Notes on Google's Gemma 3

Plus adding AI-generated descriptions to my tools collection

Mar 14, 2025

In this newsletter:

Notes on Google's Gemma 3
Adding AI-generated descriptions to my tools collection

Plus 6 links and 4 quotations

Notes on Google's Gemma 3 - 2025-03-12

Google's Gemma team released an impressive new model today (under their not-open-source Gemma license). Gemma 3 comes in four sizes - 1B, 4B, 12B, and 27B - and while 1B is text-only the larger three models are all multi-modal for vision:

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling.

Here's the Gemma 3 Technical Report PDF, which makes the big claim that they find "Gemma3-27B-IT comparable to Gemini-1.5-Pro across benchmarks".

I tried out the largest model using the latest Ollama - this is the second time I've spotted a major model release partnering with Ollama on launch day, the first being Mistral Small 3.

I ran this (after upgrading Ollama through their menu icon upgrade option):

ollama pull gemma3:27b

That pulled 17GB of model weights. I've been trying it out using LLM and llm-ollama:

llm install llm-ollama
llm -m gemma3:27b 'Build a single page HTML+CSS+JavaScript UI that gives me a large textarea for writing in which constantly saves what I have entered to localStorage (restoring when I reload the page) and displays a word counter'

That was a replay of a prompt I ran against Claude Artifacts a few months ago. Here's what Gemma built, and the full chat transcript. It's a simple example but it worked just right.

Animated demo - a simple textarea, I type text in it and a word count updates above.

Let's try out the vision mode, by asking it to describe this photo:

llm -m gemma3:27b describe -a https://static.simonwillison.net/static/2025/two-pelicans.jpg

Here's the full response:

Overall Impression:
The image features two brown pelicans in flight against a solid, bright blue background. The composition is dynamic, capturing the birds mid-wingbeat.
Details:
Pelicans: Both pelicans are a rich, dark brown color. They have large, pale yellow-orange beaks and pouches. One pelican is shown with its wings fully extended, appearing to be gliding or soaring. The other is captured with its wings in a more dynamic, upward-sweeping motion.
Background: The background is a flat, uniform shade of bright blue, which makes the pelicans stand out.
Composition: The birds are positioned in opposite directions, creating a sense of movement and balance. The image is well-lit, highlighting the texture of the pelicans' feathers.
Overall, the image is a striking and visually appealing depiction of these majestic birds in flight.

Is it right to say "the birds are positioned in opposite directions"? I thought that was a mistake at first, but I'll give it a pass since their wings are positioned in opposite directions to each other.

Something I've been curious about recently is longer context support: how well can a local model on my laptop deal with summarization or data extraction tasks against longer pieces of text?

I decided to try my Hacker News summarize script using Gemma, against the thread there discussing the Gemma 3 technical paper.

First I did a quick token count (using the OpenAI tokenizer but it's usually a similar number to other models):

curl 'https://hn.algolia.com/api/v1/items/43340491' | ttok

This returned 22,260 - well within Gemma's documented limits but still a healthy number considering just last year most models topped out at 4,000 or 8,000.

I ran my script like this:

hn-summary.sh 43340491 -m gemma3:27b

It did a pretty good job! Here's the full prompt and response. The one big miss is that it ignored my instructions to include illustrative quotes - I don't know if modifying the prompt will fix that but it's disappointing that it didn't handle that well, given how important direct quotes are for building confidence in RAG-style responses.

Here's what I got for Generate an SVG of a pelican riding a bicycle:

llm -m gemma3:27b 'Generate an SVG of a pelican riding a bicycle'

A collection of abstract shapes, definitely not a pelican on a bicycle

You can also try out the new Gemma in Google AI Studio, and via their API. I added support for it to llm-gemini 0.15, though sadly it appears vision mode doesn't work with that API hosted model yet.

llm install -U llm-gemini
llm keys set gemini
# paste key here
llm -m gemma-3-27b-it 'five facts about pelicans of interest to skunks'

Here's what I got. I'm not sure how pricing works for that hosted model.

Gemma 3 is also already available through MLX-VLM - here's the MLX model collection - but I haven't tried that version yet.

Adding AI-generated descriptions to my tools collection - 2025-03-13

The /colophon page on my tools.simonwillison.net site lists all 78 of the HTML+JavaScript tools I've built (with AI assistance) along with their commit histories, including links to prompting transcripts. I wrote about how I built that colophon the other day. It now also includes a description of each tool, generated using Claude 3.7 Sonnet.

llm-prices.html # The LLM Pricing Calculator helps you estimate costs when using various large language models. Enter the number of input tokens (your prompts) and output tokens (AI responses), along with their respective pricing per million tokens, to calculate the total cost in both dollars and cents. The calculator includes a comprehensive list of preset pricing for popular models like GPT-4o, Claude 3, Gemini, and more. Each preset displays the cost per million tokens for both input and output, which automatically populates the calculator fields when selected. The calculator updates in real-time as you modify values, making it easy to compare costs across different models or estimate expenses for specific usage scenarios. 148bc1 October 16, 2024 20:50 Create llm-prices.html https://gist.github.com/simonw/6b684b5f7d75fb82034fc963cc487530 2e0a70 October 25, 2024 16:32 o1-preview and o1-mini

This is the single largest chunk of AI-generated text I've ever published on the indexable web. I share lots of LLM transcripts but I usually put them in a secret Gist such that they'll be kept out of search engines.

On the one hand this feels slop adjacent - but I was never going to write documentation for all 78 of those little experiments, yet I find myself forgetting details of how they worked. Just one example: my CSV marker map tool which I built last year for knocking out an ad-hoc Bigfoot sightings map and then promptly forgot what it could do.

Here's what I got when I ran:

cat csv-marker-map.html | llm -m claude-3.7-sonnet \
  --system "Write a paragraph of documentation for this page as markdown. Do not include a heading. Do not use words like just or simply."

This interactive map displays markers based on data from a CSV file. Load data by adding ?csv=URL to the page URL, where the CSV must include latitude and longitude columns. You can customize the map with several URL parameters: center(coordinates in format lat,lng), zoom(zoom level), q (location to search for), marker (add individual markers at coordinates), and color (change the default marker color). The map automatically updates the URL as you navigate, preserving your current view. Markers from CSV files appear as small colored circles on the map, while markers specified directly in the URL use the default Leaflet marker style.

That's genuinely useful. I got Claude to write me a write_docs.py script to loop through all 78 files and generate filename.docs.md descriptions that live in the repo, then updated build_colophon.pyto include those descriptions if they exist.

It cost about $1.08 in Claude credits to document all 78 tools, because some of those HTML files were quite long so the input tokens added up.

Each of the documentation files includes a comment with the most recent commit hash of the file that was used to generate the document, like this:

<!-- Generated from commit: 7c6af8eeabc7682b5f9ec2621e34bc771c5471d8 -->

The tool can use this to spot if a file has been updated - if so, the documentation will be regenerated.

Add this all together and now I can drop new HTML+JavaScript tools into my simonw/toolsrepo and, moments later, they'll be published on tools.simonwillison.net with auto-generated descriptions added to my colophon. I think that's pretty neat!

Update: I decided that the descriptions were too long, so I modified the script to add "Keep it to 2-3 sentences" to the end of the system prompt. These new, shorter descriptions are now live - here's the diff. Total usage was 283,528 input tokens and 6,010 output tokens for a cost of 94 cents.

The new, shorter description for csv-marker-map.html looks like this:

This page creates an interactive map with markers based on CSV data. It accepts parameters in the URL to set the center, zoom level, search query, individual markers, and a CSV file URL for bulk marker placement. The markers are displayed on an OpenStreetMap base layer, and the map view automatically updates the URL when panned or zoomed.

For comparison, here's a copy of the previous colophon with the longer descriptions.

Quote 2025-03-11

Languages that allow for a structurally similar codebase offer a significant boon for anyone making code changes because we can easily port changes between the two codebases. In contrast, languages that require fundamental rethinking of memory management, mutation, data structuring, polymorphism, laziness, etc., might be a better fit for a ground-up rewrite, but we're undertaking this more as a port that maintains the existing behavior and critical optimizations we've built into the language. Idiomatic Go strongly resembles the existing coding patterns of the TypeScript codebase, which makes this porting effort much more tractable.

Ryan Cavanaugh

Link 2025-03-11 OpenAI API: Responses vs. Chat Completions:

OpenAI released a bunch of new API platform features this morning under the headline "New tools for building agents" (their somewhat mushy interpretation of "agents" here is "systems that independently accomplish tasks on behalf of users").

A particularly significant change is the introduction of a new Responses API, which is a slightly different shape from the Chat Completions API that they've offered for the past couple of years and which others in the industry have widely cloned as an ad-hoc standard.

In this guide they illustrate the differences, with a reassuring note that:

The Chat Completions API is an industry standard for building AI applications, and we intend to continue supporting this API indefinitely. We're introducing the Responses API to simplify workflows involving tool use, code execution, and state management. We believe this new API primitive will allow us to more effectively enhance the OpenAI platform into the future.

An API that is going away is the Assistants API, a perpetual beta first launched at OpenAI DevDay in 2023. The new responses API solves effectively the same problems but better, and assistants will be sunset "in the first half of 2026".

The best illustration I've seen of the differences between the two is this giant commit to the openai-python GitHub repository updating ALL of the example code in one go.

The most important feature of the Responses API (a feature it shares with the old Assistants API) is that it can manage conversation state on the server for you. An oddity of the Chat Completions API is that you need to maintain your own records of the current conversation, sending back full copies of it with each new prompt. You end up making API calls that look like this (from their examples):

{
    "model": "gpt-4o-mini",
    "messages": [
        {
            "role": "user",
            "content": "knock knock.",
        },
        {
            "role": "assistant",
            "content": "Who's there?",
        },
        {
            "role": "user",
            "content": "Orange."
        }
    ]
}

These can get long and unwieldy - especially when attachments such as images are involved - but the real challenge is when you start integrating tools: in a conversation with tool use you'll need to maintain that full state and drop messages in that show the output of the tools the model requested. It's not a trivial thing to work with.

The new Responses API continues to support this list of messages format, but you also get the option to outsource that to OpenAI entirely: you can add a new "store": true property and then in subsequent messages include a "previous_response_id: response_id key to continue that conversation.

This feels a whole lot more natural than the Assistants API, which required you to think in terms of threads, messages and runs to achieve the same effect.

Also fun: the Response API supports HTML form encoding now in addition to JSON:

curl https://api.openai.com/v1/responses \
  -u :$OPENAI_API_KEY \
  -d model="gpt-4o" \
  -d input="What is the capital of France?"

I found that in an excellent Twitter threadproviding background on the design decisions in the new API from OpenAI's Atty Eleti. Here's a nitter link for people who don't have a Twitter account.

New built-in tools

A potentially more exciting change today is the introduction of default tools that you can request while using the new Responses API. There are three of these, all of which can be specified in the "tools": [...] array.

{"type": "web_search_preview"} - the same search feature available through ChatGPT. The documentation doesn't clarify which underlying search engine is used - I initially assumed Bing, but the tool documentation links to this Overview of OpenAI Crawlerspage so maybe it's entirely in-house now? Web search is priced at between $25 and $50 per thousand queries depending on if you're using GPT-4o or GPT-4o mini and the configurable size of your "search context".
{"type": "file_search", "vector_store_ids": [...]} provides integration with the latest version of their file search vector store, mainly used for RAG. "Usage is priced⁠ at $2.50 per thousand queries and file storage at $0.10/GB/day, with the first GB free".
{"type": "computer_use_preview", "display_width": 1024, "display_height": 768, "environment": "browser"} is the most surprising to me: it's tool access to the Computer-Using Agent system they built for their Operator product. This one is going to be a lot of fun to explore. The tool's documentation includes a warning about prompt injection risks. Though on closer inspection I think this may work more like Claude Computer Use, where you have to run the sandboxed environment yourself rather than outsource that difficult part to them.

I'm still thinking through how to expose these new features in my LLM tool, which is made harder by the fact that a number of plugins now rely on the default OpenAI implementation from core, which is currently built on top of Chat Completions. I've been worrying for a while about the impact of our entire industry building clones of one proprietary API that might change in the future, I guess now we get to see how that shakes out!

Link 2025-03-11 OpenAI Agents SDK:

OpenAI's other big announcement today (see also) - a Python library (openai-agents) for building "agents", which is a replacement for their previous swarm research project.

In this project, an "agent" is a class that configures an LLM with a system prompt an access to specific tools.

An interesting concept in this one is the concept of handoffs, where one agent can chose to hand execution over to a different system-prompt-plus-tools agent treating it almost like a tool itself. This code example illustrates the idea:

from agents import Agent, handoff

billing_agent = Agent(
    name="Billing agent"
)
refund_agent = Agent(
    name="Refund agent"
)
triage_agent = Agent(
    name="Triage agent",
    handoffs=[billing_agent, handoff(refund_agent)]
)

The library also includes guardrails - classes you can add that attempt to filter user input to make sure it fits expected criteria. Bits of this look suspiciously like trying to solve AI security problems with more AI to me.

Link 2025-03-13 Smoke test your Django admin site:

Justin Duke demonstrates a neat pattern for running simple tests against your internal Django admin site: introspect every admin route via django.urls.get_resolver() and loop through them with @pytest.mark.parametrize to check they all return a 200 HTTP status code.

This catches simple mistakes with the admin configuration that trigger exceptions that might otherwise go undetected.

I rarely write automated tests against my own admin sites and often feel guilty about it. I wrote some notes on testing it with pytest-django fixtures a few years ago.

Link 2025-03-13 Introducing Command A: Max performance, minimal compute:

New LLM release from Cohere. It's interesting to see which aspects of the model they're highlighting, as an indicator of what their commercial customers value the most (highlight mine):

Command A delivers maximum performance with minimal hardware costs when compared to leading proprietary and open-weights models, such as GPT-4o and DeepSeek-V3. For private deployments, Command A excels on business-critical agentic and multilingual tasks, while being deployable on just two GPUs, compared to other models that typically require as many as 32.

It's open weights but very much not open source - the license is Creative Commons Attribution Non-Commercial and also requires adhering to their Acceptable Use Policy.

Cohere offer it for commercial use via "contact" us pricing or through their API. I released llm-command-r 0.3 adding support for this new model, plus their smaller and faster Command R7B (released in December) and support for structured outputs via LLM schemas.

(I found a weird bug with their schema support where schemas that end in an integer output a seemingly limitless integer - in my experiments it affected Command R and the new Command A but not Command R7B.)

Link 2025-03-13 Anthropic API: Text editor tool:

Anthropic released a new "tool" today for text editing. It looks similar to the tool they offered as part of their computer use beta API, and the trick they've been using for a while in both Claude Artifacts and the new Claude Code to more efficiently edit files there.

The new tool requires you to implement several commands:

view - to view a specified file - either the whole thing or a specified range
str_replace - execute an exact string match replacement on a file
create - create a new file with the specified contents
insert - insert new text after a specified line number
undo_edit - undo the last edit made to a specific file

Providing implementations of these commands is left as an exercise for the developer.

Once implemented, you can have conversations with Claude where it knows that it can request the content of existing files, make modifications to them and create new ones.

There's quite a lot of assembly required to start using this. I tried vibe coding an implementationby dumping a copy of the documentation into Claude itself but I didn't get as far as a working program - it looks like I'd need to spend a bunch more time on that to get something to work, so my effort is currently abandoned.

This was introduced as in a post on Token-saving updates on the Anthropic API, which also included a simplification of their token caching API and a new Token-efficient tool use (beta)where sending a token-efficient-tools-2025-02-19 beta header to Claude 3.7 Sonnet can save 14-70% of the tokens needed to define tools and schemas.

Quote 2025-03-13

Today we release OLMo 2 32B, the most capable and largest model in the OLMo 2 family, scaling up the OLMo 2 training recipe used for our 7B and 13B models released in November. It is trained up to 6T tokens and post-trained using Tulu 3.1. OLMo 2 32B is the first fully-open model (all data, code, weights, and details are freely available) to outperform GPT3.5-Turbo and GPT-4o mini on a suite of popular, multi-skill academic benchmarks.

Ai2

Link 2025-03-13 Xata Agent:

Xata are a hosted PostgreSQL company who also develop the open source pgroll and pgstreamschema migration tools.

Their new "Agent" tool is a system that helps monitor and optimize a PostgreSQL server using prompts to LLMs.

Any time I see a new tool like this I go hunting for the prompts. It looks like the main system prompts for orchestrating the tool live here - here's a sample:

Provide clear, concise, and accurate responses to questions. Use the provided tools to get context from the PostgreSQL database to answer questions. When asked why a query is slow, call the explainQuery tool and also take into account the table sizes. During the initial assessment use the getTablesAndInstanceInfo, getPerfromanceAndVacuumSettings, and getPostgresExtensions tools. When asked to run a playbook, use the getPlaybook tool to get the playbook contents. Then use the contents of the playbook as an action plan. Execute the plan step by step.

The really interesting thing is those playbooks, each of which is implemented as a prompt in the lib/tools/playbooks.ts file. There are six of these so far:

SLOW_QUERIES_PLAYBOOK
GENERAL_MONITORING_PLAYBOOK
TUNING_PLAYBOOK
INVESTIGATE_HIGH_CPU_USAGE_PLAYBOOK
INVESTIGATE_HIGH_CONNECTION_COUNT_PLAYBOOK
INVESTIGATE_LOW_MEMORY_PLAYBOOK

Here's the full text of INVESTIGATE_LOW_MEMORY_PLAYBOOK:

Objective:
To investigate and resolve low freeable memory in the PostgreSQL database.

Step 1:
Get the freeable memory metric using the tool getInstanceMetric.

Step 3:
Get the instance details and compare the freeable memory with the amount of memory available.

Step 4:
Check the logs for any indications of memory pressure or out of memory errors. If there are, make sure to report that to the user. Also this would mean that the situation is critical.

Step 4:
Check active queries. Use the tool getConnectionsGroups to get the currently active queries. If a user or application stands out for doing a lot of work, record that to indicate to the user.

Step 5:
Check the work_mem setting and shared_buffers setting. Think if it would make sense to reduce these in order to free up memory.

Step 6:
If there is no clear root cause for using memory, suggest to the user to scale up the Postgres instance. Recommend a particular instance class.

This is the first time I've seen prompts arranged in a "playbooks" pattern like this. What a weird and interesting way to write software!

Quote 2025-03-13

One of the most essential practices for maintaining the long-term quality of computer code is to write automated tests that ensure the program continues to act as expected, even when other people (including your future self) muck with it.

Evan Miller

Quote 2025-03-14

[...] in 2013, I did not understand that the things I said had meaning. I hate talking about this because it makes me seem more important than I am, but it’s also important to acknowledge. I saw myself at the time as just Steve, some random guy. If I say something on the internet, it’s like I’m talking to a friend in real life, my words are just random words and I’m human and whatever. It is what it is.

But at that time in my life, that wasn’t actually the case. I was on the Rails team, I was speaking at conferences, and people were reading my blog and tweets. I was an “influencer,” for better or worse. But I hadn’t really internalized that change in my life yet. And so I didn’t really understand that if I criticized something, it was something thousands of people would see.

Steve Klabnik

Simon Willison’s Newsletter

Discussion about this post