In this newsletter:
Trying out llama.cpp's new vision support
Plus 11 links and 3 quotations and 1 TIL and 3 notes
Trying out llama.cpp's new vision support - 2025-05-10
This llama.cpp server vision support via libmtmd pull request - via Hacker News - was merged earlier today. The PR finally adds full support for vision models to the excellent llama.cpp project. It's documented on this page, but the more detailed technical details are covered here. Here are my notes on getting it working on a Mac.
llama.cpp
models are usually distributed as .gguf
files. This project introduces a new variant of those called mmproj
, for multimodal projector. libmtmd
is the new library for handling these.
You can try it out by compiling llama.cpp
from source, but I found another option that works: you can download pre-compiled binaries from the GitHub releases.
On macOS there's an extra step to jump through to get these working, which I'll describe below.
Update: it turns out the Homebrew package for llama.cpp
turns things around extremely quickly. You can run brew install llama.cpp
or brew upgrade llama.cpp
and start running the below tools without any extra steps.
I downloaded the llama-b5332-bin-macos-arm64.zip
file from this GitHub release and unzipped it, which created a build/bin
directory.
That directory contains a bunch of binary executables and a whole lot of .dylib
files. macOS wouldn't let me execute these files because they were quarantined. Running this command fixed that for the llama-mtmd-cli
and llama-server
executables and the .dylib
files they needed:
sudo xattr -rd com.apple.quarantine llama-server llama-mtmd-cli *.dylib
Now I can run an interactive terminal LLM session using this command:
./llama-mtmd-cli -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL
The first time this runs it fetches four files from unsloth/gemma-3-4b-it-GGUF - a quantized copy of Google DeepMind's Gemma 3 4B - and stores them in ~/Library/Caches/llama.cpp
:
2.4G unsloth_gemma-3-4b-it-GGUF_gemma-3-4b-it-UD-Q4_K_XL.gguf
217B unsloth_gemma-3-4b-it-GGUF_gemma-3-4b-it-UD-Q4_K_XL.gguf.json
812M unsloth_gemma-3-4b-it-GGUF_mmproj-F16.gguf
203B unsloth_gemma-3-4b-it-GGUF_mmproj-F16.gguf.json
Once they've downloaded an interactive session starts. You can chat directly with the model or use the /image path-to-image.jpg
command to load an image into the context.
I loaded in a pelican photo and prompted "describe it" and got back this:
It’s a striking photograph of a brown pelican standing in shallow water on a sandy beach. Here's a breakdown of what I see:
Subject: The main focus is a large brown pelican. It’s standing with its wings fully extended, creating a dramatic pose.
Pose: The pelican is in a slightly awkward, almost comical stance, with its legs planted firmly in the wet sand and its wings spread wide. It looks like it might be preparing to take off or is simply enjoying the water.
Color & Texture: The pelican is a rich, dark brown with lighter markings on its wings and head. The sand is a muted beige color, and the water reflects the sunlight, creating a shimmering effect. You can see the texture of the sand and the ripples in the water.
Lighting: The lighting appears to be bright and sunny, casting shadows and highlighting the details of the pelican's feathers.
Background: The background is blurred, suggesting a shallow depth of field, which keeps the focus on the pelican. It appears to be a beach scene.
Overall Impression: The image has a dynamic feel due to the pelican’s pose and the movement implied by the water. It’s a well-composed shot that captures a moment of wildlife in its natural habitat.
Do you want me to zoom in on a particular detail, or perhaps tell you more about pelicans?
Not bad for a 3.2GB model running on my laptop!
Running llama-server
Even more fun is the llama-server
command. This starts a localhost web server running on port 8080 to serve the model, with both a web UI and an OpenAI-compatible API endpoint.
The command to run it is the same:
./llama-server -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL
Now visit
http://localhost:8080
in your browser to start interacting with the model:
It miscounted the pelicans in the group photo, but again, this is a tiny 3.2GB model.
With the server running on port 8080 you can also access the OpenAI-compatible API endpoint. Here's how to do that using curl
:
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Describe a pelicans ideal corporate retreat"}
]
}' | jq
I built a new plugin for LLM just now called llm-llama-server to make interacting with this API more convenient. You can use that like this:
llm install llm-llama-server
llm -m llama-server 'invent a theme park ride for a pelican'
Or for vision models use llama-server-vision
:
llm -m llama-server-vision 'describe this image' -a https://static.simonwillison.net/static/2025/pelican-group.jpg
The LLM plugin uses the streaming API, so responses will stream back to you as they are being generated.
Link 2025-05-07 astral-sh/ty:
Astral have been working on this "extremely fast Python type checker and language server, written in Rust" quietly but in-the-open for a while now. Here's the first alpha public release - albeit not yet announced - as ty on PyPI (nice donated two-letter name!)
You can try it out via uvx like this - run the command in a folder full of Python code and see what comes back:
uvx ty check
I got zero errors for my recent, simple condense-json library and a ton of errors for my more mature sqlite-utils library - output here.
It really is fast:
cd /tmp
git clone https://github.com/simonw/sqlite-utils
cd sqlite-utils
time uvx ty check
Reports it running in around a tenth of a second (0.109 total wall time) using multiple CPU cores:
uvx ty check 0.18s user 0.07s system 228% cpu 0.109 total
Running time uvx mypy .
in the same folder (both after first ensuring the underlying tools had been cached) took around 7x longer:
uvx mypy . 0.46s user 0.09s system 74% cpu 0.740 total
This isn't a fair comparison yet as ty still isn't feature complete in comparison to mypy.
Link 2025-05-07 llm-prices.com:
I've been maintaining a simple LLM pricing calculator since October last year. I finally decided to split it out to its own domain name (previously it was hosted at tools.simonwillison.net/llm-prices
), running on Cloudflare Pages.
The site runs out of my simonw/llm-prices GitHub repository. I ported the history of the old llm-prices.html
file using a vibe-coded bash script that I forgot to save anywhere.
I rarely use AI-generated imagery in my own projects, but for this one I found an excellent reason to use GPT-4o image outputs... to generate the favicon! I dropped a screenshot of the site into ChatGPT (o4-mini-high in this case) and asked for the following:
design a bunch of options for favicons for this site in a single image, white background
I liked the top right one, so I cropped it into Pixelmator and made a 32x32 version. Here's what it looks like in my browser:
I added a new feature just now: the state of the calculator is now reflected in the #fragment-hash
URL of the page, which means you can link to your previous calculations.
I implemented that feature using the new gemini-2.5-pro-preview-05-06, since that model boasts improved front-end coding abilities. It did a pretty great job - here's how I prompted it:
llm -m gemini-2.5-pro-preview-05-06 -f https://www.llm-prices.com/ -s 'modify this code so that the state of the page is reflected in the fragmenth hash URL - I want to capture the values filling out the form fields and also the current sort order of the table. These should be respected when the page first loads too. Update them using replaceHistory, no need to enable the back button.'
Here's the transcript and the commit updating the tool, plus an example link showing the new feature in action (and calculating the cost for that Gemini 2.5 Pro prompt at 16.8224 cents, after fixing the calculation.)
Link 2025-05-07 Medium is the new large:
New model release from Mistral - this time closed source/proprietary. Mistral Medium claims strong benchmark scores similar to GPT-4o and Claude 3.7 Sonnet, but is priced at $0.40/million input and $2/million output - about the same price as GPT 4.1 Mini. For comparison, GPT-4o is $2.50/$10 and Claude 3.7 Sonnet is $3/$15.
The model is a vision LLM, accepting both images and text.
More interesting than the price is the deployment model. Mistral Medium may not be open weights but it is very much available for self-hosting:
Mistral Medium 3 can also be deployed on any cloud, including self-hosted environments of four GPUs and above.
Mistral's other announcement today is Le Chat Enterprise. This is a suite of tools that can integrate with your company's internal data and provide "agents" (these look similar to Claude Projects or OpenAI GPTs), again with the option to self-host.
Is there a new open weights model coming soon? This note tucked away at the bottom of the Mistral Medium 3 announcement seems to hint at that:
With the launches of Mistral Small in March and Mistral Medium today, it's no secret that we're working on something 'large' over the next few weeks. With even our medium-sized model being resoundingly better than flagship open source models such as Llama 4 Maverick, we're excited to 'open' up what's to come :)
I released llm-mistral 0.12 adding support for the new model.
Link 2025-05-07 Create and edit images with Gemini 2.0 in preview:
Gemini 2.0 Flash has had image generation capabilities for a while now, and they're now available via the paid Gemini API - at 3.9 cents per generated image.
According to the API documentation you need to use the new gemini-2.0-flash-preview-image-generation
model ID and specify {"responseModalities":["TEXT","IMAGE"]}
as part of your request.
Here's an example that calls the API using curl
(and fetches a Gemini key from the llm keys get
store):
curl -s -X POST \
"https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash-preview-image-generation:generateContent?key=$(llm keys get gemini)" \
-H "Content-Type: application/json" \
-d '{
"contents": [{
"parts": [
{"text": "Photo of a raccoon in a trash can with a paw-written sign that says I love trash"}
]
}],
"generationConfig":{"responseModalities":["TEXT","IMAGE"]}
}' > /tmp/raccoon.json
Here's the response. I got Gemini 2.5 Pro to vibe-code me a new debug tool for visualizing that JSON. If you visit that tool and click the "Load an example" link you'll see the result of the raccoon image visualized:
The other prompt I tried was this one:
Provide a vegetarian recipe for butter chicken but with chickpeas not chicken and include many inline illustrations along the way
The result of that one was a 41MB JSON file(!) containing 28 images - which presumably cost over a dollar since images are 3.9 cents each.
Some of the illustrations it chose for that one were somewhat unexpected:
If you want to see that one you can click the "Load a really big example" link in the debug tool, then wait for your browser to fetch and render the full 41MB JSON file.
The most interesting feature of Gemini (as with GPT-4o images) is the ability to accept images as inputs. I tried that out with this pelican photo like this:
cat > /tmp/request.json << EOF
{
"contents": [{
"parts":[
{"text": "Modify this photo to add an inappropriate hat"},
{
"inline_data": {
"mime_type":"image/jpeg",
"data": "$(base64 -i pelican.jpg)"
}
}
]
}],
"generationConfig": {"responseModalities": ["TEXT", "IMAGE"]}
}
EOF
# Execute the curl command with the JSON file
curl -X POST \
'https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash-preview-image-generation:generateContent?key='$(llm keys get gemini) \
-H 'Content-Type: application/json' \
-d @/tmp/request.json \
> /tmp/out.json
And now the pelican is wearing a hat:
Link 2025-05-07 Introducing web search on the Anthropic API:
Anthropic's web search (presumably still powered by Brave) is now also available through their API, in the shape of a new web search tool called web_search_20250305
.
You can specify a maximum number of uses per prompt and you can also pass a list of disallowed or allowed domains, plus hints as to the user's current location.
Search results are returned in a format that looks similar to the Anthropic Citations API.
It's charged at $10 per 1,000 searches, which is a little more expensive than what the Brave Search API charges ($3 or $5 or $9 per thousand depending on how you're using them).
I couldn't find any details of additional rules surrounding storage or display of search results, which surprised me because both Google Gemini and OpenAI have these for their own API search results.
Link 2025-05-08 llm-gemini 0.19.1:
Bugfix release for my llm-gemini plugin, which was recording the number of output tokens (needed to calculate the price of a response) incorrectly for the Gemini "thinking" models. Those models turn out to return candidatesTokenCount
and thoughtsTokenCount
as two separate values which need to be added together to get the total billed output token count. Full details in this issue.
I spotted this potential bug in this response log this morning, and my concerns were confirmed when Paul Gauthier wrote about a similar fix in Aider in Gemini 2.5 Pro Preview 03-25 benchmark cost, where he noted that the $6.32 cost recorded to benchmark Gemini 2.5 Pro Preview 03-25 was incorrect. Since that model is no longer available (despite the date-based model alias persisting) Paul is not able to accurately calculate the new cost, but it's likely a lot more since the Gemini 2.5 Pro Preview 05-06 benchmark cost $37.
I've gone through my gemini tag and attempted to update my previous posts with new calculations - this mostly involved increases in the order of 12.336 cents to 16.316 cents (as seen here).
Quote 2025-05-08
But I’ve also had my own quiet concerns about what [vibe coding] means for early-career developers. So much of how I learned came from chasing bugs in broken tutorials and seeing how all the pieces connected, or didn’t. There was value in that. And maybe I’ve been a little protective of it.
A mentor challenged that. He pointed out that debugging AI generated code is a lot like onboarding into a legacy codebase, making sense of decisions you didn’t make, finding where things break, and learning to trust (or rewrite) what’s already there. That’s the kind of work a lot of developers end up doing anyway.
Quote 2025-05-08
Microservices only pay off when you have real scaling bottlenecks, large teams, or independently evolving domains. Before that? You’re paying the price without getting the benefit: duplicated infra, fragile local setups, and slow iteration.
Link 2025-05-08 Reservoir Sampling:
Yet another outstanding interactive essay by Sam Rose (previously), this time explaining how reservoir sampling can be used to select a "fair" random sample when you don't know how many options there are and don't want to accumulate them before making a selection.
Reservoir sampling is one of my favourite algorithms, and I've been wanting to write about it for years now. It allows you to solve a problem that at first seems impossible, in a way that is both elegant and efficient.
I appreciate that Sam starts the article with "No math notation, I promise." Lots of delightful widgets to interact with here, all of which help build an intuitive understanding of the underlying algorithm.
Sam shows how this algorithm can be applied to the real-world problem of sampling log files when incoming logs threaten to overwhelm a log aggregator.
The dog illustration is commissioned art and the MIT-licensed code is available on GitHub.
Quote 2025-05-08
If Claude is asked to count words, letters, and characters, it thinks step by step before answering the person. It explicitly counts the words, letters, or characters by assigning a number to each. It only answers the person once it has performed this explicit counting step. [...]
If Claude is shown a classic puzzle, before proceeding, it quotes every constraint or premise from the person’s message word for word before inside quotation marks to confirm it’s not dealing with a new variant. [...]
If asked to write poetry, Claude avoids using hackneyed imagery or metaphors or predictable rhyming schemes.
Link 2025-05-08 SQLite CREATE TABLE: The DEFAULT clause:
If your SQLite create table statement includes a line like this:
CREATE TABLE alerts (
-- ...
alert_created_at text default current_timestamp
)
current_timestamp
will be replaced with a UTC timestamp in the format 2025-05-08 22:19:33
. You can also use current_time
for HH:MM:SS
and current_date
for YYYY-MM-DD
, again using UTC.
Posting this here because I hadn't previously noticed that this defaults to UTC, which is a useful detail. It's also a strong vote in favor of YYYY-MM-DD HH:MM:SS
as a string format for use with SQLite, which doesn't otherwise provide a formal datetime type.
Link 2025-05-09 Gemini 2.5 Models now support implicit caching:
I just spotted a cacheTokensDetails
key in the token usage JSON while running a long chain of prompts against Gemini 2.5 Flash - despite not configuring caching myself:
{"cachedContentTokenCount": 200658, "promptTokensDetails": [{"modality": "TEXT", "tokenCount": 204082}], "cacheTokensDetails": [{"modality": "TEXT", "tokenCount": 200658}], "thoughtsTokenCount": 2326}
I went searching and it turns out Gemini had a massive upgrade to their prompt caching earlier today:
Implicit caching directly passes cache cost savings to developers without the need to create an explicit cache. Now, when you send a request to one of the Gemini 2.5 models, if the request shares a common prefix as one of previous requests, then it’s eligible for a cache hit. We will dynamically pass cost savings back to you, providing the same 75% token discount. [...]
To make more requests eligible for cache hits, we reduced the minimum request size for 2.5 Flash to 1024 tokens and 2.5 Pro to 2048 tokens.
Previously you needed to both explicitly configure the cache and pay a per-hour charge to keep that cache warm.
This new mechanism is so much more convenient! It imitates how both DeepSeek and OpenAI implement prompt caching, leaving Anthropic as the remaining large provider who require you to manually configure prompt caching to get it to work.
Gemini's explicit caching mechanism is still available. The documentation says:
Explicit caching is useful in cases where you want to guarantee cost savings, but with some added developer work.
With implicit caching the cost savings aren't possible to predict in advance, especially since the cache timeout within which a prefix will be discounted isn't described and presumably varies based on load and other circumstances outside of the developer's control.
Update: DeepMind's Philipp Schmid:
There is no fixed time, but it's should be a few minutes.
Link 2025-05-09 sqlite-utils 4.0a0:
New alpha release of sqlite-utils, my Python library and CLI tool for manipulating SQLite databases.
It's the first 4.0 alpha because there's a (minor) backwards-incompatible change: I've upgraded the .upsert()
and .upsert_all()
methods to use SQLIte's UPSERT mechanism, INSERT INTO ... ON CONFLICT DO UPDATE
. Details in this issue.
That feature was added to SQLite in version 3.24.0, released 2018-06-04. I'm pretty cautious about my SQLite version support since the underlying library can be difficult to upgrade, depending on your platform and operating system.
I'm going to leave the new alpha to bake for a little while before pushing a stable release. Since this is a major version bump I'm going to take the opportunity to see if there are any other minor API warts that I can clean up at the same time.
Note 2025-05-09
I had some notes in a GitHub issue thread in a private repository that I wanted to export as Markdown. I realized that I could get them using a combination of several recent projects.
Here's what I ran:
export GITHUB_TOKEN="$(llm keys get github)"
llm -f issue:https://github.com/simonw/todos/issues/170 \
-m echo --no-log | jq .prompt -r > notes.md
I have a GitHub personal access token stored in my LLM keys, for use with Anthony Shaw's llm-github-models plugin.
My own llm-fragments-github plugin expects an optional GITHUB_TOKEN
environment variable, so I set that first - here's an issue to have it use the github
key instead.
With that set, the issue:
fragment loader can take a URL to a private GitHub issue thread and load it via the API using the token, then concatenate the comments together as Markdown. Here's the code for that.
Fragments are meant to be used as input to LLMs. I built a llm-echo plugin recently which adds a fake LLM called "echo" which simply echos its input back out again.
Adding --no-log
prevents that junk data from being stored in my LLM log database.
The output is JSON with a "prompt"
key for the original prompt. I use jq .prompt
to extract that out, then -r
to get it as raw text (not a "JSON string"
).
... and I write the result to notes.md
.
Link 2025-05-10 TIL: SQLite triggers:
I've been doing some work with SQLite triggers recently while working on sqlite-chronicle, and I decided I needed a single reference to exactly which triggers are executed for which SQLite actions and what data is available within those triggers.
I wrote this triggers.py script to output as much information about triggers as possible, then wired it into a TIL article using Cog. The Cog-powered source code for the TIL article can be seen here.
Note 2025-05-10 Poker Face season two just started on Peacock (the US streaming service). It's my favorite thing on TV right now. I've started threads on MetaFilter FanFare for episodes one, two and three.
Note 2025-05-11 Achievement unlocked: tap danced in the local community college dance recital.
I'm mostly using ChatGPT for my day-to-day AI stuff, and I absolutely love the image generation, but I've run into the fixed dimensions it produces a lot. It says of itself:
Currently, ChatGPT (using DALL·E 3 via GPT-4o) supports the following image dimensions:
Square: 1024×1024
Portrait: 1024×1792
Landscape: 1792×1024
And that's it! Do you know why they only support these? I'd love to make more types of images like favicons, slack emojis, banners, etc. I like the idea of adding a bunch to one image like you did above, it would be nice to have a workflow where I don't have to crop the image into the dimensions I want, though.
Long live homebrew