In this newsletter:
Claude 3.5 Haiku
W̶e̶e̶k̶n̶o̶t̶e̶s̶ Monthnotes for October
Plus 14 links and 3 quotations
Claude 3.5 Haiku - 2024-11-04
Anthropic released Claude 3.5 Haiku today, a few days later than expected (they said it would be out by the end of October).
I was expecting this to be a complete replacement for their existing Claude 3 Haiku model, in the same way that Claude 3.5 Sonnet eclipsed the existing Claude 3 Sonnet while maintaining the same pricing.
Claude 3.5 Haiku is different. First, it doesn't (yet) support image inputs - so Claude 3 Haiku remains the least expensive Anthropic model for handling those.
Secondly, it's not priced the same as the previous Haiku. That was $0.25/million input and $1.25/million for output - the new 3.5 Haiku is 4x that at $1/million input and $5/million output.
Anthropic tweeted:
During final testing, Haiku surpassed Claude 3 Opus, our previous flagship model, on many benchmarks—at a fraction of the cost.
As a result, we've increased pricing for Claude 3.5 Haiku to reflect its increase in intelligence.
Given that Anthropic claim that their new Haiku out-performs their older Claude 3 Opus (still $15/m input and $75/m output!) this price isn't disappointing, but it's a small surprise nonetheless.
Accessing Claude 3.5 Haiku with LLM
I released a new version of my llm-claude-3plugin with support for the new model. You can install (or upgrade) the plugin and run it like this:
llm install --upgrade llm-claude-3
llm keys set claude
# Paste API key here
llm -m claude-3.5-haiku 'describe memory management in Rust'
Here's the output from that prompt.
Comparing prices
I added the new price to my LLM pricing calculator, which inspired me to extract this comparison table for the leading models from Gemini, Anthropic and OpenAI. Here they are sorted from least to most expensive:
Gemini 1.5 Flash-8B remains the model to beat on pricing: it's 1/6th of the price of the new Haiku - far less capable, but still extremely useful for tasks such as audio transcription.
Also notable from Anthropic's model comparison table: Claude 3.5 Haiku has a max output of 8,192 tokens (same as 3.5 Sonnet, but twice that of Claude 3 Opus and Claude 3 Haiku). 3.5 Haiku has a training cut-off date of July 2024, the most recent of any Anthropic model. 3.5 Sonnet is April 2024 and the Claude 3 family are all August 2023.
W̶e̶e̶k̶n̶o̶t̶e̶s̶ Monthnotes for October - 2024-10-30
I try to publish weeknotes at least once every two weeks. It's been four since the last entry, so I guess this one counts as monthnotes instead.
In my defense, the reason I've fallen behind on weeknotes is that I've been publishing a lot of long-form blog entries this month.
Plentiful LLM vendor news
A lot of LLM stuff happened. OpenAI had their DevDay, which I used as an opportunity to try out live blogging for the first time. I figured out video scraping with Google Gemini and generally got excited about how incredibly inexpensive the Gemini models are. Anthropic launched Computer Use and JavaScript analysis, and the month ended with GitHub Universe.
My LLM tool goes multi-modal
My big achievement of the month was finally shipping multi-modal support for my LLM tool. This has been almost a year in the making: GPT-4 vision kicked off the new era of vision LLMs at OpenAI DevDay last November and I've been watching the space with keen interest ever since.
I had a couple of false starts at the feature, which was difficult at first because LLM acts as a cross-model abstraction layer, and it's hard to design those effectively without plenty of examples of different models.
Initially I thought the feature would just be for images, but then Google Gemini launched the ability to feed in PDFs, audio files and videos as well. That's why I renamed it from -i/--image
to -a/--attachment
- I'm glad I hadn't committed to the image UI before realizing that file attachments could be so much more.
I'm really happy with how the feature turned out. The one missing piece at the moment is local models: I prototyped some incomplete local model plugins to verify the API design would work, but I've not yet pushed any of them to a state where I think they're ready to release. My research into mistral.rs was part of that process.
Now that attachments have landed I'm free to start thinking about the next major LLM feature. I'm leaning towards tool usage: enough models have tool use / structured output capabilities now that I think I can design an abstraction layer that works across all of them. The combination of tool use with LLM's plugin system is really fun to think about.
Blog entries
You can now run prompts against images, audio and video in your terminal using LLM
Run a prompt to generate and execute jq programs using llm-jq
Notes on the new Claude analysis JavaScript code execution tool
Initial explorations of Anthropic's new Computer Use capability
Running Llama 3.2 Vision and Phi-3.5 Vision on a Mac with mistral.rs
Experimenting with audio input and output for the OpenAI Chat Completion API
Video scraping: extracting JSON data from a 35 second screen capture for less than 1/10th of a cent
Releases
llm-mistral 0.7 - 2024-10-29
LLM plugin providing access to Mistral models using the Mistral APIllm-claude-3 0.6 - 2024-10-29
LLM plugin for interacting with the Claude 3 family of modelsllm-gemini 0.3 - 2024-10-29
LLM plugin to access Google's Gemini family of modelsllm 0.17 - 2024-10-29
Access large language models from the command-linellm-whisper-api 0.1.1 - 2024-10-27
Run transcriptions using the OpenAI Whisper APIllm-jq 0.1.1 - 2024-10-27
Write and execute jq programs with the help of LLMclaude-to-sqlite 0.2 - 2024-10-21
Convert a Claude.ai export to SQLitefiles-to-prompt 0.4 - 2024-10-16
Concatenate a directory full of files into a single prompt for use with LLMsdatasette-examples 0.1a0 - 2024-10-08
Load example SQL scripts into Datasette on startupdatasette 0.65 - 2024-10-07
An open source multi-tool for exploring and publishing data
TILs
Installing flash-attn without compiling it - 2024-10-25
Using uv to develop Python command-line applications - 2024-10-24
Setting cache-control: max-age=31536000 with a Cloudflare Transform Rule - 2024-10-24
Running prompts against images, PDFs, audio and video with Google Gemini - 2024-10-23
The most basic possible Hugo site - 2024-10-23
Livestreaming a community election event on YouTube - 2024-10-10
Upgrading Homebrew and avoiding the failed to verify attestation error - 2024-10-09
Collecting replies to tweets using JavaScript - 2024-10-09
Compiling and running sqlite3-rsync - 2024-10-04
Building an automatically updating live blog in Django - 2024-10-02
Link 2024-10-30 docs.jina.ai - the Jina meta-prompt:
From Jina AI on Twitter:
curl docs.jina.ai
- This is our Meta-Prompt. It allows LLMs to understand our Reader, Embeddings, Reranker, and Classifier APIs for improved codegen. Using the meta-prompt is straightforward. Just copy the prompt into your preferred LLM interface like ChatGPT, Claude, or whatever works for you, add your instructions, and you're set.
The page is served using content negotiation. If you hit it with curl
you get plain text, but a browser with text/html
in the accept:
header gets an explanation along with a convenient copy to clipboard button.
Link 2024-10-30 Creating a LLM-as-a-Judge that drives business results:
Hamel Husain's sequel to Your AI product needs evals. This is packed with hard-won actionable advice.
Hamel warns against using scores on a 1-5 scale, instead promoting an alternative he calls "Critique Shadowing". Find a domain expert (one is better than many, because you want to keep their scores consistent) and have them answer the yes/no question "Did the AI achieve the desired outcome?" - providing a critique explaining their reasoning for each of their answers.
This gives you a reliable score to optimize against, and the critiques mean you can capture nuance and improve the system based on that captured knowledge.
Most importantly, the critique should be detailed enough so that you can use it in a few-shot prompt for a LLM judge. In other words, it should be detailed enough that a new employee could understand it.
Once you've gathered this expert data system you can switch to using an LLM-as-a-judge. You can then iterate on the prompt you use for it in order to converge its "opinions" with those of your domain expert.
Hamel concludes:
The real value of this process is looking at your data and doing careful analysis. Even though an AI judge can be a helpful tool, going through this process is what drives results. I would go as far as saying that creating a LLM judge is a nice “hack” I use to trick people into carefully looking at their data!
Link 2024-10-31 Australia/Lord_Howe is the weirdest timezone:
Lord Howe Island - part of Australia, population 382 - is unique in that the island's standard time zone is UTC+10:30 but is UTC+11 when daylight saving time applies. It's the only time zone where DST represents a 30 minute offset.
Link 2024-10-31 Cerebras Coder:
Val Town founder Steve Krouse has been building demos on top of the Cerebras API that runs Llama3.1-70b at 2,000 tokens/second.
Having a capable LLM with that kind of performance turns out to be really interesting. Cerebras Coder is a demo that implements Claude Artifact-style on-demand JavaScript apps, and having it run at that speed means changes you request are visible within less than a second:
Steve's implementation (created with the help of Townie, the Val Town code assistant) demonstrates the simplest possible version of an iframe sandbox:
<iframe
srcDoc={code}
sandbox="allow-scripts allow-modals allow-forms allow-popups allow-same-origin allow-top-navigation allow-downloads allow-presentation allow-pointer-lock"
/>
Where code
is populated by a setCode(...)
call inside a React component.
The most interesting applications of LLMs continue to be where they operate in a tight loop with a human - this can make those review loops potentially much faster and more productive.
Link 2024-11-01 Control your smart home devices with the Gemini mobile app on Android:
Google are adding smart home integration to their Gemini chatbot - so far on Android only.
Have they considered the risk of prompt injection? It looks like they have, at least a bit:
Important: Home controls are for convenience only, not safety- or security-critical purposes. Don't rely on Gemini for requests that could result in injury or harm if they fail to start or stop.
The Google Home extension can’t perform some actions on security devices, like gates, cameras, locks, doors, and garage doors. For unsupported actions, the Gemini app gives you a link to the Google Home app where you can control those devices.
It can control lights and power, climate control, window coverings, TVs and speakers and "other smart devices, like washers, coffee makers, and vacuums".
I imagine we will see some security researchers having a lot of fun with this shortly.
Quote 2024-11-01
Lord Clement-Jones: To ask His Majesty's Government what assessment they have made of the cybersecurity risks posed by prompt injection attacks to the processing by generative artificial intelligence of material provided from outside government, and whether any such attacks have been detected thus far.
Lord Vallance of Balham: Security is central to HMG's Generative AI Framework, which was published in January this year and sets out principles for using generative AI safely and responsibly. The risks posed by prompt injection attacks, including from material provided outside of government, have been assessed as part of this framework and are continually reviewed. The published Generative AI Framework for HMG specifically includes Prompt Injection attacks, alongside other AI specific cyber risks.
Question for Department for Science, Innovation and Technology
Link 2024-11-01 Claude API: PDF support (beta):
Claude 3.5 Sonnet now accepts PDFs as attachments:
The new Claude 3.5 Sonnet (
claude-3-5-sonnet-20241022
) model now supports PDF input and understands both text and visual content within documents.
I just released llm-claude-3 0.7 with support for the new attachment type (attachments are a very new feature), so now you can do this:
llm install llm-claude-3 --upgrade
llm -m claude-3.5-sonnet 'extract text' -a mydoc.pdf
Visual PDF analysis can also be turned on for the Claude.ai application:
Also new today: Claude now offers a free (albeit rate-limited) token counting API. This addresses a complaint I've had for a while: previously it wasn't possible to accurately estimate the cost of a prompt before sending it to be executed.
Link 2024-11-01 From Naptime to Big Sleep: Using Large Language Models To Catch Vulnerabilities In Real-World Code:
Google's Project Zero security team used a system based around Gemini 1.5 Pro to find a previously unreported security vulnerability in SQLite (a stack buffer underflow), in time for it to be fixed prior to making it into a release.
A key insight here is that LLMs are well suited for checking for new variants of previously reported vulnerabilities:
A key motivating factor for Naptime and now for Big Sleep has been the continued in-the-wild discovery of exploits for variants of previously found and patched vulnerabilities. As this trend continues, it's clear that fuzzing is not succeeding at catching such variants, and that for attackers, manual variant analysis is a cost-effective approach.
We also feel that this variant-analysis task is a better fit for current LLMs than the more general open-ended vulnerability research problem. By providing a starting point – such as the details of a previously fixed vulnerability – we remove a lot of ambiguity from vulnerability research, and start from a concrete, well-founded theory: "This was a previous bug; there is probably another similar one somewhere".
LLMs are great at pattern matching. It turns out feeding in a pattern describing a prior vulnerability is a great way to identify potential new ones.
Link 2024-11-02 SmolLM2:
New from Loubna Ben Allal and her research team at Hugging Face:
SmolLM2 is a family of compact language models available in three size: 135M, 360M, and 1.7B parameters. They are capable of solving a wide range of tasks while being lightweight enough to run on-device. [...]
It was trained on 11 trillion tokens using a diverse dataset combination: FineWeb-Edu, DCLM, The Stack, along with new mathematics and coding datasets that we curated and will release soon.
The model weights are released under an Apache 2 license. I've been trying these out using my llm-gguf plugin for LLM and my first impressions are really positive.
Here's a recipe to run a 1.7GB Q8 quantized model from lmstudio-community:
llm install llm-gguf
llm gguf download-model https://huggingface.co/lmstudio-community/SmolLM2-1.7B-Instruct-GGUF/resolve/main/SmolLM2-1.7B-Instruct-Q8_0.gguf -a smol17
llm chat -m smol17
Or at the other end of the scale, here's how to run the 138MB Q8 quantized 135M model:
llm gguf download-model https://huggingface.co/lmstudio-community/SmolLM2-135M-Instruct-GGUF/resolve/main/SmolLM2-135M-Instruct-Q8_0.gguf' -a smol135m
llm chat -m smol135m
The blog entry to accompany SmolLM2 should be coming soon, but in the meantime here's the entry from July introducing the first version: SmolLM - blazingly fast and remarkably powerful .
Link 2024-11-02 Please publish and share more:
💯 to all of this by Jeff Triplett:
Friends, I encourage you to publish more, indirectly meaning you should write more and then share it. [...]
You don’t have to change the world with every post. You might publish a quick thought or two that helps encourage someone else to try something new, listen to a new song, or binge-watch a new series.
Jeff shares my opinion on conclusions: giving myself permission to hit publish even when I haven't wrapped everything up neatly was a huge productivity boost for me:
Our posts are done when you say they are. You do not have to fret about sticking to landing and having a perfect conclusion. Your posts, like this post, are done after we stop writing.
And another 💯 to this footnote:
PS: Write and publish before you write your own static site generator or perfect blogging platform. We have lost billions of good writers to this side quest because they spend all their time working on the platform instead of writing.
Link 2024-11-02 Claude Token Counter:
Anthropic released a token counting API for Claude a few days ago.
I built this tool for running prompts, images and PDFs against that API to count the tokens in them.
The API is free (albeit rate limited), but you'll still need to provide your own API key in order to use it.
Here's the source code. I built this using two sessions with Claude - one to build the initial tooland a second to add PDF and image support. That second one is a bit of a mess - it turns out if you drop an HTML file onto a Claude conversation it converts it to Markdown for you, but I wanted it to modify the original HTML source.
The API endpoint also allows you to specify a model, but as far as I can tell from running some experiments the token count was the same for Haiku, Opus and Sonnet 3.5.
Link 2024-11-03 Docling:
MIT licensed document extraction Python library from the Deep Search team at IBM, who released Docling v2 on October 16th.
Here's the Docling Technical Report paper from August, which provides details of two custom models: a layout analysis model for figuring out the structure of the document (sections, figures, text, tables etc) and a TableFormer model specifically for extracting structured data from tables.
Those models are available on Hugging Face.
Here's how to try out the Docling CLI interface using uvx
(avoiding the need to install it first - though since it downloads models it will take a while to run the first time):
uvx docling mydoc.pdf --to json --to md
This will output a mydoc.json
file with complex layout information and a mydoc.md
Markdown file which includes Markdown tables where appropriate.
The Python API is a lot more comprehensive. It can even extract tables as Pandas DataFrames:
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("document.pdf")
for table in result.document.tables:
df = table.export_to_dataframe()
print(df)
I ran that inside uv run --with docling python
. It took a little while to run, but it demonstrated that the library works.
Link 2024-11-03 California Clock Change:
The clocks go back in California tonight and I finally built my dream application for helping me remember if I get an hour extra of sleep or not, using a Claude Artifact. Here's the transcript.
This is one of my favorite examples yet of the kind of tiny low stakes utilities I'm building with Claude Artifacts because the friction involved in churning out a working application has dropped almost to zero.
(I added another feature: it now includes a noteof what time my Dog thinks it is if the clocks have recently changed.)
Quote 2024-11-03
Building technology in startups is all about having the right level of tech debt. If you have none, you’re probably going too slow and not prioritizing product-market fit and the important business stuff. If you get too much, everything grinds to a halt. Plus, tech debt is a “know it when you see it” kind of thing, and I know that my definition of “a bunch of tech debt” is, to other people, “very little tech debt.”
Link 2024-11-04 Nous Hermes 3:
The Nous Hermes family of fine-tuned models have a solid reputation. Their most recent release came out in August, based on Meta's Llama 3.1:
Our training data aggressively encourages the model to follow the system and instruction prompts exactly and in an adaptive manner. Hermes 3 was created by fine-tuning Llama 3.1 8B, 70B and 405B, and training on a dataset of primarily synthetically generated responses. The model boasts comparable and superior performance to Llama 3.1 while unlocking deeper capabilities in reasoning and creativity.
The model weights are on Hugging Face, including GGUF versions of the 70B and 8Bmodels. Here's how to try the 8B model (a 4.58GB download) using the llm-gguf plugin:
llm install llm-gguf
llm gguf download-model 'https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-8B-GGUF/resolve/main/Hermes-3-Llama-3.1-8B.Q4_K_M.gguf' -a Hermes-3-Llama-3.1-8B
llm -m Hermes-3-Llama-3.1-8B 'hello in spanish'
Nous Research partnered with Lambda Labs to provide inference APIs. It turns out Lambda host quite a few models now, currently providing free inference to users with an API key.
I just released the first alpha of a llm-lambda-labsplugin. You can use that to try the larger 405b model (very hard to run on a consumer device) like this:
llm install llm-lambda-labs
llm keys set lambdalabs
# Paste key here
llm -m lambdalabs/hermes3-405b 'short poem about a pelican with a twist'
Here's the source code for the new plugin, which I based on llm-mistral. The plugin uses httpx-sse to consume the stream of tokens from the API.
Link 2024-11-04 New OpenAI feature: Predicted Outputs:
Interesting new ability of the OpenAI API - the first time I've seen this from any vendor.
If you know your prompt is mostly going to return the same content - you're requesting an edit to some existing code, for example - you can now send that content as a "prediction" and have GPT-4o or GPT-4o mini use that to accelerate the returned result.
OpenAI's documentation says:
When providing a prediction, any tokens provided that are not part of the final completion are charged at completion token rates.
I initially misunderstood this as meaning you got a price reduction in addition to the latency improvement, but that's not the case: in the best possible case it will return faster and you won't be charged anything extra over the expected cost for the prompt, but the more it differs from your permission the more extra tokens you'll be billed for.
I ran the example from the documentation both with and without the prediction and got these results. Without the prediction:
"usage": {
"prompt_tokens": 150,
"completion_tokens": 118,
"total_tokens": 268,
"completion_tokens_details": {
"accepted_prediction_tokens": 0,
"audio_tokens": null,
"reasoning_tokens": 0,
"rejected_prediction_tokens": 0
}
That took 5.2 seconds and cost 0.1555 cents.
With the prediction:
"usage": {
"prompt_tokens": 166,
"completion_tokens": 226,
"total_tokens": 392,
"completion_tokens_details": {
"accepted_prediction_tokens": 49,
"audio_tokens": null,
"reasoning_tokens": 0,
"rejected_prediction_tokens": 107
}
That took 3.3 seconds and cost 0.2675 cents.
Further details from OpenAI's Steve Coffey:
We are using the prediction to do speculative decoding during inference, which allows us to validate large batches of the input in parallel, instead of sampling token-by-token!
[...] If the prediction is 100% accurate, then you would see no cost difference. When the model diverges from your speculation, we do additional sampling to “discover” the net-new tokens, which is why we charge rejected tokens at completion time rates.
Quote 2024-11-05
You already know Donald Trump. He is unfitto lead. Watch him. Listen to those who know him best. He tried to subvert an election and remains a threat to democracy. He helped overturn Roe, with terrible consequences. Mr. Trump's corruption and lawlessness go beyond elections: It's his whole ethos. He lieswithout limit. If he's re-elected, the G.O.P. won't restrain him. Mr. Trump will use the government to go after opponents. He will pursue a cruel policy of mass deportations. He will wreak havoc on the poor, the middle classand employers. Another Trump term will damage the climate, shatter alliances and strengthen autocrats. Americans should demand better. Vote.
Nice post. My highlight is the Video analysis with Gemini. Really cool, thanks for sharing this, Simon.