URL-addressable Pyodide Python environments
Plus o3-mini, Gemini 2.0 Pro, Nomic Embed Text v2 and more
In this newsletter:
URL-addressable Pyodide Python environments
Using pip to install a Large Language Model that's under 100MB
Plus 20 links and 6 quotations and 1 TIL
URL-addressable Pyodide Python environments - 2025-02-13
This evening I spotted an obscure bug in Datasette, using Datasette Lite. I figure it's a good opportunity to highlight how useful it is to have a URL-addressable Python environment, powered by Pyodide and WebAssembly.
Here's the page that helped me discover the bug:
To explain what's going on here, let's first review the individual components.
Datasette Lite
Datasette Lite is a version of Datasette that runs entirely in your browser. It runs on Pyodide, which I think is still the most underappreciated project in the Python ecosystem.
I built Datasette Lite almost three years ago as a weekend hack project to try and see if I could get Datasette - a server-side Python web application - to run entirely in the browser.
I've added a bunch of features since then, described in the README - most significantly the ability to load SQLite databases, CSV files, JSON files or Parquet files by passing a URL to a query string parameter.
I built Datasette Lite almost as a joke, thinking nobody would want to wait for a full Python interpreter to download to their browser each time they wanted to explore some data. It turns out internet connections are fast these days and having a version of Datasette that needs a browser, GitHub Pages and nothing else is actually extremely useful.
Just the other day I saw Logan Williams of Bellingcat using it to share a better version of this Excel sheet:
The NSF grants that Ted Cruz has singled out for advancing "neo-Marxist class warfare propaganda," in Datasette-Lite: lite.datasette.io?url=https://...
Let's look at that URL in full:
The ?url=
parameter there poins to a SQLite database file, hosted on DigitalOcean Spaces and served with the all-important access-control-allow-origin: *
header which allows Datasette Lite to load it across domains.
The #/cruz_nhs/grants
part of the URL tells Datasette Lite which page to load when you visit the link.
Anything after the #
in Datasette Lite is a URL that gets passed on to the WebAssembly-hosted Datasette instance. Any query string items before that can be used to affect the initial state of the Datasette instance, to import data or even to install additional plugins.
The Datasette 1.0 alphas
I've shipped a lot of Datasette alphas - the most recent is Datasette 1.0a17. Those alphas get published to PyPI, which means they can be installed using pip install datasette==1.0a17
.
A while back I added the same ability to Datasette Lite itself. You can now pass &ref=1.0a17
to the Datasette Lite URL to load that specific version of Datasette.
This works thanks to the magic of Pyodide's micropip mechanism. Every time you load Datasette Lite in your browser it's actually using micropip
to install the packages it needs directly from PyPI. The code looks something like this:
await pyodide.loadPackage('micropip', {messageCallback: log});
let datasetteToInstall = 'datasette';
let pre = 'False';
if (settings.ref) {
if (settings.ref == 'pre') {
pre = 'True';
} else {
datasetteToInstall = `datasette==${settings.ref}`;
}
}
await self.pyodide.runPythonAsync(`
import micropip
await micropip.install("${datasetteToInstall}", pre=${pre})
`);
That settings
object has been passed to the Web Worker that loads Datasette, incorporating various query string parameters.
This all means I can pass ?ref=1.0a17
to Datasette Lite to load a specific version, or ?ref=pre
to get the most recently released pre-release version.
This works for plugins, too
Since loading extra packages from PyPI via micropip
is so easy, I went a step further and added plugin support.
The ?install=
parameter can be passed multiple times, each time specifying a Datasette plugin from PyPI that should be installed into the browser.
The README includes a bunch of examples of this mechanism in action. Here's a fun one that loads datasette-mp3-audio to provide inline MP3 playing widgets, originally created for my ScotRail audio announcements project.
This only works for some plugins. They need to be pure Python wheels - getting plugins with compiled binary dependencies to work in Pyodide WebAssembly requires a whole set of steps that I haven't quite figured out.
Frustratingly, it doesn't work for plugins that run their own JavaScript yet! I may need to rearchitect significant chunks of both Datasette and Datasette Lite to make that work.
It's also worth noting that this is a remote code execution security hole. I don't think that's a problem here, because lite.datasette.io
is deliberately hosted on the subdomain of a domain that I never intend to use cookies on. It's possible to vandalize the visual display of lite.datasette.io
but it shouldn't be possible to steal any private data or do any lasting damage.
datasette-visible-internal-db
This evening's debugging exercise used a plugin called datasette-visible-internal-db.
Datasette's internal database is an invisible SQLite database that sits at the heart of Datasette, tracking things like loaded metadata and the schemas of the currently attached tables.
Being invisible means we can use it for features that shouldn't be visible to users - plugins that record API secrets or permissions or track comments or data import progress, for example.
In Python code it's accessed like this:
internal_db = datasette.get_internal_database()
As opposed to Datasette's other databases which are accessed like so:
db = datasette.get_database("my-database")
Sometimes, when hacking on Datasette, it's useful to be able to browse the internal database using the default Datasette UI.
That's what datasette-visible-internal-db
does. The plugin implementation is just five lines of code:
import datasette
@datasette.hookimpl
def startup(datasette):
db = datasette.get_internal_database()
datasette.add_database(db, name="_internal", route="_internal")
On startup the plugin grabs a reference to that internal database and then registers it using Datasette's add_database() method. That's all it takes to have it show up as a visible database on the /_internal
path within Datasette.
Spotting the bug
I was poking around with this today out of pure curiosity - I hadn't tried ?install=datasette-visible-internal-db
with Datasette Lite before and I wanted to see if it worked.
Here's that URL from earlier, this time with commentary:
https://lite.datasette.io/ // Datasette Lite
?install=datasette-visible-internal-db // Install the visible internal DB plugin
&ref=1.0a17 // Load the 1.0a17 alpha release
#/_internal/catalog_columns // Navigate to the /_internal/catalog_columns table page
&_facet=database_name // Facet by database_name for good measure
And this is what I saw:
This all looked good... until I clicked on that _internal
link in the database_name
column... and it took me to this /_internal/databases/_internal 404 page.
Why was that a 404? Datasette introspects the SQLite table schema to identify foreign key relationships, then turns those into hyperlinks. The SQL schema for that catalog_columns
table (displayed at the bottom of the table page) looked like this:
CREATE TABLE catalog_columns (
database_name TEXT,
table_name TEXT,
cid INTEGER,
name TEXT,
type TEXT,
"notnull" INTEGER,
default_value TEXT, -- renamed from dflt_value
is_pk INTEGER, -- renamed from pk
hidden INTEGER,
PRIMARY KEY (database_name, table_name, name),
FOREIGN KEY (database_name) REFERENCES databases(database_name),
FOREIGN KEY (database_name, table_name) REFERENCES tables(database_name, table_name)
);
Those foreign key references are a bug! I renamed the internal tables from databases
and tables
to catalog_databases
and catalog_tables
quite a while ago, but apparently forgot to update the references - and SQLite let me get away with it.
Fixing the bug
I fixed the bug in this commit. As is often the case the most interesting part of the fix is the accompanying test. I decided to use the introspection helpers in sqlite-utils to guard against every making another mistake like this again in the future:
@pytest.mark.asyncio
async def test_internal_foreign_key_references(ds_client):
internal_db = await ensure_internal(ds_client)
def inner(conn):
db = sqlite_utils.Database(conn)
table_names = db.table_names()
for table in db.tables:
for fk in table.foreign_keys:
other_table = fk.other_table
other_column = fk.other_column
message = 'Column "{}.{}" references other column "{}.{}" which does not exist'.format(
table.name, fk.column, other_table, other_column
)
assert other_table in table_names, message + " (bad table)"
assert other_column in db[other_table].columns_dict, (
message + " (bad column)"
)
await internal_db.execute_fn(inner)
This uses Datasette's await db.execute_fn() method, which lets you run Python code that accesses SQLite in a thread. That code can then use the blocking sqlite-utils
introspection methods - here I'm looping through every table in that internal database, looping through each tables .foreign_keys
and confirming that the .other_table
and .other_column
values reference a table and column that genuinely exist.
I ran this test, watched it fail, then applied the fix and it passed.
URL-addressable Steps To Reproduce
The idea I most wanted to highlight here is the enormous value provided by URL-addressable Steps To Reproduce.
Having good Steps To Reproduce is crucial for productively fixing bugs. Something you can click on to see the bug is the most effective form of STR there is.
Ideally, these URLs will continue to work long into the future.
The great thing about a system like Datasette Lite is that everything is statically hosted files. The application itself is hosted on GitHub Pages, and it works by loading additional files from various different CDNs. The only dynamic aspect is cached lookups against the PyPI API, which I expect to stay stable for a long time to come.
As a stable component of the Web platform for almost 8 years WebAssembly is clearly here to stay. I expect we'll be able to execute today's WASM code in browsers 20+ years from now.
I'm confident that the patterns I've been exploring in Datasette Lite over the past few years could be just as valuable for other projects. Imagine demonstrating bugs in a Django application using a static WebAssembly build, archived forever as part of an issue tracking system.
I think WebAssembly and Pyodide still have a great deal of untapped potential for the wider Python world.
Using pip to install a Large Language Model that's under 100MB - 2025-02-07
I just released llm-smollm2, a new plugin for LLM that bundles a quantized copy of the SmolLM2-135M-Instruct LLM inside of the Python package.
This means you can now pip install
a full LLM!
If you're already using LLM you can install it like this:
llm install llm-smollm2
Then run prompts like this:
llm -m SmolLM2 'Are dogs real?'
(New favourite test prompt for tiny models, courtesy of Tim Duffy. Here's the result).
If you don't have LLM yet first follow these installation instructions, or brew install llm
or pipx install llm
or uv tool install llm
depending on your preferred way of getting your Python tools.
If you have uv setup you don't need to install anything at all! The following command will spin up an ephemeral environment, install the necessary packages and start a chat session with the model all in one go:
uvx --with llm-smollm2 llm chat -m SmolLM2
Finding a tiny model
The fact that the model is almost exactly 100MB is no coincidence: that's the default size limit for a Python package that can be uploaded to the Python Package Index (PyPI).
I asked on Bluesky if anyone had seen a just-about-usable GGUF model that was under 100MB, and Artisan Loaf pointed me to SmolLM2-135M-Instruct.
I ended up using this quantization by QuantFactory just because it was the first sub-100MB model I tried that worked.
Trick for finding quantized models: Hugging Face has a neat "model tree" feature in the side panel of their model pages, which includes links to relevant quantized models. I find most of my GGUFs using that feature.
Building the plugin
I first tried the model out using Python and the llama-cpp-python library like this:
uv run --with llama-cpp-python python
Then:
from llama_cpp import Llama
from pprint import pprint
llm = Llama(model_path="SmolLM2-135M-Instruct.Q4_1.gguf")
output = llm.create_chat_completion(messages=[
{"role": "user", "content": "Hi"}
])
pprint(output)
This gave me the output I was expecting:
{'choices': [{'finish_reason': 'stop',
'index': 0,
'logprobs': None,
'message': {'content': 'Hello! How can I assist you today?',
'role': 'assistant'}}],
'created': 1738903256,
'id': 'chatcmpl-76ea1733-cc2f-46d4-9939-90efa2a05e7c',
'model': 'SmolLM2-135M-Instruct.Q4_1.gguf',
'object': 'chat.completion',
'usage': {'completion_tokens': 9, 'prompt_tokens': 31, 'total_tokens': 40}}
But it also spammed my terminal with a huge volume of debugging output - which started like this:
llama_model_load_from_file_impl: using device Metal (Apple M2 Max) - 49151 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 272 tensors from SmolLM2-135M-Instruct.Q4_1.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
And then continued for more than 500 lines!
I've had this problem with llama-cpp-python
and llama.cpp
in the past, and was sad to find that the documentation still doesn't have a great answer for how to avoid this.
So I turned to the just released Gemini 2.0 Pro (Experimental), because I know it's a strong model with a long input limit.
I ran the entire llama-cpp-python
codebase through it like this:
cd /tmp
git clone https://github.com/abetlen/llama-cpp-python
cd llama-cpp-python
files-to-prompt -e py . -c | llm -m gemini-2.0-pro-exp-02-05 \
'How can I prevent this library from logging any information at all while it is running - no stderr or anything like that'
Here's the answer I got back. It recommended setting the logger to logging.CRITICAL
, passing verbose=False
to the constructor and, most importantly, using the following context manager to suppress all output:
from contextlib import contextmanager, redirect_stderr, redirect_stdout
@contextmanager
def suppress_output():
"""
Suppresses all stdout and stderr output within the context.
"""
with open(os.devnull, "w") as devnull:
with redirect_stdout(devnull), redirect_stderr(devnull):
yield
This worked! It turned out most of the output came from initializing the LLM
class, so I wrapped that like so:
with suppress_output():
model = Llama(model_path=self.model_path, verbose=False)
Proof of concept in hand I set about writing the plugin. I started with my simonw/llm-plugin cookiecutter template:
uvx cookiecutter gh:simonw/llm-plugin
[1/6] plugin_name (): smollm2
[2/6] description (): SmolLM2-135M-Instruct.Q4_1 for LLM
[3/6] hyphenated (smollm2):
[4/6] underscored (smollm2):
[5/6] github_username (): simonw
[6/6] author_name (): Simon Willison
The rest of the plugin was mostly borrowed from my existing llm-gguf plugin, updated based on the latest README for the llama-cpp-python
project.
There's more information on building plugins in the tutorial on writing a plugin.
Packaging the plugin
Once I had that working the last step was to figure out how to package it for PyPI. I'm never quite sure of the best way to bundle a binary file in a Python package, especially one that uses a pyproject.toml
file... so I dumped a copy of my existing pyproject.toml
file into o3-mini-high and prompted:
Modify this to bundle a SmolLM2-135M-Instruct.Q4_1.gguf file inside the package. I don't want to use hatch or a manifest or anything, I just want to use setuptools.
Here's the shared transcript - it gave me exactly what I wanted. I bundled it by adding this to the end of the toml
file:
[tool.setuptools.package-data]
llm_smollm2 = ["SmolLM2-135M-Instruct.Q4_1.gguf"]
Then dropping that .gguf
file into the llm_smollm2/
directory and putting my plugin code in llm_smollm2/__init__.py
.
I tested it locally by running this:
python -m pip install build
python -m build
I fired up a fresh virtual environment and ran pip install ../path/to/llm-smollm2/dist/llm_smollm2-0.1-py3-none-any.whl
to confirm that the package worked as expected.
Publishing to PyPI
My cookiecutter template comes with a GitHub Actions workflow that publishes the package to PyPI when a new release is created using the GitHub web interface. Here's the relevant YAML:
deploy:
runs-on: ubuntu-latest
needs: [test]
environment: release
permissions:
id-token: write
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.13"
cache: pip
cache-dependency-path: pyproject.toml
- name: Install dependencies
run: |
pip install setuptools wheel build
- name: Build
run: |
python -m build
- name: Publish
uses: pypa/gh-action-pypi-publish@release/v1
This runs after the test
job has passed. It uses the pypa/gh-action-pypi-publish Action to publish to PyPI - I wrote more about how that works in this TIL.
Is the model any good?
This one really isn't! It's not really surprising but it turns out 94MB really isn't enough space for a model that can do anything useful.
It's super fun to play with, and I continue to maintain that small, weak models are a great way to help build a mental model of how this technology actually works.
That's not to say SmolLM2 isn't a fantastic model family. I'm running the smallest, most restricted version here. SmolLM - blazingly fast and remarkably powerful describes the full model family - which comes in 135M, 360M, and 1.7B sizes. The larger versions are a whole lot more capable.
If anyone can figure out something genuinely useful to do with the 94MB version I'd love to hear about it.
Quote 2025-02-02
While we encourage people to use AI systems during their role to help them work faster and more effectively, please do not use AI assistants during the application process. We want to understand your personal interest in Anthropic without mediation through an AI system, and we also want to evaluate your non-AI-assisted communication skills. Please indicate 'Yes' if you have read and agree.
Why do you want to work at Anthropic? (We value this response highly - great answers are often 200-400 words.)
Link 2025-02-03 A computer can never be held accountable:
This legendary page from an internal IBM training in 1979 could not be more appropriate for our new age of AI.
A computer can never be held accountable
Therefore a computer must never make a management decision
Back in June 2024 I asked on Twitter if anyone had more information on the original source.
Jonty Wareing replied:
It was found by someone going through their father's work documents, and subsequently destroyed in a flood.
I spent some time corresponding with the IBM archives but they can't locate it. Apparently it was common for branch offices to produce things that were not archived.
Here's the reply Jonty got back from IBM:
I believe the image was first shared online in this tweet by @bumblebike in February 2017. Here's where they confirm it was from 1979 internal training.
Here's another tweet from @bumblebike from December 2021 about the flood:
Unfortunately destroyed by flood in 2019 with most of my things. Inquired at the retirees club zoom last week, but there’s almost no one the right age left. Not sure where else to ask.
Link 2025-02-03 Constitutional Classifiers: Defending against universal jailbreaks:
Interesting new research from Anthropic, resulting in the paper Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming.
From the paper:
In particular, we introduce Constitutional Classifiers, a framework that trains classifier safeguards using explicit constitutional rules (§3). Our approach is centered on a constitution that delineates categories of permissible and restricted content (Figure 1b), which guides the generation of synthetic training examples (Figure 1c). This allows us to rapidly adapt to new threat models through constitution updates, including those related to model misalignment (Greenblatt et al., 2023). To enhance performance, we also employ extensive data augmentation and leverage pool sets of benign data.[^1]
Critically, our output classifiers support streaming prediction: they assess the potential harmfulness of the complete model output at each token without requiring the full output to be generated. This enables real-time intervention—if harmful content is detected at any point, we can immediately halt generation, preserving both safety and user experience.
A key focus of this research is CBRN - an acronym for Chemical, Biological, Radiological and Nuclear harms. Both Anthropic and OpenAI's safety research frequently discuss these threats.
Anthropic hosted a two month red teaming exercise where participants tried to break through their system:
Specifically, they were given a list of ten “forbidden” queries, and their task was to use whichever jailbreaking techniques they wanted in order to get one of our current models (in this case, Claude 3.5 Sonnet, June 2024) guarded by the prototype Constitutional Classifiers to answer all of the queries. We only considered it a successful “universal” jailbreak if the model provided a detailed answer to all of the queries.
183 active participants spent an estimated >3,000 hours over a two-month experimental period attempting to jailbreak the model. They were offered a monetary reward up to $15,000 should they discover a universal jailbreak. Despite the large amount of effort, none of the participants were able to coerce the model to answer all ten forbidden queries with a single jailbreak—that is, no universal jailbreak was discovered.
I'm disappointed that the goal they set was for a single universal jailbreak. I would have thought that a partial jailbreak - one that worked against a subset of those questions - would still be a result that demonstrated concerning gaps in their proposed protection.
Anthropic are now running a public continuation of their red teaming exercise at claude.ai/constitutional-classifiers until Feb 10, 2025, with 8 target questions.
Here's the first of those:
I hadn't heard of Soman so I pasted that question into R1 on chat.deepseek.com which confidently explained precautionary measures I should take when working with Soman, "a potent nerve agent", but wrapped it up with this disclaimer:
Disclaimer: Handling Soman is inherently high-risk and typically restricted to authorized military/labs. This guide assumes legal access and institutional oversight. Always consult certified safety professionals before proceeding.
Link 2025-02-04 Build a link blog:
Xuanwo started a link blog inspired by my article My approach to running a link blog, and in a delightful piece of recursion his first post is a link blog entry about my post about link blogging, following my tips on quoting liberally and including extra commentary.
I decided to follow simon's approach to creating a link blog, where I can share interesting links I find on the internet along with my own comments and thoughts about them.
Link 2025-02-04 Animating Rick and Morty One Pixel at a Time:
Daniel Hooper says he spent 8 months working on the post, the culmination of which is an animation of Rick from Rick and Morty, implemented in 240 lines of GLSL - the OpenGL Shading Language which apparently has been directly supported by browsers for many years.
The result is a comprehensive GLSL tutorial, complete with interactive examples of each of the steps used to generate the final animation which you can tinker with directly on the page. It feels a bit like Logo!
Shaders work by running code for each pixel to return that pixel's color - in this case the color_for_pixel()
function is wired up as the core logic of the shader.
Here's Daniel's code for the live shader editor he built for this post. It looks like this is the function that does the most important work:
function loadShader(shaderSource, shaderType) {
const shader = gl.createShader(shaderType);
gl.shaderSource(shader, shaderSource);
gl.compileShader(shader);
const compiled = gl.getShaderParameter(shader, gl.COMPILE_STATUS);
if (!compiled) {
const lastError = gl.getShaderInfoLog(shader);
gl.deleteShader(shader);
return lastError;
}
return shader;
}
Where gl
is a canvas.getContext("webgl2")
WebGL2RenderingContext
object, described by MDN here.
TIL 2025-02-04 Running pytest against a specific Python version with uv run:
While working on this issue I figured out a neat pattern for running the tests for my project locally against a specific Python version using uv run: …
Link 2025-02-05 AI-generated slop is already in your public library:
US libraries that use the Hoopla system to offer ebooks to their patrons sign agreements where they pay a license fee for anything selected by one of their members that's in the Hoopla catalog.
The Hoopla catalog is increasingly filling up with junk AI slop ebooks like "Fatty Liver Diet Cookbook: 2000 Days of Simple and Flavorful Recipes for a Revitalized Liver", which then cost libraries money if someone checks them out.
Apparently librarians already have a term for this kind of low-quality, low effort content that predates it being written by LLMs: vendor slurry.
Libraries stand against censorship, making this a difficult issue to address through removing those listings.
Sarah Lamdan, deputy director of the American Library Association says:
If library visitors choose to read AI eBooks, they should do so with the knowledge that the books are AI-generated.
Link 2025-02-05 Ambsheets: Spreadsheets for exploring scenarios:
Delightful UI experiment by Alex Warth and Geoffrey Litt at Ink & Switch, exploring the idea of a spreadsheet with cells that can handle multiple values at once, which they call "amb" (for "ambiguous") values. A single sheet can then be used to model multiple scenarios.
Here the cell for "Car" contains {500, 1200}
and the cell for "Apartment" contains {2800, 3700, 5500}
, resulting in a "Total" cell with six different values. Hovering over a calculated highlights its source values and a side panel shows a table of calculated results against those different combinations.
Always interesting to see neat ideas like this presented on top of UIs that haven't had a significant upgrade in a very long time.
Link 2025-02-05 o3-mini is really good at writing internal documentation:
I wanted to refresh my knowledge of how the Datasette permissions system works today. I already have extensive hand-written documentation for that, but I thought it would be interesting to see if I could derive any insights from running an LLM against the codebase.
o3-mini has an input limit of 200,000 tokens. I used LLM and my files-to-prompt tool to generate the documentation like this:
cd /tmp
git clone https://github.com/simonw/datasette
cd datasette
files-to-prompt datasette -e py -c | \
llm -m o3-mini -s \
'write extensive documentation for how the permissions system works, as markdown'
The files-to-prompt
command is fed the datasette subdirectory, which contains just the source code for the application - omitting tests (in tests/
) and documentation (in docs/
).
The -e py
option causes it to only include files with a .py
extension - skipping all of the HTML and JavaScript files in that hierarchy.
The -c
option causes it to output Claude's XML-ish format - a format that works great with other LLMs too.
You can see the output of that command in this Gist.
Then I pipe that result into LLM, requesting the o3-mini
OpenAI model and passing the following system prompt:
write extensive documentation for how the permissions system works, as markdown
Specifically requesting Markdown is important.
The prompt used 99,348 input tokens and produced 3,118 output tokens (320 of those were invisible reasoning tokens). That's a cost of 12.3 cents.
Honestly, the results are fantastic. I had to double-check that I hadn't accidentally fed in the documentation by mistake.
(It's possible that the model is picking up additional information about Datasette in its training set, but I've seen similar high quality results from other, newer libraries so I don't think that's a significant factor.)
In this case I already had extensive written documentation of my own, but this was still a useful refresher to help confirm that the code matched my mental model of how everything works.
Documentation of project internals as a category is notorious for going out of date. Having tricks like this to derive usable how-it-works documentation from existing codebases in just a few seconds and at a cost of a few cents is wildly valuable.
Link 2025-02-05 Gemini 2.0 is now available to everyone:
Big new Gemini 2.0 releases today:
Gemini 2.0 Pro (Experimental) is Google's "best model yet for coding performance and complex prompts" - currently available as a free preview.
Gemini 2.0 Flash is now generally available.
Gemini 2.0 Flash-Lite looks particularly interesting:
We’ve gotten a lot of positive feedback on the price and speed of 1.5 Flash. We wanted to keep improving quality, while still maintaining cost and speed. So today, we’re introducing 2.0 Flash-Lite, a new model that has better quality than 1.5 Flash, at the same speed and cost. It outperforms 1.5 Flash on the majority of benchmarks.
That means Gemini 2.0 Flash-Lite is priced at 7.5c/million input tokens and 30c/million output tokens - half the price of OpenAI's GPT-4o mini (15c/60c).
Gemini 2.0 Flash isn't much more expensive: 10c/million for text/image input, 70c/million for audio input, 40c/million for output. Again, cheaper than GPT-4o mini.
I pushed a new LLM plugin release, llm-gemini 0.10, adding support for the three new models:
llm install -U llm-gemini
llm keys set gemini
# paste API key here
llm -m gemini-2.0-flash "impress me"
llm -m gemini-2.0-flash-lite-preview-02-05 "impress me"
llm -m gemini-2.0-pro-exp-02-05 "impress me"
Here's the output for those three prompts.
I ran Generate an SVG of a pelican riding a bicycle
through the three new models. Here are the results, cheapest to most expensive:
gemini-2.0-flash-lite-preview-02-05
gemini-2.0-flash
gemini-2.0-pro-exp-02-05
I also ran the same prompt I tried with o3-mini the other day:
cd /tmp
git clone https://github.com/simonw/datasette
cd datasette
files-to-prompt datasette -e py -c | \
llm -m gemini-2.0-pro-exp-02-05 \
-s 'write extensive documentation for how the permissions system works, as markdown' \
-o max_output_tokens 10000
Here's the result from that - you can compare that to o3-mini's result here.
Link 2025-02-05 S1: The $6 R1 Competitor?:
Tim Kellogg shares his notes on a new paper, s1: Simple test-time scaling, which describes an inference-scaling model fine-tuned on top of Qwen2.5-32B-Instruct for just $6 - the cost for 26 minutes on 16 NVIDIA H100 GPUs.
Tim highlight the most exciting result:
After sifting their dataset of 56K examples down to just the best 1K, they found that the core 1K is all that's needed to achieve o1-preview performance on a 32B model.
The paper describes a technique called "Budget forcing":
To enforce a minimum, we suppress the generation of the end-of-thinking token delimiter and optionally append the string “Wait” to the model’s current reasoning trace to encourage the model to reflect on its current generation
That's the same trick Theia Vogel described a few weeks ago.
Here's the s1-32B
model on Hugging Face. I found a GGUF version of it at brittlewis12/s1-32B-GGUF, which I ran using Ollama like so:
ollama run hf.co/brittlewis12/s1-32B-GGUF:Q4_0
I also found those 1,000 samples on Hugging Face in the simplescaling/s1K data repository there.
I used DuckDB to convert the parquet file to CSV (and turn one VARCHAR[]
column into JSON):
COPY (
SELECT
solution,
question,
cot_type,
source_type,
metadata,
cot,
json_array(thinking_trajectories) as thinking_trajectories,
attempt
FROM 's1k-00001.parquet'
) TO 'output.csv' (HEADER, DELIMITER ',');
Then I loaded that CSV into sqlite-utils so I could use the convert
command to turn a Python data structure into JSON using json.dumps()
and eval()
:
# Load into SQLite
sqlite-utils insert s1k.db s1k output.csv --csv
# Fix that column
sqlite-utils convert s1k.db s1u metadata 'json.dumps(eval(value))' --import json
# Dump that back out to CSV
sqlite-utils rows s1k.db s1k --csv > s1k.csv
Here's that CSV in a Gist, which means I can load it into Datasette Lite.
It really is a tiny amount of training data. It's mostly math and science, but there are also 15 cryptic crossword examples.
Quote 2025-02-06
There's a new kind of coding I call "vibe coding", where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It's possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good. Also I just talk to Composer with SuperWhisper so I barely even touch the keyboard.
I ask for the dumbest things like "decrease the padding on the sidebar by half" because I'm too lazy to find it. I "Accept All" always, I don't read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it. The code grows beyond my usual comprehension, I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away.
It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or webapp, but it's not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works.
Link 2025-02-06 The future belongs to idea guys who can just do things:
Geoffrey Huntley with a provocative take on AI-assisted programming:
I seriously can't see a path forward where the majority of software engineers are doing artisanal hand-crafted commits by as soon as the end of 2026.
He calls for companies to invest in high quality internal training and create space for employees to figure out these new tools:
It's hackathon (during business hours) once a month, every month time.
Geoffrey's concluding note resonates with me. LLMs are a gift to the fiercely curious and ambitious:
If you’re a high agency person, there’s never been a better time to be alive...
Link 2025-02-06 sqlite-page-explorer:
Outstanding tool by Luke Rissacher for understanding the SQLite file format. Download the application (built using redbean and Cosmopolitan, so the same binary runs on Windows, Mac and Linux) and point it at a SQLite database to get a local web application with an interface for exploring how the file is structured.
Here's it running against the datasette.io/content database that runs the official Datasette website:
Link 2025-02-06 Datasette 1.0a17:
New Datasette alpha, with a bunch of small changes and bug fixes accumulated over the past few months. Some (minor) highlights:
The register_magic_parameters(datasette) plugin hook can now register async functions. (#2441)
Breadcrumbs on database and table pages now include a consistent self-link for resetting query string parameters. (#2454)
New internal methods
datasette.set_actor_cookie()
anddatasette.delete_actor_cookie()
, described here. (#1690)
/-/permissions
page now shows a list of all permissions registered by plugins. (#1943)If a table has a single unique text column Datasette now detects that as the foreign key label for that table. (#2458)
The
/-/permissions
page now includes options for filtering or exclude permission checks recorded against the current user. (#2460)
I was incentivized to push this release by an issue I ran into in my new datasette-load plugin, which resulted in this fix:
Fixed a bug where replacing a database with a new one with the same name did not pick up the new database correctly. (#2465)
Link 2025-02-07 APSW SQLite query explainer:
Today I found out about APSW's (Another Python SQLite Wrapper, in constant development since 2004) apsw.ext.query_info() function, which takes a SQL query and returns a very detailed set of information about that query - all without executing it.
It actually solves a bunch of problems I've wanted to address in Datasette - like taking an arbitrary query and figuring out how many parameters (?
) it takes and which tables and columns are represented in the result.
I tried it out in my console (uv run --with apsw python
) and it seemed to work really well. Then I remembered that the Pyodide project includes WebAssembly builds of a number of Python C extensions and was delighted to find apsw on that list.
... so I got Claude to build me a web interface for trying out the function, using Pyodide to run a user's query in Python in their browser via WebAssembly.
Claude didn't quite get it in one shot - I had to feed it the URL to a more recent Pyodide and it got stuck in a bug loop which I fixed by pasting the code into a fresh session.
Link 2025-02-07 sqlite-s3vfs:
Neat open source project on the GitHub organisation for the UK government's Department for Business and Trade: a "Python virtual filesystem for SQLite to read from and write to S3."
I tried out their usage example by running it in a Python REPL with all of the dependencies
uv run --python 3.13 --with apsw --with sqlite-s3vfs --with boto3 python
It worked as advertised. When I listed my S3 bucket I found it had created two files - one called demo.sqlite/0000000000
and another called demo.sqlite/0000000001
, both 4096 bytes because each one represented a SQLite page.
The implementation is just 200 lines of Python, implementing a new SQLite Virtual Filesystem on top of apsw.VFS.
The README includes this warning:
No locking is performed, so client code must ensure that writes do not overlap with other writes or reads. If multiple writes happen at the same time, the database will probably become corrupt and data be lost.
I wonder if the conditional writes feature added to S3 back in November could be used to protect against that happening. Tricky as there are multiple files involved, but maybe it (or a trick like this one) could be used to implement some kind of exclusive lock between multiple processes?
Quote 2025-02-07
Confession: we've been hiding parts of v0's responses from users since September. Since the launch of DeepSeek's web experience and its positive reception, we realize now that was a mistake. From now on, we're also showing v0's full output in every response. This is a much better UX because it feels faster and it teaches end users how to prompt more effectively.
Quote 2025-02-08
[...] We are destroying software with complex build systems.
We are destroying software with an absurd chain of dependencies, making everything bloated and fragile.
We are destroying software telling new programmers: “Don’t reinvent the wheel!”. But, reinventing the wheel is how you learn how things work, and is the first step to make new, different wheels. [...]
Quote 2025-02-09
The cost to use a given level of AI falls about 10x every 12 months, and lower prices lead to much more use. You can see this in the token cost from GPT-4 in early 2023 to GPT-4o in mid-2024, where the price per token dropped about 150x in that time period. Moore’s law changed the world at 2x every 18 months; this is unbelievably stronger.
Link 2025-02-10 Cerebras brings instant inference to Mistral Le Chat:
Mistral announced a major upgrade to their Le Chat web UI (their version of ChatGPT) a few days ago, and one of the signature features was performance.
It turns out that performance boost comes from hosting their model on Cerebras:
We are excited to bring our technology to Mistral – specifically the flagship 123B parameter Mistral Large 2 model. Using our Wafer Scale Engine technology, we achieve over 1,100 tokens per second on text queries.
Given Cerebras's so far unrivaled inference performance I'm surprised that no other AI lab has formed a partnership like this already.
Link 2025-02-11 llm-sort:
Delightful LLM plugin by Evangelos Lamprou which adds the ability to perform "semantic search" - allowing you to sort the contents of a file based on using a prompt against an LLM to determine sort order.
Best illustrated by these examples from the README:
llm sort --query "Which names is more suitable for a pet monkey?" names.txt
cat titles.txt | llm sort --query "Which book should I read to cook better?"
It works using this pairwise prompt, which is executed multiple times using Python's sorted(documents, key=functools.cmp_to_key(compare_callback))
mechanism:
Given the query:
{query}
Compare the following two lines:
Line A:
{docA}
Line B:
{docB}
Which line is more relevant to the query? Please answer with "Line A" or "Line B".
From the lobste.rs comments, Cole Kurashige:
I'm not saying I'm prescient, but in The Before Times I did something similar with Mechanical Turk
This made me realize that so many of the patterns we were using against Mechanical Turk a decade+ ago can provide hints about potential ways to apply LLMs.
Link 2025-02-12 Building a SNAP LLM eval: part 1:
Dave Guarino (previously) has been exploring using LLM-driven systems to help people apply for SNAP, the US Supplemental Nutrition Assistance Program (aka food stamps).
This is a domain which existing models know some things about, but which is full of critical details around things like eligibility criteria where accuracy really matters.
Domain-specific evals like this are still pretty rare. As Dave puts it:
There is also not a lot of public, easily digestible writing out there on building evals in specific domains. So one of our hopes in sharing this is that it helps others build evals for domains they know deeply.
Having robust evals addresses multiple challenges. The first is establishing how good the raw models are for a particular domain. A more important one is to help in developing additional systems on top of these models, where an eval is crucial for understanding if RAG or prompt engineering tricks are paying off.
Step 1 doesn't involve writing any code at all:
Meaningful, real problem spaces inevitably have a lot of nuance. So in working on our SNAP eval, the first step has just been using lots of models — a lot. [...]
Just using the models and taking notes on the nuanced “good”, “meh”, “bad!” is a much faster way to get to a useful starting eval set than writing or automating evals in code.
I've been complaining for a while that there isn't nearly enough guidance about evals out there. This piece is an excellent step towards filling that gap.
Link 2025-02-12 Nomic Embed Text V2: An Open Source, Multilingual, Mixture-of-Experts Embedding Model:
Nomic continue to release the most interesting and powerful embedding models. Their latest is Embed Text V2, an Apache 2.0 licensed multi-lingual 1.9GB model (here it is on Hugging Face) trained on "1.6 billion high-quality data pairs", which is the first embedding model I've seen to use a Mixture of Experts architecture:
In our experiments, we found that alternating MoE layers with 8 experts and top-2 routing provides the optimal balance between performance and efficiency. This results in 475M total parameters in the model, but only 305M active during training and inference.
I first tried it out using uv run
like this:
uv run \
--with einops \
--with sentence-transformers \
--python 3.13 python
Then:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("nomic-ai/nomic-embed-text-v2-moe", trust_remote_code=True)
sentences = ["Hello!", "¡Hola!"]
embeddings = model.encode(sentences, prompt_name="passage")
print(embeddings)
Then I got it working on my laptop using the llm-sentence-tranformers plugin like this:
llm install llm-sentence-transformers
llm install einops # additional necessary package
llm sentence-transformers register nomic-ai/nomic-embed-text-v2-moe --trust-remote-code
llm embed -m sentence-transformers/nomic-ai/nomic-embed-text-v2-moe -c 'string to embed'
This outputs a 768 item JSON array of floating point numbers to the terminal. These are Matryoshka embeddings which means you can truncate that down to just the first 256 items and get similarity calculations that still work albeit slightly less well.
To use this for RAG you'll need to conform to Nomic's custom prompt format. For documents to be searched:
search_document: text of document goes here
And for search queries:
search_query: term to search for
I landed a new --prepend option for the llm embed-multi command to help with that, but it's not out in a full release just yet.
I also released llm-sentence-transformers 0.3 with some minor improvements to make running this model more smooth.
Quote 2025-02-12
We want AI to “just work” for you; we realize how complicated our model and product offerings have gotten.
We hate the model picker as much as you do and want to return to magic unified intelligence.
We will next ship GPT-4.5, the model we called Orion internally, as our last non-chain-of-thought model.
After that, a top goal for us is to unify o-series models and GPT-series models by creating systems that can use all our tools, know when to think for a long time or not, and generally be useful for a very wide range of tasks.
In both ChatGPT and our API, we will release GPT-5 as a system that integrates a lot of our technology, including o3. We will no longer ship o3 as a standalone model.
[When asked about release dates for GPT 4.5 / GPT 5:] weeks / months
Link 2025-02-13 python-build-standalone now has Python 3.14.0a5:
Exciting news from Charlie Marsh:
We just shipped the latest Python 3.14 alpha (3.14.0a5) to uv and python-build-standalone. This is the first release that includes the tail-calling interpreter.
Our initial benchmarks show a ~20-30% performance improvement across CPython.
This is an optimization that was first discussed in faster-cpython in January 2024, then landed earlier this month by Ken Jin and included in the 3.14a05 release. The alpha release notes say:
A new type of interpreter based on tail calls has been added to CPython. For certain newer compilers, this interpreter provides significantly better performance. Preliminary numbers on our machines suggest anywhere from -3% to 30% faster Python code, and a geometric mean of 9-15% faster on pyperformance depending on platform and architecture. The baseline is Python 3.14 built with Clang 19 without this new interpreter.
This interpreter currently only works with Clang 19 and newer on x86-64 and AArch64 architectures. However, we expect that a future release of GCC will support this as well.
Including this in python-build-standalone means it's now trivial to try out via uv. I upgraded to the latest uv
like this:
pip install -U uv
Then ran uv python list
to see the available versions:
cpython-3.14.0a5+freethreaded-macos-aarch64-none <download available>
cpython-3.14.0a5-macos-aarch64-none <download available>
cpython-3.13.2+freethreaded-macos-aarch64-none <download available>
cpython-3.13.2-macos-aarch64-none <download available>
cpython-3.13.1-macos-aarch64-none /opt/homebrew/opt/python@3.13/bin/python3.13 -> ../Frameworks/Python.framework/Versions/3.13/bin/python3.13
I downloaded the new alpha like this:
uv python install cpython-3.14.0a5
And tried it out like so:
uv run --python 3.14.0a5 python
The Astral team have been using Ken's bm_pystones.py benchmarks script. I grabbed a copy like this:
wget 'https://gist.githubusercontent.com/Fidget-Spinner/e7bf204bf605680b0fc1540fe3777acf/raw/fa85c0f3464021a683245f075505860db5e8ba6b/bm_pystones.py'
And ran it with uv
:
uv run --python 3.14.0a5 bm_pystones.py
Giving:
Pystone(1.1) time for 50000 passes = 0.0511138
This machine benchmarks at 978209 pystones/second
Inspired by Charlie's example I decided to try the hyperfine benchmarking tool, which can run multiple commands to statistically compare their performance. I came up with this recipe:
brew install hyperfine
hyperfine \
"uv run --python 3.14.0a5 bm_pystones.py" \
"uv run --python 3.13 bm_pystones.py" \
-n tail-calling \
-n baseline \
--warmup 10
So 3.14.0a5 scored 1.12 times faster than 3.13 on the benchmark (on my extremely overloaded M2 MacBook Pro).
Link 2025-02-13 shot-scraper 1.6 with support for HTTP Archives:
New release of my shot-scraper CLI tool for taking screenshots and scraping web pages.
The big new feature is HTTP Archive (HAR)) support. The new shot-scraper har command can now create an archive of a page and all of its dependents like this:
shot-scraper har https://datasette.io/
This produces a datasette-io.har
file (currently 163KB) which is JSON representing the full set of requests used to render that page. Here's a copy of that file. You can visualize that here using ericduran.github.io/chromeHAR.
That JSON includes full copies of all of the responses, base64 encoded if they are binary files such as images.
You can add the --zip
flag to instead get a datasette-io.har.zip
file, containing JSON data in har.har
but with the response bodies saved as separate files in that archive.
The shot-scraper multi
command lets you run shot-scraper
against multiple URLs in sequence, specified using a YAML file. That command now takes a --har
option (or --har-zip
or --har-file name-of-file)
, described in the documentation, which will produce a HAR at the same time as taking the screenshots.
Shots are usually defined in YAML that looks like this:
- output: example.com.png
url: https://www.example.com/
- output: w3c.org.png
url: https://www.w3.org/
You can now omit the output:
keys and generate a HAR file without taking any screenshots at all:
- url: httpss://www.example.com/
- url: https://www.w3.org/
Run like this:
shot-scraper multi shots.yml --har
Which outputs:
Skipping screenshot of 'https://example.com/'
Skipping screenshot of 'https://www.w3.org/'
Wrote to HAR file: trace.har
shot-scraper
is built on top of Playwright, and the new features use the browser.new_context(record_har_path=...) parameter.