Talking Large Language Models with Rooftop Ruby

Plus embeddings, more embeddings and Datasette Cloud

Simon Willison

Sep 29, 2023

In this newsletter:

Talking Large Language Models with Rooftop Ruby
Weeknotes: Embeddings, more embeddings and Datasette Cloud

Plus 19 links and 5 quotations and 7 TILs

Talking Large Language Models with Rooftop Ruby - 2023-09-29

I'm on the latest episode of the Rooftop Ruby podcast with Collin Donnell and Joel Drapper, talking all things LLM.

Here's a full transcript of the episode, which I generated using Whisper and then tidied up manually (after failing to get a good editing job out of Claude and GPT-4). I've also provided a link from each section heading to jump to the relevant spot in the recording.

The topics we covered:

You can listen to it on Apple Podcasts, Spotify, Google Podcasts, Podcast Index, Overcast and a bunch of other places.

Weeknotes: Embeddings, more embeddings and Datasette Cloud - 2023-09-17

Since my last weeknotes, a flurry of activity. LLM has embeddings support now, and Datasette Cloud has driven some major improvements to the wider Datasette ecosystem.

Embeddings in LLM

LLM gained embedding support in version 0.9, and then got binary embedding support (for CLIP) in version 0.10. I wrote about those releases in detail in:

Embeddings are a fascinating tool. If you haven't got your head around them yet the first of my blog entries tries to explain why they are so interesting.

There's a lot more I want to built on top of embeddings - most notably, LLM (or Datasette, or likely a combination of the two) will be growing support for Retrieval Augmented Generation on top of the LLM embedding mechanism.

Annotated releases

I always include a list of new releases in my weeknotes. This time I'm going to use those to illustrate the themes I've been working on.

The first group of release relates to LLM and its embedding support. LLM 0.10 extended that support:

llm 0.10 - 2023-09-12
Access large language models from the command-line

Embedding models can now be built as LLM plugins. I've released two of those so far:

llm-sentence-transformers 0.1.2 - 2023-09-13
LLM plugin for embeddings using sentence-transformers
llm-clip 0.1 - 2023-09-12
Generate embeddings for images and text using CLIP with LLM

The CLIP one is particularly fun, because it genuinely allows you to build a sophisticated image search engine that runs entirely on your own computer!

symbex 1.4 - 2023-09-05
Find the Python code for specified symbols

Symbex is my tool for extracting symbols - functions, methods and classes - from Python code. I introduced that in Symbex: search Python code for functions and classes, then pipe them into a LLM.

Symbex 1.4 adds a tiny but impactful feature: it can now output a list of symbols as JSON, CSV or TSV. These output formats are designed to be compatible with the new llm embed-multi command, which means you can easily create embeddings for all of your functions:

symbex '*' '*:*' --nl | \
  llm embed-multi symbols - \
  --format nl --database embeddings.db --store

I haven't fully explored what this enables yet, but it should mean that both related functions and semantic function search ("Find my a function that downloads a CSV") are now easy to build.

llm-cluster 0.2 - 2023-09-04
LLM plugin for clustering embeddings

Yet another thing you can do with embeddings is use them to find clusters of related items.

The neatest feature of llm-cluster is that you can ask it to generate names for these clusters by sending the names of the items in each cluster through another language model, something like this:

llm cluster issues 10 \
  -d issues.db \
  --summary \
  --prompt 'Short, concise title for this cluster of related documents'

One last embedding related project: datasette-llm-embed is a tiny plugin that adds a select llm_embed('sentence-transformers/all-mpnet-base-v2', 'This is some text') SQL function. I built it to support quickly prototyping embedding-related ideas in Datasette.

datasette-llm-embed 0.1a0 - 2023-09-08
Datasette plugin adding a llm_embed(model_id, text) SQL function

Spending time with embedding models has lead me to spend more time with Hugging Face. I realized last week that the Hugging Face all models sorted by downloads page doubles as a list of the models that are most likely to be easy to use.

One of the models I tried out was Salesforce BLIP, an astonishing model that can genuinely produce usable captions for images.

It's really easy to work with. I ended up building this tiny little CLI tool that wraps the model:

blip-caption 0.1 - 2023-09-10
Generate captions for images with Salesforce BLIP

Releases driven by Datasette Cloud

Datasette Cloud continues to drive improvements to the wider Datasette ecosystem as a whole.

It runs on the latest Datasette 1.0 alpha series, taking advantage of the JSON write API.

This also means that it's been highlighting breaking changes in 1.0 that have caused old plugins to break, either subtly or completely.

This has driven a bunch of new plugin releases. Some of these are compatible with both 0.x and 1.x - the ones that only work with the 1.x alphas are themselves marked as alpha releases.

datasette-export-notebook 1.0.1 - 2023-09-15
Datasette plugin providing instructions for exporting data to Jupyter or Observable
datasette-cluster-map 0.18a0 - 2023-09-11
Datasette plugin that shows a map for any data with latitude/longitude columns
datasette-graphql 3.0a0 - 2023-09-07
Datasette plugin providing an automatic GraphQL API for your SQLite databases

Datasette Cloud's API works using database-backed access tokens, to ensure users can revoke tokens if they need to (something that's not easily done with purely signed tokens) and that each token can record when it was most recently used.

I've been building that into the existing datasette-auth-tokens plugin:

datasette-auth-tokens 0.4a3 - 2023-08-31
Datasette plugin for authenticating access using API tokens

Alex Garcia has been working with me building out features for Datasette Cloud, generously sponsored by Fly.io.

We're beginning to build out social features for Datasette Cloud - feature that will help teams privately collaborate on data investigations together.

Alex has been building datasette-short-links as an experimental link shortener. In building that, we realized that we needed a mechanism for resolving actor IDs displayed in a list (e.g. this link created by X) to their actual names.

Datasette doesn't dictate the shape of actor representations, and there's no guarantee that actors would be represented in a predictable table.

So... we needed a new plugin hook. I released Datasette 1.06a with a new hook, actors_from_ids(actor_ids), which can be used to answer the question "who are the actors represented by these IDs".

Alex is using this in datasette-short-links, and I built two plugins to work with the new hook as well:

datasette 1.0a6 - 2023-09-08
An open source multi-tool for exploring and publishing data
datasette-debug-actors-from-ids 0.1a1 - 2023-09-08
Datasette plugin for trying out the actors_from_ids hook
datasette-remote-actors 0.1a1 - 2023-09-08
Datasette plugin for fetching details of actors from a remote endpoint

Datasette Cloud lets users insert, edit and delete rows from their tables, using the plugin Alex built called datasette-write-ui which he introduced on the Datasette Cloud blog.

This inspired me to finally put out a fresh release of datasette-edit-schema - the plugin which provides the ability to edit table schemas - adding and removing columns, changing column types, even altering the order columns are stored in the table.

datasette-edit-schema 0.6 is a major release, with three significant new features:

You can now create a brand new table from scratch!
You can edit the table's primary key
You can modify the foreign key constraints on the table

Those last two became important when I realized that Datasette's API is much more interesting if there are foreign key relationships to follow.

Combine that with datasette-write-ui and Datasette Cloud now has a full set of features for building, populating and editing tables - backed by a comprehensive JSON API.

sqlite-migrate 0.1a2 - 2023-09-03
A simple database migration system for SQLite, based on sqlite-utils

sqlite-migrate is still marked as an alpha, but won't be for much longer: it's my attempt at a migration system for SQLite, inspired by Django migrations but with a less sophisticated set of features.

I'm using it in LLM now to manage the schema used to store embeddings, and it's beginning to show up in some Datasette plugins as well. I'll be promoting this to non-alpha status pretty soon.

sqlite-utils 3.35.1 - 2023-09-09
Python CLI utility and library for manipulating SQLite databases

A tiny fix in this, which with hindsight was less impactful than I thought.

I spotted a bug on Datasette Cloud when I configured full-text search on a column, then edited the schema and found that searches no longer returned the correct results.

It turned out the rowid column in SQLite was being rewritten by calls to the sqlite-utils table.transform() method. FTS records are related to their underlying row by rowid, so this was breaking search!

I pushed out a fix for this in 3.35.1. But then... I learned that rowid in SQLite has always been unstable - they are rewritten any time someone VACUUMs a table!

I've been designing future features for Datasette that assume that rowid is a useful stable identifier for a row. This clearly isn't going to work! I'm still thinking through the consequences of it, but I think there may be Datasette features (like the ability to comment on a row) that will only work for tables with a proper foreign key.

sqlite-chronicle

sqlite-chronicle 0.1 - 2023-09-11
Use triggers to track when rows in a SQLite table were updated or deleted

This is very early, but I'm excited about the direction it's going in.

I keep on finding problems where I want to be able to synchronize various processes with the data in a table.

I built sqlite-history a few months ago, which uses SQLite triggers to create a full copy of the updated data every time a row in a table is edited.

That's a pretty heavy-weight solution. What if there was something lighter that could achieve a lot of the same goals?

sqlite-chronicle uses triggers to instead create what I'm calling a "chronicle table". This is a shadow table that records, for every row in the main table, four integer values:

added_ms - the timestamp in milliseconds when the row was added
updated_ms - the timestamp in milliseconds when the row was last updated
version - a constantly incrementing version number, global across the entire table
deleted - set to 1 if the row has been deleted

Just storing four integers (plus copies of the primary key) makes this a pretty tiny table, and hopefully one that's cheap to update via triggers.

But... having this table enables some pretty interesting things - because external processes can track the last version number that they saw and use it to see just which rows have been inserted and updated since that point.

I gave a talk at DjangoCon a few years ago called the denormalized query engine pattern, describing the challenge of syncing an external search index like Elasticsearch with data held in a relational database.

These chronicle tables can solve that problem, and can be applied to a whole host of other problems too. So far I'm thinking about the following:

Publishing SQLite databases up to Datasette, sending only the rows that have changed since the last sync. I wrote a prototype that does this and it seems to work very well.
Copying a table from Datasette Cloud to other places - a desktop copy, or another instance, or even into an alternative database such as PostgreSQL or MySQL, in a way that only copies and deletes rows that have changed.
Saved search alerts: run a SQL query against just rows that were modified since the last time that query ran, then send alerts if any rows are matched.
Showing users a note that "34 rows in this table have changed since your last visit", then displaying those rows.

I'm sure there are many more applications for this. I'm looking forward to finding out what they are!

sqlite-utils-move-tables 0.1 - 2023-09-01
sqlite-utils plugin adding a move-tables command

I needed to fix a bug in Datasette Cloud by moving a table from one database to another... so I built a little plugin for sqlite-utils that adds a sqlite-utils move-tables origin.db destination.db tablename command. I love being able to build single-use features as plugins like this.

And some TILs

Embedding paragraphs from my blog with E5-large-v2 - 2023-09-08

This was a fun TIL exercising the new embeddings feature in LLM. I used Django SQL Dashboardto break up my blog entries into paragraphs and exported those as CSV which could then be piped into llm embed-multi, then used that to build a CLI-driven semantic search engine for my blog.

Using llama-cpp-python grammars to generate JSON - 2023-09-13

llama-cpp has grammars now, which enable you to control the exact output format of the LLM. I'm optimistic that these could be used to implement an equivalent to OpenAI Functions on top of Llama 2 and similar models. So far I've just got them to output arrays of JSON objects.

Summarizing Hacker News discussion themes with Claude and LLM - 2023-09-09

I'm using this trick a lot at the moment. I have API access to Claude now, which has a 100,000 token context limit (GPT-4 is just 8,000 by default). That's enough to summarize 100+ comment threads from Hacker News, for which I'm now using this prompt:

Summarize the themes of the opinions expressed here, including quotes (with author attribution) where appropriate.

The quotes part has been working really well - it turns out summaries of themes with illustrative quotes are much more interesting, and so far my spot checks haven't found any that were hallucinated.

Trying out cr-sqlite on macOS - 2023-09-13

cr-sqlite adds full CRDTs to SQLite, which should enable multiple databases to accept writes independently and then seamlessly merge them together. It's a very exciting capability!

Running Datasette on Hugging Face Spaces - 2023-09-08

It turns out Hugging Faces offer free scale-to-zero hosting for demos that run in Docker containers on machines with a full 16GB of RAM! I'm used to optimizing Datasette for tiny 256MB containers, so having this much memory available is a real treat.

And the rest:

Limited JSON API for Google searches using Programmable Search Engine - 2023-09-17
Running tests against multiple versions of a Python dependency in GitHub Actions - 2023-09-15
Remember to commit when using datasette.execute_write_fn() - 2023-08-31

TIL 2023-09-13 Trying out cr-sqlite on macOS:

cr-sqlite is fascinating. It's a loadable SQLite extension by Matt Wonlaw that "allows merging different SQLite databases together that have taken independent writes". …

TIL 2023-09-13 Using llama-cpp-python grammars to generate JSON:

llama.cpp recently added the ability to control the output of any model using a grammar. …

Link 2023-09-13 Simulating History with ChatGPT: Absolutely fascinating new entry in the using-ChatGPT-to-teach genre. Benjamin Breen teaches history at UC Santa Cruz, and has been developing a sophisticated approach to using ChatGPT to play out role-playing scenarios involving different periods of history. His students are challenged to participate in them, then pick them apart - fact-checking details from the scenario and building critiques of the perspectives demonstrated by the language model. There are so many quotable snippets in here, I recommend reading the whole thing.

Quote 2023-09-13

In the long term, I suspect that LLMs will have a significant positive impact on higher education. Specifically, I believe they will elevate the importance of the humanities. [...] LLMs are deeply, inherently textual. And they are reliant on text in a way that is directly linked to the skills and methods that we emphasize in university humanities classes.

Benjamin Breen

Link 2023-09-13 Some notes on Local-First Development: Local-First is the name that has been coined by the community of people who are interested in building apps where data is manipulated in a client application first (mobile, desktop or web) and then continually synchronized with a server, rather than the other way round. This is a really useful review by Kyle Mathews of how the space is shaping up so far - lots of interesting threads to follow here.

Link 2023-09-13 Introducing datasette-litestream: easy replication for SQLite databases in Datasette: We use Litestream on Datasette Cloud for streaming backups of user data to S3. Alex Garcia extracted out our implementation into a standalone Datasette plugin, which bundles the Litestream Go binary (for the relevant platform) in the package you get when you run "datasette install datasette-litestream" - so now Datasette has a very robust answer to questions about SQLite disaster recovery beyond just the Datasette Cloud platform.

Link 2023-09-14 CAISO Grid Status: CAISO is the California Independent System Operator, a non-profit managing 80% of California's electricity flow. This grid status page shows live data about the state of the grid and it's fascinating: right now (2pm local time) California is running 71.4% on renewables, having peaked at 80% three hours ago. The current fuel mix is 52% solar, 31% natural gas, 7% each large hydro and nuclear and 2% wind. The charts on this page show how solar turns off overnight and then picks up and peaks during daylight hours.

TIL 2023-09-15 Running tests against multiple versions of a Python dependency in GitHub Actions:

My datasette-export-notebook plugin worked fine in the stable release of Datasette, currently version 0.64.3, but failed in the Datasette 1.0 alphas. Here's the issue describing the problem. …

Link 2023-09-16 How CPython Implements and Uses Bloom Filters for String Processing: Fascinating dive into Python string internals by Abhinav Upadhyay. It turns out CPython uses very simple bloom filters in several parts of the core string methods, to solve problems like splitting on newlines where there are actually eight codepoints that could represent a newline, and a tiny bloom filter can help filter a character in a single operation before performing all eight comparisons only if that first check failed.

Link 2023-09-16 Notes on using a single-person Mastodon server: Julia Evans experiences running a single-person Mastodon server (on masto.host - the same host I use for my own) pretty much exactly match what I've learned so far as well. The biggest disadvantage is the missing replies issue, where your server only shows replies to posts that come from people who you follow - so it's easy to reply to something in a way that duplicates other replies that are invisible to you.

TIL 2023-09-17 Limited JSON API for Google searches using Programmable Search Engine:

I figured out how to use a JSON API to run a very limited Google search today in a legit, non-screen-scraper way. …

Quote 2023-09-18

Note that there have been no breaking changes since the [SQLite] file format was designed in 2004. The changes shows in the version history above have all be one of (1) typo fixes, (2) clarifications, or (3) filling in the "reserved for future extensions" bits with descriptions of those extensions as they occurred.

D. Richard Hipp

Link 2023-09-19 LLM 0.11: I released LLM 0.11 with support for the new gpt-3.5-turbo-instruct completion model from OpenAI.

The most interesting feature of completion models is the option to request "log probabilities" from them, where each token returned is accompanied by up to 5 alternatives that were considered, along with their scores.

Link 2023-09-19 The WebAssembly Go Playground: Jeff Lindsay has a full Go 1.21.1 compiler running entirely in the browser.

Link 2023-09-23 TG: Polygon indexing: TG is a brand new geospatial library by Josh Baker, author of the Tile38 in-memory spatial server (kind of a geospatial Redis). TG is written in pure C and delivered as a single C file, reminiscent of the SQLite amalgamation.

TG looks really interesting. It implements almost the exact subset of geospatial functionality that I find most useful: point-in-polygon, intersect, WKT, WKB, and GeoJSON - all with no additional dependencies.

The most interesting thing about it is the way it handles indexing. In this documentation Josh describes two approaches he uses to speeding up point-in-polygon and intersection using a novel approach that goes beyond the usual RTree implementation.

I think this could make the basis of a really useful SQLite extension - a lighter-weight alternative to SpatiaLite.

TIL 2023-09-23 Trying out the facebook/musicgen-small sound generation model:

Facebook's musicgen is a model that generates snippets of audio from a text description - it's effectively a Stable Diffusion for music. …

Link 2023-09-24 Should you give candidates feedback on their interview performance?: Jacob provides a characteristically nuanced answer to the question of whether you should provide feedback to candidates you have interviewed. He suggests offering the candidate the option to email asking for feedback early in the interview process to avoid feeling pushy later on, and proposes the phrase "you failed to demonstrate..." as a useful framing device.

Link 2023-09-25 A Hackers' Guide to Language Models: Jeremy Howard's new 1.5 hour YouTube introduction to language models looks like a really useful place to catch up if you're an experienced Python programmer looking to start experimenting with LLMs. He covers what they are and how they work, then shows how to build against the OpenAI API, build a Code Interpreter clone using OpenAI functions, run models from Hugging Face on your own machine (with NVIDIA cards or on a Mac) and finishes with a demo of fine-tuning a Llama 2 model to perform text-to-SQL using an open dataset.

Quote 2023-09-25

We already know one major effect of AI on the skills distribution: AI acts as a skills leveler for a huge range of professional work. If you were in the bottom half of the skill distribution for writing, idea generation, analyses, or any of a number of other professional tasks, you will likely find that, with the help of AI, you have become quite good.

Ethan Mollick

Link 2023-09-25 Geospatial SQL queries in SQLite using TG, sqlite-tg and datasette-sqlite-tg: Alex Garcia built sqlite-tg - a SQLite extension that uses the brand new TG geospatial library to provide a whole suite of custom SQL functions for working with geospatial data.

Here are my notes on trying out his initial alpha releases. The extension already provides tools for converting between GeoJSON, WKT and WKB, plus the all important tg_intersects() function for testing if a polygon or point overlap each other.

It's pretty useful already. Without any geospatial indexing at all I was still able to get 700ms replies to a brute-force point-in-polygon query against 150MB of GeoJSON timezone boundaries stored as JSON text in a table.

Link 2023-09-25 Upsert in SQL: Anton Zhiyanov is currently on a one-man quest to write detailed documentation for all of the fundamental SQL operations, comparing and contrasting how they work across multiple engines, generally with interactive examples.

Useful tips in here on why "insert... on conflict" is usually a better option than "insert or replace into" because the latter can perform a delete and then an insert, firing triggers that you may not have wanted to be fired.

Link 2023-09-26 Batch size one billion: SQLite insert speedups, from the useful to the absurd: Useful, detailed review of ways to maximize the performance of inserting a billion integers into a SQLite database table.

TIL 2023-09-26 Snapshot testing with Syrupy:

I'm a big fan of snapshot testing - writing tests where you compare the output of some function to a previously saved version, and can re-generate that version from scratch any time something changes. …

Link 2023-09-26 Rethinking the Luddites in the Age of A.I.: I've been staying way clear of comparisons to Luddites in conversations about the potential harmful impacts of modern AI tools, because it seemed to me like an offensive, unproductive cheap shot.

This article has shown me that the comparison is actually a lot more relevant - and sympathetic - than I had realized.

In a time before labor unions, the Luddites represented an early example of a worker movement that tried to stand up for their rights in the face of transformational, negative change to their specific way of life.

"Knitting machines known as lace frames allowed one employee to do the work of many without the skill set usually required" is a really striking parallel to what's starting to happen with a surprising array of modern professions already.

Quote 2023-09-27

The profusion of dubious A.I.-generated content resembles the badly made stockings of the nineteenth century. At the time of the Luddites, many hoped the subpar products would prove unacceptable to consumers or to the government. Instead, social norms adjusted.

Kyle Chayka

Link 2023-09-27 Optimizing for Taste: David Cramer's detailed explanation as to why his company Sentry mostly avoids A/B testing. David wrote this as an internal blog post originally, but is now sharing it with the world. I found myself nodding along vigorously as I read this - lots of astute observations here.

I particularly appreciated his closing note: "The strength of making a decision is making it. You can always make a new one later. Choose the obvious path forward, and if you don’t see one, find someone who does."

Link 2023-09-27 Finding Bathroom Faucets with Embeddings: Absolutely the coolest thing I've seen someone build on top of my LLM tool so far: Drew Breunig is renovating a bathroom and needed a way to filter through literally thousands of options for facet taps. He scraped 20,000 images of fixtures from a plumbing supply site and used LLM to embed every one of them via CLIP... and now he can ask for "faucets that look like this one", or even run searches for faucets that match "Gawdy" or "Bond Villain" or "Nintendo 64". Live demo included!

Link 2023-09-27 Google was accidentally leaking its Bard AI chats into public search results: I'm quoted in this piece about yesterday's Bard privacy bug: it turned out the share URL and "Let anyone with the link see what you've selected" feature wasn't correctly setting a noindex parameter, and so some shared conversations were being swept up by the Google search crawlers. Thankfully this was a mistake, not a deliberate design decision, and it should be fixed by now.

Quote 2023-09-28

Looking at LLMs as chatbots is the same as looking at early computers as calculators. We're seeing an emergence of a whole new computing paradigm, and it is very early.

Andrej Karpathy

Link 2023-09-28 Getting started with the Datasette Cloud API: I wrote an introduction to the Datasette Cloud API for the company blog, with a tutorial showing how to use Python and GitHub Actions to import data from the Federal Register into a table in Datasette Cloud, then configure full-text search against it.

Simon Willison’s Newsletter

Discussion about this post