Talking Large Language Models with Rooftop Ruby
Plus embeddings, more embeddings and Datasette Cloud
In this newsletter:
Talking Large Language Models with Rooftop Ruby
Weeknotes: Embeddings, more embeddings and Datasette Cloud
Plus 19 links and 5 quotations and 7 TILs
Talking Large Language Models with Rooftop Ruby - 2023-09-29
I'm on the latest episode of the Rooftop Ruby podcast with Collin Donnell and Joel Drapper, talking all things LLM.
Here's a full transcript of the episode, which I generated using Whisper and then tidied up manually (after failing to get a good editing job out of Claude and GPT-4). I've also provided a link from each section heading to jump to the relevant spot in the recording.
The topics we covered:
Since my last weeknotes, a flurry of activity. LLM has embeddings support now, and Datasette Cloud has driven some major improvements to the wider Datasette ecosystem.
Embeddings in LLM
LLM gained embedding support in version 0.9, and then got binary embedding support (for CLIP) in version 0.10. I wrote about those releases in detail in:
Embeddings are a fascinating tool. If you haven't got your head around them yet the first of my blog entries tries to explain why they are so interesting.
There's a lot more I want to built on top of embeddings - most notably, LLM (or Datasette, or likely a combination of the two) will be growing support for Retrieval Augmented Generation on top of the LLM embedding mechanism.
I always include a list of new releases in my weeknotes. This time I'm going to use those to illustrate the themes I've been working on.
The first group of release relates to LLM and its embedding support. LLM 0.10 extended that support:
llm 0.10 - 2023-09-12
Access large language models from the command-line
Embedding models can now be built as LLM plugins. I've released two of those so far:
llm-sentence-transformers 0.1.2 - 2023-09-13
LLM plugin for embeddings using sentence-transformers
llm-clip 0.1 - 2023-09-12
Generate embeddings for images and text using CLIP with LLM
The CLIP one is particularly fun, because it genuinely allows you to build a sophisticated image search engine that runs entirely on your own computer!
symbex 1.4 - 2023-09-05
Find the Python code for specified symbols
Symbex is my tool for extracting symbols - functions, methods and classes - from Python code. I introduced that in Symbex: search Python code for functions and classes, then pipe them into a LLM.
Symbex 1.4 adds a tiny but impactful feature: it can now output a list of symbols as JSON, CSV or TSV. These output formats are designed to be compatible with the new llm embed-multi command, which means you can easily create embeddings for all of your functions:
symbex '*' '*:*' --nl | \
llm embed-multi symbols - \
--format nl --database embeddings.db --store
I haven't fully explored what this enables yet, but it should mean that both related functions and semantic function search ("Find my a function that downloads a CSV") are now easy to build.
llm-cluster 0.2 - 2023-09-04
LLM plugin for clustering embeddings
Yet another thing you can do with embeddings is use them to find clusters of related items.
The neatest feature of
llm-cluster is that you can ask it to generate names for these clusters by sending the names of the items in each cluster through another language model, something like this:
llm cluster issues 10 \
-d issues.db \
--prompt 'Short, concise title for this cluster of related documents'
One last embedding related project:
datasette-llm-embed is a tiny plugin that adds a
select llm_embed('sentence-transformers/all-mpnet-base-v2', 'This is some text') SQL function. I built it to support quickly prototyping embedding-related ideas in Datasette.
datasette-llm-embed 0.1a0 - 2023-09-08
Datasette plugin adding a llm_embed(model_id, text) SQL function
Spending time with embedding models has lead me to spend more time with Hugging Face. I realized last week that the Hugging Face all models sorted by downloads page doubles as a list of the models that are most likely to be easy to use.
One of the models I tried out was Salesforce BLIP, an astonishing model that can genuinely produce usable captions for images.
It's really easy to work with. I ended up building this tiny little CLI tool that wraps the model:
blip-caption 0.1 - 2023-09-10
Generate captions for images with Salesforce BLIP
Releases driven by Datasette Cloud
Datasette Cloud continues to drive improvements to the wider Datasette ecosystem as a whole.
It runs on the latest Datasette 1.0 alpha series, taking advantage of the JSON write API.
This also means that it's been highlighting breaking changes in 1.0 that have caused old plugins to break, either subtly or completely.
This has driven a bunch of new plugin releases. Some of these are compatible with both 0.x and 1.x - the ones that only work with the 1.x alphas are themselves marked as alpha releases.
datasette-export-notebook 1.0.1 - 2023-09-15
Datasette plugin providing instructions for exporting data to Jupyter or Observable
datasette-cluster-map 0.18a0 - 2023-09-11
Datasette plugin that shows a map for any data with latitude/longitude columns
datasette-graphql 3.0a0 - 2023-09-07
Datasette plugin providing an automatic GraphQL API for your SQLite databases
Datasette Cloud's API works using database-backed access tokens, to ensure users can revoke tokens if they need to (something that's not easily done with purely signed tokens) and that each token can record when it was most recently used.
I've been building that into the existing
datasette-auth-tokens 0.4a3 - 2023-08-31
Datasette plugin for authenticating access using API tokens
We're beginning to build out social features for Datasette Cloud - feature that will help teams privately collaborate on data investigations together.
Alex has been building datasette-short-links as an experimental link shortener. In building that, we realized that we needed a mechanism for resolving actor IDs displayed in a list (e.g. this link created by X) to their actual names.
Datasette doesn't dictate the shape of actor representations, and there's no guarantee that actors would be represented in a predictable table.
So... we needed a new plugin hook. I released Datasette 1.06a with a new hook, actors_from_ids(actor_ids), which can be used to answer the question "who are the actors represented by these IDs".
Alex is using this in
datasette-short-links, and I built two plugins to work with the new hook as well:
datasette 1.0a6 - 2023-09-08
An open source multi-tool for exploring and publishing data
datasette-debug-actors-from-ids 0.1a1 - 2023-09-08
Datasette plugin for trying out the actors_from_ids hook
datasette-remote-actors 0.1a1 - 2023-09-08
Datasette plugin for fetching details of actors from a remote endpoint
This inspired me to finally put out a fresh release of datasette-edit-schema - the plugin which provides the ability to edit table schemas - adding and removing columns, changing column types, even altering the order columns are stored in the table.
datasette-edit-schema 0.6 is a major release, with three significant new features:
You can now create a brand new table from scratch!
You can edit the table's primary key
You can modify the foreign key constraints on the table
Those last two became important when I realized that Datasette's API is much more interesting if there are foreign key relationships to follow.
Combine that with
datasette-write-ui and Datasette Cloud now has a full set of features for building, populating and editing tables - backed by a comprehensive JSON API.
sqlite-migrate 0.1a2 - 2023-09-03
A simple database migration system for SQLite, based on sqlite-utils
sqlite-migrate is still marked as an alpha, but won't be for much longer: it's my attempt at a migration system for SQLite, inspired by Django migrations but with a less sophisticated set of features.
I'm using it in LLM now to manage the schema used to store embeddings, and it's beginning to show up in some Datasette plugins as well. I'll be promoting this to non-alpha status pretty soon.
sqlite-utils 3.35.1 - 2023-09-09
Python CLI utility and library for manipulating SQLite databases
A tiny fix in this, which with hindsight was less impactful than I thought.
I spotted a bug on Datasette Cloud when I configured full-text search on a column, then edited the schema and found that searches no longer returned the correct results.
It turned out the
rowid column in SQLite was being rewritten by calls to the
sqlite-utils table.transform() method. FTS records are related to their underlying row by
rowid, so this was breaking search!
I pushed out a fix for this in 3.35.1. But then... I learned that
rowid in SQLite has always been unstable - they are rewritten any time someone VACUUMs a table!
I've been designing future features for Datasette that assume that
rowid is a useful stable identifier for a row. This clearly isn't going to work! I'm still thinking through the consequences of it, but I think there may be Datasette features (like the ability to comment on a row) that will only work for tables with a proper foreign key.
sqlite-chronicle 0.1 - 2023-09-11
Use triggers to track when rows in a SQLite table were updated or deleted
This is very early, but I'm excited about the direction it's going in.
I keep on finding problems where I want to be able to synchronize various processes with the data in a table.
I built sqlite-history a few months ago, which uses SQLite triggers to create a full copy of the updated data every time a row in a table is edited.
That's a pretty heavy-weight solution. What if there was something lighter that could achieve a lot of the same goals?
sqlite-chronicle uses triggers to instead create what I'm calling a "chronicle table". This is a shadow table that records, for every row in the main table, four integer values:
added_ms- the timestamp in milliseconds when the row was added
updated_ms- the timestamp in milliseconds when the row was last updated
version- a constantly incrementing version number, global across the entire table
deleted- set to
1if the row has been deleted
Just storing four integers (plus copies of the primary key) makes this a pretty tiny table, and hopefully one that's cheap to update via triggers.
But... having this table enables some pretty interesting things - because external processes can track the last version number that they saw and use it to see just which rows have been inserted and updated since that point.
I gave a talk at DjangoCon a few years ago called the denormalized query engine pattern, describing the challenge of syncing an external search index like Elasticsearch with data held in a relational database.
These chronicle tables can solve that problem, and can be applied to a whole host of other problems too. So far I'm thinking about the following:
Publishing SQLite databases up to Datasette, sending only the rows that have changed since the last sync. I wrote a prototype that does this and it seems to work very well.
Copying a table from Datasette Cloud to other places - a desktop copy, or another instance, or even into an alternative database such as PostgreSQL or MySQL, in a way that only copies and deletes rows that have changed.
Saved search alerts: run a SQL query against just rows that were modified since the last time that query ran, then send alerts if any rows are matched.
Showing users a note that "34 rows in this table have changed since your last visit", then displaying those rows.
I'm sure there are many more applications for this. I'm looking forward to finding out what they are!
sqlite-utils-move-tables 0.1 - 2023-09-01
sqlite-utils plugin adding a move-tables command
I needed to fix a bug in Datasette Cloud by moving a table from one database to another... so I built a little plugin for
sqlite-utils that adds a
sqlite-utils move-tables origin.db destination.db tablename command. I love being able to build single-use features as plugins like this.
And some TILs
Embedding paragraphs from my blog with E5-large-v2 - 2023-09-08
This was a fun TIL exercising the new embeddings feature in LLM. I used Django SQL Dashboardto break up my blog entries into paragraphs and exported those as CSV which could then be piped into
llm embed-multi, then used that to build a CLI-driven semantic search engine for my blog.
Using llama-cpp-python grammars to generate JSON - 2023-09-13
llama-cpp has grammars now, which enable you to control the exact output format of the LLM. I'm optimistic that these could be used to implement an equivalent to OpenAI Functions on top of Llama 2 and similar models. So far I've just got them to output arrays of JSON objects.
I'm using this trick a lot at the moment. I have API access to Claude now, which has a 100,000 token context limit (GPT-4 is just 8,000 by default). That's enough to summarize 100+ comment threads from Hacker News, for which I'm now using this prompt:
Summarize the themes of the opinions expressed here, including quotes (with author attribution) where appropriate.
The quotes part has been working really well - it turns out summaries of themes with illustrative quotes are much more interesting, and so far my spot checks haven't found any that were hallucinated.
Trying out cr-sqlite on macOS - 2023-09-13
cr-sqlite adds full CRDTs to SQLite, which should enable multiple databases to accept writes independently and then seamlessly merge them together. It's a very exciting capability!
Running Datasette on Hugging Face Spaces - 2023-09-08
It turns out Hugging Faces offer free scale-to-zero hosting for demos that run in Docker containers on machines with a full 16GB of RAM! I'm used to optimizing Datasette for tiny 256MB containers, so having this much memory available is a real treat.
And the rest:
TIL 2023-09-13 Trying out cr-sqlite on macOS:
cr-sqlite is fascinating. It's a loadable SQLite extension by Matt Wonlaw that "allows merging different SQLite databases together that have taken independent writes". …
TIL 2023-09-13 Using llama-cpp-python grammars to generate JSON:
llama.cpp recently added the ability to control the output of any model using a grammar. …
Link 2023-09-13 Simulating History with ChatGPT: Absolutely fascinating new entry in the using-ChatGPT-to-teach genre. Benjamin Breen teaches history at UC Santa Cruz, and has been developing a sophisticated approach to using ChatGPT to play out role-playing scenarios involving different periods of history. His students are challenged to participate in them, then pick them apart - fact-checking details from the scenario and building critiques of the perspectives demonstrated by the language model. There are so many quotable snippets in here, I recommend reading the whole thing.
In the long term, I suspect that LLMs will have a significant positive impact on higher education. Specifically, I believe they will elevate the importance of the humanities. [...] LLMs are deeply, inherently textual. And they are reliant on text in a way that is directly linked to the skills and methods that we emphasize in university humanities classes.
Link 2023-09-13 Some notes on Local-First Development: Local-First is the name that has been coined by the community of people who are interested in building apps where data is manipulated in a client application first (mobile, desktop or web) and then continually synchronized with a server, rather than the other way round. This is a really useful review by Kyle Mathews of how the space is shaping up so far - lots of interesting threads to follow here.
Link 2023-09-13 Introducing datasette-litestream: easy replication for SQLite databases in Datasette: We use Litestream on Datasette Cloud for streaming backups of user data to S3. Alex Garcia extracted out our implementation into a standalone Datasette plugin, which bundles the Litestream Go binary (for the relevant platform) in the package you get when you run "datasette install datasette-litestream" - so now Datasette has a very robust answer to questions about SQLite disaster recovery beyond just the Datasette Cloud platform.
Link 2023-09-14 CAISO Grid Status: CAISO is the California Independent System Operator, a non-profit managing 80% of California's electricity flow. This grid status page shows live data about the state of the grid and it's fascinating: right now (2pm local time) California is running 71.4% on renewables, having peaked at 80% three hours ago. The current fuel mix is 52% solar, 31% natural gas, 7% each large hydro and nuclear and 2% wind. The charts on this page show how solar turns off overnight and then picks up and peaks during daylight hours.
Link 2023-09-16 How CPython Implements and Uses Bloom Filters for String Processing: Fascinating dive into Python string internals by Abhinav Upadhyay. It turns out CPython uses very simple bloom filters in several parts of the core string methods, to solve problems like splitting on newlines where there are actually eight codepoints that could represent a newline, and a tiny bloom filter can help filter a character in a single operation before performing all eight comparisons only if that first check failed.
Link 2023-09-16 Notes on using a single-person Mastodon server: Julia Evans experiences running a single-person Mastodon server (on masto.host - the same host I use for my own) pretty much exactly match what I've learned so far as well. The biggest disadvantage is the missing replies issue, where your server only shows replies to posts that come from people who you follow - so it's easy to reply to something in a way that duplicates other replies that are invisible to you.
I figured out how to use a JSON API to run a very limited Google search today in a legit, non-screen-scraper way. …
Note that there have been no breaking changes since the [SQLite] file format was designed in 2004. The changes shows in the version history above have all be one of (1) typo fixes, (2) clarifications, or (3) filling in the "reserved for future extensions" bits with descriptions of those extensions as they occurred.
Link 2023-09-19 LLM 0.11: I released LLM 0.11 with support for the new gpt-3.5-turbo-instruct completion model from OpenAI.
The most interesting feature of completion models is the option to request "log probabilities" from them, where each token returned is accompanied by up to 5 alternatives that were considered, along with their scores.
Link 2023-09-19 The WebAssembly Go Playground: Jeff Lindsay has a full Go 1.21.1 compiler running entirely in the browser.
Link 2023-09-23 TG: Polygon indexing: TG is a brand new geospatial library by Josh Baker, author of the Tile38 in-memory spatial server (kind of a geospatial Redis). TG is written in pure C and delivered as a single C file, reminiscent of the SQLite amalgamation.
TG looks really interesting. It implements almost the exact subset of geospatial functionality that I find most useful: point-in-polygon, intersect, WKT, WKB, and GeoJSON - all with no additional dependencies.
The most interesting thing about it is the way it handles indexing. In this documentation Josh describes two approaches he uses to speeding up point-in-polygon and intersection using a novel approach that goes beyond the usual RTree implementation.
I think this could make the basis of a really useful SQLite extension - a lighter-weight alternative to SpatiaLite.
TIL 2023-09-23 Trying out the facebook/musicgen-small sound generation model:
Facebook's musicgen is a model that generates snippets of audio from a text description - it's effectively a Stable Diffusion for music. …
Link 2023-09-24 Should you give candidates feedback on their interview performance?: Jacob provides a characteristically nuanced answer to the question of whether you should provide feedback to candidates you have interviewed. He suggests offering the candidate the option to email asking for feedback early in the interview process to avoid feeling pushy later on, and proposes the phrase "you failed to demonstrate..." as a useful framing device.
Link 2023-09-25 A Hackers' Guide to Language Models: Jeremy Howard's new 1.5 hour YouTube introduction to language models looks like a really useful place to catch up if you're an experienced Python programmer looking to start experimenting with LLMs. He covers what they are and how they work, then shows how to build against the OpenAI API, build a Code Interpreter clone using OpenAI functions, run models from Hugging Face on your own machine (with NVIDIA cards or on a Mac) and finishes with a demo of fine-tuning a Llama 2 model to perform text-to-SQL using an open dataset.
We already know one major effect of AI on the skills distribution: AI acts as a skills leveler for a huge range of professional work. If you were in the bottom half of the skill distribution for writing, idea generation, analyses, or any of a number of other professional tasks, you will likely find that, with the help of AI, you have become quite good.
Link 2023-09-25 Geospatial SQL queries in SQLite using TG, sqlite-tg and datasette-sqlite-tg: Alex Garcia built sqlite-tg - a SQLite extension that uses the brand new TG geospatial library to provide a whole suite of custom SQL functions for working with geospatial data.
Here are my notes on trying out his initial alpha releases. The extension already provides tools for converting between GeoJSON, WKT and WKB, plus the all important tg_intersects() function for testing if a polygon or point overlap each other.
It's pretty useful already. Without any geospatial indexing at all I was still able to get 700ms replies to a brute-force point-in-polygon query against 150MB of GeoJSON timezone boundaries stored as JSON text in a table.
Link 2023-09-25 Upsert in SQL: Anton Zhiyanov is currently on a one-man quest to write detailed documentation for all of the fundamental SQL operations, comparing and contrasting how they work across multiple engines, generally with interactive examples.
Useful tips in here on why "insert... on conflict" is usually a better option than "insert or replace into" because the latter can perform a delete and then an insert, firing triggers that you may not have wanted to be fired.
Link 2023-09-26 Batch size one billion: SQLite insert speedups, from the useful to the absurd: Useful, detailed review of ways to maximize the performance of inserting a billion integers into a SQLite database table.
TIL 2023-09-26 Snapshot testing with Syrupy:
I'm a big fan of snapshot testing - writing tests where you compare the output of some function to a previously saved version, and can re-generate that version from scratch any time something changes. …
Link 2023-09-26 Rethinking the Luddites in the Age of A.I.: I've been staying way clear of comparisons to Luddites in conversations about the potential harmful impacts of modern AI tools, because it seemed to me like an offensive, unproductive cheap shot.
This article has shown me that the comparison is actually a lot more relevant - and sympathetic - than I had realized.
In a time before labor unions, the Luddites represented an early example of a worker movement that tried to stand up for their rights in the face of transformational, negative change to their specific way of life.
"Knitting machines known as lace frames allowed one employee to do the work of many without the skill set usually required" is a really striking parallel to what's starting to happen with a surprising array of modern professions already.
The profusion of dubious A.I.-generated content resembles the badly made stockings of the nineteenth century. At the time of the Luddites, many hoped the subpar products would prove unacceptable to consumers or to the government. Instead, social norms adjusted.
Link 2023-09-27 Optimizing for Taste: David Cramer's detailed explanation as to why his company Sentry mostly avoids A/B testing. David wrote this as an internal blog post originally, but is now sharing it with the world. I found myself nodding along vigorously as I read this - lots of astute observations here.
I particularly appreciated his closing note: "The strength of making a decision is making it. You can always make a new one later. Choose the obvious path forward, and if you don’t see one, find someone who does."
Link 2023-09-27 Finding Bathroom Faucets with Embeddings: Absolutely the coolest thing I've seen someone build on top of my LLM tool so far: Drew Breunig is renovating a bathroom and needed a way to filter through literally thousands of options for facet taps. He scraped 20,000 images of fixtures from a plumbing supply site and used LLM to embed every one of them via CLIP... and now he can ask for "faucets that look like this one", or even run searches for faucets that match "Gawdy" or "Bond Villain" or "Nintendo 64". Live demo included!
Link 2023-09-27 Google was accidentally leaking its Bard AI chats into public search results: I'm quoted in this piece about yesterday's Bard privacy bug: it turned out the share URL and "Let anyone with the link see what you've selected" feature wasn't correctly setting a noindex parameter, and so some shared conversations were being swept up by the Google search crawlers. Thankfully this was a mistake, not a deliberate design decision, and it should be fixed by now.
Looking at LLMs as chatbots is the same as looking at early computers as calculators. We're seeing an emergence of a whole new computing paradigm, and it is very early.
Link 2023-09-28 Getting started with the Datasette Cloud API: I wrote an introduction to the Datasette Cloud API for the company blog, with a tutorial showing how to use Python and GitHub Actions to import data from the Federal Register into a table in Datasette Cloud, then configure full-text search against it.