Datasette Enrichments: a new plugin framework for augmenting your data
Plus links, quotes, weeknotes and more
In this newsletter:
Datasette Enrichments: a new plugin framework for augmenting your data
Weeknotes: datasette-enrichments, datasette-comments, sqlite-chronicle
Plus 16 links and 8 quotations and 1 TIL
Datasette Enrichments: a new plugin framework for augmenting your data - 2023-12-01
Today I'm releasing datasette-enrichments, a new feature for Datasette which provides a framework for applying "enrichments" that can augment your data.
An enrichment is code that can be run against rows in a database table. That code can transform existing data or fetch additional data from external sources, then write that augmented data back to the database.
A good example of an enrichment is geocoding: take a table with an address column, run each address through a geocoding API, then write the resulting location back to latitude
and longitude
columns on the same table.
Each enrichment is itself a plugin. The Datasette enrichments system is designed to be easily extended with new enrichment types, to serve a wide variety of use-cases.
Demonstrating enrichments
I've made a video demo to demonstrate the new capabilities introduced by this plugin.
The video shows off two enrichments: datasette-enrichments-gpt
for running prompts against OpenAI's GPT language models, and datasette-enrichments-opencage
for geocoding addresses.
In the video I demonstrate the following:
Uploading a CSV file of Film Locations in San Francisco to create a table
Running the OpenCage geocoder enrichment against those rows to populate
latitude
andlongitude
columns... which results in a map being displayed on the table page using datasette-cluster-map
Applying the GPT enrichment to write terrible haikus about every museum on my Niche Museums website
Extracting JSON with key people and dates from each museum descriptions
Using the GPT-4 Vision API to generate detailed descriptions of photographs displayed on the site
Enrichments so far
I'm releasing four enrichment plugins today:
I've also published documentation on developing a new enrichment.
datasette-enrichments-gpt
The most interesting enrichment I'm releasing today is datasette-enrichments-gpt. This enrichment provides access to various OpenAI language models, allowing you to do some really interesting things:
Execute a prompt against data pulled from columns in each row of a table and store the result
Run prompts against URLs to images using the GPT-4 Vision API
Extract structured data from text
I demonstrated all three of these in the video. Here's how I used JSON object mode to extract JSON structured data for people and years from the museum descriptions, using this prompt:
Return JSON: {"people": [...], "years": [...]}
Each person should be {"name": "...", "bio": "One line bio"}
Each year should be {"year": 1893, "description": "What happened in that year"}
I also ran GPT-4 Vision against images, with the prompt "describe this photo". Here's the description it gave for this photograph from the Bigfoot Discovery Museum:
In the photo, we see an elderly man with a full white beard and glasses, wearing a cap and a blue denim shirt, seated behind a cluttered desk. The desk is strewn with various items including papers, books, and what appears to be works of art or prints. The man seems engaged in conversation or explaining something, mid-gesture with his right hand.
The backdrop is a room filled with bookshelves brimming with books and some items that look like filing organizers, hinting at a vast collection. The shelves are densely packed, giving the space a cozy and somewhat cluttered appearance, likely a reflection of intellectual activity and a personal workspace. Various other items such as a poster and possibly personal memorabilia can be seen on the walls adding to the character of the room.
Overall, the image portrays a scholarly or artistic atmosphere, suggesting that the man could be a collector, a bookstore owner, an academic, or an artist.
datasette-enrichments-opencage
datasette-enrichments-opencage provides access to the OpenCage geocoder.
I really like OpenCage. Many geocoders have strict restrictions on what you can do with the data they return - some of them even prohibit storing the results long-term in a database!
OpenCage avoid this by carefully building on top of open data, and they also financially support some of the open data projects they rely on.
This plugin (and datasette-enrichments-gpt
) both implement a pattern where you can configure an API key using plugin secrets, but if you don't do that the key will be requested from you each time you run an enrichment.
datasette-enrichments-jinja
I wanted to launch with an example of an enrichment that can execute arbitrary code against each row in a table.
Running code in a sandbox in Python is notoriously difficult. I decided to use the Jinja sandbox, which isn't completely secure against malicious attackers but should be good enough to ensure trustworthy users don't accidentally cause too much damage.
datasette-enrichments-jinja can execute a Jinja template against each row in a table and store the result.
It's a small but powerful template language, and should prove useful for a number data manipulation tasks.
datasette-enrichments-re2
datasette-enrichments-re2 provides an enrichment that can run a regular expression against a value from a table and store the result.
It offers four different modes:
Execute a search and replace against a column
Extract the first matching result and store that in the specified column (adding a column to the table if necessary)
Extract all matching results and store them as a JSON array in the specified column. If the regular expression uses named capture groups this will be an array of objects, otherwise it will be an array of strings.
Execute a regular expression with named capture groups and store the results in multiple columns, one for each of those named groups
That's quite a lot of functionality bundled into one enrichment! I haven't used this for much yet myself, but I'm looking forward to exploring it further and documenting some useful patterns.
Writing your own enrichment plugin
The most exciting thing about enrichments is what they can unlock in the future.
I've tried to make it as easy as possible for Python developers to build their own enrichment plugins.
The Developing a new enrichment documentation walks through the process of building a new enrichment plugin from scratch.
Enrichments run inside Datasette using Python asyncio
. This is a particularly good fit for enrichments that use external APIs, since HTTPX makes it easy to run multiple HTTP requests in parallel.
The -opencage
and -gpt
enrichments are two examples of enrichments that use HTTPX.
Interested in building one? Join the new #enrichments channel on the Datasette Discord to discuss ideas and talk about the new feature!
Weeknotes: datasette-enrichments, datasette-comments, sqlite-chronicle - 2023-12-08
I've mainly been working on Datasette Enrichments and continuing to explore the possibilities enabled by sqlite-chronicle.
Enrichments
I think this is the biggest new Datasette to arrive in quite a while, and it's entirely implemented as a plugin.
I described these in detail in Datasette Enrichments: a new plugin framework for augmenting your data (with an accompanying YouTube video demo). The short version: you can now install plugins that can "enrich" data by running transformations (or data fetches) against selected rows - geocoding addresses, or executing a GPT prompt, or applying a regular expression.
The datasette-enrichments plugin provides the mechanism for running these enrichments. Other plugins can then depend on it and define all manner of interesting options for enriching and transforming data.
I've built four of these so far, and I wrote some extensive documentation to help people build more. I'm excited to see how people use and build further on this initial foundation.
Datasette Comments
Alex Garcia released the first version of datasette-comments as part of our continuing collaboration to build out Datasette Cloud.
He wrote about that on the Datasette Cloud blog: Annotate and explore your data with datasette-comments.
This is another capability I've been looking forward to for years: the plugin lets you leave comments on individual rows within a Datasette instance, in order to collaborate with others on finding stories in data.
sqlite-chronicle and datasette-chronicle
I first wrote about sqlite-chronicle in weeknotes back in September. This week, inspired by my work on embeddings, I spent a bit more time on it and shipped a 0.2 release.
sqlite-chronicle
is a Python library that implements a SQL pattern where a table can have a _chronicle_tablename
companion table created, which is then updated using triggers against the main table.
The chronicle table has a shadow row for every row in the main table, duplicating its primary keys and then storing millisecond timestamp columns for added_ms
and updated_ms
, an integer version
column and a deleted
boolean indicator.
The goal is to record when a row was last inserted or updated, with an atomically incrementing version
ID representing the version of the entire table.
This can then enable all sorts of interesting potential use-cases:
Identify which rows have been updated or inserted since a previously recorded version
Synchronize a table with another table, only updating/inserting/deleting rows that have changed since last time
Run scheduled tasks that only consider rows that have changed in some way
The relevance to enrichments is that I'd like to implement a form of "persistent" enrichment - an enrichment which is configured to run repeatedly against new or updated rows, geocoding new addresses for example.
To do that, I need a mechanism to identify which rows have already been enriched and which need to be enriched again. sqlite-chronicle
is my current plan to provide that mechanism.
It's still pretty experimental. I recently found that INSERT OR REPLACE INTO
queries don't behave how I would expect them to, see issue #7.
I also started a new plugin to accompany the feature: datasette-chronicle, which adds two features to Datasette:
"enable/disable chronicle tracking" table actions for users with the correct permissions, which can be used in the Datasette UI to turn chronicle tracking on and off for a specific table
For tables that have chronicle enabled, a
?_since=VERSION
querystring parameter which can be used to filter the table to only rows that have changed since the specified version
I'm running the plugin against the documents table on demos.datasette.cloud
- see _chronicle_documents there for the result. That table is populated via GitHub scheduled actions and the Datasette API, as described in Getting started with the Datasette Cloud API - it's also where I first spotted the INSERT OR REPLACE INTO
issue I described earlier.
Newsroom Robots
I recorded an episode of the Newsroom Robots AI in journalism podcast with Nikita Roy a couple of weeks ago.
She split our conversation into two episodes:
Simon Willison (Part One): Breaking Down OpenAI's New Features & Security Risks of Large Language Models - which I ended up using as the basis for two blog entries:
Simon Willison (Part Two): How Datasette Helps With Investigative Reporting which has the best audio description of Datasette I've managed to produce so far.
sqlite-utils 3.36
Quoting the release notes.
Support for creating tables in SQLite STRICT mode. Thanks, Taj Khattra. (#344)
CLI commands
create-table
,insert
andupsert
all now accept a--strict
option.Python methods that can create a table -
table.create()
andinsert/upsert/insert_all/upsert_all
all now accept an optionalstrict=True
parameter.The
transform
command andtable.transform()
method preserve strict mode when transforming a table.The
sqlite-utils create-table
command now acceptsstr
,int
andbytes
as aliases fortext
,integer
andblob
respectively. (#606)
Taj Khattra's contribution of the --strict
and strict=True
options is a beautiful example of my ideal pull request: a clean implementation, comprehensive tests and thoughtful updates to the documentation all bundled together in one go.
Releases
sqlite-utils 3.36 - 2023-12-08
Python CLI utility and library for manipulating SQLite databasesdatasette-leaflet-geojson 0.8.1 - 2023-12-07
Datasette plugin that replaces any GeoJSON column values with a Leaflet map.datasette-chronicle 0.2 - 2023-12-06
Enable sqlite-chronicle against tables in Datasettedatasette-enrichments-jinja 0.1 - 2023-12-06
Datasette enrichment for evaluating templates in a Jinja sandboxsqlite-chronicle 0.2.1 - 2023-12-06
Use triggers to track when rows in a SQLite table were updated or deleteddatasette-enrichments-gpt 0.3 - 2023-12-01
Datasette enrichment for analyzing row data using OpenAI's GPT modelsdatasette-statistics 0.2.1 - 2023-11-30
SQL statistics functions for Datasettedatasette-enrichments-opencage 0.1 - 2023-11-30
Geocoding and reverse geocoding using OpenCagedatasette-enrichments-re2 0.1 - 2023-11-30
Enrich data using regular expressions powered by re2datasette-enrichments 0.2 - 2023-11-29
Tools for running enrichments against data stored in Datasettedatasette-pretty-json 0.3 - 2023-11-28
Datasette plugin that pretty-prints any column values that are valid JSON objects or arrays
TILs
Grabbing a transcript of a short snippet of a YouTube video with MacWhisper - 2023-12-01
Cryptography in Pyodide - 2023-11-26
Running pip install '.[docs]' on ReadTheDocs - 2023-11-24
Link 2023-11-30 ChatGPT is one year old. Here’s how it changed the world.:
I'm quoted in this piece by Benj Edwards about ChatGPT's one year birthday:
"Imagine if every human being could automate the tedious, repetitive information tasks in their lives, without needing to first get a computer science degree," AI researcher Simon Willison told Ars in an interview about ChatGPT's impact. "I'm seeing glimpses that LLMs might help make a huge step in that direction."
Quote 2023-11-30
This is what I constantly tell my students: The hard part about doing a tech product for the most part isn't the what beginners think makes tech hard — the hard part is wrangling systemic complexity in a good, sustainable and reliable way.
Many non-tech people e.g. look at programmers and think the hard part is knowing what this garble of weird text means. But this is the easy part. And if you are a person who would think it is hard, you probably don't know about all the demons out there that will come to haunt you if you don't build a foundation that helps you actively keeping them away.
Link 2023-11-30 Annotate and explore your data with datasette-comments:
New plugin for Datasette and Datasette Cloud: datasette-comments, providing tools for collaborating on data exploration with a team through posting comments on individual rows of data.
Alex Garcia built this for Datasette Cloud but as with almost all of our work there it's also available as an open source Python package.
Quote 2023-12-01
So something everybody I think pretty much agrees on, including Sam Altman, including Yann LeCun, is LLMs aren't going to make it. The current LLMs are not a path to ASI. They're getting more and more expensive, they're getting more and more slow, and the more we use them, the more we realize their limitations.
We're also getting better at taking advantage of them, and they're super cool and helpful, but they appear to be behaving as extremely flexible, fuzzy, compressed search engines, which when you have enough data that's kind of compressed into the weights, turns out to be an amazingly powerful operation to have at your disposal.
[...] And the thing you can really see missing here is this planning piece, right? So if you try to get an LLM to solve fairly simple graph coloring problems or fairly simple stacking problems, things that require backtracking and trying things and stuff, unless it's something pretty similar in its training, they just fail terribly.
[...] So that's the theory about what something like Q* might be, or just in general, how do we get past this current constraint that we have?
TIL 2023-12-01 Grabbing a transcript of a short snippet of a YouTube video with MacWhisper:
I grabbed a quote from a transcript of a snippet of a YouTube video today for my blog. …
Link 2023-12-01 Seamless Communication:
A new "family of AI research models" from Meta AI for speech and text translation. The live demo is particularly worth trying - you can record a short webcam video of yourself speaking and get back the same video with your speech translated into another language.
The key to it is the new SeamlessM4T v2 model, which supports 101 languages for speech input, 96 Languages for text input/output and 35 languages for speech output. SeamlessM4T-Large v2 is a 9GB file, available on Hugging Face.
Also in this release: SeamlessExpressive, which "captures certain underexplored aspects of prosody such as speech rate and pauses" - effectively maintaining things like expressed enthusiasm across languages.
Plus SeamlessStreaming, "a model that can deliver speech and text translations with around two seconds of latency".
Link 2023-12-01 Write shaders for the Vegas sphere:
Alexandre Devaux built this phenomenal three.js / WebGL demo, which displays a rotating flyover of the Vegas Sphere and lets you directly edit shader code to render your own animations on it and see what they would look like. The via Hacker News thread includes dozens of examples of scripts you can paste in.
Link 2023-12-04 LLM Visualization:
Brendan Bycroft's beautifully crafted interactive explanation of the transformers architecture - that universal but confusing model diagram, only here you can step through and see a representation of the flurry of matrix algebra that occurs every time you get a Large Language Model to generate the next token.
Link 2023-12-05 Spider-Man: Across the Spider-Verse screenplay (PDF):
Phil Lord shared this on Twitter yesterday - the final screenplay for Spider-Man: Across the Spider-Verse. It's a really fun read.
Quote 2023-12-05
A calculator has a well-defined, well-scoped set of use cases, a well-defined, well-scoped user interface, and a set of well-understood and expected behaviors that occur in response to manipulations of that interface.
Large language models, when used to drive chatbots or similar interactive text-generation systems, have none of those qualities. They have an open-ended set of unspecified use cases.
Link 2023-12-05 Simon Willison (Part Two): How Datasette Helps With Investigative Reporting:
The second part of my Newsroom Robots podcast conversation with Nikita Roy. This episode includes my best audio answer yet to the "what is Datasette?" question, plus notes on how to use LLMs in journalism despite their propensity to make things up.
Quote 2023-12-05
GPT and other large language models are aesthetic instruments rather than epistemological ones. Imagine a weird, unholy synthesizer whose buttons sample textual information, style, and semantics. Such a thing is compelling not because it offers answers in the form of text, but because it makes it possible to play text—all the text, almost—like an instrument.
Link 2023-12-05 AI and Trust:
Barnstormer of an essay by Bruce Schneier about AI and trust. It's worth spending some time with this - it's hard to extract the highlights since there are so many of them.
A key idea is that we are predisposed to trust AI chat interfaces because they imitate humans, which means we are highly susceptible to profit-seeking biases baked into them.
Bruce suggests that what's needed is public models, backed by government funds: "A public model is a model built by the public for the public. It requires political accountability, not just market accountability."
Link 2023-12-06 Ice Cubes GPT-4 prompts:
The Ice Cubes open source Mastodon app recently grew a very good "describe this image" feature to help people add alt text to their images. I had a dig around in their repo and it turns out they're using GPT-4 Vision for this (and regular GPT-4 for other features), passing the image with this prompt:
"What’s in this image? Be brief, it's for image alt description on a social network. Don't write in the first person."
Link 2023-12-06 Long context prompting for Claude 2.1:
Claude 2.1 has a 200,000 token context, enough for around 500 pages of text. Convincing it to answer a question based on a single sentence buried deep within that content can be difficult, but Anthropic found that adding "Assistant: Here is the most relevant sentence in the context:" to the end of the prompt was enough to raise Claude 2.1’s score from 27% to 98% on their evaluation.
Link 2023-12-07 SVG Tutorial: Learn SVG through 25 examples:
Hunor Márton Borbély published this fantastic advent calendar of tutorials for learning SVG, from the basics up to advanced concepts like animation and interactivity.
Quote 2023-12-08
We like to assume that automation technology will maintain or increase wage levels for a few skilled supervisors. But in the long-term skilled automation supervisors also tend to earn less.
Here's an example: In 1801 the Jacquard loom was invented, which automated silkweaving with punchcards. Around 1800, a manual weaver could earn 30 shillings/week. By the 1830s the same weaver would only earn around 5s/week. A Jacquard operator earned 15s/week, but he was also 12x more productive.
The Jacquard operator upskilled and became an automation supervisor, but their wage still dropped. For manual weavers the wages dropped even more. If we believe assistive AI will deliver unseen productivity gains, we can assume that wage erosion will also be unprecedented.
Link 2023-12-08 Standard Webhooks 1.0.0:
A loose specification for implementing webhooks, put together by a technical steering committee that includes representatives from Zapier, Twilio and more.
These recommendations look great to me. Even if you don't follow them precisely, this document is still worth reviewing any time you consider implementing webhooks - it covers a bunch of non-obvious challenges, such as responsible retry scheduling, thin-vs-thick hook payloads, authentication, custom HTTP headers and protecting against Server side request forgery attacks.
Link 2023-12-08 Announcing Purple Llama: Towards open trust and safety in the new world of generative AI:
New from Meta AI, Purple Llama is "an umbrella project featuring open trust and safety tools and evaluations meant to level the playing field for developers to responsibly deploy generative AI models and experiences".
There are three components: a 27 page "Responsible Use Guide", a new open model called Llama Guard and CyberSec Eval, "a set of cybersecurity safety evaluations benchmarks for LLMs".
Disappointingly, despite this being an initiative around trustworthy LLM development,prompt injection is mentioned exactly once, in the Responsible Use Guide, with an incorrect description describing it as involving "attempts to circumvent content restrictions"!
The Llama Guard model is interesting: it's a fine-tune of Llama 2 7B designed to help spot "toxic" content in input or output from a model, effectively an openly released alternative to OpenAI's moderation API endpoint.
The CyberSec Eval benchmarks focus on two concepts: generation of insecure code, and preventing models from assisting attackers from generating new attacks. I don't think either of those are anywhere near as important as prompt injection mitigation.
My hunch is that the reason prompt injection didn't get much coverage in this is that, like the rest of us, Meta's AI research teams have no idea how to fix it yet!
Quote 2023-12-08
Create a culture that favors begging forgiveness (and reversing decisions quickly) rather than asking permission. Invest in infrastructure such as progressive / cancellable rollouts. Use asynchronous written docs to get people aligned (“comment in this doc by Friday if you disagree with the plan”) rather than meetings (“we’ll get approval at the next weekly review meeting”).
Link 2023-12-09 3D Gaussian Splatting - Why Graphics Will Never Be The Same:
Gaussian splatting is an intriguing new approach to 3D computer graphics that's getting a lot of buzz at the moment. This 2m11s YouTube video is the best condensed explanation I've seen of the key idea.
Quote 2023-12-09
I always struggle a bit with I'm asked about the "hallucination problem" in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines.
We direct their dreams with prompts. The prompts start the dream, and based on the LLM's hazy recollection of its training documents, most of the time the result goes someplace useful.
It's only when the dreams go into deemed factually incorrect territory that we label it a "hallucination". It looks like a bug, but it's just the LLM doing what it always does.
Link 2023-12-10 ast-grep:
There are a lot of interesting things about this year-old project.
sg (an alias for ast-grep) is a CLI tool for running AST-based searches against code, built in Rust on top of the Tree-sitter parsing library. You can run commands like this:
sg -p 'await await_me_maybe($ARG)' datasette --lang python
To search the datasette directory for code that matches the search pattern, in a syntax-aware way.
It works across 19 different languages, and can handle search-and-replace too, so it can work as a powerful syntax-aware refactoring tool.
My favourite detail is how it's packaged. You can install the CLI utilite using Homebrew, Cargo, npm or pip/pipx - each of which will give you a CLI tool you can start running. On top of that it provides API bindings for Rust, JavaScript and Python!
Quote 2023-12-10
When I speak in front of groups and ask them to raise their hands if they used the free version of ChatGPT, almost every hand goes up. When I ask the same group how many use GPT-4, almost no one raises their hand. I increasingly think the decision of OpenAI to make the “bad” AI free is causing people to miss why AI seems like such a huge deal to a minority of people that use advanced systems and elicits a shrug from everyone else.
Link 2023-12-10 Upgrading GitHub.com to MySQL 8.0:
I love a good zero-downtime upgrade story, and this is a fine example of the genre. GitHub spent a year upgrading MySQL from 5.7 to 8 across 1200+ hosts, covering 300+ TB that was serving 5.5 million queries per second. The key technique was extremely carefully managed replication, plus tricks like leaving enough 5.7 replicas available to handle a rollback should one be needed.