Enriching data with GPT3.5 and SQLite SQL functions

Plus running Deno JavaScript and Lua in the ChatGPT Code Interpreter alpha

May 01, 2023

In this newsletter:

Enriching data with GPT3.5 and SQLite SQL functions
Miscellaneous research into Rye, ChatGPT Code Interpreter and openai-to-sqlite

Plus 7 links and 1 quotation and 5 TILs

Enriching data with GPT3.5 and SQLite SQL functions - 2023-04-29

I shipped openai-to-sqlite 0.3 yesterday with a fun new feature: you can now use the command-line tool to enrich data in a SQLite database by running values through an OpenAI model and saving the results, all in a single SQL query.

The idea for this came out of a conversation in a Datasette Office Hours session. I was asked if there was a way to do sentiment analysis using Datasette. There isn't, yet - and the feature I've been planning that would enable that (which I'm calling "enrichments") is still a fair way out.

But it got me to thinking... what's the simplest possible way to run sentiment analysis on a column of data in a SQLite database using the tools I've already built?

I ended up adding a small new feature to my openai-to-sqlite tool: the ability to call the OpenAI API (currently just the ChatGPT / gpt-3.5-turbo model) with a SQL query, plus a new chatgpt(prompt) SQL function for executing prompts.

This means you can do sentiment analysis something like this:

openai-to-sqlite query database.db "
  update messages set sentiment = chatgpt(
    'Sentiment analysis for this message: ' || message ||
    ' - ONLY return a lowercase string from: positive, negative, neutral, unknown'
  )
  where sentiment not in ('positive', 'negative', 'neutral', 'unknown')
    or sentiment is null
"

Running this command causes the sentiment column on the messages table to be populated with one of the following values: positive, negative, neutral or unknown.

It also prints out a cost estimate at the end. To run against 400 rows of data (each the length of a group chat message, so pretty short) cost me 20,000 tokens, which was about 4 cents. gpt-3.5-turbo is cheap.

The command uses an OpenAI API key from the OPENAI_API_KEY environment variable, or you can pass it in using the --token option to the command.

The tool also displays a progress bar while it's running, which looks like this:

Terminal showing that full command, plus a progress bar that counts from 46/413 up to 44/413 and shows the estimated time remaining

Sentiment analysis with ChatGPT

Here's the SQL query that I ran, with extra inline comments:

update messages
  -- we're updating rows in the messages table
  set sentiment = chatgpt(
    -- Construct the ChatGPT prompt
    'Sentiment analysis for this message: ' ||
    message ||
    ' - ONLY return a lowercase string from:' ||
    'positive, negative, neutral, unknown'
)
where
  -- Don't update rows that already have a sentiment
  sentiment not in (
    'positive', 'negative', 'neutral', 'unknown'
  ) or sentiment is null

And here's the prompt I'm using:

Sentiment analysis for this message: {message} - ONLY return a lowercase string from: positive, negative, neutral, unknown

As usual with prompt engineering, you end up having to practically BEG the model to stick to the rules. My first version of this prompt produced all kinds of unexpected output - this version mostly does what I want, but still ends up spitting out the occasional Positive. or Sentiment: Negative result despite my pleas for just those four strings.

I'm sure there are better prompts for this. I'd love to see what they are!

Running prompts with a SELECT

I have other tools for running prompts from the command-line, but if you want to use this to execute a prompt directly without writing to a database you can execute against the :memory: SQLite in-memory database like this:

openai-to-sqlite query :memory: =
  "select chatgpt('Surprising names for a horse')"

Running that command produces JSON output with 1. Marshmallow 2. Chewbacca 3. Rainbow 4. Disco Ball 5. Gummy Bear 6. Sparkle 7. Pudding 8. Twinkle Toes 9. Lightning Bolt 10. Bubblegum - and a total price of $0.0001 (0.0136 cents)

How the progress bar works

When I first implemented this, it didn't have a progress bar.

This turned out to be a pretty big problem!

A SQL update that affects 400 rows, each one involving an API call to OpenAI, can take a few minutes to run.

During that time, there is no feedback at all to show you that it's working, or indicate how far it's gone.

The UPDATE statement itself runs inside a transaction, so you can't even peak in the database to see how it's going - all 400 rows will appear once, at the end of the query.

I really wanted a progress bar. But how could I implement that? I need it to update as the query progresses, and I also need to know how many API calls it's going to make in advance in order to correctly display it.

I figured out a neat way of doing this.

The trick is to run the SQL query twice. The first time, implement a dummy chatgpt() function that counts how many times it has been called but doesn't make an API call.

Then, at the end of that query... rollback the transaction! This means that the dummy chatgpt() function will have been called the correct number of times, but the database will be left unchanged.

Now register the "real" chatgpt() function and run the query a second time.

That second chatgpt() Python function can increment the counter used by the progress bar each time it's called.

The result is a progress bar that knows how many API calls are going to be made and updates as the query progresses.

Here's the code that makes that happen.

Is this a good way of doing things?

Even though I managed to get the progress bar working, there are still some big limitations in using a SQL query to do this.

The first is that any query which triggers external API calls via a custom SQL function is inherently time-consuming. This means running a really long transaction, and since SQLite only accepts a single write at a time this means locking the database for writes for a very long time.

That's OK for a command-line script like this when you're the only user of the database, but it's not a good idea for databases that are serving other queries - as I frequently do with SQLite and Datasette.

A much bigger problem though is what happens when something goes wrong. An update ... set col1 = chatgpt(...) query executes in a single transaction. If you're running it against 400 rows and something causes an error at row 399, the transaction will rollback and you'll lose all of the work that DID succeed!

As such, while this is a fun tool for ad-hoc experiments with OpenAI data enrichment, I don't think it's a good long-term solution. A better mechanism would enable each individual API call to be written to storage such that problems in later calls don't discard data collected earlier on.

Miscellaneous research into Rye, ChatGPT Code Interpreter and openai-to-sqlite - 2023-05-01

I gave myself some time off stressing about my core responsibilities this week after PyCon, which meant allowing myself to be distracted by some miscellaneous research projects.

Rye

Rye is a new experimental twist on Python packaging from Armin Ronacher. He's been quite apologetic about it, asking Should Rye Exist? - Python packaging is a crowded space right now!

Personally, I think a working prototype of an interesting idea is always worthwhile. My experience is that running code increases the quality of the discussion around an idea enormously, because it gives people something concrete to talk about.

Rye has some really interesting ideas. By far my favourite is how it bundles Python itself: it doesn't depend on a system Python, instead downloading a standalone Python build from the python-build-standalone project and stashing it away in a ~/.rye directory.

I love this. Getting Python running on a system is often way harder than it should be. Rye provides a single binary (written in Rust) which can bootstrap a working Python environment, without interfering with the system Python or any other Python environments that might already be installed.

I wrote up a few notes on Rye in a TIL earlier this week, mainly detailing how it works and where it puts things.

I also released Datasette 0.64.3 with a tiny fix to ensure it would install cleanly using rye install datasette.

ChatGPT Code Interpreter

I've been having a whole lot of fun exploring this. I wrote about how I've been using it to run micro-benchmarks a few weeks ago - today I figured out a pattern for installing additional Python packages (despite its lack of an internet connection) and even uploading binaries for Deno and Lua to grant it the ability to run code in other languages!

I think it's the most interesting thing in all of ChatGPT/LLM world at the moment, which is a big statement.

openai-to-sqlite

Inspired by a Datasette Office Hours conversation on Friday I decided to see if I could figure out a way to run simple sentiment analysis against data in a SQLite database using any of my various tools.

I ended up adding a new mechanism to my openai-to-sqlite CLI tool - it can now execute SQL queries that can update existing tables with the results of a chatgpt() API call using a custom SQL function.

I wrote more about that in Enriching data with GPT3.5 and SQLite SQL functions.

Upgraded social media cards for my TILs

My Today I Learned site has had social media cards - images that show up in link previews when URLs are shared - for a long time now. Since few of my TILs have images of their own it generates these as screenshots of the pages themselves.

Until recently it stored these images as PNG files directly in the SQLite database itself. Vercel has a 50MB size limit on deployments and the other day the screenshots finally tipped the database over that limit.

To fix it, I moved the images out of the SQLite database and put them in an S3 bucket instead. This also meant I could increase their size and resolution - they are now generated with the shot-scraper --retina option which doubles their size to 1600x800 pixels.

This ended up being a fun exercise in combining both shot-scraper and my s3-credentials CLI tools. I wrote up full details of how the new screenshot system works in a new TIL, Social media cards generated with shot-scraper.

Next week: a webinar on Prompt Injection

My other blog entry this week introduced The Dual LLM pattern for building AI assistants that can resist prompt injection - my latest thinking on how we might be able to build AI assistants even without a robust solution to the prompt injection problem.

I have a speaking engagement lined up for next week: the LangChain Prompt Injection Webinar.

I'll be discussing prompt injection attacks against LLMs on a panel with Willem Pienaar, Kojin Oshiba and Jonathan Cohen and Christopher Parisien from NVIDIA.

I think it will be an interesting conversation. I'm going to reiterate my argument that You can’t solve AI security problems with more AI - a position that I'm not sure is shared by the other members of the panel!

Entries this week

Releases this week

s3-credentials 0.15 - 2023-04-30
A tool for creating credentials for accessing S3 buckets
openai-to-sqlite 0.3 - 2023-04-29
Save OpenAI API results to a SQLite database
datasette 0.64.3 - 2023-04-27
An open source multi-tool for exploring and publishing data
shot-scraper 1.2 - 2023-04-27
A command-line utility for taking automated screenshots of websites
datasette-explain 0.1a2 - 2023-04-24
Explain and validate SQL queries as you type them into Datasette

TIL this week

Expanding ChatGPT Code Interpreter with Python packages, Deno and Lua - 2023-05-01
Social media cards generated with shot-scraper - 2023-04-30
Deno KV - 2023-04-28
The location of the pip cache directory - 2023-04-28
A few notes on Rye - 2023-04-27

Quote 2023-04-26

The Consumer Financial Protection Bureau (CFPB) supervises, sets rules for, and enforces numerous federal consumer financial laws and guards consumers in the financial marketplace from unfair, deceptive, or abusive acts or practices and from discrimination [...] the fact that the technology used to make a credit decision is too complex, opaque, or new is not a defense for violating these laws.

The Consumer Financial Protection Bureau (PDF)

Link 2023-04-26 How prompt injection attacks hijack today's top-end AI – and it's really tough to fix: Thomas Claburn interviewed me about prompt injection for the Register. Lots of direct quotes from our phone call in here - we went pretty deep into why it's such a difficult problem to address.

Link 2023-04-26 urllib3 v2.0.0 is now generally available: urllib3 is 12 years old now, and is a common low-level dependency for packages like requests and httpx. The biggest new feature in v2 is a higher-level API: resp = urllib3.request("GET", "https://example.com") - a very welcome addition to the library.

TIL 2023-04-27 A few notes on Rye:

Rye is Armin Ronacher's new experimental Python packaging tool. I decided to take it for a test-run. …

Link 2023-04-27 GPT-3 token encoder and decoder: I built an Observable notebook with an interface to encode, decode and search through GPT-3 tokens, building on top of a notebook by EJ Fox and Ian Johnson.

Link 2023-04-28 Trainbot: "Trainbot watches a piece of train track, detects passing trains, and stitches together images of them" - check out the site itself too, which shows beautifully stitched panoramas of trains that have recently passed near Jo M's apartment. Found via the best Hacker News thread I've seen in years, "Ask HN: Most interesting tech you built for just yourself?".

TIL 2023-04-28 The location of the pip cache directory:

pip uses a cache to avoid downloading packages again: …

Link 2023-04-28 IF by DeepFloyd Lab: New image generation AI model, financially backed by StabilityAI but based on the Google Imagen paper. Claims to be much better at following complex prompts, including being able to generate text! I tried the Colab notebook with "a photograph of raccoon in the woods holding a sign that says 'I will eat your trash'" and it didn't quite get the text right, see via link for the result.

TIL 2023-04-28 Deno KV:

I got intrigued by Deno KV, which describes itself as "a global database for global apps". It's a key/value store for Deno applications which bundles some kind of worldwide distributed/replicated database service. …

Link 2023-04-29 MLC LLM: From MLC, the team that gave us Web LLM and Web Stable Diffusion. "MLC LLM is a universal solution that allows any language model to be deployed natively on a diverse set of hardware backends and native applications". I installed their iPhone demo from TestFlight this morning and it does indeed provide an offline LLM that runs on my phone. It's reasonably capable - the underlying model for the app is vicuna-v1-7b, a LLaMA derivative.

Link 2023-04-29 MRSK: A new open source web application deployment tool from 37signals, developed to help migrate their Hey webmail app out of the cloud and onto their own managed hardware. The key feature is one that I care about deeply: it enables zero-downtime deploys by running all traffic through a Traefik reverse proxy in a way that allows requests to be paused while a new deployment is going out - so end users get a few seconds delay on their HTTP requests before being served by the replaced application.

TIL 2023-04-30 Social media cards generated with shot-scraper:

My TIL website has social media card images to make links shared from it look slightly more interesting when shared on sites like Mastodon and Twitter. …

TIL 2023-05-01 Expanding ChatGPT Code Interpreter with Python packages, Deno and Lua:

The ChatGPT Code Interpreter alpha remains incredibly interesting. I wrote about how I was using it for Python and SQLite benchmarking a few weeks ago. Today I found a neat pattern for expanding its capabilities with custom binaries. …

Simon Willison’s Newsletter

Discussion about this post