In this newsletter:
Understanding GPT tokenizers
Weeknotes: Parquet in Datasette Lite, various talks, more LLM hacking
Plus 5 links and 1 quotation and 1 TIL
Understanding GPT tokenizers - 2023-06-08
Large language models such as GPT-3/4, LLaMA and PaLM work in terms of tokens. They take text, convert it into tokens (integers), then predict which tokens should come next.
Playing around with these tokens is an interesting way to get a better idea for how this stuff actually works under the hood.
OpenAI offer a Tokenizer tool for exploring how tokens work
I've built my own, slightly more interesting tool as an Observable notebook:
https://observablehq.com/@simonw/gpt-tokenizer
You can use the notebook to convert text to tokens, tokens to text and also to run searches against the full token table.
Here's what the notebook looks like:
The text I'm tokenizing here is:
The dog eats the apples El perro come las manzanas 片仮名
This produces 21 integer tokens: 5 for the English text, 8 for the Spanish text and six (two each) for those three Japanese characters. The two newlines are each represented by tokens as well.
The notebook uses the tokenizer from GPT-2 (borrowing from this excellent notebook by EJ Fox and Ian Johnson), so it's useful primarily as an educational tool - there are differences between how it works and the latest tokenizers for GPT-3 and above.
Exploring some interesting tokens
Playing with the tokenizer reveals all sorts of interesting patterns.
Most common English words are assigned a single token. As demonstrated above:
"The": 464
" dog": 3290
" eats": 25365
" the": 262
" apples": 22514
Note that capitalization is important here. "The" with a capital T is token 464, but " the" with both a leading space and a lowercase t is token 262.
Many words also have a token that incorporates a leading space. This makes for much more efficient encoding of full sentences, since they can be encoded without needing to spend a token on each whitespace character.
Languages other than English suffer from less efficient tokenization.
"El perro come las manzanas" in Spanish is encoded like this:
"El": 9527
" per": 583
"ro": 305
" come": 1282
" las": 39990
" man": 582
"zan": 15201
"as": 292
The English bias is obvious here. " man" gets a lower token ID of 582, because it's an English word. "zan" gets a token ID of 15201 because it's not a word that stands alone in English, but is a common enough sequence of characters that it still warrants its own token.
Some languages even have single characters that end up encoding to multiple tokens, such as these Japanese characters:
片: 31965 229
仮: 20015 106
名: 28938 235
Glitch tokens
A fascinating subset of tokens are what are known as "glitch tokens". My favourite example of those is token 23282 - " davidjl".
We can find that token by searching for "david" using the search box in the notebook:
Riley Goodside highlighted some weird behaviour with that token:
Why this happens is an intriguing puzzle.
It looks likely that this token refers to user davidjl123 on Reddit, a keen member of the /r/counting subreddit. He's posted incremented numbers there well over 163,000 times.
Presumably that subreddit ended up in the training data used to create the tokenizer used by GPT-2, and since that particular username showed up hundreds of thousands of times it ended up getting its own token.
But why would that break things like this? The best theory I've seen so far came from londons_explore on Hacker News:
These glitch tokens are all near the centroid of the token embedding space. That means that the model cannot really differentiate between these tokens and the others equally near the center of the embedding space, and therefore when asked to 'repeat' them, gets the wrong one.
That happened because the tokens were on the internet many millions of times (the davidjl user has 163,000 posts on reddit simply counting increasing numbers), yet the tokens themselves were never hard to predict (and therefore while training, the gradients became nearly zero, and the embedding vectors decayed to zero, which some optimizers will do when normalizing weights).
The conversation attached to the post SolidGoldMagikarp (plus, prompt generation) on LessWrong has a great deal more detail on this phenomenon.
Counting tokens with tiktoken
OpenAI's models each have a token limit. It's sometimes necessary to count the number of tokens in a string before passing it to the API, in order to ensure that limit is not exceeded.
One technique that needs this is Retrieval Augmented Generation, where you answer a user's question by running a search (or an embedding search) against a corpus of documents, extract the most likely content and include that as context in a prompt.
The key to successfully implementing that pattern is to include as much relevant context as will fit within the token limit - so you need to be able to count tokens.
OpenAI provide a Python library for doing this called tiktoken.
If you dig around inside the library you'll find it currently includes five different tokenization schemes: r50k_base
, p50k_base
, p50k_edit
, cl100k_base
and gpt2
.
Of these cl100k_base
is the most relevant, being the tokenizer for both GPT-4 and the inexpensive gpt-3.5-turbo
model used by current ChatGPT.
p50k_base
is used by text-davinci-003
. A full mapping of models to tokenizers can be found in the MODEL_TO_ENCODING
dictionary in tiktoken/model.py
.
Here's how to use tiktoken
:
import tiktoken
encoding = tiktoken.encoding_for_model("gpt-4")
# or "gpt-3.5-turbo" or "text-davinci-003"
tokens = encoding.encode("Here is some text")
token_count = len(tokens)
tokens
will now be an array of four integer token IDs - [8586, 374, 1063, 1495]
in this case.
Use the .decode()
method to turn an array of token IDs back into text:
text = encoding.decode(tokens)
# 'Here is some text'
The first time you call encoding_for_model()
the encoding data will be fetched over HTTP from a openaipublic.blob.core.windows.net
Azure blob storage bucket (code here). This is cached in a temp directory, but that will get cleared should your machine restart. You can force it to use a more persistent cache directory by setting a TIKTOKEN_CACHE_DIR
environment variable.
ttok
I introduced my ttok tool a few weeks ago. It's a command-line wrapper around tiktoken
with two key features: it can count tokens in text that is piped to it, and it can also truncate that text down to a specified number of tokens:
# Count tokens
echo -n "Count these tokens" | ttok
# Outputs: 3 (the newline is skipped thanks to echo -n)
# Truncation
curl 'https://simonwillison.net/' | strip-tags -m | ttok -t 6
# Outputs: Simon Willison’s Weblog
# View integer token IDs
echo "Show these tokens" | ttok --tokens
# Outputs: 7968 1521 11460 198
Use -m gpt2
or similar to use an encoding for a different model.
Watching tokens get generated
Once you understand tokens, the way GPT tools generate text starts to make a lot more sense.
In particular, it's fun to watch GPT-4 streaming back its output as independent tokens (GPT-4 is slightly slower than 3.5, making it easier to see what's going on).
Here's what I get for llm -s 'Five names for a pet pelican' -4
- using my llm CLI tool to generate text from GPT-4:
As you can see, names that are not in the dictionary such as "Pelly" take multiple tokens, but "Captain Gulliver" outputs the token "Captain" as a single chunk.
Weeknotes: Parquet in Datasette Lite, various talks, more LLM hacking - 2023-06-04
I've fallen a bit behind on my weeknotes. Here's a catchup for the last few weeks.
Parquet in Datasette Lite
Datasette Lite is my build of Datasette (a server-side Python web application) which runs entirely in the browser using WebAssembly and Pyodide. I recently added the ability to directly load Parquet files over HTTP.
This required an upgrade to the underlying version of Pyodide, in order to use the WebAssembly compiled version of the fastparquet library. That upgrade was blocked by a AttributeError: module 'os' has no attribute 'link'
error, but Roman Yurchak showed me a workaround which unblocked me.
So now the following works:
https://lite.datasette.io/?parquet=https://github.com/Teradata/kylo/blob/master/samples/sample-data/parquet/userdata1.parquet
This will work with any URL to a Parquet file that is served with open CORS headers - files on GitHub (or in a GitHub Gist) get these headers automatically.
Also new in Datasette Lite: the ?memory=1
query string option, which starts Datasette Lite without loading any default demo databases. I added this to help me construct this demo for my new datasette-sqlite-url-lite plugin:
https://lite.datasette.io/?memory=1&install=datasette-sqlite-url-lite#/_memory?sql=select+'url_valid()'+as+fn%2C+url_valid(%3Aurl)+as+result%0Aunion+all%0Aselect+'url_scheme()'%2C+url_scheme(%3Aurl)%0Aunion+all%0Aselect+'url_host()'%2C+url_host(%3Aurl)%0Aunion+all%0Aselect+'url_path()'%2C+url_path(%3Aurl)%0Aunion+all%0Aselect+'url_fragment()'%2C+url_fragment(%3Aurl)%3B&url=https%3A%2F%2Fwww.sqlite.org%2Fvtab.html%23usage
datasette-sqlite-url-lite - mostly written by GPT-4
datasette-sqlite-url is a really neat plugin by Alex Garcia which adds custom SQL functions to SQLite that allow you to parse URLs and extract their components.
There's just one catch: the extension itself is written in C, and there isn't yet a version of it compiled for WebAssembly to work in Datasette Lite.
I wanted to use some of the functions in it, so I decided to see if I could get a Pure Python alternative of it working. But this was a very low stakes project, so I decided to see if I could get GPT-4 to do essentially all of the work for me.
I prompted it like this - copying and pasting the examples directly from Alex's documentation:
Write Python code to register the following SQLite custom functions:
select url_valid('https://sqlite.org'); -- 1 select url_scheme('https://www.sqlite.org/vtab.html#usage'); -- 'https' select url_host('https://www.sqlite.org/vtab.html#usage'); -- 'www.sqlite.org' select url_path('https://www.sqlite.org/vtab.html#usage'); -- '/vtab.html' select url_fragment('https://www.sqlite.org/vtab.html#usage'); -- 'usage'
The code it produced was almost exactly what I needed.
I wanted some tests too, so I prompted:
Write a suite of pytest tests for this
This gave me the tests I needed - with one error in the way they called SQLite, but still doing 90% of the work for me.
Here's the full ChatGPT conversation and the resulting code I checked into the repo.
Various talks
Videos for three of my recent talks are now available on YouTube:
Big Opportunities in Small Data is the keynote I gave at Citus Con: An Event for Postgres 2023 - talking about Datasette, SQLite and some tricks I would love to see the PostgreSQL community adopt from the explorations I've been doing around small data.
The Data Enthusiast's Toolkit is an hour long interview with Rizel Scarlett about both Datasette and my career to date. Frustratingly I had about 10 minutes of terrible microphone audio in the middle, but the conversation itself was really great.
Data analysis with SQLite and Python is a video from PyCon of the full 2hr45m tutorial I gave there last month. The handout notes for that are available online too.
I also spotted that the Changelog put up a video Just getting in to AI for development? Start here with an extract from our podcast episode LLMs break the internet.
Entries this week
It's infuriatingly hard to understand how closed models train on their input
Lawyer cites fake cases invented by ChatGPT, judge is not amused
llm, ttok and strip-tags - CLI tools for working with ChatGPT and other LLMs
Releases this week
datasette-sqlite-url-lite 0.1 - 2023-05-26
A pure Python alternative to sqlite-url ready to be used in Datasette Litesqlite-utils 3.32.1 - 2023-05-21
Python CLI utility and library for manipulating SQLite databasesstrip-tags 0.3 - 2023-05-19
CLI tool for stripping tags from HTMLttok 0.1 - 2023-05-18
Count and truncate text based on tokensllm 0.3 - 2023-05-17
Access large language models from the command-line
TIL this week
Testing the Access-Control-Max-Age CORS header - 2023-05-25
Comparing two training datasets using sqlite-utils - 2023-05-23
mlc-chat - RedPajama-INCITE-Chat-3B on macOS - 2023-05-22
hexdump and hexdump -C - 2023-05-22
Exploring Baseline with Datasette Lite - 2023-05-12
Link 2023-06-05 Logan Kilpatrick (OpenAI): "The API does not just change without us telling you. The models are static there."
That's the official line on the ongoing questions concerning whether OpenAI's models have been degrading in quality over the last few weeks and months.
Worth noting that this mentions the API but doesn't mention ChatGPT itself, which I suspect gets model updates a lot more frequently than the models served through the API.
Quote 2023-06-05
If you give feedback that isn't constructive your feedback is worthless. I know that sounds harsh but it is. If you give unconstructive feedback you might as well not be saying anything. If you just look at something and go "That's stupid" or "I don't like that" - that's worthless feedback, nobody can do anything with that. They're not going to start throwing darts against the wall until you say "Oh OK, I like that". You have to say something more.
Link 2023-06-08 ChatGPT Plugins Don't Have PMF: Sam Altman was recently quoted (in a since unpublished blog post) noting that ChatGPT plugins have not yet demonstrated product market fit.
This matches my own usage patterns: I use the "browse" and "code interpreter" modes on a daily basis, but I've not found any of the third party developer plugins to stick for me yet.
I like Matt Rickard's observation here: "Chat is not the right UX for plugins. If you know what you want to do, it’s often easier to just do a few clicks on the website. If you don’t, just a chat interface makes it hard to steer the model toward your goal."
Link 2023-06-08 First Impressions of Vision Pro and VisionOS: John Gruber's description of his thirty minute Vision Pro demo includes a bunch of details I haven't seen described anywhere else, including how calibration and corrective lenses work and how precise and stable the overlays of additional information are.
Link 2023-06-08 Examples of weird GPT-4 behavior for the string " davidjl": GPT-4, when told to repeat or otherwise process the string " davidjl" (note the leading space character), treats it as "jndl" or "jspb" or "JDL" instead. It turns out " davidjl" has its own single token in the tokenizer: token ID 23282, presumably dating back to the GPT-2 days.
Riley Goodside refers to these as "glitch tokens".
This token might refer to Reddit user davidjl123 who ranks top of the league for the old /r/counting subreddit, with 163,477 posts there which presumably ended up in older training data.
TIL 2023-06-08 Consecutive groups in SQL using window functions:
I have a database table with all of my Swarm checkins since 2011, created using my swarm-to-sqlite tool. …
Link 2023-06-08 simpleaichat: Max Woolf released his own Python package for building against the GPT-3.5 and GPT-4 APIs (and potentially other LLMs in the future).
It's a very clean piece of API design with some useful additional features: there's an AsyncAIChat subclass that works with Python asyncio, and the library includes a mechanism for registering custom functions that can then be called by the LLM as tools.
One trick I haven't seen before: it uses a combination of max_tokens: 1 and a ChatGPT logit_bias to ensure that answers to one of its default prompts are restricted to just numerals between 0 and 9. This is described in the PROMPTS.md file.