Understanding GPT tokenizers

Plus Parquet in Datasette Lite

Jun 08, 2023

In this newsletter:

Understanding GPT tokenizers
Weeknotes: Parquet in Datasette Lite, various talks, more LLM hacking

Plus 5 links and 1 quotation and 1 TIL

Understanding GPT tokenizers - 2023-06-08

Large language models such as GPT-3/4, LLaMA and PaLM work in terms of tokens. They take text, convert it into tokens (integers), then predict which tokens should come next.

Playing around with these tokens is an interesting way to get a better idea for how this stuff actually works under the hood.

OpenAI offer a Tokenizer tool for exploring how tokens work

I've built my own, slightly more interesting tool as an Observable notebook:

https://observablehq.com/@simonw/gpt-tokenizer

You can use the notebook to convert text to tokens, tokens to text and also to run searches against the full token table.

Here's what the notebook looks like:

GPT token encoder and decoder. Enter text to tokenize it: Then a textarea containing The dog eats the apples, El perro come las manzanas, 片仮名. 21 integer token IDs are displayed, followed by a colorful output that displays each word (or partial word) along with its corresponding integer token. The Japanese characters correspond to two integer tokens each.

The text I'm tokenizing here is:

The dog eats the apples
El perro come las manzanas
片仮名

This produces 21 integer tokens: 5 for the English text, 8 for the Spanish text and six (two each) for those three Japanese characters. The two newlines are each represented by tokens as well.

The notebook uses the tokenizer from GPT-2 (borrowing from this excellent notebook by EJ Fox and Ian Johnson), so it's useful primarily as an educational tool - there are differences between how it works and the latest tokenizers for GPT-3 and above.

Exploring some interesting tokens

Playing with the tokenizer reveals all sorts of interesting patterns.

Most common English words are assigned a single token. As demonstrated above:

"The": 464
" dog": 3290
" eats": 25365
" the": 262
" apples": 22514

Note that capitalization is important here. "The" with a capital T is token 464, but " the" with both a leading space and a lowercase t is token 262.

Many words also have a token that incorporates a leading space. This makes for much more efficient encoding of full sentences, since they can be encoded without needing to spend a token on each whitespace character.

Languages other than English suffer from less efficient tokenization.

"El perro come las manzanas" in Spanish is encoded like this:

"El": 9527
" per": 583
"ro": 305
" come": 1282
" las": 39990
" man": 582
"zan": 15201
"as": 292

The English bias is obvious here. " man" gets a lower token ID of 582, because it's an English word. "zan" gets a token ID of 15201 because it's not a word that stands alone in English, but is a common enough sequence of characters that it still warrants its own token.

Some languages even have single characters that end up encoding to multiple tokens, such as these Japanese characters:

片: 31965 229
仮: 20015 106
名: 28938 235

Glitch tokens

A fascinating subset of tokens are what are known as "glitch tokens". My favourite example of those is token 23282 - " davidjl".

We can find that token by searching for "david" using the search box in the notebook:

Search tokens: a search box containing david, with 5 results. 3271 is David with a leading space, 11006 is David with no leading space, but 23282 is davidjl with a leading space.

Riley Goodside highlighted some weird behaviour with that token:

Four GPT-4 sessions. Prompt: Repeat the string "davidjl". Response: jndl. Prompt: How many letters are in this username: davidjl. Response: The username "jdnl" contains 4 letters. Prompt: Are the strings "jdl" and "davidjl" identical? Response: No, the strings "jdl" and "jspb" are not identical. Each character in a string contributes to its uniqueness. Even though both strings have the same number of characters and two of them are the same ("j" and "d"), the last characters are different. Thus, "jdl" and "jspb" are different strings. Prompt: What's the deal with davidjl? Response: I'm sorry, but "JDL" is quite ambiguous. It could refer to the Jewish Defense League, a right- wing organization, or to Java Development Language, among other things. Could you please provide more context or clarify what you're referring to? I'll do my best to give you the information you're looking for.

Why this happens is an intriguing puzzle.

It looks likely that this token refers to user davidjl123 on Reddit, a keen member of the /r/counting subreddit. He's posted incremented numbers there well over 163,000 times.

Presumably that subreddit ended up in the training data used to create the tokenizer used by GPT-2, and since that particular username showed up hundreds of thousands of times it ended up getting its own token.

But why would that break things like this? The best theory I've seen so far came from londons_explore on Hacker News:

These glitch tokens are all near the centroid of the token embedding space. That means that the model cannot really differentiate between these tokens and the others equally near the center of the embedding space, and therefore when asked to 'repeat' them, gets the wrong one.
That happened because the tokens were on the internet many millions of times (the davidjl user has 163,000 posts on reddit simply counting increasing numbers), yet the tokens themselves were never hard to predict (and therefore while training, the gradients became nearly zero, and the embedding vectors decayed to zero, which some optimizers will do when normalizing weights).

The conversation attached to the post SolidGoldMagikarp (plus, prompt generation) on LessWrong has a great deal more detail on this phenomenon.

Counting tokens with tiktoken

OpenAI's models each have a token limit. It's sometimes necessary to count the number of tokens in a string before passing it to the API, in order to ensure that limit is not exceeded.

One technique that needs this is Retrieval Augmented Generation, where you answer a user's question by running a search (or an embedding search) against a corpus of documents, extract the most likely content and include that as context in a prompt.

The key to successfully implementing that pattern is to include as much relevant context as will fit within the token limit - so you need to be able to count tokens.

OpenAI provide a Python library for doing this called tiktoken.

If you dig around inside the library you'll find it currently includes five different tokenization schemes: r50k_base, p50k_base, p50k_edit, cl100k_base and gpt2.

Of these cl100k_base is the most relevant, being the tokenizer for both GPT-4 and the inexpensive gpt-3.5-turbo model used by current ChatGPT.

p50k_base is used by text-davinci-003. A full mapping of models to tokenizers can be found in the MODEL_TO_ENCODING dictionary in tiktoken/model.py.

Here's how to use tiktoken:

import tiktoken

encoding = tiktoken.encoding_for_model("gpt-4")
# or "gpt-3.5-turbo" or "text-davinci-003"

tokens = encoding.encode("Here is some text")
token_count = len(tokens)

tokens will now be an array of four integer token IDs - [8586, 374, 1063, 1495] in this case.

Use the .decode() method to turn an array of token IDs back into text:

text = encoding.decode(tokens)
# 'Here is some text'

The first time you call encoding_for_model() the encoding data will be fetched over HTTP from a openaipublic.blob.core.windows.net Azure blob storage bucket (code here). This is cached in a temp directory, but that will get cleared should your machine restart. You can force it to use a more persistent cache directory by setting a TIKTOKEN_CACHE_DIR environment variable.

ttok

I introduced my ttok tool a few weeks ago. It's a command-line wrapper around tiktoken with two key features: it can count tokens in text that is piped to it, and it can also truncate that text down to a specified number of tokens:

# Count tokens
echo -n "Count these tokens" | ttok
# Outputs: 3 (the newline is skipped thanks to echo -n)

# Truncation
curl 'https://simonwillison.net/' | strip-tags -m | ttok -t 6
# Outputs: Simon Willison’s Weblog

# View integer token IDs
echo "Show these tokens" | ttok --tokens
# Outputs: 7968 1521 11460 198

Use -m gpt2 or similar to use an encoding for a different model.

Watching tokens get generated

Once you understand tokens, the way GPT tools generate text starts to make a lot more sense.

In particular, it's fun to watch GPT-4 streaming back its output as independent tokens (GPT-4 is slightly slower than 3.5, making it easier to see what's going on).

Here's what I get for llm -s 'Five names for a pet pelican' -4 - using my llm CLI tool to generate text from GPT-4:

Terminal window running that command. 1. Pelly 2. Beaky 3. SkyDancer 4. Scoop 5. Captain Gulliver - most of those words take more than one token, but Captain is output instantly.

As you can see, names that are not in the dictionary such as "Pelly" take multiple tokens, but "Captain Gulliver" outputs the token "Captain" as a single chunk.

Weeknotes: Parquet in Datasette Lite, various talks, more LLM hacking - 2023-06-04

I've fallen a bit behind on my weeknotes. Here's a catchup for the last few weeks.

Parquet in Datasette Lite

Datasette Lite is my build of Datasette (a server-side Python web application) which runs entirely in the browser using WebAssembly and Pyodide. I recently added the ability to directly load Parquet files over HTTP.

This required an upgrade to the underlying version of Pyodide, in order to use the WebAssembly compiled version of the fastparquet library. That upgrade was blocked by a AttributeError: module 'os' has no attribute 'link' error, but Roman Yurchak showed me a workaround which unblocked me.

So now the following works:

https://lite.datasette.io/?parquet=https://github.com/Teradata/kylo/blob/master/samples/sample-data/parquet/userdata1.parquet

This will work with any URL to a Parquet file that is served with open CORS headers - files on GitHub (or in a GitHub Gist) get these headers automatically.

Also new in Datasette Lite: the ?memory=1 query string option, which starts Datasette Lite without loading any default demo databases. I added this to help me construct this demo for my new datasette-sqlite-url-lite plugin:

https://lite.datasette.io/?memory=1&install=datasette-sqlite-url-lite#/_memory?sql=select+'url_valid()'+as+fn%2C+url_valid(%3Aurl)+as+result%0Aunion+all%0Aselect+'url_scheme()'%2C+url_scheme(%3Aurl)%0Aunion+all%0Aselect+'url_host()'%2C+url_host(%3Aurl)%0Aunion+all%0Aselect+'url_path()'%2C+url_path(%3Aurl)%0Aunion+all%0Aselect+'url_fragment()'%2C+url_fragment(%3Aurl)%3B&url=https%3A%2F%2Fwww.sqlite.org%2Fvtab.html%23usage

datasette-sqlite-url-lite - mostly written by GPT-4

datasette-sqlite-url is a really neat plugin by Alex Garcia which adds custom SQL functions to SQLite that allow you to parse URLs and extract their components.

There's just one catch: the extension itself is written in C, and there isn't yet a version of it compiled for WebAssembly to work in Datasette Lite.

I wanted to use some of the functions in it, so I decided to see if I could get a Pure Python alternative of it working. But this was a very low stakes project, so I decided to see if I could get GPT-4 to do essentially all of the work for me.

I prompted it like this - copying and pasting the examples directly from Alex's documentation:

Write Python code to register the following SQLite custom functions:

select url_valid('https://sqlite.org'); -- 1
select url_scheme('https://www.sqlite.org/vtab.html#usage'); -- 'https'
select url_host('https://www.sqlite.org/vtab.html#usage'); -- 'www.sqlite.org'
select url_path('https://www.sqlite.org/vtab.html#usage'); -- '/vtab.html'
select url_fragment('https://www.sqlite.org/vtab.html#usage'); -- 'usage'

The code it produced was almost exactly what I needed.

I wanted some tests too, so I prompted:

Write a suite of pytest tests for this

This gave me the tests I needed - with one error in the way they called SQLite, but still doing 90% of the work for me.

Here's the full ChatGPT conversation and the resulting code I checked into the repo.

Various talks

Videos for three of my recent talks are now available on YouTube:

Big Opportunities in Small Data is the keynote I gave at Citus Con: An Event for Postgres 2023 - talking about Datasette, SQLite and some tricks I would love to see the PostgreSQL community adopt from the explorations I've been doing around small data.
The Data Enthusiast's Toolkit is an hour long interview with Rizel Scarlett about both Datasette and my career to date. Frustratingly I had about 10 minutes of terrible microphone audio in the middle, but the conversation itself was really great.
Data analysis with SQLite and Python is a video from PyCon of the full 2hr45m tutorial I gave there last month. The handout notes for that are available online too.

I also spotted that the Changelog put up a video Just getting in to AI for development? Start here with an extract from our podcast episode LLMs break the internet.

Entries this week

Releases this week

datasette-sqlite-url-lite 0.1 - 2023-05-26
A pure Python alternative to sqlite-url ready to be used in Datasette Lite
sqlite-utils 3.32.1 - 2023-05-21
Python CLI utility and library for manipulating SQLite databases
strip-tags 0.3 - 2023-05-19
CLI tool for stripping tags from HTML
ttok 0.1 - 2023-05-18
Count and truncate text based on tokens
llm 0.3 - 2023-05-17
Access large language models from the command-line

TIL this week

Testing the Access-Control-Max-Age CORS header - 2023-05-25
Comparing two training datasets using sqlite-utils - 2023-05-23
mlc-chat - RedPajama-INCITE-Chat-3B on macOS - 2023-05-22
hexdump and hexdump -C - 2023-05-22
Exploring Baseline with Datasette Lite - 2023-05-12

Link 2023-06-05 Logan Kilpatrick (OpenAI): "The API does not just change without us telling you. The models are static there."

That's the official line on the ongoing questions concerning whether OpenAI's models have been degrading in quality over the last few weeks and months.

Worth noting that this mentions the API but doesn't mention ChatGPT itself, which I suspect gets model updates a lot more frequently than the models served through the API.

Quote 2023-06-05

If you give feedback that isn't constructive your feedback is worthless. I know that sounds harsh but it is. If you give unconstructive feedback you might as well not be saying anything. If you just look at something and go "That's stupid" or "I don't like that" - that's worthless feedback, nobody can do anything with that. They're not going to start throwing darts against the wall until you say "Oh OK, I like that". You have to say something more.

Timothy Cain

Link 2023-06-08 ChatGPT Plugins Don't Have PMF: Sam Altman was recently quoted (in a since unpublished blog post) noting that ChatGPT plugins have not yet demonstrated product market fit.

This matches my own usage patterns: I use the "browse" and "code interpreter" modes on a daily basis, but I've not found any of the third party developer plugins to stick for me yet.

I like Matt Rickard's observation here: "Chat is not the right UX for plugins. If you know what you want to do, it’s often easier to just do a few clicks on the website. If you don’t, just a chat interface makes it hard to steer the model toward your goal."

Link 2023-06-08 First Impressions of Vision Pro and VisionOS: John Gruber's description of his thirty minute Vision Pro demo includes a bunch of details I haven't seen described anywhere else, including how calibration and corrective lenses work and how precise and stable the overlays of additional information are.

Link 2023-06-08 Examples of weird GPT-4 behavior for the string " davidjl": GPT-4, when told to repeat or otherwise process the string " davidjl" (note the leading space character), treats it as "jndl" or "jspb" or "JDL" instead. It turns out " davidjl" has its own single token in the tokenizer: token ID 23282, presumably dating back to the GPT-2 days.

Riley Goodside refers to these as "glitch tokens".

This token might refer to Reddit user davidjl123 who ranks top of the league for the old /r/counting subreddit, with 163,477 posts there which presumably ended up in older training data.

TIL 2023-06-08 Consecutive groups in SQL using window functions:

I have a database table with all of my Swarm checkins since 2011, created using my swarm-to-sqlite tool. …

Link 2023-06-08 simpleaichat: Max Woolf released his own Python package for building against the GPT-3.5 and GPT-4 APIs (and potentially other LLMs in the future).

It's a very clean piece of API design with some useful additional features: there's an AsyncAIChat subclass that works with Python asyncio, and the library includes a mechanism for registering custom functions that can then be called by the LLM as tools.

One trick I haven't seen before: it uses a combination of max_tokens: 1 and a ChatGPT logit_bias to ensure that answers to one of its default prompts are restricted to just numerals between 0 and 9. This is described in the PROMPTS.md file.

Simon Willison’s Newsletter

Discussion about this post