Training is not the same as chatting: ChatGPT and other LLMs don't remember everything you say

Plus notes from PyCon and LLM support for GPT-4o and Gemini Flash

May 29, 2024

In this newsletter:

Training is not the same as chatting: ChatGPT and other LLMs don't remember everything you say
Weeknotes: PyCon US 2024

Plus 25 links and 10 quotations and 2 TILs

Training is not the same as chatting: ChatGPT and other LLMs don't remember everything you say - 2024-05-29

I'm beginning to suspect that one of the most common misconceptions about LLMs such as ChatGPT involves how "training" works.

A common complaint I see about these tools is that people don't want to even try them out because they don't want to contribute to their training data.

This is by no means an irrational position to take, but it does often correspond to an incorrect mental model about how these tools work.

Short version: ChatGPT and other similar tools do not directly learn from and memorize everything that you say to them.

This can be quite unintuitive: these tools imitate a human conversational partner, and humans constantly update their knowledge based on what you say to to them. Computers have much better memory than humans, so surely ChatGPT would remember every detail of everything you ever say to it. Isn't that what "training" means?

That's not how these tools work.

LLMs are stateless functions

From a computer science point of view, it's best to think of LLMs as stateless function calls. Given this input text, what should come next?

In the case of a "conversation" with a chatbot such as ChatGPT or Claude or Google Gemini, that function input consists of the current conversation (everything said by both the human and the bot) up to that point, plus the user's new prompt.

Every time you start a new chat conversation, you clear the slate. Each conversation is an entirely new sequence, carried out entirely independently of previous conversations from both yourself and other users.

Understanding this is key to working effectively with these models. Every time you hit "new chat" you are effectively wiping the short-term memory of the model, starting again from scratch.

This has a number of important consequences:

There is no point at all in "telling" a model something in order to improve its knowledge for future conversations. I've heard from people who have invested weeks of effort pasting new information into ChatGPT sessions to try and "train" a better bot. That's a waste of time!
Understanding this helps explain why the "context length" of a model is so important. Different LLMs have different context lengths, expressed in terms of "tokens" - a token is about 3/4s of a word. This is the number that tells you how much of a conversation the bot can consider at any one time. If your conversation goes past this point the model will "forget" details that occurred at the beginning of the conversation.
Sometimes it's a good idea to start a fresh conversation in order to deliberately reset the model. If a model starts making obvious mistakes, or refuses to respond to a valid question for some weird reason that reset might get it back on the right track.
Tricks like Retrieval Augmented Generationand ChatGPT's "memory" make sense only once you understand this fundamental limitation to how these models work.
If you're excited about local models because you can be certain there's no way they can train on your data, you're mostly right: you can run them offline and audit your network traffic to be absolutely sure your data isn't being uploaded to a server somewhere. But...
... if you're excited about local models because you want something on your computer that you can chat to and it will learn from you and then better respond to your future prompts, that's probably not going to work.

So what is "training" then?

When we talk about model training, we are talking about the process that was used to build these models in the first place.

As a big simplification, there are two phases to this. The first is to pile in several TBs of text - think all of Wikipedia, a scrape of a large portion of the web, books, newspapers, academic papers and more - and spend months of time and potentially millions of dollars in electricity crunching through that "pre-training" data identifying patterns in how the words relate to each other.

This gives you a model that can complete sentences, but not necessarily in a way that will delight and impress a human conversational partner. The second phase aims to fix that - this can incorporate instruction tuning or Reinforcement Learning from Human Feedback (RLHF) which has the goal of teaching the model to pick the best possible sequences of words to have productive conversations.

The end result of these phases is the model itself - an enormous (many GB) blob of floating point numbers that capture both the statistical relationships between the words and some version of "taste" in terms of how best to assemble new words to reply to a user's prompts.

Once trained, the model remains static and unchanged - sometimes for months or even years.

Here's a note from Jason D. Clinton, an engineer who works on Claude 3 at Anthropic:

The model is stored in a static file and loaded, continuously, across 10s of thousands of identical servers each of which serve each instance of the Claude model. The model file never changes and is immutable once loaded; every shard is loading the same model file running exactly the same software.

These models don't change very often!

Reasons to worry anyway

A frustrating thing about this issue is that it isn't actually possible to confidently state "don't worry, ChatGPT doesn't train on your input".

Many LLM providers have terms and conditions that allow them to improve their models based on the way you are using them. Even when they have opt-out mechanisms these are often opted-in by default.

When OpenAI say "We may use Content to provide, maintain, develop, and improve our Services" it's not at all clear what they mean by that!

Are they storing up everything anyone says to their models and dumping that into the training run for their next model versions every few months?

I don't think it's that simple: LLM providers don't want random low-quality text or privacy-invading details making it into their training data. But they are notoriously secretive, so who knows for sure?

The opt-out mechanisms are also pretty confusing. OpenAI try to make it as clear as possible that they won't train on any content submitted through their API (so you had better understand what an "API" is), but lots of people don't believe them! I wrote about the AI trust crisis last year: the pattern where many people actively disbelieve model vendors and application developers (such as Dropbox and Slack) that claim they don't train models on private data.

People also worry that those terms might change in the future. There are options to protect against that: if you're spending enough money you can sign contracts with OpenAI and other vendors that freeze the terms and conditions.

If your mental model is that LLMs remember and train on all input, it's much easier to assume that developers who claim they've disabled that ability may not be telling the truth. If you tell your human friend to disregard a juicy piece of gossip you've mistakenly passed on to them you know full well that they're not going to forget it!

The other major concern is the same as with any cloud service: it's reasonable to assume that your prompts are still logged for a period of time, for compliance and abuse reasons, and if that data is logged there's always a chance of exposure thanks to an accidental security breach.

What about "memory" features?

To make things even more confusing, some LLM tools are introducing features that attempt to work around this limitation.

ChatGPT recently added a memory featurewhere it can "remember" small details and use them in follow-up conversations.

As with so many LLM features this is a relatively simple prompting trick: during a conversation the bot can call a mechanism to record a short note - your name, is a preference you have expressed - which will then be invisibly included in the chat context passed in future conversations.

You can review (and modify) the list of remembered fragments at any time, and ChatGPT shows a visible UI element any time it adds to its memory.

Bad policy based on bad mental models

One of the most worrying results of this common misconception concerns people who make policy decisions for how LLM tools should be used.

Does your company ban all use of LLMs because they don't want their private data leaked to the model providers?

They're not 100% wrong - see reasons to worry anyway - but if they are acting based on the idea that everything said to a model is instantly memorized and could be used in responses to other users they're acting on faulty information.

Even more concerning is what happens with lawmakers. How many politicians around the world are debating and voting on legislation involving these models based on a science fiction idea of what they are and how they work?

If people believe ChatGPT is a machine that instantly memorizes and learns from everything anyone says to it there is a very real risk they will support measures that address invented as opposed to genuine risks involving this technology.

Weeknotes: PyCon US 2024 - 2024-05-28

Earlier this month I attended PyCon US 2024 in Pittsburgh, Pennsylvania. I gave an invited keynote on the Saturday morning titled "Imitation intelligence", tying together much of what I've learned about Large Language Models over the past couple of years and making the case that the Python community has a unique opportunity and responsibility to help try to nudge this technology in a positive direction.

The video isn't out yet but I'll publish detailed notes to accompany my talk (using my annotated presentation format) as soon as it goes live on YouTube.

PyCon was a really great conference. Pittsburgh is a fantastic city, and I'm delighted that PyCon will be in the same venue next year so I can really take advantage of the opportunity to explore in more detail.

I also realized that it's about time Datasette participated in the PyCon sprints - the project is mature enough for that to be a really valuable opportunity now. I'm looking forward to leaning into that next year.

I'm on a family-visiting trip back to the UK at the moment, so taking a bit of time off from my various projects.

LLM support for new models

The big new language model releases from May were OpenAI GPT-4o and Google's Gemini Flash. I released LLM 0.14, datasette-extract 0.1a7 and datasette-enrichments-gpt 0.5 with support for GPT-4o, and llm-gemini 0.1a4adding support for the new inexpensive Gemini 1.5 Flash.

Gemini 1.5 Flash is a particularly interesting model: it's now ranked 9th on the LMSYS leaderboard, beating Llama 3 70b. It's inexpensive, priced close to Claude 3 Haiku, and can handle up to a million tokens of context.

I'm also excited about GPT-4o - half the price of GPT-4 Turbo, around twice as fast and it appears to be slightly more capable too. I've been getting particularly good results from it for structured data extraction using datasette-extract - it seems to be able to more reliably produce a longer sequence of extracted rows from a given input.

Blog entries

Releases

datasette-permissions-metadata 0.1 - 2024-05-15
Configure permissions for Datasette 0.x in metadata.json
datasette-enrichments-gpt 0.5 - 2024-05-15
Datasette enrichment for analyzing row data using OpenAI's GPT models
datasette-extract 0.1a7 - 2024-05-15
Import unstructured data (text and images) into structured tables
llm-gemini 0.1a4 - 2024-05-14
LLM plugin to access Google's Gemini family of models
llm 0.14 - 2024-05-13
Access large language models from the command-line

TILs

Listen to a web page in Mobile Safari - 2024-05-21
How I studied for my Ham radio general exam - 2024-05-11

Link 2024-05-17 Programming mantras are proverbs:

I like this idea from Luke Plant that the best way to think about mantras like "Don’t Repeat Yourself" is to think of them as proverbs that can be accompanied by an equal and opposite proverb.

DRY, "Don't Repeat Yourself" matches with WET, "Write Everything Twice".

Proverbs as tools for thinking, not laws to be followed.

Link 2024-05-17 PSF announces a new five year commitment from Fastly:

Fastly have been donating CDN resources to Python - most notably to the PyPI package index - for ten years now.

The PSF just announced at PyCon US that Fastly have agreed to a new five year commitment. This is a really big deal, because it addresses the strategic risk of having a key sponsor like this who might change their support policy based on unexpected future conditions.

Thanks, Fastly. Very much appreciated!

Quote 2024-05-17

I have seen the extremely restrictive off-boarding agreement that contains nondisclosure and non-disparagement provisions former OpenAI employees are subject to. It forbids them, for the rest of their lives, from criticizing their former employer. Even acknowledging that the NDA exists is a violation of it.

If a departing employee declines to sign the document, or if they violate it, they can lose all vested equity they earned during their time at the company, which is likely worth millions of dollars.

Kelsey Piper

Link 2024-05-17 Commit: Add a shared credentials relationship from twitter.com to x.com:

A commit toshared-credentials.jsonin Apple'spassword-manager-resourcesrepository. Commit message: "Pour one out."

Link 2024-05-17 Understand errors and warnings better with Gemini:

As part of Google's Gemini-in-everything strategy, Chrome DevTools now includes an opt-in feature for passing error messages in the JavaScript console to Gemini for an explanation, via a lightbulb icon.

Amusingly, this documentation page includes a warning about prompt injection:

Many of LLM applications are susceptible to a form of abuse known as prompt injection. This feature is no different. It is possible to trick the LLM into accepting instructions that are not intended by the developers.

They include a screenshot of a harmless example, but I'd be interested in hearing if anyone has a theoretical attack that could actually cause real damage here.

Quote 2024-05-18

I rewrote it [the Oracle of Bacon] in Rust in January 2023 when I switched over to TMDB as a data source. The new data source was a deep change, and I didn’t want the headache of building it in the original 1990s-era C codebase.

Patrick Reynolds

Link 2024-05-18 AI counter app from my PyCon US keynote:

In my keynote at PyCon US this morning I ran a counter at the top of my screen that automatically incremented every time I said the words "AI" or "artificial intelligence", using vosk, pyaudio and Tkinter. I wrote it in a few minutes with the help of GPT-4o - here's the code I ran as a GitHub repository.

I'll publish full detailed notes from my talk once the video is available on YouTube.

Link 2024-05-19 A Plea for Sober AI:

Great piece by Drew Breunig: "Imagine having products THIS GOOD and still over-selling them."

Link 2024-05-19 Fast groq-hosted LLMs vs browser jank:

Groq is now serving LLMs such as Llama 3 so quickly that JavaScript which attempts to render Markdown strings on every new token can cause performance issues in browsers.

Taras Glek's solution was to move the rendering to a requestAnimationFrame() callback, effectively buffering the rendering to the fastest rate the browser can support.

Link 2024-05-19 NumFOCUS DISCOVER Cookbook: Minimal Measures:

NumFOCUS publish a guide "for organizers of conferences and events to support and encourage diversity and inclusion at those events."

It includes this useful collection of the easiest and most impactful measures that events can put in place, covering topics such as accessibility, speaker selection, catering and provision of gender-neutral restrooms.

Link 2024-05-19 Spam, junk … slop? The latest wave of AI behind the ‘zombie internet’:

I'm quoted in this piece in the Guardian about slop:

I think having a name for this is really important, because it gives people a concise way to talk about the problem.
Before the term ‘spam’ entered general use it wasn’t necessarily clear to everyone that unwanted marketing messages were a bad way to behave. I’m hoping ‘slop’ has the same impact – it can make it clear to people that generating and publishing unreviewed AI-generated content is bad behaviour.

Link 2024-05-20 CRDT: Text Buffer:

Delightfully short and clear explanation of the CRDT approach to collaborative text editing by Evan Wallace (of Figma and esbuild fame), including a neat interactive demonstration of how the algorithm works even when the network connection between peers is temporarily paused.

Quote 2024-05-20

Last September, I received an offer from Sam Altman, who wanted to hire me to voice the current ChatGPT 4.0 system. He told me that he felt that by my voicing the system, I could bridge the gap between tech companies and creatives and help consumers to feel comfortable with the seismic shift concerning humans and AI. He said he felt that my voice would be comforting to people. After much consideration and for personal reasons, I declined the offer.

Scarlett Johansson

Link 2024-05-21 Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet:

Big advances in the field of LLM interpretability from Anthropic, who managed to extract millions of understandable features from their production Claude 3 Sonnet model (the mid-point between the inexpensive Haiku and the GPT-4-class Opus).

Some delightful snippets in here such as this one:

We also find a variety of features related to sycophancy, such as an empathy / “yeah, me too” feature 34M/19922975, a sycophantic praise feature 1M/847723, and a sarcastic praise feature 34M/19415708.

Link 2024-05-21 New Phi-3 models: small, medium and vision:

I couldn't find a good official announcement post to link to about these three newly released models, but this post on LocalLLaMA on Reddit has them in one place: Phi-3 small (7B), Phi-3 medium (14B) and Phi-3 vision (4.2B) (the previously released model was Phi-3 mini - 3.8B).

You can try out the vision model directly here, no login required. It didn't do a great job with my first test image though, hallucinating the text.

As with Mini these are all released under an MIT license.

UPDATE: Here's a page from the newly published Phi-3 Cookbook describing the models in the family.

TIL 2024-05-21 Listen to a web page in Mobile Safari:

I found a better way to listen to a whole web page through text-to-speech on Mobile Safari today. …

Link 2024-05-22 Mastering LLMs: A Conference For Developers & Data Scientists:

I'm speaking at this 5-week (maybe soon 6-week) long online conference about LLMs, presenting about "LLMs on the command line".

Other speakers include Jeremy Howard, Sophia Yang from Mistral, Wing Lian of Axolotl, Jason Liu of Instructor, Paige Bailey from Google, my former co-worker John Berryman and a growing number of fascinating LLM practitioners.

It's been fun watching this grow from a short course on fine-tuning LLMs to a full-blown multi-week conference over the past few days!

Quote 2024-05-22

The default prefix used to be "sqlite_". But then Mcafee started using SQLite in their anti-virus product and it started putting files with the "sqlite" name in the c:/temp folder. This annoyed many windows users. Those users would then do a Google search for "sqlite", find the telephone numbers of the developers and call to wake them up at night and complain. For this reason, the default name prefix is changed to be "sqlite" spelled backwards.

D. Richard Hipp, 18 years ago

Link 2024-05-22 What is prompt optimization?:

Delightfully clear explanation of a simple automated prompt optimization strategy from Jason Liu. Gather a selection of examples and build an evaluation function to return a numeric score (the hard bit). Then try different shuffled subsets of those examples in your prompt and look for the example collection that provides the highest averaged score.

Quote 2024-05-23

The most effective mechanism I’ve found for rolling out No Wrong Door is initiating three-way conversations when asked questions. If someone direct messages me a question, then I will start a thread with the question asker, myself, and the person I believe is the correct recipient for the question. This is particularly effective because it’s a viral approach: rolling out No Wrong Door just requires any one of the three participants to adopt the approach.

Will Larson

Link 2024-05-24 A Grand Unified Theory of the AI Hype Cycle:

Glyph outlines the pattern of every AI hype cycle since the 1960s: a new, novel mechanism is discovered and named. People get excited, and non-practitioners start hyping it as the path to true "AI". It eventually becomes apparent that this is not the case, even while practitioners quietly incorporate this new technology into useful applications while downplaying the "AI" branding. A new mechanism is discovered and the cycle repeats.

Quote 2024-05-24

But increasingly, I’m worried that attempts to crack down on the cryptocurrency industry — scummy though it may be — may result in overall weakening of financial privacy, and may hurt vulnerable people the most. As they say, “hard cases make bad law”.

Molly White

Link 2024-05-24 Some goofy results from ‘AI Overviews’ in Google Search:

John Gruber collects two of the best examples of Google's new AI overviews going horribly wrong.

Gullibility is a fundamental trait of all LLMs, and Google's new feature apparently doesn't know not to parrot ideas it picked up from articles in the Onion, or jokes from Reddit.

I've heard that LLM providers internally talk about "screenshot attacks" - bugs where the biggest risk is that someone will take an embarrassing screenshot.

In Google search's case this class of bug feels like a significant reputational threat.

Quote 2024-05-24

The leader of a team - especially a senior one - is rarely ever the smartest, the most expert or even the most experienced.

Often it’s the person who can best understand individuals’ motivations and galvanize them towards an outcome, all while helping them stay cohesive.

Nivia Henry

Quote 2024-05-24

I just left Google last month. The "AI Projects" I was working on were poorly motivated and driven by this panic that as long as it had "AI" in it, it would be great. This myopia is NOT something driven by a user need. It is a stone cold panic that they are getting left behind.

The vision is that there will be a Tony Stark like Jarvis assistant in your phone that locks you into their ecosystem so hard that you'll never leave. That vision is pure catnip. The fear is that they can't afford to let someone else get there first.

Scott Jenson

Link 2024-05-24 Nilay Patel reports a hallucinated ChatGPT summary of his own article:

Here's a ChatGPT bug that's a new twist on the old issue where it would hallucinate the contents of a web page based on the URL.

The Verge editor Nilay Patel asked for a summary of one of his own articles, pasting in the URL.

ChatGPT 4o replied with an entirely invented summary full of hallucinated details.

It turns out The Verge blocks ChatGPT's browse mode from accessing their site in their robots.txt:

User-agent: ChatGPT-User
Disallow: /

Clearly ChatGPT should reply that it is unable to access the provided URL, rather than inventing a response that guesses at the contents!

Link 2024-05-24 Golden Gate Claude:

This is absurdly fun and weird. Anthropic's recent LLM interpretability research gave them the ability to locate features within the opaque blob of their Sonnet model and boost the weight of those features during inference.

For a limited time only they're serving a "Golden Gate Claude" model which has the feature for the Golden Gate Bridge boosted. No matter what question you ask it the Golden Gate Bridge is likely to be involved in the answer in some way. Click the little bridge icon in the Claude UI to give it a go.

I asked for names for a pet pelican and the first one it offered was this:

Golden Gate - This iconic bridge name would be a fitting moniker for the pelican with its striking orange color and beautiful suspension cables.

And from a recipe for chocolate covered pretzels:

Gently wipe any fog away and pour the warm chocolate mixture over the bridge/brick combination. Allow to air dry, and the bridge will remain accessible for pedestrians to walk along it.

UPDATE: I think the experimental model is no longer available, approximately 24 hours after release. We'll miss you, Golden Gate Claude.

Link 2024-05-25 Why Google’s AI might recommend you mix glue into your pizza:

I got "distrust and verify" as advice on using LLMs into this Washington Post piece by Shira Ovide.

Link 2024-05-26 Statically Typed Functional Programming with Python 3.12:

Oskar Wickström builds a simple expression evaluator that demonstrates some new patterns enabled by Python 3.12, incorporating the match operator, generic types and type aliases.

Link 2024-05-26 City In A Bottle – A 256 Byte Raycasting System:

Frank Force explains his brilliant 256 byte canvas ray tracing animated cityscape demo in detail.

Link 2024-05-27 fastlite:

New Python library from Jeremy Howard that adds some neat utility functions and syntactic sugar to my sqlite-utils Python library, specifically for interactive use in Jupyter notebooks.

The autocomplete support through newly exposed dynamic properties is particularly neat, as is the diagram(db.tables) utility for rendering a graphviz diagram showing foreign key relationships between all of the tables.

Link 2024-05-28 Pyodide 0.26 Release:

PyOdide provides Python packaged for browser WebAssembly alongside an ecosystem of additional tools and libraries to help Python and JavaScript work together.

The latest release bumps the Python version up to 3.12, and also adds support for pygame-ce, allowing games written using pygame to run directly in the browser.

The PyOdide community also just landed a 14-month-long PR adding support to cibuildwheel, which should make it easier to ship binary wheels targeting PyOdide.

Link 2024-05-28 Reproducing GPT-2 (124M) in llm.c in 90 minutes for $20:

GPT-2 124M was the smallest model in the GPT-2 series released by OpenAI back in 2019. Andrej Karpathy's llm.c is an evolving 4,000 line C/CUDA implementation which can now train a GPT-2 model from scratch in 90 minutes against a 8X A100 80GB GPU server. This post walks through exactly how to run the training, using 10 billion tokens of FineWeb.

Andrej notes that this isn't actually that far off being able to train a GPT-3:

Keep in mind that here we trained for 10B tokens, while GPT-3 models were all trained for 300B tokens. [...] GPT-3 actually didn't change too much at all about the model (context size 1024 -> 2048, I think that's it?).

Estimated cost for a GPT-3 ADA (350M parameters)? About $2,000.

Quote 2024-05-29

Sometimes the most creativity is found in enumerating the solution space. Design is the process of prioritizing tradeoffs in a high dimensional space. Understand that dimensionality.

Chris Perry

Link 2024-05-29 What We Learned from a Year of Building with LLMs (Part I):

Accumulated wisdom from six experienced LLM hackers. Lots of useful tips in here. On providing examples in a prompt:

If n is too low, the model may over-anchor on those specific examples, hurting its ability to generalize. As a rule of thumb, aim for n ≥ 5. Don’t be afraid to go as high as a few dozen.

There's a recommendation not to overlook keyword search when implementing RAG - tricks with embeddings can miss results for things like names or acronyms, and keyword search is much easier to debug.

Plus this tip on using the LLM-as-judge pattern for implementing automated evals:

Instead of asking the LLM to score a single output on a Likert scale, present it with two options and ask it to select the better one. This tends to lead to more stable results.

TIL 2024-05-29 Cloudflare redirect rules with dynamic expressions:

I wanted to ensure

https://niche-museums.com/

would redirect to

https://www.niche-museums.com/

- including any path - using Cloudflare. …

Quote 2024-05-29

In their rush to cram in “AI” “features”, it seems to me that many companies don’t actually understand why people use their products. [...] Trust is a precious commodity. It takes a long time to build trust. It takes a short time to destroy it.

Jeremy Keith

James Wang

Great article explaining how LLMs works under the hood, especially for non-technical folks not getting (understandably) that it's largely a pre-trained, stateless "blob of floating point numbers" (as you note) and not a human.

I find it's really hard for people who haven't worked on this stuff themselves to get. I suppose there's still something to the Turing test (silly as it seems nowadays)—people aren't really able to conceptualize that something that sounds like a human doesn't actually work like one at all.

Expand full comment

Jurgen Gravestein

Loved your piece about mental models and common misconceptions about how AI “learns”. It notice the confusion, too. And it’s understandble, this stuff is mighty complex. But having a base level understanding really improves people’s ability to grasp what AI can and cannot do.

Simon Willison’s Newsletter

Discussion about this post