It's infuriatingly hard to understand how closed models train on their input
Plus, ChatGPT should include inline tips
In this newsletter:
It's infuriatingly hard to understand how closed models train on their input
ChatGPT should include inline tips
Plus 4 links and 3 quotations
It's infuriatingly hard to understand how closed models train on their input - 2023-06-04
One of the most common concerns I see about large language models regards their training data. People are worried that anything they say to ChatGPT could be memorized by it and spat out to other users. People are concerned that anything they store in a private repository on GitHub might be used as training data for future versions of Copilot.
When someone asked Google Bard how it was trained back in March, it told them its training data included internal Gmail! This turned out to be a complete fabrication - a hallucination by the model itself - and Google issued firm denials, but it's easy to see why that freaked people out.
I've been wanting to write something reassuring about this issue for a while now. The problem is... I can't do it. I don't have the information I need to credibly declare these concerns unfounded, and the more I look into this the murkier it seems to get.
Closed models won't tell you what's in their training data
The fundamental issue here is one of transparency. The builders of the big closed models - GPT-3, GPT-4, Google's PaLM and PaLM 2, Anthropic's Claude - refuse to tell us what's in their training data.
Given this lack of transparency, there's no way to confidently state that private data that is passed to them isn't being used to further train future versions of these models.
I've spent a lot of time digging around in openly available training sets. I built an early tool for searching the training set for Stable Diffusion. I can tell you exactly what has gone in to the RedPajama training set that's being used for an increasing number of recent openly licensed language models.
But for those closed models? Barring loose, high-level details that are revealed piecemeal in blog posts and papers, I have no idea what's in them.
What OpenAI do and don't tell us
The good news is that OpenAI have an unambiguous policy regarding data that is sent to them by API users who are paying for the service:
OpenAI does not use data submitted by customers via our API to train OpenAI models or improve OpenAI’s service offering.
That's very clear. It's worth noting that this is a new policy though, introduced in March. The API data usage policies page includes this note:
Data submitted to the API prior to March 1, 2023 (the effective date of this change) may have been used for improvements if the customer had not previously opted out of sharing data.
Where things get a lot murkier is ChatGPT itself. Emphasis mine:
We don’t use data for selling our services, advertising, or building profiles of people—we use data to make our models more helpful for people. ChatGPT, for instance, improves by further training on the conversations people have with it, unless you choose to disable training.
But what does this mean in practice?
My initial assumption had been that this isn't as simple as anything you type into ChatGPT being used as raw input for further rounds of model training - I expected it was more about using that input to identify trends in the kinds of questions people ask, or using feedback from the up/down vote buttons to further fine-tune the model.
But honestly, I have no idea. Maybe they just run a regular expression to strip out phone numbers and email address and pipe everything else straight into the GPT-5 training runs? Without further transparency all we can do is guess.
A clue from the InstructGPT paper
The best clue I’ve seen as to how this data might actually be used comes from OpenAI’s description of InstructGPT back in January 2022:
To make our models safer, more helpful, and more aligned, we use an existing technique called reinforcement learning from human feedback (RLHF). On prompts submitted by our customers to the API
[A]
our labelers provide demonstrations of the desired model behavior, and rank several outputs from our models. We then use this data to fine-tune GPT-3.
Crucially, this hints that the data isn’t being used as raw input for future trained models. Instead, it’s being used in an exercise where several potential outputs are produced and human labelers then select which of those is the best possible answer to the prompt. Aside from exposing potentially private data to those human labelers, I don’t see this as a risk for leaking that data in the later output of the model.
That [A]
footnote turns out to be important:
We only use prompts submitted through the Playground to an earlier version of the InstructGPT models that was deployed in January 2021. Our human annotators remove personal identifiable information from all prompts before adding it to the training set.
Again though, I’m left with even more questions. This was before ChatGPT existed, so was the Playground development tool being treated separately from the API itself back then? What does “adding it to the training set” mean—is that the raw pre-training data used for future models, or is it the RLHF data used for the fine-tuning that they mentioned earlier?
Security leaks are another threat
Aside from training concerns, there's another danger to consider here: the risk that an AI vendor might log inputs to their models and then suffer from a security flaw that exposes that data to attackers - or an insider threat where vendor employees access logged data that they shouldn't.
OpenAI themselves had a widely publicized security issue a few months ago where ChatGPT users could see summarized titles of sessions by other users. This is an extremely bad breach!
Their new trust.openai.com site appears to be entirely aimed at reassuring companies about their approach to security.
To be fair, this is not a new issue: companies have been trusting their private data to cloud providers like AWS and Google Cloud for more than a decade.
The challenge is that these AI companies have much less of a track record for staying secure. AWS and Google Cloud have large security teams with many years of experience securing their customer's data. These newer AI vendors are building up those capabilities as they go.
Self-hosted, openly licensed models
I've been tracking the meteoric rise of openly licensed LLMs you can run on your own hardware since LLaMA and Alpaca demonstrated how capable they could be back in March.
These models aren't yet anywhere near as capable as GPT-4, and claims that they compete with ChatGPT's gpt-3.5-turbo
mostly don't hold up to deeper scrutiny.
But... they're pretty good - and they're getting better at an impressive rate.
And since you can run them on your own instances, they remove all possible concerns about what happens to the data that you pipe through them.
An open question for me remains how large a large language model actually needs in order to solve the kind of problems companies need to solve. Could a weaker, openly licensed model armed with the same retrieval augmented generation tricks that we've seen from Bing and Bard be capable enough to remove the need for a closed model like GPT-4?
My hunch is that for many applications these augmented openly licensed models will be increasingly capable, and will see widespread adoption over the next few months and years.
Bonus section: does GitHub use private repos to train future models?
This question came up on Hacker News this morning. GitHub's Privacy & Data Sharing policy says the following:
Private repository data is scanned by machine and never read by GitHub staff. Human eyes will never see the contents of your private repositories, except as described in our Terms of Service.
Your individual personal or repository data will not be shared with third parties. We may share aggregate data learned from our analysis with our partners.
I interpret this as GitHub saying that no employee will ever see the contents of your private repo (barring incidents where they are compelled by law), and that the only data that might be shared with partners is "aggregate data learned from our analysis".
But what is "aggregate data"?
Could a large language model trained on data fit under that term? I don't think so, but the terminology is vague enough that once again I'm not ready to stake my reputation on it.
Clarity on this kind of thing is just so important. I think organizations like GitHub need to over-communicate on this kind of thing, and avoid any terminology like "aggregate data" that could leave people confused.
Thanks to Andy Baio and Fred Benenson for reviewing early drafts of this post.
ChatGPT should include inline tips - 2023-05-30
In OpenAI isn’t doing enough to make ChatGPT’s limitations clear James Vincent argues that OpenAI's existing warnings about ChatGPT's confounding ability to convincingly make stuff up are not effective.
I completely agree.
The case of the lawyer who submitted fake cases invented by ChatGPT to the court is just the most recent version of this.
Plenty of people have argued that the lawyer should have read the warning displayed on every page of the ChatGPT interface. But that warning is clearly inadequate. Here's that warning in full:
ChatGPT may produce inaccurate information about people, places, or facts
Anyone who has spent time with ChatGPT will know that there's a lot more to it than that. It's not just that ChatGPT may produce inaccurate information: it will double-down on it, inventing new details to support its initial claims. It will tell lies like this one:
I apologize for the confusion earlier. Upon double-checking, I found that the case Varghese v. China Southern Airlines Co. Ltd., 925 F.3d 1339 (11th Cir. 2019), does indeed exist and can be found on legal research databases such as Westlaw and LexisNexis.
It can't "double-check" information, and it doesn't have access to legal research databases.
"May produce inaccurate information" is a massive understatement here! It implies the occasional mistake, not Machiavellian levels of deception where it doubles-down on falsehoods and invents increasingly convincing justifications for them.
Even for people who have read that warning, a single sentence in a footer isn't nearly enough to inoculate people against the many weird ways ChatGPT can lead them astray.
My proposal: Inline tips
I think this problem could be addressed with some careful interface design.
Currently, OpenAI have been trying to train ChatGPT to include additional warnings in its regular output. It will sometimes reply with warnings that it isn't able to do things... but these warnings are unreliable. Often I'll try the same prompt multiple times and only get the warning for some of those attempts.
Instead, I think the warnings should be added in a way that is visually distinct from the regular output. Here's a mockup illustrating the kind of thing I'm talking about:
As you can see, the prompt "Write some tweets based on what's trending on pinterest" triggers an inline warning with a visually different style and a message explaining that "This ChatGPT model does not have access to the internet, and its training data cut-off is September 2021".
My first version of this used "My data is only accurate up to September 2021", but I think having the warnings use "I" pronouns is itself misleading - the tips should be commentary about the model's output, not things that appear to be spoken by the model itself.
Here's a second mockup, inspired by the lawyer example:
This time the warning is "ChatGPT should not be relied on for legal research of this nature, because it is very likely to invent realistic cases that do not actually exist."
Writing these warnings clearly is its own challenge - I think they should probably include links to further information in an OpenAI support site that teaches people how to responsibly use ChatGPT (something that is very much needed).
(Here's the HTML I used for these mockups, added using the Firefox DevTools.)
How would this work?
Actually implementing this system isn't trivial. The first challenge is coming up with the right collection of warnings - my hunch is that this could be hundreds of items already. The next challenge is logic to decide when to display them, which would itself require an LLM (or maybe a fine-tuned model of some sort).
The good news is that a system like this could be developed independently of core ChatGPT itself. New warnings could be added without any changes needed to the underlying model, making it safe to iterate wildly on the inline tips without risk of affecting the core model's performance or utility.
Obviously I'd like it best if OpenAI were to implement something like this as part of ChatGPT itself, but it would be possible for someone else to prototype it on top of the OpenAI APIs.
I thought about doing that myself, but my list of projects is overflowing enough already!
Max Woolf’s prototype
Max Woolf built an implementation of this idea as a demo for his upcoming easy-ChatGPT tool. He shared these screenshots on Twitter:
Link 2023-05-31 The Python Language Summit 2023: Making the Global Interpreter Lock Optional: Extremely informative update covering Sam Gross's python-nogil proposal from this year's language summit at PyCon.
Sam has been working hard on his fork for the past year, and now has it rebased for Python 3.12. If his PEP is accepted it could end up as an optional compile-time build in time for Python 3.13.
"The plan for nogil remains that it would be enabled via a compile-time flag, named --disable-gil. Third-party C extensions would need to provide separate wheels for GIL-disabled Python."
Link 2023-05-31 Mandatory Certification Regarding Generative Artificial Intelligence: From the Judge Specific Requirements for Judge Brantley Starr in Austin, TX:
"All attorneys appearing before the Court must file on the docket a certificate attesting either that no portion of the filing was drafted by generative artificial intelligence (such as ChatGPT, Harvey.AI, or Google Bard) or that any language drafted by generative artificial intelligence was checked for accuracy, using print reporters or traditional legal databases, by a human being. [...]"
Quote 2023-05-31
If I were an AI sommelier I would say that gpt-3.5-turbo is smooth and agreeable with a long finish, though perhaps lacking depth. text-davinci-003 is spicy and tight, sophisticated even.
Quote 2023-06-01
He notes that one simulated test saw an AI-enabled drone tasked with a SEAD mission to identify and destroy SAM sites, with the final go/no go given by the human. However, having been ‘reinforced’ in training that destruction of the SAM was the preferred option, the AI then decided that ‘no-go’ decisions from the human were interfering with its higher mission – killing SAMs – and then attacked the operator in the simulation.
[UPDATE: This turned out to be a "thought experiment" intentionally designed to illustrate how these things could go wrong.]
Highlights from the RAeS Future Combat Air & Space Capabilities Summit
Link 2023-06-02 Vector Search: Amjith Ramanujam provides a very thorough tutorial on implementing vector similarity search using SentenceTransformers embeddings (all-MiniLM-L6-v2) executed using sqlite-utils, then served via datasette-sqlite-vss and deployed using Fly.
Link 2023-06-03 pytest-icdiff: This is neat: "pip install pytest-icdiff" provides an instant usability upgrade to the output of failed tests in pytest, especially if the assertions involve comparing larger strings or nested JSON objects.
Quote 2023-06-04
There was an exchange on Twitter a while back where someone said, ‘What is artificial intelligence?’ And someone else said, ‘A poor choice of words in 1954’. And, you know, they’re right. I think that if we had chosen a different phrase for it, back in the ’50s, we might have avoided a lot of the confusion that we’re having now.
> Could a large language model trained on data fit under that term? I don't think so, but the terminology is vague enough that once again I'm not ready to stake my reputation on it.
My intuition is that this wording does allude that private repositories are used for training.
Private repo --[text data] --> embeddings generator server --[embeddings]--> model training server --> model
I personally do not have a problem with this, and would opt into it if I could (wouldn’t say no to some free GitHub service credits) but my immediate gut is leaning towards “it’s definitely used for training” and was surprised you think otherwise/