The Stable Diffusion moment for Large Language Models

Also the first edition of my new weekly-ish newsletter

Simon Willison

Mar 13, 2023

Thanks for subscribing to my newsletter! I plan to send this out around once a week, mainly using content from my blog.

(I’ll write about this more later, but I’m using this Observable notebook to help compose the newsletter.)

In today’s newsletter:

Large language models are having their Stable Diffusion moment
Stanford Alpaca, and the acceleration of on-device large language model development
Weeknotes: NICAR, and an appearance on KQED Forum

Plus 5 links and 3 quotations.

Large language models are having their Stable Diffusion moment - 2023-03-11

The open release of the Stable Diffusion image generation model back in August 2022 was a key moment. I wrote how Stable Diffusion is a really big deal at the time.

People could now generate images from text on their own hardware!

More importantly, developers could mess around with the guts of what was going on.

The resulting explosion in innovation is still going on today. Most recently, ControlNet appears to have leapt Stable Diffusion ahead of Midjourney and DALL-E in terms of its capabilities.

It feels to me like that Stable Diffusion moment back in August kick-started the entire new wave of interest in generative AI - which was then pushed into over-drive by the release of ChatGPT at the end of November.

That Stable Diffusion moment is happening again right now, for large language models - the technology behind ChatGPT itself.

This morning I ran a GPT-3 class language model on my own personal laptop for the first time!

AI stuff was weird already. It's about to get a whole lot weirder.

LLaMA

Somewhat surprisingly, language models like GPT-3 that power tools like ChatGPT are a lot larger and more expensive to build and operate than image generation models.

The best of these models have mostly been built by private organizations such as OpenAI, and have been kept tightly controlled - accessible via their API and web interfaces, but not released for anyone to run on their own machines.

These models are also BIG. Even if you could obtain the GPT-3 model you would not be able to run it on commodity hardware - these things usually require several A100-class GPUs, each of which retail for $8,000+.

This technology is clearly too important to be entirely controlled by a small group of companies.

There have been dozens of open large language models released over the past few years, but none of them have quite hit the sweet spot for me in terms of the following:

Easy to run on my own hardware
Large enough to be useful - ideally equivalent in capabilities to GPT-3
Open source enough that they can be tinkered with

This all changed yesterday, thanks to the combination of Facebook's LLaMA model and llama.cpp by Georgi Gerganov.

Here's the abstract from the LLaMA paper:

We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.

It's important to note that LLaMA isn't fully "open". You have to agree to some strict terms to access the model. It's intended as a research preview, and isn't something which can be used for commercial purposes.

In a totally cyberpunk move, within a few days of the release, someone submitted this PR to the LLaMA repository linking to an unofficial BitTorrent download link for the model files!

So they're in the wild now. You may not be legally able to build a commercial product on them, but the genie is out of the bottle. That furious typing sound you can hear is thousands of hackers around the world starting to dig in and figure out what life is like when you can run a GPT-3 class model on your own hardware.

llama.cpp

LLaMA on its own isn't much good if it's still too hard to run it on a personal laptop.

Enter Georgi Gerganov.

Georgi is an open source developer based in Sofia, Bulgaria (according to his GitHub profile). He previously released whisper.cpp, a port of OpenAI's Whisper automatic speech recognition model to C++. That project made Whisper applicable to a huge range of new use cases.

He's just done the same thing with LLaMA.

Georgi's llama.cpp project had its initial release yesterday. From the README:

The main goal is to run the model using 4-bit quantization on a MacBook.

4-bit quantization is a technique for reducing the size of models so they can run on less powerful hardware. It also reduces the model sizes on disk - to 4GB for the 7B model and just under 8GB for the 13B one.

It totally works!

I used it to run the 7B LLaMA model on my laptop this night, and then this morning upgraded to the 13B model - the one that Facebook claim is competitive with GPT-3.

Here are my detailed notes on how I did that - most of the information I needed was already there in the README.

As my laptop started to spit out text at me I genuinely had a feeling that the world was about to change, again.

Animated GIF showing LLaMA on my laptop completing a prompt about The first man on the moon was - it only takes a few seconds to complete and outputs information about Neil Armstrong

I thought it would be a few more years before I could run a GPT-3 class model on hardware that I owned. I was wrong: that future is here already.

Is this the worst thing that ever happened?

I'm not worried about the science fiction scenarios here. The language model running on my laptop is not an AGI that's going to break free and take over the world.

But there are a ton of very real ways in which this technology can be used for harm. Just a few:

Generating spam
Automated romance scams
Trolling and hate speech
Fake news and disinformation
Automated radicalization (I worry about this one a lot)

Not to mention that this technology makes things up exactly as easily as it parrots factual information, and provides no way to tell the difference.

Prior to this moment, a thin layer of defence existed in terms of companies like OpenAI having a limited ability to control how people interacted with those models.

Now that we can run these on our own hardware, even those controls are gone.

How do we use this for good?

I think this is going to have a huge impact on society. My priority is trying to direct that impact in a positive direction.

It's easy to fall into a cynical trap of thinking there's nothing good here at all, and everything generative AI is either actively harmful or a waste of time.

I'm personally using generative AI tools on a daily basis now for a variety of different purposes. They've given me a material productivity boost, but more importantly they have expanded my ambitions in terms of projects that I take on.

I used ChatGPT to learn enough AppleScript to ship a new project in less than an hour just last week!

I'm going to continue exploring and sharing genuinely positive applications of this technology. It's not going to be un-invented, so I think our priority should be figuring out the most constructive possible ways to use it.

What to look for next

Assuming Facebook don't relax the licensing terms, LLaMA will likely end up more a proof-of-concept that local language models are feasible on consumer hardware than a new foundation model that people use going forward.

The race is on to release the first fully open language model that gives people ChatGPT-like capabilities on their own devices.

Quoting Stable Diffusion backer Emad Mostaque:

Wouldn't be nice if there was a fully open version eh

It's happening already...

I published this article on Saturday 11th March 2023. On Sunday, Artem Andreenko got it running on a RaspberryPi with 4GB of RAM:

I've sucefully runned LLaMA 7B model on my 4GB RAM Raspberry Pi 4. It's super slow about 10sec/token. But it looks we can run powerful cognitive pipelines on a cheap hardware. pic.twitter.com/XDbvM2U5GY
- Artem Andreenko 🇺🇦 (@miolini) March 12, 2023

Then on Monday, Anish Thite got it working on a Pixel 6 phone (at 26s/token):

@ggerganov's LLaMA works on a Pixel 6!

LLaMAs been waiting for this, and so have I pic.twitter.com/JjEhdzJ2B9
- anishmaxxing (@thiteanish) March 13, 2023

Stanford Alpaca, and the acceleration of on-device large language model development - 2023-03-13

On Saturday 11th March I wrote about how Large language models are having their Stable Diffusion moment. Today is Monday. Let's look at what's happened in the past three days.

Later on Saturday: Artem Andreenko reports that llama.cpp can run the 4-bit quantized 7B LLaMA language model model on a 4GB RaspberryPi - at 10 seconds per token, but still hugely impressive.
Sunday 12th March: cocktailpeanut releases Dalai, a "dead simple way to run LLaMA on your computer": npx dalai llama and npx dalai serve.
13th March (today): Anish Thite reports llama.cpp running on a Pixel 6 phone (26 seconds per token).
Also today: a team at Stanford released Alpaca: A Strong Open-Source Instruction-Following Model - fine-tuned from the LLaMA 7B model.

When I talked about a "Stable Diffusion moment" this is the kind of thing I meant: the moment this stuff is available for people to experiment with, things accelerate.

I'm going to dive into Alpaca in detail.

Stanford's Alpaca

Here's the introduction to the Alpaca announcement:

We introduce Alpaca 7B, a model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations. Alpaca behaves similarly to OpenAI’s text-davinci-003, while being surprisingly small and easy/cheap to reproduce (<600$).

The biggest weakness in the LLaMA models released by Meta research last month is their lack of instruction-tuning.

A language model is a sentence completion engine. You give it a sequence of words, "The first man on the moon was", and it completes that sentence, hopefully with useful content.

One of the great innovations from OpenAI was their application of instruction tuning to GPT-3:

To make our models safer, more helpful, and more aligned, we use an existing technique called reinforcement learning from human feedback (RLHF). On prompts submitted by our customers to the API, our labelers provide demonstrations of the desired model behavior, and rank several outputs from our models. We then use this data to fine-tune GPT-3.

Prior to this, you had to think very carefully about how to construct your prompts. Thanks to instruction tuning you can be a lot more, well, human in the way you interact with the model. "Write me a poem about pandas!" now works as a prompt, instead of "Here is a poem about pandas:".

The LLaMA models had not been through this process. The LLaMA FAQ acknowledges this:

Keep in mind these models are not finetuned for question answering. As such, they should be prompted so that the expected answer is the natural continuation of the prompt. [...] Overall, always keep in mind that models are very sensitive to prompts (particularly when they have not been finetuned).

This is an enormous usability problem.

One of my open questions about LLaMA was how difficult and expensive it would be to fine-tune it such that it could respond better to instructions.

Thanks to the team at Stanford we now have an answer: 52,000 training samples and $600 of training compute!

Something that stuns me about Alpaca is the quality they claim to be able to get from the 7B model - the smallest of the LLaMA models, and the one which has been seen running (albeit glacially slowly) on a RaspberryPi and a mobile phone! Here's one example from their announcement:

Enter your instruction and press enter: Write an e-mail to congratulate new Stanford admits and mention that you are exited about meeting all of them in person Stanford-Alpaca-7B: An Open-Source Instruction-Following Language Model Dear Stanford Admits, Congratulations on your admission to Stanford! I am so excited to meet all of you in person. As a current student at Stanford, I am looking forward to welcoming you to the Stanford community and showing you all that the university has to offer. I wish you all the best in your future endeavors and I can't wait to see you on campus! Sincerely, Your Name

I would be impressed to see this from the 65B (largest) LLaMA model - but getting this from 7B is spectacular.

Still not for commercial usage

I'll quote the Stanford announcement on this in full:

We emphasize that Alpaca is intended only for academic research and any commercial use is prohibited. There are three factors in this decision: First, Alpaca is based on LLaMA, which has a non-commercial license, so we necessarily inherit this decision. Second, the instruction data is based OpenAI's text-davinci-003, whose terms of use prohibit developing models that compete with OpenAI. Finally, we have not designed adequate safety measures, so Alpaca is not ready to be deployed for general use.

So it's still not something we can use to build commercial offerings - but for personal research and tinkering it's yet another huge leap forwards.

What does this demonstrate?

The license of the LLaMA model doesn't bother me too much. What's exciting to me is what this all proves.

LLaMA itself shows that it's possible to train a GPT-3 class language model using openly available resources. The LLaMA paper includes details of the training data, which is entirely from publicly available sources (which include CommonCrawl, GitHub, Wikipedia, ArXiv and ArXiv).
llama.cpp shows that you can then use some tricks run that language model on consumer hardware - apparently anything with 4GB or more of RAM is enough to at least get it to start spitting out tokens!
Alpaca shows that you can apply fine-tuning with a feasible sized set of examples (52,000) and cost ($600) such that even the smallest of the LLaMA models - the 7B one, which can compress down to a 4GB file with 4-bit quantization - provides results that compare well to cutting edge text-davinci-003 in initial human evaluation.

One thing that's worth noting: the Alpaca 7Bb comparison likely used the full-sized 13.48GB 16bit floating point 7B model, not the 4GB smaller 4bit floating point model used by llama.cpp. I've not yet seen a robust comparison of quality between the two.

Exploring the Alpaca training data with Datasette Lite

The Alpaca team released the 52,000 fine-tuning instructions they used as a 21.7MB JSON file in their GitHub repository.

My Datasette Lite tool has the ability to fetch JSON from GitHub and load it into an in-browser SQLite database. Here's the URL to do that:

https://lite.datasette.io/?json=https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json

This will let you browse the 52,000 examples in your browser.

But we can do a step better than that: here's a SQL query that runs LIKE queries to search through those examples, considering all three text columns:

select instruction, input, output from alpaca_data
where instruction || ' ' || input || ' ' || output like '%' || :search || '%'
order by random()

I'm using order by random() because why not? It's more fun to explore that way.

The following link will both load the JSON file and populate and execute that SQL query, plus allow you to change the search term using a form in your browser:

https://lite.datasette.io/?json=https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json#/data?sql=select+instruction%2C+input%2C+output+from+alpaca_data%0Awhere+instruction+%7C%7C+%27+%27+%7C%7C+input+%7C%7C+%27+%27+%7C%7C+output+like+%27%25%27+%7C%7C+%3Asearch+%7C%7C+%27%25%27%0Aorder+by+random%28%29&search=occam

Screenshot of Datasette executing that SQL query, retruning three results that match 'occam'

What's next?

This week is likely to be wild. OpenAI are rumored to have a big announcement on Tuesday - possibly GPT-4? And I've heard rumors of announcements from both Anthropic and Google this week as well.

I'm still more excited about seeing what happens next with LLaMA. Language models on personal devices is happening so much faster than I thought it would.

Link 2023-03-08 How Discord Stores Trillions of Messages: This is a really interesting case-study. Discord migrated from MongoDB to Cassandra back in 2016 to handle billions of messages. Today they're handling trillions, and they completed a migration from Cassandra to Scylla, a Cassandra-like data store written in C++ (as opposed to Cassandra's Java) to help avoid problems like GC pauses. In addition to being a really good scaling war story this has some interesting details about their increased usage of Rust. As a fan of request coalescing (which I've previously referred to as dogpile prevention) I particularly liked this bit: "Our data services sit between the API and our ScyllaDB clusters. They contain roughly one gRPC endpoint per database query and intentionally contain no business logic. The big feature our data services provide is request coalescing. If multiple users are requesting the same row at the same time, we’ll only query the database once. The first user that makes a request causes a worker task to spin up in the service. Subsequent requests will check for the existence of that task and subscribe to it. That worker task will query the database and return the row to all subscribers."

Link 2023-03-09 apple-notes-to-sqlite: With the help of ChatGPT I finally figured out just enough AppleScript to automate the export of my notes to a SQLite database. AppleScript is a notoriously read-only language, which is turns out makes it a killer app for LLM-assisted coding.

Quote 2023-03-10

What could I do with a universal function — a tool for turning just about any X into just about any Y with plain language instructions?

Robin Sloan

Link 2023-03-11 Running LLaMA 7B on a 64GB M2 MacBook Pro with llama.cpp: I got Facebook's LLaMA 7B to run on my MacBook Pro using llama.cpp (a "port of Facebook's LLaMA model in C/C++") by Georgi Gerganov. It works! I've been hoping to run a GPT-3 class language model on my own hardware for ages, and now it's possible to do exactly that. The model itself ends up being just 4GB after applying Georgi's script to "quantize the model to 4-bits".

Link 2023-03-07 Online gradient descent written in SQL: Max Halford trains an online gradient descent model against two years of AAPL stock data using just a single advanced SQL query. He built this against DuckDB - I tried to replicate his query in SQLite and it almost worked, but it gave me a "recursive reference in a subquery" error that I was unable to resolve.

Weeknotes: NICAR, and an appearance on KQED Forum - 2023-03-07

I spent most of this week at NICAR 2023, the data journalism conference hosted this year in Nashville, Tennessee.

This was my third in-person NICAR and it's an absolute delight: NICAR is one of my favourite conferences to go to. It brings together around a thousand journalists who work with data, from all over the country and quite a few from the rest of the world.

People have very different backgrounds and experiences, but everyone has one thing in common: a nerdy obsession with using data to find and tell stories.

I came away with at least a year's worth of new ideas for things I want to build.

I also presented a session: an hour long workshop titled "Datasette: An ecosystem of tools for exploring data and collaborating on data projects".

I demonstrated the scope of the project, took people through some hands-on exercises derived from the Datasette tutorials Cleaning data with sqlite-utils and Datasette and Using Datasette in GitHub Codespaces and invited everyone in the room to join the Datastte Cloud preview and try using datasette-socrata to import and explore some data from the San Francisco open data portal.

My goal for this year's NICAR was to setup some direct collaborations with working newsrooms. Datasette is ready for this now, and I'm willing to invest significant time and effort in onboarding newsrooms, helping them start using the tools and learning what I need to do to help them be more effective in that environment.

If your newsroom is interested in that, please drop me an email at swillison@ Google's email service.

KQED Forum

My post about Bing attracted attention from the production team at KQED Forum, a long-running and influential Bay Area news discussion radio show.

They invited me to join a live panel discussion on Thursday morning with science-fiction author Ted Chiang and Claire Leibowitz from Partnership on AI.

I've never done live radio before, so this was an opportunity that was too exciting to miss. I ducked out of the conference for an hour to join the conversation via Zoom.

Aside from a call with a producer a few days earlier I didn't have much of an idea what to expect (similar to my shorter live TV appearance). You really have to be able to think on your feet!

A recording is available on the KQED site, and on Apple Podcasts.

I'm happy with most of it, but I did have one offensive and embarassing slip-up. I was talking about the Kevin Roose ChatGPT conversation from the New York Times, where Bing declared its love for him. I said (05:30):

So I love this particular example because it actually accidentally illustrates exactly how these things work.
All of these chatbots, all of these language models they're called, all they can do is predict sentences.
They predict the next word that statistically makes sense given what's come before.
And if you look at the way it talks to Kevin Roose, I've got a quote.
It says, "You're married, but you're not happy. You're married, but you're not satisfied. You're married, but you're not in love."
No human being would talk like that. That's practically a kind of weird poetry, right?
But if you're thinking about in terms of, OK, what sentence should logically come after this sentence?
"You're not happy, and then you're not satisfied", and then "you're not in love" - those just work. So Kevin managed to get himself into the situation where this bot was way off the reservation.
This is one of the most monumental software bugs of all time.
This was Microsoft's Bing search engine. They had a bug in their search engine where it would try and get a user to break up with their wife!
That's absolutely absurd.
But really, all it's doing is it had got itself to a point in the conversation where it's like, Okay, well, I'm in the mode of trying to talk about how why a marriage isn't working?
What comes next? What comes next? What comes next?

In talking about Bing's behaviour I've been trying to avoid words like "crazy" and "psycho", because those stigmatize mental illness. I try to use terms like "wild" and "inappropriate" and "absurd" instead.

But saying something is "off the reservation" is much worse!

The term is deeply offensive, based on a dark history of forced relocation of Native Americans. I used it here thoughtlessly. If you asked me to think for a moment about whether it was an appropriate phrase I would have identified that it wasn't. I'm really sorry to have said this, and I will be avoiding this language in the future.

I'll share a few more annotated highlights from the transcript, thankfully without any more offensive language.

Here's my response to a question about how I've developed my own understanding of how these models actually work (19:47):

I'm a software engineer. So I've played around with training my own models on my laptop. I found an example where you can train one just on the complete works of Shakespeare and then have it spit out garbage Shakespeare, which has "thee" and "thus" and so forth.
And it looks like Shakespeare until you read a whole sentence and you realize it's total nonsense.
I did the same thing with my blog. I've got like 20 years of writing that I piped into it and it started producing sentences which were clearly in my tone even though they meant nothing.
It's so interesting seeing it generate these sequences of words in kind of a style but with no actual meaning to them.
And really that's exactly the same thing as ChatGPT. It's just that ChatGPT was fed terabytes of data and trained for months and months and months, whereas I fed in a few megabytes of data and trained it for 15 minutes.
So that really helps me start to get a feel for how these things work. The most interesting thing about these models is it turns out there's this sort of inflection point in size where you train them and they don't really get better up until a certain point where suddenly they start gaining these capabilities.
They start being able to summarize text and generate poems and extract things into bullet pointed lists. And the impression I've got from the AI research community is people aren't entirely sure that they understand why that happens at a certain point.
A lot of AI research these days is just, let's build it bigger and bigger and bigger and play around with it. And oh look, now it can do this thing. I just saw this morning that someone's got it playing chess. It shouldn't be able to play chess, but it turns out the Bing one can play chess and like nine out of ten of the moves it generates are valid moves and one out of ten are rubbish because it doesn't have a chess model baked into it.
So this is one of the great mysteries of these things, is that as you train them more, they gain these capabilities that no one was quite expecting them to gain.
Another example of that: these models are really good at writing code, like writing actual code for software, and nobody really expected that to be the case, right? They weren't designed as things that would replace programmers, but actually the results you can get out of them if you know how how to use them in terms of generating code can be really sophisticated.
One of the most important lessons I think is that these things are actually deceptively difficult to use, right? It's a chatbot. How hard can it be? You just type things and it says things back to you.
But if you want to use it effectively, you have to understand pretty deeply what its capabilities and limitations are. If you try and give it mathematical puzzles, it will fail miserably because despite being a computer - and computers should be good at maths! - that's not something that language models are designed to handle.
And it'll make things up left, right, and center, which is something you need to figure out pretty quickly. Otherwise, you're gonna start believing just garbage that it throws out at you.
So there's actually a lot of depth to this. I think it's worth investing a lot of time just playing games with these things and trying out different stuff, because it's very easy to use them incorrectly. And there's very little guidance out there about what they're good at and what they're bad at. It takes a lot of learning.

I was happy with my comparison of writing cliches to programming. A caller had mentioned that they had seen it produce an answer to a coding question that invented an API that didn't exist, causing them to lose trust in it as a programming tool (23:11):

I can push back slightly on this example. That's absolutely right. It will often invent API methods that don't exist. But as somebody who creates APIs, I find that really useful because sometimes it invents an API that doesn't exist, and I'll be like, well, that's actually a good idea.
Because the thing it's really good at is consistency. And when you're designing APIs, consistency is what you're aiming for. So, you know, in writing, you want to avoid cliches. In programming, cliches are your friend. So, yeah, I actually use it as a design assistant where it'll invent something that doesn't exist. And I'll be like, okay, well, maybe that's the thing that I should build next.

A caller asked "Are human beings not also statistically created language models?". My answer to that (at 35:40):

So I'm not a neurologist, so I'm not qualified to answer this question in depth, but this does come up a lot in AI circles. In the discourse, yeah.
Yes, so my personal feeling on this is there is a very small part of our brain that kind of maybe works a little bit like a language model. You know, when you're talking, it's pretty natural to think what word's going to come next in that sentence.
But I'm very confident that that's only a small fraction of how our brains actually work. When you look at these language models like ChatGPT today, it's very clear that if you want to reach this mythical AGI, this general intelligence, it's going to have to be a heck of a lot more than just a language model, right?
You need to tack on models that can tell truth from fiction and that can do sophisticated planning and do logical analysis and so forth. So yeah, my take on this is, sure, there might be a very small part of how our brains work that looks a little bit like a language model if you squint at it, but I think there's a huge amount more to cognition than just the tricks that these language models are doing.

These transcripts were all edited together from an initial attempt created using OpenAI Whisper, running directly on my Mac using MacWhisper.

Releases this week

datasette-simple-html: 0.1 - 2023-03-01
Datasette SQL functions for very simple HTML operations
datasette-app: 0.2.3 - (5 releases total) - 2023-02-27
The Datasette macOS application

TIL this week

A simple Python wrapper for the ChatGPT API

Link 2023-03-11 ChatGPT's API is So Good and Cheap, It Makes Most Text Generating AI Obsolete: Max Woolf on the quite frankly weird economics of the ChatGPT API: it's 1/10th the price of GPT-3 Da Vinci and appears to be equivalent (if not more) capable. "But it is very hard to economically justify not using ChatGPT as a starting point for a business need and migrating to a more bespoke infrastructure later as needed, and that’s what OpenAI is counting on. [...] I don’t envy startups whose primary business is text generation right now."

Quote 2023-03-12

I've successfully run LLaMA 7B model on my 4GB RAM Raspberry Pi 4. It's super slow about 10sec/token. But it looks we can run powerful cognitive pipelines on a cheap hardware.

Artem Andreenko

Quote 2023-03-13

We introduce Alpaca 7B, a model fine-tuned from the LLaMA 7B model on 52K instruction-following demonstrations. Alpaca behaves similarly to OpenAI’s text-davinci-003, while being surprisingly small and easy/cheap to reproduce (<600$).

Alpaca: A Strong Open-Source Instruction-Following Model

Simon Willison’s Newsletter

Discussion about this post