LLM slop, datasette-secrets, llm-evals, gpt2-chatbot and a whole lot more
Plus 28 links and 19 quotes from the past two weeks
LLM slop
I really like this neologism: “slop”, for text generated entirely by LLMs and published, unwanted, on the Internet:
Quote 2024-05-07
Watching in real time as "slop" becomes a term of art. the way that "spam" became the term for unwanted emails, "slop" is going in the dictionary as the term for unwanted AI generated content
Weeknotes: Llama 3, AI for Data Journalism, llm-evals and datasette-secrets - 2024-04-23
Ony of my biggest frustrations in working with LLMs is that I still don't have a great way to evaluate improvements to my prompts. Did capitalizing OUTPUT IN JSON really make a difference? I don't have a great mechanism for figuring that out.
llm-evals-plugin (llmevals
was taken on PyPI already) is a very early prototype of an LLM plugin that I hope to use to address this problem. […]
Weeknotes: more datasette-secrets, plus a mystery video project - 2024-05-07
I introduced datasette-secrets
two weeks ago. The core idea is to provide a way for end-users to store secrets such as API keys in Datasette, allowing other plugins to access them - also Patterns for plugins that work against multiple Datasette versions. […]
Link 2024-04-22 timpaul/form-extractor-prototype:
Tim Paul, Head of Interaction Design at the UK's Government Digital Service, published this brilliant prototype built on top of Claude 3 Opus.
The video shows what it can do. Give it an image of a form and it will extract the form fields and use them to create a GDS-style multi-page interactive form, using their GOV.UK Forms design system and govuk-frontend npm package.
It works for both hand-drawn napkin illustrations and images of existing paper forms.
The bulk of the prompting logic is the schema definition in data/extract-form-questions.json
I'm always excited to see applications built on LLMs that go beyond the chatbot UI. This is a great example of exactly that.
Quote 2024-04-23
We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone.
Link 2024-04-23 The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions:
By far the most detailed paper on prompt injection I've seen yet from OpenAI, published a few days ago and with six credited authors: Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke and Alex Beutel.
The paper notes that prompt injection mitigations which completely refuse any form of instruction in an untrusted prompt may not actually be ideal: some forms of instruction are harmless, and refusing them may provide a worse experience.
Instead, it proposes a hierarchy - where models are trained to consider if instructions from different levels conflict with or support the goals of the higher-level instructions - if they are aligned or misaligned with them.
The authors tested this idea by fine-tuning a model on top of GPT 3.5, and claim that it shows greatly improved performance against numerous prompt injection benchmarks.
As always with prompt injection, my key concern is that I don't think "improved" is good enough here. If you are facing an adversarial attacker reducing the chance that they might find an exploit just means they'll try harder until they find an attack that works.
The paper concludes with this note: "Finally, our current models are likely still vulnerable to powerful adversarial attacks. In the future, we will conduct more explicit adversarial training, and study more generally whether LLMs can be made sufficiently robust to enable high-stakes agentic applications."
Link 2024-04-23 microsoft/Phi-3-mini-4k-instruct-gguf:
Microsoft's Phi-3 LLM is out and it's really impressive. This 4,000 token context GGUF model is just a 2.2GB (for the Q4 version) and ran on my Mac using the llamafile option described in the README. I could then run prompts through it using the llm-llamafile plugin.
The vibes are good! Initial test prompts I've tried feel similar to much larger 7B models, despite using just a few GBs of RAM. Tokens are returned fast too - it feels like the fastest model I've tried yet.
And it's MIT licensed.
Quote 2024-04-23
We [Bluesky] took a somewhat novel approach of giving every user their own SQLite database. By removing the Postgres dependency, we made it possible to run a ‘PDS in a box’ [Personal Data Server] without having to worry about managing a database. We didn’t have to worry about things like replicas or failover. For those thinking this is irresponsible: don’t worry, we are backing up all the data on our PDSs!
SQLite worked really well because the PDS – in its ideal form – is a single-tenant system. We owned up to that by having these single tenant SQLite databases.
Quote 2024-04-24
A bad survey won’t tell you it’s bad. It’s actually really hard to find out that a bad survey is bad — or to tell whether you have written a good or bad set of questions. Bad code will have bugs. A bad interface design will fail a usability test. It’s possible to tell whether you are having a bad user interview right away. Feedback from a bad survey can only come in the form of a second source of information contradicting your analysis of the survey results.
Most seductively, surveys yield responses that are easy to count and counting things feels so certain and objective and truthful.
Even if you are counting lies.
Link 2024-04-24 openelm/README-pretraining.md:
Apple released something big three hours ago, and I'm still trying to get my head around exactly what it is.
The parent project is called CoreNet, described as "A library for training deep neural networks". Part of the release is a new LLM called OpenELM, which includes completely open source training code and a large number of published training checkpoint.
I'm linking here to the best documentation I've found of that training data: it looks like the bulk of it comes from RefinedWeb, RedPajama, The Pile and Dolma.
Quote 2024-04-24
When I said “Send a text message to Julian Chokkattu,” who’s a friend and fellow AI Pin reviewer over at Wired, I thought I’d be asked what I wanted to tell him. Instead, the device simply said OK and told me it sent the words “Hey Julian, just checking in. How's your day going?” to Chokkattu. I've never said anything like that to him in our years of friendship, but I guess technically the AI Pin did do what I asked.
Link 2024-04-25 Snowflake Arctic Cookbook:
Today's big model release was Snowflake Arctic, an enormous 480B model with a 128×3.66B MoE (Mixture of Experts) architecture. It's Apache 2 licensed and Snowflake state that "in addition, we are also open sourcing all of our data recipes and research insights."
The research insights will be shared on this Arctic Cookbook blog - which currently has two articles covering their MoE architecture and describing how they optimized their training run in great detail.
They also list dozens of "coming soon" posts, which should be pretty interesting given how much depth they've provided in their writing so far.
Link 2024-04-25 No, Most Books Don't Sell Only a Dozen Copies:
I linked to a story the other day about book sales claiming "90 percent of them sold fewer than 2,000 copies and 50 percent sold less than a dozen copies", based on numbers released in the Penguin antitrust lawsuit. It turns out those numbers were interpreted incorrectly.
In this piece from September 2022 Lincoln Michel addresses this and other common misconceptions about book statistics.
Understanding these numbers requires understanding a whole lot of intricacies about how publishing actually works. Here's one illustrative snippet:
"Take the statistic that most published books only sell 99 copies. This seems shocking on its face. But if you dig into it, you’ll notice it was counting one year’s sales of all books that were in BookScan’s system. That’s quite different statistic than saying most books don’t sell 100 copies in total! A book could easily be a bestseller in, say, 1960 and sell only a trickle of copies today."
The top comment on the post comes from Kristen McLean of NPD BookScan, the organization who's numbers were misrepresented is the trial. She wasn't certain how the numbers had been sliced to get that 90% result, but in her own analysis of "frontlist sales for the top 10 publishers by unit volume in the U.S. Trade market" she found that 14.7% sold less than 12 copies and the 51.4% spot was for books selling less than a thousand.
Link 2024-04-25 Blogmarks that use markdown:
I needed to attach a correction to an older blogmark (my 20-year old name for short-form links with commentary on my blog) today - but the commentary field has always been text, not HTML, so I didn't have a way to add the necessary link.
This motivated me to finally add optional Markdown support for blogmarks to my blog's custom Django CMS. I then went through and added inline code markup to a bunch of different older posts, and built this Django SQL Dashboard to keep track of which posts I had updated.
Quote 2024-04-25
I’ve been at OpenAI for almost a year now. In that time, I’ve trained a lot of generative models. [...] It’s becoming awfully clear to me that these models are truly approximating their datasets to an incredible degree. [...] What this manifests as is – trained on the same dataset for long enough, pretty much every model with enough weights and training time converges to the same point. [...] This is a surprising observation! It implies that model behavior is not determined by architecture, hyperparameters, or optimizer choices. It’s determined by your dataset, nothing else. Everything else is a means to an end in efficiently delivery compute to approximating that dataset.
Quote 2024-04-25
The only difference between screwing around and science is writing it down
Link 2024-04-26 Food Delivery Leak Unmasks Russian Security Agents:
This story is from April 2022 but I realize now I never linked to it.
Yandex Food, a popular food delivery service in Russia, suffered a major data leak.
The data included an order history with names, addresses and phone numbers of people who had placed food orders through that service.
Bellingcat were able to cross-reference this leak with addresses of Russian security service buildings - including those linked to the GRU and FSB.This allowed them to identify the names and phone numbers of people working for those organizations, and then combine that information with further leaked data as part of their other investigations.
If you look closely at the screenshots in this story they may look familiar: Bellingcat were using Datasette internally as a tool for exploring this data!
TIL 2024-04-26 Transcribing MP3s with whisper-cpp on macOS:
I asked on Twitter for tips about running Whisper transcriptions in the CLI on my Mac. Werner Robitza pointed me to Homebrew's whisper-cpp formula, and when I complained that it didn't have quite enough documentation for me to know how to use it Werner got a PR accepted adding the missing details. …
Quote 2024-04-26
If you’re auditioning for your job every day, and you’re auditioning against every other brilliant employee there, and you know that at the end of the year, 6% of you are going to get cut no matter what, and at the same time, you have access to unrivaled data on partners, sellers, and competitors, you might be tempted to look at that data to get an edge and keep your job and get to your restricted stock units.
Quote 2024-04-26
It's very fast to build something that's 90% of a solution. The problem is that the last 10% of building something is usually the hard part which really matters, and with a black box at the center of the product, it feels much more difficult to me to nail that remaining 10%. With vibecheck, most of the time the results to my queries are great; some percentage of the time they aren't. Closing that gap with gen AI feels much more fickle to me than a normal engineering problem. It could be that I'm unfamiliar with it, but I also wonder if some classes of generative AI based products are just doomed to mediocrity as a result.
Link 2024-04-27 Everything Google's Python team were responsible for:
In a questionable strategic move, Google laid off the majority of their internal Python team a few days ago. Someone on Hacker News asked what the team had been responsible for, and team member zem relied with this fascinating comment providing detailed insight into how the team worked and indirectly how Python is used within Google.
Quote 2024-04-27
I've worked out why I don't get much value out of LLMs. The hardest and most time-consuming parts of my job involve distinguishing between ideas that are correct, and ideas that are plausible-sounding but wrong. Current AI is great at the latter type of ideas, and I don't need more of those.
Link 2024-04-28 Zed Decoded: Rope & SumTree:
Text editors like Zed need in-memory data structures that are optimized for handling large strings where text can be inserted or deleted at any point without needing to copy the whole string.
Ropes) are a classic, widely used data structure for this.
Zed have their own implementation of ropes in Rust, but it's backed by something even more interesting: a SumTree, described here as a thread-safe, snapshot-friendly, copy-on-write B+ tree where each leaf node contains multiple items and a Summary for each Item, and internal tree nodes contain a Summary of the items in its subtree.
These summaries allow for some very fast traversal tree operations, such as turning an offset in the file into a line and row coordinate and vice-versa. The summary itself can be anything, so each application of SumTree in Zed collects different summary information.
Uses in Zed include tracking highlight regions, code folding state, git blame information, project file trees and more - over 20 different classes and counting.
Zed co-founder Nathan Sobo calls SumTree "the soul of Zed".
Also notable: this detailed article is accompanied by an hour long video with a four-way conversation between Zed maintainers providing a tour of these data structures in the Zed codebase.
Link 2024-04-29 How do you accidentally run for President of Iceland?:
Anna Andersen writes about a spectacular user interface design case-study from this year's Icelandic presidential election.
Running for President requires 1,500 endorsements. This year, those endorsements can be filed online through a government website.
The page for collecting endorsements originally had two sections - one for registering to collect endorsements, and another to submit your endorsement. The login link for the first came higher on the page, and at least 11 people ended up accidentally running for President!
Quote 2024-04-29
The creator of a model can not ensure that a model is never used to do something harmful – any more so that the developer of a web browser, calculator, or word processor could. Placing liability on the creators of general purpose tools like these mean that, in practice, such tools can not be created at all, except by big businesses with well funded legal teams.
[...]
Instead of regulating the development of AI models, the focus should be on regulating their applications, particularly those that pose high risks to public safety and security. Regulate the use of AI in high-risk areas such as healthcare, criminal justice, and critical infrastructure, where the potential for harm is greatest, would ensure accountability for harmful use, whilst allowing for the continued advancement of AI technology.
Link 2024-04-29 My notes on gpt2-chatbot:
There's a new, unlabeled and undocumented model on the LMSYS Chatbot Arena today called gpt2-chatbot
. It's been giving some impressive responses - you can prompt it directly in the Direct Chat tab by selecting it from the big model dropdown menu.
It looks like a stealth new model preview. It's giving answers that are comparable to GPT-4 Turbo and in some cases better - my own experiments lead me to think it may have more "knowledge" baked into it, as ego prompts ("Who is Simon Willison?") and questions about things like lists of speakers at DjangoCon over the years seem to hallucinate less and return more specific details than before.
The lack of transparency here is both entertaining and infuriating. Lots of people are performing a parallel distributed "vibe check" and sharing results with each other, but it's annoying that even the most basic questions (What even IS this thing? Can it do RAG? What's its context length?) remain unanswered so far.
The system prompt appears to be the following - but system prompts just influence how the model behaves, they aren't guaranteed to contain truthful information:
You are ChatGPT, a large language model trained
by OpenAI, based on the GPT-4 architecture.
Knowledge cutoff: 2023-11
Current date: 2024-04-29
Image input capabilities: Enabled
Personality: v2
My best guess is that this is a preview of some kind of OpenAI "GPT 4.5" release. I don't think it's a big enough jump in quality to be a GPT-5.
Update: LMSYS do document their policy on using anonymized model names for tests of unreleased models.
Update May 7th: The model has been confirmed as belonging to OpenAI thanks to an error message that leaked details of the underlying API platform.
Quote 2024-04-29
# All the code is wrapped in a main function that gets called at the bottom of the file, so that a truncated partial download doesn't end up executing half a script.
Link 2024-04-30 Why SQLite Uses Bytecode:
Brand new SQLite architecture documentation by D. Richard Hipp explaining the trade-offs between a bytecode based query plan and a tree of objects.
SQLite uses the bytecode approach, which provides an important characteristic that SQLite can very easily execute queries incrementally - stopping after each row, for example. This is more useful for a local library database than for a network server where the assumption is that the entire query will be executed before results are returned over the wire.
Link 2024-04-30 My approach to HTML web components:
Some neat patterns here from Jeremy Keith, who is using Web Components extensively for progressive enhancement of existing markup.
The reactivity you get with full-on frameworks [like React and Vue] isn’t something that web components offer. But I do think web components can replace jQuery and other approaches to scripting the DOM.
Jeremy likes naming components with their element as a prefix (since all element names must contain at least one hyphen), and suggests building components under the single responsibility principle - so you can do things like <button-confirm><button-clipboard><button>...
.
Jeremy configure buttons with data-
attributes and has them communicate with each other using custom events.
Something I hadn't realized is that since the connectedCallback
function on a custom element is fired any time that element is attached to a page you can fetch()
and then insertHTML
content that includes elements and know that they will initialize themselves without needing any extra logic - great for the kind of pattern encourages by systems such as HTMX.
Link 2024-04-30 How an empty S3 bucket can make your AWS bill explode:
Maciej Pocwierz accidentally created an S3 bucket with a name that was already used as a placeholder value in a widely used piece of software. They saw 100 million PUT requests to their new bucket in a single day, racking up a big bill since AWS charges $5/million PUTs.
It turns out AWS charge that same amount for PUTs that result in a 403 authentication error, a policy that extends even to "requester pays" buckets!
So, if you know someone's S3 bucket name you can DDoS their AWS bill just by flooding them with meaningless unauthenticated PUT requests.
AWS support refunded Maciej's bill as an exception here, but I'd like to see them reconsider this broken policy entirely.
Update from Jeff Barr:
We agree that customers should not have to pay for unauthorized requests that they did not initiate. We’ll have more to share on exactly how we’ll help prevent these charges shortly.
Quote 2024-04-30
Performance analysis indicates that SQLite spends very little time doing bytecode decoding and dispatch. Most CPU cycles are consumed in walking B-Trees, doing value comparisons, and decoding records - all of which happens in compiled C code. Bytecode dispatch is using less than 3% of the total CPU time, according to my measurements.
So at least in the case of SQLite, compiling all the way down to machine code might provide a performance boost 3% or less. That's not very much, considering the size, complexity, and portability costs involved.
Quote 2024-04-30
We collaborate with open-source and commercial model providers to bring their unreleased models to community for preview testing.
Model providers can test their unreleased models anonymously, meaning the models' names will be anonymized. A model is considered unreleased if its weights are neither open, nor available via a public API or service.
Link 2024-05-01 Save the Web by Being Nice:
This is a neat little article by Andrew Stephens who calls for more people to participate in building and supporting nice things on the web.
The very best thing to keep the web partly alive is to maintain some content yourself - start a blog, join a forum and contribute to the conversation, even podcast if that is your thing. But that takes a lot of time and not everyone has the energy or the knowhow to create like this.
The second best thing to do is to show your support for pages you enjoy by being nice and making a slight effort.
Like, comment-on, share and encourage people who make things you like. If you have the time or energy, make your own things and put them online.
Link 2024-05-01 Introducing the Claude Team plan and iOS app:
The iOS app seems nice, and provides free but heavily rate-limited access to Sonnet (the middle-sized Claude 3 model) - I ran two prompts just now and it told me I could have 3 more, resetting in five hours.
For $20/month you get access to Opus and 5x the capacity - which feels a little ungenerous to me.
The new $30/user/month team plan provides higher rate limits but is a minimum of five seats.
Link 2024-05-01 Llama 3 prompt formats:
I'm often frustrated at how thin the documentation around the prompt format required by an LLM can be.
Llama 3 turns out to be the best example I've seen yet of clear prompt format documentation. Every model needs documentation this good!
Quote 2024-05-02
I'm old enough to remember when the Internet wasn't a group of five websites, each consisting of screenshots of text from the other four.
Link 2024-05-02 We can have a different web:
Molly White's beautifully optimistic manifesto for creating a better web. Read the whole thing, or even better, find some headphones and a dog and go for a walk listening to the audio version.
Link 2024-05-02 Printing music with CSS Grid:
Stephen Bond demonstrates some ingenious tricks for creating surprisingly usable sheet music notation using clever application of CSS grids.
It uses rules like .stave > [data-duration="0.75"] { grid-column-end: span 18; }
to turn data-
attributes for musical properties into positions on the rendered stave.
Quote 2024-05-02
AI is the most anthropomorphized technology in history, starting with the name—intelligence—and plenty of other words thrown around the field: learning, neural, vision, attention, bias, hallucination. These references only make sense to us because they are hallmarks of being human. [...]
There is something kind of pathological going on here. One of the most exciting advances in computer science ever achieved, with so many promising uses, and we can't think beyond the most obvious, least useful application? What, because we want to see ourselves in this technology? [...]
Anthropomorphizing AI not only misleads, but suggests we are on equal footing with, even subservient to, this technology, and there's nothing we can do about it.
Link 2024-05-03 I'm writing a new vector search SQLite Extension:
Alex Garcia is working on sqlite-vec
, a spiritual successor to his sqlite-vss
project. The new SQLite C extension will have zero other dependencies (sqlite-vss
used some tricky C++ libraries) and will work using virtual tables, storing chunks of vectors in shadow tables to avoid needing to load everything into memory at once.
Quote 2024-05-03
I used to have this singular focus on students writing code that they submit, and then I run test cases on the code to determine what their grade is. This is such a narrow view of what it means to be a software engineer, and I just felt that with generative AI, I’ve managed to overcome that restrictive view.
It’s an opportunity for me to assess their learning process of the whole software development [life cycle]—not just code. And I feel like my courses have opened up more and they’re much broader than they used to be. I can make students work on larger and more advanced projects.
Link 2024-05-04 Figma’s journey to TypeScript: Compiling away our custom programming language:
I love a good migration story. Figma had their own custom language that compiled to JavaScript, called Skew. As WebAssembly support in browsers emerged and improved the need for Skew's performance optimizations reduced, and TypeScript's maturity and popularity convinced them to switch.
Rather than doing a stop-the-world rewrite they built a transpiler from Skew to TypeScript, enabling a multi-year migration without preventing their product teams from continuing to make progress on new features.
Quote 2024-05-04
I believe these things:
1. If you use generative tools to produce or modify your images, you have abandoned photointegrity.
2. That’s not always wrong. Sometimes you need an image of a space battle or a Triceratops family or whatever.
3. What is always wrong is using this stuff without disclosing it.
Link 2024-05-05 What You Need to Know about Modern CSS (Spring 2024 Edition):
Useful guide to the many new CSS features that have become widely enough supported to start using as-of May 2024. Time to learn container queries!
View transitions are still mostly limited to Chrome - I can't wait for those to land in Firefox and Safari.
Quote 2024-05-06
Migrations are not something you can do rarely, or put off, or avoid; not if you are a growing company. Migrations are an ordinary fact of life.
Doing them swiftly, efficiently, and -- most of all -- *completely* is one of the most critical skills you can develop as a team.
Link 2024-05-07 OpenAI cookbook: How to get token usage data for streamed chat completion response:
New feature in the OpenAI streaming API that I've been wanting for a long time: you can now set stream_options={"include_usage": True}
to get back a "usage"
block at the end of the stream showing how many input and output tokens were used.
This means you can now accurately account for the total cost of each streaming API call. Previously this information was only an available for non-streaming responses.
Link 2024-05-07 Deterministic Quoting: Making LLMs Safe for Healthcare:
Matt Yeung introduces Deterministic Quoting, a technique to help reduce the risk of hallucinations while working with LLMs. The key idea is to have parts of the output that are copied directly from relevant source documents, with a different visual treatment to help indicate that they are exact quotes, not generated output.
The AI chooses which section of source material to quote, but the retrieval of that text is a traditional non-AI database lookup. That’s the only way to guarantee that an LLM has not transformed text: don’t send it through the LLM in the first place.
The LLM may still pick misleading quotes or include hallucinated details in the accompanying text, but this is still a useful improvement.
The implementation is straight-forward: retrieved chunks include a unique reference, and the LLM is instructed to include those references as part of its replies. Matt's posts include examples of the prompts they are using for this.
Link 2024-05-08 gpt2-chatbot confirmed as OpenAI:
The mysterious gpt2-chatbot
model that showed up in the LMSYS arena a few days ago was suspected to be a testing preview of a new OpenAI model. This has now been confirmed, thanks to a 429 rate limit error message that exposes details from the underlying OpenAI API platform.
The model has been renamed to im-also-a-good-gpt-chatbot
and is now only randomly available in "Arena (battle)" mode, not via "Direct Chat".
Link 2024-05-08 Towards universal version control with Patchwork:
Geoffrey Litt has been working with Ink & Switch exploring UI patterns for applying version control to different kinds of applications, with the goal of developing a set of conceptual primitives that can bring branching and version tracking to interfaces beyond just Git-style version control.
Geoffrey observes that basic version control is already a metaphor in a lot of software - the undo stack in Photoshop or suggestion mode in Google Docs are two examples.
Extending that is a great way to interact with AI tools as well - allowing for editorial bots that can suggest their own changes for you to accept, for example.
The “Deterministic Quoting” piece you included was TOP NOTCH! I literally just finished a RAG w/ references bender. Everything in this article was top of mind. It’s awesome to read such an in-depth write up on business critical RAG.
Thank you!