AI for Data Journalism: demonstrating what we can do with this stuff right now
Plus news on Mistral, Reka, Claude 3 and more
In this newsletter:
AI for Data Journalism: demonstrating what we can do with this stuff right now
Plus 13 links and 8 quotations and 1 TIL
AI for Data Journalism: demonstrating what we can do with this stuff right now - 2024-04-17
I gave a talk last month at the Story Discovery at Scale data journalism conference hosted at Stanford by Big Local News. My brief was to go deep into the things we can use Large Language Models for right now, illustrated by a flurry of demos to help provide starting points for further conversations at the conference.
I used the talk as an opportunity for some demo driven development - I pulled together a bunch of different project strands for the talk, then spent the following weeks turning them into releasable tools.
There are 12 live demos in this talk!
The full 50 minute video of my talk is available on YouTube. Below I've turned that video into an annotated presentation, with screenshots, further information and links to related resources and demos that I showed during the talk.
Quote 2024-04-10
The challenge [with RAG] is that most corner-cutting solutions look like they’re working on small datasets while letting you pretend that things like search relevance don’t matter, while in reality relevance significantly impacts quality of responses when you move beyond prototyping (whether they’re literally search relevance or are better tuned SQL queries to retrieve more appropriate rows). This creates a false expectation of how the prototype will translate into a production capability, with all the predictable consequences: underestimating timelines, poor production behavior/performance, etc.
Link 2024-04-10 Notes on how to use LLMs in your product:
A whole bunch of useful observations from Will Larson here. I love his focus on the key characteristic of LLMs that "you cannot know whether a given response is accurate", nor can you calculate a dependable confidence score for a response - and as a result you need to either "accept potential inaccuracies (which makes sense in many cases, humans are wrong sometimes too) or keep a Human-in-the-Loop (HITL) to validate the response."
Link 2024-04-10 Shell History Is Your Best Productivity Tool:
Martin Heinz drops a wealth of knowledge about ways to configure zsh (the default shell on macOS these days) to get better utility from your shell history.
Quote 2024-04-11
[on GitHub Copilot] It’s like insisting to walk when you can take a bike. It gets the hard things wrong but all the easy things right, very helpful and much faster. You have to learn what it can and can’t do.
Link 2024-04-11 Use an llm to automagically generate meaningful git commit messages:
Neat, thoroughly documented recipe by Harper Reed using my LLM CLI tool as part of a scheme for if you're feeling too lazy to write a commit message - it uses a prepare-commit-msg Git hook which runs any time you commit without a message and pipes your changes to a model along with a custom system prompt.
Link 2024-04-11 3Blue1Brown: Attention in transformers, visually explained:
Grant Sanderson publishes animated explainers of mathematical topics on YouTube, to over 6 million subscribers. His latest shows how the attention mechanism in transformers (the algorithm behind most LLMs) works and is by far the clearest explanation I've seen of the topic anywhere.
I was intrigued to find out what tool he used to produce the visualizations. It turns out Grant built his own open source Python animation library, manim, to enable his YouTube work.
Quote 2024-04-12
The language issues are indicative of the bigger problem facing the AI Pin, ChatGPT, and frankly, every other AI product out there: you can’t see how it works, so it’s impossible to figure out how to use it. [...] our phones are constant feedback machines — colored buttons telling us what to tap, instant activity every time we touch or pinch or scroll. You can see your options and what happens when you pick one. With AI, you don’t get any of that. Using the AI Pin feels like wishing on a star: you just close your eyes and hope for the best. Most of the time, nothing happens.
Link 2024-04-12 How we built JSR:
Really interesting deep dive by Luca Casonato into the engineering behind the new JSR alternative JavaScript package registry launched recently by Deno.
The backend uses PostgreSQL and a Rust API server hosted on Google Cloud Run.
The frontend uses Fresh, Deno's own server-side JavaScript framework which leans heavily in the concept of "islands" - a progressive enhancement technique where pages are rendered on the server and small islands of interactivity are added once the page has loaded.
Link 2024-04-13 Lessons after a half-billion GPT tokens:
Ken Kantzer presents some hard-won experience from shipping real features on top of OpenAI's models.
They ended up settling on a very basic abstraction over the chat API - mainly to handle automatic retries on a 500 error. No complex wrappers, not even JSON mode or function calling or system prompts.
Rather than counting tokens they estimate tokens as 3 times the length in characters, which works well enough.
One challenge they highlight for structured data extraction (one of my favourite use-cases for LLMs): "GPT really cannot give back more than 10 items. Trying to have it give you back 15 items? Maybe it does it 15% of the time."
(Several commenters on Hacker News report success in getting more items back by using numbered keys or sequence IDs in the returned JSON to help the model keep count.)
Link 2024-04-14 redka:
Anton Zhiyanov's new project to build a subset of Redis (including protocol support) using Go and SQLite. Also works as a Go library.
The guts of the SQL implementation are in the internal/sqlx folder.
Quote 2024-04-15
[On complaints about Claude 3 reduction in quality since launch] The model is stored in a static file and loaded, continuously, across 10s of thousands of identical servers each of which serve each instance of the Claude model. The model file never changes and is immutable once loaded; every shard is loading the same model file running exactly the same software. We haven’t changed the temperature either. We don’t see anywhere where drift could happen. The files are exactly the same as at launch and loaded each time from a frozen pristine copy.
Link 2024-04-15 OpenAI Batch API:
OpenAI are now offering a 50% discount on batch chat completion API calls if you submit them in bulk and allow for up to 24 hours for them to be run.
Requests are sent as a newline-delimited JSON file, with each line looking something like this:
{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-3.5-turbo", "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is 2+2?"}]}}
You upload a file for the batch, kick off a batch request and then poll for completion.
This makes GPT-3.5 Turbo cheaper than Claude 3 Haiku - provided you're willing to wait a few hours for your responses.
Link 2024-04-16 inline-snapshot:
I'm a big fan of snapshot testing, where expected values are captured the first time a test suite runs and then asserted against in future runs. It's a very productive way to build a robust test suite.
inline-snapshot by Frank Hoffmann is a particularly neat implementation of the pattern. It defines a snapshot() function which you can use in your tests:
assert 1548 * 18489 == snapshot()
When you run that test using "pytest --inline-snapshot=create" the snapshot() function will be replaced in your code (using AST manipulation) with itself wrapping the repr() of the expected result:
assert 1548 * 18489 == snapshot(28620972)
If you modify the code and need to update the tests you can run "pytest --inline-snapshot=fix" to regenerate the recorded snapshot values.
Quote 2024-04-16
Permissions have three moving parts, who wants to do it, what do they want to do, and on what object. Any good permission system has to be able to efficiently answer any permutation of those variables. Given this person and this object, what can they do? Given this object and this action, who can do it? Given this person and this action, which objects can they act upon?
Link 2024-04-16 Google NotebookLM Data Exfiltration:
NotebookLM is a Google Labs product that lets you store information as sources (mainly text files in PDF) and then ask questions against those sources - effectively an interface for building your own custom RAG (Retrieval Augmented Generation) chatbots.
Unsurprisingly for anything that allows LLMs to interact with untrusted documents, it's susceptible to prompt injection.
Johann Rehberger found some classic prompt injection exfiltration attacks: you can create source documents with instructions that cause the chatbot to load a Markdown image that leaks other private data to an external domain as data passed in the query string.
Johann reported this privately in the December but the problem has not yet been addressed.
A good rule of thumb is that any time you let LLMs see untrusted tokens there is a risk of an attack like this, so you should be very careful to avoid exfiltration vectors like Markdown images or even outbound links.
Quote 2024-04-16
The saddest part about it, though, is that the garbage books don’t actually make that much money either. It’s even possible to lose money generating your low-quality ebook to sell on Kindle for $0.99. The way people make money these days is by teaching students the process of making a garbage ebook. It’s grift and garbage all the way down — and the people who ultimately lose out are the readers and writers who love books.
TIL 2024-04-17 A script to capture frames from a QuickTime video:
I was putting together some notes for a talk I gave, and I wanted an efficient way to create screenshots of specific moments in a video of that talk. …
Link 2024-04-17 Scammers are targeting teenage boys on social media—and driving some to suicide.:
Horrifying in depth report describing sextortion scams: a scammer tricks a teenage boy into sending them reciprocal nude photos, then instantly starts blackmailing them by threatening to forward those photos to their friends and family members. Most online scams take weeks or even months to play out - these scams can turn to blackmail within minutes.
Quote 2024-04-17
But the reality is that you can't build a hundred-billion-dollar industry around a technology that's kind of useful, mostly in mundane ways, and that boasts perhaps small increases in productivity if and only if the people who use it fully understand its limitations.
Quote 2024-04-18
In mid-March, we added this line to our system prompt to prevent Claude from thinking it can open URLs:
"It cannot open URLs, links, or videos, so if it seems as though the interlocutor is expecting Claude to do so, it clarifies the situation and asks the human to paste the relevant text or image content directly into the conversation."
Link 2024-04-18 mistralai/mistral-common:
New from Mistral: mistral-common, an open source Python library providing "a set of tools to help you work with Mistral models".
So far that means a tokenizer! This is similar to OpenAI's tiktoken library in that it lets you run tokenization in your own code, which crucially means you can count the number of tokens that you are about to use - useful for cost estimates but also for cramming the maximum allowed tokens in the context window for things like RAG.
Mistral's library is better than tiktoken though, in that it also includes logic for correctly calculating the tokens needed for conversation construction and tool definition. With OpenAI's APIs you're currently left guessing how many tokens are taken up by these advanced features.
Anthropic haven't published any form of tokenizer at all - it's the feature I'd most like to see from them next.
Here's how to explore the vocabulary of the tokenizer:
MistralTokenizer.from_model(
"open-mixtral-8x22b"
).instruct_tokenizer.tokenizer.vocab()[:12]
['', '', '', '[INST]', '[/INST]', '[TOOL_CALLS]', '[AVAILABLE_TOOLS]', '[/AVAILABLE_TOOLS]', '[TOOL_RESULTS]', '[/TOOL_RESULTS]']
Link 2024-04-18 llm-reka:
My new plugin for running LLM prompts against the Reka family of API hosted LLM models: reka-core ($10 per million input), reka-flash (80c per million) and reka-edge (40c per million).
All three of those models are trained from scratch by a team that includes several Google Brain alumni.
Reka Core is their most powerful model, released on Monday 15th April and claiming benchmark scores competitive with GPT-4 and Claude 3 Opus.