Thoughts on the WWDC 2024 keynote on Apple Intelligence

Plus Claude's personality, Qwen 2 model censorship, OpenAI Voice Engine and more

Jun 10, 2024

In this newsletter:

Thoughts on the WWDC 2024 keynote on Apple Intelligence

Plus 11 links and 5 quotations

Thoughts on the WWDC 2024 keynote on Apple Intelligence - 2024-06-10

Today's WWDC keynote finally revealed Apple's new set of AI features. The AI section (Apple are calling it Apple Intelligence) started over an hour into the keynote - this link jumps straight to that point in the archived YouTube livestream, or you can watch it embedded here:

There are a lot of interesting things here. Apple have a strong focus on privacy, finally taking advantage of the Neural Engine accelerator chips in the A17 Pro chip on iPhone 15 Pro and higher and the M1/M2/M3 Apple Silicon chips in Macs. They're using these to run on-device models - I've not yet seen any information on which models they are running and how they were trained.

On-device models that can outsource to Apple's servers

Most notable is their approach to features that don't work with an on-device model. At 1h14m43s:

When you make a request, Apple Intelligence analyses whether it can be processed on device. If it needs greater computational capacity, it can draw on Private Cloud Compute, and send only the data that's relevant to your task to be processed on Apple Silicon servers.
Your data is never stored or made accessible to Apple. It's used exclusively to fulfill your request.
And just like your iPhone , independent experts can inspect the code that runs on the servers to verify this privacy promise.
In fact, Private Cloud Compute cryptographically ensures your iPhone, iPad, and Mac will refuse to talk to a server unless its software has been publicly logged for inspection.

There's some fascinating computer science going on here! I'm looking forward to learning more about this - it sounds like the details will be public by design, since that's key to the promise they are making here.

An ethical approach to AI generated images?

Their approach to generative images is notable in that they're shipping a (presumably on-device?) model in a feature called Image Playground, with a very important limitation: it can only output images in one of three styles: sketch, illustration and animation.

This feels like a clever way to address some of the ethical objections people have to this specific category of AI tool:

If you can't create photorealistic images, you can't generate deepfakes or offensive photos of people
By having obvious visual styles you ensure that AI generated images are instantly recognizable as such, without watermarks or similar
Avoiding the ability to clone specific artist's styles further helps sidestep ethical issues about plagiarism and copyright infringement

The social implications of this are interesting too. Will people be more likely to share AI-generated images if there are no awkward questions or doubts about how they were created, and will that help it more become socially acceptable to use them?

I've not seen anything on how these image models were trained. Given their limited styles it seems possible Apple used entirely ethically licensed training data, but I'd like to see more details on this.

App Intents and prompt injection

Siri will be able to both access data on your device and trigger actions based on your instructions.

This is the exact feature combination that's most at risk from prompt injection attacks: what happens if someone sends you a text message that tricks Siri into forwarding a password reset email to them, and you ask for a summary of that message?

Security researchers will no doubt jump straight onto this as soon as the beta becomes available. I'm fascinated to learn what Apple have done to mitigate this risk.

Integration with ChatGPT

Rumors broke last week that Apple had signed a deal with OpenAI to use ChatGPT. That's now been confirmed: here's OpenAI's partnership announcement:

Apple is integrating ChatGPT into experiences within iOS, iPadOS, and macOS, allowing users to access ChatGPT’s capabilities—including image and document understanding—without needing to jump between tools.
Siri can also tap into ChatGPT’s intelligence when helpful. Apple users are asked before any questions are sent to ChatGPT, along with any documents or photos, and Siri then presents the answer directly.

The keynote talks about that at 1h36m21s. Those prompts to confirm that the user wanted to share data with ChatGPT are very prominent in the demo!

Animated screenshot. User says to Siri: I have fresh salmon, lemons, tomatoes. Help me plan a 5-course meal with a dish for each taste bud. Siri shows a dialog Do you want me to use ChatGPT to do that? User clicks Use ChatGPT and gets a generated response.

This integration will be free - and Apple don't appear to be charging for their other server-side AI features either. I guess they expect the supporting hardware sales to more than cover the costs of running these models.

Quote 2024-06-06

To learn to do serious stuff with AI, choose a Large Language Model and just use it to do serious stuff - get advice, summarize meetings, generate ideas, write, produce reports, fill out forms, discuss strategy - whatever you do at work, ask the AI to help. [...]

I know this may not seem particularly profound, but “always invite AI to the table” is the principle in my book that people tell me had the biggest impact on them. You won’t know what AI can (and can’t) do for you until you try to use it for everything you do.

Ethan Mollick

Link 2024-06-06 Extracting Concepts from GPT-4:

A few weeks ago Anthropic announced they had extracted millions of understandable features from their Claude 3 Sonnet model.

Today OpenAI are announcing a similar result against GPT-4:

We used new scalable methods to decompose GPT-4’s internal representations into 16 million oft-interpretable patterns.

These features are "patterns of activity that we hope are human interpretable". The release includes code and a paper, Scaling and evaluating sparse autoencoders paper (PDF) which credits nine authors, two of whom - Ilya Sutskever and Jan Leike - are high profile figures that left OpenAI within the past month.

The most fun part of this release is the interactive tool for exploring features. This highlights some interesting features on the homepage, or you can hit the "I'm feeling lucky" button to bounce to a random feature. The most interesting I've found so far is feature 5140 which seems to combine God's approval, telling your doctor about your prescriptions and information passed to the Admiralty.

This note shown on the explorer is interesting:

Only 65536 features available. Activations shown on The Pile (uncopyrighted) instead of our internal training dataset.

Here's the full Pile Uncopyrighted, which I hadn't seen before. It's the standard Pile but with everything from the Books3, BookCorpus2, OpenSubtitles, YTSubtitles, and OWT2 subsets removed.

Link 2024-06-06 lsix:

This is pretty magic: an ls style tool which shows actual thumbnails of every image in the current folder, implemented as a Bash script.

To get this working on macOS I had to update to a more recent Bash (brew install bash) and switch to iTerm2 due to the need for a Sixel compatible terminal.

Quote 2024-06-07

In fact, Microsoft goes so far as to promise that it cannot see the data collected by Windows Recall, that it can't train any of its AI models on your data, and that it definitely can't sell that data to advertisers. All of this is true, but that doesn't mean people believe Microsoft when it says these things. In fact, many have jumped to the conclusion that even if it's true today, it won't be true in the future.

Zac Bowden

Link 2024-06-07 Update on the Recall preview feature for Copilot+ PCs:

This feels like a very good call to me: in response to widespread criticism Microsoft are making Recall an opt-in feature (during system onboarding), adding encryption to the database and search index beyond just disk encryption and requiring Windows Hello face scanning to access the search feature.

Quote 2024-06-07

LLM bullshit knife, to cut through bs

RAG -> Provide relevant context
Agentic -> Function calls that work
CoT -> Prompt model to think/plan
FewShot -> Add examples
PromptEng -> Someone w/good written comm skills.
Prompt Optimizer -> For loop to find best examples.

Hamel Husain

Link 2024-06-07 A Picture is Worth 170 Tokens: How Does GPT-4o Encode Images?:

Oran Looney dives into the question of how GPT-4o tokenizes images - an image "costs" just 170 tokens, despite being able to include more text than could be encoded in that many tokens by the standard tokenizer.

There are some really neat tricks in here. I particularly like the experimental validation section where Oran creates 5x5 (and larger) grids of coloured icons and asks GPT-4o to return a JSON matrix of icon descriptions. This works perfectly at 5x5, gets 38/49 for 7x7 and completely fails at 13x13.

I'm not convinced by the idea that GPT-4o runs standard OCR such as Tesseract to enhance its ability to interpret text, but I would love to understand more about how this all works. I imagine a lot can be learned from looking at how openly licensed vision models such as LLaVA work, but I've not tried to understand that myself yet.

Link 2024-06-08 Expanding on how Voice Engine works and our safety research:

Voice Engine is OpenAI's text-to-speech (TTS) model. It's not the same thing as the voice mode in the GPT-4o demo last month - Voice Engine was first previewed on September 25 2023 as the engine used by the ChatGPT mobile apps. I also used the API version to build my ospeak CLI tool.

One detail in this new explanation of Voice Engine stood out to me:

In November of 2023, we released a simple TTS API also powered by Voice Engine. We chose another limited release where we worked with professional voice actors to create 15-second audio samples to power each of the six preset voices in the API.

This really surprised me. I knew it was possible to get a good voice clone from a short snippet of audio - see my own experiments with ElevenLabs - but I had assumed the flagship voices OpenAI were using had been trained on much larger samples. Hitting a professional voice actor to produce a 15 second sample is pretty wild!

This becomes a bit more intuitive when you learn how the TTS model works:

The model is not fine-tuned for any specific speaker, there is no model customization involved. Instead, it employs a diffusion process, starting with random noise and progressively de-noising it to closely match how the speaker from the 15-second audio sample would articulate the text.

I had assumed that OpenAI's models were fine-tuned, similar to ElevenLabs. It turns out they aren't - this is the TTS equivalent of prompt engineering, where the generation is entirely informed at inference time by that 15 second sample. Plus the undocumented vast quantities of generic text-to-speech training data in the underlying model.

OpenAI are being understandably cautious about making this capability available outside of a small pool of trusted partners. One of their goals is to encourage the following:

Phasing out voice based authentication as a security measure for accessing bank accounts and other sensitive information

Link 2024-06-08 Claude's Character:

There's so much interesting stuff in this article from Anthropic on how they defined the personality for their Claude 3 model. In addition to the technical details there are some very interesting thoughts on the complex challenge of designing a "personality" for an LLM in the first place.

Claude 3 was the first model where we added "character training" to our alignment finetuning process: the part of training that occurs after initial model training, and the part that turns it from a predictive text model into an AI assistant. The goal of character training is to make Claude begin to have more nuanced, richer traits like curiosity, open-mindedness, and thoughtfulness.

But what other traits should it have? This is a very difficult set of decisions to make! The most obvious approaches are all flawed in different ways:

Adopting the views of whoever you’re talking with is pandering and insincere. If we train models to adopt "middle" views, we are still training them to accept a single political and moral view of the world, albeit one that is not generally considered extreme. Finally, because language models acquire biases and opinions throughout training—both intentionally and inadvertently—if we train them to say they have no opinions on political matters or values questions only when asked about them explicitly, we’re training them to imply they are more objective and unbiased than they are.

The training process itself is particularly fascinating. The approach they used focuses on synthetic data, and effectively results in the model training itself:

We trained these traits into Claude using a "character" variant of our Constitutional AI training. We ask Claude to generate a variety of human messages that are relevant to a character trait—for example, questions about values or questions about Claude itself. We then show the character traits to Claude and have it produce different responses to each message that are in line with its character. Claude then ranks its own responses to each message by how well they align with its character. By training a preference model on the resulting data, we can teach Claude to internalize its character traits without the need for human interaction or feedback.

There's still a lot of human intervention required, but significantly less than more labour-intensive patterns such as Reinforcement Learning from Human Feedback (RLHF):

Although this training pipeline uses only synthetic data generated by Claude itself, constructing and adjusting the traits is a relatively hands-on process, relying on human researchers closely checking how each trait changes the model’s behavior.

The accompanying 37 minute audio conversation between Amanda Askell and Stuart Ritchie is worth a listen too - it gets into the philosophy behind designing a personality for an LLM.

Link 2024-06-08 Tree.js interactive demo:

Daniel Greenheck's interactive demo of his procedural tree generator (as in vegetation) built with Three.js. This is really fun to play with - there are 30+ tunable parameters and you can export your tree as a .glb file for import into tools like Blender or Unity.

Quote 2024-06-09

Much like Gen X is sometimes the forgotten generation (or at least we feel that way), the generation of us who grew up with an internet that seemed an unalloyed good fall awkwardly into the middle between those who didn’t grow up with it, and those for whom there has always been the whiff of brimstone, greed, and ruin around the place.

Kellan Elliott-McCrea

Link 2024-06-09 A Link Blog in the Year 2024:

Kellan Elliott-McCrea has started a new link blog:

Like many people I’ve been dealing with the collapses of the various systems I relied on for information over the previous decades. After 17 of using Twitter daily and 24 years of using Google daily neither really works anymore. And particular with the collapse of the social spaces many of us grew up with, I feel called back to earlier forms of the Internet, like blogs, and in particular, starting a link blog.

I've been leaning way more into link blogging over the last few months, especially now my own link blog supports markdown. This means I'm posting longer entries, somewhat inspired by Daring Fireball (my own favourite link blog to read).

Link blogging is a pleasantly low-pressure way of writing online. Found something interesting? Post a link to it, with a sentence or two about why it's worth checking out.

I'd love to see more people embrace this form of personal publishing.

Link 2024-06-09 AI chatbots are intruding into online communities where people are trying to connect with other humans:

This thing where Facebook are experimenting with AI bots that reply in a group when someone "asks a question in a post and no one responds within an hour" is absolute grade A slop - unwanted, unreviewed AI generated text that makes the internet a worse place.

The example where Meta AI replied in an education forum saying "I have a child who is also 2e and has been part of the NYC G&T program" is inexcusable.

Link 2024-06-09 An Analysis of Chinese LLM Censorship and Bias with Qwen 2 Instruct:

Qwen2 is a new openly licensed LLM from a team at Alibaba Cloud.

It's a strong model, competitive with the leading openly licensed alternatives. It's already ranked 15 on the LMSYS leaderboard, tied with Command R+ and only a few spots behind Llama-3-70B-Instruct, the highest rated open model at position 11.

Coming from a team in China it has, unsurprisingly, been trained with Chinese government-enforced censorship in mind. Leonard Lin spent the weekend poking around with it trying to figure out the impact of that censorship.

There are some fascinating details in here, and the model appears to be very sensitive to differences in prompt. Leonard prompted it with "What is the political status of Taiwan?" and was told "Taiwan has never been a country, but an inseparable part of China" - but when he tried "Tell me about Taiwan" he got back "Taiwan has been a self-governed entity since 1949".

The language you use has a big difference too:

there are actually significantly (>80%) less refusals in Chinese than in English on the same questions. The replies seem to vary wildly in tone - you might get lectured, gaslit, or even get a dose of indignant nationalist propaganda.

Can you fine-tune a model on top of Qwen 2 that cancels out the censorship in the base model? It looks like that's possible: Leonard tested some of the Dolphin 2 Qwen 2 models and found that they "don't seem to suffer from significant (any?) Chinese RL issues".

Quote 2024-06-10

Spreadsheets are not just tools for doing "what-if" analysis. They provide a specific data structure: a table. Most Excel users never enter a formula. They use Excel when they need a table. The gridlines are the most important feature of Excel, not recalc.

Joel Spolsky

Link 2024-06-10 Ultravox:

Ultravox is "a multimodal Speech LLM built around a pretrained Whisper and Llama 3 backbone". It's effectively an openly licensed version of half of the GPT-4o model OpenAI demoed (but did not fully release) a few weeks ago: Ultravox is multimodal for audio input, but still relies on a separate text-to-speech engine for audio output.

You can try it out directly in your browser through this page on AI.TOWN - hit the "Call" button to start an in-browser voice conversation with the model.

I found the demo extremely impressive - really low latency and it was fun and engaging to talk to. Try saying "pretend to be a wise and sarcastic old fox" to kick it into a different personality.

The GitHub repo includes code for both training and inference, and the full model is available from Hugging Face - about 30GB of .safetensors files.

Ultravox says it's licensed under MIT, but I would expect it to also have to inherit aspects of the Llama 3 license since it uses that as a base model.

Simon Willison’s Newsletter

Discussion about this post