9 Comments
User's avatar
Suhrab Khan's avatar

Impressive work. Claude Opus 4.5 shows real improvements in reasoning and nuanced tasks. The Nano Banana Pro’s text-rendering capabilities are a standout.

Hegemony's avatar

I’m partial to the LLM plays Factorio benchmark

Pyro's avatar

Evaluating new models is trivially easy though.

When GPT-5 dropped I asked it to create 15 software-engineering prompts it would consider a bit too complex for itself (you can see them at https://github.com/pyros-projects/agent-comparison). Lo and behold, pre-Codex GPT failed basically all of them. Sonnet 4.5 (depending on the agent harness) also has serious troubles, while Opus 4.5 basically aces all of them.

The nice thing about LLMs capable of solving harder problems is that you can use them to create harder problems.

So I probably need to regenerate this prompt list by having Opus create a new one. Then we'll know for sure when something surpasses it. At least for coding.

But here's the thing: these "zero-shot" tasks aren't even the right way to test models anyway.

What actually matters is how good a model is as a digital companion and sidekick to do stupid stuff with.

And oh my god, this is the first model that works as a real brainstorming partner.

Usually when you ask Sonnet or GPT to brainstorm projects like X or Y, they just propose slight variations of X and Y.

They don't understand that those are conceptual examples, not the actual thing you want. You want completely new ideas, not permutations.

Opus 4.5 actually gets this.

Right now it brainstormed a novel way to program LLMs instead of prompting them (https://github.com/pyros-projects/wishful). During another session, Opus itself realized the framework could re-implement AlphaEvolve in a way that's far more elegant than Google's original paper. I never mentioned AlphaEvolve.

Hadn't even thought about that angle. Opus came up with it on its own, validated the idea with experiments, and we're now implementing it.

These are the tests that count.

A single SWE-Bench or ARC-AGI number will never tell you this. Quite the contrary, I'd argue the narrow ARC-AGI-maxxing agents make for terrible brainstorming partners.

ToxSec's avatar

Evaluation and benchmarks is super odd space. I feel like sometimes companies stress it so much. But sometimes the ux matters more to us. As we saw with the technical increase in chat 5, but people disliked it from o4

Ruben Hassid's avatar

Benchmarks don’t mean much anymore, and every new model looks better depending on the task.

Only thing that matters now is real use. Run the same workflow against models side-by-side and judge the output.

Might be the only reliable eval left.

James Wang's avatar

Great overview and work here, thanks Simon!

Will's avatar

"today it’s often very difficult to find concrete examples that differentiate the new generation of models from their predecessors."

This problem of "testing intelligence" is interesting. Can someone with an IQ of 110 devise tests to not only tell which of two people (IQs of 145 and 155) has a higher IQ but the relative difference between them? Maybe? I'm not sure it is a simple problem. (And something I could see there being a podcast about out there somewhere.)

Did you try asking an AI to come up with some examples? :)

"Benchmarks like SWE-bench Verified show models beating each other by single digit percentage point margins, but what does that actually equate to in real-world problems that I need to solve on a daily basis?"

I think your next section of that post speaks directly to the real-world problems.

"I still don’t think training models not to fall for prompt injection is the way forward here."

That is the foundational change (maybe architecturally outside of the models on but wrapped in a model release) needed to really unlock LLM agents for knowledge work outside of development.

Mark Cheverton's avatar

Simon, all your gist links break because substack adds tracking query string to a gist URL that isn't formed as a key value query string: https://gistpreview.github.io/?f40971b693024fbe984a68b73cc283d2=&utm_source=substack&utm_medium=email

Appreciate you sharing the transcripts BTW. Really useful to see how you actually interact.

-Mark

Joe Repka's avatar

Maybe the validity of benchmarks is not critical; maybe benchmarks are just a gamification to stimulate interest and adoption, a marketing aid.