Claude 4.5 Evaluation Challenges

Overview

Claude Opus 4.5: Anthropic's new flagship model at \$5/\$25 per million tokens, with 200K context, enhanced computer use abilities, and a new "effort" parameter - but still vulnerable to prompt injection 5-33% of the time
Model Evaluation Crisis: Despite helping implement 2,022 additions across 39 files using Opus 4.5 preview, Simon found switching back to Sonnet 4.5 maintained the same productivity - highlighting how difficult it's become to identify meaningful capability differences
Nano Banana Pro: Google's new image model excels at infographics with accurate text rendering, 4K output, and Google Search grounding - creating a Datasette infographic from just 9 words that included correct logos and UI thumbnails
Olmo 3 Release: Ai2's fully open 32B model includes training data, process, and checkpoints - enabling backdoor auditing impossible with other open-weight models, though it over-thought for 14 minutes on a simple SVG prompt
sqlite-utils 4.0: Major version bump includes breaking changes like proper SQL double-quotes instead of square brackets, iterator support for bulk inserts, and type detection by default for CSV imports

Takeaways

Simon Willison's deep dive into Claude Opus 4.5 reveals a fascinating problem: as AI models converge on the frontier, distinguishing their capabilities becomes increasingly difficult. After extensively testing Anthropic's latest flagship model through real production coding tasks, he discovered something unexpected.

What caught my attention was his admission that despite Opus 4.5 being marketed as the "best model in the world for coding," he couldn't definitively identify improvements over Sonnet 4.5 in practical use. This isn't a model limitation - it's an evaluation crisis. Simon notes he's "embarrassingly lacking in suitable challenges" that current models can't already solve.

My favorite moments in AI are when a new model gives me the ability to do something that simply wasn't possible before. In the past these have felt a lot more obvious, but today it's often very difficult to find concrete examples that differentiate the new generation of models from their predecessors.

The implications are profound: we need better evaluation methodologies beyond single-digit benchmark improvements. Simon's suggestion that AI labs provide concrete "before and after" examples - prompts that fail on previous models but succeed on new ones - would transform how we understand model progress. Until then, we're left drawing pelicans on bicycles to spot the differences.