Simon Willison’s Weblog

Subscribe

Thursday, 22nd May 2025

If your library doesn't have any documentation, it can't have any bugs.

Documentation specifies what your code is supposed to do. Your tests specify what it actually does.

Bugs exist when your test-enforced implementation fails to match the behavior described in your documentation. Without documentation a bug is just undefined behavior.

If you aim to follow semantic versioning you bump your major version when you release a backwards incompatible change. Such changes cannot exist if your code is not comprehensively documented!

Inspired by a half-remembered conversation I had with Tom Insam many years ago.

# 1:55 am / testing, semantic-versioning, documentation

Live blog: Claude 4 launch at Code with Claude

I’m at Anthropic’s Code with Claude event, where they are launching Claude 4. I’ll be live blogging the keynote here.

llm-anthropic 0.16. New release of my LLM plugin for Anthropic adding the new Claude 4 Opus and Sonnet models.

You can see pelicans on bicycles generated using the new plugin at the bottom of my live blog covering the release.

I also released llm-anthropic 0.16a1 which works with the latest LLM alpha and provides tool usage feature on top of the Claude models.

The new models can be accessed using both their official model ID and the aliases I've set for them in the plugin:

llm install -U llm-anthropic
llm keys set anthropic
# paste key here
llm -m anthropic/claude-sonnet-4-0 \
  'Generate an SVG of a pelican riding a bicycle'

This uses the full model ID - anthropic/claude-sonnet-4-0.

I've also setup aliases claude-4-sonnet and claude-4-opus. These are notably different from the official Anthropic names - I'm sticking with their previous naming scheme of claude-VERSION-VARIANT as seen with claude-3.7-sonnet.

Here's an example that uses the new alpha tool feature with the new Opus:

llm install llm-anthropic==0.16a1
llm --functions '                                                               
def multiply(a: int, b: int):
    return a * b
' '234324 * 2343243' --td -m claude-4-opus

Outputs:

I'll multiply those two numbers for you.
Tool call: multiply({'a': 234324, 'b': 2343243})
  549078072732
The result of 234,324 × 2,343,243 is **549,078,072,732**.

Here's the output of llm logs -c from that tool-enabled prompt response. More on tool calling in my recent workshop.

# 6:36 pm / llm, anthropic, claude, generative-ai, ai, llms, claude-4

Updated Anthropic model comparison table. A few details in here about Claude 4 that I hadn't spotted elsewhere:

  1. The training cut-off date for Claude Opus 4 and Claude Sonnet 4 is March 2025! That's the most recent cut-off for any of the current popular models, really impressive.
  2. Opus 4 has a max output of 32,000 tokens, Sonnet 4 has a max output of 64,000 tokens. Claude 3.7 Sonnet is 64,000 tokens too, so this is a small regression for Opus.
  3. The input limit for both of the Claude 4 models is still stuck at 200,000. I'm disjointed by this, I was hoping for a leap to a million to catch up with GPT 4.1 and the Gemini Pro series.
  4. Claude 3 Haiku is still in that table - it remains Anthropic's cheapest model, priced slightly lower than Claude 3.5 Haiku.

For pricing: Sonnet 4 is the same price as Sonnet 3.7 ($3/million input, $15/million output). Opus 4 matches the pricing of the older Opus 3 - $15/million for input and $75/million for output. I've updated llm-prices.com with the new models.

I spotted a few more interesting details in Anthropic's Migrating to Claude 4 documentation:

Claude 4 models introduce a new refusal stop reason for content that the model declines to generate for safety reasons, due to the increased intelligence of Claude 4 models.

Plus this note on the new summarized thinking feature:

With extended thinking enabled, the Messages API for Claude 4 models returns a summary of Claude’s full thinking process. Summarized thinking provides the full intelligence benefits of extended thinking, while preventing misuse.

While the API is consistent across Claude 3.7 and 4 models, streaming responses for extended thinking might return in a “chunky” delivery pattern, with possible delays between streaming events.

Summarization is processed by a different model than the one you target in your requests. The thinking model does not see the summarized output.

There's a new beta header, interleaved-thinking-2025-05-14, which turns on the "interleaved thinking" feature where tools can be called as part of the chain-of-thought. More details on that in the interleaved thinking documentation.

This is a frustrating note:

  • You’re charged for the full thinking tokens generated by the original request, not the summary tokens.
  • The billed output token count will not match the count of tokens you see in the response.

I initially misread that second bullet as meaning we would no longer be able to estimate costs based on the return token counts, but it's just warning us that we might see an output token integer that doesn't exactly match the visible tokens that were returned in the API.

# 7:03 pm / anthropic, claude, generative-ai, ai, llms, llm-pricing, claude-4

I was going slightly spare at the fact that every talk at this Anthropic developer conference has used the word "agents" dozens of times, but nobody ever stopped to provide a useful definition.

I'm now in the "Prompting for Agents" workshop and Anthropic's Hannah Moran finally broke the trend by saying that at Anthropic:

Agents are models using tools in a loop

I can live with that! I'm glad someone finally said it out loud.

# 7:07 pm / anthropic, generative-ai, ai-agents, ai, llms