The Claude Opus 4.5 system card [0] is much more revealing than the marketing bl...

aurareturn · 2025-11-25T07:13:41 1764054821

  Pages 22–24 of Opus’s system card provide some evidence for this. Anthropic run a multi-agent search benchmark where Opus acts as an orchestrator and Haiku/Sonnet/Opus act as sub-agents with search access. Using cheap Haiku sub-agents gives a ~12-point boost over Opus alone.

Will this lead to another exponential in capabilities and token increase in the same order as thinking models?

dave1010uk · 2025-11-25T08:05:25 1764057925

Perhaps. Though if that were feasible, I'd expect it would have been exploited already.

I think this is more about the cost and time saving of being able to use cheaper models. Sub-agents are effectively the same as parallelization and temporary context compaction. (The same as with human teams, delegation and organisational structures.)

We're starting to see benchmarks include stats of low/medium/high reasoning effort and how newer models can match or beat older ones with fewer reasoning tokens. What would be interesting is seeing more benchmarks for different sub-agent reasoning combinations too. Eg does Claude perform better when Opus can use 10,000 tokens of Sonnet or 100,000 tokens of Haiku? What's the best agent response you can get for $1?

Where I think we might see gains in _some_ types of tasks is with vast quantities of tiny models. I.e many LLMs that are under 4B parameters used as sub-agents. I wonder what GPT-5.1 Pro would be like if it could orchestrate 1000 drone-like workers.