GPTCrunch
Back to News
New Release

Claude Opus 4 Sets New Benchmark Records

Anthropic's latest flagship model achieves state-of-the-art on SWE-bench with 72% score and introduces extended thinking capabilities.

GPTUni Team

May 22, 20251 min read

Anthropic has released Claude Opus 4, their most capable model to date. The model achieves 72% on SWE-bench Verified, the highest score recorded at time of release, and demonstrates substantial improvements across reasoning, coding, and analysis benchmarks.

The model introduces an extended thinking mode that exposes the chain-of-thought process to developers, allowing them to see how the model reasons through complex problems. This transparency is valuable for debugging, auditing, and building trust in model outputs.

Claude Opus 4 also shows strong results on GPQA Diamond with a 74.1% score and achieves 96.4 on MMLU, demonstrating broad knowledge and reasoning capability. The model supports a 200K-token context window.

On pricing, Opus 4 is positioned as a premium offering at $15 per million input tokens and $75 per million output tokens. This makes it significantly more expensive than mid-tier alternatives, but Anthropic is betting that the quality difference justifies the cost for enterprise use cases like code generation, legal analysis, and scientific research.

The release follows Anthropic's pattern of releasing a highly capable but expensive flagship, followed by more cost-effective variants. Claude Sonnet 4.6, the current mid-tier offering, provides much of the same capability at a fraction of the price.

New Release