Beta version: *Information might not be fully accurate. Please report any discrepancies.

AnthropicVerified25 benchmarks

Claude Opus 4.6

Released 2026-02-05Unknown Architecture

Training: 2025-05

Verified Model Card

Latest Data

2026-02-20

Context Window

200k

tokens

Input Cost

$5.00

per 1M tokens

Output Cost

$25.00

per 1M tokens

Cache Cost

$0.50 / $6.25

read / write per 1M

Parameters

Unknown

model footprint

Benchmark Provenance

Performance Analysis // Verified Benchmarks

MMLU (5-shot)Knowledge

91.4/ 100

Verified

Last Verified: Unknown DateAnthropic News

Massive Multitask Language Understanding covers 57 subjects across STEM, the humanities, social sciences, and more.

MATH (CoT)Math

89.2/ 100

Verified

Last Verified: Unknown DateAnthropic News

Challenging competition mathematics problems (AIME/IMO level).

HumanEvalCoding

94.6/ 100

Verified

Last Verified: Unknown DateAnthropic News

Functional correctness of synthesized programs from docstrings.

SWE-bench VerifiedCoding

80.8/ 100

Verified

Last Verified: Unknown DateAnthropic News

Resolving real-world GitHub issues. Verified subset ensures solvable issues.

MMMU (Multimodal)Multimodal

76.5/ 100

Verified

Last Verified: Unknown DateAnthropic News

Multi-discipline Multimodal Understanding and Reasoning.

LiveBenchReasoning

76.33/ 100

Verified

Last Verified: 2026-02-20LiveBench

Contamination-free, continuously updated reasoning benchmark.

LMArena ELOReal-world

1502/ 1700

Verified

Last Verified: Unknown DateAnthropic News

Chatbot Arena ELO score. Crowd-sourced human preference ranking.

AA Intelligence IndexReal-world

46.4*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Artificial Analysis aggregate intelligence index.

MMLU-ProScience

82.2/ 100

Verified

Last Verified: Unknown DateAnthropic News

A more robust and harder version of MMLU, focusing on complex reasoning and STEM subjects.

HLEScience

18.6*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Humanity's Last Exam - Hard reasoning benchmark without tools.

HLE-FullScience

40/ 100

Verified

Last Verified: Unknown DateAnthropic News

Humanity's Last Exam full evaluation without tools.

CritPtScience

12.6/ 100

Verified

Last Verified: Unknown DateAnthropic News

Complex Research using Integrated Thinking - Physics Test. Research-level physics reasoning.

AA Coding IndexCoding

47.6*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Artificial Analysis aggregate coding capability index.

GPQA DiamondSTEM

84*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Graduate-Level Google-Proof Q&A Benchmark.

MRCR v2Long Context

76/ 100

Verified

Last Verified: Unknown DateAnthropic News

Multi-Round Context Retrieval - 8-needle test.

AA-LCRLong Context

58.3*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Artificial Analysis Long Context Reasoning benchmark. Evaluates reasoning over long contexts.

IFBenchInstruction Following

44.6*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Artificial Analysis IFBench. Evaluates precise instruction following with constraints.

AA-OmniscienceHallucination

11/ 100

Verified

Last Verified: Unknown DateAnthropic News

Evaluates model omniscience and factual reliability across diverse domains.

ARC-AGI-2Reasoning

68.8/ 100

Verified

Last Verified: Unknown DateAnthropic News

Abstraction and Reasoning Corpus - Level 2 (Extreme difficulty).

Terminal-Bench 2.0Agentic

65.4/ 100

Verified

Last Verified: Unknown DateAnthropic News

Agent performance in realistic terminal workflows (v2.0 leaderboard).

Terminal-Bench HardAgentic

48.5*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Hard split of Terminal-Bench focused on tougher terminal workflows.

OSWorld-VerifiedAgentic

72.7/ 100

Verified

Last Verified: Unknown DateAnthropic News

Verified desktop computer-use benchmark for end-to-end task completion.

BrowseCompAgentic

84/ 100

Verified

Last Verified: Unknown DateAnthropic News

Web browsing + synthesis benchmark for research agents.

TAU-Bench TelecomAgentic

84.8*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Telecom-domain tool-use and workflow benchmark.

SciCodeAdvanced Tasks

45.7*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Scientific programming benchmark for code synthesis and correctness.