Beta version: *Information might not be fully accurate. Please report any discrepancies.

AnthropicVerified18 benchmarks

Claude Opus 4.5

Released 2025-11-24Unknown Architecture

Training: 2025-03-31

Verified Model Card

Latest Data

2026-02-20

Context Window

200k

tokens

Input Cost

$5.00

per 1M tokens

Output Cost

$25.00

per 1M tokens

Cache Cost

$0.50 / $6.25

read / write per 1M

Parameters

Unknown

model footprint

Model Variants

Compare All

1 Variants Available

Claude Opus 4.5 High

Unknown2025-11-24

Benchmark Provenance

Performance Analysis // Verified Benchmarks

MMLU (5-shot)Knowledge

90.5/ 100

Verified

Last Verified: Unknown DateAnthropic News

Massive Multitask Language Understanding covers 57 subjects across STEM, the humanities, social sciences, and more.

SWE-bench VerifiedCoding

80.9/ 100

Verified

Last Verified: Unknown DateAnthropic News

Resolving real-world GitHub issues. Verified subset ensures solvable issues.

LiveBenchReasoning

59.1/ 100

Verified

Last Verified: 2026-02-20LiveBench

Contamination-free, continuously updated reasoning benchmark.

AA Intelligence IndexReal-world

43*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Artificial Analysis aggregate intelligence index.

MMLU-ProScience

88.9*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

A more robust and harder version of MMLU, focusing on complex reasoning and STEM subjects.

HLEScience

12.9*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Humanity's Last Exam - Hard reasoning benchmark without tools.

AA Math IndexMath

62.7*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Artificial Analysis aggregate math capability index.

LiveCodeBench v6Coding

73.8*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Contamination-free coding benchmark using recent problems.

AA Coding IndexCoding

42.9*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Artificial Analysis aggregate coding capability index.

GPQA DiamondSTEM

81*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Graduate-Level Google-Proof Q&A Benchmark.

AA-LCRLong Context

65.3*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Artificial Analysis Long Context Reasoning benchmark. Evaluates reasoning over long contexts.

IFBenchInstruction Following

43*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Artificial Analysis IFBench. Evaluates precise instruction following with constraints.

AA-OmniscienceHallucination

10/ 100

Verified

Last Verified: Unknown DateAnthropic News

Evaluates model omniscience and factual reliability across diverse domains.

AIME 2025Math

62.7*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

American Invitational Mathematics Examination 2025 problems.

FactScoreHallucination

51.3/ 100

Verified

Last Verified: Unknown DateAnthropic News

Precision of fine-grained facts in long-form biographies.

Terminal-Bench HardAgentic

40.9*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Hard split of Terminal-Bench focused on tougher terminal workflows.

TAU-Bench TelecomAgentic

86.3*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Telecom-domain tool-use and workflow benchmark.

SciCodeAdvanced Tasks

47*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Scientific programming benchmark for code synthesis and correctness.