Beta version: *Information might not be fully accurate. Please report any discrepancies.

OpenAIVerified20 benchmarks

o3

Released 2025-04-12Reasoning Model Architecture

Training: 2024-05

Verified Model Card

Latest Data

2026-02-20

Context Window

200k

tokens

Input Cost

$2.00

per 1M tokens

Output Cost

$8.00

per 1M tokens

Cache Cost

$0.50 / Free

read / write per 1M

Parameters

Reasoning Model

model footprint

Benchmark Provenance

Performance Analysis // Verified Benchmarks

MATH (CoT)Math

98.2/ 100

Verified

Last Verified: Unknown DateOpenAI Blog

Challenging competition mathematics problems (AIME/IMO level).

LiveBenchReasoning

84.6/ 100

Verified

Last Verified: 2026-02-20LiveBench

Contamination-free, continuously updated reasoning benchmark.

BigCodeBenchCoding

35.5/ 100

Verified

Last Verified: Unknown DateOpenAI Blog

Next-generation HumanEval with more diverse library calls and complex tasks.

AIME 2024/25Math

90.3*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

American Invitational Mathematics Examination. Competition-level math.

AA Intelligence IndexReal-world

38.3*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Artificial Analysis aggregate intelligence index.

AgentBenchAgent

84.7/ 100

Verified

Last Verified: Unknown DateOpenAI Blog

Comprehensive framework to evaluate LLMs as agents across diverse environments.

MMLU-ProScience

85.3*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

A more robust and harder version of MMLU, focusing on complex reasoning and STEM subjects.

HLEScience

20*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Humanity's Last Exam - Hard reasoning benchmark without tools.

AA Math IndexMath

88.3*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Artificial Analysis aggregate math capability index.

MATH-500Math

99.2*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

500-problem math benchmark for broad quantitative reasoning.

LiveCodeBench v6Coding

80.8*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Contamination-free coding benchmark using recent problems.

AA Coding IndexCoding

38.4*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Artificial Analysis aggregate coding capability index.

GPQA DiamondSTEM

82.7*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Graduate-Level Google-Proof Q&A Benchmark.

AA-LCRLong Context

69.3*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Artificial Analysis Long Context Reasoning benchmark. Evaluates reasoning over long contexts.

IFBenchInstruction Following

71.4*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Artificial Analysis IFBench. Evaluates precise instruction following with constraints.

AIME 2025Math

88.3*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

American Invitational Mathematics Examination 2025 problems.

SuperchemSTEM

40/ 100

Unverified

Last Verified: Unknown DateOpenAI Blog

Expert-level chemistry knowledge and reasoning.

Terminal-Bench HardAgentic

37.1*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Hard split of Terminal-Bench focused on tougher terminal workflows.

TAU-Bench TelecomAgentic

80.7*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Telecom-domain tool-use and workflow benchmark.

SciCodeAdvanced Tasks

41*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Scientific programming benchmark for code synthesis and correctness.