Beta version: *Information might not be fully accurate. Please report any discrepancies.

OpenAIVerified20 benchmarks

GPT-5.4

Released 2026-03-05Unknown Architecture

Training: 2025-08

Verified Model Card

Latest Data

2026-03-05

Context Window

1.1M

tokens

Input Cost

$2.50

per 1M tokens

Output Cost

$15.00

per 1M tokens

Parameters

Unknown

model footprint

Benchmark Provenance

Performance Analysis // Verified Benchmarks

HLE-FullScience

39.8/ 100

Verified

Last Verified: 2026-03-05Introducing GPT-5.4

Humanity's Last Exam full evaluation without tools.

HLE-Full (w/ tools)Science

52.1/ 100

Verified

Last Verified: 2026-03-05Introducing GPT-5.4

Humanity's Last Exam full evaluation with tool access enabled.

GPQA DiamondSTEM

92.8/ 100

Verified

Last Verified: 2026-03-05Introducing GPT-5.4

Graduate-Level Google-Proof Q&A Benchmark.

ARC-AGI-1Reasoning

93.7/ 100

Verified

Last Verified: 2026-03-05Introducing GPT-5.4

Abstraction and Reasoning Corpus - Level 1.

MRCR v2Long Context

97.3/ 100

Verified

Last Verified: 2026-03-05Introducing GPT-5.4

Multi-Round Context Retrieval - 8-needle test.

ARC-AGI-2Reasoning

73.3/ 100

Verified

Last Verified: 2026-03-05Introducing GPT-5.4

Abstraction and Reasoning Corpus - Level 2 (Extreme difficulty).

Graphwalks BfsLong Context

93/ 100

Verified

Last Verified: 2026-03-05Introducing GPT-5.4

Traversal-based long context reasoning using BFS (128k).

MMMU-ProVision

81.2/ 100

Verified

Last Verified: 2026-03-05Introducing GPT-5.4

Professional level MMMU expansion.

OmniDocBench 1.5Vision

0.109/ 1

Verified

Last Verified: 2026-03-05Introducing GPT-5.4

OCR benchmark measuring edit distance (lower is better).

Terminal-Bench 2.0Agentic

75.1/ 100

Verified

Last Verified: 2026-03-05Introducing GPT-5.4

Agent performance in realistic terminal workflows (v2.0 leaderboard).

OSWorld-VerifiedAgentic

75/ 100

Verified

Last Verified: 2026-03-05Introducing GPT-5.4

Verified desktop computer-use benchmark for end-to-end task completion.

WebArenaAgentic

67.3/ 100

Verified

Last Verified: 2026-03-05Introducing GPT-5.4

Browser-based autonomous task execution benchmark.

SWE-bench ProAgentic

57.7/ 100

Verified

Last Verified: 2026-03-05Introducing GPT-5.4

Higher-difficulty SWE-bench subset for frontier coding agents.

BrowseCompAgentic

82.7/ 100

Verified

Last Verified: 2026-03-05Introducing GPT-5.4

Web browsing + synthesis benchmark for research agents.

MCP AtlasAgentic

67.2/ 100

Verified

Last Verified: 2026-03-05Introducing GPT-5.4

Multi-step workflows using Model Context Protocol.

ToolathlonAgentic

54.6/ 100

Verified

Last Verified: 2026-03-05Introducing GPT-5.4

Long horizon real-world software tasks.

GDPVal-AAAgentic

83/ 100

Verified

Last Verified: 2026-03-05Introducing GPT-5.4

Artificial Analysis GDPVal benchmark for knowledge-work quality.

TAU-Bench TelecomAgentic

98.9/ 100

Verified

Last Verified: 2026-03-05Introducing GPT-5.4

Telecom-domain tool-use and workflow benchmark.

FrontierSci ResearchAdvanced Tasks

33/ 100

Verified

Last Verified: 2026-03-05Introducing GPT-5.4

Open-ended scientific research benchmark with expert-level questions.

FrontierMathMath

47.6/ 100

Verified

Last Verified: 2026-03-05Introducing GPT-5.4

Advanced mathematics benchmark with tiered difficulty.