Beta version: *Information might not be fully accurate. Please report any discrepancies.

xAIVerified23 benchmarks

Grok-4

Released 2026-02-01Ultra-Dense Architecture

Training: 2025-07

Verified Model Card

Latest Data

2026-02-20

Context Window

2.0M

tokens

Input Cost

$3.00

per 1M tokens

Output Cost

$15.00

per 1M tokens

Cache Cost

$0.75 / Free

read / write per 1M

Parameters

Ultra-Dense

model footprint

Benchmark Provenance

Performance Analysis // Verified Benchmarks

MMLU (5-shot)Knowledge

92/ 100

Verified

Last Verified: Unknown DatexAI

Massive Multitask Language Understanding covers 57 subjects across STEM, the humanities, social sciences, and more.

MATH (CoT)Math

95/ 100

Verified

Last Verified: Unknown DatexAI

Challenging competition mathematics problems (AIME/IMO level).

LiveBenchReasoning

62.02/ 100

Verified

Last Verified: 2026-02-20LiveBench

Contamination-free, continuously updated reasoning benchmark.

AIME 2024/25Math

94.3*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

American Invitational Mathematics Examination. Competition-level math.

LMArena ELOReal-world

1410/ 1700

Verified

Last Verified: Unknown DatexAI

Chatbot Arena ELO score. Crowd-sourced human preference ranking.

AA Intelligence IndexReal-world

41.4*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Artificial Analysis aggregate intelligence index.

MMLU-ProScience

86.6*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

A more robust and harder version of MMLU, focusing on complex reasoning and STEM subjects.

HLEScience

23.9*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Humanity's Last Exam - Hard reasoning benchmark without tools.

HLE-FullScience

26.9/ 100

Verified

Last Verified: Unknown DatexAI

Humanity's Last Exam full evaluation without tools.

AA Math IndexMath

92.7*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Artificial Analysis aggregate math capability index.

MATH-500Math

99*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

500-problem math benchmark for broad quantitative reasoning.

LiveCodeBench v6Coding

81.9*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Contamination-free coding benchmark using recent problems.

AA Coding IndexCoding

40.5*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Artificial Analysis aggregate coding capability index.

GPQA DiamondSTEM

87.7*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Graduate-Level Google-Proof Q&A Benchmark.

PhybenchSTEM

42.33/ 100

Verified

Last Verified: Unknown DatexAI

Physics reasoning and problem solving benchmark.

AA-LCRLong Context

68*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Artificial Analysis Long Context Reasoning benchmark. Evaluates reasoning over long contexts.

IFBenchInstruction Following

53.7*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Artificial Analysis IFBench. Evaluates precise instruction following with constraints.

AIME 2025Math

92.7*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

American Invitational Mathematics Examination 2025 problems.

ARC-AGI-2Reasoning

15.9/ 100

Verified

Last Verified: Unknown DatexAI

Abstraction and Reasoning Corpus - Level 2 (Extreme difficulty).

FactScoreHallucination

53.6/ 100

Verified

Last Verified: Unknown DatexAI

Precision of fine-grained facts in long-form biographies.

Terminal-Bench HardAgentic

37.9*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Hard split of Terminal-Bench focused on tougher terminal workflows.

TAU-Bench TelecomAgentic

74.9*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Telecom-domain tool-use and workflow benchmark.

SciCodeAdvanced Tasks

45.7*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Scientific programming benchmark for code synthesis and correctness.