Massive Multitask Language Understanding covers 57 subjects across STEM, the humanities, social sciences, and more.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Latest Data
2026-02-20
Context Window
2.0M
tokens
Input Cost
$1.25
per 1M tokens
Output Cost
$5.00
per 1M tokens
Cache Cost
$0.13 / $0.38
read / write per 1M
Parameters
Unknown
model footprint
Performance Analysis // Verified Benchmarks
Massive Multitask Language Understanding covers 57 subjects across STEM, the humanities, social sciences, and more.
Resolving real-world GitHub issues. Verified subset ensures solvable issues.
Contamination-free, continuously updated reasoning benchmark.
Humanity's Last Exam - Hard reasoning benchmark without tools.
Verified subset of SimpleQA for parametric knowledge evaluation.
Competitive programming problems from Codeforces, ICPC, and IOI with Elo rating.
Graduate-Level Google-Proof Q&A Benchmark.
Physics reasoning and problem solving benchmark.
Multi-Round Context Retrieval - 8-needle test.
Massive Multilingual Language Understanding.
American Invitational Mathematics Examination 2025 problems.
Abstraction and Reasoning Corpus - Level 2 (Extreme difficulty).
Physical Interaction QA across multiple languages and cultures.
Professional level MMMU expansion.
OCR benchmark measuring edit distance (lower is better).
Screen understanding benchmark for GUI interaction.
Information synthesis from complex charts.
Agent performance in realistic terminal workflows (v2.0 leaderboard).
Long-horizon business simulation benchmark (final account balance).
Factuality benchmark across grounding, parametric, search, and multimodal.
Multi-step workflows using Model Context Protocol.
Long horizon real-world software tasks.
Tool-use and API orchestration benchmark for assistants.
Video variant of MMMU for multimodal understanding and reasoning.