Massive Multitask Language Understanding covers 57 subjects across STEM, the humanities, social sciences, and more.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Latest Data
Unknown
Context Window
128k
tokens
Input Cost
$0.20
per 1M tokens
Output Cost
$0.80
per 1M tokens
Parameters
230B MoE (10B active)
model footprint
Performance Analysis // Verified Benchmarks
Massive Multitask Language Understanding covers 57 subjects across STEM, the humanities, social sciences, and more.
Resolving real-world GitHub issues. Verified subset ensures solvable issues.
A more robust and harder version of MMLU, focusing on complex reasoning and STEM subjects.
Humanity's Last Exam - Hard reasoning benchmark without tools.
Contamination-free coding benchmark using recent problems.
Graduate-Level Google-Proof Q&A Benchmark.
Artificial Analysis IFBench. Evaluates precise instruction following with constraints.
American Invitational Mathematics Examination 2025 problems.
Agent performance in realistic terminal workflows (v2.0 leaderboard).
Web browsing + synthesis benchmark for research agents.
Scientific programming benchmark for code synthesis and correctness.