Massive Multitask Language Understanding covers 57 subjects across STEM, the humanities, social sciences, and more.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Latest Data
2026-04-24
Context Window
1.0M
tokens
Input Cost
$0.43
per 1M tokens
Output Cost
$0.87
per 1M tokens
Parameters
1.6T MoE (49B activated)
model footprint
Performance Analysis // Verified Benchmarks
Massive Multitask Language Understanding covers 57 subjects across STEM, the humanities, social sciences, and more.
Challenging competition mathematics problems (AIME/IMO level).
Functional correctness of synthesized programs from docstrings.
Resolving real-world GitHub issues. Verified subset ensures solvable issues.
Next-generation HumanEval with more diverse library calls and complex tasks.
Grade school math word problems requiring multi-step reasoning.
A more robust and harder version of MMLU, focusing on complex reasoning and STEM subjects.
Humanity's Last Exam full evaluation without tools.
Humanity's Last Exam full evaluation with tool access enabled.
Harvard-MIT Mathematics Tournament 2026 - High difficulty competition math.
International Mathematical Olympiad style answer-only benchmark.
Competitive programming rating based on problem solving.
Contamination-free coding benchmark using recent problems.
Graduate-Level Google-Proof Q&A Benchmark.
Multi-Round Context Retrieval - 8-needle test.
Comprehensive long-context understanding (128k).
Agent performance in realistic terminal workflows (v2.0 leaderboard).
Higher-difficulty SWE-bench subset for frontier coding agents.
Software engineering performance across multilingual codebases.
Web browsing + synthesis benchmark for research agents.
Multi-step workflows using Model Context Protocol.
Long horizon real-world software tasks.