Massive Multitask Language Understanding covers 57 subjects across STEM, the humanities, social sciences, and more.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Latest Data
2026-02-20
Context Window
200k
tokens
Input Cost
$1.00
per 1M tokens
Output Cost
$3.20
per 1M tokens
Cache Cost
$0.20 / Free
read / write per 1M
Parameters
744B total, 40B active
model footprint
Performance Analysis // Verified Benchmarks
Massive Multitask Language Understanding covers 57 subjects across STEM, the humanities, social sciences, and more.
Challenging competition mathematics problems (AIME/IMO level).
Functional correctness of synthesized programs from docstrings.
Resolving real-world GitHub issues. Verified subset ensures solvable issues.
Contamination-free, continuously updated reasoning benchmark.
Chatbot Arena ELO score. Crowd-sourced human preference ranking.
Artificial Analysis aggregate intelligence index.
Humanity's Last Exam - Hard reasoning benchmark without tools.
Artificial Analysis aggregate coding capability index.
Graduate-Level Google-Proof Q&A Benchmark.
Artificial Analysis Long Context Reasoning benchmark. Evaluates reasoning over long contexts.
Artificial Analysis IFBench. Evaluates precise instruction following with constraints.
American Invitational Mathematics Examination 2025 problems.
Agent performance in realistic terminal workflows (v2.0 leaderboard).
Hard split of Terminal-Bench focused on tougher terminal workflows.
Higher-difficulty SWE-bench subset for frontier coding agents.
Web browsing + synthesis benchmark for research agents.
Broad retrieval and synthesis benchmark across many sources.
Long-horizon business simulation benchmark (final account balance).
Tool-use and API orchestration benchmark for assistants.
Telecom-domain tool-use and workflow benchmark.
Scientific programming benchmark for code synthesis and correctness.