Massive Multitask Language Understanding covers 57 subjects across STEM, the humanities, social sciences, and more.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Latest Data
2026-02-20
Context Window
2.0M
tokens
Input Cost
$3.00
per 1M tokens
Output Cost
$15.00
per 1M tokens
Cache Cost
$0.75 / Free
read / write per 1M
Parameters
Ultra-Dense
model footprint
Performance Analysis // Verified Benchmarks
Massive Multitask Language Understanding covers 57 subjects across STEM, the humanities, social sciences, and more.
Challenging competition mathematics problems (AIME/IMO level).
Contamination-free, continuously updated reasoning benchmark.
American Invitational Mathematics Examination. Competition-level math.
Chatbot Arena ELO score. Crowd-sourced human preference ranking.
Artificial Analysis aggregate intelligence index.
A more robust and harder version of MMLU, focusing on complex reasoning and STEM subjects.
Humanity's Last Exam - Hard reasoning benchmark without tools.
Humanity's Last Exam full evaluation without tools.
Artificial Analysis aggregate math capability index.
500-problem math benchmark for broad quantitative reasoning.
Contamination-free coding benchmark using recent problems.
Artificial Analysis aggregate coding capability index.
Graduate-Level Google-Proof Q&A Benchmark.
Physics reasoning and problem solving benchmark.
Artificial Analysis Long Context Reasoning benchmark. Evaluates reasoning over long contexts.
Artificial Analysis IFBench. Evaluates precise instruction following with constraints.
American Invitational Mathematics Examination 2025 problems.
Abstraction and Reasoning Corpus - Level 2 (Extreme difficulty).
Precision of fine-grained facts in long-form biographies.
Hard split of Terminal-Bench focused on tougher terminal workflows.
Telecom-domain tool-use and workflow benchmark.
Scientific programming benchmark for code synthesis and correctness.