A more robust and harder version of MMLU, focusing on complex reasoning and STEM subjects.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Latest Data
2026-04-03
Context Window
256k
tokens
Input Cost
$0.06
per 1M tokens
Output Cost
$0.33
per 1M tokens
Parameters
25.2B total (3.8B active)
model footprint
Performance Analysis // Verified Benchmarks
A more robust and harder version of MMLU, focusing on complex reasoning and STEM subjects.
Humanity's Last Exam full evaluation without tools.
Humanity's Last Exam full evaluation with tool access enabled.
Future prediction of AIME performance levels.
Competitive programming rating based on problem solving.
Contamination-free coding benchmark using recent problems.
Graduate-Level Google-Proof Q&A Benchmark.
Multi-Round Context Retrieval - 8-needle test.
Massive Multilingual Language Understanding.
Extra-hard subset of BIG-bench focusing on challenging reasoning and knowledge tasks.
Comprehensive mathematical vision benchmark.
Professional level MMMU expansion.
OCR benchmark measuring edit distance (lower is better).
Multimodal medical question answering benchmark.