A more robust and harder version of MMLU, focusing on complex reasoning and STEM subjects.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Latest Data
2026-04-02
Context Window
128k
tokens
Input Cost
Free
per 1M tokens
Output Cost
Free
per 1M tokens
Parameters
2.3B effective (5.1B with embeddings)
model footprint
Performance Analysis // Verified Benchmarks
A more robust and harder version of MMLU, focusing on complex reasoning and STEM subjects.
Future prediction of AIME performance levels.
Competitive programming rating based on problem solving.
Contamination-free coding benchmark using recent problems.
Graduate-Level Google-Proof Q&A Benchmark.
Multi-Round Context Retrieval - 8-needle test.
Massive Multilingual Language Understanding.
Extra-hard subset of BIG-bench focusing on challenging reasoning and knowledge tasks.
Comprehensive mathematical vision benchmark.
Professional level MMMU expansion.
OCR benchmark measuring edit distance (lower is better).
Multimodal medical question answering benchmark.
Multilingual speech-to-text translation benchmark.
Few-shot learning evaluation of universal representations of speech. Error rate (lower is better).