Resolving real-world GitHub issues. Verified subset ensures solvable issues.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Latest Data
2026-02-20
Context Window
1.0M
tokens
Input Cost
$0.50
per 1M tokens
Output Cost
$3.00
per 1M tokens
Cache Cost
$0.05 / Free
read / write per 1M
Parameters
Speed Optimized
model footprint
1 Variants Available
Performance Analysis // Verified Benchmarks
Resolving real-world GitHub issues. Verified subset ensures solvable issues.
Contamination-free, continuously updated reasoning benchmark.
Artificial Analysis aggregate intelligence index.
A more robust and harder version of MMLU, focusing on complex reasoning and STEM subjects.
Humanity's Last Exam - Hard reasoning benchmark without tools.
Humanity's Last Exam full evaluation with tool access enabled.
Verified subset of SimpleQA for parametric knowledge evaluation.
Artificial Analysis aggregate math capability index.
Contamination-free coding benchmark using recent problems.
Competitive programming problems from Codeforces, ICPC, and IOI with Elo rating.
Artificial Analysis aggregate coding capability index.
Graduate-Level Google-Proof Q&A Benchmark.
Multi-Round Context Retrieval - 8-needle test.
Artificial Analysis Long Context Reasoning benchmark. Evaluates reasoning over long contexts.
Massive Multilingual Language Understanding.
Artificial Analysis IFBench. Evaluates precise instruction following with constraints.
American Invitational Mathematics Examination 2025 problems.
Abstraction and Reasoning Corpus - Level 2 (Extreme difficulty).
Physical Interaction QA across multiple languages and cultures.
Professional level MMMU expansion.
OCR benchmark measuring edit distance (lower is better).
Screen understanding benchmark for GUI interaction.
Information synthesis from complex charts.
Agent performance in realistic terminal workflows (v2.0 leaderboard).
Hard split of Terminal-Bench focused on tougher terminal workflows.
Long-horizon business simulation benchmark (final account balance).
Factuality benchmark across grounding, parametric, search, and multimodal.
Multi-step workflows using Model Context Protocol.
Long horizon real-world software tasks.
Tool-use and API orchestration benchmark for assistants.
Telecom-domain tool-use and workflow benchmark.
Scientific programming benchmark for code synthesis and correctness.
Video variant of MMMU for multimodal understanding and reasoning.