Resolving real-world GitHub issues. Verified subset ensures solvable issues.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Latest Data
2026-02-20
Context Window
1.0M
tokens
Input Cost
$0.30
per 1M tokens
Output Cost
$2.50
per 1M tokens
Parameters
Speed Optimized
model footprint
Performance Analysis // Verified Benchmarks
Resolving real-world GitHub issues. Verified subset ensures solvable issues.
Contamination-free, continuously updated reasoning benchmark.
Humanity's Last Exam - Hard reasoning benchmark without tools.
Verified subset of SimpleQA for parametric knowledge evaluation.
Competitive programming problems from Codeforces, ICPC, and IOI with Elo rating.
Graduate-Level Google-Proof Q&A Benchmark.
Multi-Round Context Retrieval - 8-needle test.
Massive Multilingual Language Understanding.
American Invitational Mathematics Examination 2025 problems.
Abstraction and Reasoning Corpus - Level 2 (Extreme difficulty).
Physical Interaction QA across multiple languages and cultures.
Professional level MMMU expansion.
OCR benchmark measuring edit distance (lower is better).
Screen understanding benchmark for GUI interaction.
Information synthesis from complex charts.
Agent performance in realistic terminal workflows (v2.0 leaderboard).
Long-horizon business simulation benchmark (final account balance).
Factuality benchmark across grounding, parametric, search, and multimodal.
Multi-step workflows using Model Context Protocol.
Long horizon real-world software tasks.
Tool-use and API orchestration benchmark for assistants.
Video variant of MMMU for multimodal understanding and reasoning.