Resolving real-world GitHub issues. Verified subset ensures solvable issues.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Latest Data
2026-04-20
Context Window
256k
tokens
Input Cost
$0.20
per 1M tokens
Output Cost
$4.00
per 1M tokens
Cache Cost
$0.16 / Free
read / write per 1M
Parameters
1T MoE (32B activated)
model footprint
Performance Analysis // Verified Benchmarks
Resolving real-world GitHub issues. Verified subset ensures solvable issues.
Online judge programming benchmark for Python.
Humanity's Last Exam full evaluation without tools.
Humanity's Last Exam full evaluation with tool access enabled.
Future prediction of AIME performance levels.
Harvard-MIT Mathematics Tournament 2026 - High difficulty competition math.
International Mathematical Olympiad style answer-only benchmark.
Contamination-free coding benchmark using recent problems.
Graduate-Level Google-Proof Q&A Benchmark.
Comprehensive mathematical vision benchmark.
Professional level MMMU expansion.
Early-stage visual development benchmark.
Chart-based reasoning from arXiv papers (Reasoning QA).
Agent performance in realistic terminal workflows (v2.0 leaderboard).
Benchmark for daily agentic tasks across text and multimodal interactions.
Advanced agentic planning and execution benchmark.
Verified desktop computer-use benchmark for end-to-end task completion.
Higher-difficulty SWE-bench subset for frontier coding agents.
Software engineering performance across multilingual codebases.
Web browsing + synthesis benchmark for research agents.
Multi-agent swarm variant of BrowseComp.
Broad retrieval and synthesis benchmark across many sources.
Long horizon real-world software tasks.
Deep multi-hop search QA for long-horizon agents.
Model Context Protocol interoperability benchmark.
Scientific programming benchmark for code synthesis and correctness.