Resolving real-world GitHub issues. Verified subset ensures solvable issues.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Latest Data
2026-02-19
Context Window
1.0M
tokens
Input Cost
$2.50
per 1M tokens
Output Cost
$15.00
per 1M tokens
Parameters
Unknown
model footprint
Performance Analysis // Verified Benchmarks
Resolving real-world GitHub issues. Verified subset ensures solvable issues.
Humanity's Last Exam - Hard reasoning benchmark without tools.
Competitive programming problems from Codeforces, ICPC, and IOI with Elo rating.
Graduate-Level Google-Proof Q&A Benchmark.
Multi-Round Context Retrieval - 8-needle test.
Massive Multilingual Language Understanding.
Abstraction and Reasoning Corpus - Level 2 (Extreme difficulty).
Professional level MMMU expansion.
Agent performance in realistic terminal workflows (v2.0 leaderboard).
Higher-difficulty SWE-bench subset for frontier coding agents.
Web browsing + synthesis benchmark for research agents.
Retail-domain tool-use and workflow benchmark from τ²-bench.
Telecom-domain tool-use and workflow benchmark.
Scientific programming benchmark for code synthesis and correctness.