Resolving real-world GitHub issues. Verified subset ensures solvable issues.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Latest Data
2026-04-27
Context Window
262k
tokens
Input Cost
$0.50
per 1M tokens
Output Cost
$2.00
per 1M tokens
Parameters
27B
model footprint
Performance Analysis // Verified Benchmarks
Resolving real-world GitHub issues. Verified subset ensures solvable issues.
Multi-discipline Multimodal Understanding and Reasoning.
A more robust and harder version of MMLU, focusing on complex reasoning and STEM subjects.
Future prediction of AIME performance levels.
Harvard-MIT Mathematics Tournament 2026 - High difficulty competition math.
Contamination-free coding benchmark using recent problems.
Graduate-Level Google-Proof Q&A Benchmark.
Compact MathVista split for faster multimodal reasoning checks.
Professional level MMMU expansion.
Real-world visual question answering.
Elite multimodal model evaluation.
Visual object counting and identification.
Environment Reasoning and Question Answering.
Chart-based reasoning from arXiv papers (Reasoning QA).
Agent performance in realistic terminal workflows (v2.0 leaderboard).
Higher-difficulty SWE-bench subset for frontier coding agents.
Short-form visual question answering with verifiable responses.
Video variant of MMMU for multimodal understanding and reasoning.
Video multimodal evaluation benchmark for perception and reasoning.
Mobile device control and task completion benchmark.
Multi-task long video understanding benchmark.
Comprehensive video understanding benchmark across multiple tasks.