Resolving real-world GitHub issues. Verified subset ensures solvable issues.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Latest Data
2026-04-27
Context Window
262k
tokens
Input Cost
$0.16
per 1M tokens
Output Cost
$0.97
per 1M tokens
Parameters
35B total (3B active)
model footprint
Performance Analysis // Verified Benchmarks
Resolving real-world GitHub issues. Verified subset ensures solvable issues.
Multi-discipline Multimodal Understanding and Reasoning.
A more robust and harder version of MMLU, focusing on complex reasoning and STEM subjects.
Future prediction of AIME performance levels.
Harvard-MIT Mathematics Tournament 2026 - High difficulty competition math.
Contamination-free coding benchmark using recent problems.
Graduate-Level Google-Proof Q&A Benchmark.
Compact MathVista split for faster multimodal reasoning checks.
Professional level MMMU expansion.
Real-world visual question answering.
Elite multimodal model evaluation.
Visual object counting and identification.
Environment Reasoning and Question Answering.
Chart-based reasoning from arXiv papers (Reasoning QA).
Agent performance in realistic terminal workflows (v2.0 leaderboard).
Higher-difficulty SWE-bench subset for frontier coding agents.
Short-form visual question answering with verifiable responses.
Video variant of MMMU for multimodal understanding and reasoning.
Video multimodal evaluation benchmark for perception and reasoning.
Multi-task long video understanding benchmark.
Comprehensive video understanding benchmark across multiple tasks.