Beta version: *Information might not be fully accurate. Please report any discrepancies.

Moonshot AIVerified58 benchmarks

Kimi K2.5

Released 2026-01-20Proprietary Architecture

Verified Model Card

Latest Data

2026-02-20

Context Window

10.0M

tokens

Input Cost

$1.00

per 1M tokens

Output Cost

$3.00

per 1M tokens

Parameters

Proprietary

model footprint

Benchmark Provenance

Performance Analysis // Verified Benchmarks

SWE-bench VerifiedCoding

76.8/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Resolving real-world GitHub issues. Verified subset ensures solvable issues.

LiveBenchReasoning

69.07/ 100

Verified

Last Verified: 2026-02-20LiveBench

Contamination-free, continuously updated reasoning benchmark.

AA Intelligence IndexReal-world

46.7*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Artificial Analysis aggregate intelligence index.

MMLU-ProScience

89.3/ 100

Verified

Last Verified: Unknown DateMoonshot AI

A more robust and harder version of MMLU, focusing on complex reasoning and STEM subjects.

HLEScience

29.4*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Humanity's Last Exam - Hard reasoning benchmark without tools.

HLE-Full (w/ tools)Science

30.1/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Humanity's Last Exam full evaluation with tool access enabled.

SimpleQAScience

44.1/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Open-domain factuality benchmark focusing on short, verifiable answers.

HMMT Feb 2025Math

36.9/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Harvard-MIT Mathematics Tournament - High difficulty competition math.

IMO-AnswerBenchMath

92.8/ 100

Verified

Last Verified: Unknown DateMoonshot AI

International Mathematical Olympiad style answer-only benchmark.

LiveCodeBench v6Coding

85/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Contamination-free coding benchmark using recent problems.

AA Coding IndexCoding

39.5*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Artificial Analysis aggregate coding capability index.

PaperBench (CodeDev)Coding

63.5/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Research-grade coding and software development tasks.

CyberGymCoding

41.3/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Cybersecurity-flavored coding benchmark in simulated environments.

OJBench (cpp)Coding

57.4/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Online-judge competitive coding benchmark focused on C++ tasks.

GPQA DiamondSTEM

87.9*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Graduate-Level Google-Proof Q&A Benchmark.

LongBench v2Long Context

40.8/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Comprehensive long-context understanding (128k).

AA-LCRLong Context

65.3*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Artificial Analysis Long Context Reasoning benchmark. Evaluates reasoning over long contexts.

IFBenchInstruction Following

70.2*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Artificial Analysis IFBench. Evaluates precise instruction following with constraints.

Verified AdvancedIFInstruction Following

63.1/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Advanced instruction-following benchmark with verified grading.

AIME 2025Math

50.2/ 100

Verified

Last Verified: Unknown DateMoonshot AI

American Invitational Mathematics Examination 2025 problems.

MathVista (mini)Vision

90.1/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Compact MathVista split for faster multimodal reasoning checks.

MathVisionVision

84.2/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Comprehensive mathematical vision benchmark.

BLINKVision

87/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Spatial and perception benchmark for multimodal models.

MMVPVision

88.8/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Multimodal visual perception benchmark.

OCRBenchVision

92.6/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Optical character recognition and document understanding benchmark.

MMMU-ProVision

78.5/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Professional level MMMU expansion.

MMMU (val)Vision

84.3/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Validation split of MMMU for multimodal understanding.

ZeroBenchVision

11/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Aggregate ZeroBench score across the full task set.

ZeroBench (w/ tools)Vision

36.5/ 100

Verified

Last Verified: Unknown DateMoonshot AI

ZeroBench score when tool use is allowed.

BabyVisionVision

78.9/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Early-stage visual development benchmark.

WorldVQAVision

46.3/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Global visual knowledge and reasoning.

OmniDocBenchVision

92.3/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Universal document understanding benchmark.

InfoVQA (test)Vision

74/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Information-seeking visual question answering on the test split.

CharXiv-RQVision

77.5/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Chart-based reasoning from arXiv papers (Reasoning QA).

MMVUVideo

80.4/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Multimodal Video Understanding.

MotionBenchVideo

70.4/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Comprehensive motion perception evaluation.

LongVideoBenchVideo

79.8/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Understanding extremely long-form video content.

LVBenchVideo

75.9/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Large-scale Video Benchmark.

Terminal-Bench 2.0Agentic

50.8/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Agent performance in realistic terminal workflows (v2.0 leaderboard).

Terminal-Bench HardAgentic

34.8*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Hard split of Terminal-Bench focused on tougher terminal workflows.

OSWorld-VerifiedAgentic

63.3/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Verified desktop computer-use benchmark for end-to-end task completion.

WebArenaAgentic

58.9/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Browser-based autonomous task execution benchmark.

SWE-bench ProAgentic

50.7/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Higher-difficulty SWE-bench subset for frontier coding agents.

SWE MultilingualAgentic

73/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Software engineering performance across multilingual codebases.

BrowseCompAgentic

60.6/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Web browsing + synthesis benchmark for research agents.

BrowseComp (ctx manage)Agentic

74.9/ 100

Verified

Last Verified: Unknown DateMoonshot AI

BrowseComp variant with explicit context-window management.

BrowseComp (Agent Swarm)Agentic

72.7/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Multi-agent swarm variant of BrowseComp.

WideSearchAgentic

77.1/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Broad retrieval and synthesis benchmark across many sources.

WideSearch (Agent Swarm)Agentic

67.8/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Multi-agent swarm variant of WideSearch.

FinSearchComp T2&T3Agentic

41/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Tier 2 and Tier 3 slices of FinSearchComp.

DeepSearchQAAgentic

57.4/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Deep multi-hop search QA for long-horizon agents.

SEAL-0Agentic

37/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Strategic environment-agent loop benchmark.

GDPVal-AAAgentic

65.8/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Artificial Analysis GDPVal benchmark for knowledge-work quality.

TAU-Bench TelecomAgentic

95.9*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Telecom-domain tool-use and workflow benchmark.

SciCodeAdvanced Tasks

49*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Scientific programming benchmark for code synthesis and correctness.

SimpleVQAVision

71.2/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Short-form visual question answering with verifiable responses.

VideoMMMUVideo

86.6/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Video variant of MMMU for multimodal understanding and reasoning.

VideoMMEVideo

87.4/ 100

Verified

Last Verified: Unknown DateMoonshot AI

Video multimodal evaluation benchmark for perception and reasoning.