Beta version: *Information might not be fully accurate. Please report any discrepancies.

Zhipu AIVerifiedOpen Weights22 benchmarks

GLM-5

Released 2026-02-11744B total, 40B active Architecture

Training: 2025-04

Verified Model Card

Latest Data

2026-02-20

Context Window

200k

tokens

Input Cost

$1.00

per 1M tokens

Output Cost

$3.20

per 1M tokens

Cache Cost

$0.20 / Free

read / write per 1M

Parameters

744B total, 40B active

model footprint

Benchmark Provenance

Performance Analysis // Verified Benchmarks

MMLU (5-shot)Knowledge

90.1/ 100

Verified

Last Verified: 2026-02-17GLM-5 Paper (arXiv)

Massive Multitask Language Understanding covers 57 subjects across STEM, the humanities, social sciences, and more.

MATH (CoT)Math

92.7/ 100

Verified

Last Verified: 2026-02-17GLM-5 Paper (arXiv)

Challenging competition mathematics problems (AIME/IMO level).

HumanEvalCoding

88/ 100

Verified

Last Verified: 2026-02-17GLM-5 Paper (arXiv)

Functional correctness of synthesized programs from docstrings.

SWE-bench VerifiedCoding

77.8/ 100

Verified

Last Verified: 2026-02-17GLM-5 Paper (arXiv)

Resolving real-world GitHub issues. Verified subset ensures solvable issues.

LiveBenchReasoning

68.85/ 100

Verified

Last Verified: 2026-02-20LiveBench

Contamination-free, continuously updated reasoning benchmark.

LMArena ELOReal-world

1452/ 1700

Verified

Last Verified: 2026-02-18Chatbot Arena Leaderboard

Chatbot Arena ELO score. Crowd-sourced human preference ranking.

AA Intelligence IndexReal-world

49.6*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Artificial Analysis aggregate intelligence index.

HLEScience

27.2*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Humanity's Last Exam - Hard reasoning benchmark without tools.

AA Coding IndexCoding

44.2*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Artificial Analysis aggregate coding capability index.

GPQA DiamondSTEM

82*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Graduate-Level Google-Proof Q&A Benchmark.

AA-LCRLong Context

63.3*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Artificial Analysis Long Context Reasoning benchmark. Evaluates reasoning over long contexts.

IFBenchInstruction Following

72.3*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Artificial Analysis IFBench. Evaluates precise instruction following with constraints.

AIME 2025Math

92.7/ 100

Verified

Last Verified: 2026-02-17GLM-5 Paper (arXiv)

American Invitational Mathematics Examination 2025 problems.

Terminal-Bench 2.0Agentic

56.2/ 100

Verified

Last Verified: 2026-02-17GLM-5 Paper (arXiv)

Agent performance in realistic terminal workflows (v2.0 leaderboard).

Terminal-Bench HardAgentic

43.2*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Hard split of Terminal-Bench focused on tougher terminal workflows.

SWE-bench ProAgentic

52.3/ 100

Verified

Last Verified: 2026-02-17GLM-5 Paper (arXiv)

Higher-difficulty SWE-bench subset for frontier coding agents.

BrowseCompAgentic

60.6/ 100

Verified

Last Verified: 2026-02-17GLM-5 Paper (arXiv)

Web browsing + synthesis benchmark for research agents.

WideSearchAgentic

77.1/ 100

Verified

Last Verified: 2026-02-17GLM-5 Paper (arXiv)

Broad retrieval and synthesis benchmark across many sources.

Vending-Bench 2Agentic

4432/ 10000

Verified

Last Verified: 2026-02-17GLM-5 Paper (arXiv)

Long-horizon business simulation benchmark (final account balance).

TAU-BenchAgentic

85*/ 100

Verified

Last Verified: 2026-02-16Artificial Analysis (Independent)

Tool-use and API orchestration benchmark for assistants.

TAU-Bench TelecomAgentic

98.2*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Telecom-domain tool-use and workflow benchmark.

SciCodeAdvanced Tasks

46.2*/ 100

Third-party

Last Verified: 2026-02-16Artificial Analysis (Independent)

Scientific programming benchmark for code synthesis and correctness.