Beta version: *Information might not be fully accurate. Please report any discrepancies.

AnthropicVerified9 benchmarks

Claude 3.5 Sonnet

Released 2024-06-20175B (Estimated) Architecture

Verified Model Card

Latest Data

Unknown

Context Window

200k

tokens

Input Cost

$3.00

per 1M tokens

Output Cost

$15.00

per 1M tokens

Parameters

175B (Estimated)

model footprint

Benchmark Provenance

Performance Analysis // Verified Benchmarks

MMLU (5-shot)Knowledge

88.7/ 100

Verified

Last Verified: Unknown DateAnthropic News

Massive Multitask Language Understanding covers 57 subjects across STEM, the humanities, social sciences, and more.

HumanEvalCoding

92/ 100

Verified

Last Verified: Unknown DateAnthropic News

Functional correctness of synthesized programs from docstrings.

SWE-bench VerifiedCoding

49/ 100

Verified

Last Verified: Unknown DateAnthropic News

Resolving real-world GitHub issues. Verified subset ensures solvable issues.

MMMU (Multimodal)Multimodal

67.2/ 100

Verified

Last Verified: Unknown DateAnthropic News

Multi-discipline Multimodal Understanding and Reasoning.

BigCodeBenchCoding

30.4/ 100

Verified

Last Verified: Unknown DateAnthropic News

Next-generation HumanEval with more diverse library calls and complex tasks.

LMArena ELOReal-world

1271/ 1700

Verified

Last Verified: Unknown DateAnthropic News

Chatbot Arena ELO score. Crowd-sourced human preference ranking.

AgentBenchAgent

80.1/ 100

Verified

Last Verified: Unknown DateAnthropic News

Comprehensive framework to evaluate LLMs as agents across diverse environments.

GPQA DiamondSTEM

59.4/ 100

Verified

Last Verified: Unknown DateAnthropic News

Graduate-Level Google-Proof Q&A Benchmark.

SuperchemSTEM

40/ 100

Unverified

Last Verified: Unknown DateAnthropic News

Expert-level chemistry knowledge and reasoning.