Beta version: *Information might not be fully accurate. Please report any discrepancies.
Registry / Live Benchmarks
BetaGlobal LLM Leaderboard
Compare 1,500+ AI models by benchmark scores, pricing, and capabilities — with full provenance tracking.
Current Leaders
Cursor Composer 2.5
4d ago · 0 benchmarks
Performance by Domain
About DomainsIntelligence
Top: Qwen3.6 27B
Knowledge & Communication
Top: Gemini 3.1 Pro
Coding
Top: GPT-5.1
Math
Top: GPT-5.2 Pro
Agents & Tools
Top: Gemini 2.0 Flash
Vision & Video
Top: GPT-5.2 Pro
Long Context
Top: GPT-5.4
Factuality
Top: Grok-4.1-Fast
Frequently Asked Questions
Quick answers about choosing, comparing, and interpreting AI model rankings.
What is the best LLM right now?
The best LLM depends on your use case. For reasoning tasks, models like Claude Opus and Gemini 3.1 Pro lead on GPQA Diamond. For coding, GPT-5.x and Claude Opus dominate SWE-bench. For budget-conscious users, DeepSeek V4 Pro offers frontier quality at a fraction of the cost. Use our comparison tool to find the best model for your specific needs.
How do I compare LLM models?
Use the LLM Registry Compare tool to select up to three models and see side-by-side benchmark scores, pricing, context windows, and capability profiles. The comparison defaults to strict shared-benchmark analysis for fair head-to-head evaluation.
Which LLM is best for coding?
Based on SWE-bench Verified scores, Claude Opus 4.7 and GPT-5.5 currently lead for software engineering tasks. For code generation benchmarks like HumanEval, GPT-5.3 Codex and Gemini models perform exceptionally well. Check the Coding leaderboard for the latest rankings.
What is the cheapest top AI model?
DeepSeek V4 Pro offers frontier-level performance (90%+ GPQA) at approximately $0.44/$0.87 per million input/output tokens — roughly 10x cheaper than comparable models from OpenAI or Anthropic. DeepSeek V4 Flash is even cheaper at $0.14/$0.28 per million tokens.
How are LLM benchmark scores normalized?
LLM Registry normalizes all scores to a 0–100 scale. Bounded metrics use max-scaling (score/max × 100). ELO-based metrics use min-max scaling. Lower-is-better metrics are mathematically inverted so that 100 always represents the best performance. See our Methodology page for full details.
What is the difference between Verified and Discovered models?
Verified models are manually curated with full provenance tracking — every score has a source ID, verification level, and as-of date. Discovered models are auto-imported from community databases and may have less complete benchmark coverage.