Name: LLM Registry Benchmark Dataset
Creator: LLM Registry

Question 1

What is the best LLM right now?

Accepted Answer

The best LLM depends on your use case. For reasoning tasks, models like Claude Opus and Gemini 3.1 Pro lead on GPQA Diamond. For coding, GPT-5.x and Claude Opus dominate SWE-bench. For budget-conscious users, DeepSeek V4 Pro offers frontier quality at a fraction of the cost. Use our comparison tool to find the best model for your specific needs.

Question 2

How do I compare LLM models?

Accepted Answer

Use the LLM Registry Compare tool to select up to three models and see side-by-side benchmark scores, pricing, context windows, and capability profiles. The comparison defaults to strict shared-benchmark analysis for fair head-to-head evaluation.

Question 3

Which LLM is best for coding?

Accepted Answer

Based on SWE-bench Verified scores, Claude Opus 4.7 and GPT-5.5 currently lead for software engineering tasks. For code generation benchmarks like HumanEval, GPT-5.3 Codex and Gemini models perform exceptionally well. Check the Coding leaderboard for the latest rankings.

Question 4

What is the cheapest top AI model?

Accepted Answer

DeepSeek V4 Pro offers frontier-level performance (90%+ GPQA) at approximately $0.44/$0.87 per million input/output tokens — roughly 10x cheaper than comparable models from OpenAI or Anthropic. DeepSeek V4 Flash is even cheaper at $0.14/$0.28 per million tokens.

Question 5

How are LLM benchmark scores normalized?

Accepted Answer

LLM Registry normalizes all scores to a 0–100 scale. Bounded metrics use max-scaling (score/max × 100). ELO-based metrics use min-max scaling. Lower-is-better metrics are mathematically inverted so that 100 always represents the best performance. See our Methodology page for full details.

Question 6

What is the difference between Verified and Discovered models?

Accepted Answer

Verified models are manually curated with full provenance tracking — every score has a source ID, verification level, and as-of date. Discovered models are auto-imported from community databases and may have less complete benchmark coverage.

Global LLM Leaderboard

Cursor Composer 2.5

Performance by Domain

Intelligence

Knowledge & Communication

Coding

Math

Agents & Tools

Vision & Video

Long Context

Factuality

Frequently Asked Questions