Methodology — How LLM Benchmarks Work

Provenance & Attribution

We prioritize transparency in every score. Every data point in the registry is tagged with a Source ID and a Verification Level.

Data Attribution

Artificial Analysis
Scores marked with * are imported from artificialanalysis.ai.
Metadata Source
Model metadata (pricing, specs, capabilities) imported from models.dev (MIT License).

Comparison Methodology

Compare views prioritize reliability first. We separate fair, overlap-based analysis from exploratory analysis so missing data does not get interpreted as zero performance.

Strict Mode (Default)

Uses only benchmarks shared by selected models for fair head-to-head comparisons in summary and detailed tables.

Exploratory Mode

Includes non-shared results for broader context. Missing entries remain explicitly labeled as N/A.

Capability Profile

Radar uses all available domains within the current scope and preserves missing values as N/A instead of plotting them as zero.

Evidence Quality

Compare pages expose shared benchmark counts and per-model coverage so confidence in comparisons is visible before interpretation.

Coverage-Assisted Mode (Leaderboard)

Leaderboard can optionally fill sparse base-model gaps with family-proxy scores from the same model line. These scores are marked as estimated and shown with a ~ suffix. Use Observed Only mode for strict measured values.

Enhanced Model Metadata

Family System

Models are grouped into families (e.g., Llama, GPT, Claude) for easier discovery and comparison. Family badges appear on model cards and enable family-based filtering.

Capability Icons

Visual indicators show key capabilities: reasoning (chain-of-thought), vision (image analysis), tools (function calling), audio, video, code specialization, JSON mode, file uploads, and temperature control. Hover or tap icons for detailed descriptions.

Training Cutoff

Each model displays its training data cutoff date, providing transparency about knowledge freshness and temporal limitations.

Advanced Pricing

Beyond basic input/output pricing, we track cache read/write costs, reasoning token pricing, audio input/output costs, and context surcharges for models with over 200K context windows.

Model Status

Models are tagged with lifecycle status: active (production-ready), beta (public testing), alpha (early testing), or deprecated (end-of-life).

Max Output Tokens

Maximum generation length is displayed for each model, helping you understand output limitations for long-form content generation.

Automated Data Pipeline

models.dev Integration

We automatically import metadata from models.dev, a community-driven database of LLM specifications. This provides comprehensive coverage of 1,675+ models with pricing, capabilities, and limits.

Weekly Sync

GitHub Actions automatically fetch updated data every Monday, detect changes, and create pull requests for review. This ensures our registry stays current with the rapidly evolving LLM landscape.

ID Normalization

Model IDs from different sources are normalized to our internal naming convention using a community-maintainable JSON mapping file, making it easy for contributors to add new mappings.

Deep Merge Logic

Imported metadata is deep-merged with existing data to prevent capability loss. Our test suite (9 tests, 100% coverage) ensures that updating one field never accidentally removes existing capabilities like vision or tool support.

Static API Architecture

Static Slicing

Instead of requiring API consumers to download the entire 800KB+ dataset, we generate 1,546 individual JSON files (one per model), each under 1KB. This reduces API payload size by 99.95%.

Edge Delivery

All API endpoints are pre-generated at build time and served from Cloudflare's edge network globally, providing under 20ms response times worldwide with 100% uptime SLA.

Rate Limiting

Rate limiting is handled by Cloudflare WAF at the edge (100 requests/minute per IP), providing DDoS protection without any application code or bundle size impact.

Bundle Optimization

By using static generation and on-demand data loading via SWR hooks, we reduced the client bundle from ~870KB to ~170KB (80% reduction), dramatically improving initial page load times.

Quality Assurance

Test Coverage

Critical data merge logic is covered by 9 unit tests with 100% coverage, ensuring data integrity is maintained during automated imports.

Data Validation

Automated validation scripts verify data integrity, checking model IDs, benchmark IDs, score bounds, and provenance metadata before each deployment.

Source Attribution

Every score includes provenance metadata: source ID, verification level (third-party, provider, community, estimated), and as-of date for complete transparency.

Unified Fields

We maintain consistent field naming across all data sources (e.g., trainingCutoff instead of knowledgeCutoff) to prevent data fragmentation and ensure reliability.

Registry Changelog

History of Updates

v0.8.0

2026-05-21

Comprehensive Model Sweep, New Providers, and SEO Improvements

Added 35+ new models across 15 providers

Added new providers: Baidu (ERNIE 5.1/5.0), Tencent (Hy3), StepFun (Step 3.5 Flash), Inception (Mercury 2), InclusionAI (Ring/Ling), LG AI (EXAONE), NVIDIA (Nemotron), Upstage (Solar), IBM (Granite), Arcee AI (Trinity)

Added high-priority models: GPT-5.5 Instant, Claude Sonnet 4.6, Gemini 3.5 Flash, Gemini 3.1 Flash-Lite, Qwen3.7 Max, Grok 4.3, Muse Spark, Mistral Medium 3.5

Added xAI variants: Grok 4.20 (2M context), Grok 4.1 Thinking, Grok 4.1

Added Alibaba variants: Qwen3.6 Plus, Qwen3.5 Max, Qwen3 Coder Next

Added OpenAI variants: GPT-5.4 mini, GPT-5.4 nano

Added DeepSeek V4 Pro Thinking variant (Arena ELO 1461)

Added ByteDance Dola Seed 2.0 Pro (Arena ELO 1456)

Added Xiaomi MiMo V2 Omni and Flash variants

Added Z AI GLM-5-Turbo

Added Amazon Nova 2.0 Pro Preview

Improved SEO with FAQ section, Current Leaders highlights, and JSON-LD structured data on all pages

Added FAQPage schema to homepage targeting 'llm comparison' and 'best llm' queries

Added Dataset + ItemList JSON-LD to leaderboard category pages

Added WebPage JSON-LD to compare page

Added AboutPage + Article JSON-LD to methodology page

Improved page titles targeting comparison and leaderboard search queries

Redesigned footer with 4-column layout and resource links

Fixed GitHub link in footer (yamanahlawat to jnd0)

Modernized README.md with updated model references and cleaner structure

Total models: 1,617 (1,626 with variants), 60 sources, 229 benchmarks

v0.7.0

2026-03-01

Models.dev Integration and Advanced Filtering

Integrated models.dev data pipeline with 1,675+ model metadata imports

Added family badges and capability icons with descriptive hover/tap tooltips

Implemented advanced filtering by family, capability, and provider on explore page

Added static sliced API with under 1KB per-model endpoints (99.95% smaller than full dataset)

Enhanced pricing with cache, reasoning, audio, and context surcharge support

Added automated weekly sync via GitHub Actions with change detection

Implemented deep merge logic to prevent capability data loss during updates

Added comprehensive test suite (9 tests, 100% coverage on critical paths)

Reduced client bundle size by 80% (from 870KB to 170KB) with dynamic loading

Added provider integration guide with SDK examples for 20 major providers

Implemented dynamic metadata loading with SWR hooks for on-demand fetching

Enhanced API documentation with static architecture clarification and client-side examples

Added Cloudflare WAF rate limiting configuration guide (100 requests/minute)

Unified trainingCutoff field across all metadata (replaced knowledgeCutoff)

Generated 1,546 per-model JSON files for edge delivery with global under 20ms response times

Added model status tracking (active, beta, alpha, deprecated) for lifecycle management

Implemented max output tokens display for generation length transparency

Added OpenAPI 3.0 specification for complete API documentation

v0.6.0

2026-02-20

Reliable Compare, SEO Foundation, and Performance Pass

Reworked compare cards to use evidence metrics (coverage, verification share, latest as-of date) instead of synthetic confidence.

Added strict vs exploratory comparison modes with explicit shared benchmark visibility and reliability messaging.

Improved compare benchmark detail rows with provenance context (source, verification badge, as-of date, and N/A handling).

Updated capability profile rendering to support sparse overlap safely while preserving radar behavior and full available domain coverage.

Added comprehensive metadata and social preview improvements across core pages plus canonical URL normalization.

Added structured data (WebSite, Organization, Dataset, TechArticle, BreadcrumbList) for home, model, benchmark, and domain surfaces.

Added Search Console runbook in SEO_CHECKLIST.md and linked it from README methodology guidance.

Improved accessibility with better icon control labels, heading order fixes, and higher-contrast microcopy in key cards.

Reduced initial homepage work by lazy-loading leaderboard interaction code and trimming non-essential above-the-fold motion.

v0.5.0

2026-02-19

Filters, Freshness, and API Enhancements

Added source and verification level multi-select filters to leaderboard toolbar.

Added data freshness indicators: amber dot for aging (91-180d), red dot for stale (>180d).

Fixed URL params clobbering between chart and leaderboard on benchmark pages.

Fixed domain ranking calculation to use normalized scores for cross-benchmark comparability.

Added /api/v1/export endpoint with JSON and CSV format support for research workflows.

Added /api-docs page with live data counts and endpoint documentation.

Added domain detail pages showing top models and benchmark lists.

Added explore page with log scale toggle and searchable benchmark selector.

Added loading states for benchmark, model, explore, and benchmarks pages.

Added Open Graph metadata for model, benchmark, and domain pages.

Added mobile menu domain links and API navigation item.

v0.4.0

2026-02-18

Accessibility and UX Polish

Fixed search input race condition that caused characters to be lost during rapid typing.

Made empty compare slots clickable to open model selector directly.

Added aria-live regions for screen reader announcements on search results.

Added theme-color meta tag for consistent mobile browser theming.

Added prefers-reduced-motion support for all animated progress bars.

Made Report Inaccuracy button functional with mailto link.

Cleaned up unused imports for improved bundle size.

v0.3.0

2026-02-15

Trust and Mobile UX Upgrade

Added score-level provenance, verification tier labels, and freshness metadata across leaderboard, compare, and model views.

Introduced mobile leaderboard cards and sticky compare tray for faster small-screen workflows.

Shipped strict data validation and improved benchmark metadata defaults.

v0.2.0

2026-02-14

Methodology and Registry Guardrails

Added methodology page with normalization and ranking explanations.

Introduced data validation scripts and CI workflow for registry quality.

Expanded benchmark taxonomy with Agentic and Advanced Tasks coverage.

v0.1.0

2026-02-13

Comparison and Category Scoring

Added category-average views in leaderboard and compare workflows.

Implemented column reordering, layout persistence, and summary mode.

Fixed benchmark deduplication and key consistency regressions.