Verified Models
1573
Beta version: *Information might not be fully accurate. Please report any discrepancies.
Registry / Methodology
Data policy, normalization rules, and provenance labeling used across the global benchmark registry.
Verified Models
1573
Active Benchmarks
206
Evaluation Categories
17
Verified Sources
30
We prioritize transparency in every score. Every data point in the registry is tagged with a Source ID and a Verification Level.
Artificial Analysis
Scores marked with * are imported from artificialanalysis.ai.
Metadata Source
Model metadata (pricing, specs, capabilities) imported from models.dev (MIT License).
Compare views prioritize reliability first. We separate fair, overlap-based analysis from exploratory analysis so missing data does not get interpreted as zero performance.
Strict Mode (Default)
Uses only benchmarks shared by selected models for fair head-to-head comparisons in summary and detailed tables.
Exploratory Mode
Includes non-shared results for broader context. Missing entries remain explicitly labeled as N/A.
Capability Profile
Radar uses all available domains within the current scope and preserves missing values as N/A instead of plotting them as zero.
Evidence Quality
Compare pages expose shared benchmark counts and per-model coverage so confidence in comparisons is visible before interpretation.
Coverage-Assisted Mode (Leaderboard)
Leaderboard can optionally fill sparse base-model gaps with family-proxy scores from the same model line. These scores are marked as estimated and shown with a ~ suffix. Use Observed Only mode for strict measured values.
Family System
Models are grouped into families (e.g., Llama, GPT, Claude) for easier discovery and comparison. Family badges appear on model cards and enable family-based filtering.
Capability Icons
Visual indicators show key capabilities: reasoning (chain-of-thought), vision (image analysis), tools (function calling), audio, video, code specialization, JSON mode, file uploads, and temperature control. Hover or tap icons for detailed descriptions.
Training Cutoff
Each model displays its training data cutoff date, providing transparency about knowledge freshness and temporal limitations.
Advanced Pricing
Beyond basic input/output pricing, we track cache read/write costs, reasoning token pricing, audio input/output costs, and context surcharges for models with over 200K context windows.
Model Status
Models are tagged with lifecycle status: active (production-ready), beta (public testing), alpha (early testing), or deprecated (end-of-life).
Max Output Tokens
Maximum generation length is displayed for each model, helping you understand output limitations for long-form content generation.
models.dev Integration
We automatically import metadata from models.dev, a community-driven database of LLM specifications. This provides comprehensive coverage of 1,675+ models with pricing, capabilities, and limits.
Weekly Sync
GitHub Actions automatically fetch updated data every Monday, detect changes, and create pull requests for review. This ensures our registry stays current with the rapidly evolving LLM landscape.
ID Normalization
Model IDs from different sources are normalized to our internal naming convention using a community-maintainable JSON mapping file, making it easy for contributors to add new mappings.
Deep Merge Logic
Imported metadata is deep-merged with existing data to prevent capability loss. Our test suite (9 tests, 100% coverage) ensures that updating one field never accidentally removes existing capabilities like vision or tool support.
Static Slicing
Instead of requiring API consumers to download the entire 800KB+ dataset, we generate 1,546 individual JSON files (one per model), each under 1KB. This reduces API payload size by 99.95%.
Edge Delivery
All API endpoints are pre-generated at build time and served from Cloudflare's edge network globally, providing under 20ms response times worldwide with 100% uptime SLA.
Rate Limiting
Rate limiting is handled by Cloudflare WAF at the edge (100 requests/minute per IP), providing DDoS protection without any application code or bundle size impact.
Bundle Optimization
By using static generation and on-demand data loading via SWR hooks, we reduced the client bundle from ~870KB to ~170KB (80% reduction), dramatically improving initial page load times.
Test Coverage
Critical data merge logic is covered by 9 unit tests with 100% coverage, ensuring data integrity is maintained during automated imports.
Data Validation
Automated validation scripts verify data integrity, checking model IDs, benchmark IDs, score bounds, and provenance metadata before each deployment.
Source Attribution
Every score includes provenance metadata: source ID, verification level (third-party, provider, community, estimated), and as-of date for complete transparency.
Unified Fields
We maintain consistent field naming across all data sources (e.g., trainingCutoff instead of knowledgeCutoff) to prevent data fragmentation and ensure reliability.
v0.7.0
2026-03-01
Integrated models.dev data pipeline with 1,675+ model metadata imports
Added family badges and capability icons with descriptive hover/tap tooltips
Implemented advanced filtering by family, capability, and provider on explore page
Added static sliced API with under 1KB per-model endpoints (99.95% smaller than full dataset)
Enhanced pricing with cache, reasoning, audio, and context surcharge support
Added automated weekly sync via GitHub Actions with change detection
Implemented deep merge logic to prevent capability data loss during updates
Added comprehensive test suite (9 tests, 100% coverage on critical paths)
Reduced client bundle size by 80% (from 870KB to 170KB) with dynamic loading
Added provider integration guide with SDK examples for 20 major providers
Implemented dynamic metadata loading with SWR hooks for on-demand fetching
Enhanced API documentation with static architecture clarification and client-side examples
Added Cloudflare WAF rate limiting configuration guide (100 requests/minute)
Unified trainingCutoff field across all metadata (replaced knowledgeCutoff)
Generated 1,546 per-model JSON files for edge delivery with global under 20ms response times
Added model status tracking (active, beta, alpha, deprecated) for lifecycle management
Implemented max output tokens display for generation length transparency
Added OpenAPI 3.0 specification for complete API documentation
v0.6.0
2026-02-20
Reworked compare cards to use evidence metrics (coverage, verification share, latest as-of date) instead of synthetic confidence.
Added strict vs exploratory comparison modes with explicit shared benchmark visibility and reliability messaging.
Improved compare benchmark detail rows with provenance context (source, verification badge, as-of date, and N/A handling).
Updated capability profile rendering to support sparse overlap safely while preserving radar behavior and full available domain coverage.
Added comprehensive metadata and social preview improvements across core pages plus canonical URL normalization.
Added structured data (WebSite, Organization, Dataset, TechArticle, BreadcrumbList) for home, model, benchmark, and domain surfaces.
Added Search Console runbook in SEO_CHECKLIST.md and linked it from README methodology guidance.
Improved accessibility with better icon control labels, heading order fixes, and higher-contrast microcopy in key cards.
Reduced initial homepage work by lazy-loading leaderboard interaction code and trimming non-essential above-the-fold motion.
v0.5.0
2026-02-19
Added source and verification level multi-select filters to leaderboard toolbar.
Added data freshness indicators: amber dot for aging (91-180d), red dot for stale (>180d).
Fixed URL params clobbering between chart and leaderboard on benchmark pages.
Fixed domain ranking calculation to use normalized scores for cross-benchmark comparability.
Added /api/v1/export endpoint with JSON and CSV format support for research workflows.
Added /api-docs page with live data counts and endpoint documentation.
Added domain detail pages showing top models and benchmark lists.
Added explore page with log scale toggle and searchable benchmark selector.
Added loading states for benchmark, model, explore, and benchmarks pages.
Added Open Graph metadata for model, benchmark, and domain pages.
Added mobile menu domain links and API navigation item.
v0.4.0
2026-02-18
Fixed search input race condition that caused characters to be lost during rapid typing.
Made empty compare slots clickable to open model selector directly.
Added aria-live regions for screen reader announcements on search results.
Added theme-color meta tag for consistent mobile browser theming.
Added prefers-reduced-motion support for all animated progress bars.
Made Report Inaccuracy button functional with mailto link.
Cleaned up unused imports for improved bundle size.
v0.3.0
2026-02-15
Added score-level provenance, verification tier labels, and freshness metadata across leaderboard, compare, and model views.
Introduced mobile leaderboard cards and sticky compare tray for faster small-screen workflows.
Shipped strict data validation and improved benchmark metadata defaults.
v0.2.0
2026-02-14
Added methodology page with normalization and ranking explanations.
Introduced data validation scripts and CI workflow for registry quality.
Expanded benchmark taxonomy with Agentic and Advanced Tasks coverage.
v0.1.0
2026-02-13
Added category-average views in leaderboard and compare workflows.
Implemented column reordering, layout persistence, and summary mode.
Fixed benchmark deduplication and key consistency regressions.