Methodology
How we source data, what we track, and how often we update.
Data Sources
OpenRouter API
DailyModel pricing (9 tiers: input, output, cache read/write, reasoning, image, audio, web search, per-request), context windows, modalities, supported parameters, deprecation dates.
Epoch AI
WeeklyBenchmark scores across 40+ benchmarks for 168+ models. Normalized 0-1 scores. Training compute and cost data for 3,200+ models.
SWE-bench
WeeklySoftware engineering task resolution rates across 6 leaderboard variants (Verified, Lite, Full, bash-only, Multilingual, Multimodal). Cost per instance. Open-source flags.
MCP Registry
Daily4,000+ MCP server listings with package info, transport types, and repository links.
HuggingFace
WeeklyModel metadata — parameter counts, downloads, likes, licenses, model types, and last modified dates for 650+ models.
Ollama Registry
WeeklyLocally-runnable models — model names, sizes, quantization options for on-device inference.
Official Provider Reports
On releaseBenchmark scores published by model providers (Anthropic, OpenAI, Google, Meta, etc.) in model cards and technical reports.
Benchmark Scores
We display scores exactly as reported by their source. Scores are not adjusted, weighted, or normalized for cross-benchmark comparison unless explicitly noted. Each score links to its source when available.
When a model has multiple scores for the same benchmark (e.g., different evaluation settings), we use the official score published by the model provider. If no official score exists, we use the most recent independent evaluation.
Average Score
The "Avg" column in the leaderboard is an unweighted arithmetic mean across all benchmarks where the model has been tested. This is a rough signal, not a definitive ranking. Models tested on more benchmarks may have lower averages due to exposure to harder tests.
Pricing
All pricing is sourced from OpenRouter's public API and represents the top available provider for each model. Prices are shown per 1 million tokens. We track 9 pricing tiers: input, output, cache read, cache write, image, audio, internal reasoning, web search, and per-request fees.
Pricing is checked daily. When a price change is detected, the new price is recorded with a timestamp. Historical price data is stored but not currently displayed.
Uptime Monitoring
Provider uptime is measured by sending lightweight health check requests to each provider's API endpoint every 60 seconds from US-East. We record response time (latency) and HTTP status code. A provider is marked "degraded" if average latency exceeds 2x its 30-day baseline, and "outage" if requests fail consistently.
Uptime percentage is calculated over a rolling 30-day window. This reflects API endpoint availability, not individual model availability.
Open Source Classification
A model is marked "Open Source" if its weights are publicly available for download and use under an OSI-approved license (Apache 2.0, MIT) or a permissive community license (Llama Community License). Models with "open weights" under restrictive licenses are marked separately.
Update Frequency
| Data Type | Frequency |
|---|---|
| API Pricing | Daily |
| Provider Uptime | Every 60 seconds |
| Benchmark Scores | On new release + weekly sweep |
| MCP Servers | Daily |
| GitHub Stars/Forks | Daily |
| Model Directory | Daily (from OpenRouter) |
Corrections
If you spot an error in our data, please open an issue on our GitHub repository or reach out on Twitter @BenchGecko. We take data accuracy seriously and will correct errors within 24 hours.
Built by the BenchGecko team. Powered by data from the open AI ecosystem.