Beta

Methodology

How we source data, what we track, and how often we update.

Data Sources

OpenRouter API

Daily

Model pricing (9 tiers: input, output, cache read/write, reasoning, image, audio, web search, per-request), context windows, modalities, supported parameters, deprecation dates.

Source: https://openrouter.ai/api/v1/modelsLicense: Public API, no authentication required

Epoch AI

Weekly

Benchmark scores across 40+ benchmarks for 168+ models. Normalized 0-1 scores. Training compute and cost data for 3,200+ models.

Source: https://epoch.ai/dataLicense: CC-BY 4.0

SWE-bench

Weekly

Software engineering task resolution rates across 6 leaderboard variants (Verified, Lite, Full, bash-only, Multilingual, Multimodal). Cost per instance. Open-source flags.

Source: https://www.swebench.comLicense: Open data

MCP Registry

Daily

4,000+ MCP server listings with package info, transport types, and repository links.

Source: https://registry.modelcontextprotocol.ioLicense: Open API

HuggingFace

Weekly

Model metadata — parameter counts, downloads, likes, licenses, model types, and last modified dates for 650+ models.

Source: https://huggingface.co/api/modelsLicense: Public API, no authentication required

Ollama Registry

Weekly

Locally-runnable models — model names, sizes, quantization options for on-device inference.

Source: https://ollama.com/api/tagsLicense: Public API

Official Provider Reports

On release

Benchmark scores published by model providers (Anthropic, OpenAI, Google, Meta, etc.) in model cards and technical reports.

License: Public

Benchmark Scores

We display scores exactly as reported by their source. Scores are not adjusted, weighted, or normalized for cross-benchmark comparison unless explicitly noted. Each score links to its source when available.

When a model has multiple scores for the same benchmark (e.g., different evaluation settings), we use the official score published by the model provider. If no official score exists, we use the most recent independent evaluation.

Average Score

The "Avg" column in the leaderboard is an unweighted arithmetic mean across all benchmarks where the model has been tested. This is a rough signal, not a definitive ranking. Models tested on more benchmarks may have lower averages due to exposure to harder tests.

Pricing

All pricing is sourced from OpenRouter's public API and represents the top available provider for each model. Prices are shown per 1 million tokens. We track 9 pricing tiers: input, output, cache read, cache write, image, audio, internal reasoning, web search, and per-request fees.

Pricing is checked daily. When a price change is detected, the new price is recorded with a timestamp. Historical price data is stored but not currently displayed.

Uptime Monitoring

Provider uptime is measured by sending lightweight health check requests to each provider's API endpoint every 60 seconds from US-East. We record response time (latency) and HTTP status code. A provider is marked "degraded" if average latency exceeds 2x its 30-day baseline, and "outage" if requests fail consistently.

Uptime percentage is calculated over a rolling 30-day window. This reflects API endpoint availability, not individual model availability.

Open Source Classification

A model is marked "Open Source" if its weights are publicly available for download and use under an OSI-approved license (Apache 2.0, MIT) or a permissive community license (Llama Community License). Models with "open weights" under restrictive licenses are marked separately.

Update Frequency

Data TypeFrequency
API PricingDaily
Provider UptimeEvery 60 seconds
Benchmark ScoresOn new release + weekly sweep
MCP ServersDaily
GitHub Stars/ForksDaily
Model DirectoryDaily (from OpenRouter)

Corrections

If you spot an error in our data, please open an issue on our GitHub repository or reach out on Twitter @BenchGecko. We take data accuracy seriously and will correct errors within 24 hours.

Built by the BenchGecko team. Powered by data from the open AI ecosystem.