Measuring Routerly: MMLU, HumanEval, and BIRD Benchmarks
One of Routerly’s core claims is that intelligent routing can deliver accuracy close to premium models while spending significantly less, because routine requests are served by cheaper models and expensive ones are reserved for tasks that require them.
Claims like that need evidence. We published routerly-benchmark: an open, reproducible benchmark suite that measures routing quality, cost, and latency against direct model calls.
What the suite measures
Three benchmarks cover distinct dimensions of LLM capability:
| Benchmark | Domain | Dataset | Metric |
|---|---|---|---|
| MMLU | Multi-subject reasoning | 57 subjects, multiple choice | Accuracy (%) |
| HumanEval | Python code generation | 164 problems (OpenAI HumanEval) | pass@1 (%) |
| BIRD | SQL generation | 95 real databases (text-to-SQL) | Execution accuracy (%) |
The goal is not just to measure per-model accuracy. The interesting question is: when Routerly routes across a pool of models, does the aggregate accuracy track the best model in the pool, or does it regress toward the cheapest?
How a benchmark run works
Each benchmark is a Python script that:
- Loads a question subset (controlled by
--seedfor reproducibility) - Sends each question to the configured target via the OpenAI-compatible API
- Evaluates the response using the task’s native scoring function
- Saves the results to a JSON file under
results/
The same script runs against any target. Targets are configured through .env files:
cp .env.example .env_routerly
# edit BASE_URL, API_KEY, MODEL
python benchmark.py --env .env_routerly --seed 42
Using the same --seed across all runs guarantees the same question subset, which is required for valid comparisons.
Reference baselines (seed 42, n=30)
The following results were recorded on a 30-sample run with seed 42:
MMLU (multi-subject reasoning accuracy)
| Model | Accuracy |
|---|---|
| claude-sonnet-4-6 | 96.7% |
| claude-opus-4-6 | 93.3% |
| gpt-4.1-nano | 70.0% |
HumanEval (Python code generation, pass@1)
| Model | pass@1 |
|---|---|
| claude-sonnet-4-6 | 96.7% |
| claude-opus-4-6 | 86.7% |
| gpt-4.1-nano | 76.7% |
BIRD (text-to-SQL execution accuracy)
| Model | Accuracy |
|---|---|
| claude-opus-4-6 | 66.7% |
| claude-sonnet-4-6 | 56.7% |
| gpt-4.1-nano | 36.7% |
These are baselines for direct model calls. Routerly runs as an additional target against the same question sets, so you can measure the router’s accuracy and cost against each baseline directly.
Supported targets
Every benchmark is fully target-agnostic. It reads BASE_URL, API_KEY, and MODEL from the environment and works identically against any OpenAI-compatible endpoint:
| Target | Base URL |
|---|---|
| Routerly | https://api.routerly.ai/v1 (or your self-hosted instance) |
| Anthropic Claude Opus | https://api.anthropic.com/v1 |
| Anthropic Claude Sonnet | https://api.anthropic.com/v1 |
| OpenAI GPT-4.1-nano | https://api.openai.com/v1 |
The MODEL field can be set to auto when targeting Routerly, letting the routing engine decide which model to use for each request.
Why these three benchmarks?
- MMLU covers a broad range of reasoning tasks across 57 academic subjects. A router that inappropriately sends hard questions to weak models will show a clear accuracy drop here.
- HumanEval tests code generation quality. Code correctness is binary (the tests either pass or they do not), making pass@1 a clean, objective signal.
- BIRD is intentionally harder. SQL generation against real multi-table databases requires planning and schema understanding. It stress-tests the router’s ability to identify tasks that genuinely need more capable models.
Together they cover reasoning, code, and structured generation: three workloads that reflect real production usage patterns.
Running the suite yourself
The repository is at github.com/Inebrio/routerly-benchmark. Each benchmark has its own directory with a requirements.txt, a README, and .env.example files. A virtual environment per benchmark keeps dependencies isolated.
cd "MMLU Benchmark"
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env_routerly # fill in your Routerly URL and token
python benchmark.py --env .env_routerly --seed 42
We will expand the suite as Routerly adds new routing policies and as the model landscape evolves. Contributions are welcome.
Sources
- routerly-benchmark repository: github.com/Inebrio/routerly-benchmark
- MMLU dataset: arxiv.org/abs/2009.03300
- HumanEval dataset: arxiv.org/abs/2107.03374
- BIRD dataset: bird-bench.github.io