benchmarks routing performance cost-optimization

Measuring Routerly: MMLU, HumanEval, and BIRD Benchmarks

Carlo Satta March 30, 2026 4 min read

One of Routerly’s core claims is that intelligent routing can deliver accuracy close to premium models while spending significantly less, because routine requests are served by cheaper models and expensive ones are reserved for tasks that require them.

Claims like that need evidence. We published routerly-benchmark: an open, reproducible benchmark suite that measures routing quality, cost, and latency against direct model calls.

What the suite measures

Three benchmarks cover distinct dimensions of LLM capability:

Benchmark	Domain	Dataset	Metric
MMLU	Multi-subject reasoning	57 subjects, multiple choice	Accuracy (%)
HumanEval	Python code generation	164 problems (OpenAI HumanEval)	pass@1 (%)
BIRD	SQL generation	95 real databases (text-to-SQL)	Execution accuracy (%)

The goal is not just to measure per-model accuracy. The interesting question is: when Routerly routes across a pool of models, does the aggregate accuracy track the best model in the pool, or does it regress toward the cheapest?

How a benchmark run works

Each benchmark is a Python script that:

Loads a question subset (controlled by --seed for reproducibility)
Sends each question to the configured target via the OpenAI-compatible API
Evaluates the response using the task’s native scoring function
Saves the results to a JSON file under results/

The same script runs against any target. Targets are configured through .env files:

cp .env.example .env_routerly
# edit BASE_URL, API_KEY, MODEL
python benchmark.py --env .env_routerly --seed 42

Using the same --seed across all runs guarantees the same question subset, which is required for valid comparisons.

Reference baselines (seed 42, n=30)

The following results were recorded on a 30-sample run with seed 42:

MMLU (multi-subject reasoning accuracy)

Model	Accuracy
claude-sonnet-4-6	96.7%
claude-opus-4-6	93.3%
gpt-4.1-nano	70.0%

HumanEval (Python code generation, pass@1)

Model	pass@1
claude-sonnet-4-6	96.7%
claude-opus-4-6	86.7%
gpt-4.1-nano	76.7%

BIRD (text-to-SQL execution accuracy)

Model	Accuracy
claude-opus-4-6	66.7%
claude-sonnet-4-6	56.7%
gpt-4.1-nano	36.7%

These are baselines for direct model calls. Routerly runs as an additional target against the same question sets, so you can measure the router’s accuracy and cost against each baseline directly.

Supported targets

Every benchmark is fully target-agnostic. It reads BASE_URL, API_KEY, and MODEL from the environment and works identically against any OpenAI-compatible endpoint:

Target	Base URL
Routerly	`https://api.routerly.ai/v1` (or your self-hosted instance)
Anthropic Claude Opus	`https://api.anthropic.com/v1`
Anthropic Claude Sonnet	`https://api.anthropic.com/v1`
OpenAI GPT-4.1-nano	`https://api.openai.com/v1`

The MODEL field can be set to auto when targeting Routerly, letting the routing engine decide which model to use for each request.

Why these three benchmarks?

MMLU covers a broad range of reasoning tasks across 57 academic subjects. A router that inappropriately sends hard questions to weak models will show a clear accuracy drop here.
HumanEval tests code generation quality. Code correctness is binary (the tests either pass or they do not), making pass@1 a clean, objective signal.
BIRD is intentionally harder. SQL generation against real multi-table databases requires planning and schema understanding. It stress-tests the router’s ability to identify tasks that genuinely need more capable models.

Together they cover reasoning, code, and structured generation: three workloads that reflect real production usage patterns.

Running the suite yourself

The repository is at github.com/Inebrio/routerly-benchmark. Each benchmark has its own directory with a requirements.txt, a README, and .env.example files. A virtual environment per benchmark keeps dependencies isolated.

cd "MMLU Benchmark"
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env_routerly   # fill in your Routerly URL and token
python benchmark.py --env .env_routerly --seed 42

We will expand the suite as Routerly adds new routing policies and as the model landscape evolves. Contributions are welcome.

Sources

routerly-benchmark repository: github.com/Inebrio/routerly-benchmark
MMLU dataset: arxiv.org/abs/2009.03300
HumanEval dataset: arxiv.org/abs/2107.03374
BIRD dataset: bird-bench.github.io

All articles