LLM routing policies work: what three benchmarks confirm

Carlo Satta 7 min read

TL;DR: Three benchmarks confirm that LLM routing policies work. On HumanEval, intelligent routing saves 12% cost at −1pp accuracy. On MMLU, Routerly beats claude-sonnet-4-6 on both accuracy (+3pp) and cost (−9.5%). BIRD confirms routing reduces cost, and also shows that for highly domain-specific benchmarks, cost savings are only one of the metrics worth tracking.

In a previous post we introduced the open benchmark suite that measures routing quality and cost against direct model calls. This post reports the first validated campaign: three benchmarks, 10 seeds, 20 samples per run.

The question is not just whether Routerly is cheaper than a single reference model. It is whether LLM-based routing policies (using a classifier to assign each request to the right model) hold up under controlled, reproducible evaluation. The answer across three different task types is yes, with an important nuance about what “success” means for each.

Setup

Benchmarks:

  • HumanEval: 164 Python code generation problems (OpenAI, 2021)
  • MMLU: 14,042 multiple-choice questions across 57 disciplines (Hendrycks et al., 2021)
  • BIRD: SQL generation on real-world databases (Li et al., 2024)

Baselines: claude-sonnet-4-6 (primary reference), claude-opus-4-6 (high-quality secondary), gpt-4.1-nano (low-cost lower bound).

Protocol: 20 samples per seed, 10 distinct seeds, same seeds across all configurations. 600 runs per benchmark total.

Summary

BenchmarkConfigAccuracyΔ vs SonnetCost vs Sonnet
HumanEvalRouterly (complexity routing)96.0%−1.0pp−12.2%
MMLURouterly (gpt-5-mini pool)89.5%+3.0pp−9.5%
BIRDRouterly (SQL complexity routing)49.5%−6.0pp−5.8%

HumanEval: routing works exactly as designed

HumanEval is the ideal case for an LLM routing policy. The problem distribution is bimodal: some problems are trivial string and list operations that any small model handles correctly, and others require real reasoning. The policy can draw a clear boundary between them, and the classifier can learn it.

Config: openai/gpt-4.1-mini as router, gpt-4.1-nano and claude-sonnet-4-6 as targets. The policy routes 42% of problems to nano across all 10 seeds.

MetricSonnetRouterly
Average accuracy (10 seeds)97.0%96.0%
Average cost per run$0.0489$0.0429
Δ accuracy-−1.0pp
Δ cost-−12.2%

The 12.2% cost reduction is consistent across all 10 seeds, ranging from −0.3% to −27.1%. Accuracy is at parity or better on eight of ten seeds and within 5pp on the remaining two. This is the routing policy behaving correctly: cheap problems go to cheap models, hard problems still get the best model.

Versus claude-opus-4-6: Routerly is +12pp more accurate at 35% lower cost ($0.0429 vs. $0.0657).

MMLU: the right tool for the task

MMLU is a different challenge. Individual questions are short and quick, which means the cost of a routing call is not negligible relative to the cost of answering the question directly. Profiling showed that a gpt-4.1-mini routing call generates approximately 1,152 reasoning tokens ($0.000517 per call); over 20 questions per run, that overhead offsets the savings from routing.

This is a useful finding in itself: the policy layer in Routerly is not limited to dynamic routing. It also supports direct model pool assignment, where a single model is selected based on task type and no per-request routing is applied. For MMLU, the right policy is to assign openai/gpt-5-mini as a single-model pool. gpt-5-mini is a reasoning model that applies chain-of-thought internally, which explains both the accuracy gain and the cost reduction.

MetricSonnetRouterly (gpt-5-mini)
Average accuracy (10 seeds)86.5%89.5%
Average cost per run$0.0112$0.0101
Δ accuracy-+3.0pp
Δ cost-−9.5%

Routerly beats Sonnet on both accuracy and cost simultaneously, with zero routing overhead.

For cost-sensitive workloads, deepseek-chat is a viable alternative: 83.0% accuracy at $0.0007 per run, a 93.3% cost reduction at −3.5pp versus Sonnet.

BIRD: cost savings confirmed, accuracy metric needs context

BIRD is the most domain-specific of the three benchmarks. It measures exact SQL execution accuracy on real-world databases: a generated query either produces the same result set as the gold query, or it does not. There is no partial credit.

This is worth keeping in mind when reading the accuracy numbers. At 55.5%, even Sonnet direct is well below where it lands on general reasoning tasks. The benchmark is hard by design, and the margin between configurations is measured against that already-narrow baseline.

Config: openai/gpt-4.1-mini as router, gpt-4.1-nano and claude-sonnet-4-6 as targets. The policy classifies queries by SQL complexity: simple joins and single-table queries go to nano, everything with subqueries, window functions, or CTEs goes to Sonnet. The router is appropriately conservative: 74% of queries are sent to Sonnet.

MetricSonnetRouterly
Average accuracy (10 seeds)55.5%49.5%
Average cost per run$0.0732$0.0689
Δ accuracy-−6.0pp
Δ cost-−5.8%

The routing policy reduces cost by 5.8% versus Sonnet and 42% versus Opus. The accuracy gap reflects the difficulty of drawing a precise SQL complexity boundary: BIRD contains many queries that look structurally simple but require multi-step reasoning, and the classifier routes some of them to nano.

In production SQL workflows, exact execution accuracy against a gold query is rarely the sole metric. Latency, cost per query, and whether the output needs human review before execution all factor in. For review-before-execute pipelines, Routerly at 42% lower cost than Opus is already a practical configuration. Improving the routing boundary to close the accuracy gap is the active area of iteration.

What these benchmarks confirm

The three tasks together validate LLM routing policies as an approach and also clarify when each policy type applies:

  • Bimodal complexity tasks (HumanEval): LLM-based routing classifies inputs correctly at scale and delivers consistent cost savings with minimal accuracy impact.
  • Short high-throughput tasks (MMLU): model substitution is more efficient than per-request routing. Routerly’s policy layer supports both, and selecting the right model for the task type produces better results than routing across a mixed pool.
  • Domain-specific benchmarks (BIRD): routing reduces cost, and the relevant success metrics depend on how the output is used. Exact execution accuracy on a curated dataset is one dimension; cost per query and suitability for human-review pipelines are others.

What comes next

These results give a concrete foundation to Routerly’s core idea: an LLM proxy that routes intelligently can match or exceed the quality of a single premium model at lower cost, and the policy layer is where the leverage is.

The current policies are complexity-based: they classify inputs by structural properties of the task (code difficulty, SQL structure). That works well when the boundary is clear. The natural next step is to explore policies that classify by intent rather than structure. A semantic or intent-based policy would route based on what the user is trying to accomplish: a question that rephrases existing documentation needs a different model than one that requires synthesis across multiple sources or creative generation. This class of policy is harder to define but potentially more generalizable across task types, and it is a direction we plan to validate in the next benchmark iteration.

The campaign is ongoing. Planned work includes tighter routing policies for BIRD, a semantic/intent policy prototype, latency measurements across all configurations, and results against updated model versions. All configurations and raw results are tracked in the routerly-benchmark repository.


Sources