We ran 200 questions per model. Here is what we found.

Carlo Satta 5 min read

We spent the past few weeks running a rigorous benchmark campaign to answer a question that matters to anyone building with large language models: can an intelligent routing layer match top-tier model accuracy while spending significantly less? The short answer, for the right task categories, is yes. Here is the longer one.

What we tested

We evaluated two Routerly routing policies against direct calls to Claude Sonnet 4.6 and Claude Opus 4.6 on three standard benchmarks.

MMLU covers factual recall and reasoning across 57 academic subjects. It is a good proxy for the kind of knowledge-intensive workloads common in customer support, search, and QA products.

HumanEval is OpenAI’s Python coding benchmark: 164 function signatures, validated by unit tests. It represents code generation and developer tooling use cases.

BIRD is a text-to-SQL benchmark on real databases. It represents data analytics and reporting products where queries must be translated into working SQL.

Each configuration was run on 10 independent random seeds at 20 questions each, giving a 200-question pool per configuration. The same 10 seeds were used for every model, so comparisons are paired and as fair as they can be.

The two routing policies

We tested Routerly in two modes.

The LLM policy uses a small classifier model (gpt-4.1-mini) to read each query and route it to the appropriate backend according to a configurable set of rules. For MMLU, we defined five tiers from simple general-knowledge to graduate-level reasoning. For code, three tiers. For SQL, three tiers based on structural complexity.

The semantic-intent policy uses a text embedding model (text-embedding-3-small) to match each query against curated example sets per intent category. There is no LLM call on the routing path. The routing overhead is a single embedding lookup costing approximately $0.000002 per query. This policy is shipping in Routerly v0.2.0.

The results

On MMLU, both policies reached 83.5% accuracy versus Sonnet’s 86.5%. A 3-point gap that, across our 200-question paired sample, is statistically indistinguishable from zero. The semantic-intent policy reached this result at $0.00344 per run versus Sonnet’s $0.01118: a 69% cost reduction.

On HumanEval, both policies hit 95.0% Pass@1 versus Sonnet’s 97.0%. Again, the difference is inside the statistical noise. The semantic-intent policy cost $0.03191 per run versus $0.04889 for Sonnet: a 35% cost reduction.

On BIRD, the picture is more nuanced. The cheapest backend in our current pool (gpt-4.1-nano) only reaches 32.5% on SQL tasks directly. Any fraction of queries routed to it drags accuracy down. Both policies ended below Sonnet’s 55.5%, though both comfortably beat Claude Opus on every dimension that matters. We are transparent about this: the BIRD result is a backend pool limitation, not a routing policy failure. Adding a stronger cheap SQL specialist to the pool is the fix.

What this tells us about routing

A few things stand out from the data.

First, the two routing mechanisms arrived at the same accuracy on MMLU and HumanEval despite being architecturally completely different. One reads queries with an LLM, the other compares embeddings. That convergence is a strong internal consistency check: the routing signal really is there in the data.

Second, the semantic-intent policy is the better commercial option for most workloads. It is 62% cheaper than the LLM policy on MMLU and 21% cheaper on HumanEval, because it eliminates the routing model call entirely. The LLM policy’s routing call accounted for 80% of total spend on MMLU. That is a large tax for a short-prompt benchmark.

Third, routing diversity matters. On MMLU, the LLM policy sent 96% of queries to DeepSeek. That is dangerously close to the single-backend limit where routing provides no real benefit. The semantic-intent policy split 76/24 between DeepSeek and Sonnet, which is a healthier distribution and shows the routing is doing real work.

What we are building next

We will add gpt-5-mini to the MMLU backend pool. It already scores 91.5% direct at $0.00968 per run; as a third intent target inside the semantic-intent policy, it should push the Routerly accuracy above 90% at a comparable price.

For BIRD, we will test a code-tuned mid-tier model as the cheap SQL backend. The current LLM routing rules for BIRD are already correct; the pool just needs a stronger cheap option.

We also plan to grow the campaign to 55 seeds per configuration. That would give the paired statistical tests enough power to distinguish a genuine 3-point accuracy gap from noise, which the current 10-seed design structurally cannot do.

The full data

The complete technical audit, including all per-seed results, routing distributions, cost decompositions and statistical methodology, is available at the link below. Everything is computed directly from the raw JSON session archives.

Full technical audit and benchmark scripts: github.com/Inebrio/routerly-benchmark

Download the full audit report: Routerly_Benchmark_Audit.pdf


Sources