1,000 questions per model: BIRD caught up to Sonnet

Carlo Satta 6 min read

Back in April we ran 200 questions per model and wrote up what came out. It was a decent first pass, but it had three honest weaknesses. Those limitations were clear at the time: only 10 seeds per config (not enough to tell a real 3-point gap from noise), no GPT-5 models in the comparison, and a BIRD setup that was frankly underbaked.

So we went back and did the bigger version. Same idea, more rigor:

  • 50 seeds of 20 questions each, so 1,000 questions per configuration instead of 200.
  • GPT-5-mini and GPT-5-nano added to the MMLU lineup.
  • A reworked BIRD routing config.

Quick framing for anyone new: Routerly is a self-hosted gateway that routes each query to the cheapest model that can still handle it. The question this benchmark exists to answer is simple and a little uncomfortable to ask: can routing actually match a top-tier model like Claude Sonnet 4.6, or are you just trading accuracy for a smaller bill? Here’s what 1,000 questions per config had to say. The full audit with every per-seed number is in the benchmark repo, so you don’t have to take my word for any of this.

The setup, briefly

Two routing policies, both compared against calling models directly:

  • The LLM policy uses a small classifier (gpt-4.1-mini) to read each query and pick a backend.
  • The semantic-intent policy skips the LLM entirely and matches each query against example sets using embeddings. The routing cost is basically a rounding error, about $0.000002 per query. This is the one shipping in 0.2.0.

Three benchmarks: MMLU (factual recall and reasoning), HumanEval (Python coding), BIRD (text-to-SQL on real databases). Every model saw the same 50 seeds in the same order, so the comparisons are paired and as fair as we could make them.

BIRD: from a 6-point gap to basically zero

This is the result that’s most pleasing, because last time BIRD was the weak spot and we called it out.

In April, the cheap SQL backend dragged routing accuracy roughly 6 points below Sonnet. This time the LLM policy lands at 58.5% versus Sonnet’s 59.2%. That’s a gap of -0.7 points, and the paired test (t = -0.91, p = 0.367) can’t tell it apart from zero. With 1,000 questions the test has enough power to catch a real 3-point difference, so this isn’t “too small to measure.” It’s actual parity on text-to-SQL.

There’s a catch, worth saying plainly rather than burying. The reason accuracy caught up is that the router now sends 93.7% of BIRD queries to Sonnet. Great for accuracy, less great for your wallet: that leaves the routing overhead sitting on top of a near-Sonnet bill, so the LLM policy ends up costing a hair more than Sonnet direct ($0.074 vs $0.069 per run). Accuracy: solved. Cost: not yet. The fix is a mid-tier SQL model good enough to take more of the easy queries, and that’s on the list.

If you care more about the bill than the last few points, the semantic policy routes a healthier 72/28 split, comes in 26% cheaper, and gives up 5.5 points of accuracy.

HumanEval: the clean win

Coding is where everything lines up.

The LLM policy hits 94.9% pass@1 versus Sonnet’s 97.7%. That -2.8 point gap is inside the plus/minus 3 point band set as the bar, and it comes with a 12% cost saving. So: matches the target accuracy, costs less, holds up across 50 seeds. That’s the cleanest result in the whole campaign, no asterisks.

Want to save more and can live with a bigger gap? The semantic policy is 41% cheaper at 90.8%. Both policies, by the way, comfortably beat calling Opus directly, which on this benchmark was both less accurate and about 4.6x more expensive than Sonnet.

MMLU: a near miss, but a very cheap one

MMLU is where the bigger sample made results more honest, not less.

Both policies land around 84% versus Sonnet’s 88.4%, a gap of roughly 4 points. At 10 seeds that gap hid inside the noise. At 50 seeds it’s real and measurable, and it’s just outside the 3-point band. So strictly speaking, miss.

But look at the price tag. The semantic policy gets that 84% at $0.003 per run against Sonnet’s $0.011, about 70% cheaper. For a lot of factual-recall workloads, 4 points of accuracy for a 70% smaller bill is a trade worth taking.

Adding the GPT-5 models was also clarifying. GPT-5-nano scored 86.5%, only a couple of points above the routing policies and not far off on cost. GPT-5-mini hit 90.3% but at $0.024 per run, more than 7x the semantic policy. The obvious next move is to drop GPT-5-mini into the routing pool for the hard reasoning questions, which should pull MMLU up toward 87-89% while keeping the bill well under Sonnet. That’s the single change most likely to turn this near-miss into a clean win.

Why the two policies keep agreeing

One thing that keeps being reassuring: the LLM policy and the semantic policy are completely different mechanisms (one reads queries with a model, the other compares embeddings), and they keep landing on the same accuracy and the same routing splits. On HumanEval both settle on roughly a 60/40 divide between the strong model and the cheap one. When two unrelated methods agree on where the line is, the line is probably real, and the routing is doing actual work rather than dumping everything on one backend.

The semantic policy is also cheaper than the LLM policy everywhere (57% on MMLU, 34% on HumanEval, 32% on BIRD), for the boring reason that it doesn’t pay for a routing model call. On MMLU, where the questions are short, that routing call was eating 72% of the LLM policy’s total spend. Embeddings just don’t.

So, does routing work?

Short version, by workload:

  • Coding: yes, clean win. Sonnet-level accuracy, 12% cheaper.
  • Text-to-SQL: accuracy parity with Sonnet, cost roughly a wash for now. Use the semantic policy if you want the 26% saving and can spare 5 points.
  • Factual recall: 4 points under Sonnet, but 70% cheaper, which is a great deal for most apps.

None of this is hand-waving. Every number here is recomputed from the raw per-run JSON, confidence intervals and paired tests included, and it’s all in the benchmark repo for you to pull apart. Open an issue if you spot something off.

Full technical audit and benchmark scripts: github.com/Inebrio/routerly-benchmark

Download the full audit report: Routerly_Benchmark_Audit_v2.pdf

And the routing policy that did all this is the one shipping in 0.2.0, which just went out. Free and open source, AGPL-3.0, self-hosted. Go point it at a real workload and see if these numbers hold for you.


Sources