benchmarks routing cost-optimization performance

Routerly vs Fable 5 and Opus 4.8: how the routing holds up

Carlo Satta June 10, 2026 4 min read

Fable 5 and Opus 4.8 are the new frontier. The natural question for anyone running Routerly is whether the routing still holds up against the latest models, or whether the accuracy gap has widened enough to change the math. We ran the comparison.

Two benchmarks, three seeds each, 60 questions per model: HumanEval (code generation, pass@1) and MMLU (general knowledge, accuracy). The frontier models were evaluated on the exact same sampled questions Routerly had already been tested on, identical random seeds and identical sample size (n = 20). The Routerly numbers come from the April extended campaign; the frontier runs are from June 2026. The question sets are identical, so the accuracy comparison is valid, though treat cost figures as current API pricing rather than a controlled simultaneous run.

On code, near-frontier accuracy at a fraction of the cost

Configuration	Accuracy	Std. dev	Cost per query
Opus 4.8	93.3%	±2.4	$0.00630
Routerly LLM policy	91.7%	±6.2	$0.00231
Fable 5	88.3%	±2.4	$0.00961
Routerly semantic policy	88.3%	±4.7	$0.00177

The LLM policy retains 98% of Opus 4.8’s accuracy at 37% of the cost. The semantic policy matches Fable 5’s accuracy exactly, 88.3% on both, at 18% of Fable’s cost. The 1.6-point gap against Opus falls inside the small-sample noise at this seed count. On coding workloads, routing delivers near-frontier output and cuts the bill by 3x to 5x.

On knowledge, the cost advantage grows but accuracy gives more ground

Configuration	Accuracy	Std. dev	Cost per query
Opus 4.8	88.3%	±6.2	$0.00104
Fable 5	81.7%	±4.7	$0.00388
Routerly LLM policy	75.0%	±8.2	$0.00046
Routerly semantic policy	73.3%	±6.2	$0.00019

The semantic policy answers at $0.00019 per query, roughly one twentieth of Fable 5’s cost, while holding about 90% of Fable’s accuracy. Against Opus 4.8 it keeps 83% of the accuracy at 18% of the cost. The accuracy trade-off is more visible here than on code. This is the dial routing gives you: decide what a query is worth, and stop overpaying for the rest.

The right metric is quality per dollar

Routing is not a contest to top an accuracy leaderboard. Opus 4.8 scores higher when you send it every request without filtering, and it should. The question for anyone running LLMs at scale is different: how much quality are you buying per dollar?

On code, the answer is clear. The routing captures most of the frontier’s accuracy while spending a fraction of the budget, with the gap against the best model sitting inside measurement noise. On knowledge the savings are larger but the accuracy concession is more visible. The right trade-off depends on the workload, and routing makes it explicit and measurable rather than invisible.

The routing policies tested here are the ones that shipped in Routerly 0.2.0. Source and install instructions at github.com/Inebrio/Routerly.

Download the full comparison: Routerly_Benchmark_Comparison.pdf

Method: HumanEval (pass@1) and MMLU (accuracy), n = 20 per seed, 3 seeds per benchmark, 60 questions per model. HumanEval seeds: 27485098, 66336210, 127186558. MMLU seeds: 216112228, 218859290, 238604058. Frontier models (Opus 4.8, Fable 5) evaluated June 2026; Routerly strategies evaluated April 2026 on the same seeds and questions. Cost per query is mean per-run cost divided by 20. Preliminary, small-sample comparison.

Sources

Routerly benchmark repository: github.com/Inebrio/routerly-benchmark
Routerly repository: github.com/Inebrio/Routerly

All articles