announcement changelog routing semantic-cache

Routerly 0.2.0 is out 🚀: semantic cache, Anthropic, four providers

Carlo Satta June 11, 2026 5 min read

After two months of steady work on the develop branch, 0.2.0 is tagged, tested, and out the door.

The release adds semantic response caching, native Anthropic support, conversation-aware routing, and four new providers. Zero breaking changes. If you’re running an earlier build, update now; if you’re new to Routerly, keep reading.

Quick reminder of what Routerly is: a self-hosted API gateway that sits in front of your AI providers (OpenAI, Anthropic, Google, Ollama, and now a few more) and gives you one endpoint, cost control, routing, and visibility over the whole thing. You run it, you own the data, nobody sends you a bill. Free and open source under AGPL-3.0.

Here’s what’s in it.

Semantic cache: stop paying twice for the same answer

If you run a support bot or a docs assistant, you already live this: ten users ask the same thing in ten slightly different ways, and you pay for all ten.

Routerly now caches responses by meaning, not by exact string. A new request gets embedded and compared against what’s already in the cache. Close enough match? You get the cached answer back, with no provider call, no token cost, and near-zero latency. On repetitive traffic this adds up faster than you’d expect, and you can watch it happen: cache hits show up in the usage logs and the dashboard, so the savings are a number you can actually point at.

Native Anthropic support: Claude Desktop and Claude Code, drop-in

This is the one that’s most satisfying.

Routerly now fully speaks the Anthropic /v1/messages format. Which means Claude Desktop, Claude Code, and any Anthropic-native client can use Routerly as a proxy without touching a single line of config. Point them at Routerly and you get multi-provider fallback, cost tracking, and your routing rules, all transparently, while the client has no idea anything changed.

The OpenAI path works exactly as before. Both are first-class now.

Conversation memory: no more switching models mid-chat

Subtle bug that bites stateless gateways: route every request on its own and a multi-turn conversation can quietly end up talking to a different model on every turn. Coherence suffers, especially across models with different context limits or system-prompt quirks.

0.2.0 keeps a short-term memory keyed by conversation ID. The first turn picks a model, and every following turn in that conversation stays pinned to it. No extra services to run, it’s in-process and lightweight, and the client never notices.

Four new providers

Built-in support for four that got too useful to skip:

DeepSeek, strong reasoning at a low price
Groq, very fast inference for when latency is the thing you care about
Together AI, open-source model hosting
Perplexity, search-augmented models for retrieval-heavy work

Pricing and context-window data ship in the catalog, and they mix freely with everything else in a project’s pool, local Ollama models included.

Decoupled model IDs: aliases that don’t break

Before, the model ID you used inside Routerly had to match the provider’s exact API string, so any rename upstream meant editing your configs.

Now Routerly’s internal model name is fully independent from the provider’s. Call them fast, reasoner, cheap-summarizer, map each to whatever real model you like, and swap the model underneath any time without touching the code that calls it. Small change, the kind you thank yourself for six months later.

The smaller stuff that still matters

Per-request cost breakdown: input, output, cache-read and cache-write tokens recorded separately, visible in the dashboard detail view.
Updates from the CLI: routerly update check, routerly update channel, routerly update run.
Updates from the dashboard: Settings > About shows your channel, the latest version, and a one-click update button. Pick the channel that matches your nerve: stable, current, or develop.
Opt-in anonymous telemetry: off by default, toggle it whenever. If you turn it on, you help us see what’s actually used. If you don’t, nothing leaves your machine.
Housekeeping under the hood: Fastify v5, Vite v6, Vitest v3, Zod v4, plus fixes (Qwen3 thinking-only responses, and reasoning_effort no longer leaking to non-o-series OpenAI models).

No breaking changes

The OpenAI and Anthropic wire formats are untouched. Existing configs, API keys, and client integrations all carry over as-is. Upgrading should be boring, which is the highest compliment you can pay an upgrade.

Update, kick the tires, and please break it

Two things would genuinely help:

If you’re on an older build, update and let us know if anything feels off after the jump.
If you’re new, install it, throw a real workload at it, and file an issue the moment something misbehaves. Finding the rough edges now is exactly the point of a release like this.

Source, install instructions, and the issue tracker are all here: github.com/Inebrio/Routerly. Docs at doc.routerly.ai.

And if you’re wondering whether the routing actually holds up against just calling a top-tier model directly: we ran a much bigger benchmark this time. That’s the next post.

Sources

Routerly repository: github.com/Inebrio/Routerly

All articles