Semantic Caching for LLM Serving: When the Cache Hit Is Not a String Match
Exact-match caching misses most LLM cache hits — paraphrases tank hit rate. Semantic caching, threshold tuning, and the production failure modes that bite.
The first cache layer most teams put in front of an LLM is the one they already know: hash the request, key by the hash, store the response, done. It works the way it works for any other backend — then somebody looks at the hit rate, sees four percent, and concludes caching does not help with LLMs. The conclusion is wrong; the cache is. Users do not phrase questions the same way twice. They add a word, drop a word, swap a synonym, paste with a trailing newline. To an exact-match cache, every one of those is a brand new query. Semantic caching recognizes “what is the refund policy” and “how do refunds work” as the same question and serves the second from the first’s response. Done well, it converts a four-percent hit rate into thirty or sixty. Done poorly, it serves confidently wrong answers at high speed.
This is a guide to what semantic caching is, where it goes in the serving path, the threshold and TTL decisions that determine whether it helps or hurts, and the production failure modes the marketing pages omit.
Why exact-match caching underperforms
An exact-match cache keys on the canonicalized request body — model, parameters, prompt string, system message, tool definitions — and hits only when the byte representation is identical. For deterministic backend calls (parameterized SQL, a REST GET with normalized params) that is the right key. For natural-language prompts it is the wrong key, because natural language has high surface variation and low semantic variation. The same intent reaches the model as dozens of strings. The hashing layer cannot see that “cancel my order” and “I need to cancel an order” are the same request, so it pays full inference cost for both.
The fix is to key on meaning. Embed the incoming prompt, search a vector index of previously-served prompts, and if the nearest neighbor is within a similarity threshold, return its cached response. Every semantic-cache implementation converges on this architecture with variations in storage, embedding model, and integration surface. GPTCache ↗, the open-source semantic cache from Zilliz (MIT-licensed), states the goal directly: reduce LLM API costs and increase response speed by caching on embedding similarity, with pluggable cache storage (SQLite, Postgres, Redis, MongoDB, Elasticsearch) and pluggable vector stores (Milvus, FAISS, pgvector, Qdrant, Chroma, Weaviate). Same algorithm, swap the components to whatever you already run.
The managed-service equivalents take the same approach with different ergonomics. Redis LangCache ↗ is a fully-managed semantic-cache REST service that handles the embedding, vector search, and eviction behind a single API call, sold on the premise that you should not be running a vector database to cache LLM responses. Portkey’s semantic cache ↗ lives in their AI gateway and operates on top of an exact-match check — it first looks for a byte-identical hit and only falls through to semantic search on a miss, which keeps the fast path fast. The default cosine similarity threshold is 0.95, with the constraint that semantic mode applies only to requests under 8,191 tokens and four or fewer messages. The two-tier (exact, then semantic) pattern is worth copying regardless of which tool you use.
Where semantic caching sits in the serving path
Place semantic caching in the gateway, after input guardrails ↗ and before the model call. The flow that works in production: auth and request normalization (strip non-semantic noise like client-side timestamps), input guardrails (reject before you cache — caching unsafe input is wasted storage at best and a policy bypass at worst when a paraphrase of a previously-allowed prompt gets through), exact-match cache lookup, semantic cache lookup on miss, model call on full miss with the response written back to both caches, and output guardrails on every response — including cached ones, because policies change and a response cached months ago under the old policy is not exempt from the new one. Input guardrails before the cache, output guardrails after. The cache is a transparent layer that produces responses indistinguishable from a model call; treat them the same.
The threshold problem
A semantic cache has exactly one knob that matters and it is brutal: the similarity threshold. Set it too loose and the cache returns the wrong answer to a different question — the user asks about refunds and gets the response to a question about returns, because the embedding distance was close enough. Set it too tight and the hit rate collapses back toward exact match. Portkey’s 0.95 default is on the conservative side and is a reasonable starting point for general chat workloads, but it is not the answer for every workload, and “tune it” is not a strategy.
The threshold is workload-specific in two ways. Semantic distance correlates differently with intent overlap depending on the embedding model — a threshold calibrated for OpenAI’s text-embedding-3-small will not transfer to Sentence-Transformers MiniLM unchanged. And the cost of a false-positive hit is workload-specific: a chatbot returning an off-topic answer is annoying; a coding assistant returning the wrong API signature is a bug shipped at zero latency.
- Loose-tolerance workloads (FAQ, support triage, general chat): start at 0.93–0.95, sample hits, eyeball-check that served responses answer the new question, tighten if you see drift.
- Tight-tolerance workloads (code generation, structured extraction, anything where “close” is wrong): start at 0.97–0.99, or skip semantic mode in the serving path and use it only for non-critical sub-tasks.
- Calibrate per intent class, not globally. A good gateway lets you scope the threshold per route or namespace.
The discipline is the same one we apply to LLM-as-a-judge calibration in CI ↗: the threshold is part of your test infrastructure, and an uncalibrated threshold produces silently wrong outputs that erode trust in the whole layer.
TTL, eviction, and the staleness trap
Cached LLM responses go stale for reasons that traditional caches do not have. The underlying knowledge changes (prices, policies, product names). The model version is updated by the provider. The system prompt is edited. A document in the RAG corpus is corrected. Any of these can invalidate cached responses while the cache continues serving them.
The TTL controls in the managed services are sensible defaults, not answers. Portkey’s max_age ranges from a 60-second minimum to a 90-day maximum, with a 7-day default. The right value is the shortest TTL at which the cache still earns its keep, set per route based on how fast the underlying answer can change. A “what is your refund window” response can sit for weeks. A “what is today’s pricing” response should not be cached at all.
Cache invalidation hooks matter as much as TTL. Every change that can invalidate cached answers needs to bust the relevant cache keys: a system-prompt deploy, a RAG-corpus update, a model-version pin change (covered in our prompt versioning post ↗). Without those hooks, you are serving last week’s answer with this week’s prompt, and the staleness will not show up in any dashboard until a user complains.
Provider-side prompt caching is not a substitute
A separate feature confuses the conversation: provider-side prompt caching. OpenAI’s prompt caching ↗ automatically caches prompt prefixes of 1,024 tokens or longer, with cached tokens billed at a steep discount (the docs cite up to ~90% input-token cost reduction and up to ~80% latency reduction) and no code changes required. Anthropic offers a similar mechanism with explicit cache control. These are prefix caches — they speed up the part of the request that is identical across calls (a long system prompt, a tool catalog, retrieved-context boilerplate), not the response.
The two layers are complementary. Provider prompt caching cuts the cost and latency of input processing when a long shared prefix is reused; semantic response caching skips the model call entirely when the intent is a repeat. Use both. Do not mistake provider prompt caching for a response cache — it does not eliminate the inference, just makes it cheaper.
Failure modes the marketing omits
Four production traps to plan around. Personalized responses cached across users: if the response was tailored to user A (name, order history, account state) and you key by prompt embedding alone, user B’s similar prompt gets user A’s data. Always include user/tenant scope in the cache key, and treat any cross-tenant cache as a privacy incident in waiting. Cache poisoning via prompt injection: a malicious prompt that gets through guardrails and into the cache becomes a response served to every paraphrase of that prompt — output guardrails on cached responses (not just fresh ones) are the mitigation. Hit-rate vanity: hit rate alone is not the metric; correct hit rate is. Sample served-from-cache responses and judge-grade them the same way you judge fresh ones. Quality regression at the threshold edge: cached responses sitting just over the similarity threshold are the most likely to be subtly wrong — log the similarity score of every cache hit and alert on a rising share of hits clustered near the threshold.
Semantic caching is one of the highest-leverage layers you can add to an LLM serving stack — it pays for itself in cost and latency the day it ships — and one of the easiest to ship wrong, because the failure mode is fast confident incorrect answers rather than visible errors. Build it as a gateway layer, tune the threshold per workload, scope keys by tenant, invalidate on every change that can move the answer, and monitor cached responses with the same quality bar you apply to fresh ones. The cache is not a checkbox; it is a system, and like any production system it earns trust by being measurable.
Sources
- GPTCache — Zilliz ↗ — Open-source (MIT) semantic cache for LLM responses, with pluggable cache stores (SQLite, Postgres, Redis, MongoDB, Elasticsearch, etc.) and vector stores (Milvus, FAISS, pgvector, Qdrant, Chroma, Weaviate), and integrations with LangChain and llama_index.
- Redis LangCache ↗ — Fully-managed semantic-cache REST service that handles embedding, vector search, and eviction behind a single API endpoint, deployed as a Redis Cloud managed offering.
- Simple and Semantic Cache — Portkey ↗ — AI-gateway cache that combines exact-match first then semantic fallback, with a 0.95 default cosine similarity threshold,
max_agefrom 60 seconds to 90 days (default 7 days), and a constraint of <8,191 tokens and ≤4 messages per request for semantic mode. - Prompt Caching — OpenAI ↗ — Provider-side prefix caching that activates automatically for prompts ≥1,024 tokens, with steeply discounted cached tokens and no code changes required; complementary to, not a substitute for, response-level semantic caching.
Sources
LLMOps Report — in your inbox
Operating LLMs in production — eval, observability, cost, latency. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
RAG Observability: Monitoring the Retrieval Layer in Production
When a RAG system gives a bad answer, the retrieval layer is usually to blame — and your LLM monitoring can't see it. How to instrument retrieval quality with context precision, recall, and faithfulness in production.
Self-Hosted vs API LLMs: The Operational Tradeoffs
The self-host-versus-API decision is usually framed as a cost-per-token comparison. The real tradeoffs are operational — GPU memory math, who owns reliability, and the hidden engineering cost that the token spreadsheet ignores.
Guardrails in the Serving Path: Defense in Depth for LLMs
Guardrails are not a single check you bolt on — they're layers in the request path, each catching what the others miss. How to place input, output, and behavioral guardrails without wrecking latency.