Tag

#llm-serving

2 posts tagged llm-serving.

ops

Semantic Caching for LLM Serving: When the Cache Hit Is Not a String Match

Exact-match caching misses most LLM cache hits — paraphrases tank hit rate. Semantic caching, threshold tuning, and the production failure modes that bite.
May 29, 2026
ops

Self-Hosted vs API LLMs: The Operational Tradeoffs

The self-host-versus-API decision is usually framed as a cost-per-token comparison. The real tradeoffs are operational — GPU memory math, who owns reliability, and the hidden engineering cost that the token spreadsheet ignores.
May 12, 2026