Tag
#llm-serving
2 posts tagged llm-serving.
- ops
Semantic Caching for LLM Serving: When the Cache Hit Is Not a String Match
Exact-match caching misses most LLM cache hits — paraphrases tank hit rate. Semantic caching, threshold tuning, and the production failure modes that bite.
- ops
Self-Hosted vs API LLMs: The Operational Tradeoffs
The self-host-versus-API decision is usually framed as a cost-per-token comparison. The real tradeoffs are operational — GPU memory math, who owns reliability, and the hidden engineering cost that the token spreadsheet ignores.