What this site is for
LLMOps Report covers ML observability and MLOps from a production-engineering perspective. Here's what we publish.
LLMOps Report covers ML observability and MLOps from inside production engineering. The kind of writing we wanted to find when we were debugging a model that worked in eval and broke in prod.
What we publish:
Drift, the unsexy version. Concept drift, label drift, feature drift, training/serving skew. How to detect it in real systems, what thresholds actually catch problems, why most monitoring dashboards lie about it.
Production failure writeups. When models go wrong in the real world — silently degraded predictions, retraining loops gone bad, embedding-store corruption, vector-DB consistency issues — postmortems we wish vendors would publish.
Tooling reviews, honest. Arize, Fiddler, WhyLabs, Evidently, NannyML, Aporia, the open-source observability stack. Where each helps, where it solves problems you don’t have, what to install when you’re starting from zero.
MLOps without the hype cycle. Feature stores, model registries, evaluation pipelines, online inference. What’s worth adopting, what’s reinventing things SREs solved a decade ago, what’s genuinely new.
What we don’t publish:
- Vendor-sponsored “thought leadership”
- “Top 10 MLOps tools” listicles
- Anything we couldn’t show running in production
Pseudonymous bylines. Tips and corrections to the editor.
Start with token cost observability in production, concept drift detection for LLM systems, or Arize vs Evidently for production observability.
LLMOps Report — in your inbox
Operating LLMs in production — eval, observability, cost, latency. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
Best Vector Database for RAG: A Practical Comparison (2026)
Pinecone, Weaviate, Qdrant, pgvector, Chroma, Milvus — benchmarked on recall@k, p99 latency, filtered search, and cost at real production scale.
Semantic Caching for LLM Serving: When the Cache Hit Is Not a String Match
Exact-match caching misses most LLM cache hits — paraphrases tank hit rate. Semantic caching, threshold tuning, and the production failure modes that bite.
LLM Eval Pipelines in CI/CD: Gates That Actually Catch Things
Running LLM evals in CI is easy to set up and easy to get wrong. How to build quality gates and red-team gates that block bad prompts before they ship — and why a passing CI eval is not the same as a working production system.