LLMOps Report
Concept drift detection in production
mlops

Concept Drift Detection in Production: Practical Thresholds and Why Most Alerts Are Noise

How to actually detect concept drift in live systems, what thresholds matter, and why your monitoring dashboard is probably lying to you.

By Editorial · · 7 min read

Concept drift is the thing that makes models silently wrong over time. Not explode-wrong. Not error-wrong. Just wrong enough that the business stops trusting predictions but doesn’t know why.

Most teams discover they have a drift problem the same way: someone looks at a dashboard three months after a model’s performance fell off a cliff and notices the line started curving down around the time some external event changed user behavior. In retrospect, obvious. At the time, no alert fired.

Here’s why that happens, and what to do differently.

The three-layer confusion

When people say “concept drift” they usually mean one of three different things:

Data drift — the distribution of your input features has shifted. This is the easiest to detect because you don’t need ground truth. Feature distributions are observable the moment a request hits your model.

Label drift — the marginal distribution of the thing you’re predicting has changed. If you’re predicting churn, and churn rate doubles because the economy tanked, label drift is happening. Your model didn’t get worse — the target itself moved.

Concept drift (properly) — the relationship between features and labels changed. The same feature vector now predicts a different outcome. This is the hardest to catch because you typically need delayed ground truth to know it’s happening.

Conflating these three leads to both false alerts (panicking over feature drift that doesn’t affect predictions) and missed alerts (assuming stable input distributions mean a stable model when the relationship between inputs and outputs has shifted).

What thresholds actually work

The honest answer is: it depends on your label latency.

For systems with fast label feedback (click predictions, fraud flags, any domain where you learn within hours whether you were right), you can monitor prediction accuracy directly and set thresholds on a rolling window. If accuracy drops 5% over 48 hours, that’s a real signal.

For systems with slow feedback (lifetime value prediction, medical risk scoring, anything measured in weeks or months), you’re stuck monitoring proxies:

The worst thing you can do is set a single threshold for all features. Most features drift constantly for benign reasons — seasonality, product changes, user cohort shifts. Alerting on all feature drift will teach your on-call rotation to ignore monitoring.

Why most monitoring dashboards lie

Dashboards built around statistical tests (KS test, chi-squared, MMD) are detecting whether distributions are statistically different, which is almost always yes when you have enough data. The question you actually care about is whether the drift is big enough to affect prediction quality.

Statistical significance without effect size is noise.

A better pattern: gate feature drift alerts through a downstream impact estimate. If you have an approximate mapping from feature shift magnitude to prediction error (build this during validation), use it as a filter. Only fire an alert when the estimated impact on downstream accuracy crosses a threshold you actually care about.

This requires knowing your model well enough to build that mapping, which is work most teams skip. That’s why dashboards lie.

Practical approaches that actually catch things

Monitor your residuals, not just your inputs. For regression tasks, a sudden change in error distribution on recent predictions (before ground truth arrives) can indicate drift. If your model was systematically underpredicting high-value users last week and is now overpredicting them, something changed.

Build drift into your retraining trigger, not your alert trigger. The question isn’t “did something change?” — something always changed. The question is “has enough changed that retraining would improve performance?” Frame your threshold around expected performance gain from retraining.

Slice by cohort. Drift is rarely uniform. It hits specific segments first — new user cohorts, mobile vs. web, specific geographies. Aggregate monitoring hides segment-level problems. If you’re not monitoring by cohort, you’ll catch drift late.

Track when you don’t see drift. A model serving 10M predictions per day with no feature drift for two weeks in a row might mean your feature pipeline is broken. Total silence is as suspicious as constant alerts.

What to actually instrument

At minimum:

Don’t instrument everything on day one. Instrument the features your model most relies on, validated against actual prediction errors from past incidents. Add more as you learn what actually mattered.

The goal isn’t a dashboard full of green checks. It’s knowing, within a day, when your model’s relationship to the world has shifted enough to matter.

That bar is lower than teams think. You don’t need a sophisticated ML monitoring platform to meet it. You need clear definitions of what “drift” means in your context, thresholds tied to business outcomes, and the discipline to not add an alert every time a feature moves.

#drift #monitoring #observability #production
Subscribe

LLMOps Report — in your inbox

Operating LLMs in production — eval, observability, cost, latency. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments