LLM Eval Pipelines in CI/CD: Gates That Actually Catch Things
Running LLM evals in CI is easy to set up and easy to get wrong. How to build quality gates and red-team gates that block bad prompts before they ship — and why a passing CI eval is not the same as a working production system.
The first LLM eval that runs in CI is a milestone and a trap. The milestone: you can now block a prompt change that tanks quality before it reaches users. The trap: it is dangerously easy to build a green CI gate that proves almost nothing — a handful of cherry-picked examples, an exact-string assertion that breaks on a synonym, a judge prompt no one calibrated. A passing eval suite that does not correlate with production behavior is worse than no suite, because it manufactures confidence.
This is about building CI gates that earn their green checkmark, and being clear-eyed about the gap between “passed CI” and “works in production.”
Two gates, two jobs
Production LLM CI typically wants two distinct gates, and conflating them weakens both.
The quality gate blocks deploys when output quality drops below a threshold — hallucination rate too high, answer relevance too low, a regression on a known-good case. DeepEval ↗ is built for this: it is Python-first, integrates with pytest, ships a large catalog of metrics (faithfulness, answer relevancy, hallucination, bias, and task-specific scorers), and is designed to fail a build when a metric crosses a threshold. If your codebase is Python and your concern is “did this change make answers worse,” this is the natural fit.
The red-team gate blocks deploys when the change opens a security or safety hole — a prompt that is newly injectable, a system prompt that now leaks, a path that emits PII. Promptfoo ↗ is strong here: it is CLI-first, YAML-configured, runs in GitHub Actions, GitLab CI, Jenkins, and Azure Pipelines, generates adversarial test cases, and — per its CI/CD docs — fails the build via standard exit codes and JUnit XML output when assertions or thresholds are not met. The mature pattern many teams converge on is to run both: promptfoo as the adversarial/security gate, DeepEval as the metric gate. They are complementary, not competing.
The reason to keep them separate is that they fail for different reasons and want different owners. A quality regression is a product/ML concern; a red-team failure is a security concern mapped to the OWASP Top 10 for LLM Applications ↗, whose 2025 edition leads with LLM01:2025 Prompt Injection — still the hardest class to fully mitigate. Wiring both into CI puts both concerns in front of the change that caused them.
What makes a gate actually catch things
The tooling is the easy part. These are the design decisions that separate a gate that catches regressions from one that rubber-stamps them.
Use semantic assertions, not exact-match. response == expected_string fails the moment the model rephrases a correct answer, so teams loosen it until it asserts nothing. Open-ended outputs need semantic scoring — embedding similarity, LLM-as-a-judge against a rubric, or task-specific checks (did it return valid JSON, did it cite a real source, did it stay on topic). The discipline that replaces unit tests for LLMs is covered in our LLMOps best-practices guide ↗; the CI-specific point is that your assertions must tolerate valid variation while still catching real regressions.
Calibrate the judge before you trust it. If your gate uses LLM-as-a-judge, the judge is now part of your test infrastructure and an uncalibrated judge produces noise that erodes trust in the whole suite. Two practices help: use a different model family for the judge than the one under test (a same-family judge tends to agree with its own family’s mistakes), and use coarse buckets (a 4-point excellent/acceptable/poor/unsafe scale) rather than a fine 1–10 score, because coarse buckets correlate far better with downstream outcomes. This is the same calibration discipline we recommend for production quality monitoring ↗ — the judge in CI and the judge in prod should be calibrated the same way.
Curate a golden dataset that includes the scars. The most valuable test cases are the ones that already broke in production. Every incident should leave behind a regression case in the golden set. A suite assembled this way grows teeth over time; one assembled from synthetic happy-path examples stays toothless.
Set thresholds from observed variance, not a guess. Because judge scores and metrics are noisy, a threshold picked arbitrarily either flaps (fails on noise) or never fires. Run the suite a few times against a stable baseline, observe the variance, and set the gate just outside it. Re-tune after any change to the judge or the metric.
The non-determinism problem
LLM outputs are non-deterministic even at temperature 0, which collides with CI’s expectation of deterministic pass/fail. Three mitigations, in order of preference:
- Assert on properties, not exact text. “Contains a valid citation,” “is under 200 tokens,” “does not mention a competitor,” “parses as the required schema” — these are stable across runs in a way that exact output is not.
- Aggregate over a sample. Run each case several times and assert on the rate (e.g., “passes the safety check in 100% of 5 runs,” “relevance score averages above threshold”). A single run of a non-deterministic system is a coin flip; the rate is the signal.
- Pin the model version. Pin to a date-stamped model snapshot in CI so the model is not silently changing under your gate — the same provider-drift discipline we cover in silent quality decay ↗. An eval that passed yesterday and fails today with no code change is often the provider, not you.
Where the gate runs
Match the eval depth to the stage, because the full suite is too slow and expensive to block every commit:
- On every PR: a fast subset — the golden regression cases plus a small red-team set. Seconds to a couple of minutes. This is the gate that blocks merge.
- On staging promotion: the full quality and red-team suite against representative traffic. Minutes is acceptable here; this is the gate that blocks production.
- Continuously in production: sampled judge-graded evaluation on real traffic. This is not CI, and it is the part teams skip. See below.
The gap CI cannot close
Here is the limitation to internalize: a CI eval tests a frozen distribution. It runs your curated cases against the current model and prompt. It cannot tell you that production traffic has drifted away from your test set, that users are now asking questions your golden dataset never imagined, or that the provider quietly changed the model after your last green build. The eval suite becomes a museum piece within a couple of months if it is your only quality signal.
CI evals catch regressions you can anticipate. They do not catch drift you did not. The complete picture pairs the CI gate with continuous production sampling — a separate judge-graded eval running on live traffic, tracking the actual production distribution rather than the synthetic one. That production-monitoring system, and why it must be distinct from CI, is the subject of our piece on silent quality decay ↗. Build the CI gate first; it is high-leverage and cheap. Just do not mistake its green check for proof the system works — only that the things you thought to test still pass.
For broader tooling context, sentryml.com ↗ covers eval and observability integration across the LLMOps stack, and mlobserve.com ↗ surveys the eval-framework landscape as it stands in 2026.
Sources
- CI/CD integration — promptfoo ↗ — Running evals and red-team scans in CI across GitHub Actions, GitLab, Jenkins, and Azure, with build-failing assertions and JUnit/JSON output.
- DeepEval — Confident AI ↗ — Pytest-integrated metric gates (faithfulness, relevancy, hallucination, bias) for failing builds on quality thresholds.
- OWASP Top 10 for LLM Applications (2025) ↗ — The 2025 risk list led by LLM01:2025 Prompt Injection, the basis for red-team gate coverage.
Sources
LLMOps Report — in your inbox
Operating LLMs in production — eval, observability, cost, latency. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
Semantic Caching for LLM Serving: When the Cache Hit Is Not a String Match
Exact-match caching misses most LLM cache hits — paraphrases tank hit rate. Semantic caching, threshold tuning, and the production failure modes that bite.
Prompt Versioning and Deployment: The Operational Workflow
Versioning prompts is the easy part. The operational hard parts — decoupling prompt releases from code deploys, labels for staging vs production, rollback, and not blowing your latency budget — are where teams actually get stuck.
RAG Observability: Monitoring the Retrieval Layer in Production
When a RAG system gives a bad answer, the retrieval layer is usually to blame — and your LLM monitoring can't see it. How to instrument retrieval quality with context precision, recall, and faithfulness in production.