Evals are not optional

By Operonn TeamApril 8, 20264 min readEVALSENGINEERING

0:00

2:10

Listen

Every AI system we ship has a test suite. Not because it's a best practice, but because without one, every prompt change becomes a coin flip and every model upgrade becomes a panic.

The minimum bar

For any system that goes to production, we want at least:

A regression set of 50–200 real inputs with expected behaviour.
An automated grader — sometimes a model-as-judge, sometimes a deterministic check, often both.
A CI gate that blocks deploys if quality drops below a floor.
A hallucination check specific to the domain.

It sounds like a lot. It's a day or two of setup that pays itself back the first time someone changes a prompt at 4pm on a Friday.

Model-as-judge, with discipline

LLM-graded evals are useful, but lazy implementations drift. The LLM-as-a-judge paper from Zheng et al. laid out most of the known failure modes; we follow two rules on top:

The judge model is pinned, even when the system model upgrades. Otherwise your "quality went up" is just two models agreeing with each other more.
Every judge prompt has a calibration set — a small batch where humans graded the same outputs, so you know the judge's bias.

A model that grades its own outputs is a model that grades itself well. Use a different model, or better yet, a different model family.

What evals don't cover

Evals catch regressions on known inputs. They don't catch:

New input distributions you haven't seen.
User behaviour that emerges from the interface.
Compounding errors in multi-step agents.

For those, you need production observability — sampled traces, outcome tagging, and a weekly review where someone actually reads the logs. Tools like LangSmith, Braintrust, and Arize Phoenix reduce the plumbing work considerably, but none of them remove the human-reads-the-logs step. There is no substitute for that.

The team that ships fast forever is the team that can prove yesterday's quality still holds today. Everyone else is guessing in production.

ShareX LinkedIn Email

The minimum bar

Model-as-judge, with discipline

What evals don't cover

Working on something like this?