Evals are the new tests

Software engineers know how to write tests. Test the input, test the output, fail loudly if they do not match. AI breaks this model in two ways. First, the output is not deterministic. Second, "correct" is rarely a single string. It is a distribution of acceptable answers, judged by a clinician. Classical testing does not catch this. Eval frameworks do.

What an eval suite actually looks like

For a clinical AI feature (say, a note generator from a recorded encounter), our eval suite has three layers running in parallel:

Offline regression: a frozen set of ~500 anonymised encounters with clinician-validated reference notes. Every model change is scored against this set before it can be promoted.
Live shadow eval: a fraction of production traffic is re-scored asynchronously against the previous model. Quietly catches drift before users see it.
Clinician spot-check: a sampled set of live outputs is sent to a clinician reviewer panel weekly. The bottleneck (and the most expensive), but irreplaceable.

Failure modes you will see

Silent precision drops when the upstream model is updated by your vendor.
Reference rot: your "validated" gold set ages out as clinical guidelines evolve.
Eval-overfit: the model improves on the suite without improving in practice. Worst case, it gets worse for users while the metric goes up.

The discipline

Treat your eval suite as a first-class deliverable. Version it like code. Pay clinicians for their review time. Publish the methodology to your customers so they can audit how you measure quality. None of this is cheap, and all of it is non-optional if you ship in a clinical setting.

EvalsClinical AITesting

Written by

Zowork Engineering

Engineering team

Zowork is a healthcare and behavioral health AI engineering team. For a decade we’ve shipped clinical platforms. Now we’re building the AI that runs underneath them.

Found this useful?

Share it with someone who needs to read it.

Evals are the new tests

What an eval suite actually looks like

Failure modes you will see

The discipline

More from the team

The state of clinical AI in 2026

Why behavioral health is the hardest place to put AI

AI-first development: 6× throughput without 6× bugs

Want to talk through one of these? Get in touch.