Software engineers know how to write tests. Test the input, test the output, fail loudly if they do not match. AI breaks this model in two ways. First, the output is not deterministic. Second, "correct" is rarely a single string. It is a distribution of acceptable answers, judged by a clinician. Classical testing does not catch this. Eval frameworks do.
What an eval suite actually looks like
For a clinical AI feature (say, a note generator from a recorded encounter), our eval suite has three layers running in parallel:
- Offline regression: a frozen set of ~500 anonymised encounters with clinician-validated reference notes. Every model change is scored against this set before it can be promoted.
- Live shadow eval: a fraction of production traffic is re-scored asynchronously against the previous model. Quietly catches drift before users see it.
- Clinician spot-check: a sampled set of live outputs is sent to a clinician reviewer panel weekly. The bottleneck (and the most expensive), but irreplaceable.
Failure modes you will see
- Silent precision drops when the upstream model is updated by your vendor.
- Reference rot: your "validated" gold set ages out as clinical guidelines evolve.
- Eval-overfit: the model improves on the suite without improving in practice. Worst case, it gets worse for users while the metric goes up.
The discipline
Treat your eval suite as a first-class deliverable. Version it like code. Pay clinicians for their review time. Publish the methodology to your customers so they can audit how you measure quality. None of this is cheap, and all of it is non-optional if you ship in a clinical setting.
Zowork is a healthcare and behavioral health AI engineering team. For a decade we’ve shipped clinical platforms. Now we’re building the AI that runs underneath them.