Engineering

What Structured AI Evaluation Actually Looks Like

Piper Cyterski
April 7, 2026
8 min read

Ask any AI team whether they evaluate their models and they'll say yes. But push on what that means and the answers diverge wildly. For some, it's a Jupyter cell that prints an accuracy number. For others, it's a Slack thread where three people say "looks good to me." Occasionally, it's a proper scoring pipeline -- but even then, it's usually a one-off script that nobody maintains.

The real problem is deeper than most teams realize. It's not just that evaluation is informal -- it's that the criteria themselves are unexamined. Teams treat evaluation as a fixed checkpoint when it should be a living system that's iterated on as carefully as the models it measures.

Generic Benchmarks Won't Save You

It's tempting to lean on public benchmarks and standard metrics. They're easy to implement and they give you a number. But they almost never measure what actually matters for your use case.

A customer support team cares about whether the response resolved the issue, matched the brand voice, and avoided hallucinating policy details. A code generation team cares about whether the output compiles, handles edge cases, and follows the project's conventions. A content team cares about factual grounding, tone, and whether the piece reads like it was written by someone who understands the subject.

None of these map cleanly to BLEU scores or generic "helpfulness" ratings. Every serious use case demands custom criteria -- evaluation dimensions designed around what quality actually means in your specific context.

Here's a concrete example. Say you're building a customer support bot for a fintech company. A generic eval might ask an LLM judge: "Rate this response from 1-5 on helpfulness." That tells you almost nothing. A custom criterion for this use case might look like:

Policy Accuracy (1-5): Does the response accurately reflect current company policies? Score 1 if the response contains any fabricated policy details. Score 3 if it's directionally correct but vague. Score 5 if every policy reference is accurate and specific enough for the customer to act on. Flag any response that invents a policy number, refund timeline, or eligibility requirement that doesn't exist.

That rubric encodes domain knowledge. It tells the LLM judge (or the human rater) exactly what to look for and how to differentiate between good and bad. The difference between "rate helpfulness" and a rubric like this is the difference between evaluation theater and evaluation that actually catches the failures your customers will notice.

Criteria Are a Product, Not a Spec

Here's the part most evaluation frameworks miss: your criteria need to be iterated on just like your models do.

That policy accuracy rubric above? It's a first draft -- a hypothesis about what "good" means. It needs to be tested. The most reliable way to test it is to benchmark it against human judgment. Have three domain experts rate 50 outputs using the rubric. Run an LLM judge with the same rubric against the same outputs. Measure agreement.

Maybe you find the LLM judge agrees with humans 92% of the time on scores of 1 and 5 (the extremes) but only 61% of the time on scores of 2-4 (the middle range). That's actionable: you now know the rubric needs sharper distinctions in the middle of the scale. Maybe you add anchoring examples:

Score 2: Response mentions a relevant policy area but gets a specific detail wrong (e.g., says "refunds within 30 days" when the actual policy is 14 days).

Score 4: All policy references are accurate, but the response doesn't proactively surface a relevant policy the customer didn't ask about but would benefit from knowing.

You re-run, and agreement jumps to 78% in the middle range. That's calibration. The criterion got better because you treated it as something worth experimenting on, not a spec you wrote once and forgot about.

This means criteria development is itself an experimental loop:

  1. Define a criterion based on what quality means for your use case
  2. Collect human judgments on a representative sample
  3. Run the criterion against the same sample
  4. Measure agreement -- where does it align with humans, where does it diverge?
  5. Refine the rubric, add anchoring examples, sharpen the scale boundaries
  6. Version the updated criterion and repeat

Teams that treat this loop seriously end up with criteria that are genuinely calibrated to their quality standards. Teams that skip it end up with numbers that feel rigorous but don't correlate with the outcomes they actually care about.

Composing Evaluation Layers

Once you've developed criteria worth trusting, the question is how to deploy them efficiently. In practice, a well-designed evaluation composes three layers:

Programmatic checks run first and run cheaply. Is the output valid JSON? Does it stay under the token limit? Does it contain PII patterns? These are deterministic, instant, and binary. They work as gates -- if output fails a format check, there's no reason to spend LLM or human time evaluating it further. For the fintech support bot, a programmatic check might verify that every response includes a required disclaimer and doesn't expose account numbers.

LLM judges handle nuance at scale. Your calibrated rubric (like the policy accuracy criterion above) gets applied by an LLM across hundreds or thousands of outputs. The key is that the rubric has been earned through the calibration process -- it's not a generic prompt, it's a tested instrument. An LLM judge running a well-calibrated rubric is qualitatively different from an LLM judge running "rate this 1-5." One is a measurement tool. The other is a random number generator with extra steps.

Human ratings provide ground truth and handle the hard cases. Humans rate a representative sample to keep the calibration loop running. They also handle the cases where automated evaluation is least reliable -- edge cases, novel failure modes, outputs where the LLM judge's confidence is lowest. Using humans for everything is too slow and expensive. Using them as the calibration layer and the safety net is the highest-leverage application of human attention.

A concrete pipeline for the fintech bot: every response hits the programmatic gate (disclaimer present, no PII). Passing responses get scored by the LLM judge on policy accuracy, empathy, and resolution clarity. The bottom 10% by LLM judge score, plus a random 5% sample, get routed to human review. Human scores feed back into the next round of judge calibration.

That's three layers, composed deliberately, each doing what it's best at. The result is evaluation you can trust at scale without requiring humans to review everything.

The Versioning Imperative

Criteria evolve -- that's the whole point. You refine rubrics, adjust scales, recalibrate LLM judges against fresh human data. But without versioning, you can't tell whether a score change means the output got worse or the criteria got stricter.

Say your fintech bot's average policy accuracy score drops from 4.1 to 3.6 between Tuesday and Thursday. Did the model get worse? Did a policy change invalidate some responses? Or did you tighten the rubric from v2 to v3 and the new version is just stricter? Without version tracking, you can't tell. With it, the answer is immediate.

This also means you can safely experiment with criteria. Try a new rubric as version N+1, run it against the same dataset, compare results to version N. If it better aligns with human judgment, promote it. If not, roll back. Criteria development gets the same rigor as model development.

From Evaluation to Evidence

When your criteria are custom-built for your use case, calibrated against human judgment, versioned, and composed into layered evaluation -- "should we ship this?" stops being a vibes question. It becomes a data question with a reproducible answer.

That's the real goal: not just measuring quality, but building a body of evidence that compounds over time. Every scored output is a data point. Every criterion refinement makes the measuring stick more accurate. Every evaluation run becomes a baseline for the next.

Your criteria deserve the same experimental rigor you apply to the things they measure. Anything less is just gut feel with extra steps.