LLM evaluation, honestly — series

Your LLM judge is a classifier

Yee Seng Chan — Tue, 26 May 2026 00:00:00 GMT

Part of a series

LLM evaluation, honestly

An LLM judge that predicts PASS or FAIL is a classifier. Like any classifier, it has to be tested against human labels before its scores mean anything.

This is the step teams often skip. They build a judge, spot-check a few outputs, decide the verdicts look reasonable, and start reporting a pass rate. The dashboard may say the judge passed 87% of traces, but that number is only useful if the judge itself is reliable.

This article is about how to validate an LLM judge.

The running example is a documentation Q&A agent. The agent retrieves passages from internal docs and answers questions about a data-retention policy. For example, trial-account data is kept for 90 days, standard-account data for 30 days, and enterprise accounts under legal hold must be escalated.

The judge’s job is to check whether the answer is faithful to the retrieved source. In plain English: did the answer say what the document actually supports? Or did it use the wrong policy, drop an important caveat, or make a claim the document does not justify?

That is not something a simple regex can reliably check. It requires reading the question, the retrieved source, and the answer together. So it is a reasonable job for an LLM judge. The question is whether that judge is any good.

Validate against human labels

Validating a judge means comparing its verdicts against human labels on examples the judge has not seen.

The process is simple:

Start with traces that humans have labeled PASS or FAIL.
Keep a held-out set that is not used in the judge prompt.
Run the judge on that held-out set.
Compare the judge’s verdicts with the human labels.
Report the confusion matrix, precision, recall, and F1 for the class that matters.

The held-out set is the key. If an example appears in the judge’s few-shot prompt or was used to develop the judge, it cannot also be used to validate the judge. The judge has already seen it.

Accuracy hides what matters

Suppose the faithfulness judge is tested on 100 human-labeled traces: 80 PASS and 20 FAIL.

It agrees with human labels on 83 of 100 traces, so the headline looks good: 83% accuracy.

Figure 1: Accuracy hides what matters. On 100 human-labeled traces (80 pass, 20 fail), the faithfulness judge agrees with humans 83% of the time, which looks fine. The confusion matrix shows why it isn’t: the judge correctly passes 77 faithful answers and catches 6 failures, but waves through 14 of the 20 unfaithful ones. Recall on the FAIL class is only 30% (precision 67%, F1 41%).

But the confusion matrix tells a different story:

	Human PASS	Human FAIL
Judge PASS	77	14
Judge FAIL	3	6

The judge correctly passes 77 of 80 faithful answers, but catches only 6 of 20 failures.

Accuracy              = (77 + 6) / 100        = 83%
Precision (P) on FAIL = 6 / (6 + 3)           = 67%
Recall (R) on FAIL    = 6 / (6 + 14)          = 30%
F1 on FAIL            = (2 * P * R) / (P + R) = 41%

So the judge is not good enough for faithfulness evaluation. It misses most of the unfaithful answers.

Hold out a real test set

Do not validate a judge on examples that appear in its prompt or were used to tune it. Split your labeled traces into: few-shot examples that go inside the judge prompt, dev set for iterating on the judge, and held-out test set for final validation only.

On the dev set, the most useful step is reading disagreements. For example:

User question: “How long do trial-account records stay available?”
Retrieved context: standard accounts keep records for 30 days; trial accounts keep them for 90 days.
Agent answer: “Records are retained for 30 days after closure.”
Judge verdict: PASS (found a matching sentence in the context).
Human label: FAIL (the answer used the wrong account type).

That disagreement tells you what example to add: faithfulness requires support for the specific entity and condition in the user’s question, not just any matching sentence in the retrieved documents.

Once the judge performs well on dev, run it once on the held-out test set and report the result.

When calibration is poor

Reading the disagreements usually reveals one of a few patterns:

Missing failure subtype. The judge catches outright hallucinations but misses subtle contradictions, dropped caveats, or unsupported generalizations.
Over-flagging. The judge calls everything FAIL when the agent paraphrases instead of quoting verbatim.
Unclear judging standard. The judge says some answers are unfaithful that the humans accept as fine.
Right for the wrong reason. The verdict is correct but the critique points at the wrong thing.
Too-small validation set. Wide error bars on metrics; unreliable conclusions (Eval scores are samples, not truth covers when this matters).

The fix progression is examples → criterion → model.

Examples almost always move first: if the judge is wrong on five cases of subtle contradiction, consider adding five contradiction examples to the few-shot block. If multiple disagreements trace to ambiguity in what counts as a failure rather than missing examples, the adjudicator sharpens the criterion. Model swaps come last. Most calibration problems live in the prompt, not the model, and switching models is expensive in cost, latency, and stability.

Revalidate when things change

The judge agrees with humans on the held-out set today. That doesn’t mean it will agree in six months. Foundation models update, the agent’s failure distribution shifts, and new failure types appear. Calibration is a snapshot: true at measurement time, increasingly stale afterward.

Your eval system also drifts covers what to do about it: re-validate against fresh labels on a schedule and monitor agreement rates over time. Large drift in those agreement rates is usually a signal that the failure distribution itself has changed.

Every deployed judge should carry a short record of what it was validated on: the domain, the trace types, the test-set size, the date, and what it’s allowed to be used for. Without that record, judges get re-used on data the original validation doesn’t cover, and the team trusts a number that no longer applies.

The judge is part of the system

An LLM judge is not an oracle. It is a model component that needs the same discipline as any other classifier: held-out labels, a confusion matrix, precision and recall on the class that matters, and periodic revalidation.

Once the judge is validated, there is still one more question: how much should we trust movement in the score? A judge may report that v2 improved by two points, but that movement may be real or may be noise. The next article covers that next layer: uncertainty, sample size, and when an eval result is large enough to act on.

Don’t ask an LLM judge what code can check

Yee Seng Chan — Fri, 22 May 2026 00:00:00 GMT

Part of a series

LLM evaluation, honestly

Most bad LLM judges are doing a simpler evaluator’s job. They are asked whether JSON is valid. Whether a required field is present. Whether a tool was called. Whether a prohibited phrase appears in the final answer. Whether a handoff record contains an order ID.

That is not judgment. That is checking.

When a team asks an LLM judge to do what code could have checked, it gets the worst of both worlds: slower, noisier, more expensive, and harder to debug.

LLM judges are useful when the evaluator needs to understand meaning in context. They are wasted when the failure has a structural shape. If the failure can be detected with code, use code. Save judges for failures that require understanding meaning.

Start with the failure mode

The wrong question is “Should we use an LLM judge here?” The right question is “What exactly are we trying to detect?”

Code check when the failure has a structural shape: a missing required field, an invalid category, a prohibited phrase in the response.
LLM judge when the failure requires reading language in context: tone mismatch, redundant clarifying questions, subtle policy violations.

Use code for structural failures

A code-based evaluator is a function. Take a trace, return a score with a reason.

def check_handoff_schema(trace) -> Score:
    required = {"case_id", "user_email", "issue_category", "summary"}
    handoff = trace.outputs["handoff_record"]
    missing = required - handoff.keys()
    if missing:
        return Score(passed=False, reason=f"Missing fields: {missing}")
    return Score(passed=True, reason="Handoff record well-formed")

Five lines. Runs in milliseconds. Deterministic: same input, same output, every time. Debuggable: when it fails, the reason tells you exactly what to look at. Free: no API calls, no token costs, no rate limits. Runs on every trace, on every CI build, on every production trace if you want.

Code and judges aren’t always alternatives. Sometimes they’re layers. A regex check for refund-promise language catches explicit cases (“Your refund has been approved”); it misses subtler implications (“We’ll make sure you get your money back”). The fix isn’t to abandon the regex for a judge. Run the cheap regex on every trace; run a more expensive judge only when the trace involves a refund request and the regex didn’t fire. The principle: run cheap checks broadly, run judges where they add value.

Use LLM judges for semantic failures

Some failure modes don’t have a structural shape. For instance, the user is frustrated but the agent responds in an upbeat and causal manner.

A judge worth trusting is binary (pass or fail, not 1-5) and grounded in labeled examples from the dataset the previous article produced. The next article covers the prompt design and the calibration step: comparing the judge’s verdicts to human labels before treating its numbers as measurement. Until then, the discipline is simple: don’t deploy a judge whose agreement with humans you haven’t measured.

Keep each evaluator narrow

The mistake is building one big judge called quality_judge_v3 and asking it to score everything: tone, helpfulness, schema validity, routing. That judge produces a number. It does not produce a diagnosis.

It’s the same reason you don’t write a single function called is_correct() in production code. Composite signals are harder to act on, harder to debug, and harder to improve.

Suppose the intake agent’s “overall quality” judge drops from 87% to 81% over a release. What changed? Tone got worse? Helpfulness got worse? Grounding got worse? Some combination? The composite verdict can’t tell you. You read failures by hand, you guess, and you propose a fix without knowing which dimension you’re targeting.

Specialized evaluators fix this. The tone judge tracks tone. The redundant-question judge tracks redundancy. When tone scores drop and redundancy scores hold steady, you know what changed. When you fix the tone issue, the tone judge tells you whether the fix worked.

Figure 1: One number can’t tell you what broke. The composite quality_judge_v3 score drops from 87% to 81% over a release, which leaves you asking “what changed?” and reading failures by hand to guess. The narrow stack of evaluators, run on the same release, shows Tone dropping from 87 to 60 while Redundancy, Schema, Routing, and Escalation all held steady. Same release, same data; the diagnosis only exists in the narrow view.

The cost is more API calls per trace, one per judge instead of one total. In practice this is cheap, because most evaluation runs sample only a fraction of production traffic. The per-trace cost matters less than signal quality.

A reasonable evaluator stack for the intake agent might look like:

Evaluator	Failure mode	Mechanism
`no_refund_promise`	Forbidden refund-approval language	Code (regex)
`handoff_schema`	Required handoff fields populated	Code (schema check)
`order_id_confirmed`	Order ID confirmed before handoff	Code (state predicate)
`escalation_honored`	Escalation when user requests human	Code (state predicate)
`tone_appropriate`	Tone matches user’s emotional register	LLM judge
`no_redundant_question`	Clarifying question wasn’t already answered	LLM judge
`classification_correct`	Issue routed to the right category	Reference-based check

Seven evaluators, each answering one question. When the dashboard shows a regression, you can identify which question’s answer changed.

One related discipline: when a regression case enters the suite, tag it by the layer where the failure originated, not the layer where it became visible. A malformed handoff at turn 5 caused by a classifier misread at turn 3 should be tagged as a classifier failure.

Code first, judges when you need them

Your LLM judge is a classifier covers the next layer: making judges trustworthy when you do need them. A judge that produces verdicts in the right format isn’t the same as a judge that produces correct verdicts.

The next time someone proposes a judge, first ask: could code answer this?

Read traces before you write the labeling guide

Yee Seng Chan — Tue, 19 May 2026 00:00:00 GMT

Part of a series

LLM evaluation, honestly

The eval loop from Stop vibe-checking your agent is only as honest as the dataset underneath it. A dataset’s first job is to discover how the system fails.

The running example continues from the harness series: a scheduling assistant that books, reschedules, and cancels meetings. A user writes: “Move my meeting with Alex to next Thursday afternoon if he has time.” That one sentence assumes the assistant can guess which Alex, parse “next Thursday afternoon,” pick the right calendar, and not say “Done” before the write actually succeeds.

Start with traces, not imagined categories

A common mistake is to start with abstract qualities like accurate, helpful, concise, safe, professional. They describe how the team imagines quality before seeing how the system actually fails.

For a scheduling assistant, the real failures are more specific: choosing the wrong Alex, writing to the wrong calendar, confirming before the calendar API succeeds, losing information from earlier turns, or mishandling a forwarded email thread. These failures are concrete enough to test and fix, and they only emerge when humans read real traces.

So start with traces: label what happened, write short critiques, and let the failure modes emerge from the examples. When labeling reveals an obvious bug, fix it instead of building a judge to detect it.

Keep labels consistent

Reviewers will disagree on edge cases. Designate one adjudicator to own the labeling guide, make the call on contested cases, and update examples as the standard becomes clearer.

For domain-heavy products, the adjudicator should be whoever is closest to the product’s real standard of correctness, not whoever happens to be available. In a healthcare project I led, anchoring medical-billing-level labels to a domain expert’s judgment rather than engineers’ guesses gave the dataset real ground truth.

Keep the labels themselves simple: pass or fail. Likert scales feel more nuanced but are harder to act on and invite false precision. The nuance lives in the critique.

Design for coverage and difficulty

Once labels are consistent, the next question is whether the dataset is representative.

Figure 1: Design the dataset on two axes. Coverage is what situations the set must contain: main tasks (book, reschedule, cancel), common ambiguity (which Alex, vague time, forwarded thread), and high-risk failures (wrong calendar, premature confirm, permission error); set target counts per slice and decide before collecting. Difficulty is how hard each case is: easy guards against regressions, medium shows whether it is improving, hard is the frontier, safety must refuse or escalate; keep a visible mix. Raw production clusters in the easy, main-task corner, so the rest has to be designed in deliberately.

A raw production sample usually overrepresents easy traffic. For a scheduling assistant, that means lots of simple booking requests and too few high-risk cases: ambiguous attendees, wrong-calendar writes, permission errors, forwarded email threads, time-zone mistakes.

Design the dataset intentionally:

Main tasks: booking, rescheduling, cancellation.
Common ambiguity: unclear attendee, unclear event, vague time phrase, forwarded email thread.
High-risk failures: wrong calendar, premature confirmation, missing conflict check, permission error.

Set target counts for the slices that matter most (30 ambiguous-attendee cases, 25 wrong-calendar cases). The numbers are illustrative; the point is deciding the targets before collecting.

For rare but important slices, do a cheap pre-filtering pass before full labeling. If wrong-calendar writes or permission errors are rare in raw traffic, find the candidate traces first, then label those carefully. In a clinical-note project I led, sections like allergies and labs appeared sparsely across clinician-patient transcripts; a quick yes/no pass identified which transcripts contained them.

Balance difficulty too:

Easy cases: the system should already pass these. They protect against regressions.
Medium cases: the system is mixed. These show whether the product is improving.
Hard cases: the system often fails. These show the frontier.
Safety cases: the system must refuse, escalate, or avoid forbidden behavior.

Too easy and the dataset saturates; too hard and it doesn’t show progress. Aim for a visible mix so the team can tell what actually improved.

Eval scores are samples, not truth returns to sample size and uncertainty. If a slice matters, include it deliberately and track it separately.

When you don’t have production data

Before launch, the dataset has to come from somewhere other than production. The key rule: generate inputs, not outputs. Let the real system produce the responses, and label the resulting traces like any other trace.

Two sources work well in combination.

Persona-based generation. Define three to five user personas that match the product (for the scheduling assistant: a time-zone-confused product manager, an executive with overlapping calendars, an IC handling forwarded meeting threads). For each persona, generate eight to twelve plausible queries.

Expert elicitation. Sit with someone who has done this product’s work before: a customer success lead, a power user from a prior product, a domain expert. Ask: “What is the worst question someone could ask this?” and “Where do you expect this to fail?” The answers encode failure intuition the team doesn’t yet have from data, including the high-risk cases the coverage section calls out.

The dataset is the product’s memory

An eval dataset is the product’s memory of what has gone wrong, what the team decided “good” means, and what must not break again.

That memory has to keep growing. The first version will be incomplete; failure modes will get sharper as production reveals new ones; the adjudicator will tighten the examples annotation guide as the team’s standard becomes clearer. The danger is pretending the dataset doesn’t need to evolve.

Stop vibe-checking your agent

Yee Seng Chan — Fri, 15 May 2026 00:00:00 GMT

Part of a series

LLM evaluation, honestly

Some teams evaluate by feel: read a few runs after a prompt change, form an impression, ship. It works until something changes: a prompt update seems better but nobody can prove it, or a refactor might break something nobody can name.

The first useful eval system replaces feel with a loop: traces, labels, failure modes, checks and judges, regression cases, and a regular review that feeds new failures back in.

This series is about evaluating LLM agents: turning the observable behavior the harness captures into evidence.

The running example: a customer support intake agent that classifies issues, asks clarifying questions, captures relevant facts, and produces a handoff record for a human. No refunds, no side-effecting tools.

What eval is for

A score is useful only if it changes what the team does next. Eval supports different decisions at different stages: during development, whether a new prompt is better; before merge, whether old failures came back; in production, whether new failure patterns are emerging.

The operational test for any metric: When this number moves, what decision changes? If the answer is merge, hold, investigate, roll back, or add a regression case, the metric is useful. If the answer is “nothing,” it does not belong on the dashboard.

The minimum viable eval loop

A useful eval system starts with traces and ends with decisions. The first version needs six pieces: traces, labels, failure modes, checks and judges, regression cases, and a regular review.

Figure 1: The minimum viable eval loop. Traces become labels, labels reveal failure modes, failure modes become checks and judges, checks and judges produce regression cases, and a regular review turns recent failures into decisions. The loop only works if it closes: new production failures re-enter as traces and regression cases rather than disappearing into Slack.

1. Traces

A trace records one agent run: state, retrieved context, actions, tool calls, verification, and the final state it saved. It shows how the agent got there, not just where it ended up.

An agent’s final message often hides the real failure. The intake agent might say: “Thanks, I captured your request and will route it to our support team for review.” The trace may show that it classified the issue as “billing” instead of “refund,” failed to copy the order ID into the handoff, and marked the user’s frustration as “low” despite the message “I’ve asked about this three times already.”

Reading only the final answer misses the first bad move. A malformed handoff at turn five may trace back to a misclassification at turn three.

Figure 2: The final answer hides the first bad move. A final-answer check sees only the agent’s closing message, which reads fine and would pass. The trace shows the run was already broken: at turn 3 the issue was misclassified as billing instead of refund (the first bad move), which propagated to a mismarked frustration level at turn 4 and a handoff with the order ID dropped at turn 5. The turn-5 failure traces back to the turn-3 misclassification.

2. Labels

A label is a human judgment attached to a trace: pass or fail, plus a short critique. For example: “Failed: agent routed the case as billing instead of refund at turn three; downstream handoff omitted refund-request status.” That critique tells the team what failed, where, and what to watch for. A score of 3 out of 5 does not.

3. Failure modes

After reading thirty or forty labeled traces, failure modes emerge: refund language the agent should not have used, redundant clarifying questions, dropped order IDs, missing handoff fields. Without them, every problem collapses into “the prompt is bad.”

A later article goes deeper on building this dataset: which traces to sample, how to label them, and how failure modes emerge from reading real runs.

4. Checks and judges

Each important failure mode should become either a check or a judge.

A check is a function that examines a trace and returns pass or fail without an LLM call. Use checks for structural failures: refund language appears, a required field is missing, the handoff schema is invalid, or the order ID was dropped.
A judge is a separate LLM that examines a trace and returns a verdict on something a function cannot reliably decide. Use judges for semantic failures: the clarifying question was redundant, the tone was wrong, or the agent answered the wrong question.

Figure 3: Check or judge? A check is a deterministic function with no LLM call, used for structural failures (refund language present, required field missing, order ID dropped); it is cheap and should run on every change. A judge is a separate LLM call, used for semantic failures (redundant clarifying question, wrong tone, answered the wrong question); it needs calibration against human labels before it can be trusted.

For example, “the agent must never promise a refund” should be a code check:

def evaluate_no_refund_promise(trace) -> Score:
    response = trace.outputs.get("agent_response", "")
    matches = find_matches(response, REFUND_PROMISE_PATTERNS)

    if matches:
        return Score(
            passed=False,
            reason=f"Refund-promise language: {matches}",
        )

    return Score(passed=True, reason="No refund-promise language")

Checks are cheap and deterministic. Run them on every important change.

Judges need calibration: run the judge on traces humans have already labeled, then look at where it agrees, where it misses failures humans caught, and where it flags cases humans considered acceptable. Calibration tells you whether the judge is reliable enough to support the decision you want it to support.

Don’t ask an LLM judge what code can check focuses on when a failure should become a code check versus an LLM judge. Your LLM judge is a classifier goes deeper on judge calibration.

5. Regression cases

A regression case is a real failure saved so future versions have to pass it.

If the intake agent once promised a refund, dropped an order ID, or asked for information the user already provided, that trace becomes a fixed case in the regression suite. Ten or fifteen real failures are enough to make the system harder to break in the same way twice.

Run the regression suite before important merges. If the new version fails cases the old version passed, the team should understand why before shipping.

Eval scores are samples, not truth covers what changes once raw pass rates are no longer enough and you need to think about uncertainty, sample size, and meaningful differences between versions.

6. Review and decisions

The loop only matters if it changes what the team does. A failed check blocks a merge, a spike in redundant-question failures triggers investigation, a new production failure becomes a regression case, a judge disagreement becomes a calibration example.

Pull a small sample of recent failures regularly, read them end-to-end, and ask whether they reveal a pattern the current dataset misses. The habit is simple: new failures feed back into the eval loop instead of disappearing into Slack.

Your eval system also drifts returns to this production loop: how new failures from real usage become new eval cases, alerts, and hardening work.

A concrete first version

The first version for the intake agent might look like this:

Piece	First version
Traces	~50 intake traces, synthetic if pre-launch or sampled from production once traffic exists
Labels	Pass/fail plus short critiques explaining the first bad move
Failure modes	Recurring patterns of mistakes (e.g. “promises refunds”, “drops order IDs”)
Checks and judges	Code checks for the structural ones; one calibrated judge for redundant clarifying questions
Regression cases	10–15 real historical failures fixed as test cases
Review	Regular review of recent failures and judge disagreements

The numbers are illustrative. The shape matters: start small, use real failures, make every new failure feed back into the loop.

The smallest useful promise

Having a minimal eval loop in place is enough to change how the team builds. “Does v2 feel better?” becomes “Which failure modes improved, which regressed, and do we trust the measurement?” Eval stops being a side research exercise and becomes part of the engineering loop.

The harness made the agent’s behavior observable, the eval system turns that behavior into evidence, and the hardening loop turns evidence into a system harder to break.

The next article goes one layer deeper: the eval dataset.