Home
Writing
Series
Projects
About

LLM evaluation, honestly

LLM evaluation, honestly

Why your eval scores look good while the system stays unreliable, and what to measure instead.

Author

Yee Seng Chan

Published

2026 · May 15

The series treats LLM evaluation as engineering: labels built from real traces, code checks before LLM judges, judges treated as calibrated classifiers, scores read as samples not truth, and an eval system you maintain like any other. The point is to know whether the system is actually getting better.

Read in order:

Stop vibe-checking your agent

Eyeballing a handful of runs feels like evaluation and isn’t.

Read traces before you write the labeling guide

Let real traces decide what ‘good’ means.

Don’t ask an LLM judge what code can check

If a check can be written as code, an LLM judge is a slower, noisier, costlier way to an answer.

Your LLM judge is a classifier

Validate LLM judges with held-out labels, confusion matrices, and periodic rechecks.

Eval scores are samples, not truth

How to tell real eval gains from noise.

Your RAG score hides the diagnosis

Score the pipeline, not the answer.

Your eval system also drifts

Eval systems need maintenance too.

No matching items

© 2026 Yee Seng Chan

Built with Quarto