Writing

Essays on how modern language models work, and on building and evaluating the systems around them.

All essays, newest first. Multi-part arguments are collected under Series.

Your eval system also drifts

LLM evaluation, honestly

Eval systems need maintenance too.

2026 · Jun 5

Your RAG score hides the diagnosis

LLM evaluation, honestly

Score the pipeline, not the answer.

2026 · Jun 1

Eval scores are samples, not truth

LLM evaluation, honestly

How to tell real eval gains from noise.

2026 · May 29

Your LLM judge is a classifier

LLM evaluation, honestly

Validate LLM judges with held-out labels, confusion matrices, and periodic rechecks.

2026 · May 26

Don’t ask an LLM judge what code can check

LLM evaluation, honestly

If a check can be written as code, an LLM judge is a slower, noisier, costlier way to an answer.

2026 · May 22

Read traces before you write the labeling guide

LLM evaluation, honestly

Let real traces decide what ‘good’ means.

2026 · May 19

Stop vibe-checking your agent

LLM evaluation, honestly

Eyeballing a handful of runs feels like evaluation and isn’t.

2026 · May 15

Traces are how agents get better

The agent harness

Traces turn failures into evidence you can debug, attribute, and fix.

2026 · May 10

Prompts guide. Gates enforce

The agent harness

A prompt is a suggestion the model can ignore; a gate is code it cannot.

2026 · May 1

State, not transcript, is agent memory

The agent harness

Structured state is the real memory layer.

2026 · Apr 28

The harness is the product

The agent harness

The case that the harness, not the model, is what you actually ship.

2026 · Apr 23

Why AI agent demos break in production

The agent harness

Failures live in everything wrapped around the model.

2026 · Apr 18

Behind the scenes of AI agent frameworks

What an agent actually is

The model emits, the API parses, your code dispatches and loops.

2026 · Apr 14

What the heck is an AI agent?

What an agent actually is

Stop asking if it’s an agent. Ask how much autonomy, and whether that fits the task.

2026 · Apr 11