Yee Seng Chan
  • Home
  • Writing
  • Series
  • Projects
  • About
Categories
All (23)
How LLMs learn to reason (7)
LLM evaluation, honestly (4)
Research foundations of modern LLMs (5)
The agent harness (5)
What an agent actually is (2)

Writing

Essays on how modern language models work, and on building and evaluating the systems around them.
Individual essays, newest first. Multi-part arguments are collected under Series.

Your LLM judge is a classifier

LLM evaluation, honestly

Validate LLM judges with held-out labels, confusion matrices, and periodic rechecks.

2026 · May 26

Don’t ask an LLM judge what code can check

LLM evaluation, honestly

If a check can be written as code, an LLM judge is a slower, noisier, costlier way to an answer.

2026 · May 22

Read traces before you write the labeling guide

LLM evaluation, honestly

Let real traces decide what ‘good’ means.

2026 · May 19

Stop vibe-checking your agent

LLM evaluation, honestly

Eyeballing a handful of runs feels like evaluation and isn’t.

2026 · May 15

Traces are how agents get better

The agent harness

Traces turn failures into evidence you can debug, attribute, and fix.

2026 · May 10

Prompts guide. Gates enforce

The agent harness

A prompt is a suggestion the model can ignore; a gate is code it cannot.

2026 · May 1

State, not transcript, is agent memory

The agent harness

Structured state is the real memory layer.

2026 · Apr 28

The harness is the product

The agent harness

The case that the harness, not the model, is what you actually ship.

2026 · Apr 23

Why AI agent demos break in production

The agent harness

Failures live in everything wrapped around the model.

2026 · Apr 18

Behind the scenes of AI agent frameworks

What an agent actually is

The model emits, the API parses, your code dispatches and loops.

2026 · Apr 14

What the heck is an AI agent?

What an agent actually is

Stop asking if it’s an agent. Ask how much autonomy, and whether that fits the task.

2026 · Apr 11

Information extraction didn’t disappear. It moved inside the workflow.

Research foundations of modern LLMs

Classic IE was absorbed into prompts, schemas, and pipelines.

2026 · Apr 5

The fine-tuning stack: one loss, different data

Research foundations of modern LLMs

What actually changes behavior across SFT, instruction tuning, and preference tuning.

2026 · Mar 31

Retrieval is older than RAG: from DPR to end-to-end

Research foundations of modern LLMs

What RAG inherited from DPR, and what it quietly skipped.

2026 · Mar 27

The encoder didn’t die. It became the embedding model

Research foundations of modern LLMs

BERT-style encoders moved into retrieval, reranking, and classification.

2026 · Mar 23

Pretraining objectives: why decoder-only won

Research foundations of modern LLMs

How unification, not raw capability, settled the architecture race.

2026 · Mar 18

How reasoning models learn to use tools

How LLMs learn to reason

Reasoning and tool-calling are different skills. RL teaches the timing.

2026 · Mar 14

R1-Zero was the result. R1 was the product

How LLMs learn to reason

Raw RL grows reasoning. Data, filtering, and staged training make it a product.

2026 · Mar 10

DPO: RLHF collapsed into one loss

How LLMs learn to reason

No reward model, no RL loop. One supervised-looking loss, with hidden costs.

2026 · Mar 6

GRPO: the algorithm behind reasoning models

How LLMs learn to reason

Drop the value model. Score a group of samples against each other.

2026 · Mar 2

PPO is REINFORCE plus five fixes

How LLMs learn to reason

Baselines, advantages, clipping, a value model, stable updates.

2026 · Feb 24

REINFORCE: the gradient that drives training

How LLMs learn to reason

The policy gradient, derived.

2026 · Feb 20

REINFORCE: the world before the gradient

How LLMs learn to reason

MDPs, returns, value vs policy, the objective.

2026 · Feb 16
No matching items
© 2026 Yee Seng Chan
Built with Quarto