Home
Writing
Series
Projects
About

How LLMs learn to reason

How LLMs learn to reason

The RL lineage behind reasoning models, traced one algorithm at a time, from REINFORCE and PPO through GRPO and DPO to what R1 actually shipped.

Author

Yee Seng Chan

Published

2026 · February 16

Reasoning models came out of a reinforcement-learning lineage: REINFORCE and PPO set the mechanics, GRPO and DPO reshaped the reward and update, tool-use RL extended it past single answers, and R1 made it shippable. The series follows the chain in order, so R1 reads as a consequence rather than a surprise.

Read in order:

REINFORCE: the world before the gradient

MDPs, returns, value vs policy, the objective.

2026 · February 16

REINFORCE: the gradient that drives training

The policy gradient, derived.

2026 · February 20

PPO is REINFORCE plus five fixes

Baselines, advantages, clipping, a value model, stable updates.

2026 · February 24

GRPO: the algorithm behind reasoning models

Drop the value model. Score a group of samples against each other.

2026 · March 2

DPO: RLHF collapsed into one loss

No reward model, no RL loop. One supervised-looking loss, with hidden costs.

2026 · March 6

R1-Zero was the result. R1 was the product

Raw RL grows reasoning. Data, filtering, and staged training make it a product.

2026 · March 10

How reasoning models learn to use tools

Reasoning and tool-calling are different skills. RL teaches the timing.

2026 · March 14

No matching items

© 2026 Yee Seng Chan

Built with Quarto