How LLMs learn to reason

How LLMs learn to reason
The RL lineage behind reasoning models, traced one algorithm at a time, from REINFORCE and PPO through GRPO and DPO to what R1 actually shipped.
Author

Yee Seng Chan

Published

2026 · February 16

Reasoning models came out of a reinforcement-learning lineage: REINFORCE and PPO set the mechanics, GRPO and DPO reshaped the reward and update, tool-use RL extended it past single answers, and R1 made it shippable. The series follows the chain in order, so R1 reads as a consequence rather than a surprise.

Read in order: