How LLMs learn to reason — series

How reasoning models learn to use tools

Yee Seng Chan — Sat, 14 Mar 2026 00:00:00 GMT

Part of a series

How LLMs learn to reason

Ask Claude what last night’s score was, or to look up the latest version of some Python library, and watch what it does. It writes a search query, calls out to a search tool, reads the results, sometimes searches again to clarify, and only then answers you. The whole sequence happens inside one response.

The behavior isn’t scripted. Real assistants have scaffolding around the model (tool schemas, system instructions, routing, sometimes policy rules), but the core behavior is something the model learned in training. There’s no if user_asked_for_current_info: search() branch anywhere. The model learned that for some questions, calling out to a tool produces better answers than guessing.

The previous piece covered how DeepSeek-R1 used reinforcement learning to consolidate reasoning behavior (backtracking, self-correction, longer chains of thought) out of a base model trained only on math and code problems. The headline finding was that outcome-only RL on verifiable tasks produces models that learn to think.

This piece is about what happens when you point that same training recipe at tool use. It works, with two interesting twists.

The recipe extends: Search-R1

The cleanest demonstration is Search-R1 (Jin et al. 2025). Take the R1 training recipe, but instead of letting the model think to itself, let it call a search engine in the middle of its reasoning.

The training data is question-answering, specifically the kind that requires looking facts up. “Curious is a women’s fragrance by a singer born in what city and state?” The model can’t reasonably know the answer from memorized training data and needs to chain together facts it doesn’t have.

The trajectory is interleaved: reason for a bit, decide a search would help, issue a query, read the retrieved passages, reason more, maybe search again, eventually answer. In practice:

<think> I need to find the singer behind the fragrance "Curious". think>
<search> Curious fragrance singer search>
<information> [Wikipedia: Curious is a fragrance by Britney Spears...] information>
<think> So I need to find where Britney Spears was born. think>
<search> Britney Spears birthplace search>
<information> [...Spears was born in McComb, Mississippi...] information>
<think> McComb, Mississippi. That's the answer. think>
<answer> McComb, Mississippi answer>

The training loop:

for each training question:
    trajectory = []
    while not answered:
        generate next chunk from the model
        append chunk to trajectory
        if chunk contains <search> query search>:
            results = retriever(query)
            append <information> results information> to trajectory
        if chunk contains <answer> ... answer>:
            answered = True
    reward = exact_match(extracted_answer, ground_truth)
    update the policy, but only on tokens the model wrote,
    NOT on tokens that came from retrieved passages

The last line is the wrinkle worth understanding.

Setup-wise, this is the R1 recipe with one change. Same policy-gradient RL (Search-R1 tests both PPO and GRPO, with PPO more stable in their main setting). Same outcome-only reward (exact match between predicted and ground-truth answer). No process rewards, no learned reward model, no human-graded trajectories. The only difference is that when the model produces , the training loop pauses generation, runs the query, and pastes the top results back into the trajectory before the model continues.

And it works. Trained on Natural Questions and HotpotQA, evaluated across seven QA datasets, Qwen2.5-7B with Search-R1 (PPO) gets 43% average accuracy versus 30% for vanilla retrieval-augmented generation and 28% for R1-style RL with no search at all. The gap is consistent across both in-distribution and out-of-distribution test sets.

What’s striking is what the model learned without being told. The reward checks only the final answer. Nothing in the training signal says “search more often” or “search more carefully.” But over training, the average number of searches per trajectory grows. The model learns to search when uncertain, to search again if the first results aren’t good enough, and sometimes to do a verifying search after it thinks it has the answer.

None of this was hand-designed. The policy gradient finds these behaviors because trajectories that include them tend to end with correct answers more often than trajectories that don’t. The same mechanism that gave R1 its self-correcting math behavior gives Search-R1 its search behavior.

The retrieved-token mask

The trajectory contains tokens the model didn’t write: the retrieved Wikipedia passages between tags. Run the policy gradient over every token, and you’re training the model to assign higher probability to those passages too. That’s nonsense as a learning signal. The model didn’t choose those tokens, and the retrieved text isn’t even consistent across rollouts of the same prompt.

The fix is to mask retrieved tokens out of the loss. Only update on tokens the model actually generated. The full PPO/GRPO objective is more involved, but the core idea is an indicator in front of each token’s contribution:

where if the model generated token , and if the token came from a retrieved passage. Without the indicator, retrieved tokens contribute to the gradient and pull the policy toward whatever was in the search results. With it, only tokens the model actually chose contribute.

Without the mask, the same model gets 34% average accuracy. With it, 43%. Nine points from one indicator function.

The principle generalizes. Anytime your RL setup includes tokens from outside the model (search results, tool outputs, environment state, function returns), you have to be careful which ones enter the gradient. Get it wrong and the model regresses toward whatever distribution those external tokens come from.

When the action space gets richer: ToolRL

Search-R1 works because search is a forgiving tool. The action space is small (queries to one engine). Verification is sharp (did the final answer match?). Search itself is robust, since even mediocre queries usually retrieve something useful.

Real tool use is messier. An agentic model might have access to dozens of tools, each with named parameters, some required and some optional, taking strings or numbers or structured objects. The model can fail by picking the wrong tool, by using wrong parameters, by getting parameter names right but values wrong, or by forgetting to call a tool that was needed.

ToolRL (Qian et al. 2026) looked at training in this richer setting. Two findings: one expected, one surprising.

The expected finding: smaller pieces of credit

Search-R1’s reward only checks the final answer. For richer tool use, that signal is too sparse. The model takes too many distinct actions between question and answer for “did it work in the end” to tell it which specific action was good or bad.

ToolRL breaks the reward into pieces. For each tool call the model makes, compare it to a ground-truth tool call from the training data and grade three things separately:

Tool name: did the model pick the right tool?
Parameter names: did it use the right arguments?
Parameter values: did it fill them in correctly?

The total correctness reward for a tool call is the sum:

The tool-name match is an intersection over union, the fraction of tool names the model picked that overlap with the ones it should have:

Pick exactly the right set of tools and you get 1. Pick none and you get 0. The parameter-name match works the same way, with overlap between predicted and ground-truth parameter keys. The parameter-value match is stricter: exact equality between values, for the parameters that matched. Add the three components together, normalize, and you have a single scalar reward that scales smoothly from “nothing matched” to “everything matched perfectly.”

The ablation: compared against a coarser version that only gives credit when the entire tool call matches exactly, the fine-grained reward trains faster and reaches higher final accuracy. The coarser version starves the model of useful gradient. When most of your trajectories fail somehow, partial credit for the parts you got right keeps the gradient informative.

The surprising finding: longer thinking hurts

In the R1 piece I argued that reasoning models think longer at inference because longer trajectories happened to correlate with correct answers, and the policy gradient followed the correlation. You’d expect this to extend to tool use: more thinking, more careful tool calls, better outcomes.

ToolRL tested this directly by adding a length reward, a small bonus for trajectories that produce longer thinking traces before tool calls. Here’s what happened on a tool-use benchmark called BFCL:

Setup	Accuracy (Qwen2.5-3B)
Standard (no length reward)	53.0%
Fixed length reward	48.9%
Dynamic length reward (escalating over training)	48.2%

The length rewards work. The traces visibly get longer. But accuracy gets worse, not better.

This reshapes how to think about the R1 piece. In math and code, longer reasoning traces meant more exploration before commitment, which translated cleanly into accuracy. In tool use, the same instinct backfires. The model about to commit to a tool call doesn’t always benefit from talking itself through one more time. It can talk itself out of the right call, or elaborate a wrong call into a worse one.

Thinking helps when the structure of the task rewards thinking. Math problems reward exploration before commitment. Tool use rewards picking the right tool, getting the arguments right, and getting on with it.

You can see hints of this in production models. Claude and GPT-4 produce different amounts of reasoning depending on whether you’ve asked them to do math or look up a fact. The faster response on a tool-use task isn’t laziness, it’s efficiency. Optimal trace length is task-shaped.

What this means for the models you use

When you watch Claude search the web, call a calculator, or run code, there’s a version of this kind of training behind it. Not literally Search-R1 or ToolRL (production training pipelines are more complex and not public), but the conceptual approach is the same: RL on rollouts where the model has tool access and there’s a clear definition of what a successful interaction looks like.

This explains some uneven patterns. Models are smoother with tools they saw a lot of during training (search, code execution, basic file operations) and rougher with niche or company-specific tools that didn’t get the same coverage. They’re also better at tools where success is checkable. For instance, a search returning useful results is verifiable.

The harness around the model (tool schemas, validation, retries, fallback rules) does some of this work too. But none of it substitutes for the model itself having the right reflexes: knowing when to call which tool, what arguments to pass, and when to skip the tool entirely. That part has to come from training.

What’s next, and the series wrap

This is the last piece in the series on RL training for LLMs. The arc:

REINFORCE foundations: the setup these algorithms live in. MDPs, returns, credit assignment, value vs policy, the RL objective.
REINFORCE gradient: the policy gradient derived. REINFORCE is weighted supervised learning.
PPO: the workhorse policy-gradient algorithm; REINFORCE plus five fixes for variance and data efficiency.
GRPO: the variant that drops the value network and powered DeepSeek’s reasoning models.
DPO: the offline simplification that collapses RLHF into a single supervised loss.
R1: what GRPO produces when you point it at math and code, with reasoning behavior consolidating into a long-chain-of-thought policy.
Search-R1 and ToolRL: what happens when you extend the recipe to tool use.

References

Jin, Bowen, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. 2025. “Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning.” arXiv Preprint. https://arxiv.org/abs/2503.09516.

Qian, Cheng, Emre Can Acikgoz, Qi He, et al. 2026. “ToolRL: Reward Is All Tool Learning Needs.” Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/2504.13958.

R1-Zero was the result. R1 was the product

Yee Seng Chan — Tue, 10 Mar 2026 00:00:00 GMT

Part of a series

How LLMs learn to reason

R1-Zero is the conceptual result. R1 is the engineering wrapping around it.

A reasoning model’s trace looks structurally different from a standard LLM’s. The model works through a problem, gets partway, says wait, that’s not quite right, backs up, tries a different approach, checks the result against the constraints, notices the check failed, revises again, then commits. Thousands of tokens of exploration, self-correction, and hypothesis revision before the boxed answer.

Nobody wrote down the rule “if your first approach looks wrong, say wait and try a different one.” Nobody curated a corpus of reasoning traces with backtracking and trained on them via SFT. The behavior consolidates out of a simple training procedure.

R1-Zero is the result. Everything else is interpretation.

The setup is simple. Start from DeepSeek-V3-Base (Liu et al. 2024), a 671B model pretrained on text but never instruction-tuned. No SFT, no RLHF, no human-written reasoning examples.

Run GRPO on it (PPO with the critic removed and a group-relative advantage in its place; the GRPO piece covers the details). Use math and code prompts where the answer is checkable. Two rewards:

Accuracy: exact-match against the ground-truth answer for math, test execution for code.
Format: the output uses the ...... template.

Both rule-based. No neural reward model, no process supervision, no verifier trained on preferences.

What happens during training

AIME 2024 accuracy climbs from 15.6% to 71.0%, and to 86.7% with sixteen samples and majority voting (Guo et al. 2025). The surprising part is response length. Average response length grows monotonically across training.

Figure 1: Average response length of DeepSeek-R1-Zero grows monotonically over GRPO training, with no length term in the reward. The jump near step 8.2k coincides with the output cap being raised from 32K to 65K tokens. Source: DeepSeek-AI, 2025 (DeepSeek-R1).

The reward function never mentions length. The model discovers through trajectory comparison under GRPO that longer traces correlate with getting the answer right.

The paper flags an “aha moment”: mid-derivation on a competition math problem, the model produces Wait, wait. Wait. That’s an aha moment I can flag here and re-evaluates a step it was about to commit to. Self-reflective tokens (wait, check, verify, mistake, wrong) grow alongside response length. None were in the reward function or the prompt template. They appeared because trajectories that included them got reward more reliably than trajectories without them.

What’s actually happening

The mechanism is policy gradient on long trajectories with sparse terminal reward. You do many rollouts. Most fail. The verifier returns a single bit at the end of each: correct or not. That bit gets broadcast across every token of every successful trajectory. Whatever distinguished successful trajectories from failures, on average, gets reinforced: backtracking, self-checks, alternative approaches, length.

Karpathy calls this “sucking supervision through a straw.” A minute of rollout updated by a single bit. It’s crude credit assignment, and it works given enough samples and gradient steps.

What R1-Zero shows

R1-Zero establishes an important claim: GRPO with rule-based verifiable rewards, on a sufficiently strong pretrained base model, consolidates latent reasoning behavior into a reliable long-CoT policy.

Consolidates is doing the work. Recent work on chain-of-thought decoding (Wang and Zhou 2024) shows that reasoning traces, including backtracking and self-correction, are already present in pretrained models, just not at rank one in the next-token distribution. R1-Zero doesn’t create reasoning behavior from nothing. It surfaces latent capability as a consistent policy, with trace lengths scaling up as the model discovers what gets rewarded.

R1 is what you build when the policy needs to be a product

R1’s contribution is different from R1-Zero’s. It’s the recipe for taking the raw reasoning behavior and making it a deployable assistant.

Why R1-Zero isn’t a product

R1-Zero mixes Chinese and English mid-trace and can produce correct math in prose so messy a user can’t tell. The reasoning policy was optimized for one thing (getting the answer right under verification) and it shows.

The four-stage pipeline

R1’s pipeline addresses this in four stages:

Cold-start SFT. A few thousand long-CoT examples sampled from R1-Zero outputs, refined for readability and language consistency, used to fine-tune V3-Base. Gives the next RL stage a stylistically aligned starting point.
R1-Zero-style RL. GRPO on the cold-start checkpoint, same rule-based rewards plus a language-consistency reward to suppress mixing.
Rejection sampling and second SFT. The post-RL model generates reasoning trajectories filtered for correctness, combined with non-reasoning examples, and used for another round of SFT.
Final RL. Rule-based rewards on verifiable tasks combined with neural reward models on helpfulness and harmlessness, the latter trained on human preferences in the standard RLHF way.

Test-time compute is downstream of training

Snell’s result (Snell et al. 2024) predates R1 and shows that a smaller model with adaptive test-time compute can match a 14× larger model on easy and medium-difficulty problems. But Snell got there by building substantial external scaffolding around the base LLM: a separately trained process reward model (PRM; scoring intermediate steps of a reasoning trace), a revision model fine-tuned on self-correction trajectories, beam search over PRM outputs, best-of-N with verifier-weighted aggregation. The scaffolding extracts the gains.

R1 displaced the stack. No PRM. No beam search. No revision model. No best-of-N at inference. The model produces one long CoT in one decoding pass. The DeepSeek paper classifies PRMs and MCTS-style search (Monte Carlo Tree Search) as unsuccessful attempts in their own development.

The scaffolding got internalized into the policy via training. Wait, that’s wrong is the model running its own verifier mid-trace. Backtracking is the model doing its own revision. Exploring an alternative approach is the model running its own search.

When somebody says a reasoning model “spends more compute at inference,” what they usually mean is that the model produces a longer CoT. The longer CoT exists because training made the longer CoT effective.

Verifiable rewards are the source and the boundary

Every successful application of this recipe shares a common feature: outcomes can be checked cheaply and unambiguously. Math problems with deterministic final answers. Coding problems with executable test suites. Formal theorem proving with proof checkers.

Reasoning behavior consolidates in R1-Zero because GRPO can attribute reward credit reliably across thousands of tokens based on a single end-of-trace check. The trace is long, the signal is sparse, the per-step contribution is invisible. None of that matters as long as the final-answer check is trustworthy. You compare trajectories, reward the ones that ended right, and over enough samples the policy gradient finds whatever in the middle of the trace was helping. Self-correction phrases, hypothesis exploration, longer length all win the comparison on average, when verification is reliable.

Run the same recipe with a learned reward model (say, a neural judge trained on human preferences for “good reasoning”) and the signal becomes noisy and gameable under optimization pressure. Rule-based rewards on verifiable domains don’t.

The limits of the recipe

“Verifiable” is narrower than it sounds. Math has answer checking. Competition coding has test cases. But real software engineering, where “correct” includes design, readability, maintainability, and integration, is not verifiable in the R1 sense. The R1 recipe reaches the verifiable subset of code, not coding as a practice.

Even within verifiable domains, the recipe doesn’t produce uniform competence. Recent empirical work (Shojaee et al. 2025) reports task-specific reliability ceilings: models collapse on puzzle instances of certain complexity even when nominally within their domain. Being inside a verifiable domain is necessary but not sufficient for reliable reasoning.

The same training that produces backtracking and self-correction can produce overthinking and second-guessing on tasks where the answer was obvious. R1 underperforms standard LLMs on instruction-following evaluations even while crushing AIME and Codeforces.

What’s next

R1-Zero is one demonstration in a broader pattern. The same recipe extends to multi-turn tool-using settings if outcomes remain verifiable. Search-R1 (Jin et al. 2025) is the cleanest demonstration: extend the recipe to a model that interleaves reasoning with search-engine calls, and the model learns to use the tool well as part of its reasoning policy.

The next piece covers what changes when you move from single-turn verifiable reasoning to multi-turn tool-using settings and how the boundaries of “verifiable” get tested when you add tools and external state.

In the broader arc:

PPO is the workhorse algorithm.
DPO is the offline alternative.
GRPO is the variant for verifiable reward settings.
R1 is what GRPO produces when you point it at math and code on a strong base model.

References

Guo, Daya, Dejian Yang, Haowei Zhang, et al. 2025. “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.” arXiv Preprint. https://arxiv.org/abs/2501.12948.

Liu, Aixin, Bei Feng, Bing Xue, et al. 2024. “DeepSeek-V3 Technical Report.” arXiv Preprint. https://arxiv.org/abs/2412.19437.

Shojaee, Parshin, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. 2025. “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity.” arXiv Preprint. https://arxiv.org/abs/2506.06941.

Snell, Charlie, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. “Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.” arXiv Preprint. https://arxiv.org/abs/2408.03314.

Wang, Xuezhi, and Denny Zhou. 2024. “Chain-of-Thought Reasoning Without Prompting.” arXiv Preprint. https://arxiv.org/abs/2402.10200.

DPO: RLHF collapsed into one loss

Yee Seng Chan — Fri, 06 Mar 2026 00:00:00 GMT

Part of a series

How LLMs learn to reason

In May 2023, a paper showed up titled “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” (Rafailov et al. 2023) The claim: you could do RLHF without a reward model, without PPO, without rollouts, and without RL training of any kind. Just preference pairs and a clever loss function.

If you’d been working with PPO + RLHF up to that point, this sounded too good to be true. PPO was famously painful. Four models in memory simultaneously: policy, value network, reference model, reward model. Online sampling during training. The Anthropic and OpenAI teams running serious RLHF had built infrastructure most people couldn’t afford to replicate.

DPO said: skip all of it. Train a single supervised loss on preference pairs. Done.

I followed DPO when it landed. I derived the math step by step. I wrote a small training pipeline around TRL’s DPOTrainer and got it running locally on a 24GB GPU. In parallel, the community’s reaction was also immediate: within months, the open-source preference-tuning ecosystem had largely shifted to DPO. Models like Zephyr (Tunstall et al. 2023) showed up. The teams running PPO infrastructure didn’t go away, but the floor opened up dramatically.

In this article, let me first anchor what DPO is replacing.

Why PPO + RLHF was painful

PPO + RLHF is a multi-stage pipeline. You start with an SFT model, then:

Collect a preference dataset: pairs of (prompt, chosen response, rejected response).
Train a separate reward model on this dataset, learning to score (prompt, response) pairs.
Run PPO using the reward model as the reward signal, generating fresh rollouts during training.

Training requires four LLM-sized models in memory: the SFT model serves as the reference for KL anchoring, the policy is the LLM model being trained, the reward model scores rollouts, and PPO needs a value network on top.

The KL anchor matters because of what happens without it. The policy will drift toward outputs that score high under the reward model but bear no relationship to coherent text. Reward models, being LLMs themselves, have adversarial inputs: sequences that exploit weaknesses in the reward model and produce arbitrarily high scores while reading as gibberish. The KL term keeps the policy close to the SFT model.

Mathematically, the RLHF objective is:

In words: maximize expected reward, subject to a KL penalty that keeps you close to the reference model. The hyperparameter controls the tradeoff.

The punchline of DPO is that this objective has a closed-form optimal policy, and that closed form contains all the information you need to train against preference data directly: no reward model, no rollouts, no PPO. We’ll work through the derivation in a few paragraphs. But first, the conceptual move.

The DPO surprise

Figure 1: DPO collapsed the RLHF pipeline. PPO + RLHF runs three pipeline stages (preferences, train a reward model, run PPO with online rollouts) and keeps four LLM-sized models in memory: policy and value being trained in tandem, plus reward and reference as frozen scorers. DPO collapses this to two pipeline stages (preferences feeding directly into one supervised loss) and two models in memory (policy being trained, reference frozen as the KL anchor). Same KL-regularized objective; half the moving parts.

DPO’s pitch is structural. The RLHF pipeline has two coupled training problems: train a reward model that captures human preferences, then use that reward model to train a policy. DPO observes that these two steps are mathematically linked. Specifically, given the KL-regularized RLHF objective above, the optimal policy can be expressed as a function of the reward.

If you can express the reward in terms of the optimal policy, you don’t need to train a reward model separately. You can plug the implicit reward directly into the preference modeling framework, and now your policy is being trained to match preferences directly.

The result: a single supervised loss on preference pairs. No rollouts. No value network. No reward model.

Two models in memory: the policy being trained, and the reference model. That’s it. Compare to PPO + RLHF’s four. The memory and simplicity savings are significant.

But before showing the math, I want to land the right intuition for what DPO is doing, because there’s a misreading of DPO that’s tempting and wrong.

The key intuition: relative to the reference

The wrong reading of DPO: “It’s preference training. Increase chosen, decrease rejected.” Roughly what happens, but it misses the structural point.

DPO doesn’t push chosen up and rejected down in absolute terms. The intuition: if the reference already strongly preferred chosen, DPO barely moves. If the reference was indifferent or wrong, DPO pushes hard.

This is what preserves the KL anchor from RLHF. Naive preference training (just push chosen up, rejected down) would let the policy drift arbitrarily far from the reference. The “above what the reference already was” part is what keeps DPO solving the same KL-regularized problem PPO + RLHF solves. And it’s free: the KL anchor falls out of the derivation. No separate KL term in the loss.

The derivation: where the DPO loss comes from

I’m going to walk through this in four steps.

Step 1: Start from the KL-regularized RLHF objective

Same objective as PPO + RLHF:

We want high reward, but we pay a penalty for drifting too far from the reference model. The hyperparameter trades off these two pressures.

Step 2: Solve for the optimal policy

The KL-regularized objective above has a closed-form solution. After working through the calculus (the partition function manipulation is the heart of it), the optimal policy is:

where is a normalizer that makes the result a valid probability distribution.

The optimal policy is the reference policy tilted toward high-reward completions, with the strength of the tilt controlled by .

This step is doing real algebraic work. Deriving it requires writing the objective as a KL divergence between two distributions and identifying the minimizer. For a more granular walk-through of the derivation, see my earlier write-up at https://chanys.github.io/dpo/. For the conceptual flow, what matters is the form: the optimal policy is the reference, multiplied by an exponentiated reward term, normalized.

Step 3: Rearrange to express reward through the policy

The expression in Step 2 has reward on the right and optimal policy on the left. We can flip it. Take the log of both sides:

Solving for :

This is the “your language model is secretly a reward model” moment. The reward function, the thing we were going to train a separate reward model to estimate, can be expressed entirely in terms of the optimal policy and the reference policy. No reward model needed. The reward is encoded in the log-probability ratio between the optimal policy and the reference policy.

The partition function is still hanging around, which is annoying because computing it requires summing over all possible completions . We’ll deal with it in the next step.

Step 4: Plug into Bradley-Terry

The Bradley-Terry preference model gives the probability that one response is preferred over another, in terms of their rewards:

where is the logistic sigmoid. Substituting in our expression for from Step 3:

The partition function disappeared. Bradley-Terry only cares about reward differences between two completions for the same prompt, and depends only on the prompt: it cancels out in the subtraction. This is the reason DPO works as a practical algorithm. Without this cancellation, we’d be stuck computing partition functions over the full output space.

We replace with our learned policy and minimize the negative log-likelihood of the observed preferences:

This is the DPO loss. Run gradient descent on it over a preference dataset, and you’re solving the same KL-regularized RLHF objective that PPO was solving; without the reward model, the rollouts, or the value network.

What the loss is actually doing

Look at the structure of the DPO loss. There are two log-probability ratios: one for the chosen response, one for the rejected. Each compares the current policy’s probability to the reference policy’s probability. The loss wants the chosen log-ratio to be larger than the rejected log-ratio.

This is the formal version of the intuition I described earlier. The training pressure isn’t “increase chosen, decrease rejected.” It’s “increase the chosen-over-rejected log-ratio relative to the reference.” If the reference model already gave higher probability to the chosen response, the loss is already partially satisfied; the gradient is small. If the reference was wrong (gave higher probability to the rejected response), the gradient is large, pushing hard against the reference.

The KL anchor isn’t a separate term. It’s built into the structure of the loss. Both numerators in the log-ratios are the policy being trained; both denominators are the reference. The reference’s behavior is the implicit constraint on what the policy can do.

The original RLHF objective had a KL penalty as an added term. DPO collapses everything: the reward, the KL anchor, the policy update; into a single supervised loss where the structure of the equation enforces all three.

What DPO meant for the field

DPO didn’t replace online RL. It carved out a specific lane. What DPO won is the open-source ecosystem of budget-constrained preference tuning: for teams without lab-scale compute, DPO is the dominant approach. The structural distinction that matters isn’t PPO vs. DPO; it’s online (fresh rollouts during training) vs. offline (a fixed preference dataset). Online has an exploration advantage; offline is simpler. For most use cases, the simplicity wins.

In mid-2023, doing RLHF was something a small number of well-resourced labs could do. PPO was complex and engineering-intensive. The infrastructure for online preference tuning was confined to teams that had built it. Most people who wanted to fine-tune a model on preference data couldn’t realistically do it.

By late 2023, that had changed. DPO + qLoRA + 4-bit quantization meant the entire preference-tuning recipe could be run on a single GPU. The Zephyr models, the Tülu series, the wave of open instruction-tuned models that followed: all of these were downstream of DPO making the pipeline accessible. The simplification of preference tuning meant orders of magnitude more people could do it.

Appendix: Deriving the optimal policy

This is the gory derivation I deferred in Step 2: the manipulation that takes the KL-regularized RLHF objective and produces the closed-form optimal policy. I worked through this in late 2023, a few months after the DPO paper came out, because I wanted to convince myself the algorithm was real rather than take the result on faith. What follows is essentially that derivation, cleaned up. Skip this section if you’re satisfied with the form of the result. Read on if you want to see how the partition function shows up.

We start with the RLHF objective:

The KL divergence can be written as an expectation:

So the inner expectation in our objective becomes:

Divide by to convert the maximization into a minimization, and the sign of the reward term flips:

Use the identity to rewrite the reward term:

Combine the two logarithms:

The denominator inside the log isn’t a probability distribution: it doesn’t sum to one over . We can fix this by introducing a normalizer :

and defining:

By construction, is a valid probability distribution: it’s nonnegative everywhere and sums to one over . Now we substitute it back into our objective. Add and subtract inside the bracket (no change, since these cancel):

Use the identity on the first term:

Combine the first two log terms using :

Recognize the denominator inside the log as :

The term doesn’t depend on , so it’s a constant with respect to the optimization. We can drop it. What’s left is exactly a KL divergence:

KL divergence is minimized when the two distributions are equal. So the minimizer is :

That’s the closed-form optimal policy. The rest of the DPO derivation (Steps 3 and 4 in the main text) follows from rearranging this expression to extract the implicit reward and plugging it into Bradley-Terry.

The partition function is the technical price of admission for this derivation. It would be intractable to compute directly as it requires summing over all possible completions, but it cancels out in the final DPO loss because Bradley-Terry only cares about reward differences. That cancellation is what makes DPO work as a practical algorithm.

References

Rafailov, Rafael, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” Advances in Neural Information Processing Systems (NeurIPS). https://arxiv.org/abs/2305.18290.

Tunstall, Lewis, Edward Beeching, Nathan Lambert, et al. 2023. “Zephyr: Direct Distillation of LM Alignment.” arXiv Preprint. https://arxiv.org/abs/2310.16944.

GRPO: the algorithm behind reasoning models

Yee Seng Chan — Mon, 02 Mar 2026 00:00:00 GMT

Part of a series

How LLMs learn to reason

In January 2025, DeepSeek released R1 (Guo et al. 2025): a reasoning model that closed most of the gap to OpenAI’s o1 at a fraction of the training cost, and (more importantly) shipped with the training recipe described in the open. The recipe centered on an algorithm called GRPO. Within a few months, essentially every open-source reasoning model (Qwen’s reasoning variants, Llama derivatives, the wave that followed) was using GRPO or close cousins of it.

GRPO itself wasn’t new in 2025. The DeepSeek team had introduced it a year earlier, in their DeepSeekMath paper (Shao et al. 2024), where they used it to train a 7B math-specialist model. What changed in 2025 wasn’t the algorithm. It was the demonstration that the same algorithm could produce general reasoning capability at frontier scale.

If you’ve been trying to understand how reasoning models like o1, R1, or their successors actually get trained, GRPO is most of the answer. This post is about what GRPO is, why it works, and where it sits relative to the PPO + RLHF recipe it (sometimes) replaces.

I’m assuming you already understand PPO at the level of the previous post in this series. That assumption matters because GRPO is best understood as a small, deliberate set of changes from PPO, not as a new algorithm built from scratch. Most of what makes PPO work (clipping, importance sampling, the multi-epoch inner loop) carries over unchanged. The changes are concentrated in two places.

Here’s the short version of GRPO: GRPO is PPO with the critic removed. Instead of using a value network to estimate per-token advantages, GRPO samples multiple rollouts per prompt and computes each rollout’s advantage relative to its siblings.

That’s the whole conceptual move. One model gone from memory. Per-token advantage replaced with per-trajectory advantage. Everything else stays. Which raises the question of why this works at all, and why it works especially well for reasoning.

Figure 1: GRPO is PPO with the critic removed. Two changes: the value network is dropped (four model roles become three), and the advantage is computed per trajectory from a group of sibling rollouts instead of per token from GAE. Clipping, the importance ratio, KL anchoring, and the multi-epoch loop are unchanged.

Why drop the critic?

In PPO, four model roles are active during training:

Policy: the LLM being trained
Value network: the critic, also trained, typically same architecture as the policy
Reference model: frozen copy of the SFT model, for KL anchoring
Reward model: frozen, scores complete responses

That’s four LLM-sized things in the picture. The actual memory cost depends on a lot of implementation details (whether the policy and value share a body, how parameters are sharded across GPUs, optimizer state for the trained models) but the structural fact stands: PPO involves four model roles, and reducing that count is a real win.

Two of those roles you can’t avoid. The reference model is structurally necessary; without it you have no KL anchor and the policy drifts into reward-hacking gibberish. The reward model (or verifier) provides the training signal in the first place.

But the value network? It’s only there to compute advantages. If you can compute advantages another way, you can drop it.

A natural question is whether the reward model could just take over. It can’t, and the reason matters.

The reward model and the value network do fundamentally different jobs. The reward model answers: “How good is this complete response?” It takes a (prompt, response) pair and outputs a scalar. It was trained on human preferences over finished outputs, so it has no signal for half-finished responses. The value network answers a different question: “Given this partial state, what total reward do I expect by the end?” It takes a state (prompt plus tokens-so-far) and predicts where things are heading.

These are different questions. One scores finished things. The other predicts where things are heading. The reward model can’t do the second job because no human ever rated half-responses during preference data collection.

So the value network isn’t redundant. PPO needs it because GAE (the per-token advantage estimator) needs to know the expected outcome at every state along the trajectory, and only the value network can provide that.

GRPO’s move isn’t to substitute the reward model for the value network. It’s to give up on per-token advantage estimation entirely.

Group-relative advantages

Here’s the new advantage computation. For each prompt, generate rollouts (typically 4 to 16). Score each one with the verifier. The advantage of the -th rollout is:

That’s it. Each rollout gets one scalar advantage (“how much better than my sibling rollouts did this one do?”) applied identically to every token in that rollout.

This is a different kind of advantage than what PPO uses. PPO had a different advantage per token, derived from GAE; GRPO has the same advantage for every token in a rollout, derived from how the rollout’s reward compared to its siblings’. The credit is assigned at the trajectory level, not the token level. That’s the cost of dropping the value network: less granular credit assignment. But it’s also why GRPO works without per-token value estimates.

Let me make this concrete with the example I’ll use throughout: training a model to solve grade-school math word problems.

“Maya is buying tickets for a concert. Adult tickets cost $12 each and child tickets cost $8 each. She buys 9 tickets total and spends $92. How many adult tickets did she buy?”

The answer is 5. The verifier extracts the model’s final boxed answer and checks: 1 if correct, 0 if not.

Sample 4 rollouts ():

Rollout A: sets up the equation , solves cleanly, ends with \boxed{5}. Reward: 1.
Rollout B: tries and but makes an arithmetic slip, ends with \boxed{6}. Reward: 0.
Rollout C: solves it correctly via testing values, ends with \boxed{5}. Reward: 1.
Rollout D: confuses adult and child prices, ends with \boxed{4}. Reward: 0.

Compute the group statistics. The mean is straightforward:

For the standard deviation, take squared deviations from the mean and sum them:

Then divide by (only 3 of the 4 deviations are independent once the mean is fixed) and take the square root:

(Some implementations divide by instead, giving . The choice rescales all advantages by the same factor and doesn’t change their signs, but it can matter when comparing hyperparameters across codebases.)

Advantages are then :

(rollout A)
(rollout B)
(rollout C)
(rollout D)

In the gradient update, every token in rollouts A and C gets multiplied by , and every token in rollouts B and D gets multiplied by . The policy gets pushed toward the trajectories that solved the problem, away from the ones that didn’t.

The intuition is clean: of these attempts, which ones got it right? Reinforce those.

What happens when all rollouts score the same

A subtle but important case. If all 4 rollouts get reward 0 (model failed every attempt), or all 4 get reward 1 (model succeeded every time), then the standard deviation is 0, the mean equals every , and every advantage is 0.

This isn’t a bug. It’s the algorithm correctly saying “there’s no learning signal here.” If all attempts failed, we don’t know which failures were closer to success. If all succeeded, we don’t know which successes were genuinely well-reasoned versus lucky. Either way, no rollout in this group is “better than its siblings,” so no gradient signal gets generated.

The practical implication: GRPO is most effective when the model is at a difficulty level where it sometimes succeeds and sometimes fails. Too easy and all rewards are 1. Too hard and all rewards are 0. Either way, no signal. Training data needs to sit in the model’s “frontier” of difficulty, which is why curriculum learning matters more for GRPO than it does for PPO.

Everything else looks PPO-shaped

With the advantage computed, the rest of the algorithm is structurally identical to PPO.

The clipped policy loss has the same form as PPO. Same importance ratio:

Same min-of-clipped-and-unclipped construction for the per-token policy loss:

The only difference from PPO: is now the trajectory-level group-relative advantage rather than the token-level GAE advantage. Every token in a given rollout shares the same . Clipping mechanism, importance ratio, asymmetric min; all unchanged.

The multi-epoch inner loop is the same: collect rollouts, do epochs of mini-batched gradient steps, discard, repeat.

KL anchoring is also the same idea, though implemented differently. PPO often folds the KL term into the per-token rewards before computing advantages. GRPO can’t easily do this, as rewards in GRPO arrive only at the end of each rollout, with no per-token reward to fold into. So GRPO adds the KL as a separate per-token loss term:

This is the simple log-ratio estimator of KL; same as PPO uses. The two per-token losses then combine into a single scalar that gets backpropped:

Two separate per-token loss components, summed with a weight , averaged across all tokens in the mini-batch. Both contribute their own gradients to the policy parameters.

If you understand PPO, you understand most of GRPO. Two things change: how the advantage gets computed, and what models are in memory. Everything else carries over.

Why this works for reasoning specifically

GRPO didn’t have to be a reasoning algorithm. The math is generic. You could use it for any task with a scalar reward at the end. But there’s a structural reason it works especially well for reasoning.

Reasoning tasks have two properties that are awkward for PPO:

Rewards are usually verifiable. Math problems have right answers. Code can be tested. The reward is a deterministic check (“did the code pass the tests?”), not a learned approximation of human preferences. This is the world of Reinforcement Learning with Verifiable Rewards (RLVR) and it’s the world GRPO was made for.

Reward is genuinely sparse. You don’t know if a chain of thought is good until you see whether it produced the right final answer. Intermediate tokens can look identical between a trajectory that ends correctly and one that ends wrong, until the very end.

PPO handles sparse rewards via the value network, which has to learn “how good does this partial reasoning trajectory look?” That’s a hard regression problem. Reasoning trajectories can look identical for many tokens, then diverge wildly at the answer. The value network struggles to fit this signal cleanly, and the resulting advantages are noisy.

GRPO sidesteps the problem entirely. It doesn’t try to predict per-token value. It just compares whole rollouts to each other.

There’s also a practical reason GRPO and verifiable rewards pair well. GRPO needs rollouts per prompt to compute group statistics: meaning the reward function gets called times per prompt. If the reward function is a few lines of Python that runs in microseconds (a math correctness checker, a code test runner), scaling to is free. If the reward function is a 7B-parameter reward model requiring a full forward pass, inference cost gets expensive fast. GRPO’s economics work in the verifier setting.

This is also where the practical scope of “verifiable” matters. Verifiable rewards work cleanly for math (correctness check on the final answer), code (test cases), formal proofs (proof assistants), and constrained generation (regex compliance). They don’t work for tasks like “is this response helpful?” or “is this writing tasteful?”. Those are inherently subjective and need RLHF-style learned rewards. So GRPO + RLVR doesn’t replace RLHF for general assistant training. It’s a complementary tool for a specific (and increasingly important) class of tasks.

What R1-Zero showed

When DeepSeek published the R1 paper in early 2025, they actually published two models, and the distinction is informative.

DeepSeek-R1 is the deployable model. Trained with a multi-stage pipeline: cold-start SFT, then GRPO + RLVR, then more SFT, then more RL. This is the one people actually use.

DeepSeek-R1-Zero is the surprising one. Trained with pure RL (GRPO + verifiable rewards) directly on the DeepSeek-V3 base model, with no SFT stage at all. R1-Zero showed that reasoning behavior could emerge from RL alone, without any supervised demonstrations of how to reason. The DeepSeek paper documents the specific behaviors that emerged during training:

Generating progressively longer chains of thought as training continued: the model learning, as training progresses, that more inference-time compute helped it solve harder problems.
Reflecting on its own solutions, revisiting and evaluating earlier reasoning steps.
Exploring alternative approaches when a first attempt didn’t pan out.
Critiquing intermediate steps and catching its own errors.

None of this was programmed in. The RL process simply rewarded the model for producing correct answers in the right format, and the rest emerged. The DeepSeek paper calls this a “self-evolution”: the model learning, through exploration and reward, how to reason without ever being shown an example.

R1-Zero’s outputs were rough at the surface. The model would mix languages mid-response (sometimes Chinese in the middle of an English answer), use inconsistent formatting, occasionally produce incoherent passages. Strong reasoning, weak presentation. That’s what made the deployable R1 require additional SFT stages on top, to clean up the surface presentation while keeping the reasoning capability.

The R1-Zero result changed how people thought about what RL could do. The conventional wisdom had been that you needed SFT to give the model a starting distribution before any RL could help. R1-Zero showed that’s not strictly true for capabilities that have verifiable signals. Pure RL on a base model can produce reasoning. You don’t need to demonstrate how to reason; you just need to be able to verify when it worked.

The current consensus pipeline for production reasoning models combines: SFT for assistant behavior, preference tuning for helpfulness/harmlessness/honesty, RLVR for reasoning capability. Each stage does a job the others can’t do well. R1-Zero showed the RL stage could work without the prior stages; R1 showed why production models include them anyway.

Where GRPO fits in the broader landscape

The post-training landscape as of 2026 has split into a few distinct lanes.

Frontier general RLHF (proprietary, OpenAI / Anthropic / Google) still uses PPO-based or proprietary variants of PPO. Frontier labs have the engineering resources to handle PPO’s complexity and care about every fraction of a percent of quality. PPO is online, it generates new rollouts and learns from them, whereas DPO is offline, learning from a fixed preference dataset. That exploratory advantage seems to matter at the frontier. The frontier labs don’t share their pipelines publicly, so it’s hard to know exactly what they do, but PPO and its descendants remain in use.

Open-source preference tuning has largely migrated to DPO and its variants. DPO is much simpler than PPO: no rollouts, no value network, no reward model, just a clever loss on preference pairs. For most open-source teams, the simplicity is worth more than whatever quality edge PPO might provide.

Reasoning models use GRPO + RLVR. This is the lane DeepSeek pioneered with R1, and most open-source reasoning models follow the same recipe.

Figure 2: The post-training landscape in 2026, split into three lanes. PPO and DPO share the general-RLHF, preference-data space (frontier labs use PPO for quality; open source uses DPO for simplicity); GRPO plus RLVR owns reasoning, where rewards are verifiable. What you have decides the lane.

The practical takeaway: PPO and DPO/variants share the general-RLHF space (frontier vs open-source split); GRPO + RLVR owns reasoning. For practitioners, the choice depends on what you have. Preference data and a tight budget? DPO. Preference data and frontier-quality ambitions? PPO + RLHF. Verifiable rewards (math, code, theorem proving)? GRPO + RLVR.

Summary

In the broader arc that this post sits in, the next piece is DPO, the offline alternative that displaced PPO in much of the open-source ecosystem. Conceptually different from GRPO, DPO collapses the whole RL pipeline into a single supervised loss, but solves a related problem.

There’s a broader point worth naming here. PPO is hard. The four-models setup, the value network’s instability, and the sensitivity to hyperparameters are real engineering challenges. For years, these kept serious RL research mostly confined to a handful of well-resourced labs. GRPO didn’t just simplify an algorithm. It made RL training accessible to teams that couldn’t have run a stable PPO pipeline if they tried. It was a story about an algorithm simple enough that more people could actually use it.

References

Guo, Daya, Dejian Yang, Haowei Zhang, et al. 2025. “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.” arXiv Preprint. https://arxiv.org/abs/2501.12948.

Shao, Zhihong, Peiyi Wang, Qihao Zhu, et al. 2024. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.” arXiv Preprint. https://arxiv.org/abs/2402.03300.

PPO is REINFORCE plus five fixes

Yee Seng Chan — Tue, 24 Feb 2026 00:00:00 GMT

Part of a series

How LLMs learn to reason

Most treatments of PPO either start with MDPs and Bellman equations and lose you in the abstract before the algorithm appears, or treat PPO as a black box. Neither helps when you actually want to understand it. Here’s the framing that finally made it stick for me: PPO (Schulman et al. 2017) is REINFORCE with five fixes that, together, solve two problems.

The plan: walk through the two problems REINFORCE has, watch each PPO fix arise as a response to one of them, and end by tracing how the final objective decomposes all the way down to things you can actually compute.

REINFORCE in one breath

Reinforcement learning is the framework where an agent takes actions in an environment, the environment responds with reward and a new state, and the agent’s job is to find a strategy, a policy, that collects as much reward as possible. Unlike supervised learning, the agent generates its own training data by acting. Bad policies generate bad data. That’s what makes RL hard in a way classification never is.

The simplest policy gradient algorithm is REINFORCE. Run the policy. Observe what happens. Update the policy so good actions become more likely and bad ones less likely. The objective being maximized:

Reading the parts: is the probability the policy assigns to the action that was actually taken in state . is the return from time : the sum of future rewards from that point to the end of the trajectory.

REINFORCE as supervised learning, with twists

Here’s the framing that finally made REINFORCE click for me. Standard supervised cross-entropy classification minimizes:

where is the true label for . Compare to REINFORCE:

Same form, two changes:

No labels: use the sampled action as a pseudo-label. RL doesn’t tell us what the “correct” action was. So we use the action the policy actually sampled as a stand-in. The training signal becomes “do more of what you just did.” Which sounds insane, until you remember the second change.
Weight each pseudo-labeled example by how the trajectory turned out. Supervised learning treats every example as equally important. Every label is correct by definition. REINFORCE weights each term by . Positive scales the term up: do more of this. Negative flips its sign: do less.

Read this way, REINFORCE is supervised learning where the agent generates its own pseudo-labels by sampling, and each pseudo-labeled example is weighted by how things turned out. That’s the whole conceptual content of the algorithm. The gradient mechanics (softmax derivatives, autograd, the chain rule) are exactly the gradient mechanics of supervised cross-entropy, scaled by . If you’ve ever trained a classifier, you’ve already done most of the work.

The training loop

The loop is correspondingly small:

for episode in range(num_episodes):
    # 1. Roll out a trajectory under the current policy
    states, actions, rewards = rollout(env, policy)

    # 2. Compute returns G_t for each timestep, working backward
    returns = []
    G = 0
    for r in reversed(rewards):
        G = r + gamma * G
        returns.insert(0, G)

    # 3. Form the loss and update
    loss = 0
    for s_t, a_t, G_t in zip(states, actions, returns):
        logits = policy(s_t)
        log_probs = F.log_softmax(logits, dim=-1)
        loss = loss - G_t * log_probs[a_t]

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Three things to notice. Autograd does all the work. We never compute the policy gradient by hand. We construct a loss whose gradient is the policy gradient and let the framework handle it. The negation in loss = loss - G_t * log_probs[a_t] is because optimizers minimize by default, so we minimize , equivalent to maximizing . Returns are computed backward through the trajectory in using the recursion .

The two problems

REINFORCE works. People used it. Two problems become obvious the moment you scale it.

Problem 1: high variance. The “total reward that followed an action” varies wildly across rollouts even when the action itself was fine. Take the same action twice from the same state under a stochastic policy and you might see very different total rewards because what happened afterward was different. The training signal jumps around. Learning is slow and unstable.

Problem 2: data inefficiency. A trajectory is expensive to collect. For an LLM, sampling a 500-token response takes real GPU time. REINFORCE throws each one away after a single gradient update, because after that update the policy has changed and the old data is no longer “from the right distribution”: a sense we’ll make precise when we get to importance sampling.

PPO fixes both. The first three fixes attack variance. The last two attack data inefficiency. Each fix introduces machinery the next one needs, so they’re easiest to read in order.

Figure 1: PPO is REINFORCE plus five fixes. Fixes 1–3 attack REINFORCE’s variance problem; fixes 4–5 attack its data-inefficiency problem. Each fix introduces machinery the next one needs.

PPO as five fixes

Fix 1: Use the advantage, not the raw reward

The first variance-reduction trick is to stop asking “what reward followed this action?” and start asking “did this action do better than expected?”

A concrete anchor for the rest of the piece: imagine training a chatbot to answer the question “What is PPO?” The model generates a response token by token: “PPO is a reinforcement learning algorithm…”. Then at the end a reward model scores the whole response. Each token is one action. Each full response is one trajectory.

When the model picks “reinforcement” partway through, was that a good choice? It depends on what was available at that point. If most plausible alternatives at that position would have led to a coherent response anyway, the choice isn’t special. If most would have led somewhere worse, then “reinforcement” was a good pick we want to reinforce.

The “expected reward from this state” is the value of the state. The action’s advantage is how much better than the value the actual outcome was:

Advantage = (what happened) − (what we'd expect on average from this state)

Substituting advantage for raw reward in the policy gradient cuts variance dramatically. Actions that go as expected contribute roughly zero to the gradient. Only genuinely surprising actions, good or bad, drive updates. The noise from “all of these actions were fine, actually” stops contributing.

The catch: to compute advantage, you need the value of each state. We don’t know it directly. So we learn it, with a second neural network.

Fix 2: Add a value network (the critic)

PPO trains two networks in parallel:

The policy network (the actor) takes a state, outputs a distribution over actions. For LLMs, this is the language model.
The value network (the critic) takes a state, outputs a single number: the expected total reward from this state onward.

For the chatbot, the value network looks at a partial response, e.g. “PPO is a reinforcement learning”, and outputs something like 0.6, meaning “responses that start this way tend to score around 0.6 from the reward model.” After the next token, the partial response becomes “PPO is a reinforcement learning algorithm” and the value might shift to a slightly higher score of 0.65, indicating that the response is heading somewhere good. The critic is essentially a running estimate of “how well is this generation going so far?”

The critic doesn’t act. It just judges. Taken together, the actor and critic give an actor-critic architecture, of which PPO is one specific instance.

How does the critic learn? Standard regression. After a rollout, you have actual rewards . From any state along the trajectory, the actual return is . Train the critic to predict from using mean-squared-error. Over many trajectories the critic becomes a calibrated estimator, and the advantages it underwrites become reliable.

For LLMs, the value network is typically a copy of the policy architecture: same transformer body, with a small linear head outputting a scalar instead of a token distribution. Whether the bodies are shared (parameters trained jointly) or separate copies (parameters trained independently) varies by implementation.

Fix 3: Estimate the advantage with GAE

There’s a subtle decision in how to compute the advantage from rollout data. Two extremes:

Use the actual rewards that followed. Accurate but noisy: at the mercy of every random thing that happened in the trajectory.
Trust the value network’s predictions. Stable but biased: if the critic is wrong, your advantages are wrong.

You can blend. Use the actual rewards for the next few steps, then have the critic estimate the rest. This is the n-step family of advantage estimators, parameterized by how many real-reward steps to use before deferring to the critic. Small trusts the critic. Large trusts the rewards.

Rather than picking one , Generalized Advantage Estimation (Schulman et al. 2016) takes an exponentially-weighted average across all :

The prefactors normalize the weights to sum to 1 (geometric series). Small puts most weight on the 1-step estimator (trust the critic); large flattens the weights toward the full-trajectory estimator (trust the rewards). Default .

The pleasing thing is that this weighted average collapses to a clean recursion. The intermediate identity that does the work is the TD residual:

“TD” stands for Temporal Difference: the residual compares two value predictions ( and ) separated by one time step, after accounting for the actual reward observed during that step. If the critic were perfect, would always be zero. When nonzero, measures how wrong the critic was, and in which direction. The same number has two names because it plays two roles: it’s a prediction error of the critic (which is what trains the value network) and it’s the simplest possible advantage estimate (, the first row of the n-step ladder).

Now the GAE collapse. Each n-step advantage can be written as a sum of TD residuals, (the unpacking chain at the end walks through this for the 2-step case). Substituting into the GAE weighted average and collecting by , the geometric series collapses each coefficient: ends up with weight . So:

That’s the entire GAE computation: a backward pass through the trajectory accumulating -decayed TD residuals.

advantages = []
gae = 0
for t in reversed(range(len(rewards))):
    delta = rewards[t] + gamma * values[t + 1] - values[t]
    gae = delta + gamma * lam * gae
    advantages.insert(0, gae)

That’s it. GAE is one of the most important practical contributions in policy gradient methods. Almost every modern algorithm uses it. But you don’t need to derive it to use it. Two knobs: (discount factor, typically 0.99) and (the GAE parameter, typically 0.95).

Fixes 1, 2, and 3 together complete the variance-reduction story. Fixes 4 and 5 attack the data-inefficiency problem.

Fix 4: Importance sampling: reusing data

We collected an expensive trajectory. We’d like to do many gradient steps on it, not one. The wrinkle that’s specific to RL: unlike supervised learning, where data sits in a fixed dataset, here you generate your training data by running the current policy. After one gradient step, the policy is different, and the trajectory we just used is now from the previous policy. Thus reusing this trajectory is, strictly speaking, off-policy.

Importance sampling is the statistical trick that corrects for this. Each old data point gets a weight equal to the ratio of “how likely is this action under the new policy” to “how likely was it under the old policy”:

If the new policy still likes the action just as much, the ratio is 1, and the data point counts normally. If the new policy now likes the action more, the ratio is greater than 1; if less, less than 1. Multiplied with the advantage, gives a corrected gradient signal that’s valid even though the data wasn’t generated by the current policy.

This is what lets you take many gradient steps on one batch of trajectories. Exactly the sample efficiency you want.

But there’s a problem. If the policy moves a lot from where the data was collected, those ratios can blow up. An action the old policy gave probability 0.01 to and the new policy gives probability 0.9 has ratio 90. One sample now contributes 90× as much as a normal one. Training becomes wildly unstable, sometimes catastrophically so: the policy can collapse to garbage and never recover.

Importance sampling is a great tool, but only if you keep the new policy close to the old one. Which leads to the final ingredient and the actual contribution of the PPO paper.

Fix 5: Clip the importance ratio

The four fixes above (advantage, value network, GAE, importance sampling) were already standard prior to PPO.

To address the instability of importance sampling, TRPO (Trust Region Policy Optimization) added an explicit constraint: the new policy must stay within a KL-divergence bound of the old. This works, but it requires solving a constrained optimization with second-order methods and the implementation is a pain.

PPO’s contribution to RL, is a single trick that gets most of the stability of TRPO without TRPO’s complexity: If a sample is already pushing the policy in some direction, stop letting it push further.

Concretely:

In words: take the importance-corrected advantage; also take a version where the ratio is clipped to ; use whichever is smaller. The hyperparameter controls “how far is too far”: typically 0.2, so ratios are constrained to roughly before clipping kicks in.

The asymmetric min matters. The interesting question is when the min picks the unclipped value and when it picks the clipped one.

Advantage	Ratio	Unclipped	Clipped	min picks	What this means
(good)	(moved toward)	(large +)	(smaller +)	clipped	Already pushed right way; cap further reward
(good)	(moved away)	(small +)	(larger +)	unclipped	Wrong direction; let full corrective gradient through
(bad)	(moved toward)	(very −)	(less −)	unclipped	Wrong direction; let full corrective gradient through
(bad)	(moved away)	(less −)	(more −)	clipped	Already pushed right way; cap further punishment

The pattern: when the policy has moved in the right direction, the min picks the clipped value, and we stop reinforcing that direction beyond the band. When the policy has moved in the wrong direction, the min picks the unclipped value, and the full corrective gradient comes through.

Three lines of code:

ratio = torch.exp(log_probs_new - log_probs_old)  # = pi_new / pi_old
surr1 = ratio * advantages
surr2 = torch.clamp(ratio, 1 - epsilon, 1 + epsilon) * advantages
policy_loss = -torch.min(surr1, surr2).mean()

That’s PPO. Five fixes, two problems solved.

The unpacking chain

I want to spend the rest of this piece on how to compute the clipped objective:

In particular, let’s drill through how to calculate the advantage . The advantage is the deepest piece because it is defined recursively, through several layers of intermediate quantities. The plan: start from the GAE definition at the top and unpack downward until we hit primitives we can actually compute.

Level 1: GAE as a weighted average of n-step advantages:

This is what GAE is by definition. The parameter controls a bias-variance trade-off. At , GAE collapses to the 1-step estimate : take one real step of reward, then use the value function to estimate everything that comes after. This is low variance because only one stochastic reward enters, but it inherits whatever errors has. Larger shifts weight toward longer-horizon estimates, which use more actual rewards and rely less on . That lowers the bias but accumulates variance from the rewards themselves. The coefficients form a geometric series that sums to 1, so GAE is a properly normalized weighted average.

The expression is in terms of n-step advantages , which themselves need unpacking. Conceptually, an n-step advantage estimates how much better the trajectory turned out than the value function had predicted, by looking steps ahead before bootstrapping with the value function.

Level 2: n-step advantages as sums of TD residuals:

Each n-step advantage is a sum of TD residuals weighted by . The TD residual is a 1-step “surprise”: the gap between what actually happened in one step and what the value function had predicted. Summing of these gives the -step advantage. The n-step structure is now reduced to TD residuals. We still haven’t said what a TD residual is.

Level 3: TD residuals expanded into rewards and value predictions:

A TD residual is defined as:

Reading left to right: the actual reward at time , plus the discounted value of where we ended up, minus what we had predicted before taking the action. A one-step prediction error of the value function.

Plugging the definition into the 2-step advantage:

Distributing the in the second bracket:

The and cancel, leaving:

This is the standard 2-step formula. The cancellation is what is meant by the interior value terms telescope away: all the intermediate value predictions drop out, leaving only the rewards along the trajectory and the value function evaluated at the endpoints.

At this level, GAE has been reduced to two computable building blocks: per-step rewards and critic predictions . The critic predictions are direct outputs of the value network: a forward pass. They don’t need further unpacking. The rewards do.

Level 4: the per-step reward.

In standard RL, is whatever the environment gives you. Atari emits a score, a board game emits +1 or -1 at the end. There is nothing to unpack: is a primitive.

For LLMs, this is the level where the application-specific machinery shows up. RLHF, RLVR, and other variants all attach at exactly this level: they each define differently. That definition is what turns a generic policy gradient into RLHF, RLVR, or any other LLM-RL variant. Everything above this level (GAE, n-step advantages, TD residuals, the clipped objective) stays the same.

Two common forms:

RLHF. A separate reward model is trained on human preference data (pairs of “this response is better than that one”). The reward model then scores each completion the policy generates, and that score is .
RLVR. A verifier checks the answer against ground truth: did the math come out right, did the code pass the tests, did the extracted JSON match the schema? The verifier emits a numeric score, and that is .

Most LLM-RL recipes also add a KL penalty to : a term that penalizes the policy for drifting too far from a frozen reference (usually the base model before RL). This keeps the trained model from collapsing into degenerate high-reward outputs.

Apart from these substitutions, the algorithm is unchanged. The clipped objective, the GAE advantage, the value model, the policy update: all the same. Switching from RLHF to RLVR is switching what is, nothing more.

The full chain, top to bottom: clipped objective → GAE advantage → n-step advantages → TD residuals → rewards and value predictions → for LLMs, log-probs and a learned or verified reward signal. Each level answers “but how do you actually compute the thing one level up?” Eventually you bottom out at things produced by neural-network forward passes, plus whatever reward signal your problem provides.

This is the picture I wish I’d had when I first read the PPO paper. The clipped objective at the top, the unpacking chain underneath, and the application-specific reward sitting at the bottom waiting to be plugged in. Once you have it, the rest is just engineering.

What’s next

PPO was the original workhorse for RLHF, but the field has moved. Two algorithms have displaced it in different settings:

DPO (Direct Preference Optimization) skips the reward model and the RL machinery entirely. Train directly on preference pairs with a clever loss derived from the Bradley-Terry math that justifies RLHF. Much simpler: no PPO, no rollouts, no value network. It also works surprisingly well. It’s displaced PPO in many open-source pipelines.
GRPO (Group Relative Policy Optimization) is a PPO variant from DeepSeek that drops the value network and computes advantages by comparing rollouts to each other. Memory-efficient and well-suited for verifiable reward settings like math and code, where you don’t need a learned reward model, and a programmatic verifier provides the signal. GRPO is what powers most recent reasoning models.

Both are easier to understand once PPO is in your head. DPO is “what if we collapsed PPO into a single supervised loss?” GRPO is “what if we kept PPO but dropped the critic?”

References

Schulman, John, Philipp Moritz, Sergey Levine, Michael Jordan, and and Pieter Abbeel. 2016. “High-Dimensional Continuous Control Using Generalized Advantage Estimation.” International Conference on Learning Representations. https://arxiv.org/abs/1506.02438.

Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. “Proximal Policy Optimization Algorithms.” arXiv Preprint. https://arxiv.org/abs/1707.06347.

REINFORCE: the gradient that drives training

Yee Seng Chan — Fri, 20 Feb 2026 00:00:00 GMT

Part of a series

How LLMs learn to reason

Part 1 built the world REINFORCE lives in and arrived at the objective: , the thing we want to maximize. What it did not explain is why gradient descent on that objective actually makes good actions more likely. That question lives one layer down, in how the gradient flows through the softmax to the logits and back to the parameters. This part derives that mechanism.

1. The objective and its loss form

Restating, with the notation now grounded:

is the objective we want to maximize.
is the network’s parameters.
is the probability the policy assigns to action in state .
is the return from time : positive when the trajectory was good, negative when bad.

Neural networks are typically trained by minimizing a loss, so define

Minimizing is equivalent to maximizing . The sign flip is purely a notational convenience for using minimize-by-default optimizers.

2. Why the loss form does the right thing

Look at one term in isolation: , where .

Good action (). If increases, increases (becomes less negative), so decreases. Minimizing encourages to grow.

Figure 1

Bad action (). Then is positive, so . If decreases, becomes more negative, so decreases. Minimizing encourages to shrink.

Figure 2

So the form of encodes the right thing: minimizing it pushes good-action probabilities up and bad-action probabilities down. This is the easy part of the argument. It tells us that if probabilities move appropriately, the loss decreases. The harder question is whether gradient descent on actually causes them to move that way. For that, we need to look one layer deeper.

3. A useful reframe: REINFORCE is supervised learning with two tweaks

Before diving into the gradient mechanics, here’s a mental model that makes the REINFORCE loss feel less arbitrary and connects it to something you already know.

In standard supervised classification, the cross-entropy loss is

where is the true label for input . Minimizing this pushes the network’s predicted probability of the correct class upward.

Now compare to REINFORCE:

These are the same form, with two changes:

Replace the label with the sampled action. We don’t know what action should have been taken. There’s no oracle telling us “the correct move from was .” So we use the action the policy actually sampled as a kind of pseudo-label.
Multiply each term by . Supervised learning treats every example as equally important. Every label is “correct” by definition. REINFORCE doesn’t have that luxury, so it weights each pseudo-labeled example by how good the outcome turned out to be. Positive scales the term up (do more of this); negative flips its sign (do less of this).

Read this way, REINFORCE is just supervised learning where (a) the agent generates its own pseudo-labels by sampling, and (b) each example is weighted by how good the outcome turned out to be.

This means the gradient mechanics we’re about to derive are exactly the gradient mechanics of supervised cross-entropy, scaled by . If you’ve ever computed the gradient of cross-entropy through a softmax, you’ve already done most of the work. The next few sections are just making that work explicit and showing how the scaling rides along.

4. The network outputs logits, not probabilities

The neural network does not directly output probabilities. It outputs logits, which are then converted to probabilities via softmax. With two actions:

For example, with and :

Logits determine probabilities, and changing logits changes probabilities. Gradient descent updates the logits (and ultimately the weights that produce them):

So the question becomes: what is , and does its sign push the logit in a direction that makes the action’s probability move correctly?

5. The chain rule

Focus on one step: . By the chain rule,

We need .

6. Deriving the key softmax gradient

Starting from the softmax and using :

Differentiate term by term. The first term gives . For the second, let . Then (the piece doesn’t depend on ), so

Putting it together:

This is a clean, elegant result: as the action becomes more confident (), the gradient shrinks toward zero. There’s nothing left to push.

7. The gradient on the loss

Substituting back:

This is the core equation. With it, we can check that gradient descent does the right thing.

Use and .

Good action (): increases, which pushes up.

Bad action (): decreases, which pushes down.

8. Closing the loop

Increasing enlarges the numerator of the softmax and pushes up; decreasing it pushes down. Combined with the gradient signs from the previous section:

9. The general update rule

Everything above was for a single timestep and a single logit. For the full sum across the trajectory, the same logic gives

and the gradient-descent step on becomes

This is the REINFORCE update:

Each parameter gets nudged in the direction , the direction that would increase the probability of the action that was actually taken, scaled by , the return that followed. Good actions () get reinforced; bad actions () get suppressed.

REINFORCE: the world before the gradient

Yee Seng Chan — Mon, 16 Feb 2026 00:00:00 GMT

Part of a series

How LLMs learn to reason

These notes walk through REINFORCE end-to-end. This first half builds the conceptual world the algorithm lives in: what RL is, what an MDP is, what value functions are, what trajectories are, where the objective comes from, and what every piece of notation in it means. The second half derives the gradient that actually drives training. Both halves are needed: one tells you what you’re maximizing, the other tells you how gradient descent accomplishes it.

1. RL is a different kind of learning problem

In supervised learning, you have a dataset of inputs and labels, and you train a model to map one to the other. Reinforcement learning has no dataset. You have:

an environment with states,
an agent that takes actions and moves between states,
a reward signal that tells the agent (after the fact) how well things are going.

The agent’s job is to figure out a strategy, a policy, that collects as much reward as possible over time. Unlike supervised learning, the data is generated by the agent’s own behavior, which means a bad policy generates bad data. This is what makes RL hard in a way classification never is.

Throughout these notes we’ll use a small grid world. The agent occupies a square and can move up, down, left, or right. A few squares are terminal: in one corner, in another, and a “penalty” square that punishes with . Most squares are empty.

2. The Markov Decision Process

The grid world is an instance of a more general structure called a Markov Decision Process (MDP). An MDP has:

a set of states ,
a set of actions ,
a reward function : how much reward you get for taking action in state ,
transition dynamics : what state you end up in next.

The basic transition mechanic is straightforward: at each timestep, the agent is in state , picks action , and the environment responds by giving reward and moving the agent to a new state . We write this as .

Concretely in the grid world: if the agent is at and chooses “right,” then , , , and (assuming a per-step fuel cost; more on that below). The next iteration starts from , which becomes the new “current state,” and the cycle repeats until the agent hits a terminal square.

The next state depends on which action was taken. From state :

Action leads to
Action leads to
Action leads to
Action leads to

In a deterministic environment like this grid world, uniquely determines . In a stochastic environment, taking action in state gives a distribution over possible next states. Think of a robot whose wheels sometimes slip. We’ll come back to this distinction when we write the Bellman equation in its general form.

Where rewards come from

The reward function deserves a closer look, because it’s the silent partner in the whole RL story.

The reward function is a fixed property of the environment, defined before training begins. It’s not learned. It’s not a parameter. The agent never modifies it. When the agent takes action from state and the environment returns reward , the agent simply observes that number. It doesn’t see the rule that produced it.

In the grid world, you (the designer) wrote the rules: at one terminal, at another, at the “penalty”, for every step taken. That table is the reward function. The agent will spend its entire training experiencing that table’s outputs without ever being told the table exists.

This means defining the reward function is how you specify the task. Change the reward function, and you change the problem:

Reward at one corner → agent walks to that corner.
Reward for staying alive each step → agent learns to survive.
Reward everywhere → agent learns to terminate as fast as possible, possibly by walking into the “penalty”.

Same environment, same states, same actions, but completely different behavior, because the reward function is different. This is why reward design is famously tricky: a poorly-specified reward gives you an agent that solves the wrong problem. (The classic example is a boat-racing agent that learned to drive in circles hitting respawning power-ups instead of finishing the race, because that gave more points.)

3. Trajectories and returns

The agent acting in the environment generates a trajectory: the full sequence of states, actions, and rewards from the start of an episode to the end:

The Greek letter (“tau”) is the standard symbol for it. A trajectory is also sometimes called an “episode” or a “rollout.” All three terms mean the same thing: one run of the agent through the environment, from start to finish.

The return

The return from time onward is the sum of future rewards:

This is the most important quantity in REINFORCE. Three things to internalize about it.

is forward-looking. It only counts what happens from onward. Rewards collected before timestep don’t enter into it. Each timestep in a trajectory has its own , and earlier timesteps generally have more future ahead of them.

is empirical, not predicted. It’s a number you measure after running an episode, by summing the rewards you actually got. It’s not a parameter, not an output of any model. Just an observation.

is different from , and the distinction matters. is the single-step reward at timestep : what the agent got right then. is the cumulative return from onward: the sum of all rewards from that point to the end of the episode.

To make this concrete, here’s a trajectory in the grid world:

The rewards are . The returns at each timestep:

Notice: but . They’re not the same number, and they’re not measuring the same thing.

Earlier timesteps had to pay more fuel costs before reaching the , so their returns are smaller. From at the end, the agent is one step from the reward; from at the start, it had to walk all the way there.

Figure 1: The worked trajectory τ on the 4×4 grid world. Every move costs (fuel) until the terminal; the return sums rewards from step onward, so steps nearer the goal carry larger returns ( rising to ).

Why and not is what matters

Anticipating where REINFORCE is going: the gradient updates will weight each action by a return, not by an immediate reward. The reason is that the immediate reward doesn’t tell you whether the action you took was good in any meaningful sense. In the grid world, every step has reward regardless of which direction you moved. The single-step reward gives you no information about whether you moved toward the or away from it.

What does carry that information? The total reward you collected after taking the action. If you took action and ultimately got to the , then was probably part of a good plan. If you took action and ended up at the “penalty”, then was probably a bad choice.

captures this. It’s the agent’s verdict on “how did things go after I took action ?”, which is exactly the right signal for deciding whether to make more or less likely in the future. We’ll come back to this when we form the objective.

The credit assignment problem

There’s a deeper problem hiding behind this design choice, and it has a name worth knowing.

Suppose the agent finally gets a reward at step 20. Which of the 20 actions deserves credit? The action right before the reward? The one 15 steps earlier that put the agent on the right path? All of them? Some combination?

This is called the credit assignment problem, and it’s the central difficulty of reinforcement learning. In supervised learning, every input has its own label, so credit is unambiguous: image goes with label . In RL, rewards are often sparse and delayed: you take many actions before learning whether any of them were good, and there’s no per-step label telling you which actions worked.

’s answer is brutally simple: assign every action the credit of everything that came after it. The action at step 15 gets credit for the reward at step 20, because that reward is part of . So does the action at step 1, because also includes that reward. This is unfair on a per-action basis because some of those early actions probably didn’t matter, but it averages out across many trajectories. Actions that consistently precede good outcomes will accumulate positive updates over many rollouts; actions that don’t, won’t. The noise washes out; the signal accumulates.

This is why REINFORCE needs lots of samples to work. Each individual trajectory gives you a noisy, unfair credit assignment. Only across many trajectories does the right policy emerge.

4. Values and the Bellman equation

Before talking about how to learn a policy, it helps to ask what makes a state “good.” A natural answer: a state is good if you can collect a lot of reward starting from it. Define the value of a state as the return you’d expect if you played optimally from there.

In the simplest version of the grid world, where only terminal squares give reward, every non-terminal state satisfies:

where is the state reached by taking action from . The value of a state is the best value among its neighbors. Notice the action-dependence: depends on which you take. Writing it as (or sometimes ) makes that explicit and avoids the slightly sloppy notation that hides the dependence.

This is the Bellman equation in its simplest form, and it lets you propagate values from the terminal squares throughout the grid.

There’s a wrinkle: with this rule alone, the agent has no incentive to hurry. Wandering forever before reaching the is as good as walking straight there. To fix this, add a small per-step reward ; a “fuel cost”:

Now the agent prefers shorter paths.

A second refinement is the discount factor , which says future rewards matter less than present ones: the same reason a dollar today is worth more than a dollar next year. With both:

You might wonder why we need when we already have the step cost . They look redundant: both make the agent prefer short paths in the grid world. But they’re solving different problems:

The step cost makes long paths expensive in absolute terms. Every step subtracts 1 from the total return. A 3-step path to nets ; a 10-step path nets .
The discount factor makes future rewards worth less than present ones. A reward of received in 3 steps is worth today; received in 10 steps, it’s worth .

The practical reasons does work the step cost can’t:

encodes uncertainty about the future. A reward you might get in 100 steps is less trustworthy than one you’ll get in 2 steps. The world might change, the model might be wrong. captures this: distant predictions get less weight because they’re less reliable. A step cost just charges you for moving; it doesn’t model uncertainty.
doesn’t require knowing the right magnitude. A step cost only works if you tune it. Set in a grid where rewards are and the agent barely cares about path length; set and the agent refuses to move. is scale-free in this sense: creates a roughly-100-step planning horizon regardless of reward scale.
They compose differently with positive rewards along the way. In a “stay alive” task where each surviving step gives , a negative step cost would subtract from a signal that’s supposed to be additive. still works there: it just makes the agent prefer reward sooner rather than later.

So: the step cost is a reward-design choice (you, specifying the task, decide moving is bad), while is a property of the agent’s planning horizon (the agent decides distant futures matter less). They live at different levels.

Once is in the picture, the return is also typically written with discounting:

This is the version that shows up in most modern RL writing. The undiscounted form earlier was a special case with .

5. The general Bellman equation and expectations

In a stochastic environment, taking action in state doesn’t deterministically land you in : it gives a distribution over next states. The fully general Bellman optimality equation handles this with an expectation:

or written out as a sum:

The two forms are the same equation. Going from the first to the second is just unpacking the expectation, so it’s worth being clear about what an expectation is.

The expected value of a random variable is the weighted average of its possible values, where the weights are the probabilities:

That’s it. Roll a fair six-sided die: outcomes each with probability , and . List every possible outcome, multiply by how likely it is, add up.

A conditional expectation is the same weighted average, but using the conditional probabilities instead: “given that I know , what’s the expected value of ?”

Apply this to the Bellman expectation: the random thing is (taking action in state gives a random next state), and the quantity being averaged is . List all possible next states , weight each by , and sum:

Concrete example. Suppose the agent is in state , takes action , but the environment is noisy:

80% chance you actually go right, landing in with reward
20% chance you slip up, landing in with reward

Then the expected value inside the Bellman equation is

Each possible outcome contributes its value, weighted by how likely it is.

In the deterministic case, for one specific and for everything else. The sum collapses to a single term , and the expectation disappears. That’s why the simpler form is fine for the grid world.

6. vs : two flavors of value

So far we’ve only talked about , the value of a state. There’s a closely related object , the value of a state-action pair. Both come up constantly in RL, and they answer slightly different questions.

: the value of a state. Answers: “How good is it to be in state ?” Specifically, the expected return starting from and following the policy from there:

One number per state.

: the value of a state-action pair. Answers: “How good is it to take action from state ?” The expected return if you take in state and then follow the policy:

One number per state-action pair. With 4 actions, is four numbers, one for each action you might take from .

The two are related. is what you get when you average over the actions the policy would take:

If the policy is greedy (always picks the best action), this collapses to . So is the summary, is the breakdown.

Why does this matter? Because is more useful for acting. Suppose you have a value function and you need to choose an action.

With : you’d need to look ahead one step for each candidate action, see what state you’d land in, and check the value there. This requires knowing the environment dynamics.
With : you read off the values directly and pick the action with the highest one. No model of the environment needed.

This is why deep RL methods that learn value functions almost always learn , not . DQN (Deep Q-Network, the famous 2013 Atari paper) learns directly: the network takes a state and outputs one number per action; you act greedily by picking the argmax. You don’t need to know how the Atari game works internally; the -values tell you which button to press.

7. Why neural networks?

Iterating Bellman across a 16-square grid is fine. Iterating across the state space of Go, which has more positions than there are atoms in the universe, is not. You can’t store one number per state, let alone visit each state repeatedly to update it.

The fix is to use a neural network as a function approximator in place of a giant lookup table. The network takes a state as input and outputs either:

a value estimate ( or ): “from this state, you can expect roughly this much return,” or
a policy: “from this state, here are the probabilities of each action.”

These two choices give the two main families of deep RL.

A value network approximates or . It’s trained by enforcing the Bellman equation: the network’s prediction at should match times its prediction at the next state . Once trained, you act greedily: pick the action with the highest predicted value. DQN is the canonical example.

A policy network approximates the policy directly. It takes a state and outputs probabilities over actions. There’s no Bellman equation, no value estimation. You train the network so good actions become more likely and bad actions become less likely. This is the world REINFORCE lives in.

Why use one over the other? Policy networks have several advantages worth knowing:

They naturally output stochastic policies (next section), which matters for exploration.
They handle continuous action spaces gracefully: a value network would need to argmax over an infinite set.
They’re often easier to train end-to-end with backpropagation, which is the whole point of writing as something you can take a gradient of.

Value networks tend to be more sample-efficient when they work, but harder to stabilize. The two approaches aren’t mutually exclusive: actor-critic methods use both.

8. Stochastic policies and explore/exploit

A deterministic policy outputs the single best action: “from , always go right.” A stochastic policy outputs a distribution: “from , go right with probability , up with , down with , left with .”

Why prefer the stochastic version? Imagine the agent has found the reward and learned to walk to it. With a deterministic policy, it will never deviate and will never discover the reward sitting in another corner. Randomness is what lets the agent explore. A stochastic policy mostly exploits what it knows but occasionally tries something else, which is how it keeps improving.

This is also why the policy network outputs a softmax over actions rather than a single chosen action. The softmax gives a differentiable, naturally-stochastic parameterization; exactly what the gradient mechanics in Part II need.

9. The policy notation

We’ve referenced “the policy” repeatedly. It’s now time to give it a proper symbol.

is the policy: the neural network that, given a state, outputs a probability distribution over actions. The Greek letter is the standard symbol for a policy. The subscript reminds you that the policy is parameterized by the network’s weights . Different weights produce different policies.

Three notations all show up in practice and mean closely related things:

on its own refers to the whole policy as an object: the full mapping from states to action distributions.
is the distribution over actions when the agent is in state : for a 4-action grid world, that’s 4 numbers summing to 1.
is a single number: “the probability that policy (with weights ) assigns to action when in state .”

Concretely, for the grid world: the network takes a state like as input, runs it through some layers, ends with a softmax, and outputs , , , . Four probabilities summing to 1. Sample one of them to get the action the agent actually takes.

What the policy actually controls

It’s worth being precise about what the policy is for. The policy controls exactly one thing: the probability of choosing each action in each state. That’s it. Given a state, it produces a distribution. The agent samples from that distribution.

What the policy does not control:

The reward. The environment decides what reward to give. The policy can’t change this.
The next state. The transition dynamics are a property of the environment. If the agent walks right and the floor is icy, the agent slides.
Which states are reachable. The structure of the world is given. The policy only influences which reachable states the agent tends to visit.

The causal chain looks like this:

The policy controls the second arrow. Everything downstream is the environment doing its thing. But by setting up that arrow well, i.e. assigning high probability to good actions in each state, the policy biases what happens in the rest of the chain toward outcomes we want.

This is also why the gradient mechanics in Part II are expressed in terms of . The only thing the agent can adjust is action probabilities. The only thing the gradient can affect is which actions get sampled.

A useful contrast to hold in mind: a value function ( or ) tells you how good states are. It doesn’t tell you what to do. A policy tells you what to do. It doesn’t predict return. REINFORCE trains the policy directly, without ever learning a value function: the trajectories themselves provide the feedback signal.

10. The objective from a single trajectory

We can finally write down what REINFORCE is trying to do.

Run the agent. It produces a trajectory . For each timestep in that trajectory, you can compute two things:

: the return that followed (you get this by summing rewards from onward).
: the probability the policy assigned to the action that was actually taken.

The objective to maximize is:

Notice the structure: each term in the sum has two factors: the return , and the log-probability of the action that produced it.

How to read this sum

So if a trajectory has 4 steps (), the sum is:

One term per timestep. Each term has two factors: the return from that timestep onward, and the log-probability of the action that was taken at that timestep.

Let’s compute it for the grid-world trajectory we used earlier. The returns were . Suppose the current policy assigned these probabilities to the actions actually taken:

, so
, so
, so
, so

Plug in:

That’s one number: the value of for this specific trajectory under the current policy.

What does that number mean?

On its own, the absolute value isn’t very meaningful. What matters is how the sum changes as changes.

Here’s the intuition for why this particular form is what we want to maximize. Each term is doing something specific:

If is large and positive (the trajectory after this timestep was good), then is a large negative number that gets less negative if increases. So the sum increases when the policy raises the probability of .
If is negative (things went badly after this timestep), then is positive, and it increases when decreases. So the sum increases when the policy lowers the probability of .

Putting those together: the sum is large when the policy assigns high probability to actions that led to good returns and low probability to actions that led to bad returns. Maximizing it pushes the policy in exactly the direction we want.

This is also why is the right multiplier here, not . The return is the agent’s verdict on whether action was a good choice, accounting for everything that followed. The single-step reward would tell us nothing useful in environments where rewards are delayed.

11. The objective as an expectation

The single-trajectory form is what you actually compute in code. But to be precise about what really is, you need to think of it as an expectation.

Each time you run the agent, you get a different trajectory, because the policy is stochastic and the environment may be too. The “true” objective averages over all the trajectories the policy could possibly produce:

Reading the new piece of notation:

reads as “trajectory sampled from ”: meaning is generated by running the policy in the environment.
reads as “the expectation, where the randomness comes from sampling trajectories from the policy.”

So in plain English: “run the policy infinitely many times, compute the bracketed sum on each trajectory, and average the results.”

Expanding the expectation

The expectation is shorthand for a weighted average over trajectories. Let’s expand it.

Step 1: What’s the probability of a specific trajectory ?

A trajectory is a sequence:

For this exact sequence to occur, every step has to play out a particular way. The probability is the product of all the individual things that had to happen:

Reading the pieces:

: the probability that the episode started at . (Some problems have a fixed start state; others sample from a distribution. covers both.)
: the probability that the policy chose action in state . (This is the only piece that depends on .)
: the probability that the environment transitioned to given the state and action. (This is the environment dynamics: fixed, not learnable.)

In a deterministic environment with a fixed start state, and are all 0 or 1, so the trajectory probability simplifies to just : the product of the policy probabilities along the path.

Step 2: Plug into the expectation.

The expectation sums over all possible trajectories, weighted by their probabilities:

Substituting the expression for :

That’s the fully expanded form.

The “” is summing over every possible trajectory: every possible starting state, every possible sequence of actions the policy might take, every possible sequence of next states the environment might produce.

Why nobody computes this directly

In the grid world with 16 squares, 4 actions, and episodes of length up to 20 steps, the number of possible trajectories is roughly even before accounting for state randomness. In something like Atari or Go, it’s effectively infinite.

You can’t enumerate all trajectories. You can’t compute for each one. You especially can’t, because is the environment dynamics. You usually don’t even know it; you just experience it by stepping the simulator.

This is exactly why we sample. Instead of summing over all trajectories with their true probabilities, we:

Run the policy once. The environment naturally generates a trajectory with the right probability. We don’t need to know explicitly because the simulator embodies it.
Compute for that trajectory.
Treat this as a single sample from the expectation.
Average across many such samples (across many gradient steps).

This is called a Monte Carlo estimate: using random samples to approximate an expectation.

12. Putting the training loop together

We now have all the pieces. The REINFORCE training loop:

Roll out a trajectory. Starting from some initial state, sample actions from the current policy until the episode ends. Record states, actions, and rewards along the way.
Compute the returns. For each timestep in the trajectory, sum the (discounted) rewards from onward to get .
Form the objective. Plug into . This is your single-sample estimate of the true expected objective.
Update the parameters. Take a gradient step on .
Repeat. With updated parameters, roll out a new trajectory and do it again.

A useful intuition: the policy network is being shown its own past behavior and told do more of what worked, less of what didn’t. The return is the supervision signal: it plays the role that the label plays in supervised learning, except the agent generates it for itself by interacting with the environment.

The training loop in code

To make this concrete, here’s the entire REINFORCE training loop in PyTorch. The grid world is abstracted as env, and the policy is a small neural network.

import torch
import torch.nn.functional as F

# Hyperparameters
learning_rate = 1e-3
gamma = 0.99
num_episodes = 10000

# Policy network: state -> action logits
policy = PolicyNetwork()  # outputs logits over actions
optimizer = torch.optim.Adam(policy.parameters(), lr=learning_rate)

for episode in range(num_episodes):
    # 1. Roll out a trajectory
    states, actions, rewards = [], [], []
    state = env.reset()
    done = False
    while not done:
        logits = policy(state)                     # action logits for this state
        probs = F.softmax(logits, dim=-1)          # action probabilities
        action = torch.multinomial(probs, 1).item()  # sample an action
        next_state, reward, done = env.step(action)

        states.append(state)
        actions.append(action)
        rewards.append(reward)
        state = next_state

    # 2. Compute returns G_t for each timestep (working backward)
    returns = []
    G = 0
    for r in reversed(rewards):
        G = r + gamma * G
        returns.insert(0, G)
    returns = torch.tensor(returns)

    # 3. Form the loss
    # For each timestep, we want G_t * log pi(a_t | s_t).
    # We sum over t and negate (since optimizers minimize).
    loss = 0
    for s_t, a_t, G_t in zip(states, actions, returns):
        logits = policy(s_t)
        log_probs = F.log_softmax(logits, dim=-1)
        loss = loss - G_t * log_probs[a_t]

    # 4. Update parameters
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

A few things worth flagging in this code:

The negation. Because optimizers minimize by default, we minimize , which is equivalent to maximizing . Hence loss = loss - G_t * log_probs[a_t] instead of +.

Returns computed backward. The for r in reversed(rewards) loop is the standard way to compute returns in time using the recursion . Forward computation would be .

One trajectory per gradient step. This is REINFORCE in its purest form. In practice you’d batch multiple trajectories before each step to reduce variance, but the algorithm doesn’t require it.

13. The bridge to the gradient mechanics

This is the conceptual frame in which REINFORCE makes sense. We have an objective , written either as a single-trajectory sum or as an expectation over trajectories, and we want to maximize it. Equivalently, we minimize by gradient descent.

What this frame doesn’t tell you is why gradient descent on actually causes good actions to become more probable. That question lives one layer down. It’s about how the gradient flows through the softmax to the logits, and from the logits back through the network to the parameters. That’s what Part II is about.