<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>How LLMs learn to reason — series</title>
<link>https://yeesengchan.com/posts/series/how-llms-learn-to-reason/</link>
<atom:link href="https://yeesengchan.com/posts/series/how-llms-learn-to-reason/index.xml" rel="self" type="application/rss+xml"/>
<description>The RL lineage behind reasoning models, traced one algorithm at a time, from REINFORCE and PPO through GRPO and DPO to what R1 actually shipped.</description>
<image>
<url>https://yeesengchan.com/01-reasoning.png</url>
<title>How LLMs learn to reason — series</title>
<link>https://yeesengchan.com/posts/series/how-llms-learn-to-reason/</link>
</image>
<generator>quarto-1.9.38</generator>
<lastBuildDate>Sat, 14 Mar 2026 00:00:00 GMT</lastBuildDate>
<item>
  <title>How reasoning models learn to use tools</title>
  <dc:creator>Yee Seng Chan</dc:creator>
  <link>https://yeesengchan.com/posts/series/how-llms-learn-to-reason/05-tool-use/</link>
  <description><![CDATA[ 





<!-- Shared series navigation. Each PART includes this file with a Quarto
     include shortcode pointing at ../_series.qmd (see any part's index.qmd
     for the exact syntax — do NOT repeat that shortcode here, it would
     recurse). Links are sibling-relative (../NN-slug/) so they resolve
     identically from any part. When you add a part, add one line here.
     Files starting with "_" are never rendered as their own page. -->
<div class="series-nav">
<div class="series-label">
Part of a series
</div>
<div class="series-name">
How LLMs learn to reason
</div>
<!-- Add parts as an ordered list below as you publish, e.g.
     1. [Part title](../01-slug/)
     2. [Part title](../02-slug/) -->
<ol type="1">
<li><a href="../00-reinforce-foundations/">REINFORCE: the world before the gradient</a></li>
<li><a href="../00-reinforce-gradient/">REINFORCE: the gradient that drives training</a></li>
<li><a href="../01-ppo/">PPO is REINFORCE plus five fixes</a></li>
<li><a href="../02-grpo/">GRPO: The algorithm behind reasoning models</a></li>
<li><a href="../03-dpo/">DPO: RLHF collapsed into one loss</a></li>
<li><a href="../04-r1/">R1-Zero was the result. R1 was the product</a></li>
<li><a href="../05-tool-use/">How reasoning models learn to use tools</a></li>
</ol>
</div>
<p>Ask Claude what last night’s score was, or to look up the latest version of some Python library, and watch what it does. It writes a search query, calls out to a search tool, reads the results, sometimes searches again to clarify, and only then answers you. The whole sequence happens inside one response.</p>
<p>The behavior isn’t scripted. Real assistants have scaffolding around the model (tool schemas, system instructions, routing, sometimes policy rules), but the core behavior is something the model learned in training. There’s no <code>if user_asked_for_current_info: search()</code> branch anywhere. The model learned that for some questions, calling out to a tool produces better answers than guessing.</p>
<p><a href="../../../../posts/series/how-llms-learn-to-reason/04-r1/index.html">The previous piece</a> covered how DeepSeek-R1 used reinforcement learning to consolidate reasoning behavior (backtracking, self-correction, longer chains of thought) out of a base model trained only on math and code problems. The headline finding was that outcome-only RL on verifiable tasks produces models that learn to think.</p>
<p>This piece is about what happens when you point that same training recipe at tool use. It works, with two interesting twists.</p>
<section id="the-recipe-extends-search-r1" class="level2">
<h2 class="anchored" data-anchor-id="the-recipe-extends-search-r1">The recipe extends: Search-R1</h2>
<p>The cleanest demonstration is Search-R1 <span class="citation" data-cites="jin_etal_2025_searchr1">(Jin et al. 2025)</span>. Take the R1 training recipe, but instead of letting the model think to itself, let it call a search engine in the middle of its reasoning.</p>
<p>The training data is question-answering, specifically the kind that requires looking facts up. “Curious is a women’s fragrance by a singer born in what city and state?” The model can’t reasonably know the answer from memorized training data and needs to chain together facts it doesn’t have.</p>
<p>The trajectory is interleaved: reason for a bit, decide a search would help, issue a query, read the retrieved passages, reason more, maybe search again, eventually answer. In practice:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode markdown code-with-copy"><code class="sourceCode markdown"><span id="cb1-1"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">think</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&gt;</span> I need to find the singer behind the fragrance "Curious". <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;/</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">think</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&gt;</span></span>
<span id="cb1-2"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">search</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&gt;</span> Curious fragrance singer <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;/</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">search</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&gt;</span></span>
<span id="cb1-3"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">information</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&gt;</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span><span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">Wikipedia: Curious is a fragrance by Britney Spears...</span><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;/</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">information</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&gt;</span></span>
<span id="cb1-4"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">think</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&gt;</span> So I need to find where Britney Spears was born. <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;/</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">think</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&gt;</span></span>
<span id="cb1-5"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">search</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&gt;</span> Britney Spears birthplace <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;/</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">search</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&gt;</span></span>
<span id="cb1-6"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">information</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&gt;</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">[</span><span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">...Spears was born in McComb, Mississippi...</span><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">]</span> <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;/</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">information</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&gt;</span></span>
<span id="cb1-7"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">think</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&gt;</span> McComb, Mississippi. That's the answer. <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;/</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">think</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&gt;</span></span>
<span id="cb1-8"><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">answer</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&gt;</span> McComb, Mississippi <span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&lt;/</span><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">answer</span><span class="dt" style="color: #AD0000;
background-color: null;
font-style: inherit;">&gt;</span></span></code></pre></div></div>
<p>The training loop:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> each training question:</span>
<span id="cb2-2">    trajectory <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb2-3">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">while</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">not</span> answered:</span>
<span id="cb2-4">        generate <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">next</span> chunk <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> the model</span>
<span id="cb2-5">        append chunk to trajectory</span>
<span id="cb2-6">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> chunk contains <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>search<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> query <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;/</span>search<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span>:</span>
<span id="cb2-7">            results <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> retriever(query)</span>
<span id="cb2-8">            append <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>information<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> results <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;/</span>information<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> to trajectory</span>
<span id="cb2-9">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> chunk contains <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span>answer<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> ... <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;/</span>answer<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span>:</span>
<span id="cb2-10">            answered <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span></span>
<span id="cb2-11">    reward <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> exact_match(extracted_answer, ground_truth)</span>
<span id="cb2-12">    update the policy, but only on tokens the model wrote,</span>
<span id="cb2-13">    NOT on tokens that came <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> retrieved passages</span></code></pre></div></div>
<p>The last line is the wrinkle worth understanding.</p>
<p>Setup-wise, this is the R1 recipe with one change. Same policy-gradient RL (Search-R1 tests both PPO and GRPO, with PPO more stable in their main setting). Same outcome-only reward (exact match between predicted and ground-truth answer). No process rewards, no learned reward model, no human-graded trajectories. The only difference is that when the model produces <code>&lt;/search&gt;</code>, the training loop pauses generation, runs the query, and pastes the top results back into the trajectory before the model continues.</p>
<p>And it works. Trained on Natural Questions and HotpotQA, evaluated across seven QA datasets, Qwen2.5-7B with Search-R1 (PPO) gets 43% average accuracy versus 30% for vanilla retrieval-augmented generation and 28% for R1-style RL with no search at all. The gap is consistent across both in-distribution and out-of-distribution test sets.</p>
<p>What’s striking is what the model learned without being told. The reward checks only the final answer. Nothing in the training signal says “search more often” or “search more carefully.” But over training, the average number of searches per trajectory grows. The model learns to search when uncertain, to search again if the first results aren’t good enough, and sometimes to do a verifying search after it thinks it has the answer.</p>
<p>None of this was hand-designed. The policy gradient finds these behaviors because trajectories that include them tend to end with correct answers more often than trajectories that don’t. The same mechanism that gave R1 its self-correcting math behavior gives Search-R1 its search behavior.</p>
<section id="the-retrieved-token-mask" class="level3">
<h3 class="anchored" data-anchor-id="the-retrieved-token-mask">The retrieved-token mask</h3>
<p>The trajectory contains tokens the model didn’t write: the retrieved Wikipedia passages between <code>&lt;information&gt;</code> tags. Run the policy gradient over every token, and you’re training the model to assign higher probability to those passages too. That’s nonsense as a learning signal. The model didn’t choose those tokens, and the retrieved text isn’t even consistent across rollouts of the same prompt.</p>
<p>The fix is to mask retrieved tokens out of the loss. Only update on tokens the model actually generated. The full PPO/GRPO objective is more involved, but the core idea is an indicator in front of each token’s contribution:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Ctext%7BLoss%7D%20%5C;=%5C;%20-%5Csum_%7Bt=1%7D%5E%7B%7Cy%7C%7D%20I(y_t)%20%5C,%5Ccdot%5C,%20%5B%5C,%5Ctext%7Bpolicy-gradient%20signal%20at%20token%20%7D%20t%5C,%5D"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?I(y_t)%20=%201"> if the model generated token <img src="https://latex.codecogs.com/png.latex?y_t">, and <img src="https://latex.codecogs.com/png.latex?I(y_t)%20=%200"> if the token came from a retrieved passage. Without the indicator, retrieved tokens contribute to the gradient and pull the policy toward whatever was in the search results. With it, only tokens the model actually chose contribute.</p>
<p>Without the mask, the same model gets 34% average accuracy. With it, 43%. Nine points from one indicator function.</p>
<p>The principle generalizes. Anytime your RL setup includes tokens from outside the model (search results, tool outputs, environment state, function returns), you have to be careful which ones enter the gradient. Get it wrong and the model regresses toward whatever distribution those external tokens come from.</p>
</section>
</section>
<section id="when-the-action-space-gets-richer-toolrl" class="level2">
<h2 class="anchored" data-anchor-id="when-the-action-space-gets-richer-toolrl">When the action space gets richer: ToolRL</h2>
<p>Search-R1 works because search is a forgiving tool. The action space is small (queries to one engine). Verification is sharp (did the final answer match?). Search itself is robust, since even mediocre queries usually retrieve something useful.</p>
<p>Real tool use is messier. An agentic model might have access to dozens of tools, each with named parameters, some required and some optional, taking strings or numbers or structured objects. The model can fail by picking the wrong tool, by using wrong parameters, by getting parameter names right but values wrong, or by forgetting to call a tool that was needed.</p>
<p>ToolRL <span class="citation" data-cites="qian_etal_2025">(Qian et al. 2026)</span> looked at training in this richer setting. Two findings: one expected, one surprising.</p>
<section id="the-expected-finding-smaller-pieces-of-credit" class="level3">
<h3 class="anchored" data-anchor-id="the-expected-finding-smaller-pieces-of-credit">The expected finding: smaller pieces of credit</h3>
<p>Search-R1’s reward only checks the final answer. For richer tool use, that signal is too sparse. The model takes too many distinct actions between question and answer for “did it work in the end” to tell it which specific action was good or bad.</p>
<p>ToolRL breaks the reward into pieces. For each tool call the model makes, compare it to a ground-truth tool call from the training data and grade three things separately:</p>
<ul>
<li><strong>Tool name:</strong> did the model pick the right tool?</li>
<li><strong>Parameter names:</strong> did it use the right arguments?</li>
<li><strong>Parameter values:</strong> did it fill them in correctly?</li>
</ul>
<p>The total correctness reward for a tool call is the sum:</p>
<p><img src="https://latex.codecogs.com/png.latex?r_%7B%5Ctext%7Btool%20call%7D%7D%20%5C;=%5C;%20r_%7B%5Ctext%7Bname%7D%7D%20%5C;+%5C;%20r_%7B%5Ctext%7Bparam-name%7D%7D%20%5C;+%5C;%20r_%7B%5Ctext%7Bparam-value%7D%7D"></p>
<p>The tool-name match is an intersection over union, the fraction of tool names the model picked that overlap with the ones it should have:</p>
<p><img src="https://latex.codecogs.com/png.latex?r_%7B%5Ctext%7Bname%7D%7D%20%5C;=%5C;%20%5Cfrac%7B%7CN_%7B%5Ctext%7Bpredicted%7D%7D%20%5C,%5Ccap%5C,%20N_%7B%5Ctext%7Bground-truth%7D%7D%7C%7D%7B%7CN_%7B%5Ctext%7Bpredicted%7D%7D%20%5C,%5Ccup%5C,%20N_%7B%5Ctext%7Bground-truth%7D%7D%7C%7D"></p>
<p>Pick exactly the right set of tools and you get 1. Pick none and you get 0. The parameter-name match works the same way, with overlap between predicted and ground-truth parameter keys. The parameter-value match is stricter: exact equality between values, for the parameters that matched. Add the three components together, normalize, and you have a single scalar reward that scales smoothly from “nothing matched” to “everything matched perfectly.”</p>
<p>The ablation: compared against a coarser version that only gives credit when the entire tool call matches exactly, the fine-grained reward trains faster and reaches higher final accuracy. The coarser version starves the model of useful gradient. When most of your trajectories fail somehow, partial credit for the parts you got right keeps the gradient informative.</p>
</section>
<section id="the-surprising-finding-longer-thinking-hurts" class="level3">
<h3 class="anchored" data-anchor-id="the-surprising-finding-longer-thinking-hurts">The surprising finding: longer thinking hurts</h3>
<p>In <a href="../../../../posts/series/how-llms-learn-to-reason/04-r1/index.html">the R1 piece</a> I argued that reasoning models think longer at inference because longer trajectories happened to correlate with correct answers, and the policy gradient followed the correlation. You’d expect this to extend to tool use: more thinking, more careful tool calls, better outcomes.</p>
<p>ToolRL tested this directly by adding a length reward, a small bonus for trajectories that produce longer thinking traces before tool calls. Here’s what happened on a tool-use benchmark called BFCL:</p>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Setup</th>
<th>Accuracy (Qwen2.5-3B)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Standard (no length reward)</td>
<td><strong>53.0%</strong></td>
</tr>
<tr class="even">
<td>Fixed length reward</td>
<td>48.9%</td>
</tr>
<tr class="odd">
<td>Dynamic length reward (escalating over training)</td>
<td>48.2%</td>
</tr>
</tbody>
</table>
<p>The length rewards work. The traces visibly get longer. But accuracy gets worse, not better.</p>
<p>This reshapes how to think about the R1 piece. In math and code, longer reasoning traces meant more exploration before commitment, which translated cleanly into accuracy. In tool use, the same instinct backfires. The model about to commit to a tool call doesn’t always benefit from talking itself through one more time. It can talk itself out of the right call, or elaborate a wrong call into a worse one.</p>
<p>Thinking helps when the structure of the task rewards thinking. Math problems reward exploration before commitment. Tool use rewards picking the right tool, getting the arguments right, and getting on with it.</p>
<p>You can see hints of this in production models. Claude and GPT-4 produce different amounts of reasoning depending on whether you’ve asked them to do math or look up a fact. The faster response on a tool-use task isn’t laziness, it’s efficiency. Optimal trace length is task-shaped.</p>
</section>
</section>
<section id="what-this-means-for-the-models-you-use" class="level2">
<h2 class="anchored" data-anchor-id="what-this-means-for-the-models-you-use">What this means for the models you use</h2>
<p>When you watch Claude search the web, call a calculator, or run code, there’s a version of this kind of training behind it. Not literally Search-R1 or ToolRL (production training pipelines are more complex and not public), but the conceptual approach is the same: RL on rollouts where the model has tool access and there’s a clear definition of what a successful interaction looks like.</p>
<p>This explains some uneven patterns. Models are smoother with tools they saw a lot of during training (search, code execution, basic file operations) and rougher with niche or company-specific tools that didn’t get the same coverage. They’re also better at tools where success is checkable. For instance, a search returning useful results is verifiable.</p>
<p>The harness around the model (tool schemas, validation, retries, fallback rules) does some of this work too. But none of it substitutes for the model itself having the right reflexes: knowing when to call which tool, what arguments to pass, and when to skip the tool entirely. That part has to come from training.</p>
</section>
<section id="whats-next-and-the-series-wrap" class="level2">
<h2 class="anchored" data-anchor-id="whats-next-and-the-series-wrap">What’s next, and the series wrap</h2>
<p>This is the last piece in the series on RL training for LLMs. The arc:</p>
<ul>
<li><a href="../../../../posts/series/how-llms-learn-to-reason/00-reinforce-foundations/index.html"><strong>REINFORCE foundations</strong></a>: the setup these algorithms live in. MDPs, returns, credit assignment, value vs policy, the RL objective.</li>
<li><a href="../../../../posts/series/how-llms-learn-to-reason/00-reinforce-gradient/index.html"><strong>REINFORCE gradient</strong></a>: the policy gradient derived. REINFORCE is weighted supervised learning.</li>
<li><a href="../../../../posts/series/how-llms-learn-to-reason/01-ppo/index.html"><strong>PPO</strong></a>: the workhorse policy-gradient algorithm; REINFORCE plus five fixes for variance and data efficiency.</li>
<li><a href="../../../../posts/series/how-llms-learn-to-reason/02-grpo/index.html"><strong>GRPO</strong></a>: the variant that drops the value network and powered DeepSeek’s reasoning models.</li>
<li><a href="../../../../posts/series/how-llms-learn-to-reason/03-dpo/index.html"><strong>DPO</strong></a>: the offline simplification that collapses RLHF into a single supervised loss.</li>
<li><a href="../../../../posts/series/how-llms-learn-to-reason/04-r1/index.html"><strong>R1</strong></a>: what GRPO produces when you point it at math and code, with reasoning behavior consolidating into a long-chain-of-thought policy.</li>
<li><strong>Search-R1 and ToolRL:</strong> what happens when you extend the recipe to tool use.</li>
</ul>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent">
<div id="ref-jin_etal_2025_searchr1" class="csl-entry">
Jin, Bowen, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. 2025. <span>“Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning.”</span> <em>arXiv Preprint</em>. <a href="https://arxiv.org/abs/2503.09516">https://arxiv.org/abs/2503.09516</a>.
</div>
<div id="ref-qian_etal_2025" class="csl-entry">
Qian, Cheng, Emre Can Acikgoz, Qi He, et al. 2026. <span>“ToolRL: Reward Is All Tool Learning Needs.”</span> <em>Advances in Neural Information Processing Systems (NeurIPS)</em>. <a href="https://arxiv.org/abs/2504.13958">https://arxiv.org/abs/2504.13958</a>.
</div>
</div></section></div> ]]></description>
  <category>How LLMs learn to reason</category>
  <guid>https://yeesengchan.com/posts/series/how-llms-learn-to-reason/05-tool-use/</guid>
  <pubDate>Sat, 14 Mar 2026 00:00:00 GMT</pubDate>
</item>
<item>
  <title>R1-Zero was the result. R1 was the product</title>
  <dc:creator>Yee Seng Chan</dc:creator>
  <link>https://yeesengchan.com/posts/series/how-llms-learn-to-reason/04-r1/</link>
  <description><![CDATA[ 





<!-- Shared series navigation. Each PART includes this file with a Quarto
     include shortcode pointing at ../_series.qmd (see any part's index.qmd
     for the exact syntax — do NOT repeat that shortcode here, it would
     recurse). Links are sibling-relative (../NN-slug/) so they resolve
     identically from any part. When you add a part, add one line here.
     Files starting with "_" are never rendered as their own page. -->
<div class="series-nav">
<div class="series-label">
Part of a series
</div>
<div class="series-name">
How LLMs learn to reason
</div>
<!-- Add parts as an ordered list below as you publish, e.g.
     1. [Part title](../01-slug/)
     2. [Part title](../02-slug/) -->
<ol type="1">
<li><a href="../00-reinforce-foundations/">REINFORCE: the world before the gradient</a></li>
<li><a href="../00-reinforce-gradient/">REINFORCE: the gradient that drives training</a></li>
<li><a href="../01-ppo/">PPO is REINFORCE plus five fixes</a></li>
<li><a href="../02-grpo/">GRPO: The algorithm behind reasoning models</a></li>
<li><a href="../03-dpo/">DPO: RLHF collapsed into one loss</a></li>
<li><a href="../04-r1/">R1-Zero was the result. R1 was the product</a></li>
<li><a href="../05-tool-use/">How reasoning models learn to use tools</a></li>
</ol>
</div>
<p>R1-Zero is the conceptual result. R1 is the engineering wrapping around it.</p>
<p>A reasoning model’s trace looks structurally different from a standard LLM’s. The model works through a problem, gets partway, says <em>wait, that’s not quite right</em>, backs up, tries a different approach, checks the result against the constraints, notices the check failed, revises again, then commits. Thousands of tokens of exploration, self-correction, and hypothesis revision before the boxed answer.</p>
<p>Nobody wrote down the rule “if your first approach looks wrong, say <em>wait</em> and try a different one.” Nobody curated a corpus of reasoning traces with backtracking and trained on them via SFT. The behavior consolidates out of a simple training procedure.</p>
<section id="r1-zero-is-the-result.-everything-else-is-interpretation." class="level2">
<h2 class="anchored" data-anchor-id="r1-zero-is-the-result.-everything-else-is-interpretation.">R1-Zero is the result. Everything else is interpretation.</h2>
<p>The setup is simple. Start from DeepSeek-V3-Base <span class="citation" data-cites="liu_etal_2024">(<span class="nocase">Liu et al.</span> 2024)</span>, a 671B model pretrained on text but never instruction-tuned. No SFT, no RLHF, no human-written reasoning examples.</p>
<p>Run GRPO on it (PPO with the critic removed and a group-relative advantage in its place; the <a href="../../../../posts/series/how-llms-learn-to-reason/02-grpo/index.html">GRPO piece</a> covers the details). Use math and code prompts where the answer is checkable. Two rewards:</p>
<ul>
<li><strong>Accuracy:</strong> exact-match against the ground-truth answer for math, test execution for code.</li>
<li><strong>Format:</strong> the output uses the <code>&lt;think&gt;...&lt;/think&gt;&lt;answer&gt;...&lt;/answer&gt;</code> template.</li>
</ul>
<p>Both rule-based. No neural reward model, no process supervision, no verifier trained on preferences.</p>
<section id="what-happens-during-training" class="level3">
<h3 class="anchored" data-anchor-id="what-happens-during-training">What happens during training</h3>
<p>AIME 2024 accuracy climbs from 15.6% to 71.0%, and to 86.7% with sixteen samples and majority voting <span class="citation" data-cites="guo_etal_2025">(<span class="nocase">Guo et al.</span> 2025)</span>. The surprising part is response length. Average response length grows monotonically across training.</p>
<div id="fig-r1zero-length" class="quarto-float quarto-figure quarto-figure-center anchored" alt="Line chart from the DeepSeek-R1 paper: average response length per response versus training steps.">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-r1zero-length-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://yeesengchan.com/posts/series/how-llms-learn-to-reason/04-r1/r1_zero.png" class="img-fluid figure-img" alt="Line chart from the DeepSeek-R1 paper: average response length per response versus training steps.">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-r1zero-length-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;1: Average response length of DeepSeek-R1-Zero grows monotonically over GRPO training, with no length term in the reward. The jump near step 8.2k coincides with the output cap being raised from 32K to 65K tokens. Source: DeepSeek-AI, 2025 (DeepSeek-R1).
</figcaption>
</figure>
</div>
<p>The reward function never mentions length. The model discovers through trajectory comparison under GRPO that longer traces correlate with getting the answer right.</p>
<p>The paper flags an “aha moment”: mid-derivation on a competition math problem, the model produces <em>Wait, wait. Wait. That’s an aha moment I can flag here</em> and re-evaluates a step it was about to commit to. Self-reflective tokens (<em>wait</em>, <em>check</em>, <em>verify</em>, <em>mistake</em>, <em>wrong</em>) grow alongside response length. None were in the reward function or the prompt template. They appeared because trajectories that included them got reward more reliably than trajectories without them.</p>
</section>
<section id="whats-actually-happening" class="level3">
<h3 class="anchored" data-anchor-id="whats-actually-happening">What’s actually happening</h3>
<p>The mechanism is policy gradient on long trajectories with sparse terminal reward. You do many rollouts. Most fail. The verifier returns a single bit at the end of each: correct or not. That bit gets broadcast across every token of every successful trajectory. Whatever distinguished successful trajectories from failures, on average, gets reinforced: backtracking, self-checks, alternative approaches, length.</p>
<p>Karpathy calls this “sucking supervision through a straw.” A minute of rollout updated by a single bit. It’s crude credit assignment, and it works given enough samples and gradient steps.</p>
</section>
<section id="what-r1-zero-shows" class="level3">
<h3 class="anchored" data-anchor-id="what-r1-zero-shows">What R1-Zero shows</h3>
<p>R1-Zero establishes an important claim: GRPO with rule-based verifiable rewards, on a sufficiently strong pretrained base model, consolidates latent reasoning behavior into a reliable long-CoT policy.</p>
<p><em>Consolidates</em> is doing the work. Recent work on chain-of-thought decoding <span class="citation" data-cites="wang_zhou_2024">(Wang and Zhou 2024)</span> shows that reasoning traces, including backtracking and self-correction, are already present in pretrained models, just not at rank one in the next-token distribution. R1-Zero doesn’t create reasoning behavior from nothing. It surfaces latent capability as a consistent policy, with trace lengths scaling up as the model discovers what gets rewarded.</p>
</section>
</section>
<section id="r1-is-what-you-build-when-the-policy-needs-to-be-a-product" class="level2">
<h2 class="anchored" data-anchor-id="r1-is-what-you-build-when-the-policy-needs-to-be-a-product">R1 is what you build when the policy needs to be a product</h2>
<p>R1’s contribution is different from R1-Zero’s. It’s the recipe for taking the raw reasoning behavior and making it a deployable assistant.</p>
<section id="why-r1-zero-isnt-a-product" class="level3">
<h3 class="anchored" data-anchor-id="why-r1-zero-isnt-a-product">Why R1-Zero isn’t a product</h3>
<p>R1-Zero mixes Chinese and English mid-trace and can produce correct math in prose so messy a user can’t tell. The reasoning policy was optimized for one thing (getting the answer right under verification) and it shows.</p>
</section>
<section id="the-four-stage-pipeline" class="level3">
<h3 class="anchored" data-anchor-id="the-four-stage-pipeline">The four-stage pipeline</h3>
<p>R1’s pipeline addresses this in four stages:</p>
<ol type="1">
<li><strong>Cold-start SFT.</strong> A few thousand long-CoT examples sampled from R1-Zero outputs, refined for readability and language consistency, used to fine-tune V3-Base. Gives the next RL stage a stylistically aligned starting point.</li>
<li><strong>R1-Zero-style RL.</strong> GRPO on the cold-start checkpoint, same rule-based rewards plus a language-consistency reward to suppress mixing.</li>
<li><strong>Rejection sampling and second SFT.</strong> The post-RL model generates reasoning trajectories filtered for correctness, combined with non-reasoning examples, and used for another round of SFT.</li>
<li><strong>Final RL.</strong> Rule-based rewards on verifiable tasks combined with neural reward models on helpfulness and harmlessness, the latter trained on human preferences in the standard RLHF way.</li>
</ol>
</section>
</section>
<section id="test-time-compute-is-downstream-of-training" class="level2">
<h2 class="anchored" data-anchor-id="test-time-compute-is-downstream-of-training">Test-time compute is downstream of training</h2>
<p>Snell’s result <span class="citation" data-cites="snell_etal_2024">(Snell et al. 2024)</span> predates R1 and shows that a smaller model with adaptive test-time compute can match a 14× larger model on easy and medium-difficulty problems. But Snell got there by building substantial external scaffolding around the base LLM: a separately trained process reward model (PRM; scoring intermediate steps of a reasoning trace), a revision model fine-tuned on self-correction trajectories, beam search over PRM outputs, best-of-N with verifier-weighted aggregation. The scaffolding extracts the gains.</p>
<p>R1 displaced the stack. No PRM. No beam search. No revision model. No best-of-N at inference. The model produces one long CoT in one decoding pass. The DeepSeek paper classifies PRMs and MCTS-style search (Monte Carlo Tree Search) as <em>unsuccessful attempts</em> in their own development.</p>
<p>The scaffolding got internalized into the policy via training. <em>Wait, that’s wrong</em> is the model running its own verifier mid-trace. Backtracking is the model doing its own revision. Exploring an alternative approach is the model running its own search.</p>
<p>When somebody says a reasoning model “spends more compute at inference,” what they usually mean is that the model produces a longer CoT. The longer CoT exists because training made the longer CoT effective.</p>
</section>
<section id="verifiable-rewards-are-the-source-and-the-boundary" class="level2">
<h2 class="anchored" data-anchor-id="verifiable-rewards-are-the-source-and-the-boundary">Verifiable rewards are the source and the boundary</h2>
<p>Every successful application of this recipe shares a common feature: outcomes can be checked cheaply and unambiguously. Math problems with deterministic final answers. Coding problems with executable test suites. Formal theorem proving with proof checkers.</p>
<p>Reasoning behavior consolidates in R1-Zero because GRPO can attribute reward credit reliably across thousands of tokens based on a single end-of-trace check. The trace is long, the signal is sparse, the per-step contribution is invisible. None of that matters as long as the final-answer check is trustworthy. You compare trajectories, reward the ones that ended right, and over enough samples the policy gradient finds whatever in the middle of the trace was helping. Self-correction phrases, hypothesis exploration, longer length all win the comparison on average, when verification is reliable.</p>
<p>Run the same recipe with a learned reward model (say, a neural judge trained on human preferences for “good reasoning”) and the signal becomes noisy and gameable under optimization pressure. Rule-based rewards on verifiable domains don’t.</p>
<section id="the-limits-of-the-recipe" class="level3">
<h3 class="anchored" data-anchor-id="the-limits-of-the-recipe">The limits of the recipe</h3>
<p>“Verifiable” is narrower than it sounds. Math has answer checking. Competition coding has test cases. But real software engineering, where “correct” includes design, readability, maintainability, and integration, is not verifiable in the R1 sense. The R1 recipe reaches the verifiable subset of code, not coding as a practice.</p>
<p>Even within verifiable domains, the recipe doesn’t produce uniform competence. Recent empirical work <span class="citation" data-cites="shojaee_etal_2025">(Shojaee et al. 2025)</span> reports task-specific reliability ceilings: models collapse on puzzle instances of certain complexity even when nominally within their domain. Being inside a verifiable domain is necessary but not sufficient for reliable reasoning.</p>
<p>The same training that produces backtracking and self-correction can produce overthinking and second-guessing on tasks where the answer was obvious. R1 underperforms standard LLMs on instruction-following evaluations even while crushing AIME and Codeforces.</p>
</section>
</section>
<section id="whats-next" class="level2">
<h2 class="anchored" data-anchor-id="whats-next">What’s next</h2>
<p>R1-Zero is one demonstration in a broader pattern. The same recipe extends to multi-turn tool-using settings if outcomes remain verifiable. Search-R1 <span class="citation" data-cites="jin_etal_2025_searchr1">(Jin et al. 2025)</span> is the cleanest demonstration: extend the recipe to a model that interleaves reasoning with search-engine calls, and the model learns to use the tool well as part of its reasoning policy.</p>
<p>The <a href="../../../../posts/series/how-llms-learn-to-reason/05-tool-use/index.html">next piece</a> covers what changes when you move from single-turn verifiable reasoning to multi-turn tool-using settings and how the boundaries of “verifiable” get tested when you add tools and external state.</p>
<p>In the broader arc:</p>
<ul>
<li><a href="../../../../posts/series/how-llms-learn-to-reason/01-ppo/index.html"><strong>PPO</strong></a> is the workhorse algorithm.</li>
<li><a href="../../../../posts/series/how-llms-learn-to-reason/03-dpo/index.html"><strong>DPO</strong></a> is the offline alternative.</li>
<li><a href="../../../../posts/series/how-llms-learn-to-reason/02-grpo/index.html"><strong>GRPO</strong></a> is the variant for verifiable reward settings.</li>
<li><strong>R1</strong> is what GRPO produces when you point it at math and code on a strong base model.</li>
</ul>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent">
<div id="ref-guo_etal_2025" class="csl-entry">
<span class="nocase">Guo, Daya, Dejian Yang, Haowei Zhang, et al.</span> 2025. <span>“DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.”</span> <em>arXiv Preprint</em>. <a href="https://arxiv.org/abs/2501.12948">https://arxiv.org/abs/2501.12948</a>.
</div>
<div id="ref-jin_etal_2025_searchr1" class="csl-entry">
Jin, Bowen, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. 2025. <span>“Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning.”</span> <em>arXiv Preprint</em>. <a href="https://arxiv.org/abs/2503.09516">https://arxiv.org/abs/2503.09516</a>.
</div>
<div id="ref-liu_etal_2024" class="csl-entry">
<span class="nocase">Liu, Aixin, Bei Feng, Bing Xue, et al.</span> 2024. <span>“DeepSeek-V3 Technical Report.”</span> <em>arXiv Preprint</em>. <a href="https://arxiv.org/abs/2412.19437">https://arxiv.org/abs/2412.19437</a>.
</div>
<div id="ref-shojaee_etal_2025" class="csl-entry">
Shojaee, Parshin, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. 2025. <span>“The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity.”</span> <em>arXiv Preprint</em>. <a href="https://arxiv.org/abs/2506.06941">https://arxiv.org/abs/2506.06941</a>.
</div>
<div id="ref-snell_etal_2024" class="csl-entry">
Snell, Charlie, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. <span>“Scaling LLM Test-Time Compute Optimally Can Be More Effective Than Scaling Model Parameters.”</span> <em>arXiv Preprint</em>. <a href="https://arxiv.org/abs/2408.03314">https://arxiv.org/abs/2408.03314</a>.
</div>
<div id="ref-wang_zhou_2024" class="csl-entry">
Wang, Xuezhi, and Denny Zhou. 2024. <span>“Chain-of-Thought Reasoning Without Prompting.”</span> <em>arXiv Preprint</em>. <a href="https://arxiv.org/abs/2402.10200">https://arxiv.org/abs/2402.10200</a>.
</div>
</div></section></div> ]]></description>
  <category>How LLMs learn to reason</category>
  <guid>https://yeesengchan.com/posts/series/how-llms-learn-to-reason/04-r1/</guid>
  <pubDate>Tue, 10 Mar 2026 00:00:00 GMT</pubDate>
</item>
<item>
  <title>DPO: RLHF collapsed into one loss</title>
  <dc:creator>Yee Seng Chan</dc:creator>
  <link>https://yeesengchan.com/posts/series/how-llms-learn-to-reason/03-dpo/</link>
  <description><![CDATA[ 





<!-- Shared series navigation. Each PART includes this file with a Quarto
     include shortcode pointing at ../_series.qmd (see any part's index.qmd
     for the exact syntax — do NOT repeat that shortcode here, it would
     recurse). Links are sibling-relative (../NN-slug/) so they resolve
     identically from any part. When you add a part, add one line here.
     Files starting with "_" are never rendered as their own page. -->
<div class="series-nav">
<div class="series-label">
Part of a series
</div>
<div class="series-name">
How LLMs learn to reason
</div>
<!-- Add parts as an ordered list below as you publish, e.g.
     1. [Part title](../01-slug/)
     2. [Part title](../02-slug/) -->
<ol type="1">
<li><a href="../00-reinforce-foundations/">REINFORCE: the world before the gradient</a></li>
<li><a href="../00-reinforce-gradient/">REINFORCE: the gradient that drives training</a></li>
<li><a href="../01-ppo/">PPO is REINFORCE plus five fixes</a></li>
<li><a href="../02-grpo/">GRPO: The algorithm behind reasoning models</a></li>
<li><a href="../03-dpo/">DPO: RLHF collapsed into one loss</a></li>
<li><a href="../04-r1/">R1-Zero was the result. R1 was the product</a></li>
<li><a href="../05-tool-use/">How reasoning models learn to use tools</a></li>
</ol>
</div>
<p>In May 2023, a paper showed up titled “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” <span class="citation" data-cites="rafailov_etal_2023">(Rafailov et al. 2023)</span> The claim: you could do RLHF without a reward model, without PPO, without rollouts, and without RL training of any kind. Just preference pairs and a clever loss function.</p>
<p>If you’d been working with <a href="../../../../posts/series/how-llms-learn-to-reason/01-ppo/index.html">PPO + RLHF</a> up to that point, this sounded too good to be true. PPO was famously painful. Four models in memory simultaneously: policy, value network, reference model, reward model. Online sampling during training. The Anthropic and OpenAI teams running serious RLHF had built infrastructure most people couldn’t afford to replicate.</p>
<p>DPO said: skip all of it. Train a single supervised loss on preference pairs. Done.</p>
<p>I followed DPO when it landed. I derived the math step by step. I wrote a small training pipeline around TRL’s DPOTrainer and got it running locally on a 24GB GPU. In parallel, the community’s reaction was also immediate: within months, the open-source preference-tuning ecosystem had largely shifted to DPO. Models like Zephyr <span class="citation" data-cites="tunstall_etal_2023">(Tunstall et al. 2023)</span> showed up. The teams running PPO infrastructure didn’t go away, but the floor opened up dramatically.</p>
<p>In this article, let me first anchor what DPO is replacing.</p>
<section id="why-ppo-rlhf-was-painful" class="level2">
<h2 class="anchored" data-anchor-id="why-ppo-rlhf-was-painful">Why PPO + RLHF was painful</h2>
<p>PPO + RLHF is a multi-stage pipeline. You start with an SFT model, then:</p>
<ol type="1">
<li>Collect a preference dataset: pairs of (prompt, chosen response, rejected response).</li>
<li>Train a separate reward model on this dataset, learning to score (prompt, response) pairs.</li>
<li>Run PPO using the reward model as the reward signal, generating fresh rollouts during training.</li>
</ol>
<p>Training requires four LLM-sized models in memory: the SFT model serves as the reference for KL anchoring, the policy is the LLM model being trained, the reward model scores rollouts, and PPO needs a value network on top.</p>
<p>The KL anchor matters because of what happens without it. The policy will drift toward outputs that score high under the reward model but bear no relationship to coherent text. Reward models, being LLMs themselves, have adversarial inputs: sequences that exploit weaknesses in the reward model and produce arbitrarily high scores while reading as gibberish. The KL term keeps the policy close to the SFT model.</p>
<p>Mathematically, the RLHF objective is:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cmax_%5Cpi%20%5C;%20%5Cmathbb%7BE%7D_%7Bx%20%5Csim%20D,%20%5C,%20y%20%5Csim%20%5Cpi(%5Ccdot%7Cx)%7D%20%5Cleft%5B%20r(x,%20y)%20-%20%5Cbeta%20%5Clog%20%5Cfrac%7B%5Cpi(y%7Cx)%7D%7B%5Cpi_%5Ctext%7Bref%7D(y%7Cx)%7D%20%5Cright%5D"></p>
<p>In words: maximize expected reward, subject to a KL penalty that keeps you close to the reference model. The hyperparameter <img src="https://latex.codecogs.com/png.latex?%5Cbeta"> controls the tradeoff.</p>
<p>The punchline of DPO is that this objective has a closed-form optimal policy, and that closed form contains all the information you need to train against preference data directly: no reward model, no rollouts, no PPO. We’ll work through the derivation in a few paragraphs. But first, the conceptual move.</p>
</section>
<section id="the-dpo-surprise" class="level2">
<h2 class="anchored" data-anchor-id="the-dpo-surprise">The DPO surprise</h2>
<div id="fig-ppo-rlhf-vs-dpo" class="quarto-float quarto-figure quarto-figure-center anchored" alt="Two cards side by side. Left card with amber border, titled PPO plus RLHF. Pipeline row: three boxes preferences, reward model, PPO plus rollouts, connected by arrows. Models in memory row: four boxes, with policy and value filled in indigo to indicate they are being trained, and reward and reference outlined in gray to indicate they are frozen. Summary line: 2 trained, 4 in memory, online rollouts. Right card with teal border, titled DPO. Pipeline row: two boxes preferences, DPO loss, connected by an arrow. Models in memory row: two boxes, policy filled indigo (trained), reference outlined gray (frozen). Summary line: 1 trained, 2 in memory, no rollouts. Caption: Same KL-regularized objective. Half the moving parts.">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-ppo-rlhf-vs-dpo-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://yeesengchan.com/posts/series/how-llms-learn-to-reason/03-dpo/ppo_rlhf_vs_dpo.png" class="img-fluid figure-img" alt="Two cards side by side. Left card with amber border, titled PPO plus RLHF. Pipeline row: three boxes preferences, reward model, PPO plus rollouts, connected by arrows. Models in memory row: four boxes, with policy and value filled in indigo to indicate they are being trained, and reward and reference outlined in gray to indicate they are frozen. Summary line: 2 trained, 4 in memory, online rollouts. Right card with teal border, titled DPO. Pipeline row: two boxes preferences, DPO loss, connected by an arrow. Models in memory row: two boxes, policy filled indigo (trained), reference outlined gray (frozen). Summary line: 1 trained, 2 in memory, no rollouts. Caption: Same KL-regularized objective. Half the moving parts.">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-ppo-rlhf-vs-dpo-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;1: DPO collapsed the RLHF pipeline. PPO + RLHF runs three pipeline stages (preferences, train a reward model, run PPO with online rollouts) and keeps four LLM-sized models in memory: policy and value being trained in tandem, plus reward and reference as frozen scorers. DPO collapses this to two pipeline stages (preferences feeding directly into one supervised loss) and two models in memory (policy being trained, reference frozen as the KL anchor). Same KL-regularized objective; half the moving parts.
</figcaption>
</figure>
</div>
<p>DPO’s pitch is structural. The RLHF pipeline has two coupled training problems: train a reward model that captures human preferences, then use that reward model to train a policy. DPO observes that these two steps are mathematically linked. Specifically, given the KL-regularized RLHF objective above, the optimal policy can be expressed as a function of the reward.</p>
<p>If you can express the reward in terms of the optimal policy, you don’t need to train a reward model separately. You can plug the implicit reward directly into the preference modeling framework, and now your policy is being trained to match preferences directly.</p>
<p>The result: a single supervised loss on preference pairs. No rollouts. No value network. No reward model.</p>
<p>Two models in memory: the policy being trained, and the reference model. That’s it. Compare to PPO + RLHF’s four. The memory and simplicity savings are significant.</p>
<p>But before showing the math, I want to land the right intuition for what DPO is doing, because there’s a misreading of DPO that’s tempting and wrong.</p>
</section>
<section id="the-key-intuition-relative-to-the-reference" class="level2">
<h2 class="anchored" data-anchor-id="the-key-intuition-relative-to-the-reference">The key intuition: relative to the reference</h2>
<p>The wrong reading of DPO: “It’s preference training. Increase chosen, decrease rejected.” Roughly what happens, but it misses the structural point.</p>
<p>DPO doesn’t push chosen up and rejected down in absolute terms. The intuition: if the reference already strongly preferred chosen, DPO barely moves. If the reference was indifferent or wrong, DPO pushes hard.</p>
<p>This is what preserves the KL anchor from RLHF. Naive preference training (just push chosen up, rejected down) would let the policy drift arbitrarily far from the reference. The “above what the reference already was” part is what keeps DPO solving the same KL-regularized problem PPO + RLHF solves. And it’s free: the KL anchor falls out of the derivation. No separate KL term in the loss.</p>
</section>
<section id="the-derivation-where-the-dpo-loss-comes-from" class="level2">
<h2 class="anchored" data-anchor-id="the-derivation-where-the-dpo-loss-comes-from">The derivation: where the DPO loss comes from</h2>
<p>I’m going to walk through this in four steps.</p>
<section id="step-1-start-from-the-kl-regularized-rlhf-objective" class="level3">
<h3 class="anchored" data-anchor-id="step-1-start-from-the-kl-regularized-rlhf-objective">Step 1: Start from the KL-regularized RLHF objective</h3>
<p>Same objective as PPO + RLHF:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cmax_%5Cpi%20%5C;%20%5Cmathbb%7BE%7D_%7Bx%20%5Csim%20D,%20%5C,%20y%20%5Csim%20%5Cpi(%5Ccdot%7Cx)%7D%20%5Cleft%5B%20r(x,%20y)%20-%20%5Cbeta%20%5Clog%20%5Cfrac%7B%5Cpi(y%7Cx)%7D%7B%5Cpi_%5Ctext%7Bref%7D(y%7Cx)%7D%20%5Cright%5D"></p>
<p>We want high reward, but we pay a penalty for drifting too far from the reference model. The hyperparameter <img src="https://latex.codecogs.com/png.latex?%5Cbeta"> trades off these two pressures.</p>
</section>
<section id="step-2-solve-for-the-optimal-policy" class="level3">
<h3 class="anchored" data-anchor-id="step-2-solve-for-the-optimal-policy">Step 2: Solve for the optimal policy</h3>
<p>The KL-regularized objective above has a closed-form solution. After working through the calculus (the partition function manipulation is the heart of it), the optimal policy is:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cpi%5E*(y%7Cx)%20=%20%5Cfrac%7B1%7D%7BZ(x)%7D%20%5Cpi_%5Ctext%7Bref%7D(y%7Cx)%20%5Cexp%5C!%5Cleft(%5Cfrac%7B1%7D%7B%5Cbeta%7D%20r(x,%20y)%5Cright)"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?Z(x)%20=%20%5Csum_y%20%5Cpi_%5Ctext%7Bref%7D(y%7Cx)%20%5Cexp%5C!%5Cleft(%5Cfrac%7B1%7D%7B%5Cbeta%7D%20r(x,%20y)%5Cright)"> is a normalizer that makes the result a valid probability distribution.</p>
<p>The optimal policy is the reference policy tilted toward high-reward completions, with the strength of the tilt controlled by <img src="https://latex.codecogs.com/png.latex?%5Cbeta">.</p>
<p>This step is doing real algebraic work. Deriving it requires writing the objective as a KL divergence between two distributions and identifying the minimizer. For a more granular walk-through of the derivation, see my earlier write-up at <a href="https://chanys.github.io/dpo/" class="uri">https://chanys.github.io/dpo/</a>. For the conceptual flow, what matters is the form: the optimal policy is the reference, multiplied by an exponentiated reward term, normalized.</p>
</section>
<section id="step-3-rearrange-to-express-reward-through-the-policy" class="level3">
<h3 class="anchored" data-anchor-id="step-3-rearrange-to-express-reward-through-the-policy">Step 3: Rearrange to express reward through the policy</h3>
<p>The expression in Step 2 has reward on the right and optimal policy on the left. We can flip it. Take the log of both sides:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Clog%20%5Cpi%5E*(y%7Cx)%20=%20%5Clog%20%5Cpi_%5Ctext%7Bref%7D(y%7Cx)%20+%20%5Cfrac%7B1%7D%7B%5Cbeta%7D%20r(x,%20y)%20-%20%5Clog%20Z(x)"></p>
<p>Solving for <img src="https://latex.codecogs.com/png.latex?r(x,%20y)">:</p>
<p><img src="https://latex.codecogs.com/png.latex?r(x,%20y)%20=%20%5Cbeta%20%5Clog%20%5Cfrac%7B%5Cpi%5E*(y%7Cx)%7D%7B%5Cpi_%5Ctext%7Bref%7D(y%7Cx)%7D%20+%20%5Cbeta%20%5Clog%20Z(x)"></p>
<p>This is the “your language model is secretly a reward model” moment. The reward function, the thing we were going to train a separate reward model to estimate, can be expressed entirely in terms of the optimal policy and the reference policy. No reward model needed. The reward is encoded in the log-probability ratio between the optimal policy and the reference policy.</p>
<p>The partition function <img src="https://latex.codecogs.com/png.latex?Z(x)"> is still hanging around, which is annoying because computing it requires summing over all possible completions <img src="https://latex.codecogs.com/png.latex?y">. We’ll deal with it in the next step.</p>
</section>
<section id="step-4-plug-into-bradley-terry" class="level3">
<h3 class="anchored" data-anchor-id="step-4-plug-into-bradley-terry">Step 4: Plug into Bradley-Terry</h3>
<p>The Bradley-Terry preference model gives the probability that one response is preferred over another, in terms of their rewards:</p>
<p><img src="https://latex.codecogs.com/png.latex?p(y_w%20%5Csucc%20y_l%20%5Cmid%20x)%20=%20%5Csigma%5Cbig(r(x,%20y_w)%20-%20r(x,%20y_l)%5Cbig)"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?%5Csigma"> is the logistic sigmoid. Substituting in our expression for <img src="https://latex.codecogs.com/png.latex?r(x,%20y)"> from Step 3:</p>
<p><img src="https://latex.codecogs.com/png.latex?p(y_w%20%5Csucc%20y_l%20%5Cmid%20x)%20=%20%5Csigma%5C!%5Cleft(%20%5Cbeta%20%5Clog%20%5Cfrac%7B%5Cpi%5E*(y_w%7Cx)%7D%7B%5Cpi_%5Ctext%7Bref%7D(y_w%7Cx)%7D%20-%20%5Cbeta%20%5Clog%20%5Cfrac%7B%5Cpi%5E*(y_l%7Cx)%7D%7B%5Cpi_%5Ctext%7Bref%7D(y_l%7Cx)%7D%20%5Cright)"></p>
<p>The partition function disappeared. Bradley-Terry only cares about reward <em>differences</em> between two completions for the same prompt, and <img src="https://latex.codecogs.com/png.latex?Z(x)"> depends only on the prompt: it cancels out in the subtraction. This is the reason DPO works as a practical algorithm. Without this cancellation, we’d be stuck computing partition functions over the full output space.</p>
<p>We replace <img src="https://latex.codecogs.com/png.latex?%5Cpi%5E*"> with our learned policy <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta"> and minimize the negative log-likelihood of the observed preferences:</p>
<p><img src="https://latex.codecogs.com/png.latex?L_%5Ctext%7BDPO%7D(%5Ctheta)%20=%20-%5C,%5Cmathbb%7BE%7D_%7B(x,%20y_w,%20y_l)%20%5Csim%20D%7D%20%5Cleft%5B%20%5Clog%20%5Csigma%5C!%5Cleft(%20%5Cbeta%20%5Clog%20%5Cfrac%7B%5Cpi_%5Ctheta(y_w%7Cx)%7D%7B%5Cpi_%5Ctext%7Bref%7D(y_w%7Cx)%7D%20-%20%5Cbeta%20%5Clog%20%5Cfrac%7B%5Cpi_%5Ctheta(y_l%7Cx)%7D%7B%5Cpi_%5Ctext%7Bref%7D(y_l%7Cx)%7D%20%5Cright)%20%5Cright%5D"></p>
<p>This is the DPO loss. Run gradient descent on it over a preference dataset, and you’re solving the same KL-regularized RLHF objective that PPO was solving; without the reward model, the rollouts, or the value network.</p>
</section>
</section>
<section id="what-the-loss-is-actually-doing" class="level2">
<h2 class="anchored" data-anchor-id="what-the-loss-is-actually-doing">What the loss is actually doing</h2>
<p>Look at the structure of the DPO loss. There are two log-probability ratios: one for the chosen response, one for the rejected. Each compares the current policy’s probability to the reference policy’s probability. The loss wants the chosen log-ratio to be larger than the rejected log-ratio.</p>
<p>This is the formal version of the intuition I described earlier. The training pressure isn’t “increase chosen, decrease rejected.” It’s “increase the chosen-over-rejected log-ratio relative to the reference.” If the reference model already gave higher probability to the chosen response, the loss is already partially satisfied; the gradient is small. If the reference was wrong (gave higher probability to the rejected response), the gradient is large, pushing hard against the reference.</p>
<p>The KL anchor isn’t a separate term. It’s built into the structure of the loss. Both numerators in the log-ratios are the policy being trained; both denominators are the reference. The reference’s behavior is the implicit constraint on what the policy can do.</p>
<p>The original RLHF objective had a KL penalty as an <em>added term</em>. DPO collapses everything: the reward, the KL anchor, the policy update; into a single supervised loss where the structure of the equation enforces all three.</p>
</section>
<section id="what-dpo-meant-for-the-field" class="level2">
<h2 class="anchored" data-anchor-id="what-dpo-meant-for-the-field">What DPO meant for the field</h2>
<p>DPO didn’t replace online RL. It carved out a specific lane. What DPO won is the open-source ecosystem of <em>budget-constrained</em> preference tuning: for teams without lab-scale compute, DPO is the dominant approach. The structural distinction that matters isn’t PPO vs.&nbsp;DPO; it’s <em>online</em> (fresh rollouts during training) vs.&nbsp;<em>offline</em> (a fixed preference dataset). Online has an exploration advantage; offline is simpler. For most use cases, the simplicity wins.</p>
<p>In mid-2023, doing RLHF was something a small number of well-resourced labs could do. PPO was complex and engineering-intensive. The infrastructure for online preference tuning was confined to teams that had built it. Most people who wanted to fine-tune a model on preference data couldn’t realistically do it.</p>
<p>By late 2023, that had changed. DPO + qLoRA + 4-bit quantization meant the entire preference-tuning recipe could be run on a single GPU. The Zephyr models, the Tülu series, the wave of open instruction-tuned models that followed: all of these were downstream of DPO making the pipeline accessible. The simplification of preference tuning meant orders of magnitude more people could do it.</p>
</section>
<section id="appendix-deriving-the-optimal-policy" class="level2">
<h2 class="anchored" data-anchor-id="appendix-deriving-the-optimal-policy">Appendix: Deriving the optimal policy</h2>
<p>This is the gory derivation I deferred in Step 2: the manipulation that takes the KL-regularized RLHF objective and produces the closed-form optimal policy. I worked through this in late 2023, a few months after the DPO paper came out, because I wanted to convince myself the algorithm was real rather than take the result on faith. What follows is essentially that derivation, cleaned up. Skip this section if you’re satisfied with the form of the result. Read on if you want to see how the partition function shows up.</p>
<p>We start with the RLHF objective:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cmax_%5Cpi%20%5C;%20%5Cmathbb%7BE%7D_%7Bx%20%5Csim%20D,%20%5C,%20y%20%5Csim%20%5Cpi(%5Ccdot%7Cx)%7D%20%5Cleft%5B%20r(x,%20y)%20-%20%5Cbeta%20%5Clog%20%5Cfrac%7B%5Cpi(y%7Cx)%7D%7B%5Cpi_%5Ctext%7Bref%7D(y%7Cx)%7D%20%5Cright%5D"></p>
<p>The KL divergence can be written as an expectation:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BD%7D_%5Ctext%7BKL%7D%5C!%5Cleft%5B%5Cpi(y%7Cx)%20%5C,%5C%7C%5C,%20%5Cpi_%5Ctext%7Bref%7D(y%7Cx)%5Cright%5D%20=%20%5Cmathbb%7BE%7D_%7By%20%5Csim%20%5Cpi(%5Ccdot%7Cx)%7D%5C!%5Cleft%5B%20%5Clog%20%5Cfrac%7B%5Cpi(y%7Cx)%7D%7B%5Cpi_%5Ctext%7Bref%7D(y%7Cx)%7D%20%5Cright%5D"></p>
<p>So the inner expectation in our objective becomes:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BE%7D_%7By%20%5Csim%20%5Cpi(%5Ccdot%7Cx)%7D%5C!%5Cleft%5B%20r(x,%20y)%20-%20%5Cbeta%20%5Clog%20%5Cfrac%7B%5Cpi(y%7Cx)%7D%7B%5Cpi_%5Ctext%7Bref%7D(y%7Cx)%7D%20%5Cright%5D"></p>
<p>Divide by <img src="https://latex.codecogs.com/png.latex?-%5Cbeta"> to convert the maximization into a minimization, and the sign of the reward term flips:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cmin_%5Cpi%20%5C;%20%5Cmathbb%7BE%7D_%7Bx%20%5Csim%20D%7D%20%5C,%20%5Cmathbb%7BE%7D_%7By%20%5Csim%20%5Cpi(%5Ccdot%7Cx)%7D%5C!%5Cleft%5B%20%5Clog%20%5Cfrac%7B%5Cpi(y%7Cx)%7D%7B%5Cpi_%5Ctext%7Bref%7D(y%7Cx)%7D%20-%20%5Cfrac%7B1%7D%7B%5Cbeta%7D%20r(x,%20y)%20%5Cright%5D"></p>
<p>Use the identity <img src="https://latex.codecogs.com/png.latex?x%20=%20%5Clog%20%5Cexp(x)"> to rewrite the reward term:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cmin_%5Cpi%20%5C;%20%5Cmathbb%7BE%7D_%7Bx%20%5Csim%20D%7D%20%5C,%20%5Cmathbb%7BE%7D_%7By%20%5Csim%20%5Cpi(%5Ccdot%7Cx)%7D%5C!%5Cleft%5B%20%5Clog%20%5Cfrac%7B%5Cpi(y%7Cx)%7D%7B%5Cpi_%5Ctext%7Bref%7D(y%7Cx)%7D%20-%20%5Clog%20%5Cexp%5C!%5Cleft(%5Cfrac%7B1%7D%7B%5Cbeta%7D%20r(x,%20y)%5Cright)%20%5Cright%5D"></p>
<p>Combine the two logarithms:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cmin_%5Cpi%20%5C;%20%5Cmathbb%7BE%7D_%7Bx%20%5Csim%20D%7D%20%5C,%20%5Cmathbb%7BE%7D_%7By%20%5Csim%20%5Cpi(%5Ccdot%7Cx)%7D%5C!%5Cleft%5B%20%5Clog%20%5Cfrac%7B%5Cpi(y%7Cx)%7D%7B%5Cpi_%5Ctext%7Bref%7D(y%7Cx)%20%5Cexp%5C!%5Cleft(%5Cfrac%7B1%7D%7B%5Cbeta%7D%20r(x,%20y)%5Cright)%7D%20%5Cright%5D"></p>
<p>The denominator inside the log isn’t a probability distribution: it doesn’t sum to one over <img src="https://latex.codecogs.com/png.latex?y">. We can fix this by introducing a normalizer <img src="https://latex.codecogs.com/png.latex?Z(x)">:</p>
<p><img src="https://latex.codecogs.com/png.latex?Z(x)%20=%20%5Csum_y%20%5Cpi_%5Ctext%7Bref%7D(y%7Cx)%20%5Cexp%5C!%5Cleft(%5Cfrac%7B1%7D%7B%5Cbeta%7D%20r(x,%20y)%5Cright)"></p>
<p>and defining:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cpi%5E*(y%7Cx)%20=%20%5Cfrac%7B1%7D%7BZ(x)%7D%20%5Cpi_%5Ctext%7Bref%7D(y%7Cx)%20%5Cexp%5C!%5Cleft(%5Cfrac%7B1%7D%7B%5Cbeta%7D%20r(x,%20y)%5Cright)"></p>
<p>By construction, <img src="https://latex.codecogs.com/png.latex?%5Cpi%5E*"> is a valid probability distribution: it’s nonnegative everywhere and sums to one over <img src="https://latex.codecogs.com/png.latex?y">. Now we substitute it back into our objective. Add and subtract <img src="https://latex.codecogs.com/png.latex?%5Clog%20Z(x)"> inside the bracket (no change, since these cancel):</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cmin_%5Cpi%20%5C;%20%5Cmathbb%7BE%7D_%7Bx%20%5Csim%20D%7D%20%5C,%20%5Cmathbb%7BE%7D_%7By%20%5Csim%20%5Cpi(%5Ccdot%7Cx)%7D%5C!%5Cleft%5B%20%5Clog%20%5Cfrac%7B%5Cpi(y%7Cx)%7D%7B%5Cpi_%5Ctext%7Bref%7D(y%7Cx)%20%5Cexp%5C!%5Cleft(%5Cfrac%7B1%7D%7B%5Cbeta%7D%20r(x,%20y)%5Cright)%7D%20+%20%5Clog%20Z(x)%20-%20%5Clog%20Z(x)%20%5Cright%5D"></p>
<p>Use the identity <img src="https://latex.codecogs.com/png.latex?%5Clog%20Z(x)%20=%20-%5Clog%20%5Cfrac%7B1%7D%7BZ(x)%7D"> on the first <img src="https://latex.codecogs.com/png.latex?%5Clog%20Z(x)"> term:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cmin_%5Cpi%20%5C;%20%5Cmathbb%7BE%7D_%7Bx%20%5Csim%20D%7D%20%5C,%20%5Cmathbb%7BE%7D_%7By%20%5Csim%20%5Cpi(%5Ccdot%7Cx)%7D%5C!%5Cleft%5B%20%5Clog%20%5Cfrac%7B%5Cpi(y%7Cx)%7D%7B%5Cpi_%5Ctext%7Bref%7D(y%7Cx)%20%5Cexp%5C!%5Cleft(%5Cfrac%7B1%7D%7B%5Cbeta%7D%20r(x,%20y)%5Cright)%7D%20-%20%5Clog%20%5Cfrac%7B1%7D%7BZ(x)%7D%20-%20%5Clog%20Z(x)%20%5Cright%5D"></p>
<p>Combine the first two log terms using <img src="https://latex.codecogs.com/png.latex?%5Clog%20a%20-%20%5Clog%20b%20=%20%5Clog(a/b)">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cmin_%5Cpi%20%5C;%20%5Cmathbb%7BE%7D_%7Bx%20%5Csim%20D%7D%20%5C,%20%5Cmathbb%7BE%7D_%7By%20%5Csim%20%5Cpi(%5Ccdot%7Cx)%7D%5C!%5Cleft%5B%20%5Clog%20%5Cfrac%7B%5Cpi(y%7Cx)%7D%7B%5Cfrac%7B1%7D%7BZ(x)%7D%5C,%20%5Cpi_%5Ctext%7Bref%7D(y%7Cx)%20%5Cexp%5C!%5Cleft(%5Cfrac%7B1%7D%7B%5Cbeta%7D%20r(x,%20y)%5Cright)%7D%20-%20%5Clog%20Z(x)%20%5Cright%5D"></p>
<p>Recognize the denominator inside the log as <img src="https://latex.codecogs.com/png.latex?%5Cpi%5E*(y%7Cx)">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cmin_%5Cpi%20%5C;%20%5Cmathbb%7BE%7D_%7Bx%20%5Csim%20D%7D%20%5C,%20%5Cmathbb%7BE%7D_%7By%20%5Csim%20%5Cpi(%5Ccdot%7Cx)%7D%5C!%5Cleft%5B%20%5Clog%20%5Cfrac%7B%5Cpi(y%7Cx)%7D%7B%5Cpi%5E*(y%7Cx)%7D%20-%20%5Clog%20Z(x)%20%5Cright%5D"></p>
<p>The <img src="https://latex.codecogs.com/png.latex?%5Clog%20Z(x)"> term doesn’t depend on <img src="https://latex.codecogs.com/png.latex?%5Cpi">, so it’s a constant with respect to the optimization. We can drop it. What’s left is exactly a KL divergence:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cmin_%5Cpi%20%5C;%20%5Cmathbb%7BE%7D_%7Bx%20%5Csim%20D%7D%5C!%5Cleft%5B%20%5Cmathbb%7BD%7D_%5Ctext%7BKL%7D%5C!%5Cleft%5B%5Cpi(y%7Cx)%20%5C,%5C%7C%5C,%20%5Cpi%5E*(y%7Cx)%5Cright%5D%20%5Cright%5D"></p>
<p>KL divergence is minimized when the two distributions are equal. So the minimizer is <img src="https://latex.codecogs.com/png.latex?%5Cpi%20=%20%5Cpi%5E*">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cpi%5E*(y%7Cx)%20=%20%5Cfrac%7B1%7D%7BZ(x)%7D%20%5Cpi_%5Ctext%7Bref%7D(y%7Cx)%20%5Cexp%5C!%5Cleft(%5Cfrac%7B1%7D%7B%5Cbeta%7D%20r(x,%20y)%5Cright)"></p>
<p>That’s the closed-form optimal policy. The rest of the DPO derivation (Steps 3 and 4 in the main text) follows from rearranging this expression to extract the implicit reward and plugging it into Bradley-Terry.</p>
<p>The partition function <img src="https://latex.codecogs.com/png.latex?Z(x)"> is the technical price of admission for this derivation. It would be intractable to compute directly as it requires summing over all possible completions, but it cancels out in the final DPO loss because Bradley-Terry only cares about reward differences. That cancellation is what makes DPO work as a practical algorithm.</p>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent">
<div id="ref-rafailov_etal_2023" class="csl-entry">
Rafailov, Rafael, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2023. <span>“Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.”</span> <em>Advances in Neural Information Processing Systems (NeurIPS)</em>. <a href="https://arxiv.org/abs/2305.18290">https://arxiv.org/abs/2305.18290</a>.
</div>
<div id="ref-tunstall_etal_2023" class="csl-entry">
Tunstall, Lewis, Edward Beeching, Nathan Lambert, et al. 2023. <span>“Zephyr: Direct Distillation of LM Alignment.”</span> <em>arXiv Preprint</em>. <a href="https://arxiv.org/abs/2310.16944">https://arxiv.org/abs/2310.16944</a>.
</div>
</div></section></div> ]]></description>
  <category>How LLMs learn to reason</category>
  <guid>https://yeesengchan.com/posts/series/how-llms-learn-to-reason/03-dpo/</guid>
  <pubDate>Fri, 06 Mar 2026 00:00:00 GMT</pubDate>
</item>
<item>
  <title>GRPO: the algorithm behind reasoning models</title>
  <dc:creator>Yee Seng Chan</dc:creator>
  <link>https://yeesengchan.com/posts/series/how-llms-learn-to-reason/02-grpo/</link>
  <description><![CDATA[ 





<!-- Shared series navigation. Each PART includes this file with a Quarto
     include shortcode pointing at ../_series.qmd (see any part's index.qmd
     for the exact syntax — do NOT repeat that shortcode here, it would
     recurse). Links are sibling-relative (../NN-slug/) so they resolve
     identically from any part. When you add a part, add one line here.
     Files starting with "_" are never rendered as their own page. -->
<div class="series-nav">
<div class="series-label">
Part of a series
</div>
<div class="series-name">
How LLMs learn to reason
</div>
<!-- Add parts as an ordered list below as you publish, e.g.
     1. [Part title](../01-slug/)
     2. [Part title](../02-slug/) -->
<ol type="1">
<li><a href="../00-reinforce-foundations/">REINFORCE: the world before the gradient</a></li>
<li><a href="../00-reinforce-gradient/">REINFORCE: the gradient that drives training</a></li>
<li><a href="../01-ppo/">PPO is REINFORCE plus five fixes</a></li>
<li><a href="../02-grpo/">GRPO: The algorithm behind reasoning models</a></li>
<li><a href="../03-dpo/">DPO: RLHF collapsed into one loss</a></li>
<li><a href="../04-r1/">R1-Zero was the result. R1 was the product</a></li>
<li><a href="../05-tool-use/">How reasoning models learn to use tools</a></li>
</ol>
</div>
<p>In January 2025, DeepSeek released R1 <span class="citation" data-cites="guo_etal_2025">(<span class="nocase">Guo et al.</span> 2025)</span>: a reasoning model that closed most of the gap to OpenAI’s o1 at a fraction of the training cost, and (more importantly) shipped with the training recipe described in the open. The recipe centered on an algorithm called GRPO. Within a few months, essentially every open-source reasoning model (Qwen’s reasoning variants, Llama derivatives, the wave that followed) was using GRPO or close cousins of it.</p>
<p>GRPO itself wasn’t new in 2025. The DeepSeek team had introduced it a year earlier, in their DeepSeekMath paper <span class="citation" data-cites="shao_etal_2024">(Shao et al. 2024)</span>, where they used it to train a 7B math-specialist model. What changed in 2025 wasn’t the algorithm. It was the demonstration that the same algorithm could produce general reasoning capability at frontier scale.</p>
<p>If you’ve been trying to understand how reasoning models like o1, R1, or their successors actually get trained, GRPO is most of the answer. This post is about what GRPO is, why it works, and where it sits relative to the PPO + RLHF recipe it (sometimes) replaces.</p>
<p>I’m assuming you already understand PPO at the level of the <a href="../../../../posts/series/how-llms-learn-to-reason/01-ppo/index.html">previous post in this series</a>. That assumption matters because GRPO is best understood as a small, deliberate set of changes from PPO, not as a new algorithm built from scratch. Most of what makes PPO work (clipping, importance sampling, the multi-epoch inner loop) carries over unchanged. The changes are concentrated in two places.</p>
<p>Here’s the short version of GRPO: <strong>GRPO is PPO with the critic removed. Instead of using a value network to estimate per-token advantages, GRPO samples multiple rollouts per prompt and computes each rollout’s advantage relative to its siblings.</strong></p>
<p>That’s the whole conceptual move. One model gone from memory. Per-token advantage replaced with per-trajectory advantage. Everything else stays. Which raises the question of why this works at all, and why it works <em>especially</em> well for reasoning.</p>
<div id="fig-grpo-from-ppo" class="quarto-float quarto-figure quarto-figure-center anchored" alt="A side-by-side comparison. The left panel, PPO, has four model roles: policy, critic, reference, reward; its advantage is critic then GAE, one value per token. An arrow labelled two changes points to the right panel, GRPO, which has the same roles except the critic is struck out and marked removed, leaving three roles; its advantage is computed from a group of rollouts via mean and standard deviation, one value per trajectory. A band underneath lists what is unchanged from PPO: clipping, importance ratio, KL anchor, multi-epoch loop.">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-grpo-from-ppo-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://yeesengchan.com/posts/series/how-llms-learn-to-reason/02-grpo/grpo_from_ppo.png" class="img-fluid figure-img" alt="A side-by-side comparison. The left panel, PPO, has four model roles: policy, critic, reference, reward; its advantage is critic then GAE, one value per token. An arrow labelled two changes points to the right panel, GRPO, which has the same roles except the critic is struck out and marked removed, leaving three roles; its advantage is computed from a group of rollouts via mean and standard deviation, one value per trajectory. A band underneath lists what is unchanged from PPO: clipping, importance ratio, KL anchor, multi-epoch loop.">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-grpo-from-ppo-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;1: GRPO is PPO with the critic removed. Two changes: the value network is dropped (four model roles become three), and the advantage is computed per trajectory from a group of sibling rollouts instead of per token from GAE. Clipping, the importance ratio, KL anchoring, and the multi-epoch loop are unchanged.
</figcaption>
</figure>
</div>
<section id="why-drop-the-critic" class="level2">
<h2 class="anchored" data-anchor-id="why-drop-the-critic">Why drop the critic?</h2>
<p>In PPO, four model roles are active during training:</p>
<ul>
<li><strong>Policy</strong>: the LLM being trained</li>
<li><strong>Value network</strong>: the critic, also trained, typically same architecture as the policy</li>
<li><strong>Reference model</strong>: frozen copy of the SFT model, for KL anchoring</li>
<li><strong>Reward model</strong>: frozen, scores complete responses</li>
</ul>
<p>That’s four LLM-sized things in the picture. The actual memory cost depends on a lot of implementation details (whether the policy and value share a body, how parameters are sharded across GPUs, optimizer state for the trained models) but the structural fact stands: PPO involves four model roles, and reducing that count is a real win.</p>
<p>Two of those roles you can’t avoid. The reference model is structurally necessary; without it you have no KL anchor and the policy drifts into reward-hacking gibberish. The reward model (or verifier) provides the training signal in the first place.</p>
<p>But the value network? It’s only there to compute advantages. If you can compute advantages another way, you can drop it.</p>
<p>A natural question is whether the reward model could just take over. It can’t, and the reason matters.</p>
<p>The reward model and the value network do fundamentally different jobs. The reward model answers: “How good is this <em>complete</em> response?” It takes a (prompt, response) pair and outputs a scalar. It was trained on human preferences over finished outputs, so it has no signal for half-finished responses. The value network answers a different question: “Given this <em>partial</em> state, what total reward do I expect by the end?” It takes a state (prompt plus tokens-so-far) and predicts where things are heading.</p>
<p>These are different questions. One scores finished things. The other predicts where things are heading. The reward model can’t do the second job because no human ever rated half-responses during preference data collection.</p>
<p>So the value network isn’t redundant. PPO needs it because GAE (the per-token advantage estimator) needs to know the expected outcome at every state along the trajectory, and only the value network can provide that.</p>
<p>GRPO’s move isn’t to substitute the reward model for the value network. It’s to give up on per-token advantage estimation entirely.</p>
</section>
<section id="group-relative-advantages" class="level2">
<h2 class="anchored" data-anchor-id="group-relative-advantages">Group-relative advantages</h2>
<p>Here’s the new advantage computation. For each prompt, generate <img src="https://latex.codecogs.com/png.latex?G"> rollouts (typically 4 to 16). Score each one with the verifier. The advantage of the <img src="https://latex.codecogs.com/png.latex?i">-th rollout is:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_i%20=%20%5Cfrac%7Br_i%20-%20%5Ctext%7Bmean%7D(r_1,%20%5Cldots,%20r_G)%7D%7B%5Ctext%7Bstd%7D(r_1,%20%5Cldots,%20r_G)%20+%20%5Cepsilon%7D"></p>
<p>That’s it. Each rollout gets one scalar advantage (“how much better than my sibling rollouts did this one do?”) applied identically to every token in that rollout.</p>
<p>This is a different kind of advantage than what PPO uses. PPO had a different advantage per token, derived from GAE; GRPO has the same advantage for every token in a rollout, derived from how the rollout’s reward compared to its siblings’. The credit is assigned at the trajectory level, not the token level. That’s the cost of dropping the value network: less granular credit assignment. But it’s also why GRPO works without per-token value estimates.</p>
<p>Let me make this concrete with the example I’ll use throughout: training a model to solve grade-school math word problems.</p>
<p><em>“Maya is buying tickets for a concert. Adult tickets cost $12 each and child tickets cost $8 each. She buys 9 tickets total and spends $92. How many adult tickets did she buy?”</em></p>
<p>The answer is 5. The verifier extracts the model’s final boxed answer and checks: 1 if correct, 0 if not.</p>
<p>Sample 4 rollouts (<img src="https://latex.codecogs.com/png.latex?G%20=%204">):</p>
<ul>
<li><strong>Rollout A</strong>: sets up the equation <img src="https://latex.codecogs.com/png.latex?12a%20+%208(9-a)%20=%2092">, solves cleanly, ends with <code>\boxed{5}</code>. Reward: 1.</li>
<li><strong>Rollout B</strong>: tries <img src="https://latex.codecogs.com/png.latex?a%20+%20c%20=%209"> and <img src="https://latex.codecogs.com/png.latex?12a%20+%208c%20=%2092"> but makes an arithmetic slip, ends with <code>\boxed{6}</code>. Reward: 0.</li>
<li><strong>Rollout C</strong>: solves it correctly via testing values, ends with <code>\boxed{5}</code>. Reward: 1.</li>
<li><strong>Rollout D</strong>: confuses adult and child prices, ends with <code>\boxed{4}</code>. Reward: 0.</li>
</ul>
<p>Compute the group statistics. The mean is straightforward:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Ctext%7Bmean%7D%20=%20%5Cfrac%7B1%20+%200%20+%201%20+%200%7D%7B4%7D%20=%200.5"></p>
<p>For the standard deviation, take squared deviations from the mean and sum them:</p>
<p><img src="https://latex.codecogs.com/png.latex?(1%20-%200.5)%5E2%20+%20(0%20-%200.5)%5E2%20+%20(1%20-%200.5)%5E2%20+%20(0%20-%200.5)%5E2%20=%204%20%5Ctimes%200.25%20=%201.0"></p>
<p>Then divide by <img src="https://latex.codecogs.com/png.latex?G%20-%201%20=%203"> (only 3 of the 4 deviations are independent once the mean is fixed) and take the square root:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Csigma%20=%20%5Csqrt%7B1.0%20/%203%7D%20%5Capprox%200.577"></p>
<p>(Some implementations divide by <img src="https://latex.codecogs.com/png.latex?G"> instead, giving <img src="https://latex.codecogs.com/png.latex?%5Csigma%20=%200.5">. The choice rescales all advantages by the same factor and doesn’t change their signs, but it can matter when comparing hyperparameters across codebases.)</p>
<p>Advantages are then <img src="https://latex.codecogs.com/png.latex?(r_i%20-%20%5Ctext%7Bmean%7D)%20/%20%5Csigma">:</p>
<ul>
<li><img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_1%20=%20(1%20-%200.5)%20/%200.577%20%5Capprox%20+0.866"> (rollout A)</li>
<li><img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_2%20=%20(0%20-%200.5)%20/%200.577%20%5Capprox%20-0.866"> (rollout B)</li>
<li><img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_3%20=%20(1%20-%200.5)%20/%200.577%20%5Capprox%20+0.866"> (rollout C)</li>
<li><img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_4%20=%20(0%20-%200.5)%20/%200.577%20%5Capprox%20-0.866"> (rollout D)</li>
</ul>
<p>In the gradient update, every token in rollouts A and C gets multiplied by <img src="https://latex.codecogs.com/png.latex?+0.866">, and every token in rollouts B and D gets multiplied by <img src="https://latex.codecogs.com/png.latex?-0.866">. The policy gets pushed toward the trajectories that solved the problem, away from the ones that didn’t.</p>
<p>The intuition is clean: of these attempts, which ones got it right? Reinforce those.</p>
<section id="what-happens-when-all-rollouts-score-the-same" class="level3">
<h3 class="anchored" data-anchor-id="what-happens-when-all-rollouts-score-the-same">What happens when all rollouts score the same</h3>
<p>A subtle but important case. If all 4 rollouts get reward 0 (model failed every attempt), or all 4 get reward 1 (model succeeded every time), then the standard deviation is 0, the mean equals every <img src="https://latex.codecogs.com/png.latex?r_i">, and every advantage is 0.</p>
<p>This isn’t a bug. It’s the algorithm correctly saying <em>“there’s no learning signal here.”</em> If all attempts failed, we don’t know which failures were closer to success. If all succeeded, we don’t know which successes were genuinely well-reasoned versus lucky. Either way, no rollout in this group is “better than its siblings,” so no gradient signal gets generated.</p>
<p>The practical implication: GRPO is most effective when the model is at a difficulty level where it sometimes succeeds and sometimes fails. Too easy and all rewards are 1. Too hard and all rewards are 0. Either way, no signal. Training data needs to sit in the model’s “frontier” of difficulty, which is why curriculum learning matters more for GRPO than it does for PPO.</p>
</section>
</section>
<section id="everything-else-looks-ppo-shaped" class="level2">
<h2 class="anchored" data-anchor-id="everything-else-looks-ppo-shaped">Everything else looks PPO-shaped</h2>
<p>With the advantage computed, the rest of the algorithm is structurally identical to PPO.</p>
<p>The clipped policy loss has the same form as PPO. Same importance ratio:</p>
<p><img src="https://latex.codecogs.com/png.latex?r_t(%5Ctheta)%20=%20%5Cfrac%7B%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)%7D%7B%5Cpi_%7B%5Ctheta_%5Ctext%7Bold%7D%7D(a_t%20%5Cmid%20s_t)%7D"></p>
<p>Same min-of-clipped-and-unclipped construction for the per-token policy loss:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cell%5E%5Cpi_t%20=%20-%5Cmin%5C!%5CBig(r_t(%5Ctheta)%20%5Ccdot%20%5Chat%7BA%7D_t,%5C;%20%5Ctext%7Bclip%7D(r_t(%5Ctheta),%201-%5Cepsilon,%201+%5Cepsilon)%20%5Ccdot%20%5Chat%7BA%7D_t%5CBig)"></p>
<p>The only difference from PPO: <img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_t"> is now the trajectory-level group-relative advantage rather than the token-level GAE advantage. <strong>Every token in a given rollout shares the same <img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_t">.</strong> Clipping mechanism, importance ratio, asymmetric min; all unchanged.</p>
<p>The multi-epoch inner loop is the same: collect rollouts, do <img src="https://latex.codecogs.com/png.latex?K"> epochs of mini-batched gradient steps, discard, repeat.</p>
<p>KL anchoring is also the same idea, though implemented differently. PPO often folds the KL term into the per-token rewards before computing advantages. GRPO can’t easily do this, as rewards in GRPO arrive only at the end of each rollout, with no per-token reward to fold into. So GRPO adds the KL as a separate per-token loss term:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cell%5E%7B%5Ctext%7BKL%7D%7D_t%20=%20%5Clog%20%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)%20-%20%5Clog%20%5Cpi_%5Ctext%7Bref%7D(a_t%20%5Cmid%20s_t)"></p>
<p>This is the simple log-ratio estimator of KL; same as PPO uses. The two per-token losses then combine into a single scalar that gets backpropped:</p>
<p><img src="https://latex.codecogs.com/png.latex?L_%5Cmathcal%7BB%7D%20=%20%5Cfrac%7B1%7D%7B%7C%5Cmathcal%7BB%7D%7C%7D%20%5Csum_t%20%5CBig%5B%5Cell%5E%5Cpi_t%20+%20%5Cbeta%20%5Ccdot%20%5Cell%5E%7B%5Ctext%7BKL%7D%7D_t%5CBig%5D"></p>
<p>Two separate per-token loss components, summed with a weight <img src="https://latex.codecogs.com/png.latex?%5Cbeta">, averaged across all tokens in the mini-batch. Both contribute their own gradients to the policy parameters.</p>
<p>If you understand PPO, you understand most of GRPO. Two things change: how the advantage gets computed, and what models are in memory. Everything else carries over.</p>
</section>
<section id="why-this-works-for-reasoning-specifically" class="level2">
<h2 class="anchored" data-anchor-id="why-this-works-for-reasoning-specifically">Why this works for reasoning specifically</h2>
<p>GRPO didn’t have to be a reasoning algorithm. The math is generic. You could use it for any task with a scalar reward at the end. But there’s a structural reason it works <em>especially</em> well for reasoning.</p>
<p>Reasoning tasks have two properties that are awkward for PPO:</p>
<p><strong>Rewards are usually verifiable.</strong> Math problems have right answers. Code can be tested. The reward is a deterministic check (“did the code pass the tests?”), not a learned approximation of human preferences. This is the world of <em>Reinforcement Learning with Verifiable Rewards</em> (RLVR) and it’s the world GRPO was made for.</p>
<p><strong>Reward is genuinely sparse.</strong> You don’t know if a chain of thought is good until you see whether it produced the right final answer. Intermediate tokens can look identical between a trajectory that ends correctly and one that ends wrong, until the very end.</p>
<p>PPO handles sparse rewards via the value network, which has to learn “how good does this partial reasoning trajectory look?” That’s a hard regression problem. Reasoning trajectories can look identical for many tokens, then diverge wildly at the answer. The value network struggles to fit this signal cleanly, and the resulting advantages are noisy.</p>
<p>GRPO sidesteps the problem entirely. It doesn’t try to predict per-token value. It just compares whole rollouts to each other.</p>
<p>There’s also a practical reason GRPO and verifiable rewards pair well. GRPO needs <img src="https://latex.codecogs.com/png.latex?G"> rollouts per prompt to compute group statistics: meaning the reward function gets called <img src="https://latex.codecogs.com/png.latex?G"> times per prompt. If the reward function is a few lines of Python that runs in microseconds (a math correctness checker, a code test runner), scaling to <img src="https://latex.codecogs.com/png.latex?G%20=%2016"> is free. If the reward function is a 7B-parameter reward model requiring a full forward pass, <img src="https://latex.codecogs.com/png.latex?G%5Ctimes"> inference cost gets expensive fast. GRPO’s economics work in the verifier setting.</p>
<p>This is also where the practical scope of “verifiable” matters. Verifiable rewards work cleanly for math (correctness check on the final answer), code (test cases), formal proofs (proof assistants), and constrained generation (regex compliance). They don’t work for tasks like “is this response helpful?” or “is this writing tasteful?”. Those are inherently subjective and need RLHF-style learned rewards. So GRPO + RLVR doesn’t replace RLHF for general assistant training. It’s a complementary tool for a specific (and increasingly important) class of tasks.</p>
</section>
<section id="what-r1-zero-showed" class="level2">
<h2 class="anchored" data-anchor-id="what-r1-zero-showed">What R1-Zero showed</h2>
<p>When DeepSeek published the R1 paper in early 2025, they actually published two models, and the distinction is informative.</p>
<p><strong>DeepSeek-R1</strong> is the deployable model. Trained with a multi-stage pipeline: cold-start SFT, then GRPO + RLVR, then more SFT, then more RL. This is the one people actually use.</p>
<p><strong>DeepSeek-R1-Zero</strong> is the surprising one. Trained with pure RL (GRPO + verifiable rewards) directly on the DeepSeek-V3 base model, with no SFT stage at all. R1-Zero showed that reasoning behavior could emerge from RL alone, without any supervised demonstrations of how to reason. The DeepSeek paper documents the specific behaviors that emerged during training:</p>
<ul>
<li>Generating progressively longer chains of thought as training continued: the model learning, as training progresses, that more inference-time compute helped it solve harder problems.</li>
<li>Reflecting on its own solutions, revisiting and evaluating earlier reasoning steps.</li>
<li>Exploring alternative approaches when a first attempt didn’t pan out.</li>
<li>Critiquing intermediate steps and catching its own errors.</li>
</ul>
<p>None of this was programmed in. The RL process simply rewarded the model for producing correct answers in the right format, and the rest emerged. The DeepSeek paper calls this a “self-evolution”: the model learning, through exploration and reward, how to reason without ever being shown an example.</p>
<p>R1-Zero’s outputs were rough at the surface. The model would mix languages mid-response (sometimes Chinese in the middle of an English answer), use inconsistent formatting, occasionally produce incoherent passages. Strong reasoning, weak presentation. That’s what made the deployable R1 require additional SFT stages on top, to clean up the surface presentation while keeping the reasoning capability.</p>
<p>The R1-Zero result changed how people thought about what RL could do. The conventional wisdom had been that you needed SFT to give the model a starting distribution before any RL could help. R1-Zero showed that’s not strictly true for capabilities that have verifiable signals. Pure RL on a base model can produce reasoning. You don’t need to demonstrate how to reason; you just need to be able to verify when it worked.</p>
<p>The current consensus pipeline for production reasoning models combines: SFT for assistant behavior, preference tuning for helpfulness/harmlessness/honesty, RLVR for reasoning capability. Each stage does a job the others can’t do well. R1-Zero showed the RL stage could work without the prior stages; R1 showed why production models include them anyway.</p>
</section>
<section id="where-grpo-fits-in-the-broader-landscape" class="level2">
<h2 class="anchored" data-anchor-id="where-grpo-fits-in-the-broader-landscape">Where GRPO fits in the broader landscape</h2>
<p>The post-training landscape as of 2026 has split into a few distinct lanes.</p>
<p><strong>Frontier general RLHF</strong> (proprietary, OpenAI / Anthropic / Google) still uses PPO-based or proprietary variants of PPO. Frontier labs have the engineering resources to handle PPO’s complexity and care about every fraction of a percent of quality. PPO is online, it generates new rollouts and learns from them, whereas DPO is offline, learning from a fixed preference dataset. That exploratory advantage seems to matter at the frontier. The frontier labs don’t share their pipelines publicly, so it’s hard to know exactly what they do, but PPO and its descendants remain in use.</p>
<p><strong>Open-source preference tuning</strong> has largely migrated to DPO and its variants. DPO is much simpler than PPO: no rollouts, no value network, no reward model, just a clever loss on preference pairs. For most open-source teams, the simplicity is worth more than whatever quality edge PPO might provide.</p>
<p><strong>Reasoning models</strong> use GRPO + RLVR. This is the lane DeepSeek pioneered with R1, and most open-source reasoning models follow the same recipe.</p>
<div id="fig-post-training-lanes" class="quarto-float quarto-figure quarto-figure-center anchored" alt="Three lane cards under two brackets. The first bracket, general RLHF with preference data, spans two lanes. PPO: online RL against a learned reward model, used by frontier labs (OpenAI, Anthropic, Google); choose when you have preference data and frontier-quality goals. DPO and its variants IPO, KTO, ORPO, SimPO: one supervised loss on preference pairs, used by open-source preference tuning; choose when you have preference data and a tight budget. The second bracket, RLVR with verifiable rewards, covers one lane. GRPO: PPO minus the critic with group-relative advantage, used by reasoning models such as DeepSeek-R1 and those after it; choose when you have verifiable rewards such as math, code, or proofs. Summary: preference data leads to DPO or PPO depending on budget versus ambition; verifiable rewards lead to GRPO plus RLVR.">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-post-training-lanes-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://yeesengchan.com/posts/series/how-llms-learn-to-reason/02-grpo/post_training_lanes.png" class="img-fluid figure-img" alt="Three lane cards under two brackets. The first bracket, general RLHF with preference data, spans two lanes. PPO: online RL against a learned reward model, used by frontier labs (OpenAI, Anthropic, Google); choose when you have preference data and frontier-quality goals. DPO and its variants IPO, KTO, ORPO, SimPO: one supervised loss on preference pairs, used by open-source preference tuning; choose when you have preference data and a tight budget. The second bracket, RLVR with verifiable rewards, covers one lane. GRPO: PPO minus the critic with group-relative advantage, used by reasoning models such as DeepSeek-R1 and those after it; choose when you have verifiable rewards such as math, code, or proofs. Summary: preference data leads to DPO or PPO depending on budget versus ambition; verifiable rewards lead to GRPO plus RLVR.">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-post-training-lanes-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;2: The post-training landscape in 2026, split into three lanes. PPO and DPO share the general-RLHF, preference-data space (frontier labs use PPO for quality; open source uses DPO for simplicity); GRPO plus RLVR owns reasoning, where rewards are verifiable. What you have decides the lane.
</figcaption>
</figure>
</div>
<p>The practical takeaway: PPO and DPO/variants share the general-RLHF space (frontier vs open-source split); GRPO + RLVR owns reasoning. For practitioners, the choice depends on what you have. Preference data and a tight budget? DPO. Preference data and frontier-quality ambitions? PPO + RLHF. Verifiable rewards (math, code, theorem proving)? GRPO + RLVR.</p>
</section>
<section id="summary" class="level2">
<h2 class="anchored" data-anchor-id="summary">Summary</h2>
<p>In the broader arc that this post sits in, the next piece is <a href="../../../../posts/series/how-llms-learn-to-reason/03-dpo/index.html"><strong>DPO</strong></a>, the offline alternative that displaced PPO in much of the open-source ecosystem. Conceptually different from GRPO, DPO collapses the whole RL pipeline into a single supervised loss, but solves a related problem.</p>
<p>There’s a broader point worth naming here. PPO is hard. The four-models setup, the value network’s instability, and the sensitivity to hyperparameters are real engineering challenges. For years, these kept serious RL research mostly confined to a handful of well-resourced labs. GRPO didn’t just simplify an algorithm. It made RL training accessible to teams that couldn’t have run a stable PPO pipeline if they tried. It was a story about an algorithm simple enough that more people could actually use it.</p>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent">
<div id="ref-guo_etal_2025" class="csl-entry">
<span class="nocase">Guo, Daya, Dejian Yang, Haowei Zhang, et al.</span> 2025. <span>“DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.”</span> <em>arXiv Preprint</em>. <a href="https://arxiv.org/abs/2501.12948">https://arxiv.org/abs/2501.12948</a>.
</div>
<div id="ref-shao_etal_2024" class="csl-entry">
Shao, Zhihong, Peiyi Wang, Qihao Zhu, et al. 2024. <span>“DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.”</span> <em>arXiv Preprint</em>. <a href="https://arxiv.org/abs/2402.03300">https://arxiv.org/abs/2402.03300</a>.
</div>
</div></section></div> ]]></description>
  <category>How LLMs learn to reason</category>
  <guid>https://yeesengchan.com/posts/series/how-llms-learn-to-reason/02-grpo/</guid>
  <pubDate>Mon, 02 Mar 2026 00:00:00 GMT</pubDate>
</item>
<item>
  <title>PPO is REINFORCE plus five fixes</title>
  <dc:creator>Yee Seng Chan</dc:creator>
  <link>https://yeesengchan.com/posts/series/how-llms-learn-to-reason/01-ppo/</link>
  <description><![CDATA[ 





<!-- Shared series navigation. Each PART includes this file with a Quarto
     include shortcode pointing at ../_series.qmd (see any part's index.qmd
     for the exact syntax — do NOT repeat that shortcode here, it would
     recurse). Links are sibling-relative (../NN-slug/) so they resolve
     identically from any part. When you add a part, add one line here.
     Files starting with "_" are never rendered as their own page. -->
<div class="series-nav">
<div class="series-label">
Part of a series
</div>
<div class="series-name">
How LLMs learn to reason
</div>
<!-- Add parts as an ordered list below as you publish, e.g.
     1. [Part title](../01-slug/)
     2. [Part title](../02-slug/) -->
<ol type="1">
<li><a href="../00-reinforce-foundations/">REINFORCE: the world before the gradient</a></li>
<li><a href="../00-reinforce-gradient/">REINFORCE: the gradient that drives training</a></li>
<li><a href="../01-ppo/">PPO is REINFORCE plus five fixes</a></li>
<li><a href="../02-grpo/">GRPO: The algorithm behind reasoning models</a></li>
<li><a href="../03-dpo/">DPO: RLHF collapsed into one loss</a></li>
<li><a href="../04-r1/">R1-Zero was the result. R1 was the product</a></li>
<li><a href="../05-tool-use/">How reasoning models learn to use tools</a></li>
</ol>
</div>
<p>Most treatments of PPO either start with MDPs and Bellman equations and lose you in the abstract before the algorithm appears, or treat PPO as a black box. Neither helps when you actually want to understand it. Here’s the framing that finally made it stick for me: PPO <span class="citation" data-cites="schulman_etal_2017">(Schulman et al. 2017)</span> is REINFORCE with five fixes that, together, solve two problems.</p>
<p>The plan: walk through the two problems REINFORCE has, watch each PPO fix arise as a response to one of them, and end by tracing how the final objective decomposes all the way down to things you can actually compute.</p>
<section id="reinforce-in-one-breath" class="level2">
<h2 class="anchored" data-anchor-id="reinforce-in-one-breath">REINFORCE in one breath</h2>
<p>Reinforcement learning is the framework where an agent takes actions in an environment, the environment responds with reward and a new state, and the agent’s job is to find a strategy, a policy, that collects as much reward as possible. Unlike supervised learning, the agent generates its own training data by acting. Bad policies generate bad data. That’s what makes RL hard in a way classification never is.</p>
<p>The simplest policy gradient algorithm is REINFORCE. Run the policy. Observe what happens. Update the policy so good actions become more likely and bad ones less likely. The objective being maximized:</p>
<p><img src="https://latex.codecogs.com/png.latex?J(%5Ctheta)%20=%20%5Csum_t%20G_t%20%5Clog%20%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)."></p>
<p>Reading the parts: <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)"> is the probability the policy assigns to the action that was actually taken in state <img src="https://latex.codecogs.com/png.latex?s_t">. <img src="https://latex.codecogs.com/png.latex?G_t"> is the <em>return</em> from time <img src="https://latex.codecogs.com/png.latex?t">: the sum of future rewards from that point to the end of the trajectory.</p>
<section id="reinforce-as-supervised-learning-with-twists" class="level3">
<h3 class="anchored" data-anchor-id="reinforce-as-supervised-learning-with-twists">REINFORCE as supervised learning, with twists</h3>
<p>Here’s the framing that finally made REINFORCE click for me. Standard supervised cross-entropy classification minimizes:</p>
<p><img src="https://latex.codecogs.com/png.latex?L_%7B%5Ctext%7Bsup%7D%7D%20=%20-%5Csum_i%20%5Clog%20p(y_i%20%5Cmid%20x_i),"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?y_i"> is the true label for <img src="https://latex.codecogs.com/png.latex?x_i">. Compare to REINFORCE:</p>
<p><img src="https://latex.codecogs.com/png.latex?L_%7B%5Ctext%7BREINFORCE%7D%7D%20=%20-%5Csum_t%20G_t%20%5Clog%20%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)."></p>
<p>Same form, two changes:</p>
<ol type="1">
<li><strong>No labels: use the sampled action as a pseudo-label.</strong> RL doesn’t tell us what the “correct” action was. So we use the action the policy actually sampled as a stand-in. The training signal becomes “do more of what you just did.” Which sounds insane, until you remember the second change.</li>
<li><strong>Weight each pseudo-labeled example by how the trajectory turned out.</strong> Supervised learning treats every example as equally important. Every label is correct by definition. REINFORCE weights each term by <img src="https://latex.codecogs.com/png.latex?G_t">. Positive <img src="https://latex.codecogs.com/png.latex?G_t"> scales the term up: do more of this. Negative <img src="https://latex.codecogs.com/png.latex?G_t"> flips its sign: do less.</li>
</ol>
<p>Read this way, REINFORCE is supervised learning where the agent generates its own pseudo-labels by sampling, and each pseudo-labeled example is weighted by how things turned out. That’s the whole conceptual content of the algorithm. The gradient mechanics (softmax derivatives, autograd, the chain rule) are <em>exactly</em> the gradient mechanics of supervised cross-entropy, scaled by <img src="https://latex.codecogs.com/png.latex?G_t">. If you’ve ever trained a classifier, you’ve already done most of the work.</p>
</section>
<section id="the-training-loop" class="level3">
<h3 class="anchored" data-anchor-id="the-training-loop">The training loop</h3>
<p>The loop is correspondingly small:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> episode <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(num_episodes):</span>
<span id="cb1-2">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 1. Roll out a trajectory under the current policy</span></span>
<span id="cb1-3">    states, actions, rewards <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> rollout(env, policy)</span>
<span id="cb1-4"></span>
<span id="cb1-5">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 2. Compute returns G_t for each timestep, working backward</span></span>
<span id="cb1-6">    returns <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb1-7">    G <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb1-8">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> r <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">reversed</span>(rewards):</span>
<span id="cb1-9">        G <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> r <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> gamma <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> G</span>
<span id="cb1-10">        returns.insert(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, G)</span>
<span id="cb1-11"></span>
<span id="cb1-12">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 3. Form the loss and update</span></span>
<span id="cb1-13">    loss <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb1-14">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> s_t, a_t, G_t <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">zip</span>(states, actions, returns):</span>
<span id="cb1-15">        logits <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> policy(s_t)</span>
<span id="cb1-16">        log_probs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> F.log_softmax(logits, dim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb1-17">        loss <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> loss <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> G_t <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> log_probs[a_t]</span>
<span id="cb1-18"></span>
<span id="cb1-19">    optimizer.zero_grad()</span>
<span id="cb1-20">    loss.backward()</span>
<span id="cb1-21">    optimizer.step()</span></code></pre></div></div>
<p>Three things to notice. Autograd does all the work. We never compute the policy gradient by hand. We construct a loss whose gradient <em>is</em> the policy gradient and let the framework handle it. The negation in <code>loss = loss - G_t * log_probs[a_t]</code> is because optimizers minimize by default, so we minimize <img src="https://latex.codecogs.com/png.latex?-J">, equivalent to maximizing <img src="https://latex.codecogs.com/png.latex?J">. Returns are computed backward through the trajectory in <img src="https://latex.codecogs.com/png.latex?O(T)"> using the recursion <img src="https://latex.codecogs.com/png.latex?G_t%20=%20r_t%20+%20%5Cgamma%20G_%7Bt+1%7D">.</p>
</section>
<section id="the-two-problems" class="level3">
<h3 class="anchored" data-anchor-id="the-two-problems">The two problems</h3>
<p>REINFORCE works. People used it. Two problems become obvious the moment you scale it.</p>
<p><strong>Problem 1: high variance.</strong> The “total reward that followed an action” varies wildly across rollouts even when the action itself was fine. Take the same action twice from the same state under a stochastic policy and you might see very different total rewards because what happened <em>afterward</em> was different. The training signal jumps around. Learning is slow and unstable.</p>
<p><strong>Problem 2: data inefficiency.</strong> A trajectory is expensive to collect. For an LLM, sampling a 500-token response takes real GPU time. REINFORCE throws each one away after a single gradient update, because after that update the policy has changed and the old data is no longer “from the right distribution”: a sense we’ll make precise when we get to importance sampling.</p>
<p>PPO fixes both. The first three fixes attack variance. The last two attack data inefficiency. Each fix introduces machinery the next one needs, so they’re easiest to read in order.</p>
<div id="fig-ppo-five-fixes" class="quarto-float quarto-figure quarto-figure-center anchored" alt="A diagram. A REINFORCE box on the left is labelled as having two problems. It branches into two lanes. The top lane, Problem 1 High variance, contains three fixes in order: use the advantage, add a critic, estimate it with GAE. The bottom lane, Problem 2 Data inefficiency, contains two fixes in order: importance sampling, then clip the ratio. Both lanes converge into a PPO box on the right marked both problems solved. Small notes on each fix show how it leads to the next: the advantage needs a value, the critic must be combined with rewards, importance sampling ratios can explode and so must be clipped.">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-ppo-five-fixes-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://yeesengchan.com/posts/series/how-llms-learn-to-reason/01-ppo/ppo_five_fixes.png" class="img-fluid figure-img" alt="A diagram. A REINFORCE box on the left is labelled as having two problems. It branches into two lanes. The top lane, Problem 1 High variance, contains three fixes in order: use the advantage, add a critic, estimate it with GAE. The bottom lane, Problem 2 Data inefficiency, contains two fixes in order: importance sampling, then clip the ratio. Both lanes converge into a PPO box on the right marked both problems solved. Small notes on each fix show how it leads to the next: the advantage needs a value, the critic must be combined with rewards, importance sampling ratios can explode and so must be clipped.">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-ppo-five-fixes-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;1: PPO is REINFORCE plus five fixes. Fixes 1–3 attack REINFORCE’s variance problem; fixes 4–5 attack its data-inefficiency problem. Each fix introduces machinery the next one needs.
</figcaption>
</figure>
</div>
</section>
</section>
<section id="ppo-as-five-fixes" class="level2">
<h2 class="anchored" data-anchor-id="ppo-as-five-fixes">PPO as five fixes</h2>
<section id="fix-1-use-the-advantage-not-the-raw-reward" class="level3">
<h3 class="anchored" data-anchor-id="fix-1-use-the-advantage-not-the-raw-reward">Fix 1: Use the advantage, not the raw reward</h3>
<p>The first variance-reduction trick is to stop asking “what reward followed this action?” and start asking “did this action do better than expected?”</p>
<p>A concrete anchor for the rest of the piece: imagine training a chatbot to answer the question “What is PPO?” The model generates a response token by token: “<em>PPO is a reinforcement learning algorithm…</em>”. Then at the end a reward model scores the whole response. Each token is one action. Each full response is one trajectory.</p>
<p>When the model picks “reinforcement” partway through, was that a good choice? It depends on what was available at that point. If most plausible alternatives at that position would have led to a coherent response anyway, the choice isn’t special. If most would have led somewhere worse, then “reinforcement” was a good pick we want to reinforce.</p>
<p>The “expected reward from this state” is the <em>value</em> of the state. The action’s <em>advantage</em> is how much better than the value the actual outcome was:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1">Advantage <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (what happened) − (what we<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'d expect on average from this state)</span></span></code></pre></div></div>
<p>Substituting advantage for raw reward in the policy gradient cuts variance dramatically. Actions that go as expected contribute roughly zero to the gradient. Only genuinely surprising actions, good or bad, drive updates. The noise from “all of these actions were fine, actually” stops contributing.</p>
<p>The catch: to compute advantage, you need the value of each state. We don’t know it directly. So we <em>learn it</em>, with a second neural network.</p>
</section>
<section id="ppo-value-network-critic" class="level3">
<h3 class="anchored" data-anchor-id="ppo-value-network-critic">Fix 2: Add a value network (the critic)</h3>
<p>PPO trains two networks in parallel:</p>
<ul>
<li>The <strong>policy network</strong> (the actor) takes a state, outputs a distribution over actions. For LLMs, this <em>is</em> the language model.</li>
<li>The <strong>value network</strong> (the critic) takes a state, outputs a single number: the expected total reward from this state onward.</li>
</ul>
<p>For the chatbot, the value network looks at a partial response, e.g.&nbsp;<em>“PPO is a reinforcement learning”</em>, and outputs something like 0.6, meaning “responses that start this way tend to score around 0.6 from the reward model.” After the next token, the partial response becomes <em>“PPO is a reinforcement learning algorithm”</em> and the value might shift to a slightly higher score of 0.65, indicating that the response is heading somewhere good. The critic is essentially a running estimate of “how well is this generation going so far?”</p>
<p>The critic doesn’t act. It just judges. Taken together, the actor and critic give an <em>actor-critic</em> architecture, of which PPO is one specific instance.</p>
<p>How does the critic learn? Standard regression. After a rollout, you have actual rewards <img src="https://latex.codecogs.com/png.latex?r_0,%20r_1,%20%5Cldots,%20r_T">. From any state <img src="https://latex.codecogs.com/png.latex?s_t"> along the trajectory, the actual return is <img src="https://latex.codecogs.com/png.latex?G_t%20=%20r_t%20+%20%5Cgamma%20r_%7Bt+1%7D%20+%20%5Cldots%20+%20%5Cgamma%5E%7BT-t%7D%20r_T">. Train the critic to predict <img src="https://latex.codecogs.com/png.latex?G_t"> from <img src="https://latex.codecogs.com/png.latex?s_t"> using mean-squared-error. Over many trajectories the critic becomes a calibrated estimator, and the advantages it underwrites become reliable.</p>
<p>For LLMs, the value network is typically a copy of the policy architecture: same transformer body, with a small linear head outputting a scalar instead of a token distribution. Whether the bodies are <em>shared</em> (parameters trained jointly) or <em>separate copies</em> (parameters trained independently) varies by implementation.</p>
</section>
<section id="fix-3-estimate-the-advantage-with-gae" class="level3">
<h3 class="anchored" data-anchor-id="fix-3-estimate-the-advantage-with-gae">Fix 3: Estimate the advantage with GAE</h3>
<p>There’s a subtle decision in <em>how</em> to compute the advantage from rollout data. Two extremes:</p>
<ul>
<li><strong>Use the actual rewards that followed.</strong> Accurate but noisy: at the mercy of every random thing that happened in the trajectory.</li>
<li><strong>Trust the value network’s predictions.</strong> Stable but biased: if the critic is wrong, your advantages are wrong.</li>
</ul>
<p>You can blend. Use the actual rewards for the next few steps, then have the critic estimate the rest. This is the <em>n-step</em> family of advantage estimators, parameterized by how many real-reward steps to use before deferring to the critic. Small <img src="https://latex.codecogs.com/png.latex?n"> trusts the critic. Large <img src="https://latex.codecogs.com/png.latex?n"> trusts the rewards.</p>
<p>Rather than picking one <img src="https://latex.codecogs.com/png.latex?n">, <strong>Generalized Advantage Estimation</strong> <span class="citation" data-cites="schulman_etal_2016">(Schulman et al. 2016)</span> takes an exponentially-weighted average across all <img src="https://latex.codecogs.com/png.latex?n">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_t%5E%7B%5Ctext%7BGAE%7D%7D%20=%20(1-%5Clambda)%5Chat%7BA%7D_t%5E%7B(1)%7D%20+%20(1-%5Clambda)%5Clambda%5Chat%7BA%7D_t%5E%7B(2)%7D%20+%20(1-%5Clambda)%5Clambda%5E2%5Chat%7BA%7D_t%5E%7B(3)%7D%20+%20%5Cldots"></p>
<p>The <img src="https://latex.codecogs.com/png.latex?(1-%5Clambda)"> prefactors normalize the weights to sum to 1 (geometric series). Small <img src="https://latex.codecogs.com/png.latex?%5Clambda"> puts most weight on the 1-step estimator (trust the critic); large <img src="https://latex.codecogs.com/png.latex?%5Clambda"> flattens the weights toward the full-trajectory estimator (trust the rewards). Default <img src="https://latex.codecogs.com/png.latex?%5Clambda%20=%200.95">.</p>
<p>The pleasing thing is that this weighted average collapses to a clean recursion. The intermediate identity that does the work is the <em>TD residual</em>:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cdelta_t%20=%20r_t%20+%20%5Cgamma%20V(s_%7Bt+1%7D)%20-%20V(s_t)."></p>
<p>“TD” stands for <em>Temporal Difference</em>: the residual compares two value predictions (<img src="https://latex.codecogs.com/png.latex?V(s_t)"> and <img src="https://latex.codecogs.com/png.latex?V(s_%7Bt+1%7D)">) separated by one time step, after accounting for the actual reward observed during that step. If the critic were perfect, <img src="https://latex.codecogs.com/png.latex?%5Cdelta_t"> would always be zero. When nonzero, <img src="https://latex.codecogs.com/png.latex?%5Cdelta_t"> measures how wrong the critic was, and in which direction. The same number has two names because it plays two roles: it’s a <em>prediction error of the critic</em> (which is what trains the value network) and it’s the <em>simplest possible advantage estimate</em> (<img src="https://latex.codecogs.com/png.latex?%5Cdelta_t%20=%20%5Chat%7BA%7D_t%5E%7B(1)%7D">, the first row of the n-step ladder).</p>
<p>Now the GAE collapse. Each n-step advantage can be written as a sum of TD residuals, <img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_t%5E%7B(n)%7D%20=%20%5Csum_%7Bk=0%7D%5E%7Bn-1%7D%20%5Cgamma%5Ek%20%5Cdelta_%7Bt+k%7D"> (the unpacking chain at the end walks through this for the 2-step case). Substituting into the GAE weighted average and collecting by <img src="https://latex.codecogs.com/png.latex?%5Cdelta">, the geometric series collapses each coefficient: <img src="https://latex.codecogs.com/png.latex?%5Cdelta_%7Bt+k%7D"> ends up with weight <img src="https://latex.codecogs.com/png.latex?(%5Cgamma%5Clambda)%5Ek">. So:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_t%5E%7B%5Ctext%7BGAE%7D%7D%20=%20%5Cdelta_t%20+%20%5Cgamma%5Clambda%20%5C,%20%5Cdelta_%7Bt+1%7D%20+%20(%5Cgamma%5Clambda)%5E2%20%5C,%20%5Cdelta_%7Bt+2%7D%20+%20%5Cldots%20=%20%5Cdelta_t%20+%20%5Cgamma%5Clambda%20%5C,%20%5Chat%7BA%7D_%7Bt+1%7D%5E%7B%5Ctext%7BGAE%7D%7D."></p>
<p>That’s the entire GAE computation: a backward pass through the trajectory accumulating <img src="https://latex.codecogs.com/png.latex?%5Cgamma%5Clambda">-decayed TD residuals.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1">advantages <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb3-2">gae <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb3-3"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> t <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">reversed</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">len</span>(rewards))):</span>
<span id="cb3-4">    delta <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> rewards[t] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> gamma <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> values[t <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> values[t]</span>
<span id="cb3-5">    gae <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> delta <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> gamma <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> lam <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> gae</span>
<span id="cb3-6">    advantages.insert(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, gae)</span></code></pre></div></div>
<p>That’s it. GAE is one of the most important practical contributions in policy gradient methods. Almost every modern algorithm uses it. But you don’t need to derive it to use it. Two knobs: <img src="https://latex.codecogs.com/png.latex?%5Cgamma"> (discount factor, typically 0.99) and <img src="https://latex.codecogs.com/png.latex?%5Clambda"> (the GAE parameter, typically 0.95).</p>
<p>Fixes 1, 2, and 3 together complete the variance-reduction story. Fixes 4 and 5 attack the data-inefficiency problem.</p>
</section>
<section id="fix-4-importance-sampling-reusing-data" class="level3">
<h3 class="anchored" data-anchor-id="fix-4-importance-sampling-reusing-data">Fix 4: Importance sampling: reusing data</h3>
<p>We collected an expensive trajectory. We’d like to do many gradient steps on it, not one. The wrinkle that’s specific to RL: unlike supervised learning, where data sits in a fixed dataset, here you generate your training data by running the current policy. After one gradient step, the policy is different, and the trajectory we just used is now from the previous policy. Thus reusing this trajectory is, strictly speaking, off-policy.</p>
<p>Importance sampling is the statistical trick that corrects for this. Each old data point gets a weight equal to the ratio of “how likely is this action under the new policy” to “how likely was it under the old policy”:</p>
<p><img src="https://latex.codecogs.com/png.latex?r_t(%5Ctheta)%20=%20%5Cfrac%7B%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)%7D%7B%5Cpi_%7B%5Ctheta_%7B%5Ctext%7Bold%7D%7D%7D(a_t%20%5Cmid%20s_t)%7D."></p>
<p>If the new policy still likes the action just as much, the ratio is 1, and the data point counts normally. If the new policy now likes the action more, the ratio is greater than 1; if less, less than 1. Multiplied with the advantage, <img src="https://latex.codecogs.com/png.latex?r_t(%5Ctheta)%20%5Ccdot%20%5Chat%7BA%7D_t"> gives a corrected gradient signal that’s valid even though the data wasn’t generated by the current policy.</p>
<p>This is what lets you take many gradient steps on one batch of trajectories. Exactly the sample efficiency you want.</p>
<p>But there’s a problem. If the policy moves <em>a lot</em> from where the data was collected, those ratios can blow up. An action the old policy gave probability 0.01 to and the new policy gives probability 0.9 has ratio 90. One sample now contributes 90× as much as a normal one. Training becomes wildly unstable, sometimes catastrophically so: the policy can collapse to garbage and never recover.</p>
<p>Importance sampling is a great tool, but only if you keep the new policy close to the old one. Which leads to the final ingredient and the actual contribution of the PPO paper.</p>
</section>
<section id="fix-5-clip-the-importance-ratio" class="level3">
<h3 class="anchored" data-anchor-id="fix-5-clip-the-importance-ratio">Fix 5: Clip the importance ratio</h3>
<p>The four fixes above (advantage, value network, GAE, importance sampling) were already standard prior to PPO.</p>
<p>To address the instability of importance sampling, TRPO (Trust Region Policy Optimization) added an explicit constraint: the new policy must stay within a KL-divergence bound of the old. This works, but it requires solving a constrained optimization with second-order methods and the implementation is a pain.</p>
<p>PPO’s contribution to RL, is a single trick that gets <em>most</em> of the stability of TRPO without TRPO’s complexity: <em>If a sample is already pushing the policy in some direction, stop letting it push further.</em></p>
<p>Concretely:</p>
<p><img src="https://latex.codecogs.com/png.latex?J%5E%7B%5Ctext%7BCLIP%7D%7D(%5Ctheta)%20=%20%5Cmathbb%7BE%7D_t%5C!%5Cleft%5B%5Cmin%5C!%5Cbig(r_t(%5Ctheta)%20%5C,%20%5Chat%7BA%7D_t,%20%5C;%20%5Ctext%7Bclip%7D(r_t(%5Ctheta),%201-%5Cepsilon,%201+%5Cepsilon)%20%5C,%20%5Chat%7BA%7D_t%5Cbig)%5Cright%5D."></p>
<p>In words: take the importance-corrected advantage; also take a version where the ratio is clipped to <img src="https://latex.codecogs.com/png.latex?%5B1-%5Cepsilon,%201+%5Cepsilon%5D">; use whichever is smaller. The hyperparameter <img src="https://latex.codecogs.com/png.latex?%5Cepsilon"> controls “how far is too far”: typically 0.2, so ratios are constrained to roughly <img src="https://latex.codecogs.com/png.latex?%5B0.8,%201.2%5D"> before clipping kicks in.</p>
<p>The asymmetric <em>min</em> matters. The interesting question is when the min picks the unclipped value and when it picks the clipped one.</p>
<table class="caption-top table">
<colgroup>
<col style="width: 16%">
<col style="width: 16%">
<col style="width: 16%">
<col style="width: 16%">
<col style="width: 16%">
<col style="width: 16%">
</colgroup>
<thead>
<tr class="header">
<th>Advantage</th>
<th>Ratio</th>
<th>Unclipped</th>
<th>Clipped</th>
<th>min picks</th>
<th>What this means</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D%20%3E%200"> (good)</td>
<td><img src="https://latex.codecogs.com/png.latex?r_t%20%3E%201.2"> (moved toward)</td>
<td><img src="https://latex.codecogs.com/png.latex?1.5%5Chat%7BA%7D"> (large +)</td>
<td><img src="https://latex.codecogs.com/png.latex?1.2%5Chat%7BA%7D"> (smaller +)</td>
<td><strong>clipped</strong></td>
<td>Already pushed right way; cap further reward</td>
</tr>
<tr class="even">
<td><img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D%20%3E%200"> (good)</td>
<td><img src="https://latex.codecogs.com/png.latex?r_t%20%3C%200.8"> (moved away)</td>
<td><img src="https://latex.codecogs.com/png.latex?0.5%5Chat%7BA%7D"> (small +)</td>
<td><img src="https://latex.codecogs.com/png.latex?0.8%5Chat%7BA%7D"> (larger +)</td>
<td><strong>unclipped</strong></td>
<td>Wrong direction; let full corrective gradient through</td>
</tr>
<tr class="odd">
<td><img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D%20%3C%200"> (bad)</td>
<td><img src="https://latex.codecogs.com/png.latex?r_t%20%3E%201.2"> (moved toward)</td>
<td><img src="https://latex.codecogs.com/png.latex?1.5%5Chat%7BA%7D"> (very −)</td>
<td><img src="https://latex.codecogs.com/png.latex?1.2%5Chat%7BA%7D"> (less −)</td>
<td><strong>unclipped</strong></td>
<td>Wrong direction; let full corrective gradient through</td>
</tr>
<tr class="even">
<td><img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D%20%3C%200"> (bad)</td>
<td><img src="https://latex.codecogs.com/png.latex?r_t%20%3C%200.8"> (moved away)</td>
<td><img src="https://latex.codecogs.com/png.latex?0.5%5Chat%7BA%7D"> (less −)</td>
<td><img src="https://latex.codecogs.com/png.latex?0.8%5Chat%7BA%7D"> (more −)</td>
<td><strong>clipped</strong></td>
<td>Already pushed right way; cap further punishment</td>
</tr>
</tbody>
</table>
<p>The pattern: when the policy has moved in the right direction, the min picks the clipped value, and we stop reinforcing that direction beyond the <img src="https://latex.codecogs.com/png.latex?%5Cepsilon"> band. When the policy has moved in the wrong direction, the min picks the unclipped value, and the full corrective gradient comes through.</p>
<p>Three lines of code:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1">ratio <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> torch.exp(log_probs_new <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> log_probs_old)  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># = pi_new / pi_old</span></span>
<span id="cb4-2">surr1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ratio <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> advantages</span>
<span id="cb4-3">surr2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> torch.clamp(ratio, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> epsilon, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> epsilon) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> advantages</span>
<span id="cb4-4">policy_loss <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>torch.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">min</span>(surr1, surr2).mean()</span></code></pre></div></div>
<p>That’s PPO. Five fixes, two problems solved.</p>
</section>
</section>
<section id="the-unpacking-chain" class="level2">
<h2 class="anchored" data-anchor-id="the-unpacking-chain">The unpacking chain</h2>
<p>I want to spend the rest of this piece on how to compute the clipped objective:</p>
<p><img src="https://latex.codecogs.com/png.latex?J%5E%7B%5Ctext%7BCLIP%7D%7D(%5Ctheta)%20=%20%5Cmathbb%7BE%7D_t%5C!%5Cleft%5B%5Cmin%5C!%5Cbig(r_t(%5Ctheta)%20%5C,%20%5Chat%7BA%7D_t,%20%5C;%20%5Ctext%7Bclip%7D(r_t(%5Ctheta),%201-%5Cepsilon,%201+%5Cepsilon)%20%5C,%20%5Chat%7BA%7D_t%5Cbig)%5Cright%5D"></p>
<p>In particular, let’s drill through how to calculate the advantage <img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_t">. The advantage is the deepest piece because it is defined recursively, through several layers of intermediate quantities. The plan: start from the GAE definition at the top and unpack downward until we hit primitives we can actually compute.</p>
<p><strong>Level 1: GAE as a weighted average of n-step advantages:</strong></p>
<p><img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_t%5E%7B%5Ctext%7BGAE%7D%7D%20=%20(1-%5Clambda)%5C,%20%5Chat%7BA%7D_t%5E%7B(1)%7D%20+%20(1-%5Clambda)%5Clambda%5C,%20%5Chat%7BA%7D_t%5E%7B(2)%7D%20+%20(1-%5Clambda)%5Clambda%5E2%5C,%20%5Chat%7BA%7D_t%5E%7B(3)%7D%20+%20%5Cldots"></p>
<p>This is what GAE <em>is</em> by definition. The parameter <img src="https://latex.codecogs.com/png.latex?%5Clambda%20%5Cin%20%5B0,%201%5D"> controls a bias-variance trade-off. At <img src="https://latex.codecogs.com/png.latex?%5Clambda%20=%200">, GAE collapses to the 1-step estimate <img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_t%5E%7B(1)%7D">: take one real step of reward, then use the value function <img src="https://latex.codecogs.com/png.latex?V"> to estimate everything that comes after. This is low variance because only one stochastic reward enters, but it inherits whatever errors <img src="https://latex.codecogs.com/png.latex?V"> has. Larger <img src="https://latex.codecogs.com/png.latex?%5Clambda"> shifts weight toward longer-horizon estimates, which use more actual rewards and rely less on <img src="https://latex.codecogs.com/png.latex?V">. That lowers the bias but accumulates variance from the rewards themselves. The coefficients <img src="https://latex.codecogs.com/png.latex?(1-%5Clambda)%5Clambda%5E%7Bk-1%7D"> form a geometric series that sums to 1, so GAE is a properly normalized weighted average.</p>
<p>The expression is in terms of n-step advantages <img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_t%5E%7B(n)%7D">, which themselves need unpacking. Conceptually, an n-step advantage estimates how much better the trajectory turned out than the value function had predicted, by looking <img src="https://latex.codecogs.com/png.latex?n"> steps ahead before bootstrapping with the value function.</p>
<p><strong>Level 2: n-step advantages as sums of TD residuals:</strong></p>
<p><img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_t%5E%7B(1)%7D%20=%20%5Cdelta_t"> <img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_t%5E%7B(2)%7D%20=%20%5Cdelta_t%20+%20%5Cgamma%5Cdelta_%7Bt+1%7D"> <img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_t%5E%7B(3)%7D%20=%20%5Cdelta_t%20+%20%5Cgamma%5Cdelta_%7Bt+1%7D%20+%20%5Cgamma%5E2%5Cdelta_%7Bt+2%7D"></p>
<p>Each n-step advantage is a sum of TD residuals weighted by <img src="https://latex.codecogs.com/png.latex?%5Cgamma%5Ek">. The TD residual <img src="https://latex.codecogs.com/png.latex?%5Cdelta_t"> is a 1-step “surprise”: the gap between what actually happened in one step and what the value function had predicted. Summing <img src="https://latex.codecogs.com/png.latex?n"> of these gives the <img src="https://latex.codecogs.com/png.latex?n">-step advantage. The n-step structure is now reduced to TD residuals. We still haven’t said what a TD residual <em>is</em>.</p>
<p><strong>Level 3: TD residuals expanded into rewards and value predictions:</strong></p>
<p>A TD residual is <em>defined</em> as:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cdelta_t%20=%20r_t%20+%20%5Cgamma%20V(s_%7Bt+1%7D)%20-%20V(s_t)"></p>
<p>Reading left to right: the actual reward at time <img src="https://latex.codecogs.com/png.latex?t">, plus the discounted value of where we ended up, minus what we had predicted before taking the action. A one-step prediction error of the value function.</p>
<p>Plugging the definition into the 2-step advantage:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Chat%7BA%7D_t%5E%7B(2)%7D%20=%20%5Cdelta_t%20+%20%5Cgamma%5Cdelta_%7Bt+1%7D%20=%20%5Br_t%20+%20%5Cgamma%20V(s_%7Bt+1%7D)%20-%20V(s_t)%5D%20+%20%5Cgamma%5Br_%7Bt+1%7D%20+%20%5Cgamma%20V(s_%7Bt+2%7D)%20-%20V(s_%7Bt+1%7D)%5D"></p>
<p>Distributing the <img src="https://latex.codecogs.com/png.latex?%5Cgamma"> in the second bracket:</p>
<p><img src="https://latex.codecogs.com/png.latex?=%20r_t%20+%20%5Cgamma%20V(s_%7Bt+1%7D)%20-%20V(s_t)%20+%20%5Cgamma%20r_%7Bt+1%7D%20+%20%5Cgamma%5E2%20V(s_%7Bt+2%7D)%20-%20%5Cgamma%20V(s_%7Bt+1%7D)"></p>
<p>The <img src="https://latex.codecogs.com/png.latex?+%5Cgamma%20V(s_%7Bt+1%7D)"> and <img src="https://latex.codecogs.com/png.latex?-%5Cgamma%20V(s_%7Bt+1%7D)"> cancel, leaving:</p>
<p><img src="https://latex.codecogs.com/png.latex?=%20r_t%20+%20%5Cgamma%20r_%7Bt+1%7D%20+%20%5Cgamma%5E2%20V(s_%7Bt+2%7D)%20-%20V(s_t)"></p>
<p>This is the standard 2-step formula. The cancellation is what is meant by <em>the interior value terms telescope away</em>: all the intermediate value predictions drop out, leaving only the rewards along the trajectory and the value function evaluated at the endpoints.</p>
<p>At this level, GAE has been reduced to two computable building blocks: per-step rewards <img src="https://latex.codecogs.com/png.latex?r_t"> and critic predictions <img src="https://latex.codecogs.com/png.latex?V(s_t)">. The critic predictions are direct outputs of the value network: a forward pass. They don’t need further unpacking. The rewards do.</p>
<p><strong>Level 4: the per-step reward.</strong></p>
<p>In standard RL, <img src="https://latex.codecogs.com/png.latex?r_t"> is whatever the environment gives you. Atari emits a score, a board game emits +1 or -1 at the end. There is nothing to unpack: <img src="https://latex.codecogs.com/png.latex?r_t"> is a primitive.</p>
<p>For LLMs, this is the level where the application-specific machinery shows up. RLHF, RLVR, and other variants all attach at exactly this level: they each define <img src="https://latex.codecogs.com/png.latex?r_t"> differently. That definition is what turns a generic policy gradient into RLHF, RLVR, or any other LLM-RL variant. Everything above this level (GAE, n-step advantages, TD residuals, the clipped objective) stays the same.</p>
<p>Two common forms:</p>
<ul>
<li><strong>RLHF.</strong> A separate <em>reward model</em> is trained on human preference data (pairs of “this response is better than that one”). The reward model then scores each completion the policy generates, and that score is <img src="https://latex.codecogs.com/png.latex?r_t">.</li>
<li><strong>RLVR.</strong> A <em>verifier</em> checks the answer against ground truth: did the math come out right, did the code pass the tests, did the extracted JSON match the schema? The verifier emits a numeric score, and that is <img src="https://latex.codecogs.com/png.latex?r_t">.</li>
</ul>
<p>Most LLM-RL recipes also add a <em>KL penalty</em> to <img src="https://latex.codecogs.com/png.latex?r_t">: a term that penalizes the policy for drifting too far from a frozen reference (usually the base model before RL). This keeps the trained model from collapsing into degenerate high-reward outputs.</p>
<p>Apart from these substitutions, the algorithm is unchanged. The clipped objective, the GAE advantage, the value model, the policy update: all the same. Switching from RLHF to RLVR is switching what <img src="https://latex.codecogs.com/png.latex?r_t"> is, nothing more.</p>
<p>The full chain, top to bottom: clipped objective → GAE advantage → n-step advantages → TD residuals → rewards and value predictions → for LLMs, log-probs and a learned or verified reward signal. Each level answers “but how do you actually compute the thing one level up?” Eventually you bottom out at things produced by neural-network forward passes, plus whatever reward signal your problem provides.</p>
<p>This is the picture I wish I’d had when I first read the PPO paper. The clipped objective at the top, the unpacking chain underneath, and the application-specific reward sitting at the bottom waiting to be plugged in. Once you have it, the rest is just engineering.</p>
</section>
<section id="whats-next" class="level2">
<h2 class="anchored" data-anchor-id="whats-next">What’s next</h2>
<p>PPO was the original workhorse for RLHF, but the field has moved. Two algorithms have displaced it in different settings:</p>
<ul>
<li><a href="../../../../posts/series/how-llms-learn-to-reason/03-dpo/index.html"><strong>DPO</strong></a> (Direct Preference Optimization) skips the reward model and the RL machinery entirely. Train directly on preference pairs with a clever loss derived from the Bradley-Terry math that justifies RLHF. Much simpler: no PPO, no rollouts, no value network. It also works surprisingly well. It’s displaced PPO in many open-source pipelines.</li>
<li><a href="../../../../posts/series/how-llms-learn-to-reason/02-grpo/index.html"><strong>GRPO</strong></a> (Group Relative Policy Optimization) is a PPO variant from DeepSeek that drops the value network and computes advantages by comparing rollouts to each other. Memory-efficient and well-suited for <em>verifiable</em> reward settings like math and code, where you don’t need a learned reward model, and a programmatic verifier provides the signal. GRPO is what powers most recent reasoning models.</li>
</ul>
<p>Both are easier to understand once PPO is in your head. DPO is “what if we collapsed PPO into a single supervised loss?” GRPO is “what if we kept PPO but dropped the critic?”</p>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent">
<div id="ref-schulman_etal_2016" class="csl-entry">
Schulman, John, Philipp Moritz, Sergey Levine, Michael Jordan, and and Pieter Abbeel. 2016. <span>“High-Dimensional Continuous Control Using Generalized Advantage Estimation.”</span> <em>International Conference on Learning Representations</em>. <a href="https://arxiv.org/abs/1506.02438">https://arxiv.org/abs/1506.02438</a>.
</div>
<div id="ref-schulman_etal_2017" class="csl-entry">
Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. <span>“Proximal Policy Optimization Algorithms.”</span> <em>arXiv Preprint</em>. <a href="https://arxiv.org/abs/1707.06347">https://arxiv.org/abs/1707.06347</a>.
</div>
</div></section></div> ]]></description>
  <category>How LLMs learn to reason</category>
  <guid>https://yeesengchan.com/posts/series/how-llms-learn-to-reason/01-ppo/</guid>
  <pubDate>Tue, 24 Feb 2026 00:00:00 GMT</pubDate>
</item>
<item>
  <title>REINFORCE: the gradient that drives training</title>
  <dc:creator>Yee Seng Chan</dc:creator>
  <link>https://yeesengchan.com/posts/series/how-llms-learn-to-reason/00-reinforce-gradient/</link>
  <description><![CDATA[ 





<!-- Shared series navigation. Each PART includes this file with a Quarto
     include shortcode pointing at ../_series.qmd (see any part's index.qmd
     for the exact syntax — do NOT repeat that shortcode here, it would
     recurse). Links are sibling-relative (../NN-slug/) so they resolve
     identically from any part. When you add a part, add one line here.
     Files starting with "_" are never rendered as their own page. -->
<div class="series-nav">
<div class="series-label">
Part of a series
</div>
<div class="series-name">
How LLMs learn to reason
</div>
<!-- Add parts as an ordered list below as you publish, e.g.
     1. [Part title](../01-slug/)
     2. [Part title](../02-slug/) -->
<ol type="1">
<li><a href="../00-reinforce-foundations/">REINFORCE: the world before the gradient</a></li>
<li><a href="../00-reinforce-gradient/">REINFORCE: the gradient that drives training</a></li>
<li><a href="../01-ppo/">PPO is REINFORCE plus five fixes</a></li>
<li><a href="../02-grpo/">GRPO: The algorithm behind reasoning models</a></li>
<li><a href="../03-dpo/">DPO: RLHF collapsed into one loss</a></li>
<li><a href="../04-r1/">R1-Zero was the result. R1 was the product</a></li>
<li><a href="../05-tool-use/">How reasoning models learn to use tools</a></li>
</ol>
</div>
<p><a href="../../../../posts/series/how-llms-learn-to-reason/00-reinforce-foundations/index.html">Part 1</a> built the world REINFORCE lives in and arrived at the objective: <img src="https://latex.codecogs.com/png.latex?J(%5Ctheta)%20=%20%5Csum_t%20G_t%20%5Clog%20%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)">, the thing we want to maximize. What it did not explain is <em>why</em> gradient descent on that objective actually makes good actions more likely. That question lives one layer down, in how the gradient flows through the softmax to the logits and back to the parameters. This part derives that mechanism.</p>
<section id="the-objective-and-its-loss-form" class="level2">
<h2 class="anchored" data-anchor-id="the-objective-and-its-loss-form">1. The objective and its loss form</h2>
<p>Restating, with the notation now grounded:</p>
<p><img src="https://latex.codecogs.com/png.latex?J(%5Ctheta)%20=%20%5Csum_t%20G_t%20%5Clog%20%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)."></p>
<ul>
<li><img src="https://latex.codecogs.com/png.latex?J(%5Ctheta)"> is the objective we want to maximize.</li>
<li><img src="https://latex.codecogs.com/png.latex?%5Ctheta"> is the network’s parameters.</li>
<li><img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)"> is the probability the policy assigns to action <img src="https://latex.codecogs.com/png.latex?a_t"> in state <img src="https://latex.codecogs.com/png.latex?s_t">.</li>
<li><img src="https://latex.codecogs.com/png.latex?G_t"> is the return from time <img src="https://latex.codecogs.com/png.latex?t">: positive when the trajectory was good, negative when bad.</li>
</ul>
<p>Neural networks are typically trained by <em>minimizing</em> a loss, so define</p>
<p><img src="https://latex.codecogs.com/png.latex?L(%5Ctheta)%20=%20-J(%5Ctheta)%20=%20-%5Csum_t%20G_t%20%5Clog%20%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)."></p>
<p>Minimizing <img src="https://latex.codecogs.com/png.latex?L"> is equivalent to maximizing <img src="https://latex.codecogs.com/png.latex?J">. The sign flip is purely a notational convenience for using minimize-by-default optimizers.</p>
</section>
<section id="why-the-loss-form-does-the-right-thing" class="level2">
<h2 class="anchored" data-anchor-id="why-the-loss-form-does-the-right-thing">2. Why the loss form does the right thing</h2>
<p>Look at one term in isolation: <img src="https://latex.codecogs.com/png.latex?L%20=%20-G%20%5Clog%20p">, where <img src="https://latex.codecogs.com/png.latex?p%20=%20%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)">.</p>
<p><strong>Good action (<img src="https://latex.codecogs.com/png.latex?G%20%3E%200">).</strong> If <img src="https://latex.codecogs.com/png.latex?p"> increases, <img src="https://latex.codecogs.com/png.latex?%5Clog%20p"> increases (becomes less negative), so <img src="https://latex.codecogs.com/png.latex?-G%20%5Clog%20p"> decreases. Minimizing <img src="https://latex.codecogs.com/png.latex?L"> encourages <img src="https://latex.codecogs.com/png.latex?p"> to grow.</p>
<div id="fig-case1" class="quarto-float quarto-figure quarto-figure-center anchored" alt="Good action case.">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-case1-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://yeesengchan.com/posts/series/how-llms-learn-to-reason/00-reinforce-gradient/case1.png" id="fig-case1" class="img-fluid figure-img" alt="Good action case.">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig quarto-uncaptioned" id="fig-case1-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;1
</figcaption>
</figure>
</div>
<p><strong>Bad action (<img src="https://latex.codecogs.com/png.latex?G%20%3C%200">).</strong> Then <img src="https://latex.codecogs.com/png.latex?-G"> is positive, so <img src="https://latex.codecogs.com/png.latex?L%20=%20-G%20%5Clog%20p%20=%20(%5Ctext%7Bpositive%7D)%20%5Ccdot%20%5Clog%20p">. If <img src="https://latex.codecogs.com/png.latex?p"> decreases, <img src="https://latex.codecogs.com/png.latex?%5Clog%20p"> becomes more negative, so <img src="https://latex.codecogs.com/png.latex?L"> decreases. Minimizing <img src="https://latex.codecogs.com/png.latex?L"> encourages <img src="https://latex.codecogs.com/png.latex?p"> to shrink.</p>
<div id="fig-case2" class="quarto-float quarto-figure quarto-figure-center anchored" alt="Bad action case.">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-case2-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://yeesengchan.com/posts/series/how-llms-learn-to-reason/00-reinforce-gradient/case2.png" id="fig-case2" class="img-fluid figure-img" alt="Bad action case.">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig quarto-uncaptioned" id="fig-case2-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;2
</figcaption>
</figure>
</div>
<p>So the form of <img src="https://latex.codecogs.com/png.latex?L"> encodes the right thing: minimizing it pushes good-action probabilities up and bad-action probabilities down. This is the easy part of the argument. It tells us that <em>if</em> probabilities move appropriately, the loss decreases. The harder question is whether gradient descent on <img src="https://latex.codecogs.com/png.latex?L"> actually causes them to move that way. For that, we need to look one layer deeper.</p>
</section>
<section id="a-useful-reframe-reinforce-is-supervised-learning-with-two-tweaks" class="level2">
<h2 class="anchored" data-anchor-id="a-useful-reframe-reinforce-is-supervised-learning-with-two-tweaks">3. A useful reframe: REINFORCE is supervised learning with two tweaks</h2>
<p>Before diving into the gradient mechanics, here’s a mental model that makes the REINFORCE loss feel less arbitrary and connects it to something you already know.</p>
<p>In standard supervised classification, the cross-entropy loss is</p>
<p><img src="https://latex.codecogs.com/png.latex?L_%7B%5Ctext%7Bsup%7D%7D%20=%20-%5Csum_i%20%5Clog%20p(y_i%20%5Cmid%20x_i),"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?y_i"> is the true label for input <img src="https://latex.codecogs.com/png.latex?x_i">. Minimizing this pushes the network’s predicted probability of the correct class upward.</p>
<p>Now compare to REINFORCE:</p>
<p><img src="https://latex.codecogs.com/png.latex?L_%7B%5Ctext%7BREINFORCE%7D%7D%20=%20-%5Csum_t%20G_t%20%5Clog%20%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)."></p>
<p>These are the <em>same form</em>, with two changes:</p>
<ol type="1">
<li><strong>Replace the label with the sampled action.</strong> We don’t know what action <em>should</em> have been taken. There’s no oracle telling us “the correct move from <img src="https://latex.codecogs.com/png.latex?(2,%203)"> was <img src="https://latex.codecogs.com/png.latex?%5Crightarrow">.” So we use the action the policy actually sampled as a kind of pseudo-label.</li>
<li><strong>Multiply each term by <img src="https://latex.codecogs.com/png.latex?G_t">.</strong> Supervised learning treats every example as equally important. Every label is “correct” by definition. REINFORCE doesn’t have that luxury, so it weights each pseudo-labeled example by how good the outcome turned out to be. Positive <img src="https://latex.codecogs.com/png.latex?G_t"> scales the term up (do <em>more</em> of this); negative <img src="https://latex.codecogs.com/png.latex?G_t"> flips its sign (do <em>less</em> of this).</li>
</ol>
<p>Read this way, REINFORCE is just supervised learning where (a) the agent generates its own pseudo-labels by sampling, and (b) each example is weighted by how good the outcome turned out to be.</p>
<p>This means the gradient mechanics we’re about to derive are <em>exactly</em> the gradient mechanics of supervised cross-entropy, scaled by <img src="https://latex.codecogs.com/png.latex?G_t">. If you’ve ever computed the gradient of cross-entropy through a softmax, you’ve already done most of the work. The next few sections are just making that work explicit and showing how the <img src="https://latex.codecogs.com/png.latex?G_t"> scaling rides along.</p>
</section>
<section id="the-network-outputs-logits-not-probabilities" class="level2">
<h2 class="anchored" data-anchor-id="the-network-outputs-logits-not-probabilities">4. The network outputs logits, not probabilities</h2>
<p>The neural network does not directly output probabilities. It outputs <strong>logits</strong>, which are then converted to probabilities via softmax. With two actions:</p>
<p><img src="https://latex.codecogs.com/png.latex?p(%5Crightarrow)%20=%20%5Cfrac%7Be%5E%7Bz_%5Crightarrow%7D%7D%7Be%5E%7Bz_%5Crightarrow%7D%20+%20e%5E%7Bz_%5Cleftarrow%7D%7D."></p>
<p>For example, with <img src="https://latex.codecogs.com/png.latex?z_%5Crightarrow%20=%201.0"> and <img src="https://latex.codecogs.com/png.latex?z_%5Cleftarrow%20=%200.0">:</p>
<p><img src="https://latex.codecogs.com/png.latex?p(%5Crightarrow)%20=%20%5Cfrac%7Be%5E1%7D%7Be%5E1%20+%20e%5E0%7D%20%5Capprox%200.73,%20%5Cqquad%20p(%5Cleftarrow)%20%5Capprox%200.27."></p>
<p>Logits determine probabilities, and changing logits changes probabilities. Gradient descent updates the logits (and ultimately the weights that produce them):</p>
<p><img src="https://latex.codecogs.com/png.latex?z_%5Crightarrow%20%5Cleftarrow%20z_%5Crightarrow%20-%20%5Ceta%20%5Cfrac%7B%5Cpartial%20L%7D%7B%5Cpartial%20z_%5Crightarrow%7D."></p>
<p>So the question becomes: what is <img src="https://latex.codecogs.com/png.latex?%5Cpartial%20L%20/%20%5Cpartial%20z_%5Crightarrow">, and does its sign push the logit in a direction that makes the action’s probability move correctly?</p>
</section>
<section id="the-chain-rule" class="level2">
<h2 class="anchored" data-anchor-id="the-chain-rule">5. The chain rule</h2>
<p>Focus on one step: <img src="https://latex.codecogs.com/png.latex?L%20=%20-G%20%5Clog%20p(%5Crightarrow)">. By the chain rule,</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B%5Cpartial%20L%7D%7B%5Cpartial%20z_%5Crightarrow%7D%20=%20-G%20%5Ccdot%20%5Cfrac%7B%5Cpartial%20%5Clog%20p(%5Crightarrow)%7D%7B%5Cpartial%20z_%5Crightarrow%7D."></p>
<p>We need <img src="https://latex.codecogs.com/png.latex?%5Cpartial%20%5Clog%20p(%5Crightarrow)%20/%20%5Cpartial%20z_%5Crightarrow">.</p>
</section>
<section id="deriving-the-key-softmax-gradient" class="level2">
<h2 class="anchored" data-anchor-id="deriving-the-key-softmax-gradient">6. Deriving the key softmax gradient</h2>
<p>Starting from the softmax and using <img src="https://latex.codecogs.com/png.latex?%5Clog(A/B)%20=%20%5Clog%20A%20-%20%5Clog%20B">:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Clog%20p(%5Crightarrow)%20=%20%5Clog%5C!%5Cleft(%5Cfrac%7Be%5E%7Bz_%5Crightarrow%7D%7D%7Be%5E%7Bz_%5Crightarrow%7D%20+%20e%5E%7Bz_%5Cleftarrow%7D%7D%5Cright)%20=%20z_%5Crightarrow%20-%20%5Clog%5C!%5Cleft(e%5E%7Bz_%5Crightarrow%7D%20+%20e%5E%7Bz_%5Cleftarrow%7D%5Cright)."></p>
<p>Differentiate term by term. The first term gives <img src="https://latex.codecogs.com/png.latex?1">. For the second, let <img src="https://latex.codecogs.com/png.latex?u%20=%20e%5E%7Bz_%5Crightarrow%7D%20+%20e%5E%7Bz_%5Cleftarrow%7D">. Then <img src="https://latex.codecogs.com/png.latex?%5Cpartial%20u%20/%20%5Cpartial%20z_%5Crightarrow%20=%20e%5E%7Bz_%5Crightarrow%7D"> (the <img src="https://latex.codecogs.com/png.latex?e%5E%7Bz_%5Cleftarrow%7D"> piece doesn’t depend on <img src="https://latex.codecogs.com/png.latex?z_%5Crightarrow">), so</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B%5Cpartial%20%5Clog%20u%7D%7B%5Cpartial%20z_%5Crightarrow%7D%20=%20%5Cfrac%7B1%7D%7Bu%7D%20%5Ccdot%20e%5E%7Bz_%5Crightarrow%7D%20=%20%5Cfrac%7Be%5E%7Bz_%5Crightarrow%7D%7D%7Be%5E%7Bz_%5Crightarrow%7D%20+%20e%5E%7Bz_%5Cleftarrow%7D%7D%20=%20p(%5Crightarrow)."></p>
<p>Putting it together:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B%5Cpartial%20%5Clog%20p(%5Crightarrow)%7D%7B%5Cpartial%20z_%5Crightarrow%7D%20=%201%20-%20p(%5Crightarrow)."></p>
<p>This is a clean, elegant result: as the action becomes more confident (<img src="https://latex.codecogs.com/png.latex?p%20%5Cto%201">), the gradient shrinks toward zero. There’s nothing left to push.</p>
</section>
<section id="the-gradient-on-the-loss" class="level2">
<h2 class="anchored" data-anchor-id="the-gradient-on-the-loss">7. The gradient on the loss</h2>
<p>Substituting back:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B%5Cpartial%20L%7D%7B%5Cpartial%20z_%5Crightarrow%7D%20=%20-G%5C,(1%20-%20p(%5Crightarrow))."></p>
<p>This is the core equation. With it, we can check that gradient descent does the right thing.</p>
<p>Use <img src="https://latex.codecogs.com/png.latex?p(%5Crightarrow)%20=%200.73"> and <img src="https://latex.codecogs.com/png.latex?1%20-%20p%20=%200.27">.</p>
<p><strong>Good action (<img src="https://latex.codecogs.com/png.latex?G%20=%20+4">):</strong> <img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B%5Cpartial%20L%7D%7B%5Cpartial%20z_%5Crightarrow%7D%20=%20-4%20%5Ctimes%200.27%20=%20-1.08%20%5Cquad%5CRightarrow%5Cquad%20z_%5Crightarrow%20%5Cleftarrow%20z_%5Crightarrow%20+%201.08%5C,%5Ceta."> <img src="https://latex.codecogs.com/png.latex?z_%5Crightarrow"> increases, which pushes <img src="https://latex.codecogs.com/png.latex?p(%5Crightarrow)"> up.</p>
<p><strong>Bad action (<img src="https://latex.codecogs.com/png.latex?G%20=%20-4">):</strong> <img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B%5Cpartial%20L%7D%7B%5Cpartial%20z_%5Crightarrow%7D%20=%20-(-4)%20%5Ctimes%200.27%20=%20+1.08%20%5Cquad%5CRightarrow%5Cquad%20z_%5Crightarrow%20%5Cleftarrow%20z_%5Crightarrow%20-%201.08%5C,%5Ceta."> <img src="https://latex.codecogs.com/png.latex?z_%5Crightarrow"> decreases, which pushes <img src="https://latex.codecogs.com/png.latex?p(%5Crightarrow)"> down.</p>
</section>
<section id="closing-the-loop" class="level2">
<h2 class="anchored" data-anchor-id="closing-the-loop">8. Closing the loop</h2>
<p>Increasing <img src="https://latex.codecogs.com/png.latex?z_%5Crightarrow"> enlarges the numerator of the softmax and pushes <img src="https://latex.codecogs.com/png.latex?p(%5Crightarrow)"> up; decreasing it pushes <img src="https://latex.codecogs.com/png.latex?p(%5Crightarrow)"> down. Combined with the gradient signs from the previous section:</p>
<ul>
<li><img src="https://latex.codecogs.com/png.latex?G%20%3E%200%20%5C;%5CRightarrow%5C;%20%5Cpartial%20L%20/%20%5Cpartial%20z_%5Crightarrow%20%3C%200%20%5C;%5CRightarrow%5C;%20z_%5Crightarrow%20%5Cuparrow%20%5C;%5CRightarrow%5C;%20p(%5Crightarrow)%20%5Cuparrow"></li>
<li><img src="https://latex.codecogs.com/png.latex?G%20%3C%200%20%5C;%5CRightarrow%5C;%20%5Cpartial%20L%20/%20%5Cpartial%20z_%5Crightarrow%20%3E%200%20%5C;%5CRightarrow%5C;%20z_%5Crightarrow%20%5Cdownarrow%20%5C;%5CRightarrow%5C;%20p(%5Crightarrow)%20%5Cdownarrow"></li>
</ul>
</section>
<section id="the-general-update-rule" class="level2">
<h2 class="anchored" data-anchor-id="the-general-update-rule">9. The general update rule</h2>
<p>Everything above was for a single timestep and a single logit. For the full sum across the trajectory, the same logic gives</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cnabla_%5Ctheta%20L%20=%20-%5Csum_t%20G_t%20%5Cnabla_%5Ctheta%20%5Clog%20%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t),"></p>
<p>and the gradient-descent step on <img src="https://latex.codecogs.com/png.latex?L"> becomes</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Ctheta%20%5Cleftarrow%20%5Ctheta%20-%20%5Ceta%20%5Cnabla_%5Ctheta%20L%20=%20%5Ctheta%20+%20%5Ceta%20%5Csum_t%20G_t%20%5Cnabla_%5Ctheta%20%5Clog%20%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)."></p>
<p>This is the <strong>REINFORCE update</strong>:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5CDelta%20%5Ctheta%20%5Cpropto%20G_t%20%5C,%20%5Cnabla_%5Ctheta%20%5Clog%20%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)."></p>
<p>Each parameter gets nudged in the direction <img src="https://latex.codecogs.com/png.latex?%5Cnabla_%5Ctheta%20%5Clog%20%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)">, the direction that would increase the probability of the action that was actually taken, scaled by <img src="https://latex.codecogs.com/png.latex?G_t">, the return that followed. Good actions (<img src="https://latex.codecogs.com/png.latex?G_t%20%3E%200">) get reinforced; bad actions (<img src="https://latex.codecogs.com/png.latex?G_t%20%3C%200">) get suppressed.</p>


</section>

 ]]></description>
  <category>How LLMs learn to reason</category>
  <guid>https://yeesengchan.com/posts/series/how-llms-learn-to-reason/00-reinforce-gradient/</guid>
  <pubDate>Fri, 20 Feb 2026 00:00:00 GMT</pubDate>
</item>
<item>
  <title>REINFORCE: the world before the gradient</title>
  <dc:creator>Yee Seng Chan</dc:creator>
  <link>https://yeesengchan.com/posts/series/how-llms-learn-to-reason/00-reinforce-foundations/</link>
  <description><![CDATA[ 





<!-- Shared series navigation. Each PART includes this file with a Quarto
     include shortcode pointing at ../_series.qmd (see any part's index.qmd
     for the exact syntax — do NOT repeat that shortcode here, it would
     recurse). Links are sibling-relative (../NN-slug/) so they resolve
     identically from any part. When you add a part, add one line here.
     Files starting with "_" are never rendered as their own page. -->
<div class="series-nav">
<div class="series-label">
Part of a series
</div>
<div class="series-name">
How LLMs learn to reason
</div>
<!-- Add parts as an ordered list below as you publish, e.g.
     1. [Part title](../01-slug/)
     2. [Part title](../02-slug/) -->
<ol type="1">
<li><a href="../00-reinforce-foundations/">REINFORCE: the world before the gradient</a></li>
<li><a href="../00-reinforce-gradient/">REINFORCE: the gradient that drives training</a></li>
<li><a href="../01-ppo/">PPO is REINFORCE plus five fixes</a></li>
<li><a href="../02-grpo/">GRPO: The algorithm behind reasoning models</a></li>
<li><a href="../03-dpo/">DPO: RLHF collapsed into one loss</a></li>
<li><a href="../04-r1/">R1-Zero was the result. R1 was the product</a></li>
<li><a href="../05-tool-use/">How reasoning models learn to use tools</a></li>
</ol>
</div>
<p>These notes walk through REINFORCE end-to-end. This first half builds the conceptual world the algorithm lives in: what RL is, what an MDP is, what value functions are, what trajectories are, where the objective <img src="https://latex.codecogs.com/png.latex?J(%5Ctheta)"> comes from, and what every piece of notation in it means. The <a href="../../../../posts/series/how-llms-learn-to-reason/00-reinforce-gradient/index.html">second half</a> derives the gradient that actually drives training. Both halves are needed: one tells you <em>what</em> you’re maximizing, the other tells you <em>how</em> gradient descent accomplishes it.</p>
<section id="rl-is-a-different-kind-of-learning-problem" class="level2">
<h2 class="anchored" data-anchor-id="rl-is-a-different-kind-of-learning-problem">1. RL is a different kind of learning problem</h2>
<p>In supervised learning, you have a dataset of inputs and labels, and you train a model to map one to the other. Reinforcement learning has no dataset. You have:</p>
<ul>
<li>an <strong>environment</strong> with states,</li>
<li>an <strong>agent</strong> that takes actions and moves between states,</li>
<li>a <strong>reward signal</strong> that tells the agent (after the fact) how well things are going.</li>
</ul>
<p>The agent’s job is to figure out a strategy, a <strong>policy</strong>, that collects as much reward as possible over time. Unlike supervised learning, the data is generated by the agent’s own behavior, which means a bad policy generates bad data. This is what makes RL hard in a way classification never is.</p>
<p>Throughout these notes we’ll use a small grid world. The agent occupies a square and can move up, down, left, or right. A few squares are terminal: <img src="https://latex.codecogs.com/png.latex?+5"> in one corner, <img src="https://latex.codecogs.com/png.latex?+4"> in another, and a “penalty” square that punishes with <img src="https://latex.codecogs.com/png.latex?-1">. Most squares are empty.</p>
</section>
<section id="the-markov-decision-process" class="level2">
<h2 class="anchored" data-anchor-id="the-markov-decision-process">2. The Markov Decision Process</h2>
<p>The grid world is an instance of a more general structure called a <strong>Markov Decision Process</strong> (MDP). An MDP has:</p>
<ul>
<li>a set of states <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BS%7D">,</li>
<li>a set of actions <img src="https://latex.codecogs.com/png.latex?%5Cmathcal%7BA%7D">,</li>
<li>a reward function <img src="https://latex.codecogs.com/png.latex?R(s,%20a)">: how much reward you get for taking action <img src="https://latex.codecogs.com/png.latex?a"> in state <img src="https://latex.codecogs.com/png.latex?s">,</li>
<li>transition dynamics <img src="https://latex.codecogs.com/png.latex?P(s'%20%5Cmid%20s,%20a)">: what state you end up in next.</li>
</ul>
<p>The basic transition mechanic is straightforward: at each timestep, the agent is in state <img src="https://latex.codecogs.com/png.latex?s">, picks action <img src="https://latex.codecogs.com/png.latex?a">, and the environment responds by giving reward <img src="https://latex.codecogs.com/png.latex?r"> and moving the agent to a new state <img src="https://latex.codecogs.com/png.latex?s'">. We write this as <img src="https://latex.codecogs.com/png.latex?s%20%5Cxrightarrow%7Ba%7D%20s'">.</p>
<p>Concretely in the grid world: if the agent is at <img src="https://latex.codecogs.com/png.latex?(2,%203)"> and chooses “right,” then <img src="https://latex.codecogs.com/png.latex?s%20=%20(2,%203)">, <img src="https://latex.codecogs.com/png.latex?a%20=%20%5Crightarrow">, <img src="https://latex.codecogs.com/png.latex?s'%20=%20(3,%203)">, and <img src="https://latex.codecogs.com/png.latex?r%20=%20-1"> (assuming a per-step fuel cost; more on that below). The next iteration starts from <img src="https://latex.codecogs.com/png.latex?s'%20=%20(3,%203)">, which becomes the new “current state,” and the cycle repeats until the agent hits a terminal square.</p>
<p>The next state <img src="https://latex.codecogs.com/png.latex?s'"> depends on which action <img src="https://latex.codecogs.com/png.latex?a"> was taken. From state <img src="https://latex.codecogs.com/png.latex?(2,%203)">:</p>
<ul>
<li>Action <img src="https://latex.codecogs.com/png.latex?%5Crightarrow"> leads to <img src="https://latex.codecogs.com/png.latex?s'%20=%20(3,%203)"></li>
<li>Action <img src="https://latex.codecogs.com/png.latex?%5Cuparrow"> leads to <img src="https://latex.codecogs.com/png.latex?s'%20=%20(2,%204)"></li>
<li>Action <img src="https://latex.codecogs.com/png.latex?%5Cleftarrow"> leads to <img src="https://latex.codecogs.com/png.latex?s'%20=%20(1,%203)"></li>
<li>Action <img src="https://latex.codecogs.com/png.latex?%5Cdownarrow"> leads to <img src="https://latex.codecogs.com/png.latex?s'%20=%20(2,%202)"></li>
</ul>
<p>In a deterministic environment like this grid world, <img src="https://latex.codecogs.com/png.latex?a"> uniquely determines <img src="https://latex.codecogs.com/png.latex?s'">. In a stochastic environment, taking action <img src="https://latex.codecogs.com/png.latex?a"> in state <img src="https://latex.codecogs.com/png.latex?s"> gives a <em>distribution</em> over possible next states. Think of a robot whose wheels sometimes slip. We’ll come back to this distinction when we write the Bellman equation in its general form.</p>
<section id="where-rewards-come-from" class="level3">
<h3 class="anchored" data-anchor-id="where-rewards-come-from">Where rewards come from</h3>
<p>The reward function <img src="https://latex.codecogs.com/png.latex?R(s,%20a)"> deserves a closer look, because it’s the silent partner in the whole RL story.</p>
<p>The reward function is a <strong>fixed property of the environment</strong>, defined before training begins. It’s not learned. It’s not a parameter. The agent never modifies it. When the agent takes action <img src="https://latex.codecogs.com/png.latex?a"> from state <img src="https://latex.codecogs.com/png.latex?s"> and the environment returns reward <img src="https://latex.codecogs.com/png.latex?r_t">, the agent simply observes that number. It doesn’t see the rule that produced it.</p>
<p>In the grid world, you (the designer) wrote the rules: <img src="https://latex.codecogs.com/png.latex?+5"> at one terminal, <img src="https://latex.codecogs.com/png.latex?+4"> at another, <img src="https://latex.codecogs.com/png.latex?-1"> at the “penalty”, <img src="https://latex.codecogs.com/png.latex?-1"> for every step taken. That table is the reward function. The agent will spend its entire training experiencing that table’s outputs without ever being told the table exists.</p>
<p>This means <strong>defining the reward function is how you specify the task</strong>. Change the reward function, and you change the problem:</p>
<ul>
<li>Reward <img src="https://latex.codecogs.com/png.latex?+5"> at one corner → agent walks to that corner.</li>
<li>Reward <img src="https://latex.codecogs.com/png.latex?+1"> for staying alive each step → agent learns to survive.</li>
<li>Reward <img src="https://latex.codecogs.com/png.latex?-1"> everywhere → agent learns to terminate as fast as possible, possibly by walking into the “penalty”.</li>
</ul>
<p>Same environment, same states, same actions, but completely different behavior, because the reward function is different. This is why reward design is famously tricky: a poorly-specified reward gives you an agent that solves the wrong problem. (The classic example is a boat-racing agent that learned to drive in circles hitting respawning power-ups instead of finishing the race, because that gave more points.)</p>
</section>
</section>
<section id="trajectories-and-returns" class="level2">
<h2 class="anchored" data-anchor-id="trajectories-and-returns">3. Trajectories and returns</h2>
<p>The agent acting in the environment generates a <strong>trajectory</strong>: the full sequence of states, actions, and rewards from the start of an episode to the end:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Ctau%20=%20(s_0,%20a_0,%20r_0,%5C;%20s_1,%20a_1,%20r_1,%5C;%20s_2,%20a_2,%20r_2,%5C;%20%5Cldots,%5C;%20s_T)."></p>
<p>The Greek letter <img src="https://latex.codecogs.com/png.latex?%5Ctau"> (“tau”) is the standard symbol for it. A trajectory is also sometimes called an “episode” or a “rollout.” All three terms mean the same thing: one run of the agent through the environment, from start to finish.</p>
<section id="the-return-g_t" class="level3">
<h3 class="anchored" data-anchor-id="the-return-g_t">The return <img src="https://latex.codecogs.com/png.latex?G_t"></h3>
<p>The <strong>return</strong> from time <img src="https://latex.codecogs.com/png.latex?t"> onward is the sum of future rewards:</p>
<p><img src="https://latex.codecogs.com/png.latex?G_t%20=%20r_t%20+%20r_%7Bt+1%7D%20+%20r_%7Bt+2%7D%20+%20%5Ccdots%20+%20r_T."></p>
<p>This <img src="https://latex.codecogs.com/png.latex?G_t"> is the most important quantity in REINFORCE. Three things to internalize about it.</p>
<p><strong><img src="https://latex.codecogs.com/png.latex?G_t"> is forward-looking.</strong> It only counts what happens <em>from <img src="https://latex.codecogs.com/png.latex?t"> onward</em>. Rewards collected before timestep <img src="https://latex.codecogs.com/png.latex?t"> don’t enter into it. Each timestep in a trajectory has its own <img src="https://latex.codecogs.com/png.latex?G_t">, and earlier timesteps generally have more future ahead of them.</p>
<p><strong><img src="https://latex.codecogs.com/png.latex?G_t"> is empirical, not predicted.</strong> It’s a number you measure after running an episode, by summing the rewards you actually got. It’s not a parameter, not an output of any model. Just an observation.</p>
<p><strong><img src="https://latex.codecogs.com/png.latex?G_t"> is different from <img src="https://latex.codecogs.com/png.latex?r_t">, and the distinction matters.</strong> <img src="https://latex.codecogs.com/png.latex?r_t"> is the <em>single-step</em> reward at timestep <img src="https://latex.codecogs.com/png.latex?t">: what the agent got right then. <img src="https://latex.codecogs.com/png.latex?G_t"> is the <em>cumulative</em> return from <img src="https://latex.codecogs.com/png.latex?t"> onward: the sum of all rewards from that point to the end of the episode.</p>
<p>To make this concrete, here’s a trajectory in the grid world:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Ctau%20=%20%5Cbig((0,0),%20%5Crightarrow,%20-1,%5C;%20(1,0),%20%5Cuparrow,%20-1,%5C;%20(1,1),%20%5Crightarrow,%20-1,%5C;%20(2,1),%20%5Crightarrow,%20+5%5Cbig)."></p>
<p>The rewards are <img src="https://latex.codecogs.com/png.latex?r_0%20=%20-1,%20r_1%20=%20-1,%20r_2%20=%20-1,%20r_3%20=%20+5">. The returns at each timestep:</p>
<ul>
<li><img src="https://latex.codecogs.com/png.latex?G_0%20=%20-1%20+%20(-1)%20+%20(-1)%20+%205%20=%202"></li>
<li><img src="https://latex.codecogs.com/png.latex?G_1%20=%20-1%20+%20(-1)%20+%205%20=%203"></li>
<li><img src="https://latex.codecogs.com/png.latex?G_2%20=%20-1%20+%205%20=%204"></li>
<li><img src="https://latex.codecogs.com/png.latex?G_3%20=%205"></li>
</ul>
<p>Notice: <img src="https://latex.codecogs.com/png.latex?r_0%20=%20-1"> but <img src="https://latex.codecogs.com/png.latex?G_0%20=%202">. They’re not the same number, and they’re not measuring the same thing.</p>
<p>Earlier timesteps had to pay more fuel costs before reaching the <img src="https://latex.codecogs.com/png.latex?+5">, so their returns are smaller. From <img src="https://latex.codecogs.com/png.latex?(2,1)"> at the end, the agent is one step from the reward; from <img src="https://latex.codecogs.com/png.latex?(0,0)"> at the start, it had to walk all the way there.</p>
<div id="fig-grid-world" class="quarto-float quarto-figure quarto-figure-center anchored" alt="A 4 by 4 grid world. The agent starts at the bottom-left cell and moves right, then up, then right, then right into a green plus-five goal cell, which is terminal. A red minus-one penalty cell and a green plus-four terminal cell sit in the top row, off the path. Each of the first three moves earns a reward of minus one (a fuel cost); the final move into the goal earns plus five. The return shown at each visited cell is 2, 3, 4, then 5: returns are larger for steps closer to the goal, because earlier steps pay more fuel before reaching the plus-five.">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-grid-world-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://yeesengchan.com/posts/series/how-llms-learn-to-reason/00-reinforce-foundations/grid_world_trajectory.png" class="img-fluid figure-img" alt="A 4 by 4 grid world. The agent starts at the bottom-left cell and moves right, then up, then right, then right into a green plus-five goal cell, which is terminal. A red minus-one penalty cell and a green plus-four terminal cell sit in the top row, off the path. Each of the first three moves earns a reward of minus one (a fuel cost); the final move into the goal earns plus five. The return shown at each visited cell is 2, 3, 4, then 5: returns are larger for steps closer to the goal, because earlier steps pay more fuel before reaching the plus-five.">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-grid-world-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;1: The worked trajectory τ on the 4×4 grid world. Every move costs <img src="https://latex.codecogs.com/png.latex?r=-1"> (fuel) until the <img src="https://latex.codecogs.com/png.latex?+5"> terminal; the return <img src="https://latex.codecogs.com/png.latex?G_t"> sums rewards from step <img src="https://latex.codecogs.com/png.latex?t"> onward, so steps nearer the goal carry larger returns (<img src="https://latex.codecogs.com/png.latex?G_0=2"> rising to <img src="https://latex.codecogs.com/png.latex?G_3=5">).
</figcaption>
</figure>
</div>
</section>
<section id="why-g_t-and-not-r_t-is-what-matters" class="level3">
<h3 class="anchored" data-anchor-id="why-g_t-and-not-r_t-is-what-matters">Why <img src="https://latex.codecogs.com/png.latex?G_t"> and not <img src="https://latex.codecogs.com/png.latex?r_t"> is what matters</h3>
<p>Anticipating where REINFORCE is going: the gradient updates will weight each action by a return, not by an immediate reward. The reason is that the immediate reward <img src="https://latex.codecogs.com/png.latex?r_t"> doesn’t tell you whether the action you took was <em>good</em> in any meaningful sense. In the grid world, every step has reward <img src="https://latex.codecogs.com/png.latex?-1"> regardless of which direction you moved. The single-step reward gives you no information about whether you moved <em>toward</em> the <img src="https://latex.codecogs.com/png.latex?+5"> or <em>away</em> from it.</p>
<p>What does carry that information? The total reward you collected after taking the action. If you took action <img src="https://latex.codecogs.com/png.latex?a_t"> and ultimately got to the <img src="https://latex.codecogs.com/png.latex?+5">, then <img src="https://latex.codecogs.com/png.latex?a_t"> was probably part of a good plan. If you took action <img src="https://latex.codecogs.com/png.latex?a_t"> and ended up at the “penalty”, then <img src="https://latex.codecogs.com/png.latex?a_t"> was probably a bad choice.</p>
<p><img src="https://latex.codecogs.com/png.latex?G_t"> captures this. It’s the agent’s verdict on “how did things go after I took action <img src="https://latex.codecogs.com/png.latex?a_t">?”, which is exactly the right signal for deciding whether to make <img src="https://latex.codecogs.com/png.latex?a_t"> more or less likely in the future. We’ll come back to this when we form the objective.</p>
</section>
<section id="the-credit-assignment-problem" class="level3">
<h3 class="anchored" data-anchor-id="the-credit-assignment-problem">The credit assignment problem</h3>
<p>There’s a deeper problem hiding behind this design choice, and it has a name worth knowing.</p>
<p>Suppose the agent finally gets a <img src="https://latex.codecogs.com/png.latex?+5"> reward at step 20. <em>Which of the 20 actions deserves credit?</em> The action right before the reward? The one 15 steps earlier that put the agent on the right path? All of them? Some combination?</p>
<p>This is called the <strong>credit assignment problem</strong>, and it’s the central difficulty of reinforcement learning. In supervised learning, every input has its own label, so credit is unambiguous: image <img src="https://latex.codecogs.com/png.latex?x_i"> goes with label <img src="https://latex.codecogs.com/png.latex?y_i">. In RL, rewards are often sparse and delayed: you take many actions before learning whether any of them were good, and there’s no per-step label telling you which actions worked.</p>
<p><img src="https://latex.codecogs.com/png.latex?G_t">’s answer is brutally simple: assign every action the credit of <em>everything that came after it</em>. The action at step 15 gets credit for the <img src="https://latex.codecogs.com/png.latex?+5"> reward at step 20, because that reward is part of <img src="https://latex.codecogs.com/png.latex?G_%7B15%7D">. So does the action at step 1, because <img src="https://latex.codecogs.com/png.latex?G_1"> also includes that reward. This is unfair on a per-action basis because some of those early actions probably didn’t matter, but it averages out across many trajectories. Actions that <em>consistently</em> precede good outcomes will accumulate positive updates over many rollouts; actions that don’t, won’t. The noise washes out; the signal accumulates.</p>
<p>This is why REINFORCE needs lots of samples to work. Each individual trajectory gives you a noisy, unfair credit assignment. Only across many trajectories does the right policy emerge.</p>
</section>
</section>
<section id="values-and-the-bellman-equation" class="level2">
<h2 class="anchored" data-anchor-id="values-and-the-bellman-equation">4. Values and the Bellman equation</h2>
<p>Before talking about how to <em>learn</em> a policy, it helps to ask what makes a state “good.” A natural answer: a state is good if you can collect a lot of reward starting from it. Define the <strong>value</strong> of a state <img src="https://latex.codecogs.com/png.latex?s"> as the return you’d expect if you played optimally from there.</p>
<p>In the simplest version of the grid world, where only terminal squares give reward, every non-terminal state satisfies:</p>
<p><img src="https://latex.codecogs.com/png.latex?V(s)%20=%20%5Cmax_a%20V(s'_a),"></p>
<p>where <img src="https://latex.codecogs.com/png.latex?s'_a"> is the state reached by taking action <img src="https://latex.codecogs.com/png.latex?a"> from <img src="https://latex.codecogs.com/png.latex?s">. The value of a state is the best value among its neighbors. Notice the action-dependence: <img src="https://latex.codecogs.com/png.latex?s'"> depends on which <img src="https://latex.codecogs.com/png.latex?a"> you take. Writing it as <img src="https://latex.codecogs.com/png.latex?s'_a"> (or sometimes <img src="https://latex.codecogs.com/png.latex?s'(s,%20a)">) makes that explicit and avoids the slightly sloppy <img src="https://latex.codecogs.com/png.latex?%5Cmax_a%20V(s')"> notation that hides the dependence.</p>
<p>This is the <strong>Bellman equation</strong> in its simplest form, and it lets you propagate values from the terminal squares throughout the grid.</p>
<p>There’s a wrinkle: with this rule alone, the agent has no incentive to hurry. Wandering forever before reaching the <img src="https://latex.codecogs.com/png.latex?+5"> is as good as walking straight there. To fix this, add a small per-step reward <img src="https://latex.codecogs.com/png.latex?r%20=%20-1">; a “fuel cost”:</p>
<p><img src="https://latex.codecogs.com/png.latex?V(s)%20=%20%5Cmax_a%20%5Cbig%5Br%20+%20V(s'_a)%5Cbig%5D."></p>
<p>Now the agent prefers shorter paths.</p>
<p>A second refinement is the <strong>discount factor</strong> <img src="https://latex.codecogs.com/png.latex?%5Cgamma%20%5Cin%20(0,%201%5D">, which says future rewards matter less than present ones: the same reason a dollar today is worth more than a dollar next year. With both:</p>
<p><img src="https://latex.codecogs.com/png.latex?V(s)%20=%20%5Cmax_a%20%5Cbig%5Br%20+%20%5Cgamma%20V(s'_a)%5Cbig%5D."></p>
<p>You might wonder why we need <img src="https://latex.codecogs.com/png.latex?%5Cgamma"> when we already have the step cost <img src="https://latex.codecogs.com/png.latex?r">. They look redundant: both make the agent prefer short paths in the grid world. But they’re solving different problems:</p>
<ul>
<li>The <strong>step cost</strong> makes long paths <em>expensive in absolute terms</em>. Every step subtracts 1 from the total return. A 3-step path to <img src="https://latex.codecogs.com/png.latex?+5"> nets <img src="https://latex.codecogs.com/png.latex?5%20-%203%20=%202">; a 10-step path nets <img src="https://latex.codecogs.com/png.latex?5%20-%2010%20=%20-5">.</li>
<li>The <strong>discount factor</strong> makes future rewards <em>worth less than present ones</em>. A reward of <img src="https://latex.codecogs.com/png.latex?5"> received in 3 steps is worth <img src="https://latex.codecogs.com/png.latex?%5Cgamma%5E3%20%5Ccdot%205"> today; received in 10 steps, it’s worth <img src="https://latex.codecogs.com/png.latex?%5Cgamma%5E%7B10%7D%20%5Ccdot%205">.</li>
</ul>
<p>The practical reasons <img src="https://latex.codecogs.com/png.latex?%5Cgamma"> does work the step cost can’t:</p>
<ol type="1">
<li><p><strong><img src="https://latex.codecogs.com/png.latex?%5Cgamma"> encodes uncertainty about the future.</strong> A reward you might get in 100 steps is less trustworthy than one you’ll get in 2 steps. The world might change, the model might be wrong. <img src="https://latex.codecogs.com/png.latex?%5Cgamma"> captures this: distant predictions get less weight because they’re less reliable. A step cost just charges you for moving; it doesn’t model uncertainty.</p></li>
<li><p><strong><img src="https://latex.codecogs.com/png.latex?%5Cgamma"> doesn’t require knowing the right magnitude.</strong> A step cost only works if you tune it. Set <img src="https://latex.codecogs.com/png.latex?r%20=%20-0.01"> in a grid where rewards are <img src="https://latex.codecogs.com/png.latex?%5Cpm%205"> and the agent barely cares about path length; set <img src="https://latex.codecogs.com/png.latex?r%20=%20-10"> and the agent refuses to move. <img src="https://latex.codecogs.com/png.latex?%5Cgamma"> is scale-free in this sense: <img src="https://latex.codecogs.com/png.latex?%5Cgamma%20=%200.99"> creates a roughly-100-step planning horizon regardless of reward scale.</p></li>
<li><p><strong>They compose differently with positive rewards along the way.</strong> In a “stay alive” task where each surviving step gives <img src="https://latex.codecogs.com/png.latex?+1">, a <em>negative</em> step cost would subtract from a signal that’s supposed to be additive. <img src="https://latex.codecogs.com/png.latex?%5Cgamma"> still works there: it just makes the agent prefer reward sooner rather than later.</p></li>
</ol>
<p>So: the step cost is a reward-design choice (you, specifying the task, decide moving is bad), while <img src="https://latex.codecogs.com/png.latex?%5Cgamma"> is a property of the agent’s planning horizon (the agent decides distant futures matter less). They live at different levels.</p>
<p>Once <img src="https://latex.codecogs.com/png.latex?%5Cgamma"> is in the picture, the return is also typically written with discounting:</p>
<p><img src="https://latex.codecogs.com/png.latex?G_t%20=%20r_t%20+%20%5Cgamma%20r_%7Bt+1%7D%20+%20%5Cgamma%5E2%20r_%7Bt+2%7D%20+%20%5Ccdots%20=%20%5Csum_%7Bk=0%7D%5E%7B%5Cinfty%7D%20%5Cgamma%5Ek%20r_%7Bt+k%7D."></p>
<p>This is the version that shows up in most modern RL writing. The undiscounted form earlier was a special case with <img src="https://latex.codecogs.com/png.latex?%5Cgamma%20=%201">.</p>
</section>
<section id="the-general-bellman-equation-and-expectations" class="level2">
<h2 class="anchored" data-anchor-id="the-general-bellman-equation-and-expectations">5. The general Bellman equation and expectations</h2>
<p>In a stochastic environment, taking action <img src="https://latex.codecogs.com/png.latex?a"> in state <img src="https://latex.codecogs.com/png.latex?s"> doesn’t deterministically land you in <img src="https://latex.codecogs.com/png.latex?s'">: it gives a <em>distribution</em> over next states. The fully general Bellman optimality equation handles this with an expectation:</p>
<p><img src="https://latex.codecogs.com/png.latex?V(s)%20=%20%5Cmax_a%20%5Cmathbb%7BE%7D%5Cbig%5Br%20+%20%5Cgamma%20V(s')%20%5C,%5Cbig%7C%5C,%20s,%20a%5Cbig%5D,"></p>
<p>or written out as a sum:</p>
<p><img src="https://latex.codecogs.com/png.latex?V(s)%20=%20%5Cmax_a%20%5Csum_%7Bs'%7D%20P(s'%20%5Cmid%20s,%20a)%5Cbig%5BR(s,%20a,%20s')%20+%20%5Cgamma%20V(s')%5Cbig%5D."></p>
<p>The two forms are the same equation. Going from the first to the second is just unpacking the expectation, so it’s worth being clear about what an expectation is.</p>
<p>The expected value of a random variable <img src="https://latex.codecogs.com/png.latex?X"> is the weighted average of its possible values, where the weights are the probabilities:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BE%7D%5BX%5D%20=%20%5Csum_x%20P(X%20=%20x)%20%5Ccdot%20x."></p>
<p>That’s it. Roll a fair six-sided die: outcomes <img src="https://latex.codecogs.com/png.latex?1,%202,%203,%204,%205,%206"> each with probability <img src="https://latex.codecogs.com/png.latex?%5Cfrac%7B1%7D%7B6%7D">, and <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BE%7D%5BX%5D%20=%203.5">. List every possible outcome, multiply by how likely it is, add up.</p>
<p>A <strong>conditional</strong> expectation <img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BE%7D%5BX%20%5Cmid%20Y%20=%20y%5D"> is the same weighted average, but using the conditional probabilities <img src="https://latex.codecogs.com/png.latex?P(X%20=%20x%20%5Cmid%20Y%20=%20y)"> instead: “given that I know <img src="https://latex.codecogs.com/png.latex?Y%20=%20y">, what’s the expected value of <img src="https://latex.codecogs.com/png.latex?X">?”</p>
<p>Apply this to the Bellman expectation: the random thing is <img src="https://latex.codecogs.com/png.latex?s'"> (taking action <img src="https://latex.codecogs.com/png.latex?a"> in state <img src="https://latex.codecogs.com/png.latex?s"> gives a random next state), and the quantity being averaged is <img src="https://latex.codecogs.com/png.latex?r%20+%20%5Cgamma%20V(s')">. List all possible next states <img src="https://latex.codecogs.com/png.latex?s'">, weight each by <img src="https://latex.codecogs.com/png.latex?P(s'%20%5Cmid%20s,%20a)">, and sum:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BE%7D%5Cbig%5Br%20+%20%5Cgamma%20V(s')%20%5C,%5Cbig%7C%5C,%20s,%20a%5Cbig%5D%20=%20%5Csum_%7Bs'%7D%20P(s'%20%5Cmid%20s,%20a)%5Cbig%5BR(s,%20a,%20s')%20+%20%5Cgamma%20V(s')%5Cbig%5D."></p>
<p>Concrete example. Suppose the agent is in state <img src="https://latex.codecogs.com/png.latex?s">, takes action <img src="https://latex.codecogs.com/png.latex?a%20=%20%5Crightarrow">, but the environment is noisy:</p>
<ul>
<li>80% chance you actually go right, landing in <img src="https://latex.codecogs.com/png.latex?s'_1"> with reward <img src="https://latex.codecogs.com/png.latex?-1"></li>
<li>20% chance you slip up, landing in <img src="https://latex.codecogs.com/png.latex?s'_2"> with reward <img src="https://latex.codecogs.com/png.latex?-1"></li>
</ul>
<p>Then the expected value inside the Bellman equation is</p>
<p><img src="https://latex.codecogs.com/png.latex?0.8%20%5Ccdot%20%5Cbig%5B-1%20+%20%5Cgamma%20V(s'_1)%5Cbig%5D%20+%200.2%20%5Ccdot%20%5Cbig%5B-1%20+%20%5Cgamma%20V(s'_2)%5Cbig%5D."></p>
<p>Each possible outcome contributes its value, weighted by how likely it is.</p>
<p>In the deterministic case, <img src="https://latex.codecogs.com/png.latex?P(s'%20%5Cmid%20s,%20a)%20=%201"> for one specific <img src="https://latex.codecogs.com/png.latex?s'"> and <img src="https://latex.codecogs.com/png.latex?0"> for everything else. The sum collapses to a single term <img src="https://latex.codecogs.com/png.latex?r%20+%20%5Cgamma%20V(s')">, and the expectation disappears. That’s why the simpler form <img src="https://latex.codecogs.com/png.latex?V(s)%20=%20%5Cmax_a%20%5Br%20+%20%5Cgamma%20V(s'_a)%5D"> is fine for the grid world.</p>
</section>
<section id="v-vs-q-two-flavors-of-value" class="level2">
<h2 class="anchored" data-anchor-id="v-vs-q-two-flavors-of-value">6. <img src="https://latex.codecogs.com/png.latex?V"> vs <img src="https://latex.codecogs.com/png.latex?Q">: two flavors of value</h2>
<p>So far we’ve only talked about <img src="https://latex.codecogs.com/png.latex?V">, the value of a <em>state</em>. There’s a closely related object <img src="https://latex.codecogs.com/png.latex?Q">, the value of a <em>state-action pair</em>. Both come up constantly in RL, and they answer slightly different questions.</p>
<p><strong><img src="https://latex.codecogs.com/png.latex?V(s)">: the value of a state.</strong> Answers: “How good is it to be in state <img src="https://latex.codecogs.com/png.latex?s">?” Specifically, the expected return starting from <img src="https://latex.codecogs.com/png.latex?s"> and following the policy from there:</p>
<p><img src="https://latex.codecogs.com/png.latex?V(s)%20=%20%5Cmathbb%7BE%7D%5Cbig%5BG_t%20%5C,%5Cbig%7C%5C,%20s_t%20=%20s%5Cbig%5D."></p>
<p>One number per state.</p>
<p><strong><img src="https://latex.codecogs.com/png.latex?Q(s,%20a)">: the value of a state-action pair.</strong> Answers: “How good is it to take action <img src="https://latex.codecogs.com/png.latex?a"> from state <img src="https://latex.codecogs.com/png.latex?s">?” The expected return if you take <img src="https://latex.codecogs.com/png.latex?a"> in state <img src="https://latex.codecogs.com/png.latex?s"> and <em>then</em> follow the policy:</p>
<p><img src="https://latex.codecogs.com/png.latex?Q(s,%20a)%20=%20%5Cmathbb%7BE%7D%5Cbig%5BG_t%20%5C,%5Cbig%7C%5C,%20s_t%20=%20s,%20a_t%20=%20a%5Cbig%5D."></p>
<p>One number per state-action pair. With 4 actions, <img src="https://latex.codecogs.com/png.latex?Q((2,%203),%20%5Ccdot)"> is four numbers, one for each action you might take from <img src="https://latex.codecogs.com/png.latex?(2,%203)">.</p>
<p>The two are related. <img src="https://latex.codecogs.com/png.latex?V"> is what you get when you average <img src="https://latex.codecogs.com/png.latex?Q"> over the actions the policy would take:</p>
<p><img src="https://latex.codecogs.com/png.latex?V(s)%20=%20%5Csum_a%20%5Cpi(a%20%5Cmid%20s)%20%5C,%20Q(s,%20a)."></p>
<p>If the policy is greedy (always picks the best action), this collapses to <img src="https://latex.codecogs.com/png.latex?V(s)%20=%20%5Cmax_a%20Q(s,%20a)">. So <img src="https://latex.codecogs.com/png.latex?V"> is the summary, <img src="https://latex.codecogs.com/png.latex?Q"> is the breakdown.</p>
<p>Why does this matter? Because <img src="https://latex.codecogs.com/png.latex?Q"> is more useful for <em>acting</em>. Suppose you have a value function and you need to choose an action.</p>
<ul>
<li><strong>With <img src="https://latex.codecogs.com/png.latex?V">:</strong> you’d need to look ahead one step for each candidate action, see what state you’d land in, and check the value there. This requires knowing the environment dynamics.</li>
<li><strong>With <img src="https://latex.codecogs.com/png.latex?Q">:</strong> you read off the values directly and pick the action with the highest one. No model of the environment needed.</li>
</ul>
<p>This is why deep RL methods that learn value functions almost always learn <img src="https://latex.codecogs.com/png.latex?Q">, not <img src="https://latex.codecogs.com/png.latex?V">. <strong>DQN</strong> (Deep Q-Network, the famous 2013 Atari paper) learns <img src="https://latex.codecogs.com/png.latex?Q"> directly: the network takes a state and outputs one number per action; you act greedily by picking the argmax. You don’t need to know how the Atari game works internally; the <img src="https://latex.codecogs.com/png.latex?Q">-values tell you which button to press.</p>
</section>
<section id="why-neural-networks" class="level2">
<h2 class="anchored" data-anchor-id="why-neural-networks">7. Why neural networks?</h2>
<p>Iterating Bellman across a 16-square grid is fine. Iterating across the state space of Go, which has more positions than there are atoms in the universe, is not. You can’t store one number per state, let alone visit each state repeatedly to update it.</p>
<p>The fix is to use a neural network as a <strong>function approximator</strong> in place of a giant lookup table. The network takes a state as input and outputs either:</p>
<ul>
<li>a <strong>value estimate</strong> (<img src="https://latex.codecogs.com/png.latex?V"> or <img src="https://latex.codecogs.com/png.latex?Q">): “from this state, you can expect roughly this much return,” or</li>
<li>a <strong>policy</strong>: “from this state, here are the probabilities of each action.”</li>
</ul>
<p>These two choices give the two main families of deep RL.</p>
<p>A <strong>value network</strong> approximates <img src="https://latex.codecogs.com/png.latex?V"> or <img src="https://latex.codecogs.com/png.latex?Q">. It’s trained by enforcing the Bellman equation: the network’s prediction at <img src="https://latex.codecogs.com/png.latex?s"> should match <img src="https://latex.codecogs.com/png.latex?r%20+%20%5Cgamma"> times its prediction at the next state <img src="https://latex.codecogs.com/png.latex?s'">. Once trained, you act greedily: pick the action with the highest predicted value. DQN is the canonical example.</p>
<p>A <strong>policy network</strong> approximates the policy directly. It takes a state and outputs probabilities over actions. There’s no Bellman equation, no value estimation. You train the network so good actions become more likely and bad actions become less likely. This is the world REINFORCE lives in.</p>
<p>Why use one over the other? Policy networks have several advantages worth knowing:</p>
<ul>
<li>They naturally output <strong>stochastic policies</strong> (next section), which matters for exploration.</li>
<li>They handle <strong>continuous action spaces</strong> gracefully: a value network would need to argmax over an infinite set.</li>
<li>They’re often easier to train end-to-end with backpropagation, which is the whole point of writing <img src="https://latex.codecogs.com/png.latex?J(%5Ctheta)"> as something you can take a gradient of.</li>
</ul>
<p>Value networks tend to be more sample-efficient when they work, but harder to stabilize. The two approaches aren’t mutually exclusive: <a href="../../../../posts/series/how-llms-learn-to-reason/01-ppo/#ppo-value-network-critic">actor-critic</a> methods use both.</p>
</section>
<section id="stochastic-policies-and-exploreexploit" class="level2">
<h2 class="anchored" data-anchor-id="stochastic-policies-and-exploreexploit">8. Stochastic policies and explore/exploit</h2>
<p>A <strong>deterministic</strong> policy outputs the single best action: “from <img src="https://latex.codecogs.com/png.latex?(2,%203)">, always go right.” A <strong>stochastic</strong> policy outputs a distribution: “from <img src="https://latex.codecogs.com/png.latex?(2,%203)">, go right with probability <img src="https://latex.codecogs.com/png.latex?0.8">, up with <img src="https://latex.codecogs.com/png.latex?0.1">, down with <img src="https://latex.codecogs.com/png.latex?0.05">, left with <img src="https://latex.codecogs.com/png.latex?0.05">.”</p>
<p>Why prefer the stochastic version? Imagine the agent has found the <img src="https://latex.codecogs.com/png.latex?+4"> reward and learned to walk to it. With a deterministic policy, it will never deviate and will never discover the <img src="https://latex.codecogs.com/png.latex?+5"> reward sitting in another corner. Randomness is what lets the agent <strong>explore</strong>. A stochastic policy mostly <strong>exploits</strong> what it knows but occasionally tries something else, which is how it keeps improving.</p>
<p>This is also why the policy network outputs a <strong>softmax</strong> over actions rather than a single chosen action. The softmax gives a differentiable, naturally-stochastic parameterization; exactly what the gradient mechanics in Part II need.</p>
</section>
<section id="the-policy-notation-pi_theta" class="level2">
<h2 class="anchored" data-anchor-id="the-policy-notation-pi_theta">9. The policy notation <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta"></h2>
<p>We’ve referenced “the policy” repeatedly. It’s now time to give it a proper symbol.</p>
<p><strong><img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta"></strong> is the <strong>policy</strong>: the neural network that, given a state, outputs a probability distribution over actions. The Greek letter <img src="https://latex.codecogs.com/png.latex?%5Cpi"> is the standard symbol for a policy. The subscript <img src="https://latex.codecogs.com/png.latex?%5Ctheta"> reminds you that the policy is parameterized by the network’s weights <img src="https://latex.codecogs.com/png.latex?%5Ctheta">. Different weights produce different policies.</p>
<p>Three notations all show up in practice and mean closely related things:</p>
<ul>
<li><strong><img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta"></strong> on its own refers to the whole policy as an object: the full mapping from states to action distributions.</li>
<li><strong><img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta(%5Ccdot%20%5Cmid%20s)"></strong> is the distribution over actions when the agent is in state <img src="https://latex.codecogs.com/png.latex?s">: for a 4-action grid world, that’s 4 numbers summing to 1.</li>
<li><strong><img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta(a%20%5Cmid%20s)"></strong> is a single number: “the probability that policy <img src="https://latex.codecogs.com/png.latex?%5Cpi"> (with weights <img src="https://latex.codecogs.com/png.latex?%5Ctheta">) assigns to action <img src="https://latex.codecogs.com/png.latex?a"> when in state <img src="https://latex.codecogs.com/png.latex?s">.”</li>
</ul>
<p>Concretely, for the grid world: the network takes a state like <img src="https://latex.codecogs.com/png.latex?(2,%203)"> as input, runs it through some layers, ends with a softmax, and outputs <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta(%5Crightarrow%20%5Cmid%20(2,%203))%20=%200.4">, <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta(%5Cuparrow%20%5Cmid%20(2,%203))%20=%200.3">, <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta(%5Cleftarrow%20%5Cmid%20(2,%203))%20=%200.2">, <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta(%5Cdownarrow%20%5Cmid%20(2,%203))%20=%200.1">. Four probabilities summing to 1. Sample one of them to get the action the agent actually takes.</p>
<section id="what-the-policy-actually-controls" class="level3">
<h3 class="anchored" data-anchor-id="what-the-policy-actually-controls">What the policy actually controls</h3>
<p>It’s worth being precise about what the policy <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta"> is for. The policy controls exactly one thing: <strong>the probability of choosing each action in each state</strong>. That’s it. Given a state, it produces a distribution. The agent samples from that distribution.</p>
<p>What the policy does <em>not</em> control:</p>
<ul>
<li><strong>The reward.</strong> The environment decides what reward to give. The policy can’t change this.</li>
<li><strong>The next state.</strong> The transition dynamics <img src="https://latex.codecogs.com/png.latex?P(s'%20%5Cmid%20s,%20a)"> are a property of the environment. If the agent walks right and the floor is icy, the agent slides.</li>
<li><strong>Which states are reachable.</strong> The structure of the world is given. The policy only influences which reachable states the agent <em>tends to visit</em>.</li>
</ul>
<p>The causal chain looks like this:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Ctheta%20%5C;%5Clongrightarrow%5C;%20%5Cpi_%5Ctheta(%5Ccdot%20%5Cmid%20s)%20%5C;%5Clongrightarrow%5C;%20%5Ctext%7Baction%20sampled%7D%20%5C;%5Clongrightarrow%5C;%20%5Ctext%7Benvironment%20responds%7D%20%5C;%5Clongrightarrow%5C;%20%5Ctext%7Breward%20+%20next%20state%7D."></p>
<p>The policy controls the second arrow. Everything downstream is the environment doing its thing. But by setting up that arrow well, i.e.&nbsp;assigning high probability to good actions in each state, the policy biases what happens in the rest of the chain toward outcomes we want.</p>
<p>This is also why the gradient mechanics in Part II are expressed in terms of <img src="https://latex.codecogs.com/png.latex?%5Cnabla_%5Ctheta%20%5Clog%20%5Cpi_%5Ctheta(a%20%5Cmid%20s)">. The only thing the agent can adjust is action probabilities. The only thing the gradient can affect is which actions get sampled.</p>
<p>A useful contrast to hold in mind: a <strong>value function</strong> (<img src="https://latex.codecogs.com/png.latex?V"> or <img src="https://latex.codecogs.com/png.latex?Q">) tells you <em>how good</em> states are. It doesn’t tell you what to do. A <strong>policy</strong> tells you <em>what to do</em>. It doesn’t predict return. REINFORCE trains the policy directly, without ever learning a value function: the trajectories themselves provide the feedback signal.</p>
</section>
</section>
<section id="the-objective-from-a-single-trajectory" class="level2">
<h2 class="anchored" data-anchor-id="the-objective-from-a-single-trajectory">10. The objective from a single trajectory</h2>
<p>We can finally write down what REINFORCE is trying to do.</p>
<p>Run the agent. It produces a trajectory <img src="https://latex.codecogs.com/png.latex?%5Ctau">. For each timestep <img src="https://latex.codecogs.com/png.latex?t"> in that trajectory, you can compute two things:</p>
<ul>
<li><img src="https://latex.codecogs.com/png.latex?G_t">: the return that followed (you get this by summing rewards from <img src="https://latex.codecogs.com/png.latex?t"> onward).</li>
<li><img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)">: the probability the policy assigned to the action that was actually taken.</li>
</ul>
<p>The objective to maximize is:</p>
<p><img src="https://latex.codecogs.com/png.latex?J(%5Ctheta)%20=%20%5Csum_t%20G_t%20%5Clog%20%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)"></p>
<p>Notice the structure: each term in the sum has two factors: the return <img src="https://latex.codecogs.com/png.latex?G_t">, and the log-probability of the action that produced it.</p>
<section id="how-to-read-this-sum" class="level3">
<h3 class="anchored" data-anchor-id="how-to-read-this-sum">How to read this sum</h3>
<p>So if a trajectory has 4 steps (<img src="https://latex.codecogs.com/png.latex?t%20=%200,%201,%202,%203">), the sum is:</p>
<p><img src="https://latex.codecogs.com/png.latex?G_0%20%5Clog%20%5Cpi_%5Ctheta(a_0%20%5Cmid%20s_0)%20+%20G_1%20%5Clog%20%5Cpi_%5Ctheta(a_1%20%5Cmid%20s_1)%20+%20G_2%20%5Clog%20%5Cpi_%5Ctheta(a_2%20%5Cmid%20s_2)%20+%20G_3%20%5Clog%20%5Cpi_%5Ctheta(a_3%20%5Cmid%20s_3)."></p>
<p>One term per timestep. Each term has two factors: the return <img src="https://latex.codecogs.com/png.latex?G_t"> from that timestep onward, and the log-probability of the action that was taken at that timestep.</p>
<p>Let’s compute it for the grid-world trajectory we used earlier. The returns were <img src="https://latex.codecogs.com/png.latex?G_0%20=%202,%20G_1%20=%203,%20G_2%20=%204,%20G_3%20=%205">. Suppose the current policy assigned these probabilities to the actions actually taken:</p>
<ul>
<li><img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta(%5Crightarrow%20%5Cmid%20(0,0))%20=%200.4">, so <img src="https://latex.codecogs.com/png.latex?%5Clog%20%5Cpi_%5Ctheta%20=%20%5Clog(0.4)%20%5Capprox%20-0.92"></li>
<li><img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta(%5Cuparrow%20%5Cmid%20(1,0))%20=%200.3">, so <img src="https://latex.codecogs.com/png.latex?%5Clog%20%5Cpi_%5Ctheta%20%5Capprox%20-1.20"></li>
<li><img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta(%5Crightarrow%20%5Cmid%20(1,1))%20=%200.5">, so <img src="https://latex.codecogs.com/png.latex?%5Clog%20%5Cpi_%5Ctheta%20%5Capprox%20-0.69"></li>
<li><img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta(%5Crightarrow%20%5Cmid%20(2,1))%20=%200.7">, so <img src="https://latex.codecogs.com/png.latex?%5Clog%20%5Cpi_%5Ctheta%20%5Capprox%20-0.36"></li>
</ul>
<p>Plug in:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Csum_t%20G_t%20%5Clog%20%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)%20=%202(-0.92)%20+%203(-1.20)%20+%204(-0.69)%20+%205(-0.36)."></p>
<p><img src="https://latex.codecogs.com/png.latex?=%20-1.84%20-%203.60%20-%202.76%20-%201.80%20=%20-10.0."></p>
<p>That’s one number: the value of <img src="https://latex.codecogs.com/png.latex?J(%5Ctheta)"> for this specific trajectory under the current policy.</p>
</section>
<section id="what-does-that-number-mean" class="level3">
<h3 class="anchored" data-anchor-id="what-does-that-number-mean">What does that number mean?</h3>
<p>On its own, the absolute value isn’t very meaningful. What matters is how the sum <em>changes</em> as <img src="https://latex.codecogs.com/png.latex?%5Ctheta"> changes.</p>
<p>Here’s the intuition for why this particular form is what we want to maximize. Each term <img src="https://latex.codecogs.com/png.latex?G_t%20%5Clog%20%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)"> is doing something specific:</p>
<ul>
<li><p><strong>If <img src="https://latex.codecogs.com/png.latex?G_t"> is large and positive</strong> (the trajectory after this timestep was good), then <img src="https://latex.codecogs.com/png.latex?G_t%20%5Clog%20%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)"> is a large negative number that gets <em>less negative</em> if <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)"> increases. So the sum increases when the policy raises the probability of <img src="https://latex.codecogs.com/png.latex?a_t">.</p></li>
<li><p><strong>If <img src="https://latex.codecogs.com/png.latex?G_t"> is negative</strong> (things went badly after this timestep), then <img src="https://latex.codecogs.com/png.latex?G_t%20%5Clog%20%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)"> is positive, and it <em>increases</em> when <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)"> decreases. So the sum increases when the policy lowers the probability of <img src="https://latex.codecogs.com/png.latex?a_t">.</p></li>
</ul>
<p>Putting those together: the sum is large when the policy assigns high probability to actions that led to good returns <em>and</em> low probability to actions that led to bad returns. Maximizing it pushes the policy in exactly the direction we want.</p>
<p>This is also why <img src="https://latex.codecogs.com/png.latex?G_t"> is the right multiplier here, not <img src="https://latex.codecogs.com/png.latex?r_t">. The return <img src="https://latex.codecogs.com/png.latex?G_t"> is the agent’s verdict on whether action <img src="https://latex.codecogs.com/png.latex?a_t"> was a good choice, accounting for everything that followed. The single-step reward <img src="https://latex.codecogs.com/png.latex?r_t"> would tell us nothing useful in environments where rewards are delayed.</p>
</section>
</section>
<section id="the-objective-as-an-expectation" class="level2">
<h2 class="anchored" data-anchor-id="the-objective-as-an-expectation">11. The objective as an expectation</h2>
<p>The single-trajectory form is what you actually compute in code. But to be precise about what <img src="https://latex.codecogs.com/png.latex?J(%5Ctheta)"> really <em>is</em>, you need to think of it as an expectation.</p>
<p>Each time you run the agent, you get a different trajectory, because the policy is stochastic and the environment may be too. The “true” objective averages over all the trajectories the policy could possibly produce:</p>
<p><img src="https://latex.codecogs.com/png.latex?J(%5Ctheta)%20=%20%5Cmathbb%7BE%7D_%7B%5Ctau%20%5Csim%20%5Cpi_%5Ctheta%7D%5Cbigg%5B%5Csum_t%20G_t%20%5Clog%20%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)%5Cbigg%5D."></p>
<p>Reading the new piece of notation:</p>
<ul>
<li><strong><img src="https://latex.codecogs.com/png.latex?%5Ctau%20%5Csim%20%5Cpi_%5Ctheta"></strong> reads as “trajectory <img src="https://latex.codecogs.com/png.latex?%5Ctau"> sampled from <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta">”: meaning <img src="https://latex.codecogs.com/png.latex?%5Ctau"> is generated by running the policy <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta"> in the environment.</li>
<li><strong><img src="https://latex.codecogs.com/png.latex?%5Cmathbb%7BE%7D_%7B%5Ctau%20%5Csim%20%5Cpi_%5Ctheta%7D%5B%5Ccdot%5D"></strong> reads as “the expectation, where the randomness comes from sampling trajectories from the policy.”</li>
</ul>
<p>So in plain English: “run the policy infinitely many times, compute the bracketed sum on each trajectory, and average the results.”</p>
<section id="expanding-the-expectation" class="level3">
<h3 class="anchored" data-anchor-id="expanding-the-expectation">Expanding the expectation</h3>
<p>The expectation is shorthand for a weighted average over trajectories. Let’s expand it.</p>
<p><strong>Step 1: What’s the probability of a specific trajectory <img src="https://latex.codecogs.com/png.latex?%5Ctau">?</strong></p>
<p>A trajectory is a sequence:</p>
<p><img src="https://latex.codecogs.com/png.latex?%5Ctau%20=%20(s_0,%20a_0,%20r_0,%5C;%20s_1,%20a_1,%20r_1,%5C;%20%5Cldots,%5C;%20s_T)."></p>
<p>For this exact sequence to occur, every step has to play out a particular way. The probability is the product of all the individual things that had to happen:</p>
<p><img src="https://latex.codecogs.com/png.latex?P(%5Ctau%20%5Cmid%20%5Ctheta)%20=%20%5Crho_0(s_0)%20%5Ccdot%20%5Cprod_%7Bt=0%7D%5E%7BT-1%7D%20%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)%20%5Ccdot%20P(s_%7Bt+1%7D%20%5Cmid%20s_t,%20a_t)."></p>
<p>Reading the pieces:</p>
<ul>
<li><img src="https://latex.codecogs.com/png.latex?%5Crho_0(s_0)">: the probability that the episode started at <img src="https://latex.codecogs.com/png.latex?s_0">. (Some problems have a fixed start state; others sample from a distribution. <img src="https://latex.codecogs.com/png.latex?%5Crho_0"> covers both.)</li>
<li><img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)">: the probability that the policy chose action <img src="https://latex.codecogs.com/png.latex?a_t"> in state <img src="https://latex.codecogs.com/png.latex?s_t">. (This is the only piece that depends on <img src="https://latex.codecogs.com/png.latex?%5Ctheta">.)</li>
<li><img src="https://latex.codecogs.com/png.latex?P(s_%7Bt+1%7D%20%5Cmid%20s_t,%20a_t)">: the probability that the environment transitioned to <img src="https://latex.codecogs.com/png.latex?s_%7Bt+1%7D"> given the state and action. (This is the environment dynamics: fixed, not learnable.)</li>
</ul>
<p>In a deterministic environment with a fixed start state, <img src="https://latex.codecogs.com/png.latex?%5Crho_0"> and <img src="https://latex.codecogs.com/png.latex?P"> are all 0 or 1, so the trajectory probability simplifies to just <img src="https://latex.codecogs.com/png.latex?%5Cprod_t%20%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)">: the product of the policy probabilities along the path.</p>
<p><strong>Step 2: Plug into the expectation.</strong></p>
<p>The expectation sums over all possible trajectories, weighted by their probabilities:</p>
<p><img src="https://latex.codecogs.com/png.latex?J(%5Ctheta)%20=%20%5Csum_%7B%5Ctau%7D%20P(%5Ctau%20%5Cmid%20%5Ctheta)%20%5Ccdot%20%5Cbigg%5B%5Csum_t%20G_t%20%5Clog%20%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)%5Cbigg%5D."></p>
<p>Substituting the expression for <img src="https://latex.codecogs.com/png.latex?P(%5Ctau%20%5Cmid%20%5Ctheta)">:</p>
<p><img src="https://latex.codecogs.com/png.latex?J(%5Ctheta)%20=%20%5Csum_%7B%5Ctau%7D%20%5Cbigg%5B%5Cunderbrace%7B%5Crho_0(s_0)%20%5Cprod_%7Bt=0%7D%5E%7BT-1%7D%20%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)%20P(s_%7Bt+1%7D%20%5Cmid%20s_t,%20a_t)%7D_%7B%5Ctext%7Bprobability%20of%20this%20trajectory%7D%7D%5Cbigg%5D%20%5Ccdot%20%5Cbigg%5B%5Cunderbrace%7B%5Csum_t%20G_t%20%5Clog%20%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)%7D_%7B%5Ctext%7Bvalue%20of%20this%20trajectory%7D%7D%5Cbigg%5D."></p>
<p>That’s the fully expanded form.</p>
<p>The “<img src="https://latex.codecogs.com/png.latex?%5Csum_%5Ctau">” is summing over <em>every possible trajectory</em>: every possible starting state, every possible sequence of actions the policy might take, every possible sequence of next states the environment might produce.</p>
</section>
<section id="why-nobody-computes-this-directly" class="level3">
<h3 class="anchored" data-anchor-id="why-nobody-computes-this-directly">Why nobody computes this directly</h3>
<p>In the grid world with 16 squares, 4 actions, and episodes of length up to 20 steps, the number of possible trajectories is roughly <img src="https://latex.codecogs.com/png.latex?4%5E%7B20%7D"> even before accounting for state randomness. In something like Atari or Go, it’s effectively infinite.</p>
<p>You can’t enumerate all trajectories. You can’t compute <img src="https://latex.codecogs.com/png.latex?P(%5Ctau%20%5Cmid%20%5Ctheta)"> for each one. You especially can’t, because <img src="https://latex.codecogs.com/png.latex?P(s_%7Bt+1%7D%20%5Cmid%20s_t,%20a_t)"> is the environment dynamics. You usually don’t even know it; you just experience it by stepping the simulator.</p>
<p>This is exactly why we sample. Instead of summing over all trajectories with their true probabilities, we:</p>
<ol type="1">
<li>Run the policy once. The environment naturally generates a trajectory with the right probability. We don’t need to know <img src="https://latex.codecogs.com/png.latex?P"> explicitly because the simulator embodies it.</li>
<li>Compute <img src="https://latex.codecogs.com/png.latex?%5Csum_t%20G_t%20%5Clog%20%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)"> for that trajectory.</li>
<li>Treat this as a single sample from the expectation.</li>
<li>Average across many such samples (across many gradient steps).</li>
</ol>
<p>This is called a <strong>Monte Carlo estimate</strong>: using random samples to approximate an expectation.</p>
</section>
</section>
<section id="putting-the-training-loop-together" class="level2">
<h2 class="anchored" data-anchor-id="putting-the-training-loop-together">12. Putting the training loop together</h2>
<p>We now have all the pieces. The REINFORCE training loop:</p>
<ol type="1">
<li><p><strong>Roll out a trajectory.</strong> Starting from some initial state, sample actions from the current policy <img src="https://latex.codecogs.com/png.latex?%5Cpi_%5Ctheta"> until the episode ends. Record states, actions, and rewards along the way.</p></li>
<li><p><strong>Compute the returns.</strong> For each timestep <img src="https://latex.codecogs.com/png.latex?t"> in the trajectory, sum the (discounted) rewards from <img src="https://latex.codecogs.com/png.latex?t"> onward to get <img src="https://latex.codecogs.com/png.latex?G_t">.</p></li>
<li><p><strong>Form the objective.</strong> Plug into <img src="https://latex.codecogs.com/png.latex?J(%5Ctheta)%20=%20%5Csum_t%20G_t%20%5Clog%20%5Cpi_%5Ctheta(a_t%20%5Cmid%20s_t)">. This is your single-sample estimate of the true expected objective.</p></li>
<li><p><strong>Update the parameters.</strong> Take a gradient step on <img src="https://latex.codecogs.com/png.latex?L(%5Ctheta)%20=%20-J(%5Ctheta)">.</p></li>
<li><p><strong>Repeat.</strong> With updated parameters, roll out a new trajectory and do it again.</p></li>
</ol>
<p>A useful intuition: the policy network is being shown its own past behavior and told <em>do more of what worked, less of what didn’t</em>. The return <img src="https://latex.codecogs.com/png.latex?G_t"> is the supervision signal: it plays the role that the label plays in supervised learning, except the agent generates it for itself by interacting with the environment.</p>
<section id="the-training-loop-in-code" class="level3">
<h3 class="anchored" data-anchor-id="the-training-loop-in-code">The training loop in code</h3>
<p>To make this concrete, here’s the entire REINFORCE training loop in PyTorch. The grid world is abstracted as <code>env</code>, and the policy is a small neural network.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> torch</span>
<span id="cb1-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> torch.nn.functional <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> F</span>
<span id="cb1-3"></span>
<span id="cb1-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Hyperparameters</span></span>
<span id="cb1-5">learning_rate <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1e-3</span></span>
<span id="cb1-6">gamma <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.99</span></span>
<span id="cb1-7">num_episodes <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10000</span></span>
<span id="cb1-8"></span>
<span id="cb1-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Policy network: state -&gt; action logits</span></span>
<span id="cb1-10">policy <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> PolicyNetwork()  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># outputs logits over actions</span></span>
<span id="cb1-11">optimizer <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> torch.optim.Adam(policy.parameters(), lr<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>learning_rate)</span>
<span id="cb1-12"></span>
<span id="cb1-13"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> episode <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">range</span>(num_episodes):</span>
<span id="cb1-14">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 1. Roll out a trajectory</span></span>
<span id="cb1-15">    states, actions, rewards <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [], [], []</span>
<span id="cb1-16">    state <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> env.reset()</span>
<span id="cb1-17">    done <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span></span>
<span id="cb1-18">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">while</span> <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">not</span> done:</span>
<span id="cb1-19">        logits <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> policy(state)                     <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># action logits for this state</span></span>
<span id="cb1-20">        probs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> F.softmax(logits, dim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)          <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># action probabilities</span></span>
<span id="cb1-21">        action <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> torch.multinomial(probs, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>).item()  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># sample an action</span></span>
<span id="cb1-22">        next_state, reward, done <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> env.step(action)</span>
<span id="cb1-23"></span>
<span id="cb1-24">        states.append(state)</span>
<span id="cb1-25">        actions.append(action)</span>
<span id="cb1-26">        rewards.append(reward)</span>
<span id="cb1-27">        state <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> next_state</span>
<span id="cb1-28"></span>
<span id="cb1-29">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 2. Compute returns G_t for each timestep (working backward)</span></span>
<span id="cb1-30">    returns <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> []</span>
<span id="cb1-31">    G <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb1-32">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> r <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">reversed</span>(rewards):</span>
<span id="cb1-33">        G <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> r <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> gamma <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> G</span>
<span id="cb1-34">        returns.insert(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, G)</span>
<span id="cb1-35">    returns <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> torch.tensor(returns)</span>
<span id="cb1-36"></span>
<span id="cb1-37">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 3. Form the loss</span></span>
<span id="cb1-38">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># For each timestep, we want G_t * log pi(a_t | s_t).</span></span>
<span id="cb1-39">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># We sum over t and negate (since optimizers minimize).</span></span>
<span id="cb1-40">    loss <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb1-41">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> s_t, a_t, G_t <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">zip</span>(states, actions, returns):</span>
<span id="cb1-42">        logits <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> policy(s_t)</span>
<span id="cb1-43">        log_probs <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> F.log_softmax(logits, dim<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=-</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb1-44">        loss <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> loss <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> G_t <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> log_probs[a_t]</span>
<span id="cb1-45"></span>
<span id="cb1-46">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 4. Update parameters</span></span>
<span id="cb1-47">    optimizer.zero_grad()</span>
<span id="cb1-48">    loss.backward()</span>
<span id="cb1-49">    optimizer.step()</span></code></pre></div></div>
<p>A few things worth flagging in this code:</p>
<p><strong>The negation.</strong> Because optimizers minimize by default, we minimize <img src="https://latex.codecogs.com/png.latex?-J(%5Ctheta)">, which is equivalent to maximizing <img src="https://latex.codecogs.com/png.latex?J(%5Ctheta)">. Hence <code>loss = loss - G_t * log_probs[a_t]</code> instead of <code>+</code>.</p>
<p><strong>Returns computed backward.</strong> The <code>for r in reversed(rewards)</code> loop is the standard way to compute returns in <img src="https://latex.codecogs.com/png.latex?O(T)"> time using the recursion <img src="https://latex.codecogs.com/png.latex?G_t%20=%20r_t%20+%20%5Cgamma%20G_%7Bt+1%7D">. Forward computation would be <img src="https://latex.codecogs.com/png.latex?O(T%5E2)">.</p>
<p><strong>One trajectory per gradient step.</strong> This is REINFORCE in its purest form. In practice you’d batch multiple trajectories before each step to reduce variance, but the algorithm doesn’t require it.</p>
</section>
</section>
<section id="the-bridge-to-the-gradient-mechanics" class="level2">
<h2 class="anchored" data-anchor-id="the-bridge-to-the-gradient-mechanics">13. The bridge to the gradient mechanics</h2>
<p>This is the conceptual frame in which REINFORCE makes sense. We have an objective <img src="https://latex.codecogs.com/png.latex?J(%5Ctheta)">, written either as a single-trajectory sum or as an expectation over trajectories, and we want to maximize it. Equivalently, we minimize <img src="https://latex.codecogs.com/png.latex?L(%5Ctheta)%20=%20-J(%5Ctheta)"> by gradient descent.</p>
<p>What this frame <em>doesn’t</em> tell you is <em>why</em> gradient descent on <img src="https://latex.codecogs.com/png.latex?L"> actually causes good actions to become more probable. That question lives one layer down. It’s about how the gradient flows through the softmax to the logits, and from the logits back through the network to the parameters. That’s what <a href="../../../../posts/series/how-llms-learn-to-reason/00-reinforce-gradient/index.html">Part II</a> is about.</p>


</section>

 ]]></description>
  <category>How LLMs learn to reason</category>
  <guid>https://yeesengchan.com/posts/series/how-llms-learn-to-reason/00-reinforce-foundations/</guid>
  <pubDate>Mon, 16 Feb 2026 00:00:00 GMT</pubDate>
</item>
</channel>
</rss>
