Links
- Lecture video: https://youtu.be/JdGFdViaOJk
- Course materials: lecture_17.py
Introduction: From "What" to "How" in Reinforcement Learning
This lecture transitions from the high-level concepts of Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from Verifiable Rewards (RLVR) to the low-level, practical mechanics of the algorithms that make them work. The central theme is a deep dive into the policy gradient method, the fundamental algorithm that allows a language model to be optimized directly for a reward signal.
The core principle is simple and powerful: If you can measure it, you can optimize it. While previous lectures established that RLVR is the key to surpassing human abilities on tasks with verifiable correctness (like math and coding), this session is dedicated to understanding how the model's parameters are actually updated to achieve this. We will build the entire process from first principles, starting with the mathematical derivation of the policy gradient, identifying its critical flaw (high variance), and then systematically introducing the concepts of baselines and advantage functions to create a stable and effective learning algorithm.
The lecture culminates in a detailed walkthrough of a practical implementation, focusing on Group Relative Policy Optimization (GRPO). This simplified yet powerful algorithm has become a cornerstone of recent state-of-the-art reasoning models. By the end, the "magic" of RL will be demystified, replaced by a clear, mechanical understanding of how a model learns to generate high-reward outputs through iterative optimization.
Part 1: Framing Language Generation as a Reinforcement Learning Problem
To apply RL, we must first translate the task of text generation into the standard RL framework.
- State (
s
): In the context of LLMs, the state is the immutable prompt concatenated with the sequence of tokens generated so far. For example,s_t = "Write a python function to sort a list. def sort_list(lst):"
. - Action (
a
): An action is the act of generating the next single token from the vocabulary. - Policy (
π(a|s)
): The policy is the language model itself, specifically a fine-tuned version of a base model. It is a functionπ_θ
parameterized by weightsθ
, which takes the current states
and outputs a probability distribution over all possible next tokensa
. - Trajectory/Rollout (
τ
): A trajectory is a full sequence of states and actions, starting from the initial prompt and ending when the model generates an end-of-sequence token. It represents a complete, generated response. - Reward (
R
): The reward is a scalar value assigned to a complete trajectory. In the RLVR setting, this reward is:- Verifiable: It can be computed programmatically without human judgment (e.g.,
is_correct(solution)
). - Outcome-based: The reward is given only at the end of the generation, based on the final output.
- Undiscounted: Unlike in traditional RL, there is no discounting factor
γ
, as the reward reflects the quality of the entire sequence.
- Verifiable: It can be computed programmatically without human judgment (e.g.,
- Transition Dynamics (
T(s'|s,a)
): The environment is deterministic. The next states'
is simply the current states
with the chosen actiona
(the next token) appended to it. This simplicity is a key difference from complex physical environments and allows for powerful planning and search at test time. - The Objective: The goal of the RL algorithm is to find the optimal set of policy parameters
θ*
that maximizes the expected reward over the distribution of prompts and generated responses:maximize E[R]
.
Part 2: The Theory and Mechanics of Policy Gradient
The Naive Policy Gradient: A High-Variance Starting Point
The policy gradient theorem provides a direct way to calculate the gradient of the expected reward with respect to the model's parameters θ
. Through a mathematical manipulation known as the log-derivative trick, the gradient can be expressed as an expectation:
∇_θ E[R] = E[ ∇_θ log π_θ(a|s) * R(s, a) ]
This formulation is powerful because it allows us to estimate the gradient using samples from our own policy. This leads to the naive policy gradient algorithm:
- Sample: For a given prompt
s
, generate a responsea
by sampling from the current policyπ_θ
. - Evaluate: Compute the reward
R(s, a)
for the generated response. - Update: Update the policy parameters
θ
by taking a small step in the direction of the gradient estimate:∇_θ log π_θ(a|s) * R(s, a)
.
The Intuition: This update rule is remarkably similar to the one used in Supervised Fine-Tuning (SFT). log π_θ(a|s)
is simply the log-likelihood of the generated sequence. The update is therefore equivalent to performing SFT on the model's own outputs, but with a crucial difference: the loss for each sample is weighted by its reward. High-reward responses are strongly reinforced (their likelihood is increased), while zero-reward responses provide no learning signal at all.
The Critical Flaw: High Variance. While this estimate is mathematically unbiased, its variance is enormous, especially in RLVR settings with sparse rewards (e.g., a binary 0/1 for correctness). The model might generate thousands of incorrect responses (reward 0), leading to zero gradient and no learning. Then, by sheer luck, it might stumble upon a correct response (reward 1), resulting in a massive gradient that can destabilize the entire training process. This makes learning extremely noisy, sample-inefficient, and slow.
The Solution: Baselines and the Advantage Function
To combat high variance, we introduce a baseline b(s)
. A baseline is a function that depends only on the state s
(the prompt) and not on the action a
(the response). We can subtract this from the reward in our gradient estimate without introducing any bias:
∇_θ E[R] = E[ ∇_θ log π_θ(a|s) * (R(s, a) - b(s)) ]
The Intuition: Imagine a prompt where all possible responses yield a very high reward (e.g., between 90 and 100). The naive policy gradient would reinforce all of them. However, by setting a baseline b(s)
equal to the average reward for that prompt (e.g., 95), we create a more meaningful learning signal. Responses with a reward of 99 get a positive update (99 - 95 = +4
), while responses with a reward of 91 get a negative update (91 - 95 = -4
). This teaches the model not just to find good solutions, but to find the best solutions.
The Advantage Function. The term R(s, a) - b(s)
is an estimate of the Advantage Function A(s, a)
. The ideal baseline is the Value Function V(s) = E[R|s]
, which is the expected reward from a given state. The advantage A(s, a) = R(s, a) - V(s)
therefore measures how much better or worse a specific action was compared to the average expected outcome. Using the advantage dramatically reduces variance and stabilizes training.
Part 3: A Practical Implementation with Group Relative Policy Optimization (GRPO)
While powerful, estimating the value function V(s)
typically requires training an additional, large neural network (a "critic"), which is a core component of complex algorithms like PPO. Group Relative Policy Optimization (GRPO) is a simpler algorithm that has become popular in RLVR because it cleverly avoids this.
The GRPO Advantage Estimate
GRPO's core innovation is its method for estimating the baseline V(s)
on-the-fly without a critic model. The procedure is as follows:
- For a single prompt
s
, generate a group ofG
different responses by sampling from the current policyπ_θ
. - Calculate the verifiable reward
R_i
for each of theG
responses. - The baseline
V(s)
is simply estimated as the mean of the rewards within that group:b(s) = mean(R_1, R_2, ..., R_G)
. - The advantage for each response in the group is then calculated as:
A_i = R_i - mean(R_group)
.
The Full GRPO Training Loop
The lecture provides a code walkthrough that illustrates the end-to-end training process:
-
Outer Loop (Epochs): Training is iterative. At the start of each major iteration (or epoch), the current policy model
π_θ
is frozen and becomes the reference model (π_ref
). This reference model is used to compute a KL-divergence penalty, which regularizes the policy and prevents it from diverging too far from a stable, known distribution, a key technique for preventing catastrophic collapse during RL. -
Inner Loop (Training Steps): Within each epoch, multiple gradient update steps are performed.
- a. Rollout Phase: For a batch of prompts, the current policy
π_θ
is used to generate a group ofG
responses for each prompt. - b. Reward & Advantage Calculation: The verifiable reward function is applied to every generated response. Then, for each prompt's group of responses, the group-relative advantage is calculated.
- c. Loss Calculation: The total loss is computed. It consists of:
- Policy Gradient Loss: The negative sum of the log-probabilities of each response, weighted by its calculated advantage.
- (Optional) KL-Divergence Penalty: A term
β * KL(π_θ || π_ref)
is added to the loss to regularize the policy. - (Optional) PPO-Style Clipping: To prevent excessively large updates, the ratio of probabilities
π_θ / π_ref
can be clipped to a small window (e.g.,[0.8, 1.2]
), which is a core stabilizing mechanism from the PPO algorithm that is often incorporated into GRPO.
- d. Parameter Update: The loss is backpropagated, and the policy model's parameters
θ
are updated via an optimizer like Adam.
- a. Rollout Phase: For a batch of prompts, the current policy
This entire process—rollout, reward, advantage estimation, and update—is repeated, iteratively refining the model's policy to increase the probability of generating high-reward, correct solutions. The lecture's coding example on a simple sorting task clearly demonstrates that using a baseline (centered rewards) is dramatically more effective and stable than the naive policy gradient approach.