Lecture 17 — Alignment: RL (Percy)

Links

Lecture video: https://youtu.be/JdGFdViaOJk
Course materials: lecture_17.py

Introduction: From "What" to "How" in Reinforcement Learning

This lecture transitions from the high-level concepts of Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from Verifiable Rewards (RLVR) to the low-level, practical mechanics of the algorithms that make them work. The central theme is a deep dive into the policy gradient method, the fundamental algorithm that allows a language model to be optimized directly for a reward signal.

The core principle is simple and powerful: If you can measure it, you can optimize it. While previous lectures established that RLVR is the key to surpassing human abilities on tasks with verifiable correctness (like math and coding), this session is dedicated to understanding how the model's parameters are actually updated to achieve this. We will build the entire process from first principles, starting with the mathematical derivation of the policy gradient, identifying its critical flaw (high variance), and then systematically introducing the concepts of baselines and advantage functions to create a stable and effective learning algorithm.

The lecture culminates in a detailed walkthrough of a practical implementation, focusing on Group Relative Policy Optimization (GRPO). This simplified yet powerful algorithm has become a cornerstone of recent state-of-the-art reasoning models. By the end, the "magic" of RL will be demystified, replaced by a clear, mechanical understanding of how a model learns to generate high-reward outputs through iterative optimization.

Part 1: Framing Language Generation as a Reinforcement Learning Problem

To apply RL, we must first translate the task of text generation into the standard RL framework.

State (s): In the context of LLMs, the state is the immutable prompt concatenated with the sequence of tokens generated so far. For example, s_t = "Write a python function to sort a list. def sort_list(lst):".
Action (a): An action is the act of generating the next single token from the vocabulary.
Policy (π(a|s)): The policy is the language model itself, specifically a fine-tuned version of a base model. It is a function π_θ parameterized by weights θ, which takes the current state s and outputs a probability distribution over all possible next tokens a.
Trajectory/Rollout (τ): A trajectory is a full sequence of states and actions, starting from the initial prompt and ending when the model generates an end-of-sequence token. It represents a complete, generated response.
Reward (R): The reward is a scalar value assigned to a complete trajectory. In the RLVR setting, this reward is:
- Verifiable: It can be computed programmatically without human judgment (e.g., is_correct(solution)).
- Outcome-based: The reward is given only at the end of the generation, based on the final output.
- Undiscounted: Unlike in traditional RL, there is no discounting factor γ, as the reward reflects the quality of the entire sequence.
Transition Dynamics (T(s'|s,a)): The environment is deterministic. The next state s' is simply the current state s with the chosen action a (the next token) appended to it. This simplicity is a key difference from complex physical environments and allows for powerful planning and search at test time.
The Objective: The goal of the RL algorithm is to find the optimal set of policy parameters θ* that maximizes the expected reward over the distribution of prompts and generated responses: maximize E[R].

Part 2: The Theory and Mechanics of Policy Gradient

The Naive Policy Gradient: A High-Variance Starting Point

The policy gradient theorem provides a direct way to calculate the gradient of the expected reward with respect to the model's parameters θ. Through a mathematical manipulation known as the log-derivative trick, the gradient can be expressed as an expectation:

∇_θ E[R] = E[ ∇_θ log π_θ(a|s) * R(s, a) ]

This formulation is powerful because it allows us to estimate the gradient using samples from our own policy. This leads to the naive policy gradient algorithm:

Sample: For a given prompt s, generate a response a by sampling from the current policy π_θ.
Evaluate: Compute the reward R(s, a) for the generated response.
Update: Update the policy parameters θ by taking a small step in the direction of the gradient estimate: ∇_θ log π_θ(a|s) * R(s, a).

The Intuition: This update rule is remarkably similar to the one used in Supervised Fine-Tuning (SFT). log π_θ(a|s) is simply the log-likelihood of the generated sequence. The update is therefore equivalent to performing SFT on the model's own outputs, but with a crucial difference: the loss for each sample is weighted by its reward. High-reward responses are strongly reinforced (their likelihood is increased), while zero-reward responses provide no learning signal at all.

The Critical Flaw: High Variance. While this estimate is mathematically unbiased, its variance is enormous, especially in RLVR settings with sparse rewards (e.g., a binary 0/1 for correctness). The model might generate thousands of incorrect responses (reward 0), leading to zero gradient and no learning. Then, by sheer luck, it might stumble upon a correct response (reward 1), resulting in a massive gradient that can destabilize the entire training process. This makes learning extremely noisy, sample-inefficient, and slow.

The Solution: Baselines and the Advantage Function

To combat high variance, we introduce a baseline b(s). A baseline is a function that depends only on the state s (the prompt) and not on the action a (the response). We can subtract this from the reward in our gradient estimate without introducing any bias:

∇_θ E[R] = E[ ∇_θ log π_θ(a|s) * (R(s, a) - b(s)) ]

The Intuition: Imagine a prompt where all possible responses yield a very high reward (e.g., between 90 and 100). The naive policy gradient would reinforce all of them. However, by setting a baseline b(s) equal to the average reward for that prompt (e.g., 95), we create a more meaningful learning signal. Responses with a reward of 99 get a positive update (99 - 95 = +4), while responses with a reward of 91 get a negative update (91 - 95 = -4). This teaches the model not just to find good solutions, but to find the best solutions.

The Advantage Function. The term R(s, a) - b(s) is an estimate of the Advantage Function A(s, a). The ideal baseline is the Value Function V(s) = E[R|s], which is the expected reward from a given state. The advantage A(s, a) = R(s, a) - V(s) therefore measures how much better or worse a specific action was compared to the average expected outcome. Using the advantage dramatically reduces variance and stabilizes training.

Part 3: A Practical Implementation with Group Relative Policy Optimization (GRPO)

While powerful, estimating the value function V(s) typically requires training an additional, large neural network (a "critic"), which is a core component of complex algorithms like PPO. Group Relative Policy Optimization (GRPO) is a simpler algorithm that has become popular in RLVR because it cleverly avoids this.

The GRPO Advantage Estimate

GRPO's core innovation is its method for estimating the baseline V(s) on-the-fly without a critic model. The procedure is as follows:

For a single prompt s, generate a group of G different responses by sampling from the current policy π_θ.
Calculate the verifiable reward R_i for each of the G responses.
The baseline V(s) is simply estimated as the mean of the rewards within that group: b(s) = mean(R_1, R_2, ..., R_G).
The advantage for each response in the group is then calculated as: A_i = R_i - mean(R_group).

The Full GRPO Training Loop

The lecture provides a code walkthrough that illustrates the end-to-end training process:

Outer Loop (Epochs): Training is iterative. At the start of each major iteration (or epoch), the current policy model π_θ is frozen and becomes the reference model (π_ref). This reference model is used to compute a KL-divergence penalty, which regularizes the policy and prevents it from diverging too far from a stable, known distribution, a key technique for preventing catastrophic collapse during RL.
Inner Loop (Training Steps): Within each epoch, multiple gradient update steps are performed.
- a. Rollout Phase: For a batch of prompts, the current policy π_θ is used to generate a group of G responses for each prompt.
- b. Reward & Advantage Calculation: The verifiable reward function is applied to every generated response. Then, for each prompt's group of responses, the group-relative advantage is calculated.
- c. Loss Calculation: The total loss is computed. It consists of:
  - Policy Gradient Loss: The negative sum of the log-probabilities of each response, weighted by its calculated advantage.
  - (Optional) KL-Divergence Penalty: A term β * KL(π_θ || π_ref) is added to the loss to regularize the policy.
  - (Optional) PPO-Style Clipping: To prevent excessively large updates, the ratio of probabilities π_θ / π_ref can be clipped to a small window (e.g., [0.8, 1.2]), which is a core stabilizing mechanism from the PPO algorithm that is often incorporated into GRPO.
- d. Parameter Update: The loss is backpropagated, and the policy model's parameters θ are updated via an optimizer like Adam.

This entire process—rollout, reward, advantage estimation, and update—is repeated, iteratively refining the model's policy to increase the probability of generating high-reward, correct solutions. The lecture's coding example on a simple sorting task clearly demonstrates that using a baseline (centered rewards) is dramatically more effective and stable than the naive policy gradient approach.