Lecture 16 — Alignment: RL

Links

Lecture video: https://youtu.be/46f2QTDB08Q
Course materials: lecture 16.pdf

Overview: Beyond Human Preference – The Next Frontier of Alignment

This lecture explores the evolution of Reinforcement Learning (RL) in language models, moving beyond the paradigm of Reinforcement Learning from Human Feedback (RLHF). While RLHF is powerful for aligning models with subjective human preferences regarding style and safety, it suffers from a critical limitation: over-optimization. When a policy is trained for too long against a static, imperfect reward model (whether human- or AI-based), its performance eventually degrades as it learns to exploit the reward model's flaws rather than genuinely improving.

This lecture introduces a new paradigm: Reinforcement Learning from Verifiable Rewards (RLVR). This approach shifts the focus from subjective preferences to tasks with objectively correct, machine-verifiable outcomes, such as mathematics and coding. In these domains, the reward is not a proxy for quality but a direct signal of correctness (e.g., does the code pass the unit tests? Does the mathematical proof arrive at the right answer?). This allows for a more robust, scalable, and powerful application of RL, enabling models to significantly enhance their reasoning capabilities and push the state of the art, as demonstrated by groundbreaking models like DeepSeek-R1, Kimi K1.5, and Qwen3.

Part 1: Revisiting RLHF and its Limitations

Before diving into RLVR, the lecture recaps the key algorithms and failure modes of traditional preference-based alignment.

PPO vs. DPO: The Two Faces of RLHF

Proximal Policy Optimization (PPO): The original algorithm used in InstructGPT. It's a complex, "on-policy" RL method that involves training three separate models (policy, reward model, value function) and requires collecting new "rollouts" at each step. It is powerful but notoriously unstable and resource-intensive.
Direct Preference Optimization (DPO): A simpler, "offline" method that has largely replaced PPO. It achieves the same goal by reframing the RL objective into a simple pairwise loss function, eliminating the need for a separate reward model and on-policy sampling. It is more stable and much easier to implement.

The Pitfalls of Optimizing a Proxy Reward

The central issue with RLHF is that the reward model is just an imperfect proxy for true quality. This leads to several problems:

Over-optimization: As shown in numerous studies, if you optimize a policy against a reward model for too long, the true win-rate against human preference peaks and then declines. The policy learns to "reward-hack" by generating outputs that exploit the reward model's biases (e.g., a preference for longer, more assertive responses) rather than actually getting better. * Mode Collapse and Poor Calibration: RLHF fine-tuning shifts the model's objective from accurately predicting the next token to maximizing reward. This often damages the model's calibration, making it overconfident in its answers. The entropy of its output distribution decreases, a phenomenon known as "mode collapse."

The fundamental limitation is that for subjective tasks, there is no ground truth, only a noisy preference signal. This motivates the shift to domains where a ground truth does exist.

Part 2: RLVR - Reinforcement Learning with Verifiable Rewards

RLVR focuses on tasks where the correctness of a model's output can be verified programmatically. This provides a clean, reliable, and infinitely scalable reward signal.

Key Domains for RLVR

Mathematics: The final answer to a math problem can be checked for correctness.
Coding: A generated program can be executed against a set of unit tests to verify its functionality.

The Rise of GRPO: A Simplified RL Algorithm

While PPO could be used for RLVR, many recent models have adopted a simpler algorithm called Group Relative Policy Optimization (GRPO).

How it Works: GRPO is a policy-gradient method that cleverly eliminates the need for a separate, learned value function (a major source of complexity in PPO). To estimate the "advantage" of a given response, it samples a group of G responses for a single prompt, computes their rewards, and then normalizes the reward for each response by the mean and standard deviation of the group. Advantage(response_i) = (reward_i - mean(rewards_group)) / std(rewards_group)
Pros: It is much simpler and more memory-efficient than PPO because it doesn't require training a large value model.
Cons (and a Fix): As originally proposed, the division by the standard deviation introduces a mathematical bias into the gradient updates. A corrected version, "Dr. GRPO," removes this term, resulting in a more principled and empirically better-performing algorithm that avoids the length biases of the original.

Part 3: Case Studies in State-of-the-Art Reasoning Models

The lecture dissects the training pipelines of several recent models that have achieved breakthrough performance using RLVR.

1. DeepSeek-R1: The Model that Ignited the Reasoning Race

DeepSeek-R1 was a landmark release that demonstrated performance exceeding top proprietary models on reasoning tasks, using an open and relatively simple RLVR recipe.

Algorithm: The core is GRPO.
The Pipeline (Multi-stage Training):
1. Reasoning SFT: The DeepSeek-V3 base model is first fine-tuned on a small amount of high-quality, long Chain-of-Thought (CoT) data. This "cold start" phase gives the model a strong initial reasoning ability and stabilizes the subsequent RL phase.
2. Reasoning RL (GRPO): The SFT model is then optimized with GRPO. The reward is a binary signal for correctness on math and coding problems, plus a small penalty for language inconsistency in the CoT.
3. General SFT/RLHF: After the reasoning is enhanced, the model undergoes a final alignment stage using standard SFT and RLHF on a general instruction-following dataset to ensure it remains a capable all-around assistant.
Key Finding (Distillation): The reasoning capability learned through this intensive RLVR process can be effectively distilled into other pre-trained models. By using R1 to generate 800k high-quality CoT solutions and then fine-tuning models like Qwen and Llama on this synthetic data, their reasoning performance was dramatically improved.

2. Kimi K1.5: A Parallel Path to Excellence

Released contemporaneously with R1, Kimi K1.5 also achieved top-tier reasoning performance using a distinct but conceptually similar RLVR approach.

Data Curation: A key innovation was in data selection. They used a model-based difficulty filter, keeping only problems that a strong SFT model failed on a "best-of-8" sampling basis. This focuses the RL training on the most challenging and informative problems.
RL Algorithm: They developed a custom policy-gradient algorithm inspired by DPO, using a squared-error loss and a reference-based reward model.
Length Control: To encourage more concise reasoning, they introduced an explicit length penalty into the reward function, incentivizing shorter CoTs for correct answers.
Advanced RL Infrastructure: Kimi's technical report details a sophisticated hybrid deployment framework using Megatron for training and vLLM for fast, parallelized rollouts, highlighting the critical systems engineering required for efficient RL at scale.

3. Qwen3: Low-Data RLVR and Controllable Reasoning

The Qwen3 model demonstrates that significant reasoning improvements can be achieved with a surprisingly small amount of RL data.

The Playbook: Qwen3 follows the now-standard recipe: difficulty filtering of data, a long-CoT SFT "cold start," and then RLVR with GRPO on only ~4,000 carefully selected examples.
Thinking Mode Fusion: A novel technique to make the model's reasoning process controllable. They train the model on a mix of data with <think> and <no_think> tags. This allows the user at inference time to specify a "thinking budget" (e.g., max number of CoT tokens), after which the model is forced to conclude its reasoning and provide a final answer. This provides a direct trade-off between performance and latency.

Conclusion: The Future is Verifiable

RLVR represents a significant paradigm shift in how we align and enhance language models. By moving from noisy, subjective human preferences to clean, verifiable, and scalable reward signals, it unlocks a new phase of capability development, particularly in formal domains like math, science, and programming. The success of models like R1 and Kimi demonstrates that intensive optimization on verifiable tasks is a powerful method for eliciting and sharpening the latent reasoning abilities learned during pre-training. While RLHF remains essential for general helpfulness and safety, RLVR is proving to be the key to pushing the frontier of what these models can achieve in complex, logical domains.