Lecture 15 — Alignment: SFT/RLHF

Links

Lecture video: https://youtu.be/Dfu7vC9jo4w
Course materials: lecture 15.pdf

Overview: From Raw Potential to Helpful and Harmless Assistants

This lecture addresses the final, crucial stage in creating a modern language model: alignment. While pre-training endows a model with a vast repository of knowledge and latent capabilities (making it like a GPT-3), it remains a raw, unfiltered next-token predictor. Alignment is the process that transforms this base model into a helpful, harmless, and instruction-following assistant like ChatGPT.

The lecture details the standard, two-phase alignment pipeline pioneered by OpenAI and now used across the industry:

Supervised Fine-Tuning (SFT): The initial phase, where the model learns the format of instruction-following and dialogue by imitating high-quality, human-written examples.
Reinforcement Learning from Human Feedback (RLHF): The refinement phase, where the model's behavior is optimized to align with human preferences, moving beyond simple imitation to generate responses that are judged as being better, more helpful, or safer.

We will explore the data collection, training methodologies, and critical nuances of each phase, culminating in a deep dive into Direct Preference Optimization (DPO), a simpler and more stable alternative to traditional Reinforcement Learning that has revolutionized the field.

Part 1: Supervised Fine-Tuning (SFT) – Learning by Imitation

SFT is the first step in teaching a base model how to be a useful assistant. It is a standard supervised learning process.

The Data for SFT

Format: The data consists of high-quality (prompt, response) pairs. These can be single-turn instructions or multi-turn conversational dialogues.
Sources:
- Human-Written Data: Early efforts (like InstructGPT) and high-quality modern datasets (like OpenAssistant) relied on human crowdworkers to write both the prompts and the desired responses from scratch.
- Academic NLP Datasets: Collections like FLAN and Super-Natural Instructions programmatically converted thousands of existing NLP tasks (e.g., summarization, translation, Q&A) into an instruction format.
- Synthetic Data (Model-Generated): A major breakthrough was the use of powerful existing models (like GPT-3.5 or GPT-4) to generate vast quantities of instruction data. The Alpaca dataset, for example, was created by prompting GPT-3.5 with 175 human-written seed tasks to generate 52,000 new examples.

The SFT Process

The process is straightforward fine-tuning: the base model is trained on this dataset using a standard cross-entropy loss to maximize the likelihood of generating the target response given the prompt.

Key Insights and Challenges in SFT

SFT Teaches Style and Format: SFT is incredibly effective at teaching the model how to behave like an assistant—how to follow instructions, structure answers with lists or code blocks, and adopt a helpful tone.
SFT Does Not Teach New Knowledge: A critical insight, often referred to as the "hallucination and behavior cloning" problem, is that SFT is a poor method for teaching the model new factual knowledge.
- The Problem: Fine-tuning on a fact that isn't already strongly represented in the model's pre-trained "knowledge graph" doesn't teach it the fact. Instead, it teaches the model to hallucinate—to confidently state an answer even when it's just guessing. The model learns that the desired behavior in the SFT process is to always provide an answer, regardless of its internal confidence.
- The Takeaway: SFT should be used to elicit and format knowledge the model already possesses from pre-training, not to inject new facts.
Safety Tuning: A small amount of SFT data is extremely effective for safety alignment. By fine-tuning on just a few hundred examples where the model is taught to refuse harmful or inappropriate requests (e.g., responding "I am sorry, I cannot tell you how."), its safety profile can be dramatically improved. However, this comes with the risk of over-refusal or "exaggerated safety," where the model starts refusing benign prompts (e.g., refusing to explain how to "kill" a Python process).

The Modern Approach: "Mid-training"

To leverage the scalability of pre-training with the quality of SFT data, many modern training recipes adopt a "mid-training" or "two-phase pre-training" approach. After the initial large-scale pre-training on web data, they continue training for a final phase where the data mixture is enriched with a significant portion of SFT and other high-quality data. This allows the model to absorb the instruction-following signal at scale without the risk of "catastrophic forgetting" that can occur with a short, separate fine-tuning stage.

Part 2: Reinforcement Learning from Human Feedback (RLHF) – Learning from Preferences

While SFT teaches a model to imitate a specific set of "correct" answers, RLHF is a more powerful and scalable paradigm that optimizes the model to align with nuanced human preferences.

Why RLHF? The Gap Between Imitation and Preference

Cost: It is far cheaper and easier for a human to rate which of two responses is better than to write a perfect response from scratch. This makes preference data collection more scalable.
The Generation-Validation Gap: Humans are not always good at writing what they truly prefer. Studies have shown that models fine-tuned on human-written summaries can be less preferred than models optimized directly on human preferences for summaries. RLHF closes this gap.

The Classic RLHF Pipeline (InstructGPT)

The original process involves three steps:

Step 1: Supervised Fine-Tuning (SFT): Start with a pre-trained base model and fine-tune it on a small, high-quality SFT dataset. This gives the model the basic ability to follow instructions.
Step 2: Train a Reward Model (RM):
- A prompt is selected, and two different responses (y_w, the "winner," and y_l, the "loser") are generated by the SFT model.
- A human labeler indicates which response they prefer.
- This creates a large dataset of (prompt, winning_response, losing_response) tuples.
- A Reward Model (another LLM, initialized from the SFT model) is trained on this dataset. Its goal is to predict a scalar "reward" score for any given (prompt, response) pair. It is trained on a pairwise loss function to ensure that RM(prompt, y_w) > RM(prompt, y_l).
Step 3: Optimize the Policy with Reinforcement Learning (PPO):
- The SFT model (now called the "policy") is optimized using the Proximal Policy Optimization (PPO) algorithm.
- In this RL loop, the policy generates a response to a new prompt. The static reward model "judges" this response and provides a reward. PPO then updates the policy's weights to maximize this reward.
- KL-Divergence Penalty: A crucial component of this step is a KL-divergence penalty. The objective function penalizes the PPO policy if it deviates too far from the original SFT model. This acts as a regularizer, preventing the policy from "over-optimizing" to exploit flaws in the reward model and generating bizarre, unnatural text.

The Problems with PPO

While effective, the PPO-based approach is notoriously complex, unstable, and difficult to implement. It involves training multiple models, managing on-policy data collection, and is very sensitive to hyperparameters, making it a major engineering hurdle.

Part 3: Direct Preference Optimization (DPO) – RLHF Without the RL

Direct Preference Optimization (DPO) is a groundbreaking alternative that achieves the goal of RLHF without the complexity of reinforcement learning. It is simpler, more stable, and has become the new industry standard.

The DPO Derivation: From RL to a Simple Loss Function

The key insight of DPO is a mathematical reformulation of the RLHF objective. The derivation follows these conceptual steps:

The constrained RL optimization problem has a known closed-form solution: the optimal policy π* is proportional to the reference policy π_ref times the exponentiated reward model exp(β * r(x, y)).
This equation can be algebraically rearranged to define the reward model r(x, y) in terms of the optimal policy π* and the reference policy π_ref.
This definition of the reward is then substituted directly into the pairwise reward model loss function from Step 2 of the classic RLHF pipeline.

The result is a single, elegant loss function that directly optimizes the language model policy π_θ using only the reference model π_ref and the static preference dataset.

The DPO Loss Function and its Gradient

The DPO loss function is: Loss_DPO = -E [log σ (β * log(π_θ(y_w)/π_ref(y_w)) - β * log(π_θ(y_l)/π_ref(y_l)))]

Analyzing the gradient of this loss reveals its intuitive behavior:

It increases the likelihood of the preferred (winner) response y_w.
It decreases the likelihood of the dispreferred (loser) response y_l.
The strength of these updates is implicitly weighted by how "wrong" the current policy is about the preference, scaled by the hyperparameter β.

The Impact of DPO

DPO simplifies the entire RLHF process into a single fine-tuning stage that is as stable and easy to implement as SFT. It requires no separate reward model, no sampling, and no complex RL machinery. Empirical results consistently show that DPO performs as well as, or even better than, a well-tuned PPO implementation, making it a far more accessible and efficient method for aligning models with human preferences.

Side Effects of RLHF: Over-Optimization and Mode Collapse

Reward Overfitting: Optimizing a policy against a fixed reward model for too long can lead to "reward hacking," where the policy finds adversarial ways to maximize its score that do not correspond to genuine improvements. Performance on the true human preference distribution often peaks and then declines with excessive optimization.
Mode Collapse and Poor Calibration: RLHF changes the model's objective from accurately modeling the true data distribution to maximizing a reward. This often leads to a loss of entropy and calibration. The model becomes overconfident, producing probability distributions that are no longer reliable measures of its actual uncertainty.