For years, RLHF meant: (1) SFT a reference model, (2) train a reward model from pairwise preferences (Bradley–Terry/BTL), (3) do KL-regularized policy optimization (usually PPO) against that reward with a reference-policy anchor (stay close to SFT). This is what InstructGPT [1] popularized: RL with a KL term to the reference model, using PPO [2] to stabilize updates. For a broader RL overview, see my companion note: Reinforcement Learning in LLMs – Why and How.
Direct Preference Optimization (DPO) showed you can skip the explicit reward model and skip PPO. Why? Because (a) human preferences are naturally modeled with Bradley–Terry [3], and (b) the max-entropy / KL-regularized RL objective [4] has a closed-form optimal policy—an exponential tilt of the reference policy. Combine those and you get a pure supervised objective on preference pairs.
Step 1 — Bradley–Terry: preferences are logistic in score differences
Given two completions $y^+$ (preferred) and $y^-$ (rejected) for prompt $x$, Bradley–Terry says [3]
i.e., a logistic model over the score difference. In RLHF, that score has been the reward model’s output.
Step 2 — KL-regularized RL has a Boltzmann (exponential-tilt) solution
The classic KL-regularized objective (the one PPO approximates) is [2][4]
Its (nonparametric) maximizer is closed-form:
Invert it to express reward in terms of a log-ratio:
That’s the max‑entropy principle at work: the solution is always an exponential-family tilt of the base measure (here, the reference policy) [5].
Step 3 — DPO: plug the log-ratio into Bradley–Terry and train directly
DPO’s key move [6]: treat your current policy $\pi_\theta$ as the $\pi^*$ in the log-ratio above and use that log-ratio as the score in Bradley–Terry. That yields a purely supervised loss on preference pairs $(x, y^+, y^-)$:
Define
Then minimize the binary logistic loss:
That’s it. No reward model, no rollouts, no PPO. You just compute log-probs from your current policy and the frozen reference policy, and do standard gradient descent.
Why DPO isn’t RL
- Objective: a supervised logistic loss on preference pairs—not expected return under a reward.
- Data: no on‑policy rollouts; you train on a fixed set of comparisons.
- Signal: log‑prob ratios to a frozen reference (no reward model, no critic).
- Constraint: the KL anchor is implicit via the reference log‑probs, not a penalty inside an RL objective.
In practice, you train DPO with the same stack as SFT (dataloaders + CE‑style loss), rather than an RL loop.
If you’re comparing gradient estimators for stochastic objectives, see: Reparameterization vs REINFORCE.
Why DPO displaced PPO in many stacks
- Simplicity & stability. Just a classification-style objective over pairs; no on-policy sampling or value baselines.
- Right inductive bias. KL anchoring is baked in via the reference-policy log-ratio (the “exponential tilt” solution of max-ent RL).
- Directly matches the data. We collect pairwise preferences; DPO trains directly on them (Bradley–Terry), instead of first distilling them into a reward model.
- Competitive quality. The paper reports equal or better results than PPO-based RLHF on several alignment tasks, with far lower complexity.
Minimal “how-to” (pseudo-code)
- Data. Preference triples $(x, y^+, y^-)$.
- Models. Frozen $\pi_{\text{ref}}$ (your SFT) and trainable $\pi_\theta$ (init from $\pi_{\text{ref}}$).
- Compute $\Delta_\theta$ with temperature $\beta$ (tunable).
- Loss: $-\log \sigma(\Delta_\theta)$.
- Optimize $\theta$ with AdamW; no rollouts; standard LM training loop.
What is GRPO?
GRPO (Group Relative Policy Optimization) is a policy‑gradient method that keeps PPO’s stabilizers (importance ratios, clipping, optional KL to a reference) but replaces a learned value baseline with group‑relative baselines—centering rewards within a batch of sampled completions. This lowers variance without training a critic and pairs well with verifiable (outcome) rewards. For a broader tour of RL in LLMs and why verifiable rewards matter, see Reinforcement Learning in LLMs – Why and How.
Where DPO shines
- Preference data fits naturally. Train directly on $(x, y^+, y^-)$ without fitting a reward model first.
- Built‑in KL anchoring. The log‑prob ratio to $\pi_{\text{ref}}$ bakes the “stay close to SFT” prior into the objective.
- Offline and simple. No on‑policy sampling, no value baselines; reuse your SFT stack.
- Stable and cheap. Small set of knobs ($\beta$, sampling temps) and fast iteration.
Where RL (e.g., GRPO) still helps
- Outcome / verifiable rewards. Math, code, or tasks with programmatic checkers benefit from exploration and outcome‑level credit.
- Beyond the dataset. On‑policy search for better behaviors, curricula, and hard negatives.
- Process signals and constraints. Rewarding reasoning traces, enforcing safety costs, or managing KL budgets/trust‑region updates.
- Multi‑objective trade‑offs. Explicit penalties/bonuses are easier to tune in RL training loops.
Recap
- Bradley–Terry models pairwise preferences with a logistic on score differences.
- Max-entropy KL-RL says the optimal policy is an exponential tilt of the reference; reward equals a log-prob ratio (policy vs reference).
- DPO plugs that log-ratio into Bradley–Terry and trains directly on comparisons.
- Result: no PPO, no reward model, competitive alignment quality, simpler training.
References
- Ouyang, L., et al. “Training language models to follow instructions with human feedback.” arXiv preprint arXiv:2203.02155 (2022).
- Schulman, J., et al. “Proximal Policy Optimization Algorithms.” arXiv preprint arXiv:1707.06347 (2017).
- Bradley, R. A., & Terry, M. E. “Rank analysis of incomplete block designs: I. The method of paired comparisons.” Biometrika (1952).
- Levine, S. “Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review.” arXiv preprint arXiv:1805.00909 (2018).
- Jaynes, E. T. “Information Theory and Statistical Mechanics.” Physical Review (1957).
- Rafailov, M., et al. “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” arXiv preprint arXiv:2305.18290; ICLR 2024 (2023).
- DeepSeek‑AI. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.” arXiv preprint arXiv:2402.03300 (2024).