For years, RLHF meant: (1) SFT a reference model, (2) train a reward model from pairwise preferences (Bradley–Terry/BTL), (3) do KL-regularized policy optimization (usually PPO) against that reward with a reference-policy anchor (stay close to SFT). This is what InstructGPT [1] popularized: RL with a KL term to the reference model, using PPO [2] to stabilize updates.
Direct Preference Optimization (DPO) showed you can skip the explicit reward model and skip PPO. Why? Because (a) human preferences are naturally modeled with Bradley–Terry [3], and (b) the max-entropy / KL-regularized RL objective [4][5] has a closed-form optimal policy—an exponential tilt of the reference policy. Combine those and you get a pure supervised objective on preference pairs.
Step 1 — Bradley–Terry: preferences are logistic in score differences
Given two completions $y^+$ (preferred) and $y^-$ (rejected) for prompt $x$, Bradley–Terry says [3]
i.e., a logistic model over the score difference. In RLHF, that score has been the reward model’s output.
Step 2 — KL-regularized RL has a Boltzmann (exponential-tilt) solution
The classic KL-regularized objective (the one PPO approximates) is [2][4]
Its (nonparametric) maximizer is closed-form:
Invert it to express reward in terms of a log-ratio:
That’s the max-entropy principle at work: the solution is always an exponential-family tilt of the base measure (here, the reference policy) [5].
Step 3 — DPO: plug the log-ratio into Bradley–Terry and train directly
DPO’s key move [6]: treat your current policy $\pi_\theta$ as the $\pi^*$ in the log-ratio above and use that log-ratio as the score in Bradley–Terry. That yields a purely supervised loss on preference pairs $(x, y^+, y^-)$:
Define
Then minimize the binary logistic loss:
That’s it. No reward model, no rollouts, no PPO. You just compute log-probs from your current policy and the frozen reference policy, and do standard gradient descent.
Why DPO displaced PPO in many stacks
- Simplicity & stability. Just a classification-style objective over pairs; no on-policy sampling or value baselines.
- Right inductive bias. KL anchoring is baked in via the reference-policy log-ratio (the “exponential tilt” solution of max-ent RL).
- Directly matches the data. We collect pairwise preferences; DPO trains directly on them (Bradley–Terry), instead of first distilling them into a reward model.
- Competitive quality. The paper reports equal or better results than PPO-based RLHF on several alignment tasks, with far lower complexity.
Minimal “how-to” (pseudo-code)
- Data. Preference triples $(x, y^+, y^-)$.
- Models. Frozen $\pi_{\text{ref}}$ (your SFT) and trainable $\pi_\theta$ (init from $\pi_{\text{ref}}$).
- Compute $\Delta_\theta$ with temperature $\beta$ (tunable).
- Loss: $-\log \sigma(\Delta_\theta)$.
- Optimize $\theta$ with AdamW; no rollouts; standard LM training loop.
Recap
- Bradley–Terry models pairwise preferences with a logistic on score differences.
- Max-entropy KL-RL says the optimal policy is an exponential tilt of the reference; reward equals a log-prob ratio (policy vs reference).
- DPO plugs that log-ratio into Bradley–Terry and trains directly on comparisons.
- Result: no PPO, no reward model, competitive alignment quality, simpler training.
References
- Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. arXiv:2203.02155. https://arxiv.org/abs/2203.02155
- Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347. https://arxiv.org/abs/1707.06347
- Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika.
- Levine, S. (2018). Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review. arXiv:1805.00909. https://arxiv.org/abs/1805.00909
- Jaynes, E. T. (1957). Information Theory and Statistical Mechanics. Physical Review.
- Rafailov, M., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv:2305.18290; ICLR 2024. https://arxiv.org/abs/2305.18290