From PPO to DPO: Why Pairwise Preferences + Max-Entropy Make RL Optional

3 min read

Tracing the evolution from PPO-based RLHF to Direct Preference Optimization, through Bradley–Terry and the maximum entropy principle.

From PPO to DPO: Why Pairwise Preferences + Max-Entropy Make RL Optional
Photo by Joel Filipe on Unsplash

For years, RLHF meant: (1) SFT a reference model, (2) train a reward model from pairwise preferences (Bradley–Terry/BTL), (3) do KL-regularized policy optimization (usually PPO) against that reward with a reference-policy anchor (stay close to SFT). This is what InstructGPT [1] popularized: RL with a KL term to the reference model, using PPO [2] to stabilize updates.

Direct Preference Optimization (DPO) showed you can skip the explicit reward model and skip PPO. Why? Because (a) human preferences are naturally modeled with Bradley–Terry [3], and (b) the max-entropy / KL-regularized RL objective [4][5] has a closed-form optimal policy—an exponential tilt of the reference policy. Combine those and you get a pure supervised objective on preference pairs.


Step 1 — Bradley–Terry: preferences are logistic in score differences

Given two completions $y^+$ (preferred) and $y^-$ (rejected) for prompt $x$, Bradley–Terry says [3]

$ \Pr(y^+ \succ y^- \mid x) = \frac{e^{s(x,y^+)}}{e^{s(x,y^+)} + e^{s(x,y^-)}} = \sigma\!\big(s(x,y^+)-s(x,y^-)\big), $

i.e., a logistic model over the score difference. In RLHF, that score has been the reward model’s output.


Step 2 — KL-regularized RL has a Boltzmann (exponential-tilt) solution

The classic KL-regularized objective (the one PPO approximates) is [2][4]

$ \max_{\pi} \; \mathbb{E}_{y\sim \pi(\cdot\mid x)}[r(x,y)] \;-\; \beta\, D_{\text{KL}}\!\big(\pi(\cdot\mid x)\,\|\,\pi_{\text{ref}}(\cdot\mid x)\big). $

Its (nonparametric) maximizer is closed-form:

$ \pi^*(y\mid x) \propto \pi_{\text{ref}}(y\mid x)\, \exp\!\Big(\tfrac{1}{\beta} r(x,y)\Big). $

Invert it to express reward in terms of a log-ratio:

$ r(x,y) = \beta \Big[\log \pi^*(y\mid x) - \log \pi_{\text{ref}}(y\mid x)\Big] + \text{const}(x). $

That’s the max-entropy principle at work: the solution is always an exponential-family tilt of the base measure (here, the reference policy) [5].


Step 3 — DPO: plug the log-ratio into Bradley–Terry and train directly

DPO’s key move [6]: treat your current policy $\pi_\theta$ as the $\pi^*$ in the log-ratio above and use that log-ratio as the score in Bradley–Terry. That yields a purely supervised loss on preference pairs $(x, y^+, y^-)$:

Define

$ \Delta_\theta(x,y^+,y^-) = \beta\Big( \big[\log \pi_\theta(y^+\!\mid x)-\log \pi_\theta(y^-\!\mid x)\big] -\big[\log \pi_{\text{ref}}(y^+\!\mid x)-\log \pi_{\text{ref}}(y^-\!\mid x)\big] \Big). $

Then minimize the binary logistic loss:

$ \mathcal{L}_{\text{DPO}} = -\,\mathbb{E}_{(x,y^+,y^-)}\Big[\log \sigma\big(\Delta_\theta(x,y^+,y^-)\big)\Big]. $

That’s it. No reward model, no rollouts, no PPO. You just compute log-probs from your current policy and the frozen reference policy, and do standard gradient descent.


Why DPO displaced PPO in many stacks

  • Simplicity & stability. Just a classification-style objective over pairs; no on-policy sampling or value baselines.
  • Right inductive bias. KL anchoring is baked in via the reference-policy log-ratio (the “exponential tilt” solution of max-ent RL).
  • Directly matches the data. We collect pairwise preferences; DPO trains directly on them (Bradley–Terry), instead of first distilling them into a reward model.
  • Competitive quality. The paper reports equal or better results than PPO-based RLHF on several alignment tasks, with far lower complexity.

Minimal “how-to” (pseudo-code)

  1. Data. Preference triples $(x, y^+, y^-)$.
  2. Models. Frozen $\pi_{\text{ref}}$ (your SFT) and trainable $\pi_\theta$ (init from $\pi_{\text{ref}}$).
  3. Compute $\Delta_\theta$ with temperature $\beta$ (tunable).
  4. Loss: $-\log \sigma(\Delta_\theta)$.
  5. Optimize $\theta$ with AdamW; no rollouts; standard LM training loop.

Recap

  • Bradley–Terry models pairwise preferences with a logistic on score differences.
  • Max-entropy KL-RL says the optimal policy is an exponential tilt of the reference; reward equals a log-prob ratio (policy vs reference).
  • DPO plugs that log-ratio into Bradley–Terry and trains directly on comparisons.
  • Result: no PPO, no reward model, competitive alignment quality, simpler training.

References

  1. Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. arXiv:2203.02155. https://arxiv.org/abs/2203.02155
  2. Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347. https://arxiv.org/abs/1707.06347
  3. Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika.
  4. Levine, S. (2018). Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review. arXiv:1805.00909. https://arxiv.org/abs/1805.00909
  5. Jaynes, E. T. (1957). Information Theory and Statistical Mechanics. Physical Review.
  6. Rafailov, M., et al. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv:2305.18290; ICLR 2024. https://arxiv.org/abs/2305.18290
Copyright 2025, Ran DingPrivacyTerms