From PPO to DPO (and GRPO)

5 min read

PPO made RLHF work; DPO made it simple. This post derives DPO from PPO, explains why it’s a supervised alternative (not RL), where it shines, and where RL/GRPO still helps.

From PPO to DPO (and GRPO)
Photo by Joel Filipe on Unsplash

For years, RLHF meant: (1) SFT a reference model, (2) train a reward model from pairwise preferences (Bradley–Terry/BTL), (3) do KL-regularized policy optimization (usually PPO) against that reward with a reference-policy anchor (stay close to SFT). This is what InstructGPT [1] popularized: RL with a KL term to the reference model, using PPO [2] to stabilize updates. For a broader RL overview, see my companion note: Reinforcement Learning in LLMs – Why and How.

Direct Preference Optimization (DPO) showed you can skip the explicit reward model and skip PPO. Why? Because (a) human preferences are naturally modeled with Bradley–Terry [3], and (b) the max-entropy / KL-regularized RL objective [4] has a closed-form optimal policy—an exponential tilt of the reference policy. Combine those and you get a pure supervised objective on preference pairs.


Step 1 — Bradley–Terry: preferences are logistic in score differences

Given two completions $y^+$ (preferred) and $y^-$ (rejected) for prompt $x$, Bradley–Terry says [3]

$ \Pr(y^+ \succ y^- \mid x) = \frac{e^{s(x,y^+)}}{e^{s(x,y^+)} + e^{s(x,y^-)}} = \sigma\!\big(s(x,y^+)-s(x,y^-)\big), $

i.e., a logistic model over the score difference. In RLHF, that score has been the reward model’s output.


Step 2 — KL-regularized RL has a Boltzmann (exponential-tilt) solution

The classic KL-regularized objective (the one PPO approximates) is [2][4]

$ \max_{\pi} \; \mathbb{E}_{y\sim \pi(\cdot\mid x)}[r(x,y)] \;-\; \beta\, D_{\text{KL}}\!\big(\pi(\cdot\mid x)\,\|\,\pi_{\text{ref}}(\cdot\mid x)\big). $

Its (nonparametric) maximizer is closed-form:

$ \pi^*(y\mid x) \propto \pi_{\text{ref}}(y\mid x)\, \exp\!\Big(\tfrac{1}{\beta} r(x,y)\Big). $

Invert it to express reward in terms of a log-ratio:

$ r(x,y) = \beta \Big[\log \pi^*(y\mid x) - \log \pi_{\text{ref}}(y\mid x)\Big] + \text{const}(x). $

That’s the max‑entropy principle at work: the solution is always an exponential-family tilt of the base measure (here, the reference policy) [5].


Step 3 — DPO: plug the log-ratio into Bradley–Terry and train directly

DPO’s key move [6]: treat your current policy $\pi_\theta$ as the $\pi^*$ in the log-ratio above and use that log-ratio as the score in Bradley–Terry. That yields a purely supervised loss on preference pairs $(x, y^+, y^-)$:

Define

$ \Delta_\theta(x,y^+,y^-) = \beta\Big( \big[\log \pi_\theta(y^+\!\mid x)-\log \pi_\theta(y^-\!\mid x)\big] -\big[\log \pi_{\text{ref}}(y^+\!\mid x)-\log \pi_{\text{ref}}(y^-\!\mid x)\big] \Big). $

Then minimize the binary logistic loss:

$ \mathcal{L}_{\text{DPO}} = -\,\mathbb{E}_{(x,y^+,y^-)}\Big[\log \sigma\big(\Delta_\theta(x,y^+,y^-)\big)\Big]. $

That’s it. No reward model, no rollouts, no PPO. You just compute log-probs from your current policy and the frozen reference policy, and do standard gradient descent.


Why DPO isn’t RL

  • Objective: a supervised logistic loss on preference pairs—not expected return under a reward.
  • Data: no on‑policy rollouts; you train on a fixed set of comparisons.
  • Signal: log‑prob ratios to a frozen reference (no reward model, no critic).
  • Constraint: the KL anchor is implicit via the reference log‑probs, not a penalty inside an RL objective.

In practice, you train DPO with the same stack as SFT (dataloaders + CE‑style loss), rather than an RL loop.

If you’re comparing gradient estimators for stochastic objectives, see: Reparameterization vs REINFORCE.


Why DPO displaced PPO in many stacks

  • Simplicity & stability. Just a classification-style objective over pairs; no on-policy sampling or value baselines.
  • Right inductive bias. KL anchoring is baked in via the reference-policy log-ratio (the “exponential tilt” solution of max-ent RL).
  • Directly matches the data. We collect pairwise preferences; DPO trains directly on them (Bradley–Terry), instead of first distilling them into a reward model.
  • Competitive quality. The paper reports equal or better results than PPO-based RLHF on several alignment tasks, with far lower complexity.

Minimal “how-to” (pseudo-code)

  1. Data. Preference triples $(x, y^+, y^-)$.
  2. Models. Frozen $\pi_{\text{ref}}$ (your SFT) and trainable $\pi_\theta$ (init from $\pi_{\text{ref}}$).
  3. Compute $\Delta_\theta$ with temperature $\beta$ (tunable).
  4. Loss: $-\log \sigma(\Delta_\theta)$.
  5. Optimize $\theta$ with AdamW; no rollouts; standard LM training loop.

What is GRPO?

GRPO (Group Relative Policy Optimization) is a policy‑gradient method that keeps PPO’s stabilizers (importance ratios, clipping, optional KL to a reference) but replaces a learned value baseline with group‑relative baselines—centering rewards within a batch of sampled completions. This lowers variance without training a critic and pairs well with verifiable (outcome) rewards. For a broader tour of RL in LLMs and why verifiable rewards matter, see Reinforcement Learning in LLMs – Why and How.

Where DPO shines

  • Preference data fits naturally. Train directly on $(x, y^+, y^-)$ without fitting a reward model first.
  • Built‑in KL anchoring. The log‑prob ratio to $\pi_{\text{ref}}$ bakes the “stay close to SFT” prior into the objective.
  • Offline and simple. No on‑policy sampling, no value baselines; reuse your SFT stack.
  • Stable and cheap. Small set of knobs ($\beta$, sampling temps) and fast iteration.

Where RL (e.g., GRPO) still helps

  • Outcome / verifiable rewards. Math, code, or tasks with programmatic checkers benefit from exploration and outcome‑level credit.
  • Beyond the dataset. On‑policy search for better behaviors, curricula, and hard negatives.
  • Process signals and constraints. Rewarding reasoning traces, enforcing safety costs, or managing KL budgets/trust‑region updates.
  • Multi‑objective trade‑offs. Explicit penalties/bonuses are easier to tune in RL training loops.


Recap

  • Bradley–Terry models pairwise preferences with a logistic on score differences.
  • Max-entropy KL-RL says the optimal policy is an exponential tilt of the reference; reward equals a log-prob ratio (policy vs reference).
  • DPO plugs that log-ratio into Bradley–Terry and trains directly on comparisons.
  • Result: no PPO, no reward model, competitive alignment quality, simpler training.

References

  1. Ouyang, L., et al. “Training language models to follow instructions with human feedback.” arXiv preprint arXiv:2203.02155 (2022).
  2. Schulman, J., et al. “Proximal Policy Optimization Algorithms.” arXiv preprint arXiv:1707.06347 (2017).
  3. Bradley, R. A., & Terry, M. E. “Rank analysis of incomplete block designs: I. The method of paired comparisons.” Biometrika (1952).
  4. Levine, S. “Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review.” arXiv preprint arXiv:1805.00909 (2018).
  5. Jaynes, E. T. “Information Theory and Statistical Mechanics.” Physical Review (1957).
  6. Rafailov, M., et al. “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” arXiv preprint arXiv:2305.18290; ICLR 2024 (2023).
  7. DeepSeek‑AI. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.” arXiv preprint arXiv:2402.03300 (2024).
Copyright 2025, Ran DingPrivacyTerms