From PPO to DPO (and GRPO)

For years, RLHF meant: (1) SFT a reference model, (2) train a reward model from pairwise preferences (Bradley–Terry/BTL), (3) do KL-regularized policy optimization (usually PPO) against that reward with a reference-policy anchor (stay close to SFT). This is what InstructGPT [1] popularized: RL with a KL term to the reference model, using PPO [2] to stabilize updates. For a broader RL overview, see my companion note: Reinforcement Learning in LLMs – Why and How.

Direct Preference Optimization (DPO) showed you can skip the explicit reward model and skip PPO. Why? Because (a) human preferences are naturally modeled with Bradley–Terry [3], and (b) the max-entropy / KL-regularized RL objective [4] has a closed-form optimal policy—an exponential tilt of the reference policy. Combine those and you get a pure supervised objective on preference pairs.

Step 1 — Bradley–Terry: preferences are logistic in score differences

Given two completions $y^+$ (preferred) and $y^-$ (rejected) for prompt $x$, Bradley–Terry says [3]

\Pr(y^+ \succ y^- \mid x) = \frac{e^{s(x,y^+)}}{e^{s(x,y^+)} + e^{s(x,y^-)}} = \sigma\!\big(s(x,y^+)-s(x,y^-)\big),

i.e., a logistic model over the score difference. In RLHF, that score has been the reward model’s output.

Step 2 — KL-regularized RL has a Boltzmann (exponential-tilt) solution

The classic KL-regularized objective (the one PPO approximates) is [2][4]

\max_{\pi} \; \mathbb{E}_{y\sim \pi(\cdot\mid x)}[r(x,y)] \;-\; \beta\, D_{\text{KL}}\!\big(\pi(\cdot\mid x)\,\|\,\pi_{\text{ref}}(\cdot\mid x)\big).

Its (nonparametric) maximizer is closed-form:

\pi^*(y\mid x) \propto \pi_{\text{ref}}(y\mid x)\, \exp\!\Big(\tfrac{1}{\beta} r(x,y)\Big).

Invert it to express reward in terms of a log-ratio:

r(x,y) = \beta \Big[\log \pi^*(y\mid x) - \log \pi_{\text{ref}}(y\mid x)\Big] + \text{const}(x).

That’s the max‑entropy principle at work: the solution is always an exponential-family tilt of the base measure (here, the reference policy) [5].

Step 3 — DPO: plug the log-ratio into Bradley–Terry and train directly

DPO’s key move [6]: treat your current policy $\pi_\theta$ as the $\pi^*$ in the log-ratio above and use that log-ratio as the score in Bradley–Terry. That yields a purely supervised loss on preference pairs $(x, y^+, y^-)$:

Define

\Delta_\theta(x,y^+,y^-) = \beta\Big( \big[\log \pi_\theta(y^+\!\mid x)-\log \pi_\theta(y^-\!\mid x)\big] -\big[\log \pi_{\text{ref}}(y^+\!\mid x)-\log \pi_{\text{ref}}(y^-\!\mid x)\big] \Big).

Then minimize the binary logistic loss:

\mathcal{L}_{\text{DPO}} = -\,\mathbb{E}_{(x,y^+,y^-)}\Big[\log \sigma\big(\Delta_\theta(x,y^+,y^-)\big)\Big].

That’s it. No reward model, no rollouts, no PPO. You just compute log-probs from your current policy and the frozen reference policy, and do standard gradient descent.

Why DPO isn’t RL

Objective: a supervised logistic loss on preference pairs—not expected return under a reward.
Data: no on‑policy rollouts; you train on a fixed set of comparisons.
Signal: log‑prob ratios to a frozen reference (no reward model, no critic).
Constraint: the KL anchor is implicit via the reference log‑probs, not a penalty inside an RL objective.

In practice, you train DPO with the same stack as SFT (dataloaders + CE‑style loss), rather than an RL loop.

Why DPO displaced PPO in many stacks

Simplicity & stability. Just a classification-style objective over pairs; no on-policy sampling or value baselines.
Right inductive bias. KL anchoring is baked in via the reference-policy log-ratio (the “exponential tilt” solution of max-ent RL).
Directly matches the data. We collect pairwise preferences; DPO trains directly on them (Bradley–Terry), instead of first distilling them into a reward model.
Competitive quality. The paper reports equal or better results than PPO-based RLHF on several alignment tasks, with far lower complexity.

Minimal “how-to” (pseudo-code)

Data. Preference triples $(x, y^+, y^-)$.
Models. Frozen $\pi_{\text{ref}}$ (your SFT) and trainable $\pi_\theta$ (init from $\pi_{\text{ref}}$).
Compute $\Delta_\theta$ with temperature $\beta$ (tunable).
Loss: $-\log \sigma(\Delta_\theta)$.
Optimize $\theta$ with AdamW; no rollouts; standard LM training loop.

What is GRPO?

GRPO (Group Relative Policy Optimization) is a policy‑gradient method that keeps PPO’s stabilizers (importance ratios, clipping, optional KL to a reference) but replaces a learned value baseline with group‑relative baselines—centering rewards within a batch of sampled completions. This lowers variance without training a critic and pairs well with verifiable (outcome) rewards. For a broader tour of RL in LLMs and why verifiable rewards matter, see Reinforcement Learning in LLMs – Why and How.

Where DPO shines

Preference data fits naturally. Train directly on $(x, y^+, y^-)$ without fitting a reward model first.
Built‑in KL anchoring. The log‑prob ratio to $\pi_{\text{ref}}$ bakes the “stay close to SFT” prior into the objective.
Offline and simple. No on‑policy sampling, no value baselines; reuse your SFT stack.
Stable and cheap. Small set of knobs ($\beta$, sampling temps) and fast iteration.

Where RL (e.g., GRPO) still helps

Outcome / verifiable rewards. Math, code, or tasks with programmatic checkers benefit from exploration and outcome‑level credit.
Beyond the dataset. On‑policy search for better behaviors, curricula, and hard negatives.
Process signals and constraints. Rewarding reasoning traces, enforcing safety costs, or managing KL budgets/trust‑region updates.
Multi‑objective trade‑offs. Explicit penalties/bonuses are easier to tune in RL training loops.

Recap

Bradley–Terry models pairwise preferences with a logistic on score differences.
Max-entropy KL-RL says the optimal policy is an exponential tilt of the reference; reward equals a log-prob ratio (policy vs reference).
DPO plugs that log-ratio into Bradley–Terry and trains directly on comparisons.
Result: no PPO, no reward model, competitive alignment quality, simpler training.

References

Ouyang, L., et al. “Training language models to follow instructions with human feedback.” arXiv preprint arXiv:2203.02155 (2022).
Schulman, J., et al. “Proximal Policy Optimization Algorithms.” arXiv preprint arXiv:1707.06347 (2017).
Bradley, R. A., & Terry, M. E. “Rank analysis of incomplete block designs: I. The method of paired comparisons.” Biometrika (1952).
Levine, S. “Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review.” arXiv preprint arXiv:1805.00909 (2018).
Jaynes, E. T. “Information Theory and Statistical Mechanics.” Physical Review (1957).
Rafailov, M., et al. “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” arXiv preprint arXiv:2305.18290; ICLR 2024 (2023).
DeepSeek‑AI. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.” arXiv preprint arXiv:2402.03300 (2024).