Overview
Large language models (LLMs) are pretrained on internet-scale text with the simple goal of next-token prediction. That gives them fluent language, but not necessarily behavior aligned with human preferences or robust reasoning skills. Post-training — the phase where we refine a model’s behavior — is where reinforcement learning (RL) often comes in.
Supervised Fine-Tuning (SFT): Imitation
The first step is usually supervised fine-tuning (SFT). You collect prompt–answer pairs, often written by human annotators, and minimize cross-entropy loss:
Here \$\pi\_\theta\$ is the model’s distribution and \$y^\*\$ is the “gold” answer. This is per-token imitation: the model is told exactly what to output at every step.
This works when you can provide explicit demonstrations. But it assumes:
- There’s a single correct answer or reasoning path.
- Annotating data is feasible.
- Mimicking demonstrations is the end goal.
RLHF: Rewarding Preferences
Reinforcement learning from human feedback (RLHF) [1] changes the training signal. Instead of gold trajectories, humans compare outputs (A vs B). A reward model \$R\_\phi(x,y)\$ is trained from these comparisons, and then the LLM is optimized (often with PPO [2]) to maximize a KL-regularized objective [3]:
This objective:
- Pushes probability mass toward any output that humans prefer.
- Allows multiple valid completions, not just one reference.
- Balances optimization with a KL penalty to stay close to the SFT baseline.
The Common Feeling: “Isn’t This Just Another Loss Term?”
Many readers (myself included, at first) find RLHF underwhelming. If you squint, it looks like we just:
- Added a reward term (from the reward model).
- Added a regularization term (the KL penalty).
Isn’t that just another loss, like supervised training with extra seasoning?
In practice, yes — the optimizer just sees a loss function. But conceptually, there are two crucial differences:
-
Unit of supervision:
- SFT: per-token, must match reference exactly.
- RLHF: sequence-level, model gets scored on its own generated outputs.
-
Exploration:
- SFT never leaves the data distribution.
- RLHF explores new completions, learns from feedback even if no demonstration exists.
It’s a subtle but important shift: from imitation to optimization.
Why the Distinction Matters in Reasoning
This difference is especially sharp for reasoning models.
- Supervised setup: You need annotated reasoning traces. If the model deviates from the trace, it gets punished.
- RL setup: You only need a correctness or preference signal at the end. The model can explore multiple reasoning paths and reinforce all the ones that succeed.
Formally, the policy gradient distributes the scalar reward across all tokens in the trajectory [4]:
This is why RLHF can train models to solve math or logic problems without ever providing full step-by-step traces.
RLVR: Reinforcement Learning with Verifiers
A newer approach is RLVR — RL with verifiers.
- Instead of a reward model trained from human preferences, you use a programmatic verifier: e.g., a math checker, a compiler, or unit tests (e.g., CodeRL uses unit tests as rewards [5]).
- The model proposes reasoning chains; the verifier confirms correctness (pass/fail or graded score).
- The reward is automatic, scalable, and directly tied to task success.
This is a natural fit for reasoning tasks like math and coding, where correctness can be verified algorithmically. RLVR avoids the subjective noise of human-labeled preferences and focuses the optimization on true task success.
Intuition
- SFT: “Copy the teacher’s solution exactly.”
- RLHF: “Write your own solution; if the grader likes it, you’ll be rewarded.”
- RLVR: “Write your own solution; the computer checks it — if it’s right, you’re rewarded.”
Conclusion
RL in LLM post-training can feel, at first, like we’re just designing new loss functions. But the shift in supervision level and the introduction of exploration are what make it different from SFT. For alignment tasks, RLHF turns human preferences into optimization signals. For reasoning, RLVR shows the full potential: models can explore reasoning paths freely, and objective verifiers provide precise, scalable rewards.
That’s why RL — in one form or another — remains central to shaping how models not only speak, but also reason.
References
- Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. arXiv:2203.02155. https://arxiv.org/abs/2203.02155
- Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347. https://arxiv.org/abs/1707.06347
- Levine, S. (2018). Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review. arXiv:1805.00909. https://arxiv.org/abs/1805.00909
- Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning (REINFORCE). Machine Learning. https://link.springer.com/article/10.1007/BF00992696
- Le, H., et al. (2022). CodeRL: Mastering Code Generation through Pretrained LMs and Deep Reinforcement Learning. NeurIPS. https://arxiv.org/abs/2207.01780