Reinforcement Learning in LLMs

Part 1: Why Do We Need RL in LLM Post-Training?

Large language models (LLMs) are pretrained on internet-scale text with the simple goal of next-token prediction. That gives them fluent language, but not necessarily behavior aligned with human preferences or robust reasoning skills. Post‑training — the phase where we refine a model’s behavior — is where reinforcement learning (RL) often comes in.

In this section we'll talk about how SFT, RLHF, and RL with verifiable rewards differ — and why sequence‑level rewards and exploration make RL essential for shaping LLM behavior and reasoning.

Supervised Fine-Tuning (SFT): Imitation

The first step is usually supervised fine-tuning (SFT). You collect prompt–answer pairs, often written by human annotators, and minimize cross-entropy loss:

\mathcal{L}_{\text{SFT}}(\theta) = - \sum_t \log \pi_\theta(y_t^* \mid x, y_{\lt t}^*)

Here $\pi_\theta$ is the model’s distribution and $y^*$ is the “gold” answer. This is per-token imitation: the model is told exactly what to output at every step.

This works when you can provide explicit demonstrations. But it assumes:

There’s a single correct answer or reasoning path.
Annotating data is feasible.
Mimicking demonstrations is the end goal.

RLHF: Rewarding Preferences

Reinforcement learning from human feedback (RLHF) [1] changes the training signal. Instead of gold trajectories, humans compare outputs (A vs B). A reward model $R_\phi(x,y)$ is trained from these comparisons, and then the LLM is optimized (often with PPO [2]) to maximize a KL-regularized objective [3]:

\max_\theta \; \mathbb{E}_{y \sim \pi_\theta(\cdot \mid x)} [ R_\phi(x,y) ] \; - \; \beta \, \text{KL}(\pi_\theta \| \pi_{\text{SFT}})

This objective (closely tied to the max‑entropy view; see MaxEnt and MLE):

Pushes probability mass toward any output that humans prefer.
Allows multiple valid completions, not just one reference.
Balances optimization with a KL penalty to stay close to the SFT baseline.

For a supervised alternative derived from this KL‑regularized view, see: From PPO to DPO.

The Common Feeling: “Isn’t This Just Another Loss Term?”

Many readers (myself included, at first) find RLHF underwhelming. If you squint, it looks like we just:

Added a reward term (from the reward model).
Added a regularization term (the KL penalty).

Isn’t that just another loss, like supervised training with extra seasoning?

In practice, yes — the optimizer just sees a loss function. But conceptually, there are two crucial differences:

Unit of supervision:
- SFT: per-token, must match reference exactly.
- RLHF: sequence-level, model gets scored on its own generated outputs.
Exploration:
- SFT never leaves the data distribution.
- RLHF explores new completions, learns from feedback even if no demonstration exists.

It’s a subtle but important shift: from imitation to optimization.

Why the Distinction Matters in Reasoning

This difference is especially sharp for reasoning models.

Supervised setup: You need annotated reasoning traces. If the model deviates from the trace, it gets punished.
RL setup: You only need a correctness or preference signal at the end. The model can explore multiple reasoning paths and reinforce all the ones that succeed.

Formally, the policy gradient distributes the scalar reward across all tokens in the trajectory [4] (for gradient‑estimator background, see Reparameterization vs REINFORCE):

\nabla_\theta J(\theta) = \mathbb{E}_{y \sim \pi_\theta} [ R(x,y) \nabla_\theta \log \pi_\theta(y|x) ].

This is why RLHF can train models to solve math or logic problems without ever providing full step-by-step traces.

RLVR: Reinforcement Learning with Verifiable Reward

When a task has programmatically checkable outcomes (e.g., math or code), we can use RL with verifiable rewards (RLVR).

Instead of a reward model trained from human preferences, you use programmatic verifiable rewards: e.g., a math checker, a compiler, or unit tests (e.g., CodeRL uses unit tests as rewards [5]).
The model proposes reasoning chains; a reward is computed by checking correctness (pass/fail or graded score).
The reward is automatic, scalable, and directly tied to task success.

This is a natural fit for reasoning tasks like math and coding, where correctness can be verified algorithmically. RLVR avoids the subjective noise of human-labeled preferences and focuses the optimization on true task success.

Intuition

SFT: “Copy the teacher’s solution exactly.”
RLHF: “Write your own solution; if the grader likes it, you’ll be rewarded.”
RLVR: “Write your own solution; the computer checks it — if it’s right, you’re rewarded.”

Conclusion

RL in LLM post-training can feel, at first, like we’re just designing new loss functions. But the shift in supervision level and the introduction of exploration are what make it different from SFT. For alignment tasks, RLHF turns human preferences into optimization signals. For reasoning, RLVR shows the full potential: models can explore reasoning paths freely, and objective verifiable rewards provide precise, scalable signals.

That’s why RL — in one form or another — remains central to shaping how models not only speak, but also reason.

Part 2: How to Train LLMs with RL

This is a recap of Lecture 17 of Stanford CS336 (Spring 2025), which covered reinforcement learning (RL) for language models with a focus on policy gradient and GRPO. If you find this interesting, I highly recommend going through the full lecture materials:

RL Setup in the LLM Context

State ($s$): the prompt plus the generated tokens so far
Action ($a$): the next token generated
Reward ($R$): how good the response is — in this lecture we focus on outcome rewards (a single number after the full response), especially verifiable rewards (deterministic functions, e.g. exact match)
Policy ($\pi$): the language model itself, mapping states to token distributions
Trajectory: generating a full response and then receiving a reward

Unlike robotics, LMs have deterministic transitions ($s' = s + a$), which makes planning/test-time compute feasible.

Naive Policy Gradient: Learn from Rewarded Trajectories

The simplest RL objective is:

\nabla_\theta \mathbb{E}[R] \;=\; \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) \, R(s,a)]

Interpretation: scale the gradient of the log-prob of a sampled response by its reward.

If $R = 1$ (correct), we update positively.
If $R = 0$ (incorrect), no update.

This is like doing supervised fine-tuning (SFT) on self-generated data, filtered by reward. But there are problems:

High variance: updates are noisy and unstable.
Sparse rewards: most responses get $R = 0$, leading to no gradient signal.
Shifting distribution: every update changes $\pi$, so the data distribution changes too.

As I noted in my personal lecture notes: we are teaching ourselves trajectories when the outcome is good, with no demonstrations to anchor us.

Baselines and Advantage Functions

To reduce variance, we subtract a baseline $b(s)$:

\nabla_\theta \mathbb{E}[R] \;\approx\; \nabla_\theta \log \pi_\theta(a|s) \, (R - b(s))

If $b(s) = \mathbb{E}[R|s]$, then this becomes the advantage:
$$ A(s,a) = Q(s,a) - V(s) $$
Intuition: don’t just reward good responses; reward responses that are better than average for that state.
Centering or normalizing rewards across responses is a simple way to approximate this.

GRPO: Group Relative Policy Optimization

GRPO is a simplification of PPO tailored for language models:

Instead of learning a critic $V(s)$, use the group of sampled responses per prompt to define a natural baseline.
Center rewards within the group, so updates depend on relative quality.
Still applies importance sampling ratios ($\pi / \pi_{\text{old}}$) and clipping to stabilize updates.
Optionally add a KL penalty against a reference model to prevent drift.

GRPO Objective Function

The GRPO loss combines clipped importance sampling with group-relative advantages:

\mathcal{L}_{\text{GRPO}}(\theta) = -\mathbb{E}_{x,y} \left[ \min\left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right] + \beta \, \text{KL}(\pi_\theta \| \pi_{\text{ref}})

Where:

$r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\text{old}}(a_t|s_t)}$ is the importance sampling ratio
$\hat{A}_t = R(x,y) - \bar{R}$ is the group-relative advantage (reward minus group mean)
$\epsilon$ is the clipping parameter (typically 0.2)
$\beta$ controls the KL penalty strength
$\pi_{\text{ref}}$ is the reference model (frozen at iteration start)

GRPO Training Algorithm

Algorithm: GRPO Training Loop

1. Initialize policy π_θ
2. for iteration = 1, 2, ... do:
3.     π_ref ← π_θ                           // Freeze reference for this iteration
4.     for batch of prompts {x₁, x₂, ..., x_B} do:
5.         for each prompt x_i do:
6.             π_old ← π_θ                    // Freeze behavior policy for sampling
7.             Sample K responses: y_i^(1), ..., y_i^(K) ~ π_old(·|x_i)
8.             Compute rewards: R_i^(k) = R(x_i, y_i^(k))
9.             Compute group mean: R̄_i = (1/K) Σ_k R_i^(k)
10.            Compute advantages: Â_i^(k) = R_i^(k) - R̄_i
11.        end for
12.        
13.        // Update policy using clipped objective
14.        for epoch = 1, ..., E do:
15.            for minibatch of (x_i, y_i^(k), Â_i^(k)) do:
16.                Compute importance ratios: r_t = π_θ(a_t|s_t) / π_old(a_t|s_t)
17.                Compute clipped loss: L_clip = -min(r_t Â_t, clip(r_t, 1-ε, 1+ε) Â_t)
18.                Compute KL penalty: L_KL = β KL(π_θ || π_ref)
19.                Update θ: θ ← θ - α ∇_θ (L_clip + L_KL)
20.            end for
21.        end for
22.    end for
23. end for

Key insights:

Line 6: $\pi_{\text{old}}$ is frozen per batch to ensure consistent importance sampling
Line 3: $\pi_{\text{ref}}$ is frozen per iteration to prevent drift across multiple epochs
Lines 9-10: Group-relative advantages center rewards within each prompt's response set
Lines 16-17: PPO-style clipping stabilizes the policy updates

The Role of the Reference Model and Old Model

In GRPO (Group Relative Policy Optimization), the reference model and the old model each serve distinct stabilizing roles in the training loop:

1. The reference model $\pi_{\text{ref}}$

Defined in the algorithm (line 3): $\pi_{\text{ref}} \leftarrow \pi_\theta$
It’s essentially a frozen snapshot of the current policy before you begin a new round of updates.
Used to compute KL penalties or regularization so that the updated policy doesn’t drift too far away in one iteration.
Intuition: “Keep the new policy close to where it started this iteration, so we don’t collapse or diverge.”

2. The old model $\pi_{\text{old}}$

Defined in the inner loop (line 6): $\pi_{\text{old}} \leftarrow \pi_\theta$
This is the policy used to actually sample responses for a batch of prompts during this step.
Why? Because in policy gradient methods, you need log-probs from the behavior policy that generated the samples to compute correct importance-weighted updates.
Intuition: “Freeze a copy of the current policy to score the sampled responses consistently, even while we update $\pi_\theta$ multiple times.”

3. Why both?

Reference model ($\pi_{\text{ref}}$) anchors each outer iteration so that across iterations, the policy doesn’t drift uncontrollably.
Old model ($\pi_{\text{old}}$) anchors each minibatch of rollouts so that updates are on-policy (consistent with the responses that were actually sampled).

Analogy to PPO

PPO also keeps two things:
- The behavior policy (like $\pi_{\text{old}}$) for importance sampling.
- A clipping or KL penalty baseline (like $\pi_{\text{ref}}$) to stabilize updates across epochs.
GRPO follows the same design, but the “group relative” advantage estimator is plugged in.

TL;DR:

$\pi_{\text{ref}}$: a frozen snapshot at the start of the outer loop → prevents runaway drift via KL.
$\pi_{\text{old}}$: the policy used to generate responses for a batch → ensures consistent credit assignment when updating.

A Toy Example: Sorting Numbers

The lecture walked through a simple toy environment: prompts are lists of numbers, and the task is to generate sorted responses.

Reward functions

def sort_distance_reward(prompt, response):
    ground_truth = sorted(prompt)
    return sum(1 for x, y in zip(response, ground_truth) if x == y)

def sort_inclusion_ordering_reward(prompt, response):
    inclusion_reward = sum(1 for x in prompt if x in response)
    ordering_reward = sum(1 for x, y in zip(response, response[1:]) if x <= y)
    return inclusion_reward + ordering_reward

The first gives credit only for correct positions; the second also rewards partial ordering.

Model

A deliberately simple encoder–decoder with per-position encode/decode matrices:

class Model(nn.Module):
    def __init__(self, vocab_size, embedding_dim, prompt_length, response_length):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.encode_weights = nn.Parameter(torch.randn(prompt_length, embedding_dim, embedding_dim))
        self.decode_weights = nn.Parameter(torch.randn(response_length, embedding_dim, embedding_dim))

    def forward(self, prompts):
        embeddings = self.embedding(prompts)  # [batch, pos, dim]
        encoded = einsum(embeddings, self.encode_weights, "batch pos dim1, pos dim1 dim2 -> batch dim2")
        decoded = einsum(encoded, self.decode_weights, "batch dim2, pos dim2 dim1 -> batch pos dim1")
        logits = einsum(decoded, self.embedding.weight, "batch pos dim1, vocab dim1 -> batch pos vocab")
        return logits

This isn’t a real LM — but it’s small enough to show RL mechanics.

Key Training Functions

Compute deltas (advantage-like signals):

def compute_deltas(rewards, mode="centered_rewards"):
    if mode == "rewards": return rewards
    if mode == "centered_rewards": return rewards - rewards.mean(dim=-1, keepdim=True)
    if mode == "normalized_rewards": return (rewards - rewards.mean(dim=-1, keepdim=True)) / (rewards.std(dim=-1, keepdim=True) + 1e-5)

Compute log-probs for responses:

def compute_log_probs(prompts, responses, model):
    logits = model(prompts)  # [batch, pos, vocab]
    log_probs = F.log_softmax(logits, dim=-1)
    log_probs = repeat(log_probs, "batch pos vocab -> batch trial pos vocab", trial=responses.shape[1])
    return log_probs.gather(dim=-1, index=responses.unsqueeze(-1)).squeeze(-1)  # [batch, trial, pos]

Compute loss (naive / unclipped / clipped):

def compute_loss(log_probs, deltas, mode="naive", old_log_probs=None):
    if mode == "naive":
        return -(log_probs * deltas[..., None]).mean()
    if mode == "unclipped":
        # Compute probability ratio from log-probs: exp(new - old)
        ratios = torch.exp(log_probs - old_log_probs)
        return -(ratios * deltas[..., None]).mean()
    if mode == "clipped":
        epsilon = 0.01
        ratios = torch.exp(log_probs - old_log_probs)
        clipped = torch.clamp(ratios, 1-epsilon, 1+epsilon)
        return -torch.min(ratios * deltas[..., None], clipped * deltas[..., None]).mean()

Conclusions

Reward vs loss: Reward typically goes up, but loss curves are misleading, because the dataset changes each loop (on-policy samples).
Partial credit: Helps learning but risks local optima — models can exploit weak reward definitions.
Multiple models: RL training requires juggling $\pi$, $\pi_{\text{old}}$, and $\pi_{\text{ref}}$, making it more complex than pretraining.
Scaling: The toy code works on a laptop, but production RLHF/GRPO involves distributed systems, reward models, and careful KL balancing.

Supervised learning can only mimic human data. RL unlocks optimization against verifiable signals and human preferences. While toy demos like “sorting numbers” are simple, the mechanics mirror how large-scale RLHF and GRPO fine-tune frontier LLMs.

References

Ouyang, L., et al. “Training language models to follow instructions with human feedback.” arXiv preprint arXiv:2203.02155 (2022).
Schulman, J., et al. “Proximal Policy Optimization Algorithms.” arXiv preprint arXiv:1707.06347 (2017).
Levine, S. “Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review.” arXiv preprint arXiv:1805.00909 (2018).
Williams, R. J. “Simple statistical gradient-following algorithms for connectionist reinforcement learning.” Machine Learning (1992).
Le, H., et al. “CodeRL: Mastering Code Generation through Pretrained LMs and Deep Reinforcement Learning.” NeurIPS (2022).
Rafailov, R., et al. “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” arXiv preprint arXiv:2305.18290 (2023).
DeepSeek-AI. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.” arXiv preprint arXiv:2402.03300 (2024).
Liu, Z., et al. “Understanding R1‑Zero‑Like Training: A Critical Perspective.” arXiv preprint arXiv:2503.20783 (2025).
DeepSeek Team. “DeepSeek‑R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.” arXiv preprint arXiv:2501.12948 (2025).
Kimi Team. “Kimi k1.5: Scaling Reinforcement Learning with LLMs.” arXiv preprint arXiv:2501.12599 (2025).
Qwen Team. “Qwen3 Technical Report.” arXiv preprint arXiv:2505.09388 (2025).