Reparameterization vs REINFORCE

Overview

When your objective includes expectations over random variables, you need gradients through a sampling step. There are two estimators for $\nabla_\phi \mathbb{E}_{q_\phi(z)}[f(z)]$: (1) the score-function (REINFORCE) estimator [1], which is fully general but notoriously high-variance, and (2) the reparameterization (pathwise) estimator [2][3], which backpropagates through a deterministic transform of noise and usually has far lower variance. In practice, VAEs with Gaussian latents should almost always use reparameterization ($z=\mu+\sigma\odot\epsilon$) [2][3]; fall back to score-function methods only when a clean reparameterization is unavailable (e.g., truly discrete latents), or use relaxations like Gumbel-Softmax [4][5].

TL;DR

REINFORCE: unbiased but noisy; multiplies function value by log-derivative of density.
Reparameterization: injects randomness via an auxiliary variable, so you can use backprop; lower variance, but only available for reparameterizable distributions. For harder cases, see generalized/implicit reparameterization methods [9][10].

Two Estimators

REINFORCE (Score-Function)

Without reparameterization, you can still estimate the gradient of an expectation like

\nabla_\phi \, \mathbb{E}_{q_\phi(z)}[f(z)]

by moving the gradient inside:

= \mathbb{E}_{q_\phi(z)} \big[ f(z) \, \nabla_\phi \log q_\phi(z) \big].

This is the score-function estimator (aka REINFORCE trick) [1].

Pros: Works for any distribution (discrete or continuous).
Cons: Very high variance → training is noisy and unstable.

Reparameterization (Pathwise)

If $z$ can be expressed as

z = g_\phi(\epsilon), \quad \epsilon \sim p(\epsilon),

then

\nabla_\phi \, \mathbb{E}_{q_\phi(z)}[f(z)] = \mathbb{E}_{p(\epsilon)} \big[ \nabla_\phi f(g_\phi(\epsilon)) \big],

so you can backprop through $g_\phi$ directly [2][3].

Pros: Much lower variance, gradients flow smoothly through the network.
Cons: Only works when you can reparameterize the distribution (e.g., Gaussian; categorical via Gumbel-Softmax as a relaxation [4][5]).

Why Pathwise Has Lower Variance

Key Difference

REINFORCE uses a global signal $f(z)$ multiplied by the score $\nabla_\phi\log q_\phi(z)$. It does not use $\nabla_z f(z)$.
Pathwise writes $z=g_\phi(\epsilon), \epsilon\sim p(\epsilon)$ and differentiates through the sample:

\nabla_\phi \mathbb{E}_{q_\phi}[f(z)] =\nabla_\phi \mathbb{E}_{\epsilon}[f(g_\phi(\epsilon))] =\mathbb{E}_{\epsilon}\!\big[\nabla_z f(z)\,\underbrace{\nabla_\phi g_\phi(\epsilon)}_{\text{local sensitivity}}\big].

This uses local derivatives of $f$ and typically tracks the objective’s curvature much more closely.

Intuition for Variance

Score-function term: $(f(z)-b)\,\nabla_\phi\log q_\phi(z)$ is a product of two random quantities; variance explodes when either varies a lot.
Pathwise term: $\nabla_z f(z)\,\nabla_\phi g_\phi(\epsilon)$ uses smoother local derivatives and avoids multiplying by a random score.

1D Gaussian Example (Why It Blows Up Fast)

Let $z\sim\mathcal{N}(\mu,1)$. Compare gradients w.r.t. $\mu$.

REINFORCE: $\nabla_\mu \log q(z)=z-\mu$. Estimator is
$g_{\text{SF}} = f(z)\,(z-\mu).$
If $f$ has quadratic or heavier growth, $g_{\text{SF}}$ involves high-order moments (e.g., $E[z^3], E[z^5],\dots$), so its variance scales with higher moments of $z$ → large.
Pathwise: $z=\mu+\epsilon,\ \epsilon\sim\mathcal{N}(0,1)$. Estimator is

$ g_{\text{PW}} = \frac{\partial f(z)}{\partial z}\cdot \frac{\partial z}{\partial \mu}=\; f'(z). $

Typically $\operatorname{Var}[f'(z)]$ is far smaller than $\operatorname{Var}[f(z)(z-\mu)]$. Example $f(z)=z^2$: true $\nabla_\mu E[z^2]=2\mu$.
- REINFORCE sample: $z^2(z-\mu)$ (uses $z^3$), very high variance.
- Pathwise sample: $2z$, variance $=4\operatorname{Var}(z)=4$ — tiny by comparison.

Formal View (Control Variates / Rao–Blackwell)

Reparameterization can be seen as a Rao–Blackwellization over the sampling path, replacing a noisy global reward with a conditional expectation that leverages differentiability. This provably reduces variance (or leaves it equal) compared to the raw score-function form; related control-variate estimators include VIMCO [6], REBAR [7], and RELAX [8].
Even with optimal baselines, REINFORCE typically remains higher-variance because it still relies on $f(z)$ rather than $\nabla_z f(z)$.

Practical Symptoms You’ve Likely Seen

Score-function estimates need lots of samples, aggressive baselines/advantage centering, and extra variance reduction (control variates, value functions).
Pathwise (reparameterization) usually trains stably with ~1 sample per datapoint in VAEs.

Appendix: REINFORCE Identity (Proof + Baselines)

Goal: differentiate an expectation where the distribution depends on $\phi$:

J(\phi)=\mathbb{E}_{z\sim q_\phi(z)}[f(z)].

Under mild regularity (dominated convergence / interchange ok),

\nabla_\phi J(\phi) =\nabla_\phi \int f(z)\,q_\phi(z)\,dz =\int f(z)\,\nabla_\phi q_\phi(z)\,dz.

Use the log-derivative trick $\nabla_\phi q_\phi = q_\phi \nabla_\phi \log q_\phi$:

\nabla_\phi J(\phi) =\int f(z)\,q_\phi(z)\,\nabla_\phi \log q_\phi(z)\,dz =\mathbb{E}_{q_\phi}\!\big[\,f(z)\,\nabla_\phi \log q_\phi(z)\,\big].

Two common refinements:

Baseline (control variate): for any constant $b$,

\mathbb{E}_{q_\phi}\!\big[(f(z)-b)\nabla_\phi \log q_\phi(z)\big]

has the same expectation because $\mathbb{E}[\nabla_\phi \log q_\phi]=\nabla_\phi \int q_\phi=0$.

Optimal scalar baseline (for a given component) minimizes variance:

b^\star=\frac{\operatorname{Cov}(f,\,\nabla_\phi \log q_\phi)}{\operatorname{Var}(\nabla_\phi \log q_\phi)}.

References

Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning (REINFORCE). Machine Learning. https://link.springer.com/article/10.1007/BF00992696
Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. ICLR. https://arxiv.org/abs/1312.6114
Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic Backpropagation and Approximate Inference in Deep Generative Models. ICML. https://arxiv.org/abs/1401.4082
Jang, E., Gu, S., & Poole, B. (2017). Categorical Reparameterization with Gumbel–Softmax. ICLR. https://arxiv.org/abs/1611.01144
Maddison, C. J., Mnih, A., & Teh, Y. W. (2017). The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. ICLR. https://arxiv.org/abs/1611.00712
Mnih, A., & Rezende, D. J. (2016). Variational Inference for Monte Carlo Objectives (VIMCO). ICML. https://arxiv.org/abs/1602.06725
Tucker, G., Mnih, A., Maddison, C. J., Lawson, J., & Sohl-Dickstein, J. (2017). REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models. NeurIPS. https://arxiv.org/abs/1703.07370
Grathwohl, W., Choi, D., Wu, Y., Roeder, G., & Duvenaud, D. (2018). Backpropagation through the Void: Optimizing Control Variates for Black-Box Gradient Estimation (RELAX). ICLR. https://arxiv.org/abs/1711.00123
Ruiz, F. J. R., Titsias, M. K., & Blei, D. M. (2016). The Generalized Reparameterization Gradient. NeurIPS. https://arxiv.org/abs/1610.02287
Figurnov, M., Mohamed, S., & Mnih, A. (2018). Implicit Reparameterization Gradients. NeurIPS. https://arxiv.org/abs/1805.08498