Reparameterization vs REINFORCE

4 min read

Two ways to differentiate expectations: REINFORCE (score-function) vs reparameterization (pathwise), and why the pathwise estimator trains VAEs more stably.

Reparameterization vs REINFORCE
Photo by SIMON LEE on Unsplash

Overview

When training VAEs you need gradients through random latent samples. There are two estimators for $\nabla_\phi \mathbb{E}_{q_\phi(z)}[f(z)]$: (1) the score-function (REINFORCE) estimator [3], which is fully general but notoriously high-variance, and (2) the reparameterization (pathwise) estimator [1][2], which backprops through a deterministic transform of noise and usually has far lower variance. In practice, VAEs with Gaussian latents should almost always use reparameterization ($z=\mu+\sigma\odot\epsilon$) [1][2]; fall back to score-function methods only when a clean reparameterization is unavailable (e.g., truly discrete latents), or use relaxations like Gumbel-Softmax [4][5].

TL;DR

  • REINFORCE: unbiased but noisy; multiplies function value by log-derivative of density.
  • Reparameterization: injects randomness via an auxiliary variable, so you can use backprop; lower variance, but only available for reparameterizable distributions.

Two Estimators

REINFORCE (Score-Function)

Without reparameterization, you can still estimate the gradient of an expectation like

$ \nabla_\phi \, \mathbb{E}_{q_\phi(z)}[f(z)] $

by moving the gradient inside:

$ = \mathbb{E}_{q_\phi(z)} \big[ f(z) \, \nabla_\phi \log q_\phi(z) \big]. $

This is the score-function estimator (aka REINFORCE trick) [3].

  • Pros: Works for any distribution (discrete or continuous).
  • Cons: Very high variance → training is noisy and unstable.

Reparameterization (Pathwise)

If $z$ can be expressed as

$ z = g_\phi(\epsilon), \quad \epsilon \sim p(\epsilon), $

then

$ \nabla_\phi \, \mathbb{E}_{q_\phi(z)}[f(z)] = \mathbb{E}_{p(\epsilon)} \big[ \nabla_\phi f(g_\phi(\epsilon)) \big], $

so you can backprop through $g_\phi$ directly [1][2].

  • Pros: Much lower variance, gradients flow smoothly through the network.
  • Cons: Only works when you can reparameterize the distribution (e.g., Gaussian; categorical via Gumbel-Softmax as a relaxation [4][5]).

Why Pathwise Has Lower Variance

Key Difference

  • REINFORCE uses a global signal $f(z)$ multiplied by the score $\nabla_\phi\log q_\phi(z)$. It does not use $\nabla_z f(z)$.
  • Pathwise writes $z=g_\phi(\epsilon), \epsilon\sim p(\epsilon)$ and differentiates through the sample:
$ \nabla_\phi \mathbb{E}_{q_\phi}[f(z)] =\nabla_\phi \mathbb{E}_{\epsilon}[f(g_\phi(\epsilon))] =\mathbb{E}_{\epsilon}\!\big[\nabla_z f(z)\,\underbrace{\nabla_\phi g_\phi(\epsilon)}_{\text{local sensitivity}}\big]. $

This uses local derivatives of $f$ and typically tracks the objective’s curvature much more closely.

Intuition for Variance

  • Score-function term: $(f(z)-b)\,\nabla_\phi\log q_\phi(z)$ is a product of two random quantities; variance explodes when either varies a lot.
  • Pathwise term: $\nabla_z f(z)\,\nabla_\phi g_\phi(\epsilon)$ uses smoother local derivatives and avoids multiplying by a random score.

1D Gaussian Example (Why It Blows Up Fast)

Let $z\sim\mathcal{N}(\mu,1)$. Compare gradients w.r.t. $\mu$.

  • REINFORCE: $\nabla_\mu \log q(z)=z-\mu$. Estimator is

    $ g_{\text{SF}} = f(z)\,(z-\mu). $

    If $f$ has quadratic or heavier growth, $g_{\text{SF}}$ involves high-order moments (e.g., $E[z^3], E[z^5],\dots$), so its variance scales with higher moments of $z$ → large.

  • Pathwise: $z=\mu+\epsilon,\ \epsilon\sim\mathcal{N}(0,1)$. Estimator is

    $ g_{\text{PW}} = \frac{\partial f(z)}{\partial z}\cdot \frac{\partial z}{\partial \mu}=\; f'(z). $

    Typically $\operatorname{Var}[f'(z)]$ is far smaller than $\operatorname{Var}[f(z)(z-\mu)]$. Example $f(z)=z^2$: true $\nabla_\mu E[z^2]=2\mu$.

    • REINFORCE sample: $z^2(z-\mu)$ (uses $z^3$), very high variance.
    • Pathwise sample: $2z$, variance $=4\operatorname{Var}(z)=4$ — tiny by comparison.

Formal View (Control Variates / Rao–Blackwell)

  • Reparameterization can be seen as a Rao–Blackwellization over the sampling path, replacing a noisy global reward with a conditional expectation that leverages differentiability. This provably reduces variance (or leaves it equal) compared to the raw score-function form; related control-variate estimators include VIMCO [6], REBAR [7], and RELAX [8].
  • Even with optimal baselines, REINFORCE typically remains higher-variance because it still relies on $f(z)$ rather than $\nabla_z f(z)$.

Practical Symptoms You’ve Likely Seen

  • Score-function estimates need lots of samples, aggressive baselines/advantage centering, and extra variance reduction (control variates, value functions).
  • Pathwise (reparameterization) usually trains stably with ~1 sample per datapoint in VAEs.

Appendix: REINFORCE Identity (Proof + Baselines)

Goal: differentiate an expectation where the distribution depends on $\phi$:

$ J(\phi)=\mathbb{E}_{z\sim q_\phi(z)}[f(z)]. $

Under mild regularity (dominated convergence / interchange ok),

$ \nabla_\phi J(\phi) =\nabla_\phi \int f(z)\,q_\phi(z)\,dz =\int f(z)\,\nabla_\phi q_\phi(z)\,dz. $

Use the log-derivative trick $\nabla_\phi q_\phi = q_\phi \nabla_\phi \log q_\phi$:

$ \nabla_\phi J(\phi) =\int f(z)\,q_\phi(z)\,\nabla_\phi \log q_\phi(z)\,dz =\mathbb{E}_{q_\phi}\!\big[\,f(z)\,\nabla_\phi \log q_\phi(z)\,\big]. $

Two common refinements:

  • Baseline (control variate): for any constant $b$,
$ \mathbb{E}_{q_\phi}\!\big[(f(z)-b)\nabla_\phi \log q_\phi(z)\big] $

has the same expectation because $\mathbb{E}[\nabla_\phi \log q_\phi]=\nabla_\phi \int q_\phi=0$.

  • Optimal scalar baseline (for a given component) minimizes variance:
$ b^\star=\frac{\operatorname{Cov}(f,\,\nabla_\phi \log q_\phi)}{\operatorname{Var}(\nabla_\phi \log q_\phi)}. $

References

  1. Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. ICLR. https://arxiv.org/abs/1312.6114
  2. Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic Backpropagation and Approximate Inference in Deep Generative Models. ICML. https://arxiv.org/abs/1401.4082
  3. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning (REINFORCE). Machine Learning. https://link.springer.com/article/10.1007/BF00992696
  4. Jang, E., Gu, S., & Poole, B. (2017). Categorical Reparameterization with Gumbel–Softmax. ICLR. https://arxiv.org/abs/1611.01144
  5. Maddison, C. J., Mnih, A., & Teh, Y. W. (2017). The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. ICLR. https://arxiv.org/abs/1611.00712
  6. Mnih, A., & Rezende, D. J. (2016). Variational Inference for Monte Carlo Objectives (VIMCO). ICML. https://arxiv.org/abs/1602.06725
  7. Tucker, G., Mnih, A., Maddison, C. J., Lawson, J., & Sohl-Dickstein, J. (2017). REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models. NeurIPS. https://arxiv.org/abs/1703.07370
  8. Grathwohl, W., Choi, D., Wu, Y., Roeder, G., & Duvenaud, D. (2018). Backpropagation through the Void: Optimizing Control Variates for Black-Box Gradient Estimation (RELAX). ICLR. https://arxiv.org/abs/1711.00123
  9. Figurnov, M., Mohamed, S., & Mnih, A. (2018). Implicit Reparameterization Gradients. NeurIPS. https://arxiv.org/abs/1805.08498
  10. Ruiz, F. J. R., Titsias, M. K., & Blei, D. M. (2016). The Generalized Reparameterization Gradient. NeurIPS. https://arxiv.org/abs/1610.02287
Copyright 2025, Ran DingPrivacyTerms