Overview
When training VAEs you need gradients through random latent samples. There are two estimators for $\nabla_\phi \mathbb{E}_{q_\phi(z)}[f(z)]$: (1) the score-function (REINFORCE) estimator [3], which is fully general but notoriously high-variance, and (2) the reparameterization (pathwise) estimator [1][2], which backprops through a deterministic transform of noise and usually has far lower variance. In practice, VAEs with Gaussian latents should almost always use reparameterization ($z=\mu+\sigma\odot\epsilon$) [1][2]; fall back to score-function methods only when a clean reparameterization is unavailable (e.g., truly discrete latents), or use relaxations like Gumbel-Softmax [4][5].
TL;DR
- REINFORCE: unbiased but noisy; multiplies function value by log-derivative of density.
- Reparameterization: injects randomness via an auxiliary variable, so you can use backprop; lower variance, but only available for reparameterizable distributions.
Two Estimators
REINFORCE (Score-Function)
Without reparameterization, you can still estimate the gradient of an expectation like
by moving the gradient inside:
This is the score-function estimator (aka REINFORCE trick) [3].
- Pros: Works for any distribution (discrete or continuous).
- Cons: Very high variance → training is noisy and unstable.
Reparameterization (Pathwise)
If $z$ can be expressed as
then
so you can backprop through $g_\phi$ directly [1][2].
- Pros: Much lower variance, gradients flow smoothly through the network.
- Cons: Only works when you can reparameterize the distribution (e.g., Gaussian; categorical via Gumbel-Softmax as a relaxation [4][5]).
Why Pathwise Has Lower Variance
Key Difference
- REINFORCE uses a global signal $f(z)$ multiplied by the score $\nabla_\phi\log q_\phi(z)$. It does not use $\nabla_z f(z)$.
- Pathwise writes $z=g_\phi(\epsilon), \epsilon\sim p(\epsilon)$ and differentiates through the sample:
This uses local derivatives of $f$ and typically tracks the objective’s curvature much more closely.
Intuition for Variance
- Score-function term: $(f(z)-b)\,\nabla_\phi\log q_\phi(z)$ is a product of two random quantities; variance explodes when either varies a lot.
- Pathwise term: $\nabla_z f(z)\,\nabla_\phi g_\phi(\epsilon)$ uses smoother local derivatives and avoids multiplying by a random score.
1D Gaussian Example (Why It Blows Up Fast)
Let $z\sim\mathcal{N}(\mu,1)$. Compare gradients w.r.t. $\mu$.
-
REINFORCE: $\nabla_\mu \log q(z)=z-\mu$. Estimator is
$ g_{\text{SF}} = f(z)\,(z-\mu). $If $f$ has quadratic or heavier growth, $g_{\text{SF}}$ involves high-order moments (e.g., $E[z^3], E[z^5],\dots$), so its variance scales with higher moments of $z$ → large.
-
Pathwise: $z=\mu+\epsilon,\ \epsilon\sim\mathcal{N}(0,1)$. Estimator is
$ g_{\text{PW}} = \frac{\partial f(z)}{\partial z}\cdot \frac{\partial z}{\partial \mu}=\; f'(z). $Typically $\operatorname{Var}[f'(z)]$ is far smaller than $\operatorname{Var}[f(z)(z-\mu)]$. Example $f(z)=z^2$: true $\nabla_\mu E[z^2]=2\mu$.
- REINFORCE sample: $z^2(z-\mu)$ (uses $z^3$), very high variance.
- Pathwise sample: $2z$, variance $=4\operatorname{Var}(z)=4$ — tiny by comparison.
Formal View (Control Variates / Rao–Blackwell)
- Reparameterization can be seen as a Rao–Blackwellization over the sampling path, replacing a noisy global reward with a conditional expectation that leverages differentiability. This provably reduces variance (or leaves it equal) compared to the raw score-function form; related control-variate estimators include VIMCO [6], REBAR [7], and RELAX [8].
- Even with optimal baselines, REINFORCE typically remains higher-variance because it still relies on $f(z)$ rather than $\nabla_z f(z)$.
Practical Symptoms You’ve Likely Seen
- Score-function estimates need lots of samples, aggressive baselines/advantage centering, and extra variance reduction (control variates, value functions).
- Pathwise (reparameterization) usually trains stably with ~1 sample per datapoint in VAEs.
Appendix: REINFORCE Identity (Proof + Baselines)
Goal: differentiate an expectation where the distribution depends on $\phi$:
Under mild regularity (dominated convergence / interchange ok),
Use the log-derivative trick $\nabla_\phi q_\phi = q_\phi \nabla_\phi \log q_\phi$:
Two common refinements:
- Baseline (control variate): for any constant $b$,
has the same expectation because $\mathbb{E}[\nabla_\phi \log q_\phi]=\nabla_\phi \int q_\phi=0$.
- Optimal scalar baseline (for a given component) minimizes variance:
References
- Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. ICLR. https://arxiv.org/abs/1312.6114
- Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic Backpropagation and Approximate Inference in Deep Generative Models. ICML. https://arxiv.org/abs/1401.4082
- Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning (REINFORCE). Machine Learning. https://link.springer.com/article/10.1007/BF00992696
- Jang, E., Gu, S., & Poole, B. (2017). Categorical Reparameterization with Gumbel–Softmax. ICLR. https://arxiv.org/abs/1611.01144
- Maddison, C. J., Mnih, A., & Teh, Y. W. (2017). The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. ICLR. https://arxiv.org/abs/1611.00712
- Mnih, A., & Rezende, D. J. (2016). Variational Inference for Monte Carlo Objectives (VIMCO). ICML. https://arxiv.org/abs/1602.06725
- Tucker, G., Mnih, A., Maddison, C. J., Lawson, J., & Sohl-Dickstein, J. (2017). REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models. NeurIPS. https://arxiv.org/abs/1703.07370
- Grathwohl, W., Choi, D., Wu, Y., Roeder, G., & Duvenaud, D. (2018). Backpropagation through the Void: Optimizing Control Variates for Black-Box Gradient Estimation (RELAX). ICLR. https://arxiv.org/abs/1711.00123
- Figurnov, M., Mohamed, S., & Mnih, A. (2018). Implicit Reparameterization Gradients. NeurIPS. https://arxiv.org/abs/1805.08498
- Ruiz, F. J. R., Titsias, M. K., & Blei, D. M. (2016). The Generalized Reparameterization Gradient. NeurIPS. https://arxiv.org/abs/1610.02287