Introduction
I originally wanted to write down the proof for the Gumbel-max trick but soon realized it is actually the same idea as a much more common problem: exponential race. So, in this note let's go from this common problem and arrive at the Gumbel-max trick.
Competing Alarms
As a preparation let's solve a probability problem first:
There are $N$ clocks started simultaneously, such that the $i$-th clock rings after a random time $T_i \sim \text{Exp}(\lambda_i)$
- (1) Designate $X$ as the random time after which some clock (i.e any one of the clocks) rings, find the distribution of $X$
- (2) Find the probability of the $i$-th clock rings first
Let $X = \min {T_1, T_2, \dots, T_n }$ and $F_X(t)$ and $F_{T_i}(t)$ be the CDFs. We also have $F_{T_i}(t) = 1- e^{-\lambda_it}$.
Following order statistics of $\min$, we have $P(X>t) = \prod_{i=1}^N P(T_i>t)$ or equivalently,
$ 1 - F_X(t) = \prod_{i=1}^N (1-F_{T_i}(t)) = \prod_{i=1}^N e^{-\lambda_it} = e^{-\sum_{i=1}^N \lambda_it} $
Therefore
$ \begin{equation} X \sim \text{Exp}(\lambda_X = \sum_{i=1}^N \lambda_i) \label{part1} \end{equation} $
i.e. the $\min$ of a set of i.i.d exponential random variables is still an exponential random variable with the rate $\lambda_X$ being the sum of the rates of that set of random variables.
For the second part of the problem, we can consider two competing alarms $T_1$ and $T_2$ to begin with. Our goal is to find $P(T_1 $
\begin{split}
P(T_1 < T_2) & = \int_0^{+\infty} \int_{t_1}^{+\infty} P(T_1=t_1) P(T_2=t_2) dt_2 dt_1 \\\\
&= \int_0^{+\infty} P(T_1=t_1) \left(1-F_{T_2}(t_1)\right) dt_1 \\\\
&= \int_0^{+\infty} \lambda_1 e^{-\lambda_1 t_1} e^{-\lambda_2 t_1} dt_1 \\\\
& = \frac{\lambda_1}{\lambda_1+\lambda_2}
\end{split}
$ Now, let's consider one specific clock $T_k$ versus all the rest, noted as $T_{-k} = \min {T_i}_{i \neq k}$. According to $\ref{part1}$, we know that $T_{-k} \sim \text{Exp}(\sum_{i\neq k} \lambda_i)$. Using the result above we have the solution for part (2) as follows $
\begin{equation}
P(T_k \text{ rings first}) = P(T_k < T_{-k}) = \frac{\lambda_k}{\lambda_k+\sum_{i\neq k}\lambda_i} = \frac{\lambda_k}{\sum_{i=1}^N\lambda_i}
\label{part2}
\end{equation}
$ Of course, we can do the integration directly and get the same result $
\begin{split}
P(T_k < T_{-k}) & = \int_0^{+\infty} P(T_k=t_k) \left( \idotsint_{t_k}^{+\infty} \prod_{i\neq k}P(T_i=t_i) dt_i \right) dt_k \\
& = \int_0^{+\infty} P(T_k=t_k) \left( \prod_{i\neq k} \left(1-F_{T_i}(t_k)\right) \right) dt_k \\
& = \int_0^{+\infty} \lambda_k \exp{\left(-\lambda_k t_k\right)} \exp{\left(-\sum_{i \neq k}\lambda_i t_k\right)} dt_k \\
& = \frac{\lambda_k}{\sum_{i=1}^N\lambda_i}
\end{split}
$ By the way, this setup with multiple exponential random variables and we look for the first arrival is also called exponential race. I just made up the name "Exponential-Min". The better name for this section is probably Sampling from Multinomial by Argmining. Suppose we have a set of positive numbers $[\lambda_1, \lambda_2, \lambda_3, \dots, \lambda_N]$. Correspondingly we have a normalized probabiilty vector $\vec{p}=[p_1, p_2, p_3, \dots, p_N]$, where $p_k = \frac{\lambda_k}{\sum_{i=1}^N\lambda_i}$. This probability vector specifies a multinominal distribution over $N$ choices. Now, if we were to get a sample ${1, 2, \dots, N}$ according to this multinominal distribution specified by $\vec{p}$ (which is fundamentally specified by $[\lambda_1, \lambda_2, \lambda_3, \dots, \lambda_N]$), what should we do? Normally, we do the following: But that's the boring way. Now we have this new Exponential-Min trick, we can do the following: Thus, somehow we ended up Sampling from Multinomial by Argmining! Now let's move on to Gumbel distribution from Exponential distribution. Gumbel distribution with unit scale ($\beta=1$) is parameterized by location parameter $\mu$. $\text{Gumble}(\mu)$ has CDF and PDF as follows $
\text{CDF: } F(x; \mu)=e^{-e^{-(x-\mu)}}
$
$
\text{PDF: }f(x; \mu) = e^{-\left((x-\mu)+e^{-(x-\mu)}\right)}
$ Given a set of $N$ independent Gumbel random variables $G_i$, each with their own parameter $\mu_i$, i.e. $G_i \sim \text{Gumbel}(\mu_i)$. Gumbel distribution has two properties that are quite analogous the exponential race example above. The proof is straightforward and similar to above: $
F_Z(x) = \prod_{i=1}^N F_{G_i}(x) = \prod_{i=1}^N e^{-e^{-(x-\mu_i)}} = e^{-\sum_{i=1}^N e^{-(x-\mu_i)}} = e^{-e^{-x} \sum_{i=1}^N e^{\mu_i}} = e^{-e^{-(x-\mu_Z)}}
$ Now here we can tell nearly an identical/parallel story as in the section Exponential-Min Trick. And, this section should really be called Sampling from Multinomial by Argmaxing. The main differences are It seems a lot of work to sample multinominal by argmaxing over Gumble samples (or argmining over Exponential samples). In what situation do we ever want to do this? The short answer is that Gumble-Max trick allows us to make a sampling step differentiable. Specifically, it makes sampling from multinomial distribution differentiable. We'll take a closer look at this in a future post but pause for a second and think about it. We are saying it is possible to differentiate through the action of drawing a discrete sample from a multinomial distribution! This was a pretty surprising/amazing possibility to me. Regarding downstream applications, differentiating through sampling is an important "trick" in neural network based variational inference in general. Multinomial discrete random variables are prevalent in many learning problems. Gumble-max trick allows us to work with them in many interesting neural variational inference problems, which we will look into in future posts.Exponential-Min Trick
Gumbel Distribution
Gumbel-Max Trick
When is Gumbel-Max Trick Useful?