Maximum Likelihood and Maximum Entropy

6 min read

How MLE fits parameters and MaxEnt justifies exponential-family model forms.

Introduction

In most machine learning courses, we hear a lot about maximum likelihood estimation (MLE) and much less about the maximum entropy principle (MaxEnt). MLE is the workhorse of applied modeling, while MaxEnt is often left for advanced statistics or information theory. Yet the two are deeply connected. MLE explains how to fit parameters of a chosen model, while MaxEnt explains why exponential-family models like Gaussian, Bernoulli, logistic, and softmax appear so universally in machine learning. Together, they form the foundation of modern probabilistic modeling.


Maximum Likelihood Estimation (MLE)

MLE assumes we already have a family of models $p_\theta(x)$, parameterized by $\theta$. Given data $D=\{x_1,\dots,x_n\}$, the MLE picks the parameters that maximize the probability of the data:

$ \hat{\theta} = \arg\max_\theta \prod_{i=1}^n p_\theta(x_i). $
Equivalently, MLE minimizes the KL divergence from the empirical distribution $\hat p$ to the model:
$ \hat\theta = \arg\min_\theta D_{\mathrm{KL}}(\hat p \,\|\, p_\theta). $
It’s simple, computationally convenient, and universally taught.
But: MLE starts from the assumption that we already know the correct form of the distribution.


Maximum Entropy Principle (MaxEnt)

MaxEnt, introduced by Jaynes in 1957, takes a different approach. Suppose you only know partial information about a system — such as constraints on expectations (mean, variance, or feature averages). Which distribution should you pick? MaxEnt says to choose the one with the highest entropy subject to those constraints:

$ p^* = \arg\max_p H(p) \quad \text{s.t.} \quad \mathbb{E}_p[f_i(x)] = c_i. $
This ensures you are not making any assumptions beyond what is supported by the constraints.

  • No information → uniform.
  • Known mean → exponential.
  • Known mean and variance → Gaussian.
  • Expected feature–label correlations → logistic regression.

Example: Logistic Regression

Let $y\in\{0,1\}$.

From MLE

  • Assume $P(y=1\mid x;\theta)=\sigma(\theta^\top x)$.
  • Fit $\theta$ by maximizing the data likelihood.

From MaxEnt (conditional)

  • Choose $P(y\mid x)$ to maximize conditional entropy subject to matching feature expectations.
  • The solution is log-linear:
    $ P(y\mid x)=\frac{\exp\!\big(y\,\theta^\top x\big)}{1+\exp(\theta^\top x)}\,, $
    which is exactly logistic regression.

So: MaxEnt → model form, MLE → parameter fitting.


Why We Hear More About MLE

Most ML courses are focused on applied tasks: given a dataset and a model, how do you estimate parameters? That’s MLE. The MaxEnt principle is more foundational — it explains why those model families are natural in the first place. Unless you study information theory, statistics, or advanced ML, you might never see MaxEnt explicitly. Yet its fingerprints are everywhere: Boltzmann distributions in physics, softmax in neural nets, CRFs in NLP, Gaussian and exponential in stats.


Conclusion

MaxEnt selects the family: exponential-family densities are the unique solutions to “maximize entropy subject to linear moment constraints.” Once that family is fixed, MLE estimates the parameters by maximizing likelihood on data, which — by convex duality — solves the same moment-matching equations as the MaxEnt multipliers. That’s why logistic regression, softmax regression, CRFs, Gaussians, Poissons, etc., are simultaneously (i) MaxEnt solutions under natural constraints and (ii) MLE-friendly models in practice.


Appendix: Maximum Entropy — Core Results and Proofs

All logarithms are natural logs. Existence of a maximizer requires standard regularity (e.g., compact support or features that make the log-partition finite); we note these where relevant.

Notation and setup

Let $\mathcal{X}$ be a finite or measurable space. We seek a distribution $p$ that maximizes Shannon entropy

$ H(p)= -\int_{\mathcal{X}} p(x)\log p(x)\,dx \quad (\text{sum if }\mathcal{X} \text{ is finite}) $
subject to

  • normalization: $\int p=1$,
  • linear expectation constraints: $\int p(x) f_i(x)\,dx = c_i$ for $i=1,\dots,m$,
  • (optional) support constraints such as $p(x)=0$ outside a given set.

Entropy is strictly concave in $p$; linear constraints define a convex feasible set ⇒ the maximizer (if it exists) is unique.


Theorem 1 (Exponential family form)

Claim. The unique maximizer has the Gibbs/Boltzmann (exponential-family) form

$ p^*(x)=\frac{1}{Z(\lambda)} \exp\!\Big(\sum_{i=1}^m \lambda_i f_i(x)\Big), \quad Z(\lambda)=\int \exp\!\Big(\sum_i \lambda_i f_i(x)\Big)\,dx, $
for some Lagrange multipliers $\lambda\in\mathbb{R}^m$ chosen so that the constraints hold.

Proof. Form the Lagrangian

$ \mathcal{L}(p,\alpha,\lambda)= -\!\int p\log p + \alpha\!\Big(\!\int p -1\Big) + \sum_i \lambda_i \!\Big(\!\int p f_i - c_i\Big). $
Take the Gâteaux derivative w.r.t. $p$:
$ \frac{\delta \mathcal{L}}{\delta p}(x)= -(1+\log p(x)) + \alpha + \sum_i \lambda_i f_i(x)=0. $
Solve for $p$: $\log p(x) = \alpha-1 + \sum_i \lambda_i f_i(x)$. Exponentiate and absorb the constant $\alpha$ into $Z$. Uniqueness follows from strict concavity. ∎

Remarks.

  • If you include a base/reference density $q(x)$ as a hard constraint (“absolutely continuous w.r.t. $q$”), the solution is the exponential tilt of $q$:
    $ p^*(x)=\frac{q(x)\exp(\sum_i \lambda_i f_i(x))}{\int q(u)\exp(\sum_i \lambda_i f_i(u))\,du}. $
  • Dual variables $\lambda$ are chosen by solving the moment-matching equations $\partial \log Z(\lambda)/\partial \lambda_i = c_i$.

Theorem 2 (KL duality view)

Claim. MaxEnt with constraints $\mathbb{E}_p[f_i]=c_i$ is equivalent to

$ \min_{p}\; D_{\mathrm{KL}}\!\left(p\,\Big\|\, \frac{\exp(\sum_i \lambda_i f_i)}{Z(\lambda)}\right) \quad\text{for the (unique) }\lambda\text{ that satisfies the constraints.} $

Proof sketch. Plug the Gibbs form from Theorem 1 into $-H(p) = \int p\log p$ and rearrange:

$ -H(p)= \int p \log\frac{p}{p^*} - \int p \log p^* = D_{\mathrm{KL}}(p\|p^*) - \sum_i \lambda_i \underbrace{\mathbb{E}_p[f_i]}_{=c_i} + \log Z(\lambda). $
Minimizing $-H(p)$ under the constraints ⇔ minimizing $D_{\mathrm{KL}}(p\|p^*)$. Minimum is at $p=p^*$. ∎

Corollary (RLHF/DPO connection). With a reference policy $\pi_{\rm ref}$, set the feature to $f(x,y)=r(x,y)$ and the multiplier to $1/\beta$. The optimizer is

$ \pi^*(y\mid x)= \frac{1}{Z(x)}\,\pi_{\rm ref}(y\mid x)\exp\!\Big(\tfrac{1}{\beta}r(x,y)\Big), $
i.e., KL-regularized reward maximization yields an exponential tilt of $\pi_{\rm ref}$.


Canonical worked examples

Example A (Finite set ⇒ softmax)

Problem. $\mathcal{X}=\{1,\dots,K\}$. Maximize $H(p)$ s.t. $\sum_k p_k=1$ and $\sum_k p_k s_k = c$.

Solution. By Theorem 1,

$ p_k \propto \exp(\lambda s_k) \quad\Rightarrow\quad p_k = \frac{\exp(\lambda s_k)}{\sum_j \exp(\lambda s_j)}. $
This is softmax over scores $s_k$. Parameter $\lambda$ is chosen so the expected score equals $c$.
Special case: if instead of fixing $c$ you put a penalty $-\beta^{-1}\sum_k p_k s_k$ in the objective, you get $p_k\propto \exp(\tfrac{1}{\beta}s_k)$.


Example B (Maximum entropy with mean only on $\mathbb{R}_{\ge 0}$ ⇒ Exponential)

Problem. Support $x\ge 0$. Known mean $\mathbb{E}[X]=\mu$.

Solution. Constraints: $\int_0^\infty p=1$, $\int_0^\infty x p(x)\,dx=\mu$. Theorem 1 with $f_1(x)=-x$ gives

$ p(x)=\frac{1}{\mu}\exp\!\Big(-\frac{x}{\mu}\Big). $
So the exponential distribution uniquely maximizes entropy on $[0,\infty)$ with fixed mean.


Example C (Maximum entropy with mean and variance on $\mathbb{R}$ ⇒ Gaussian)

Problem. Support $\mathbb{R}$. Known $\mathbb{E}[X]=\mu$, $\mathbb{E}[X^2]=\mu^2+\sigma^2$.

Solution. Take features $f_1(x)=x$, $f_2(x)=x^2$. Theorem 1 yields

$ p(x) \propto \exp(\lambda_1 x + \lambda_2 x^2). $
For normalizability on $\mathbb{R}$ we must have $\lambda_2<0$. Completing the square gives
$ p(x)=\frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right). $
Thus the Gaussian is the unique max-entropy distribution on $\mathbb{R}$ with fixed mean and variance.


Example D (CRFs / log-linear models)

For conditional modeling, maximize the conditional entropy subject to feature constraints:

$ \max_{p(\cdot|x)} H\big(p(\cdot|x)\big)\quad \text{s.t.}\quad \mathbb{E}_{p(y|x)}[f_i(x,y)] = \hat{c}_i(x). $
Result (by Theorem 1):
$ p(y|x) = \frac{1}{Z(x)}\exp\!\Big(\sum_i \theta_i f_i(x,y)\Big), $
which is the CRF/log-linear family. Logistic/softmax regression is the special case with local features and finite label sets.


Technical notes (for completeness)

  1. Existence/normalizability. In continuous cases the log-partition must be finite (e.g., a negative quadratic coefficient in Example C). Otherwise no finite-entropy maximizer exists.

  2. Uniqueness. Strict concavity of $H$ yields uniqueness whenever the feasible set is nonempty and closed, and the problem is well-posed (Slater’s condition holds for typical linear constraints).

  3. Dual problem. The dual maximizes

    $ g(\lambda)= -\log Z(\lambda) + \sum_i \lambda_i c_i, $
    which is concave in $\lambda$. The KKT/dual optimal $\lambda^*$ solves the moment-matching equations $\partial \log Z(\lambda)/\partial \lambda_i = c_i$.

  4. Regularized view. Maximizing $\mathbb{E}_p[f]$ minus $\beta D_{\mathrm{KL}}(p\|q)$ over $p$ (no hard constraints) yields the exponential tilt of $q$:

    $ p^*(x)=\frac{q(x)\exp\!\big(\tfrac{1}{\beta} f(x)\big)}{\int q(u)\exp\!\big(\tfrac{1}{\beta} f(u)\big)\,du}. $
    This is exactly the RLHF/DPO “reward minus KL” solution.


Further reading

  • E.T. Jaynes, Information Theory and Statistical Mechanics (1957); Probability Theory: The Logic of Science (2003).
  • Cover & Thomas, Elements of Information Theory, Ch. 12.
  • Berger, Statistical Decision Theory and Bayesian Analysis, Ch. 3.
Copyright 2025, Ran DingPrivacyTerms