Likelihood, Entropy & Information Criteria

1. Likelihood

What is Likelihood?

Likelihood answers:

Given data, which parameter value makes it most reasonable?

General Likelihood Function

For independent observations:

\[ L(\theta) = \prod_{i=1}^{n} f(y_i \mid \theta) \]

Notation

  • \(L(\theta)\) : Likelihood function
  • \(\theta\) : Parameter vector
  • \(f(y_i|\theta)\) : Probability density/mass
  • \(\prod\) : Product over observations
  • \(n\) : Sample size

Log-Likelihood

\[ \ell(\theta) = \log L(\theta) = \sum_{i=1}^{n} \log f(y_i|\theta) \]

Why Log?

  • Converts product → sum
  • Easier differentiation
  • Numerically stable

Likelihood in Regression Models

General Regression Model

\[ Y_i = g(X_i, \beta) + \varepsilon_i \]

Assume:

\[ \varepsilon_i \sim f(0,\sigma^2) \]

Then:

\[ f(y_i|\beta,\sigma^2) \]

determines likelihood.

Linear Regression Case

If:

\[ Y_i \sim N(\mu_i, \sigma^2) \]

where

\[ \mu_i = X_i^\top \beta \]

Log-likelihood:

\[ \ell = -\frac{n}{2}\log(2\pi) -\frac{n}{2}\log(\sigma^2) -\frac{1}{2\sigma^2} \sum_{i=1}^{n}(y_i - X_i^\top \beta)^2 \]

Important Components

  • \(\sigma^2\) : Error variance
  • \(X_i^\top \beta\) : Linear predictor
  • \(\sum (y_i - X_i^\top \beta)^2\) : Residual Sum of Squares

OLS = MLE under normal errors.

2. History of Informational Entropy

Claude Shannon (1948)

Paper:

“A Mathematical Theory of Communication”

Goal:

  • Measure information
  • Measure uncertainty
  • Optimize communication systems

Shannon defined entropy as:

\[ H(X) = -\sum p(x)\log p(x) \]

Statistical vs Thermodynamic Entropy

Thermodynamic Entropy

From physics:

\[ S = k \log W \]

  • \(S\) : Entropy
  • \(k\) : Boltzmann constant
  • \(W\) : Number of microstates

Measures physical disorder.

Statistical Entropy (Shannon)

\[ H(X) = -\sum p(x)\log p(x) \]

Measures:

Uncertainty in probability distribution

Key Differences

Physics Statistics
Measures disorder Measures uncertainty
Depends on physical states Depends on probabilities
Units: Joules/Kelvin Units: bits or nats

Same mathematics — different interpretation.

3. Shannon Entropy Calculation

Discrete Case

\[ H(X) = -\sum_x p(x)\log p(x) \]

Interpretation:

\[ H(X) = E[-\log p(X)] \]

Continuous Case

\[ H(f) = -\int f(x)\log f(x)\,dx \]

  • \(\int\) : Integral
  • \(dx\) : Infinitesimal change

4. Lindley’s Information Measure

Dennis Lindley (1956)

Information gain:

\[ \text{Gain} = H(\text{Prior}) - H(\text{Posterior}) \]

Interpretation

  • Prior entropy → Uncertainty before data
  • Posterior entropy → After data
  • Difference → Learning from data

Learning = Reduction in entropy.

5. Kullback–Leibler Divergence

Measures distance between two distributions.

\[ D_{KL}(P||Q) = \sum_x p(x)\log\frac{p(x)}{q(x)} \]

Continuous Version

\[ D_{KL}(P||Q) = \int p(x)\log\frac{p(x)}{q(x)}dx \]

Expanded Form

\[ D_{KL} = \int p(x)\log p(x)dx - \int p(x)\log q(x)dx \]

Connection to Entropy

\[ D_{KL} = H(P) - E_P[\log q(x)] \]

So:

KL divergence
= Entropy − Expected log-likelihood

6. KL and Maximum Likelihood

True distribution: \(f_0\)

Model: \(f_\theta\)

Minimize:

\[ \min D_{KL}(f_0 || f_\theta) \]

Equivalent to:

\[ \max E[\log f_\theta(X)] \]

Sample Version

\[ \max \sum_{i=1}^{n} \log f(x_i|\theta) \]

This is:

Maximum Likelihood Estimation

MLE = KL minimization.

7. Akaike Information Criterion (AIC)

Akaike (1973)

Goal:

Estimate expected KL divergence.

Formula

\[ AIC = -2\ell(\hat{\theta}) + 2k \]

Why 2k?

  • First term → Fit
  • Second term → Bias correction
  • Corrects optimism in log-likelihood

AIC ≈ Estimated KL divergence.

Lower AIC → Better model.

Conceptual Flow

Likelihood

Log-likelihood

Entropy

KL Divergence

MLE

AIC

Final Summary

  • Likelihood measures model fit
  • Entropy measures uncertainty
  • Lindley measures information gain
  • KL measures information loss
  • MLE minimizes KL divergence
  • AIC estimates KL with complexity correction

All are mathematically connected.