Likelihood answers:
Given data, which parameter value makes it most reasonable?
For independent observations:
\[ L(\theta) = \prod_{i=1}^{n} f(y_i \mid \theta) \]
\[ \ell(\theta) = \log L(\theta) = \sum_{i=1}^{n} \log f(y_i|\theta) \]
\[ Y_i = g(X_i, \beta) + \varepsilon_i \]
Assume:
\[ \varepsilon_i \sim f(0,\sigma^2) \]
Then:
\[ f(y_i|\beta,\sigma^2) \]
determines likelihood.
If:
\[ Y_i \sim N(\mu_i, \sigma^2) \]
where
\[ \mu_i = X_i^\top \beta \]
Log-likelihood:
\[ \ell = -\frac{n}{2}\log(2\pi) -\frac{n}{2}\log(\sigma^2) -\frac{1}{2\sigma^2} \sum_{i=1}^{n}(y_i - X_i^\top \beta)^2 \]
OLS = MLE under normal errors.
Paper:
“A Mathematical Theory of Communication”
Goal:
Shannon defined entropy as:
\[ H(X) = -\sum p(x)\log p(x) \]
From physics:
\[ S = k \log W \]
Measures physical disorder.
\[ H(X) = -\sum p(x)\log p(x) \]
Measures:
Uncertainty in probability distribution
| Physics | Statistics |
|---|---|
| Measures disorder | Measures uncertainty |
| Depends on physical states | Depends on probabilities |
| Units: Joules/Kelvin | Units: bits or nats |
Same mathematics — different interpretation.
\[ H(X) = -\sum_x p(x)\log p(x) \]
Interpretation:
\[ H(X) = E[-\log p(X)] \]
\[ H(f) = -\int f(x)\log f(x)\,dx \]
Dennis Lindley (1956)
Information gain:
\[ \text{Gain} = H(\text{Prior}) - H(\text{Posterior}) \]
Learning = Reduction in entropy.
Measures distance between two distributions.
\[ D_{KL}(P||Q) = \sum_x p(x)\log\frac{p(x)}{q(x)} \]
\[ D_{KL}(P||Q) = \int p(x)\log\frac{p(x)}{q(x)}dx \]
\[ D_{KL} = \int p(x)\log p(x)dx - \int p(x)\log q(x)dx \]
\[ D_{KL} = H(P) - E_P[\log q(x)] \]
So:
KL divergence
= Entropy − Expected log-likelihood
True distribution: \(f_0\)
Model: \(f_\theta\)
Minimize:
\[ \min D_{KL}(f_0 || f_\theta) \]
Equivalent to:
\[ \max E[\log f_\theta(X)] \]
\[ \max \sum_{i=1}^{n} \log f(x_i|\theta) \]
This is:
Maximum Likelihood Estimation
MLE = KL minimization.
Akaike (1973)
Goal:
Estimate expected KL divergence.
\[ AIC = -2\ell(\hat{\theta}) + 2k \]
AIC ≈ Estimated KL divergence.
Lower AIC → Better model.
Likelihood
↓
Log-likelihood
↓
Entropy
↓
KL Divergence
↓
MLE
↓
AIC
All are mathematically connected.