| # | Topic |
|---|---|
| 01 | Likelihood |
| 02 | Information Entropy |
| 03 | Kullback-Leibler Divergence |
What is it? Why does it matter? How do we compute it?
Imagine you flip a coin 3 times and get: Heads, Heads, Tails
How likely is this if the coin is fair (p = 0.5)? What if p = 0.7?
Likelihood answers: given the data we observed, how plausible is each parameter value?
| Definition | |
|---|---|
| Probability | Parameter fixed → ask about data: \(P(\text{data} \mid \theta)\) |
| Likelihood | Data fixed → ask about parameter: \(L(\theta \mid \text{data})\) |
Warning
Likelihood is NOT a probability of the parameter — it is a function of the parameter given fixed data.
Same formula, different perspective!
For \(n\) independent observations:
\[L(\theta \mid x_1, \ldots, x_n) = \prod_{i=1}^n f(x_i \mid \theta)\]
\[\ell(\theta) = \log L = \sum_{i=1}^n \log f(x_i \mid \theta)\]
Why use log?
Find \(\hat\theta\) that maximises \(L(\theta \mid \text{data})\).
Usually solve:
\[\frac{d}{d\theta}\,\ell(\theta) = 0\]
This gives the “best fit” parameter.
📌 You flip a coin 5 times and observe: H H H T H (4 Heads, 1 Tail)
Goal: estimate \(p\) = probability of Heads
Each flip is modelled as Bernoulli(\(p\)): \(P(H) = p\), \(\quad P(T) = 1 - p\)
\[L(p) = p \times p \times p \times (1-p) \times p = p^4(1-p)^1\]
Taking the log:
\[\ell(p) = 4\log(p) + 1\log(1-p)\]
\[\frac{d\ell}{dp} = \frac{4}{p} - \frac{1}{1-p} = 0\]
\[4(1-p) = p \quad\Rightarrow\quad \hat{p} = \frac{4}{5} = 0.8\]
Tip
MLE Answer: \(\hat{p} = 0.8\) — Intuitively correct! 4 out of 5 flips were Heads.
Assume errors \(\sim \mathcal{N}(0,\sigma^2)\)
\[L(\beta,\sigma \mid y) = \prod_i \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(y_i - x_i^\top\beta)^2}{2\sigma^2}\right)\]
Log-Likelihood:
\[\ell = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum(y_i - x_i^\top\beta)^2\]
Maximising \(\ell\) w.r.t. \(\beta\) is equivalent to minimising:
\[\sum(y_i - x_i^\top\beta)^2\]
Note
OLS = MLE under normality!
Binary outcome \(y_i \in \{0,1\}\), with:
\[\pi_i = P(y_i=1) = \frac{1}{1+e^{-x_i^\top\beta}}\]
Likelihood:
\[L(\beta \mid y) = \prod_i \pi_i^{y_i}(1-\pi_i)^{1-y_i}\]
Log-Likelihood:
\[\ell(\beta) = \sum_i \bigl[y_i\log(\pi_i) + (1-y_i)\log(1-\pi_i)\bigr]\]
Unlike linear regression, there is no closed-form solution.
We use numerical optimisation:
The algorithm maximises \(\ell(\beta)\) iteratively.
📌 Predict Pass/Fail from hours studied. \(\beta_0 = -3\), \(\beta_1 = 1.5\)
| Student | Hours | Outcome | \(x^\top\beta\) | \(\pi\) | \(\log L\) |
|---|---|---|---|---|---|
| A | 2 | Fail | 0 | 0.500 | −0.693 |
| B | 3 | Pass | 1.5 | 0.818 | −0.201 |
| C | 4 | Pass | 3 | 0.953 | −0.048 |
\[\ell(\beta) = (-0.693) + (-0.201) + (-0.048) = -0.942\]
MLE finds \(\beta_0, \beta_1\) that make this as large (closest to 0) as possible.
History · Shannon · Physics vs. Statistics · Lindley’s Measure
| Year | Person | Contribution |
|---|---|---|
| 1850s | Clausius | Thermodynamic entropy: \(dS = dQ/T\) |
| 1877 | Boltzmann | \(S = k \cdot \ln(W)\) — microstates |
| 1948 | Shannon | Information entropy — uncertainty in distributions |
| 1956 | Jaynes | Maximum Entropy Principle |
| 1968 | Lindley | Bayesian information measure |
\[H(X) = -\sum_x p(x)\cdot\log_2 p(x)\]
(Sum over all possible outcomes \(x\))
| Property | Detail |
|---|---|
| 📐 Units | \(\log_2\) → bits; \(\ln\) → nats; \(\log_{10}\) → hartleys |
| ⚖️ Range | \(H \geq 0\) always; \(H = 0\) means certainty |
| 🎯 Maximum | \(H\) is highest when all outcomes are equally likely |
| ❓ Interpretation | Average “surprise” per observation |
\(p(H) = 0.5\), \(\quad p(T) = 0.5\)
\[H(X) = -[0.5\log_2(0.5) + 0.5\log_2(0.5)]\] \[= -[0.5(-1)+0.5(-1)] = \mathbf{1.0 \text{ bit}}\]
➡️ Maximum uncertainty — you need exactly 1 bit to describe each flip.
\(p(H) = 0.9\), \(\quad p(T) = 0.1\)
\[H(X) = -[0.9\log_2(0.9) + 0.1\log_2(0.1)]\] \[= -[-0.137 + (-0.332)] = \mathbf{0.469 \text{ bits}}\]
➡️ Lower entropy = less surprise. Almost certain it will be Heads!
| Physics | Statistics | |
|---|---|---|
| Formula | \(S = k \cdot \ln(W)\) | \(H = -\sum p(x)\log p(x)\) |
| Units | Joules / Kelvin | Bits or Nats |
| Variable | \(W\) = microstates | \(p(x)\) = outcome probability |
| Physics | Statistics | |
|---|---|---|
| Concept | Disorder / irreversibility | Uncertainty / unpredictability |
| Subject | Gas molecules, heat | Probability distributions, data |
| Direction | Always increases (2nd Law) | Can increase or decrease |
Note
Shannon chose the name “entropy” because Boltzmann’s formula has the same mathematical structure.
Lindley (1956): Expected information gained from an experiment:
\[I = H(\theta) - \mathbb{E}[H(\theta \mid x)]\]
\[= \text{Entropy(prior)} - \text{Expected Entropy(posterior)}\]
Big \(I\) → data was very informative
Small \(I\) → data told you little
Estimating if a coin is fair:
\[I = H(\text{prior}) - H(\text{posterior}) \quad\Rightarrow\quad \text{large = very useful data!}\]
How different are two probability distributions?
KL divergence measures how much distribution \(P\) differs from a reference distribution \(Q\).
Discrete:
\[D_{KL}(P \,\|\, Q) = \sum_x P(x)\,\log\!\left[\frac{P(x)}{Q(x)}\right]\]
Continuous:
\[D_{KL}(P \,\|\, Q) = \int p(x)\,\log\!\left[\frac{p(x)}{q(x)}\right]dx\]
| Property | Explanation |
|---|---|
| 🔢 Always ≥ 0 | \(D_{KL} = 0\) only when \(P = Q\) everywhere |
| ↔︎️ Not Symmetric | \(D_{KL}(P\|Q) \neq D_{KL}(Q\|P)\) — not a true distance! |
| 📡 Information Gain | Extra bits to encode \(P\) using a code built for \(Q\) |
| 🔗 Relation to MLE | Minimising \(D_{KL}(P\|Q)\) w.r.t. \(Q\) ≡ MLE |
📌 \(P\) = true weather distribution, \(Q\) = model’s prediction
| Outcome | \(P(x)\) | \(Q(x)\) |
|---|---|---|
| Sunny ☀️ | 0.50 | 0.40 |
| Cloudy ⛅ | 0.30 | 0.35 |
| Rainy 🌧️ | 0.20 | 0.25 |
| Outcome | \(P/Q\) | \(\log_2(P/Q)\) | \(P \cdot \log_2(P/Q)\) |
|---|---|---|---|
| Sunny ☀️ | 1.25 | 0.322 | 0.161 |
| Cloudy ⛅ | 0.857 | −0.222 | −0.067 |
| Rainy 🌧️ | 0.80 | −0.322 | −0.064 |
| Total | \(D_{KL} = 0.030\) bits |
\(D_{KL}(P \| Q) = 0.030\) bits
On average, you waste 0.030 extra bits per forecast because model \(Q\) deviates from \(P\).
\[\text{Minimising } D_{KL}(P_\text{true} \,\|\, Q_\text{model}) \;\equiv\; \text{MLE}\]
\[\text{AIC} = 2k - 2\ell(\hat\theta)\]
\[\text{BIC} = k\log(n) - 2\ell(\hat\theta)\]
\[H(P,Q) = H(P) + D_{KL}(P\|Q)\]
📌 Two models predict class probabilities for 4 classes. Which is closer to true \(P\)?
| Class | \(P\) (True) | \(Q_1\) (Model A) | \(Q_2\) (Model B) |
|---|---|---|---|
| Dog 🐶 | 0.40 | 0.35 | 0.42 |
| Cat 🐱 | 0.30 | 0.30 | 0.28 |
| Bird 🐦 | 0.20 | 0.25 | 0.18 |
| Fish 🐟 | 0.10 | 0.10 | 0.12 |
| Class | \(P\log(P/Q_1)\) | \(P\log(P/Q_2)\) |
|---|---|---|
| Dog 🐶 | 0.024 | −0.008 |
| Cat 🐱 | 0.000 | 0.012 |
| Bird 🐦 | −0.045 | 0.020 |
| Fish 🐟 | 0.000 | −0.018 |
| \(D_{KL}\) | ≈ 0.021 bits | ≈ 0.006 bits |
Tip
Model B is 3.5× closer to the true distribution → Select Model B.
| Concept | Formula | Role |
|---|---|---|
| Likelihood | \(L(\theta\mid\text{data}) = \prod f(x_i\mid\theta)\) | Estimate parameters via MLE |
| Entropy | \(H(X) = -\sum p\log p\) | Measure uncertainty |
| KL Divergence | \(D_{KL}(P\|Q) = \sum P\log(P/Q)\) | Compare distributions |
\[H(P,Q) = H(P) + D_{KL}(P\|Q)\]
Important
Minimising KL divergence from true \(P\) to model \(Q\)
≡ MLE of \(Q\)’s parameters
≡ Training neural networks with cross-entropy loss
Likelihood is NOT a probability — it is a function of the parameter given fixed data
Entropy measures uncertainty — same math as Boltzmann, applied to distributions
KL divergence is asymmetric — \(D_{KL}(P\|Q) = 0\) only if \(P = Q\)
Minimising KL ≡ MLE — AIC, cross-entropy loss, and neural network training all follow from this