Likelihood & Entropy

Rohan Dahal

Roadmap

# Topic
01 Likelihood
02 Information Entropy
03 Kullback-Leibler Divergence

PART 1

Likelihood

What is it? Why does it matter? How do we compute it?

What is Likelihood? — The Intuition

Imagine you flip a coin 3 times and get: Heads, Heads, Tails

How likely is this if the coin is fair (p = 0.5)? What if p = 0.7?

Likelihood answers: given the data we observed, how plausible is each parameter value?

Probability vs. Likelihood

Definition
Probability Parameter fixed → ask about data: \(P(\text{data} \mid \theta)\)
Likelihood Data fixed → ask about parameter: \(L(\theta \mid \text{data})\)

Warning

Likelihood is NOT a probability of the parameter — it is a function of the parameter given fixed data.

Same formula, different perspective!

The Likelihood Formula

For \(n\) independent observations:

\[L(\theta \mid x_1, \ldots, x_n) = \prod_{i=1}^n f(x_i \mid \theta)\]

Log-Likelihood

\[\ell(\theta) = \log L = \sum_{i=1}^n \log f(x_i \mid \theta)\]

Why use log?

  • Products → Sums (simpler math)
  • Avoids numerical underflow (tiny × tiny → 0)
  • Same maximum as the original likelihood
  • \(\log\) is monotone increasing

Maximum Likelihood Estimation (MLE)

Find \(\hat\theta\) that maximises \(L(\theta \mid \text{data})\).

Usually solve:

\[\frac{d}{d\theta}\,\ell(\theta) = 0\]

This gives the “best fit” parameter.

Coin Flip Example — Setup

📌 You flip a coin 5 times and observe: H H H T H (4 Heads, 1 Tail)

Goal: estimate \(p\) = probability of Heads

Each flip is modelled as Bernoulli(\(p\)): \(P(H) = p\), \(\quad P(T) = 1 - p\)

Coin Flip Example — Likelihood

\[L(p) = p \times p \times p \times (1-p) \times p = p^4(1-p)^1\]

Taking the log:

\[\ell(p) = 4\log(p) + 1\log(1-p)\]

Coin Flip Example — Maximise

\[\frac{d\ell}{dp} = \frac{4}{p} - \frac{1}{1-p} = 0\]

\[4(1-p) = p \quad\Rightarrow\quad \hat{p} = \frac{4}{5} = 0.8\]

Tip

MLE Answer: \(\hat{p} = 0.8\) — Intuitively correct! 4 out of 5 flips were Heads.

Likelihood in Linear Regression

Assume errors \(\sim \mathcal{N}(0,\sigma^2)\)

\[L(\beta,\sigma \mid y) = \prod_i \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(y_i - x_i^\top\beta)^2}{2\sigma^2}\right)\]

Log-Likelihood:

\[\ell = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum(y_i - x_i^\top\beta)^2\]

Linear Regression — OLS Connection

Maximising \(\ell\) w.r.t. \(\beta\) is equivalent to minimising:

\[\sum(y_i - x_i^\top\beta)^2\]

Note

OLS = MLE under normality!

Likelihood in Logistic Regression

Binary outcome \(y_i \in \{0,1\}\), with:

\[\pi_i = P(y_i=1) = \frac{1}{1+e^{-x_i^\top\beta}}\]

Likelihood:

\[L(\beta \mid y) = \prod_i \pi_i^{y_i}(1-\pi_i)^{1-y_i}\]

Log-Likelihood:

\[\ell(\beta) = \sum_i \bigl[y_i\log(\pi_i) + (1-y_i)\log(1-\pi_i)\bigr]\]

Logistic Regression — No Closed Form

Unlike linear regression, there is no closed-form solution.

We use numerical optimisation:

  • Newton-Raphson
  • Gradient descent

The algorithm maximises \(\ell(\beta)\) iteratively.

Logistic Regression — Numerical Example

📌 Predict Pass/Fail from hours studied. \(\beta_0 = -3\), \(\beta_1 = 1.5\)

Student Hours Outcome \(x^\top\beta\) \(\pi\) \(\log L\)
A 2 Fail 0 0.500 −0.693
B 3 Pass 1.5 0.818 −0.201
C 4 Pass 3 0.953 −0.048

Logistic Regression — Log-Likelihood Sum

\[\ell(\beta) = (-0.693) + (-0.201) + (-0.048) = -0.942\]

MLE finds \(\beta_0, \beta_1\) that make this as large (closest to 0) as possible.

PART 2

Information Entropy

History · Shannon · Physics vs. Statistics · Lindley’s Measure

A Brief History of Entropy

Year Person Contribution
1850s Clausius Thermodynamic entropy: \(dS = dQ/T\)
1877 Boltzmann \(S = k \cdot \ln(W)\) — microstates
1948 Shannon Information entropy — uncertainty in distributions
1956 Jaynes Maximum Entropy Principle
1968 Lindley Bayesian information measure

Shannon Entropy — The Formula

\[H(X) = -\sum_x p(x)\cdot\log_2 p(x)\]

(Sum over all possible outcomes \(x\))

Shannon Entropy — Properties

Property Detail
📐 Units \(\log_2\) → bits; \(\ln\) → nats; \(\log_{10}\) → hartleys
⚖️ Range \(H \geq 0\) always; \(H = 0\) means certainty
🎯 Maximum \(H\) is highest when all outcomes are equally likely
Interpretation Average “surprise” per observation

Entropy Example — Fair Coin

\(p(H) = 0.5\), \(\quad p(T) = 0.5\)

\[H(X) = -[0.5\log_2(0.5) + 0.5\log_2(0.5)]\] \[= -[0.5(-1)+0.5(-1)] = \mathbf{1.0 \text{ bit}}\]

➡️ Maximum uncertainty — you need exactly 1 bit to describe each flip.

Entropy Example — Biased Coin

\(p(H) = 0.9\), \(\quad p(T) = 0.1\)

\[H(X) = -[0.9\log_2(0.9) + 0.1\log_2(0.1)]\] \[= -[-0.137 + (-0.332)] = \mathbf{0.469 \text{ bits}}\]

➡️ Lower entropy = less surprise. Almost certain it will be Heads!

Physics vs. Statistical Entropy — Formulas

Physics Statistics
Formula \(S = k \cdot \ln(W)\) \(H = -\sum p(x)\log p(x)\)
Units Joules / Kelvin Bits or Nats
Variable \(W\) = microstates \(p(x)\) = outcome probability

Physics vs. Statistical Entropy — Concepts

Physics Statistics
Concept Disorder / irreversibility Uncertainty / unpredictability
Subject Gas molecules, heat Probability distributions, data
Direction Always increases (2nd Law) Can increase or decrease

Note

Shannon chose the name “entropy” because Boltzmann’s formula has the same mathematical structure.

Lindley’s Information Measure — Formula

Lindley (1956): Expected information gained from an experiment:

\[I = H(\theta) - \mathbb{E}[H(\theta \mid x)]\]

\[= \text{Entropy(prior)} - \text{Expected Entropy(posterior)}\]

Lindley’s Information Measure — Intuition

  • Before the experiment: high uncertainty about \(\theta\) → high \(H(\text{prior})\)
  • After seeing data \(x\): posterior concentrates → lower \(H(\text{posterior})\)
  • Lindley’s \(I\) = how much uncertainty was reduced by data

Big \(I\) → data was very informative
Small \(I\) → data told you little

Lindley’s Measure — Coin Example

Estimating if a coin is fair:

  • Prior: Uniform over \(p \in [0,1]\)\(H(\text{prior})\) is high
  • After 100 flips (60 H, 40 T): Posterior concentrates near \(p = 0.6\)
  • \(H(\text{posterior})\) is much lower

\[I = H(\text{prior}) - H(\text{posterior}) \quad\Rightarrow\quad \text{large = very useful data!}\]

PART 3

Kullback-Leibler Divergence

How different are two probability distributions?

What is KL Divergence?

KL divergence measures how much distribution \(P\) differs from a reference distribution \(Q\).

Discrete:

\[D_{KL}(P \,\|\, Q) = \sum_x P(x)\,\log\!\left[\frac{P(x)}{Q(x)}\right]\]

Continuous:

\[D_{KL}(P \,\|\, Q) = \int p(x)\,\log\!\left[\frac{p(x)}{q(x)}\right]dx\]

KL Divergence — Key Properties

Property Explanation
🔢 Always ≥ 0 \(D_{KL} = 0\) only when \(P = Q\) everywhere
↔︎️ Not Symmetric \(D_{KL}(P\|Q) \neq D_{KL}(Q\|P)\) — not a true distance!
📡 Information Gain Extra bits to encode \(P\) using a code built for \(Q\)
🔗 Relation to MLE Minimising \(D_{KL}(P\|Q)\) w.r.t. \(Q\) ≡ MLE

KL Divergence — Weather Example Setup

📌 \(P\) = true weather distribution, \(Q\) = model’s prediction

Outcome \(P(x)\) \(Q(x)\)
Sunny ☀️ 0.50 0.40
Cloudy ⛅ 0.30 0.35
Rainy 🌧️ 0.20 0.25

KL Divergence — Weather Calculation

Outcome \(P/Q\) \(\log_2(P/Q)\) \(P \cdot \log_2(P/Q)\)
Sunny ☀️ 1.25 0.322 0.161
Cloudy ⛅ 0.857 −0.222 −0.067
Rainy 🌧️ 0.80 −0.322 −0.064
Total \(D_{KL} = 0.030\) bits

KL Divergence — Interpretation

\(D_{KL}(P \| Q) = 0.030\) bits

On average, you waste 0.030 extra bits per forecast because model \(Q\) deviates from \(P\).

  • Closer to 0 → Better model
  • If \(Q = P\) exactly → \(D_{KL} = 0\) → no wasted information

KL & Model Selection — The Connection

\[\text{Minimising } D_{KL}(P_\text{true} \,\|\, Q_\text{model}) \;\equiv\; \text{MLE}\]

AIC

\[\text{AIC} = 2k - 2\ell(\hat\theta)\]

  • \(k\) = number of parameters
  • Penalises model complexity
  • Derived directly from KL divergence
  • Smaller AIC = better model

BIC

\[\text{BIC} = k\log(n) - 2\ell(\hat\theta)\]

  • \(n\) = sample size, \(k\) = number of parameters
  • Stronger penalty for complexity than AIC
  • Prefers simpler models
  • Smaller BIC = better model

Cross-Entropy Loss

\[H(P,Q) = H(P) + D_{KL}(P\|Q)\]

  • Used throughout machine learning
  • Minimising cross-entropy ≡ minimising KL divergence from true \(P\)
  • Basis for neural network training

KL Model Comparison — Setup

📌 Two models predict class probabilities for 4 classes. Which is closer to true \(P\)?

Class \(P\) (True) \(Q_1\) (Model A) \(Q_2\) (Model B)
Dog 🐶 0.40 0.35 0.42
Cat 🐱 0.30 0.30 0.28
Bird 🐦 0.20 0.25 0.18
Fish 🐟 0.10 0.10 0.12

KL Model Comparison — Results

Class \(P\log(P/Q_1)\) \(P\log(P/Q_2)\)
Dog 🐶 0.024 −0.008
Cat 🐱 0.000 0.012
Bird 🐦 −0.045 0.020
Fish 🐟 0.000 −0.018
\(D_{KL}\) ≈ 0.021 bits ≈ 0.006 bits

Tip

Model B is 3.5× closer to the true distribution → Select Model B.

How It All Connects

Concept Formula Role
Likelihood \(L(\theta\mid\text{data}) = \prod f(x_i\mid\theta)\) Estimate parameters via MLE
Entropy \(H(X) = -\sum p\log p\) Measure uncertainty
KL Divergence \(D_{KL}(P\|Q) = \sum P\log(P/Q)\) Compare distributions

The Unifying Insight

\[H(P,Q) = H(P) + D_{KL}(P\|Q)\]

Important

Minimising KL divergence from true \(P\) to model \(Q\)
≡ MLE of \(Q\)’s parameters
≡ Training neural networks with cross-entropy loss

Key Takeaways

  1. Likelihood is NOT a probability — it is a function of the parameter given fixed data

  2. Entropy measures uncertainty — same math as Boltzmann, applied to distributions

  3. KL divergence is asymmetric\(D_{KL}(P\|Q) = 0\) only if \(P = Q\)

  4. Minimising KL ≡ MLE — AIC, cross-entropy loss, and neural network training all follow from this