Likelihood & Entropy

Rohan Dahal

Roadmap

#	Topic
01	Likelihood
02	Information Entropy
03	Kullback-Leibler Divergence

PART 1

Likelihood

What is it? Why does it matter? How do we compute it?

What is Likelihood? — The Intuition

Imagine you flip a coin 3 times and get: Heads, Heads, Tails

How likely is this if the coin is fair (p = 0.5)? What if p = 0.7?

Likelihood answers: given the data we observed, how plausible is each parameter value?

Probability vs. Likelihood

	Definition
Probability	Parameter fixed → ask about data: \(P(\text{data} \mid \theta)\)
Likelihood	Data fixed → ask about parameter: \(L(\theta \mid \text{data})\)

Warning

Likelihood is NOT a probability of the parameter — it is a function of the parameter given fixed data.

Same formula, different perspective!

The Likelihood Formula

For \(n\) independent observations:

\[L(\theta \mid x_1, \ldots, x_n) = \prod_{i=1}^n f(x_i \mid \theta)\]

Log-Likelihood

\[\ell(\theta) = \log L = \sum_{i=1}^n \log f(x_i \mid \theta)\]

Why use log?

Products → Sums (simpler math)
Avoids numerical underflow (tiny × tiny → 0)
Same maximum as the original likelihood
\(\log\) is monotone increasing

Maximum Likelihood Estimation (MLE)

Find \(\hat\theta\) that maximises \(L(\theta \mid \text{data})\).

Usually solve:

\[\frac{d}{d\theta}\,\ell(\theta) = 0\]

This gives the “best fit” parameter.

Coin Flip Example — Setup

📌 You flip a coin 5 times and observe: H H H T H (4 Heads, 1 Tail)

Goal: estimate \(p\) = probability of Heads

Each flip is modelled as Bernoulli(\(p\)): \(P(H) = p\), \(\quad P(T) = 1 - p\)

Coin Flip Example — Likelihood

\[L(p) = p \times p \times p \times (1-p) \times p = p^4(1-p)^1\]

Taking the log:

\[\ell(p) = 4\log(p) + 1\log(1-p)\]

Coin Flip Example — Maximise

\[\frac{d\ell}{dp} = \frac{4}{p} - \frac{1}{1-p} = 0\]

\[4(1-p) = p \quad\Rightarrow\quad \hat{p} = \frac{4}{5} = 0.8\]

Tip

MLE Answer: \(\hat{p} = 0.8\) — Intuitively correct! 4 out of 5 flips were Heads.

Likelihood in Linear Regression

Assume errors \(\sim \mathcal{N}(0,\sigma^2)\)

\[L(\beta,\sigma \mid y) = \prod_i \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left(-\frac{(y_i - x_i^\top\beta)^2}{2\sigma^2}\right)\]

Log-Likelihood:

\[\ell = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum(y_i - x_i^\top\beta)^2\]

Linear Regression — OLS Connection

Maximising \(\ell\) w.r.t. \(\beta\) is equivalent to minimising:

\[\sum(y_i - x_i^\top\beta)^2\]

Note

OLS = MLE under normality!

Likelihood in Logistic Regression

Binary outcome \(y_i \in \{0,1\}\), with:

\[\pi_i = P(y_i=1) = \frac{1}{1+e^{-x_i^\top\beta}}\]

Likelihood:

\[L(\beta \mid y) = \prod_i \pi_i^{y_i}(1-\pi_i)^{1-y_i}\]

Log-Likelihood:

\[\ell(\beta) = \sum_i \bigl[y_i\log(\pi_i) + (1-y_i)\log(1-\pi_i)\bigr]\]

Logistic Regression — No Closed Form

Unlike linear regression, there is no closed-form solution.

We use numerical optimisation:

Newton-Raphson
Gradient descent

The algorithm maximises \(\ell(\beta)\) iteratively.

Logistic Regression — Numerical Example

📌 Predict Pass/Fail from hours studied. \(\beta_0 = -3\), \(\beta_1 = 1.5\)

Student	Hours	Outcome	\(x^\top\beta\)	\(\pi\)	\(\log L\)
A	2	Fail	0	0.500	−0.693
B	3	Pass	1.5	0.818	−0.201
C	4	Pass	3	0.953	−0.048

Logistic Regression — Log-Likelihood Sum

\[\ell(\beta) = (-0.693) + (-0.201) + (-0.048) = -0.942\]

MLE finds \(\beta_0, \beta_1\) that make this as large (closest to 0) as possible.

PART 2

Information Entropy

History · Shannon · Physics vs. Statistics · Lindley’s Measure

A Brief History of Entropy

Year	Person	Contribution
1850s	Clausius	Thermodynamic entropy: \(dS = dQ/T\)
1877	Boltzmann	\(S = k \cdot \ln(W)\) — microstates
1948	Shannon	Information entropy — uncertainty in distributions
1956	Jaynes	Maximum Entropy Principle
1968	Lindley	Bayesian information measure

Shannon Entropy — The Formula

\[H(X) = -\sum_x p(x)\cdot\log_2 p(x)\]

(Sum over all possible outcomes \(x\))

Shannon Entropy — Properties

Property	Detail
📐 Units	\(\log_2\) → bits; \(\ln\) → nats; \(\log_{10}\) → hartleys
⚖️ Range	\(H \geq 0\) always; \(H = 0\) means certainty
🎯 Maximum	\(H\) is highest when all outcomes are equally likely
❓ Interpretation	Average “surprise” per observation

Entropy Example — Fair Coin

\(p(H) = 0.5\), \(\quad p(T) = 0.5\)

\[H(X) = -[0.5\log_2(0.5) + 0.5\log_2(0.5)]\] \[= -[0.5(-1)+0.5(-1)] = \mathbf{1.0 \text{ bit}}\]

➡️ Maximum uncertainty — you need exactly 1 bit to describe each flip.

Entropy Example — Biased Coin

\(p(H) = 0.9\), \(\quad p(T) = 0.1\)

\[H(X) = -[0.9\log_2(0.9) + 0.1\log_2(0.1)]\] \[= -[-0.137 + (-0.332)] = \mathbf{0.469 \text{ bits}}\]

➡️ Lower entropy = less surprise. Almost certain it will be Heads!

Physics vs. Statistical Entropy — Formulas

	Physics	Statistics
Formula	\(S = k \cdot \ln(W)\)	\(H = -\sum p(x)\log p(x)\)
Units	Joules / Kelvin	Bits or Nats
Variable	\(W\) = microstates	\(p(x)\) = outcome probability

Physics vs. Statistical Entropy — Concepts

	Physics	Statistics
Concept	Disorder / irreversibility	Uncertainty / unpredictability
Subject	Gas molecules, heat	Probability distributions, data
Direction	Always increases (2nd Law)	Can increase or decrease

Note

Shannon chose the name “entropy” because Boltzmann’s formula has the same mathematical structure.

Lindley’s Information Measure — Formula

Lindley (1956): Expected information gained from an experiment:

\[I = H(\theta) - \mathbb{E}[H(\theta \mid x)]\]

\[= \text{Entropy(prior)} - \text{Expected Entropy(posterior)}\]

Lindley’s Information Measure — Intuition

Before the experiment: high uncertainty about \(\theta\) → high \(H(\text{prior})\)
After seeing data \(x\): posterior concentrates → lower \(H(\text{posterior})\)
Lindley’s \(I\) = how much uncertainty was reduced by data

Big \(I\) → data was very informative
Small \(I\) → data told you little

Lindley’s Measure — Coin Example

Estimating if a coin is fair:

Prior: Uniform over \(p \in [0,1]\) → \(H(\text{prior})\) is high
After 100 flips (60 H, 40 T): Posterior concentrates near \(p = 0.6\)
\(H(\text{posterior})\) is much lower

\[I = H(\text{prior}) - H(\text{posterior}) \quad\Rightarrow\quad \text{large = very useful data!}\]

PART 3

Kullback-Leibler Divergence

How different are two probability distributions?

What is KL Divergence?

KL divergence measures how much distribution \(P\) differs from a reference distribution \(Q\).

Discrete:

\[D_{KL}(P \,\|\, Q) = \sum_x P(x)\,\log\!\left[\frac{P(x)}{Q(x)}\right]\]

Continuous:

\[D_{KL}(P \,\|\, Q) = \int p(x)\,\log\!\left[\frac{p(x)}{q(x)}\right]dx\]

KL Divergence — Key Properties

Property	Explanation
🔢 Always ≥ 0	\(D_{KL} = 0\) only when \(P = Q\) everywhere
↔︎️ Not Symmetric	\(D_{KL}(P\\|Q) \neq D_{KL}(Q\\|P)\) — not a true distance!
📡 Information Gain	Extra bits to encode \(P\) using a code built for \(Q\)
🔗 Relation to MLE	Minimising \(D_{KL}(P\\|Q)\) w.r.t. \(Q\) ≡ MLE

KL Divergence — Weather Example Setup

📌 \(P\) = true weather distribution, \(Q\) = model’s prediction

Outcome	\(P(x)\)	\(Q(x)\)
Sunny ☀️	0.50	0.40
Cloudy ⛅	0.30	0.35
Rainy 🌧️	0.20	0.25

KL Divergence — Weather Calculation

Outcome	\(P/Q\)	\(\log_2(P/Q)\)	\(P \cdot \log_2(P/Q)\)
Sunny ☀️	1.25	0.322	0.161
Cloudy ⛅	0.857	−0.222	−0.067
Rainy 🌧️	0.80	−0.322	−0.064
Total			\(D_{KL} = 0.030\) bits

KL Divergence — Interpretation

\(D_{KL}(P \| Q) = 0.030\) bits

On average, you waste 0.030 extra bits per forecast because model \(Q\) deviates from \(P\).

Closer to 0 → Better model
If \(Q = P\) exactly → \(D_{KL} = 0\) → no wasted information

KL & Model Selection — The Connection

\[\text{Minimising } D_{KL}(P_\text{true} \,\|\, Q_\text{model}) \;\equiv\; \text{MLE}\]

AIC

\[\text{AIC} = 2k - 2\ell(\hat\theta)\]

\(k\) = number of parameters
Penalises model complexity
Derived directly from KL divergence
Smaller AIC = better model

BIC

\[\text{BIC} = k\log(n) - 2\ell(\hat\theta)\]

\(n\) = sample size, \(k\) = number of parameters
Stronger penalty for complexity than AIC
Prefers simpler models
Smaller BIC = better model

Cross-Entropy Loss

\[H(P,Q) = H(P) + D_{KL}(P\|Q)\]

Used throughout machine learning
Minimising cross-entropy ≡ minimising KL divergence from true \(P\)
Basis for neural network training

KL Model Comparison — Setup

📌 Two models predict class probabilities for 4 classes. Which is closer to true \(P\)?

Class	\(P\) (True)	\(Q_1\) (Model A)	\(Q_2\) (Model B)
Dog 🐶	0.40	0.35	0.42
Cat 🐱	0.30	0.30	0.28
Bird 🐦	0.20	0.25	0.18
Fish 🐟	0.10	0.10	0.12

KL Model Comparison — Results

Class	\(P\log(P/Q_1)\)	\(P\log(P/Q_2)\)
Dog 🐶	0.024	−0.008
Cat 🐱	0.000	0.012
Bird 🐦	−0.045	0.020
Fish 🐟	0.000	−0.018
\(D_{KL}\)	≈ 0.021 bits	≈ 0.006 bits

Tip

Model B is 3.5× closer to the true distribution → Select Model B.

How It All Connects

Concept	Formula	Role
Likelihood	\(L(\theta\mid\text{data}) = \prod f(x_i\mid\theta)\)	Estimate parameters via MLE
Entropy	\(H(X) = -\sum p\log p\)	Measure uncertainty
KL Divergence	\(D_{KL}(P\\|Q) = \sum P\log(P/Q)\)	Compare distributions

The Unifying Insight

\[H(P,Q) = H(P) + D_{KL}(P\|Q)\]

Important

Minimising KL divergence from true \(P\) to model \(Q\)
≡ MLE of \(Q\)’s parameters
≡ Training neural networks with cross-entropy loss

Key Takeaways

Likelihood is NOT a probability — it is a function of the parameter given fixed data
Entropy measures uncertainty — same math as Boltzmann, applied to distributions
KL divergence is asymmetric — \(D_{KL}(P\|Q) = 0\) only if \(P = Q\)
Minimising KL ≡ MLE — AIC, cross-entropy loss, and neural network training all follow from this