Maximum Likelihood Estimation

Rohan Dahal

Maximum Likelihood Estimation

The core problem:

Unknown parameter: \(\theta\)
We collect data: \(X_1, X_2, \ldots, X_n\)
Goal: estimate \(\theta\) from the data

“We’re trying to figure out an unknown parameter \(\theta\). We don’t know it directly, so we collect data and use that to estimate it.”

Data Setup

Our data \(X_1, X_2, \ldots, X_n\) are:

Independent — knowing one observation tells us nothing about another
Identically distributed (i.i.d.) — all drawn from the same distribution
Each follows the density / probability mass function \(f(x \mid \theta)\)

This structure lets us use probability to model the data systematically.

Key Idea: Likelihood

	Probability	Likelihood
Fixed	\(\theta\)	Data
Variable	Data	\(\theta\)
Question	What data might we see?	What \(\theta\) fits this data?

Likelihood = \(P(\text{data} \mid \theta)\), but we flip the question:

Given the data we observed, what value of \(\theta\) makes it most probable?

The Likelihood Function

\[L(\theta) = f(X_1, X_2, \ldots, X_n \mid \theta)\]

Because the observations are independent, we multiply:

\[\boxed{L(\theta) = \prod_{i=1}^{n} f(X_i \mid \theta)}\]

“Multiplying individual probabilities gives the total likelihood of the entire dataset.”

Visual Intuition: The Likelihood Curve

The peak is our best estimate of \(\theta\).

MLE Definition

The Maximum Likelihood Estimator is the value of \(\theta\) that maximizes \(L(\theta)\):

\[\hat{\theta} = \underset{\theta}{\arg\max} \ L(\theta)\]

It’s the \(\theta\) value where the likelihood curve reaches its highest point — the most “believable” parameter given what we observed.

How to Find the MLE

Write the likelihood \(L(\theta) = \prod_{i=1}^n f(X_i \mid \theta)\)
Take the log → log-likelihood \(\ell(\theta) = \sum_{i=1}^n \log f(X_i \mid \theta)\)
Differentiate \(\dfrac{d\ell}{d\theta}\)
Set equal to zero \(\dfrac{d\ell}{d\theta} = 0\)
Solve for \(\hat{\theta}\)

Why Use Log-Likelihood?

\(\log(ab) = \log a + \log b\) — products become sums, derivatives get easier, and the maximum stays in the same place.

Worked Example: Normal Distribution

Let \(X_i \sim \mathcal{N}(\mu, \sigma^2)\) with known \(\sigma^2\). Find \(\hat{\mu}\).

Log-likelihood: \[\ell(\mu) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (X_i - \mu)^2\]

Differentiate & set to zero: \[\frac{d\ell}{d\mu} = \frac{1}{\sigma^2}\sum_{i=1}^n (X_i - \mu) = 0\]

Solve: \[\boxed{\hat{\mu}_{MLE} = \bar{X} = \frac{1}{n}\sum_{i=1}^n X_i}\]

The sample mean is the MLE for the normal mean — reassuringly intuitive!

Estimating the Probability of a Coin Flip

We have a biased coin.
We want to estimate the probability \(p\) of landing heads, where \(p\) is between 0 and 1.
You flip the coin 5 times, and the results are:
- 3 heads (H)
- 2 tails (T)

We will use Maximum Likelihood Estimation (MLE) to estimate \(p\).

Step 1: Define the Likelihood Function

The likelihood function gives the probability of observing the data (3 heads, 2 tails) for a given value of \(p\).

The likelihood \(L(p)\) is the product of probabilities for each flip:

\[L(p) = p^3 \times (1 - p)^2\]

Where: - \(p^3\) is the probability of flipping 3 heads. - \((1 - p)^2\) is the probability of flipping 2 tails.

We will now move to the next step of maximizing this likelihood to find \(p\).

Step 2: Find the Log-Likelihood and Maximize

Log-Likelihood Function:

To simplify the maximization process, we take the logarithm of the likelihood function:

\[\ell(p) = 3 \log(p) + 2 \log(1 - p)\]

We differentiate the log-likelihood with respect to \(p\) and set it equal to zero to maximize the function:

\[\frac{d\ell(p)}{dp} = \frac{3}{p} - \frac{2}{1 - p}\]

Step 3: Solve for \(p\)

Now, solve the equation:

\[\frac{3}{p} = \frac{2}{1 - p}\]

Rearrange and solve for \(p\):

\[3(1 - p) = 2p\]

\[3 - 3p = 2p\]

\[3 = 5p\]

\[\boxed{p = \frac{3}{5} = 0.6}\]

Conclusion

Thus, the Maximum Likelihood Estimate (MLE) for the probability of heads is 0.6.

Based on the observed data (3 heads, 2 tails), the best estimate for the probability of flipping heads is 0.6. This is how Maximum Likelihood Estimation works in this example.

Summary

Step	What We Do
Setup	Assume \(X_i \overset{iid}{\sim} f(x \mid \theta)\)
Likelihood	\(L(\theta) = \prod_i f(X_i \mid \theta)\)
Log-trick	\(\ell(\theta) = \sum_i \log f(X_i \mid \theta)\)
Optimize	\(\hat{\theta} = \arg\max_\theta \ \ell(\theta)\)
Solve	Set \(\frac{d\ell}{d\theta} = 0\), solve for \(\hat{\theta}\)

MLE gives us the parameter value that makes our observed data as probable as possible.