The core problem:
“We’re trying to figure out an unknown parameter \(\theta\). We don’t know it directly, so we collect data and use that to estimate it.”
Our data \(X_1, X_2, \ldots, X_n\) are:
This structure lets us use probability to model the data systematically.
| Probability | Likelihood | |
|---|---|---|
| Fixed | \(\theta\) | Data |
| Variable | Data | \(\theta\) |
| Question | What data might we see? | What \(\theta\) fits this data? |
Likelihood = \(P(\text{data} \mid \theta)\), but we flip the question:
Given the data we observed, what value of \(\theta\) makes it most probable?
\[L(\theta) = f(X_1, X_2, \ldots, X_n \mid \theta)\]
Because the observations are independent, we multiply:
\[\boxed{L(\theta) = \prod_{i=1}^{n} f(X_i \mid \theta)}\]
“Multiplying individual probabilities gives the total likelihood of the entire dataset.”
The peak is our best estimate of \(\theta\).
The Maximum Likelihood Estimator is the value of \(\theta\) that maximizes \(L(\theta)\):
\[\hat{\theta} = \underset{\theta}{\arg\max} \ L(\theta)\]
It’s the \(\theta\) value where the likelihood curve reaches its highest point — the most “believable” parameter given what we observed.
\(\log(ab) = \log a + \log b\) — products become sums, derivatives get easier, and the maximum stays in the same place.
Let \(X_i \sim \mathcal{N}(\mu, \sigma^2)\) with known \(\sigma^2\). Find \(\hat{\mu}\).
Log-likelihood: \[\ell(\mu) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (X_i - \mu)^2\]
Differentiate & set to zero: \[\frac{d\ell}{d\mu} = \frac{1}{\sigma^2}\sum_{i=1}^n (X_i - \mu) = 0\]
Solve: \[\boxed{\hat{\mu}_{MLE} = \bar{X} = \frac{1}{n}\sum_{i=1}^n X_i}\]
The sample mean is the MLE for the normal mean — reassuringly intuitive!
We will use Maximum Likelihood Estimation (MLE) to estimate \(p\).
The likelihood function gives the probability of observing the data (3 heads, 2 tails) for a given value of \(p\).
The likelihood \(L(p)\) is the product of probabilities for each flip:
\[L(p) = p^3 \times (1 - p)^2\]
Where: - \(p^3\) is the probability of flipping 3 heads. - \((1 - p)^2\) is the probability of flipping 2 tails.
We will now move to the next step of maximizing this likelihood to find \(p\).
Log-Likelihood Function:
To simplify the maximization process, we take the logarithm of the likelihood function:
\[\ell(p) = 3 \log(p) + 2 \log(1 - p)\]
We differentiate the log-likelihood with respect to \(p\) and set it equal to zero to maximize the function:
\[\frac{d\ell(p)}{dp} = \frac{3}{p} - \frac{2}{1 - p}\]
Now, solve the equation:
\[\frac{3}{p} = \frac{2}{1 - p}\]
Rearrange and solve for \(p\):
\[3(1 - p) = 2p\]
\[3 - 3p = 2p\]
\[3 = 5p\]
\[\boxed{p = \frac{3}{5} = 0.6}\]
Thus, the Maximum Likelihood Estimate (MLE) for the probability of heads is 0.6.
Based on the observed data (3 heads, 2 tails), the best estimate for the probability of flipping heads is 0.6. This is how Maximum Likelihood Estimation works in this example.
| Step | What We Do |
|---|---|
| Setup | Assume \(X_i \overset{iid}{\sim} f(x \mid \theta)\) |
| Likelihood | \(L(\theta) = \prod_i f(X_i \mid \theta)\) |
| Log-trick | \(\ell(\theta) = \sum_i \log f(X_i \mid \theta)\) |
| Optimize | \(\hat{\theta} = \arg\max_\theta \ \ell(\theta)\) |
| Solve | Set \(\frac{d\ell}{d\theta} = 0\), solve for \(\hat{\theta}\) |
MLE gives us the parameter value that makes our observed data as probable as possible.
Maximum Likelihood Estimation