Likelihood

A common and fruitful approach to statistics is to assume that the data arises from a family of distributions indexed by a parameter that represents a useful summary of the distribution
The likelihood of a collection of data is the joint density evaluated as a function of the parameters with the data fixed
Likelihood analysis of data uses the likelihood to perform inference regarding the unknown parameter

Likelihood

Given a statistical probability mass function or density, say \( f(x, \theta) \), where \( \theta \) is an unknown parameter, the likelihood is \( f \) viewed as a function of \( \theta \) for a fixed, observed value of \( x \).

Interpretations of likelihoods

The likelihood has the following properties:

Ratios of likelihood values measure the relative evidence of one value of the unknown parameter to another.
Given a statistical model and observed data, all of the relevant information contained in the data regarding the unknown parameter is contained in the likelihood.
If \( \{X_i\} \) are independent random variables, then their likelihoods multiply. That is, the likelihood of the parameters given all of the \( X_i \) is simply the product of the individual likelihoods.

Example

Suppose that we flip a coin with success probability \( \theta \)
Recall that the mass function for \( x \) \[ f(x,\theta) = \theta^x(1 - \theta)^{1 - x} ~~~\mbox{for}~~~ \theta \in [0,1]. \] where \( x \) is either \( 0 \) (Tails) or \( 1 \) (Heads)
Suppose that the result is a head
The likelihood is \[ {\cal L}(\theta, 1) = \theta^1 (1 - \theta)^{1 - 1} = \theta ~~~\mbox{for} ~~~ \theta \in [0,1]. \]
Therefore, \( {\cal L}(.5, 1) / {\cal L}(.25, 1) = 2 \),
There is twice as much evidence supporting the hypothesis that \( \theta = .5 \) to the hypothesis that \( \theta = .25 \)

Example continued

Suppose now that we flip our coin from the previous example 4 times and get the sequence 1, 0, 1, 1
The likelihood is: \[ \begin{eqnarray*} {\cal L}(\theta, 1,0,1,1) & = & \theta^1 (1 - \theta)^{1 - 1} \theta^0 (1 - \theta)^{1 - 0} \\ & \times & \theta^1 (1 - \theta)^{1 - 1} \theta^1 (1 - \theta)^{1 - 1}\\ & = & \theta^3(1 - \theta)^1 \end{eqnarray*} \]
This likelihood only depends on the total number of heads and the total number of tails; we might write \( {\cal L}(\theta, 1, 3) \) for shorthand
Now consider \( {\cal L}(.5, 1, 3) / {\cal L}(.25, 1, 3) = 5.33 \)
There is over five times as much evidence supporting the hypothesis that \( \theta = .5 \) over that \( \theta = .25 \)

Plotting likelihoods

Generally, we want to consider all the values of \( \theta \) between 0 and 1
A likelihood plot displays \( \theta \) by \( {\cal L}(\theta,x) \)
Because the likelihood measures relative evidence, dividing the curve by its maximum value (or any other value for that matter) does not change its interpretation

pvals <- seq(0, 1, length = 1000)
plot(pvals, dbinom(3, 4, pvals) / dbinom(3, 4, 3/4), type = "l", frame = FALSE, lwd = 3, xlab = "p", ylab = "likelihood / max likelihood")

Maximum likelihood

The value of \( \theta \) where the curve reaches its maximum has a special meaning
It is the value of \( \theta \) that is most well supported by the data
This point is called the maximum likelihood estimate (or MLE) of \( \theta \) \[ MLE = \mathrm{argmax}_\theta {\cal L}(\theta, x). \]
Another interpretation of the MLE is that it is the value of \( \theta \) that would make the data that we observed most probable

Some results

\( X_1, \ldots, X_n \stackrel{iid}{\sim} N(\mu, \sigma^2) \) the MLE of \( \mu \) is \( \bar X \) and the ML of \( \sigma^2 \) is the biased sample variance estimate.
If \( X_1,\ldots, X_n \stackrel{iid}{\sim} Bernoulli(p) \) then the MLE of \( p \) is \( \bar X \) (the sample proportion of 1s).
If \( X_i \stackrel{iid}{\sim} Binomial(n_i, p) \) then the MLE of \( p \) is \( \frac{\sum_{i=1}^n X_i}{\sum_{i=1}^n n_i} \) (the sample proportion of 1s).
If \( X \stackrel{iid}{\sim} Poisson(\lambda t) \) then the MLE of \( \lambda \) is \( X/t \).
If \( X_i \stackrel{iid}{\sim} Poisson(\lambda t_i) \) then the MLE of \( \lambda \) is \( \frac{\sum_{i=1}^n X_i}{\sum_{i=1}^n t_i} \)

Example

You saw 5 failure events per 94 days of monitoring a nuclear pump.
Assuming Poisson, plot the likelihood

lambda <- seq(0, .2, length = 1000)
likelihood <- dpois(5, 94 * lambda) / dpois(5, 5)
plot(lambda, likelihood, frame = FALSE, lwd = 3, type = "l", xlab = expression(lambda))
lines(rep(5/94, 2), 0 : 1, col = "red", lwd = 3)
lines(range(lambda[likelihood > 1/16]), rep(1/16, 2), lwd = 2)
lines(range(lambda[likelihood > 1/8]), rep(1/8, 2), lwd = 2)