9.3 - Maximum likelihood estimation

Fisher saw room for improvement in the MoM estimators

R.A. Fisher

[Pearson’s “Method of Moments”] is without question of great practical utility, [wherein] different forms of frequency curves are fitted by calculating as many moments of the sample as there are parameters to be evaluated. The parameters chosen are those of an infinite population of the specified type having the same moments as those calculated from the sample.

[F]or that class of distribution to which the method can be applied, it has not been shown, except in the case of the normal curve, that the best values will be obtained by the method of moments. The method will, in these cases, certainly be serviceable in yielding an approximation, but to discover whether this approximation is a good or a bad one, and to improve it, if necessary, a more adequate criterion is required.

Fisher: On the Mathematical Foundations of Theoretical Statistics. Philosophical Transactions of the Royal Society of London. p 321. 1922.

The likelihood

Fisher conceived of the joint distribution of the data values as the likelihood of observing the sample for a given parameter value \(\theta\)

For a jointly discrete i.i.d. sample:

\[L(\theta) = p(y_1,y_2,...,y_n;\theta) = \prod_{i=1}^n p(y_i;\theta)\]

For a jointly continuous i.i.d. sample:

\[L(\theta) = f(y_1,y_2,...,y_n;\theta) = \prod_{i=1}^n f_Y(y_i;\theta)\]

It then makes sense to choose, as the estimator, the value of \(\theta\) that maximizes this likelihood: i.e. the parameter value that makes the observed data “most likely.”

The MLE

The maximum likelihood estimator of \(\theta\), \(\hat\theta_{MLE}\), is the value of \(\theta\) that maximizes the likelihood function for a given sample of observed data.
Formally:

\[\hat\theta_{MLE} = \arg\!\max_{\theta}L(\theta)\]

Example: Poisson

Suppose we have a sample of \(n=3\) from a \(POI(\lambda)\) distribution
Observed values: \(y_1 = 6; y_2 = 8; y_3 = 10\).
Consider the following possibilities for \(\lambda\): \(\lambda = 9\) or \(\lambda= 7\).
Which of these is a better “guess” as to the true \(\lambda\) that led to these data?

\[L(\lambda) = p(6;\lambda)p(8;\lambda)p(10;\lambda) = \frac{e^{-\lambda}\lambda^6}{6!}\times \frac{e^{-\lambda}\lambda^8}{8!} \times \frac{e^{-\lambda}\lambda^{10}}{10!} = \frac{e^{-3\lambda}\lambda^{24}}{6!8!10!}\]

\[L(7) = \frac{e^{-3\cdot 7}7^{24}}{6!8!10!} = 0.00138\]

\[L(9) = \frac{e^{-3\cdot 9}9^{24}}{6!8!10!} = 0.00142\]

Thus, the observed data are more likely when \(\lambda = 9\) than when \(\lambda =7\).

Better than “peck and check”

Of course, rather than just evaluating \(L(\lambda)\) for two values we could consider it over all possible values:

library(tidyverse)


L <- \(lam) {
 y <- c(6,8,10)
 likelihood <- exp(-3*lam)*lam^sum(y)/prod(factorial(y))
 return(likelihood)
}

L(7)

[1] 0.001378964

L(9)

[1] 0.001423158

ggplot() + 
  xlim(c(1,15))+ ylim(c(0,0.0017))+
  geom_function(fun = L) + 
  labs(y=expression(L(lambda)),
       x = expression(lambda)
       )+ 
  theme_classic(base_size = 16)

Where does this likelihood have its maximum?

Motivating the log-likelihood

Need to maximize:

\[L(\lambda) = \frac{e^{-3\lambda}\lambda^{24}}{6!8!10!}\]

Involves products of probabilities: involves numerically tiny values, especially for large \(n\)! (Here, \(n=3\) only)
To maximize with calculus, must use product rule = nasty

Logs have nice properties, like:

\[\log(a\cdot b\cdot c) = \log(a)+\log(b)+\log(c)\] \[\log(a^b) = b\log(a)\]

\[\ln(e^x) = x \Leftarrow \mbox{ simplifies a lot of well-known pdfs!}\]

Since \(\log(\cdot)\) is a monotone-increasing function, the value of \(\theta\) that maximizes \(\log(L(\theta))\) is the same \(\theta\) that maximizes \(L(\theta)\)
As the log “stretches” out tiny numbers, results in much more stable numerical optimization (when required)

Maximizing the log-likelihood

Using calculus to find the \(\lambda\) that maximizes

\[L(\lambda) = \frac{e^{-3\lambda}\lambda^{24}}{6!8!10!}\]

\[\ln(L(\lambda)) = \ln(e^{-3\lambda}) + \ln(\lambda^{24}) -\ln(6!8!10!)=-3\lambda + 24 \ln (\lambda) -\ln(6!8!10!)\]

\[\frac{d}{d\lambda}\ln(L(\lambda)) = -3+\frac{24}{\lambda}\stackrel{set}{=}0 \Rightarrow \hat\lambda_{MLE} = 8\]

Finding Poisson MLE in general

Now consider finding the MLE of \(\lambda\) given a sample \(Y_1,...,Y_n \stackrel{i.i.d.}{\sim} POI(\lambda)\)

\[L(\lambda) = \prod_{i=1}^n p(y_i;\lambda) = \prod_{i=1}^n \frac{e^{-\lambda}\lambda^{y_i}}{y_i!} = \frac{e^{-n\lambda}\lambda^{\sum_{i=1}^ny_i}}{\prod_{i=1}^ny_i!}\]

\[\ln(L(\lambda)) = \ln\left(e^{-n\lambda}\right) + \ln\left(\lambda^{\sum_{i=1}^ny_i}\right)-\ln\left(\prod_{i=1}^ny_i!\right)\]

\[ = -n\lambda + \sum_{i=1}^ny_i\ln(\lambda)-\ln\left(\prod_{i=1}^ny_i!\right)\]

\[\frac{d}{d\lambda}\ln(L(\lambda)) = -n+ \frac{\sum_{i=1}^ny_i}{\lambda} \stackrel{set}{=}0\]

\[\Rightarrow \hat\lambda_{MLE} = \frac{\sum_{i=1}^ny_i}{n}\]

\(\therefore \hat\lambda_{MLE} = \bar Y\)

Using numerical optimization

For the previous problem, we could find the MLE relatively easily using calculus
For many problems, the MLE does not exist in closed form and must be estimated using numerical optimization
For single-parameter optimization:

Write a log-likelikhood function for single parameter \(\theta\), with second argument for sample
Use optimize() to perform single-parameter optimization. Arguments:
- f: name of your log-likelihood function
- ...: additional arguments to your log-likelihood function, most likely a sample vector
- interval: interval over which to search for the maximum
- maximum=TRUE: by default, optimize() minimizes. Toggle this to maximize.

Numerical optimization example

Writing the log-likelihood function:

pois_loglik <- function(lambda, y) {
  n <- length(y)
  ll <- -n*lambda + sum(y)*log(lambda) - sum(log(factorial(y)))
  return(ll)
}

Simulating a sample:

set.seed(12345)
n <- 20
ysamp <- rpois(20, lambda = 2)

What should the MLE be?

mean(ysamp)

[1] 2.2

Using optimize() to find MLE:

optimize(pois_loglik, y = ysamp,
         interval = c(0,5),
         maximum = TRUE)

$maximum
[1] 2.200017

$objective
[1] -36.27721

Plotting the log-likelihood function:

ggplot()+
  geom_function(fun= \(l) pois_loglik(l,ysamp)) + 
  xlim(c(0.1,5))+ 
  theme_classic(base_size = 18) +
  labs(x=expression(lambda),
       y = expression(ln(L(lambda)))
       ) + 
  geom_vline(aes(xintercept = 2.2), linetype=2)

Exponential rate MLE

Suppose \(Y_1,...,Y_n \stackrel{i.i.d.}{\sim} EXP(\lambda)\)
What is the MLE of \(\lambda\)?

\[L(\lambda) = \prod_{i=1}^n f(y_i;\lambda) = \prod_{i=1}^n \lambda e^{-\lambda y_i}= \lambda^n e^{-\lambda\sum_{i=1}^ny_i}\]

\[\ln(L(\lambda)) = \ln\left(\lambda^n\right) + \ln\left(e^{-\lambda\sum_{i=1}^ny_i}\right)\]

\[ = n \ln(\lambda) - \lambda\sum_{i=1}^ny_i\]

\[\frac{d}{d\lambda}\ln(L(\lambda)) = \frac{n}{\lambda}- \sum_{i=1}^ny_i \stackrel{set}{=}0\]

\[\Rightarrow \hat\lambda_{MLE} = \frac{n}{\sum_{i=1}^ny_i}\]

\(\therefore \hat\lambda_{MLE} = \frac{1}{\bar Y}\)

Exponential scale MLE

What if we use the scale parameterization instead, i.e. suppose \(Y_1,...,Y_n \stackrel{i.i.d.}{\sim} EXP(\beta)\)?
What is the MLE of \(\beta\)?

\[L(\beta) = \prod_{i=1}^n f(y_i;\beta) = \prod_{i=1}^n \frac{1}{\beta} e^{-y_i/\beta}= \beta^{-n} e^{-\sum_{i=1}^ny_i/\beta}\]

\[\ln(L(\beta)) = -n \ln(\beta) - \frac{\sum_{i=1}^ny_i}{\beta}\]

\[\frac{d}{d\beta}\ln(L(\beta)) = -\frac{n}{\beta}+ \frac{\sum_{i=1}^ny_i}{\beta^2} \stackrel{set}{=}0\]

\[\Rightarrow \hat\beta_{MLE} = \frac{\sum_{i=1}^ny_i}{n}\]

\(\therefore \hat\beta_{MLE} = \bar Y\)

Invariance of MLEs

If \(\hat\theta\) is the MLE of \(\theta\), then \(\tau(\hat\theta)\) is the MLE of \(\tau(\theta)\).

Exponential example (\(\tau(x) = 1/x\)):

\[\hat\lambda = \frac{1}{\bar Y} \mbox{ is the MLE of } \lambda\]

\[\tau(\hat\lambda) = \frac{1}{\hat\lambda} = \frac{1}{1/\bar Y} \mbox{ is the MLE of } \frac{1}{\lambda} = \tau(\lambda) = \beta\]

Uniform upper bound MLE

Suppose \(Y_1,...,Y_n\) are i.i.d. \(\sim UNIF(0,\theta)\).
Find the MLE of \(\theta\).

Rote approach:

\[L(\theta) = \prod_{i=1}^n f(y_i;\theta) = \prod_{i=1}^n \frac{1}{\theta} = \frac{1}{\theta^n}\]

\[\ln(L(\theta)) = -n\ln(\theta)\]

\[\frac{d}{d\theta}\ln(L(\theta)) = -\frac{n}{\theta}\stackrel{set}{=}0 \Rightarrow \hat\theta_{MLE} = -\frac{n}{0} = -\infty...?\]

Indicator functions

When \(\theta\) parameterizes the support, we must incorporate this into our likelihood function with indicator functions:

\[L(\theta) = \prod_{i=1}^n f(y_i;\theta) = \prod_{i=1}^n \frac{1}{\theta} 1(y_i >0) 1(y_i < \theta)\]

\[= \frac{1}{\theta^n}\cdot 1(y_{(1)}>0)\cdot 1(y_{(n)}< \theta)\]

Not all likelihoods are concave!

\[L(\theta) = \frac{1}{\theta^n}\cdot 1(\theta>y_{(n)})\]

\(L(\theta)\) is discontinuous in \(\theta\).
\(L(\theta)\) is non-differentiable at its maximum (hence the infinite derivative when attempting to set it equal to 0).
\(L(\theta)\) achieves its maximum at \(\theta = y_{(n)}\).
Value of maximized likelihood:

\[L(y_{(n)}) = y_{(n)}^{-n}\]

\[\therefore \hat\theta_{MLE} = Y_{(n)}\]

Make sure to sketch the likelihood function and identify its nonzero form (monotone increasing? decreasing?) and where it achieves its maximum when you have non-concave likelihoods (when \(\theta\) parameterizes the support)

Normal MLEs

Suppose \(Y_1,...,Y_n \stackrel{i.i.d.}{\sim} N(\mu,\sigma^2)\)
Find the MLEs of \(\mu\) and \(\sigma^2\).

\[L(\mu,\sigma^2) = \prod_{i=1}^n f(y_i;\mu,\sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} e^{-(y_i-\mu)^2/2\sigma^2}= (2\pi\sigma^2)^{-n/2} e^{-\sum_{i=1}^n(y_i-\mu)^2/2\sigma^2}\]

\[\ln(L(\mu,\sigma^2)) = -\frac{n}{2}\ln(\sigma^2)- \frac{\sum_{i=1}^n(y_i-\mu)^2}{2\sigma^2}-\frac{n}{2}\ln(2\pi)\]

\[\mbox{Eq. 1: } \frac{\partial}{\partial \mu}\ln(L(\mu,\sigma^2))=\frac{\sum_{i=1}^n(y_i-\mu)}{\sigma^2} \stackrel{set}{=}0 \]

\[\mbox{Eq. 2: }\frac{\partial}{\partial \sigma^2}\ln(L(\mu,\sigma^2))=-\frac{n}{2\sigma^2}+ \frac{\sum_{i=1}^n(y_i-\mu)^2}{2\sigma^4}\stackrel{set}{=}0\]

Solving

\[\mbox{Eq. 1: } \frac{\sum_{i=1}^n(y_i-\mu)}{\sigma^2} =0\] \[ \Rightarrow \sum_{i=1}^ny_i-n\mu=0\] \[\Rightarrow n\bar y -n\mu =0\]

\[\Rightarrow \hat\mu_{MLE} = \bar y\]

Plug into:

\[\mbox{Eq. 2: }-\frac{n}{2\sigma^2}+ \frac{\sum_{i=1}^n(y_i-\mu)^2}{2\sigma^4}=0\]

\[\Rightarrow \frac{\sum_{i=1}^n(y_i-\hat\mu)^2}{2\sigma^4} = \frac{n}{2\sigma^2}\] \[\Rightarrow \hat\sigma^2_{MLE} = \frac{\sum_{i=1}^n(y_i-\bar y)^2}{n}\]

Beta MLEs

Suppose \(Y_1,...,Y_n\) are i.i.d. \(\sim BETA(\alpha,\beta)\).
Identify the system of equations needed to solve to find the MLEs of \(\alpha\) and \(\beta\).

\[\small L(\alpha,\beta) = \prod_{i=1}^n f(y_i;\alpha,\beta) = \prod_{i=1}^n \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} y_i^{\alpha-1}(1-y_i)^{\beta-1} = \left(\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\right)^n\prod_{i=1}^ny_i^{\alpha-1}\prod_{i=1}^n(1-y_i)^{\beta-1}\]

\[\small \ln(L(\alpha,\beta)) = n\ln(\Gamma(\alpha+\beta)) -n\ln(\Gamma(\alpha))-n\ln(\Gamma(\beta)) +(\alpha-1)\sum_{i=1}^n\ln(y_i)+ (\beta-1)\sum_{i=1}^n\ln(1-y_i)\]

\[\small \frac{\partial}{\partial \alpha} \ln(L(\alpha,\beta)) = \frac{n}{\Gamma(\alpha+\beta)}\frac{\partial}{\partial \alpha} \Gamma(\alpha+\beta)-\frac{n}{\Gamma(\alpha)}\frac{\partial}{\partial \alpha} \Gamma(\alpha) + \sum_{i=1}^n\ln(y_i)...?\]

\[\small \frac{\partial}{\partial \beta} \ln(L(\alpha,\beta)) = \frac{n}{\Gamma(\alpha+\beta)}\frac{\partial}{\partial \beta} \Gamma(\alpha+\beta)-\frac{n}{\Gamma(\beta)}\frac{\partial}{\partial \beta} \Gamma(\beta) + \sum_{i=1}^n\ln(1-y_i)...?\]

Investigating \(\ln(L(\alpha,\beta))\) visually

Multi-parameter log-likelihood functions should have a vector of parameters as the first argument
Writing the beta log-likelihood function, making use of dbeta(, log=TRUE) to evaluate the log-likelihood contributions from each data point:

beta_loglik <- function(theta,ysamp) {
  alpha <- theta[1]
  beta <- theta[2]
  n <- length(ysamp)
  ll <-  sum(dbeta(ysamp, shape1=alpha, shape2=beta,log = TRUE))
  return(ll)
}

Investigating \(\ln(L(\alpha,\beta))\) visually

Simulating a single sample of size \(n=200\) from a \(BETA(5,2)\):

set.seed(1234)
n <- 200
mysamp <- rbeta(200, shape1=5, shape2 = 2)
head(mysamp)

[1] 0.8993290 0.6549274 0.4466458 0.9780859 0.6196458 0.6009996

Evaluating the value of the log-likelihood for a grid of \((\alpha,\beta)\) values:

alphaseq <- seq(2,8,l=100)
betaseq <- seq(1,6,l=100)
(loglik_grid <- expand.grid(alpha = alphaseq, beta = betaseq)
              %>% mutate(y = list(mysamp))
              %>% as_tibble()
              %>% mutate(loglik = pmap_dbl(list(alpha,beta,y), \(a,b,y) 
                                           beta_loglik(c(a,b), y)
                                           )
                         )
) %>% head()

# A tibble: 6 × 4
  alpha  beta y           loglik
  <dbl> <dbl> <list>       <dbl>
1  2        1 <dbl [200]>   66.8
2  2.06     1 <dbl [200]>   68.4
3  2.12     1 <dbl [200]>   69.9
4  2.18     1 <dbl [200]>   71.1
5  2.24     1 <dbl [200]>   72.3
6  2.30     1 <dbl [200]>   73.2

Plotting the 2-dimensional log-likelihood:

ggplot(data = loglik_grid,
          aes(x=alpha, y= beta, z = loglik)
       ) + 
  geom_contour_filled(binwidth=2)+
  geom_contour(binwidth=10,col='grey')+
  labs(x=expression(alpha), y = expression(beta))+
  guides(fill='none')+
  theme_classic(base_size = 22)

Using `optim()` to find MLEs

optim() is the workhorse R function for multi-dimensional optimization. It does minimization only.
Important arguments:
- par: initial values. Usually we just make a guess. A good idea to try several starting points!
- fn: the name of the function to be optimized. The first argument must be a VECTOR of parameters over which to optimize.
- ...: Additional arguments to the log-likelihood function, including the data.
- control: a list of arguments to control optim. One important control parameter is fnscale=-1, indicating we want to minimize the negative log-likelihood.

Implementing `optim()` for simulated data

Finding the MLEs:

optim(par = c(1,1), 
      fn = beta_loglik, 
      ysamp = mysamp, 
      control=list(fnscale=-1)
      )

$par
[1] 4.827151 1.876892

$value
[1] 96.00696

$counts
function gradient 
     275       NA 

$convergence
[1] 0

$message
NULL

Histogram of sample, with fitted beta curve:

ggplot()+ 
  geom_histogram(aes(x = mysamp, y = after_stat(density)), 
                 binwidth=0.02,
                 center=0.01,
                 fill='grey',
                 col='black'
                 ) +
  geom_function(fun = \(y) 
                dbeta(y, 
                      shape1 =  4.827151, 
                      shape2 = 1.876892)
                ) + 
  theme_classic(base_size = 18) + 
  labs(x='y')

Comparing MLEs to MoMs

Recall:

\[\hat\alpha_{MOM} = \left( \frac{\bar Y_1 - \bar Y_2}{S^2}\right)\bar Y_1\]

\[ \hat\beta_{MOM} = \left( \frac{\bar Y_1 - \bar Y_2}{S^2}\right) (1-\bar Y_1)\]

Method-of-moments estimators:

ybar1 <- mean(mysamp)
ybar2 <- mean(mysamp^2)
S2 <- ybar2-ybar1^2
(alpha_hat_mom <- ybar1*(ybar1-ybar2)/S2)

[1] 4.685153

(beta_hat_mom <- (1-ybar1)*(ybar1-ybar2)/S2)

[1] 1.809314

Maximum likelihood estimators:

optim(par = c(1,1), 
      fn = beta_loglik, 
      ysamp = mysamp, 
      control=list(fnscale=-1)
      )$par

[1] 4.827151 1.876892

Comparing the fits

ggplot()+ 
  geom_histogram(aes(x = mysamp, y = after_stat(density)), 
                 binwidth=0.02,
                 center=0.01,
                 fill='grey',
                 col='black'
                 ) +
  geom_function(fun = \(y) 
                dbeta(y,  shape1 =  4.827151,  shape2 = 1.876892),
                aes(color='Fitted beta with MLEs')
                ) + 
  geom_function(fun = \(y) 
                dbeta(y,   shape1 =  4.685153,  shape2 = 1.809314),
                aes(color='Fitted beta with MoMs')
                )+
  theme_classic(base_size = 18) + 
  labs(x='y', color='')

Pearson was not impressed

Karl Pearson

Image source: Wikipedia

Wasting your time fitting curves by moments, eh?

To my astonishment, [Fisher’s method] depends on first working out the constants of the frequency curve by the Method of Moments, and then superposing on it, by what Fisher terms the “Method of Maximum Likelihood,” a further approximation to obtain, what he holds he will thus get, “more efficient values” of the curve constants. The additional computing work after the determination by moments is as long as, I think longer than, the first fitting by moments. The practical value therefore turns on the gain in accuracy obtained by the increased labour.

Pearson: Method of Moments and Method of Maximum Likelihood. Biometrika p 34-35 1936.

9.3 - Maximum likelihood estimation

Fisher saw room for improvement in the MoM estimators

The likelihood

The MLE

Example: Poisson

Better than “peck and check”

Motivating the log-likelihood

Maximizing the log-likelihood

Finding Poisson MLE in general

Using numerical optimization

Numerical optimization example

Exponential rate MLE

Exponential scale MLE

Invariance of MLEs

Uniform upper bound MLE

Indicator functions

Not all likelihoods are concave!

Normal MLEs

Solving

Beta MLEs

Investigating \(\ln(L(\alpha,\beta))\) visually

Investigating \(\ln(L(\alpha,\beta))\) visually

Using optim() to find MLEs

Implementing optim() for simulated data

Comparing MLEs to MoMs

Comparing the fits

Pearson was not impressed

Using `optim()` to find MLEs

Implementing `optim()` for simulated data