9.3 - Maximum likelihood estimation

Fisher saw room for improvement in the MoM estimators

R.A. Fisher

Image source: University of Adelaide

[Pearson’s “Method of Moments”] is without question of great practical utility, [wherein] different forms of frequency curves are fitted by calculating as many moments of the sample as there are parameters to be evaluated. The parameters chosen are those of an infinite population of the specified type having the same moments as those calculated from the sample.

[F]or that class of distribution to which the method can be applied, it has not been shown, except in the case of the normal curve, that the best values will be obtained by the method of moments. The method will, in these cases, certainly be serviceable in yielding an approximation, but to discover whether this approximation is a good or a bad one, and to improve it, if necessary, a more adequate criterion is required.

  • Fisher: On the Mathematical Foundations of Theoretical Statistics. Philosophical Transactions of the Royal Society of London. p 321. 1922.

The likelihood

  • Fisher conceived of the joint distribution of the data values as the likelihood of observing the sample for a given parameter value \(\theta\)
  • For a jointly discrete i.i.d. sample:

\[L(\theta) = p(y_1,y_2,...,y_n;\theta) = \prod_{i=1}^n p(y_i;\theta)\]

  • For a jointly continuous i.i.d. sample:

\[L(\theta) = f(y_1,y_2,...,y_n;\theta) = \prod_{i=1}^n f_Y(y_i;\theta)\]

  • It then makes sense to choose, as the estimator, the value of \(\theta\) that maximizes this likelihood: i.e. the parameter value that makes the observed data “most likely.”

The MLE

  • The maximum likelihood estimator of \(\theta\), \(\hat\theta_{MLE}\), is the value of \(\theta\) that maximizes the likelihood function for a given sample of observed data.
  • Formally:

\[\hat\theta_{MLE} = \arg\!\max_{\theta}L(\theta)\]

Example: Poisson

  • Suppose we have a sample of \(n=3\) from a \(POI(\lambda)\) distribution
  • Observed values: \(y_1 = 6; y_2 = 8; y_3 = 10\).
  • Consider the following possibilities for \(\lambda\): \(\lambda = 9\) or \(\lambda= 7\).
  • Which of these is a better “guess” as to the true \(\lambda\) that led to these data?

\[L(\lambda) = p(6;\lambda)p(8;\lambda)p(10;\lambda) = \frac{e^{-\lambda}\lambda^6}{6!}\times \frac{e^{-\lambda}\lambda^8}{8!} \times \frac{e^{-\lambda}\lambda^{10}}{10!} = \frac{e^{-3\lambda}\lambda^{24}}{6!8!10!}\]

\[L(7) = \frac{e^{-3\cdot 7}7^{24}}{6!8!10!} = 0.00138\]

\[L(9) = \frac{e^{-3\cdot 9}9^{24}}{6!8!10!} = 0.00142\]

  • Thus, the observed data are more likely when \(\lambda = 9\) than when \(\lambda =7\).

Better than “peck and check”

Of course, rather than just evaluating \(L(\lambda)\) for two values we could consider it over all possible values:

library(tidyverse)


L <- \(lam) {
 y <- c(6,8,10)
 likelihood <- exp(-3*lam)*lam^sum(y)/prod(factorial(y))
 return(likelihood)
}

L(7)
[1] 0.001378964
L(9)
[1] 0.001423158
ggplot() + 
  xlim(c(1,15))+ ylim(c(0,0.0017))+
  geom_function(fun = L) + 
  labs(y=expression(L(lambda)),
       x = expression(lambda)
       )+ 
  theme_classic(base_size = 16)

Where does this likelihood have its maximum?

Motivating the log-likelihood

  • Need to maximize:

\[L(\lambda) = \frac{e^{-3\lambda}\lambda^{24}}{6!8!10!}\]

  • Involves products of probabilities: involves numerically tiny values, especially for large \(n\)! (Here, \(n=3\) only)
  • To maximize with calculus, must use product rule = nasty
  • Logs have nice properties, like:

\[\log(a\cdot b\cdot c) = \log(a)+\log(b)+\log(c)\] \[\log(a^b) = b\log(a)\]

\[\ln(e^x) = x \Leftarrow \mbox{ simplifies a lot of well-known pdfs!}\]

  • Since \(\log(\cdot)\) is a monotone-increasing function, the value of \(\theta\) that maximizes \(\log(L(\theta))\) is the same \(\theta\) that maximizes \(L(\theta)\)
  • As the log “stretches” out tiny numbers, results in much more stable numerical optimization (when required)

Maximizing the log-likelihood

Using calculus to find the \(\lambda\) that maximizes

\[L(\lambda) = \frac{e^{-3\lambda}\lambda^{24}}{6!8!10!}\]

\[\ln(L(\lambda)) = \ln(e^{-3\lambda}) + \ln(\lambda^{24}) -\ln(6!8!10!)=-3\lambda + 24 \ln (\lambda) -\ln(6!8!10!)\]

\[\frac{d}{d\lambda}\ln(L(\lambda)) = -3+\frac{24}{\lambda}\stackrel{set}{=}0 \Rightarrow \hat\lambda_{MLE} = 8\]

Finding Poisson MLE in general

  • Now consider finding the MLE of \(\lambda\) given a sample \(Y_1,...,Y_n \stackrel{i.i.d.}{\sim} POI(\lambda)\)

\[L(\lambda) = \prod_{i=1}^n p(y_i;\lambda) = \prod_{i=1}^n \frac{e^{-\lambda}\lambda^{y_i}}{y_i!} = \frac{e^{-n\lambda}\lambda^{\sum_{i=1}^ny_i}}{\prod_{i=1}^ny_i!}\]

\[\ln(L(\lambda)) = \ln\left(e^{-n\lambda}\right) + \ln\left(\lambda^{\sum_{i=1}^ny_i}\right)-\ln\left(\prod_{i=1}^ny_i!\right)\]

\[ = -n\lambda + \sum_{i=1}^ny_i\ln(\lambda)-\ln\left(\prod_{i=1}^ny_i!\right)\]

\[\frac{d}{d\lambda}\ln(L(\lambda)) = -n+ \frac{\sum_{i=1}^ny_i}{\lambda} \stackrel{set}{=}0\]

\[\Rightarrow \hat\lambda_{MLE} = \frac{\sum_{i=1}^ny_i}{n}\]

  • \(\therefore \hat\lambda_{MLE} = \bar Y\)

Using numerical optimization

  • For the previous problem, we could find the MLE relatively easily using calculus
  • For many problems, the MLE does not exist in closed form and must be estimated using numerical optimization
  • For single-parameter optimization:
  1. Write a log-likelikhood function for single parameter \(\theta\), with second argument for sample
  2. Use optimize() to perform single-parameter optimization. Arguments:
    • f: name of your log-likelihood function
    • ...: additional arguments to your log-likelihood function, most likely a sample vector
    • interval: interval over which to search for the maximum
    • maximum=TRUE: by default, optimize() minimizes. Toggle this to maximize.

Numerical optimization example

Writing the log-likelihood function:

pois_loglik <- function(lambda, y) {
  n <- length(y)
  ll <- -n*lambda + sum(y)*log(lambda) - sum(log(factorial(y)))
  return(ll)
}

Simulating a sample:

set.seed(12345)
n <- 20
ysamp <- rpois(20, lambda = 2)

What should the MLE be?

mean(ysamp)
[1] 2.2

Using optimize() to find MLE:

optimize(pois_loglik, y = ysamp,
         interval = c(0,5),
         maximum = TRUE)
$maximum
[1] 2.200017

$objective
[1] -36.27721

Plotting the log-likelihood function:

ggplot()+
  geom_function(fun= \(l) pois_loglik(l,ysamp)) + 
  xlim(c(0.1,5))+ 
  theme_classic(base_size = 18) +
  labs(x=expression(lambda),
       y = expression(ln(L(lambda)))
       ) + 
  geom_vline(aes(xintercept = 2.2), linetype=2)

Exponential rate MLE

  • Suppose \(Y_1,...,Y_n \stackrel{i.i.d.}{\sim} EXP(\lambda)\)
  • What is the MLE of \(\lambda\)?

\[L(\lambda) = \prod_{i=1}^n f(y_i;\lambda) = \prod_{i=1}^n \lambda e^{-\lambda y_i}= \lambda^n e^{-\lambda\sum_{i=1}^ny_i}\]

\[\ln(L(\lambda)) = \ln\left(\lambda^n\right) + \ln\left(e^{-\lambda\sum_{i=1}^ny_i}\right)\]

\[ = n \ln(\lambda) - \lambda\sum_{i=1}^ny_i\]

\[\frac{d}{d\lambda}\ln(L(\lambda)) = \frac{n}{\lambda}- \sum_{i=1}^ny_i \stackrel{set}{=}0\]

\[\Rightarrow \hat\lambda_{MLE} = \frac{n}{\sum_{i=1}^ny_i}\]

  • \(\therefore \hat\lambda_{MLE} = \frac{1}{\bar Y}\)

Exponential scale MLE

  • What if we use the scale parameterization instead, i.e. suppose \(Y_1,...,Y_n \stackrel{i.i.d.}{\sim} EXP(\beta)\)?
  • What is the MLE of \(\beta\)?

\[L(\beta) = \prod_{i=1}^n f(y_i;\beta) = \prod_{i=1}^n \frac{1}{\beta} e^{-y_i/\beta}= \beta^{-n} e^{-\sum_{i=1}^ny_i/\beta}\]

\[\ln(L(\beta)) = -n \ln(\beta) - \frac{\sum_{i=1}^ny_i}{\beta}\]

\[\frac{d}{d\beta}\ln(L(\beta)) = -\frac{n}{\beta}+ \frac{\sum_{i=1}^ny_i}{\beta^2} \stackrel{set}{=}0\]

\[\Rightarrow \hat\beta_{MLE} = \frac{\sum_{i=1}^ny_i}{n}\]

  • \(\therefore \hat\beta_{MLE} = \bar Y\)

Invariance of MLEs

  • If \(\hat\theta\) is the MLE of \(\theta\), then \(\tau(\hat\theta)\) is the MLE of \(\tau(\theta)\).
  • Exponential example (\(\tau(x) = 1/x\)):

\[\hat\lambda = \frac{1}{\bar Y} \mbox{ is the MLE of } \lambda\]

\[\tau(\hat\lambda) = \frac{1}{\hat\lambda} = \frac{1}{1/\bar Y} \mbox{ is the MLE of } \frac{1}{\lambda} = \tau(\lambda) = \beta\]

Uniform upper bound MLE

  • Suppose \(Y_1,...,Y_n\) are i.i.d. \(\sim UNIF(0,\theta)\).
  • Find the MLE of \(\theta\).

Rote approach:

\[L(\theta) = \prod_{i=1}^n f(y_i;\theta) = \prod_{i=1}^n \frac{1}{\theta} = \frac{1}{\theta^n}\]

\[\ln(L(\theta)) = -n\ln(\theta)\]

\[\frac{d}{d\theta}\ln(L(\theta)) = -\frac{n}{\theta}\stackrel{set}{=}0 \Rightarrow \hat\theta_{MLE} = -\frac{n}{0} = -\infty...?\]

Indicator functions

  • When \(\theta\) parameterizes the support, we must incorporate this into our likelihood function with indicator functions:

\[L(\theta) = \prod_{i=1}^n f(y_i;\theta) = \prod_{i=1}^n \frac{1}{\theta} 1(y_i >0) 1(y_i < \theta)\]

\[= \frac{1}{\theta^n}\cdot 1(y_{(1)}>0)\cdot 1(y_{(n)}< \theta)\]

Not all likelihoods are concave!

\[L(\theta) = \frac{1}{\theta^n}\cdot 1(\theta>y_{(n)})\]

  • \(L(\theta)\) is discontinuous in \(\theta\).
  • \(L(\theta)\) is non-differentiable at its maximum (hence the infinite derivative when attempting to set it equal to 0).
  • \(L(\theta)\) achieves its maximum at \(\theta = y_{(n)}\).
  • Value of maximized likelihood:

\[L(y_{(n)}) = y_{(n)}^{-n}\]

\[\therefore \hat\theta_{MLE} = Y_{(n)}\]

Make sure to sketch the likelihood function and identify its nonzero form (monotone increasing? decreasing?) and where it achieves its maximum when you have non-concave likelihoods (when \(\theta\) parameterizes the support)

Normal MLEs

  • Suppose \(Y_1,...,Y_n \stackrel{i.i.d.}{\sim} N(\mu,\sigma^2)\)
  • Find the MLEs of \(\mu\) and \(\sigma^2\).

\[L(\mu,\sigma^2) = \prod_{i=1}^n f(y_i;\mu,\sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} e^{-(y_i-\mu)^2/2\sigma^2}= (2\pi\sigma^2)^{-n/2} e^{-\sum_{i=1}^n(y_i-\mu)^2/2\sigma^2}\]

\[\ln(L(\mu,\sigma^2)) = -\frac{n}{2}\ln(\sigma^2)- \frac{\sum_{i=1}^n(y_i-\mu)^2}{2\sigma^2}-\frac{n}{2}\ln(2\pi)\]

\[\mbox{Eq. 1: } \frac{\partial}{\partial \mu}\ln(L(\mu,\sigma^2))=\frac{\sum_{i=1}^n(y_i-\mu)}{\sigma^2} \stackrel{set}{=}0 \]

\[\mbox{Eq. 2: }\frac{\partial}{\partial \sigma^2}\ln(L(\mu,\sigma^2))=-\frac{n}{2\sigma^2}+ \frac{\sum_{i=1}^n(y_i-\mu)^2}{2\sigma^4}\stackrel{set}{=}0\]

Solving

\[\mbox{Eq. 1: } \frac{\sum_{i=1}^n(y_i-\mu)}{\sigma^2} =0\] \[ \Rightarrow \sum_{i=1}^ny_i-n\mu=0\] \[\Rightarrow n\bar y -n\mu =0\]

\[\Rightarrow \hat\mu_{MLE} = \bar y\]

Plug into:

\[\mbox{Eq. 2: }-\frac{n}{2\sigma^2}+ \frac{\sum_{i=1}^n(y_i-\mu)^2}{2\sigma^4}=0\]

\[\Rightarrow \frac{\sum_{i=1}^n(y_i-\hat\mu)^2}{2\sigma^4} = \frac{n}{2\sigma^2}\] \[\Rightarrow \hat\sigma^2_{MLE} = \frac{\sum_{i=1}^n(y_i-\bar y)^2}{n}\]

Beta MLEs

  • Suppose \(Y_1,...,Y_n\) are i.i.d. \(\sim BETA(\alpha,\beta)\).
  • Identify the system of equations needed to solve to find the MLEs of \(\alpha\) and \(\beta\).

\[\small L(\alpha,\beta) = \prod_{i=1}^n f(y_i;\alpha,\beta) = \prod_{i=1}^n \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} y_i^{\alpha-1}(1-y_i)^{\beta-1} = \left(\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\right)^n\prod_{i=1}^ny_i^{\alpha-1}\prod_{i=1}^n(1-y_i)^{\beta-1}\]

\[\small \ln(L(\alpha,\beta)) = n\ln(\Gamma(\alpha+\beta)) -n\ln(\Gamma(\alpha))-n\ln(\Gamma(\beta)) +(\alpha-1)\sum_{i=1}^n\ln(y_i)+ (\beta-1)\sum_{i=1}^n\ln(1-y_i)\]

\[\small \frac{\partial}{\partial \alpha} \ln(L(\alpha,\beta)) = \frac{n}{\Gamma(\alpha+\beta)}\frac{\partial}{\partial \alpha} \Gamma(\alpha+\beta)-\frac{n}{\Gamma(\alpha)}\frac{\partial}{\partial \alpha} \Gamma(\alpha) + \sum_{i=1}^n\ln(y_i)...?\]

\[\small \frac{\partial}{\partial \beta} \ln(L(\alpha,\beta)) = \frac{n}{\Gamma(\alpha+\beta)}\frac{\partial}{\partial \beta} \Gamma(\alpha+\beta)-\frac{n}{\Gamma(\beta)}\frac{\partial}{\partial \beta} \Gamma(\beta) + \sum_{i=1}^n\ln(1-y_i)...?\]

Investigating \(\ln(L(\alpha,\beta))\) visually

  • Multi-parameter log-likelihood functions should have a vector of parameters as the first argument
  • Writing the beta log-likelihood function, making use of dbeta(, log=TRUE) to evaluate the log-likelihood contributions from each data point:
beta_loglik <- function(theta,ysamp) {
  alpha <- theta[1]
  beta <- theta[2]
  n <- length(ysamp)
  ll <-  sum(dbeta(ysamp, shape1=alpha, shape2=beta,log = TRUE))
  return(ll)
}

Investigating \(\ln(L(\alpha,\beta))\) visually

  • Simulating a single sample of size \(n=200\) from a \(BETA(5,2)\):
set.seed(1234)
n <- 200
mysamp <- rbeta(200, shape1=5, shape2 = 2)
head(mysamp)
[1] 0.8993290 0.6549274 0.4466458 0.9780859 0.6196458 0.6009996
  • Evaluating the value of the log-likelihood for a grid of \((\alpha,\beta)\) values:
alphaseq <- seq(2,8,l=100)
betaseq <- seq(1,6,l=100)
(loglik_grid <- expand.grid(alpha = alphaseq, beta = betaseq)
              %>% mutate(y = list(mysamp))
              %>% as_tibble()
              %>% mutate(loglik = pmap_dbl(list(alpha,beta,y), \(a,b,y) 
                                           beta_loglik(c(a,b), y)
                                           )
                         )
) %>% head()
# A tibble: 6 × 4
  alpha  beta y           loglik
  <dbl> <dbl> <list>       <dbl>
1  2        1 <dbl [200]>   66.8
2  2.06     1 <dbl [200]>   68.4
3  2.12     1 <dbl [200]>   69.9
4  2.18     1 <dbl [200]>   71.1
5  2.24     1 <dbl [200]>   72.3
6  2.30     1 <dbl [200]>   73.2
  • Plotting the 2-dimensional log-likelihood:
ggplot(data = loglik_grid,
          aes(x=alpha, y= beta, z = loglik)
       ) + 
  geom_contour_filled(binwidth=2)+
  geom_contour(binwidth=10,col='grey')+
  labs(x=expression(alpha), y = expression(beta))+
  guides(fill='none')+
  theme_classic(base_size = 22)

Using optim() to find MLEs

  • optim() is the workhorse R function for multi-dimensional optimization. It does minimization only.
  • Important arguments:
    • par: initial values. Usually we just make a guess. A good idea to try several starting points!
    • fn: the name of the function to be optimized. The first argument must be a VECTOR of parameters over which to optimize.
    • ...: Additional arguments to the log-likelihood function, including the data.
    • control: a list of arguments to control optim. One important control parameter is fnscale=-1, indicating we want to minimize the negative log-likelihood.

Implementing optim() for simulated data

  • Finding the MLEs:
optim(par = c(1,1), 
      fn = beta_loglik, 
      ysamp = mysamp, 
      control=list(fnscale=-1)
      )
$par
[1] 4.827151 1.876892

$value
[1] 96.00696

$counts
function gradient 
     275       NA 

$convergence
[1] 0

$message
NULL
  • Histogram of sample, with fitted beta curve:
ggplot()+ 
  geom_histogram(aes(x = mysamp, y = after_stat(density)), 
                 binwidth=0.02,
                 center=0.01,
                 fill='grey',
                 col='black'
                 ) +
  geom_function(fun = \(y) 
                dbeta(y, 
                      shape1 =  4.827151, 
                      shape2 = 1.876892)
                ) + 
  theme_classic(base_size = 18) + 
  labs(x='y')

Comparing MLEs to MoMs

  • Recall:

\[\hat\alpha_{MOM} = \left( \frac{\bar Y_1 - \bar Y_2}{S^2}\right)\bar Y_1\]

\[ \hat\beta_{MOM} = \left( \frac{\bar Y_1 - \bar Y_2}{S^2}\right) (1-\bar Y_1)\]

Method-of-moments estimators:

ybar1 <- mean(mysamp)
ybar2 <- mean(mysamp^2)
S2 <- ybar2-ybar1^2
(alpha_hat_mom <- ybar1*(ybar1-ybar2)/S2)
[1] 4.685153
(beta_hat_mom <- (1-ybar1)*(ybar1-ybar2)/S2)
[1] 1.809314

Maximum likelihood estimators:

optim(par = c(1,1), 
      fn = beta_loglik, 
      ysamp = mysamp, 
      control=list(fnscale=-1)
      )$par
[1] 4.827151 1.876892

Comparing the fits

ggplot()+ 
  geom_histogram(aes(x = mysamp, y = after_stat(density)), 
                 binwidth=0.02,
                 center=0.01,
                 fill='grey',
                 col='black'
                 ) +
  geom_function(fun = \(y) 
                dbeta(y,  shape1 =  4.827151,  shape2 = 1.876892),
                aes(color='Fitted beta with MLEs')
                ) + 
  geom_function(fun = \(y) 
                dbeta(y,   shape1 =  4.685153,  shape2 = 1.809314),
                aes(color='Fitted beta with MoMs')
                )+
  theme_classic(base_size = 18) + 
  labs(x='y', color='')

Pearson was not impressed

Karl Pearson

Image source: Wikipedia

Wasting your time fitting curves by moments, eh?

To my astonishment, [Fisher’s method] depends on first working out the constants of the frequency curve by the Method of Moments, and then superposing on it, by what Fisher terms the “Method of Maximum Likelihood,” a further approximation to obtain, what he holds he will thus get, “more efficient values” of the curve constants. The additional computing work after the determination by moments is as long as, I think longer than, the first fitting by moments. The practical value therefore turns on the gain in accuracy obtained by the increased labour.

  • Pearson: Method of Moments and Method of Maximum Likelihood. Biometrika p 34-35 1936.