MT5762 Lecture 15

C. Donovan

Introduction to Maximum Likelihood

(max) Likelihood - simple and intuitive view

A simple start

start with just a discrete random variable \( Y \), so the probability mass function gives probabilities directly for a value of \( Y \)
Consider 2 independent identically distributed random variables \( Y_1 \) and \( Y_2 \) which are Poisson distributed, and which we can observe
If we know the mean \( \theta \) of the Poisson from which they have come, then we can write down the probability of getting observed values \( y_1 \) and \( y_2 \):

\[ P ( y_1, y_2 | \theta ) = \frac{\theta ^ {y_1} e^{- \theta} }{y_1 !} \times \frac{\theta ^ {y_2} e^{- \theta} }{y_2 !} \]

Maximum Likelihood (ML)

suppose that we have the two observations, \( y_1 \) and \( y_2 \), but don't know the parameter \( \theta \).
In some sense the most 'likely' value for \( \theta \) is the one which maximises the previous expression, i.e. the value of \( \theta \) which makes our observed data look most probable
When expressions like previous contain known data, but unknown parameters (like \( \theta \)), they are known as likelihoods for the unknown parameters
Estimating a parameter by maximising its likelihood (given some observed data) is known as maximum likelihood estimation

The ML process

set up the likelihood for your observed data assuming a general form of Probability Function \( f(X|\boldsymbol{\theta}) \) (e.g Poisson, Binomial etc).
Note \( \boldsymbol{\theta} \) can be a vector of several parameters \( \boldsymbol{\theta}=[ \theta_1, \theta_2, ...] \). Let us ignore \( X \) for the time being, so use \( f(X|\boldsymbol{\theta})=f(\boldsymbol{\theta}) \)

The ML process

multiply together the functions for calculating probabilities/densities for each potential data values i.e. \( \prod f(\boldsymbol{\theta}) \)
take logs of this to simplify (i.e. now have \( g(\boldsymbol{\theta})=\sum \log(f(\boldsymbol{\theta})) \) then differentiate with respect to the parameter of interest (\( \theta_1 \) say)

The ML process

solve for the value of \( \theta_1 \) \[ \frac{d(g(\boldsymbol{\theta}))}{d\theta_1}=0 \] e.g. we find the parameter value of \( \theta_1 \) where the log-likelihood function peaks

The ML process

To actually get a log-likelihood value for some value of \( \theta \) - insert the observed values of \( X=x_1,...x_n \) into step 2
we are usually interested in the log likelihood at the MLE e.g. this is where we might compare competing models.
If we are dealing with iid variables, then we can often derive the form of MLE for a general value of \( X \).

Note that the theory applies to continuous RVs just as well as discrete ones - just use PDFs to create the likelihood function.

Example: dead soldiers

A very (in)famous dataset relating to count data is that of Bortkiewicz, published in 1898.
Bortkiewicz was an Economist/Statistician who took an interest in the numbers of Prussian Cavalry men dying after being kicked in the head by their horses.
The data was collected across several cavalry corps over a 20 year period (some 90+ fatal kicks to the head!).

Example: dead soldiers

Suppose this data represents the reports from one cavalry corp over this period: 10, 6, 2, 1, 1 where the first number represents the number of years with no deaths, the second is the number of years with 1 reported death etc.
Assuming a Poisson distribution is appropriate, calculate the Maximum Likelihood Estimate for the yearly death rate parameter \( \lambda \).

Example: dead soldiers - analytically

Our data comprises of 20 points i.e. \( {0,0,\cdots,1,1,\cdots,2,2,3,4} \).
There are several ways to achieve this that differ in the details, but the process remains the same.
First construct the likelihood function as a product of the PDF (the standard Poisson) for each of the \( n \) data points:

\[ \prod_{i=1}^n \frac{\lambda^{x_i}e^{-\lambda}}{x_i!} \]

Example: dead soldiers - analytically

Taking logs here will simplify matters

\[ \begin{align*} \sum_{i=1}^n \log \left( \frac{\lambda^{x_i}e^{-\lambda}}{x_i!}\right)&\quad = \quad \sum_{i=1}^n\left(\log(\lambda^{x_i})-\lambda-\log(x_i!)\right)\\ &\quad = \quad \sum_{i=1}^nx_i\log(\lambda)-n\lambda-\sum_{i=1}^n\log(x_i!)\\ \end{align*} \]

Example: dead soldiers - analytically

differentiate wrt parameter of interest then solve for said parameter

\[ \begin{align*} \sum_{i=1}^n\frac{x_i}{\lambda}-n=0&\quad = \quad \sum_{i=1}^n\frac{x_i}{\lambda}=n\\ &\quad = \quad\frac{\sum_{i=1}^n x_i}{n}=\lambda \end{align*} \]

which is (as you perhaps expected) simply the mean of the data.

Example: dead soldiers - analytically

So, summing the data gives \( 0\times 10+1\times 6 + 2\times 2 + 3\times 1 + 4\times 1=17 \) and thus \( \lambda=17/20=0.85 \). Slightly less than one fatal kick to the head per year. So given the assumption of a Poisson process, the Poisson distribution that best fits our data is one with a mean of 0.85; it is the most likely value of \( \lambda \).

Example: dead soldiers - by computer

First enter data data and a range of parameter to search over (here 0 to 5 in increments of 0.01)

# Basic MLE calculation for Poisson data

   data <- c(0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,2,2,3,4)

   theta=seq(0,5,by=0.01)

  likelihood<- numeric(length(theta))

  for (i in 1:length(theta)) {
    likelihood[i] <- prod(dpois(data, theta[i]))
    }

Example: dead soldiers - by computer

Then, for each potential parameter value (generally \( \theta \) here = our \( \lambda \) for a Poisson), calculate the likelihood

Which looks like this - the likelihood of the data given different values of \( \theta \). We want the max.

plot of chunk unnamed-chunk-2

Example: dead soldiers - by computer

From our grid search, the answer is about the mean of the data. The sample mean is the theoretical MLE for \( \theta \) in this problem.

  mleindicator<- which(likelihood==max(likelihood))

  mle<-theta[mleindicator]

  mean(data)

[1] 0.85

mle

[1] 0.85

Example: dead soldiers - by computer

Usually we deal with log-likelihoods, they are easier to work with, because the product becomes a sum. However, the result will be the same:

  theta=seq(0,5,by=0.01)

  theta=seq(0,5,by=0.01)

  loglikelihood<- numeric(length(theta))

  for (i in 1:length(theta)) {
    loglikelihood[i] <- sum(log(dpois(data, theta[i])))
  }

  mleindicator<- which(loglikelihood==max(loglikelihood))

  mle<-theta[mleindicator]

  mean(data)

[1] 0.85

mle

[1] 0.85

Example: dead soldiers - by computer

However, the result will be the same:

plot of chunk unnamed-chunk-5

  mleindicator<- which(loglikelihood==max(loglikelihood))

  mle<-theta[mleindicator]

  mean(data)

[1] 0.85

mle

[1] 0.85

So what?

Likelihood has a wide range of uses in statistics:

So what?

Likelihood has a wide range of uses in statistics:

it is often used as the basis of parameter estimation in models e.g. for a regression \( Y=\beta_0+\beta_1 X \), what are the most likely values of \( \beta_0 \) and \( \beta_1 \) given a set of data \( X=x_1, x_2, ... , x_n \) assuming Normal errors? i.e. \( Y\sim N(\mu, \sigma) \) where \( \mu=E[Y|X]=\beta_0+\beta_1 X \).
comparing competing models, which model is best supported by the data, which of the models is more likely to have generated the data we observed e.g. likelihood ratios.

Regression by grid-search on parameters

Let's try looking at simple regression. We'll

Optimise using sums-of-squared residuals i.e. minimse these
Re-pose as (log)-likelihood problem and optimise i.e. maximise this

Here we predict RunTime as a function of Oxygen - data that relates to running and metabolic rates.

Optimising a regression by brute force: OLS

library(tidyverse)
library(fields)

  #~ OLS regression style problem

  # read in some data
  fitnessData<- read_csv("data/fitness.csv")

  # want to look at RunTime as a function of Oxygen
  plot(fitnessData$Oxygen, fitnessData$RunTime)

  # could also write like an equation
  # plot(RunTime~Oxygen, data=fitnessData)


  #~ simple regression
  simpleReg<- lm(RunTime~Oxygen, data=fitnessData)

plot of chunk unnamed-chunk-9

Optimising a regression by brute force: OLS

So we aim to estimate this:


Call:
lm(formula = RunTime ~ Oxygen, data = fitnessData)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.39059 -0.47174  0.05551  0.46331  1.20112 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  21.2243     1.1678  18.175  < 2e-16 ***
Oxygen       -0.2245     0.0245  -9.166 4.59e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7148 on 29 degrees of freedom
Multiple R-squared:  0.7434,    Adjusted R-squared:  0.7345 
F-statistic: 84.01 on 1 and 29 DF,  p-value: 4.585e-10

Optimising a regression by brute force: OLS

Set up a grid to search over

  #~ grid search over parameters
  slopeGrid<- seq(-1,1, length=200)

  intGrid<- seq(10,50, length=200)

  gridSearch<- expand.grid(intGrid, slopeGrid)

Optimising a regression by brute force: OLS

Then set up the search - this is the thing I minimise (\( y-\hat{y} \) = \( y - (\beta_0 + \beta_1 x) \))

  errorSSfunction<- function(x, y, intercept, slope){sum((y-(intercept+slope*x))^2)}

This thing just passes all our intercept and slope parameters through that function:

  surfaceVals<- apply(gridSearch, 1, function(q){
    errorSSfunction(fitnessData$Oxygen, fitnessData$RunTime, q[1], q[2])})

Optimising a regression by brute force: OLS

So, which of the parameter pairs had the smallest residual sum of squares (RSS)?

  minPoint<- gridSearch[which(surfaceVals==min(surfaceVals)),]

  minPoint

          Var1       Var2
15457 21.25628 -0.2261307

Optimising a regression by brute force: OLS

Another view

  image.plot(slopeGrid, intGrid, matrix(surfaceVals, ncol=200))

  abline(h=minPoint[1], v=minPoint[2], col='white', lwd = 3)

plot of chunk unnamed-chunk-16

Optimising a regression by brute force: OLS

And another view

  persp(matrix(surfaceVals, ncol=200), phi=30, theta=30)

plot of chunk unnamed-chunk-17

Optimising a regression by brute force: OLS

It's a weird surface - we actually can make things easier for ourselves. If we centre the data at the mean of \( x \) and \( y \) (this makes the intercept and slope uncorrelated - details not important here), we can just solve for one each separately.

For the slope:

plot of chunk unnamed-chunk-19

Optimising a regression by brute force: Likelihood

We can use the same approach, but now the objective function will be log-likelihood, which we try to maximise.

  # re-pose in log-likelihood terms 
  # note assume a fixed sd for the error variance, something that is estimated in practice - however should get same int/slope
  regLikfunction<- function(x, y, intercept, slope){res<- y-(intercept+slope*x); sum(log(dnorm(res)))}

Note here we assume Normality of the errors - so the log-likelihood is calculated from the residuals (I'm assuming \( \sigma \) = 1, not important).
This sum of log-likelihoods goes up or down depending on \( \beta_0 \) and \( \beta_1 \). Up is good.

Optimising a regression by brute force: Likelihood

Again, search over a grid of parameters, find the maximum:

 #~ grid search over parameters
  slopeGrid<- seq(-1,1, length=200)

  intGrid<- seq(10,50, length=200)

  gridSearch<- expand.grid(intGrid, slopeGrid)

  surfaceVals<- apply(gridSearch, 1, function(q){
    regLikfunction(fitnessData$Oxygen, fitnessData$RunTime, q[1], q[2])})

  maxPoint<- gridSearch[which(surfaceVals==max(surfaceVals, na.rm=T)),]

  maxPoint

          Var1       Var2
15457 21.25628 -0.2261307

Optimising a regression by brute force: Likelihood

Which can be viewed like:

  surfaceVals<- ifelse(is.infinite(surfaceVals), NA, surfaceVals)

   image.plot(slopeGrid, intGrid, matrix(surfaceVals, ncol=200))

  abline(h=maxPoint[1], v=maxPoint[2], col='white', lwd = 2)

plot of chunk unnamed-chunk-23

Optimising a regression by brute force: Likelihood

So finding the beak of this surface - somewhat like the RSS surface inverted

  persp(z = matrix(surfaceVals, ncol=200), phi=30, theta=30)

plot of chunk unnamed-chunk-24

Optimising a regression by brute force: Likelihood

More generally we try to exploit the gradient structure of these objective functions.

Here I calculate 1st derivatives for a likelihood based on the Normal distribution
Then I make up a sample, Normally-distributed, with mean 4 and sd 2

  # clearly inefficient - should use gradient search
  firstDerivNorm<- function(data, sigma, mu){1/(2*sigma^2)*sum(data)-1/(2*sigma^2)*length(data)*mu}

  mySample<- rnorm(10,4,2)

  mean(mySample); sd(mySample)

[1] 3.412602

[1] 1.489074

  possibleMeans<- seq(-1,5, length=500)

Optimising a regression by brute force: Likelihood

We see this returns the sample mean (a Normal distribution with this mean makes our data most likely)
We're relying on there not being other, better, minima outside our search
In practice there are better search methods (proper gradient methods: e.g. Newton-Raphson).

plot of chunk unnamed-chunk-26

Recap and look-forwards

We've covered:

Maximum likelihood in a bit more detail
Somewhat conceptually, how this finds our parameter estimates

Building more complex linear models e.g. interactions
Model comparisons and selection, including AIC