C. Donovan
A simple start
Consider 2 independent identically distributed random variables \( Y_1 \) and \( Y_2 \) which are Poisson distributed, and which we can observe
If we know the mean \( \theta \) of the Poisson from which they have come, then we can write down the probability of getting observed values \( y_1 \) and \( y_2 \):
\[ P ( y_1, y_2 | \theta ) = \frac{\theta ^ {y_1} e^{- \theta} }{y_1 !} \times \frac{\theta ^ {y_2} e^{- \theta} }{y_2 !} \]
suppose that we have the two observations, \( y_1 \) and \( y_2 \), but don't know the parameter \( \theta \).
In some sense the most 'likely' value for \( \theta \) is the one which maximises the previous expression, i.e. the value of \( \theta \) which makes our observed data look most probable
When expressions like previous contain known data, but unknown parameters (like \( \theta \)), they are known as likelihoods for the unknown parameters
Estimating a parameter by maximising its likelihood (given some observed data) is known as maximum likelihood estimation
set up the likelihood for your observed data assuming a general form of Probability Function \( f(X|\boldsymbol{\theta}) \) (e.g Poisson, Binomial etc).
Note \( \boldsymbol{\theta} \) can be a vector of several parameters \( \boldsymbol{\theta}=[ \theta_1, \theta_2, ...] \). Let us ignore \( X \) for the time being, so use \( f(X|\boldsymbol{\theta})=f(\boldsymbol{\theta}) \)
multiply together the functions for calculating probabilities/densities for each potential data values i.e. \( \prod f(\boldsymbol{\theta}) \)
take logs of this to simplify (i.e. now have \( g(\boldsymbol{\theta})=\sum \log(f(\boldsymbol{\theta})) \) then differentiate with respect to the parameter of interest (\( \theta_1 \) say)
To actually get a log-likelihood value for some value of \( \theta \) - insert the observed values of \( X=x_1,...x_n \) into step 2
we are usually interested in the log likelihood at the MLE e.g. this is where we might compare competing models.
If we are dealing with iid variables, then we can often derive the form of MLE for a general value of \( X \).
Note that the theory applies to continuous RVs just as well as discrete ones - just use PDFs to create the likelihood function.
\[ \prod_{i=1}^n \frac{\lambda^{x_i}e^{-\lambda}}{x_i!} \]
Taking logs here will simplify matters
\[ \begin{align*} \sum_{i=1}^n \log \left( \frac{\lambda^{x_i}e^{-\lambda}}{x_i!}\right)&\quad = \quad \sum_{i=1}^n\left(\log(\lambda^{x_i})-\lambda-\log(x_i!)\right)\\ &\quad = \quad \sum_{i=1}^nx_i\log(\lambda)-n\lambda-\sum_{i=1}^n\log(x_i!)\\ \end{align*} \]
differentiate wrt parameter of interest then solve for said parameter
\[ \begin{align*} \sum_{i=1}^n\frac{x_i}{\lambda}-n=0&\quad = \quad \sum_{i=1}^n\frac{x_i}{\lambda}=n\\ &\quad = \quad\frac{\sum_{i=1}^n x_i}{n}=\lambda \end{align*} \]
which is (as you perhaps expected) simply the mean of the data.
So, summing the data gives \( 0\times 10+1\times 6 + 2\times 2 + 3\times 1 + 4\times 1=17 \) and thus \( \lambda=17/20=0.85 \). Slightly less than one fatal kick to the head per year. So given the assumption of a Poisson process, the Poisson distribution that best fits our data is one with a mean of 0.85; it is the most likely value of \( \lambda \).
First enter data data and a range of parameter to search over (here 0 to 5 in increments of 0.01)
# Basic MLE calculation for Poisson data
data <- c(0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,2,2,3,4)
theta=seq(0,5,by=0.01)
likelihood<- numeric(length(theta))
for (i in 1:length(theta)) {
likelihood[i] <- prod(dpois(data, theta[i]))
}
Then, for each potential parameter value (generally \( \theta \) here = our \( \lambda \) for a Poisson), calculate the likelihood
Which looks like this - the likelihood of the data given different values of \( \theta \). We want the max.
From our grid search, the answer is about the mean of the data. The sample mean is the theoretical MLE for \( \theta \) in this problem.
mleindicator<- which(likelihood==max(likelihood))
mle<-theta[mleindicator]
mean(data)
[1] 0.85
mle
[1] 0.85
Usually we deal with log-likelihoods, they are easier to work with, because the product becomes a sum. However, the result will be the same:
theta=seq(0,5,by=0.01)
theta=seq(0,5,by=0.01)
loglikelihood<- numeric(length(theta))
for (i in 1:length(theta)) {
loglikelihood[i] <- sum(log(dpois(data, theta[i])))
}
mleindicator<- which(loglikelihood==max(loglikelihood))
mle<-theta[mleindicator]
mean(data)
[1] 0.85
mle
[1] 0.85
However, the result will be the same:
mleindicator<- which(loglikelihood==max(loglikelihood))
mle<-theta[mleindicator]
mean(data)
[1] 0.85
mle
[1] 0.85
Likelihood has a wide range of uses in statistics:
Likelihood has a wide range of uses in statistics:
Let's try looking at simple regression. We'll
Here we predict RunTime
as a function of Oxygen
- data that relates to running and metabolic rates.
library(tidyverse)
library(fields)
#~ OLS regression style problem
# read in some data
fitnessData<- read_csv("data/fitness.csv")
# want to look at RunTime as a function of Oxygen
plot(fitnessData$Oxygen, fitnessData$RunTime)
# could also write like an equation
# plot(RunTime~Oxygen, data=fitnessData)
#~ simple regression
simpleReg<- lm(RunTime~Oxygen, data=fitnessData)
So we aim to estimate this:
Call:
lm(formula = RunTime ~ Oxygen, data = fitnessData)
Residuals:
Min 1Q Median 3Q Max
-1.39059 -0.47174 0.05551 0.46331 1.20112
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.2243 1.1678 18.175 < 2e-16 ***
Oxygen -0.2245 0.0245 -9.166 4.59e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7148 on 29 degrees of freedom
Multiple R-squared: 0.7434, Adjusted R-squared: 0.7345
F-statistic: 84.01 on 1 and 29 DF, p-value: 4.585e-10
Set up a grid to search over
#~ grid search over parameters
slopeGrid<- seq(-1,1, length=200)
intGrid<- seq(10,50, length=200)
gridSearch<- expand.grid(intGrid, slopeGrid)
Then set up the search - this is the thing I minimise (\( y-\hat{y} \) = \( y - (\beta_0 + \beta_1 x) \))
errorSSfunction<- function(x, y, intercept, slope){sum((y-(intercept+slope*x))^2)}
This thing just passes all our intercept and slope parameters through that function:
surfaceVals<- apply(gridSearch, 1, function(q){
errorSSfunction(fitnessData$Oxygen, fitnessData$RunTime, q[1], q[2])})
So, which of the parameter pairs had the smallest residual sum of squares (RSS)?
minPoint<- gridSearch[which(surfaceVals==min(surfaceVals)),]
minPoint
Var1 Var2
15457 21.25628 -0.2261307
Another view
image.plot(slopeGrid, intGrid, matrix(surfaceVals, ncol=200))
abline(h=minPoint[1], v=minPoint[2], col='white', lwd = 3)
And another view
persp(matrix(surfaceVals, ncol=200), phi=30, theta=30)
It's a weird surface - we actually can make things easier for ourselves. If we centre the data at the mean of \( x \) and \( y \) (this makes the intercept and slope uncorrelated - details not important here), we can just solve for one each separately.
For the slope:
We can use the same approach, but now the objective function will be log-likelihood, which we try to maximise.
# re-pose in log-likelihood terms
# note assume a fixed sd for the error variance, something that is estimated in practice - however should get same int/slope
regLikfunction<- function(x, y, intercept, slope){res<- y-(intercept+slope*x); sum(log(dnorm(res)))}
Again, search over a grid of parameters, find the maximum:
#~ grid search over parameters
slopeGrid<- seq(-1,1, length=200)
intGrid<- seq(10,50, length=200)
gridSearch<- expand.grid(intGrid, slopeGrid)
surfaceVals<- apply(gridSearch, 1, function(q){
regLikfunction(fitnessData$Oxygen, fitnessData$RunTime, q[1], q[2])})
maxPoint<- gridSearch[which(surfaceVals==max(surfaceVals, na.rm=T)),]
maxPoint
Var1 Var2
15457 21.25628 -0.2261307
Which can be viewed like:
surfaceVals<- ifelse(is.infinite(surfaceVals), NA, surfaceVals)
image.plot(slopeGrid, intGrid, matrix(surfaceVals, ncol=200))
abline(h=maxPoint[1], v=maxPoint[2], col='white', lwd = 2)
So finding the beak of this surface - somewhat like the RSS surface inverted
persp(z = matrix(surfaceVals, ncol=200), phi=30, theta=30)
More generally we try to exploit the gradient structure of these objective functions.
# clearly inefficient - should use gradient search
firstDerivNorm<- function(data, sigma, mu){1/(2*sigma^2)*sum(data)-1/(2*sigma^2)*length(data)*mu}
mySample<- rnorm(10,4,2)
mean(mySample); sd(mySample)
[1] 3.412602
[1] 1.489074
possibleMeans<- seq(-1,5, length=500)
We've covered:
Next: