Chapter 4: Beyond Linear Regression: The Method or Maximum Likelihood

The need for non-linear modelling methods

To begin with, this chapter explores cases where linear regression is not a suitable method of modelling data. These cases arise when when data do not meet the assumptions of constant variance. Examples of such data are given:

When the response variable is a proportion (takes values between 0 and 1 inclusive). When data is of this nature, a binomial distribution is likely a better choice.
When the response variable is a count. A Poisson or Negative Binomial distribution might be a better choice for data of this nature.
When the response variabel is positive continuous. For these cases, a Gamma or inverse Gaussian distribution might be more suitable.

In all of these cases, it is common that the relationship between the response and explanatory variable is non-linear also because of the boudaries on the response variable.

Several data sets are introduced that fall into the three listed categories of data that cannot be modeled with a linear model. A range of different models are recommended each of these data sets requiring different systematic components and random components taking different distributions.

With all of these different models, developement of seperate theories (fitting algorithms, inference procedures, diagnostic tools etc.) are needed. However, an alternative approach is to work more generally. This is possible because the distributions mentioned above, along with others, can be grouped into a ‘family’ of distributions termed the exponential dispersion model (or EDM). As a result, a common set of estimation algorithms, inference methods and diagnostic tools can be applied to all of these models. This idea of applying a general set of theories to a range of distributions is the principle of generalised linear models (GLMs). Further advantages of GLMs are that the mean-variance relationship can be selected independently from the scale of the predictor and there is the freedom to select a distribution the best suits the response variable, thereby a better approximation to the probability distribution can be achieved.

GLM for one parameter

Having discussed where linear regression is inappropriate and the resulting need for GLMs, parameter estimation is naturally the next topic of interest. Least-squares is suitable criterion for fitting models to data where the response variables are roughly normally distributed. However, as already mentioned, GLMs can take many forms of non-normal distributions and therefore calls for a more general estimation methodology called ‘maximum likelihood’.

The idea behind maximum likelihood estimation is to select values for the unknown parameters that maximise the probability density of the observed data and it can be used in any case that a certain probability distribution has been proposed for a given data set.

For example, \(y_1,...,y_n\) are indepentdent observations from an exponential distribution with a scale parameter \(\theta\). The exponential distribution has the probability density function: \[P(y;\theta)=\theta exp(-y\theta)\] Therefore, the joint probability density function of \(y_1,...,y_n\) is: \[P(y_1,...,y_n;\theta)=\prod_{i=1}^nP(y;\theta)=\theta^n exp(-n\bar y\theta)\] where \(\bar y\) is the arithmetic mean of the \(y_i\). The above is called the likelihood function and is commonly expressed as:

\[L(\theta;y)=\prod_{i=1}^nP(y;\theta)=\theta^n exp(-n\bar y\theta)\] For this example, maximum likelihood estimatation sets out to estimate \(\theta\) such that it maximised the joint probability function. The estimate of \(\theta\) that maximises the likelihood function is known as the maximum likelihood estimate, denoted as \(\hat\theta\). It can be shown that \(\hat\theta=1/\bar y\), however, how it is calculated will be shown later. Also, when calculating the maximum likelihood estimate, it is common, and often easier to use the log of the likelihood function, which for this example can be shown as:

\[\ell(\theta;y)=logL(\theta;y)=n(log\theta-\bar y\theta)\] To put all of this into practice, Catherine Kleier from Regis University and I collected count data for alpine plants growing with Raoulia albosericea (cushion plants) in one meter squared plots on Mount Ruapehu. We were interested in the facilitative effects of cushion plants on other plants species and therefore wanted to know if the size of the cushion plant influence the number of other plant growing in the one meter squared plot. We defined our response variable as:

\(y=\) number of non-cushion plants.

The unknown parameter is the expected number of non-cushion plants which will be written as \(\mu\) becasue \(E[y]=\mu=\frac{1}{n}\sum_{i=1}^ny_i\). Ulitmately, we are interested in the relationship between \(\mu\) and cushion plant size, however, for the moment we will consider all observations equivalent.

Given that we wish to model counts, we will use the Poisson distribution which has the probability function:

\[P(y;\mu)=\frac{exp(-\mu)\mu^y}{y!}\]

Using R, we can show that by evaluating the log-likelohood function for a few values of \(\mu\), the MLE of \(\mu\) (\(\hat\mu\)) is close to 0.5 as shown log-likelihood values peaking around 0.5.

plot	site	cushion_size	plant_count
1	High	1112	28
2	High	2500	28
3	High	1800	28
4	High	1069	30
5	High	465	15

data(quilpie); names(quilpie)

## [1] "Year"   "Rain"   "SOI"    "Phase"  "Exceed" "y"

mu <- c(0.2, 0.4, 0.5, 0.6, 0.8) # Candidate values to test
ll <- rep(0, 5) # A place-holder for the log-likelihood values
for (i in 1:5)
ll[i] <- sum( dbinom(quilpie$y, size=1, prob=mu[i], log=TRUE))
data.frame(Mu=mu, LogLikelihood=ll)

##    Mu LogLikelihood
## 1 0.2     -63.69406
## 2 0.4     -48.92742
## 3 0.5     -47.13401
## 4 0.6     -48.11649
## 5 0.8     -60.92148

Score equations

A practical way to find the maximum likelihood estimate is to use what is known as the score equation. The score equation is the derivative of the log-likelihood function set it to equal zero, or in other words an equation that specifies where the log-likelihood function has a gradient of zero. Solving the score equation for the parameter of interest can in theory give an incorrect result for the maximum likelihood estimate when the loglikelihood function is not unimodal. However, where the log-likelihood function is unimodal and continuously differentiable, it will yeild the maximum likelihood estimate of said parameter.

Continuing the Quilpie data example:

We can denote the log-likelihood function of the Bernoulli distribution as:

\[\ell(\mu;y)=\sum_{i=1}^ny_ilog\mu+(1-y_i)log(1-\mu)\] And therefore the score equation is:

\[\frac{n(\bar y-\hat\mu)}{\hat\mu(1-\hat\mu)}=0\] Solving for \(\hat\mu\) we find that \(\hat\mu=\bar y\) where \(\bar y\) is the sample mean.

Therefore \(\hat\mu\) for our Quilpie data can be solved by taking the sample mean:

muhat <- mean(quilpie$y); muhat

## [1] 0.5147059

Observed and expected information

The first derivative of the log-likelihood function enabled us to determine the maximum likelihood estimate and now we will use the second derivative to measure how well-defined our estimate is. This is known as the observed information and can be denotes as:

\[J(\mu)=-\frac{d^2\ell(\mu;y)}{d\mu^2}\] \(J(\mu)\) is a function of the observed data, and if \(J(\mu)\) is large then our estimate \(\hat\mu\) is a precise estimate of \(\mu\) and if it is low then \(\hat\mu\) is not a very well-defined estimate of \(\mu\)

Expected information, also called Fisher information is defined as:

\[I(\mu)=E[J(\mu)]\] \(I(\mu)\) is a property of the model and it measures the average information that will be observed for a specified parameter from the model and parameter value. A key feature of the \(I(\mu)\) is that it has a nice relationship with the variance of the score function.

Applying this to the Quilpie data:

\[\frac{d^2\ell(\mu;y)}{d\mu^2}=\frac{-\mu(1-\mu)-(y-\mu)(1-2\mu)}{\mu^2(1-\mu)^2}\] Therefore:

\[J(\mu)=n\frac{\mu(1-\mu)-(\hat\mu-\mu)(1-2\mu)}{\mu^2(1-\mu)^2}\] Evaluating at \(\mu=\hat\mu\):

\[J(\mu)=\frac{n}{\hat\mu(1-\hat\mu)}\] Because \(E[\hat\mu]=\mu\):

\[I(\mu)=E[J(\mu)]=\frac{n}{\mu(1-\mu)}\] And so we can see that \(J(\mu)=I(\mu)\) when \(\mu\) is evaluated at \(\mu=\hat\mu\).

Evaluating the Fisher information for the Quilpie data in R:

n <- length( quilpie$y )
Info <- n / (muhat *(1-muhat))
c(muhat=muhat, FisherInfo=Info)

##       muhat  FisherInfo 
##   0.5147059 272.2354978

Standard errors of parameters

It can be shown that

\[var[\hat\mu]\approx1/I(\mu)\] Therefore

\[SE[\hat\mu]\approx1/I(\mu)^{1/2}\] Hence, the fisher information is a measure of how well-defined the MLE is.

Calculating the estimated standard error for the Quilpie data we get:

1/sqrt(Info)

## [1] 0.06060767

MLE for multiple parameters

Score equations

Assume the linear predictor:

\[\eta_i=\beta_0+\beta_1x_{i1}+...+\beta_px_{ip}\] The mean \(\mu_i\) is dependent on the linear predictor such that \(g(\mu_i)=\eta_i\) where \(g()\) is a known link function.

The log-likelihood function for regression models is:

\[\ell(\beta_0,\beta_1,...,\beta_p;y)=\sum_{i=1}^nlogP(y_i;\mu_i,\phi)\] where \(\phi\) is the dispersion parameter that specifies the variance of \(y_i\)

The score functions can be denoted as:

\[U(\beta_j)=\sum_{i=1}^n\frac{P(y_i;\mu_i,\phi)}{\partial\mu_i}\frac{\partial\mu_i}{\partial\beta_j}\] With each unknown regression parameter \(\beta_j\) having its own score equation.

Going back to the Quilpie data, we will look at the relationship between SOI and the probability of rainfall exceeding 10mm (\(\mu\)). Given that \(\mu\) is restricted to values between zero and one, a suitable systematic component is:

Insert graphs here if needed

\[log\frac{\mu}{1-\mu}=\eta=\beta_0+\beta_1x\]

This systematic component has two parameters that need to be estimated, \(\beta_0\) and \(\beta_1\), therefore we need two score equations which will take the forms:

\[U(\hat\beta_0)=\sum_{i=1}^ny_i-\hat\mu_i=0\] \[U(\hat\beta_1)=\sum_{i=1}^n(y_i-\hat\mu_i)x_i=0\] where \(log[\hat\mu_i/(1-\hat\mu_i)]=\hat\beta_0+\hat\beta_ix_i\)

Observed and expected information

Computing the observed information for the Quilpie data we get:

\[I_{00}(\beta)=-\frac{\partial U(\beta_0)}{\partial\beta_0}=\sum_{i=1}^n\mu_i(1-\mu_i)\] \[I_{11}(\beta)=-\frac{\partial U(\beta_1)}{\partial\beta_1}=\sum_{i=1}^n\mu_i(1-\mu_i)x_i^2\] \[I_{01}(\beta)=I_{10}(\beta)=-\frac{\partial U(\beta_1)}{\partial\beta_0}=\sum_{i=1}^n\mu_i(1-\mu_i)x_i\]

Standard errors of parameters

\[var[\hat\beta_j]\approx1/I_{jj}(\beta)\] \[SE[\hat\beta_j]\approx1/I_{jj}(\beta)^{1/2}\]

Section 4.7 summarise if needed

Fisher scoring for computing MLEs

STAT440 Project

Jules Millen

28/03/2020

Chapter 4: Beyond Linear Regression: The Method or Maximum Likelihood

The need for non-linear modelling methods

GLM for one parameter

Score equations

Observed and expected information

Standard errors of parameters

MLE for multiple parameters

Score equations

Insert graphs here if needed

Observed and expected information

Standard errors of parameters

Section 4.7 summarise if needed

Fisher scoring for computing MLEs