March 19, 2019

Generalised linear models

Week 4-5

Generalised linear models

Generalised linear models (GLMs) have three basic components:

  1. a distribution for the response variable
  2. a predictor function that wraps up all the regression parameters and covariates
  3. a link function linking the predictor function to the response (link between the expected and the fitted values).

A GLM transforms the response variable from its original scale (e.g. bounded between 0 and 1 with binomial data) to an unbounded transformed scale and uses the latter to test the effect of the predictor function. The transformed and original scales are coupled by a link function that allows back-transformation to the original scale while preserving the distributional requirements. For example, with a binomial response variable (e.g. dead/alive), the purpose of the link function is to ensure that the model predictions and their confidence intervals lie between these 0 and 1 when we back-transform them onto the original scale of the response variable.

Week 4-5

Generalised linear model specification

The glm function is very similar to the familiar lm function but has a family argument to specify the distribution of the response variable and the associated model errors. Along with the distribution, we specify a link function, which ensures that distributional requirements are satisfied during back-tranformation of the model predictions onto the original scale.


## Binomial GLM (logistic regression) for binary data
glm(y ~ x, data = ..., family = binomial(link = "logit"))

## Poisson GLM for count data
glm(y ~ x, data = ..., family = poisson(link = "log"))
Week 4-5

GLMs for count data

Counts have a lower bound (zero) and roughly follow a Poisson distribution. The Poisson distribution has only one parameter, the mean (also called \(\lambda\)), and assumes constant variance that equals the mean

Week 4-5

Overdispersion in Poisson GLMs

The underlying assumption of constant variance that equals the mean rarely holds true for real-world data. Commonly, the variance is greater than the mean and this phenomenon is called overdispersion. If your data is overdispersed, model inference is biased. Overdispersion commonly results in smaller standard errors around the parameter estimates, which produces larger test statistics (t-values) and thus gives spuriously low P-values.
Week 4-5

Overdispersion in Poisson GLMs

One needs to distinguish between apparent and real overdisperion. Apparent overdisperion can result from:

  • too many zeros
  • a wrong link function
  • outlier(s)
  • nonlinear patters
  • missing covariates or interactions
  • transformation of covariates
  • ignoring the dependency structure
Real overdispersion is inherent to your data and can simply occur because the variation is naturally greater than the mean or it may be caused by many zeros, clustered observations or unaccounted for correlations among observations.
Week 4-5

Detecting overdispersion

A ratio of the residual deviance over the residual degrees of freedom that is much higher than 1 indicates overdispersion.

Week 4-5

Dealing with overdispersion

We can a run quasi-likelihood model that adjusts the standard errors (and hence the t- and P-values) for the amount of overdisperion present. The dispersion parameter is reported in the summary model output. However, this option does not allow the use of likelihood based model comparison tools such as the AIC or likelihood ratio tests.
A better alternative is the use a negative binomial distribution that contains a variance term which boils down to \(\lambda\) if no overdispersion is present. Negative binomial GLM allow likelihood based model comparison procedures.


## Fit a quasi-likelihood GLM
glm(y ~ x, data = ..., family = quasipoisson(link = "log"))

library(MASS)
## Fit a negative binomial GLM
glm.nb(y ~ x, data = ...)
Week 4-5

Binomial data - Logistic regression

Modelling binomial (binary) data is commonly referred to as logistic regression. We are dealing with binary or binomial data when the response variable may only take on two levels such as dead/alive, absent/present, male/female, or parasitised/unparasitised for example. The binary nature of the data matches the binomial distribution which describes the discrete probability distribution of the number of successes in a sequence of success/failure experiments.

If we examine say 100 toads in a habitat for the occurrence of parasites, then this can be viewed as 100 success/failure experiments ('parasitised' may be regarded as success or failure depending on the context).

Binomial data - Logistic regression

The glm function requires the binomial response variable in one of three ways:

  • as a two-level factor such as yes/no, alive/dead, which will be treated as a 0/1 data
  • as a proportion, in which case the number of trials (n) must be provided as a vector of weights using the 'weights' argument
  • as a two-column matrix holding the number of successes in the first and the number of failures in the second column
Week 4-5

Binomial data - Logistic regression

Week 4-5

Binomial data - Logistic regression

With a categorical predictor variable, we simpy obtain a bar plot showing the proportions, whereas a continuous predictor variable often leads to a figure like below.
Week 4-5

GLMs for continuous data

GLMs can also be used to model normally distributed data but more often GLMs are used to analyse continuous data that follows other distributions such as the gamma, inverse Gaussian or the exponential distribution, which are all associated with nonnegative continuous variables. Here is an example of random gamma and inverse Gaussian distributed data. Both are right-skewed distributions (long tail to the right).
Week 4-5

Gamma distributed data

We can use quantile-quantile plots to determine if our data follows a gamma or inverse Gaussian rather than a nornal distribution.

Week 4-5

Gamma GLM

Environmental monitoring data often follows a gamma (or inverse Gaussian) distribution.

Beta Regression

Analysing percentage (proportion) data lacking the underlying counts

Proportion data can be modelled using a binomial GLM if the underlying counts are available. Percentage or proportion data without underlying counts such as percentage land cover, perentage bodyfat etc. cannot be modelled with binomial GLMs. However, this data is bounded by zero and 100% (or zero and one), which invalidates the use of an ordinary linear model unless a logit transformation is applied to the response variable. The modern approach is to use a beta regression model to analyse this type of data.

Week 4-5