10/19/2018

Overdispersion in Poisson GLMs

The Poisson distribution has only one parameter, the mean (also called or μ), and assumes constant variance that equals the mean, which, however, rarely holds true for real-world data. Commonly, the variance is greater than the mean and this phenomenon is called overdispersion. If your data is overdispersed, model inference is biased. Overdispersion commonly results in smaller standard errors around the parameter estimates, which produces larger t-values and thus gives spuriously low P-values. One needs to distinguish between apparent and real overdisperion. Apparent overdisperion can result from:

  • too many zeros
  • a wrong link function
  • outlier(s)

Overdispersion in Poisson GLMs

  • nonlinear patterns
  • missing covariates or interactions
  • transformation of covariates
  • ignoring the dependency structure

Real overdispersion is inherent to your data and can simply occur because the variation is naturally greater than the mean or it may be caused by many zeros, clustered observations or unaccounted for correlations among observations.

In case of underdispersion use weighted Poisson GLMs, because underdispersion that is unaccounted for may yield statistically insignificant parameter estimates whereas in fact the parameters are significant.

Overdispersion in Poisson GLMs

If you are unable to pinpoint the source of the overdispersion and to resolve it, then you need to revert to quasi-likelihood estimation, or as a more powerful alternative, use the negative binomial distribution.

One can apply quasi-likelihood estimation by choosing the ‘quasipoisson’ distribution in the glm model specification. However, with this option only the standard errors of the parameter estimates are corrected NOT the actual parameters, meaning that those are still biased! The negative binomial distribution has a correction for overdispersion built-in. The distribution function of the negative binomial distribution has two parameters, the mean \(\mu\) and the dispersion parameter \(k\). The variance for the negative binomial distribution is given by:

\(var(Y) = \mu + \frac{\mu^{2}}{k}\)

Overdispersion in Poisson GLMs

Overdispersion occurs if the variance is greater than the mean. The second term in the above variance statement determines the amount of overdispersion. Overdispersion is indirectly determined by \(k\). If \(K\) is large compared to \(\mu^{2}\), then the term \(\mu^{2}/k\) approaches 0 and the variance boils down to \(\mu\), i.e., in these cases the negative binomial converges to the Poisson distribution.

The negative binomial distribution is realised in the glm.nb function in the R package MASS, which comes with the base installation (but needs to be loaded first). The default link is a logarithmic link (log) but the square root (sqrt) and identity links are also available.

Overdispersion in Poisson GLMs

We will explore the phenomenon of overdispersion with the crabs data set (package glm2).

The data comprises four variables for each of 173 female horseshoe crabs. The reponse variable is a count of the ‘Satellites’, the number of male partners in addition to the female’s primary partner. The ‘width’ of the female (cm), ‘Dark’ a binary variable indicating whether the female is dark coloured, and ‘GoodSpine’ also a binary variable referring to the female’s spine condition.

The crabs data set

library(glm2)
library(ggplot2)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:glm2':
## 
##     crabs
data(crabs)
str(crabs)
## 'data.frame':    200 obs. of  8 variables:
##  $ sp   : Factor w/ 2 levels "B","O": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex  : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ index: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ FL   : num  8.1 8.8 9.2 9.6 9.8 10.8 11.1 11.6 11.8 11.8 ...
##  $ RW   : num  6.7 7.7 7.8 7.9 8 9 9.9 9.1 9.6 10.5 ...
##  $ CL   : num  16.1 18.1 19 20.1 20.3 23 23.8 24.5 24.2 25.2 ...
##  $ CW   : num  19 20.8 22.4 23.1 23 26.5 27.1 28.4 27.8 29.3 ...
##  $ BD   : num  7 7.4 7.7 8.2 8.2 9.8 9.8 10.4 9.7 10.3 ...

Plotting the data

library(glm2)
data(crabs)
## Data exploration
ggplot(crabs, aes(x = Width, y = Satellites)) + geom_point() + 
  facet_wrap(facets = ~ Dark + GoodSpine)

Overdispersion in Poisson GLMs

The data show a considerable amount of variation and a weak increasing trend in satellites with increasing carapax width. Let’s see what the model can figure out.

## Poisson regression model
m1 <- glm(Satellites ~ Width * Dark * GoodSpine, data = crabs, family = poisson)
summary(m1)

Overdispersion in Poisson GLMs

It looks as if the width \(\times\) dark \(\times\) spine interaction is significant. However, a closer examination of the summary output indicates overdispersion: the residual deviance is much higher than the residual degrees of freedom.

We can test this more formally by running a quasi-likelihood model thatadjusts the standard errors (and hence the z- and P-values) for the amount of overdispersion present. The dispersion parameter is reported in the summary model output. A dispersion parameter substantially larger than 1 is indicative of overdisperion.

Quasi-likelihood Poisson GLMs

m2 <- glm(Satellites ~ Width * Dark * GoodSpine, data = crabs, family = quasipoisson)
summary(m2)

The dispersion parameter

How is overdispersion estimated? Overdispersion is commonly signified by the Greek letter \(\phi\). If overdisperion is present, then the residual deviance (\(D\)) over \(\phi\) (\(D\)/\(\phi\)) is Chi-square distributed with n - p degrees of freedom (p being the number of model paramters), which allows us to come up with the following estimator for \(\phi\)):

\(\widehat{\phi} = \frac{D}{n - p}\)

The dispersion parameter

This is the dispersion parameter, that the ‘quasipoisson’ model returns in the summary output.

Negative binomial GLM

Let’s now fit a negative binomial GLM. Recall that the negative binomial distribution has a built-in term to deal with overdispersion.

## Negative binomial regression model
m3 <- glm.nb(Satellites ~ Width * Dark * GoodSpine, data = crabs)
summary(m3)

AIC-based model comparison

Is the negative binomial model an improvement over the Poisson model?

AIC(m1, m3)
#    df      AIC
# m1  8 920.7867
# m3  9 763.5984