Introduction

The Poisson regression model naturally arises when we want to model the average number of occurrences per unit of time or space. For example, the incidence of rare cancer, the number of car crossing at the crossroad, or the number of earthquakes.

One feature of the Poisson distribution is that the mean equals the variance. However, over- or underdispersion happens in Poisson models, where the variance is larger or smaller than the mean value, respectively. In reality, overdispersion happens more frequently with a limited amount of data.

The overdispersion issue affects the interpretation of the model. It is necessary to address the problem in order to avoid the wrong estimation of the coefficients.

In this post, I am going to discuss some basic methods to adjust for the overdispersion phenomenon in the Poisson regression model. The implementation will be shown in R codes. I hope this article is helpful.

Suppose we want to model count responses Yi using a vector of predictors xi. We know that the response variable Yi follows a Poisson distribution with parameter μi.

Yi follows the Poisson distribution

where the probability function is

The Poisson PMF

The log link function is used to link the linear combination of the predictors, Xi with the Poisson parameter μi.

The Poisson regression model.

Getting started

Load packages

In this lab, we will explore and visualize the data using the tidyverse suite of packages. The data can be found in the companion package for OpenIntro labs, openintro.

Let’s load the packages.

The tidyverse “umbrella” package which houses a suite of many different R packages: for data wrangling and data visualization.
The faraway R package: Functions and Datasets for Books by Julian Faraway
The AER R package: Applied Econometrics with R.
The MASS R package: Support Functions and Datasets for Venables and Ripley’s MASS.

library(faraway)
library(tidyverse)
library(AER)
library(MASS)

Model building

Poission vs Quasipoission

Let’s build a simple model with the example introduced in Faraway’s book.

data(gala)
gala = gala[,-2]
pois_mod = glm(Species ~ .,family=poisson,gala)
summary(pois_mod)

## 
## Call:
## glm(formula = Species ~ ., family = poisson, data = gala)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.2752  -4.4966  -0.9443   1.9168  10.1849  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.155e+00  5.175e-02  60.963  < 2e-16 ***
## Area        -5.799e-04  2.627e-05 -22.074  < 2e-16 ***
## Elevation    3.541e-03  8.741e-05  40.507  < 2e-16 ***
## Nearest      8.826e-03  1.821e-03   4.846 1.26e-06 ***
## Scruz       -5.709e-03  6.256e-04  -9.126  < 2e-16 ***
## Adjacent    -6.630e-04  2.933e-05 -22.608  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 3510.73  on 29  degrees of freedom
## Residual deviance:  716.85  on 24  degrees of freedom
## AIC: 889.68
## 
## Number of Fisher Scoring iterations: 5

Don’t be fooled by the super significant coefficients. Those beautiful p-values are exactly the consequences of the overdispersion issue. We can check the overdispersion either visually or quantitatively.

Let’s first plot out the estimated variance against the mean.

plot(log(fitted(pois_mod)),log((gala$Species-fitted(pois_mod))^2),xlab=expression(hat(mu)),ylab=expression((y-hat(mu))^2),pch=20,col="blue")
abline(0,1) ## 'varianc = mean' line

We can see that the majority of the variance is larger than the mean, which is a warning of overdispersion.

Quantitatively, the dispersion parameter φ can be estimated using Pearson’s Chi-squared statistic and the degree of freedom.

Dispersion calculation

Quantitatively, the dispersion parameter φ can be estimated using Pearson’s Chi-squared statistic and the degree of freedom.

Estimation of the dispersion parameter

When φ is larger than 1, it is overdispersion. To manually calculate the parameter, we use the code below.

dp = sum(residuals(pois_mod,type ="pearson")^2)/pois_mod$df.residual
dp

## [1] 31.74914

which gives us 31.74914 and confirms this simple Poisson model has the overdispersion problem.

Alternatively, we can apply a significance test directly on the fitted model to check the overdispersion.

dispersiontest(pois_mod)

## 
##  Overdispersion test
## 
## data:  pois_mod
## z = 3.3759, p-value = 0.0003678
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##   25.39503

This overdispersion test reports the significance of the overdispersion issue within the model.

We can check how much the coefficient estimations are affected by overdispersion.

summary(pois_mod,dispersion = dp)

## 
## Call:
## glm(formula = Species ~ ., family = poisson, data = gala)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.2752  -4.4966  -0.9443   1.9168  10.1849  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.1548079  0.2915897  10.819  < 2e-16 ***
## Area        -0.0005799  0.0001480  -3.918 8.95e-05 ***
## Elevation    0.0035406  0.0004925   7.189 6.53e-13 ***
## Nearest      0.0088256  0.0102621   0.860    0.390    
## Scruz       -0.0057094  0.0035251  -1.620    0.105    
## Adjacent    -0.0006630  0.0001653  -4.012 6.01e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 31.74914)
## 
##     Null deviance: 3510.73  on 29  degrees of freedom
## Residual deviance:  716.85  on 24  degrees of freedom
## AIC: 889.68
## 
## Number of Fisher Scoring iterations: 5

Now around half of the predictors become insignificant, which changes the entire interpretation of the model.

Alright, let’s address the problem in the following two ways.

Allow Dispersion Estimation

A simple way to adjust the overdispersion is as straightforward as to estimate the dispersion parameter within the model. This could be done via the quasi-families in R.

qpoi_mod = glm(Species ~ .,family=quasipoisson, gala)
summary(qpoi_mod)

## 
## Call:
## glm(formula = Species ~ ., family = quasipoisson, data = gala)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.2752  -4.4966  -0.9443   1.9168  10.1849  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.1548079  0.2915901  10.819 1.03e-10 ***
## Area        -0.0005799  0.0001480  -3.918 0.000649 ***
## Elevation    0.0035406  0.0004925   7.189 1.98e-07 ***
## Nearest      0.0088256  0.0102622   0.860 0.398292    
## Scruz       -0.0057094  0.0035251  -1.620 0.118380    
## Adjacent    -0.0006630  0.0001653  -4.012 0.000511 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 31.74921)
## 
##     Null deviance: 3510.73  on 29  degrees of freedom
## Residual deviance:  716.85  on 24  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5

We can see that the dispersion parameter is estimated to be 31.74921, which is very close to our manual calculation as aforementioned. This procedure tells us that only three of the predictors’ coefficients are significant.

Replace Poisson with Negative Binomial

Another way to address the overdispersion in the model is to change our distributional assumption to the Negative binomial in which the variance is larger than the mean.

Let’s implement the negative binomial model in R.

nb_mod = glm.nb(Species ~ .,data = gala)
summary(nb_mod)

## 
## Call:
## glm.nb(formula = Species ~ ., data = gala, init.theta = 1.674602286, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1344  -0.8597  -0.1476   0.4576   1.8416  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.9065247  0.2510344  11.578  < 2e-16 ***
## Area        -0.0006336  0.0002865  -2.211 0.027009 *  
## Elevation    0.0038551  0.0006916   5.574 2.49e-08 ***
## Nearest      0.0028264  0.0136618   0.207 0.836100    
## Scruz       -0.0018976  0.0028096  -0.675 0.499426    
## Adjacent    -0.0007605  0.0002278  -3.338 0.000842 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1.6746) family taken to be 1)
## 
##     Null deviance: 88.431  on 29  degrees of freedom
## Residual deviance: 33.196  on 24  degrees of freedom
## AIC: 304.22
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.675 
##           Std. Err.:  0.442 
## 
##  2 x log-likelihood:  -290.223

It is a better fit to the data because the ratio of deviance over degrees of freedom is only slightly larger than 1 here.

Conclusions

A. Overdispersion can affect the interpretation of the poisson model.

B. To avoid the overdispersion issue in our model, we can use a quasi-family to estimate the dispersion parameter.

C. We can also use the negative binomial instead of the poisson model.

References

https://biometry.github.io/APES/LectureNotes/2016-JAGS/Overdispersion/OverdispersionJAGS.pdf

Faraway, Julian J. Extending the linear model with R: generalized linear, mixed effects and nonparametric regression models. CRC press, 2016.

GENERALIZED LINEAR MODELS-Adjust for Overdispersion in Poisson Regression

Eliminating the false positive detections of the significant coefficients in the Poisson model and the implementations in R

Nasif Hossain

2022-08-19