The Poisson regression model naturally arises when we want to model the average number of occurrences per unit of time or space. For example, the incidence of rare cancer, the number of car crossing at the crossroad, or the number of earthquakes.
One feature of the Poisson distribution is that the mean equals the variance. However, over- or underdispersion happens in Poisson models, where the variance is larger or smaller than the mean value, respectively. In reality, overdispersion happens more frequently with a limited amount of data.
The overdispersion issue affects the interpretation of the model. It is necessary to address the problem in order to avoid the wrong estimation of the coefficients.
In this post, I am going to discuss some basic methods to adjust for the overdispersion phenomenon in the Poisson regression model. The implementation will be shown in R codes. I hope this article is helpful.
Suppose we want to model count responses Yi using a vector of predictors xi. We know that the response variable Yi follows a Poisson distribution with parameter μi.
Yi follows the Poisson distribution
where the probability function is
The Poisson PMF
The log link function is used to link the linear combination of the predictors, Xi with the Poisson parameter μi.
The Poisson regression model.
In this lab, we will explore and visualize the data using the tidyverse suite of packages. The data can be found in the companion package for OpenIntro labs, openintro.
Let’s load the packages.
The tidyverse “umbrella” package which houses a suite of many different R packages: for data wrangling and data visualization.
The faraway R
package: Functions
and Datasets for Books by Julian Faraway
The AER R
package: Applied
Econometrics with R.
The MASS R
package: Support
Functions and Datasets for Venables and Ripley’s MASS.
library(faraway)
library(tidyverse)
library(AER)
library(MASS)
Let’s build a simple model with the example introduced in Faraway’s book.
data(gala)
gala = gala[,-2]
pois_mod = glm(Species ~ .,family=poisson,gala)
summary(pois_mod)
##
## Call:
## glm(formula = Species ~ ., family = poisson, data = gala)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.2752 -4.4966 -0.9443 1.9168 10.1849
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.155e+00 5.175e-02 60.963 < 2e-16 ***
## Area -5.799e-04 2.627e-05 -22.074 < 2e-16 ***
## Elevation 3.541e-03 8.741e-05 40.507 < 2e-16 ***
## Nearest 8.826e-03 1.821e-03 4.846 1.26e-06 ***
## Scruz -5.709e-03 6.256e-04 -9.126 < 2e-16 ***
## Adjacent -6.630e-04 2.933e-05 -22.608 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 3510.73 on 29 degrees of freedom
## Residual deviance: 716.85 on 24 degrees of freedom
## AIC: 889.68
##
## Number of Fisher Scoring iterations: 5
Don’t be fooled by the super significant coefficients. Those beautiful p-values are exactly the consequences of the overdispersion issue. We can check the overdispersion either visually or quantitatively.
Let’s first plot out the estimated variance against the mean.
plot(log(fitted(pois_mod)),log((gala$Species-fitted(pois_mod))^2),xlab=expression(hat(mu)),ylab=expression((y-hat(mu))^2),pch=20,col="blue")
abline(0,1) ## 'varianc = mean' line
We can see that the majority of the variance is larger than the mean, which is a warning of overdispersion.
Quantitatively, the dispersion parameter φ can be estimated using Pearson’s Chi-squared statistic and the degree of freedom.
Quantitatively, the dispersion parameter φ can be estimated using Pearson’s Chi-squared statistic and the degree of freedom.
Estimation of the dispersion parameter
When φ is larger than 1, it is overdispersion. To manually calculate the parameter, we use the code below.
dp = sum(residuals(pois_mod,type ="pearson")^2)/pois_mod$df.residual
dp
## [1] 31.74914
which gives us 31.74914 and confirms this simple Poisson model has the overdispersion problem.
Alternatively, we can apply a significance test directly on the fitted model to check the overdispersion.
dispersiontest(pois_mod)
##
## Overdispersion test
##
## data: pois_mod
## z = 3.3759, p-value = 0.0003678
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion
## 25.39503
This overdispersion test reports the significance of the overdispersion issue within the model.
We can check how much the coefficient estimations are affected by overdispersion.
summary(pois_mod,dispersion = dp)
##
## Call:
## glm(formula = Species ~ ., family = poisson, data = gala)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.2752 -4.4966 -0.9443 1.9168 10.1849
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.1548079 0.2915897 10.819 < 2e-16 ***
## Area -0.0005799 0.0001480 -3.918 8.95e-05 ***
## Elevation 0.0035406 0.0004925 7.189 6.53e-13 ***
## Nearest 0.0088256 0.0102621 0.860 0.390
## Scruz -0.0057094 0.0035251 -1.620 0.105
## Adjacent -0.0006630 0.0001653 -4.012 6.01e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 31.74914)
##
## Null deviance: 3510.73 on 29 degrees of freedom
## Residual deviance: 716.85 on 24 degrees of freedom
## AIC: 889.68
##
## Number of Fisher Scoring iterations: 5
Now around half of the predictors become insignificant, which changes the entire interpretation of the model.
Alright, let’s address the problem in the following two ways.
A simple way to adjust the overdispersion is as straightforward as to estimate the dispersion parameter within the model. This could be done via the quasi-families in R.
qpoi_mod = glm(Species ~ .,family=quasipoisson, gala)
summary(qpoi_mod)
##
## Call:
## glm(formula = Species ~ ., family = quasipoisson, data = gala)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.2752 -4.4966 -0.9443 1.9168 10.1849
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.1548079 0.2915901 10.819 1.03e-10 ***
## Area -0.0005799 0.0001480 -3.918 0.000649 ***
## Elevation 0.0035406 0.0004925 7.189 1.98e-07 ***
## Nearest 0.0088256 0.0102622 0.860 0.398292
## Scruz -0.0057094 0.0035251 -1.620 0.118380
## Adjacent -0.0006630 0.0001653 -4.012 0.000511 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 31.74921)
##
## Null deviance: 3510.73 on 29 degrees of freedom
## Residual deviance: 716.85 on 24 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 5
We can see that the dispersion parameter is estimated to be 31.74921, which is very close to our manual calculation as aforementioned. This procedure tells us that only three of the predictors’ coefficients are significant.
Another way to address the overdispersion in the model is to change our distributional assumption to the Negative binomial in which the variance is larger than the mean.
Let’s implement the negative binomial model in R.
nb_mod = glm.nb(Species ~ .,data = gala)
summary(nb_mod)
##
## Call:
## glm.nb(formula = Species ~ ., data = gala, init.theta = 1.674602286,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1344 -0.8597 -0.1476 0.4576 1.8416
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.9065247 0.2510344 11.578 < 2e-16 ***
## Area -0.0006336 0.0002865 -2.211 0.027009 *
## Elevation 0.0038551 0.0006916 5.574 2.49e-08 ***
## Nearest 0.0028264 0.0136618 0.207 0.836100
## Scruz -0.0018976 0.0028096 -0.675 0.499426
## Adjacent -0.0007605 0.0002278 -3.338 0.000842 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(1.6746) family taken to be 1)
##
## Null deviance: 88.431 on 29 degrees of freedom
## Residual deviance: 33.196 on 24 degrees of freedom
## AIC: 304.22
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 1.675
## Std. Err.: 0.442
##
## 2 x log-likelihood: -290.223
It is a better fit to the data because the ratio of deviance over degrees of freedom is only slightly larger than 1 here.
A. Overdispersion can affect the interpretation of the poisson model.
B. To avoid the overdispersion issue in our model, we can use a quasi-family to estimate the dispersion parameter.
C. We can also use the negative binomial instead of the poisson model.
https://biometry.github.io/APES/LectureNotes/2016-JAGS/Overdispersion/OverdispersionJAGS.pdf
Faraway, Julian J. Extending the linear model with R: generalized linear, mixed effects and nonparametric regression models. CRC press, 2016.