This week we started and made good progress on Poisson Regression and interpreting the output produced in R
I will use the data set we used earlier in the week about tortoise’s on the Galapagos Islands.
library(faraway)
## Warning: package 'faraway' was built under R version 3.6.3
data(gala)
attach(gala)
The goal of this data set is to use the variables to predict the number of tortoise species on an island.
My first instinct is to try and use a linear regression to model this relationship but after some thought this linear model instinct is faulty.
A standard linear regression would allow for there to be negative tortoise species on a particular island which is not possible.
This realization brings me to a Poisson Regression which will return values greater than or equal to 0
modPos <- glm(Species ~ Area + Elevation + Nearest, family = poisson)
summary(modPos)
##
## Call:
## glm(formula = Species ~ Area + Elevation + Nearest, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -17.1900 -6.1715 -2.7125 0.7063 21.4237
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.548e+00 3.933e-02 90.211 < 2e-16 ***
## Area -5.529e-05 1.890e-05 -2.925 0.00344 **
## Elevation 1.588e-03 5.040e-05 31.502 < 2e-16 ***
## Nearest 5.921e-03 1.466e-03 4.039 5.38e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 3510.7 on 29 degrees of freedom
## Residual deviance: 1797.8 on 26 degrees of freedom
## AIC: 1966.7
##
## Number of Fisher Scoring iterations: 5
It is important to understand the meaning of the coefficients. Take the coefficient for Elevation for example. In this model, an increase of 1 meter in elavation results in an increase of .001588 in the log mean number of species of tortioses holding Area and Nearest constant.
It can also be interpretted as a (e^.001588 - 1) % increase in the mean number of species of tortoises with every additional meter of elevation holding other variables constant again.
Doing 3 Wald Test analyses, we see that all three variables I have selected are significant.
But something we need to check is the residual deviance because the test stat seems large.
pchisq(1797.8, df = 26, lower.tail = FALSE)
## [1] 0
In fact, the test stat is so large that R rounds the p-value to 0. That means my model is really not so good.
A halfnorm plot of the residuals will show outliers but I know the issue lies somewhere else.
A key characteristic of Poisson distribution and regression is that the mean and variance are equal; however, I believe this may not be the case for my model.
To see I need to calculate the dispersion parameter.
disp <- sum(residuals(modPos, type ="pearson")^2 / modPos$df.residual)
Yikes! The variance is approximately 76 times greater than the mean.
summary(modPos, dispersion = disp)
##
## Call:
## glm(formula = Species ~ Area + Elevation + Nearest, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -17.1900 -6.1715 -2.7125 0.7063 21.4237
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.548e+00 3.437e-01 10.323 < 2e-16 ***
## Area -5.529e-05 1.652e-04 -0.335 0.737819
## Elevation 1.588e-03 4.404e-04 3.605 0.000312 ***
## Nearest 5.921e-03 1.281e-02 0.462 0.643966
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 76.36298)
##
## Null deviance: 3510.7 on 29 degrees of freedom
## Residual deviance: 1797.8 on 26 degrees of freedom
## AIC: 1966.7
##
## Number of Fisher Scoring iterations: 5
Another group of Wald Tests results in a totally different conclusion. Here only Elevation is a significant variable and we can justify drop the variables Area and Nearest.
Drop in deviance tests are another way to compare nested models other than Wald Tests.
Let’s say we have a second Poisson regression model that is slightly larger than my original 3 predictor model:
modPoslarger <- glm(Species ~ Area + Elevation + Scruz + Nearest, family = poisson)
summary(modPoslarger)
##
## Call:
## glm(formula = Species ~ Area + Elevation + Scruz + Nearest, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -13.429 -6.064 -1.679 2.049 16.582
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.702e+00 4.015e-02 92.20 <2e-16 ***
## Area -2.314e-04 2.263e-05 -10.22 <2e-16 ***
## Elevation 2.147e-03 6.587e-05 32.60 <2e-16 ***
## Scruz -1.273e-02 7.058e-04 -18.04 <2e-16 ***
## Nearest 2.678e-02 1.660e-03 16.13 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 3510.7 on 29 degrees of freedom
## Residual deviance: 1341.5 on 25 degrees of freedom
## AIC: 1512.3
##
## Number of Fisher Scoring iterations: 5
Now to calculate drop in deviance:
dropinDev <- deviance(modPos) - deviance(modPoslarger)
dispL <- sum(residuals(modPoslarger, type ="pearson")^2 / modPos$df.residual)
teststat <- dropinDev/(dispL*1)
pf(teststat, df1 = 1, 25, lower.tail = FALSE)
## [1] 0.01122023
This small p-value indicates we should go for the larger/more complex of the two models.