Introduction
As chapter 5 of “Extending the Linear Model with R” by Julian J. Faraway explained Poisson regression can be a really useful tool if we know how and when to use it. For this demonstration I will look at a regression model for the gala dataset from the faraway package. It pertains to the species diversity on the Galapagos Island. There are altogether 7 variables in the data set and I will use poisson regression to define the relationship between the number of plant species with other variables in the dataset.
What is Poisson Regression?
Poisson regression is used to model response variables (Y-values) that are counts. It tells us which explanatory variables have a statistically significant effect on the response variable. Poisson regression is more suited for cases where the response variable is a small integer.
Access Data
## Species Endemics Area Elevation Nearest Scruz Adjacent
## Baltra 58 23 25.09 346 0.6 0.6 1.84
## Bartolome 31 21 1.24 109 0.6 26.3 572.33
## Caldwell 3 3 0.21 114 2.8 58.7 0.78
## Champion 25 9 0.10 46 1.9 47.4 0.18
## Coamano 2 1 0.05 77 1.9 1.9 903.82
## Daphne.Major 18 11 0.34 119 8.0 8.0 1.84
Species diversity on the Galapagos Islands
Description
There are 30 Galapagos islands and 7 variables in the dataset. The relationship between the number of plant species and several geographic variables is of interest. The original dataset contained several missing values which have been filled for convenience.
Usage
data(gala)
Format
The dataset contains the following variables
Species
the number of plant species found on the island
Endemics
the number of endemic species
Area
the area of the island (km\(^2\))
Elevation
the highest elevation of the island (m)
Nearest
the distance from the nearest island (km)
Scruz
the distance from Santa Cruz island (km)
Adjacent
the area of the adjacent island (square km)
Source
M. P. Johnson and P. H. Raven (1973) “Species number and endemism: The Galapagos Archipelago revisited” Science, 179, 893-895
Based on what we see in this description we can say that Species is the response variable. Lets check out the basic summary of the dataset.
## Endemics Area Elevation Nearest
## Min. : 0.00 Min. : 0.010 Min. : 25.00 Min. : 0.20
## 1st Qu.: 7.25 1st Qu.: 0.258 1st Qu.: 97.75 1st Qu.: 0.80
## Median :18.00 Median : 2.590 Median : 192.00 Median : 3.05
## Mean :26.10 Mean : 261.709 Mean : 368.03 Mean :10.06
## 3rd Qu.:32.25 3rd Qu.: 59.237 3rd Qu.: 435.25 3rd Qu.:10.03
## Max. :95.00 Max. :4669.320 Max. :1707.00 Max. :47.40
## Scruz Adjacent
## Min. : 0.00 Min. : 0.03
## 1st Qu.: 11.03 1st Qu.: 0.52
## Median : 46.65 Median : 2.59
## Mean : 56.98 Mean : 261.10
## 3rd Qu.: 81.08 3rd Qu.: 59.24
## Max. :290.20 Max. :4669.32
Now we are done exploring the data, lets generate a histogram for Species inorder to check if the variable follow the poisson distribution.
hist(gala$Species, breaks = 10, xlab = "Species count", main = "Distribution of Species", prob = TRUE)
lines(density(gala$Species, na.rm = T), col = "red", lwd = 4)
As we done with the preliminary analysis, I can now apply poisson regression as shown below.
fit <- glm(gala$Species ~ gala$Endemics+gala$Area+gala$Elevation+gala$Nearest+gala$Scruz+gala$Adjacent, data = gala, family = poisson())
summary(fit)
##
## Call:
## glm(formula = gala$Species ~ gala$Endemics + gala$Area + gala$Elevation +
## gala$Nearest + gala$Scruz + gala$Adjacent, family = poisson(),
## data = gala)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.9919 -2.9305 -0.4296 1.3254 7.4735
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.828e+00 5.958e-02 47.471 < 2e-16 ***
## gala$Endemics 3.388e-02 1.741e-03 19.459 < 2e-16 ***
## gala$Area -1.067e-04 3.741e-05 -2.853 0.00433 **
## gala$Elevation 2.638e-04 1.934e-04 1.364 0.17264
## gala$Nearest 1.048e-02 1.611e-03 6.502 7.91e-11 ***
## gala$Scruz -6.835e-04 5.802e-04 -1.178 0.23877
## gala$Adjacent 4.539e-05 4.800e-05 0.946 0.34437
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 3510.73 on 29 degrees of freedom
## Residual deviance: 313.36 on 23 degrees of freedom
## AIC: 488.19
##
## Number of Fisher Scoring iterations: 5
Based on the above analysis, Endemics, Area and Nearest should be enough to build the right poisson regression model.
fit.reduced <- glm(gala$Species~gala$Endemics+gala$Area+gala$Nearest, data = gala, family = poisson())
summary(fit.reduced)
##
## Call:
## glm(formula = gala$Species ~ gala$Endemics + gala$Area + gala$Nearest,
## family = poisson(), data = gala)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.085 -3.061 -0.593 2.095 6.770
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.868e+00 4.883e-02 58.729 < 2e-16 ***
## gala$Endemics 3.551e-02 7.327e-04 48.462 < 2e-16 ***
## gala$Area -4.542e-05 1.568e-05 -2.896 0.00378 **
## gala$Nearest 9.289e-03 1.319e-03 7.043 1.88e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 3510.73 on 29 degrees of freedom
## Residual deviance: 330.84 on 26 degrees of freedom
## AIC: 499.67
##
## Number of Fisher Scoring iterations: 5
We can see that each parameter is significant at P < 0.05 level.
Now let’s interpret the model parameters. We can either examine the coefficients from above or we can use the coef() function as below.
## (Intercept) gala$Endemics gala$Area gala$Nearest
## 2.8678751411 0.0355078212 -0.0000454177 0.0092886047
The regression parameter of 0.0355 for Endemics indicates that a one-unit increase in the variable is associated with a 0.04 increase in the log mean number of Species, holding other variables constant.However, it is much easier to interpret the regression coefficients in the original scale of the dependent variable (number of Species, rather than log number of Species).
## (Intercept) gala$Endemics gala$Area gala$Nearest
## 17.5995818 1.0361458 0.9999546 1.0093319
Using the above steps, we obtained a Poisson regression model for predicting the number of plant species on the Galapagos Islands.From the above findings, I can say that one unit increase in Area multiples the expected number of species by 0.9999, and a unit increase in the number of endemic species represented by Endemics multiplies the number of species by 1.0361.