Libraries

library(faraway)
## Warning: package 'faraway' was built under R version 3.6.3

Introduction

As chapter 5 of “Extending the Linear Model with R” by Julian J. Faraway explained Poisson regression can be a really useful tool if we know how and when to use it. For this demonstration I will look at a regression model for the gala dataset from the faraway package. It pertains to the species diversity on the Galapagos Island. There are altogether 7 variables in the data set and I will use poisson regression to define the relationship between the number of plant species with other variables in the dataset.

What is Poisson Regression?

Poisson regression is used to model response variables (Y-values) that are counts. It tells us which explanatory variables have a statistically significant effect on the response variable. Poisson regression is more suited for cases where the response variable is a small integer.

Access Data

data("gala")
head(gala)
##              Species Endemics  Area Elevation Nearest Scruz Adjacent
## Baltra            58       23 25.09       346     0.6   0.6     1.84
## Bartolome         31       21  1.24       109     0.6  26.3   572.33
## Caldwell           3        3  0.21       114     2.8  58.7     0.78
## Champion          25        9  0.10        46     1.9  47.4     0.18
## Coamano            2        1  0.05        77     1.9   1.9   903.82
## Daphne.Major      18       11  0.34       119     8.0   8.0     1.84

Species diversity on the Galapagos Islands

Description

There are 30 Galapagos islands and 7 variables in the dataset. The relationship between the number of plant species and several geographic variables is of interest. The original dataset contained several missing values which have been filled for convenience.

Usage

data(gala)

Format

The dataset contains the following variables

Species

the number of plant species found on the island

Endemics

the number of endemic species

Area

the area of the island (km\(^2\))

Elevation

the highest elevation of the island (m)

Nearest

the distance from the nearest island (km)

Scruz

the distance from Santa Cruz island (km)

Adjacent

the area of the adjacent island (square km)

Source

M. P. Johnson and P. H. Raven (1973) “Species number and endemism: The Galapagos Archipelago revisited” Science, 179, 893-895

Based on what we see in this description we can say that Species is the response variable. Lets check out the basic summary of the dataset.

summary(gala[c(-1)])
##     Endemics          Area            Elevation          Nearest     
##  Min.   : 0.00   Min.   :   0.010   Min.   :  25.00   Min.   : 0.20  
##  1st Qu.: 7.25   1st Qu.:   0.258   1st Qu.:  97.75   1st Qu.: 0.80  
##  Median :18.00   Median :   2.590   Median : 192.00   Median : 3.05  
##  Mean   :26.10   Mean   : 261.709   Mean   : 368.03   Mean   :10.06  
##  3rd Qu.:32.25   3rd Qu.:  59.237   3rd Qu.: 435.25   3rd Qu.:10.03  
##  Max.   :95.00   Max.   :4669.320   Max.   :1707.00   Max.   :47.40  
##      Scruz           Adjacent      
##  Min.   :  0.00   Min.   :   0.03  
##  1st Qu.: 11.03   1st Qu.:   0.52  
##  Median : 46.65   Median :   2.59  
##  Mean   : 56.98   Mean   : 261.10  
##  3rd Qu.: 81.08   3rd Qu.:  59.24  
##  Max.   :290.20   Max.   :4669.32

Now we are done exploring the data, lets generate a histogram for Species inorder to check if the variable follow the poisson distribution.

hist(gala$Species, breaks = 10, xlab = "Species count", main = "Distribution of Species", prob = TRUE)
lines(density(gala$Species, na.rm = T), col = "red", lwd = 4)

boxplot(gala$Species, main = "Boxplot of Species")

As we done with the preliminary analysis, I can now apply poisson regression as shown below.

fit <- glm(gala$Species ~ gala$Endemics+gala$Area+gala$Elevation+gala$Nearest+gala$Scruz+gala$Adjacent, data = gala, family = poisson())
summary(fit)
## 
## Call:
## glm(formula = gala$Species ~ gala$Endemics + gala$Area + gala$Elevation + 
##     gala$Nearest + gala$Scruz + gala$Adjacent, family = poisson(), 
##     data = gala)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.9919  -2.9305  -0.4296   1.3254   7.4735  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     2.828e+00  5.958e-02  47.471  < 2e-16 ***
## gala$Endemics   3.388e-02  1.741e-03  19.459  < 2e-16 ***
## gala$Area      -1.067e-04  3.741e-05  -2.853  0.00433 ** 
## gala$Elevation  2.638e-04  1.934e-04   1.364  0.17264    
## gala$Nearest    1.048e-02  1.611e-03   6.502 7.91e-11 ***
## gala$Scruz     -6.835e-04  5.802e-04  -1.178  0.23877    
## gala$Adjacent   4.539e-05  4.800e-05   0.946  0.34437    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 3510.73  on 29  degrees of freedom
## Residual deviance:  313.36  on 23  degrees of freedom
## AIC: 488.19
## 
## Number of Fisher Scoring iterations: 5

Based on the above analysis, Endemics, Area and Nearest should be enough to build the right poisson regression model.

fit.reduced <- glm(gala$Species~gala$Endemics+gala$Area+gala$Nearest, data = gala, family = poisson())
summary(fit.reduced)
## 
## Call:
## glm(formula = gala$Species ~ gala$Endemics + gala$Area + gala$Nearest, 
##     family = poisson(), data = gala)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -5.085  -3.061  -0.593   2.095   6.770  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    2.868e+00  4.883e-02  58.729  < 2e-16 ***
## gala$Endemics  3.551e-02  7.327e-04  48.462  < 2e-16 ***
## gala$Area     -4.542e-05  1.568e-05  -2.896  0.00378 ** 
## gala$Nearest   9.289e-03  1.319e-03   7.043 1.88e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 3510.73  on 29  degrees of freedom
## Residual deviance:  330.84  on 26  degrees of freedom
## AIC: 499.67
## 
## Number of Fisher Scoring iterations: 5

We can see that each parameter is significant at P < 0.05 level.

Now let’s interpret the model parameters. We can either examine the coefficients from above or we can use the coef() function as below.

coef(fit.reduced)
##   (Intercept) gala$Endemics     gala$Area  gala$Nearest 
##  2.8678751411  0.0355078212 -0.0000454177  0.0092886047

The regression parameter of 0.0355 for Endemics indicates that a one-unit increase in the variable is associated with a 0.04 increase in the log mean number of Species, holding other variables constant.However, it is much easier to interpret the regression coefficients in the original scale of the dependent variable (number of Species, rather than log number of Species).

exp(coef(fit.reduced))
##   (Intercept) gala$Endemics     gala$Area  gala$Nearest 
##    17.5995818     1.0361458     0.9999546     1.0093319

Using the above steps, we obtained a Poisson regression model for predicting the number of plant species on the Galapagos Islands.From the above findings, I can say that one unit increase in Area multiples the expected number of species by 0.9999, and a unit increase in the number of endemic species represented by Endemics multiplies the number of species by 1.0361.