Generalized Linear Models (GzLM)

Count Data

Let’s assume you have been counting fish on coral reefs. So your data are count data. If the counts are large they may well look pretty normal. But there are some important characteristics of count data: Counts are integers, whereas the normal distribution is for continuous data that can include any fraction. Counts also can’t be less than zero, but the normal distribution models stochastic processes that draw both zeros and negative numbers. While a count can be 0, it cannot be negative.

Statisticians have described many distributions for counts but one of the simplest is the Poisson distribution. It is a model of positive integers. It has one parameter \(\lambda\), which is both its mean and variance. Let’s see what that looks like with some simple R code to draw random numbers from two Poisson distributions:

I. Random Numbers

n <- 1000
set.seed(42)
x1 <- rpois(n, lambda = 1)
x10 <- rpois(n, lambda = 10)
mean(x1)

## [1] 0.976

var(x1)

## [1] 0.9663904

mean(x10)

## [1] 10.083

var(x10) #notice that the poisson has about equal mean and variances above (mean 10, variance 10 for x10 and mean almost 1 and variance 1 for x1)

## [1] 10.75086

II. Poisson Distributions

par(mfrow= c(1,2))
hist(x1, xlim = c(0, 25), seq(0, 25, by=1), col="#c276f6")
hist(x10, xlim=c(0, 25), seq(0, 25, by=1), col= "#00b9be")

If lambda is 1 what do you find? What about if lambda is 10?

So far our Poisson model only has one parameter, a mean (and variance) we refer to as λ. But what if we wanted the mean to change? For instance, we might have counted fish on different types of coral reefs and we want to test whether there are different abundances on each type of reef. Or we might have counted fish across a gradient of pollution and we want to know how their numbers change from low to high pollution.We will call these hypothesized causes of changes in fish counts ‘covariates’. Others might call them explanatory variables, treatments (if experiments) or predictor variables.

We will use Generalized Linear Models, so we could include the covariates as variables in a linear equation, after all that is what we do with linear regression (and general linear models).

\(Y\) ~ \(\beta\) + X\(\beta_1\) + \(\epsilon\)

Let’s generate some sample data ourselves. We will assume pollution is measured on a zero to one (low to high) scale, and that the mean number of fish with no pollution (0) = 4 and that on average there are no fish when pollution level = 0.5 (half the maximum).

n <- 50

beta <- -8

alpha <- 4

x <- seq(0, 1, length.out = n)

ymean <- alpha + beta*x # this is y = mx+b formula for simplicity, but think of this as Y(fish count) ~ B0 + B1(pollution)+E
plot(x, ymean, type = "l", col="#00b3dc", xlab="Pollution level", ylab = "Number of fish counted")
abline(h=0, lty=2, lwd=2)

There is something odd about this model: we are predicting negative fish (on average) for pollution levels over 0.5.

It gets even worse if we model sampling with a normal distribution:

set.seed(55)
yobs_normal <- ymean + rnorm(n) #Here we're adding random variation to the mean, using the mean count above
plot(x, ymean, type = 'l', xlab = "Pollution level", ylab = "Number of fish counted")
points(x, yobs_normal)
abline(h = 0, lty = 2, lwd = 2)

We can solve this by using a link function and a non-normal distribution. (see GLM in next tab)

Generalized Linear Models

Link functions elegantly solve the problem of using linear models with with non-normal data. There are many types of link functions, but we will look at one that is popular for use with count data: log.

n <- 50 #sample size

x <- seq(0, 1, length.out = n) #generates a normal sequence

gamma <- -3.2 #effect of polluted reefs, equivalent to B1 or B for illustration

alpha <- 4 #intercept = mean at zero population

yexp <- alpha*exp(gamma*x)

plot(x, yexp, type = "l", xlab = "Pollution level", ylab="Number of fish counted", col="#abd7c2")
abline(h=0 ,lty = 2, lwd = 2)

Here we have the equation y = \(\alpha\) * exponential(\(\gamma\) * \(x\)) which is the same as the linear equation for log(y): log(y) = log(\(\alpha\))+\(\gamma\) *x.

Note we retained alpha=4 in both, because for both equations alpha is the expected value at pollution of zero.

The slope parameter was changed in the log-linear equation to gamma because it is not a direct analogue of our slope parameter beta above.

One of the nice things about the log-linear equation is that the slope parameter now represents multiples of change. For instance, gamma = -3.2 means the abundance declines about 25 times (=1/exp(-3.2)) when going from a pollution level of 0 to 1. Abundance declines about a five times decline if we go from a pollution of 0 to 0.5 (= 1/exp(-3.2*0.5)).

Now we can use this exponential curve as the mean (and variance!) of a Poisson distribution.

n <- 50 #sample size

x <- seq(0, 1, length.out = n) #generates a normal sequence

gamma <- -3.2 #effect of polluted reefs, equivalent to B1 or B for illustration

alpha <- 4 #intercept = mean at zero population

yexp <- alpha*exp(gamma*x)

yobs_pois <- rpois(n, yexp) 
#As we drew random numbers from a normal dist., we will draw them from a Poisson dist.
#You can look up rpois() if you don't know what this function is doing
plot(x, yexp, type = 'l', col="#ef766e", xlab = "Pollution level", ylab = "Number of fish counted", ylim = c(0, 8))
points(x, yobs_pois)

Fitting the GzLM

Since we introduced some random variability into our generated data, we can pretend like the last section is a “sample”. We can now use a generalized linear model to estimate the effect of the pollution covariate using R’s glm() function. The syntax is pretty much the same as with lm() but we now specify the family and link function to use for fitting that third component of the model characteristic of generalized linear models (in addition to the random (Ys) and systematic (Xs) components).

m1 <- glm(yobs_pois ~ x, family = poisson(link = "log"))
summary(m1)

## 
## Call:
## glm(formula = yobs_pois ~ x, family = poisson(link = "log"))
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.4097     0.1926   7.319 2.50e-13 ***
## x            -3.3456     0.5625  -5.947 2.73e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 94.763  on 49  degrees of freedom
## Residual deviance: 49.485  on 48  degrees of freedom
## AIC: 123.71
## 
## Number of Fisher Scoring iterations: 5

coef(m1)

## (Intercept)           x 
##    1.409704   -3.345646

The values we printed give the estimates for the intercept and slope coefficients (alpha and gamma). How do these relate to the input data for (log)alpha and gamma?

We have specified above the type of distribution to use (family = poission()) and which link to use. “log” is in fact the default choice, but I put it there so you know you can change it.

Technically we would say we fit a Generalized Linear Model with Poisson errors and a log link function. We talk about Poisson errors (not Poisson data), because it is the leftover variation after we fit the model (i.e., the error or residuals) that we are assuming is Poisson distributed, in the same way we assume normally distributed residuals with a Gaussian distribution.

The model actually doesn’t make any assumptions about how the raw data are distributed, so long as they are integers and non-negative. Note that the data can contain zeros, but the mean of the Poisson is always >0.

What do the coefficients mean? Remember the coefficients are on the log scale. So the mean abundance at a pollution level of zero = the exponentiated value of the intercept and a change in pollution from 0 to 1 causes an estimated slope of 1/exp(slope) times decline in fish abundance. You can check these numbers with the math below.

exp(coef(m1)[1]) #Very close to 4, as set up in our data generation

## (Intercept) 
##    4.094745

1/(exp(coef(m1)[2])) #Similar to the 25 times decline mentioned above

##        x 
## 28.37889

We can plot the fitted model with standard errors along with our “true mean” from our simulated data.

ypredict <- predict(m1, type = "response", se = TRUE)
plot(x, yexp, type = 'l', xlab = "Pollution level",
     ylab = "Number of fish counted",
     ylim = c(0, 8))
lines(x, ypredict$fit, lwd = 2, col = "red", lty = 2)
#Add lines for standard errors
lines(x, ypredict$fit + ypredict$se.fit, lty = 3, col = "red") #Pull the se out of the predicted values
lines(x, ypredict$fit - ypredict$se.fit, lty = 3, col = "red")
#plot observations
points(x, yobs_pois)
legend('topright', legend = c("True mean", "Estimated mean"),
       lwd = 2, lty = c(1,2), col = c("black", "red"))

You can see the fitted line falls close to the ‘true’ line, and the standard errors are pretty tight around our best estimate.

The fitting algorithm itself is attempting the maximize the log-likelihood of the sample mean given the observations (in technical speak). You can refresh your understanding of maximum likelihood estimation if needed.

We wanted to fit a linear function to data that can’t be less than zero, because linear functions are convenient to work with. So we used a log link function to describe the mean and to ensure that the mean is always greater than zero.We ended up with a model where the slope describes multiples of change in fish abundance over the pollution gradient. So the model itself is actually multiplicative, not additive.

If you think about it, natural processes that generate counts often are multiplicative, not additive. For instance, we may talk about ‘fish multiplying’ when they breed, because population growth can be exponential.

So our mathematically convenient link function actually ended up being a better description of the natural process.

The last thing we want to check is if our residuals are Poisson distributed (an assumption of a GzLM with a Poisson distributional family). What this means for Pearson residuals (residual divided by the square root for the variance) is that they have constant spread, just like the normal residuals. Examining the residuals for these models is more complicated in reality, and if you end up down this road, you can evaluate it further.

par(mfrow=c(2,2))
plot(m1)

Transform or nah?

What’s the difference between a log link and log transforming your data? The answer is insightful as to how link functions work.

Take the data we generated above and fit two GLMs (you will have to add a small number–kludge factor–so you can log the zeros, not ideal but a common practice)

yobsplus <- yobs_pois+0.1
model1 <- glm(yobsplus ~ x, family = gaussian(link = "log"))
model2 <- glm(log(yobsplus) ~ x, family = gaussian(link = "identity"))
summary(model1)

## 
## Call:
## glm(formula = yobsplus ~ x, family = gaussian(link = "log"))
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.3943     0.1330  10.487 5.22e-14 ***
## x            -2.9649     0.5925  -5.004 7.95e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.295963)
## 
##     Null deviance: 120.000  on 49  degrees of freedom
## Residual deviance:  62.205  on 48  degrees of freedom
## AIC: 158.81
## 
## Number of Fisher Scoring iterations: 6

summary(model2)

## 
## Call:
## glm(formula = log(yobsplus) ~ x, family = gaussian(link = "identity"))
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.0466     0.3223   3.247  0.00213 ** 
## x            -3.4193     0.5554  -6.156 1.45e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.337798)
## 
##     Null deviance: 114.919  on 49  degrees of freedom
## Residual deviance:  64.214  on 48  degrees of freedom
## AIC: 160.4
## 
## Number of Fisher Scoring iterations: 2

In the first model we fitted a Gaussian distribution (normally distributed errors) with a log link. In the second we fitted a Gaussian distribution to logged response variable (log(y)) with the identity link (which is no link, just functionally an lm).

Now compare the results. Notice that the estimate of the slope is quite different. Why is this?

ypredict1 <- predict(model1, type = "response", se = TRUE)
ypredict2 <-predict(model2, type="response", se=TRUE)
plot(x, yexp, type = 'l', xlab = "Pollution level",
     ylab = "Number of fish counted",
     ylim = c(-2, 8))
lines(x, ypredict1$fit, lwd = 2, col = "#e76bf3", lty = 2)
lines(x, ypredict2$fit, lwd = 2, col = "steelblue", lty = 2)
#Add lines for standard errors
lines(x, ypredict1$fit + ypredict$se.fit, lty = 3, col = "#e76bf3")
lines(x, ypredict1$fit - ypredict$se.fit, lty = 3, col = "#e76bf3")

lines(x, ypredict2$fit + ypredict$se.fit, lty = 3, col = "steelblue")
lines(x, ypredict2$fit - ypredict$se.fit, lty = 3, col = "steelblue")
#plot observations
points(x, yobs_pois)

The model with the log link (Model 1 above) is fitting the mean on the log scale, the Gaussian errors will be on the natural scale. So the residual (or error) variance will be constant for all mean values of y.

The model with the log of the data and identity link is fitting the mean and variance on the log scale. So if we retransform log(y) back to y (exponentiate it), the variance will change with the mean.

So a log link isn’t the same as a log transformation. The transformation changes the raw data. The link function doesn’t touch the raw data, instead you can think of it as a transformation of the model for the mean of the raw data.

Binomial Response

Uta Lizard

In the 1980s a group of researchers did groundbreaking work on ecosystem subsidies looking at spiders on island ecosystems. As part of an investigation controlling spider populations on these islands, Polis et. al. (1988) recorded the physical and biological characteristics of islands in the Gulf of California. One of the outcomes was an analysis of a spider predator (Uta lizard) presence/absence (PA). The predictor was island perimeter to area ratio (RATIO). Because the response variable (Uta lizard presence/absence) is binary, the data are non-normal and we cannot use a lm. Conduct a logistic regression to determine if perimeter:area ratio is a good predictor of presence/absence of Uta lizards.

Explore the data on your own. Fit an appropriate model (see code below). Plot expected outcomes. (There are some chunks of code below to help you along with some of the details.)

polis <- read.csv("polis.csv")
head(polis)

##       ISLAND RATIO PA
## 1       Bota 15.41  1
## 2     Cabeza  5.63  1
## 3    Cerraja 25.92  1
## 4 Coronadito 15.17  0
## 5     Flecha 13.04  1
## 6   Gemelose 18.85  0

When you build your model, name it polis.glm which will facilitate some of the steps below. To fit the logistic regression, we use the glm function and the family=“” argument, where family=“binomial” to account for the binary nature of the data. What is the canonical link for the binmoial family? If you don’t know, you can use the help to look up the family argument. Besides the different function, glm() that is used to fit generalized linear models, the fitting syntax is pretty much the same as with lm().

polis.glm <- glm(PA ~ RATIO, family=binomial, data=polis)

summary(polis.glm)

## 
## Call:
## glm(formula = PA ~ RATIO, family = binomial, data = polis)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   3.6061     1.6953   2.127   0.0334 *
## RATIO        -0.2196     0.1005  -2.184   0.0289 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26.287  on 18  degrees of freedom
## Residual deviance: 14.221  on 17  degrees of freedom
## AIC: 18.221
## 
## Number of Fisher Scoring iterations: 6

Interpreting results The canonical link for the binomial family is the logit link function, which is commonly used in logistic regression models.

Interpretation of the coefficients: The logistic regression coefficient for the predictor (RATIO) is interpreted as the log-odds ratio. In your example, you mentioned a coefficient of -0.21 for RATIO. This implies that for a one-unit increase in RATIO, the log-odds of Uta Lizard presence decrease by 0.21.

The p-value for the coefficient tests the null hypothesis that the coefficient is equal to zero. A p-value of 0.029 indicates that RATIO is statistically significant.

Fit a binomial glm for PA~RATIO.

Just like for lm() and glm() with a Poisson distribution as above, we see we get some output related to the intercept and slope. Here the “slope” (which is actually the log-odds ratio) uses the z statistic (the standard normal deviate) to check for differences from 0. The z value is the ratio of the estimated coefficient to its standard error. It measures the number of standard deviations that the estimated coefficient is away from zero. A higher absolute value of z value indicates that the estimated coefficient is more statistically significant. Here, it’s a little more than 2 sd away from zero, which means the probability is <0.05. Indeed, the p-value is 0.029.

Interpretation of logistic regression is not as straightforward as other forms of regression because the concept of the slope is different. Here, we have a presence (1) or absence (0) and we’re trying to find the level of the predictor at which we are more likely to have one or the other (the probability of getting a 1 vs a 0 is typically set at the 50% point). The standard logistic regression function for predicting the outcome of an observation given a predictor variable (x), is an s-shaped curve defined as p = exp(y) / [1 + exp(y)]. So, the model estimates a “slope” but it is really the log-odds ratio. There is more information here, or in any advanced modeling book, should you need to go down this road.

So, we can say that Perimeter to Area Ratio (RATIO) is a significant predictor of Uta Lizard presence and that the probability of lizard presence decreases with increasing perimeter:area ratio (-0.22), where the regression equation is defined as p = exp(3.61 -0.21* RATIO)/ [1 + exp(3.61 -0.21* RATIO)].

We can calculate the predicted values based on the fitted model (make sure you know what each step is doing here).

#Create a new data frame on which to make predictions
xs<-data.frame(RATIO=seq(0,70, length.out=1000))

polis.predict<-predict(polis.glm, type="response", se.fit=TRUE, newdata=xs)

xs$predict <- polis.predict$fit
#To get confidence intervals, you will need to calculate the fit for each point + the SE for each point--this can be vectorized:
#E.g. to get the upper confidence band
upper<-polis.predict$fit + polis.predict$se.fit~xs$RATIO

upper2<-polis.predict$fit + polis.predict$se.fit

lower<-polis.predict$fit - polis.predict$se.fit~xs$RATIO

lower2<-polis.predict$fit - polis.predict$se.fit

xs$upper <- upper2

xs$lower <- lower2

#We can generate a plot using base R, top, or ggplot, bottom, which will hopefully make this more clear.

We can generate a plot using base R, top, or ggplot, bottom, which will hopefully make this more clear.

# Plot the logistic regression curve with confidence intervals
plot(polis$RATIO, polis$PA, pch=20, xlab="Perimeter to area ratio", ylab="Uta presence/absence")
lines(xs$RATIO, polis.predict$fit, col = "#00bfc4", lwd = 2)  # Logistic regression curve
lines(xs$RATIO, xs$upper, col = "grey", lty = 2)  # Upper confidence band
lines(xs$RATIO, xs$lower, col = "grey", lty = 2)  # Lower confidence band

So, when the Perimeter to Area Ratio hits about 18, the expected presence of Uta Lizards drops off.

Overdispersion and Deviance

#Overdispersion: the variance of the response is greater than what's assumed by the model. This is more of a problem in Poisson-type models, but I demonstrate the code here.
#To calculate Pearson chi^2 p-value to evaluate dispersion (overdispersion); 
pp<- sum(resid(polis.glm, type="pearson")^2)
1-pchisq(pp, polis.glm$df.resid)

## [1] 0.5715331

#0.57 very high p-val., no overdispersion

#To calculate deviance (G^2)
1-pchisq(polis.glm$deviance, polis.glm$df.resid)

## [1] 0.6514215

#To evaluate the strength of the association (R^2 equivalent, but remember it is not really R^2 because we are not using squared error)
1-(polis.glm$dev/polis.glm$null)

## [1] 0.4590197

HW

Amphibian Roadkill

RK <- read.table("RoadKills.txt", header = T)

The data set consists of roadkills of amphibian species at 52 sites along a road in Portugal. A scatterplot of the response variable roadkills against a possible explanatory variable ‘distance to the natural park’, denoted by D.PARK, is given in Fig. 1. The biological interpretation of ‘distance to the park’ is given in Chapter 16.

1. 🐠

Good

plot(RK$D.PARK, RK$TOT.N, xlab = "Distance to park", ylab = "Road kills", pch=20, col="#e76bf3")

Figure 1: Scatterplot of amphibian road kills versus distance (in metres) to a nearby Natural Park. The data are counts, and there seems to be a non-linear, perhaps exponential, relationship between roadkills and D.PARK. Also note that the variation is larger for larger values of roadkills. Taken together, this gives us all the ingredients for a Poisson GLM. Starting with only D.PARK as an explanatory variable, and ignoring the other 10 explanatory variables, is a pedagogical choice for presenting Poisson GLM in a textbook and is not a general recommendation for analysing these data

Yi, the number of killed animals at site i, is Poisson distributed with mean µi.
The systematic part is given by η(D.PARKi) = α + β × D.PARKi.
There is a logarithm link between the mean of Yi and the predictor function η(D.PARKi)

As a result of these three steps, we have

Yi ∼ p(µi)

E(Yi) = µi and var(Yi) = µi

log(µi) = α + β × D.PARKi or µi = eα+β×D.PARKi

I. Model Selection

You can use the glm for a linear regression as long as you apply the family= gaussian

#Make a model
M1 <- glm(TOT.N ~ D.PARK, family = poisson, data = RK)

summary(M1) #the first 2 lines tell which model has been fitted. Basic math on the residuals is next.

## 
## Call:
## glm(formula = TOT.N ~ D.PARK, family = poisson, data = RK)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.316e+00  4.322e-02   99.87   <2e-16 ***
## D.PARK      -1.059e-04  4.387e-06  -24.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1071.4  on 51  degrees of freedom
## Residual deviance:  390.9  on 50  degrees of freedom
## AIC: 634.29
## 
## Number of Fisher Scoring iterations: 4

# The estimation for slope and intercept are 4.31 and -0.000106, respectively. We get a z stat (because we know the variance) and a corresponding p-value (for testing the null hypothesis that the slope is 0). We get an AIC that can help us with model selection. 

## Keep in mind that the distance to the park is expressed in meters. You should express these data in km instead to reduce the number of 0s.

Deviance

The null and residual deviance are essentially the maximum likelihood equivalents of the total sum of squares and the residual sum of squares, respectively. For the Poisson GLM, the residual deviance is defined as twice the difference between the log likelihood of a model that provides a perfect fit. AKA the saturated model. There is no \(R^2\) in GLM models, so you can determine the explained deviance like this:

\(\dfrac{nullDeviance-residualDeviance}{nullDeviance} * 100\)

#so if we use this with our mode: 

(M1$null.deviance - M1$deviance)/M1$null.deviance * 100

## [1] 63.51681

So the explanatory variable distance to the park explains 63.51% of the variation in road kills. Dobson (2002) called this proportional increase in explained deviance the pseudo R2. The smaller the residual deviance, the better the model. Some statistics programs also quote a p-value as it is supposedly Chi-square distributed with n – p degrees of freedom, where p is the number of regression parameters in the model and n the number of observations. However, using the residual deviance as a goodness of-fit measure is not without controversy; see McCullagh and Nelder (pg. 118–119, 1989). They argue that (at least for the binomial GLM) a large value of the residual deviance cannot always be seen as evidence of a poor fit. The residual deviance is also sometimes called the deviance (see above calculation).

Sketching the Residual Values

Calculate the predicted values and add them as a line.

MyData=data.frame(D.PARK=seq(from=0,to=25000,by=1000))

#predict function for model prediictions
G<-predict(M1,
           newdata=MyData,
           type="link",
           se=T)


F<-exp(G$fit) #creates an exponential function for M1 predicted fit values

FSEUP<-exp(G$fit+1.96*G$se.fit)

FSELOW<-exp(G$fit-1.96*G$se.fit)
plot(RK$D.PARK, RK$TOT.N, xlab = "Distance to park", ylab = "Road kills", pch=20, col="#e76bf3")
lines(MyData$D.PARK,F,lty=1)
lines(MyData$D.PARK,FSEUP,lty=2)
lines(MyData$D.PARK,FSELOW,lty=2)

Figure 2: Observed roadkills with a fitted Poisson GLM curve (solid line) and 95% confidence bands (dotted lines). Note the clear exponential shape of the curve. For smaller fitted values, there are groups of residuals above and below the fitted line. This is not good, and we need to deal with this in the model validation.

II. Model Selection in GLM

So far, we have only discussed the interpretation of the model in terms of an exponential curve fitted through a set of points; we now concentrate on things like model selection, hypothesis testing, and model validation. However, applying a model selection with only one explanatory variable is a bit unrealistic, so we now add a few more explanatory variables.

The amphibian roadkills data set contains 17 explanatory variables. A list of these variables and abbreviations is given in Table 16.1 above. Some of the explanatory variables were square root transformed because of large values. Using variance inflation factors (VIF, Appendix A), a sub-selection of nine variables is made in Chapter 16 and we use the same sub-selection here (see Table 16.2 above and the function code below).

#set up premade functions from Zuur
source("HighstatLibV10.R.txt") 

#The resulting VIF values are given in Table 16.2. As explained in Appendix A, a cut-off value of 5 or even 3 can be used to remove collinear variables; we used 3. To find a set of explanatory variables that does not contain collinearity, we removed one variable at a time, recalculated the VIF values, and repeated this process until all VIF values were smaller than 3. As a result, MONT, P.EDGE, N.PATCH, L.SDI, and SQ.URBAN were dropped. This means that we have 12 remaining explanatory variables. This is still a large number of variables!

#get the data you want
Z<-cbind(RK$OPEN.L, 
         RK$SQ.OLIVE, 
         RK$MONT.S,
         RK$MONT,
         RK$SQ.POLIC, 
         RK$SQ.SHRUB, 
         RK$SQ.URBAN,
         RK$SQ.WATRES, 
         RK$L.WAT.C, 
         RK$L.D.ROAD, 
         RK$SQ.LPROAD, 
         RK$D.WAT.RES, 
         RK$SQ.DWATCOUR, 
         RK$D.PARK, 
         RK$N.PATCH, 
         RK$P.EDGE, 
         RK$L.SDI)
#17 variables above
#use Zuur corvif() function
corvif(Z)

## 
## 
## Variance inflation factors
## 
##          GVIF
## V1  13.651024
## V2   1.245046
## V3  19.299391
## V4   1.763165
## V5   2.221564
## V6   1.570653
## V7   1.946437
## V8  13.796522
## V9  10.797232
## V10  3.261683

Note: this is still a relatively high number of explanatory variables for a data set with only 52 observations! A Poisson GLM for the roadkills data with nine variables is specified in a very similar way as in Equation (9.4), except that the systematic part now contains all nine explanatory variables (we have no biological reasons to believe there are interactions).

Model Code

The following R code implements the Poisson GLM with nine explanatory variables.

RK$SQ.POLIC<-sqrt(RK$POLIC)
RK$SQ.WATRES<-sqrt(RK$WAT.RES)
RK$SQ.URBAN<-sqrt(RK$URBAN)
RK$SQ.OLIVE<-sqrt(RK$OLIVE)
RK$SQ.LPROAD<-sqrt(RK$L.P.ROAD)
RK$SQ.SHRUB<-sqrt(RK$SHRUB)
RK$SQ.DWATCOUR<-sqrt(RK$D.WAT.COUR)

#The full model (read above for VIF on why these variables were kept)
M2<-glm(TOT.N~OPEN.L+MONT.S+SQ.POLIC+
         SQ.SHRUB+SQ.WATRES+L.WAT.C+SQ.LPROAD+
         SQ.DWATCOUR+D.PARK,family=poisson,data=RK) #9 variables chosen

summary(M2)

## 
## Call:
## glm(formula = TOT.N ~ OPEN.L + MONT.S + SQ.POLIC + SQ.SHRUB + 
##     SQ.WATRES + L.WAT.C + SQ.LPROAD + SQ.DWATCOUR + D.PARK, family = poisson, 
##     data = RK)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.749e+00  1.567e-01  23.935  < 2e-16 ***
## OPEN.L      -3.025e-03  1.580e-03  -1.915 0.055531 .  
## MONT.S       8.697e-02  1.359e-02   6.398 1.57e-10 ***
## SQ.POLIC    -1.787e-01  4.676e-02  -3.822 0.000133 ***
## SQ.SHRUB    -6.112e-01  1.176e-01  -5.197 2.02e-07 ***
## SQ.WATRES    2.243e-01  7.050e-02   3.181 0.001468 ** 
## L.WAT.C      3.355e-01  4.127e-02   8.128 4.36e-16 ***
## SQ.LPROAD    4.517e-01  1.348e-01   3.351 0.000804 ***
## SQ.DWATCOUR  7.355e-03  4.879e-03   1.508 0.131629    
## D.PARK      -1.301e-04  5.936e-06 -21.923  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1071.44  on 51  degrees of freedom
## Residual deviance:  270.23  on 42  degrees of freedom
## AIC: 529.62
## 
## Number of Fisher Scoring iterations: 5

The Optimal Model

We want to know which explanatory variables are important, and because some terms are not significant, it is time for a model selection. The process is similar to the one used for linear regression (Appendix A). We can use either a selection criterion like the AIC or use a hypothesis testing approach. Automatic forward, backward, and forward and backward selection can be applied with the command step(M2), this backward selection indicates that no term should be dropped.

step(M2)

## Start:  AIC=529.62
## TOT.N ~ OPEN.L + MONT.S + SQ.POLIC + SQ.SHRUB + SQ.WATRES + L.WAT.C + 
##     SQ.LPROAD + SQ.DWATCOUR + D.PARK
## 
##               Df Deviance     AIC
## <none>             270.23  529.62
## - SQ.DWATCOUR  1   272.50  529.89
## - OPEN.L       1   273.93  531.32
## - SQ.WATRES    1   280.02  537.41
## - SQ.LPROAD    1   281.25  538.64
## - SQ.POLIC     1   285.53  542.92
## - SQ.SHRUB     1   298.31  555.70
## - MONT.S       1   306.89  564.28
## - L.WAT.C      1   335.47  592.86
## - D.PARK       1   838.09 1095.48

## 
## Call:  glm(formula = TOT.N ~ OPEN.L + MONT.S + SQ.POLIC + SQ.SHRUB + 
##     SQ.WATRES + L.WAT.C + SQ.LPROAD + SQ.DWATCOUR + D.PARK, family = poisson, 
##     data = RK)
## 
## Coefficients:
## (Intercept)       OPEN.L       MONT.S     SQ.POLIC     SQ.SHRUB    SQ.WATRES  
##   3.7493885   -0.0030250    0.0869656   -0.1787178   -0.6111864    0.2242561  
##     L.WAT.C    SQ.LPROAD  SQ.DWATCOUR       D.PARK  
##   0.3354676    0.4517172    0.0073554   -0.0001301  
## 
## Degrees of Freedom: 51 Total (i.e. Null);  42 Residual
## Null Deviance:       1071 
## Residual Deviance: 270.2     AIC: 529.6

For the hypothesis testing approach, we have three options:

1. Test the null hypothesis H0: βi = 0 using the z-statistic.

This is the equivalent of the t-statistic in linear regression. This approach suggests to drop first SQ.DWATCOUR as it is the least significant term and then to refit the model and see whether there are still non-significant terms in the model.

summary(M2) #drop the least significant term

## 
## Call:
## glm(formula = TOT.N ~ OPEN.L + MONT.S + SQ.POLIC + SQ.SHRUB + 
##     SQ.WATRES + L.WAT.C + SQ.LPROAD + SQ.DWATCOUR + D.PARK, family = poisson, 
##     data = RK)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.749e+00  1.567e-01  23.935  < 2e-16 ***
## OPEN.L      -3.025e-03  1.580e-03  -1.915 0.055531 .  
## MONT.S       8.697e-02  1.359e-02   6.398 1.57e-10 ***
## SQ.POLIC    -1.787e-01  4.676e-02  -3.822 0.000133 ***
## SQ.SHRUB    -6.112e-01  1.176e-01  -5.197 2.02e-07 ***
## SQ.WATRES    2.243e-01  7.050e-02   3.181 0.001468 ** 
## L.WAT.C      3.355e-01  4.127e-02   8.128 4.36e-16 ***
## SQ.LPROAD    4.517e-01  1.348e-01   3.351 0.000804 ***
## SQ.DWATCOUR  7.355e-03  4.879e-03   1.508 0.131629    
## D.PARK      -1.301e-04  5.936e-06 -21.923  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1071.44  on 51  degrees of freedom
## Residual deviance:  270.23  on 42  degrees of freedom
## AIC: 529.62
## 
## Number of Fisher Scoring iterations: 5

M3 <- glm(TOT.N~OPEN.L+MONT.S+SQ.POLIC+SQ.SHRUB+SQ.WATRES+L.WAT.C+SQ.LPROAD+D.PARK,family=poisson,data=RK) #8 variables chosen

summary(M3) #everything is now significant! WOW

## 
## Call:
## glm(formula = TOT.N ~ OPEN.L + MONT.S + SQ.POLIC + SQ.SHRUB + 
##     SQ.WATRES + L.WAT.C + SQ.LPROAD + D.PARK, family = poisson, 
##     data = RK)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.852e+00  1.405e-01  27.405  < 2e-16 ***
## OPEN.L      -3.464e-03  1.544e-03  -2.243 0.024890 *  
## MONT.S       8.927e-02  1.344e-02   6.643 3.08e-11 ***
## SQ.POLIC    -1.583e-01  4.492e-02  -3.525 0.000424 ***
## SQ.SHRUB    -6.160e-01  1.185e-01  -5.200 2.00e-07 ***
## SQ.WATRES    2.080e-01  6.917e-02   3.006 0.002645 ** 
## L.WAT.C      3.113e-01  3.817e-02   8.155 3.50e-16 ***
## SQ.LPROAD    4.936e-01  1.315e-01   3.753 0.000175 ***
## D.PARK      -1.286e-04  5.837e-06 -22.033  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1071.4  on 51  degrees of freedom
## Residual deviance:  272.5  on 43  degrees of freedom
## AIC: 529.89
## 
## Number of Fisher Scoring iterations: 5

2. drop1()

Use the drop1(M2,test = “Chi”) command, which drops one explanatory variable, in turn, and each time applies an analysis of deviance test. We explain this process below.

The model containing all explanatory variables has a deviance of 270.3. If we drop OPEN.L, the deviance is 273.93: a difference of 3.69. The statistic X2 = 3.69 follows (approximately) a Chi-square distribution with 1 degree of freedom, which gives a p-value of 0.054. This can be double checked with the R command:

1 – pchisq (3.69 ,1)

1-pchisq(3.69 ,1)

## [1] 0.05473962

Note that the analysis of deviance does not give exactly the same p-value as the z-statistic. This is because both tests are approximate. If in doubt, use the analysis of deviance test. The advantage of using the analysis of deviance test is that it also gives a p-value for a nominal variable.

drop1(M2, test="Chi")

## Single term deletions
## 
## Model:
## TOT.N ~ OPEN.L + MONT.S + SQ.POLIC + SQ.SHRUB + SQ.WATRES + L.WAT.C + 
##     SQ.LPROAD + SQ.DWATCOUR + D.PARK
##             Df Deviance     AIC    LRT  Pr(>Chi)    
## <none>           270.23  529.62                     
## OPEN.L       1   273.93  531.32   3.69 0.0546474 .  
## MONT.S       1   306.89  564.28  36.66 1.410e-09 ***
## SQ.POLIC     1   285.53  542.92  15.30 9.181e-05 ***
## SQ.SHRUB     1   298.31  555.70  28.08 1.167e-07 ***
## SQ.WATRES    1   280.02  537.41   9.79 0.0017539 ** 
## L.WAT.C      1   335.47  592.86  65.23 6.648e-16 ***
## SQ.LPROAD    1   281.25  538.64  11.02 0.0009009 ***
## SQ.DWATCOUR  1   272.50  529.89   2.27 0.1319862    
## D.PARK       1   838.09 1095.48 567.85 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3. anova()

Use the anova(M2) command, which applies a series of analysis of deviance tests by removing each term sequential. We explain at the end of Subsection 9.6.5 how this process works.

The same p-value for OPEN.L can be obtained by fitting a model with all explanatory variables (which is M2), a model without OPEN.L, and then use the anova command to compare the two models with an analysis of deviance. This is done with the following R code:

M3 <- glm(TOT.N ~ MONT.S + SQ.POLIC + D.PARK +
SQ.SHRUB + SQ.WATRES + L.WAT.C + SQ.LPROAD +
SQ.DWATCOUR, family = poisson, data = RK)

anova(M2, M3, test = "Chi")

## Analysis of Deviance Table
## 
## Model 1: TOT.N ~ OPEN.L + MONT.S + SQ.POLIC + SQ.SHRUB + SQ.WATRES + L.WAT.C + 
##     SQ.LPROAD + SQ.DWATCOUR + D.PARK
## Model 2: TOT.N ~ MONT.S + SQ.POLIC + D.PARK + SQ.SHRUB + SQ.WATRES + L.WAT.C + 
##     SQ.LPROAD + SQ.DWATCOUR
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1        42     270.23                       
## 2        43     273.93 -1  -3.6928  0.05465 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If you use this output in a paper or report, then you should write that the difference in deviance is 3.69 and approximately follows a Chi-square distribution with 1 degree of freedom. We have seen papers where a Chi-square distribution with 43 degrees of freedom was quoted from the output above, which is clearly wrong! Be careful when using the command anova(M2); it applies an analysis of deviance test, but now the terms are removed sequentially and the order depends on the order they were typed. This is useful if all explanatory variables are independent or if the last term is an interaction.

Steps 2 and 3 are similar to the anova and drop1 functions in linear regression, except that in linear regression we used an F test based on residual sum of squares of a full and a nested model. A nested model is defined as a model that is obtained from the full model by setting certain parameters equal to 0. We do not have residual sum of squares in Poisson GLM. Well, actually we do, but they are not used in these tests (residuals are discussed in Section 9.8). Instead, we use the residual deviance of two nested models.

Summary: Check dispersion before selecting model!

Using the drop1 function, we decided to remove the variable SQ.DWATCOUR. Refitting the model resulted in all explanatory variables being significant at the 5% level. This suggests that we are finished with the model selection process, and can proceed to the model validation process. However, things are never that easy. The results of the summary command presented above had a small sentence that said: ‘overdispersion parameter for Poisson family taken to be 1’. This does not mean that the overdispersion really is 1; it just says it was taken as 1.We promised more misery, and overdispersion is the next stage. In the next section, we show that all the results presented in this section can be put in the bin, because of overdispersion. If you analyse your own data, you should always first check for overdispersion, before doing any model selection or interpretation of the results. The reason why we did not start by looking at overdispersion was because we wanted to make sure you could read the output and judge whether there is overdispersion. For your own data, you should always start by checking for overdispersion and act accordingly. This is discussed in the next section.

III. Checking Dispersion

Overdispersion: when the variance is larger than the mean.

You can quickly correct for it using the following quasi-poisson code:

M4<- glm(TOT.N ~ OPEN.L + MONT.S + SQ.POLIC+
         SQ.SHRUB + SQ.WATRES + L.WAT.C + SQ.LPROAD+
         SQ.DWATCOUR + D.PARK, family = quasipoisson, data = RK)

You can see the only difference is specifying the family option as quasipoisson instead of poisson. This gives the impression that there is a quasi-Poisson distribution, but there is no such thing! All we do here is specify the mean and variance relationship and an exponential link between the expected values and explanatory variables. It is a software issue to call this ‘quasipoisson’. Do not write in your report or paper that you used a quasi-Poisson distribution. Just say that you did a Poisson GLM, detected overdispersion, and corrected the standard errors using a quasi-GLM model where the variance is given by φ × µ, where µ is the mean and φ the dispersion parameter.

To get the numerical output for this model, use summary(M4), which gives the output below. Note that the ratio of the residual deviance and the degrees of freedom is still larger than 1, but that is no longer a problem as we now allow for overdispersion. The dispersion parameter φ is estimated as 5.93. This means that all standard errors have been multiplied by 2.43 (the square root of 5.93), and as a result, most parameters are no longer significant! We can move onto model selection

summary(M4)

## 
## Call:
## glm(formula = TOT.N ~ OPEN.L + MONT.S + SQ.POLIC + SQ.SHRUB + 
##     SQ.WATRES + L.WAT.C + SQ.LPROAD + SQ.DWATCOUR + D.PARK, family = quasipoisson, 
##     data = RK)
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.749e+00  3.814e-01   9.830 1.86e-12 ***
## OPEN.L      -3.025e-03  3.847e-03  -0.786  0.43604    
## MONT.S       8.697e-02  3.309e-02   2.628  0.01194 *  
## SQ.POLIC    -1.787e-01  1.139e-01  -1.570  0.12400    
## SQ.SHRUB    -6.112e-01  2.863e-01  -2.135  0.03867 *  
## SQ.WATRES    2.243e-01  1.717e-01   1.306  0.19851    
## L.WAT.C      3.355e-01  1.005e-01   3.338  0.00177 ** 
## SQ.LPROAD    4.517e-01  3.282e-01   1.376  0.17597    
## SQ.DWATCOUR  7.355e-03  1.188e-02   0.619  0.53910    
## D.PARK      -1.301e-04  1.445e-05  -9.004 2.33e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 5.928003)
## 
##     Null deviance: 1071.44  on 51  degrees of freedom
## Residual deviance:  270.23  on 42  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5

2. Quasi🐠

Better

# begin with the quasi model you made in the last section to deal with overdispersion

drop1(M4,test="F") #drop SQ.DWATCOUR first from model

## Single term deletions
## 
## Model:
## TOT.N ~ OPEN.L + MONT.S + SQ.POLIC + SQ.SHRUB + SQ.WATRES + L.WAT.C + 
##     SQ.LPROAD + SQ.DWATCOUR + D.PARK
##             Df Deviance F value   Pr(>F)    
## <none>           270.23                     
## OPEN.L       1   273.93  0.5739 0.452926    
## MONT.S       1   306.89  5.6970 0.021574 *  
## SQ.POLIC     1   285.53  2.3776 0.130585    
## SQ.SHRUB     1   298.31  4.3635 0.042814 *  
## SQ.WATRES    1   280.02  1.5217 0.224221    
## L.WAT.C      1   335.47 10.1389 0.002735 ** 
## SQ.LPROAD    1   281.25  1.7129 0.197727    
## SQ.DWATCOUR  1   272.50  0.3526 0.555802    
## D.PARK       1   838.09 88.2569    7e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

M5<- glm(TOT.N ~ OPEN.L + MONT.S + SQ.POLIC+
         SQ.SHRUB + SQ.WATRES + L.WAT.C + SQ.LPROAD+
         D.PARK, family = quasipoisson, data = RK)

drop1(M5, test="F") #get rid of OPEN.L

## Single term deletions
## 
## Model:
## TOT.N ~ OPEN.L + MONT.S + SQ.POLIC + SQ.SHRUB + SQ.WATRES + L.WAT.C + 
##     SQ.LPROAD + D.PARK
##           Df Deviance F value    Pr(>F)    
## <none>         272.50                      
## OPEN.L     1   277.60  0.8038   0.37496    
## MONT.S     1   311.91  6.2181   0.01657 *  
## SQ.POLIC   1   285.62  2.0707   0.15739    
## SQ.SHRUB   1   300.59  4.4319   0.04115 *  
## SQ.WATRES  1   281.22  1.3765   0.24716    
## L.WAT.C    1   339.08 10.5064   0.00230 ** 
## SQ.LPROAD  1   286.31  2.1794   0.14716    
## D.PARK     1   843.43 90.0904 4.105e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

M6<- glm(TOT.N ~ MONT.S + SQ.POLIC+
         SQ.SHRUB + SQ.WATRES + L.WAT.C + SQ.LPROAD+
         D.PARK, family = quasipoisson, data = RK)
drop1(M6, test="F")

## Single term deletions
## 
## Model:
## TOT.N ~ MONT.S + SQ.POLIC + SQ.SHRUB + SQ.WATRES + L.WAT.C + 
##     SQ.LPROAD + D.PARK
##           Df Deviance  F value    Pr(>F)    
## <none>         277.60                       
## MONT.S     1   321.03   6.8845   0.01191 *  
## SQ.POLIC   1   298.88   3.3741   0.07299 .  
## SQ.SHRUB   1   310.04   5.1428   0.02831 *  
## SQ.WATRES  1   286.46   1.4056   0.24215    
## L.WAT.C    1   339.09   9.7473   0.00317 ** 
## SQ.LPROAD  1   292.85   2.4176   0.12715    
## D.PARK     1   950.86 106.7149 2.429e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#keep doing this until you find out all you need is D.Park in your model! WOW so long

drop1(M5, test="F") #get rid of OPEN.L

## Single term deletions
## 
## Model:
## TOT.N ~ OPEN.L + MONT.S + SQ.POLIC + SQ.SHRUB + SQ.WATRES + L.WAT.C + 
##     SQ.LPROAD + D.PARK
##           Df Deviance F value    Pr(>F)    
## <none>         272.50                      
## OPEN.L     1   277.60  0.8038   0.37496    
## MONT.S     1   311.91  6.2181   0.01657 *  
## SQ.POLIC   1   285.62  2.0707   0.15739    
## SQ.SHRUB   1   300.59  4.4319   0.04115 *  
## SQ.WATRES  1   281.22  1.3765   0.24716    
## L.WAT.C    1   339.08 10.5064   0.00230 ** 
## SQ.LPROAD  1   286.31  2.1794   0.14716    
## D.PARK     1   843.43 90.0904 4.105e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

M7<- glm(TOT.N ~ MONT.S + SQ.POLIC+
         SQ.SHRUB + L.WAT.C + SQ.LPROAD+
         D.PARK, family = quasipoisson, data = RK)
drop1(M7, test="F") #SQ.LPROAD is a goner

## Single term deletions
## 
## Model:
## TOT.N ~ MONT.S + SQ.POLIC + SQ.SHRUB + L.WAT.C + SQ.LPROAD + 
##     D.PARK
##           Df Deviance  F value    Pr(>F)    
## <none>         286.46                       
## MONT.S     1   322.79   5.7065  0.021157 *  
## SQ.POLIC   1   308.82   3.5122  0.067421 .  
## SQ.SHRUB   1   310.05   3.7052  0.060578 .  
## L.WAT.C    1   339.10   8.2680  0.006144 ** 
## SQ.LPROAD  1   297.73   1.7691  0.190197    
## D.PARK     1   952.16 104.5733 2.569e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

M8<- glm(TOT.N ~ MONT.S + SQ.POLIC+
         SQ.SHRUB + L.WAT.C + D.PARK, family = quasipoisson, data = RK)
drop1(M8, test="F") #SQ.SHRUB is a goner

## Single term deletions
## 
## Model:
## TOT.N ~ MONT.S + SQ.POLIC + SQ.SHRUB + L.WAT.C + D.PARK
##          Df Deviance  F value    Pr(>F)    
## <none>        297.73                       
## MONT.S    1   332.50   5.3731  0.024950 *  
## SQ.POLIC  1   314.89   2.6527  0.110205    
## SQ.SHRUB  1   314.85   2.6457  0.110664    
## L.WAT.C   1   350.38   8.1353  0.006483 ** 
## D.PARK    1   953.56 101.3296 3.298e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

M9<- glm(TOT.N ~ MONT.S + SQ.POLIC+
         L.WAT.C + D.PARK, family = quasipoisson, data = RK)
drop1(M9, test="F") #SQ.POLIC is a goner

## Single term deletions
## 
## Model:
## TOT.N ~ MONT.S + SQ.POLIC + L.WAT.C + D.PARK
##          Df Deviance  F value    Pr(>F)    
## <none>        314.85                       
## MONT.S    1   341.41   3.9644   0.05230 .  
## SQ.POLIC  1   335.99   3.1556   0.08214 .  
## L.WAT.C   1   354.53   5.9233   0.01880 *  
## D.PARK    1   997.18 101.8565 2.378e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

M10<- glm(TOT.N ~ MONT.S + L.WAT.C + D.PARK, family = quasipoisson, data = RK)
drop1(M10, test="F") #MONT.S is a goner

## Single term deletions
## 
## Model:
## TOT.N ~ MONT.S + L.WAT.C + D.PARK
##         Df Deviance  F value    Pr(>F)    
## <none>       335.99                       
## MONT.S   1   362.28   3.7559   0.05852 .  
## L.WAT.C  1   364.70   4.1012   0.04843 *  
## D.PARK   1  1067.39 104.4902 1.228e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

M11<- glm(TOT.N ~ L.WAT.C + D.PARK, family = quasipoisson, data = RK)
drop1(M11, test="F") #L.WAT.C is a goner

## Single term deletions
## 
## Model:
## TOT.N ~ L.WAT.C + D.PARK
##         Df Deviance F value    Pr(>F)    
## <none>       362.28                      
## L.WAT.C  1   390.90  3.8708   0.05481 .  
## D.PARK   1  1070.90 95.8447 4.053e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

M12<- glm(TOT.N ~ D.PARK, family = quasipoisson, data = RK)
drop1(M12, test="F") #seems we have found the best model?

## Single term deletions
## 
## Model:
## TOT.N ~ D.PARK
##        Df Deviance F value    Pr(>F)    
## <none>       390.9                      
## D.PARK  1   1071.4  87.049 1.572e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#ignoring overdispersion can lead to a VERY DIFFERENT MODEL. YIKES!

MyData=data.frame(D.PARK=seq(from=0,to=25000,by=1000))

#predict function for model prediictions
G<-predict(M12,
           newdata=MyData,
           type="link",
           se=T)


F<-exp(G$fit) #creates an exponential function for M1 predicted fit values

FSEUP<-exp(G$fit+1.96*G$se.fit)

FSELOW<-exp(G$fit-1.96*G$se.fit)
plot(RK$D.PARK, RK$TOT.N, xlab = "Distance to park", ylab = "Road kills", pch=20, col="#e76bf3")
lines(MyData$D.PARK,F,lty=1)
lines(MyData$D.PARK,FSEUP,lty=2)
lines(MyData$D.PARK,FSELOW,lty=2)

Figure 3:Fitted line of the optimal quasi-Poisson model using only D.PARK as the explanatory variables. Its estimated parameters, standard errors, etc. are given below and the fitted line is presented in this figure. Note that the confidence intervals around the line are now larger than before due to the overdispersion correction.

summary(M12)

## 
## Call:
## glm(formula = TOT.N ~ D.PARK, family = quasipoisson, data = RK)
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.316e+00  1.194e-01  36.156  < 2e-16 ***
## D.PARK      -1.058e-04  1.212e-05  -8.735 1.24e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 7.630148)
## 
##     Null deviance: 1071.4  on 51  degrees of freedom
## Residual deviance:  390.9  on 50  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4

VI. Model Validation

Just as in linear regression, we have to apply a model validation after we have decided on the optimal GLM, and the residuals are an important tool for this. The method is the same or both GLM and GAM but we focus just on GLM here.

There are 3 types of residuals (actually more, see McCullagh and Nelder 1989): 1. Pearson residuals, not looking for normality but for lack of fit. Should not show pattern. 2. Deviance residuals (R default), not looking for normality but for lack of fit. Should not show pattern. 3. Ordinary residuals, looking for normality

Illustrating Residuals

To explain model validation, we use the optimal quasi-Poisson GLM for the amphibian roadkills data. Recall that there was an overdispersion of 7.63 and that the only significant explanatory variable was D.PARK. Figure 9.6 shows the standard output from a plot command, and Fig. 9.7 contains the response residuals, Pearson residuals, scaled Pearson residuals (we divided the Pearson residuals by the square root of the overdispersion parameter), and the deviance residuals. Both figures indicate that there is a clear pattern in the residuals. Note that it is hard to detect any differences between Pearson and deviance residuals. Some additional exploration into the residuals against other explanatory variables and spatial locations is done in Chapter 16. As in linear regression, we can also use leverage and the Cook distance statistic. There are no influential observations. The following R code was used to produce Figs. 4 and 5.

Unfortunately, the function resid ignores the overdispersion; so we need to manually divide the Pearson residuals by the square root of 7.63 or calculate these residuals from scratch (as we did here). The rest of the code plots the residuals and should be self explanatory.

op <- par(mfrow = c(2, 2))
plot(M12)

Figure 4: Standard output from a GLM function applied on the amphibian roadkills data obtained by the plot command

M12<-glm(TOT.N~D.PARK,family=quasipoisson,data=RK)
EP=resid(M5,type="pearson")
ED=resid(M5,type="deviance")
mu=predict(M5,type="response")
E=RK$TOT.N-mu
EP2=E/sqrt(7.630148*mu) #Unfortunately, the function resid ignores the overdispersion; so we need to manually divide the Pearson residuals by the square root of 7.63
op <- par(mfrow=c(2,2))
plot(x=mu,y=E,main="Response residuals")
plot(x=mu,y=EP,main="Pearson residuals")
plot(x=mu,y=EP2,main="Pearson residuals scaled")
plot(x=mu,y=ED,main="Deviance residuals")

Figure 5:Response residuals (observed minus fitted values, also called ordinary residuals), Pearson residuals, scaled Pearson residuals (the overdispersion is taken into account) and the deviance residuals for the optimal quasi-Poisson model applied on the amphibian roadkills data

par(op)

3. -💜💙 (BINOMIAL)

BEST

(at least until Ch. 16 where the hurdle model, zero-inflation Poisson, and zero-inflation negative binomial models are applied)

In the previous sections of this chapter, we applied Poisson GLM on the amphibian roadkills data set and found that there is an overdispersion of 7.63. Consequently, all standard errors were corrected by multiplying them with the square root of 7.63 when we applied the quasi-Poisson model. An alternative approach is to apply the negative binomial model. In Chapter 16 and 11, a negative binomial GAM is applied on the amphibian roadkills data, but for illustration purposes we apply the negative binomial GLM here.

The variables

We start with 11 explanatory variables again.

library(MASS) #for the glm.nb function

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

M13 <- glm.nb(TOT.N ~ OPEN.L + MONT.S + SQ.POLIC + SQ.SHRUB + SQ.WATRES + L.WAT.C + SQ.LPROAD + SQ.DWATCOUR + D.PARK, link = "log", data = RK)

summary(M13)

## 
## Call:
## glm.nb(formula = TOT.N ~ OPEN.L + MONT.S + SQ.POLIC + SQ.SHRUB + 
##     SQ.WATRES + L.WAT.C + SQ.LPROAD + SQ.DWATCOUR + D.PARK, data = RK, 
##     link = "log", init.theta = 5.517795385)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.951e+00  4.145e-01   9.532   <2e-16 ***
## OPEN.L      -9.419e-03  3.245e-03  -2.903   0.0037 ** 
## MONT.S       5.846e-02  3.481e-02   1.679   0.0931 .  
## SQ.POLIC    -4.618e-02  1.298e-01  -0.356   0.7221    
## SQ.SHRUB    -3.881e-01  2.883e-01  -1.346   0.1784    
## SQ.WATRES    1.631e-01  1.675e-01   0.974   0.3301    
## L.WAT.C      2.076e-01  9.636e-02   2.154   0.0312 *  
## SQ.LPROAD    5.944e-01  3.214e-01   1.850   0.0644 .  
## SQ.DWATCOUR -1.489e-05  1.139e-02  -0.001   0.9990    
## D.PARK      -1.235e-04  1.292e-05  -9.557   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(5.5178) family taken to be 1)
## 
##     Null deviance: 213.674  on 51  degrees of freedom
## Residual deviance:  51.803  on 42  degrees of freedom
## AIC: 390.11
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  5.52 
##           Std. Err.:  1.41 
## 
##  2 x log-likelihood:  -368.107

#The output is similar to the Poisson GLM output, except we also get a parameter theta, which is the k in the negative binomial variance function. We also get its standard error, but care is needed with its use as the interval is not symmetric and we are testing on the boundary. Note that as half of the regression parameters are not significant at the 5% level, a model selection is required.

AIC

The available tools for a model selection are similar to those we have seen in the previous section: hypothesis testing and using a model selection tool like the AIC. For hypothesis testing, we can use:

The z-statistic (table above).
Analysis of deviance tables obtained by the anova(M6, test = “Chi”) command (this is doing sequential testing).
Drop each term in turn and compare the full model with a nested model using the drop1(M6, test = “Chi”)command.
Manually specifying a nested model, call it for example M7, and use the command anova(M6, M7, test = “Chi”).

An automatic backward (or forward) selection procedure based on the AIC can be applied by the command step(M6) or stepAIC(M6). The latter option is the main advantage over quasi-Poisson, where we do not have a likelihood function and therefore cannot use AIC and automatic selection procedures. A negative binomial model can also be overdispersed, and the approach described earlier of using the ratio of the residual deviance and the degrees of freedom can be used. In this case, there is a small amount of overdispersion. A quasi-negative binomial option does not exist.

Results

M14<-glm.nb(TOT.N~OPEN.L+D.PARK,link="log",data=RK) #negative binom model

summary(M14) #z-stats

## 
## Call:
## glm.nb(formula = TOT.N ~ OPEN.L + D.PARK, data = RK, link = "log", 
##     init.theta = 4.13284466)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.6717034  0.1641768  28.455   <2e-16 ***
## OPEN.L      -0.0093591  0.0031952  -2.929   0.0034 ** 
## D.PARK      -0.0001119  0.0000113  -9.901   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(4.1328) family taken to be 1)
## 
##     Null deviance: 170.661  on 51  degrees of freedom
## Residual deviance:  51.839  on 49  degrees of freedom
## AIC: 387.43
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  4.133 
##           Std. Err.:  0.980 
## 
##  2 x log-likelihood:  -379.432

drop1(M14,test="Chi") #The output from the drop1 function is given below. Both explanatory variables are significant at the 5% level.

## Single term deletions
## 
## Model:
## TOT.N ~ OPEN.L + D.PARK
##        Df Deviance    AIC     LRT  Pr(>Chi)    
## <none>      51.839 385.43                      
## OPEN.L  1   59.731 391.32   7.891  0.004967 ** 
## D.PARK  1  154.595 486.19 102.756 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#theta is the parameter k from the variance function (see page 233 for full equation)

par(mfrow=c(2,2))
plot(M8)

Summary (negative binom is the winner!)

Note that the analysis of deviance results gives slightly different p-values compared to the z-statistics, but the biological conclusions will be similar. The graphical validation plots are presented in Figure above and do not show any problems.

The model seems to suggest that the further away you are from the park, the fewer roadkills. Open land cover also has a negative effect of roadkill numbers. So, which model is better, the quasi-Poisson or the negative binomial GLM? The answer is simple: the quasi-Poisson model has patterns in the residuals and the negative binomial has no patterns. THIS IS THE PREFERRED MODEL.*

Compare poisson and negative binomial

Adding OPEN.L as an explanatory variable to the quasi-Poisson model does not remove the pattern. A bonus of the negative binomial GLM is that the AIC is defined, which allows us to do automatic selection procedures.

If the residual graphs do not show a clear winner, then you can also apply a test to compare the NB and Poisson GLMs; they are nested. The variance of the Poisson is: var(Yi) = µi, and for the NB we have var(Yi) = µi + µi 2/k. We can also write the variance of the NB model as var(Yi) = µi + α × µi 2. The models will give the same variance if α = 0; so we can use a likelihood ratio test and the null hypothesis is H0: α = 0. However, we are testing on the boundary again (the alternative is H1: α > 0. We saw a similar problem when we tested the significance of a random effect in Chapter 5, and the same solution of dividing the p-value by 2 can be applied. The Poisson model with OPEN.L and D.PARK is fitted with:

M15 <- glm(TOT.N ~ OPEN.L + D.PARK, family = poisson, data = RK)


llhNB = logLik(M14)
llhPoisson  =logLik(M15)
d <- 2 * (llhNB - llhPoisson)
pval <- pchisq(as.numeric(d), df=1, lower.tail=FALSE)/2

d

## 'log Lik.' 244.66 (df=4)

pval

## [1] 1.895066e-55

The statistic is equal to 244.66, and the p-value is p < 0.001. Note that we divided the p-value by 2. Hence, there is strong support for the negative binomial model. The same result can be obtained with the command odTest(M8) from the pscl package, which is not part of the base installation.

The amphibian roadkills data set is further analysed in Chapter 16. A comparison of the Poisson, quasi-Poisson, negative binomial, and three alternative models in case there are lots of zeros (the hurdle model, zero-inflated Poisson, and zeroinflated negative binomial models) is presented in Chapter 11