An example with count data.

Let’s assume you have been counting fish on coral reefs. So your data are count data. If the counts are large they may well look pretty normal. But there are some important characteristics of count data: Counts are integers, whereas the normal distribution is for continuous data that can include any fraction. Counts also can’t be less than zero, but the normal distribution models stochastic processes that draw both zeros and negative numbers. While a count can be 0, it cannot be negative.

Statisticians have described many distributions for counts but one of the simplest is the Poisson distribution. It is a model of positive integers. It has one parameter λ, which is both its mean and variance. Let’s see what that looks like with some simple R code to draw random numbers from two Poisson distributions:

n <-1000
set.seed(42)
x1 <- rpois(n, lambda = 1)
x10 <- rpois (n, lambda = 10)
mean(x1)
## [1] 0.976
var(x1)
## [1] 0.9663904
mean(x10)
## [1] 10.083
var(x10)
## [1] 10.75086

We just sampled random numbers from two Poisson distributions with means of 1 and 10. Notice that the means and variances of each are approximately equal (not exactly equal because of we drew a small random sample). You can think of this sampling from the Poisson as a model of count data. Let’s see what the distributions look like:

par(mfrow=c(1,2))
hist(x1, xlim = c(0,25), seq(0,25, by = 1))
hist(x10, xlim = c(0,25), seq(0,25, by = 1))

If lambda is 1 what do you find? What about if lambda is 10?

  • When lambda is 1, the distribution is greatest from 0 to 1 then decreases exponentially. When lambda is 10, the distribution starts to resemble a normal distribution.

So far our Poisson model only has one parameter, a mean (and variance) we refer to as λ. But what if we wanted the mean to change? For instance, we might have counted fish on different types of coral reefs and we want to test whether there are different abundances on each type of reef. Or we might have counted fish across a gradient of pollution and we want to know how their numbers change from low to high pollution. We will call these hypothesized causes of changes in fish counts ‘covariates’. Others might call them explanatory variables, treatments (if experiments) or predictor variables.

We will use Generalized Linear Models, so we could include the covariates as variables in a linear equation, after all that is what we do with linear regression (and general linear models).

Y∼β0+Xβ1+ϵ or Y=a+B*X for simplicity below.

Let’s generate some sample data ourselves. We will assume pollution is measured on a zero to one (low to high) scale, and that the mean number of fish with no pollution (0) = 4 and that on average there are no fish when pollution level = 0.5 (half the maximum).

n <- 50
beta <- -8 #effect of polluted reefs
alpha <- 4 #intercept = mean at 'zero' pollution
x <- seq(0,1, length.out = n) #pollution levels
ymean <- alpha + beta*x # This is Y=mx+b formulation for simplicity, but think of this as Y(fish count) ~ B0 + B1(pollution) + E
plot(x, ymean, type = "l", xlab = "Pollution Level", ylab ="Number of Fish Counted")
abline(h = 0, lty = 2, lwd = 2)

There is something odd about this model: we are predicting negative fish (on average) for pollution levels over 0.5.

It gets even worse if we model sampling with a normal distribution:

set.seed(55)
yobs_normal <- ymean + rnorm(n) #Here we're adding random variation to the mean, using the mean count above
plot(x, ymean, type = "l", xlab = "Pollution Level", ylab = "Number of Fish Counted")
points(x, yobs_normal)
abline(h = 0, lty = 2, lwd = 2)

Generalized Linear Models

Link functions elegantly solve the problem of using linear models with non-normal data. There are many types of link functions, but we will look at one that is popular for use with cound data: log.

gamma <- -3.2 # Effect of polluted reefs (equivalent to B1, or B) for illustration
alpha <- 4 # Intercept = mean at 'zero' pollution
yexp <- alpha * exp(gamma*x)
plot(x, yexp, type = "l", xlab = "Pollution Level", ylab = "Number of Fish Counted")
abline(h = 0, lty = 2, lwd = 2)

Here we have the equation \(y = \alpha * e^{(\gamma* x)}\) which is the same as the linear equation for log(y): \(\log(y) = \log(\alpha)+\gamma *x\). Note we retained alpha=4 in both, because for both equations alpha is the expected value at pollution of zero.

The slope parameter was changed in the log-linear equation to gamma because it is not a direct analogue of our slope parameter beta above.

One of the nice things about the log-linear equation is that the slope parameter now represents multiples of change. For instance, gamma = -3.2 means the abundance declines about 25 times (=1/exp(-3.2)) when going from a pollution level of 0 to 1. Abundance declines about a five times decline if we go from a pollution of 0 to 0.5 (= 1/exp(-3.2*0.5)).

Now we can use this exponential curve as the mean (and variance!) of a Poisson distribution.

yobs_pois <- rpois(n, yexp) #As we drew random numbers from a normal distribution, we will draw them from a Poisson distribution.
# You can look up rpois() if you don't know what this function is doing
plot(x, yexp, type = "l", xlab = "Pollution Level", ylab = "Number of Fish Counted", ylim = c(0, 8))
points(x, yobs_pois)

Note that the fish counts are all positive. And, as the means get smaller, the variance also gets smaller (this is the idea behind using lambda as a single measurement).

In the above example we made up the “true” population mean for the purposes of explanation. But in the real world, we sample and hope our sample is a good enough representation of the population to draw inference.

Fitting the GzLM

Since we introduced some random variability into our generated data, we can pretend like the above was a “sample”. We can now use a generalized linear model to estimate the effect of the pollution covariate using R’s glm() function. The syntax is pretty much the same as with lm() but we now specify the family and link function to use for fitting that third component of the model characteristic of generalized linear models (in addition to the random (Ys) and systematic (Xs) components).

m1 <- glm (yobs_pois ~ x, family = poisson(link = "log"))
summary(m1)
## 
## Call:
## glm(formula = yobs_pois ~ x, family = poisson(link = "log"))
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.4097     0.1926   7.319 2.50e-13 ***
## x            -3.3456     0.5625  -5.947 2.73e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 94.763  on 49  degrees of freedom
## Residual deviance: 49.485  on 48  degrees of freedom
## AIC: 123.71
## 
## Number of Fisher Scoring iterations: 5
coef(m1)
## (Intercept)           x 
##    1.409704   -3.345646

The values we printed give the estimates for the intercept and slope coefficients (alpha and gamma). How do these relate to the input data for (log)alpha and gamma?

We have specified above the type of distribution to use (family = poission()) and which link to use. “log” is in fact the default choice, but I put it there so you know you can change it.

Technically we would say we fit a Generalized Linear Model with Poisson errors and a log link function. We talk about Poisson errors (not Poisson data), because it is the leftover variation after we fit the model (i.e., the error or residuals) that we are assuming is Poisson distributed, in the same way we assume normally distributed residuals with a Gaussian distribution.

The model actually doesn’t make any assumptions about how the raw data are distributed, so long as they are integers and non-negative. Note that the data can contain zeros, but the mean of the Poisson is always >0.

What do the coefficients mean? Remember the coefficients are on the log scale. So the mean abundance at a pollution level of zero = the exponentiated value of the intercept and a change in pollution from 0 to 1 causes an estimated slope of 1/exp(slope) times decline in fish abundance. You can check these numbers with the math below.

exp(coef(m1)[1]) #Very close to 4, as set up in our data generation
## (Intercept) 
##    4.094745
1/exp(coef(m1)[2]) #Similar to the 25 times decline mentioned above
##        x 
## 28.37889

We can plot the fitted model with standard errors along with our “true mean” from our simulated data.

ypredict <- predict (m1, type = "response", se = TRUE)
plot(x, yexp, type = "l",
     xlab = "Pollution Level",
     ylab = "Number of Fish Counted",
     ylim = c(0,8))
lines(x, ypredict$fit, lwd = 2, col = "red", lty = 2)
  # Add lines for standard errors
lines(x, ypredict$fit + ypredict$se.fit, lty = 3, col = "red") #Pull the SE out of the predicted values
lines(x, ypredict$fit - ypredict$se.fit, lty = 3, col = "red")
  # Plot observations
points(x, yobs_pois)
legend("topright", legend = c("True Mean", "Estimated Mean"), lwd = 2, lty = c(1,2), col = c("black", "red"))

You can see the fitted line falls close to the ‘true’ line, and the standard errors are pretty tight around our best estimate.

The fitting algorithm itself is attempting the maximize the log-likelihood of the sample mean given the observations (in technical speak). You can refresh your understanding of maximum likelihood estimation if needed.

We wanted to fit a linear function to data that can’t be less than zero, because linear functions are convenient to work with. So we used a log link function to describe the mean and to ensure that the mean is always greater than zero.We ended up with a model where the slope describes multiples of change in fish abundance over the pollution gradient. So the model itself is actually multiplicative, not additive.

If you think about it, natural processes that generate counts often are multiplicative, not additive. For instance, we may talk about ‘fish multiplying’ when they breed, because population growth can be exponential.

So our mathematically convenient link function actually ended up being a better description of the natural process.

The last thing we want to check is if our residuals are Poisson distributed (an assumption of a GzLM with a Poisson distributional family). What this means for Pearson residuals (residual divided by the square root for the variance) is that they have constant spread, just like the normal residuals. Examining the residuals for these models is more complicated in reality, and if you end up down this road, you can evaluate it further.

plot(m1)

To Transform or to use GzLM?

What’s the difference between a log link and log transforming your data? The answer is insightful as to how link functions work.

Take the data we generated above and fit two GLMs (you will have to add a small number–kludge factor–so you can log the zeros, not ideal but a common practice).

yobsplus <- yobs_pois + 0.1
model1 <- glm(yobsplus ~ x, family = gaussian(link = "log"))
model2 <- glm(log(yobsplus) ~ x, family = gaussian(link = "identity"))
summary(model1)
## 
## Call:
## glm(formula = yobsplus ~ x, family = gaussian(link = "log"))
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.3943     0.1330  10.487 5.22e-14 ***
## x            -2.9649     0.5925  -5.004 7.95e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.295963)
## 
##     Null deviance: 120.000  on 49  degrees of freedom
## Residual deviance:  62.205  on 48  degrees of freedom
## AIC: 158.81
## 
## Number of Fisher Scoring iterations: 6
summary(model2)
## 
## Call:
## glm(formula = log(yobsplus) ~ x, family = gaussian(link = "identity"))
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.0466     0.3223   3.247  0.00213 ** 
## x            -3.4193     0.5554  -6.156 1.45e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.337798)
## 
##     Null deviance: 114.919  on 49  degrees of freedom
## Residual deviance:  64.214  on 48  degrees of freedom
## AIC: 160.4
## 
## Number of Fisher Scoring iterations: 2

In the first model we fitted a Gaussian distribution (normally distributed errors) with a log link. In the second we fitted a Gaussian distribution to logged response variable (log(y)) with the identity link (which is no link, just functionally an lm).

Now compare the results. Notice that the estimate of the slope is quite different. Why is this?

ypredict1 <- predict(model1, type = "response", se = TRUE)
ypredict2 <- predict(model2, type = "response", se = TRUE)
plot(x, yexp, type = "l",
     xlab = "Pollution Level",
     ylab = "Number of Fish Counted",
     ylim = c(-2,8))
lines(x, ypredict1$fit, lwd = 2, col = "red", lty = 2)
lines(x, ypredict2$fit, lwd = 2, col = "blue", lty = 2)
  # Add lines for standard errors
lines(x, ypredict1$fit + ypredict$se.fit, lty = 3, col = "red") #Pull the SE out of the predicted values
lines(x, ypredict1$fit - ypredict$se.fit, lty = 3, col = "red")

lines(x, ypredict2$fit + ypredict$se.fit, lty = 3, col = "blue") #Pull the SE out of the predicted values
lines(x, ypredict2$fit - ypredict$se.fit, lty = 3, col = "blue")

  # Plot observations

points(x, yobs_pois)

The model with the log link is fitting the mean on the log scale, the Gaussian errors will be on the natural scale. So the residual (or error) variance will be constant for all mean values of y.

The model with the log of the data and identity link is fitting the mean and variance on the log scale. So if we retransform log(y) back to y (exponentiate it), the variance will change with the mean.

So a log link isn’t the same as a log transformation. The transformation changes the raw data. The link function doesn’t touch the raw data, instead you can think of it as a transformation of the model for the mean of the raw data.

Presence/Absence: A Binomial Response

In the 1980s a group of researchers did groundbreaking work on ecosystem subsidies looking at spiders on island ecosystems. As part of an investigation controlling spider populations on these islands, Polis et. al. (1988) recorded the physical and biological characteristics of islands in the Gulf of California. One of the outcomes was an analysis of a spider predator (Uta lizard) presence/absence (PA). The predictor was island perimeter to area ratio (RATIO). Because the response variable (Uta lizard presence/absence) is binary, the data are non-normal and we cannot use a lm. Conduct a logistic regression to determine if perimeter:area ratio is a good predictor of presence/absence of Uta lizards.

Explore the data on your own. Fit an appropriate model (see code below). Plot expected outcomes. (There are some chunks of code below to help you along with some of the details.)

polis <- read_csv("Data/polis.csv")
## Rows: 19 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): ISLAND
## dbl (2): RATIO, PA
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(polis)
## # A tibble: 6 × 3
##   ISLAND     RATIO    PA
##   <chr>      <dbl> <dbl>
## 1 Bota       15.4      1
## 2 Cabeza      5.63     1
## 3 Cerraja    25.9      1
## 4 Coronadito 15.2      0
## 5 Flecha     13.0      1
## 6 Gemelose   18.8      0

When you build your model, name it polis.glm which will facilitate some of the steps below. To fit the logistic regression, we use the glm function and the family=“” argument, where family=“binomial” to account for the binary nature of the data. What is the canonical link for the binomial family? If you don’t know, you can use the help to look up the family argument. Besides the different function, glm() that is used to fit generalized linear models, the fitting syntax is pretty much the same as with lm().

# ?family

Binomial link = “logit”

Fit a binomial glm for PA~RATIO.

polis.glm <- glm(PA ~ RATIO, data = polis, family = binomial(link = "logit"))
summary(polis.glm)
## 
## Call:
## glm(formula = PA ~ RATIO, family = binomial(link = "logit"), 
##     data = polis)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   3.6061     1.6953   2.127   0.0334 *
## RATIO        -0.2196     0.1005  -2.184   0.0289 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26.287  on 18  degrees of freedom
## Residual deviance: 14.221  on 17  degrees of freedom
## AIC: 18.221
## 
## Number of Fisher Scoring iterations: 6

Just like for lm() and glm() with a Poisson distribution as above, we see we get some output related to the intercept and slope. Here the “slope” (which is actually the log-odds ratio) uses the z statistic (the standard normal deviate) to check for differences from 0. The z value is the ratio of the estimated coefficient to its standard error. It measures the number of standard deviations that the estimated coefficient is away from zero. A higher absolute value of z value indicates that the estimated coefficient is more statistically significant. Here, it’s a little more than 2 sd away from zero, which means the probability is <0.05. Indeed, the p-value is 0.029.

Interpretation of logistic regression is not as straightforward as other forms of regression because the concept of the slope is different. Here, we have a presence (1) or absence (0) and we’re trying to find the level of the predictor at which we are more likely to have one or the other (the probability of getting a 1 vs a 0 is typically set at the 50% point). The standard logistic regression function for predicting the outcome of an observation given a predictor variable (x), is an s-shaped curve defined as p = exp(y) / [1 + exp(y)]. So, the model estimates a “slope” but it is really the log-odds ratio. There is more information here, or in any advanced modeling book, should you need to go down this road.

So, we can say that Perimeter to Area Ratio (RATIO) is a significant predictor of Uta Lizard presence and that the probability of lizard presence decreases with increasing perimeter:area ratio (-0.22), where the regression equation is defined as p = exp(3.61 -0.21* RATIO)/ [1 + exp(3.61 -0.21* RATIO)].

We can calculate the predicted values based on the fitted model (make sure you know what each step is doing here).

# Create a new data frame on which to make predictions
xs <- data.frame(RATIO = seq(0,70, length.out = 1000))
polis.predict <- predict (polis.glm, type = "response", se.fit = TRUE, newdata = xs)


polis.predict <- data.frame(c(xs, polis.predict))
head(polis.predict)
##        RATIO       fit     se.fit residual.scale
## 1 0.00000000 0.9735597 0.04363801              1
## 2 0.07007007 0.9731608 0.04410907              1
## 3 0.14014014 0.9727560 0.04458412              1
## 4 0.21021021 0.9723453 0.04506318              1
## 5 0.28028028 0.9719286 0.04554626              1
## 6 0.35035035 0.9715058 0.04603336              1
# To get the confidence intervals, you will need to calculate the fit for each point + the SE for each points - this can be vectorized:
# e.g. to get the upper confidence band
upper <- polis.predict$fit + polis.predict$fit~xs$RATIO
#upper2 <- polis.predict$fit + polis.predict$se.fit
upper
## polis.predict$fit + polis.predict$fit ~ xs$RATIO

We can generate a plot using base R, top, or ggplot, bottom, which will hopefully make this more clear.

  ggplot(aes(x = polis.predict$RATIO), data = polis.predict) +
  geom_smooth(aes(y =  polis.predict$fit)) +
  geom_smooth(aes(y = (polis.predict$fit + polis.predict$se.fit)), lty = 3, col = "purple") +
  geom_smooth(aes(y = (polis.predict$fit - polis.predict$se.fit)), lty = 3, col = "purple") +
  geom_point(data = polis, x = polis$RATIO, y = polis$PA, size = 2) +
  xlab("Island Perimeter to Area Ratio") +
  ylab("Presence/Absence of Uta Lizard") 
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

So, when the Perimeter to Area Ratio hits about 18, the expected presence of Uta Lizards drops off.

There are a few other things we might want to assess:

Overdispersion: the variance of the response is greater than what’s assumed by the model. This is more of a problem in Poisson-type models, but I demonstrate the code here.

# To calculate Pearson chi^2 p-value to evaluate dispersion (overdispersion):

pp <- sum(resid(polis.glm, type = "pearson")^2)
1-pchisq(pp, polis.glm$df.residual)
## [1] 0.5715331

0.57 is a very very high p-value: no overdispersion.

# To calculate deviance (G^2)
1 - pchisq(polis.glm$deviance, polis.glm$df.residual)
## [1] 0.6514215
#To evaluate the strength of the association (R^2 equivalent, but remember it is not really R^2 because we are not using squared error)
1-(polis.glm$dev/polis.glm$null)
## [1] 0.4590197

HW

Problem

Follow the example in Zuur et al. 2009 regarding roadkills. The data set consists of amphibian roadkills at 52 sites along a road in Portugal. Using the roadkills data, you will model the number of amphibian roadkills (TOT.N) using the explanatory variables: distance to the natural park (D.PARK), montado with shrubs (MONT.S), polyculture (POLIC), and distance to water reservoirs (D.WAT.RES). You should decide between assuming a normal distribution or a Poisson distribution and fit the appropriate model.

Go through the normal steps to evaluate your data, assess model assumptions, fit model(s), select the best fitting model, and plot your outcomes, with a conclusion stated in plan language about what you find.

You may use the reference for guidance. While this dataset is used as an example, the particular covariates are not necessarily used in combination; nevertheless you can use the process and steps provided to work your way through fitting and evaluating a model. Some functions may have changed since the publication of the book.

Assessing assumptions.

A couple of variables, namely distance to the natural park (D.PARK) and distance to water reservoirs (D.WAT.RES), appear to be roughly normally distributed. However, montado with shrubs (MONT.S) and polyculture (POLIC) are much more indicative of a Poisson distribution with a low lambda. All variables are greater than or equal to zero in their values, and moreover, the response variable is count data, so a Poisson distribution is the most appropriate for the model.

variables r t.stat p.value
D.PARK*D.WAT.RES 0.5585286 4.761258 0.0000169
POLIC*D.WAT.RES -0.2971373 -2.200463 0.0324215
D.PARK*POLIC -0.3211196 -2.397641 0.0202770

Three explanatory variables are significantly correlated with each other: distance to water reservoirs, distance to natural park, and polyculture. Since only one interaction - between distance to water reservoirs and distance to natural park - has a correlation coefficient greater than 0.5, I will include just that interaction in the initial model.

For a GLM, variance does not need to be equal (and usually cannot be).

Fitting the model.

I first fit a Generalized Linear Model with Poisson errors.

roadkill.glm <- glm(TOT.N ~ MONT.S + POLIC + D.PARK*D.WAT.RES, data = roadkill, family = poisson)
summary(roadkill.glm)
## 
## Call:
## glm(formula = TOT.N ~ MONT.S + POLIC + D.PARK * D.WAT.RES, family = poisson, 
##     data = roadkill)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       3.982e+00  9.314e-02  42.754  < 2e-16 ***
## MONT.S            5.706e-02  1.280e-02   4.459 8.25e-06 ***
## POLIC             1.195e-02  1.452e-02   0.823  0.41041    
## D.PARK           -9.985e-05  9.464e-06 -10.551  < 2e-16 ***
## D.WAT.RES         6.536e-04  1.383e-04   4.727 2.28e-06 ***
## D.PARK:D.WAT.RES -2.866e-08  1.075e-08  -2.666  0.00768 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1071.44  on 51  degrees of freedom
## Residual deviance:  339.56  on 46  degrees of freedom
## AIC: 590.95
## 
## Number of Fisher Scoring iterations: 5

Three of the four explanatory variables are significant in the model: montado with shrubs (\(z=42.75, p<0.001\)), distance to natural park (\(z=-10.55, p<0.001\)), and distance to water reservoirs (\(z=4.73, p<0.001\)). The interaction between distance to natural park and distance to water reservoirs was also significant (\(z=-2.67, p<0.008\)). Polyculture did not contribute significantly to the model (\(z=0.82, p<0.41\)).

Model selection.

I made models including all main effects and with interactions between correlated variables, then used AIC comparison to select the best fit. I used the drop1 function on the model with no interactions, and manually created models with variations of interacting variables. After testing models with polyculture interactions, I dropped it from subsequent models as it had no significant effect in the original model.

df AIC formula
m1 6 590.9489 TOT.N ~ MONT.S + POLIC + D.PARK * D.WAT.RES
m2 5 596.4076 TOT.N ~ MONT.S + POLIC + D.PARK + D.WAT.RES
m3 9 493.4396 TOT.N ~ MONT.S + POLIC * D.PARK * D.WAT.RES
m4 6 524.4848 TOT.N ~ MONT.S + POLIC * D.PARK + D.WAT.RES
m5 6 596.8439 TOT.N ~ MONT.S + POLIC * D.WAT.RES + D.PARK
m6 5 589.6117 TOT.N ~ MONT.S + D.PARK * D.WAT.RES
m7 4 605.9879 TOT.N ~ D.PARK * D.WAT.RES
m8 3 1196.5657 TOT.N ~ MONT.S + D.WAT.RES
m9 3 610.0853 TOT.N ~ MONT.S + D.PARK

Through AIC comparisson, model “m3” - \(TOT.N \sim MONT.S + POLIC * D.PARK * D.WAT.RES\) - is the best fit for the data (AIC = 493.44). This model includes interactions between distance to natural park, distance to water reservoirs, and polyculture.

## 
## Call:
## glm(formula = TOT.N ~ MONT.S + POLIC * D.PARK * D.WAT.RES, family = poisson, 
##     data = roadkill)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             4.513e+00  1.104e-01  40.872  < 2e-16 ***
## MONT.S                  5.265e-02  1.373e-02   3.836 0.000125 ***
## POLIC                  -3.546e-01  6.087e-02  -5.826 5.69e-09 ***
## D.PARK                 -1.632e-04  1.211e-05 -13.474  < 2e-16 ***
## D.WAT.RES               3.189e-04  1.664e-04   1.916 0.055323 .  
## POLIC:D.PARK            1.080e-04  1.480e-05   7.300 2.89e-13 ***
## POLIC:D.WAT.RES        -1.133e-04  1.696e-04  -0.668 0.504052    
## D.PARK:D.WAT.RES        1.231e-08  1.309e-08   0.940 0.347097    
## POLIC:D.PARK:D.WAT.RES -5.060e-08  1.677e-08  -3.018 0.002544 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1071.44  on 51  degrees of freedom
## Residual deviance:  236.05  on 43  degrees of freedom
## AIC: 493.44
## 
## Number of Fisher Scoring iterations: 5
Estimate Std..Error z.value Pr…z..
(Intercept) 4.5128108 0.1104127 40.8722106 0.0000000
MONT.S 0.0526521 0.0137268 3.8357276 0.0001252
POLIC -0.3546339 0.0608745 -5.8256522 0.0000000
D.PARK -0.0001632 0.0000121 -13.4736918 0.0000000
D.WAT.RES 0.0003189 0.0001664 1.9163306 0.0553230
POLIC:D.PARK 0.0001080 0.0000148 7.2995372 0.0000000
POLIC:D.WAT.RES -0.0001133 0.0001696 -0.6681285 0.5040515
D.PARK:D.WAT.RES 0.0000000 0.0000000 0.9402345 0.3470973
POLIC:D.PARK:D.WAT.RES -0.0000001 0.0000000 -3.0180642 0.0025439

Plots show real data (black points) with each main effect on the x axis. Green lines show the model predictions, +/- standard error.

There don’t appear to be any strong patterns in the residuals.

Total number of amphibian roadkills is best predicted by a generalized linear model with main effects of montado with shrubs, distance to natural park, distance to water reservoirs, and polyculture, with interaction terms for the latter three variables. The log-likelihood of roadkills generally decreases with increased distance to natural parks or water reservoirs, and increases then decreases as polyculture and montado with shrubs increase.

I understand I need to compare the Pearson and deviance residuals to the fitted values and the explanatory variables to look for patterns, but I cannot figure out how to calculate the dispersion parameter, which I need to calculate the scaled Pearson residuals.

Concept

Explain why generalized linear models are preferred to transformations. You must use your own words to describe this, but feel free to consult sources (just be sure to cite them, where relevant).


While transformations of explanatory variables can easily allow for simpler linear models, generalized linear models are often more flexible and interpretable than transformed data. For one, if an explanatory variable is categorical, it is difficult to transform the data, especially if the categories are non-sequential in nature. Transformed data can be hard to interpret, especially if there are multiple explanatory variables, and their relation to the response variable can be unclear. Furthermore, not all violations of normality in errors of data can be corrected with simple transformations.

Using generalized linear models allows for only a single transformation - that of the model with the explanatory variables to the output of the response variable. Therefore, we can use input data in its original state, regardless of distribution or homoskedasticity, and use a probability distribution appropriate to the response variable to translate the model into the correct scale.