Generalised Linear Models

Welcome

This RMarkdown document contains code snippets to accompany the “Session One: Introduction to Linear Modelling” workshop by Dr Ruth Kelly (TCD) & Dr Hannah White (UCD)

This RMarkdown file provides worked examples of Generalised Linear Models for the most common types of ecological data, count data and proportion data. And a further example of a Generalised Linear Mixed Model.

All datasets needed to conduct the analysis are stored this same github folder as this code file at : https://github.com/Ruth-R-Kelly/GLM_workshop_IEA_2019 or can be automatically loaded in R from packages using the code provided.

The slides from the presentations given at the workshop are also contained in this folder. In a single morning and code file, it’s impossible to give more the a brief overview of this topic.

Generalised linear models

The key difference between General Linear Models and Generalised Linear Models is that in Generalised Linear Models are designed for use with non-normal response data.

For these models you must specify an alternative distribution which represents your data (these are known as families):

Gamma - For non-normal continous data, bounded at zero
Binomial - for presence/absence data or proportions
Poisson - Count data
Negative Binomial - Overdispersed countdata
Zero-Inflated Poisson/Negative Binomial for data with a lot of zeros.

Writing these models in R is like writing a linear model except that you use the function ‘glm’ and must now specify the distribution ‘family’ and often the link function for that family, for all those listed above ‘log’ is the most commonly used link function except the Binomial for which a ‘logit’ link is most commonly used.

Note: As previously, normality should be judged based on the distribution of the model residuals (not the raw data).

Installing required R packages

The following R packages will need to be downloaded in advance of running this code file. They can be installed by removing the ‘#’ symbol infront of the line of code in the code chunk below.

# Install the package 'MASS' on the computer. 
# You will need to remove the # at the start of the next line to run this. 
# install.packages("MASS")

Loading R packages

Installing R packages downloads ‘packages’ (essentially sets of functions) onto your computer. When you want to use the functions in a particular package you need to load that package into the R workspace, to do this use the function library(). The next code chunk loads the packages that are needed to run the rest of this code file.

# Load the package 'MASS' into the R workspace
library("MASS")

Citing R packages

Because you should always do so… (the people who wrote these functions are heros who have made our lives as ecologists a million times easier). They have also made it easy to find the right paper to reference! :)

## 
## To cite the MASS package in publications use:
## 
##   Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics
##   with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0
## 
## A BibTeX entry for LaTeX users is
## 
##   @Book{,
##     title = {Modern Applied Statistics with S},
##     author = {W. N. Venables and B. D. Ripley},
##     publisher = {Springer},
##     edition = {Fourth},
##     address = {New York},
##     year = {2002},
##     note = {ISBN 0-387-95457-0},
##     url = {http://www.stats.ox.ac.uk/pub/MASS4},
##   }

GLMs with Count Data

First example: Amphibian roadkill, (ARR) from Zuur et al. 2009

The data set consists of roadkills of amphibian species at 52 sites along a road in Portugal. Here, we examine the relationship between distance from a national park, and the number of amphibians recorded. The variable ‘distance from a national park’, is denoted by the column ‘D.PARK’. The number of amphibians is denoted by the column ‘TOT.N’.

For this workshop example I will only use these two variables, a much more complicated model with many terms could also be fitted depending on the question of interest.

## Loading in a dataset stored as a .txt file
setwd("~/PhD/Workshops/IEA Stats/GLM workshop/GLM_workshop_IEA_2019-master")
AAR <- read.table("RoadKills.txt", header = T)
## Show column names
names(AAR)

##  [1] "Sector"       "X"            "Y"            "BufoCalamita"
##  [5] "TOT.N"        "S.RICH"       "OPEN.L"       "OLIVE"       
##  [9] "MONT.S"       "MONT"         "POLIC"        "SHRUB"       
## [13] "URBAN"        "WAT.RES"      "L.WAT.C"      "L.D.ROAD"    
## [17] "L.P.ROAD"     "D.WAT.RES"    "D.WAT.COUR"   "D.PARK"      
## [21] "N.PATCH"      "P.EDGE"       "L.SDI"

## Show variable types. 
# Note D.Park is a numeric variable, TOT.N is an integer (i.e. count data whole numbers only)
str(AAR)

## 'data.frame':    52 obs. of  23 variables:
##  $ Sector      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ X           : int  260181 259914 259672 259454 259307 259189 259092 258993 258880 258767 ...
##  $ Y           : int  256546 256124 255688 255238 254763 254277 253786 253296 252809 252322 ...
##  $ BufoCalamita: int  5 1 40 27 67 56 27 37 8 16 ...
##  $ TOT.N       : int  22 14 65 55 88 104 49 66 26 47 ...
##  $ S.RICH      : int  3 4 6 5 4 7 7 7 7 6 ...
##  $ OPEN.L      : num  22.7 24.7 30.1 50.3 43.6 ...
##  $ OLIVE       : num  60.3 40.8 23.7 14.9 35.4 ...
##  $ MONT.S      : num  0 0 0.258 1.783 2.431 ...
##  $ MONT        : num  0.653 0.161 10.918 26.454 11.33 ...
##  $ POLIC       : num  4.811 2.224 1.946 0.625 0.791 ...
##  $ SHRUB       : num  0.406 0.735 0.474 0.607 0.173 ...
##  $ URBAN       : num  7.787 27.15 28.086 0.831 2.452 ...
##  $ WAT.RES     : num  0.043 0.182 0.453 0.026 0 0.039 0.114 0.224 0.177 0 ...
##  $ L.WAT.C     : num  0.583 1.419 2.005 1.924 2.167 ...
##  $ L.D.ROAD    : num  3330 2587 2150 4223 2219 ...
##  $ L.P.ROAD    : num  1.975 1.761 1.25 0.666 0.653 ...
##  $ D.WAT.RES   : num  252.1 139.6 59.2 277.8 967.8 ...
##  $ D.WAT.COUR  : num  735 134.1 269 48.8 126.1 ...
##  $ D.PARK      : num  250 741 1240 1740 2232 ...
##  $ N.PATCH     : num  122 96 67 63 59 49 35 55 52 26 ...
##  $ P.EDGE      : num  554 457 432 421 408 ...
##  $ L.SDI       : num  1.8 1.89 1.93 1.86 1.82 ...

## Show summary of all columns - Note: that TOT.N is always above 0
summary(AAR)

##      Sector            X                Y           BufoCalamita   
##  Min.   : 1.00   Min.   :258260   Min.   :231911   Min.   : 0.000  
##  1st Qu.:13.75   1st Qu.:258668   1st Qu.:238007   1st Qu.: 0.000  
##  Median :26.50   Median :259120   Median :244162   Median : 3.000  
##  Mean   :26.50   Mean   :259241   Mean   :244211   Mean   : 8.135  
##  3rd Qu.:39.25   3rd Qu.:259935   3rd Qu.:250496   3rd Qu.: 8.250  
##  Max.   :52.00   Max.   :260370   Max.   :256546   Max.   :67.000  
##      TOT.N            S.RICH          OPEN.L           OLIVE       
##  Min.   :  2.00   Min.   :1.000   Min.   : 0.738   Min.   : 0.000  
##  1st Qu.:  7.00   1st Qu.:3.000   1st Qu.:15.702   1st Qu.: 0.000  
##  Median : 17.50   Median :5.000   Median :28.533   Median : 0.674  
##  Mean   : 25.90   Mean   :4.692   Mean   :36.175   Mean   : 7.350  
##  3rd Qu.: 34.25   3rd Qu.:6.000   3rd Qu.:51.765   3rd Qu.: 8.474  
##  Max.   :104.00   Max.   :8.000   Max.   :97.574   Max.   :60.333  
##      MONT.S           MONT           POLIC             SHRUB        
##  Min.   :0.000   Min.   : 0.00   Min.   : 0.0000   Min.   :0.00000  
##  1st Qu.:0.000   1st Qu.:25.05   1st Qu.: 0.0000   1st Qu.:0.02975  
##  Median :0.000   Median :44.23   Median : 0.0600   Median :0.09300  
##  Mean   :1.077   Mean   :48.12   Mean   : 0.6025   Mean   :0.22881  
##  3rd Qu.:1.175   3rd Qu.:75.40   3rd Qu.: 0.3257   3rd Qu.:0.20950  
##  Max.   :9.426   Max.   :95.08   Max.   :11.2630   Max.   :1.74400  
##      URBAN           WAT.RES          L.WAT.C          L.D.ROAD     
##  Min.   : 0.000   Min.   :0.0000   Min.   :0.0000   Min.   : 105.3  
##  1st Qu.: 0.000   1st Qu.:0.0000   1st Qu.:0.8135   1st Qu.:1062.1  
##  Median : 0.195   Median :0.0425   Median :1.5500   Median :2098.0  
##  Mean   : 2.771   Mean   :0.3213   Mean   :1.5617   Mean   :2119.1  
##  3rd Qu.: 1.551   3rd Qu.:0.2135   3rd Qu.:2.1990   3rd Qu.:2983.8  
##  Max.   :28.086   Max.   :6.3090   Max.   :3.9510   Max.   :4891.5  
##     L.P.ROAD        D.WAT.RES         D.WAT.COUR          D.PARK       
##  Min.   :0.5710   Min.   :  59.17   Min.   :  15.18   Min.   :  250.2  
##  1st Qu.:0.6450   1st Qu.: 381.91   1st Qu.:  91.58   1st Qu.: 6573.4  
##  Median :0.6810   Median : 674.61   Median : 196.01   Median :12719.6  
##  Mean   :0.9645   Mean   : 703.52   Mean   : 288.68   Mean   :12680.9  
##  3rd Qu.:1.1818   3rd Qu.: 958.20   3rd Qu.: 359.58   3rd Qu.:18780.1  
##  Max.   :2.9600   Max.   :1883.00   Max.   :1165.00   Max.   :24884.8  
##     N.PATCH           P.EDGE          L.SDI      
##  Min.   : 15.00   Min.   :125.2   Min.   :1.124  
##  1st Qu.: 19.00   1st Qu.:197.8   1st Qu.:1.423  
##  Median : 28.00   Median :257.4   Median :1.602  
##  Mean   : 34.40   Mean   :289.1   Mean   :1.595  
##  3rd Qu.: 40.25   3rd Qu.:377.6   3rd Qu.:1.800  
##  Max.   :122.00   Max.   :553.9   Max.   :1.963

## Draw a simple plot of the relationship between parks and number of amphibians
plot( TOT.N~ D.PARK, data = AAR, 
     pch = 16, col = "steelblue")

## Draw a histogram of the response variable, (breaks specifies the number of bars) 
hist(AAR$TOT.N, col = "pink", breaks = 10, 
     main = "Histogram of amphibian counts")

First we run a poisson GLM this is the standard first approach for count data.

The next code chunk runs this GLM and checks for over-dispersion in the residuals. If the poisson family is a good choice then we expect ‘theta’ to be close to 1 and not more than 2. If it is more than 2, we can either try adding more predictor variables to the model (not shown here), or account for this over-dispersion using a negative binomial family (shown below).

# function glm, fits a GLM with different family and links specified
mod_ARR <- glm(TOT.N ~ D.PARK, data = AAR, family = "poisson"(link = "log"))
summary(mod_ARR)

## 
## Call:
## glm(formula = TOT.N ~ D.PARK, family = poisson(link = "log"), 
##     data = AAR)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -8.1100  -1.6950  -0.4708   1.4206   7.3337  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.316e+00  4.322e-02   99.87   <2e-16 ***
## D.PARK      -1.059e-04  4.387e-06  -24.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1071.4  on 51  degrees of freedom
## Residual deviance:  390.9  on 50  degrees of freedom
## AIC: 634.29
## 
## Number of Fisher Scoring iterations: 4

#### Check for overdispersion by dividing the residual deviance by the residual df
theta <- mod_ARR$deviance/mod_ARR$df.residual
theta

## [1] 7.817937

Here, the theta value at 7.8 is much too high indicating that the data is too over-dispersed for a poisson distribution.

Therefore, we should try using a “negative binomial” family to model the data. We do this using the function glm.nb() from the package “MASS”.

# glm.nb fits negative binomial family only, therefore no need to 
# specify a negative binomial family in the code
mod_ARR_NB <- glm.nb(TOT.N ~ D.PARK, data = AAR)
summary(mod_ARR_NB)

## 
## Call:
## glm.nb(formula = TOT.N ~ D.PARK, data = AAR, init.theta = 3.681040094, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4160  -0.8289  -0.2116   0.4800   2.1346  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.411e+00  1.548e-01   28.50   <2e-16 ***
## D.PARK      -1.161e-04  1.137e-05  -10.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(3.681) family taken to be 1)
## 
##     Null deviance: 155.445  on 51  degrees of freedom
## Residual deviance:  54.742  on 50  degrees of freedom
## AIC: 393.09
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  3.681 
##           Std. Err.:  0.891 
## 
##  2 x log-likelihood:  -387.093

#### Check for overdispersion by dividing the residual deviance by the residual df
theta <- mod_ARR_NB$deviance/mod_ARR_NB$df.residual
theta

## [1] 1.094839

### very close to one.  Negative binomial distribution worked..

Further model validation. Now that we have checked for overdispersion and fixed this we should now do a bit more validation. This time looking at the model residuals to see that they show no obvious patterns or heteroscadicity.

First, we plot the “deviance residuals” against the “fitted values” and check that there is pattern in these

Then we plot the “deviance residuals” against each predictor varible in the model. In this case there is only one. Again, we don’t want to see any pattern here.

If you do see an obvious patterns you may need to a) consider non-linear relationships, b) add more explanatory varibles to the model, and/or c) concider using gls or glmm to account for nested structures or non-constant variance. See FAQ for more options here.

#scatterplot of residuals vs fitted values
plot(residuals(mod_ARR_NB, type = "deviance"), fitted(mod_ARR_NB),
     pch = 16, col = "steelblue")

#scatterplot of residuals vs predictor variable
plot(residuals(mod_ARR_NB, type = "deviance"), AAR$D.PARK,
       pch = 16, col = "dark green")

# If you have variables that you have not included, or information of times or spatial locations also plot these against the residual deviance. 
# In this case for example we could see if there's any latitudinal pattern
# in the residuals by plotting them against the column ARR$Y which is the 
# geographic Y coordinates of the data collection points. 
plot(residuals(mod_ARR_NB, type = "deviance"), AAR$Y,
     pch = 16, col = "grey40")

Know that we’re happy with our model we can look at our summary table and plot our result.

Here’s one example of how to do that, which uses the inbuilt predict function in R..

summary(mod_ARR_NB)

## 
## Call:
## glm.nb(formula = TOT.N ~ D.PARK, data = AAR, init.theta = 3.681040094, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4160  -0.8289  -0.2116   0.4800   2.1346  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.411e+00  1.548e-01   28.50   <2e-16 ***
## D.PARK      -1.161e-04  1.137e-05  -10.21   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(3.681) family taken to be 1)
## 
##     Null deviance: 155.445  on 51  degrees of freedom
## Residual deviance:  54.742  on 50  degrees of freedom
## AIC: 393.09
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  3.681 
##           Std. Err.:  0.891 
## 
##  2 x log-likelihood:  -387.093

### predict the results of the model on the original data scale 
# (i.e accounting for the log-link in this case)
pred_ambh <- predict(mod_ARR_NB, type = "response", se.fit = TRUE)
str(pred_ambh)

## List of 3
##  $ fit           : Named num [1:52] 80 75.5 71.3 67.3 63.5 ...
##   ..- attr(*, "names")= chr [1:52] "1" "2" "3" "4" ...
##  $ se.fit        : Named num [1:52] 12.2 11.2 10.2 9.3 8.5 ...
##   ..- attr(*, "names")= chr [1:52] "1" "2" "3" "4" ...
##  $ residual.scale: num 1

#### plot the raw data so that the original data can be viewed
plot(AAR$TOT.N~AAR$D.PARK, ylab = "No. of roadkill", xlab = "Distance from park",
     col = "grey30", pch = 16)
### Add predicted number of amphibians from the model
lines(pred_ambh$fit~AAR$D.PARK, type = "l", col = "red", lwd = 1.5)
### Add lines indicating standard error of model estimated means
lines(pred_ambh$fit +pred_ambh$se.fit~AAR$D.PARK, type = "l", col = "blue", lwd = 1, lty = 2) # upper standard error
lines(pred_ambh$fit - pred_ambh$se.fit~AAR$D.PARK, type = "l", col = "blue", lwd = 1, lty = 2) # lower standard error

An example with Presence/Absence data.

In this example I use a dataset collected accross peatland sites in Nothern Ireland in the year following a severe fire. The data shows the number of Meadow Pipits counted in 10 minute spot counts, one year post fire. The data was collected at 122 locations, accross 6 sites, within each site there were burnt and unburnt locations surveyed.

For this example, I examine the presence or absence of meadow pipits in burnt and unburnt areas, within three habitat classes.

## Loading in a dataset stored as a .csv file
setwd("~/PhD/Workshops/IEA Stats/GLM workshop/GLM_workshop_IEA_2019-master")
MP <- read.csv("meadow_pipits.csv", header = T)
## show number of rows and columns in the dataset. 
dim(MP)

## [1] 122   6

## Show column names
names(MP)

## [1] "Site"        "Burnt"       "Habitat"     "date2"       "MP"         
## [6] "MP_Presence"

## Show variable types. 
str(MP)

## 'data.frame':    122 obs. of  6 variables:
##  $ Site       : Factor w/ 6 levels "CH","EM","GR",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Burnt      : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 2 2 ...
##  $ Habitat    : Factor w/ 3 levels "Bog","DH","WH": 1 1 1 1 1 1 1 3 1 1 ...
##  $ date2      : int  206 206 242 242 206 242 206 242 207 243 ...
##  $ MP         : int  0 0 0 1 2 0 0 4 0 0 ...
##  $ MP_Presence: int  0 0 0 1 1 0 0 1 0 0 ...

## Show summary of all columns 
summary(MP)

##  Site    Burnt  Habitat      date2           MP         MP_Presence    
##  CH:18   N:56   Bog:80   Min.   :174   Min.   :0.000   Min.   :0.0000  
##  EM:34   Y:66   DH :22   1st Qu.:200   1st Qu.:0.000   1st Qu.:0.0000  
##  GR:11          WH :20   Median :219   Median :0.000   Median :0.0000  
##  MN:20                   Mean   :222   Mean   :1.221   Mean   :0.4344  
##  SA:19                   3rd Qu.:243   3rd Qu.:2.000   3rd Qu.:1.0000  
##  SH:20                   Max.   :285   Max.   :9.000   Max.   :1.0000

## show number of burnt and unburnt sites where meadow pipits were observed. 
table(MP$Burnt, MP$MP_Presence)

##    
##      0  1
##   N 29 27
##   Y 40 26

## Draw a histogram of the response variable, (breaks specifies the number of bars) 
hist(MP$MP_Presence, col = "pink", 
     main = "Histogram meadow pipit data")

Here, we run a binomial GLM with a logit link function this is the standard first approach for presence/absence data.

# function glm, fits a GLM with different family and links specified
mod_MP <- glm(MP_Presence ~ Burnt + Habitat, data = MP, family = "binomial"(link = "logit"))
summary(mod_MP)

## 
## Call:
## glm(formula = MP_Presence ~ Burnt + Habitat, family = binomial(link = "logit"), 
##     data = MP)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.263  -1.026  -0.946   1.202   1.498  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.05854    0.29820  -0.196    0.844
## BurntY      -0.36233    0.36870  -0.983    0.326
## HabitatDH   -0.30782    0.49913  -0.617    0.537
## HabitatWH    0.25792    0.50290   0.513    0.608
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 167.02  on 121  degrees of freedom
## Residual deviance: 165.25  on 118  degrees of freedom
## AIC: 173.25
## 
## Number of Fisher Scoring iterations: 4

###

Checking the the residuals of a binomial GLM (presence/absence) data

Examine the deviance residuals, values greater than 2 indicate a lack of fit.

### extract model residuals and summarise their distribution. 
dev_resid <- resid(mod_MP, type = "deviance")
summary(dev_resid)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.26317 -1.02631 -0.94605 -0.04237  1.20237  1.49830

Here this is fine.

Examining the model summary shows that there is no significant difference between burnt and unburnt areas or between habitat types in terms of the presence or absence of meadow pipits.

A handy trick, in order to get a significance value a categorical variable such as ‘habitats’ as a whole instead of comparisons between each level, e.g. bog vs dry heath, dry heath vs. wet heath we can run the model with and without the variable habitat and compare the two models using the function ‘anova’. When you are running larger models you may find it convenient to do this using the function ‘Anova’ in the package ‘car’ (e.g. Anova(mod_MP, type = “III”)).

# print model summary
summary(mod_MP)

## 
## Call:
## glm(formula = MP_Presence ~ Burnt + Habitat, family = binomial(link = "logit"), 
##     data = MP)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.263  -1.026  -0.946   1.202   1.498  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.05854    0.29820  -0.196    0.844
## BurntY      -0.36233    0.36870  -0.983    0.326
## HabitatDH   -0.30782    0.49913  -0.617    0.537
## HabitatWH    0.25792    0.50290   0.513    0.608
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 167.02  on 121  degrees of freedom
## Residual deviance: 165.25  on 118  degrees of freedom
## AIC: 173.25
## 
## Number of Fisher Scoring iterations: 4

# To check the overall significance of the inclusion of habitat as a variable in the model. 
mod_burn_only <- glm(MP_Presence ~ Burnt, data = MP, family = "binomial"(link = "logit"))
### Compare these two models (with and without habitat included)
anova(mod_MP, mod_burn_only, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: MP_Presence ~ Burnt + Habitat
## Model 2: MP_Presence ~ Burnt
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       118     165.25                     
## 2       120     166.06 -2 -0.81554   0.6651

### Here, the p value indicates that there is no significant difference 
### between the models with and without habitat.