This RMarkdown document contains code snippets to accompany the “Session One: Introduction to Linear Modelling” workshop by Dr Ruth Kelly (TCD) & Dr Hannah White (UCD)
This RMarkdown file provides worked examples of Generalised Linear Models for the most common types of ecological data, count data and proportion data. And a further example of a Generalised Linear Mixed Model.
All datasets needed to conduct the analysis are stored this same github folder as this code file at : https://github.com/Ruth-R-Kelly/GLM_workshop_IEA_2019 or can be automatically loaded in R from packages using the code provided.
The slides from the presentations given at the workshop are also contained in this folder. In a single morning and code file, it’s impossible to give more the a brief overview of this topic.
For these models you must specify an alternative distribution which represents your data (these are known as families):
Writing these models in R is like writing a linear model except that you use the function ‘glm’ and must now specify the distribution ‘family’ and often the link function for that family, for all those listed above ‘log’ is the most commonly used link function except the Binomial for which a ‘logit’ link is most commonly used.
Note: As previously, normality should be judged based on the distribution of the model residuals (not the raw data).
The following R packages will need to be downloaded in advance of running this code file. They can be installed by removing the ‘#’ symbol infront of the line of code in the code chunk below.
# Install the package 'MASS' on the computer.
# You will need to remove the # at the start of the next line to run this.
# install.packages("MASS")
Installing R packages downloads ‘packages’ (essentially sets of functions) onto your computer. When you want to use the functions in a particular package you need to load that package into the R workspace, to do this use the function library(). The next code chunk loads the packages that are needed to run the rest of this code file.
# Load the package 'MASS' into the R workspace
library("MASS")
Because you should always do so… (the people who wrote these functions are heros who have made our lives as ecologists a million times easier). They have also made it easy to find the right paper to reference! :)
##
## To cite the MASS package in publications use:
##
## Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics
## with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0
##
## A BibTeX entry for LaTeX users is
##
## @Book{,
## title = {Modern Applied Statistics with S},
## author = {W. N. Venables and B. D. Ripley},
## publisher = {Springer},
## edition = {Fourth},
## address = {New York},
## year = {2002},
## note = {ISBN 0-387-95457-0},
## url = {http://www.stats.ox.ac.uk/pub/MASS4},
## }
First example: Amphibian roadkill, (ARR) from Zuur et al. 2009
The data set consists of roadkills of amphibian species at 52 sites along a road in Portugal. Here, we examine the relationship between distance from a national park, and the number of amphibians recorded. The variable ‘distance from a national park’, is denoted by the column ‘D.PARK’. The number of amphibians is denoted by the column ‘TOT.N’.
For this workshop example I will only use these two variables, a much more complicated model with many terms could also be fitted depending on the question of interest.
## Loading in a dataset stored as a .txt file
setwd("~/PhD/Workshops/IEA Stats/GLM workshop/GLM_workshop_IEA_2019-master")
AAR <- read.table("RoadKills.txt", header = T)
## Show column names
names(AAR)
## [1] "Sector" "X" "Y" "BufoCalamita"
## [5] "TOT.N" "S.RICH" "OPEN.L" "OLIVE"
## [9] "MONT.S" "MONT" "POLIC" "SHRUB"
## [13] "URBAN" "WAT.RES" "L.WAT.C" "L.D.ROAD"
## [17] "L.P.ROAD" "D.WAT.RES" "D.WAT.COUR" "D.PARK"
## [21] "N.PATCH" "P.EDGE" "L.SDI"
## Show variable types.
# Note D.Park is a numeric variable, TOT.N is an integer (i.e. count data whole numbers only)
str(AAR)
## 'data.frame': 52 obs. of 23 variables:
## $ Sector : int 1 2 3 4 5 6 7 8 9 10 ...
## $ X : int 260181 259914 259672 259454 259307 259189 259092 258993 258880 258767 ...
## $ Y : int 256546 256124 255688 255238 254763 254277 253786 253296 252809 252322 ...
## $ BufoCalamita: int 5 1 40 27 67 56 27 37 8 16 ...
## $ TOT.N : int 22 14 65 55 88 104 49 66 26 47 ...
## $ S.RICH : int 3 4 6 5 4 7 7 7 7 6 ...
## $ OPEN.L : num 22.7 24.7 30.1 50.3 43.6 ...
## $ OLIVE : num 60.3 40.8 23.7 14.9 35.4 ...
## $ MONT.S : num 0 0 0.258 1.783 2.431 ...
## $ MONT : num 0.653 0.161 10.918 26.454 11.33 ...
## $ POLIC : num 4.811 2.224 1.946 0.625 0.791 ...
## $ SHRUB : num 0.406 0.735 0.474 0.607 0.173 ...
## $ URBAN : num 7.787 27.15 28.086 0.831 2.452 ...
## $ WAT.RES : num 0.043 0.182 0.453 0.026 0 0.039 0.114 0.224 0.177 0 ...
## $ L.WAT.C : num 0.583 1.419 2.005 1.924 2.167 ...
## $ L.D.ROAD : num 3330 2587 2150 4223 2219 ...
## $ L.P.ROAD : num 1.975 1.761 1.25 0.666 0.653 ...
## $ D.WAT.RES : num 252.1 139.6 59.2 277.8 967.8 ...
## $ D.WAT.COUR : num 735 134.1 269 48.8 126.1 ...
## $ D.PARK : num 250 741 1240 1740 2232 ...
## $ N.PATCH : num 122 96 67 63 59 49 35 55 52 26 ...
## $ P.EDGE : num 554 457 432 421 408 ...
## $ L.SDI : num 1.8 1.89 1.93 1.86 1.82 ...
## Show summary of all columns - Note: that TOT.N is always above 0
summary(AAR)
## Sector X Y BufoCalamita
## Min. : 1.00 Min. :258260 Min. :231911 Min. : 0.000
## 1st Qu.:13.75 1st Qu.:258668 1st Qu.:238007 1st Qu.: 0.000
## Median :26.50 Median :259120 Median :244162 Median : 3.000
## Mean :26.50 Mean :259241 Mean :244211 Mean : 8.135
## 3rd Qu.:39.25 3rd Qu.:259935 3rd Qu.:250496 3rd Qu.: 8.250
## Max. :52.00 Max. :260370 Max. :256546 Max. :67.000
## TOT.N S.RICH OPEN.L OLIVE
## Min. : 2.00 Min. :1.000 Min. : 0.738 Min. : 0.000
## 1st Qu.: 7.00 1st Qu.:3.000 1st Qu.:15.702 1st Qu.: 0.000
## Median : 17.50 Median :5.000 Median :28.533 Median : 0.674
## Mean : 25.90 Mean :4.692 Mean :36.175 Mean : 7.350
## 3rd Qu.: 34.25 3rd Qu.:6.000 3rd Qu.:51.765 3rd Qu.: 8.474
## Max. :104.00 Max. :8.000 Max. :97.574 Max. :60.333
## MONT.S MONT POLIC SHRUB
## Min. :0.000 Min. : 0.00 Min. : 0.0000 Min. :0.00000
## 1st Qu.:0.000 1st Qu.:25.05 1st Qu.: 0.0000 1st Qu.:0.02975
## Median :0.000 Median :44.23 Median : 0.0600 Median :0.09300
## Mean :1.077 Mean :48.12 Mean : 0.6025 Mean :0.22881
## 3rd Qu.:1.175 3rd Qu.:75.40 3rd Qu.: 0.3257 3rd Qu.:0.20950
## Max. :9.426 Max. :95.08 Max. :11.2630 Max. :1.74400
## URBAN WAT.RES L.WAT.C L.D.ROAD
## Min. : 0.000 Min. :0.0000 Min. :0.0000 Min. : 105.3
## 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.:0.8135 1st Qu.:1062.1
## Median : 0.195 Median :0.0425 Median :1.5500 Median :2098.0
## Mean : 2.771 Mean :0.3213 Mean :1.5617 Mean :2119.1
## 3rd Qu.: 1.551 3rd Qu.:0.2135 3rd Qu.:2.1990 3rd Qu.:2983.8
## Max. :28.086 Max. :6.3090 Max. :3.9510 Max. :4891.5
## L.P.ROAD D.WAT.RES D.WAT.COUR D.PARK
## Min. :0.5710 Min. : 59.17 Min. : 15.18 Min. : 250.2
## 1st Qu.:0.6450 1st Qu.: 381.91 1st Qu.: 91.58 1st Qu.: 6573.4
## Median :0.6810 Median : 674.61 Median : 196.01 Median :12719.6
## Mean :0.9645 Mean : 703.52 Mean : 288.68 Mean :12680.9
## 3rd Qu.:1.1818 3rd Qu.: 958.20 3rd Qu.: 359.58 3rd Qu.:18780.1
## Max. :2.9600 Max. :1883.00 Max. :1165.00 Max. :24884.8
## N.PATCH P.EDGE L.SDI
## Min. : 15.00 Min. :125.2 Min. :1.124
## 1st Qu.: 19.00 1st Qu.:197.8 1st Qu.:1.423
## Median : 28.00 Median :257.4 Median :1.602
## Mean : 34.40 Mean :289.1 Mean :1.595
## 3rd Qu.: 40.25 3rd Qu.:377.6 3rd Qu.:1.800
## Max. :122.00 Max. :553.9 Max. :1.963
## Draw a simple plot of the relationship between parks and number of amphibians
plot( TOT.N~ D.PARK, data = AAR,
pch = 16, col = "steelblue")
## Draw a histogram of the response variable, (breaks specifies the number of bars)
hist(AAR$TOT.N, col = "pink", breaks = 10,
main = "Histogram of amphibian counts")
First we run a poisson GLM this is the standard first approach for count data.
The next code chunk runs this GLM and checks for over-dispersion in the residuals. If the poisson family is a good choice then we expect ‘theta’ to be close to 1 and not more than 2. If it is more than 2, we can either try adding more predictor variables to the model (not shown here), or account for this over-dispersion using a negative binomial family (shown below).
# function glm, fits a GLM with different family and links specified
mod_ARR <- glm(TOT.N ~ D.PARK, data = AAR, family = "poisson"(link = "log"))
summary(mod_ARR)
##
## Call:
## glm(formula = TOT.N ~ D.PARK, family = poisson(link = "log"),
## data = AAR)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.1100 -1.6950 -0.4708 1.4206 7.3337
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.316e+00 4.322e-02 99.87 <2e-16 ***
## D.PARK -1.059e-04 4.387e-06 -24.13 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 1071.4 on 51 degrees of freedom
## Residual deviance: 390.9 on 50 degrees of freedom
## AIC: 634.29
##
## Number of Fisher Scoring iterations: 4
#### Check for overdispersion by dividing the residual deviance by the residual df
theta <- mod_ARR$deviance/mod_ARR$df.residual
theta
## [1] 7.817937
Here, the theta value at 7.8 is much too high indicating that the data is too over-dispersed for a poisson distribution.
Therefore, we should try using a “negative binomial” family to model the data. We do this using the function glm.nb() from the package “MASS”.
# glm.nb fits negative binomial family only, therefore no need to
# specify a negative binomial family in the code
mod_ARR_NB <- glm.nb(TOT.N ~ D.PARK, data = AAR)
summary(mod_ARR_NB)
##
## Call:
## glm.nb(formula = TOT.N ~ D.PARK, data = AAR, init.theta = 3.681040094,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4160 -0.8289 -0.2116 0.4800 2.1346
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.411e+00 1.548e-01 28.50 <2e-16 ***
## D.PARK -1.161e-04 1.137e-05 -10.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(3.681) family taken to be 1)
##
## Null deviance: 155.445 on 51 degrees of freedom
## Residual deviance: 54.742 on 50 degrees of freedom
## AIC: 393.09
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 3.681
## Std. Err.: 0.891
##
## 2 x log-likelihood: -387.093
#### Check for overdispersion by dividing the residual deviance by the residual df
theta <- mod_ARR_NB$deviance/mod_ARR_NB$df.residual
theta
## [1] 1.094839
### very close to one. Negative binomial distribution worked..
Further model validation. Now that we have checked for overdispersion and fixed this we should now do a bit more validation. This time looking at the model residuals to see that they show no obvious patterns or heteroscadicity.
First, we plot the “deviance residuals” against the “fitted values” and check that there is pattern in these
Then we plot the “deviance residuals” against each predictor varible in the model. In this case there is only one. Again, we don’t want to see any pattern here.
If you do see an obvious patterns you may need to a) consider non-linear relationships, b) add more explanatory varibles to the model, and/or c) concider using gls or glmm to account for nested structures or non-constant variance. See FAQ for more options here.
#scatterplot of residuals vs fitted values
plot(residuals(mod_ARR_NB, type = "deviance"), fitted(mod_ARR_NB),
pch = 16, col = "steelblue")
#scatterplot of residuals vs predictor variable
plot(residuals(mod_ARR_NB, type = "deviance"), AAR$D.PARK,
pch = 16, col = "dark green")
# If you have variables that you have not included, or information of times or spatial locations also plot these against the residual deviance.
# In this case for example we could see if there's any latitudinal pattern
# in the residuals by plotting them against the column ARR$Y which is the
# geographic Y coordinates of the data collection points.
plot(residuals(mod_ARR_NB, type = "deviance"), AAR$Y,
pch = 16, col = "grey40")
Know that we’re happy with our model we can look at our summary table and plot our result.
Here’s one example of how to do that, which uses the inbuilt predict function in R..
summary(mod_ARR_NB)
##
## Call:
## glm.nb(formula = TOT.N ~ D.PARK, data = AAR, init.theta = 3.681040094,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4160 -0.8289 -0.2116 0.4800 2.1346
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.411e+00 1.548e-01 28.50 <2e-16 ***
## D.PARK -1.161e-04 1.137e-05 -10.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(3.681) family taken to be 1)
##
## Null deviance: 155.445 on 51 degrees of freedom
## Residual deviance: 54.742 on 50 degrees of freedom
## AIC: 393.09
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 3.681
## Std. Err.: 0.891
##
## 2 x log-likelihood: -387.093
### predict the results of the model on the original data scale
# (i.e accounting for the log-link in this case)
pred_ambh <- predict(mod_ARR_NB, type = "response", se.fit = TRUE)
str(pred_ambh)
## List of 3
## $ fit : Named num [1:52] 80 75.5 71.3 67.3 63.5 ...
## ..- attr(*, "names")= chr [1:52] "1" "2" "3" "4" ...
## $ se.fit : Named num [1:52] 12.2 11.2 10.2 9.3 8.5 ...
## ..- attr(*, "names")= chr [1:52] "1" "2" "3" "4" ...
## $ residual.scale: num 1
#### plot the raw data so that the original data can be viewed
plot(AAR$TOT.N~AAR$D.PARK, ylab = "No. of roadkill", xlab = "Distance from park",
col = "grey30", pch = 16)
### Add predicted number of amphibians from the model
lines(pred_ambh$fit~AAR$D.PARK, type = "l", col = "red", lwd = 1.5)
### Add lines indicating standard error of model estimated means
lines(pred_ambh$fit +pred_ambh$se.fit~AAR$D.PARK, type = "l", col = "blue", lwd = 1, lty = 2) # upper standard error
lines(pred_ambh$fit - pred_ambh$se.fit~AAR$D.PARK, type = "l", col = "blue", lwd = 1, lty = 2) # lower standard error
In this example I use a dataset collected accross peatland sites in Nothern Ireland in the year following a severe fire. The data shows the number of Meadow Pipits counted in 10 minute spot counts, one year post fire. The data was collected at 122 locations, accross 6 sites, within each site there were burnt and unburnt locations surveyed.
For this example, I examine the presence or absence of meadow pipits in burnt and unburnt areas, within three habitat classes.
## Loading in a dataset stored as a .csv file
setwd("~/PhD/Workshops/IEA Stats/GLM workshop/GLM_workshop_IEA_2019-master")
MP <- read.csv("meadow_pipits.csv", header = T)
## show number of rows and columns in the dataset.
dim(MP)
## [1] 122 6
## Show column names
names(MP)
## [1] "Site" "Burnt" "Habitat" "date2" "MP"
## [6] "MP_Presence"
## Show variable types.
str(MP)
## 'data.frame': 122 obs. of 6 variables:
## $ Site : Factor w/ 6 levels "CH","EM","GR",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Burnt : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 2 2 ...
## $ Habitat : Factor w/ 3 levels "Bog","DH","WH": 1 1 1 1 1 1 1 3 1 1 ...
## $ date2 : int 206 206 242 242 206 242 206 242 207 243 ...
## $ MP : int 0 0 0 1 2 0 0 4 0 0 ...
## $ MP_Presence: int 0 0 0 1 1 0 0 1 0 0 ...
## Show summary of all columns
summary(MP)
## Site Burnt Habitat date2 MP MP_Presence
## CH:18 N:56 Bog:80 Min. :174 Min. :0.000 Min. :0.0000
## EM:34 Y:66 DH :22 1st Qu.:200 1st Qu.:0.000 1st Qu.:0.0000
## GR:11 WH :20 Median :219 Median :0.000 Median :0.0000
## MN:20 Mean :222 Mean :1.221 Mean :0.4344
## SA:19 3rd Qu.:243 3rd Qu.:2.000 3rd Qu.:1.0000
## SH:20 Max. :285 Max. :9.000 Max. :1.0000
## show number of burnt and unburnt sites where meadow pipits were observed.
table(MP$Burnt, MP$MP_Presence)
##
## 0 1
## N 29 27
## Y 40 26
## Draw a histogram of the response variable, (breaks specifies the number of bars)
hist(MP$MP_Presence, col = "pink",
main = "Histogram meadow pipit data")
Here, we run a binomial GLM with a logit link function this is the standard first approach for presence/absence data.
# function glm, fits a GLM with different family and links specified
mod_MP <- glm(MP_Presence ~ Burnt + Habitat, data = MP, family = "binomial"(link = "logit"))
summary(mod_MP)
##
## Call:
## glm(formula = MP_Presence ~ Burnt + Habitat, family = binomial(link = "logit"),
## data = MP)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.263 -1.026 -0.946 1.202 1.498
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.05854 0.29820 -0.196 0.844
## BurntY -0.36233 0.36870 -0.983 0.326
## HabitatDH -0.30782 0.49913 -0.617 0.537
## HabitatWH 0.25792 0.50290 0.513 0.608
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 167.02 on 121 degrees of freedom
## Residual deviance: 165.25 on 118 degrees of freedom
## AIC: 173.25
##
## Number of Fisher Scoring iterations: 4
###
Checking the the residuals of a binomial GLM (presence/absence) data
### extract model residuals and summarise their distribution.
dev_resid <- resid(mod_MP, type = "deviance")
summary(dev_resid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.26317 -1.02631 -0.94605 -0.04237 1.20237 1.49830
Here this is fine.
Examining the model summary shows that there is no significant difference between burnt and unburnt areas or between habitat types in terms of the presence or absence of meadow pipits.
A handy trick, in order to get a significance value a categorical variable such as ‘habitats’ as a whole instead of comparisons between each level, e.g. bog vs dry heath, dry heath vs. wet heath we can run the model with and without the variable habitat and compare the two models using the function ‘anova’. When you are running larger models you may find it convenient to do this using the function ‘Anova’ in the package ‘car’ (e.g. Anova(mod_MP, type = “III”)).
# print model summary
summary(mod_MP)
##
## Call:
## glm(formula = MP_Presence ~ Burnt + Habitat, family = binomial(link = "logit"),
## data = MP)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.263 -1.026 -0.946 1.202 1.498
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.05854 0.29820 -0.196 0.844
## BurntY -0.36233 0.36870 -0.983 0.326
## HabitatDH -0.30782 0.49913 -0.617 0.537
## HabitatWH 0.25792 0.50290 0.513 0.608
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 167.02 on 121 degrees of freedom
## Residual deviance: 165.25 on 118 degrees of freedom
## AIC: 173.25
##
## Number of Fisher Scoring iterations: 4
# To check the overall significance of the inclusion of habitat as a variable in the model.
mod_burn_only <- glm(MP_Presence ~ Burnt, data = MP, family = "binomial"(link = "logit"))
### Compare these two models (with and without habitat included)
anova(mod_MP, mod_burn_only, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: MP_Presence ~ Burnt + Habitat
## Model 2: MP_Presence ~ Burnt
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 118 165.25
## 2 120 166.06 -2 -0.81554 0.6651
### Here, the p value indicates that there is no significant difference
### between the models with and without habitat.
Books: Mixed Effects Models and Extensions in Ecology with R by Zuur et al. 2009
And/or for a more hands on coding approach: Data Analysis with R statistical software by Thomas et al. 2017
Articles:
Bolker, et al. 2009. Generalized linear mixed models: a practical guide for ecology and evolution. Trends in Ecology & Evolution 24: 127-135.
Zuur AF, Ieno EN, Elphick CS. et al. A protocol for data exploration to avoid common statistical problems. Methods Ecol Evol 1: 3-14
As an online resource with details on almost everything we recommend http://glmm.wikidot.com/ by the inimitable Ben Bolker.