This material corresponds to Week 4 of the labs and demos for Bio 4259F (Undergraduate) and Bio 9915A (Graduate) at Western University.
Note: it’s a good idea to load libraries in a separate code chunk, so you don’t have to mess with the setup chunk again.
library(ggplot2)
library(dplyr)
library(ggfortify)
library(arm)
This document covers the basics of linear models from Beckerman et al. 2017 Chapter 5 (pp 108-129). Chi-square tests and two-sample t-tests are also covered in Chapter 5 but are not demo’ed in this script.
Then, in this lab we introduce the general linear model (GLM), focusing on two simplest cases, linear regression and ANOVA. The goal of a linear model is to explain or predict a response variable (Y) as a linear function of one or more explanatory variables (X).
Linear models assume:
We will work with two datasets:
Plant growth under different soil moisture conditions Growth of Daphnia under different parasite treatments
Ensure you are working in your R project with a data folder. Our fisrt dataset is:
test1 <- read.csv("data/plant.growth.rate.csv",header=T)
glimpse(test1)
## Rows: 50
## Columns: 2
## $ soil.moisture.content <dbl> 0.4696876, 0.5413106, 1.6979915, 0.8255799, 0.85…
## $ plant.growth.rate <dbl> 21.31695, 27.03072, 38.98937, 30.19529, 37.06547…
head(test1)
## soil.moisture.content plant.growth.rate
## 1 0.4696876 21.31695
## 2 0.5413106 27.03072
## 3 1.6979915 38.98937
## 4 0.8255799 30.19529
## 5 0.8568173 37.06547
## 6 1.6082405 43.24982
The second dataset includes a categorical variable (parasite). We convert it to a factor so that R recognizes it as a categorical predictor.
test2 <- read.csv("data/Daphniagrowth.csv",header=T)
test2$parasite <- as.factor(test2$parasite)
glimpse(test2)
## Rows: 40
## Columns: 3
## $ parasite <fct> control, control, control, control, control, control, cont…
## $ rep <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, …
## $ growth.rate <dbl> 1.0747092, 1.2659016, 1.3151563, 1.0757519, 1.1967619, 1.3…
levels(test2$parasite) # show ordering of parasite levels
## [1] "control" "Metschnikowia bicuspidata"
## [3] "Pansporella perplexa" "Pasteuria ramosa"
So we have two datasets. Please take into account we are going to use them for different proposes.
Before fitting any model, we visualize the data. Here, the question is: Does soil moisture content predict plant growth rate? Try to replicate this graph by your own…
This is how we propose the model in R.fitting a simple linear regression model…
mod1 <- lm(data=test1,plant.growth.rate~soil.moisture.content)
Inspecting the Model…
names(mod1) # view components of mod1
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
mod1$coefficients # see the coefficients
## (Intercept) soil.moisture.content
## 19.34846 12.74954
anova(mod1) # ANOVA table (partitioning variance)
## Analysis of Variance Table
##
## Response: plant.growth.rate
## Df Sum Sq Mean Sq F value Pr(>F)
## soil.moisture.content 1 2521.15 2521.15 156.08 < 2.2e-16 ***
## Residuals 48 775.35 16.15
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod1) # full model summary
##
## Call:
## lm(formula = plant.growth.rate ~ soil.moisture.content, data = test1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9089 -3.0747 0.2261 2.6567 8.9406
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.348 1.283 15.08 <2e-16 ***
## soil.moisture.content 12.750 1.021 12.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.019 on 48 degrees of freedom
## Multiple R-squared: 0.7648, Adjusted R-squared: 0.7599
## F-statistic: 156.1 on 1 and 48 DF, p-value: < 2.2e-16
It is critical to check whether assumptions are met.
autoplot(mod1, smooth.colour = NA)
This generates diagnostic plots:
Now let’s add the fitted regression line to the scatterplot.
ggplot(data = test1, aes(x = soil.moisture.content, y = plant.growth.rate)) +
geom_point() + # raw data points
geom_smooth(method = "lm", se = TRUE) + # regression line + confidence band
ylab("Plant growth rate (mm/week)") +
xlab("Soil moisture content") +
theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
What this shows:
Plot the data first (always; try to do so on your own):
Please ask yourself, Why this plot?
R uses treatment contrasts by default, the intercept is the mean of the reference group, and each coefficient is a difference from that reference.
# Make sure the reference level is what you want in the equations:
test2$parasite <- relevel(test2$parasite, ref = "control")
levels(test2$parasite)
## [1] "control" "Metschnikowia bicuspidata"
## [3] "Pansporella perplexa" "Pasteuria ramosa"
What means ref = “control”, the intercept is…
It is like coefficients for Pasteuria, Pansporella, and Metsch are their mean minus mean(Control).
mod2 <- lm(growth.rate ~ parasite, data = test2)
anova(mod2) # omnibus F test: do any groups differ?
## Analysis of Variance Table
##
## Response: growth.rate
## Df Sum Sq Mean Sq F value Pr(>F)
## parasite 3 3.1379 1.04597 32.325 2.571e-10 ***
## Residuals 36 1.1649 0.03236
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod2) # shows intercept + group differences (vs Control)
##
## Call:
## lm(formula = growth.rate ~ parasite, data = test2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.41930 -0.09696 0.01408 0.12267 0.31790
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.21391 0.05688 21.340 < 2e-16 ***
## parasiteMetschnikowia bicuspidata -0.41275 0.08045 -5.131 1.01e-05 ***
## parasitePansporella perplexa -0.13755 0.08045 -1.710 0.0959 .
## parasitePasteuria ramosa -0.73171 0.08045 -9.096 7.34e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1799 on 36 degrees of freedom
## Multiple R-squared: 0.7293, Adjusted R-squared: 0.7067
## F-statistic: 32.33 on 3 and 36 DF, p-value: 2.571e-10
If you want to see the dummy coding R used:
model.matrix(mod2)[1:8, ]
## (Intercept) parasiteMetschnikowia bicuspidata parasitePansporella perplexa
## 1 1 0 0
## 2 1 0 0
## 3 1 0 0
## 4 1 0 0
## 5 1 0 0
## 6 1 0 0
## 7 1 0 0
## 8 1 0 0
## parasitePasteuria ramosa
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## 7 0
## 8 0
autoplot(mod2, smooth.colour = NA)
Compare something before going further… Look at the means vs Estimate at
the summary
# raw means per group (descriptive)
sumDat <- test2 %>%
group_by(parasite) %>%
summarize(meanGR = mean(growth.rate), .groups = "drop")
sumDat
## # A tibble: 4 × 2
## parasite meanGR
## <fct> <dbl>
## 1 control 1.21
## 2 Metschnikowia bicuspidata 0.801
## 3 Pansporella perplexa 1.08
## 4 Pasteuria ramosa 0.482
summary(mod2)
##
## Call:
## lm(formula = growth.rate ~ parasite, data = test2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.41930 -0.09696 0.01408 0.12267 0.31790
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.21391 0.05688 21.340 < 2e-16 ***
## parasiteMetschnikowia bicuspidata -0.41275 0.08045 -5.131 1.01e-05 ***
## parasitePansporella perplexa -0.13755 0.08045 -1.710 0.0959 .
## parasitePasteuria ramosa -0.73171 0.08045 -9.096 7.34e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1799 on 36 degrees of freedom
## Multiple R-squared: 0.7293, Adjusted R-squared: 0.7067
## F-statistic: 32.33 on 3 and 36 DF, p-value: 2.571e-10
Your approach is great—here it is with two small tweaks: 1. Use shape = 23 (diamond filled) which ggplot understands. 2. Compute and plot model predictions with CIs cleanly.
#mod2 <- lm(growth.rate ~ parasite, data = test2)
# model-based estimates and SE for each group
sumDat <- test2 %>%
group_by(parasite) %>%
summarize(meanGR = mean(growth.rate))
# alternatively, create data frame with estimates, standard errors, and CI
estimates <- expand.grid(parasite = levels(as.factor(test2$parasite)))
estimates$growth.rate <- predict(mod2, newdata=estimates,se.fit=TRUE)$fit
estimates$sem <- predict(mod2, newdata=estimates,se.fit=TRUE)$se.fit
estimates$ci <- estimates$sem*2
estimates
## parasite growth.rate sem ci
## 1 control 1.2139088 0.05688381 0.1137676
## 2 Metschnikowia bicuspidata 0.8011541 0.05688381 0.1137676
## 3 Pansporella perplexa 1.0763551 0.05688381 0.1137676
## 4 Pasteuria ramosa 0.4822030 0.05688381 0.1137676
fit: the predicted mean for each group, sem: the standard error of that prediction, ci: the width of the 95% confidence interval.
ggplot(test2, aes(x = parasite, y = growth.rate, colour = parasite)) +
geom_jitter(width = 0.1, alpha = 0.5) +
geom_point(data = estimates,
aes(y = growth.rate), # <- usar growth.rate
size = 4, shape = 23, fill = "white", color = "black") +
geom_errorbar(data = estimates,
aes(y = growth.rate,
ymin = growth.rate - ci,
ymax = growth.rate + ci), # <- usar growth.rate
width = 0.2, color = "black") +
xlab("Parasite") + ylab("Growth rate") +
theme_classic() + coord_flip() +
guides(colour = "none")