This material corresponds to Week 4 of the labs and demos for Bio 4259F (Undergraduate) and Bio 9915A (Graduate) at Western University.

Load libraries

Note: it’s a good idea to load libraries in a separate code chunk, so you don’t have to mess with the setup chunk again.

library(ggplot2)      
library(dplyr)
library(ggfortify)
library(arm)

Overview

This document covers the basics of linear models from Beckerman et al. 2017 Chapter 5 (pp 108-129). Chi-square tests and two-sample t-tests are also covered in Chapter 5 but are not demo’ed in this script.

Then, in this lab we introduce the general linear model (GLM), focusing on two simplest cases, linear regression and ANOVA. The goal of a linear model is to explain or predict a response variable (Y) as a linear function of one or more explanatory variables (X).

Linear models assume:

  1. Linearity of the relationship
  2. Normally distributed residuals
  3. Constant variance (homoscedasticity)
  4. Independent observations

We will work with two datasets:

Plant growth under different soil moisture conditions Growth of Daphnia under different parasite treatments

Acquire data

Ensure you are working in your R project with a data folder. Our fisrt dataset is:

test1 <- read.csv("data/plant.growth.rate.csv",header=T)
glimpse(test1)
## Rows: 50
## Columns: 2
## $ soil.moisture.content <dbl> 0.4696876, 0.5413106, 1.6979915, 0.8255799, 0.85…
## $ plant.growth.rate     <dbl> 21.31695, 27.03072, 38.98937, 30.19529, 37.06547…
head(test1)
##   soil.moisture.content plant.growth.rate
## 1             0.4696876          21.31695
## 2             0.5413106          27.03072
## 3             1.6979915          38.98937
## 4             0.8255799          30.19529
## 5             0.8568173          37.06547
## 6             1.6082405          43.24982

The second dataset includes a categorical variable (parasite). We convert it to a factor so that R recognizes it as a categorical predictor.

test2 <- read.csv("data/Daphniagrowth.csv",header=T)
test2$parasite <- as.factor(test2$parasite)
glimpse(test2)
## Rows: 40
## Columns: 3
## $ parasite    <fct> control, control, control, control, control, control, cont…
## $ rep         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, …
## $ growth.rate <dbl> 1.0747092, 1.2659016, 1.3151563, 1.0757519, 1.1967619, 1.3…
levels(test2$parasite)  # show ordering of parasite levels
## [1] "control"                   "Metschnikowia bicuspidata"
## [3] "Pansporella perplexa"      "Pasteuria ramosa"

So we have two datasets. Please take into account we are going to use them for different proposes.

Plants and linear regression

Before fitting any model, we visualize the data. Here, the question is: Does soil moisture content predict plant growth rate? Try to replicate this graph by your own…

Plant modelling

This is how we propose the model in R.fitting a simple linear regression model…

mod1 <- lm(data=test1,plant.growth.rate~soil.moisture.content)

Inspecting the Model…

names(mod1) # view components of mod1
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
mod1$coefficients # see the coefficients
##           (Intercept) soil.moisture.content 
##              19.34846              12.74954
anova(mod1) # ANOVA table (partitioning variance)
## Analysis of Variance Table
## 
## Response: plant.growth.rate
##                       Df  Sum Sq Mean Sq F value    Pr(>F)    
## soil.moisture.content  1 2521.15 2521.15  156.08 < 2.2e-16 ***
## Residuals             48  775.35   16.15                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod1) # full model summary
## 
## Call:
## lm(formula = plant.growth.rate ~ soil.moisture.content, data = test1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9089 -3.0747  0.2261  2.6567  8.9406 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             19.348      1.283   15.08   <2e-16 ***
## soil.moisture.content   12.750      1.021   12.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.019 on 48 degrees of freedom
## Multiple R-squared:  0.7648, Adjusted R-squared:  0.7599 
## F-statistic: 156.1 on 1 and 48 DF,  p-value: < 2.2e-16

Model Diagnostics

It is critical to check whether assumptions are met.

autoplot(mod1, smooth.colour = NA)

This generates diagnostic plots:

  1. Residuals vs Fitted: should show no clear pattern (linearity assumption).
  2. Normal Q-Q plot: residuals should follow a straight line (normality).
  3. Scale-Location plot: residual spread should be even (homoscedasticity).
  4. Residuals vs Leverage: identifies influential points.

Plant re-plot

Now let’s add the fitted regression line to the scatterplot.

ggplot(data = test1, aes(x = soil.moisture.content, y = plant.growth.rate)) +
  geom_point() +                                   # raw data points
  geom_smooth(method = "lm", se = TRUE) +          # regression line + confidence band
  ylab("Plant growth rate (mm/week)") +
  xlab("Soil moisture content") +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

What this shows:

  1. Black points: the observed data (soil moisture vs. plant growth).
  2. Blue line: the fitted linear regression line from lm(). This is the line that minimizes the squared differences between observed and predicted values.
  3. Shaded band: the 95% confidence interval around the fitted line, showing the uncertainty in the estimated relationship.

Daphnia (one-way ANOVA)

Plot the data first (always; try to do so on your own):

Please ask yourself, Why this plot?

Set the reference level (what the intercept means)

R uses treatment contrasts by default, the intercept is the mean of the reference group, and each coefficient is a difference from that reference.

# Make sure the reference level is what you want in the equations:
test2$parasite <- relevel(test2$parasite, ref = "control")
levels(test2$parasite)
## [1] "control"                   "Metschnikowia bicuspidata"
## [3] "Pansporella perplexa"      "Pasteuria ramosa"

What means ref = “control”, the intercept is…

It is like coefficients for Pasteuria, Pansporella, and Metsch are their mean minus mean(Control).

Fit the model (one-way ANOVA = linear model with a factor)

mod2 <- lm(growth.rate ~ parasite, data = test2)

anova(mod2)   # omnibus F test: do any groups differ?
## Analysis of Variance Table
## 
## Response: growth.rate
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## parasite   3 3.1379 1.04597  32.325 2.571e-10 ***
## Residuals 36 1.1649 0.03236                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod2) # shows intercept + group differences (vs Control)
## 
## Call:
## lm(formula = growth.rate ~ parasite, data = test2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41930 -0.09696  0.01408  0.12267  0.31790 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        1.21391    0.05688  21.340  < 2e-16 ***
## parasiteMetschnikowia bicuspidata -0.41275    0.08045  -5.131 1.01e-05 ***
## parasitePansporella perplexa      -0.13755    0.08045  -1.710   0.0959 .  
## parasitePasteuria ramosa          -0.73171    0.08045  -9.096 7.34e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1799 on 36 degrees of freedom
## Multiple R-squared:  0.7293, Adjusted R-squared:  0.7067 
## F-statistic: 32.33 on 3 and 36 DF,  p-value: 2.571e-10

If you want to see the dummy coding R used:

model.matrix(mod2)[1:8, ]
##   (Intercept) parasiteMetschnikowia bicuspidata parasitePansporella perplexa
## 1           1                                 0                            0
## 2           1                                 0                            0
## 3           1                                 0                            0
## 4           1                                 0                            0
## 5           1                                 0                            0
## 6           1                                 0                            0
## 7           1                                 0                            0
## 8           1                                 0                            0
##   parasitePasteuria ramosa
## 1                        0
## 2                        0
## 3                        0
## 4                        0
## 5                        0
## 6                        0
## 7                        0
## 8                        0

Diagnostic plots (assumptions)

autoplot(mod2, smooth.colour = NA)

Compare something before going further… Look at the means vs Estimate at the summary

# raw means per group (descriptive)
sumDat <- test2 %>%
  group_by(parasite) %>%
  summarize(meanGR = mean(growth.rate), .groups = "drop")

sumDat
## # A tibble: 4 × 2
##   parasite                  meanGR
##   <fct>                      <dbl>
## 1 control                    1.21 
## 2 Metschnikowia bicuspidata  0.801
## 3 Pansporella perplexa       1.08 
## 4 Pasteuria ramosa           0.482
summary(mod2)
## 
## Call:
## lm(formula = growth.rate ~ parasite, data = test2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41930 -0.09696  0.01408  0.12267  0.31790 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        1.21391    0.05688  21.340  < 2e-16 ***
## parasiteMetschnikowia bicuspidata -0.41275    0.08045  -5.131 1.01e-05 ***
## parasitePansporella perplexa      -0.13755    0.08045  -1.710   0.0959 .  
## parasitePasteuria ramosa          -0.73171    0.08045  -9.096 7.34e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1799 on 36 degrees of freedom
## Multiple R-squared:  0.7293, Adjusted R-squared:  0.7067 
## F-statistic: 32.33 on 3 and 36 DF,  p-value: 2.571e-10

Re-plot with model-based means and 95% CIs

Your approach is great—here it is with two small tweaks: 1. Use shape = 23 (diamond filled) which ggplot understands. 2. Compute and plot model predictions with CIs cleanly.

#mod2 <- lm(growth.rate ~ parasite, data = test2)
# model-based estimates and SE for each group
sumDat <- test2 %>%
  group_by(parasite) %>%
  summarize(meanGR = mean(growth.rate))


# alternatively, create data frame with estimates, standard errors, and CI
estimates <- expand.grid(parasite = levels(as.factor(test2$parasite)))

estimates$growth.rate <- predict(mod2, newdata=estimates,se.fit=TRUE)$fit

estimates$sem <- predict(mod2, newdata=estimates,se.fit=TRUE)$se.fit

estimates$ci <- estimates$sem*2

estimates
##                    parasite growth.rate        sem        ci
## 1                   control   1.2139088 0.05688381 0.1137676
## 2 Metschnikowia bicuspidata   0.8011541 0.05688381 0.1137676
## 3      Pansporella perplexa   1.0763551 0.05688381 0.1137676
## 4          Pasteuria ramosa   0.4822030 0.05688381 0.1137676

fit: the predicted mean for each group, sem: the standard error of that prediction, ci: the width of the 95% confidence interval.

ggplot(test2, aes(x = parasite, y = growth.rate, colour = parasite)) +
  geom_jitter(width = 0.1, alpha = 0.5) +
  geom_point(data = estimates,
             aes(y = growth.rate),      # <- usar growth.rate
             size = 4, shape = 23, fill = "white", color = "black") +
  geom_errorbar(data = estimates,
                aes(y = growth.rate,
                    ymin = growth.rate - ci,
                    ymax = growth.rate + ci),  # <- usar growth.rate
                width = 0.2, color = "black") +
  xlab("Parasite") + ylab("Growth rate") +
  theme_classic() + coord_flip() +
  guides(colour = "none")