Practicum 3: ANOVA

1. Loading data: ChickWeight

Today we will be exploring the ChickWeight dataset. Here’s the description provided in the R help file ?ChickWeight

The ChickWeight data frame has 578 rows and 4 columns from an experiment on the effect of diet on early growth of chicks.

weight: a numeric vector giving the body weight of the chick (gm)
Time: a numeric vector giving the number of days since birth when the measurement was made
Chick: an ordered factor with levels 18 < … < 48 giving a unique identifier for the chick. The ordering of the levels groups chicks on the same diet together and orders them according to their final weight (lightest to heaviest) within diet
Diet: a factor with levels 1, …, 4 indicating which experimental diet the chick received

Let’s read in the data

data(ChickWeight)
summary(ChickWeight)

##      weight           Time           Chick     Diet   
##  Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  
##  1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  
##  Median :103.0   Median :10.00   20     : 12   3:120  
##  Mean   :121.8   Mean   :10.72   10     : 12   4:118  
##  3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          
##  Max.   :373.0   Max.   :21.00   19     : 12          
##                                  (Other):506

Conducting the ANOVA

Our research question is: Does diet affect weight of chicks? This is easily done using ANOVA. The function to use is aov(), NOT anova() which does something else (ANOVA is essentially an F test which can be used for a lot of things).

chicks_aov <- aov(weight ~ Diet, data = ChickWeight)
summary(chicks_aov)

##              Df  Sum Sq Mean Sq F value   Pr(>F)    
## Diet          3  155863   51954   10.81 6.43e-07 ***
## Residuals   574 2758693    4806                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results are highly significant! But we forgot to check the four assumptions of linear models (we should have done that BEFORE doing the ANOVA actually)

Checking for violations of assumptions

The four assumptions of linear models are: 1. Linearity 2. Homoscedasticity 3. Independence 4. Normality

Linearity is seldom meaningful to check when x is a categorical variable. We’ll use diagnostic plots (plot) to examine Homoscedasticity and Normality. And we’ll explore the data further to see if any violations of independence are committed.

# Diagnostic plots
plot(chicks_aov)

The first plot (residuals vs fitted values) suggests that the homoscedasticity assumption is met. But the second plot (qq plot) shows that the normality assumption is likely violated (look at the tails that deviate quite far from the standard normal)

Further evidence for this can be obtained by looking at a histogram of the data:

hist(ChickWeight$weight)

Next, let’s look at independence violations.

summary(ChickWeight)

##      weight           Time           Chick     Diet   
##  Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  
##  1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  
##  Median :103.0   Median :10.00   20     : 12   3:120  
##  Mean   :121.8   Mean   :10.72   10     : 12   4:118  
##  3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          
##  Max.   :373.0   Max.   :21.00   19     : 12          
##                                  (Other):506

It looks like many chicks were measured repeatedly. This is a clear case of violation of independence (data from the same Chick are not independent of each other). Here are a few ways we can visualize it:

table(ChickWeight$Time, ChickWeight$Chick)

##     
##      18 16 15 13 9 20 10 8 17 19 4 6 11 3 1 12 2 5 14 7 24 30 22 23 27 28 26 25
##   0   1  1  1  1 1  1  1 1  1  1 1 1  1 1 1  1 1 1  1 1  1  1  1  1  1  1  1  1
##   2   1  1  1  1 1  1  1 1  1  1 1 1  1 1 1  1 1 1  1 1  1  1  1  1  1  1  1  1
##   4   0  1  1  1 1  1  1 1  1  1 1 1  1 1 1  1 1 1  1 1  1  1  1  1  1  1  1  1
##   6   0  1  1  1 1  1  1 1  1  1 1 1  1 1 1  1 1 1  1 1  1  1  1  1  1  1  1  1
##   8   0  1  1  1 1  1  1 1  1  1 1 1  1 1 1  1 1 1  1 1  1  1  1  1  1  1  1  1
##   10  0  1  1  1 1  1  1 1  1  1 1 1  1 1 1  1 1 1  1 1  1  1  1  1  1  1  1  1
##   12  0  1  1  1 1  1  1 1  1  1 1 1  1 1 1  1 1 1  1 1  1  1  1  1  1  1  1  1
##   14  0  0  1  1 1  1  1 1  1  1 1 1  1 1 1  1 1 1  1 1  1  1  1  1  1  1  1  1
##   16  0  0  0  1 1  1  1 1  1  1 1 1  1 1 1  1 1 1  1 1  1  1  1  1  1  1  1  1
##   18  0  0  0  1 1  1  1 1  1  1 1 1  1 1 1  1 1 1  1 1  1  1  1  1  1  1  1  1
##   20  0  0  0  1 1  1  1 1  1  1 1 1  1 1 1  1 1 1  1 1  1  1  1  1  1  1  1  1
##   21  0  0  0  1 1  1  1 0  1  1 1 1  1 1 1  1 1 1  1 1  1  1  1  1  1  1  1  1
##     
##      29 21 33 37 36 31 39 38 32 40 34 35 44 45 43 41 47 49 46 50 42 48
##   0   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   2   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   4   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   6   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   8   1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   10  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   12  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   14  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   16  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   18  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
##   20  1  1  1  1  1  1  1  1  1  1  1  1  0  1  1  1  1  1  1  1  1  1
##   21  1  1  1  1  1  1  1  1  1  1  1  1  0  1  1  1  1  1  1  1  1  1

plot(weight ~ Time, data = ChickWeight, type = "n")
for(i in unique(ChickWeight$Chick)) {
  lines(weight ~ Time, data = ChickWeight, 
        subset = Chick==i, 
        lwd = 2, col = as.numeric(Diet)
        )
}
legend('topleft', legend = unique(ChickWeight$Diet), 
         col = as.numeric(unique(ChickWeight$Diet)), lwd = 2,
       title = "Diet"
       )

Clearly the chicks are growing and changing in weight through time. It is both violation of independence to use the entire dataset in an ANOVA, as well as illogical to compare data from day 1 with day 21.

Repeating the analysis with final weights

It makes much more sense to analyse only the final weights (on day 21) of the chicks. It’s a severe loss in data, but the correct analysis to do, that is both biologically sensible and statistically valid.

oldchicks <- subset(ChickWeight, Time==21)
summary(oldchicks)

##      weight           Time        Chick    Diet  
##  Min.   : 74.0   Min.   :21   13     : 1   1:16  
##  1st Qu.:167.0   1st Qu.:21   9      : 1   2:10  
##  Median :205.0   Median :21   20     : 1   3:10  
##  Mean   :218.7   Mean   :21   10     : 1   4: 9  
##  3rd Qu.:266.0   3rd Qu.:21   17     : 1         
##  Max.   :373.0   Max.   :21   19     : 1         
##                               (Other):39

table(oldchicks$Time, oldchicks$Chick)

##     
##      18 16 15 13 9 20 10 8 17 19 4 6 11 3 1 12 2 5 14 7 24 30 22 23 27 28 26 25
##   21  0  0  0  1 1  1  1 0  1  1 1 1  1 1 1  1 1 1  1 1  1  1  1  1  1  1  1  1
##     
##      29 21 33 37 36 31 39 38 32 40 34 35 44 45 43 41 47 49 46 50 42 48
##   21  1  1  1  1  1  1  1  1  1  1  1  1  0  1  1  1  1  1  1  1  1  1

We are left with 45, from the original 578. But we now no longer have any independence violations (there aren’t any duplicate observations from any individual chick)

Chick weights for this subset of chicks also looks normally distributed

hist(oldchicks$weight)

ANOVA

old_aov <- aov(weight ~ Diet, data = oldchicks)
summary(old_aov)

##             Df Sum Sq Mean Sq F value  Pr(>F)   
## Diet         3  57164   19055   4.655 0.00686 **
## Residuals   41 167839    4094                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It looks like diet has a significant effect on chick weight at 21 days (F(3, 41) = 5, p = 0.007). << Note that to report these statistics, I wrote in Rmarkdown: F(summary(old_aov)[[1]]$Df[1], summary(old_aov)[[1]]$Df[2]) = round(summary(old_aov)[[1]]$F[1], 0), p = round(summary(old_aov)[[1]]$P[1], 3)

Post-hoc test

We want to determine now exactly which diet(s) gave rise to significantly higher chick weights. To do so, we need to do a post-hoc test, which controls for multiple comparisons.

TukeyHSD(old_aov)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = weight ~ Diet, data = oldchicks)
## 
## $Diet
##          diff        lwr       upr     p adj
## 2-1  36.95000  -32.11064 106.01064 0.4868095
## 3-1  92.55000   23.48936 161.61064 0.0046959
## 4-1  60.80556  -10.57710 132.18821 0.1192661
## 3-2  55.60000  -21.01591 132.21591 0.2263918
## 4-2  23.85556  -54.85981 102.57092 0.8486781
## 4-3 -31.74444 -110.45981  46.97092 0.7036249

It looks like the only significant difference was between diet 3 and diet 1. We want to present our findings in a barplot according to conventions taught in the lecture. First, we need to write out the bar annotations (there are packages to do this, but let’s do it the old-fashioned way–manually).

Diet 1 and diet 3 are produce significantly different chick weights, so they don’t share any letters.

letters_df <- tibble(
  Diet    = levels(oldchicks$Diet),
  Letters = c("A", "AB", "BC", "AB")
)
letters_df

## # A tibble: 4 × 2
##   Diet  Letters
##   <chr> <chr>  
## 1 1     A      
## 2 2     AB     
## 3 3     BC     
## 4 4     AB

# Create a summary dataframe
summary_df <- oldchicks %>%
  group_by(Diet) %>%
  summarise(
    mean_weight = mean(weight), # of weight
    se_weight   = sd(weight) / sqrt(n()), # and standard error (this is the actual formula written out)
    .groups = "drop"
  ) %>%
  left_join(letters_df, by = "Diet") # join it to the annotations we created in the previous line of code

ggplot(summary_df, aes(x = Diet, y = mean_weight)) +
  geom_col(fill = "gold", colour = "black", width = 0.7) +
  geom_errorbar(
    aes(ymin = mean_weight - se_weight,
        ymax = mean_weight + se_weight),
    width = 0.2
  ) +
  geom_text(
    aes(
      label = Letters,
      y = mean_weight + se_weight + 20
    ),
    size = 5
  ) +
  labs(
    x = "Diet",
    y = "Chick weight at day 21 (g)"
  ) +
  theme_classic(base_size = 14)

We can do the same in base R:

# Create dataframe with letter annotations
letters_df <- data.frame(
  Diet    = levels(oldchicks$Diet),
  Letters = c("A", "AB", "BC", "AB"),
  stringsAsFactors = FALSE
)
letters_df

##   Diet Letters
## 1    1       A
## 2    2      AB
## 3    3      BC
## 4    4      AB

# Compute mean and SE
mean_weight <- tapply(oldchicks$weight, oldchicks$Diet, mean)
se_weight <- tapply(oldchicks$weight, oldchicks$Diet, function(x) sd(x) / sqrt(length(x)))
mean_weight

##        1        2        3        4 
## 177.7500 214.7000 270.3000 238.5556

se_weight

##        1        2        3        4 
## 14.67552 24.70944 22.64904 14.44925

# Merge diet, means and SEs into a single dataframe
summary_df <- data.frame(
  Diet        = names(mean_weight),
  mean_weight = mean_weight,
  se_weight   = se_weight,
  stringsAsFactors = FALSE
)
summary_df <- merge(summary_df, letters_df, by = "Diet")
summary_df

##   Diet mean_weight se_weight Letters
## 1    1    177.7500  14.67552       A
## 2    2    214.7000  24.70944      AB
## 3    3    270.3000  22.64904      BC
## 4    4    238.5556  14.44925      AB

# Bar plot
bp <- barplot(
  height = summary_df$mean_weight,
  names.arg = summary_df$Diet,
  ylim = c(0, max(summary_df$mean_weight + summary_df$se_weight) + 30),
  ylab = "Chick weight at day 21 (g)",
  xlab = "Diet",
  col  = "pink",
  border = "black"
)

# Error bars
arrows(
  x0 = bp,
  y0 = summary_df$mean_weight - summary_df$se_weight,
  x1 = bp,
  y1 = summary_df$mean_weight + summary_df$se_weight,
  angle = 90,
  code  = 3,
  length = 0.05
)

# Annotations
text(
  x = bp,
  y = summary_df$mean_weight + summary_df$se_weight + 20,
  labels = summary_df$Letters,
  cex = 1.2
)