Today we will be exploring the ChickWeight dataset. Here’s the
description provided in the R help file ?ChickWeight
The ChickWeight data frame has 578 rows and 4 columns from an experiment on the effect of diet on early growth of chicks.
weight: a numeric vector giving the body weight of
the chick (gm)
Time: a numeric vector giving the number of days
since birth when the measurement was made
Chick: an ordered factor with levels 18 < … <
48 giving a unique identifier for the chick. The ordering of the levels
groups chicks on the same diet together and orders them according to
their final weight (lightest to heaviest) within diet
Diet: a factor with levels 1, …, 4 indicating which
experimental diet the chick received
Let’s read in the data
data(ChickWeight)
summary(ChickWeight)
## weight Time Chick Diet
## Min. : 35.0 Min. : 0.00 13 : 12 1:220
## 1st Qu.: 63.0 1st Qu.: 4.00 9 : 12 2:120
## Median :103.0 Median :10.00 20 : 12 3:120
## Mean :121.8 Mean :10.72 10 : 12 4:118
## 3rd Qu.:163.8 3rd Qu.:16.00 17 : 12
## Max. :373.0 Max. :21.00 19 : 12
## (Other):506
Our research question is: Does diet affect weight of chicks? This is
easily done using ANOVA. The function to use is aov(), NOT
anova() which does something else (ANOVA is essentially an
F test which can be used for a lot of things).
chicks_aov <- aov(weight ~ Diet, data = ChickWeight)
summary(chicks_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Diet 3 155863 51954 10.81 6.43e-07 ***
## Residuals 574 2758693 4806
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results are highly significant! But we forgot to check the four assumptions of linear models (we should have done that BEFORE doing the ANOVA actually)
The four assumptions of linear models are: 1. Linearity 2. Homoscedasticity 3. Independence 4. Normality
Linearity is seldom meaningful to check when x is a categorical
variable. We’ll use diagnostic plots (plot) to examine
Homoscedasticity and Normality. And we’ll explore the data further to
see if any violations of independence are committed.
# Diagnostic plots
plot(chicks_aov)
The first plot (residuals vs fitted values) suggests that the homoscedasticity assumption is met. But the second plot (qq plot) shows that the normality assumption is likely violated (look at the tails that deviate quite far from the standard normal)
Further evidence for this can be obtained by looking at a histogram of the data:
hist(ChickWeight$weight)
Next, let’s look at independence violations.
summary(ChickWeight)
## weight Time Chick Diet
## Min. : 35.0 Min. : 0.00 13 : 12 1:220
## 1st Qu.: 63.0 1st Qu.: 4.00 9 : 12 2:120
## Median :103.0 Median :10.00 20 : 12 3:120
## Mean :121.8 Mean :10.72 10 : 12 4:118
## 3rd Qu.:163.8 3rd Qu.:16.00 17 : 12
## Max. :373.0 Max. :21.00 19 : 12
## (Other):506
It looks like many chicks were measured repeatedly. This is a clear case of violation of independence (data from the same Chick are not independent of each other). Here are a few ways we can visualize it:
table(ChickWeight$Time, ChickWeight$Chick)
##
## 18 16 15 13 9 20 10 8 17 19 4 6 11 3 1 12 2 5 14 7 24 30 22 23 27 28 26 25
## 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 4 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 6 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 8 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 10 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 12 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 14 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 16 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 18 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 20 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 21 0 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## 29 21 33 37 36 31 39 38 32 40 34 35 44 45 43 41 47 49 46 50 42 48
## 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 8 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 10 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 12 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 14 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 16 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 18 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 20 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1
## 21 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1
plot(weight ~ Time, data = ChickWeight, type = "n")
for(i in unique(ChickWeight$Chick)) {
lines(weight ~ Time, data = ChickWeight,
subset = Chick==i,
lwd = 2, col = as.numeric(Diet)
)
}
legend('topleft', legend = unique(ChickWeight$Diet),
col = as.numeric(unique(ChickWeight$Diet)), lwd = 2,
title = "Diet"
)
Clearly the chicks are growing and changing in weight through time. It is both violation of independence to use the entire dataset in an ANOVA, as well as illogical to compare data from day 1 with day 21.
It makes much more sense to analyse only the final weights (on day 21) of the chicks. It’s a severe loss in data, but the correct analysis to do, that is both biologically sensible and statistically valid.
oldchicks <- subset(ChickWeight, Time==21)
summary(oldchicks)
## weight Time Chick Diet
## Min. : 74.0 Min. :21 13 : 1 1:16
## 1st Qu.:167.0 1st Qu.:21 9 : 1 2:10
## Median :205.0 Median :21 20 : 1 3:10
## Mean :218.7 Mean :21 10 : 1 4: 9
## 3rd Qu.:266.0 3rd Qu.:21 17 : 1
## Max. :373.0 Max. :21 19 : 1
## (Other):39
table(oldchicks$Time, oldchicks$Chick)
##
## 18 16 15 13 9 20 10 8 17 19 4 6 11 3 1 12 2 5 14 7 24 30 22 23 27 28 26 25
## 21 0 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##
## 29 21 33 37 36 31 39 38 32 40 34 35 44 45 43 41 47 49 46 50 42 48
## 21 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1
We are left with 45, from the original 578. But we now no longer have any independence violations (there aren’t any duplicate observations from any individual chick)
Chick weights for this subset of chicks also looks normally distributed
hist(oldchicks$weight)
old_aov <- aov(weight ~ Diet, data = oldchicks)
summary(old_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Diet 3 57164 19055 4.655 0.00686 **
## Residuals 41 167839 4094
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It looks like diet has a significant effect on chick weight at 21
days (F(3, 41) = 5, p = 0.007). << Note that to report these
statistics, I wrote in Rmarkdown:
F(summary(old_aov)[[1]]$Df[1], summary(old_aov)[[1]]$Df[2]) = round(summary(old_aov)[[1]]$F[1], 0), p = round(summary(old_aov)[[1]]$P[1], 3)
We want to determine now exactly which diet(s) gave rise to significantly higher chick weights. To do so, we need to do a post-hoc test, which controls for multiple comparisons.
TukeyHSD(old_aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = weight ~ Diet, data = oldchicks)
##
## $Diet
## diff lwr upr p adj
## 2-1 36.95000 -32.11064 106.01064 0.4868095
## 3-1 92.55000 23.48936 161.61064 0.0046959
## 4-1 60.80556 -10.57710 132.18821 0.1192661
## 3-2 55.60000 -21.01591 132.21591 0.2263918
## 4-2 23.85556 -54.85981 102.57092 0.8486781
## 4-3 -31.74444 -110.45981 46.97092 0.7036249
It looks like the only significant difference was between diet 3 and diet 1. We want to present our findings in a barplot according to conventions taught in the lecture. First, we need to write out the bar annotations (there are packages to do this, but let’s do it the old-fashioned way–manually).
Diet 1 and diet 3 are produce significantly different chick weights, so they don’t share any letters.
letters_df <- tibble(
Diet = levels(oldchicks$Diet),
Letters = c("A", "AB", "BC", "AB")
)
letters_df
## # A tibble: 4 × 2
## Diet Letters
## <chr> <chr>
## 1 1 A
## 2 2 AB
## 3 3 BC
## 4 4 AB
# Create a summary dataframe
summary_df <- oldchicks %>%
group_by(Diet) %>%
summarise(
mean_weight = mean(weight), # of weight
se_weight = sd(weight) / sqrt(n()), # and standard error (this is the actual formula written out)
.groups = "drop"
) %>%
left_join(letters_df, by = "Diet") # join it to the annotations we created in the previous line of code
ggplot(summary_df, aes(x = Diet, y = mean_weight)) +
geom_col(fill = "gold", colour = "black", width = 0.7) +
geom_errorbar(
aes(ymin = mean_weight - se_weight,
ymax = mean_weight + se_weight),
width = 0.2
) +
geom_text(
aes(
label = Letters,
y = mean_weight + se_weight + 20
),
size = 5
) +
labs(
x = "Diet",
y = "Chick weight at day 21 (g)"
) +
theme_classic(base_size = 14)
We can do the same in base R:
# Create dataframe with letter annotations
letters_df <- data.frame(
Diet = levels(oldchicks$Diet),
Letters = c("A", "AB", "BC", "AB"),
stringsAsFactors = FALSE
)
letters_df
## Diet Letters
## 1 1 A
## 2 2 AB
## 3 3 BC
## 4 4 AB
# Compute mean and SE
mean_weight <- tapply(oldchicks$weight, oldchicks$Diet, mean)
se_weight <- tapply(oldchicks$weight, oldchicks$Diet, function(x) sd(x) / sqrt(length(x)))
mean_weight
## 1 2 3 4
## 177.7500 214.7000 270.3000 238.5556
se_weight
## 1 2 3 4
## 14.67552 24.70944 22.64904 14.44925
# Merge diet, means and SEs into a single dataframe
summary_df <- data.frame(
Diet = names(mean_weight),
mean_weight = mean_weight,
se_weight = se_weight,
stringsAsFactors = FALSE
)
summary_df <- merge(summary_df, letters_df, by = "Diet")
summary_df
## Diet mean_weight se_weight Letters
## 1 1 177.7500 14.67552 A
## 2 2 214.7000 24.70944 AB
## 3 3 270.3000 22.64904 BC
## 4 4 238.5556 14.44925 AB
# Bar plot
bp <- barplot(
height = summary_df$mean_weight,
names.arg = summary_df$Diet,
ylim = c(0, max(summary_df$mean_weight + summary_df$se_weight) + 30),
ylab = "Chick weight at day 21 (g)",
xlab = "Diet",
col = "pink",
border = "black"
)
# Error bars
arrows(
x0 = bp,
y0 = summary_df$mean_weight - summary_df$se_weight,
x1 = bp,
y1 = summary_df$mean_weight + summary_df$se_weight,
angle = 90,
code = 3,
length = 0.05
)
# Annotations
text(
x = bp,
y = summary_df$mean_weight + summary_df$se_weight + 20,
labels = summary_df$Letters,
cex = 1.2
)