One of the things I learned in Roger Peng’s excellent R Programming Course was how to use tapply to very easily apply a function to a subset of a dataframe.

For example, let’s use the ChickWeights data set.

library(datasets)
data(ChickWeights)
## Warning in data(ChickWeights): data set 'ChickWeights' not found

To find out what’s in one of R’s “built-in” data sets, we can request help:

help(ChickWeight)

To get a see how their diets relate to their weight, we can calculate the mean weight of the chicks for each of the diets.

meanWeights <- tapply(ChickWeight$weight, ChickWeight$Diet, mean)
print(meanWeights)
##        1        2        3        4 
## 102.6455 122.6167 142.9500 135.2627

To make sure the weight differences among the diets is not unduly confounded by age, let’s also check the mean time for the weight measurements, i.e., mean days number of days after birth that that the chicks were weighed.

meanTimes <- tapply(ChickWeight$Time, ChickWeight$Diet, mean)
print(meanTimes)
##        1        2        3        4 
## 10.48182 10.91667 10.91667 10.75424

It looks as though the mean weights and mean times are fairly heavily correlated.

cor(meanTimes, meanWeights)
## [1] 0.7910637

However, all the chicks were weighed on the same last day of the study (at age 21 days). We know this based on the following:

maxTimes <- tapply(ChickWeight$Time, ChickWeight$Diet, max)
print(maxTimes)
##  1  2  3  4 
## 21 21 21 21

Since we’ve now accounted for the confounding variable, we can compare the diets more easily:

maxWeights <- tapply(ChickWeight$weight, ChickWeight$Diet, max)
print(maxWeights)
##   1   2   3   4 
## 305 331 373 322

Based on this result, diet #3 looks like the diet that’s optimized for getting us the biggest chickens, while diet #1 might be preferred by chicks who dream of becoming fashion models. (Henny day now.)