One of the things I learned in Roger Peng’s excellent R Programming Course was how to use tapply to very easily apply a function to a subset of a dataframe.
For example, let’s use the ChickWeights data set.
library(datasets)
data(ChickWeights)
## Warning in data(ChickWeights): data set 'ChickWeights' not found
To find out what’s in one of R’s “built-in” data sets, we can request help:
help(ChickWeight)
To get a see how their diets relate to their weight, we can calculate the mean weight of the chicks for each of the diets.
meanWeights <- tapply(ChickWeight$weight, ChickWeight$Diet, mean)
print(meanWeights)
## 1 2 3 4
## 102.6455 122.6167 142.9500 135.2627
To make sure the weight differences among the diets is not unduly confounded by age, let’s also check the mean time for the weight measurements, i.e., mean days number of days after birth that that the chicks were weighed.
meanTimes <- tapply(ChickWeight$Time, ChickWeight$Diet, mean)
print(meanTimes)
## 1 2 3 4
## 10.48182 10.91667 10.91667 10.75424
It looks as though the mean weights and mean times are fairly heavily correlated.
cor(meanTimes, meanWeights)
## [1] 0.7910637
However, all the chicks were weighed on the same last day of the study (at age 21 days). We know this based on the following:
maxTimes <- tapply(ChickWeight$Time, ChickWeight$Diet, max)
print(maxTimes)
## 1 2 3 4
## 21 21 21 21
Since we’ve now accounted for the confounding variable, we can compare the diets more easily:
maxWeights <- tapply(ChickWeight$weight, ChickWeight$Diet, max)
print(maxWeights)
## 1 2 3 4
## 305 331 373 322
Based on this result, diet #3 looks like the diet that’s optimized for getting us the biggest chickens, while diet #1 might be preferred by chicks who dream of becoming fashion models. (Henny day now.)