Analysis of tooth growth data.

The datatable “ToothGrowth” consists of three variables - a dependant variable named ‘len’ and two independant variables named ‘dose’ and ‘supp’.

A summary of the data by dose and supp:

library(plyr)
ddply(ToothGrowth, .(supp, dose), function(x) c(summary(x$len), Std.dev. = round(sd(x$len),2)))
##   supp dose Min. 1st Qu. Median  Mean 3rd Qu. Max. Std.dev.
## 1   OJ  0.5  8.2    9.70  12.25 13.23   16.18 21.5     4.46
## 2   OJ  1.0 14.5   20.30  23.45 22.70   25.65 27.3     3.91
## 3   OJ  2.0 22.4   24.58  25.95 26.06   27.08 30.9     2.66
## 4   VC  0.5  4.2    5.95   7.15  7.98   10.90 11.5     2.75
## 5   VC  1.0 13.6   15.27  16.50 16.77   17.30 22.5     2.52
## 6   VC  2.0 18.5   23.38  25.95 26.14   28.80 33.9     4.80

Histograms of the data by dose and supp:

library(ggplot2)
ggplot(ToothGrowth, aes(len)) + stat_bin(binwidth=2) + facet_grid(dose~supp)

There is such a clear difference between doses that this does does not need to be tested, but is each supp group significantly different from each other when the dose is held constant? Assuming a constant variance, and that the groups are not paired (we don’t know what the variables mean), the 95% confidence interval for the difference between each dose pair is:

t_df <- ddply(ToothGrowth , ~dose, function(x) t.test(len ~ supp, paired = F,var.equal=T,    data = x)$conf)
colnames(t_df)[2:3] <- c("95% CI lower", "95% CI upper")

t_df
##   dose 95% CI lower 95% CI upper
## 1  0.5     1.770262     8.729738
## 2  1.0     2.840692     9.019308
## 3  2.0    -3.722999     3.562999

These t.test confidence intervals tell us (with 95% significance) that there is a significant difference between len variables for each supp when dose is 0.5 or 1.0, but not when dose is 2.0, as the 95% confidence interval in the latter case includes zero.