Basic Inferential Analysis on ToothGrowth Data

Overview

In this analysis, we look at the ToothGrowth dataset in R. This dataset records the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, (orange juice or ascorbic acid (a form of vitamin C and coded as VC). We study the distribution of length for each delivery method and each dose level, and evaluate their effectiveness.

Loading data and basic exploratory data analysis

We load the ToothGrowth (TG) data and perform some exploratory analysis.

data(ToothGrowth)
TG <- ToothGrowth

str(TG)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

with(TG,table(supp,dose))

##     dose
## supp 0.5  1  2
##   OJ  10 10 10
##   VC  10 10 10

We also make a boxplot for each supp and dose group. See figure 2.

library(ggplot2)
g3 <- ggplot(TG, aes(x=supp, y= len)) + geom_boxplot() + facet_grid(facets = .~dose)
print(g3)

Length vs. supp at all dose levels.

Summary of data

From the previous results, we see that the TG data consists of 60 observations and 3 variables len, supp and dose. Grouping by supp and dose, we can separate the data into 6 groups, each with 10 observations.

By inspecting figure 2, we can see that for each supplement, a higher dose corresponds to a higher mean of length. For dose = 0.5 and dose = 1, the supplement OJ leads to a stronger response in length, while for dose = 2.0, there is no substantial difference in the mean of len with respect to different supp.

Confidence intervals and hypothesis testing

We now calculate the confidence interval for mean of len for each supp and dose group, using ttest with a confidence level of 95%.

ci <- aggregate(len~supp+dose, TG, function (x) t.test(x)$conf.int)
ci

##   supp dose     len.1     len.2
## 1   OJ  0.5 10.039717 16.420283
## 2   VC  0.5  6.015176  9.944824
## 3   OJ  1.0 19.902273 25.497727
## 4   VC  1.0 14.970657 18.569343
## 5   OJ  2.0 24.160686 27.959314
## 6   VC  2.0 22.707910 29.572090

Here len.1 and len.2 are the lower bound and upper bound of the 95% confidence interval for each group, respectively.

There are a lot of hypothesises we can test from this dataset. We will just test one of them here as an example.

From figure 2, we can see at dose = 1, OJ seems more effective than VC in enhancing length. We use two-group t-test to test that hypothesis. Here, the null hypothesis is OJ is not more effective than VC, or the difference in means is not larger than 0.

g1 <- TG$len[which(TG$dose == 1 & TG$supp == 'OJ') ]
g2 <- TG$len[which(TG$dose == 1 & TG$supp == 'VC') ]
t.test(g1,g2, alternative = 'greater')

## 
##  Welch Two Sample t-test
## 
## data:  g1 and g2
## t = 4.0328, df = 15.358, p-value = 0.0005192
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  3.356158      Inf
## sample estimates:
## mean of x mean of y 
##     22.70     16.77

We see that with a p-value of 0.0005192, we reject the null hypothesis, and accept that ‘OJ’ is more effective than ‘VC’ at dose = 1. We can reach that same conclusion at dose = 0.5, but for dose = 2, we fail to reject the null hypothesis.

Conclusions

By analysign the TG dataset, we can reach the following conclusions: For dose = 0.5 and dose = 1, the orange juice is more effective than ascorbic acid in enhancing length. For dose = 2, we do not observe any statistically significant difference in the two methods.