Inferential Analysis ; ToothGrowth Data

1. Overview

We investigated the ToothGrowth data, which is about the effect of vitamin C on tooth growth. We wanted to see, among the orange juice and the form of vitamin C, which one has greater effect on tooth growth. To see that, we used student’s t test for each group, for each dose.

2. Basic Summaries & Some EDAs

Let’s load ToothGrowth data and see how it looks like.

data("ToothGrowth")
str(ToothGrowth)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

ToothGrowth data is a data frame with 60 observation of 3 variables each.

  • len : numeric variable for the length of odontoblasts, arbitrary units
  • supp : factor variable with 2 levels

     OC : Orange Juice
     VC : ascorbic acid (a form of vitamin C)
  • dose : numeric variable for dose, mg/day

head(ToothGrowth, 3) ; tail(ToothGrowth, 3) ; table(ToothGrowth$supp)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
##     len supp dose
## 58 27.3   OJ    2
## 59 29.4   OJ    2
## 60 23.0   OJ    2
## 
## OJ VC 
## 30 30

The first 30 rows’ levels are “VC”, and the last 30 are “OJ”.

summary(ToothGrowth$len)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.20   13.08   19.25   18.81   25.28   33.90

This is a summary of len variable. It’s mean is 18.8133333 and median is 19.25.

with(ToothGrowth, tapply(len, supp, mean))
##       OJ       VC 
## 20.66333 16.96333

This is group means for each level. Mean of level “OJ” is greater than that of “VC”.

library(ggplot2)
g <- ggplot(data=ToothGrowth, aes(x=dose, y=len))
g1 <- g + geom_point() + facet_wrap(~supp)
g2 <- g1 + stat_smooth(method="lm")  
g2

Above figure is about lengths for each dose, of each supp group. As we’ve seen before, in the regression line, the overall value of “OJ” is higher than that of “VC”. However, the slope of “VC” is bigger than that of “OJ”.

g <- ggplot(data=ToothGrowth, aes(x=dose, y=len))
g1 <- g + geom_violin(aes(fill=supp)) + facet_wrap(~dose)
g1

For 0.5 and 1.0 dose, the OJ level has upper value than those of VC. Also OJ values are spreaded more than VC values ; however, at 2.0 dose, the result is opposite.

3. Hypothesis Test ; Mean between 2 groups, about each level

tg1 <- subset(ToothGrowth, dose==0.5)
tg2 <- subset(ToothGrowth, dose==1.0)
tg3 <- subset(ToothGrowth, dose==2.0)

We used subset() to divide ToothGrowth data into subsets for each dose value. Sample size of each subsets are not large enough to use CLT, so we used Student’s T test. Every alpha of t test we’ve performed is 0.05. Since we wanted to see which one is bigger among OJ and VC, we performed one sided test. Therefore,

  • Null hypothesis : The average length of OJ is same as that of VC.
  • Alternative hypothesis : The average length of OJ is bigger than that of VC.
## One sided t test for 0.5 dose
t.test(len ~ supp, paired=FALSE, var.equal=FALSE, 
       alternative = "greater", data=tg1)
## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 3.1697, df = 14.969, p-value = 0.003179
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  2.34604     Inf
## sample estimates:
## mean in group OJ mean in group VC 
##            13.23             7.98

First one is for 0.5 dose. We can see that the p-value is much smaller than the confidence level. So we can conclude that there’s a significant difference between the average of OJ and that of VC. (OJ is bigger than VC)

## One sided t test for 1.0 dose
t.test(len ~ supp, paired=FALSE, var.equal=FALSE, 
       alternative = "greater", data=tg2)
## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 4.0328, df = 15.358, p-value = 0.0005192
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  3.356158      Inf
## sample estimates:
## mean in group OJ mean in group VC 
##            22.70            16.77

Next one is for 1.0 dose. We can also see that the p-value is so small enough to reject null hypothesis. Therefore, we can conclude that there’s a significant difference between two values also. (OJ is bigger than VC)

## One sided t test for 2.0 dose
t.test(len ~ supp, paired=FALSE, var.equal=FALSE, 
       alternative = "greater", data=tg3)
## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = -0.046136, df = 14.04, p-value = 0.5181
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -3.1335     Inf
## sample estimates:
## mean in group OJ mean in group VC 
##            26.06            26.14

The last one is for 2.0 dose. Here, the p-value is bigger than the confidence level. So, for 2.0 dose, we can’t reject the null hypothesis. We can see that sample estimates of each group ; mean of VC is bigger than that of OJ.

4. Conclusion

We can conclude that the effect of orange juice for 0.5 & 1.0 doses is greater than that of a form of vitamin C. However, for 2.0 dose, we can’t say that the effect of orange juice is greater than that of vitamin C. We could draw this conclusion through one sided t test, and this need assumptions mentioned below.

  • The variances of each groups aren’t equal.
  • Subjects for each groups aren’t paired.
  • The confidence level for each test is 0.05.