ToothGrowth Analysis

SONJA OFFWOOD

Introduction

The following document performs some hypothesis tests on the Toothgrowth dataset from R. The dataset contains data of tooth lengths, under various doses using supplements “Orange Juice” and “Vitamin C”. The tests performed in this document test if the supplement used has a significant impact on the tooth length of the subject.

Preprocessing the data

We first need to load the required packages for our analysis, as well as load the data into R:

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.1
library(knitr)
## Warning: package 'knitr' was built under R version 3.2.2
data(ToothGrowth)
x=ToothGrowth

Lets have a quick look at the structure of the data, as well as a summary of the data:

str(x)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

We can see here that the dataset consists of 3 variables, the tooth length, the supplement (either “OC” or “VC”) and the dose of the supplement which was used.

summary(x)
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

Lets perform some exploratory analysis on the data by plotting the data:

g = ggplot(data=x, aes(x=factor(dose),y=len, fill=factor(supp)))+ geom_dotplot(binaxis = "y")
g = g + theme(axis.text.x = element_text(angle = 90, hjust = 1), axis.title=element_text(size=14)) + theme(plot.title = element_text(size=16,face="bold"))
g = g + labs(title=expression("Supplement and Dosage Impact on Tooth Length"), x="Dosage", y="Tooth Length")
print(g)
## stat_bindot: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

This plot shows a clear trend between tooth length and dosage: the higher the dosage, the longer the tooth. We are however interested in the impact of the supplement on the tooth length. Lets have a look at a boxplot of the data by supplement type:

g2= ggplot(data=x, aes(x=supp, y=len))+ geom_boxplot(aes(fill=supp))
g2 = g2 + theme(axis.text.x = element_text(angle = 90, hjust = 1), axis.title=element_text(size=14)) + theme(plot.title = element_text(size=16,face="bold"))
g2 = g2 + xlab("Supplement type") + ylab("Tooth length")+ ggtitle(" Boxplot of tooth length by supplement type ")
print(g2)

Hypothesis Testing

Hypothesis on supplement only

First, we want to test the null hypothesis that the mean tooth length is equal between the two supplements Orange Juice and Vitamin C, vs the alternate hypothesis, that there is a difference in the mean tooth length between the two supplements Orange Juice and Vitamin C.

test = t.test(len ~ supp, data= x, var.equal = FALSE, paired=FALSE ,conf.level = .95)
result = data.frame( "t-statistic"  = test$statistic, 
                       "df" = test$parameter,
                        "p-value"  = test$p.value,
                        "OJ mean" = test$estimate[1],
                        "VC mean" = test$estimate[2],
                         row.names = "Orange Juice vs Vitamin C ")

kable(x = round(result,3),align = 'c' ,
      caption = "Two sample t-test for tooth growth by supplement (excl.Dose)")
Two sample t-test for tooth growth by supplement (excl.Dose)
t.statistic df p.value OJ.mean VC.mean
Orange Juice vs Vitamin C 1.915 55.309 0.061 20.663 16.963
resultCI = data.frame("lower CL" = test$conf.int[1],
                      "upper CL" = test$conf.int[2],
                      row.names = "Confidence Interval")
kable(x = round(resultCI,3),align = 'c' ,
      caption = "Confidence Interval for two sample t-test for tooth growth by supplement (excl.Dose)")
Confidence Interval for two sample t-test for tooth growth by supplement (excl.Dose)
lower.CL upper.CL
Confidence Interval -0.171 7.571

We perform this test at a 95% cofidence, and assume that the variances are not equal.From the above results we can see that we do not reject the null hypothesis and conclude that we are 95% sure that there is no difference in means between the two supplements. We reach this conclusion by any of the below:

  • The pvalue is above 5% and hence we do not reject the null hypothesis.
  • Zero is in the confidence interval.
  • Or lastly, we can see from the t-statistic that we do not reject the null hypothesis.

The easiest method is to compare the pvalue with the required percentage, so for the rest of the document, this approach will be used.

Hypothesis on supplement and dosage

We now perform a similar test, however on three subsets of the data, treating the various dosages separately. For each of the dosages 0.5, 1 and 2, we test the null hypothesis that the mean tooth length is equal between the two supplements Orange Juice and Vitamin C, vs the alternate hypothesis, that there is a difference in the mean tooth length between the two supplements Orange Juice and Vitamin C.

We again make the assumptions that the variances are not equal and perform the test at a 95% level.

x2=subset(x, x$dose==0.5)
x3=subset(x, x$dose==1)
x4=subset(x, x$dose==2)

test2 = t.test(len ~ supp, data= x2, var.equal = FALSE, paired=FALSE ,conf.level = .95)
result2 = data.frame( "t-statistic"  = test2$statistic, 
                       "df" = test2$parameter,
                        "p-value"  = test2$p.value,
                        "lower CL" = test2$conf.int[1],
                        "upper CL" = test2$conf.int[2],
                        "OJ mean" = test2$estimate[1],
                        "VC mean" = test2$estimate[2],
                         row.names = "Orange Juice vs Vitamin C (Dose=0.5)")

test3 = t.test(len ~ supp, data= x3, var.equal = FALSE, paired=FALSE ,conf.level = .95)
result3 = data.frame( "t-statistic"  = test3$statistic, 
                       "df" = test3$parameter,
                        "p-value"  = test3$p.value,
                        "lower CL" = test3$conf.int[1],
                        "upper CL" = test3$conf.int[2],
                        "OJ mean" = test3$estimate[1],
                        "VC mean" = test3$estimate[2],
                         row.names = "Orange Juice vs Vitamin C (Dose=1)")

test4 = t.test(len ~ supp, data= x4, var.equal = FALSE, paired=FALSE ,conf.level = .95)
result4 = data.frame( "t-statistic"  = test4$statistic, 
                       "df" = test4$parameter,
                        "p-value"  = test4$p.value,
                        "lower CL" = test4$conf.int[1],
                        "upper CL" = test4$conf.int[2],
                        "OJ mean" = test4$estimate[1],
                        "VC mean" = test4$estimate[2],
                         row.names = "Orange Juice vs Vitamin C (Dose=2)")
kable(x = round(result2,3),align = 'c' ,
      caption = "Two sample t-test for tooth growth by supplement (Dose=0.5)")
Two sample t-test for tooth growth by supplement (Dose=0.5)
t.statistic df p.value lower.CL upper.CL OJ.mean VC.mean
Orange Juice vs Vitamin C (Dose=0.5) 3.17 14.969 0.006 1.719 8.781 13.23 7.98
kable(x = round(result3,3),align = 'c' ,
      caption = "Two sample t-test for tooth growth by supplement (Dose=1)")
Two sample t-test for tooth growth by supplement (Dose=1)
t.statistic df p.value lower.CL upper.CL OJ.mean VC.mean
Orange Juice vs Vitamin C (Dose=1) 4.033 15.358 0.001 2.802 9.058 22.7 16.77
kable(x = round(result4,3),align = 'c' ,
      caption = "Two sample t-test for tooth growth by supplement (Dose=2)")
Two sample t-test for tooth growth by supplement (Dose=2)
t.statistic df p.value lower.CL upper.CL OJ.mean VC.mean
Orange Juice vs Vitamin C (Dose=2) -0.046 14.04 0.964 -3.798 3.638 26.06 26.14

We conclude from the p-values above that

  • For Dose 0.5, we reject the null hypothesis and conclude that there is enough evidence to indicate a difference in the mean between supplements Orange Juice and Vitamin C. (p-value<5%)
  • For Dose 1, we reject the null hypothesis and conclude that there is enough evidence to indicate a difference in the mean between supplements Orange Juice and Vitamin C. (p-value<5%)
  • For Dose 2, we do not reject the null hypothesis and conclude that there is not enough evidence to indicate a difference in the mean between supplements Orange Juice and Vitamin C. (p-value>5%)

Conclusion

The above tests indicate that only for dosages of 0.5 and 1 is there enough evidence to suggest a difference in the mean tooth length between orange juice and vitamin C supplements. For dosage 2 (as well as overall) there is not enough evidence to suggest that the supplement has an impact on the tooth length of the subjects.