SONJA OFFWOOD
The following document performs some hypothesis tests on the Toothgrowth dataset from R. The dataset contains data of tooth lengths, under various doses using supplements “Orange Juice” and “Vitamin C”. The tests performed in this document test if the supplement used has a significant impact on the tooth length of the subject.
We first need to load the required packages for our analysis, as well as load the data into R:
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.1
library(knitr)
## Warning: package 'knitr' was built under R version 3.2.2
data(ToothGrowth)
x=ToothGrowth
Lets have a quick look at the structure of the data, as well as a summary of the data:
str(x)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
We can see here that the dataset consists of 3 variables, the tooth length, the supplement (either “OC” or “VC”) and the dose of the supplement which was used.
summary(x)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
Lets perform some exploratory analysis on the data by plotting the data:
g = ggplot(data=x, aes(x=factor(dose),y=len, fill=factor(supp)))+ geom_dotplot(binaxis = "y")
g = g + theme(axis.text.x = element_text(angle = 90, hjust = 1), axis.title=element_text(size=14)) + theme(plot.title = element_text(size=16,face="bold"))
g = g + labs(title=expression("Supplement and Dosage Impact on Tooth Length"), x="Dosage", y="Tooth Length")
print(g)
## stat_bindot: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
This plot shows a clear trend between tooth length and dosage: the higher the dosage, the longer the tooth. We are however interested in the impact of the supplement on the tooth length. Lets have a look at a boxplot of the data by supplement type:
g2= ggplot(data=x, aes(x=supp, y=len))+ geom_boxplot(aes(fill=supp))
g2 = g2 + theme(axis.text.x = element_text(angle = 90, hjust = 1), axis.title=element_text(size=14)) + theme(plot.title = element_text(size=16,face="bold"))
g2 = g2 + xlab("Supplement type") + ylab("Tooth length")+ ggtitle(" Boxplot of tooth length by supplement type ")
print(g2)
First, we want to test the null hypothesis that the mean tooth length is equal between the two supplements Orange Juice and Vitamin C, vs the alternate hypothesis, that there is a difference in the mean tooth length between the two supplements Orange Juice and Vitamin C.
test = t.test(len ~ supp, data= x, var.equal = FALSE, paired=FALSE ,conf.level = .95)
result = data.frame( "t-statistic" = test$statistic,
"df" = test$parameter,
"p-value" = test$p.value,
"OJ mean" = test$estimate[1],
"VC mean" = test$estimate[2],
row.names = "Orange Juice vs Vitamin C ")
kable(x = round(result,3),align = 'c' ,
caption = "Two sample t-test for tooth growth by supplement (excl.Dose)")
| t.statistic | df | p.value | OJ.mean | VC.mean | |
|---|---|---|---|---|---|
| Orange Juice vs Vitamin C | 1.915 | 55.309 | 0.061 | 20.663 | 16.963 |
resultCI = data.frame("lower CL" = test$conf.int[1],
"upper CL" = test$conf.int[2],
row.names = "Confidence Interval")
kable(x = round(resultCI,3),align = 'c' ,
caption = "Confidence Interval for two sample t-test for tooth growth by supplement (excl.Dose)")
| lower.CL | upper.CL | |
|---|---|---|
| Confidence Interval | -0.171 | 7.571 |
We perform this test at a 95% cofidence, and assume that the variances are not equal.From the above results we can see that we do not reject the null hypothesis and conclude that we are 95% sure that there is no difference in means between the two supplements. We reach this conclusion by any of the below:
The easiest method is to compare the pvalue with the required percentage, so for the rest of the document, this approach will be used.
We now perform a similar test, however on three subsets of the data, treating the various dosages separately. For each of the dosages 0.5, 1 and 2, we test the null hypothesis that the mean tooth length is equal between the two supplements Orange Juice and Vitamin C, vs the alternate hypothesis, that there is a difference in the mean tooth length between the two supplements Orange Juice and Vitamin C.
We again make the assumptions that the variances are not equal and perform the test at a 95% level.
x2=subset(x, x$dose==0.5)
x3=subset(x, x$dose==1)
x4=subset(x, x$dose==2)
test2 = t.test(len ~ supp, data= x2, var.equal = FALSE, paired=FALSE ,conf.level = .95)
result2 = data.frame( "t-statistic" = test2$statistic,
"df" = test2$parameter,
"p-value" = test2$p.value,
"lower CL" = test2$conf.int[1],
"upper CL" = test2$conf.int[2],
"OJ mean" = test2$estimate[1],
"VC mean" = test2$estimate[2],
row.names = "Orange Juice vs Vitamin C (Dose=0.5)")
test3 = t.test(len ~ supp, data= x3, var.equal = FALSE, paired=FALSE ,conf.level = .95)
result3 = data.frame( "t-statistic" = test3$statistic,
"df" = test3$parameter,
"p-value" = test3$p.value,
"lower CL" = test3$conf.int[1],
"upper CL" = test3$conf.int[2],
"OJ mean" = test3$estimate[1],
"VC mean" = test3$estimate[2],
row.names = "Orange Juice vs Vitamin C (Dose=1)")
test4 = t.test(len ~ supp, data= x4, var.equal = FALSE, paired=FALSE ,conf.level = .95)
result4 = data.frame( "t-statistic" = test4$statistic,
"df" = test4$parameter,
"p-value" = test4$p.value,
"lower CL" = test4$conf.int[1],
"upper CL" = test4$conf.int[2],
"OJ mean" = test4$estimate[1],
"VC mean" = test4$estimate[2],
row.names = "Orange Juice vs Vitamin C (Dose=2)")
kable(x = round(result2,3),align = 'c' ,
caption = "Two sample t-test for tooth growth by supplement (Dose=0.5)")
| t.statistic | df | p.value | lower.CL | upper.CL | OJ.mean | VC.mean | |
|---|---|---|---|---|---|---|---|
| Orange Juice vs Vitamin C (Dose=0.5) | 3.17 | 14.969 | 0.006 | 1.719 | 8.781 | 13.23 | 7.98 |
kable(x = round(result3,3),align = 'c' ,
caption = "Two sample t-test for tooth growth by supplement (Dose=1)")
| t.statistic | df | p.value | lower.CL | upper.CL | OJ.mean | VC.mean | |
|---|---|---|---|---|---|---|---|
| Orange Juice vs Vitamin C (Dose=1) | 4.033 | 15.358 | 0.001 | 2.802 | 9.058 | 22.7 | 16.77 |
kable(x = round(result4,3),align = 'c' ,
caption = "Two sample t-test for tooth growth by supplement (Dose=2)")
| t.statistic | df | p.value | lower.CL | upper.CL | OJ.mean | VC.mean | |
|---|---|---|---|---|---|---|---|
| Orange Juice vs Vitamin C (Dose=2) | -0.046 | 14.04 | 0.964 | -3.798 | 3.638 | 26.06 | 26.14 |
We conclude from the p-values above that
The above tests indicate that only for dosages of 0.5 and 1 is there enough evidence to suggest a difference in the mean tooth length between orange juice and vitamin C supplements. For dosage 2 (as well as overall) there is not enough evidence to suggest that the supplement has an impact on the tooth length of the subjects.