Overview

This document contains the report of the analysis done on the ToothGrowth dataset in R, as part of the project for the Statistical Inference course offered by Johns Hopkins University through Coursera.The objective is to use the techniques learned in the course, to analyze the data and provide a summary. First, the data is loaded into the R environment and some exploratory data analyses is done. Based on visual inspection of the data and the charts, hypothesis are made regarding the impact of various doses of the supplements and subsequently these hypothesis are tested, whose results are given below.

Assumptions:

  1. It is assumed that the samples for the trials were selected so as to randomize effects of age, diet, and other factors that could influence tooth growth
  2. Since there is no reason to believe that the variance will be uniform between the group, unequal variance is assumed for the tests, based on the variances obtained within the subgroups of the sample.
Loading and exploratory analysis:

Visually inspecting the data and looking at the charts it appears that the supplements and the dosage are impacting the tooth growth in the subjects. Chart 1 shows that for either supplements, the tooth growth increases with the dose. However although the OJ appears to have a larger impact at lower doses, that difference diminishes as we move to higher doses. Chart 2 shows box plots of lenghts by supplements and doses separately. Chart 2 seems to suggest that OJ has a higher impact on tooth length than VC, but we shall test in in the subsequent section

#Load the data
data(ToothGrowth)
#summarize the contents
str(ToothGrowth)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
library(dplyr)
tg_df <- tbl_df(ToothGrowth)
tgLenSumm <- group_by(tg_df, supp, dose)%>%summarise(mnLen = mean(len))
library(ggplot2)
g1<- ggplot(tgLenSumm, aes(dose, mnLen, color = supp)) + geom_line() + labs(x = "dose", y = "Mean tooth growth length") + ggtitle("Chart 1: Tooth growth vs Dose by Supplements")
g1

g2 <- ggplot(aes(x = supp, y = len), data = ToothGrowth) + geom_boxplot(aes(fill = supp)) + labs(x = "Supplement", y = " Length") + ggtitle("Chart 2: Tooth growth by Supplements")
g2

Hypothesis testing:

Impact of supplements:

Null Hypothesis: The mean tooth growth is same across the two supplements
Alternate hypothesis: The mean tooth growth is higher for OJ than VC
The R code for the t test with unequal variance is given below. Running the test, we get the confidence interval containing zero, and so there is no evidence to reject the null hypothesis.

However, when controlled for doses, there is evidence to reject the null at doses 0.5 and 1, as shown in test 2 and 3. The confidence intervals are above zero and so the means are significanly apart. For a dose of 2 however, the confidence interval contains zero and so we can’t reject the null.

#test 1 ignoring doses
t.test(len ~ supp, paired = F, var.equal = F, data = tg_df)$conf.int
## [1] -0.1710156  7.5710156
## attr(,"conf.level")
## [1] 0.95
#test 2 dose = 0.5
t.test(len ~ supp, paired = F, var.equal = F, data = tg_df[tg_df$dose == 0.5,])$conf.int
## [1] 1.719057 8.780943
## attr(,"conf.level")
## [1] 0.95
#test 3 dose = 1
t.test(len ~ supp, paired = F, var.equal = F, data = tg_df[tg_df$dose == 1,])$conf.int
## [1] 2.802148 9.057852
## attr(,"conf.level")
## [1] 0.95
#test 4 dose = 2
t.test(len ~ supp, paired = F, var.equal = F, data = tg_df[tg_df$dose == 2,])$conf.int
## [1] -3.79807  3.63807
## attr(,"conf.level")
## [1] 0.95
Impact of dosage:

As there are 3 different doses, we’ll need three different tests We can generalize the null hypotheses as
Null hypothesis: The mean tooth growth is equal between lower and higher doses
Alternate hypothesis: The mean tooth growth is higher for higher doses
The confidence intervals for all the three t tests below are above zero and so, we can reject the nulls in all the three cases. Therefore we can conclude that the tooth growth increases with increase in dosage.

##test 5, between doses 1 and 0.5
t.test(len ~ dose, paired = F, var.equal = F, data = tg_df[tg_df$dose == 1 | tg_df$dose == 0.5,])$conf.int
## [1] -11.983781  -6.276219
## attr(,"conf.level")
## [1] 0.95
##test 6, between doses 2 and 1
t.test(len ~ dose, paired = F, var.equal = F, data = tg_df[tg_df$dose == 2 | tg_df$dose == 1,])$conf.int
## [1] -8.996481 -3.733519
## attr(,"conf.level")
## [1] 0.95
##test 7, between doses 2 and 0.5 (This is kind of redundant because significance of tests 1 & 2 implies test 3 will be significant)
t.test(len ~ dose, paired = F, var.equal = F, data = tg_df[tg_df$dose == 2 | tg_df$dose == 0.5,])$conf.int
## [1] -18.15617 -12.83383
## attr(,"conf.level")
## [1] 0.95

Conclusions:

  1. The tooth growth increases with dosage of supplments (tests 5, 6 and 7)
  2. There is no difference between the tooth growths for the two supplements when dosage is ignored (test 1)
  3. At doses 0.5 and 1, OJ results in significanly higher tooth growth than VC (tests 2, 3)
  4. At dosage of 2, there is no difference between the impact of VC and OJ (test 7)