Synopsis

This is part of the project for the Statistical Inference class in the Johns Hopkins Data Science Specialization by Coursera.

This report analyzes the ToothGrowth data in the R datasets package. The goals of this analysis are:

The ToothGrowth data

The ToothGrowth datasets has data for the analysis of the effect of vitamin C on tooth growth in Guinea pigs. The data has the length of odontoblasts (teeth) in each of 10 guinea pigs at each of three dose levels of Vitamin C (0.5, 1, and 2 mg) with each of two delivery methods (orange juice or ascorbic acid).

This data frame has 60 observations and 3 variables:

This following R code compactly displays the internal structure of the ToothGrowth dataset:

library(datasets)   ## Loading the package "datasets"
data(ToothGrowth)   ## Loading the data
str(ToothGrowth)    ## Looking at the dataset variables
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

Basic exploratory data analyses

The following scatterplot shows approximately how much the variable dose is affected by the variable len for each type of supp:

library(ggplot2)
## Scatterplot shows data by factor
  ggplot(ToothGrowth, aes(x = dose, y = len)) + geom_point(aes(color=factor(supp))) +
  scale_x_discrete("Dosage in mg") +  scale_y_continuous("Length of Teeth") + 
  ggtitle("Dose by Tooth Length for each Supplement")

plot of chunk scattplot

Basic summary of the data

The data summary by factor len variable is this:

tapply(ToothGrowth$len, ToothGrowth$supp, FUN=summary)
## $OJ
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.2    15.5    22.7    20.7    25.7    30.9 
## 
## $VC
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.2    11.2    16.5    17.0    23.1    33.9

The following plot shows data summary for each supplement:

## Bloxplot shows data summary for each Supplement
library(plyr)
ggplot(ToothGrowth, aes(x=factor(dose), y=len,fill=supp))+
  geom_boxplot()+ facet_grid(.~supp)+ labs(x="X (binned)")+
  theme(axis.text.x=element_text(angle=-90, vjust=0.4,hjust=1)) +
  scale_x_discrete("Dosage in mg") + scale_y_continuous("Length of Teeth") + 
  ggtitle("Blox Plot of Dose by Tooth Length for each Supplement")

plot of chunk bloxpot

Comparing tooth growth by supp and dose

It performs hypothesis tests by function t.test in R.

It considers the following assumptions in this analysis:

ToothGrowth$dose<-as.factor(ToothGrowth$dose)
attach(ToothGrowth)
suppDoseGroups<-as.data.frame(split(len,list(supp,dose)))
combinationNames<-vector()
c<-0
for ( i in 1:5 ) for ( j in (i+1):6 ) { c<-c+1;
  combinationNames[c]<-paste(as.character(names(suppDoseGroups)[i]),as.character(names(suppDoseGroups)[j]),sep="~") }
hypothesisTest<-matrix(data=NA,nrow=length(combinationNames),ncol=3,byrow=TRUE,
                       dimnames=list(combinationNames,c("P-value","Conf low", "Conf hight")))
c<-0
for ( i in 1:5 ) for ( j in (i+1):6 )  { c<-c+1;
    hypothesisTest[c,1]<-t.test(suppDoseGroups[,i],suppDoseGroups[,j])$p.value;
    hypothesisTest[c,2]<-t.test(suppDoseGroups[,i],suppDoseGroups[,j])$conf.int[1];
    hypothesisTest[c,3]<-t.test(suppDoseGroups[,i],suppDoseGroups[,j])$conf.int[2]
}
hypothesisTest
##                 P-value Conf low Conf hight
## OJ.0.5~VC.0.5 6.359e-03    1.719    8.78094
## OJ.0.5~OJ.1   8.785e-05  -13.416   -5.52437
## OJ.0.5~VC.1   4.601e-02   -7.008   -0.07189
## OJ.0.5~OJ.2   1.324e-06  -16.335   -9.32476
## OJ.0.5~VC.2   7.196e-06  -17.264   -8.55648
## VC.0.5~OJ.1   3.655e-08  -17.921  -11.51851
## VC.0.5~VC.1   6.811e-07  -11.266   -6.31429
## VC.0.5~OJ.2   1.362e-11  -20.618  -15.54182
## VC.0.5~VC.2   4.682e-08  -21.902  -14.41849
## OJ.1~VC.1     1.038e-03    2.802    9.05785
## OJ.1~OJ.2     3.920e-02   -6.531   -0.18856
## OJ.1~VC.2     9.653e-02   -7.564    0.68433
## VC.1~OJ.2     2.361e-07  -11.720   -6.85967
## VC.1~VC.2     9.156e-05  -13.054   -5.68573
## OJ.2~VC.2     9.639e-01   -3.798    3.63807

P-values are almost all less than 0.05. The confidence intervals do not contain zero for most of the comparisons. So the null hypothesis can be denied. This indicates that the difference in mean values between the supplements is significant for the comparisons performed. It is observed two exceptions for the comparison of orange juice and vitamin C with the dose = 2 mg and for the comparison of orange juice and vitamin C with the dose = 1 mg to 2 mg.

P-values decrease when the dose increase for the same supplement (OJ.0.5~OJ.1 and OJ.0.5~OJ.2, for example). This indicates that increasing the dosages gets a positive impact on teeth growth.

Conclusions

The mainly conclusions are: