Overview and Summary Statistics

In this second part of the project of the Statistical Inference Course, the objective is to do some exploratory data analysis and confidence intervals and hypothesis tests in the R Dataset Tooth Growth: The Effect of Vitamin C on Tooth Growth in Guinea Pigs which compares the length of odontoblasts in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, (orange juice or ascorbic acid (a form of vitamin C and coded as VC).

The dataset comes from Crampton, E. W. (1947) The growth of the odontoblast of the incisor teeth as a criterion of vitamin C intake of the guinea pig. The Journal of Nutrition 33(5): 491<80><93>-504. http://jn.nutrition.org/content/33/5/491.full.pdf

We will start by loading the dataset and seeing its structure:

library(datasets)
toothgrowth <- ToothGrowth
attach(toothgrowth)
str(toothgrowth)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

Now that the dataset is loaded, I will start by doing a bit of exploratory data analysis comparing the length of odontoblasts in the pigs by dose and supplement:

boxplot(len ~ dose,
        boxwex = 0.25, at = 1:3 - 0.2,
        subset = supp == "VC", col = "blue",
        main = "Odontoblast length of Guinea Pigs",
        xlab = "Dose in mgs of Vitamin C",
        ylab = "Odontoblast Length",
        xlim = c(0.5, 3.5), ylim = c(0, 35), yaxs = "i")
boxplot(len ~ dose, add = TRUE,
        boxwex = 0.25, at = 1:3 + 0.2,
        subset = supp == "OJ", col = "red")
legend(2, 9, c("Ascorbic acid", "Orange juice"),
       fill = c("blue", "red"))

Subsetting the data by supplments with the use of the Dplyr package we can calculate some summaries as well in order to enrich our conclusions:

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
toothoj <- filter(toothgrowth, supp == "OJ") #Orange Juice subset
toothvc <- filter(toothgrowth, supp == "VC") #Ascorbic Acid subset

summary(toothoj$len) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.20   15.52   22.70   20.66   25.72   30.90
summary(toothvc$len)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.20   11.20   16.50   16.96   23.10   33.90

From the analysis of the boxplot and the summary we can see that higher doses of vitamin C seem to promote growth of the odontoblasts with the effect being bigger, the bigger the dose of vitamin C given is.

The effect also seems to be more pronounced when the vitamin C is given by Orange Juice instead of pure ascorbic acid except for doses of 2 mgs where the interquantile range and the variability is much bigger for the 2 mgs dose given by ascorbic acid.

There is also an outlier guinea pig in the dose of 1 mg of ascorbic acid with a length of 22.5 which is much bigger than the other pigs in this group.

Confidence Intervals and Hypothesis Tests

Now it is the time to use Statistical Inference in order to see how the veracity of the suppositions we came to before holds against further scrutinity.

First things first, we need to do some testing to see which test we should use and estabilish the null hypothesis and the alternative hypothesis.

par(mfrow=c(1,2))

hist(toothoj$len, col="blue", 
     main="Odontoblast Length in OJ
     supplemented Guinea Pigs", 
     xlab="Length of Odontoblasts")

hist(toothvc$len, col="red", 
     main="Odontoblast Length in A.Acid 
     supplemented Guinea Pigs", 
     xlab="Length of Odontoblasts")

var(toothoj$len)
## [1] 43.63344
var(toothvc$len)
## [1] 68.32723

In terms of assumptions, we can see that the histograms of samples of both supplements do not follow a normal distribution. We also have a small sample size (30 each). We can also conclude that variances are not equal between the samples of the two supplements. Considering we have 30 guinea pigs of each supplement it is logical to conclude that the samples are independent. The best test to use in this case therefore would be a non-parametric test attending to the non-normality of the data, however from those we learnt in this class, the independent t-test with unequal variances is the second best choice.

The null hypothesis in this case is that the means in length of odontoblasts of both populations (guinea pigs supplemented with ascorbic acid and orange juice) are equal. The alternative hypothesis is that they are different.

Using the t.tests and comparing by dose and supp (subsetting before according to doses) (Appendix #1):

Comparing between different doses of orange juice and different doses of ascorbic acid (Appendix #2)

From the analysis of the p-values and the confidence intervals of these t.tests we can conclude with a 95% Confidence Interval that with a higher dose, within each supplement, the means between the 2 populations tested are not equal which is a strong evidence that higher doses of vitamin c contribute to higher lengths of odontoblasts.

Now into comparing between the same doses but between orange juice and ascorbic acid: (Appendix #3)

From these tests, we can conclude with a 95% CI that contrary to what we thought from the EDA, apart from the 1 mg dose (the only one with a p-value < 0.05), there does not appear to exist a significant difference in the odontoblast length if the vitamin C dose results from orange juice or ascorbic acid administration.

Appendix

Appendix #1

toothvc0.5 <- filter(toothvc, dose == 0.5)
toothvc1.0 <- filter(toothvc, dose == 1.0)
toothvc2.0 <- filter(toothvc, dose == 2.0)

toothoj0.5 <- filter(toothoj, dose == 0.5)
toothoj1.0 <- filter(toothoj, dose == 1.0)
toothoj2.0 <- filter(toothoj, dose == 2.0)

Appendix #2

t.test(toothoj0.5$len, toothoj1.0$len, paired=FALSE, var.equal=FALSE)$conf
## [1] -13.415634  -5.524366
## attr(,"conf.level")
## [1] 0.95
t.test(toothoj0.5$len, toothoj1.0$len, paired=FALSE, var.equal=FALSE)$p.value
## [1] 8.784919e-05
t.test(toothoj1.0$len, toothoj2.0$len, paired=FALSE, var.equal=FALSE)$conf
## [1] -6.5314425 -0.1885575
## attr(,"conf.level")
## [1] 0.95
t.test(toothoj1.0$len, toothoj2.0$len, paired=FALSE, var.equal=FALSE)$p.value
## [1] 0.03919514
t.test(toothoj0.5$len, toothoj2.0$len, paired=FALSE, var.equal=FALSE)$conf
## [1] -16.335241  -9.324759
## attr(,"conf.level")
## [1] 0.95
t.test(toothoj0.5$len, toothoj2.0$len, paired=FALSE, var.equal=FALSE)$p.value
## [1] 1.323784e-06
t.test(toothvc0.5$len, toothvc1.0$len, paired=FALSE, var.equal=FALSE)$conf
## [1] -11.265712  -6.314288
## attr(,"conf.level")
## [1] 0.95
t.test(toothvc0.5$len, toothvc1.0$len, paired=FALSE, var.equal=FALSE)$p.value
## [1] 6.811018e-07
t.test(toothvc1.0$len, toothvc2.0$len, paired=FALSE, var.equal=FALSE)$conf
## [1] -13.054267  -5.685733
## attr(,"conf.level")
## [1] 0.95
t.test(toothvc1.0$len, toothvc2.0$len, paired=FALSE, var.equal=FALSE)$p.value
## [1] 9.155603e-05
t.test(toothvc0.5$len, toothvc2.0$len, paired=FALSE, var.equal=FALSE)$conf
## [1] -21.90151 -14.41849
## attr(,"conf.level")
## [1] 0.95
t.test(toothvc0.5$len, toothvc2.0$len, paired=FALSE, var.equal=FALSE)$p.value
## [1] 4.681577e-08

Appendix #3

t.test(toothoj0.5$len, toothvc0.5$len, paired=FALSE, var.equal=FALSE)$conf
## [1] 1.719057 8.780943
## attr(,"conf.level")
## [1] 0.95
t.test(toothoj0.5$len, toothvc0.5$len, paired=FALSE, var.equal=FALSE)$p.value
## [1] 0.006358607
t.test(toothoj1.0$len, toothvc1.0$len, paired=FALSE, var.equal=FALSE)$conf
## [1] 2.802148 9.057852
## attr(,"conf.level")
## [1] 0.95
t.test(toothoj1.0$len, toothvc1.0$len, paired=FALSE, var.equal=FALSE)$p.value
## [1] 0.001038376
t.test(toothoj2.0$len, toothvc2.0$len, paired=FALSE, var.equal=FALSE)$conf
## [1] -3.79807  3.63807
## attr(,"conf.level")
## [1] 0.95
t.test(toothoj2.0$len, toothvc2.0$len, paired=FALSE, var.equal=FALSE)$p.value
## [1] 0.9638516