Statistical Inference Course Project Part 2

Overview

In this part, the ToothGrowth data in the R datasets package is going to be a nalyzed. The analyse should contain the following aspects:

Load the ToothGrowth data and perform some basic exploratory data analyses
Provide a basic summary of the data.
Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose. (Only use the techniques from class, even if there’s other approaches worth considering)
State your conclusions and the assumptions needed for your conclusions.

Basic Exploratory Analyse

# Don't forget to set the working directory
# 0. Load all the necessary packages
library(dplyr)
library(lubridate)
library(ggplot2)

# 1. Load the data
data("ToothGrowth")

# 2. Basic summary
dim(ToothGrowth)

## [1] 60  3

head(ToothGrowth)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

unique(ToothGrowth$supp)

## [1] VC OJ
## Levels: OJ VC

unique(ToothGrowth$dose)

## [1] 0.5 1.0 2.0

summary(ToothGrowth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

Basic Summary

ggplot2 package can make a great contribution to a better understanding of this dataset.

## Plot
## supp vs. len
p1 <- ggplot(data = ToothGrowth, aes(x = supp, y = len))
p1 + geom_boxplot(aes(fill = supp)) + facet_grid(~dose) + xlab("Supplement") +
ylab("Tooth Length") + scale_fill_discrete(name="Supplement") +
ggtitle("Supplement vs. Tooth Length by Different Dose")

## dose vs. len
p2 <- ggplot(data = ToothGrowth, aes(x = dose, y = len))
p2 + geom_boxplot(aes(fill = as.factor(dose))) + facet_grid(~supp) +
xlab("Dose") + ylab("Tooth Length") + scale_fill_discrete(name="Dose") + 
ggtitle("Dose vs. Tooth Length by Diffferent Supplement")

Statistical Inference

Firstly, T test is conducted on len and supp.

t.test(len ~ supp, alternative = "two.sided", data = ToothGrowth)

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333

Considering the fact that when p-value is less than 0.05, the tested sample will be taken as having significant statistical difference. In this case, the p-value is 0.06, which is slightly higher than the threshold value 0.05. It indicates that the null hypothesis, which is that the different supplement has no significant impact on tooth length, cannot be rejected.

Secondly, the impact of does on len will be explored. Considering the different levels of does, a series of t tests will be conducted.

## 1st group: 0.5 vs. 1.0
t.test(len ~ dose, alternative = "two.sided", 
       data = subset(ToothGrowth, dose %in% c(0.5, 1.0)))

## 
##  Welch Two Sample t-test
## 
## data:  len by dose
## t = -6.4766, df = 37.986, p-value = 1.268e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.983781  -6.276219
## sample estimates:
## mean in group 0.5   mean in group 1 
##            10.605            19.735

## 2nd group: 1.0 vs. 2.0
t.test(len ~ dose, alternative = "two.sided", 
       data = subset(ToothGrowth, dose %in% c(1.0, 2.0)))

## 
##  Welch Two Sample t-test
## 
## data:  len by dose
## t = -4.9005, df = 37.101, p-value = 1.906e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.996481 -3.733519
## sample estimates:
## mean in group 1 mean in group 2 
##          19.735          26.100

## 3rd group: 0.5 vs. 2.0
t.test(len ~ dose, alternative = "two.sided", 
       data = subset(ToothGrowth, dose %in% c(0.5, 2.0)))

## 
##  Welch Two Sample t-test
## 
## data:  len by dose
## t = -11.799, df = 36.883, p-value = 4.398e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -18.15617 -12.83383
## sample estimates:
## mean in group 0.5   mean in group 2 
##            10.605            26.100

Considering it is a multiple comparisions scenario, p.adjust() will be invloved.

## p value of 1st group
p.value <- t.test(len ~ dose, alternative = "two.sided", 
                  data = subset(ToothGrowth, dose %in% c(0.5, 1.0)))$p.value
## p value of 2nd group
p.value[2] <- t.test(len ~ dose, alternative = "two.sided", 
                  data = subset(ToothGrowth, dose %in% c(1,0, 2.0)))$p.value
## p value of 3rd group
p.value[3] <- t.test(len ~ dose, alternative = "two.sided", 
                  data = subset(ToothGrowth, dose %in% c(0.5, 2.0)))$p.value
## p value adjust
p.adjust(p.value, method = "fdr", n = length(p.value))

## [1] 1.902451e-07 1.906430e-05 1.319257e-13

It is pretty obvious, even without p.value adjustment, that these p.values are significantly smaller than 0.05, the threshold value, which means the null hypothesis can be rejected and this indicates the fact that the dose level has a extremely significant impact on tooth length. More specifically, as the dose level increases, the mean tooth length increases.

Conclusion

Before drawing any conclusion, there are some assumptions need to be clarified:

The sample data, ToothGrowth, is representative of the whole population
the Central Limit Theorem is adequate to be applied in this case.

The result of this case can be concluded as:

Supplement does have impact on tooth growth, whereas the impact is not that significant.
On the other hand, the dose level makes a great contribution to the change of tooth growth. Higher level of dose leads to a increased tooth growth.