Part 2: Basic Inferential Data Analysis

# preload library
library(ggplot2)
library(gridExtra)  # to make multiplots
## Warning: package 'gridExtra' was built under R version 3.4.3

1. Load the ToothGrowth data and perform some basic exploratory data analyses

data(ToothGrowth)

Let’s look at the data and its characteristics

head(ToothGrowth)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5
str(ToothGrowth)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
summary(ToothGrowth)
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

Let’s look at the data with only 1 category: dose and supp:

g1 <- qplot(dose, len, data = ToothGrowth) + geom_boxplot(aes(fill = factor(dose)))
g2 <- qplot(supp, len, data = ToothGrowth) + geom_boxplot(aes(fill = supp))

grid.arrange(arrangeGrob(g1, g2, ncol = 2, nrow = 1), bottom = "Figure 1: len variation with respect to dose (left) and supp (right)")

Now let’s look closer at each domain, ie. supp and dose, and their corresponding subdomains, ie. dose, supp.

g3 <- qplot(dose, len, data = ToothGrowth, facets = ~supp, ylab = "Tooth length") + 
    geom_boxplot(aes(fill = factor(dose))) + labs(fill = "Dose")

g4 <- qplot(supp, len, data = ToothGrowth, facets = ~dose, ylab = "Tooth length") + 
    geom_boxplot(aes(fill = supp)) + labs(fill = "Supp")

grid.arrange(arrangeGrob(g3, g4, ncol = 2, nrow = 1), bottom = "Figure 2: len variation by subdivision with respect to supp (left) and dose (right)")

2. Provide a basic summary of the data

3. Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose

I will compare tooth growth len with respect to 2 categories: supp and dose. Because the generality of the comparison is similar between the cases, I will use two different methods:
* confidence interval for comparing tooth growth len vs. supp (a)
* hypothesis test for comparing tooth growth len vs. dose (b)

Please note that during exploratory analysis above, we have an impression that supp OJ is more effective than VC, and respectively similar with dose 0.5, 1.0 and 2.0. However, for generality, I assume we’re looking if they’re different. Therefore, in the following comparisons, I use 2-tail test.

Especially for this data, since we observe similar effects at dose = 2.0, it is worthy to do an additional specific test between 2 supp(s) at this dose. For this test, I use 2 sided test because we don’t know if any of them is larger/smaller than the other. (c)

a. Compare tooth growth len by supp using Confidence intervals

# extract data (only first column)
lenOJ <- subset(ToothGrowth, supp == "OJ")[, 1]
lenVC <- subset(ToothGrowth, supp == "VC")[, 1]

From figure 1 (fight), we can assume and test if OJ performs better than VC. In this case, the test is equivilent to a one sided test. The confidence interval of that test is:

(mean(lenOJ) - mean(lenVC)) + c(-1, 1) * qt(0.95, length(lenVC) + 
    length(lenOJ) - 2) * sqrt((var(lenVC) + var(lenOJ))/2) * 
    sqrt(1/length(lenVC) + 1/length(lenOJ))
## [1] 0.4708204 6.9291796

We see that zero doesn’t fall in the confidence interval. This suggests our hypothesis is not true. Therefore OJ actually does perform better than VC.
On the other hand, if we just want to check if the two supp(s) produce different tooth growth, we need to use a two sided test, the confidence interval for that hypothesis is:

(mean(lenOJ) - mean(lenVC)) + c(-1, 1) * qt(0.975, length(lenVC) + 
    length(lenOJ) - 2) * sqrt((var(lenVC) + var(lenOJ))/2) * 
    sqrt(1/length(lenVC) + 1/length(lenOJ))
## [1] -0.1670064  7.5670064

Now zero lies in the confidence interval. It means that there is no significant difference between these supp(s).

b. Compare tooth growth len by dose using hypothesis test

Ideally, we need to make 3 tests, comparing:
* dose 0.5 vs dose 1.0
* dose 1.0 vs dose 2.0
* dose 0.5 vs dose 2.0

In all of them, the null hypothesis is that there is no difference between the features we’re looking at. For example, in the first test, \(H_0=\) “there is no difference between dose 0.5 and dose 1.0”. I use \(\alpha=0.05\).

# extract data
len05 <- subset(ToothGrowth, dose == 0.5)[, 1]
len10 <- subset(ToothGrowth, dose == 1)[, 1]
len20 <- subset(ToothGrowth, dose == 2)[, 1]

Compare dose 0.5 vs. dose 1.0

t.test(len05, len10, alternative = "two.sided", paired = FALSE, 
    var.equal = FALSE, conf.level = 0.95)
## 
##  Welch Two Sample t-test
## 
## data:  len05 and len10
## t = -6.4766, df = 37.986, p-value = 1.268e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.983781  -6.276219
## sample estimates:
## mean of x mean of y 
##    10.605    19.735

The test returns an extremely low probability (1.27E-7) for the hypothesis of no difference between dose 0.5 and dose 1.0. With this result, we reject the null. Given what we observed in figure 1, we can conclude that dose 1.0 has significant higher effect on tooth growth compared to dose 0.5.

Compare dose 1.0 vs. dose 2.0

t.test(len10, len20, alternative = "two.sided", paired = FALSE, 
    var.equal = FALSE, conf.level = 0.95)
## 
##  Welch Two Sample t-test
## 
## data:  len10 and len20
## t = -4.9005, df = 37.101, p-value = 1.906e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.996481 -3.733519
## sample estimates:
## mean of x mean of y 
##    19.735    26.100

Again, the test returns an extremely low probability (1.91E-5) for the hypothesis of no difference between dose 1.0 and dose 2.0. With this result, we reject the null. Given what we observed in figure 1, we can conclude that dose 2.0 has significant higher effect on tooth growth compared to dose 1.0.

From these 2 tests, we can infer that dose 2.0 has significant higher effect on tooth growth compare to dose 0.5. The last test with these 2 dose(s) is not necessary.

c. Compare tooth growth by supp at dose 2.0 using hypothesis test

# extract data
lenOJ20 <- subset(ToothGrowth, supp == "OJ" & dose == 2)[, 1]
lenVC20 <- subset(ToothGrowth, supp == "VC" & dose == 2)[, 1]

# run test
t.test(lenVC20, lenOJ20, alternative = "two.sided", paired = FALSE, 
    var.equal = FALSE, conf.level = 0.95)
## 
##  Welch Two Sample t-test
## 
## data:  lenVC20 and lenOJ20
## t = 0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.63807  3.79807
## sample estimates:
## mean of x mean of y 
##     26.14     26.06

This test returns a very high probability (96%). It is also supported by the fact that zero stays within the 95% confidence interval. This result suggsets we fail to reject to null. In other words, there is no significant difference of tooth growth between supp OJ and VC at dose 2.0 mg.

4. State your conclusions and the assumptions needed for your conclusions