Part 2

Title: Analysis on ToothGrowth dataset

Overview

In this analysis, we will perform some descriptive analysis on the ToothGrowth dataset to understand the data. Subsequently, a statistics test will be carried out to compare the tooth growth by supp and dose.

Load the ToothGrowth datasets from the R datasets package.

library(datasets)
data(ToothGrowth)
head(ToothGrowth)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

Exploratory Data Analysis

Lets look at the summary of the variables in the dataset.

summary(ToothGrowth)
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000
## Look at the frequency table of the dosage
table(ToothGrowth$dose)
## 
## 0.5   1   2 
##  20  20  20

From the above results, we can see that there are 2 groups in the variable supp (‘OJ’ and ‘VC’) and 3 groups in variable dose (0.5, 1.0 and 1.5)

Next we will explore the effect of these variable groups on tooth length via boxplot.

From the above boxplot, we observed that there is a difference between the median of tooth length of both supplement groups. However, we shall not conclude this findings until further statistical testing has been performed to support this hypothesis.

# Compute the mean difference
mean.diff = mean(ToothGrowth[ToothGrowth$supp == 'OJ',]$len) - mean(
  ToothGrowth[ToothGrowth$supp == 'VC',]$len)
sprintf('Diffrence of mean: %.2f',c(round(mean.diff,2)))
## [1] "Diffrence of mean: 3.70"
med.diff = median(ToothGrowth[ToothGrowth$supp == 'OJ',]$len) - median(
  ToothGrowth[ToothGrowth$supp == 'VC',]$len)
sprintf('Diffrence of median: %.2f',c(round(med.diff,2)))
## [1] "Diffrence of median: 6.20"

As for the dose, the boxplot suggest that there is a difference on the mean and median of the tooth length between dosage groups.

Statistical Testing

Now we some statistical to verify our hypothesis, there will be 2 sets of hypothesis testing that we will cover here.

  • Hypothesis on supplement:

    H0: Supplement intake has no effect on tooth length growth, which means mOJ = mVC; H1: Supplement intake does affect the tooth length growth, which means mOJ - mVC <> 0.

  • Hypothesis on dosage:

    H0: Supplement dosage has no effect on tooth length growth, which means m0.5 = m1.0; H1: Supplement intake does affect the tooth length growth, which means m0.5 - m1.0 <> 0. (repeat the same procedure to compare (m0.5, m1.5) and (m1.0, m1.5))

Hypothesis testing on Supplement

We start with the hypothesis testing on supplement effect.

Assumption

Before proceed any further, there are some assumptions that will be made to pre-requisite the statstical test approach that we will take.

  1. Each sampling group is representative of its population.
  2. The variance between the sampling groups are different.
  3. The samples in each sampling group are idd.
  4. The distribution of the sampling mean follows a normal distribution.

Subset the dataset according to the group of interest.

library(dplyr)
library(tidyr)

len_OJ <- filter(ToothGrowth, supp == 'OJ') %>% select(len)
len_VC <- filter(ToothGrowth, supp == 'VC') %>% select(len)

Perform t-test on the dataset subset.

t.test(len_OJ,len_VC,var.equal = F,paired = F)
## 
##  Welch Two Sample t-test
## 
## data:  len_OJ and len_VC
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean of x mean of y 
##  20.66333  16.96333

From the t test result, we learnt that:

  1. P-value = 0.06063, which is larger than alpha0.05. We should not reject the null hypothesis.
  2. The 95% confidence interval does not include 0 further evidenced the conclusion drawn from P-value.
  3. The meanOJ and meanVC are 20.66 and 16.96 respectively.

Hypothesis testing on dosage.

Follow by the hypothesis testing on the effect of dosage on tooth length growth.

Before proceed any further, there are some assumptions that will be made to pre-requisite the statstical test approach that we will take.

  1. Each sampling group is representative of its population.
  2. The samples in each sampling group are idd.

Dosage 0.5 vs 1.0

We will start with dosage 0.5 vs 1.0.

sample.subset = ToothGrowth[ToothGrowth$dose %in% c(0.5,1.0), ]
len = sample.subset$len

Perform permutation test with 10000 repeatition.

observed.group = sample.subset$dose

## x = tooth length vector, g = dosage vector, c1 = group 1, c2 = group 2
testStat = function(x, g, c1, c2) mean(x[g == c1]) - mean(x[g == c2]) 

observed.diff = testStat(len, observed.group, 0.5, 1.0)
permutations = sapply(1:10000, function(i) testStat(len, sample(sample.subset$dose), 
                                                    0.5, 1.0))
p.value = mean(abs(permutations)>=abs(observed.diff))

From the permutation test, we learnt that the p-value of the test is 0, which means we should not reject the null hypothesis.

Dosage 0.5 vs 2.0

Proceed with dosage 0.5 vs 2.0

sample.subset = ToothGrowth[ToothGrowth$dose %in% c(0.5,2.0), ]
len = sample.subset$len

Perform permutation test with 10000 repeatition.

observed.group = sample.subset$dose
observed.diff = testStat(len, observed.group, 0.5, 2.0)

permutations = sapply(1:10000, function(i) testStat(len, sample(sample.subset$dose)
                                                    , 0.5, 2.0))

p.value = mean(abs(permutations)>=abs(observed.diff))

From the permutation test, we learnt that the p-value of the test is 0, which means we should not reject the null hypothesis.

Dosage 1.0 vs 2.0

Proceed with dosage 1.0 vs 2.0

sample.subset = ToothGrowth[ToothGrowth$dose %in% c(1.0,2.0), ]
len = sample.subset$len

Perform permutation test with 10000 repeatition.

observed.group = sample.subset$dose
observed.diff = testStat(len, observed.group, 1.0, 2.0)

permutations = sapply(1:10000, function(i) testStat(len, sample(sample.subset$dose)
                                                    , 1.0, 2.0))
p.value = mean(abs(permutations)>=abs(observed.diff))

From the permutation test, we learnt that the p-value of the test is 0, which means we should not reject the null hypothesis.

Conclusion

From the results of the above statistical testing, we can now conclude that the effect different supplement on tooth growth is not statisticallly significant. However, when we compare the mean difference attributed to dosage, we found that the effect of dose on tooth growth is statistically significant. This result leads us to believe that the dose may indeed play a more dominant role in affecting the tooth growth.

Appendix B

R code of Figure 1

library(ggplot2)
library(cowplot)

 g1 = ggplot(data = ToothGrowth) + geom_boxplot(mapping = aes(x = supp, y = len, 
                                                              fill = as.factor(supp))) + 
   theme_bw(base_size = 8) + 
   labs(x = 'Supplements', y = 'Tooth Length', 
        title = 'Figure 1\nBoxplot of Supplement vs Tooth Length') + 
   guides(fill = guide_legend(title = 'Supplement')) 
 
 g2 = ggplot(data = ToothGrowth) + 
   geom_boxplot(mapping = aes(x = as.factor(dose), y = len, fill = as.factor(dose))) + 
   theme_bw(base_size = 8) + 
   labs(x = 'Dosage', y = 'Tooth Length', title = 'Figure 2\nBoxplot of Dose vs Tooth Length') + 
   guides(fill = guide_legend(title = 'Dose'))

plot_grid(g1, g2, align = 'h', scale = .6)