In this analysis, we will perform some descriptive analysis on the ToothGrowth dataset to understand the data. Subsequently, a statistics test will be carried out to compare the tooth growth by supp and dose.
Load the ToothGrowth datasets from the R datasets package.
library(datasets)
data(ToothGrowth)
head(ToothGrowth)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
Lets look at the summary of the variables in the dataset.
summary(ToothGrowth)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
## Look at the frequency table of the dosage
table(ToothGrowth$dose)
##
## 0.5 1 2
## 20 20 20
From the above results, we can see that there are 2 groups in the variable supp (‘OJ’ and ‘VC’) and 3 groups in variable dose (0.5, 1.0 and 1.5)
Next we will explore the effect of these variable groups on tooth length via boxplot.
From the above boxplot, we observed that there is a difference between the median of tooth length of both supplement groups. However, we shall not conclude this findings until further statistical testing has been performed to support this hypothesis.
# Compute the mean difference
mean.diff = mean(ToothGrowth[ToothGrowth$supp == 'OJ',]$len) - mean(
ToothGrowth[ToothGrowth$supp == 'VC',]$len)
sprintf('Diffrence of mean: %.2f',c(round(mean.diff,2)))
## [1] "Diffrence of mean: 3.70"
med.diff = median(ToothGrowth[ToothGrowth$supp == 'OJ',]$len) - median(
ToothGrowth[ToothGrowth$supp == 'VC',]$len)
sprintf('Diffrence of median: %.2f',c(round(med.diff,2)))
## [1] "Diffrence of median: 6.20"
As for the dose, the boxplot suggest that there is a difference on the mean and median of the tooth length between dosage groups.
Now we some statistical to verify our hypothesis, there will be 2 sets of hypothesis testing that we will cover here.
Hypothesis on supplement:
H0: Supplement intake has no effect on tooth length growth, which means mOJ = mVC; H1: Supplement intake does affect the tooth length growth, which means mOJ - mVC <> 0.
Hypothesis on dosage:
H0: Supplement dosage has no effect on tooth length growth, which means m0.5 = m1.0; H1: Supplement intake does affect the tooth length growth, which means m0.5 - m1.0 <> 0. (repeat the same procedure to compare (m0.5, m1.5) and (m1.0, m1.5))
We start with the hypothesis testing on supplement effect.
Before proceed any further, there are some assumptions that will be made to pre-requisite the statstical test approach that we will take.
Subset the dataset according to the group of interest.
library(dplyr)
library(tidyr)
len_OJ <- filter(ToothGrowth, supp == 'OJ') %>% select(len)
len_VC <- filter(ToothGrowth, supp == 'VC') %>% select(len)
Perform t-test on the dataset subset.
t.test(len_OJ,len_VC,var.equal = F,paired = F)
##
## Welch Two Sample t-test
##
## data: len_OJ and len_VC
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1710156 7.5710156
## sample estimates:
## mean of x mean of y
## 20.66333 16.96333
From the t test result, we learnt that:
Follow by the hypothesis testing on the effect of dosage on tooth length growth.
Before proceed any further, there are some assumptions that will be made to pre-requisite the statstical test approach that we will take.
We will start with dosage 0.5 vs 1.0.
sample.subset = ToothGrowth[ToothGrowth$dose %in% c(0.5,1.0), ]
len = sample.subset$len
Perform permutation test with 10000 repeatition.
observed.group = sample.subset$dose
## x = tooth length vector, g = dosage vector, c1 = group 1, c2 = group 2
testStat = function(x, g, c1, c2) mean(x[g == c1]) - mean(x[g == c2])
observed.diff = testStat(len, observed.group, 0.5, 1.0)
permutations = sapply(1:10000, function(i) testStat(len, sample(sample.subset$dose),
0.5, 1.0))
p.value = mean(abs(permutations)>=abs(observed.diff))
From the permutation test, we learnt that the p-value of the test is 0, which means we should not reject the null hypothesis.
Proceed with dosage 0.5 vs 2.0
sample.subset = ToothGrowth[ToothGrowth$dose %in% c(0.5,2.0), ]
len = sample.subset$len
Perform permutation test with 10000 repeatition.
observed.group = sample.subset$dose
observed.diff = testStat(len, observed.group, 0.5, 2.0)
permutations = sapply(1:10000, function(i) testStat(len, sample(sample.subset$dose)
, 0.5, 2.0))
p.value = mean(abs(permutations)>=abs(observed.diff))
From the permutation test, we learnt that the p-value of the test is 0, which means we should not reject the null hypothesis.
Proceed with dosage 1.0 vs 2.0
sample.subset = ToothGrowth[ToothGrowth$dose %in% c(1.0,2.0), ]
len = sample.subset$len
Perform permutation test with 10000 repeatition.
observed.group = sample.subset$dose
observed.diff = testStat(len, observed.group, 1.0, 2.0)
permutations = sapply(1:10000, function(i) testStat(len, sample(sample.subset$dose)
, 1.0, 2.0))
p.value = mean(abs(permutations)>=abs(observed.diff))
From the permutation test, we learnt that the p-value of the test is 0, which means we should not reject the null hypothesis.
From the results of the above statistical testing, we can now conclude that the effect different supplement on tooth growth is not statisticallly significant. However, when we compare the mean difference attributed to dosage, we found that the effect of dose on tooth growth is statistically significant. This result leads us to believe that the dose may indeed play a more dominant role in affecting the tooth growth.
R code of Figure 1
library(ggplot2)
library(cowplot)
g1 = ggplot(data = ToothGrowth) + geom_boxplot(mapping = aes(x = supp, y = len,
fill = as.factor(supp))) +
theme_bw(base_size = 8) +
labs(x = 'Supplements', y = 'Tooth Length',
title = 'Figure 1\nBoxplot of Supplement vs Tooth Length') +
guides(fill = guide_legend(title = 'Supplement'))
g2 = ggplot(data = ToothGrowth) +
geom_boxplot(mapping = aes(x = as.factor(dose), y = len, fill = as.factor(dose))) +
theme_bw(base_size = 8) +
labs(x = 'Dosage', y = 'Tooth Length', title = 'Figure 2\nBoxplot of Dose vs Tooth Length') +
guides(fill = guide_legend(title = 'Dose'))
plot_grid(g1, g2, align = 'h', scale = .6)