This document proposes a basic exploratory analysis of the ToothGrowth data in the R datasets package. It uses hypothesis tests based on confidence intervals to compare tooth growth by delivery methods and dose levels.
The ToothGrowth R dataset measures the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, (orange juice or ascorbic acid (a form of vitamin C and coded as VC).
library(datasets)
data(ToothGrowth)
toothGrowth <- ToothGrowth
names(toothGrowth) <- c("length", "method", "dose")
toothGrowth$dose <- as.factor(toothGrowth$dose)
str(toothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ length: num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ method: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose : Factor w/ 3 levels "0.5","1","2": 1 1 1 1 1 1 1 1 1 1 ...
head(toothGrowth)
## length method dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
summary(toothGrowth)
## length method dose
## Min. : 4.20 OJ:30 0.5:20
## 1st Qu.:13.07 VC:30 1 :20
## Median :19.25 2 :20
## Mean :18.81
## 3rd Qu.:25.27
## Max. :33.90
Several plots which present the median, minimum, maximum and outliers.
library(ggplot2)
library(grid)
library(gridExtra)
p1 <- ggplot(toothGrowth, aes(x=dose, y=length, color=dose)) +
geom_boxplot() +
theme(legend.position = "none")
p2 <- ggplot(toothGrowth, aes(x=dose, y=length, color=dose)) +
geom_boxplot() +
facet_grid(. ~ method)
p3 <- ggplot(toothGrowth, aes(x=method, y=length, color=method)) +
geom_boxplot() +
theme(legend.position = "none")
p4 <- ggplot(toothGrowth, aes(x=method, y=length, color=method)) +
geom_boxplot() +
facet_grid(. ~ dose)
grid.arrange(p1, p2, p3, p4, ncol = 2, nrow = 2, widths=c(1.5, 2.5))
The exploratory analysis suggests:
This section compares tooth growth by delivery method and dose using hypothesis tests based on confidence intervals. The following correlations are analysed:
This section analyses existing correlation between the delivery method and the variation on the tooth length.
The NULL hypothesis to test is that there is no correlation between the delivery method and the tooth length.
The grouping factor for the T-test is the delivery method, assuming unequal variance between both groups of observations, and no data preparation is required since there are only two levels.
test <- t.test(length ~ method, paired = F, var.equal = F, data = toothGrowth)
test$conf.int[1:2]
## [1] -0.1710156 7.5710156
test$p.value
## [1] 0.06063451
The 95% confidence interval contains the zero value and the p-value is greater than 0.05. Consequently the NULL hypothesis is consolidated and cannot be rejected.
This section analyses existing correlation between the dosage level and the variation on the tooth length.
The NULL hypothesis to test is that there is no correlation between the dose level and the tooth length.
The grouping factor for the T-test is the dosage level, it has more than two levels, hence data preparation is required before proceeding a T-test. Unpaired groups of observations have to be prepared, based on every combination of dosage level intervals.
dose.level.5_10 <- subset(toothGrowth, dose %in% c(.5, 1.0))
dose.level.5_20 <- subset(toothGrowth, dose %in% c(.5, 2.0))
dose.level.10_20 <- subset(toothGrowth, dose %in% c(1.0, 2.0))
Assuming unequal variance between both groups of observations
test <- t.test(length ~ dose, paired = F, var.equal = F, data = dose.level.5_10)
test$conf.int[1:2]
## [1] -11.983781 -6.276219
test$p.value
## [1] 1.268301e-07
The 95% confidence interval does not contain the zero value and the p-value is smaller than 0.05. Consequently the NULL hypothesis can be rejected for sure.
test <- t.test(length ~ dose, paired = F, var.equal = F, data = dose.level.5_20)
test$conf.int[1:2]
## [1] -18.15617 -12.83383
test$p.value
## [1] 4.397525e-14
The 95% confidence interval does not contain the zero value and the p-value is smaller than 0.05. Consequently the NULL hypothesis can be rejected for sure.
test <- t.test(length ~ dose, paired = F, var.equal = F, data = dose.level.10_20)
test$conf.int[1:2]
## [1] -8.996481 -3.733519
test$p.value
## [1] 1.90643e-05
The 95% confidence interval does not contain the zero value and the p-value is smaller than 0.05. Consequently the NULL hypothesis can be rejected for sure.
These T-tests confirm an evident correlation between the dosage level and the tooth length.
This section proposes a deeper analysis of the existing correlation between the delivery method and the variation on the tooth length, this time for a given dosage level.
The NULL hypothesis to test is that there is no correlation between the delivery method and the tooth length for a given dosage level.
The grouping factor for the T-test is the delivery method, it has two levels, but the the dosage level has more than two levels, hence data preparation is required before proceeding a T-test. Unpaired groups of observations have to be prepared, based on every dosage level.
dose.level.5 <- subset(toothGrowth, dose %in% c(.5))
dose.level.10 <- subset(toothGrowth, dose %in% c(1.0))
dose.level.20 <- subset(toothGrowth, dose %in% c(2.0))
Assuming unequal variance between both groups of observations
test <- t.test(length ~ method, paired = F, var.equal = F, data = dose.level.5)
test$conf.int[1:2]
## [1] 1.719057 8.780943
test$p.value
## [1] 0.006358607
The 95% confidence interval does not contain the zero value and the p-value is smaller than 0.05. Consequently the NULL hypothesis can be rejected for sure.
test <- t.test(length ~ method, paired = F, var.equal = F, data = dose.level.10)
test$conf.int[1:2]
## [1] 2.802148 9.057852
test$p.value
## [1] 0.001038376
The 95% confidence interval does not contain the zero value and the p-value is smaller than 0.05. Consequently the NULL hypothesis can be rejected for sure.
test <- t.test(length ~ method, paired = F, var.equal = F, data = dose.level.20)
test$conf.int[1:2]
## [1] -3.79807 3.63807
test$p.value
## [1] 0.9638516
The 95% confidence interval contains the zero value and the p-value is greater than 0.05. Consequently the NULL hypothesis cannot be rejected.
The T-tests confirm:
This analysis leads to the following conclusions, based on the above assumptions: