Analysis of the ToothGrowth R dataset

Synopis

This document proposes a basic exploratory analysis of the ToothGrowth data in the R datasets package. It uses hypothesis tests based on confidence intervals to compare tooth growth by delivery methods and dose levels.

Assumptions

The sample population of observed Guinea pigs is representative of the entire population of Guinea pigs.
The set of guinea pigs have been randomly selected and observed.
The variances between both every groups of observed pigs during T-tests are unequal.

Data processing

The ToothGrowth R dataset measures the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, (orange juice or ascorbic acid (a form of vitamin C and coded as VC).

Loading

library(datasets)
data(ToothGrowth)

toothGrowth <- ToothGrowth
names(toothGrowth) <- c("length", "method", "dose")
toothGrowth$dose <- as.factor(toothGrowth$dose)

Exploratory data analysis

str(toothGrowth)

## 'data.frame':    60 obs. of  3 variables:
##  $ length: num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ method: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose  : Factor w/ 3 levels "0.5","1","2": 1 1 1 1 1 1 1 1 1 1 ...

head(toothGrowth)

##   length method dose
## 1    4.2     VC  0.5
## 2   11.5     VC  0.5
## 3    7.3     VC  0.5
## 4    5.8     VC  0.5
## 5    6.4     VC  0.5
## 6   10.0     VC  0.5

summary(toothGrowth)

##      length      method   dose   
##  Min.   : 4.20   OJ:30   0.5:20  
##  1st Qu.:13.07   VC:30   1  :20  
##  Median :19.25           2  :20  
##  Mean   :18.81                   
##  3rd Qu.:25.27                   
##  Max.   :33.90

Several plots which present the median, minimum, maximum and outliers.

library(ggplot2)
library(grid)
library(gridExtra)

p1 <- ggplot(toothGrowth, aes(x=dose, y=length, color=dose)) + 
    geom_boxplot() +
    theme(legend.position = "none")

p2 <- ggplot(toothGrowth, aes(x=dose, y=length, color=dose)) + 
    geom_boxplot() +
    facet_grid(. ~ method)

p3 <- ggplot(toothGrowth, aes(x=method, y=length, color=method)) + 
    geom_boxplot() +
    theme(legend.position = "none")

p4 <- ggplot(toothGrowth, aes(x=method, y=length, color=method)) + 
    geom_boxplot() +
    facet_grid(. ~ dose)

grid.arrange(p1, p2, p3, p4, ncol = 2, nrow = 2, widths=c(1.5, 2.5))

The exploratory analysis suggests:

A correlation between the dosage level and the tooth growth.
The global absence of correlation between the delivery method and the tooth growth.
A faster increase of the tooth length with the Orange Juince delivery method for the 0.5 and 1.0 dosage levels.

Hypothesis testing with confidence intervals

This section compares tooth growth by delivery method and dose using hypothesis tests based on confidence intervals. The following correlations are analysed:

Delivery method and tooth length
Dosage level and tooth length
Delivery method and tooth length inside dose levels

Delivery method and tooth growth

This section analyses existing correlation between the delivery method and the variation on the tooth length.

The NULL hypothesis to test is that there is no correlation between the delivery method and the tooth length.

T-test

The grouping factor for the T-test is the delivery method, assuming unequal variance between both groups of observations, and no data preparation is required since there are only two levels.

test <- t.test(length ~ method, paired = F, var.equal = F, data = toothGrowth)
test$conf.int[1:2]

## [1] -0.1710156  7.5710156

test$p.value

## [1] 0.06063451

The 95% confidence interval contains the zero value and the p-value is greater than 0.05. Consequently the NULL hypothesis is consolidated and cannot be rejected.

Dosage and tooth growth

This section analyses existing correlation between the dosage level and the variation on the tooth length.

The NULL hypothesis to test is that there is no correlation between the dose level and the tooth length.

T-test

The grouping factor for the T-test is the dosage level, it has more than two levels, hence data preparation is required before proceeding a T-test. Unpaired groups of observations have to be prepared, based on every combination of dosage level intervals.

dose.level.5_10 <- subset(toothGrowth, dose %in% c(.5, 1.0))
dose.level.5_20 <- subset(toothGrowth, dose %in% c(.5, 2.0))
dose.level.10_20 <- subset(toothGrowth, dose %in% c(1.0, 2.0))

Assuming unequal variance between both groups of observations

test <- t.test(length ~ dose, paired = F, var.equal = F, data = dose.level.5_10)
test$conf.int[1:2]

## [1] -11.983781  -6.276219

test$p.value

## [1] 1.268301e-07

The 95% confidence interval does not contain the zero value and the p-value is smaller than 0.05. Consequently the NULL hypothesis can be rejected for sure.

test <- t.test(length ~ dose, paired = F, var.equal = F, data = dose.level.5_20)
test$conf.int[1:2]

## [1] -18.15617 -12.83383

test$p.value

## [1] 4.397525e-14

The 95% confidence interval does not contain the zero value and the p-value is smaller than 0.05. Consequently the NULL hypothesis can be rejected for sure.

test <- t.test(length ~ dose, paired = F, var.equal = F, data = dose.level.10_20)
test$conf.int[1:2]

## [1] -8.996481 -3.733519

test$p.value

## [1] 1.90643e-05

The 95% confidence interval does not contain the zero value and the p-value is smaller than 0.05. Consequently the NULL hypothesis can be rejected for sure.

These T-tests confirm an evident correlation between the dosage level and the tooth length.

Delivery method and tooth length inside dose levels

This section proposes a deeper analysis of the existing correlation between the delivery method and the variation on the tooth length, this time for a given dosage level.

The NULL hypothesis to test is that there is no correlation between the delivery method and the tooth length for a given dosage level.

T-test

The grouping factor for the T-test is the delivery method, it has two levels, but the the dosage level has more than two levels, hence data preparation is required before proceeding a T-test. Unpaired groups of observations have to be prepared, based on every dosage level.

dose.level.5 <- subset(toothGrowth, dose %in% c(.5))
dose.level.10 <- subset(toothGrowth, dose %in% c(1.0))
dose.level.20 <- subset(toothGrowth, dose %in% c(2.0))

Assuming unequal variance between both groups of observations

test <- t.test(length ~ method, paired = F, var.equal = F, data = dose.level.5)
test$conf.int[1:2]

## [1] 1.719057 8.780943

test$p.value

## [1] 0.006358607

The 95% confidence interval does not contain the zero value and the p-value is smaller than 0.05. Consequently the NULL hypothesis can be rejected for sure.

test <- t.test(length ~ method, paired = F, var.equal = F, data = dose.level.10)
test$conf.int[1:2]

## [1] 2.802148 9.057852

test$p.value

## [1] 0.001038376

The 95% confidence interval does not contain the zero value and the p-value is smaller than 0.05. Consequently the NULL hypothesis can be rejected for sure.

test <- t.test(length ~ method, paired = F, var.equal = F, data = dose.level.20)
test$conf.int[1:2]

## [1] -3.79807  3.63807

test$p.value

## [1] 0.9638516

The 95% confidence interval contains the zero value and the p-value is greater than 0.05. Consequently the NULL hypothesis cannot be rejected.

The T-tests confirm:

a local correlation between the delivery method and the tooth length for the 0.5 and 1.0 dosage levels.
the absence of correlation between the delivery method and the tooth length for the 2.0 dosage level.

Conclusion

This analysis leads to the following conclusions, based on the above assumptions:

An increase of the dosage level has an impact on the tooth length that increases.
The delivery method does not globaly impact the tooth length.
The delivery method does not localy impact the tooth length when the dosage level is 2.0.
The Orange Juice delivery method does localy impact the increase of the tooth length faster than Ascorbic Acid for the 0.5 and 1.0 dosage levels.