knitr::opts_chunk$set(echo = TRUE)
data("ToothGrowth"); TG <- ToothGrowth
library(ggplot2)

Synopsis

In this project investigate the length of tooth in 60 guinea pigs using the standard dataset TootGrowth. Each animal received one of three doses of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice (OJ) or ascorbic acid (VC). We investigate whether the dose and supplement type have impact on the tooth length.

Exploratory analysis

From the description in R Help we know that the data set has 60 observations on 3 variables: len - tooth length (units are not provided), supp - supplement type (VC or OJ), dose - dose in miligrams/day (0.5, 1 and 2 mg/day). Now, let’s show statistics per each dose and per supplement type..

with(TG, summary(len[dose == 0.5]));with(TG, summary(len[dose == 1]));with(TG, summary(len[dose == 2]))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.200   7.225   9.850  10.600  12.250  21.500
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.60   16.25   19.25   19.74   23.38   27.30
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.50   23.52   25.95   26.10   27.83   33.90

The summaries above show the mean and median are larger if the dose increases. Also, the maximum value is larger as the dose increases.

with(TG, summary(len[supp == "VC"])); with(TG, summary(len[supp == "OJ"]))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.20   11.20   16.50   16.96   23.10   33.90
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.20   15.52   22.70   20.66   25.72   30.90

The summaries above show that the mean and median are bigger for orange juice however the maximum value is bigger for Vitamin C. Now, let’s create a scatterplot of the tooth length by the dose with the fitted regression line.

g <- ggplot(TG, aes(dose, len)) + geom_point(shape=1) 
g +  geom_smooth(method=lm, color = "red", se = FALSE) + theme_bw() + 
labs( x= "Dose (mg/day)", y = "Tooth length", title = "Scatterplot of tooth length per dose") +
     theme(plot.title = element_text(size = 8), axis.title.x = element_text(size = 7), 
           axis.title.y = element_text(size = 7))

The plot above shows again that the tooth length increases as the dose increases. Now, let’s create a scatterplot of the tooth length by the supplement.

g <- ggplot(TG, aes(as.numeric(supp), len)) + geom_point(shape=1)
g + geom_smooth(method=lm, color = "red", se = FALSE) + theme_bw() + 
    labs( x= "Supplement: 1 - orange juice, 2 - Vitamin C", y = "Tooth length",
    title = "Scatterplot of tooth length per supplement type") +
    scale_x_continuous(breaks = seq(1, 2)) + theme(plot.title = element_text(size = 8), 
    axis.title.x = element_text(size = 7), axis.title.y = element_text(size = 7))

Even if the maximum value for Vitamic C (2) is greater than for orange juice (1) the regression line shows that on average the tooth growth is greater if a guinea pig is given orange juice rather than Vitamin C. All these exploratory steps lead to two questions: 1. is there a significant difference in tooth growth depending on the dose size? 2. is there a significant difference in tooth growth depending on the supplement type?

Hypothesis tests

I will use two sample two-sided t-test to test both question. The groups are not paired as each guinea pig was given one dose and one supplement only.

Question one: is there a significant difference in tooth growth depending on the dose size?

Let’s define null and alternative hypothesis: H0: the difference between sample means for dose 2 mg/day and dose 0.5 mg/day is equal zero. Ha: the difference between sample means for dose 2 mg/day and dose 0.5 mg/day is not equal zero.

Let’s create two groups, g05 for dose 0.5 mg/day and g2 for dose 2 mg/day and check their variances.

g05 <- TG$len[TG$dose == 0.5]; g2 <- TG$len[TG$dose == 2]
var05 <- round(sd(g05)^2, 2); var2 <- round(sd(g2)^2, 2)

So the variance for the group with dose 0.5 mg/day is 20.25 and the variance for the group with dose 2 mg/day is 14.24. The ratio is around 1.4, we can assume they are equal. Let’s do our test then.

test1 <- t.test(g2, g05, var.equal = TRUE, paired = FALSE)
test1$conf.int; test1$p.value
## [1] 12.83648 18.15352
## attr(,"conf.level")
## [1] 0.95
## [1] 2.837553e-14

The confidence interval does not include 0 and the p-value is very low, almost zero. We reject the null hypothesis. The data suggest that the tooth growth is larger for the dose of 2 mg/day.

Question two: is there a significant difference in tooth growth based on supplement?

Let’s define null and alternative hypothesis: H0: the difference between sample means for orange juice and Vitamin C is equal zero. Ha: the difference between sample means for orange juice and Vitamin C is not equal zero.

Again, I will use two sample two-sided t-test to test the hypothesis. The groups are not paired as the guinea pig was given one supplement only. Let’s create two groups, gOJ for orange juice and gVC for Vitamin C and check their variances.

gOJ <- TG$len[TG$supp == "OJ"]; gVC <- TG$len[ToothGrowth$supp == "VC"]
varOJ <- round(sd(gOJ)^2, 2); varVC <- round(sd(gVC)^2, 2)

So the variance for the group with orange juice is 43.63 and the variance for the group with Vitamin C is 68.33. The ratio is around 1.6, we can assume they are equal. Let’s do our test then.

test2 <- t.test(gOJ, gVC, var.equal = TRUE, paired = FALSE)
test2$conf.int; test2$p.value
## [1] -0.1670064  7.5670064
## attr(,"conf.level")
## [1] 0.95
## [1] 0.06039337

The confidence interval includes 0 and the p-value 0.06 is greater than 0.05. We fail to reject the null hypothesis at the 95% significance level. There is no evidence that the the difference between sample means for orange juice and Vitamin C is not equal zero.


Appendix

During exploratory analysis I have found that the supplement type has different effect depending on the dose. Let’s take a look at the plot below.

library(lattice)
xyplot(len ~ supp | dose, TG, layout = c(3,1))

It is clearly visible that the tooth growth is bigger for orange juice for smalle doses, however for the maximum dose used (2 mg/day - the last panel) the pattern starts changing!

The summaries show the following:

with(TG, summary(len[dose == 2 & supp == "OJ"])); with(TG, summary(len[dose == 2 & supp == "VC"]))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22.40   24.58   25.95   26.06   27.08   30.90
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.50   23.38   25.95   26.14   28.80   33.90

The median is exactly the same in both cases, and the mean for Vitamin C is slightly higher, the maximum value is higher again. This probably explains why the null hypothesis that the the difference between sample means for orange juice and Vitamin C is equal zero was not rejected. We would probably need to perform a multivariate test in this case.