This second part of the project is aimed at testing and evaluating hypotheses based on data, using the statistical methods discussed in the course. The assignment refers to the ‘TootGrowth’ dataset described below. It investigates: (i) whether the delivery method makes a difference in the outcome; (ii) whether dosage makes a difference; and (iii) what “treatement(s)” appears to be the most effective. I first perform some Exploratory Data Analysis, after having properly processed the data, and then conduct formal statistical testing about differences in means, correcting for the number of the hypotheses tested. Findings suggest that Vitamin C dosage is linked with statistically significant gains in cells’ length, with no significant difference between delivery methods for the highest (and most effective) dose level.
The ‘TootGrowth’ dataset2 contains measurements3 on 60 guinea pigs in total. The response or outcome variable, coded as ‘len’, is the length of odontoblasts4. The covariates are ‘dose’, which is formally a numeric variable indicating vitamin C mg/day dosage given to the subjects, and ‘supp’, a categorical variable (factor) indicating delivery method for the drug, with two “levels” (orange juice, OJ, or ascorbic acid, VC). Since only 3 different doses were administered, it can be considered a factor variable as well. Thus, the data are effectively split into either 2 or 3 major groups, according to delivery method and dosage, respectively, for a total of 6 subsamples (dose-supp pairs).
The preliminary data processing is minimal, since there are no missing values in the TootGrowth dataset. The data manipulation necessary for the analysis (subsetting and grouping, for the most part) is carefully reported in the Appendix, where I also report a call to the ‘str’ function, a routinary general ‘summary’ call, and summaries by group.
A boxplot (Figure 1 in the Appendix) helps visualize information about the distribution of the response variable of interest, length, and aids intuition about the hypotheses I want to test, namely that there are differences in outcome based both on delivery method and dosage. In particular, it suggest that OJ might be better for low and medium dosages, while there is no clear-cut difference for high dosage (2 mg/day), and that higher dosages perform better. I assume we want to understand what is most effective in favoring growth; as such, I am interested in testing the following Null Hypotheses, for a total of 6 tests to be performed: whether a higher dose is associated with a statistically significant longer length (high vs medium and medium vs low dosages); whether there is an overall difference between the subjects given orange juice or ascorbic acid; and which supp-dose combination is the more effective (by comparing OJ-High to OJ-Medium, VC-High to VC-Medium, and testing whether OJ-High and VC-High are really different).
Normality A full array of Shapiro-Wilk normality test is reported in the Appendix. Following Royston (1995)5, since no test reported a p-value greater than 0.01, we fail to reject the Null Hypothesis; we thus assume normality of the data.
Variances If Welch tests are used, unequal variances are allowed. We need no further assumption on variances6.
Unpaired Since each animal received one of the three dosages by one of the two delivery methods, there is no overlapping between the (sub-)samples. In other words, the data are unpaired, and we consider the (sub-)samples as independent samples.
Methodology As such, Welch t-tests with unequal variances are performed, correcting for multiplicity.
In setups such as the present one, where there is no pre-determined target for the experiment/data analysis, and the outcome of interest can be compared in different ways, we are at risk of incurring into the Multiplicity Problem, an issue that arises whenever multiple comparisons are at hand. The more statistical hypotheses we test, the more likely we are to get an incorrect inference, purely by randomness. There are numerous methods to try to deal with this issue; as seen in class, I’ll apply the Bonferroni correction and the BH-adjustments to my tests’ p-values.
The classical approach, the Bonferroni correction, is directed towards control of the Family-Wise Error Rate (FWER) “in a strong sense”7. While reported for completeness, for the issue at hand the “BH” (Benjamini-Hochberg) approach8, which controls the False Discovery Rate (FDR), is more appropriate.
The table below reports the Welch t-test p-values, as well as the Bonferroni-corrected and the BH-adjusted ones9.
| Test | p-value | Bonferroni | BH-adjusted | |
|---|---|---|---|---|
| Dosage | High vs Medium | 0.00001906\(^{***}\) | 0.00011439\(^{***}\) | 0.00005719\(^{***}\) |
| Medium vs Low | 0.00000013\(^{***}\) | 0.00000076\(^{***}\) | 0.00000076\(^{***}\) | |
| Delivery method | OJ vs VC (overall) | 0.06063451\(^{*}\) | 0.36380705 | 0.07276141\(^{*}\) |
| Combinations | OJ-High vs OJ-Medium | 0.03919514\(^{**}\) | 0.23517085 | 0.05879271\(^{*}\) |
| VC-High vs VC-Medium | 0.00009156\(^{***}\) | 0.00054934\(^{***}\) | 0.00018311\(^{***}\) | |
| OJ-High vs VC-High | 0.96385159 | 1.00000000 | 0.96385159 |
From the table we see that the differences in mean length between dosages of vitamin C are both statistically significant at the 1% level10. The significance of the overall difference between mean length by delivery methods gets eaten away by the Bonferroni correction, and at most is significant at the 10% level. Differences between high and medium dose levels are significant even within delivery method subgroups, though only at 10% for the BH-corrected OJ one. As it seems intuitive from the boxplot, there is no real difference between delivery methods for the high dose level.
Results show that while delivery method might have some bearing on the observed differences in mean cells length, it is not that much statistically significant. On the other hand, the level of Vitamin C dosage (mg/day) leads to statistically significant differences, suggesting a direct relationship between dose and cells’ length. Each delivery method at high dosages “performed better” (greater mean lenght) than at medium dosage, in particular. There does not seem to be any statistically appreciable difference between the combinations with associated highest mean (and median) length, namely the Orange Juice-High Dosage and Ascorbic Acid (VC)-High Dosage subsamples.
library(datasets)
data("ToothGrowth")
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
summary(ToothGrowth)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
library(data.table)
dg <- ToothGrowth
setDT(dg)
# Summary of length 'len'...
by.supp <- dg[, as.list(summary(len)), by = supp] ; by.supp
## supp Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1: VC 4.2 11.200 16.5 16.96333 23.100 33.9
## 2: OJ 8.2 15.525 22.7 20.66333 25.725 30.9
by.dose <- dg[, as.list(summary(len)), by = dose] ; by.dose
## dose Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1: 0.5 4.2 7.225 9.85 10.605 12.250 21.5
## 2: 1.0 13.6 16.250 19.25 19.735 23.375 27.3
## 3: 2.0 18.5 23.525 25.95 26.100 27.825 33.9
by.both <- dg[, as.list(summary(len)), by = c("supp", "dose")] ; by.both
## supp dose Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1: VC 0.5 4.2 5.950 7.15 7.98 10.900 11.5
## 2: VC 1.0 13.6 15.275 16.50 16.77 17.300 22.5
## 3: VC 2.0 18.5 23.375 25.95 26.14 28.800 33.9
## 4: OJ 0.5 8.2 9.700 12.25 13.23 16.175 21.5
## 5: OJ 1.0 14.5 20.300 23.45 22.70 25.650 27.3
## 6: OJ 2.0 22.4 24.575 25.95 26.06 27.075 30.9
OJ.all <- ToothGrowth[ToothGrowth$supp=="OJ", ]
VC.all <- ToothGrowth[ToothGrowth$supp=="VC", ]
OJ.low <- ToothGrowth[ToothGrowth$supp=="OJ" & ToothGrowth$dose==0.5, ]
VC.low <- ToothGrowth[ToothGrowth$supp=="VC" & ToothGrowth$dose==0.5, ]
OJ.med <- ToothGrowth[ToothGrowth$supp=="OJ" & ToothGrowth$dose==1, ]
VC.med <- ToothGrowth[ToothGrowth$supp=="VC" & ToothGrowth$dose==1, ]
OJ.h <- ToothGrowth[ToothGrowth$supp=="OJ" & ToothGrowth$dose==2, ]
VC.h <- ToothGrowth[ToothGrowth$supp=="VC" & ToothGrowth$dose==2, ]
OJ.lm <- ToothGrowth[ToothGrowth$supp=="OJ" &
(ToothGrowth$dose==0.5 | ToothGrowth$dose==1), ]
VC.lm <- ToothGrowth[ToothGrowth$supp=="VC" &
(ToothGrowth$dose==0.5 | ToothGrowth$dose==1), ]
low.all <- ToothGrowth[ToothGrowth$dose==0.5, ]
med.all <- ToothGrowth[ToothGrowth$dose==1, ]
high.all <- ToothGrowth[ToothGrowth$dose==2, ]
lm.all <- ToothGrowth[ToothGrowth$dose==0.5 | ToothGrowth$dose==1, ]
print("Please see full version on RPubs for the code")
## [1] "Please see full version on RPubs for the code"
# Overall
shap1.1 <- shapiro.test(ToothGrowth$len)
# Delivery method 'supp'
shap2.1 <- shapiro.test(OJ.all$len)
shap2.2 <- shapiro.test(VC.all$len)
# Dosage 'dose'
shap3.1 <- shapiro.test(low.all$len)
shap3.2 <- shapiro.test(med.all$len)
shap3.3 <- shapiro.test(high.all$len)
# 6 subsamples
shap4.1 <- shapiro.test(OJ.low$len)
shap4.2 <- shapiro.test(OJ.med$len)
shap4.3 <- shapiro.test(OJ.h$len)
shap4.4 <- shapiro.test(VC.low$len)
shap4.5 <- shapiro.test(VC.med$len)
shap4.6 <- shapiro.test(VC.h$len)
shapiro.results <- data.frame(stat = rep(0,12), pval = rep(0,12))
shapiro.results$stat <- rbind(shap1.1$statistic, shap2.1$statistic, shap2.2$statistic,
shap3.1$statistic, shap3.2$statistic, shap3.3$statistic,
shap4.1$statistic, shap4.2$statistic, shap4.3$statistic,
shap4.4$statistic, shap4.5$statistic, shap4.6$statistic)
shapiro.results$pval <- rbind(shap1.1$p.value, shap2.1$p.value, shap2.2$p.value,
shap3.1$p.value, shap3.2$p.value, shap3.3$p.value,
shap4.1$p.value, shap4.2$p.value, shap4.3$p.value,
shap4.4$p.value, shap4.5$p.value, shap4.6$p.value)
shapiro.cols <- c("W-statistics", "p-value")
shapiro.rows <- c("All sample", "OJ (all)", "VC (all)",
"Low dose", "Medium dose", "High dose",
"OJ-Low", "OJ-Medium", "OJ-High",
"VC-Low", "VC-Medium", "VC-High")
colnames(shapiro.results) <- shapiro.cols
rownames(shapiro.results) <- shapiro.rows
shapiro.results
## W p-value
## All sample 0.9674286 0.10910049
## OJ (all) 0.9178434 0.02358773
## VC (all) 0.9656715 0.42844646
## Low dose 0.9406451 0.24660149
## Medium dose 0.9313431 0.16388214
## High dose 0.9777535 0.90191150
## OJ-Low 0.8927435 0.18204079
## OJ-Medium 0.9266004 0.41529828
## OJ-High 0.9625752 0.81479083
## VC-Low 0.8899969 0.16956273
## VC-Medium 0.9083405 0.26978551
## VC-High 0.9732764 0.91944968
library(ggpubr)
## Loading required package: ggplot2
highVSmed <- t.test(x = high.all$len, y = med.all$len,
alternative = "two.sided", var.equal = FALSE)
medVSlow <- t.test(x = med.all$len, y = low.all$len,
alternative = "two.sided", var.equal = FALSE)
overall <- t.test(x = OJ.all$len, y = VC.all$len,
alternative = "two.sided", var.equal = FALSE)
lm.OJvsVC <- t.test(x = OJ.lm$len, y = VC.lm$len,
alternative = "two.sided", var.equal = FALSE)
OJ.highVSmed <- t.test(x = OJ.h$len, y = OJ.med$len,
alternative = "two.sided", var.equal = FALSE)
VC.highVSmed <- t.test(x = VC.h$len, y = VC.med$len,
alternative = "two.sided", var.equal = FALSE)
high.OJvsVC <- t.test(x = OJ.h$len, y = VC.h$len,
alternative = "two.sided", var.equal = FALSE)
pvals <- c(highVSmed$p.value, medVSlow$p.value, overall$p.value,
OJ.highVSmed$p.value, VC.highVSmed$p.value, high.OJvsVC$p.value)
round(pvals, 8)
## [1] 0.00001906 0.00000013 0.06063451 0.03919514 0.00009156 0.96385159
# Bonferroni
bonf <- p.adjust(pvals, method = "bonferroni")
round(bonf, 8)
## [1] 0.00011439 0.00000076 0.36380705 0.23517085 0.00054934 1.00000000
# Benjamini-Hochberg
bh.pvals <- p.adjust(pvals, method = "BH")
round(bh.pvals, 8)
## [1] 0.00005719 0.00000076 0.07276141 0.05879271 0.00018311 0.96385159
print("Test for overall difference in means between High and Medium dosage")
## [1] "Test for overall difference in means between High and Medium dosage"
highVSmed
##
## Welch Two Sample t-test
##
## data: high.all$len and med.all$len
## t = 4.9005, df = 37.101, p-value = 1.906e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3.733519 8.996481
## sample estimates:
## mean of x mean of y
## 26.100 19.735
print("Test for overall difference in means between Medium and Low dosage")
## [1] "Test for overall difference in means between Medium and Low dosage"
medVSlow
##
## Welch Two Sample t-test
##
## data: med.all$len and low.all$len
## t = 6.4766, df = 37.986, p-value = 1.268e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 6.276219 11.983781
## sample estimates:
## mean of x mean of y
## 19.735 10.605
print("Test for overall difference in means by delivery method: OJ vs VC")
## [1] "Test for overall difference in means by delivery method: OJ vs VC"
overall
##
## Welch Two Sample t-test
##
## data: OJ.all$len and VC.all$len
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1710156 7.5710156
## sample estimates:
## mean of x mean of y
## 20.66333 16.96333
# print("Test for difference in means for low-medium dosage (OJ vs VC)")
# lm.OJvsVC
print("Test for difference in means for OJ-High vs OJ-Medium")
## [1] "Test for difference in means for OJ-High vs OJ-Medium"
OJ.highVSmed
##
## Welch Two Sample t-test
##
## data: OJ.h$len and OJ.med$len
## t = 2.2478, df = 15.842, p-value = 0.0392
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1885575 6.5314425
## sample estimates:
## mean of x mean of y
## 26.06 22.70
print("Test for difference in means for VC-High vs VC-Medium")
## [1] "Test for difference in means for VC-High vs VC-Medium"
VC.highVSmed
##
## Welch Two Sample t-test
##
## data: VC.h$len and VC.med$len
## t = 5.4698, df = 13.6, p-value = 9.156e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 5.685733 13.054267
## sample estimates:
## mean of x mean of y
## 26.14 16.77
print("Test for difference in means for High dosage, OJ vs VC")
## [1] "Test for difference in means for High dosage, OJ vs VC"
high.OJvsVC
##
## Welch Two Sample t-test
##
## data: OJ.h$len and VC.h$len
## t = -0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.79807 3.63807
## sample estimates:
## mean of x mean of y
## 26.06 26.14
library(ggplot2)
library(RColorBrewer)
CBfriendly <- c("#E69F00", "#56B4E9") # We should all try to make colorblindness friendliness the default
gg <- ggplot(data = ToothGrowth,
aes(x = as.factor(dose), y = len, fill = supp)) +
scale_fill_manual(values = CBfriendly) +
geom_boxplot() +
xlab("Vitamin C Dose (mg/day)") +
ylab("Tooth Cells Length") +
ggtitle("Boxplot of Tooth Cells Length by Delivery Method & Dose") +
theme_light() +
theme(legend.position = "top",
legend.title = element_text(color = "darkgreen", size = 14),
legend.text = element_text(color = "darkred", size = 14)
) +
theme(plot.title = element_text(size = 16, hjust = 0.5)) +
theme(axis.text = element_text(size = 12),
axis.title = element_text(size = 12))
gg
Please note that blank space and division in sections were used to make the report more readable. Moreover, the Appendix shows some code chunks and extensive output for clarity and reproducibility. Please do factor these: overall, length requirements are (more or less) met!↩︎
See ?TootGrowth (R help page) for further details.↩︎
See first chunk in the Appendix.↩︎
Henceforth “tooth cells”, for simplicity. I apologize to biologists, vets & doctors alike for this.↩︎
Royston, P. (1995), Remark AS R94: A Remark on Algorithm AS 181: The W test for Normality, Applied Statistics, 44, 547–551. doi: 10.2307/2986146.↩︎
Just eyeballing the data already makes it difficult to argue for equal variances. For the sake of completeness, we should perform a F-test for equality of variances. Paraphrasing Fermat, I have a truly marvelous output of this, which this margin is too narrow to contain↩︎
See Benjamini, Y. & Hochberg, Y. (1995), Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing, Journal of the Royal Statistical Society, Series B (Methodological), Vol 57 No. 1, 289-300. The authors specify that “control of the FWER is important when a conclusion from the various individual inferences is likely to be erroneous when at least one of them is”. The Bonferroni correction is the most conservative approach, thus.↩︎
ibidem. The “BH” approach has more power, as the authors show.↩︎
Legend for statistical significance levels: \(^{***}\) 1%, \(^{**}\) 5%, \(^{*}\) 10%↩︎
They’re actually significant up to the 0.0001% level!↩︎