In this paper we perform a basic exploratory data analysis of the R dataset ToothGrowth, included on R package datasets, which collects data on the effects of Vitamin C upon length of guinea pigs’ odontoblasts cells, responsible for tooth growth. The analyzed data shows an association between the size of the dose of Vitamin C consumed by guinea pigs, regardless of method of delivery, and the length of the odontoblast cells.
We going to use the following packages:
library(ggplot2); library(gridExtra); library(pander); library(xtable)
Load the dataset and take a look into its structure.
data("ToothGrowth"); str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
The dataset is a data frame with 3 variables and 60 observations. The variables included in this dataset are:
$len: length of odontoblasts cells, responsible for tooth growth, in microns. Numeric.$supp: delivery method: orange juice (OJ) or ascorbic acid (VC). Factor.$dose: daily miligrams received by the guinean pigs: 0.5, 1 or 2 mg/day. Numeric.We transform $dose into factor, because we will not use the numerical properties of the different doses sizes.
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
Let’s see how the samples are distributed across the levels of factors.
We have six differents experimental treatments, result of combining levels of $supp and $dose, with equal samples sizes each one (\(n = 10\)). This is called a balanced two-factor factorial design. It is important to note, for the purposes of the analysis that we will carry out, that samples of each experimental treatment are independent of each other.
Let’s see a summary chart that shows how varies the lenght of odontoblast cells across $supp and $dose. The bottom and top of the boxes in the plot are the first and third quartiles, the line inside the box is the median, and the mean is indicated numerically and marked with a dash.
This plot shows the main effects of the factors, that is, changes in lenght of the odontoblasts cells associated with a change in the levels of delivery method or dose size.
It seems that the average length of odontoblast cells is slightly higher when the delivery method is orange juice. On the other hand, the relation between $len and $dose seems clearer: the higher the dose, the greater the lenght of cells.
Let’s see now if this average variations that we observed in the levels of a factor are maintained in each of the levels of the second factor.
In this plot we can see if the variations of lenght across the levels of one factor are the same at all levels of the other factor. We are searching for interaction between the factors.
With regard to the method of delivery, orange juice seems to be more effective than ascorbic acid at doses of 0.5 or 1 mg/day. When the dose reaches 2 mg per day, there seems no difference between the effect of each method. When seen from the point of view of the dose, it seems that an increase in dosage is always associated with an increase in the length of the odontoblast cells, although in the case of orange juice increased dose of 1 to 2 mg/day does not seem to have a big effect.
To check whether the differences we’ve seen in the plots above are significant, we run various t-test, for each pair of levels. So on $supp variable, which has two levels, we perform a single t-test, and on $dose, which has three levels, we have to perform three t-test. We also want to explore the interaction effects between factors, so we have to run the t-test on $upp variable for each level of $dose, and the three t-test on $dose variable for each level of $upp.
The following table shows the P-value and the 95% confidence interval for the average difference between treatments. As the vast majority of tests have been significant, it highlighted, with italic font, only the t-tests in which has not rejected the null hypothesis, which says that the difference between the averages compared is zero.
| Comparation | 2nd factor level | P.value | Conf.int.lower | Conf.int.upper |
|---|---|---|---|---|
| Main effects | ||||
| Orange juice vs Ascorbic acid | 0.061 | -0.171 | 7.571 | |
| 0.5 mg/day vs 1 mg/day | 0.000 | -11.984 | -6.276 | |
| 0.5 mg/day vs 2 mg/day | 0.000 | -18.156 | -12.834 | |
| 1 mg/day vs 2 mg/day | 0.000 | -8.996 | -3.734 | |
| Interaction | ||||
| Orange juice vs Ascorbic acid | 0.5 mg/day | 0.006 | 1.719 | 8.781 |
| Orange juice vs Ascorbic acid | 1 mg/day | 0.001 | 2.802 | 9.058 |
| Orange juice vs Ascorbic acid | 2 mg/day | 0.964 | -3.798 | 3.638 |
| 0.5 mg/day vs 1 mg/day | Orange juice | 0.000 | -13.416 | -5.524 |
| 0.5 mg/day vs 2 mg/day | Orange juice | 0.000 | -16.335 | -9.325 |
| 1 mg/day vs 2 mg/day | Orange juice | 0.039 | -6.531 | -0.189 |
| 0.5 mg/day vs 1 mg/day | Ascorbic acid | 0.000 | -11.266 | -6.314 |
| 0.5 mg/day vs 2 mg/day | Ascorbic acid | 0.000 | -21.902 | -14.418 |
| 1 mg/day vs 2 mg/day | Ascorbic acid | 0.000 | -13.054 | -5.686 |
The data suggest that both delivery method and dosage have an effect on the length of the odontoblast cells. Overall, although the effects of orange juice seem somewhat greater than those of ascorbic acid, t-test does not yield a significant result when using \(\alpha = 0.05\). The p value of t-test is \(0.061\) and the 95% confidence interval includes zero. However, if we analyze the difference between the two methods depending on the dose delivered, significant differences in favor of orange juice in doses of 0.5 and 1 mg/day are observed. When doses are 2 mg/day no difference between the orange juice and ascorbic acid are observed. In this case, therefore, can say that there exist an interaction between the two factors.On the other hand, the results of the t-test show a clear relationship between dose and the length of the cells, regardless of delivery method.
Here you can find the code that was used to build the plots and tables.
Table 1 - Frecuency table for Delivery method and Dose
addtorow <- list()
addtorow$pos <- list(0, 0)
addtorow$command <- c("& \\multicolumn{3}{c}{Dose} \\\\\n", "Delivery method & 0.5 mg/day &
1 mg/day & 2 mg/day \\\\\n")
freqtable <- xtable(table(ToothGrowth$supp, ToothGrowth$dose), caption = "Frecuency table for
Delivery method and Dose")
print(freqtable, add.to.row = addtorow, include.colnames = FALSE, comment = FALSE,
caption.placement = "top")
Table 2 - T-test table
### Run t-tests
ttest1 <- t.test(len ~ supp, paired = FALSE, var.equal = FALSE, data = ToothGrowth)
ttest2 <- t.test(len ~ dose, paired = FALSE, var.equal = FALSE, data = subset(ToothGrowth,
dose %in% c(0.5, 1)))
ttest3 <- t.test(len ~ dose, paired = FALSE, var.equal = FALSE, data = subset(ToothGrowth,
dose %in% c(0.5, 2)))
ttest4 <- t.test(len ~ dose, paired = FALSE, var.equal = FALSE, data = subset(ToothGrowth,
dose %in% c(1, 2)))
ttest5 <- t.test(len ~ supp, paired = FALSE, var.equal = FALSE, data = subset(ToothGrowth,
dose == 0.5))
ttest6 <- t.test(len ~ supp, paired = FALSE, var.equal = FALSE, data = subset(ToothGrowth,
dose == 1))
ttest7 <- t.test(len ~ supp, paired = FALSE, var.equal = FALSE, data = subset(ToothGrowth,
dose == 2))
ttest8 <- t.test(len ~ dose, paired = FALSE, var.equal = FALSE, data = subset(ToothGrowth,
dose %in% c(0.5, 1) & supp == "OJ"))
ttest9 <- t.test(len ~ dose, paired = FALSE, var.equal = FALSE, data = subset(ToothGrowth,
dose %in% c(0.5, 2) & supp == "OJ"))
ttest10 <- t.test(len ~ dose, paired = FALSE, var.equal = FALSE, data = subset(ToothGrowth,
dose %in% c(1, 2) & supp == "OJ"))
ttest11 <- t.test(len ~ dose, paired = FALSE, var.equal = FALSE, data = subset(ToothGrowth,
dose %in% c(0.5, 1) & supp == "VC"))
ttest12 <- t.test(len ~ dose, paired = FALSE, var.equal = FALSE, data = subset(ToothGrowth,
dose %in% c(0.5, 2) & supp == "VC"))
ttest13 <- t.test(len ~ dose, paired = FALSE, var.equal = FALSE, data = subset(ToothGrowth,
dose %in% c(1, 2) & supp == "VC"))
### Extract values
pvalue <- NULL
confintlow <- NULL
confintupp <- NULL
for (i in 1:13) {
pvalue = c(pvalue, as.character(sprintf("%.3f", round(get(paste0("ttest", i))$p.value,
3))))
confintlow = c(confintlow, as.character(sprintf("%.3f", round(get(paste0("ttest",
i))$conf.int[1], 3))))
confintupp = c(confintupp, as.character(sprintf("%.3f", round(get(paste0("ttest",
i))$conf.int[2], 3))))
}
### Add information about test
pvalue = append(pvalue, NA, 0)
pvalue = append(pvalue, NA, 5)
confintlow = append(confintlow, NA, 0)
confintlow = append(confintlow, NA, 5)
confintupp = append(confintupp, NA, 0)
confintupp = append(confintupp, NA, 5)
ttestint <- c(rep("", 6), "0.5 mg/day", "1 mg/day", "2 mg/day",
rep("Orange juice", 3), rep("Ascorbic acid", 3))
ttestpair <- c("Main effects", "Orange juice vs Ascorbic acid",
"0.5 mg/day vs 1 mg/day", "0.5 mg/day vs 2 mg/day", "1 mg/day vs 2 mg/day",
"Interaction", rep("Orange juice vs Ascorbic acid", 3), rep(c("0.5 mg/day vs 1 mg/day",
"0.5 mg/day vs 2 mg/day", "1 mg/day vs 2 mg/day"), 2))
ttestdf <- data.frame(ttestpair, ttestint, pvalue, confintlow,
confintupp)
colnames(ttestdf) <- c("**Comparation**", "**2nd factor level**",
"**P.value**", "**Conf.int.lower**", "**Conf.int.upper**")
### Table 2
emphasize.strong.rows(c(1, 6))
emphasize.italics.rows(c(2, 8))
pandoc.table(ttestdf, style = "simple", round = c(0, 0, 3, 3,
3), split.table = Inf, justify = c("left", "left", "right",
"right", "right"), missing = "", caption = "_T-tests performed with their corresponding
p values and 95% confidence intervals._")
Figure 1 - Main effects plots
plot1 <- ggplot(ToothGrowth, aes(x = supp, y = len, fill = supp)) +
geom_boxplot() + ggtitle("Length of odontoblast cells\nby delivery method") +
xlab("Delivery method") + ylab("Lenght of odontoblast cells") +
scale_x_discrete(labels = c("Orange juice", "Ascorbic acid")) +
scale_y_continuous(breaks = c(10, 20, 30), labels = c(expression(10 ~
mu * m), expression(20 ~ mu * m), expression(30 ~ mu *
m))) + theme(plot.title = element_text(size = 9, face = "bold"),
axis.text = element_text(size = 7), axis.title = element_text(size = 9),
legend.position = "none") + stat_summary(fun.y = mean, colour = "black",
geom = "point", shape = 45, size = 5) + stat_summary(fun.y = mean,
colour = "black", geom = "text", vjust = -0.8, size = 2,
aes(label = round(..y.., digits = 2)))
plot2 <- ggplot(ToothGrowth, aes(x = dose, y = len, fill = dose)) +
geom_boxplot() + ggtitle("Length of odontoblast cells\nby dose size") +
xlab("Dose size") + ylab(NULL) + scale_x_discrete(labels = c("0.5 mg/day",
"1 mg/day", "2 mg/day")) + scale_y_continuous(breaks = c(10,
20, 30), labels = c(expression(10 ~ mu * m), expression(20 ~
mu * m), expression(30 ~ mu * m))) + theme(plot.title = element_text(size = 9,
face = "bold"), axis.text = element_text(size = 7), axis.title = element_text(size = 9),
legend.position = "none") + stat_summary(fun.y = mean, colour = "black",
geom = "point", shape = 45, size = 5) + stat_summary(fun.y = mean,
colour = "black", geom = "text", vjust = -0.8, size = 2,
aes(label = round(..y.., digits = 2)))
grid.arrange(plot1, plot2, ncol = 2)
Figure 2 - Interaction effects plots
dose_labels <- c(`0.5` = "0.5 mg/day", `1` = "1 mg/day", `2` = "2 mg/day")
plot3 <- ggplot(ToothGrowth, aes(x = supp, y = len, fill = supp)) +
geom_boxplot() + ggtitle("Lenght by delivery method,\naccording to dose") +
xlab("Delivery method") + ylab("Lenght of odontoblast cells") +
scale_x_discrete(labels = c("Orange\njuice", "Ascorbic\nacid")) +
scale_y_continuous(breaks = c(10, 20, 30), labels = c(expression(10 ~
mu * m), expression(20 ~ mu * m), expression(30 ~ mu *
m))) + theme(plot.title = element_text(size = 9, face = "bold"),
axis.text = element_text(size = 7), axis.title = element_text(size = 9),
legend.position = "none") + facet_grid(. ~ dose, labeller = as_labeller(dose_labels)) +
stat_summary(fun.y = mean, colour = "black", geom = "point",
shape = 45, size = 5) + stat_summary(fun.y = mean, colour = "black",
geom = "text", vjust = -0.8, size = 2, aes(label = round(..y..,
digits = 2)))
supp_labels <- c(OJ = "Orange juice", VC = "Ascorbic acid")
plot4 <- ggplot(ToothGrowth, aes(x = dose, y = len, fill = dose)) +
geom_boxplot() + ggtitle("Lenght by dose,\naccording to delivery method") +
xlab("Dose size") + ylab(NULL) + scale_x_discrete(labels = c("0.5 mg/day",
"1 mg/day", "2 mg/day")) + scale_y_continuous(breaks = c(10,
20, 30), labels = c(expression(10 ~ mu * m), expression(20 ~
mu * m), expression(30 ~ mu * m))) + theme(plot.title = element_text(size = 9,
face = "bold"), axis.text = element_text(size = 7), axis.title = element_text(size = 9),
legend.position = "none") + facet_grid(. ~ supp, labeller = as_labeller(supp_labels)) +
stat_summary(fun.y = mean, colour = "black", geom = "point",
shape = 45, size = 5) + stat_summary(fun.y = mean, colour = "black",
geom = "text", vjust = -0.8, size = 2, aes(label = round(..y..,
digits = 2)))
grid.arrange(plot3, plot4, ncol = 2)