A FIRST LOOK AT THE ToothGrowth DATASET.

1. OVERVIEW

The database “ToothGrgowth”, The Effect of Vitamin C on Tooth Growth in Guinea Pigs, contains the odontoblast lengths for 60 guinea pigs. Each animal received one of three doses of vitamin C, through one of two delivery methods (Bliss, 1952). We will see this in detail.

2. LOAD AND QUICK EXPLORATION

Let’s load the database and make a first visual exploration of the behavior of its variables and their possible relationships.

There appear to be greater tooth growths with orange juice (OJ) than with ascorbic acid (VC). And the higher the dose is given, the greater the growth.

3. BASIC SUMMARY

Let’s now look at the descriptive statistics for each of the levels of these two factors.

*Summary of Tooth length by dose and by supply method*
Item	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
supp = OJ	8.2	15.53	22.7	20.66	25.73	30.9
supp = VC	4.2	11.2	16.5	16.96	23.1	33.9
dose = 0.5	4.2	7.225	9.85	10.61	12.25	21.5
dose = 1.0	13.6	16.25	19.25	19.73	23.38	27.3
dose = 2.0	18.5	23.52	25.95	26.1	27.82	33.9

Indeed, with orange juice (OJ) there were higher growths than with ascorbic acid (VC), and the distribution of growth lengths is shifted towards higher values. With the doses, an even displacement of the length distribution is observed, towards higher values with increasing doses.

4. INSIDE COMPARISONS

Let’s compare these trends in more detail, and look at the possible interactions between the two factors.

The first four graphs have basically already been described. The last three graphs allow us to observe that there is no marked effect of interaction between the two factors, except perhaps for the fact that with the maximum dose (2) there is no difference between the two delivery methods.

Let’s check this by applying hypothesis tests on all the possible pairs of comparisons that the box plots suggest make sense. Between doses 0.5 and 1, and between 1 and 2. Between the two delivery methods. And between the four interaction: the two comparisons changing the dose for orange juice, and both for ascorbic acid.

*Summary of paired comparisons, with p-adjusted values, Benjamini-Hochberg (FDR) method*
Item	LI	LS	Pval	Var1m	Var2m	Pval.Adj
.5 - 1	6.28	11.98	0	10.61	19.74	0
1 - 2	3.73	9	0	19.74	26.1	0
OJ - VC	-7.57	0.17	0.06	20.66	16.96	0.06
OJ.5 - OJ1	5.52	13.42	0	13.23	22.7	0
OJ1 - OJ2	0.19	6.53	0.04	22.7	26.06	0.05
VC.5 - VC1	6.31	11.27	0	7.98	16.77	0
VC1 - VC2	5.69	13.05	0	16.77	26.14	0

There is only one comparison that is not significant, there is no overall difference between the two delivery methods. We note that the other six comparisons appear to be statistically significant initially, but by correcting to avoid the false discovery rate, there is one (going from dose 1 to dose 2 with orange juice) that is no longer significant.

5. MY CONCLUSIONS

With paired comparisons, good information can be obtained by analyzing data sets, while considering the adjustments that avoid false discoveries.

BIBLIOGRAPHY

Bliss, C. I. 1952. The Statistics of Bioassays. With special reference to the vitamins. Academic Press, USA. 628 pp.

APENDIX

##======================= This is All the Code ======================

#---< Packages uses
library(GGally)
library(pander)

#---< Load the data & Preprocess
dataTG      <- ToothGrowth
str(dataTG)
dataTG$dose <- as.factor(dataTG$dose)

#---< EDA
ggpairs(dataTG)

#---< Distribution Summary
df <- rbind(
     summary(subset(dataTG, supp=="OJ" )$len), 
     summary(subset(dataTG, supp=="VC" )$len), 
     summary(subset(dataTG, dose=="0.5")$len), 
     summary(subset(dataTG, dose=="1"  )$len), 
     summary(subset(dataTG, dose=="2"  )$len))
df <- cbind(Item=c("supp = OJ",  "supp = VC", 
                   "dose = 0.5", "dose = 1.0", "dose = 2.0"),
            as.data.frame(df))
pander(df, round=3, split.tables=120, 
       caption="*Summary of Tooth length by dose and by supply method*")

#---> Comparisons- Graphics: Factors
par(mfrow=c(1, 2))
ggplot(dataTG, aes(y=len, x=dose, fill=dose)) + geom_boxplot() + 
        ylab("Tooth length") + xlab("DOSE delivered") + 
        ggtitle("Tooth growth by DOSE delivered") +
        stat_summary(fun=mean, geom="point", shape=7, size=2) +
        theme(legend.position="none")

ggplot(dataTG, aes(x=len, color=dose)) +
        geom_histogram(alpha=.8, binwidth=2, position="dodge") +
        facet_grid((dose~.)) + theme(legend.position="none")

par(mfrow=c(1, 2))
ggplot(dataTG, aes(y=len, x=supp, fill=supp)) + geom_boxplot() + 
        ylab("Tooth length") + xlab("Delivery METHOD") + 
        ggtitle("Tooth growth by Delivery METHOD") +
        stat_summary(fun=mean, geom="point", shape=7, size=2) +
        theme_bw() + theme(legend.position="none")

ggplot(dataTG, aes(x=len, color=supp)) +
        geom_histogram(alpha=.8, binwidth=2, position="dodge") +
        facet_grid((supp~.)) + theme(legend.position="none")

#---< Comparisons- Graphics: Interactions
par(mfrow=c(2, 1))
ggplot(dataTG, aes(y=len, x=dose, fill=dose)) + geom_boxplot() +
     facet_wrap(~ supp, ncol=3) + ylab("Tooth length") + 
     xlab("DOSE delivered") + 
     ggtitle("Tooth growth by METHOD and DOSE") +
     stat_summary(fun=mean, geom="point", shape=7, size=2) +
     theme_bw() + theme(legend.position="none")

ggplot(dataTG, aes(y=len, x=supp, fill=supp)) + geom_boxplot() +
     facet_wrap(~ dose, ncol=3) + ylab("Tooth length") + 
     xlab("Delivery METHOD") + 
     ggtitle("Tooth growth by DOSE and METHOD") +
     stat_summary(fun=mean, geom="point", shape=7, size=2) +
     theme_bw() + theme(legend.position="none")

par(mfrow=c(1, 1))
ggplot(dataTG, aes(x=len, color=supp)) +
        geom_histogram(alpha=.8, binwidth=2, position="dodge") +
        facet_grid((dose~supp)) + theme(legend.position="none")

# Subsets
d1 <- subset(dataTG, dose=="0.5")$len
d2 <- subset(dataTG, dose=="1"  )$len
d3 <- subset(dataTG, dose=="2"  )$len

s1 <- subset(dataTG, supp=="OJ" )$len
s2 <- subset(dataTG, supp=="VC" )$len

s1d1 <- subset(dataTG, supp=="OJ" & dose=="0.5")$len
s1d2 <- subset(dataTG, supp=="OJ" & dose=="1"  )$len
s1d3 <- subset(dataTG, supp=="OJ" & dose=="2"  )$len
s2d1 <- subset(dataTG, supp=="VC" & dose=="0.5")$len
s2d2 <- subset(dataTG, supp=="VC" & dose=="1"  )$len
s2d3 <- subset(dataTG, supp=="VC" & dose=="2"  )$len

# Confidence interval and p-Value function for comparison
CI.pV <- function(set1, set2, alter="two.side", var.eq=F) {
        tmp <- t.test(set2, set1, alternative=alter, var.equal=var.eq)
        li  <- tmp$conf.int[1]
        ls  <- tmp$conf.int[2]
        pv  <- tmp$p.value
        eX  <- tmp$estimate[2]
        eY  <- tmp$estimate[1]
        #---
        return(data.frame(LI=li, LS=ls, Pval=pv, Var1m=eX, Var2m=eY))
}

df <- rbind(
# Comparisons- Tests: Factors
CI.pV(d1, d2),  # 1. DOSES: .5 - 1
CI.pV(d2, d3),  # 2. DOSES: 1 - 2
CI.pV(s1, s2),  # 3- SUPPLAY: OJ - VC

# Comparisons- Test: Interactions
CI.pV(s1d1, s1d2),  # 4. (OJ-.5) - (OJ-1)
CI.pV(s1d2, s1d3),  # 5. (OJ-1) - (OJ-2)
CI.pV(s2d1, s2d2),  # 6. (VC-.5) - (VC-1)
CI.pV(s2d2, s2d3))  # 7. (VC-1) - (VC-2)

df <- cbind(Item=c(".5 - 1", "1 - 2", "OJ - VC", 
                   "OJ.5 - OJ1", "OJ1 - OJ2", "VC.5 - VC1", "VC1 - VC2"), df)
row.names(df) <- NULL
df$Pval.Adj   <- p.adjust(df$Pval, method="BH")
pander(df, round=2, caption="*Summary of paired comparisons, with p-adjusted values, Benjamini-Hochberg (FDR) method*")

#====================================================================