This report will try to answer the following question: Do Deliver Method and/or dose affect tooth growth in guinea pigs using hypothesis tests and confidence intervals.
The analysis will be based on the R ‘ToothGrowth’ dataset.
The R help is a bit confusing about the number of guinea pigs and whether there are any groups. A quick search on Internet yield the following result:
“The Crampton paper makes it clear that these data are 60 distinct guinea pigs, as odontoblasts measurements were taken under microscope for each guinea pig after the guinea pigs were sacrificed and has their teeth removed.
Perhaps the ToothGrowth desscription could be modified to read “The response is the length of odontoblasts (teeth) in each of 60 guinea pigs, 10 for each combination of dose level of Vitamin C (0.5, 1, and 2 mg) and delivery method (orange juice or ascorbic acid)”.
Taken from bugs.r-project.org
First some basic information about the dataset:
# First load the data
data(ToothGrowth)
data <- tbl_df(ToothGrowth)
# What are the dimensions
dim(ToothGrowth)
## [1] 60 3
# What are the variables
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
We can break the data by dose and group (see R help)
par(oma=c(0,0,3,0))
coplot(len ~ dose | supp, data = ToothGrowth, panel = panel.smooth,
ylab = c("Tooth growth"),
xlab = c("dose",""))
title(main="ToothGrowth data: length vs dose, given type of supplement", outer=TRUE)
It seems that the Orange Juice delivery method is more efficient, but is it significant? It seem that there is a correlation between the dose and the tooth length, but is it significant? In addition, it seems that the variance is different between groups.
group_by(data, supp, dose) %>% summarise(variance=var(len))
## Source: local data frame [6 x 3]
## Groups: supp [?]
##
## supp dose variance
## (fctr) (dbl) (dbl)
## 1 OJ 0.5 19.889000
## 2 OJ 1.0 15.295556
## 3 OJ 2.0 7.049333
## 4 VC 0.5 7.544000
## 5 VC 1.0 6.326778
## 6 VC 2.0 23.018222
Now that we have visualized the data, we can make the following assumptions:
Further more, with assume that the underlying population for each question is mean centered and of a Gaussian shape (not skewed). For example, if we had all guinea pig and we will give them orange juice, the distribution will be balanced and of a Gaussian shape. Finally, because of the size of the groups, a T tests will be performed to be more conservative.
In all statistical tests, we will consider the following hypotheses:
H0: mu = mu0
Ha: mu > mu0
First lets consider tooth growth over all doses.
lenOJ <- dplyr::filter(data,supp=="OJ")$len
lenVC <- dplyr::filter(data,supp=="VC")$len
t <- t.test(lenOJ,lenVC, paired=FALSE, var.equal=FALSE,alt="greater")
df <- data.frame(pvalues=t$p.value, conf=paste("[",round(t$conf[1],3), ",",
round(t$conf[2],3),"]"))
row.names(df) <- c(""); print(df)
## pvalues conf
## 0.03031725 [ 0.468 , Inf ]
The p-value is lower than the alpha level of 0.05, so H0 (both mean are equal) is rejected in favor of Ha (orange juice mean is greater than ascorbic acid). In addition, the confidence interval does not contain 0.
Now lets check that the previous result is valid for each dose.
dosages <- c(2.0,1.0,0.5);pValues <- c();confInt <- c()
for (d in dosages) {
lenOJ2 <- dplyr::filter(data,supp=="OJ", dose==d)$len
lenVC2 <- dplyr::filter(data,supp=="VC", dose==d)$len
t <- t.test(lenOJ2,lenVC2,paired=FALSE, var.equal=FALSE,alt="greater")
strp <- paste("[",round(t$conf[1],3), ",", round(t$conf[2],3),"]")
pValues <- c(pValues,t$p.value); confInt <- c(confInt,strp)
}
df <- data.frame(pvalues=pValues, conf=confInt)
row.names(df) <- dosages
print(df)
## pvalues conf
## 2 0.5180742056 [ -3.133 , Inf ]
## 1 0.0005191879 [ 3.356 , Inf ]
## 0.5 0.0031793034 [ 2.346 , Inf ]
It seems that for doses 0.5 and 1.0, the tendency is confirmed however for dose 2.0 the p-value is high. A further investigation shows that for this dosage, mean tooth growth is greater with ascorbic acid than orange juice.
In the following tests we will consider only the 0.5 and 2.0 dosages.
First lets consider tooth growth over all delivery methods.
len2 <- dplyr::filter(data,dose==2.0)$len
len5 <- dplyr::filter(data,dose==0.5)$len
t <- t.test(len2,len5,paired=FALSE, var.equal=FALSE,alt="greater")
df <- data.frame(pvalues=t$p.value, conf=paste("[",round(t$conf[1],3), ",",
round(t$conf[2],3),"]"))
row.names(df) <- c(""); print(df)
## pvalues conf
## 2.198762e-14 [ 13.279 , Inf ]
The p-value is very small. So H0 (no difference) is rejected. There is a tooth growth correlation with the level of intake of Vitamin C.
Lets check now that this correlation is still valid per delivery method.
supp <- c("OJ","VC");pValues <- c();confInt <- c()
for (s in supp) {
len2 <- dplyr::filter(data,supp==s, dose==2.0)$len
len5 <- dplyr::filter(data,supp==s, dose==0.5)$len
t <- t.test(len2,len5,paired=FALSE, var.equal=FALSE,alt="greater")
strp <- paste("[",round(t$conf[1],3), ",", round(t$conf[2],3),"]")
pValues <- c(pValues,t$p.value); confInt <- c(confInt,strp)
}
df <- data.frame(pvalues=pValues, conf=confInt)
row.names(df) <- c("Orange juice", "Ascorbic acid")
print(df)
## pvalues conf
## Orange juice 6.618919e-07 [ 9.948 , Inf ]
## Ascorbic acid 2.340789e-08 [ 15.086 , Inf ]
In both cases p-value are small, so the difference is significant. In addition, the confidence intervals do not contain 0.
It seems that the Orange Juice delivery method is more effective but up to a certain level of dose.
This was already shown in the exploratory data analysis but is confirmed with the statistical tests.
Here are some remarks:
There is definitively a relationship between dose and tooth growth. All tests were significant.