Statistical Analysis of the ToothGrowth Data Set

By MAS, Feb 2019

Overview

This report uses hypothesis testing (t-tests) to assess the relationship between tooth growth and vitamin C administration by dose and delivery method in a sample of 60 guinea pigs from the ToothGrowth dataset. In the study, each animal received a vitamin C dosage of either 0.5 mg/day, 1.0 mg/day, or 2 mg/day that was delivered either by orange juice (OJ) of ascorbic acid (VC).

Exploratory Data Analysis

First, load the “ToothGrowth” data set and assess it using the str and summary.

## Load ToothGrowth data set
data("ToothGrowth")

## Get the structure
str(ToothGrowth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

From str we see the data object is structured as 60 rows by 3 columns. The 3 columns are: len which contains the dependent variable describing tooth growth, supp which labels how the vitamin C was delivered (VC or OJ), and dose which labels the dosage of vitamin C.

## Get the summary
summary(ToothGrowth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

From the summary, we see that there are no NA values. There are 60 samples split into 30 that were given OJ and 30 that were given VC. The statistical breakdown of tooth length values shows that the measurements span 4.20 and 33.90 with a mean value of 18.81 and median of 19.25. The statistical breakdown for the dose column is not useful since the values are discrete and divided equally among 0.5, 1.0, and 2.0 mg/day.

Summary of the Data

We can use the information obtained from the exploratory data analysis to construct an overall visualization of the data. Specifically, plotting the length by delivery method with facets corresponding to the dose.

library(ggplot2)

g <- ggplot(ToothGrowth, aes(x=supp, y=len, type="l")) + 
        geom_boxplot(aes(group=supp, fill=supp)) + geom_jitter(width=0.1) +
        facet_grid(. ~ dose) + 
        xlab("Delivery Method (supp)") + ylab("Tooth Length") +
        ggtitle("Tooth Length as a Function of Delivery and Dose") +
        theme(plot.title = element_text(face="bold", hjust=0.5, size=12))  

g

Hypothesis Testing

From the plot above, it looks as if there may be a correlation between dose and tooth growth as well as a correlation between delivery and tooth growth. The sample means are close so to determine if these correlations are statistically significant, we need to perform hypothesis testing using 2-sample, 2-sided independent group t-tests. Specifically, we will use a Welch two sample t-test where we assume the variances are unequal.

Assumptions for Hypothesis Testing: 1. The data are IID. 2. The data follow a normal distribution. 3. The data are randomly sampled from the population. 4. Neither the sample nor population distributions are skewed. 5. There is no pairing of data. 6. Unequal variances between datas sets.

Hypothesis Testing: Increasing Dose Leads to Increased Tooth Length First, let’s consider only one delivery method at a time to make the comparison of tooth length vs. dose controlled. Define the null hypothesis, H0, such that there is no difference between tooth length mean at different doses. We will test the hypothesis, Ha, that there is a difference. First, let’s separate the data appropriately in VC and OJ.

library(dplyr)

vc_df <- filter(ToothGrowth, supp=="VC")
oj_df <- filter(ToothGrowth, supp=="OJ")

Now perform t-tests for all dose combinations. Tables of the results are shown below for VC and OJ, respectively (R code in Appendix).

	vc_p_values	vc_confidence_interval
VC 0.5 mg/day vs 1 mg/day	6.81101770286506e-07	-11.27 : -6.31
VC 0.5 mg/day vs 2 mg/day	4.6815774144921e-08	-21.9 : -14.42
VC 1 mg/day vs 2 mg/day	9.15560305663865e-05	-13.05 : -5.69

	oj_p_values	oj_confidence_interval
OJ 0.5 mg/day vs 1 mg/day	8.7849190551615e-05	-13.42 : -5.52
OJ 0.5 mg/day vs 2 mg/day	1.32378387769723e-06	-16.34 : -9.32
OJ 1 mg/day vs 2 mg/day	0.0391951420462442	-6.53 : -0.19

Looking at the confidence intervals for VC, we see they are all negative and exclude 0, thus indicating increasing dose may increase tooth growth. Looking at the p-values for VC delivery, we see that they are all below the conventional alpha value of 0.05. This means that at a 95% confidence level we would expect to see a difference in tooth length mean equal to or larger than what we observed. Thus, we can reject the null hypothesis with a type-I error rate of 5%. We therefore conclude that increasing the dose of vitamin c via ascorbic acid (VC) delivery likely results in increased tooth length.

We see similar results for OJ delivery. Thus, we conclude that increasing the dose of vitamin c via orange juice (OJ) delivery likely results in increased tooth length. It should be noted that the difference between 1.0 and 2.0 mg/day is not as large with a p-value of 0.039 and a confidence interval with an upper tail of -0.19 which is quite close to 0. In a conservative situation where one would account for multiple comparisons using the Bonferroni correction, we would scale our alpha to 0.05/3 ~ 0.017. Then the difference between 1.0 and 2.0 mg/day OJ delivery would be explained by the null hypothesis and thus we would not consider increasing OJ delivery from 1.0 to 2.0 mg/day to increase tooth length.

Hypothesis Testing: OJ Delivery Increases Tooth Length More than VC Delivery To test this hypothesis, we perform t-tests again but this time comparing OJ and VC delivery at a constant dose. We define the null hypothesis, H0, to be that there is no difference in mean tooth length between the OJ and VC delivery methods at the same dose. We will test the hypothesis, Ha, that there is a difference.

	p_values	confidence_interval
VC vs OJ at 0.5 mg/day	0.0063586067640968	-8.78 : -1.72
VC vs OJ at 1 mg/day	0.00103837587229988	-9.06 : -2.8
VC vs OJ at 2 mg/day	0.963851588723373	-3.64 : 3.8

Using the same justifications as the analysis above and looking at the p-values and confidence intervals, we can see that mean tooth length after OJ delivery is statistically significantly larger than mean tooth length after VC delivery at doses of 0.5 mg/day and 1.0 mg/day. Specifically, their p-values are below 0.05 and confidence intervals are negative and do not containe 0. We therefore reject the null hypothesis for these two cases. For 2.0 mg/day, the p-value is above 0.05 and the confidence interval contains 0. Therefore, we accept the null hypothesis and conclude there is no statistically significant difference between mean tooth length between OJ and VC delivery at 2.0 mg/day.

Conclusions

Both vitamin c delivery method and dosage affect tooth growth in guinea pigs based on t-tests performed with an alpha = 0.05 and no correction for multiple comparisons. Increasing the dose of vitamin c delivery via ascorbic acid (VC) likely results in increased tooth length at all doses. Increasing the dose of vitamin c via orange juice (OJ) delivery likely results in increased tooth length at all doses. OJ results in more tooth growth than VC at 0.5 mg/day and 1.0 mg/day doses but there is no statistically significant difference between OC and VC delivery at 2.0 mg/day doses. These conclusions are based on the assumptions listed above: 1. The data are IID. 2. The data follow a normal distribution. 3. The data are randomly sampled from the population. 4. Neither the sample nor population distributions are skewed. 5. There is no pairing of data. 6. Unequal variances between datas sets.

Appendix

R Code for first and second tables, respectively

library(kableExtra)
inds <- list(c(0.5, 1.0), c(0.5, 2.0), c(1.0, 2.0))
get_stats <- function(ind, data, statindex) {
        output <- list()
        for(entry in ind){
                name <- paste(as.character(data$supp[1]), " ", entry[1], 
                              " mg/day vs ", entry[2], " mg/day")
                if(statindex == 3){
                        teststat<-t.test(data$len[data$dose==entry[1]], 
                                 data$len[data$dose==entry[2]])[statindex]
                        output[name] <- teststat
                }
                else if(statindex == 4){
                        teststat<-t.test(data$len[data$dose==entry[1]], 
                                 data$len[data$dose==entry[2]])[statindex]
                        output[name]<-paste(round(teststat[[1]][1],2),":",
                                            round(teststat[[1]][2],2))
                }
        }
        return(output)
}
vc_p_values <- get_stats(inds, vc_df, 3)
vc_confidence_interval <- get_stats(inds, vc_df, 4)
dt_vc <- data.frame(cbind(vc_p_values, vc_confidence_interval))
dt_vc %>% kable() %>% kable_styling()

oj_p_values <- get_stats(inds, oj_df, 3)
oj_confidence_interval <- get_stats(inds, oj_df, 4)
dt_oj <- data.frame(cbind(oj_p_values, oj_confidence_interval))
dt_oj %>% kable() %>% kable_styling()

R Code for third table

inds_dose <- c(0.5, 1.0, 2.0)

compare_supp <- function(ind, data1, data2, statindex) {
        output <- list()
        for(entry in ind){
                name <- paste(data1$supp[1], "vs", data2$supp[1], " at ", 
                              entry, " mg/day")
                if(statindex == 3){
                        teststat<-t.test(data1$len[data1$dose==entry], 
                                         data2$len[data2$dose==entry])[statindex]
                        output[name] <- teststat
                }
                else if(statindex == 4){
                        teststat<-t.test(data1$len[data1$dose==entry], 
                                         data2$len[data2$dose==entry])[statindex]
                        output[name]<-paste(round(teststat[[1]][1],2),":",
                                            round(teststat[[1]][2],2))   
                }             
        }
        return(output)
}
        
p_values <- compare_supp(inds_dose, vc_df, oj_df, 3)
confidence_interval <- compare_supp(inds_dose, vc_df, oj_df, 4)
dt <- data.frame(cbind(p_values, confidence_interval))
dt %>% kable() %>% kable_styling()