Part 2: Basic Inferential Data Analysis Instructions

In the second portion of the project, I analyzed the ToothGrowth data in the R datasets package.

        data(ToothGrowth)

First, I wanted to see a basic summary of the data:

        # Provide a basic summary of the data.
        str(ToothGrowth)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
        summary(ToothGrowth)
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

This dataframe includes n=60 observations and contains three variables: len, supp, and dose. The supp variable is two-levels, “OJ” and “VC”, each containing n=30 observations. The len variable is numeric with a range of 4.20-33.90. The dose variable is also numeric with a range of 0.5-2.0. After some quick googling, I know that len is “Tooth Length”, supp is “Vitamin Supplement Type”, and dose is “Dose (in milligrams)”.

Here is a visual of the differences in tooth length by dose and supplement type:

        qplot(supp,len,data=ToothGrowth, facets=~dose, 
              main="Tooth length by supplement type and dosage (mg)", 
              xlab="Supplement type", ylab="Tooth length") + 
        geom_boxplot(aes(fill = supp))

As you can see, the boxes of the supplement types overlap within the 0.5mg dose and the 2mg dose. It appears that the biggest difference in tooth length appears between the supplement types when given the 1mg dose. Overall, tooth length was highest in the 2mg dose group, regardless of supplement type administered.

Since it appears the variances of tooth growth differ by supplement and possibly by dose, we will need to perform Welch’s t-tests for unequal variances when comparing tooth length by supplement and by dose.

The type of hypothesis testing performed below assumes the following: - Variables are independent and identically distributed (IID) - Variances of tooth growth differ by supplement and dose - The distribution of tooth growth is normal/gaussian

I compared tooth growth (len) by type of vitamin supplement (supp) using hypothesis test with H0 = no difference in tooth growth by vitamin supplement.

# Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose. 
# (Only use the techniques from class, even if there's other approaches worth considering)
        #Since the variances appear to be unequal, we need a Welch's t-test:
        t.test(len ~ supp, paired = FALSE, var.equal = FALSE, data = ToothGrowth)
## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333
        ttest = t.test(len ~ supp, paired = FALSE, var.equal = FALSE, data = ToothGrowth)
        pvalue = round(ttest$p.value, 9)

Since the p-value associated with this test (0.0606345) >0.05, we fail to reject the null that average tooth length is significantly different by supplement type.

However, what if our null hypothesis is that those taking the OJ supplement will have greater tooth growth than those taking the VC supplement?

In this test, we specify a one-sided test:

#What about a hypothesis that mean tooth length is greater when given the OJ supplement?
        oneside = t.test(len ~ supp, alternative = "greater", paired = FALSE, var.equal = FALSE, data = ToothGrowth)
        pvalueone = round(oneside$p.value, 9)

Performing a one-sided test, we find a significant p-value of 0.0303173 < 0.05, so we reject the null hypothesis that the true difference in means is not greater than 0. This supports that on average, those in the OJ supplement group have greater tooth length than those in the VC supplement group.

Next, I examined whether tooth growth was significantly different by dose with H0 = no difference in tooth growth by dose. Note - Due to the project specifications, I will need to produce multiple t-tests instead of an ANOVA or regression.

        #THe variance for the 0.5 dose looks slightly larger, so use unequal just to be safe
        #we first need to compare the 0.5 dose vs. the 1.0 dose, 
        #then the 1.0 dose vs. the 2.0 dose, then the 0.5 dose vs. 2.0 dose
        
        #First, create three subsets so that dose is only two levels:
        # using subset function
        halfvsone = subset(ToothGrowth, dose<2) 
        
        onevstwo = subset(ToothGrowth, dose>0.5)
        
        halfvstwo = subset(ToothGrowth, dose==0.5 | dose==2)
        
        #0.5 vs 1 mg:
        halfvsonetest = t.test(len ~ dose, paired = FALSE, var.equal = FALSE, data = halfvsone)
        p.5vs1 = round(halfvsonetest$p.value, 7)
        
        #1 vs 2 mg:
        onevstwotest = t.test(len ~ dose, paired = FALSE, var.equal = FALSE, data = onevstwo)
        p1vs2 = round(onevstwotest$p.value, 5)
        
        #0.5 vs 2 mg:
        halfvstwotest = t.test(len ~ dose, paired = FALSE, var.equal = FALSE, data = halfvstwo)
        p.5vs2 = halfvstwotest$p.value

As you can see by the output in the t-tests, dose level significantly impacts the average tooth length and we reject the null hypotheses in all the tests (that dose does not impact tooth length). Those who were given the 0.5 mg dose had a significantly lower average tooth length compared to those given the 1.0 mg dose (p = 0.0000001) or those given the 2.0 mg dose (p ~ 0.0000). Furthermore, those given the 1.0 mg dose had a significantly lower average tooth length compared to those given the 2.0 mg dose (p = 0.00002). In sum, higher dose = greater tooth length.