Basic Inferential Data Analysis

Joseph Bloomquist 05-20-2024

Overview

This is the second portion of the John Hopkins Statistical Inference Course project.

In this we will:

Load ToothGrowth data and perform some basic exploratory data analysis
Provide a basic summary of the data
Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose
State conclusions and the assumptions needed for those conclusions

Exploratory Data Analysis

First, we will load the data. We know this data is already clean and workable, so we will skip the data cleaning process.

data("ToothGrowth")
str(ToothGrowth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

Through the structure we can see 60 observations that factor in either “OJ” (Orange Juice) or “VC” (Ascorbic Acid) with 3 different doses.

summary(ToothGrowth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

The summary confirms our 2 factors and shows an even split in the supply methods used.

Which supply method of Vitamin C provided the biggest gain in tooth length?

We will break up the data in to subsets first based on the factors.

ojData <- subset(ToothGrowth, supp == "OJ")
vcData <- subset(ToothGrowth, supp == "VC")

Now break it down further by dosage.

halfDoseOJ <- ojData[ojData$dose == 0.5,]
fullDoseOJ <- ojData[ojData$dose == 1,]
doubleDoseOJ <- ojData[ojData$dose == 2.0,]

halfDoseVC <- vcData[vcData$dose == 0.5,]
fullDoseVC <- vcData[vcData$dose == 1,]
doubleDoseVC <- vcData[vcData$dose == 2.0,]

Let’s see what each group looks like visually. Both groups compared

boxplot(ojData$len, vcData$len, main = "OJ vs. VC", sub = "All Doses", names = c("OJ", "VC"), ylab = "Length")

As we can see on average, OJ appears to be more effective. We can test this using a 2 sample t-test.

t.test(ojData$len, vcData$len)

## 
##  Welch Two Sample t-test
## 
## data:  ojData$len and vcData$len
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean of x mean of y 
##  20.66333  16.96333

Since the p-value (0.06063) is greater than the common alpha level of 0.05, we do not reject the null hypothesis. There is not enough evidence to conclude a significant difference. No significant difference at this level

How does breaking it down by dosage affect the results?

0.5/mg Dose

boxplot(halfDoseOJ$len, halfDoseVC$len, names = c("OJ", "VC"), ylab = "Length", main = "OJ vs. VC", sub = "0.5/mg Dose")

At 0.5/mg dosage, it clearly shows “OJ” as the winner. Let’s test that hypothesis.

t.test(halfDoseOJ$len, halfDoseVC$len)

## 
##  Welch Two Sample t-test
## 
## data:  halfDoseOJ$len and halfDoseVC$len
## t = 3.1697, df = 14.969, p-value = 0.006359
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.719057 8.780943
## sample estimates:
## mean of x mean of y 
##     13.23      7.98

Since the p-value (0.006359) is way smaller than the common alpha level of 0.05, we fully reject the null hypothesis as there is enough evidence to conclude a significant difference. Significant difference in favor of OJ @ 0.5/mg Dose

1.0/mg Dose

boxplot(fullDoseOJ$len, fullDoseVC$len, names = c("OJ", "VC"), ylab = "Length", main = "OJ vs. VC", sub = "1.0/mg Dose")

It would appear that “OJ” is the winner again. Let’s test!

t.test(fullDoseOJ$len, fullDoseVC$len)

## 
##  Welch Two Sample t-test
## 
## data:  fullDoseOJ$len and fullDoseVC$len
## t = 4.0328, df = 15.358, p-value = 0.001038
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.802148 9.057852
## sample estimates:
## mean of x mean of y 
##     22.70     16.77

Since the p-value (0.001038) is much smaller than the common alpha level of 0.05, we fully reject the null hypothesis as there is enough evidence to conclude a significant difference. Significant difference in favor of OJ @ 1.0/mg Dose

2.0/mg Dose

boxplot(doubleDoseOJ$len, doubleDoseVC$len, names = c("OJ", "VC"), ylab = "Length", main = "OJ vs. VC", sub = "2.0/mg Dose")

This looks pretty even on average. Let’s test again.

t.test(doubleDoseVC$len, doubleDoseOJ$len)

## 
##  Welch Two Sample t-test
## 
## data:  doubleDoseVC$len and doubleDoseOJ$len
## t = 0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.63807  3.79807
## sample estimates:
## mean of x mean of y 
##     26.14     26.06

The astoundingly high p-value suggests there is no true difference in means, just more effect as it scales. Since the p-value (0.9639) is greater than the common alpha level of 0.05, we do not reject the null hypothesis. There is not enough evidence to conclude a significant difference. No significant difference at this level

What is the most efficient combo to increase tooth growth?

Based on this data, we can conclude that “OJ” taken as 1.0/mg Dose is the most effective for tooth growth.