Now in the second portion of the project, we’re going to analyze the ToothGrowth data in the R datasets package.
library(datasets)
library(ggplot2)
#Perform exploratory analysis of the dataset to better understand its contents
summary(ToothGrowth)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
unique(ToothGrowth$dose)
## [1] 0.5 1.0 2.0
Initial observation shows us that there are two types of supplement and three sizes of dose that affect the tooth length. Dosage clearly appears to be a factor, but the relationship between supplement and length is less clear.
# Initial plot of ToothGrowth data
ggplot(aes(x=dose, y = len), data = ToothGrowth) + geom_point(aes(color = supp))
The data is overlapping and difficult to determine based on the plot above, so let’s look at either factor individually.
# Tooth growth by supplement and dose
ggplot(aes(x=supp, y=len), data= ToothGrowth) + geom_boxplot(aes(fill=supp)) + facet_wrap(~dose)
It appears now that lower doses have a high discrepency between supplements, but at the higher dosage of 2.0, the supplements results are comparable.
We can use confidence intervals and/ hypothesis tests to compare tooth growth by supplement and dose. We will start by calculating the mean and standard deviation for length per each combination.
library(plyr)
tooth_means <- ddply(ToothGrowth, .(dose, supp), summarize, mean=mean(len), sd=sd(len))
print(tooth_means)
## dose supp mean sd
## 1 0.5 OJ 13.23 4.459709
## 2 0.5 VC 7.98 2.746634
## 3 1.0 OJ 22.70 3.910953
## 4 1.0 VC 16.77 2.515309
## 5 2.0 OJ 26.06 2.655058
## 6 2.0 VC 26.14 4.797731
Our dataset has 10 samples for each variant, so we will determine the confidence interval for each.
OJ5error <- qt(0.975,df=9)*4.459709/sqrt(10)
OJ5left <- 13.23-OJ5error
OJ5right <- 13.23+OJ5error
VC5error <- qt(0.975,df=9)*2.746634/sqrt(10)
VC5left <- 7.98-VC5error
VC5right <- 7.98+VC5error
OJ1error <- qt(0.975,df=9)*3.910953/sqrt(10)
OJ1left <- 22.7-OJ1error
OJ1right <- 22.7+OJ1error
VC1error <- qt(0.975,df=9)*2.515309/sqrt(10)
VC1left <- 16.77-VC1error
VC1right <- 16.77+VC1error
OJ2error <- qt(0.975,df=9)*2.655058/sqrt(10)
OJ2left <- 26.06-OJ2error
OJ2right <- 26.06+OJ2error
VC2error <- qt(0.975,df=9)*4.797731/sqrt(10)
VC2left <- 26.14-VC2error
VC2right <- 26.14+VC2error
Now lets compare the normal distributions of each supplement per dose using a thousand random samples utilizing the respective mean and standard deviation and plot the 95% confidence interval.
OJ_half <- rnorm(1000, 13.23, 4.459709)
VC_half <- rnorm(1000, 7.98, 2.746634)
OJ_1 <- rnorm(1000, 22.70, 3.910953)
VC_1 <- rnorm(1000, 16.77, 2.515309)
OJ_2 <- rnorm(1000, 26.06, 2.655058)
VC_2 <- rnorm(1000, 26.14, 4.797731)
par(mfrow=c(3, 2))
hist(OJ_half, col="red", breaks=40)
abline(v = OJ5left, col="green")
abline(v = OJ5right, col="green")
hist(VC_half, col="blue", breaks=40)
abline(v = VC5left, col="green")
abline(v = VC5right, col="green")
hist(OJ_1, col="red", breaks=40)
abline(v = OJ1left, col="green")
abline(v = OJ1right, col="green")
hist(VC_1, col="blue", breaks=40)
abline(v = VC1left, col="green")
abline(v = VC1right, col="green")
hist(OJ_2, col="red", breaks=40)
abline(v = OJ2left, col="green")
abline(v = OJ2right, col="green")
hist(VC_2, col="blue", breaks=40)
abline(v = VC2left, col="green")
abline(v = VC2right, col="green")
The normal distributions of each combination of dose and supplement from a thousand samples of each give us generally Gaussian distributions. The true mean will have a 95% chance of falling between the two green lines for each plot. From this we can gather that 95% of test subjects would fall in the following ranges.
OJ .5: 10.04 to 16.42
VC .5: 6.02 to 9.94
OJ 1: 19.9 to 25.5
VC 1: 14.97 to 18.57
OJ 2: 24.16 to 27.96
VC 2: 22.71 to 29.57
We can see clearly that OJ and VC are most similar, and in fact nearly identical, for the dosage of 2.0, whereas the overlap ismuch smaller for dosages .5 and 1. This still doesn’t indicate clearly whether the dosage and supplement are independent of each other or correlated. Right now, the data indicates that high dosage of either supplement is beneficial for tooth growth, whereas low dosage is more beneficial with supplement OJ. Our hypothesis then is that the supplement does not impact the tooth growth but the dosage does.