By Daniel Perez
We are going to analyze the Tooth Growth data in the R datasets package.
Load the ToothGrowth data and perform some basic exploratory data analyses.
Provide a basic summary of the data.
Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose.
State your conclusions and the assumptions needed for your conclusions.
We will begin by loading the required libraries to do the analysis.
library(ggplot2)
library(datasets)
library(dplyr)
Then lets load the data and check its structure (using str), we also look at the top row (using head) to get a sens of what the data set looks like.
data(ToothGrowth)
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
head(ToothGrowth)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
From these first steps we know that the data set has three columns. The first, numeric column, represents tooth length. The second, factor column, represents a supplement type. The last, numeric column, represent the dose in milligrams.
Let's summaize the data by supplement. First we will group the data set by supplement and dose, next we will summarise the values in each group to gather the Minimum, 1st Quantile, Median, Mean, 3rd Quantile, Maximum and Standard Deviation.
ToothGrowth %>% group_by(supp,dose) %>% summarise(min=min(len), FirstQ=quantile(len,0.25), median=median(len),mean=mean(len), ThirdQ=quantile(len,0.75), max=max(len), sd=sd(len))
## Source: local data frame [6 x 9]
## Groups: supp
##
## supp dose min FirstQ median mean ThirdQ max sd
## 1 OJ 0.5 8.2 9.700 12.25 13.23 16.175 21.5 4.459709
## 2 OJ 1.0 14.5 20.300 23.45 22.70 25.650 27.3 3.910953
## 3 OJ 2.0 22.4 24.575 25.95 26.06 27.075 30.9 2.655058
## 4 VC 0.5 4.2 5.950 7.15 7.98 10.900 11.5 2.746634
## 5 VC 1.0 13.6 15.275 16.50 16.77 17.300 22.5 2.515309
## 6 VC 2.0 18.5 23.375 25.95 26.14 28.800 33.9 4.797731
Here's a box plot that displays the summary of the data graphically.
bPlot <- ggplot(ToothGrowth, aes(x = dose, y = len)) + facet_grid(.~supp) + geom_boxplot(aes(fill = factor(dose)))
bPlot <- bPlot + labs(x="Dose", y = "Length", title="Dose vs Length by Supplement")
print(bPlot)
By looking at the plot we can hypothesize that the bigger the dose, the more the length. It also seems that orange juice correlates with more length than abscorbic acid.
We can then hypothesize that an increase in dosage results in an increase in length.
Let's assume that the data is not paired, that individuals' reactions are independent during the test and that each individual was measured once and only once in the subset. Let us also assume that the variances are equal since the data come from a population of guinea pigs and we assume that their reactions to Vitamin C have been evenly spread out through proper randomization.
Let's subset the data by dosage.
d.5 <- subset(ToothGrowth, dose == 0.5)
d1 <- subset(ToothGrowth, dose == 1)
d2 <- subset(ToothGrowth, dose == 2)
To test our hypothesis we will then do a series of T-tests (t.test) comparing each of the dosages against each other. We must not forget that we are assuming the data is not paired and equal variances.
t.test(len~dose, paired = FALSE, var.equal = TRUE, data=rbind(d.5, d1))
##
## Two Sample t-test
##
## data: len by dose
## t = -6.4766, df = 38, p-value = 1.266e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.983748 -6.276252
## sample estimates:
## mean in group 0.5 mean in group 1
## 10.605 19.735
If we look at the 95 percent confidence interval values, we can say that the measurements of those subjects that were given a 1g dose is consistently higher than that of those with a 0.5g dose.
t.test(len~dose, paired = FALSE, var.equal = TRUE, data=rbind(d.5, d2))
##
## Two Sample t-test
##
## data: len by dose
## t = -11.799, df = 38, p-value = 2.838e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -18.15352 -12.83648
## sample estimates:
## mean in group 0.5 mean in group 2
## 10.605 26.100
The confidence interval shows us here that subjects that were given a 2g dose have experienced more growth in length than those with a 0.5g dose.
t.test(len~dose, paired = FALSE, var.equal = TRUE, data=rbind(d1, d2))
##
## Two Sample t-test
##
## data: len by dose
## t = -4.9005, df = 38, p-value = 1.811e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -8.994387 -3.735613
## sample estimates:
## mean in group 1 mean in group 2
## 19.735 26.100
Finally the measurements of those individuals with a 2g dose were consistently higher than those with a 1g dose.
This can tell us that, indeed the null hypothesis of no relation between dosage and length growth can be discarded in favor our alternative hypothesis: There's a positive relation between dose and length increase.
For our next analysis, we recall from the box plot that we have hypothesized that those individuals that were given Orange Juice have experienced more growth in length than those that were given Abscorbic Acid.
Let's subset the data by supplement.
oj <- subset(ToothGrowth, supp == "OJ")
vc <- subset(ToothGrowth, supp == "VC")
Let's perform a T-test (t.test) maintaining our previous assumptions.
t.test(len~supp, paired = FALSE, var.equal = TRUE, data=rbind(oj, vc))
##
## Two Sample t-test
##
## data: len by supp
## t = 1.9153, df = 58, p-value = 0.06039
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1670064 7.5670064
## sample estimates:
## mean in group OJ mean in group VC
## 20.66333 16.96333
Our confidence interval, which contains the value 0, suggests that we cannot discard the null hypothesis, meaning that there's no discernable change in length among those individuals that were fed different supplements.