Statistical Inference

Course Project

Inferential Analysis

By Daniel Perez

We are going to analyze the Tooth Growth data in the R datasets package.

Load the ToothGrowth data and perform some basic exploratory data analyses.
Provide a basic summary of the data.
Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose.
State your conclusions and the assumptions needed for your conclusions.

We will begin by loading the required libraries to do the analysis.

library(ggplot2)
library(datasets)
library(dplyr)

Then lets load the data and check its structure (using str), we also look at the top row (using head) to get a sens of what the data set looks like.

data(ToothGrowth)
str(ToothGrowth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

head(ToothGrowth)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

From these first steps we know that the data set has three columns. The first, numeric column, represents tooth length. The second, factor column, represents a supplement type. The last, numeric column, represent the dose in milligrams.

Let's summaize the data by supplement. First we will group the data set by supplement and dose, next we will summarise the values in each group to gather the Minimum, 1st Quantile, Median, Mean, 3rd Quantile, Maximum and Standard Deviation.

ToothGrowth %>% group_by(supp,dose) %>% summarise(min=min(len), FirstQ=quantile(len,0.25), median=median(len),mean=mean(len), ThirdQ=quantile(len,0.75), max=max(len), sd=sd(len))

## Source: local data frame [6 x 9]
## Groups: supp
## 
##   supp dose  min FirstQ median  mean ThirdQ  max       sd
## 1   OJ  0.5  8.2  9.700  12.25 13.23 16.175 21.5 4.459709
## 2   OJ  1.0 14.5 20.300  23.45 22.70 25.650 27.3 3.910953
## 3   OJ  2.0 22.4 24.575  25.95 26.06 27.075 30.9 2.655058
## 4   VC  0.5  4.2  5.950   7.15  7.98 10.900 11.5 2.746634
## 5   VC  1.0 13.6 15.275  16.50 16.77 17.300 22.5 2.515309
## 6   VC  2.0 18.5 23.375  25.95 26.14 28.800 33.9 4.797731

Here's a box plot that displays the summary of the data graphically.

bPlot <- ggplot(ToothGrowth, aes(x = dose, y = len)) + facet_grid(.~supp) + geom_boxplot(aes(fill = factor(dose)))
bPlot <- bPlot + labs(x="Dose", y = "Length", title="Dose vs Length by Supplement")
print(bPlot)

plot of chunk unnamed-chunk-4

By looking at the plot we can hypothesize that the bigger the dose, the more the length. It also seems that orange juice correlates with more length than abscorbic acid.

Dose

We can then hypothesize that an increase in dosage results in an increase in length.

Let's assume that the data is not paired, that individuals' reactions are independent during the test and that each individual was measured once and only once in the subset. Let us also assume that the variances are equal since the data come from a population of guinea pigs and we assume that their reactions to Vitamin C have been evenly spread out through proper randomization.

Let's subset the data by dosage.

d.5 <- subset(ToothGrowth, dose == 0.5)
d1 <- subset(ToothGrowth, dose == 1)
d2 <- subset(ToothGrowth, dose == 2)

To test our hypothesis we will then do a series of T-tests (t.test) comparing each of the dosages against each other. We must not forget that we are assuming the data is not paired and equal variances.

0.5g vs 1g

t.test(len~dose, paired = FALSE, var.equal = TRUE, data=rbind(d.5, d1))

## 
##  Two Sample t-test
## 
## data:  len by dose
## t = -6.4766, df = 38, p-value = 1.266e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.983748  -6.276252
## sample estimates:
## mean in group 0.5   mean in group 1 
##            10.605            19.735

If we look at the 95 percent confidence interval values, we can say that the measurements of those subjects that were given a 1g dose is consistently higher than that of those with a 0.5g dose.

0.5g vs 2g

t.test(len~dose, paired = FALSE, var.equal = TRUE, data=rbind(d.5, d2))

## 
##  Two Sample t-test
## 
## data:  len by dose
## t = -11.799, df = 38, p-value = 2.838e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -18.15352 -12.83648
## sample estimates:
## mean in group 0.5   mean in group 2 
##            10.605            26.100

The confidence interval shows us here that subjects that were given a 2g dose have experienced more growth in length than those with a 0.5g dose.

1g vs 2g

t.test(len~dose, paired = FALSE, var.equal = TRUE, data=rbind(d1, d2))

## 
##  Two Sample t-test
## 
## data:  len by dose
## t = -4.9005, df = 38, p-value = 1.811e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.994387 -3.735613
## sample estimates:
## mean in group 1 mean in group 2 
##          19.735          26.100

Finally the measurements of those individuals with a 2g dose were consistently higher than those with a 1g dose.

This can tell us that, indeed the null hypothesis of no relation between dosage and length growth can be discarded in favor our alternative hypothesis: There's a positive relation between dose and length increase.

Supplement

For our next analysis, we recall from the box plot that we have hypothesized that those individuals that were given Orange Juice have experienced more growth in length than those that were given Abscorbic Acid.

Let's subset the data by supplement.

oj <- subset(ToothGrowth, supp == "OJ")
vc <- subset(ToothGrowth, supp == "VC")

Let's perform a T-test (t.test) maintaining our previous assumptions.

t.test(len~supp, paired = FALSE, var.equal = TRUE, data=rbind(oj, vc))

## 
##  Two Sample t-test
## 
## data:  len by supp
## t = 1.9153, df = 58, p-value = 0.06039
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1670064  7.5670064
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333

Our confidence interval, which contains the value 0, suggests that we cannot discard the null hypothesis, meaning that there's no discernable change in length among those individuals that were fed different supplements.