Statistical Inference Course Project

Overview

In this project we will analyze the ToothGrowth data from the R datasets package.
According to the R documentation, the dataset shows “the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs” in relation to “one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods” (another term is “supplement type”): orange juice, coded as ‘OJ’, and ascorbic acid, a form of vitamin C and coded as ‘VC’.

Basic Exploratory Data Analyses

In this part, we will take a brief look at the structure and characteristics of the data.

# loading and exploring the data
data("ToothGrowth")
str( ToothGrowth )

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

# building a contingency table
table( ToothGrowth[,2:3] )

##     dose
## supp 0.5  1  2
##   OJ  10 10 10
##   VC  10 10 10

# exploring the averages of length by supplement type
aggregate( len ~ supp, data=ToothGrowth, FUN=mean )

##   supp      len
## 1   OJ 20.66333
## 2   VC 16.96333

# exploring the averages of length by dose
aggregate( len ~ dose, data=ToothGrowth, FUN=mean )

##   dose    len
## 1  0.5 10.605
## 2  1.0 19.735
## 3  2.0 26.100

# Making a figure of two box plots to visualize the above findings
par( mfcol=c(1,2) )
boxplot( len ~ supp, data=ToothGrowth, cex.axis=0.75, ylab="tooth length", xlab="supplement",
         main="Tooth Growth and Supplements", col=c("orange","linen"), sep=":" )
legend("bottomleft", c("Orange Juice","Ascorbic Acid"), fill = c("orange","linen"), cex=0.75)

boxplot( len~dose, data=ToothGrowth, cex.axis=0.75, ylab="tooth length", xlab="dose (mg/day)",
          main="Tooth Growth and Dosage", col=c("green","yellow","red"), sep=":" )
legend( "bottomright", c("0.5 mg/day","1 mg/day","2 mg/day"), fill=c("green","yellow","red"),
        cex=0.85 )

# exploring the averages of length by supplement type and dose
aggregate( len ~ supp + dose, data=ToothGrowth, FUN=mean )

##   supp dose   len
## 1   OJ  0.5 13.23
## 2   VC  0.5  7.98
## 3   OJ  1.0 22.70
## 4   VC  1.0 16.77
## 5   OJ  2.0 26.06
## 6   VC  2.0 26.14

# Making a figure of two box plots to visualize the above findings
par( mfcol=c(1,2) )
boxplot( len ~ dose+supp, data=ToothGrowth, cex.axis=0.75, ylab="tooth length",
         xlab="dose:supplement (mg/day)", main="Tooth Growth: Supplement and its dosage",
         col=c("orange","orange","orange","linen","linen","linen"), sep=":" )
legend("bottomright", c("Orange Juice","Ascorbic Acid"), fill = c("orange","linen"), cex=0.75)

boxplot( len ~ supp+dose, data=ToothGrowth, cex.axis=0.75, ylab="tooth length",
         xlab="supplement:dose (mg/day)",
         main="Tooth Growth: Dosage and its supplement", col=c("orange","linen"), sep=":" )
legend("bottomright", c("Orange Juice","Ascorbic Acid"), fill = c("orange","linen"), cex=0.85)

Basic Summary

The basic exploratory data analyses shows that:

the dataset has 60 measurements (of the length of odontoblasts) in the variable \(`len`\)
these observations are split into independent groups by two factors: dose level (\(dose\)) and supplement type (\(supp\))
as per project instructions, we will use the above groups “to compare tooth growth by supp and dose”
according to the contingency table, we can select the following groups for comparison:
- the OJ and VS delivery methods (disregarding dosage), 30 samples each
- the 0.5, 1.0 and 2.0 dose levels (disregarding a delivery method), 20 samples each
observations in these groups are not paired
the average of OJ samples looks greater than the one for VC (20.663 vs 16.963)
- \(H_0\) is \(\bar x_{OJ} = \bar x_{VC}\) and \(H_{\alpha}\) is \(\bar x_{OJ} > \bar x_{VC}\)
the averages of samples appear to increase (10.605, 19.735, 26.100) with increasing dose
- \(H_0\) for dose levels 2.0 and 1.0 is \(\bar x_{2.0} = \bar x_{1.0}\) and \(H_{\alpha}\) is \(\bar x_{2.0} > \bar x_{1.0}\)
- \(H_0\) for dose levels 1.0 and 0.5 is \(\bar x_{1.0} = \bar x_{0.5}\) and \(H_{\alpha}\) is \(\bar x_{1.0} > \bar x_{0.5}\)

Hypothesis Tests

We will use the default 95% confidence level and the Welch Two Sample t-test.
One-sided hypothesis tests will be conducted since our alternative hypotheses are of the two types: “greater” (\(\mu_1 > \mu_0\)), “less” (\(\mu_1 < \mu_0\)).
We assume unequal variances (a different variance per group) as no relevant information is available.
We also assume that the groups are independent.
The null hypothesis is assumed true and it is defined as \(\mu_1 = \mu_0\) (\(\mu_1 - \mu_0 = 0\)) i.e. there is no difference between the population means. We will reject it in favor of the alternative hypothesis provided the t-test p-value is smaller than 0.05 (\(\alpha = 0.05\)).

Hypothesis for the supplement OJ vs VC

# test for the supplement OJ vs VC
t.test( len ~ supp, data=ToothGrowth, alternative="greater", var.equal=FALSE )

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 1.9153, df = 55.309, p-value = 0.03032
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.4682687       Inf
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333

With this test (p-value 0.03032 \(< \alpha = 0.05\)), we get statistical evidence to reject the null hypothesis, \(\bar x_{OJ} = \bar x_{VC}\), in favor of the alternative hypothesis, \(\bar x_{OJ} > \bar x_{VC}\).
That can be interpreted as the supplement OJ (orange juice) seems to be a statistically better delivery method of vitamin C (disregarding dosage) than the supplement VC (ascorbic acid) for tooth growth in guinea pigs.

Hypotheses for the dose levels 2.0 vs 1.0 and 1.0 vs 0.5

# test for the dose levels 2.0 vs 1.0
t.test( len ~ dose, data=ToothGrowth, alternative="less", var.equal=FALSE,
        subset=dose %in% c(1.0,2.0)  )

## 
##  Welch Two Sample t-test
## 
## data:  len by dose
## t = -4.9005, df = 37.101, p-value = 9.532e-06
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf -4.17387
## sample estimates:
## mean in group 1 mean in group 2 
##          19.735          26.100

# test for the dose levels 1.0 vs 0.5
t.test( len ~ dose, data=ToothGrowth, alternative="less", var.equal=FALSE,
        subset=dose %in% c(0.5,1.0)  )

## 
##  Welch Two Sample t-test
## 
## data:  len by dose
## t = -6.4766, df = 37.986, p-value = 6.342e-08
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -6.753323
## sample estimates:
## mean in group 0.5   mean in group 1 
##            10.605            19.735

With these two tests (p-values 9.532e-06 and 6.342e-08 \(< \alpha = 0.05\)), we get statistical evidence to reject the null hypotheses, \(\bar x_{2.0} = \bar x_{1.0}\) and \(\bar x_{1.0} = \bar x_{0.5}\), in favor of the alternative hypotheses, \(\bar x_{2.0} > \bar x_{1.0}\) and \(\bar x_{1.0} > \bar x_{0.5}\).
That can be interpreted as the increasing dose level of vitamin C (disregarding a delivery method) appears to increase tooth growth in guinea pigs.

Conclusions

The supplement OJ (orange juice) seems to be a statistically better delivery method of vitamin C (disregarding dosage) than the supplement VC (ascorbic acid) for tooth growth in guinea pigs
The increasing dose level of vitamin C (disregarding a delivery method) appears to increase tooth growth in guinea pigs

These conclusions are based on the assumptions of independent groups of random samples, a different variance per group (unequal variances), and a low probability of making an error (due to the Type I error rate, \(\alpha\), set to be small).