Statistical Inference Project 2

echo = TRUE
library(ggplot2)

This report deals with analysis of the ToothGrowth data in the R data sets package. The data is set of 60 observations, length of odontoblasts (teeth) in each of 10 guinea pigs at each of three dose levels of Vitamin C (0.5, 1 and 2 mg) with each of two delivery methods (orange juice or ascorbic acid).

Data loading and processing

Data ToothGrowth will be loaded and analysed here.

data(ToothGrowth)
str(ToothGrowth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

summary(ToothGrowth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

plot(ToothGrowth)

The structure of the data set suggests the dose variable is numeric, but in fact, we have seen that actually there are only three dose levels of Vitamin C (0.5, 1 and 2 mg). A more clear view of this, is the third row (or the third column, they are equivalent) of the scatter plot matrix, provided above. We will transform the variable dose to a factor variable

ToothGrowth$dose <- as.factor(ToothGrowth$dose)
summary(ToothGrowth$dose)

## 0.5   1   2 
##  20  20  20

Exploratory Data Analysis

So, the mean of the len variable is 18.8133, but this refers to all the 60 observations of the data set. The mean for the two levels of the supply method (supp) are described as below.

meansupp = split(ToothGrowth$len, ToothGrowth$supp)
sapply(meansupp, mean)

##       OJ       VC 
## 20.66333 16.96333

ggplot(aes(x=supp, y=len), data=ToothGrowth) + geom_boxplot(aes(fill=supp))+ 
        xlab("Supplement type") +ylab("Tooth length")

e will now check the effect of vitamin C dose on tooth length. So first we will calculate the means of the three levels of dosage.

meandose = split(ToothGrowth$len, ToothGrowth$dose)
sapply(meandose, mean)

##    0.5      1      2 
## 10.605 19.735 26.100

ggplot(aes(x=dose, y=len), data=ToothGrowth) + geom_boxplot(aes(fill=dose)) + 
        xlab("Dose in miligrams") +ylab("Tooth length")

Can we tell if the tooth length of the guinea pigs is greater when the delivery methods is the orange juice compared to the tooth length when the delivery method is the ascorbic acid? To answer that question we will use the provided data and perform a statistical t test for the difference if the means of two independent groups.

len<-ToothGrowth$len
supp<-ToothGrowth$supp
dose<-ToothGrowth$dose

Let Ho be the null hypotheses of equal means between the two groups, versus the alternative hypothesis that the two means are different. This means that a two sided test will be performed, and so the confidence interval is absolutely equivalent to the t-test.

First we will check the variances to see if they are close enough to be assumed to be equal

sapply(meansupp, var)

##       OJ       VC 
## 43.63344 68.32723

Variances of the two groups seem pretty far, so in the test we will not assume equality of variances.

t.test(len[supp=="OJ"], len[supp=="VC"], paired = FALSE, var.equal = FALSE)

## 
##  Welch Two Sample t-test
## 
## data:  len[supp == "OJ"] and len[supp == "VC"]
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean of x mean of y 
##  20.66333  16.96333

As we can see from the results provided the p-value of this test is 0.06, which at a significance level of 5%, we have not enough evidence to reject the null hypothesis, which subsequently means that we cannot assume that the means of delivery of the vitamin has en effect on tooth length.

The confidence interval of the test is: (-0.171, 7.571), so we can see that the confidence interval contains zero (0), as the test proved.

We will test of the tooth length of the group with vitamin C dosage We will test all three combinations, but let us start with the test whether the mean tooth length of the of the group with vitamin C dose of 2mg, is equal to the group with vitamin C dosage of 1mg.

t.test(len[dose==2], len[dose==1], paired = FALSE, var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  len[dose == 2] and len[dose == 1]
## t = 4.9005, df = 38, p-value = 1.811e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  3.735613 8.994387
## sample estimates:
## mean of x mean of y 
##    26.100    19.735

As we can see from the results provided the p-value of this test is 0, which at a significance level of 5%, we have enough evidence to reject the null hypothesis, which subsequently means that we can assume that the means of dosage change from 1mg to 2mg creates an positive effect on tooth length, tooth length increases.

The confidence interval of the test is: (3.7356, 8.9944), so we can see that the confidence interval does not contain zero (0), as the test proved.

We will go on with the test whether the mean tooth length of the of the group with vitamin C dose of 1mg, is equal to the group with vitamin C dosage of 0.5mg.

t.test(len[dose==1], len[dose==0.5], paired = FALSE, var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  len[dose == 1] and len[dose == 0.5]
## t = 6.4766, df = 38, p-value = 1.266e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   6.276252 11.983748
## sample estimates:
## mean of x mean of y 
##    19.735    10.605

The confidence interval of the test is: (6.2763, 11.9837), so we can see that the confidence interval does not contain zero (0), as the test proved.

We thing that is totally redundant to check if the mean tooth length differs between the groups of vitamin C dosage of 0.5 and 2mg, so we will just mention the 95% confidence interval, to prove that the value of zero is not included. (12.8365, 18.1535)

Multiple Comparision

Up to now we have made 4 tests, and thus we have calculated 4 different pvalues for those tests.

We will now check the multiple comparisons by adjusting for

Family-wise error rate (FWER)
False discovery rate (FDR)

# Store pvalues in a vector
pValues<-c(t.test(len[supp=="OJ"], len[supp=="VC"], paired = FALSE, var.equal = FALSE)$p.value,
           t.test(len[dose==2], len[dose==1], paired = FALSE, var.equal = TRUE)$p.value,
           t.test(len[dose==1], len[dose==0.5], paired = FALSE, var.equal = TRUE)$p.value,
           t.test(len[dose==2], len[dose==0.5], paired = FALSE, var.equal = TRUE)$p.value)

# Controls FWER
sum(p.adjust(pValues, method = "bonferroni") < 0.05)

## [1] 3

round(p.adjust(pValues, method = "bonferroni"),3)

## [1] 0.243 0.000 0.000 0.000

# Controls FDR
sum(p.adjust(pValues, method = "BH") < 0.05)

## [1] 3

round(p.adjust(pValues, method = "BH"),3)

## [1] 0.061 0.000 0.000 0.000

We can see that only the first p-value is considered big, but that was logical as even in the 5% significance level it was not small enough to reject the null hypothesis, so in the plots below, it is the p-value of the first test we made that shows up in the top right corner of the plots, and the other three falling one on another at the bottom left corner.

par(mfrow = c(1, 2))
plot(pValues, p.adjust(pValues, method = "bonferroni"), pch = 19)
plot(pValues, p.adjust(pValues, method = "BH"), pch = 19)

The results were very logical and quite predictable I would say, as it was shown from the box plots at the beginning.

Conclusions

Supplement type has no effect on tooth growth.
Increasing the dose level leads to increased tooth growth.

Assumptions