Overview

In this project we perform some initial exploratory analysis of the ToothGrowth data set. This initial explorations will give us a basic understanding of the data that will allow us to formulate hypothesis that we will then try to prove (or disprove) using basic statistical inference techniques like the estimation of confidence intervals and hypothesis testing.

Exploratory analysis

The ToothGrowth data set https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/ToothGrowth.html is an study of the factors that affect the tooth growth of guinea pigs. The data consist of 60 observations of 3 variables:

data(ToothGrowth)
summary(ToothGrowth)
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

The first variable is the tooth length and this is the response variable of the experiment. The tooth growth is in fact measured in response to 3 dose levels of vitamin C (0.5,1 and 2 mg) administered by 2 delivery methods: Orange Juice (OJ) and Ascorbic acid (VC). In the following figure we resumed the information of this data set using boxplots.

#Boxplot for VC
boxplot(len ~ dose, data = ToothGrowth, boxwex = 0.15, at = 1:3 - 0.2, subset = supp == "VC", col = "blue", main = "Tooth growth of guinea pigs", xlab = "Dose of vitamin C (in mg)", ylab = "Tooth length", xlim = c(0.5, 3.5), ylim = c(0, 35), yaxs = "i")
#Boxplot for OJ 
boxplot(len ~ dose, data = ToothGrowth, add = TRUE, boxwex = 0.15, at = 1:3 + 0.2,subset = supp == "OJ", col = "green")
abline(v=1.5, lwd=1, col="black", lty=1)
abline(v=2.5, lwd=1, col="black", lty=1)
legend(1.9, 10.1, c("Ascorbic acid (VC)", "Orange juice (OJ)"),fill = c("blue", "green"))

From this resumed plots we can see that independently of the delivery method, increasing doses of vitamin C favored the growth of the teeth. However, at least at low doses, the use of orange juice (OJ) as a delivery method seems to have a stronger effect on growth than the ascorbic acid (VC).

Regarding the analysis, it is worth noting that already the box plots are descriptive statistics tools that represent the data using their quantiles without making any asumption of the underlying statistical distribution. In many cases when the results from box plots are clear, hypothesis testing might be unnecesary.

Confidence intervals & Hypothesis tests

Effect of the dose independently of the delivery method

In this first hypothesis tests we will consider as null hypothesis that there is no difference between the tooth growth (length) of different doses. We will perform two of these tests for: doses 0.5-1.0, doses 1.0-2.0

doses 0.5 and 1.0

#t test
result<-t.test(len ~ dose,data=subset(ToothGrowth,dose %in% c(0.5,1.0)),var.equal=FALSE,paired=FALSE)
#confidence interval
c(result$conf.int[1],result$conf.int[2])
## [1] -11.983781  -6.276219

doses 1.0 and 2.0

#t test
result<-t.test(len ~ dose,data=subset(ToothGrowth,dose %in% c(1.0,2.0)),var.equal=FALSE,paired=FALSE)
#confidence interval
c(result$conf.int[1],result$conf.int[2])
## [1] -8.996481 -3.733519

In both cases the 95% confidence intervals are below zero, therefore we reject the null hypothesis: the mean values of the lengths per doses are not equal. We can conclude that in the tested doses an increase in the dose leads to tooth growth with statistical significance.

Effect of the delivery method independently of dose

#t test
result<-t.test(len ~ supp,data=ToothGrowth,var.equal=FALSE,paired=FALSE)
#confidence interval
c(result$conf.int[1],result$conf.int[2])
## [1] -0.1710156  7.5710156

In this case we see that the zero value of the mean differences (null hypothesis) falls inside the confidence intervals. Therefore, overal, with 95% confidence we can say that the mean values for the delivery methods are equal. The same kind of reasoning based on the confidence intervals is applied to the rest of the tests.

Effect of the delivery method per each dose

Dose 0.5

#t test
result<-t.test(len ~ supp,data=ToothGrowth[ToothGrowth$dose==0.5,],var.equal=FALSE,paired=FALSE)
#confidence interval
c(result$conf.int[1],result$conf.int[2])
## [1] 1.719057 8.780943

Dose 1.0

#t test
result<-t.test(len ~ supp,data=ToothGrowth[ToothGrowth$dose==1.0,],var.equal=FALSE,paired=FALSE)
#confidence interval
c(result$conf.int[1],result$conf.int[2])
## [1] 2.802148 9.057852

Dose 2.0

#t test
result<-t.test(len ~ supp,data=ToothGrowth[ToothGrowth$dose==2.0,],var.equal=FALSE,paired=FALSE)
#confidence interval
c(result$conf.int[1],result$conf.int[2])
## [1] -3.79807  3.63807

Assumptions & Conclusions

Assumptions

For the application of the t student to test differences between the means of two populations we assume that the populations are independent of each other (paired=FALSE) and that the variances between populations are also different (var.equal=FALSE). In every case we applied the default Welch two sample test with a confidence level 0f 95% around 0 (corrisponding to a difference = 0 between the means of the tested populations).

Conclusions