In this project we perform some initial exploratory analysis of the ToothGrowth data set. This initial explorations will give us a basic understanding of the data that will allow us to formulate hypothesis that we will then try to prove (or disprove) using basic statistical inference techniques like the estimation of confidence intervals and hypothesis testing.
The ToothGrowth data set https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/ToothGrowth.html is an study of the factors that affect the tooth growth of guinea pigs. The data consist of 60 observations of 3 variables:
data(ToothGrowth)
summary(ToothGrowth)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
The first variable is the tooth length and this is the response variable of the experiment. The tooth growth is in fact measured in response to 3 dose levels of vitamin C (0.5,1 and 2 mg) administered by 2 delivery methods: Orange Juice (OJ) and Ascorbic acid (VC). In the following figure we resumed the information of this data set using boxplots.
#Boxplot for VC
boxplot(len ~ dose, data = ToothGrowth, boxwex = 0.15, at = 1:3 - 0.2, subset = supp == "VC", col = "blue", main = "Tooth growth of guinea pigs", xlab = "Dose of vitamin C (in mg)", ylab = "Tooth length", xlim = c(0.5, 3.5), ylim = c(0, 35), yaxs = "i")
#Boxplot for OJ
boxplot(len ~ dose, data = ToothGrowth, add = TRUE, boxwex = 0.15, at = 1:3 + 0.2,subset = supp == "OJ", col = "green")
abline(v=1.5, lwd=1, col="black", lty=1)
abline(v=2.5, lwd=1, col="black", lty=1)
legend(1.9, 10.1, c("Ascorbic acid (VC)", "Orange juice (OJ)"),fill = c("blue", "green"))
From this resumed plots we can see that independently of the delivery method, increasing doses of vitamin C favored the growth of the teeth. However, at least at low doses, the use of orange juice (OJ) as a delivery method seems to have a stronger effect on growth than the ascorbic acid (VC).
Regarding the analysis, it is worth noting that already the box plots are descriptive statistics tools that represent the data using their quantiles without making any asumption of the underlying statistical distribution. In many cases when the results from box plots are clear, hypothesis testing might be unnecesary.
In this first hypothesis tests we will consider as null hypothesis that there is no difference between the tooth growth (length) of different doses. We will perform two of these tests for: doses 0.5-1.0, doses 1.0-2.0
#t test
result<-t.test(len ~ dose,data=subset(ToothGrowth,dose %in% c(0.5,1.0)),var.equal=FALSE,paired=FALSE)
#confidence interval
c(result$conf.int[1],result$conf.int[2])
## [1] -11.983781 -6.276219
#t test
result<-t.test(len ~ dose,data=subset(ToothGrowth,dose %in% c(1.0,2.0)),var.equal=FALSE,paired=FALSE)
#confidence interval
c(result$conf.int[1],result$conf.int[2])
## [1] -8.996481 -3.733519
In both cases the 95% confidence intervals are below zero, therefore we reject the null hypothesis: the mean values of the lengths per doses are not equal. We can conclude that in the tested doses an increase in the dose leads to tooth growth with statistical significance.
#t test
result<-t.test(len ~ supp,data=ToothGrowth,var.equal=FALSE,paired=FALSE)
#confidence interval
c(result$conf.int[1],result$conf.int[2])
## [1] -0.1710156 7.5710156
In this case we see that the zero value of the mean differences (null hypothesis) falls inside the confidence intervals. Therefore, overal, with 95% confidence we can say that the mean values for the delivery methods are equal. The same kind of reasoning based on the confidence intervals is applied to the rest of the tests.
#t test
result<-t.test(len ~ supp,data=ToothGrowth[ToothGrowth$dose==0.5,],var.equal=FALSE,paired=FALSE)
#confidence interval
c(result$conf.int[1],result$conf.int[2])
## [1] 1.719057 8.780943
#t test
result<-t.test(len ~ supp,data=ToothGrowth[ToothGrowth$dose==1.0,],var.equal=FALSE,paired=FALSE)
#confidence interval
c(result$conf.int[1],result$conf.int[2])
## [1] 2.802148 9.057852
#t test
result<-t.test(len ~ supp,data=ToothGrowth[ToothGrowth$dose==2.0,],var.equal=FALSE,paired=FALSE)
#confidence interval
c(result$conf.int[1],result$conf.int[2])
## [1] -3.79807 3.63807
For the application of the t student to test differences between the means of two populations we assume that the populations are independent of each other (paired=FALSE) and that the variances between populations are also different (var.equal=FALSE). In every case we applied the default Welch two sample test with a confidence level 0f 95% around 0 (corrisponding to a difference = 0 between the means of the tested populations).
Independently of the delivery method, the suply of higher doses of vitamin C to guinea pigs favored tooth growth.
If we do not take into consideration the doses, there is no overal difference between the effect of the delivery method on the growth of the tooth in guinea pigs.
If we partition our data by doses, then we find that the delivery method affects significantly the growth of the tooth at doses of 0.5 and 1.0 mg, while there is not significant effect at dose 2.0 mg.