In this part of the project, we will analyze the dataset ToothGrowth. This dataset is made of 60 observations of tooth growth of Guinea pigs and the effect of dose of vitamin C on the growth. The delivery method is a two-values factor variable : “OJ” (Orange Juice) and VC (Ascorbi Acid, a form of Vitamin C). The dose given contains three different values : 0.5, 1, and 2 mg/day. The measures are done on 10 Guinea pigs.
Let us show below the structure of the data frame, and the first values.
library(datasets)
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
data(ToothGrowth)
head(ToothGrowth)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
We check if there are NA values and display the summary information for this dataset :
sum(!complete.cases(ToothGrowth))
## [1] 0
So there is no row with NA value. Let us have a look on the data summary :
summary(ToothGrowth)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
In this section, we will give a summary exploratory analysis
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.2
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
p <- ggplot(ToothGrowth, aes(x=dose, y=len,fill=dose)) + geom_boxplot() + ggtitle("Fig. 1 : dispersion of tooth growth by dose") + xlab("dose in mg") + ylab("tooth length")
p
We can see that the higher the dose is, the longer the teeth are. We can notice that for a dose of 1 mg, the mean is nearly twice than for dose 0.5 mg. The progression is then lower when the dose is 2 mg. The position of the boxes are really different; this is a first clue for suggesting that the tooth length depends a lot of the dose.
Let’s now look at the influence of the delivery method.
p1 <- ggplot(ToothGrowth, aes(x=supp, y=len,fill=supp)) + geom_boxplot() + ggtitle("Dispersion of tooth growth by delivery method") + xlab("delivery method") + ylab("tooth length")
p1
The boxes are quite similar. However, the median is much higher for Orange Juice than for Vitamin C; that means that for Orange Juice, the high values are more numerous than the low values, in comparison with Vitamin C.
Now we will test if the delivery mode has an influence on the tooth growth. nul hypothesis H0 could be formulated as follows :
H0 : The delivery mode of Vitamin C does not have any influence on the tooth growth
dose <- ToothGrowth$dose
supp <- ToothGrowth$supp
len <- ToothGrowth$len
t.test(len[supp == "VC"],len[supp == "OJ"], paired=FALSE)
##
## Welch Two Sample t-test
##
## data: len[supp == "VC"] and len[supp == "OJ"]
## t = -1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7.5710156 0.1710156
## sample estimates:
## mean of x mean of y
## 16.96333 20.66333
This first test shows the following :
Now let’s try to test the influence of the dose on the tooth growth
t.test(len[dose == 0.5],len[dose == 1], paired=FALSE)
##
## Welch Two Sample t-test
##
## data: len[dose == 0.5] and len[dose == 1]
## t = -6.4766, df = 37.986, p-value = 1.268e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.983781 -6.276219
## sample estimates:
## mean of x mean of y
## 10.605 19.735
In this test, we can clearly see that the quantity of Vitamin C has an impact on the tooth growth :
An identical conclusion can be taken comparing the length of dose = 1 and dose = 2. This could already be detected from the boxplot above.