In this project we will analyze the ToothGrowth data in R datasets package.
library(datasets) # Load the library
TG <- ToothGrowth # Assigning the data into a new dataframe
str(TG) # Looking at the structure of the datasets and the variables
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
After looking the dataset and looking it guide with the command ?ToothGrowth it seems the dataset is a experiment of applying odontoblast at three different dose levels (0.5, 1 and 2) in 60 guinea pigs, by two different methods, by Orange Juice coded as OJ and vitamin C coded as VC
library(ggplot2) # Load ggplot for some exploratory analysis
library(pastecs) # Loading pastecs to get some descriptive statistics
## Loading required package: boot
stat.desc(TG$len) # Basic summary of the len data
## nbr.val nbr.null nbr.na min max
## 60.0000000 0.0000000 0.0000000 4.2000000 33.9000000
## range sum median mean SE.mean
## 29.7000000 1128.8000000 19.2500000 18.8133333 0.9875223
## CI.mean.0.95 var std.dev coef.var
## 1.9760276 58.5120226 7.6493152 0.4065901
After looking some basic descriptive statistics, the range and the standar deviation can be highlighted, cause there is a high variability on the data. With this one can start wondering…There is any effect on the amount of dose applied and the method that it was applied? Let’s look at some graphics to strengthen that theory
ggplot(TG, aes(x=supp, y=len, fill = supp)) + geom_boxplot()+
stat_summary(fun.y="mean", geom="point", shape=20, size=3)+
scale_fill_discrete(labels=c("Orange Juice", "Vitamin C"))
After looking the boxplot of the two methods, seems that the Orange Juice has better effect, the mean lean on the guinea pigs is 20.663 and the Vitamin C mean is 16.963, note that it’s difference is 3.7 But, that’s not the final conclution. Let’s look it a one more level down, by supp and dose.
ggplot(TG, aes(x=interaction(supp, dose), y=len, fill = supp)) + geom_boxplot()+
stat_summary(fun.y="mean", geom="point", shape=20, size=3)+
scale_fill_discrete(labels=c("Orange Juice", "Vitamin C"))
It’s seems that the Vitamin C and Orange Juice has similar effect with a dose of mg/day. But let’s do some hypothesis testing to draw the final conclutions
For the summary of the data, it’s only our interest to show the mean and the standar deviation of the len by dose and method.
library(reshape2)
dcast(TG, dose ~ supp, value.var = "len", fun.aggregate = mean) # Mean
## dose OJ VC
## 1 0.5 13.23 7.98
## 2 1.0 22.70 16.77
## 3 2.0 26.06 26.14
dcast(TG, dose ~ supp, value.var = "len", fun.aggregate = sd) # Standar Deviation
## dose OJ VC
## 1 0.5 4.459709 2.746634
## 2 1.0 3.910953 2.515309
## 3 2.0 2.655058 4.797731
First hypothesis:
- There is a difference of the mean affected by the method used Assuming that it’s a randomized experiment and the variance is not equal, the following code serve us well.
first_test <- t.test(len ~ supp, paired = FALSE, var.equal = FALSE, data = TG)
first_test$conf
## [1] -0.1710156 7.5710156
## attr(,"conf.level")
## [1] 0.95
first_test$p.value
## [1] 0.06063451
The results shows that there is no statistical difference between the two methods, the confidence intervals cross 0 and the p-value is greater that that a alpha level of 0.05
Second hypothesis:
- There is a difference of the mean affected by the method used and the dose Assuming that it’s a randomized experiment and the variance is not equal, the following code serve us well.
dose0.5 <- TG[TG$dose == 0.5,]
dose1 <- TG[TG$dose == 1,]
dose2 <- TG[TG$dose == 2,]
test0.5 <- t.test(len ~ supp, paired = FALSE, var.equal = FALSE, data = dose0.5)
test0.5$conf
## [1] 1.719057 8.780943
## attr(,"conf.level")
## [1] 0.95
test0.5$p.value
## [1] 0.006358607
test1 <- t.test(len ~ supp, paired = FALSE, var.equal = FALSE, data = dose1)
test1$conf
## [1] 2.802148 9.057852
## attr(,"conf.level")
## [1] 0.95
test1$p.value
## [1] 0.001038376
test2 <- t.test(len ~ supp, paired = FALSE, var.equal = FALSE, data = dose2)
test2$conf
## [1] -3.79807 3.63807
## attr(,"conf.level")
## [1] 0.95
test2$p.value
## [1] 0.9638516
After looking the three hypothesis tests above, there are three main conclution:
There is statistical difference between methods of appliance with a dose of 0.5 mg/day and 1 mg/day over the len of tooth growth of guinea pigs. There is not statistical difference of the methods when the dose is 2 mg/day
The best method and dose is by Orange Juice and by 1 mg/day as suggested in the hypothesis test and the boxplot above
The 2 mg/day dose masked the real effect of the other two doses on the two methods used.