Basic Inferential Data Analysis

Project instructions

In this project we will analyze the ToothGrowth data in R datasets package.

Loading the data

Load the ToothGrowth data and perform some basic exploratory data analysis

library(datasets) # Load the library
TG <- ToothGrowth # Assigning the data into a new dataframe 
str(TG) # Looking at the structure of the datasets and the variables

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

After looking the dataset and looking it guide with the command ?ToothGrowth it seems the dataset is a experiment of applying odontoblast at three different dose levels (0.5, 1 and 2) in 60 guinea pigs, by two different methods, by Orange Juice coded as OJ and vitamin C coded as VC

library(ggplot2) # Load ggplot for some exploratory analysis
library(pastecs) # Loading pastecs to get some descriptive statistics

## Loading required package: boot

stat.desc(TG$len) # Basic summary of the len data

##      nbr.val     nbr.null       nbr.na          min          max 
##   60.0000000    0.0000000    0.0000000    4.2000000   33.9000000 
##        range          sum       median         mean      SE.mean 
##   29.7000000 1128.8000000   19.2500000   18.8133333    0.9875223 
## CI.mean.0.95          var      std.dev     coef.var 
##    1.9760276   58.5120226    7.6493152    0.4065901

After looking some basic descriptive statistics, the range and the standar deviation can be highlighted, cause there is a high variability on the data. With this one can start wondering…There is any effect on the amount of dose applied and the method that it was applied? Let’s look at some graphics to strengthen that theory

ggplot(TG, aes(x=supp, y=len, fill = supp)) + geom_boxplot()+
        stat_summary(fun.y="mean", geom="point", shape=20, size=3)+
        scale_fill_discrete(labels=c("Orange Juice", "Vitamin C"))

After looking the boxplot of the two methods, seems that the Orange Juice has better effect, the mean lean on the guinea pigs is 20.663 and the Vitamin C mean is 16.963, note that it’s difference is 3.7 But, that’s not the final conclution. Let’s look it a one more level down, by supp and dose.

ggplot(TG, aes(x=interaction(supp, dose), y=len, fill = supp)) + geom_boxplot()+
        stat_summary(fun.y="mean", geom="point", shape=20, size=3)+
        scale_fill_discrete(labels=c("Orange Juice", "Vitamin C"))

It’s seems that the Vitamin C and Orange Juice has similar effect with a dose of mg/day. But let’s do some hypothesis testing to draw the final conclutions

Summary of the data

For the summary of the data, it’s only our interest to show the mean and the standar deviation of the len by dose and method.

library(reshape2)
dcast(TG, dose ~ supp, value.var = "len", fun.aggregate = mean) # Mean

##   dose    OJ    VC
## 1  0.5 13.23  7.98
## 2  1.0 22.70 16.77
## 3  2.0 26.06 26.14

dcast(TG, dose ~ supp, value.var = "len", fun.aggregate = sd)   # Standar Deviation

##   dose       OJ       VC
## 1  0.5 4.459709 2.746634
## 2  1.0 3.910953 2.515309
## 3  2.0 2.655058 4.797731

Hypothesis testing

Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose. (Only use the techniques from class, even if there’s other approaches worth considering)

First hypothesis:
- There is a difference of the mean affected by the method used Assuming that it’s a randomized experiment and the variance is not equal, the following code serve us well.

first_test <- t.test(len ~ supp, paired = FALSE, var.equal = FALSE, data = TG)
first_test$conf

## [1] -0.1710156  7.5710156
## attr(,"conf.level")
## [1] 0.95

first_test$p.value

## [1] 0.06063451

The results shows that there is no statistical difference between the two methods, the confidence intervals cross 0 and the p-value is greater that that a alpha level of 0.05

Second hypothesis:
- There is a difference of the mean affected by the method used and the dose Assuming that it’s a randomized experiment and the variance is not equal, the following code serve us well.

dose0.5 <- TG[TG$dose == 0.5,]
dose1 <- TG[TG$dose == 1,]
dose2 <- TG[TG$dose == 2,]
test0.5 <- t.test(len ~ supp, paired = FALSE, var.equal = FALSE, data = dose0.5)
test0.5$conf

## [1] 1.719057 8.780943
## attr(,"conf.level")
## [1] 0.95

test0.5$p.value

## [1] 0.006358607

test1 <- t.test(len ~ supp, paired = FALSE, var.equal = FALSE, data = dose1)
test1$conf

## [1] 2.802148 9.057852
## attr(,"conf.level")
## [1] 0.95

test1$p.value

## [1] 0.001038376

test2 <- t.test(len ~ supp, paired = FALSE, var.equal = FALSE, data = dose2)
test2$conf

## [1] -3.79807  3.63807
## attr(,"conf.level")
## [1] 0.95

test2$p.value

## [1] 0.9638516

After looking the three hypothesis tests above, there are three main conclution:

There is statistical difference between methods of appliance with a dose of 0.5 mg/day and 1 mg/day over the len of tooth growth of guinea pigs. There is not statistical difference of the methods when the dose is 2 mg/day
The best method and dose is by Orange Juice and by 1 mg/day as suggested in the hypothesis test and the boxplot above
The 2 mg/day dose masked the real effect of the other two doses on the two methods used.

Basic Inferential Data Analysis

Alejandro Cadavid Romero

Project instructions

Loading the data

Summary of the data

Hypothesis testing