The main of this project is to preform an explorartory data analysis of the ToothGrowth data in the R datasets package and perform a hypothesis testing on the data to get a deeper insight of the data.

Loading the dataset

library(ggplot2)
library(datasets)
attach(ToothGrowth)
summary(ToothGrowth)
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000
str(ToothGrowth)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

From the above Summary on the data, there are 3 varibles, the tooth length, the menthod the supplement was given and the dosage levels.

Explorartory Data Analysis

Dosage_Levels <- factor(ToothGrowth$dose)
g <- ggplot(data = ToothGrowth,aes(dose,len)) +
     geom_boxplot(aes(group = dose, fill = Dosage_Levels )) +
     ggtitle("Tooth Length for differnt Dosage Levels") + 
     xlab("Dosage Levels") +
     ylab("Tooth Length")
    
g

From the above plot it can be seen that the mean tooth length for each dosage level is different and it can also be seen that the tooth length increases as the dosage level increases.

Supplement_Type <- factor(ToothGrowth$supp,labels = c("Orange Juice","Absorbic Acid"))
g <- ggplot(data = ToothGrowth,aes(Supplement_Type,len)) +
     geom_boxplot(aes(group = supp, fill = Supplement_Type)) +
     ggtitle("Tooth Length for differnt Supplement types") + 
     xlab("Supplement Type") +
     ylab("Tooth Length")

g

From the above plot it can be seen that the mean tooth length for each Supplement Type is different and it can also be seen that the tooth length is more when the supplement is delivered by mixing with Orange Juice - a form of Vitamin C.

ToothGrowth$Dosage_Levels <- factor(ToothGrowth$dose)
ToothGrowth$Supplement_Type <- factor(ToothGrowth$supp,labels = c("Orange Juice","Absorbic Acid"))
g <- ggplot(data = ToothGrowth,aes(Supplement_Type,len)) +
     geom_boxplot(aes(fill = Supplement_Type)) +
     facet_grid(~ dose) +
     ggtitle("Tooth Length for different Supplement types across Dosage Levels") + 
     xlab("Dosage Levels") +
     ylab("Tooth Length")

g

Hypothesis Testing

I have added two new columns to the data set by converting the Supplement Types and the Dosage Levels to factors.

Performing t tests for Dosage Levels

Comapring Dosage levels 0.5 and 1

data1 <- subset(ToothGrowth, Dosage_Levels %in% c(0.5,1))
t.test(len ~ Dosage_Levels, paired = FALSE, var.equal = FALSE, data = data1)
## 
##  Welch Two Sample t-test
## 
## data:  len by Dosage_Levels
## t = -6.4766, df = 37.986, p-value = 1.268e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.983781  -6.276219
## sample estimates:
## mean in group 0.5   mean in group 1 
##            10.605            19.735

It can be seen that the mean tooth length for each dosage levels 0.5 and 1 are different and it can also be seen that the mean tooth length is more for dosage level 1. Also, the confidence interval does not contain 0, so we reject the null hypothesis.

Comapring Dosage levels 1 and 2

data2 <- subset(ToothGrowth, Dosage_Levels %in% c(1,2))
t.test(len ~ Dosage_Levels, paired = FALSE, var.equal = FALSE, data = data2)
## 
##  Welch Two Sample t-test
## 
## data:  len by Dosage_Levels
## t = -4.9005, df = 37.101, p-value = 1.906e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.996481 -3.733519
## sample estimates:
## mean in group 1 mean in group 2 
##          19.735          26.100

It can be seen that the mean tooth length for each dosage levels 1 and 2 are different and it can also be seen that the mean tooth length is more for dosage level 2. Also, the confidence interval does not contain 0, so we reject the null hypothesis.

Comapring Dosage levels 0.5 and 2

data3 <- subset(ToothGrowth, Dosage_Levels %in% c(0.5,2))
t.test(len ~ Dosage_Levels, paired = FALSE, var.equal = FALSE, data = data3)
## 
##  Welch Two Sample t-test
## 
## data:  len by Dosage_Levels
## t = -11.799, df = 36.883, p-value = 4.398e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -18.15617 -12.83383
## sample estimates:
## mean in group 0.5   mean in group 2 
##            10.605            26.100

It can be seen that the mean tooth length for each dosage levels 0.5 and 2 are different and it can also be seen that the mean tooth length is more for dosage level 2. Also, the confidence interval does not contain 0, so we reject the null hypothesis.

Performing t tests for Supplement Types

t.test(len ~ Supplement_Type, paired = FALSE, var.equal = FALSE, data = ToothGrowth)
## 
##  Welch Two Sample t-test
## 
## data:  len by Supplement_Type
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
##  mean in group Orange Juice mean in group Absorbic Acid 
##                    20.66333                    16.96333

It can be seen that the mean tooth length for each Supplement Type is different and it can also be seen that the tooth length is more when the supplement is delivered by mixing with Orange Juice - a form of Vitamin C. Also, the confidence interval contains 0, so we cannot reject the null hypothesis.

Conclusions based on the Analysis

Assumptions used for the Conclusion