Overview

The task developed in this document is for the completion of the Statistical Inference Course Assignment, part of Coursera’s Data Science Certification by Johns Hopkins University. There are two parts to this project :

As requested, each pdf report should have a maximum length of 3 pages admiting more 3 pages of supporting material as an appendix if needed.

Part 2 - Inferential Data Analysis

Overview

Perform exploratory analysis on the ToothGrowth R dataset - The Effect of Vitamin C on Tooth Growth in Guinea Pigs. The response is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice or ascorbic acid (a form of vitamin C and coded as VC).

Requested demonstrations

  1. Load the ToothGrowth data and perform some basic exploratory data analyses
  2. Provide a basic summary of the data.
  3. Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose.
  4. State your conclusions and the assumptions needed for your conclusions.

Given conditions || assumptions

  1. For all confidence intervals, use a 95% confidence.
  2. The measurements are not paired.
  3. Only use the techniques from class.
  4. The populations are independent, there was no crossover between the subjects and dosage.
  5. The subjects are truly selected at random, there is no confusing factors to influence the results.
  6. The variances are not equal (var.equal=FALSE)

Tasks 1/2 - Load the data and perform some basic exploratory data analyses

Here is the summary :

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

From the report above we may see that half of the cases received the dose via the VC method and the other half the OJ method. We can also get the minimum, and maximum tooth length, and the minimum and maximum dosage.

By using the length mean (lenght_mean), length standard deviation (lenght_sd), and the number of observations (count), We may summarize the data in three ways:

  • Get it all first (dosage and supplement type)
## # A tibble: 6 Ă— 5
## # Groups:   supp [2]
##   supp   dose lenght_mean lenght_sd count
##   <fct> <dbl>       <dbl>     <dbl> <int>
## 1 OJ      0.5       13.2       4.46    10
## 2 OJ      1         22.7       3.91    10
## 3 OJ      2         26.1       2.66    10
## 4 VC      0.5        7.98      2.75    10
## 5 VC      1         16.8       2.52    10
## 6 VC      2         26.1       4.80    10
  • Or by supplement type only (ignoring dosage)
## # A tibble: 2 Ă— 4
##   supp  lenght_mean lenght_sd count
## * <fct>       <dbl>     <dbl> <int>
## 1 OJ           20.7      6.61    30
## 2 VC           17.0      8.27    30
  • Finally, by dosage only (ignoring supplement type)
## # A tibble: 3 Ă— 4
##    dose lenght_mean lenght_sd count
## * <dbl>       <dbl>     <dbl> <int>
## 1   0.5        10.6      4.50    20
## 2   1          19.7      4.42    20
## 3   2          26.1      3.77    20

At first glance, it appears that OJ is a better supplement method than VC. It also looks like we may begin to agree that vitamin C is related with tooth growth.

As pictures may worth a thousand words, let’s research with them building a scatter plot using supplement and dosage relative to length of tooth:

The higher the dosage the longer the tooth grows. The graph shows the dosages are similar for both supplements at 2mg but it confirms that OJ has a bigger impact on teeth growth compared with VC on lower levels.

Task 3 - Use confidence intervals and/or hypothesis tests to compare tooth growth by supplement and dose.

  • First, is there any difference between OJ and VC at all dosage levels at once ?
## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333

NO, there is not -> We see that the confidence interval includes 0 and the p-value is 0.06 (greater than 0.05 but not really significant).

  • By using the dosage levels, let’s research if there is a difference between OJ and VC
DOSE Confidence Interval P-Value Evidence
0.5 1.7191 to 8.7809 0.00636 YES - There is - The confidence interval does not include zero
1.0 2.8021 to 9.0579 0.00104 YES - There is - The confidence interval does not include zero
2.0 -3.7981 to 3.6381 0.96385 NO - There is not - The confidence interval includes zero
  • Now, in order to see if there is difference between dosage levels. OJ and VC will be analysed separately.*
DOSES Suppl Interval P-Value Evidence
0.5 / 1.0 OJ -13.416 to -5.524 8.7849191^{-5} YES - There is - The confidence interval does not include zero
1.0 / 2.0 OJ -6.531 to -0.189 0.0391951 YES - There is - The confidence interval does not include zero
0.5 / 1.0 VC -11.266 to -6.314 6.8110177^{-7} YES - There is - The confidence interval does not include zero
1.0 / 2.0 VC -13.054 to -5.686 9.1556031^{-5} YES - There is - The confidence interval does not include zero

Task 4 - Conclusions.

Based on what this research has provided to us so far, is fair to conclude the following:

  1. Tooth length increases as the dosage increases, regardless of supplement method.
  2. There is no difference between the OJ and VC supplement methods at the 2.0 mg dosage.
  3. The OJ supplement method leads to more tooth growth than the VC method at the 0.5 mg and 1.0 mg dosages.
  4. Based on evidences 2 and 3, it appears that OJ is a better delivery method and recommended to increase tooth length for a given dose of Vitamin C, but there is no further improvement above a maximum dose level.

The requested assumptions were already stated at the project description.

APPENDIX - CODE

# We will need ggplot to draw 
library(ggplot2)
# Handy to manipulate vars.
library(dplyr)

# Load ToothGrowth data
data("ToothGrowth")
# 
summary(ToothGrowth)
summarise_all <- ToothGrowth %>% 
    group_by(supp,dose) %>%
    summarize(lenght_mean=mean(len), lenght_sd=sd(len), count = n())
print(summarise_all)

summarise_suplement <- ToothGrowth %>% 
    group_by(supp) %>%
    summarize(lenght_mean=mean(len), lenght_sd=sd(len), count = n())
print(summarise_suplement)

summarise_dose <- ToothGrowth %>% 
    group_by(dose) %>%
    summarize(lenght_mean=mean(len), lenght_sd=sd(len), count = n())
print(summarise_dose)
# Calculate len mean for every dose and supp
len_avg <- aggregate(len~.,data=ToothGrowth,mean)

# Now Plot the Tooth Lenght (len) relative to Dosage and Supplement
g <- ggplot(data = ToothGrowth,aes(x=dose,y=len))
g <- g + geom_point(aes(group=supp,colour=supp,size=1,alpha=0.6))
g <- g + geom_line(data=len_avg,aes(group=supp,colour=supp))
g <- g + labs(title="Tooth Lenght relative to Dosage and Supplement")
print(g)
# Do t test OJ to VC at all dosage levels
t.test(len ~ supp, paired=FALSE, var.equal=FALSE, data=ToothGrowth)
yes_lbl <- "**YES - There is** - The confidence interval does not include zero"
no_lbl <-  "**NO - There is not** - The confidence interval includes zero"

vcoj_05 <- t.test(len ~ supp, paired=FALSE, var.equal=FALSE, data=ToothGrowth[ToothGrowth$dose==0.5, ])
vcoj_10 <- t.test(len ~ supp, paired=FALSE, var.equal=FALSE, data=ToothGrowth[ToothGrowth$dose==1.0, ])
vcoj_20 <- t.test(len ~ supp, paired=FALSE, var.equal=FALSE, data=ToothGrowth[ToothGrowth$dose==2.0, ])

vcoj_05_itvl <- paste(round(vcoj_05$conf.int[1],4), " to ", round(vcoj_05$conf.int[2],4))
vcoj_05_pvl  <- round(vcoj_05$p.value,5)

vcoj_10_itvl <- paste(round(vcoj_10$conf.int[1],4), " to ", round(vcoj_10$conf.int[2],4))
vcoj_10_pvl  <- round(vcoj_10$p.value,5)

vcoj_20_itvl <- paste(round(vcoj_20$conf.int[1],4), " to ", round(vcoj_20$conf.int[2],4))
vcoj_20_pvl  <- round(vcoj_20$p.value,5)
oj_20 <- t.test(len ~ dose, paired=FALSE, var.equal=FALSE, data=filter(ToothGrowth, dose < 2, supp=="OJ"))
oj_05 <- t.test(len ~ dose, paired=FALSE, var.equal=FALSE, data=filter(ToothGrowth, dose > 0.5, supp=="OJ"))
vc_20 <- t.test(len ~ dose, paired=FALSE, var.equal=FALSE, data=filter(ToothGrowth, dose < 2, supp=="VC"))
vc_05 <- t.test(len ~ dose, paired=FALSE, var.equal=FALSE, data=filter(ToothGrowth, dose > 0.5, supp=="VC"))


oj_20_itvl <- paste(round(oj_20$conf.int[1],3), " to ", round(oj_20$conf.int[2],3))
oj_05_itvl <- paste(round(oj_05$conf.int[1],3), " to ", round(oj_05$conf.int[2],3))
vc_20_itvl <- paste(round(vc_20$conf.int[1],3), " to ", round(vc_20$conf.int[2],3))
vc_05_itvl <- paste(round(vc_05$conf.int[1],3), " to ", round(vc_05$conf.int[2],3))