Project Instructions

There are two parts to this project

  1. A simulation exercise
  2. Basic inferential data analysis

I will create a report to answer the questions presented in the project rubric. Given the nature of the series, I’ll use ‘knitr’ to create the report and convert to a pdf. Each pdf report will be no more than 3 pages with 3 pages of supporting appendix material if needed (code, figures, etc.).

Part 2: Basic Inferential Data Analysis

Overview

In the second portion of this project, I am going to analyze the ToothGrowth data in the R datasets package.

Questions

  1. Load the ToothGrowth data and perform some basic exploratory data analyses
  2. Provide a basic summary of the data
  3. Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose. (Only use the techniques from class, even if there’s other approaches worth considering)
  4. State your conclusions and the assumptions needed for your conclusions

Question 1: Load the ToothGrowth data and perform some basic exploratory data analyses

Load the ToothGrowth data in the R datasets package and gain insight into the data by using the ‘head’, ‘tail’, and ‘str’ functions.

data("ToothGrowth")
head(ToothGrowth)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5
tail(ToothGrowth)
##     len supp dose
## 55 24.8   OJ    2
## 56 30.9   OJ    2
## 57 26.4   OJ    2
## 58 27.3   OJ    2
## 59 29.4   OJ    2
## 60 23.0   OJ    2
str(ToothGrowth)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

Question 2: Provide a basic summary of the data

Summarize the ToothGrowth data by using the ‘summary’ function.

summary(ToothGrowth)
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

First load the appropiate package. Convert dose variable from numeric to factor then visualize tooth growth as a function of dose.

library(ggplot2)
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
ggplot(aes(x=dose, y=len), data=ToothGrowth) + geom_boxplot(aes(fill=dose))

Visualize tooth growth as a function of supplement type.

ggplot(aes(x=supp, y=len), data=ToothGrowth) + geom_boxplot(aes(fill=supp))

Question 3: Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose

Check for group differences due to different supplement type. Assume unequal variances between the two groups.

t.test(len ~ supp, data = ToothGrowth)
## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333

The above results indicate that the p-value is equal to 0.06 and the confidence interval contains zero. Thus, we fail to reject the null hypothesis that the different supplement types have no effect on tooth length.

Create three sub-groups per dose level pairs in order to check for group differences.

ToothGrowth.doses_0.5_1.0 <- subset(ToothGrowth, dose %in% c(0.5, 1.0))
ToothGrowth.doses_0.5_2.0 <- subset(ToothGrowth, dose %in% c(0.5, 2.0))
ToothGrowth.doses_1.0_2.0 <- subset(ToothGrowth, dose %in% c(1.0, 2.0))

Check for group differences due to different dose levels of (0.5, 1.0). Assume unequal variances between the two groups.

t.test(len ~ dose, data = ToothGrowth.doses_0.5_1.0)
## 
##  Welch Two Sample t-test
## 
## data:  len by dose
## t = -6.4766, df = 37.986, p-value = 1.268e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.983781  -6.276219
## sample estimates:
## mean in group 0.5   mean in group 1 
##            10.605            19.735

Check for group differences due to different dose levels of (0.5, 2.0). Assume unequal variances between the two groups.

t.test(len ~ dose, data = ToothGrowth.doses_0.5_2.0)
## 
##  Welch Two Sample t-test
## 
## data:  len by dose
## t = -11.799, df = 36.883, p-value = 4.398e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -18.15617 -12.83383
## sample estimates:
## mean in group 0.5   mean in group 2 
##            10.605            26.100

Check for group differences due to different dose levels of (1.0, 2.0). Assume unequal variances between the two groups.

t.test(len ~ dose, data = ToothGrowth.doses_1.0_2.0)
## 
##  Welch Two Sample t-test
## 
## data:  len by dose
## t = -4.9005, df = 37.101, p-value = 1.906e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.996481 -3.733519
## sample estimates:
## mean in group 1 mean in group 2 
##          19.735          26.100

For all three of the above t-tests, the resulting p-value is less than 0.5 and the confidence intervals do not contain zero. Thus, we reject the null hypothesis, and establish that increasing the dose level leads to an increase in tooth length.

Question 4: State your conclusions and the assumptions needed for your conclusions

Conclusions

  1. Supplement type has no effect on tooth growth
  2. Increasing the dose level leads to increased tooth growth

Assumptions

  1. The experiment was done with random assignment of guinea pigs to different dose level categories and supplement type to control for confounders that might affect the outcome.
  2. Members of the sample population, i.e. the 60 guinea pigs, are representative of the entire population of guinea pigs. This assumption allows us to generalize the results.
  3. For the t-tests, the variances are assumed to be different for the two groups being compared. This assumption is less stronger than the case in which the variances are assumed to be equal.