Overview

Part II: Basic Inferential Data Analysis
In this part of the project, I analyze the ToothGrowth data in the R datasets package
Dataset description: The response is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice or ascorbic acid (a form of vitamin C and coded as VC).

Part II

## load libraries and set constants 
library(RColorBrewer)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## load and perform exploratory analysis
tg = datasets::ToothGrowth
str(tg)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

par(mfrow=c(1, 1))
cols = brewer.pal(n = 11, name = "RdBu")
plot(tg$dose, tg$len, col=tg$supp, main = "Tooth length by dose and supplement type",
     ylab="Length", xlab="Dose (mg/day)")
legend(x="bottomright", legend=unique(tg$supp), fill = c("red", "black"))

Load the ToothGrowth data and perform some basic exploratory data analyses: In addition to looking at the dataset’s structure, I made a plot to compare tooth length by dose and by supplement.

## provide basic summary of data
summary(tg)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

ToothGrowth %>% group_by(dose, supp) %>% summarise(n=n(), avg_length=mean(len), aggr=sum(len), std_dev=sd(len), std_err=std_dev/sqrt(n()))

## `summarise()` regrouping output by 'dose' (override with `.groups` argument)

## # A tibble: 6 x 7
## # Groups:   dose [3]
##    dose supp      n avg_length  aggr std_dev std_err
##   <dbl> <fct> <int>      <dbl> <dbl>   <dbl>   <dbl>
## 1   0.5 OJ       10      13.2  132.     4.46   1.41 
## 2   0.5 VC       10       7.98  79.8    2.75   0.869
## 3   1   OJ       10      22.7  227      3.91   1.24 
## 4   1   VC       10      16.8  168.     2.52   0.795
## 5   2   OJ       10      26.1  261.     2.66   0.840
## 6   2   VC       10      26.1  261.     4.80   1.52

par(mfrow=c(1, 2))
lenSupp = as.data.frame(lapply(split(tg$len, tg$supp), mean))
barplot(c(lenSupp[1, 1], lenSupp[1, 2]), col=cols, main="Average 
        tooth length by \n supplement type",names.arg=c("OJ", "VC"))
lenDose = as.data.frame(lapply(split(tg$len, tg$dose), mean))
barplot(c(lenDose[1, 1], lenDose[1, 2], lenDose[1, 3]), col=cols, 
        main="Average tooth length by \n dose (in mg/day)",names.arg=c("0.5", "1", "2"))

Provide a basic summary of the data: Above, I provide a generic summary of the data as well as one that’s more in depth and breaks down each dose-supplement combination, providing the average length, aggregate sum of length, standard deviation, and standard error for each. In addition, I created two barplots, one depicting the average tooth length by supplement type and the other the average tooth length by dose.

## use confidence intervals/hypothesis tests to compare tooth growth by supp and dose
supp = t.test(tg$len ~ tg$supp)
dose1 = t.test(len~dose, tg, dose %in% c(1.0,0.5), 
       paired = F, var.equal = T, alternative ="two.sided")
dose2 = t.test(len~dose, tg, dose %in% c(2.0,1.0), paired = F, 
       var.equal = T, alternative ="two.sided")

Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose. (Only use the techniques from class, even if there’s other approaches worth considering)
The p-value of the t test comparing tooth growth by supplement type is > 0.05 (p value = 0.0606345), indicating that the difference in mean length across supplement type is not significant (fail to reject the null hypothesis). Moreover, both the p values for the t tests comparing tooth length by dose are < 0.5 (p values = 1.26629710^{-7}; 1.810828510^{-5}), indicating that the increase in dose level indeed has a significant effect on tooth length (we reject the null hypothesis).

Assumption: the obsrevations in the dataset are an accurate representation of the population as a whole.

Statistical Inference Course Project

Fabiola

9/1/2020

Overview

Part II