Part II: Basic Inferential Data Analysis
In this part of the project, I analyze the ToothGrowth data in the R datasets package
Dataset description: The response is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice or ascorbic acid (a form of vitamin C and coded as VC).
## load libraries and set constants
library(RColorBrewer)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## load and perform exploratory analysis
tg = datasets::ToothGrowth
str(tg)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
par(mfrow=c(1, 1))
cols = brewer.pal(n = 11, name = "RdBu")
plot(tg$dose, tg$len, col=tg$supp, main = "Tooth length by dose and supplement type",
ylab="Length", xlab="Dose (mg/day)")
legend(x="bottomright", legend=unique(tg$supp), fill = c("red", "black"))
## provide basic summary of data
summary(tg)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
ToothGrowth %>% group_by(dose, supp) %>% summarise(n=n(), avg_length=mean(len), aggr=sum(len), std_dev=sd(len), std_err=std_dev/sqrt(n()))
## `summarise()` regrouping output by 'dose' (override with `.groups` argument)
## # A tibble: 6 x 7
## # Groups: dose [3]
## dose supp n avg_length aggr std_dev std_err
## <dbl> <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 0.5 OJ 10 13.2 132. 4.46 1.41
## 2 0.5 VC 10 7.98 79.8 2.75 0.869
## 3 1 OJ 10 22.7 227 3.91 1.24
## 4 1 VC 10 16.8 168. 2.52 0.795
## 5 2 OJ 10 26.1 261. 2.66 0.840
## 6 2 VC 10 26.1 261. 4.80 1.52
par(mfrow=c(1, 2))
lenSupp = as.data.frame(lapply(split(tg$len, tg$supp), mean))
barplot(c(lenSupp[1, 1], lenSupp[1, 2]), col=cols, main="Average
tooth length by \n supplement type",names.arg=c("OJ", "VC"))
lenDose = as.data.frame(lapply(split(tg$len, tg$dose), mean))
barplot(c(lenDose[1, 1], lenDose[1, 2], lenDose[1, 3]), col=cols,
main="Average tooth length by \n dose (in mg/day)",names.arg=c("0.5", "1", "2"))
## use confidence intervals/hypothesis tests to compare tooth growth by supp and dose
supp = t.test(tg$len ~ tg$supp)
dose1 = t.test(len~dose, tg, dose %in% c(1.0,0.5),
paired = F, var.equal = T, alternative ="two.sided")
dose2 = t.test(len~dose, tg, dose %in% c(2.0,1.0), paired = F,
var.equal = T, alternative ="two.sided")
Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose. (Only use the techniques from class, even if there’s other approaches worth considering)
The p-value of the t test comparing tooth growth by supplement type is > 0.05 (p value = 0.0606345), indicating that the difference in mean length across supplement type is not significant (fail to reject the null hypothesis). Moreover, both the p values for the t tests comparing tooth length by dose are < 0.5 (p values = 1.26629710^{-7}; 1.810828510^{-5}), indicating that the increase in dose level indeed has a significant effect on tooth length (we reject the null hypothesis).
Assumption: the obsrevations in the dataset are an accurate representation of the population as a whole.