This is the second part of a project for Statistical Inference course, which is a part of Coursera’s Data Science and Data Science: Statistics and Machine Learning Specializations.
The report aims to perform some basic inferential data analyses of ToothGrowth data in the R datasets package that examine the effect of vitamin C dose or delivery method on tooth growth.
Code buttonlibrary(data.table); library(ggplot2); library(plotly)The ToothGrowth R help page tells that there were 60 guinea pigs receiving one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice or ascorbic acid.
Load the ToothGrowth data set, and look at its structure (Appendix: code structure):
Classes 'data.table' and 'data.frame': 60 obs. of 3 variables:
$ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
$ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
$ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
- attr(*, ".internal.selfref")=<externalptr>
The ToothGrowth R help page explains that variables in ToothGrowth mean the following:
len: numeric Tooth lengthsupp: factor Supplement type (VC as vitamin C or OJ as orange juice).dose: numeric Dose in milligrams/dayVisualize tooth len means and variability depending on supp/dose (Appendix: code boxplot)
Looks like both supp/dose affect tooth growth: len increases with Orange Juice type and dose level.
Get summary statistics for the variables (Appendix: code summary):
len supp dose
Min. : 4.20 OJ:30 Min. :0.500
1st Qu.:13.07 VC:30 1st Qu.:0.500
Median :19.25 Median :1.000
Mean :18.81 Mean :1.167
3rd Qu.:25.27 3rd Qu.:2.000
Max. :33.90 Max. :2.000
Perform hypothesis Student's t-testings by forcing the probability of rejecting the null hypothesis when it is TRUE (\(Type\ I\ error\ rate\)) to be \(\alpha=0.05\) (significance level, the risk).
To conduct hypothesis t-tests like that, the following assumptions must be fulfilled:
First, perform the relevant hypothesis t-tests for the mean difference between the supp groups, \(\mu_{OJ}-\mu_{VC}\) (Appendix: code testsupp):
\(H_0:\mu_{OJ}=\mu_{VC}\) (i.e. the mean tooth length for the two supplements are equal) versus
\(H_a:\mu_{OJ}\neq\mu_{VC}\) (i.e. the mean tooth length for the two supplements differ)
Welch Two Sample t-test
data: len by supp
t = 1.9153, df = 55.309, p-value = 0.06063
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.1710156 7.5710156
sample estimates:
mean in group OJ mean in group VC
20.66333 16.96333
Obtained results conclusions:
the grouped \(95\%\) confidence interval (-0.171, 7.571) contains zero (i.e. the probability of observing mean differences values in this interval - including zero - is about \(95\%\)), and also
p-value=0.0606 is greater than \(\alpha=0.05\) (i.e. the prob. of rejection \(H_0\) when it is true \(>5\%\)).
So, this t-tests failed to reject the null hypothesis \(H_0:\mu_{OJ}=\mu_{VC}\), and hence it couldn’t conclude that there is a mean difference between the supplement types Orange Juice and Vitamin C.
Second, perform the relevant hypothesis t-test for the mean difference between the dose groups, \(\mu_{dose_{i}}-\mu_{dose_{j}}\) (Appendix: code dose):
\(H_0:\mu_{dose_{i}}=\mu_{dose_{j}}\) (the mean tooth length for two dose levels are equal) versus
\(H_a:\mu_{dose_{i}}\neq\mu_{dose_{j}}\) (the mean tooth length for two dose levels differ)
Obtained results conclusions:
all the three grouped \(95\%\) confidence intervals
(-11.9838, -6.2762) for dose levels \(0.5\ and\ 1.0\)
(-8.9965, -3.7335) for dose levels \(1.0\ and\ 2.0\)
(-18.1562, -12.8338) for dose levels \(0.5\ and\ 2.0\)
remain entirely below zero (i.e. prob. of observing differences in these intervals is ~ \(95\%\)), and also
all the three p-values
0.0000001 for dose levels \(0.5\ and\ 1.0\)
0.0000191 for dose levels \(1.0\ and\ 2.0\)
\(4.4\cdot 10^{-14}\) for dose levels \(0.5\ and\ 2.0\)
are less than significance level \(\alpha=0.05\) (i.e. the probability of rejection \(H_0\) when it is true \(<5\%\)).
So, these t-tests rejected \(H_0\) in favour of alternative \(H_a:\mu_{dose_{i}}\neq\mu_{dose_{j}}\), and hence it suggested that there is a mean difference between the all doses group levels: \(0.5,\ 1.0\) and \(2.0\).
Summarizing, it can be concluded that:
It is also worth recalling that the conclusions above could be done ONLY under assumptions that:
structureToothGrowth <- data.table(ToothGrowth)
str(ToothGrowth)boxplotsupp.labs <- c("supplement: Orange Juice", "supplement: Vitamin C")
names(supp.labs) <- c("OJ", "VC")
dose.labs <- c("dose: 0.5", "dose: 1", "dose: 2")
names(dose.labs) <- c("0.5", "1", "2")
boxplot <- ggplot(ToothGrowth, aes(x = dose, y = len)) +
geom_boxplot(aes(fill = supp), colour = 'darkgrey',
position = position_dodge(0.9)) +
scale_fill_manual(values = c("violet","purple"))+
facet_grid(dose ~ supp,
labeller = labeller(dose = dose.labs, supp = supp.labs))+
labs(title = "Tooth Length by Dosage/Supplement (interactive)",
x = "dose (mg/day)",
y = "tooth length")
gbox<-ggplotly(boxplot) %>% layout(showlegend = FALSE)summarysummary(ToothGrowth)suppsupp <-t.test(len~supp,data=ToothGrowth)
psupp <- supp$p.value
intsupp <- supp$conf
suppdosedose1<- t.test(len ~ dose, data =
ToothGrowth[ToothGrowth$dose %in% c(0.5, 1.0),])
dose2<- t.test(len ~ dose, data =
ToothGrowth[ToothGrowth$dose %in% c(1.0, 2.0),])
dosetot<- t.test(len ~ dose, data =
ToothGrowth[ToothGrowth$dose %in% c(0.5, 2.0),])
pdose1 <- dose1$p.value
pdose2 <- dose2$p.value
pdosetot <- dosetot$p.value
intdose1 <- dose1$conf
intdose2 <- dose2$conf
intdosetot <- dosetot$conf