What Big Teeth You Have!

Overview

This is the second part of a project for Statistical Inference course, which is a part of Coursera’s Data Science and Data Science: Statistics and Machine Learning Specializations.

The report aims to perform some basic inferential data analyses of ToothGrowth data in the R datasets package that examine the effect of vitamin C dose or delivery method on tooth growth.

Code chunks can be displayed by clicking Code button

library(data.table); library(ggplot2); library(plotly)

1. Basic exploratory data analyses

The ToothGrowth R help page tells that there were 60 guinea pigs receiving one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice or ascorbic acid.

Load the ToothGrowth data set, and look at its structure (Appendix: code structure):

Classes 'data.table' and 'data.frame':  60 obs. of  3 variables:
 $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
 $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
 $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
 - attr(*, ".internal.selfref")=<externalptr>

The ToothGrowth R help page explains that variables in ToothGrowth mean the following:

len: numeric Tooth length
supp: factor Supplement type (VC as vitamin C or OJ as orange juice).
dose: numeric Dose in milligrams/day

Visualize tooth len means and variability depending on supp/dose (Appendix: code boxplot)

Looks like both supp/dose affect tooth growth: len increases with Orange Juice type and dose level.

2. Basic summary of the data

Get summary statistics for the variables (Appendix: code summary):

      len        supp         dose      
 Min.   : 4.20   OJ:30   Min.   :0.500  
 1st Qu.:13.07   VC:30   1st Qu.:0.500  
 Median :19.25           Median :1.000  
 Mean   :18.81           Mean   :1.167  
 3rd Qu.:25.27           3rd Qu.:2.000  
 Max.   :33.90           Max.   :2.000

3. Conf. intervals/ hypothesis tests: tooth growth by supp/ dose

Perform hypothesis Student's t-testings by forcing the probability of rejecting the null hypothesis when it is TRUE (\(Type\ I\ error\ rate\)) to be \(\alpha=0.05\) (significance level, the risk).

To conduct hypothesis t-tests like that, the following assumptions must be fulfilled:

the objects are representative sample from the same population,
the data are \(iid\) (i.e. there was a random assignment of objects to different dose level and supp type),
the data are roughly symmetric and mound shaped.

First, perform the relevant hypothesis t-tests for the mean difference between the supp groups, \(\mu_{OJ}-\mu_{VC}\) (Appendix: code testsupp):

\(H_0:\mu_{OJ}=\mu_{VC}\) (i.e. the mean tooth length for the two supplements are equal) versus

\(H_a:\mu_{OJ}\neq\mu_{VC}\) (i.e. the mean tooth length for the two supplements differ)


    Welch Two Sample t-test

data:  len by supp
t = 1.9153, df = 55.309, p-value = 0.06063
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.1710156  7.5710156
sample estimates:
mean in group OJ mean in group VC 
        20.66333         16.96333

Obtained results conclusions:

the grouped \(95\%\) confidence interval (-0.171, 7.571) contains zero (i.e. the probability of observing mean differences values in this interval - including zero - is about \(95\%\)), and also
p-value=0.0606 is greater than \(\alpha=0.05\) (i.e. the prob. of rejection \(H_0\) when it is true \(>5\%\)).

So, this t-tests failed to reject the null hypothesis \(H_0:\mu_{OJ}=\mu_{VC}\), and hence it couldn’t conclude that there is a mean difference between the supplement types Orange Juice and Vitamin C.

Second, perform the relevant hypothesis t-test for the mean difference between the dose groups, \(\mu_{dose_{i}}-\mu_{dose_{j}}\) (Appendix: code dose):

\(H_0:\mu_{dose_{i}}=\mu_{dose_{j}}\) (the mean tooth length for two dose levels are equal) versus

\(H_a:\mu_{dose_{i}}\neq\mu_{dose_{j}}\) (the mean tooth length for two dose levels differ)

Obtained results conclusions:

all the three grouped \(95\%\) confidence intervals

(-11.9838, -6.2762) for dose levels \(0.5\ and\ 1.0\)

(-8.9965, -3.7335) for dose levels \(1.0\ and\ 2.0\)

(-18.1562, -12.8338) for dose levels \(0.5\ and\ 2.0\)

remain entirely below zero (i.e. prob. of observing differences in these intervals is ~ \(95\%\)), and also
all the three p-values

0.0000001 for dose levels \(0.5\ and\ 1.0\)

0.0000191 for dose levels \(1.0\ and\ 2.0\)

\(4.4\cdot 10^{-14}\) for dose levels \(0.5\ and\ 2.0\)

are less than significance level \(\alpha=0.05\) (i.e. the probability of rejection \(H_0\) when it is true \(<5\%\)).

So, these t-tests rejected \(H_0\) in favour of alternative \(H_a:\mu_{dose_{i}}\neq\mu_{dose_{j}}\), and hence it suggested that there is a mean difference between the all doses group levels: \(0.5,\ 1.0\) and \(2.0\).

4. Conclusions/ assumptions needed for the conclusions

Summarizing, it can be concluded that:

there is no evidence that vitamin supplement type affects guinea pigs tooth growth,
increasing the vitamin dose level leads to guinea pigs tooth growth.

It is also worth recalling that the conclusions above could be done ONLY under assumptions that:

the guinea pigs are representative sample from the same population,
the data are \(iid\), roughly symmetric and mound shaped.

Appendix

`structure`

ToothGrowth <- data.table(ToothGrowth)
str(ToothGrowth)

`boxplot`

supp.labs <- c("supplement: Orange Juice", "supplement: Vitamin C")
names(supp.labs) <- c("OJ", "VC")
dose.labs <- c("dose: 0.5", "dose: 1", "dose: 2")
names(dose.labs) <- c("0.5", "1", "2")
boxplot <- ggplot(ToothGrowth, aes(x = dose, y = len)) + 
  geom_boxplot(aes(fill = supp), colour = 'darkgrey',
               position = position_dodge(0.9)) +
  scale_fill_manual(values = c("violet","purple"))+
  facet_grid(dose ~ supp,
             labeller = labeller(dose = dose.labs, supp = supp.labs))+
  labs(title = "Tooth Length by Dosage/Supplement (interactive)",
       x = "dose (mg/day)",
       y = "tooth length")
gbox<-ggplotly(boxplot) %>% layout(showlegend = FALSE)

`summary`

summary(ToothGrowth)

`supp`

supp <-t.test(len~supp,data=ToothGrowth)
psupp <- supp$p.value
intsupp <- supp$conf
supp

`dose`

dose1<- t.test(len ~ dose, data =
                 ToothGrowth[ToothGrowth$dose %in% c(0.5, 1.0),])
dose2<- t.test(len ~ dose, data =
                 ToothGrowth[ToothGrowth$dose %in% c(1.0, 2.0),])
dosetot<- t.test(len ~ dose, data =
                   ToothGrowth[ToothGrowth$dose %in% c(0.5, 2.0),])
pdose1 <- dose1$p.value
pdose2 <- dose2$p.value
pdosetot <- dosetot$p.value
intdose1 <- dose1$conf
intdose2 <- dose2$conf
intdosetot <- dosetot$conf