Overview

This is the second part of a project for Statistical Inference course, which is a part of Coursera’s Data Science and Data Science: Statistics and Machine Learning Specializations.

The report aims to perform some basic inferential data analyses of ToothGrowth data in the R datasets package that examine the effect of vitamin C dose or delivery method on tooth growth.


  • Code chunks can be displayed by clicking Code button

library(data.table); library(ggplot2); library(plotly)

1. Basic exploratory data analyses

The ToothGrowth R help page tells that there were 60 guinea pigs receiving one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice or ascorbic acid.

Load the ToothGrowth data set, and look at its structure (Appendix: code structure):

Classes 'data.table' and 'data.frame':  60 obs. of  3 variables:
 $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
 $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
 $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
 - attr(*, ".internal.selfref")=<externalptr> 

The ToothGrowth R help page explains that variables in ToothGrowth mean the following:

  • len: numeric Tooth length
  • supp: factor Supplement type (VC as vitamin C or OJ as orange juice).
  • dose: numeric Dose in milligrams/day

Visualize tooth len means and variability depending on supp/dose (Appendix: code boxplot)

Looks like both supp/dose affect tooth growth: len increases with Orange Juice type and dose level.

2. Basic summary of the data

Get summary statistics for the variables (Appendix: code summary):

      len        supp         dose      
 Min.   : 4.20   OJ:30   Min.   :0.500  
 1st Qu.:13.07   VC:30   1st Qu.:0.500  
 Median :19.25           Median :1.000  
 Mean   :18.81           Mean   :1.167  
 3rd Qu.:25.27           3rd Qu.:2.000  
 Max.   :33.90           Max.   :2.000  

3. Conf. intervals/ hypothesis tests: tooth growth by supp/ dose

Perform hypothesis Student's t-testings by forcing the probability of rejecting the null hypothesis when it is TRUE (\(Type\ I\ error\ rate\)) to be \(\alpha=0.05\) (significance level, the risk).

To conduct hypothesis t-tests like that, the following assumptions must be fulfilled:

  • the objects are representative sample from the same population,
  • the data are \(iid\) (i.e. there was a random assignment of objects to different dose level and supp type),
  • the data are roughly symmetric and mound shaped.

First, perform the relevant hypothesis t-tests for the mean difference between the supp groups, \(\mu_{OJ}-\mu_{VC}\) (Appendix: code testsupp):

\(H_0:\mu_{OJ}=\mu_{VC}\) (i.e. the mean tooth length for the two supplements are equal) versus

\(H_a:\mu_{OJ}\neq\mu_{VC}\) (i.e. the mean tooth length for the two supplements differ)


    Welch Two Sample t-test

data:  len by supp
t = 1.9153, df = 55.309, p-value = 0.06063
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.1710156  7.5710156
sample estimates:
mean in group OJ mean in group VC 
        20.66333         16.96333 

Obtained results conclusions:

  • the grouped \(95\%\) confidence interval (-0.171, 7.571) contains zero (i.e. the probability of observing mean differences values in this interval - including zero - is about \(95\%\)), and also

  • p-value=0.0606 is greater than \(\alpha=0.05\) (i.e. the prob. of rejection \(H_0\) when it is true \(>5\%\)).

So, this t-tests failed to reject the null hypothesis \(H_0:\mu_{OJ}=\mu_{VC}\), and hence it couldn’t conclude that there is a mean difference between the supplement types Orange Juice and Vitamin C.

Second, perform the relevant hypothesis t-test for the mean difference between the dose groups, \(\mu_{dose_{i}}-\mu_{dose_{j}}\) (Appendix: code dose):

\(H_0:\mu_{dose_{i}}=\mu_{dose_{j}}\) (the mean tooth length for two dose levels are equal) versus

\(H_a:\mu_{dose_{i}}\neq\mu_{dose_{j}}\) (the mean tooth length for two dose levels differ)

Obtained results conclusions:

  • all the three grouped \(95\%\) confidence intervals

    (-11.9838, -6.2762) for dose levels \(0.5\ and\ 1.0\)

    (-8.9965, -3.7335) for dose levels \(1.0\ and\ 2.0\)

    (-18.1562, -12.8338) for dose levels \(0.5\ and\ 2.0\)

    remain entirely below zero (i.e. prob. of observing differences in these intervals is ~ \(95\%\)), and also

  • all the three p-values

    0.0000001 for dose levels \(0.5\ and\ 1.0\)

    0.0000191 for dose levels \(1.0\ and\ 2.0\)

    \(4.4\cdot 10^{-14}\) for dose levels \(0.5\ and\ 2.0\)

    are less than significance level \(\alpha=0.05\) (i.e. the probability of rejection \(H_0\) when it is true \(<5\%\)).

So, these t-tests rejected \(H_0\) in favour of alternative \(H_a:\mu_{dose_{i}}\neq\mu_{dose_{j}}\), and hence it suggested that there is a mean difference between the all doses group levels: \(0.5,\ 1.0\) and \(2.0\).

4. Conclusions/ assumptions needed for the conclusions


Summarizing, it can be concluded that:

  • there is no evidence that vitamin supplement type affects guinea pigs tooth growth,
  • increasing the vitamin dose level leads to guinea pigs tooth growth.

It is also worth recalling that the conclusions above could be done ONLY under assumptions that:

  • the guinea pigs are representative sample from the same population,
  • the data are \(iid\), roughly symmetric and mound shaped.

Appendix

structure

ToothGrowth <- data.table(ToothGrowth)
str(ToothGrowth)

boxplot

supp.labs <- c("supplement: Orange Juice", "supplement: Vitamin C")
names(supp.labs) <- c("OJ", "VC")
dose.labs <- c("dose: 0.5", "dose: 1", "dose: 2")
names(dose.labs) <- c("0.5", "1", "2")
boxplot <- ggplot(ToothGrowth, aes(x = dose, y = len)) + 
  geom_boxplot(aes(fill = supp), colour = 'darkgrey',
               position = position_dodge(0.9)) +
  scale_fill_manual(values = c("violet","purple"))+
  facet_grid(dose ~ supp,
             labeller = labeller(dose = dose.labs, supp = supp.labs))+
  labs(title = "Tooth Length by Dosage/Supplement (interactive)",
       x = "dose (mg/day)",
       y = "tooth length")
gbox<-ggplotly(boxplot) %>% layout(showlegend = FALSE)

summary

summary(ToothGrowth)

supp

supp <-t.test(len~supp,data=ToothGrowth)
psupp <- supp$p.value
intsupp <- supp$conf
supp

dose

dose1<- t.test(len ~ dose, data =
                 ToothGrowth[ToothGrowth$dose %in% c(0.5, 1.0),])
dose2<- t.test(len ~ dose, data =
                 ToothGrowth[ToothGrowth$dose %in% c(1.0, 2.0),])
dosetot<- t.test(len ~ dose, data =
                   ToothGrowth[ToothGrowth$dose %in% c(0.5, 2.0),])
pdose1 <- dose1$p.value
pdose2 <- dose2$p.value
pdosetot <- dosetot$p.value
intdose1 <- dose1$conf
intdose2 <- dose2$conf
intdosetot <- dosetot$conf