Synopsis

This is part two of the programming assignment for Week 04 in the Coursera class Statistical Inference. In this project, we analyze the ToothGrowth data set. This dataset reports the results of two supplements on tooth growth in guinea pigs. The two supplements are Vitamin C and Orange Juice. Both supplements have been given in different amounts such that the dose of vitamin C was at three levels: 0.5, 1, 2 mg per day. We will perform an analysis to compare tooth growth by supplement and dose.

Data Preparation

We are first loading some required libraries including the ToothGrowth Dataset.

require(datasets)
require(ggplot2)
require(dplyr)
require(knitr)

Data Summary

The dataset ToothGrowth contains three columns:

df <- ToothGrowth
str(df)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

Exploratory Analysis

First, we want to analyse which of the two Vitamin C adminstration methods is more effective, regardless of dose of Vitamin C:

Tooth Growth by delivery method

Tooth Growth by delivery method

Figure @ref(fig:boxplot-01) suggests that overall, Orange Juice is the more effective delivery method of Vitamin C when compared to Vitamin C directly. We are going to test this hypothesis in Section Hypothesis testing including Confidence Intervals.

Let’s now have a look at the dose-dependency:

Tooth Growth by dosage of Vitamin C and delivery method

Tooth Growth by dosage of Vitamin C and delivery method

From the plot given in figure @ref(fig:boxplot-02), we are led to assume that Orange Juice is the more effective delivery method for Vitamin C only for the two lower doses. For the highest dose, Vitamin C as delivery method seems to be as effective a delivery method as Orange Juice. We are going to test this hypothesis in Section Hypothesis testing including Confidence Intervals.

Hypothesis testing including Confidence Intervals

In Section Exploratory Analysis, we have seen that Orange Juice seems to be the more efficient delivery method of Vitamin C overall; however, this efficiency seems to depend on the dosage level. We are now going to test these hypotheses. To do this, we will apply two-sample t-tests to see whether or not the evidence supports the null-hypothesis that the tooth growth data in two different groups is drawn from the same distribution.

Methodolical Disclaimer: Note that the t-test is only methodologically correct if the data is drawn from normal distribution. The data suggests that this is not the case for this analysis (see Distribution of the data); however, t-tests are fairly robust with respect to this assumption, so these tests might give a good indication as to whether or not the evidence supports the claims made in the following sub-sections. We conclude that the p-values given in the following tables might be a bit optimistic, i.e. we think the real p-values should be a bit higher than the ones given in the tables.

Orange Juice is not the more efficient delivery method of Vitamin C overall (p-Value 0.05)

Here, we have performed a two-sample t-test for the two groups consisting of all the elements in the “Orange Juice” group versus all elements in the “Vitamin C” group. This means, all doses are being considered in one class. For code to generate the table, see Code for the generation of table @ref(tab:test-overall).

two sample t-test for tooth growth by supplement for all doses combined
dose t statistic p-value CI lower bound CI upper bound OJ mean VC mean
all 1.9153 0.0606 -0.171 7.571 20.6633 16.9633

The result of the test given in table @ref(tab:test-overall) is somewhat ambiguous: the p-value is very close to 0.05, but does not quite reach that value. In this case, we sould say that whether or not you believe that Orange Juice is the more efficient delivery method (given the data), depends on where you want to set the confidence level. At level 0.05, you would not reject the Null-Hypothesis that Orange Juice and Vitamin C are drawn from the same distribution; at Confidence Level 0.07, you would say that Orange Juice is more effective.

Orange Juice is the more efficient delivery method of Vitamin C for low doses, but not the highest dose

Here, we have performed a two-sample t-test for groups consisting of elements in the “Orange Juice” group versus elements in the “Vitamin C” group, split by dose. This means, each test only compares guinea pigs that have been administered the same dose of Vitamin C. For code to generate the table, see Code for the generation of table @ref(tab:test-dose).

two sample t-test for tooth growth by supplement for each dose
dose t statistic p-value CI lower bound CI upper bound OJ mean VC mean
0.5 3.1697 0.0064 1.7191 8.7809 13.23 7.98
1.0 4.0328 0.0010 2.8021 9.0579 22.70 16.77
2.0 -0.0461 0.9639 -3.7981 3.6381 26.06 26.14

The hypothesis tests summarized in table @ref(tab:test-dose) clearly show us that the evidence suggests that Orange Juice is more effective for low doses (p-values of 0.001 or lower). However, for the highest dose, there is virtually no difference in effectiveness between the two administration methods (p-value of larger than 0.95). The Confidence Levels express the same conclusion, with 0 being outside the confidence intervals for the two lower doses, but largely inside the confidence interval for the highest dose.

Conclusion

As we have seen in the previous sections, the data suggests that Orange Juice is more efficient a delivery method for Vitamin C than Vitamin C itself for low doses of Vitamin C (0.5 and 1 mg/day), but not for 2 mg/day. Of course, this is only valid for guinea pigs and as there are only 60 guinea pigs that have been studied, all the conclusions are to be taken with extreme cautiousness. In addition, the data is not normally distributed, which means that the t-tests that have been performed might estimate overly optimistic p-values.

Appendix

Distribution of the data

As we have announced in Hypothesis testing including Confidence Intervals, the data is not normally distributed. We are here showing the distribution of the Tooth Growth data by supplement as well as by supplement and dose.

require(ggplot2)
g <- 
    ggplot(data=df,aes(x=len)) +
    geom_histogram() +
    facet_grid(.~supp, labeller = labeller(supp=c("OJ" = "Orange Juice", "VC" = "Vitamin C"))) +
    labs(x="Tooth Length",y="count",title="Distribution of Tooth Growth by supplement") +
    theme(plot.title = element_text(hjust = 0.5))
g
Distribution of Tooth Growth data by supplement

Distribution of Tooth Growth data by supplement

require(ggplot2)
g <- 
    ggplot(data=df,aes(x=len)) +
    geom_histogram() +
    facet_grid(dose~supp, labeller = labeller(supp=c("OJ" = "Orange Juice", "VC" = "Vitamin C"),dose=c("0.5"="dose 0.5","1"="dose 1","2"="dose 2"))) +
    labs(x="Tooth Length",y="count",title="Distribution of Tooth Growth by supplement and dose") +
    theme(plot.title = element_text(hjust = 0.5))
g
Distribution of Tooth Growth data by supplement and dose

Distribution of Tooth Growth data by supplement and dose

As a conclusion, we can say that the data is not normally distributed; however, t-tests are fairly robust with respect to the exact distribution. We conclude that the p-values given in the main text might be a bit overly optimistic, i.e. the real p-values are probably a bit higher than the ones given in the tables.

Code for the generation of figures and tables

Code for the generation of figure @ref(fig:boxplot-01)

ggplot(ToothGrowth, aes(supp, len, fill = supp)) +
      geom_boxplot() +
      labs(title = "Tooth growth of 60 guinea pigs by delivery method",
           x = "Delivery Method", 
           y = "Tooth Lengh") +
      scale_x_discrete(breaks=c("OJ","VC"),labels = c("Orange Juice","Vitamin C")) + 
      theme(plot.title = element_text(hjust = 0.5)) +
      theme(legend.position="none")

Code for the generation of figure @ref(fig:boxplot-02)

ggplot(ToothGrowth, aes(factor(dose), len, fill = factor(dose))) +
      geom_boxplot() +
      facet_grid(.~supp, labeller = as_labeller(
            c("OJ" = "Orange Juice", 
              "VC" = "Vitamin C"))) +
      labs(title = "Tooth growth of 60 guinea pigs by dosage of Vitamin C and delivery method",
           x = "Dose in mg/day", 
           y = "Tooth Lengh") +
      scale_fill_discrete(name = "Dosage of Vitamin C in mg/day") +
      theme(plot.title = element_text(hjust = 0.5)) # + theme(legend.position="bottom")

Code for the generation of table @ref(tab:test-overall)

test <- t.test(len ~ supp, data= df, var.equal = FALSE, paired=FALSE ,conf.level = .95)
resultOverall <- data.frame( dose="all", test$statistic, test$p.value,
                        test$conf.int[1],test$conf.int[2],
                        test$estimate[1],test$estimate[2])
row.names(resultOverall) <- NULL;
colNames <- c("dose","t statistic","p-value",
              "CI lower bound","CI upper bound",
              "OJ mean","VC mean")
kable(x = cbind(resultOverall[,1],round(resultOverall[,-1],4)),
      col.names = colNames,
      caption = "two sample t-test for tooth growth by supplement for all doses combined")

Code for the generation of table @ref(tab:test-dose)

doses <- sort(unique(df$dose));
performTest <- function(dosage){
  my.data <- df %>% filter(dose==dosage)
  test <- t.test(len ~ supp, data= my.data, var.equal = FALSE, paired=FALSE ,conf.level = .95)
  result <- data.frame( dose=dosage,
                        test$statistic, test$p.value,
                        test$conf.int[1],test$conf.int[2],
                        test$estimate[1],test$estimate[2])
  result
  }
results <- lapply(doses,performTest)
results <- data.frame(matrix(unlist(results), nrow=length(results),byrow=T))
row.names(results) <- NULL;
colNames <- c("dose","t statistic","p-value",
              "CI lower bound","CI upper bound",
              "OJ mean","VC mean")

kable(x = round(results,4),
      col.names = colNames,
      caption = "two sample t-test for tooth growth by supplement for each dose")