Versions & Dates

Analysis Overview

Overview: Part II of the course project for Statistical Inference, one of ten courses for the Data Science Specialization by Johns Hopkins University, aims to produce basic inferential analysis on the ToothGrowth dataset, which measures tooth growth via odontoblasts and controlling for supplement type and dosage in 60 guinea pigs.

Objectives: The present analysis provides code and narrative for exploratory data analysis of dataset ToothGrowth, a basic summary of the data, and Student’s T-Test p-values and confidence intervals to reach conclusions on the significant effects of supplement type and dosage.

Packages: Tidyverse packages dplyr and ggplot2, as well as package knitr, if undetected, are installed automatically. Both packages are loaded with function library().

Exploratory Data Analysis & Summary

The ToothGrowth dataset is loaded using function library(). The following commands provide a series of values important for understanding the dimensions, classes, and structure of the data.

Structure & Descriptive Statistics

Structure: A call to function str() provides a valuable overview of the dimensions, variable names, and variables classes of dataset ToothGrowth:

str(ToothGrowth)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

In sum, the data consist of 60 observations and 3 variables - namely: len, or “lenth”, supp, or “supplement type”, and dose, or the dosage of the supplement in milligrams. These and additional details are only available in the data’s documentation, retrievable with a call to function help(), or:

help(ToothGrowth)

Descriptive Statistics: Function summary() further provides the mean, median, mode, minimum, maximum, and interquartile range (IQR), for ToothGrowth variables:

summary(ToothGrowth)
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

Factor Levels: Function unique() provides the unique levels for factors (categorical) variables: supp and dose:

print(c("supp" = unique(ToothGrowth$supp), 
        "dose" = unique(ToothGrowth$dose)))
## supp1 supp2 dose1 dose2 dose3 
##   2.0   1.0   0.5   1.0   2.0

Here, we see there are two levels for supp, i.e. “1.0” and “2.0”, which per documentation are “Vitamin C” and “Orange Juice”, respectively. There are three categories of dose, including 0.5 mg, 1.0 mg, and 2.0 mg.

Initial Observations: The first 6 observations are retrieved with a call to head(), while the last 6 observations, retrievable with function tail() is foregone for want of additional understanding:

head(ToothGrowth)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

Exploratory Data Visualization

Scatter Plot: A simple scatter plot with linear regression, Figure 1 depicts a positive relationship between len (length) and dose (dosage), where “Vitamin C” supplements are associated with less tooth growth than “Orange Juice” supplements, though their efficacy converges as the value of dose increases. Figure 2 is available in the Appendix.

Box Plot: A simple box plot in Figure 2 reveals similar findings as the above scatter plot in Figure 1, with more emphasis on the range of len (length) values for each dose and supp (supplement). Figure 2 is available in the Appendix.

Initial Assumptions: In understanding the structure, range, and general positive relationship with len (length) as a function of supp (supplement type) and dose, it is reasonable to assume that a statistically significant relationship may exist, warranting the following investigation.

Hypothesis Tests

Overview: The following tests employ a Welch Two-Sample T-Test to test for statistical significance at the 5% level, using a call to function t.test() with the null hypothesis being no significant difference in means, and alternative hypotheses comprised of permutations of variables supp (supplement type) and dose at three ranges.

Tooth Length & Supplement Type: The following T-Test calculates the t-statistic, p-value, and confidence intervals for the following hypotheses:

Test Statistics for Length & Supplement
T-Statistic P-Values Lower CI Upper CI
1.915268 0.0606345 -0.1710156 7.571016

Findings: Though we would reject the null hypothesis at 10% significance, a p-value of 0.0606 indicates that we must fail to reject the null hypothesis. I.e. there is not a statistically significant relationship between len (length) and supp (supplement type).

Tooth Length & Dose Size: dose has been divided into ranges, including:

The final T-Tests calculate the t-statistic, p-value, and confidence intervals for the following hypotheses:

The following, Table 2, reveals the results of these tests:

Test Statistics for Length & Dose
Dose Range T-Statistic P-Values Lower CI Upper CI
0.5-1.0 mg -6.476648 0.0000001 -11.983781 -6.276219
0.5-2.0 mg -11.799046 0.0000000 -18.156167 -12.833834
1.0-2.0 mg -4.900484 0.0000191 -8.996481 -3.733519

Findings: In all dose ranges, markedly low p-values, well under 5% levels of significance, leaves the present study to conclude with rejection of the null hypothesis against all three competing alternative hypotheses. I.e. all dose ranges have a statistically significant relationship with len (length).

Additional Assumptions: Due to a relatively small number of alternative hypotheses, the author assumes that multiple hypothesis correction is unnecessary. Moreover, these findings assume that unobserved covariates or so-called “confounding variables” have not significantly influenced the experiment to collect the ToothGrowth data.

Appendix

Figures

Figure 1 Visualization & Code: “Tooth Length by Dose & Supplement Type”

ggplot(data = ToothGrowth, 
       aes(x = dose, y = len, color = supp)) +
  geom_jitter(width = 0.05, height = 0, alpha = 0.8) +
  geom_line(stat = "smooth", method = "lm", alpha = 0.25, lwd = 1.1) +
  ggtitle(label = "Figure 1: Tooth Length by Dose & Supplement Type",
          subtitle = "Orange Juice v. Vitamin C Supplements") +
  scale_color_discrete(name = "Supplement") +
  labs(x = "Dosage", y = "Tooth Length") +
  theme_classic()

Figure 1 shows a generally positive relationship between tooth length and dose size, controlling for supplement type. Though “Vitamin C” seems less efficacious at lower doses, efficacy converges with “Orange Juice” at higher dose sizes.

Figure 2 Visualization & Code: “Tooth Length by Dose & Supplement Type”

ToothGrowth$dose <- as.factor(ToothGrowth$dose)
ggplot(data = ToothGrowth, aes(x = dose, y = len, fill = supp)) +
  geom_boxplot( alpha = 0.6) +
  ggtitle(label = "Figure 2: Tooth Length by Dose & Supplement Type",
          subtitle = "Orange Juice v. Vitamin C Supplements") +
  scale_fill_discrete(name = "Supplement") +
  labs( x = "Dosage", y = "Tooth Length") +
  theme_classic()

Figure 2 also shows the positive association between dose size and tooth length, emphasizing a smaller range for “Vitamin C” at lower doses compared to “Orange Juice”. As efficacy converges at the largest dose, 2.0 mg, “Vitamin C” shows a much wider range, while “Orange Juice” is concentrated.

Tables

Table 1 Code: “Test Statistics for Length & Supplement”

len_supp <- t.test(len ~ supp, data = ToothGrowth)
ls_table <- data.frame("T-Statistic" = len_supp$statistic,
                       "P-Value" = len_supp$p.value, 
                       "Lower CI" = len_supp$conf[1], 
                       "Upper CI" = len_supp$conf[2], 
                       row.names = "")

kable(ls_table, caption = "Table 1: Test Statistics for Length & Supplement",
      col.names = c("T-Statistic", "P-Values",
                    "Lower CI", "Upper CI"))

Table 2 Code: “Test Statistics for Length & Dose”

tg_05_10 <- ToothGrowth %>% filter(dose != 2.0)
tg_05_20 <- ToothGrowth %>% filter(dose != 1.0)
tg_10_20 <- ToothGrowth %>% filter(dose != 0.5)

tg_05_10 <- t.test(len ~ dose, data = tg_05_10)
tg_05_20 <- t.test(len ~ dose, data = tg_05_20)
tg_10_20 <- t.test(len ~ dose, data = tg_10_20)

options(scipen = 999)

tg_table <- data.frame("Dose Range" = c("0.5-1.0 mg",
                                        "0.5-2.0 mg",
                                        "1.0-2.0 mg"),
                       "T-Statistic" = c(tg_05_10$statistic,
                                         tg_05_20$statistic,
                                         tg_10_20$statistic),
                       "P-Values" = c(tg_05_10$p.value,
                                      tg_05_20$p.value,
                                      tg_10_20$p.value),
                       "Lower CI" = c(tg_05_10$conf[1],
                                      tg_05_20$conf[1],
                                      tg_10_20$conf[1]),
                       "Upper CI" = c(tg_05_10$conf[2],
                                      tg_05_20$conf[2],
                                      tg_10_20$conf[2]))

kable(tg_table, caption = "Table 2: Test Statistics for Length & Dose",
      col.names = c("Dose Range", "T-Statistic", "P-Values",
                    "Lower CI", "Upper CI"))