This analysis is part 2 of 2 of the Statistical Inference course project of John Hopkins University’s Data Science Specialization course. This second section of the project investigates the ToothGrowth dataset and provides a general exploratory analysis before performing hypothesis tests to determine if there are statistically significant differences in tooth growth by vitamin C dosage and delivery method.
For this section, we will be working with the ToothGrowth dataset from the datasets library. Lets briefly load in and preview the data to see what we’re working with:
library(tidyverse)
library(magrittr)
library(datasets)
data("ToothGrowth")
dim(ToothGrowth)
## [1] 60 3
head(ToothGrowth)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
To provide some more context, here is a brief description of the dataset taken from the official R Documentation vignette:
The response is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice or ascorbic acid (a form of vitamin C and coded as VC).
Lets create some quick summary tables to get a further idea of the distributions of values by each group. Keep in mind that we will be trying to figure out to what extent the dosage and supplement has on odontoblast –and thus tooth– length. First, lets group by supplement and dose:
ToothGrowth %>%
group_by(supp, dose) %>%
summarise(mean = mean(len),
sd = sd(len),
count = n())
## # A tibble: 6 × 5
## # Groups: supp [2]
## supp dose mean sd count
## <fct> <dbl> <dbl> <dbl> <int>
## 1 OJ 0.5 13.2 4.46 10
## 2 OJ 1 22.7 3.91 10
## 3 OJ 2 26.1 2.66 10
## 4 VC 0.5 7.98 2.75 10
## 5 VC 1 16.8 2.52 10
## 6 VC 2 26.1 4.80 10
Grouping by supplement only:
ToothGrowth %>%
group_by(supp) %>%
summarise(mean = mean(len),
sd = sd(len),
count = n())
## # A tibble: 2 × 4
## supp mean sd count
## <fct> <dbl> <dbl> <int>
## 1 OJ 20.7 6.61 30
## 2 VC 17.0 8.27 30
Grouping by dose only:
ToothGrowth %>%
group_by(dose) %>%
summarise(mean = mean(len),
sd = sd(len),
count = n())
## # A tibble: 3 × 4
## dose mean sd count
## <dbl> <dbl> <dbl> <int>
## 1 0.5 10.6 4.50 20
## 2 1 19.7 4.42 20
## 3 2 26.1 3.77 20
Now lets actually visualize the data by its groups using a box plot:
ToothGrowth %>% ggplot(aes(x = as.factor(dose), y = len, color = reorder(supp, len))) +
geom_boxplot() +
scale_color_manual(values = c("#0661DA", "#EF8C07")) +
theme_light() +
labs(title = "Odontoblasts Length by Vitamin C Dosage and Delivery Method",
x = "Dosage per day (mg)",
y = "Odontoblasts Length (pm)",
color = "Delivery Method")
Just by simply looking at this graph, it is fair to say that we can see some general trends, namely:
Orange juice (generally) seems to exhibit higher odontoblast lengths compared to its ascorbic acid counterpart
The higher the vitamin C dosage given to guinea pigs, the longer their odontoblasts
While these trends and differences may at a glance seem definitive or clear, in order to truly say whether or not the dosage and supplements actually increase tooth length, we must seek statistical significance in the form of confidence intervals and hypothesis testing.
For the following hypothesis tests, we will be making use of the
t.test() function. Since the variances between our
samples/population are unequal (as dictated by the
var.equal=FALSE argument), these are technically Welch’s
t-tests rather than Student’s t-tests. Additionally, for each test, our
null hypothesis will always be that the means are equal, regardless of
dose or supplement. In other words, we will need to reject the null
hypothesis (P-value < \(\alpha\)) if
the dose or supplement actually does make a difference in increasing
guinea pig teeth length. Furthermore, the significance level (\(\alpha\)) for each test will be
standardized to 0.05, with a confidence interval of 95%.
To begin, lets see if there’s any statistical significance in odontoblast length between giving the guinea pigs 0.5mg versus 1mg of vitamin C. The first section of code assigns an object (in this case, doses_first_pair) the results of the test (a list of length 10, containing features such as the p-value, test statistic, confidence interval, etc. – all accessible using the “$” operator). The next line of code then accesses the p-value result of the test and checks whether or not it is smaller than (\(\alpha\)) = 0.05. If it is indeed smaller, the code returns a logical “TRUE” value, which means that we should reject our null hypothesis that the mean tooth length for these two groups are the same.
Running the first test:
doses_first_pair <- t.test(len ~ dose,
data = subset(ToothGrowth, dose %in% c(0.5,1.0)),
var.equal = FALSE,
paired = FALSE)
doses_first_pair$p.value < 0.05
## [1] TRUE
As you can see, we get a “TRUE” value back, which means that we can reject our null hypothesis. In other words, there is statistical significance to conclude that the mean length of odontoblasts in guinea pigs are different when given 0.5mg/day than 1mg/day of vitamin C.
Now lets check if the means are different between 1mg/day vs 2mg/day of vitamin C:
doses_second_pair <- t.test(len ~ dose,
data=subset(ToothGrowth, dose %in% c(1.0, 2.0)),
var.equal = FALSE,
paired = FALSE)
doses_second_pair$p.value < 0.05
## [1] TRUE
Once again, we reject our null hypothesis. There is statistical significance to conclude that the mean length of odontoblasts in guinea pigs are different when given 1mg/day than 2mg/day of vitamin C.
We now know that larger doses of vitamin C correspond to larger ondontoblast size. Now lets examine whether or not the delivery method (ascorbic acid vs. orange juice) makes any meaningful impact. First, we will examine the significance on the delivery method for all dosages across the board:
method <- t.test(len ~ supp,
data = ToothGrowth,
var.equal = FALSE,
paired = FALSE)
method$p.value < 0.05
## [1] FALSE
As you can see, we get a FALSE value back! We cannot reject our null hypothesis. In other words, there is insufficient statistical significance to reject our claim that the mean odontoblast lengths are any different for guinea pigs that received their vitamin C dosages through ascorbic acid versus orange juice (without isolating the dosage amount).
Lets now examine whether or not our previous result changes when we specifically examine the delivery methods for each individual dosage amount. Testing for differences in supplement for subjects given 0.5mg/day of vitamin C:
method_0.5 <- t.test(len ~ supp,
data = ToothGrowth[ToothGrowth$dose == 0.5,],
var.equal = FALSE,
paired = FALSE)
method_0.5$p.value < 0.05
## [1] TRUE
Our P-value is now lower than our alpha value, so we reject our null hypothesis. There is statistical significance to conclude that the mean length of odontoblasts in guinea pigs are different when fed with ascorbic acid versus orange juice (when given 0.5mg/day of vitamin C).
Testing for differences in supplement for subjects given 1mg/day of vitamin C:
method_1 <- t.test(len ~ supp,
data = ToothGrowth[ToothGrowth$dose == 1.0,],
var.equal = FALSE,
paired = FALSE)
method_1$p.value < 0.05
## [1] TRUE
We reject our null hypothesis. There is statistical significance to conclude that the mean length of odontoblasts in guinea pigs are different when fed with ascorbic acid versus orange juice (when given 1mg/day of vitamin C).
And finally, testing for differences in supplement for subjects given 2mg/day of vitamin C:
method_2 <- t.test(len ~ supp,
data = ToothGrowth[ToothGrowth$dose == 2.0,],
var.equal = FALSE,
paired = FALSE)
method_2$p.value < 0.05
## [1] FALSE
We fail to reject our null hypothesis. In other words, there is insufficient statistical significance to reject our claim that the mean odontoblast lengths in guinea pigs are different when fed with ascorbic acid versus orange juice (when given 2mg/day of vitamin C).
Now that we have ran hypothesis tests for every relevant combination/pair of dosage and delivery method of vitamin C in guinea pigs, we have much more confidence in our conclusions and findings as a result. Overall, we discovered a few things:
Vitamin C dosage has a positive (and statistically significant) correlation to odontoblast length, irrespective of delivery method
There is no conclusive evidence to state that ascorbic acid or orange juice are any more effective than one or the other on increasing odontoblast length when vitamin C dosage is not considered.
There is statistical significance to conclude that there is a difference in odontoblast length between ascorbic acid and orange juice when guinea pigs are given vitamin C doses of 0.5mg/day and 1mg/day, but not 2mg/day.
Note: When speaking about “conclusive evidence” in the context of this analysis, I am using the term as it pertains to this particular dataset and sample size. In reality, there should ideally be much larger and more conclusive studies done to accurately determine such results.