In this assignment, we will conduct an exploratory analysis of the ToothGrowth dataset. After a quick summary analysis of the dataset, we will investigate whether the tooth growth of guinea pigs is affected by dosage and/or delivery method. We will use Hypothesis Testing to conduct the investigation, and then publish our conclusions.
The R Documentation for the ToothGrowth dataset provides the following description:
“The response is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice or ascorbic acid (a form of vitamin C and coded as VC).”
The dataset has 3 variables:
len: A numeric variable that captures tooth lengthsupp: A factor variable that captures the delivery method by which the Vitamin C supplement was fed to the guinea pigsdose: A numeric variable that captures the dosage amount fed to the guinea pigs - 0.5, 1, or 2 mg/dayBefore we perform any summary analysis on the data, let’s save the ToothGrowth dataset as a data frame called tg, and convert dose to a factor variable.
Let’s summarize tg using the summary() function.
| len | supp | dose | |
|---|---|---|---|
| Min. : 4.20 | OJ:30 | 0.5:20 | |
| 1st Qu.:13.07 | VC:30 | 1 :20 | |
| Median :19.25 | 2 :20 | ||
| Mean :18.81 | |||
| 3rd Qu.:25.27 | |||
| Max. :33.90 |
supp and dose variables:
| 0.5 | 1 | 2 | |
|---|---|---|---|
| OJ | 10 | 10 | 10 |
| VC | 10 | 10 | 10 |
The above summary data enable us to make the following observations:
Before we decide whether to compare tooth growth by delivery method or dosage quantity, let’s run a quick box-plot analysis on the data. A perusal of the box-plot leads us to hypothesise that both delivery method and dosage may be affecting tooth growth in guinea pigs.
We will test the below concrete claims using Hypothesis Testing:
In the first case, we will use a sample size of 30 - all OJ guinea pigs vs. all VC guinea pigs. In the second case, we will use a sample size of 20 - all 0.5 mg/day guinea pigs vs. all 1 mg/day guinea pigs.
Before we perform Hypothesis Testing, however, let’s check whether the data is normal.
First, let’s look at a panel plot that looks at & compares the distribution of data by delivery method (supp).
Then, let’s look at a panel plot that looks at & compares the distribution of data by dosage (dose).
Both plots suggest that the data, while not normal, is symmetrical and not heavily-skewed. In such a case, we can use the t-test to analyse the data, provided the sample size is a minimum of 152. Since both our potential analysis contain a minimum of 20 guinea pigs in each sample, we can safely perform t-test analysis.
Other assumptions:
Orange juice (OJ) is a more effective delivery method than ascorbic acid (VC), in terms of its effect on tooth growth.
This will be our alternate hypothesis (\(H_{a}\)). The null hypothesis (\(H_{0}\)) then posits that OJ is as effective a delivery method as VC.
Let \(\mu_{1}\) denote the mean odontoblast length (len) in guinea pigs who received dosage through ascorbic acid (VC), and \(\mu_{2}\) denote len in guinea pigs who received dosage through orange juice(VC). In this case:
We will run a 95% confidence interval one-sided t-test. The confidence level will be 0.95, and \(\alpha = 0.05\). Let’s run the test in R.
We get a p-value of 0.0303173. Since \(0.0303173 < 0.05 (\alpha)\), we will reject the Null Hypothesis.
Thus, we reject the claim that Orange juice (OJ) is as effective a delivery method as ascorbic acid (VC), in terms of its effect on tooth growth.
A dosage of 2 mg/day is more effective than a dosage of 1 mg/day, in terms of its positive effect on tooth growth.
This will be our alternate hypothesis (\(H_{a}\)). The null hypothesis (\(H_{0}\)) then posits that a 2 mg/day dosage is as effective as a 1 mg/day dosage.
Let \(\mu_{1}\) denote the mean odontoblast length (len) in guinea pigs who received 1 mg/day dosage (VC), and \(\mu_{2}\) denote len in guinea pigs who received dosage of 2 mg/day. In this case:
We will run a more stringent 99% confidence interval one-sided t-test, since we don’t want to spend additional money on increased dosage, unless the case for rejecting \(H_{0}\) is very strong. The confidence level will be 0.99, and \(\alpha = 0.01\). Let’s run the test in R.
We get a p-value of 9.532147610^{-6}. Since \(9.5321476\times 10^{-6} < 0.01 (\alpha)\), we will reject the Null Hypothesis.
Thus, we reject the claim that a dosage of 2 mg/day is as effective as a dosage of 1 mg/day, in terms of its effect on tooth growth.
We fail to reject the following null hypotheses \(H_{0}\):
However, before accepting the alternate hypotheses (\(H_{a}\)), we need to make the following assumption:
The probablility of accepting \(H_{a}\) when \(H_{0}\) is true (Type II Error, denoted by \(\beta\)) is lower than the acceptable threshold we set for it (say, 5%).
Once we have made the above assumption, we can confidently conclude that both delivery method and dosage affect tooth growth in guinea pigs.
Furthermore, we can conclude that:
- Orange juice is a more effective delivery method than ascorbic acid.
- The higher the dosage, the higher the tooth growth3
SECTION 3: Boxplot of Tooth Growth:
ggplot(tg, aes(supp, len)) +
geom_boxplot(aes(fill = supp)) +
facet_wrap( ~ dose) +
labs(title = "Tooth growth box-plot",
x = "Dosage (panel) and delivery method (box-plot)",
y = "Tooth growth")
SECTION 3: Density Plot of Tooth Growth:
ggplot(tg, aes(len, ..density..)) +
geom_histogram(bins = 20, fill = "tomato2", colour = "black") +
geom_density(size = 2) +
facet_grid(supp ~ .) +
labs(title = "Tooth growth density by method",
x = "Tooth growth in units",
y = "Density of tooth growth")
ggplot(tg, aes(len, ..density..)) +
geom_histogram(bins = 20, fill = "tomato2", colour = "black") +
geom_density(size = 2) +
facet_grid(dose ~ .) +
labs(title = "Tooth growth density by dosage",
x = "Tooth growth in units",
y = "Density of tooth growth")
SECTION 2: Loading the “tidyverse” set of packages:
###Install and Load the tidyverse set of packages
#install.packages("tidyverse") (#Remove comment sign if already installed)
library(tidyverse)
SECTION 2: Saving ToothGrowth as data frame “tg”:
tg <- as_tibble(ToothGrowth)
tg$dose <- factor(tg$dose)
SECTION 2: Summary of tg:
#load the kableExtra package
library(kableExtra)
knitr::kable(
summary(tg),
align = "ccc"
) %>%
kable_styling(full_width = TRUE)
SECTION 2: Table combining delivery method & dosage:
knitr::kable(
table(tg$supp, tg$dose),
align = "ccc"
) %>%
kable_styling(full_width = TRUE)
SECTION 4: Hypothesis Test 1 - Delivery method impact on tooth growth:
#Create separate dataframes basis delivery method
tg1 <- tg[tg$supp == "VC",]; tg2 <- tg[tg$supp == "OJ",]
#Isolate "len" variable of each data frame
tg1len <- tg1$len; tg2len <- tg2$len
#Perform t-test and extraxt p-value
pval1 <- t.test(tg2len, tg1len, paired = FALSE, var.equal = FALSE,
alternative = "greater", conf.level = 0.95)$p.value
SECTION 4: Hypothesis Test 2 - Dosage impact on tooth growth:
#Create separate dataframes basis delivery method
tgdose2 <- tg[tg$dose == 2,]; tgdose1 <- tg[tg$dose == 1,]
#Isolate "len" variable of each data frame
tgdose2_len <- tgdose2$len; tgdose1_len <- tgdose1$len
#Perform t-test and extraxt p-value
pval2 <- t.test(tgdose2_len, tgdose1_len, paired = FALSE,
var.equal = FALSE, alternative = "greater",
conf.level = 0.99)$p.value
This report is based on an assignment for the online course “Statistical Inference” on coursera.org↩︎
Statistics For Business and Economics, Anderson et al (https://www.cengage.com/c/statistics-for-business-economics-14e-anderson/9781337901062PF)↩︎
The p-value for the difference between the means of 1 mg/day and 0.5 mg/day is 0.00000006, i.e. lesser than 0.01. We’ve already shown in the assignment that the p-value of 2 mg/day - 1 mg/day dosage means is 9.532147610^{-6}, also < \(\alpha\).↩︎