The information given below are all for the completion of the Statistical Inference Course Project from Coursera’s Data Science by Johns Hopkins University. This project aims to investigate the exponential distribution while comparing with the Central Limit Theorem. Additionally, illustrations are shown via simulation and explanatory texts by the means of the following:
# Load libraries
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(datasets)
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(dplyr)
library(knitr)
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.3.3
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
Now in the second portion of the project, we’re going to analyze the ToothGrowth data in the R datasets package.
The data set includes data analysing the Tooth Growth of Guinea Pigs that have been administered two types of supplements at different doses.
data("ToothGrowth")
tg <- ToothGrowth
str(tg)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
unique(tg$dose)
## [1] 0.5 1.0 2.0
There are only 3 unique doses. Because of this, we are going to convert it into a factor type.
tg$dose <- factor(tg$dose)
str(tg)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: Factor w/ 3 levels "0.5","1","2": 1 1 1 1 1 1 1 1 1 1 ...
Now let’s look at summary statistics
summary(tg)
## len supp dose
## Min. : 4.20 OJ:30 0.5:20
## 1st Qu.:13.07 VC:30 1 :20
## Median :19.25 2 :20
## Mean :18.81
## 3rd Qu.:25.27
## Max. :33.90
ggplot(tg,
aes(x = dose,
y = len,
fill = supp)) +
geom_boxplot() +
labs(title = "Guinea Pig tooth length",
subtitle = "Based on Dose and Supplement Type",
y = "Tooth Lenght",
x = "Dose (mg/day)",
fill = "Supplement Type") +
theme_test()
From the graph we can see that a dose of 0.5mg/day has the biggest difference in tooth growth (when we compare the different supplements). As the dose increases, the tooth lenfgth increases as well.
If we do not look at the 2mg/day dose, we can observe in the plot that the OJ supplement has the most impact in tooth growth at both the 0.5 and 1 mg/day doses.
Now we are going to look at these differences in more detail.
stats <- tg %>%
group_by(supp,
Dose = dose) %>%
summarise(mean = mean(len),
.groups = "drop") %>%
spread(supp,
mean) %>%
mutate(Diff = abs(VC - OJ))
kable(stats,
align = "lccrr") %>%
kable_styling(bootstrap_options = "bordered",
full_width = FALSE) %>%
row_spec(row = 0,
bold = TRUE)
Dose | OJ | VC | Diff |
---|---|---|---|
0.5 | 13.23 | 7.98 | 5.25 |
1 | 22.70 | 16.77 | 5.93 |
2 | 26.06 | 26.14 | 0.08 |
As we had observed in the boxplot, the 2mg/day dose has very little difference in tooth growth independently on the supplement. This makes it harder to determine which supplement is more effective at this dose.
# Filter the data based on dose
dose_half <- filter(tg,
dose == 0.5)
dose_one <- filter(tg,
dose == 1)
dose_two <- filter(tg,
dose == 2)
# Calculating the t-tests
t.test(len ~ supp,
dose_half)
##
## Welch Two Sample t-test
##
## data: len by supp
## t = 3.1697, df = 14.969, p-value = 0.006359
## alternative hypothesis: true difference in means between group OJ and group VC is not equal to 0
## 95 percent confidence interval:
## 1.719057 8.780943
## sample estimates:
## mean in group OJ mean in group VC
## 13.23 7.98
t.test(len ~ supp,
dose_one)
##
## Welch Two Sample t-test
##
## data: len by supp
## t = 4.0328, df = 15.358, p-value = 0.001038
## alternative hypothesis: true difference in means between group OJ and group VC is not equal to 0
## 95 percent confidence interval:
## 2.802148 9.057852
## sample estimates:
## mean in group OJ mean in group VC
## 22.70 16.77
t.test(len ~ supp,
dose_two)
##
## Welch Two Sample t-test
##
## data: len by supp
## t = -0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means between group OJ and group VC is not equal to 0
## 95 percent confidence interval:
## -3.79807 3.63807
## sample estimates:
## mean in group OJ mean in group VC
## 26.06 26.14
# Organise in a table
Dose <- c(0.5, 1, 2)
P_value <- c(0.0064, 0.0010, 0.9639)
Conf_int <- c("1.72, 8.78", "2.80, 9.06", "-3.80, 3.64")
Decision <- c("Reject null", "Reject null", "Do not reject null")
TG_ttest<- data.frame(Dose, Conf_int, P_value, Decision)
kable(TG_ttest,
align = "lccrr") %>%
kable_styling(bootstrap_options = "bordered",
full_width = FALSE) %>%
row_spec(row = 0,
bold = TRUE)
Dose | Conf_int | P_value | Decision |
---|---|---|---|
0.5 | 1.72, 8.78 | 0.0064 | Reject null |
1.0 | 2.80, 9.06 | 0.0010 | Reject null |
2.0 | -3.80, 3.64 | 0.9639 | Do not reject null |
As expected, the p-values for doses 0.5 and 1.0 will be very small because of the big differences in mean between them.
Thus, for dose 0.5 and 1.0, since p-values are smaller than 0.5, we reject the null hypotheses. This means that the type of supplement has a statistical significant effect, making Orange Juice more effective. But for dose 2.0 mg/day, we cannot reject the null as the p-value is greater than 0.5.
The central assumption for the results is that the sample is representative of the population, and that no other variable is affecting tooth length.
For the t.test, two assumptions are made:
With that, in reviewing the t.test, supplement type OJ is more effective than VC for doses less or equal than 1.0mg/day. But for dose at 2.0 mg/day, there is no difference between the supplement types.