In this project, we will investigate the exponential distribution in R and compare it with the Central Limit Theorem. We will investigate the distribution of averages of 40 exponentials, with a thousand simulations.
# Set Working Directory
setwd("~/Desktop")
# Importing Key Libraries
library(ggplot2)
library(reshape2)
library(lattice)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
First, we conduct 1,000 simulations of the average of 40 exponential distribution, and collect the mean and variance of each simulation.
# Setting seed to ensure reproducibility
set.seed(123)
# Conducting 1000 simulations of 40 exponential distributions
lambda = 0.2
nosim <- 1000; n <- 40
exp_mean <- rep(NA, nosim); exp_var <- rep(NA, nosim)
for (i in 1:nosim){
exp_mean[i] <- mean(rexp(n, rate = lambda))
exp_var[i] <- var(rexp(n, rate = lambda))/n
}
The sample mean from the simulations is 5.0069281, while the theoretical mean (generated by taking 1/rate) of the distribution is 5.
The sample variance from the simulation is 0.6111245, while the theoretical variance of the distribution is 25. In this case, as the sample variance measures the variability of 40 exponential distributions from the population instead of only 1 distribution, it is likely to be lower than the theoretical variance by a factor of 40. This implies that as the sample size approaches infinity, the sample mean, 5.0069281, approaches the theoretical mean of 5.
First, we plot the distribution of the average of 40 exponential distributions from our previous simulations. Next, we plot the average of 40 exponential distributions along side the normal distribution to compare the 2 distributions. In addition, we also create confidence intervals for both the mean and the variance of the normalized distribution.
# Plotting the average of 40 exponential distributions
hist(exp_mean, xlab = "Mean", ylab = "Frequency Count", col = "blue",
main = "Average of 40 Exponential Distributions (1000 simulations)")
To check whether our distribution follows a Normal Distribution, we check whether the 1 and 2 standard deviations from the mean captures 68% and 95% of the distribution.
# 1 standard deviation
mean_1sd <- mean(exp_mean) + 1 * sqrt(mean(exp_var))
actual_1sd <- quantile(exp_mean, c(0.84))
print(mean_1sd); print(actual_1sd)
## [1] 5.788673
## 84%
## 5.756918
# 2 standard deviation
mean_2sd <- mean(exp_mean) + 2 * sqrt(mean(exp_var))
actual_2sd <- quantile(exp_mean, c(0.975))
print(mean_2sd); print(actual_2sd)
## [1] 6.570417
## 97.5%
## 6.668337
As the difference between our distribution and a normal distribution appears to be close to 0, we can conclude that our distribution does indeed follow a normal distribution.
Lastly, we can also conduct a formal test (Shapiro-Wilk test) to verify whether our distribution adheres to a normal distribution.
p_value <- shapiro.test(exp_mean)[2]
stat_sig <- if (p_value<0.05){
"does not follow a normal distribution"} else {
"follows a normal distribution"}
Under the Shapiro-Wilk test of normality, our distribution does not follow a normal distribution.
We begin by loading the ToothGrowth dataset, and generating a basic summary of the data.
library(datasets); data("ToothGrowth")
summary(ToothGrowth)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
We plot the data to get a better understanding of how the dataframe looks:
facet_plt <- xyplot(len ~ supp | factor(dose), data = ToothGrowth,
xlab = "Supplement", ylab = "Tooth Growth",
main = "Impact of Supplement on Tooth Growth by Dosage")
facet_plt
From the plot, it appears that OJ tends to do better than VC when the dosage is low, but the effect is muted when the dosage is high.
We conduct our hypothesis tests (with the use of confidence intervals) to compare tooth growth by the different supplements, while controlling for dosage.
pVal <- rep(NA, length(unique(ToothGrowth$dose)))
stat_sig <- rep(NA, length(unique(ToothGrowth$dose)))
x <- 1
# Impact of Tooth Growth by Supplement, controlled for Dosage
for (i in unique(ToothGrowth$dose)){
df <- ToothGrowth[ToothGrowth$dose == i,]
test_result <- t.test(len ~ supp, data = df, paired = F, var_equal = F)
pVal[x] <- test_result$p.value
conf <- test_result$conf
stat_sig[x] <- if ((conf[1] > 0 & conf[2] > 0)|(conf[1] < 0 & conf[2] < 0)){
"statistically significant"
} else {
"not statistically significant"
}
x <- x + 1
}
Under a dosage of 0.5, the difference tooth growth across the two supplements is statistically significant.
Under a dosage of 1, the difference tooth growth across the two supplements is statistically significant.
Under a dosage of 2, the difference tooth growth across the two supplements is not statistically significant.
We now plot the data in a different perspective, to identify the impact of dosage on tooth growth, controlled for supplement
facet_plt2 <- xyplot(len ~ as.factor(dose) | supp, data = ToothGrowth,
xlab = "Dosage (mg/day)", ylab = "Tooth Growth",
main = "Impact of Dosage on Tooth Growth by Supplement")
facet_plt2
We proceed to conduct a hypothesis tests to compare tooth growth by dosage, while controlling for the types of supplements used.
pVal2 <- rep(NA, length(unique(ToothGrowth$dose)) * length(unique(ToothGrowth$supp)))
stat_sig2 <- rep(NA, length(unique(ToothGrowth$dose)) * length(unique(ToothGrowth$supp)))
x <- 1
# Impact of Tooth Growth by Dosage, controlled for Supplement
for (i in unique(ToothGrowth$supp)){
df <- filter(ToothGrowth, dose %in% c("0.5", "1") & supp == i)
test_result <- t.test(len ~ dose, data = df, paired = F, var_equal = F)
pVal2[x] <- test_result$p.value
conf <- test_result$conf
stat_sig2[x] <- if ((conf[1] > 0 & conf[2] > 0)|(conf[1] < 0 & conf[2] < 0)){
"statistically significant"
} else {
"not statistically significant"
}
x <- x + 1
}
for (i in unique(ToothGrowth$supp)){
df <- filter(ToothGrowth, dose %in% c("1", "2") & supp == i)
test_result <- t.test(len ~ dose, data = df, paired = F, var_equal = F)
pVal2[x] <- test_result$p.value
conf <- test_result$conf
stat_sig2[x] <- if ((conf[1] > 0 & conf[2] > 0)|(conf[1] < 0 & conf[2] < 0)){
"statistically significant"
} else {
"not statistically significant"
}
x <- x + 1
}
for (i in unique(ToothGrowth$supp)){
df <- filter(ToothGrowth, dose %in% c("0.5", "2") & supp == i)
test_result <- t.test(len ~ dose, data = df, paired = F, var_equal = F)
pVal2[x] <- test_result$p.value
conf <- test_result$conf
stat_sig2[x] <- if ((conf[1] > 0 & conf[2] > 0)|(conf[1] < 0 & conf[2] < 0)){
"statistically significant"
} else {
"not statistically significant"
}
x <- x + 1
}
For VC:
Increasing the dosage from 0.5 to 1 is statistically significant, under a significance of 5%.
Increasing the dosage from 1 to 2 is statistically significant, under a significance of 5%.
Increasing the dosage from 0.5 to 2 is statistically significant, under a significance of 5%.
For OJ:
Increasing the dosage from 0.5 to 1 is statistically significant, under a significance of 5%.
Increasing the dosage from 1 to 2 is statistically significant, under a significance of 5%.
Increasing the dosage from 0.5 to 2 is statistically significant, under a significance of 5%.
All in all, higher dosage of vitamin C, regardless of the delivery methods, result in faster tooth growth. While orange juice tend to result in faster tooth growth at lower levels of Vitamin C intake as compared to ascorbic acid, the effects fade off as dosage levels increases.