The project is to investigate the exponential distribution in R and compare it with the Central Limit Theorem.
* The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter.
* The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda.
* Set lambda = 0.2 for all of the simulations.
* Investigate the distribution of averages of 40 exponentials, with a thousand simulations.
Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials. You should:
1. Show the sample mean and compare it to the theoretical mean of the distribution.
2. Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.
3. Show that the distribution is approximately normal.
Due to the central limit theorem (CLT), the distribution of averages of 40 exponentials is very close to a normal distribution.
Prepare the workspace by loading the libraries. Define the simulation parameters, as described in the task. Simulate the sample data.
#Load the libraries
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
#Define simulation parameters
Lambda <- .2
Exponential <- 40
Sims <- 1000
set.seed(12345)
#Simulate
SimData <- replicate(Sims, rexp(Exponential, Lambda))
Calculate the means for the sample dataset, as well as the actual and theoretical means.
#Calculate simulation, actual and theoretical means
MeanSimData <- apply(SimData, 2, mean)
ActualMean <- mean(MeanSimData)
TheoryMean <- 1/Lambda
#MeanSimData <- data.frame(MeanSimData, TheoryMean)
print(c("The Actual mean is", ActualMean, "while the Theorteical meanis", TheoryMean), quote = FALSE)
## [1] The Actual mean is 4.97197196424051
## [3] while the Theorteical meanis 5
As printed above, the actual mean from the sample data is very close to the theoretical mean of normal data.
PlotSimData <- data.frame(MeanSimData)
hist(MeanSimData, breaks = 10, col = "lightsteelblue2",
xlim=c(3, 8.5), main = "Simulation Mean Distribution",
xlab = "Sample Mean")
#Plot Actual and Theoretical Means
abline(v=ActualMean, col="red", lwd=2)
abline(v=TheoryMean, col="blue", lwd=2)
legend(6, 260, legend=c("Actual Mean", "Theoretical Mean"),
col=c("red", "blue"), lty=1:1, lwd=2, cex=1)
The above graph depicts the mean of the sample distribution with both the actual and theoretical means identified. As stated previously,the actual and theoretical means are very close.
Calculate the standard deviations and variance for the sample dataset, both as actual and theoretical.
#Calculate actual and theoretical standard deviation and variance
ActualSD <- sd(MeanSimData)
TheorySD <- ((1/Lambda) * (1/sqrt(Exponential)))
ActualVar <- var(MeanSimData)
TheoryVar <- TheorySD^2
#Print Results
print(c("The standard deviation of the distribution is", ActualSD,
"with a theorteical standard deviation of", TheorySD,
"While the actual variance is", ActualVar,
"with a theorteical variance of", TheoryVar), quote = FALSE)
## [1] The standard deviation of the distribution is
## [2] 0.771645582353234
## [3] with a theorteical standard deviation of
## [4] 0.790569415042095
## [5] While the actual variance is
## [6] 0.595436904765261
## [7] with a theorteical variance of
## [8] 0.625
As printed above, the actual and theoretical of both the standard deviation and the variance are very close, indicating that the simulated data approximates the expected normal results.
According to the Central Limit Theorem, the averages of samples follow normal distribution, which has been continuously demonstrated by the closeness in values of the actual vs theoretical in multiple variables (mean, standard deviation, variance), and which continues to be depicted in the following graph.
#Plot the simulation distribution
hist(MeanSimData, breaks = 50, col = "lightblue",
main = "Simulation Distribution",
xlab = "Simulation Mean", freq = FALSE)
lines(density(MeanSimData), col="orchid", lwd=2)
# Normal distribution line creation
x <- seq(min(MeanSimData), max(MeanSimData), length=2*Exponential)
y <- dnorm(x, mean=1/Lambda, sd=sqrt(((1/Lambda)/sqrt(Exponential))^2))
lines(x, y, pch=22, col="purple4", lwd=2)
legend(6.3, .56, legend=c("Actual Variance", "Theoretical Variance"),
col=c("orchid", "purple4"), lty=1:1, lwd=2, cex=1)
The graph overlays the bell curves of the actual and theoretical distributions on top of the simulated data. Thereby visually reinforcing that the sampled data approximates normal distribution.
The ToothGrowth dataset contains data regarding the effect of vitamin C on the tooth growth of Guinea pigs. Specifically, the data includes the tooth length in each of 10 guinea pigs at three dosage levels of Vitamin C (0.5, 1, and 2 mg) and two delivery methods (orange juice or ascorbic acid).
This data frame has 60 observations and 3 variables: + len - Tooth length (numeric). + supp - Supplement type: VC-“Vitamin C”" or OJ-“orange juice” (factor). + dose - Dose in milligrams (numeric).
#Load the Libraries
library(ggplot2)
library(dplyr)
library(tidyr)
library(datasets)
#Load the Data & Explore
data(ToothGrowth)
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
head(ToothGrowth)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
# Exploration of the effect of dosed supplements on tooth growth
ggplot(ToothGrowth, aes(x = dose, y = len)) +
geom_point(aes(color=supp))
summary(ToothGrowth)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
tapply(ToothGrowth$len, ToothGrowth$supp, FUN=summary)
## $OJ
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.20 15.53 22.70 20.66 25.73 30.90
##
## $VC
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.20 11.20 16.50 16.96 23.10 33.90
The median best growth comes from the Oranje Juice, while the single highest numbers come from Ascorbic Acid.
The following plot shows data summary for each supplement:
qplot(data = ToothGrowth, supp, len, facets = ~dose,
main = "Supplement & Dose Effect on Guinea Pig's Tooth Growth",
xlab = "Supplement", ylab = "Tooth Length") +
geom_boxplot(aes(fill = supp)) +
theme(legend.position = "none")
Null Hypothesis is that there will be no difference in tooth growth caused by the dosage. Alternatively, we hypothesize there will be more tooth growth as the dosage increases.
#Tooth Growth Exploration Based on Dose Comparison
LowDose <- ToothGrowth$len[ToothGrowth$dose == .5]
MedDose <- ToothGrowth$len[ToothGrowth$dose == 1]
HigDose <- ToothGrowth$len[ToothGrowth$dose == 2]
t.test(LowDose, MedDose, alternative = "less", paired = FALSE,
var.equal = FALSE, conf.level = .95)
##
## Welch Two Sample t-test
##
## data: LowDose and MedDose
## t = -6.4766, df = 37.986, p-value = 6.342e-08
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -6.753323
## sample estimates:
## mean of x mean of y
## 10.605 19.735
t.test(MedDose, HigDose, alternative = "less", paired = FALSE,
var.equal = FALSE, conf.level = .95)
##
## Welch Two Sample t-test
##
## data: MedDose and HigDose
## t = -4.9005, df = 37.101, p-value = 9.532e-06
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -4.17387
## sample estimates:
## mean of x mean of y
## 19.735 26.100
Conclusion is that we must reject the null hypothesis due to the fact that the p-value is lower than the default error value in both instances. In both instances, the p-value is close to zero, and so it is highly likely that the dosage does have an effect on tooth growth, such that a higher dosage results in increased tooth growth.
Null Hypothesis is that there will be no difference in tooth growth caused by the supplement. Alternatively, we hypothesize there will be more tooth growth when using orange juice as the supplement.
#Tooth Growth Exploration Based on Supplement
OJ = ToothGrowth$len[ToothGrowth$supp == "OJ"]
VC = ToothGrowth$len[ToothGrowth$supp == "VC"]
t.test(OJ, VC, alternative = "greater", paired = FALSE,
var.equal = FALSE, conf.level = .95)
##
## Welch Two Sample t-test
##
## data: OJ and VC
## t = 1.9153, df = 55.309, p-value = 0.03032
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 0.4682687 Inf
## sample estimates:
## mean of x mean of y
## 20.66333 16.96333
Conclusion is that we must reject the null hypothesis due to the fact that the p-value is lower than the default error value as there is only a 3% chance to obtain extreme differences in tooth growth based solely on the supplement.
Null Hypothesis is that there will be no difference in tooth growth caused by the supplement at the 2mg dosage. Alternatively, we hypothesize there will be a difference based on supplement at the highest dosage level.
#Tooth Growth Exploration Based on Supplement with Highest Dosage
HiOJ = ToothGrowth$len[ToothGrowth$supp == "OJ" & ToothGrowth$dose == 2]
HiVC = ToothGrowth$len[ToothGrowth$supp == "VC" & ToothGrowth$dose == 2]
t.test(HiOJ, HiVC, alternative = "two.sided", paired = FALSE,
var.equal = FALSE, conf.level = .95)
##
## Welch Two Sample t-test
##
## data: HiOJ and HiVC
## t = -0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.79807 3.63807
## sample estimates:
## mean of x mean of y
## 26.06 26.14
Conclusion is that we cannot reject the null hypothesis, as its p-value is higher than the error tolerance, and there is insufficient data to demonstrate a difference in tooth growth based on the supplement at the highest doage.