Joseph Bloomquist
05-20-2024
This is the John Hopkins Statistical Inference Course project. It will consist of two parts. A simulation exercise and basic inferential data analysis. It will be presented in PDF format with no more than 3 pages.
In this we will:
Show where the distribution is centered at and compare it to the theoretical center of the distribution
Show how variable it is and compare it to the theoretical variance of the distribution
Perform an exploratory data analysis of at least a single plot or table highlighting basic features of the data
Perform some relevant confidence intervals and/or tests and interpret them within context.
Investigate the exponential distribution in R and compare it with the Central Limit Theorem.
Investigate the distribution of averages of 40 exponentials using a thousand simulations
Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials.
For these simulations, we have some preset variables and a seed for reproduction purposes.
set.seed(052024)
lambda = 0.2
exponentials = 40
simulations = 1000
totalMeans = 0
sMean = 0
tMean = 0
sVar = 0
tVar = 0
First we will generate a random exponential distribution where exponentials are the observations and the rate is lambda
for (i in 1 : simulations) totalMeans = c(totalMeans, mean(rexp(exponentials, lambda)))
Now that we have simulated data, we can calculate and differentiate between the two.
First we will grab our sample mean:
sMean <- mean(totalMeans)
sMean
## [1] 5.004291
Now we calculate our theoretical mean:
tMean <- 1/lambda
tMean
## [1] 5
Comparison Plot
hist(totalMeans, main = "Sample Mean vs. Theoretical Mean", col = "lightblue", breaks = 50)
abline(v=sMean, col = "green", lwd = 2)
abline(v=tMean, col = "red", lwd = 2)
legend("topleft", pch=15, col = c("green", "red"), legend = c("Sample Mean - 5.055955", "Theoretical Mean - 5"))
As we can see, the means are visually the same. The sample is more precise with a mere difference of .055955
In order to calculate the variances, we use a built in function for the sample. The formula for exponentials given was 1/Lambda^2, however that formula failed to give the proper results. (lambda * sqrt(exponentials))^-2 appears to be correct.
sVar <- var(totalMeans)
sVar
## [1] 0.6658285
tVar <- (lambda * sqrt(exponentials))^-2
tVar
## [1] 0.625
Comparison Plot
hist(sVar, main = "Sample Variance vs. Theoretical Variance", col = "lightblue", breaks = 50)
abline(v=sVar, col = "green", lwd = 2)
abline(v=tVar, col = "red", lwd = 2)
legend("topright", pch=15, col = c("green", "red"), legend = c("Sample Var - 0.6658285", "Theoretical Var - 0.625"))
text(0.8, .65, "Difference of: ", col = "black")
text(0.8, .55, round(sVar-tVar,2), col = "black")
How are the sample means distributed?
hist(totalMeans, main="Mean distribution", col="green", breaks=50, prob=TRUE)
lines(density(totalMeans), lwd=3, col="red")
Even though the means were derived from different exponential distributions, they collectively resemble a normal distribution.