Statistical Inference Course project

Joseph Bloomquist

05-20-2024

Overview

This is the John Hopkins Statistical Inference Course project. It will consist of two parts. A simulation exercise and basic inferential data analysis. It will be presented in PDF format with no more than 3 pages.

In this we will:

  • Show where the distribution is centered at and compare it to the theoretical center of the distribution

  • Show how variable it is and compare it to the theoretical variance of the distribution

  • Perform an exploratory data analysis of at least a single plot or table highlighting basic features of the data

  • Perform some relevant confidence intervals and/or tests and interpret them within context.

  • Investigate the exponential distribution in R and compare it with the Central Limit Theorem.

  • Investigate the distribution of averages of 40 exponentials using a thousand simulations

  • Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials.

Simulations

For these simulations, we have some preset variables and a seed for reproduction purposes.

set.seed(052024)
lambda = 0.2
exponentials = 40
simulations = 1000
totalMeans = 0
sMean = 0
tMean = 0
sVar = 0
tVar = 0

First we will generate a random exponential distribution where exponentials are the observations and the rate is lambda

for (i in 1 : simulations) totalMeans = c(totalMeans, mean(rexp(exponentials, lambda)))

Sample Mean vs. Theoretical Mean

Now that we have simulated data, we can calculate and differentiate between the two.

First we will grab our sample mean:

sMean <- mean(totalMeans)
sMean
## [1] 5.004291

Now we calculate our theoretical mean:

tMean <- 1/lambda
tMean
## [1] 5

Comparison Plot

hist(totalMeans, main = "Sample Mean vs. Theoretical Mean", col = "lightblue", breaks = 50)
abline(v=sMean, col = "green", lwd = 2)
abline(v=tMean, col = "red", lwd = 2)
legend("topleft", pch=15, col = c("green", "red"), legend = c("Sample Mean - 5.055955", "Theoretical Mean - 5"))

As we can see, the means are visually the same. The sample is more precise with a mere difference of .055955

Sample Variance vs. Theoretical Variance

In order to calculate the variances, we use a built in function for the sample. The formula for exponentials given was 1/Lambda^2, however that formula failed to give the proper results. (lambda * sqrt(exponentials))^-2 appears to be correct.

sVar <- var(totalMeans)
sVar
## [1] 0.6658285
tVar <- (lambda * sqrt(exponentials))^-2
tVar
## [1] 0.625

Comparison Plot

hist(sVar, main = "Sample Variance vs. Theoretical Variance", col = "lightblue", breaks = 50)
abline(v=sVar, col = "green", lwd = 2)
abline(v=tVar, col = "red", lwd = 2)
legend("topright", pch=15, col = c("green", "red"), legend = c("Sample Var - 0.6658285", "Theoretical Var - 0.625"))
text(0.8, .65, "Difference of: ", col = "black")
text(0.8, .55, round(sVar-tVar,2), col = "black")

Distribution

How are the sample means distributed?

hist(totalMeans, main="Mean distribution", col="green", breaks=50, prob=TRUE)
lines(density(totalMeans), lwd=3, col="red")

Conclusion

Even though the means were derived from different exponential distributions, they collectively resemble a normal distribution.