This will be a simulation exercise comparing the exponential distribution to the Central Limit Theorem.
The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. You will investigate the distribution of averages of 40 exponentials. Note that you will need to do a thousand simulations.
Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials. You should:
In point 3, focus on the difference between the distribution of a large collection of random exponentials and the distribution of a large collection of averages of 40 exponentials.
First always set your seed so the simulation is reproducible. Here I’ll use 33, my favorite seed! Store all the known variables we’re going to work with in the sim.
The exponential distribution can be simulated in R with rexp(n, lambda).
Do a thousand simulations.
set.seed(33)
lambda <- 0.2
n <- 40
expo_sims <- replicate(1000, rexp(n, lambda))
means_expo_sims <- apply(expo_sims, 2, mean)
We’re told the theoretical mean of the distribution is 1/lambda. The sample mean is just the mean of the 1,000 simulations we just did.
theo_mean <- 1/lambda
theo_mean
## [1] 5
sample_mean <- mean(means_expo_sims)
sample_mean
## [1] 4.964431
As you can see the theoretical mean and the sample mean are very close.
hist(means_expo_sims, xlab = "Simulated Mean Buckets", main = "Distribution of Simulated Means", col = "grey")
abline(v=theo_mean, col = 2, lwd = 2)
abline(v=sample_mean, col=4, lwd = 2)
legend("topright", legend = c("theo_mean", "sample_mean"), pch = "|", col = c(2,4))
We’re told the standard deviation of the theoretical distribution is 1/lambda. To calculate the variance we square the standard deviation and divide that by the n. For our sample we can just square the calculated standard deviation.
theo_var <- ((1/lambda)^2)/n
theo_var
## [1] 0.625
sample_var <- (sd(means_expo_sims))^2
sample_var
## [1] 0.6456619
The sample variation is fairly close to the theoretical variation. I prefer to compare standard deviations instead of variances since they aren’t squared and we’ll need them for part 3 anyway so here’s how we get those:
theo_sd <- 1/(lambda * sqrt(n))
theo_sd
## [1] 0.7905694
sample_sd <- sd(means_expo_sims)
sample_sd
## [1] 0.8035309
Again, fairly close.
To do this I’ll just use the same histogram I used in part one but break it up into more buckets. I’ll also add the prob = TRUE argument and a line showing the density.
hist(means_expo_sims, xlab = "Simulated Mean Buckets", main = "Distribution of Simulated Means", col = "grey", prob = TRUE, breaks = 100)
lines(density(means_expo_sims))
That looks pretty normal to me. To be more confident we can look at the confidence intervals!
theo_conf <- theo_mean + c(-1,1) * 1.96 * sqrt(theo_var)/sqrt(n)
theo_conf
## [1] 4.755 5.245
sample_conf <- mean(means_expo_sims) + c(-1,1)*1.96*sd(means_expo_sims)/sqrt(n)
sample_conf
## [1] 4.715414 5.213447
Those are close as well.
Finally, since visual tests aren’t always reliable we can do something fancy and use the Shapiro-Wilk method.
shapiro.test(means_expo_sims)
##
## Shapiro-Wilk normality test
##
## data: means_expo_sims
## W = 0.99052, p-value = 4.752e-06
Since that results in a really low p-value we can confidently say the data is normally distributed.