This report will be an analysis of the central limit theorem using an exponential distribution. We will repeatedly take the mean of a sample of the exponential distribution with the sample size equal to 40. Then we will compare that mean and variance to the theoretical mean and variance and attempt to determine if our simulation is a normal distribution.

Simulations

We’ll begin by simulating a vector of means taken from the exponential distribution using the rexp function in R. We will also store the theoretical mean and standard deviation given the central limit theorem for a sample size of 40.

set.seed(1648) # for reproducability
# create an empty vector for the means of exponential samples
clt <- NULL 
n <- 40 # sample size
lambda <- 0.2 # always use lambda = 0.2 for these sims

# take the mean of 40 samples of the exponential distribution. repeat 1,000 times
for (i in 1:1000) {
      clt <- c(clt, mean(rexp(n, lambda))) 
}

The theoretical mean for this distribution of means is 1 / lambda or 5. The theoretical standard deviation is (1 / lambda) / sqrt(n) where n is the sample size. To find the variance we can simply square the standard deviation. This works out to 0.625. We’ll store these in variables.

theoretical_mean <- 1/lambda 
theoretical_sd <- (1/lambda)/sqrt(n) 

Sample and Theoretical Mean Comparison

The mean from our simulated distribution is 5.0323247 which is very close to the theoretical mean of 5. Let’s use a histogram to explain this visually.

hist(clt, xlab='Sample Mean', main="Histogram of Sample Means from an Exponential Distribution (n=40)", col='beige')
abline(v=mean(clt), lwd=3, col='darkslategray4')
abline(v=theoretical_mean, lwd=3, col='firebrick')
legend(c("Sample", "Theoretical"),x='topright', lwd=c(3,3), col=c('darkslategray4', 'firebrick'))

We can see that the sample and theoretical means nearly overlap.

Sample Variance versus Theoretical Variance

The variance for our simulated data is 0.6004745 while the theoretical variance is 0.625. These variances are very close but clearly not identical. This is so-called Monte Carlo error because we ran ‘only’ a thousand simulations. We would need to run an infinite number of simulations to ensure a variance exactly identical to the theoretical variance. But unfortunately there are only 24 hours in a day.

Simulated Distribution is Approximately Normal

Let’s begin with some exploratory graphing to determine the type of distribution. We’ll plot a histogram of the actual values of the data. Alongside the histogram we’ll create a density estimate for the distribution and plot it as well.

par(mfrow=c(1,2))
hist(clt, breaks=60, col='beige', main="Histogram of Simulated Means")
plot(density(clt), main="Density Estimate")
polygon(density(clt), col='beige', border='firebrick', lty=2)

Both figures appear to show the ‘bell-curve’ of a normal distribution so we know we’re not radically off base. Let’s take a look at the mean and median, which should be similar in a normal distribution. the mean is 0.043904 units larger which in percentage terms is just a difference of 0.9%.

Finally we can use the pnorm function in R to determine the probabilities of producing the largest and smallest values in our simulated vector. The probability of a mean being larger than the maximum we simulated is 0.00018 while the probability of a mean being smaller than our smallest simulated is 0.0026. These probabilities are very small but we’ve run a thousand simulations and would expect to see probabilites roughly this small in a normal distribution.

APPENDIX

# difference of mean/median
diff <- mean(clt) - median(clt)
# percentage difference of mean/median
percent_diff <- round(((mean(clt)-median(clt)) / mean(clt))*100, 1)
max_prob <- 1 - pnorm(max(clt), mean(clt), sd(clt))
min_prob <- pnorm(min(clt), mean(clt), sd(clt))
sim_var <- var(clt)
theoretical_var <- theoretical_sd^2