The Central Limit Theorem Through Simulations of the Exponential Distribution

Overview

Using the exponential distribution with a \(\lambda\) of 0.2 in R, we’ll investigate the distribution of averages of 40 exponentials with 1000 simulations. We’ll then compare the result to the standard normal distribution and show it’s relation to the Central Limit Theorem.

Simulations

Using R we’re able to sample \(n\) observations from the exponential distribution with a given \(\lambda\) using the function rexp(n, rate) where the “rate” is the value of \(\lambda\). Recall that the mean and the standard deviation for a population that is exponentially distributed are both equal to \(\frac{1}{\lambda}\). Suppose that we sample an exponential distribution 40 times with a \(\lambda\) equal to 0.2, we’d expect the sample mean to be: \[\bar{x} = \frac{1}{.2} = \frac{1}{(\frac{1}{5})} = 5\] since the sample mean is an unbiased estimator of the population mean. Let’s use R to see what the mean and sample variance are for a single sample of 40 exponentials first:

# set the number of observations to sample from the distribution
n <- 40

# set the rate (lambda)
lambda <- 0.2

# generate n random exponentials
exp.sample <- rexp(n, lambda)

# review the random sample
exp.sample

##  [1]  3.83902306  5.35611684 10.64106219  3.24221560  1.90378750
##  [6]  3.08071518  0.03153491  1.96488926  2.36330023  2.33897270
## [11]  4.66649127  0.60881578  1.05637021  5.44699237  0.69263043
## [16]  2.32673472 26.05838792  5.58354405  0.17908087 11.56929049
## [21] 11.28872172  0.06045262 10.85667788  0.92323700  3.29037310
## [26]  0.09800588 15.24851430  4.23711433  1.78240312  3.93940952
## [31]  0.12631843  8.29013005  0.48828973  1.17292176  4.66318056
## [36]  0.31577051 11.66608135  0.19277948  0.97572940  3.69916563

# show the mean of the sample
mean(exp.sample)

## [1] 4.406631

# show the standard deviation of the sample
sd(exp.sample)

## [1] 5.287921

Looking at our example, we see that the mean was 4.4066308 and the standard deviation was 5.2879213, both of which are relatively “close” to 5. What happens if we run this experiment over and over again? To find out we’ll create a simulation model that generates a random sample of 40 exponentials 1000 times.

# set the number of simulations
n.sim <- 1000 

# create an exponential distribution simulation model function
      # let x be the number of exponentials to sample
      # let y be the rate (lambda)
      # let z be the number of samples to run in the simulation
exp.sim.model <- function(x, y, z) {
      
      # run the simulation z times
      for (i in 1:z) {
            
            # store the x exponentials from the exponential 
            # distribution with y rate (lambda)
            temp.sim <- rexp(x,y)
            
            # record the mean and standard deviation
            temp.df <- data.frame(mean.sim = mean(temp.sim), var.sim = var(temp.sim))
            
            # store all the results in a data frame
            if (i == 1) {
                  df <- temp.df
            } else {
                  df <- rbind(df, temp.df)
            }
      } 
      
      # return the results
      df
      
}

# run the model for 40 observations (n), rate of 0.2 (lamba), 
# 1000 times (n.sim)
exp.sim.means <- exp.sim.model(n, lambda, n.sim)

# show the first 6 observations of the resulting data.frame
head(exp.sim.means)

##   mean.sim  var.sim
## 1 5.032894 15.04838
## 2 5.864351 31.50588
## 3 6.058256 41.76799
## 4 3.936927 11.36737
## 5 6.621163 22.49229
## 6 4.779643 30.21649

Sample Mean versus Theorectical Mean

Now that we’ve run our simulation model 1000 times, we can compare the sample means to the theoretical mean. With a lambda of 0.2 we have already shown that the population mean for our exponential distribution is 5, and we know from the Central Limit Theorem that the mean itself is an independent and identically distributed random variable that is centered around the population mean, that is the theoretical mean of the sample means taken during our simulation is the population mean. Let’s plot our sample means in a histogram and see how it’s distributed.

library(ggplot2)
g <- ggplot(data = exp.sim.means, aes(x = mean.sim))
g <- g + geom_histogram(binwidth = diff(range(exp.sim.means$mean.sim))/30, fill="purple",
      color="black")  
g <- g + geom_vline(xintercept = 1/lambda, color="orange")
g <- g + labs(title = "Histogram of Means for 1000 Samples of 40 Exponentials with a 
      Rate of 0.2", x = "Sample Mean", y = "Sample Count")
g

The orange line in the above plot is the population mean of 5 when our \(\lambda\) is equal to 0.2. From the plot we can clearly see that the means of the samples from the exponential distribution do center around the population mean of 5 as expected by the Central Limit Theorem, furthermore we may now calculate the mean of the 1000 sample means and we see that this number is indeed very close to 5 at 4.9996467.

Sample Variance versus Theorectical Variance

The Central Limit Theorem also tells us that the distribution of the sample means should theoretically have a standard deviation equal to the standard error of the mean: \(\frac{\sigma}{\sqrt{n}}\). In our case we know that the standard deviation of the exponential distribution is equal to \(\frac{1}{\lambda}\) which we have shown previously to equal 5, therefore the sample means should have a standard deviation equal to \(\frac{5}{\sqrt{40}}\) or a variance of \(\frac{5}{8}\). If we take the variance of the sample means we get 0.670042, which is very close to the theoretical variance of 0.625.

# Theorectical Variance from CLT
(((1/lambda))/sqrt(40))^2

## [1] 0.625

# Variance from the simulation
var(exp.sim.means$mean.sim)

## [1] 0.670042

Distribution

So the mean of the sample means are centered around the population mean, and the variance of the sample means are asymptotic to the population variance divided by the number of observations in each sample. To show that these two facts imply that the distribution of the sample means is normally distributed we’ll create Z scores from each of our sample means in our simulation of 1000 samples of 40 exponentials. To do this we’ll take our sample mean and subtract the population mean and then divide that quantity by the standard error of the sample mean: \[Z = \frac{\bar{X} - \mu}{\frac{\sigma}{\sqrt{n}}}\]

The result should be normally distributed with a mean of zero and a variance of zero, which is the standard normal distribution for a large \(n\). To show this is the case we’ll close with a graph comparing the density of our transformed sample means with the density of the standard normal distribution.

library(plyr)
# create Z scores from the sample means
beta <- 1/lambda
exp.sim.means <- mutate(exp.sim.means, Z.score = (mean.sim - beta)/(beta/sqrt(n)))

# plot the Z scores and the density of the standard normal distribution
g <- ggplot(data = exp.sim.means, aes(x = Z.score)) 
g <- g + geom_histogram(binwidth = diff(range(exp.sim.means$Z.score))/30, fill="purple", 
            color="black", aes(y = ..density..)) 
g <- g + stat_function(fun = dnorm, size = 2) 
g <- g + labs(title="Histogram of Normalized Exponential Distribution Sample Means \n 
      Compared to the Standard Normal Density", x="Noralized Sample Means (Z Scores)")
g

We see from the histogram above that the properly normalized distribution of the sample means is approximately the same as the standard normal distribution whose density function is shown as the thick black line, which clearly demonstrates the Central Limit Theorem even with an \(n\) of only 40 observations in our simulated samples of the exponential distribution.