Distribution of Averages: Central Limit Theorem

Overview

This report investigates the Central Limit Theorem (CLT), specifically through the distribution of averages of 40 exponential random variables. The intention is to show that with 1000 simulations, the sample mean of the distribution of averages is close to the theoretical mean as per CLT, and the variablitiy of the sample distribution of averages is close to the theoretical variability as per the CLT.

Simulation

The random variable being sampled is exponential with rate parameter \(\lambda = 0.2\). Samples of size 40 are drawn and the mean is determined. This is done 1000 times to generate a distribution of the averages. The code for collecting the sample data through simulation is provided below.

lambda <- 0.2 # given rate for exponential
n <- 40 # number of samples
trials <- 1000 # number of trials of the n samples
set.seed(1284)
exps <- rexp(n*trials, lambda) # sample the exponential RVs
expMat <- matrix(exps, n, trials) # matrix of 1000 samples of 40 RVs
expMeans <- colMeans(expMat) # find the mean of each sample of 40 RVs

Sample Mean vs. Theoretical Mean

Per the Central Limit Theorem, the theoretical mean of a distribution of averages is the mean of the random variable itself being sampled. Since the random variable being sampled is \(X \sim exp(0.2)\), the theoretical mean is \(1/\lambda = 1/0.2 = 5\). The plot below shows the histogram of the sampled averages. The solid red line shows the mean of the sampled averages, and the dashed blue line shows the theoretical mean. It can be seen that the values are very close, as expected.

The actual values of the simulated sample mean and the theoretical mean are:

print(data.frame("Simulated " = mean(expMeans), "Theoretical" = 1/lambda))

##   Simulated. Theoretical
## 1   5.047547           5

Sample Variance vs. Theoretical Variance

Here variance is investigated through looking at the standard deviation and standard eroor. Per the Central Limit Theorem, the theoretical standard deviation (or standard error) of a distribution of averages is \(\sigma/\sqrt(n)\), where \(\sigma\) is the standard deviation of the random variable itself being sampled, and \(n\) is the size of the sample. Since the random variable being sampled is \(X \sim exp(0.2)\), the theoretical standard error is \(1/\lambda/\sqrt(40) = 1/0.2/\sqrt(40)\). The plot below again shows the histogram of the sampled averages. The solid red line shows the location of one standard deviation above the mean of the sampled averages, and the dashed blue line shows the theoretical location of one standard error above the mean. It can be seen that the values are very close, as expected.

The actual values of the simulated sample standard deviation and the theoretical standard error are:

print(data.frame("Simulated " = sd(expMeans), "Theoretical" = 1/lambda/sqrt(n)))

##   Simulated. Theoretical
## 1  0.8052922   0.7905694

Distribution

As per the Central Limit Theorem, the distribution of the averages of a sampled random variable should approach a normal distribution with the mean and standard error discussed above. The figure below shows two plots. The plot on the left shows a histogram of the sampled random variable itself (not the averages), with the theoretical distribution shown in red. You can clearly see the random variables are exponential. The plot on right however, shows a histogram of the averages of samples of 40 random variables, with the theoretical normal distribution shown in red. It is clear then that the distribution of averages is approximately normal.

Appendix

This appendix includes all the code used to create the plots in the above report.

Code for sample mean vs. theoretical mean plot.

hist(expMeans, breaks = 15, xlab = "Averages of Exponential Random Variables", 
     main = NULL, col = "papayawhip", freq = FALSE)
meanSim <- mean(expMeans)
meanTh <- 1/lambda;
abline(v = meanSim, col = "red", lw = 3)
abline(v = meanTh, col = "blue", lty = 2, lw = 3)
legend("topright", legend = c("Simulated Mean", "Theoretical Mean"), 
       col = c("red", "blue"), lty = c(1, 2), lw = c(3, 3))

Code sample standard deviation vs. theoretic standard deviation plot.

hist(expMeans, breaks = 15, xlab = "Averages of Exponentials", 
     main = NULL, col = "papayawhip", freq = FALSE)
sdSim <- sd(expMeans)
sdTh <- 1/lambda/sqrt(n)
abline(v = meanSim + sdSim, col = "red", lw = 3)
abline(v = meanTh + sdTh, col = "blue", lty = 2, lw = 3)
legend("topright", legend = c("Simulated SD", "Theoretical SD"), 
       col = c("red", "blue"), lty = c(1, 2), lw = c(3, 3))

Code for distribution plot.

par(mfrow = c(1,2))
hist(exps, breaks = 30, freq = FALSE, xlab = "Exponential Random Variables", 
     main = NULL)
x <- seq(0,60,0.1)
lines(x, lambda*exp(-lambda*x), col = "red")

hist(expMeans, breaks = 30, freq = FALSE, xlab = "Averages of Exponential Random Variables", 
     main = NULL)
x <- seq(2,9,0.1)
lines(x, (1/sdTh/sqrt(2*pi))*exp(-((x-meanTh)^2)/(2*sdTh^2)), col = "red")