The Central Limit Theorem states that the distribution of averages of independent identically distributed random variables become that of a standard normal distribution as the sample size increases. So when we subtract the estimate(for example estimated sample mean) from the population mean and divide by the standard error of the estimate, we should get a normalized statistic. And when we do this over and over n times (where n is a large number) we will get an approximation of the standard normal distribution.
In theory, even if our starting distribution is far from normal (say for example an exponential distribution) as long as we simulate the distribution over and over and repeatedly take the sample mean; we should get a normal distribution from which we can standardize.
Let us start with an exponential distribution with a rate of 0.2 and n of 40.
set.seed(100464)
hist(rexp(40,0.2), main = "Distribution of Exponential Distribution with a Rate of 0.2" , col = "Blue")
The theoretical mean of the exponential distribution is 1/lambda, in this case 1/0.2 = 5. The theoretical standard deviation of the exponential distribution is also 1/lambda, 5 in this case. This would imply that the variance is 5^2 or 25.
Let us now use the Central Limit Theorem to try and create a normal distribution from the means of the exponential distribution. To do it, we will simulate 1000 different exponential distributions with 40 exponentials each. We will take the mean of each of these 1000 samples and plot them. The resulting plot should be a normal distribution.
##creating an empty thing
expSim1 = NULL
cumMeans = NULL
cumMeans2 = NULL
##repeat simulation 1000 times
for(i in 1:1000){
## rbind the simulations together
expSim1 <- rbind(expSim1,rexp(40,0.2))
## recording the changes in cumulative mean for variance analysis later on
cumMeans <- rbind(cumMeans, mean(expSim1[i,]))
cumMeans2 <- rbind(cumMeans2, sum(cumMeans)/nrow(cumMeans))
}
The matrix “expSim1” contains 1000 rows with 40 columns. Each row is one simulation with 40 exponentials from the exponential distribution with rate 0.2.
Now that we have simulated an exponential distribution 1000 times, let’s find the mean and standard deviation and compare it to that of the theoretical. Remember that the theoretical mean and standard deviation of the exponential distribution is 1/rate, in our case 1/0.2 = 5. The theoretical variance should be 1/lambda^2; 25 in our case.
paste("The mean is", round(mean(apply(expSim1, 1, mean)),3))
## [1] "The mean is 5"
paste("The standard deviation is", round(sd(apply(expSim1, 1, sd)),3))
## [1] "The standard deviation is 1.064"
paste("The variance is",round(sd(apply(expSim1, 1, sd)),3)^2 )
## [1] "The variance is 1.132096"
We can see that although the mean of 1000 means of the simulated exponential distributions is similar to the theoretical mean of one exponential distribution, the standard deviation and variance are significantly less. The variance of our large collection of averages of 40 exponentials is 1.06 vs a theoretical variance of 25 for the exponential distribution.
t.test(expSim1,y = c(5,5,5,5),paired = FALSE)
##
## Welch Two Sample t-test
##
## data: expSim1 and c(5, 5, 5, 5)
## t = -0.0085701, df = 39999, p-value = 0.9932
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.04913673 0.04870891
## sample estimates:
## mean of x mean of y
## 4.999786 5.000000
We see that the mean of our simulations is equal to the theoretical mean of the exponential distribution within 95% confidence.
If we look at the distribution, we will more clearly see the central limit theorem at play.
par(mfrow = c(2,1), mar = c(3,2,1,1), pin = c(3,1.45), cex = 0.7)
expSim2 <- rexp(1000,0.2)
hist(expSim2, main = "Distribution of a Large Collection of Random Exponentials", col = "Blue")
abline(v = 5, lwd = 5, col = "red")
hist(apply(expSim1,1,mean), main = "Distribution of a Large Collection of Averages of 40 Exponentials", col = "Green")
abline(v = 5, lwd = 4, col = "red")
abline(v = mean(expSim1), lwd = 4, col = "purple")
The theoretical mean is marked with the red line. The purple line marks the center of the distribution for the large collection of averages of 40 exponentials (our simulation). We can very clearly see that by taking the averages of 40 exponentials over and over 1000 times, we can plot the averages and get an approximately normal distribution centered very nearly around where the theoretical mean of the exponential distribution would be. The charecteristic bell shaped curve of the normal distribution is very apparent in the second plot. In sharp contrast, the plot of 1000 exponentials is skewed to the right, far from a bell shaped curve.
To illustrate the differences in variance from the theoretical variance and our sample variance, look at the following plot.
plot(cumMeans2, type = 'n', main = "Variance from the Theoretical Mean")
lines(rexp(1000,0.2), col = "yellow")
lines(cumMeans2, col = "blue")
abline(h = 5, col = "red", lwd = 3)
The Blue line is the cumulative mean of our simulation as it proceeded from the first simulation of 40 exponentials to the 1000th simulation. The Yellow line shows how varied a normal exponential distribution of 1000 exponentials is. The Red line is the theoretical mean of our exponential distribution. We can see very clearly the difference in the variance of 25 (yellow line) and the variance of 1.05 (blue line). In our simulation, as we simulated more and more averages of 40 exponentials, our cumulative mean became less varied and eventually converges to the theoretical mean. In the regular exponential distribution, still of lambda of 0.2, we can see the huge amount of variance.
Overall, we can conclude that the distribution of means of 40 exponentials behaves as predicted by the Central Limit Theorem. The ending plot is a normal distribution centered around the theoretical mean with a low variance.