The central limit theorem states that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples will be approximately equal to the mean of the population. Furthermore, all of the samples will follow an approximate normal distribution pattern, with all variances being approximately equal to the variance of the population divided by each sample’s size. (Source: Investopedia)
In this project, we illustrate the central limit theorem by sampling from an exponential distribution with the rate set at lambda = 0.2. We show that the means of 1000 sets of 40 random exponentials indeed follow an approximately normal distribution pattern, with the mean of the 1000 means approximately equal to the theoretical mean, and each of the 1000 variances approximately equal to the theoretical variance divided by the sample size of 40.
1000 means of random exponentials (i.e. random variables sampled from an exponential distribution, with the given rate of 0.2) will be simulated. The simulation method will be to call the R function “rexp” a 1000 times - each time with argument “rate” set to 0.2 and argument “n” (sample size) set to 40. Each of the 1000 times, the function will return a vector of 40 random exponentials. All these results will be saved in a 1000 X 40 matrix, and the means of the 1000 rows (the simulation means) will be stored in a vector of length 1000. This vector will be used for comparisons against the theoretical values and distribution.
library(ggplot2)
library(gridExtra)
lambda <- 0.2
simTotal <- 1000
sampleSize <- 40
expMatrix <- matrix(nrow = simTotal, ncol = sampleSize)
set.seed(1)
for (sim in 1:simTotal) {
expMatrix[sim,] <- rexp(n = sampleSize, rate = lambda)
}
#write.csv(expMatrix, "expMatrix.csv")
avgVector <- vector(length = simTotal)
for (sim in 1:simTotal) {
avgVector[sim] <- mean(expMatrix[sim,])
}
#write.csv(avgVector, "avgVector.csv")
The 1000 simulation means have been stored in “avgVector”, to be used in the comparisons below.
theo_mean <- 1/lambda
print(theo_mean)
## [1] 5
sim_mean <- mean(avgVector)
print(sim_mean)
## [1] 4.990025
The mean of 1000 simulated means i.e. the simulation mean (4.99) is quite close to the theoretical mean (5.00).
theo_var <- ((1/lambda)^2)/sampleSize
print(theo_var)
## [1] 0.625
sim_var <- var(avgVector)
print(sim_var)
## [1] 0.6111165
The variance from the simulation mean i.e. the simulation variance (0.611) is quite close to the theoretical variance divided by the size of each simulation (25/40 = 0.625).
expVector <- vector(length = 1000)
set.seed(1)
expVector <- rexp(n = 1000, rate = lambda)
#write.csv(expVector, "expVector.csv")
gg1 <-
ggplot(as.data.frame(expVector), aes(x = expVector)) +
geom_histogram(binwidth = 2, color = 'black', fill = 'white') +
xlab('Random Exponential') +
ylab('Frequency') +
ggtitle('Random Exponential Distribution') +
theme(text = element_text(size = 8))
gg2 <-
ggplot(as.data.frame(avgVector), aes(x = avgVector)) +
geom_histogram(binwidth = 0.25, color = 'black', fill = 'white') +
xlab('Mean of 40 Random Exponentials') +
ylab('Frequency') +
ggtitle('Simulated Mean Distribution') +
theme(text = element_text(size = 8))
#grid.arrange(gg1, gg2, ncol = 2)
grid.arrange(ggplotGrob(gg1),
ggplotGrob(gg2),
ncol = 2,
widths = unit(c(0.5, 0.5), "npc"))
The distribution of 1000 simulated means looks far more Gaussian than the distribution of 1000 random exponentials.
The central limit theorem as described in the Overview has been strongly supported by:
- Close proximity of the simulation mean to the theoritical mean,
- Close proximity of the simulation variance to the theoretical variance divided by sample size, and
- The approximately Gaussian shape of the distribution of simulation means.