Demonstrating the Central Limit Theorem

Synopsis

The Central Limit Theorem states that if you take sufficiently large samples from a population with mean \(\mu\) and standard deviation \(\sigma\), then the sample means are approximately normally distributed with mean \(\mu\) and standard deviation \(σ/\sqrt{n}\) where \(n\) is the sample sie. This analysis will seek to demonstrate the Central Limit Theorem using the example of the Exponential distribution.

Simulations

First we simulate 1000 samples of size \(n = 40\) from an Exponential distribution with rate parameter \(\lambda = 0.2\). The data are arranged in a 1000 x 40 matrix.

n = 40
m = 1000
lambda = 0.2
set.seed(28122021)
data <- matrix(rexp(n * m, rate = lambda), n, m)

Sample Mean vs. Theoretical Mean

We seek to demonstrate that as the number of samples increases, the mean of sample means converges to the mean of the original distribution from which the samples were drawn.

We compute the mean of each sample and then the mean of the mean of sample 1, then samples 1 and 2, then samples 1, 2 and 3 and so on.

smeans <- apply(data, 2, mean)
means <- NULL
for(i in 1:dim(data)[2]){
        x <- smeans[1:i]
        means[i] <- mean(x)
}

Next arrange the results in a data frame with the index of how many samples were used to calculate the mean of sample means. So the ith row contains the mean of the sample means from samples 1 through i.

df <- data.frame(index = 1:m, data = means)

Now plot the mean of sample means against the index.

library(ggplot2)
ggplot(df, aes(x = index, y = data)) +
        geom_line() +
        geom_abline(slope = 0, intercept = 1/lambda, col = "red") + 
        ggtitle("Convergence of sample means to population mean")

Note how, as the number of samples increases, the mean of sample means converges to the theoretical mean from the original distribution (overlaid in red). In this case, the original distribution was an Exponential with rate parameter \(\lambda = 0.2\) which has a theoretical mean of \(1/\lambda = 5\).

Sample Variance versus Theoretical Variance

We seek to demonstrate that as number of samples increases, the variance of sample means converges to the \(\sigma^2/n\) where \(\sigma^2\) is the variance of the original population distribution and \(n\) is the sample size.

We compute the variance of the mean of sample 1, then samples 1 and 2, then samples 1, 2 and 3 and so on. Note that since this calculation divides by \(j-1\) where \(j\) is the number of samples, we will not get a value for the case where we have just 1 sample mean.

vars <- NULL
for(i in 1:dim(data)[2]){
        x <- smeans[1:i]
        vars[i] <- var(x)
}

Next arrange the results in a data frame with the index of how many samples were used to calculate the variance of sample means. So the ith row contains the variance of the sample means from samples 1 through i.

df <- data.frame(index = 1:m, data = vars)

Now plot the variance of sample means against the index.

ggplot(df, aes(x = index, y = data)) +
        geom_line() +
        geom_abline(slope = 0, intercept = (1/lambda)^2/n, col = "red") + 
        ggtitle("Convergence of variance of sample means")

## Warning: Removed 1 row(s) containing missing values (geom_path).

Note how, as the number of samples increases, the variance of sample means converges to the theoretical variance computed from from the original distribution (overlaid in red). In this case, the original distribution was an Exponential with rate parameter \(\lambda=0.2\) meaning the theoretical variance of sample means is \((1/\lambda)^2/n=25/40=0.625\)

Gaussian Distribution

Finally, we seek to show that the distribution of sample means is approximately Gaussian.

We plot a histogram of the sample means and overlay their density in red as well as the theoretical density in green.

hist(smeans, prob = TRUE, col = "light blue", main = "Sample Means of Exponential(0.2)", ylim = c(0, 0.6))
lines(density(smeans), lwd = 2, col = "red")
curve(dnorm(x, mean = 1/lambda, sd= (1/lambda)/sqrt(n)), lty = "dashed", lwd = 2, add=TRUE)

The sample means follow an approximately normal distribution centered at the population mean \(1/\lambda = 5\) with variance \((1/\lambda)^2/n = 0.625\)

Summary

In summary, we have shown that the distribution of sample means of size \(n\) taken from a distribution with mean \(\mu\) and variance \(\sigma^2\) is approximately normally distributed with mean \(\mu\) and variance \(\sigma^2/n\). Thus we have demonstrated the Central Limit Theorem.