Illustrating the Central Limit Theorem: A Simulation Exercise

Marilyn Moucharite

January, 2024

Overview

This report will demonstrate the concept of the Central Limit Theorem which predicts that given a large enough sample size and independent random sampling of a population with a finite variance, the distribution of the sample mean values will approach a normal distribution even if the population from which the samples are drawn is not normally distributed. This demonstration will include running 1,000 simulations of an exponential distribution each containing 40 observations then calculating the mean value of those 40 observations for each simulation resulting in a data set containing 1,000 mean values of exponential distributions.

This report will include the following information:

  1. Show the sample mean and compare it to the theoretical mean of the distribution.

  2. Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.

  3. Show that the distribution is approximately normal.

In point 3, we will focus on the difference between the distribution of a large collection of observations randomly drawn using a function produces observations with an exponential distribution and the distribution of a large collection of mean values of 40 observations.

The report consists of two parts:

  1. A simulation exercise

  2. Basic inferential data analysis

#Part 1: Simulation Exercise

Simulations

The exponential distribution can be simulated in R with rexp(n, lambda) where n is the number of random observations drawn from the exponential distribution and lambda is the rate parameter (long term average number of events divided by the time period in which the events occur). The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. We will run 1,000 simulations and set the value of lambda to 0.2 for all of the simulations.

library(ggplot2)
library(knitr)


#utility function for histogram title formatting
#provided by @Richie https://stackoverflow.com/a/3935429/4767610
wrapper <- function(x, ...) 
{
  paste(strwrap(x, ...), collapse = "\n")
}

#set the seed value so the simulation can be reproduced
set.seed(5)

n_obs<-40  #number of observations
lambda=0.2  #rate
n_sims<-1000 #number of simulations

#function to create an exponential distribution of 40 (n_obs) observations then calculate the mean of the observation values
simExpoMean <- function(n, lambda){
    mean(rexp(n,lambda))
}

#call the simExpoMean function 1,000 times 
#create a data frame of the mean values 
sim_data <- data.frame(ncol=2,nrow=n_sims)
names(sim_data) = c("Sample Num","Mean")
for (i in 1:n_sims)
{
    sim_data[i,1] <- i
    sim_data[i,2] <- simExpoMean(n_obs,lambda)
}

#create a data frame of simulated data to demonstrate the distribution of data generated with the exponential function
expo_data<-data.frame(obs=rexp(n_obs*n_sims,lambda))

#create a histogram showing the distribution of values generated using the exponential function
#we expect this data to have an exponential distribution
#we will compare this distribution to the distribution of the mean values calculated from each of the 1,000 simulations
ggplot(expo_data, aes(obs))+
  geom_histogram(alpha=0.5, position="identity", color="darkblue", fill="lightblue", binwidth=0.25)+
  theme(plot.title = element_text(hjust = 0.5))+
  labs(x="Observation values", 
              y="Frequency",
              title=wrapper("Figure 1: Distribution of 40,000 observations randomly generated using exponential function",60))

Figure 1 illustrates that the sample data, generated using the exponential function with 1,000 simulations each containing 40 observations, has an exponential distribution (not surprising). We will next evaluate a data set that comprises the sample mean values from each of the 1,000 simulations to compare the mean of that data to the theoretical mean.

Sample Mean versus Theoretical Mean

#calculate the sample and theoretical mean values
sampleMean<-mean(sim_data$Mean)
theoreticalMean<-1/lambda

compare_means <-data.frame("Mean"=c(sampleMean,theoreticalMean), 
                     row.names = c("Sample Mean","Theoretical mean"))

compare_means
##                      Mean
## Sample Mean      5.043053
## Theoretical mean 5.000000

The value of the sample mean is 5.043053 which is, as expected, close the the value of the theoretical mean of 5.

Next, we will examine the distribution of the data set that comprises the sample mean values from each of the 1,000 simulations.

ggplot(sim_data, aes(Mean))+
  geom_histogram(alpha=0.5, position="identity", color="darkblue", fill="lightblue", binwidth=0.25)+
  scale_color_manual(name="group", values=c("r" = "red", "b"="green"), labels=c("g"="Sample mean", "r"="Theoretical mean")) +
  scale_fill_manual(name="Mean values",values=c("red","green"),labels=c("Theoretical Mean","Sample Mean")) +
  geom_vline(xintercept = theoreticalMean, color="red",show.legend=TRUE, linewidth = 1.5, col="Theoretical Mean")+
  geom_vline(xintercept = sampleMean, color="green", show.legend=TRUE, linewidth = 1.5, col="Sample Mean")+
  labs(x="Sample mean values", 
              y="Frequency",
              title=wrapper("Figure 2: Distribution of mean values from 1,000 samples of a population with an exponential distribution",60),
              subtitle="Theoretical Mean Intercept in Red and Sample Mean Intercept in Green")
## Warning: Duplicated aesthetics after name standardisation: colour
## Duplicated aesthetics after name standardisation: colour

The distribution of the sample means (Figure 2) appears to represent the normal distribution. This result is consistent the Central Limit Theorem. Figure 1 shows that the population from which the sample means were drawn has an exponential distribution. The Central Limit Theorem states that a set of mean values from independent random samples will approximate a normal distribution even though the population from which the samples were drawn may not have a normal distribution.

The close proximity of the green line, which represents the sample mean, to the red line which, represents the theoretical mean, is what we expect.

Sample Variance versus Theoretical Variance

#calculate the sample and theoretical 
sampleVariance<-var(sim_data$Mean)
theoreticalVariance<- (1/lambda)^2/n_obs

compare_variance <-data.frame("Variance"=c(sampleVariance,theoreticalVariance), 
                     row.names = c("Sample variance","Theoretical variance"))

compare_variance
##                       Variance
## Sample variance      0.6026047
## Theoretical variance 0.6250000

We expect the sample variance and the theoretical variance to be close in value because, as demonstrated in Figure 2, the distribution of means from the sample approximates a normal distribution and therefore the sample mean and the theoretical mean are close in value. The mean and variance are both parameters of the normal distribution so we also expect the sample variance and theoretical variance to be close in value.

The distribution of means is approximately normal

ggplot(sim_data, aes(Mean))+
        geom_histogram(aes(y=after_stat(density)), alpha=.5, position="identity", fill="lightblue", col="darkblue", binwidth=0.25)+
        geom_density(color="red", linewidth=1)+
        stat_function(fun = dnorm, color = "green", args = list(mean = theoreticalMean, sd = sqrt(theoreticalVariance)))+
          labs(x="Sample mean", 
              y="Frequency",
              title=wrapper("Figure 3: Sample means with Distribution Sample Means (red curve) and Normal Distribution (green curve)",68))

Figure 3 shows that the distribution of the sample mean values (red line) is very close to the normal distribution (green line) indicating that the sample mean distribution approximates a normal distribution.

qqnorm(sim_data$Mean, main ="Figure 4: Normal probability plot")
 qqline(sim_data$Mean,col = "red", lwd = 2)

Figure 4 shows that the to the simulation data plot is close to the straight line created from a sample representing a normal distribution (red line). This indicates that the data containing mean values from the simulations approximates a normal distribution.