Verifying the Central Limit Theorem

Overview

In this project I am investigating the exponential distribution and comparing it with the Central Limit Theorem with the help of simulation in R. I compare:

the sample mean and the theoretical mean of the distribution.
the sample variance to the theoretical variance of the distribution.
the difference between the distribution of a large collection of random exponentials and the distribution of a large collection of averages of 40 exponentials.

Simulation

The goal is to simulate exponential distribution with the rate parameter lamda. The theoretical mean and standard deviation of the exponential ditribution is 1/lamda. I use the value lamda = 0.2 for all the simulations. To simulate the data, I make use of the function rexp(n,lamda) and use the for loop to simulate 1000 samples of the means of 40 random exponential variables.

set.seed(1)
## setting the parameters
lamda = 0.2
n = 40
sd = 1/lamda
mean = 1/lamda
## Simulating 1000 averages of 40 exponential variables 
## with lamda = 0.2 using for loop. 
mean_sample = NULL
for (i in 1:1000) mean_sample = c(mean_sample, mean(rexp(n, lamda)))

Comparison

CLT tells us that the sample mean is normally distributed with mean equal to the original mean it is trying to estimate and the standard deviation equal to the true variance divided by the number of observations.

## CLT ==> mean_sample ~ N(mean, sd^2/n)
Theoretical_mean <- 1/lamda
Theoretical_var <- sd^2/n
sample_mean_mean <- mean(mean_sample)
sample_mean_var <- var(mean_sample)

Sample Mean Vs. Theoretical Mean

So the theorectical mean of the sample mean is 5 and the calculated mean of the sample mean is 4.9900252. If we were to simulated infinite data points they should be exactly the same as implied by the CLT. Thus, even with 1000 simulations we can see that they are almost equal.

In the density plot below, the green vertical line is the mean of the sample mean that we calculated and the red line is the therectical mean of the sample mean. It is clear that they almost equal, which reflects the CLT.

hist(mean_sample, col="grey", main = "Sample Mean Vs. Theoretical Mean", prob=T)
curve(dnorm(x,5,1),col="blue",lty=2, lwd=2,add=T)
abline(v = sample_mean_mean, col = "green", lwd = 4)
abline(v=Theoretical_mean, col="red", lwd=2)

Sample Variance Vs. Theoretical Variance

The theoretical variace of the sample mean is 0.625 and the sample variance of the sample mean is 0.6111165. Yet again, the values are almost equal.

The vertical green lines in the density plot below represent one standard deviation away from the theoretical mean (5) and the red lines represent one standard deviation away from the sample mean. As expected the are almost overlap and adheres to the CLT.

hist(mean_sample, col="grey", 
     main = "Sample Variance Vs. Theoretical Variance", prob=T)
curve(dnorm(x,5,1),col="blue",lty=2,lwd=2,add=T)
abline(v =I(mean + c(-1,1)*sqrt(Theoretical_var)), col = "green", lwd = 3)
abline(v=I(sample_mean_mean+c(-1,1)*sqrt(sample_mean_var)), col="red", lwd=2)

Exponential Destribution Vs. Distribution of the Sample Mean of Exponential Distribution

First let use see how does the distribution of 1000 random exponentials look like.

set.seed(5)
Exp_dist_10 <- rexp(10, 0.2 )
Exp_dist_100 <- rexp(100, 0.2 )
Exp_dist_1000 <- rexp(1000, 0.2 )
Exp_dist_10000 <- rexp(10000, 0.2 )

plot(density(Exp_dist_10000), col="dark blue", lty=1, lwd=2, 
     main="Exponential Distribution",xlim=c(3,40), xlab="X")
lines(density(Exp_dist_1000), lty=1, lwd=2, col="red")
lines(density(Exp_dist_100), lty=1, lwd=2, col="green")
lines(density(Exp_dist_10), lty=1, lwd=2, col="blue")
curve(dexp(x,0.2), add=T, col="black", lty=3,lwd=4)
abline(v=mean, col="brown", lty=2, lwd =1)

legend("topright", pch = "________",
  legend= c("Pop_Dist","sim = 10", "sim = 100", "sim = 1000", "sim = 10000"),  
  col=c("black", "blue", "green", "red","brown"))

As you can see the sample distribution of the exponential disribution converges towards the population distribution as we increase the number of trials.

Now lets try to do the same with the distribution of the sample mean. By CLT, we know that the sample mean is distributed Normally with mean = 5 and variance = 0.625 in this case. Lets simulate samples with 5, 10, 100, and 1000 observations of sample means of exponential distribution with n=40 and lamda =0.2.

set.seed((4))
sample_mean_5 = NULL
for (i in 1:5) sample_mean_5 = c(sample_mean_5, mean(rexp(n, lamda)))
sample_mean_10 = NULL
for (i in 1:10) sample_mean_10 = c(sample_mean_10, mean(rexp(n, lamda)))
sample_mean_100 = NULL
for (i in 1:100) sample_mean_100 = c(sample_mean_100, mean(rexp(n, lamda)))
sample_mean_1000 = NULL
for (i in 1:1000) sample_mean_1000 = c(sample_mean_1000, mean(rexp(n, lamda)))

plot(density(sample_mean_1000), col="grey", lwd=2, 
     main="Distribution of Sample Mean",ylim=c(0,0.65), xlab="Sample Mean")
lines(density(sample_mean_100), col="red")
lines(density(sample_mean_10), col="green")
lines(density(sample_mean_5), col="blue")
curve(dnorm(x,5,0.625), add=T, col="black", lty=3,lwd=2)
abline(v=mean(mean_sample), col="red", lty=2, lwd=2)

legend("topright", pch = "______",
  legend= c("Theoretical", "sim = 5", "sim =10", "sim=100", "sim=1000"), 
  col=c( "black","blue", "green", "red","grey"))

As you can see from the figure, the distribution of the sample mean converges to the theoretical normal distribution as we increase the number of simulations (observations). Thus, the distribution of the sample mean is approximately normal as the the sample size increases (as implied by the CLT).