Author: Bas Baccarne
This report investigates the CLT in relation to the exponential distribution. By means of random samples, drawn from this distribution, and comparing this to the theoretical quantities of the distribution, we show how the CLT allows to estimate the theoretical (or population) mean and variance.
First, we set the basic parameters for the simulation of exponentials.
lambda = .2
mean = 1/lambda; sd = 1/lambda
nosim = 1000
n = 40
Next we run the simulation. We first define a sample funtion. Then we run this sample function a 1.000 times, capturing the empirical averages and empirical variances in the new variables ‘means’ and ‘variances’, so both variables contain each 1.000 means and variances.
sample <- function(){(rexp(n, lambda))}
means <- numeric()
variances <- numeric()
for(i in 1:nosim){
        means <- c(means,mean(sample()))
        variances <- c(variances,var(sample()))
        }
As we can see in the graph below, the distribution of empirical averages resembles a normal curve. The average of the empirical sample means estimates the theoretical mean of the distribution (population) quite well.
Theoretical mean:
mean
## [1] 5
sample mean:
round(mean(means),3)
## [1] 5.021
library(ggplot2)
qplot(means) +
        geom_vline(xintercept = mean, color="lightseagreen") +
        geom_vline(xintercept = mean(means), color="tomato2") +
        annotate("text", x = 7, y = 85, label = paste("theoretical mean =",mean), color="lightseagreen") +
        annotate("text", x = 7, y = 80, label = paste("sample mean =",round(mean(means),2)), color="tomato2") +
        labs(title="comparing sample and theoretical means") + 
        labs(x = "sample means", y = "frequency")
In a similar fashion, we plot the variances of our simulation sample below. As we can see, the average of the empirical sample variances estimates the theoretical variance of the distribution (population) quite well.
Theoretical variance:
sd^2
## [1] 25
sample mean:
round(mean(variances),3)
## [1] 24.522
qplot(variances) +
        geom_vline(xintercept = sd^2, color="lightseagreen") +
        geom_vline(xintercept = mean(variances), color="tomato2") +
        annotate("text", x = 65, y = 90, label = paste("theoretical variance =",sd^2), color="lightseagreen") +
        annotate("text", x = 65, y = 82, label = paste("sample variance =",round(mean(variances),2)), color="tomato2") +
        labs(title="comparing sample and theoretical variances") + 
        labs(x = "sample variances", y = "frequency")
Finally, we compare the distribution of a large collection (n=40.000) of random exponentials with the distribution of a large collection (n=1.000) of averages of 40 exponentials. The first image shows the distribution of 40.000 random exponentials, the second shows the distribution of 1.000 averages of 400 random exponentials.
qplot(rexp(40000, lambda)) +
        labs(title="distribution of 40.000 random exponentials (lambda=0.2)") + 
        labs(x = "value", y = "frequency")
qplot(means) +
        labs(title="distribution of 1.000 averages of 40 random exponentials (lambda=0.2)") + 
        labs(x = "average values", y = "frequency")
The distributions above show that the CLT in action. While averaging a large sample of random variables doesn’t give us a good estimation of the population mean (no normal curve), averaging the averages does bring us closer to the population mean (normal curve, centered near the population mean).