An investigation of the exponential distribution, compared to the Central Limit Theorem

Author: Bas Baccarne

Overview

This report investigates the CLT in relation to the exponential distribution. By means of random samples, drawn from this distribution, and comparing this to the theoretical quantities of the distribution, we show how the CLT allows to estimate the theoretical (or population) mean and variance.

Simulations

First, we set the basic parameters for the simulation of exponentials.

  • The lambda parameter for the distribution is 0.2.
  • The theoretical mean and sd for this distribution are both 1/lambda.
  • Our simulation contains a 1000 random samples (n=40), drawn from this distribution.
lambda = .2
mean = 1/lambda; sd = 1/lambda
nosim = 1000
n = 40

Next we run the simulation. We first define a sample funtion. Then we run this sample function a 1.000 times, capturing the empirical averages and empirical variances in the new variables ‘means’ and ‘variances’, so both variables contain each 1.000 means and variances.

sample <- function(){(rexp(n, lambda))}
means <- numeric()
variances <- numeric()
for(i in 1:nosim){
        means <- c(means,mean(sample()))
        variances <- c(variances,var(sample()))
        }

Sample Mean versus Theoretical Mean

As we can see in the graph below, the distribution of empirical averages resembles a normal curve. The average of the empirical sample means estimates the theoretical mean of the distribution (population) quite well.

Theoretical mean:

mean
## [1] 5

sample mean:

round(mean(means),3)
## [1] 5.021
library(ggplot2)
qplot(means) +
        geom_vline(xintercept = mean, color="lightseagreen") +
        geom_vline(xintercept = mean(means), color="tomato2") +
        annotate("text", x = 7, y = 85, label = paste("theoretical mean =",mean), color="lightseagreen") +
        annotate("text", x = 7, y = 80, label = paste("sample mean =",round(mean(means),2)), color="tomato2") +
        labs(title="comparing sample and theoretical means") + 
        labs(x = "sample means", y = "frequency")

Sample Variance versus Theoretical Variance

In a similar fashion, we plot the variances of our simulation sample below. As we can see, the average of the empirical sample variances estimates the theoretical variance of the distribution (population) quite well.

Theoretical variance:

sd^2
## [1] 25

sample mean:

round(mean(variances),3)
## [1] 24.522
qplot(variances) +
        geom_vline(xintercept = sd^2, color="lightseagreen") +
        geom_vline(xintercept = mean(variances), color="tomato2") +
        annotate("text", x = 65, y = 90, label = paste("theoretical variance =",sd^2), color="lightseagreen") +
        annotate("text", x = 65, y = 82, label = paste("sample variance =",round(mean(variances),2)), color="tomato2") +
        labs(title="comparing sample and theoretical variances") + 
        labs(x = "sample variances", y = "frequency")

Distribution

Finally, we compare the distribution of a large collection (n=40.000) of random exponentials with the distribution of a large collection (n=1.000) of averages of 40 exponentials. The first image shows the distribution of 40.000 random exponentials, the second shows the distribution of 1.000 averages of 400 random exponentials.

qplot(rexp(40000, lambda)) +
        labs(title="distribution of 40.000 random exponentials (lambda=0.2)") + 
        labs(x = "value", y = "frequency")

qplot(means) +
        labs(title="distribution of 1.000 averages of 40 random exponentials (lambda=0.2)") + 
        labs(x = "average values", y = "frequency")

The distributions above show that the CLT in action. While averaging a large sample of random variables doesn’t give us a good estimation of the population mean (no normal curve), averaging the averages does bring us closer to the population mean (normal curve, centered near the population mean).