Introduction

In this project we investigate the exponential distribution through the lense of the Central Limit Theorem. The mean of an exponential distribution is 1/lambda and the standard deviation is also 1/lambda, where lambda is the rate parameter. We set lambda = 0.2 throughout this analysis and investigate the distribution of means for sample sizes of n = 40 from such a distribution. The CLT states that the sampling distribution of the means follows a normal distribution Xbar ~ N(mu = 1/lambda, sigma = 1/(lambda*sqrt(n)).

Using simulation and associated explanatory text we illustrate the key properties of this distribution. In particular we show

1.the sample mean and compare it to the theoretical mean of the distribution.

2.how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.

3.that the distribution is approximately normal.

Generating data from an exponential distribution

The following R code generates the simulated data

set.seed(3421) # to facilitate reproducibility
lambda <- 0.2 # rate parameter corresponding to a population mean and variance of 5 (1/0.2) 
ssize <- 40 # sample size
nsim <- 1000 # number of simulated samples
draws <- data.frame(x = rexp(nsim * ssize, lambda))

Just to demonstrate the classic exponential curve we present a density plot of the distribution below.

ggplot(draws, aes(x = x)) + 
geom_histogram(alpha = .20, binwidth= 0.8, colour = "black", aes(y = ..density..))+
stat_function(fun = dexp, args = list(rate = lambda), size = 1, colour = "red") + xlim(0,60)
Figure 1. Exponential distribution density curve overlaid on the simulated data.

Figure 1. Exponential distribution density curve overlaid on the simulated data.

Next we calculate and store the mean for each of the 1000 simulated samples (each of size 40): the sampling distribution of the means.

drawnmeans <- data.frame(x = apply(matrix(draws$x,nsim),1,mean))

Results

Comparing the means of the sampling distribution of means and the theoretical distribution

As an unbiased estimator the mean of the sampling distribution converges to the population mean of the theoretical exponential distribution which in this case is 1/0.2 = 5.

The mean of simulated sampling distribution is :

Xbar <- mean(drawnmeans$x) 
Xbar
## [1] 5.001363

which is clearly very close to the theoretical value of 5.

Find below a density plot of the distibution of means with the mean of the distribution indicated

ggplot(drawnmeans, aes(x = x)) + 
geom_density() +
geom_vline(xintercept=Xbar, size = 1, color = 'red') 
Figure 2. Sampling distribution of means of exponential iid random variables with rate lambda =0.2 and sample size of 40. The red line indicates the simulated distribution mean whose theoretical value is 5.

Figure 2. Sampling distribution of means of exponential iid random variables with rate lambda =0.2 and sample size of 40. The red line indicates the simulated distribution mean whose theoretical value is 5.

Comparing the variances of the sampling distribution of the means and the theoretical distribution

Under the CLT the standard deviation of the sampling distribution of the means is given by the standard deviation of the original population divided by the square root of the sample size: SD(Xbar)=sigma/sqrt(n)=5/sqrt(40) = 0.791 to 3 significant figures.

The simulated sample variance is :

Xvar <- var(drawnmeans$x) 
Xvar
## [1] 0.6302084

which is indeed close to the theoretical value 0.625.

How well has the CLT performed?

In order to compare the sampling distribution of means to the approximate distribution predicted by the Central Limit Theorem (CLT), we present the simulated distribution as a histogram with the relevant normal distribution curve.

mu    <- 1/lambda
sigma <- 1/lambda
ggplot(drawnmeans, aes(x = x)) + 
geom_histogram(alpha = .10, binwidth=0.1, colour = "black", aes(y = ..density..)) +
stat_function(geom = "line", fun = dnorm, args = list(mean = mu, sd = sigma/sqrt(40)), size = 2, colour = "red")
Figure 3. Sampling distribution of means for sample size of 40 iid exponential random variables lambda=0.2. The theoretical mean and standard deviation of this distribution are respectively 1/lambda and sigma/sqrt(40).

Figure 3. Sampling distribution of means for sample size of 40 iid exponential random variables lambda=0.2. The theoretical mean and standard deviation of this distribution are respectively 1/lambda and sigma/sqrt(40).

Clearly the normal distribution appears quite a good qualitative match for the simulated distrbution. To further demonstrate this we present a Q-Q plot as well

par(mar=c(4,4,0,0)+0.1,mgp=c(2,1,0))
s.var <- (drawnmeans$x-mu)/(sigma/sqrt(40))  # transfer the data from raw to standard
qqnorm(s.var, main = NULL)
qqline(s.var)
Figure 4. QQ plot of the sampling distribution of the means of rescaled exponential iid random variables The reference line is the standard normal distribution.

Figure 4. QQ plot of the sampling distribution of the means of rescaled exponential iid random variables The reference line is the standard normal distribution.

Finally it is worth looking at the outcome of repeating this entire process many times and examining the distribution of the sampling distribution mean and sampling distribution variance over 500 repetitions of the complete simulation process

set.seed(4512)

lambda <- 0.2  # exponential rate parameter
nsim <- 1000 # random sample of size 40 drawn 1000 times
draws  <- data.frame( x = rexp(nsim * ssize, lambda))

dist.mean = NULL
dist.var  = NULL
for (i in 1 : 500) {
  x = apply(matrix(rexp(nsim * ssize, lambda),nsim),1,mean)
  dist.mean = c(dist.mean, mean(x))
  dist.var  = c(dist.var,  var(x))
  }

data2plot <- data.frame(dist.mean,dist.var)

plot1 <- ggplot(data2plot, aes(dist.mean)) + geom_density() +
  geom_vline(xintercept=mean(dist.mean), size = 1, color = 'red') 

plot2 <- ggplot(data2plot, aes(dist.var)) + geom_density() +
  geom_vline(xintercept=mean(dist.var), size = 1, color = 'red')

grid.arrange(plot1, plot2, ncol=2)
Figure 5. Distribution density plots of the distribution means and distribution variances following 500 repetitions of the entire simulation process of 1000 simulated draws of samples of size 40 from an exponential distribution of mean and standard deviation of 1/0.2 = 5

Figure 5. Distribution density plots of the distribution means and distribution variances following 500 repetitions of the entire simulation process of 1000 simulated draws of samples of size 40 from an exponential distribution of mean and standard deviation of 1/0.2 = 5

The resulting values for the mean for both of these distributions match the theoretical parameters values to 3 significant figures.

signif(mean(dist.mean),3)
## [1] 5
signif(mean(dist.var),3)
## [1] 0.625

Summary

We can conclude that the CLT predicts the situation very well in the parameter space considered in this project.