The exponential distribution and the Central Limit Theorem (CLT)

Synopsis

In this document, we want to illustrate the Central Limit Theorem via the exponential distribution. Using R, the exponential distribution can be simualated with the function rexp(n, lambda), where n is the number randomly generated numbers and lambda is the rate parameter. The mean of an exponential distribution is 1/lambda, and, most interestingly, the standard deviation is 1/lambda, as well. In this essay, we set lambda = 0.2. There are three topics that shall be discussed:

Show the sample mean and compare it to the theoretical mean of the distribution.
Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.
Show that the distribution is approximately normal.

Preliminaries

First, we load the packages ggplot2 and gridExtra that will help us to illustrate our points made visually.

library(ggplot2)
library(gridExtra)

Next, we create an object and assign the value 0 to it. We will use this object afterwards.

mns <- NULL

Now, using a for loop, we create 1000 samples of size 40 (lambda = 0.2) and collect the resulting 1000 means in the object mns; then we change the object into a data frame.

for(i in 1:1000) {
    mns = c(mns, mean(rexp(40, 0.2)))
}
mns <- data.frame(mns)

With lambda = 0.2, we know about the Theoretical Mean and the Theoretical Standard Error(and consequently, the Variance). The empirical values can be derived from taking the mean of the first (and only) column of the data frame mns. The Standard Error in this case is the standard deviation and, hence, the variance is the standard error squared.

lambda <- 0.2

Theoretical_Mean <- 1/lambda 
Theoretical_SError <- (1/lambda)*(1/sqrt(40)) 
Theoretical_Variance <- Theoretical_SError^2

Empirical_Mean <- mean(mns[,1])
Empirical_SError <- sd(mns[,1])
Empirical_Variance <- Empirical_SError^2

1.Show the sample mean and compare it to the theoretical mean of the distribution.

2. Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.

After defining the parameters from above, we can use ggplot2 in order to illustrate, how far off the empirical numbers deviate from the actual theoretical numbers. We will use the data.frame mns, apply a density histogram as geom to it, label the axes and the overall plot and mark both the theoretical and empirical means and variances. The graphical depiction of both the means and the distributions shows that the sampling did a pretty good job: the empirical mean is very close to the theoretical mean and both curves overlap each other to a large extent.

f <- ggplot(mns, aes(mns))
f <- f + geom_histogram(binwidth = lambda, fill = "blue", color = "red", aes(y = ..density..))
f <- f + labs(title="Plot of 1000 Exponential Distributions of size 40", x="Averages of 1000 Distributions of size 40", y="Density")
f <- f + geom_vline(xintercept=Empirical_Mean, color="black", size = 3.0)
f <- f + stat_function(fun=dnorm,args=list(mean=Empirical_Mean, sd=Empirical_SError),color = "black", size = 5.0)  
f <- f + geom_vline(xintercept=Theoretical_Mean,size=1.0, color="green", size = 3.0) 
f <- f + stat_function(fun=dnorm,args=list(mean=Theoretical_Mean, sd=Theoretical_SError),color = "green", size = 1.0)
print(f)

In addition, we calculate two confidence intervals, one for the empirical mean and one for the theoretical mean. Comparing the two intervals supports the claims we have just made as regards the fit between the theoretical and the empirical distribution.

Empirical_Mean + c(-1, 1)*1*Empirical_SError

## [1] 4.218817 5.757710

Theoretical_Mean + c(-1, 1)*1*Theoretical_SError

## [1] 4.209431 5.790569

3. Show that the distribution is approximately normal.

The following plot illustrates the difference of what we just calculated to just taking the mean of 1000 random exponential distributions.

s <- rexp(1000, 0.2)

s_mean <- mean(s)
s_std <- sd(s)

s <- data.frame(s)

g <- ggplot(s, aes(s))
g <- g + geom_histogram(binwidth = 0.2, fill = "blue", color = "red", aes(y = ..density..))
g <- g + labs(title="Plot of 1000 random exponentials", x="Distribution of 1000 random exponentials", y="Density")
g <- g + geom_vline(xintercept=s_mean, color="black", size = 1.0)
g <- g + geom_density(size=1.0)
grid.arrange(f, g)

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

Here we can see, the effect of the Law of Large Numbers: as a sample size grows, its mean will get closer and closer to the average of the whole population (in our case this is 1/0.2 = 5). However, the distribution is quite skewed (as we would expect from exponentials) and far from being normally distributed. The plot on the left side, which is the plot from above is almost normally distributed.