Overview

In this project, I investigate the exponential distribution in R and compare it with the Central Limit Theorem (CLT). The CLT states that the distribution of averages of independent and identically distributed (IID) variables becomes that of a standard normal distribution as sample size increases. I demonstrate that the CLT appears to hold for an ensemble of averages for the exponential function, which is a function that is not normal.

Simulations

In standard mathematical notation, the exponential function is expressed as \(f(x) = \lambda e^{-\lambda x}\) for \(x \ge 0\), where \(\lambda\) is the rate parameter. The exponential distribution can be simulated in R with with rexp(n, \(\lambda\)). The mean of the exponential distribution is \(1/\lambda\) and the standard deviation is also \(1/\lambda\). I set \(\lambda = 0.2\) for all of the simulations, and investigate the distribution of averages of 40 exponentials.

As instructed in the assignment, we do 1000 simulations using rexp(40, rate = 0.2). We plot a histogram and assign the probability densities to cltExpo.

set.seed(3000)
mns = NULL
for (i in 1 : 1000) mns = c(mns, mean(rexp(40,0.2)))
cltExpo <- hist(mns, prob=TRUE, plot=FALSE, breaks = 20)

Sample mean versus theoretical mean

I compare the mean of the sample mns, mean(mns), with the calculated mean, \(1/\lambda = 1/(0.2) = 5\):

mn <- mean(mns)
mn
## [1] 5.014507

which is not far off from the theoretical value of 5.

Sample variance versus theoretical variance

Next I compare the variance of the sample mns, variance(mns), with the theoretical variance, \(\sigma ^{2}/n\). Because \(\sigma\) is the standard deviation, which is \(1/\lambda = 1/(0.2) = 5\), and \(n = 40\), then the theoretical variance is \(5^{2}/40 = 0.625\). Here is what we find for the variance from our simulation:

variance <- var(mns)
variance
## [1] 0.6015614

which is again not far from from the theoretical value of 0.625.

Normal distribution and confidence intervals

Now we want to show that the distribution of the means is approximately normal. The plot of the 1000 means above looks mounded about the mean value of 5 approximately symmetrically, but we need to show that this is the case more quantitatively. Let’s take the standard deviation from the set of means that we generated, by using sd(mns). Then we can generate a normal distribution that will closely “fit” the histogram of the averaged means (mns), albeit with some deviations, as visual inspection of the plot shows.

stdevMns <- sd(mns)
## Generate a normal distribution, but do not plot it
normCurve <- rnorm(1000, mean = 5, stdevMns)
randNormal <- hist(normCurve, plot=FALSE, breaks=40)
## Plot our distribution of means of the exponential function
cltExpo <- hist(mns, prob=TRUE, plot=TRUE, breaks = 20)
## And overplot the curve for the normal distribution
curve(dnorm(x, mean(normCurve), sd(normCurve)), add=TRUE, col="darkblue", lwd=2)

Let’s show more quantitatively that this distribution is approximately normal. For a normal distribution, we should get about 68% of the probability density that falls within one standard deviation on either side of the mean value of 5, and it should be symmetrically distributed. So, we should see a probability density of approximately .340 in the band marked by one standard deviation below 5, and .340 in the band up to one standard deviation above 5:

cltSd1indUpper <- which(mns > 5  & mns < 5 + stdevMns)
length(cltSd1indUpper)/length(mns)
## [1] 0.338
cltSd1indLower <- which(mns < 5  & mns > 5 - stdevMns)
length(cltSd1indLower)/length(mns)
## [1] 0.344

If we look in the band of two standard deviations on either side of the mean, then each should contain a probability density around 0.475 if it is like a normal distribution:

cltSd2indUpper <- which(mns > 5 & mns < 5 + 2 * stdevMns)
length(cltSd2indUpper)/length(mns)
## [1] 0.467
cltSd2indLower <- which(mns < 5 & mns > 5 - 2 * stdevMns)
length(cltSd2indLower)/length(mns)
## [1] 0.491

Finally, if we look in the band of three standard deviations on either side, then each should contain a probability density around 0.495 in order to approximate a normal distribution:

cltSd3indUpper <- which(mns > 5 & mns < 5 + 3 * stdevMns)
length(cltSd3indUpper)/length(mns)
## [1] 0.493
cltSd3indLower <- which(mns < 5 & mns > 5 - 3 * stdevMns)
length(cltSd3indLower)/length(mns)
## [1] 0.503

By contrast, the plot of the random exponential distribution with the same normal curve overlay as we used above shows that that not only is the ensemble of random exponentials not close to a normal distribution, it is not even symmetrical about the mean!

hist(rexp(1000,0.2), prob=TRUE, breaks = 40)
curve(dnorm(x, mean(normCurve), sd(normCurve)), add=TRUE, col="darkblue", lwd=2)

So, these simulations have shown how means of non-normal functions tend towards a normal distribution.