Central Limit Theorem in Exponential Distribution

Overview

In this project I will investigate the distribution of the mean of exponentials in R and compare it with the Central Limit Theorem in the following aspects:

Compare sample mean to the theoretical mean of the distribution.
Compare the variance to the theoretical variance of the distribution.
Show that the distribution is approximately normal.

The Central Limit Theorem

The Central Limit Theorem (CLT) states that the distribution of averages of iid variables (properly normalized) becomes that of a standard normal as the sample size increases
The result is that \[\frac{\bar X_n - \mu}{\sigma / \sqrt{n}}= \frac{\sqrt n (\bar X_n - \mu)}{\sigma} = \frac{\mbox{Estimate} - \mbox{Mean of estimate}}{\mbox{Std. Err. of estimate}}\] has a distribution like that of a standard normal for large \(n\).

Simulations

In probability theory and statistics, the exponential distribution is the probability distribution that describes the time between events in a Poisson process, i.e. a process in which events occur continuously and independently at a constant average rate.

The probability density function is \[ f(x, \lambda) = \left\{\begin{array}{ll} \lambda e ^ {-\lambda x} & x >= 0 \\ 0 & x < 0 \end{array} \right. \]

The mean of an exponentially distributed random variable \(X\) with the rate parameter \(\lambda\) is given by \[ E[X] = \frac{1}{\lambda} \]

The variance of \(X\) is given by \[ VAR[X] = \frac{1}{\lambda^2} \]

so the standard deviation is also \(\frac{1}{\lambda}\).

Set \(\lambda\) = 0.2 for all of the simulations.

The following plot simiulates the 1000 distributions of the mean of 40 exponentials with \(\lambda\) = 0.2.

set.seed(643)
nosim <- 1000
n <- 40
lambda <- 0.2
mns = NULL
for (i in 1 : nosim) mns = c(mns, mean(rexp(n, lambda)))
hist(mns, main="1000 Similations of the mean of 40 exponential distributions", density=10, xlab="lambda=0.2", breaks=20)

Sample Mean versus Theoretical Mean

The mean of exponential distribution with \(\lambda\) = 0.2 is \[ E[X] = \frac{1}{\lambda} = \frac{1}{0.2} = 5 \]

The sample mean is

mean(mns)

## [1] 4.991655

In the following plot the blue line is the theoretical mean and the red dash line is the sample mean from the simulation. They are almost overlapping.

library(ggplot2)
g <- ggplot() + aes(mns) + geom_histogram(binwidth = .2, colour = "darkgreen", fill = "white" ) 
g <- g + scale_x_continuous(breaks = 2:8)
g <- g + geom_vline(xintercept = 5, colour = "blue", size = 1) 
g <- g + geom_vline(xintercept = mean(mns), colour = "red", size = 1, linetype = "longdash")  
g <- g + labs(x = "X", y = "Density", title = "1000 Similations of the mean of 40 exponential distributions")
g

Sample Variance versus Theoretical Variance

The expected variance of \(X\) is given by \[ VAR[X] = \frac{1}{\lambda ^ 2} = \frac{1}{0.2^2} = 25 \]

Standard Error of the mean with sample size of 40 is \[ SE = \sqrt \frac{VAR[X]}{n} = \sqrt \frac{25}{40} = 0.791 \]

The Theoretical Variance is \[ SE ^ 2 = 0.625 \]

The sample variance of the mean is

var(mns)

## [1] 0.6237204

The sample variance of the mean is very close to the theoretical variance.

Distribution versus normal distribution

The following plot compares the normal distribution in blue line with the density of the simulated distribution in red line. The approximation is not exact but let us see in the next section if we can improve with larger sample size.

library(ggplot2)
g <- ggplot() + aes(mns) + geom_histogram(aes(y =..density..), binwidth=.2, colour = "darkgreen", fill = "white" ) 
g <- g + scale_x_continuous(breaks = 2:8)
g <- g + stat_function(fun = dnorm, colour = "blue", arg = list(mean = 5)) + geom_density(colour = "red")
g <- g + labs(x = "Mean", y = "Density", title = "1000 Similations of the mean of 40 exponential distributions")
g

Exponential distribution CLT

Let \(X_i\) be the result of the \(i^{th}\) value of the exponential distribution
The sample mean, say \(\hat p\), is the average of value
\(E[X_i] = \frac{1}{\lambda}\) and \(Var(X_i) = \frac{1}{\lambda^2}\)
Standard error of the mean is \(\sqrt{1/\lambda^2 * n}\)
Then \[ (\hat p - p) * \lambda * \sqrt{n} \] will be approximately normally distributed.

Simulation results, \(n = 40, 80, 120\)

The following plot compares the distribution using sample sizes of 40, 80 and 120. The distribution becomes that of a standard normal as the sample size increases.

set.seed(643)
nosim <- 1000
lambda <- 0.2
cfunc <- function(x, n)  (mean(x) - 5) * lambda * sqrt(n)
dat <- data.frame(
  x = c(apply(matrix(rexp(nosim * 40, lambda), nosim), 1, cfunc, 40),
        apply(matrix(rexp(nosim * 80, lambda), nosim), 1, cfunc, 80),
        apply(matrix(rexp(nosim * 120, lambda), nosim), 1, cfunc, 120)
        ),
  size = factor(rep(c(40, 80, 120), rep(nosim, 3))))
g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(binwidth=.3, colour = "black", aes(y = ..density..)) + scale_fill_brewer(palette="Spectral")
g <- g + stat_function(fun = dnorm, size = 2)
g + facet_grid(. ~ size)

Summary

In the report I investigated the distribution of averages of 40 exponentials. Results shows that the sample mean closely matches the theoretical mean. The sample variance of the mean is also estimated quite accurately by the theoretical variance. The distribution of the sample mean approximated by the normal distribution, especially when the sample size increases.