Exponential Distribution and the Centra Limit Theorem (A Simulation Exercise)

The aim of this project is to show the power of the Central Limit Theorem, one of the most important concept and central tenet in Statistics. Normally , we all assume that the theoretical distribution of the population that we sample from is normal or approximately so. I that respect we further assume that the mean of that sample will also be normally distributed. The power of the Central Limit Theorem is that even if theoretical distribution is not normal we can still assume that the sample mean is normally distributed. That will be shown by simulating from the exponential distribution, which is very different from the normal as can be seen from Figure 3. The main mass is to the right of 0 and the vicinity, is heavily skewed to the right, and mainly is not “bell” shaped. The exponential density function is:
\[ f(X;\lambda)=\lambda e^{-\lambda x}, \] where the mean of the distribution is \(\frac{1}{\lambda}\), the variance is \(\frac{1}{\lambda^2}\), and the standard deviation is again \(\frac{1}{\lambda}\). In this report \(\lambda\) will be set equal to 0.2 , therefore the theoretical mean will be 1/.2 = 5 or \(\mu = 5\) and variance will be \(\sigma^2 = 1/.2^2 = 25\), finally the standard deviation is \(\sigma = 1/.2 = 5\). The simulation experiments will be simple - sample from the exponential distribution, calculate the mean and variance. Then repeat this 1000 times (in this report the number of simulations will be set equal to 1000), whereby each iteration will record the mean and the variance. Plot the histogram of the 1000 results, and make conclusions. Finally, compare the distribution of the simulation of the 1000 exponentially distributed numbers with that of the distribution of the 1000 means of samples of size 40.

Comparison Sample Mean with the Teoretical Mean of the distribution

According to the Law of Large Numbers the sample mean in the limit should converge to the mean of the population. We can see here that even with one sample of 40 exponentials the sample mean is pretty close to the theoretical \(\mu = 5\).

nosim = 1000; lambda = .2; n = 40
set.seed(0); mean(rexp(n,lambda))

## [1] 4.776259

In order to investigate this relation we can look at the distribution of the 1000 of means of samples of different szes - i.e. n = 10,40,90. The histograms of the simulation can be seen in Figure 1 in the appendix, with the accompanying code. It can clearly be seen that the mean of the distributions is cetered at the theoretical mean (marked by the vertical lines). Also, it can e seen that as n increases the the results are more and more closer to the theoretical mean.

Comparison of the variability of the sample (via variance) with the theoretical variance of the population

The variance of the sample mean is centered at the variance of the population from which we are sampling. The variance of the sample mean is itself a random variable and the nice property is that it estimates the variance of the population. We take a single sample here of 40 exponentials and calculate the \(\sigma^2\) and \(\sigma\).

set.seed(0); var(rexp(n,lambda))

## [1] 23.90902

set.seed(0); sd(rexp(n,lambda))

## [1] 4.889685

As can be seen, the resuls are pretty close to the theoretical parameters. In order to investigate this further we will simulate 1000 samples of n=10,40,90 and measure their variances. Then plot the histograms of the different cases is shown in Figure 2. It can be concluded that sample vaiances are even closer to the theoretical ones then that of the means, and of course as the sample size increases the relation gets even stronger. So sample variance estimates pretty good the population variance.

Emphasizing the normality of the distribution of the mean of samples of the exponential distribution

The sample mean is itself a random variable which has a mean equal to the mean of the population, but the variance of that distribution is in fact \(\frac{\sigma^2}{n}\) or the standard deviation is \(\frac{\sigma}{\sqrt n}\). With our parameters that theoretical variance would be 25/40 = 0.625. It could be observed that this variance is much samller than the variance of the population. And we can take one sample:

set.seed(0); var(rexp(n,lambda))/n

## [1] 0.5977255

Again - pretty close. Or we can simulate 1000 n = 40 samples and calculate the variance of that 1000 means:

means = NULL; set.seed(0)
for (i in 1 : nosim) {means = c(means, mean(rexp(n,lambda)))}
var(means)

## [1] 0.6181582

Of course the number now is much closer to the theoretical value. In order to further investigate this relation we can plot the histogram of the means and compare it to the distribution of the population from which we are sampling. That is illustrated in Figure 3. It clearly shows that both distributions are centered at the population mean \(\mu = 5\), as indicated by the red horisontal line. Also, the distribution of the mean of samples is clearly normally shaped, whereas the population distribution is very different from normal. On the other hand also can be observed that the exponential distribution is much more spread out than that of the means of samples.

Appendix

Figure 1

Distribution of the means of 1000 samples of size 10, 40 and 90 from the exponential distribution.

Figure 2

Distribution of the variances of 1000 samples of size 10, 40 and 90 from the exponential distribution.

Figure 3

Histogram of 1000 draws from exponential distribution(left) and histogram of the means of 1000 of 40 exponentials(right).

Code Figure 1

library(ggplot2); set.seed(0)
dat <- data.frame(
    x = c(apply(matrix(rexp(nosim * (n-30), lambda), 
                       nosim), 1, mean),
          apply(matrix(rexp(nosim * n, lambda), 
                       nosim), 1, mean),
          apply(matrix(rexp(nosim * (n+50), lambda), 
                       nosim), 1, mean)
    ),
    size = factor(rep(c(10, 40, 90), rep(nosim, 3))))

g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(alpha = .20, 
                                                           binwidth=.2, colour = "black") 
g <- g + geom_vline(xintercept = 1/lambda, size = 2) 
g + facet_grid(. ~ size) #+ annotate("text", x = 7, y = 40, label = "Some text")

Code Figure 2

set.seed(0)
dat <- data.frame(
    x = c(apply(matrix(rexp(nosim * (n-30), lambda), 
                       nosim), 1, var),
          apply(matrix(rexp(nosim * n, lambda), 
                       nosim), 1, var),
          apply(matrix(rexp(nosim * (n+50), lambda), 
                       nosim), 1, var)
    ),
    size = factor(rep(c(10, 40, 90), rep(nosim, 3))))

g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(alpha = .20, 
                                                           binwidth=7, colour = "black")
g <- g + geom_vline(xintercept = (1/lambda)^2, size = 2)
g + facet_grid(. ~ size)

Code Figure 3

set.seed(0)
dat <- data.frame(
    x = c(rexp(nosim, lambda),
          apply(matrix(rexp(nosim * n, lambda), 
                       nosim), 1, mean)
              ),
    simulation = factor(rep(c("exp dist", "sample dist"), rep(nosim, 2))))

g <- ggplot(dat, aes(x = x, fill = simulation)) + geom_histogram( alpha = .20, 
                                                            binwidth=.5, colour = "black")
g <- g + geom_vline(xintercept = 1/lambda, size = 1,color="red")
g + facet_grid(.~simulation) + theme_bw()

Acknowledgement/Reference

It must be acknowledged here that most of the code in this project is a modified version from prof. Brian Caffo’s lectures from the course “Statistical Inference” at Coursera.

Course Project Statistical Inference

Tycho Nikolov