Exponentially Distributed Data - The LLN and CLT

by epigenus

December 2014

Introduction

The exponential distribution is characterized by a parameter \(\lambda\). For a given \(\lambda\), the mean, \(\mu\) = 1/\(\lambda\) and the standard deviation, \(\sigma\) = 1/\(\lambda\). According to the Law of Large Numbers, a simulation of sampled values should approach the theoretical mean and standard deviation of the underlying distribution as the number of samples, \(\ n\) gets large (\(\ n\) -> \(\infty\)). According to the Central Limit Theorem, however, a simulation of the averages of \(\ k\)-samples values will approach a normal distribution as \(\ k\) gets large.

To to show the effects of the LLN and CLT we will generate simulated, exponetially distributed data and analyze it two different ways . For this simulation we will use \(\lambda\) = 0.2, \(\ n\) = 5000, and \(\ k\) = 2, 10, & 40. The results are stated with numbers, and shown graphically using Hadley Wickham’s ggplot2 library. Note: for reproducibility, we will set a specific random seed.

Basic requirements:

    require(ggplot2)
    set.seed(123456)
    n <- 5000
    lambda <- 0.2

The Law of Large Numbers (LLN):

First we simulate \(\ n\) = 5000 samples from exponetially distributed data with \(\lambda\) = 0.2. We also store the sample mean and sample deviation.

    expdata <- rexp(n, lambda)
    expmean <- mean(expdata)
    expsd <- sd(expdata)

The denisty plot of our simulated data is shown below, with the sample mean marked in red. As we can see, the exponential distribution is highly skewed.

plot of chunk expgraph

We wish to compare our simulated data to the ideal exponetial distribution. Using \(\lambda\) = 0.2, the population mean and standard deviation of an exponetial distribution is \(\mu\) = 5 and \(\sigma\) = 5. The sample mean, \(\mu_s\), and the sample deviation \(\ s\) for our simulated data is reported below.

    expmean

## [1] 5.05

    expsd

## [1] 4.99

As we expected from the law of large numbers, the sample mean approaches the population mean (\(\mu_s \approx \mu\)), and the sample deviation approaches the population standard deviation (\(\ s \approx \sigma\)), for large sample size (\(\ n\)=5000).

To show the affect of the LLN as a function of sample size we calculate the cumulative mean as a function of sample size for our data.

    expmeans <- cumsum(expdata)/(1:n)

The effects of the Law of Large Numbers on sample mean is shown graphically below. The population mean is highlighted by the dashed red line.

plot of chunk cummeangraph

From this graph we can clearly see that for a small number of observations the sample mean deviates widely from the population mean. But, as the number of observations increases, the sample mean converges to the population mean.

Overall the Law of Large Numbers implies that sampled data will vary as much as the population it is sample from about the population mean, for large enough sample size.

The Central Limit Theorem (CLT)

According to the CLT, normalized averages of sampled data should approach a standard normal distribution for a large enough number of averages. To illustrate this we will sample data from an exponentially distributed population, as we did above. But we will take the average over \(\ k\) = 2, 10 & 40 samples and normalize these averages in the following way: \(\ y = \lambda \sqrt{n} (\bar{X} - 1/\lambda)\)

normfunc <- function(x, n) lambda*sqrt(n)*(mean(x) - 1/lambda)
expavgdata <- data.frame(
    x = c(apply(matrix(rexp(n*2, lambda), n), 1, normfunc, 2),
          apply(matrix(rexp(n*10, lambda), n), 1, normfunc, 10),
          apply(matrix(rexp(n*40, lambda), n), 1, normfunc, 40)
          ),
    size = factor(rep(c(2, 10, 40), rep(n, 3))))

We plot the resulting data below. Each density plot panel is labelled by the number of averaged data points and shows a histogram with the sample density curve. The standard normal density curve is overlayed in each panel for reference.

plot of chunk cltgraph

As we can clearly see, if we average over only 2 samples, the sample distribution is still highly skewed and similar to the underlying population distribution (compare the first panel to the plot of simulated data we produced previously). But as the number of averaged points increases, the sample density plot conforms to the standard normal curve. By the time the number of averaged point reaches 40, the data is nearly indistiguishable from the standard normal distribution. Specifcally, the mean of the normalized data averaged over 40 samples is \(\mu_{avg}\) = 0.0087 and the sample deviation is \(\ s_{avg}\) = 0.9991, or very nearly standard normal.

This is as expected. The Central Limit Theorem implies the mean of the averages of sampled data approaches the distribution population mean (\(\mu_{avg} \approx \mu\)) for large sample size and number of averages. But unlike the Law of Large numbers, the CLT implies that the deviation of the averages does not approach the population deviation (\(\ s_{avg} \neq \sigma\)). Instead the deviation of the averaged samples approaches the deviation of the standard normal distribution.