Synopsis

The goal of this report is to investigate the exponential distribution in R and compare it with the Central Limit Theorem. To that effect it will investigate the distribution of 1000 averages of 40 exponential observations with the lambda parameter set to 0.2.

Overview of the exponential distribution

The exponential distribution PDF has one parameter: lambda. In this report it will be set to 0.2 and hold fixed.

Its theoretical mean and variance are 1/lambda or 5 in our case.

The standard error (SE) of the mean statistic is sigma/sqrt(n) or 5/sqrt(40) or 0.7905694. Its variance is SE^2 or 0.625

First we set some constants:

set.seed(12345)
rate <- 0.2
n <- 40
ns <- 1000
se <- sqrt(0.625)

Here is an example of exponential distribution with lambda=0.2 (1000 samples):

plot(density(rexp(ns, rate=rate)), main="Exponential distribution with lambda = 0.2",
     xlab="x")
abline(v=5, col="blue")
Figure 1: Exponential distribution with lambda = 0.2. We can see the theoretical mean at 5.

Figure 1: Exponential distribution with lambda = 0.2. We can see the theoretical mean at 5.

Data simulation

  1. We simulate 1000 samples of 40 observations each from an exponential distribution of rate 0.2 and compute the means
  2. We simulate 1000 samples of 40 sample means of 40 observations each from an exponential distribution of rate 0.2 and compute the variances
#First get the sample means
samples <- matrix(rexp(ns*n, rate=0.2), ncol=n)
sampleMeans <- apply(samples,1,mean)

#Now the variances
sampleMeansVars = rep(0,ns)
for (i in 1:ns) {
    samples <- matrix(rexp(n*n, rate=0.2), ncol=n)
     var <- var(apply(samples,1,mean))
     sampleMeansVars[i] <- var
}

Analysis

Comparison of the theoretical mean and sample mean

The theoretical mean is 1/lambda, so in our case 5.

The sample mean of one sample of size 1000 is 4.956, which is close to the theoretical mean of 5:

sampleMean <- mean(matrix(samples, ncol=1)[1:1000])
cat(sampleMean)
## 4.956496

The average of multiple sample means of sample size 40 is 4.977, which is even closer to the theoretical mean of 5. So the mean statistic distribution is centered at 5:

sampleMean <- mean(sampleMeans)
cat(sampleMean)
## 4.97704
means <- cumsum(sampleMeans)/(1:ns)
ggplot(data.frame(x=1:ns, y=means), aes(x=x, y=y)) +
       geom_hline(yintercept = 5) + geom_line() +
       labs(x = "Number of obs", y = "Cumulative mean") +
       ggtitle("Convergence of the mean of samples of exponentials towards 5") +
       theme_bw()
Figure 2: Convergence of the empirical mean to the theoretical mean

Figure 2: Convergence of the empirical mean to the theoretical mean

Comparison of the theoretical variance (SE) of the sample means to the empirical variance

The theoretical variance of the mean estimate is the SE^2, which is (sigma/n)^2 or 25/40 = 0.625

The variance of one sample of 1000 sample means is 0.601, which is close the the theoretical variance of 0.625:

sampleMeanVariance <- var(sampleMeans)
cat(sampleMeanVariance)
## 0.6010114

If we increase the number of simulations to 1000, this difference will be reduced even if the sample size is smaller (40).

sampleMeanVariance <- mean(sampleMeansVars)
cat(sampleMeanVariance)
## 0.6271657

So the empirical estimate of the SE of 1000 sets of 40 sample means is 0.627, which is closer the the theoretical variance of 0.625.

vars <- cumsum(sampleMeansVars)/(1:ns)
ggplot(data.frame(x=1:ns, y=vars), aes(x=x, y=y)) +
    geom_hline(yintercept = se^2) + geom_line() +
    labs(x = "Number of obs", y = "Cumulative mean") +
    ggtitle("Convergence of the variance of mean of samples of size 40") +
    theme_bw()
Figure 2: Convergence of the empirical variance to the theoretical variance

Figure 2: Convergence of the empirical variance to the theoretical variance

Normality of the mean statistic

The Central Limit Theorem (CLT) states that with a large enough set of samples, the distribution of the mean is normal. In addition, if the sample size is big enough, the distribution will have a small variance.

Graphical overview

#First we plot the empirical mean distribution
hist(sampleMeans,prob=TRUE, main="Histogram of mean statistics", 
     xlab="Sample means", ylim = c(0,0.6))

#Then we plot the theoretical statistic distribution
x <- seq(0,10, length=ns)
lines(x, dnorm(x,5,sqrt(se)),col="blue")
Figure 3: Normality of the mean statistic for the exponential distribution. In blue the theoretical mean statistic distribution.

Figure 3: Normality of the mean statistic for the exponential distribution. In blue the theoretical mean statistic distribution.

We can see that the empirical mean statistic distribution (boxplot) has a gaussian shape which is not the case of the exponential distribution (beginning of the document) that is skewed to the left.

The blue curve is the theoretical sample mean distribution: X’ ~ norm(5,sqrt(0.625))

Confidence intervals

Finally we can compare the two 95% confidence intervals.

First the theoretical:

5 + c(-1,1) * qnorm(.95) * se
## [1] 3.699629 6.300371

Then the empirical:

quantile(sampleMeans, prob=c(.05,0.95))
##       5%      95% 
## 3.761442 6.252184

Both are pretty close.