SI Part 1, Exponential Distribution

The goal is to simulate the exponential distribution and compare the simulation results to the theoretical (analytical) distribution. The sample size will be 40 (n = 40) and we will conduct 1,000 trials. The rate parameter is 0.2; i.e., \(\lambda\) = 0.2:

set.seed(46)
lambda <- 0.2 # rate param of exponential distribution
n <- 40 # "investigate the distribution of averages of 40 exponential(0.2)s."
rows <- 1000 # "you will need to do a thousand or so simulated averages of 40 exponentials."

An elegant property of the exponential distribution is that both its mean and standard devaition are \(1/\lambda\). We run the simulation:

# simulation of rexp() into 1000 rows of 40; sample.means is vector of 1,000 sample (n=40) means
# avg.sm is the average of the 1,000 sample.means
sim <- rexp(n*rows, lambda)
m <- matrix(sim, rows)
sample.means <- apply(m, 1, mean)
avg.sm <- mean(sample.means); sd.sm <- sd(sample.means)
# Theoretical moments
exp.mu <- 1/lambda; exp.sd <- 1/lambda
clt.sm.sd <- exp.sd/sqrt(n)

It is important to note we are focused on the distribution of sample means. The chart below superimposes a normal distribution (in red) over the empirical distribution (in blue) of sample means. And this shows the power of the central limit theorem! Despite a non-normal underlying distribution, the sample mean is approximately normal (this chart was the hardest part of the project for me …):

library(ggplot2)
qplot(sample.means, main="Distributions of sample means \n(BLUE is empirical, RED is normal)", 
      xlab="Sample means", ylab="Probability density (pdf)", geom = 'blank') +   
    geom_line(aes(y = ..density.., colour = 'Empirical', size= 1.5),stat = 'density') +  
    stat_function(fun = dnorm, aes(colour = 'Normal', size = 1.5), 
                  arg = list(mean=exp.mu, sd=clt.sm.sd)) +
    geom_histogram(aes(y = ..density..), binwidth = 0.1, alpha = 0.5, fill="blue4") +
    geom_vline(xintercept = 5, linetype = "dashed", color = "red2") +
    geom_vline(xintercept = avg.sm, color="blue2") +
    scale_color_manual(values=c("Empirical"="blue1", "Normal"="red2")) +
    theme(legend.position="none")

In order to “2. Show how variable it is and compare it to the theoretical variance of the distribution,” let’s compare the simulated distributional moments to their analytical equivalents:

The average of sampled means is 4.9769854 which compares to \(1/\lambda\) = 5
The standard error (i.e., the standard deviation) of the sampled mean is 0.7677428 which compares to \(\frac{{1}/{\lambda }\;}{\sqrt{40}}\) = 0.7905694

We are asked to “3. Show that the distribution is approximately normal.” A common visual technique to compare data/theoretical quantiles is a Q-Q plot (see [Wikipedia entry][http://en.wikipedia.org/wiki/Q%E2%80%93Q_plot]). Also, it only requires two lines of gglot code!

qqnorm(sample.means)
qqline(sample.means)

Conclusion

Although we ran 1,000 trials, it is worth noting the sample size is only 40. This shows the amazing magic of the CLT: the sampling distribution of the sample mean is approximately normal, which is remarkable (I think!) considering the underlying distribution is not even nearly normal. Cool beans.

SI Part 1, Exponential Distribution

David Harper

Saturday, November 22, 2014

Conclusion