OVERVIEW

This project explores, via simulation and comparison with theory, the asymptotic convergence to normality of a distribution of means of 40 random iid draws from an exponential distribution with rate \(\lambda = 0.2\).

The simulation consists of taking 1,000 means of 40 randomly generated exponentials. We explore the mean and variance of this distribution of means.

As expected from the Central Limit Theorem (CLT), the resulting distribution is approximately normal, despite its origins in the positively-skewed exponential distribution.

PART 1. THEORETICAL VS. SAMPLE MEANS

Theoretical Values

Since we expect, per the CLT, that the distribution of means will be normal:

  • the expected value of the mean of a distribution of means is \(E[\bar X] = \mu = 1/\lambda\)
  • the expected value of the variance of a distribution of means is \(Var(\bar X) = \sigma^2/n\)
  • the standard deviation of a distribution of means is therefore \(\sigma/\sqrt{n}\), and so the standard error of the mean is: \[SE_{mean} = \frac{1/\lambda}{\sqrt{n}}\]

Since we will be using the rate \(\lambda = 0.2\), the theoretical values in this particular simulation are:

  • mean of the mean: \[\mu = 1/0.2 = 5\]
  • variance of the mean: \[\frac{\sigma^2}{n} = \frac{1/0.2^2}{40} = 0.625\]
  • standard error of the mean: \[SE_{mean} = \frac{1/0.2}{\sqrt{40}} = 0.7905694\]

Simulating 1,000 means of 40 random exponentials:

We perform a unique simulation of 1,000 means of 40 random exponentials and plot the distribution of these simulated means against the theoretical normal distribution using the theoretical values.

As can be seen, in this simulation, the sample mean is just shy of the theoretical mean. Most similar simulations yield closer means – the set.seed() value (0134) was chosen so as to provide a perhaps meaningful difference and challenge normality in part 3 of the project.

PART 2. THEORETICAL VS. SAMPLE VARIANCES

A quick summary call shows the variability of the mean in our simulation. The error in estimating the mean can get as high as 58.18 % (max. value), but half of the error is contained around 21% of the mean value.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.063   4.412   4.898   4.958   5.465   7.909

The following table contrasts the theoretical values we defined in part 1 with the simulated values from our distribution of means:

Variability of the Mean
Mean Variance Standard Deviation
Theoretical \(\mu = 5\) \(\sigma^2/n = 0.625\) \(\sigma/\sqrt{n} = 0.791\)
Sample \(\bar X = 4.958\) \(s^2/n = 0.601\) \(s/\sqrt{n} = 0.775\)
Difference (value) \(0.042\) \(0.024\) \(0.016\)
Relative Difference \(0.834 \%\) \(3.84 \%\) \(2.023 \%\)

As expected, measures of spread deviate further than measures of center from their respective theoretical values, in an almost 1:2:4 ratio for mean:sd:var estimates (respectively).

PART 3: APPROACHING NORMALITY

The set.seed value chosen generates a distribution of means that is only approximately normal, retaining some of the skew present in the exponential distribution from which the iid draws originated. For comparison, here is the p.d.f. of the exponential distribution with \(\lambda = 0.2\), the density of a simulation of 1,000 random exponentials from this distrubution, and the previous density of the 1,000 means of 40 random exponentials and its associated normal curve.

As the CLT predicts, the distribution of means of 40 exponentials approximates the normal distribution. Further proof of this asymptotic behavior can be examined in the following normality plots:


APPENDIX: R Code

Part 1

# theoretical values
n <- 40; lambda <- 0.2 
theory_mean <- 1/lambda 
theory_var <- (1/lambda^2)/n
theory_sd <- (1/lambda)/sqrt(n)
# simulation 1
sims <- 1000
set.seed(0134)
simulation <- apply(matrix(rexp(sims*40, lambda), sims), 1, mean) 
sample_mean <- mean(simulation)
sample_var <- var(simulation)
sample_sd <- sd(simulation)
# density plot 1
par(mar=c(2, 4, 3, 1))
plot(density(simulation), col="red", main="Simulated Vs. Theoretical Distribution of 
        Means", ylab="density", xlab="", cex.main=0.9)
abline(v=sample_mean, col="red", lty="dashed")
curve(dnorm(x, mean=theory_mean, sd=theory_sd), add=TRUE, col="black")
abline(v=theory_mean, col="black", lty="dashed")
text(x=3.4, y=.4, labels="Sample Simulation", col="red", cex=0.8)
text(x=6, y=.4, labels="Normal p.d.f.", col="black", cex=0.7)
text(x=4.8, y=.18, expression(bar(x) == 4.958), col="red", srt=90)
text(x=5.15, y=.14, expression(mu == 5), col="black", srt=90)

Part 2

summary(simulation)

Part 3

# simulation 2 
set.seed(0134)
simulation2 <- rexp(sims, lambda) 
sample2_mean <- mean(simulation2)
# density plot 2
par(mar=c(2, 4, 2, 0.5))
plot(density(simulation2), col="blue", lwd=2, main="Exponentials and Means of 
        Exponentials", cex.main=0.8, cex.axis=0.7, ylab="density", xlab="", 
        xlim=c(-0.05,15), ylim=c(0,0.55))
curve(dexp(x, lambda), add=TRUE, col="black")
lines(density(simulation), col="red", lwd=2)
curve(dnorm(x, mean=5, sd=theory_sd), add=TRUE, col="black")
legend(10, 0.5, legend=c("Exponentials", "Means of Exponentials", "Theoretical Curves"), 
        col=c("blue", "red", "black"), cex=0.66, lty=1)
text(x=0.2, y=0.22, labels="Exponential", cex=0.5)
text(x=6.2, y=0.44, labels="Normal", cex=0.5)
# q-q plots
par(mfrow=c(1,2))
qqnorm(simulation2, pch = 1, frame = FALSE, main="Exponentials", cex.main=0.9, 
        cex.axis=0.6, cex.lab=0.6, col="blue", cex=0.5)
qqline(simulation2, probs = c(0.001, 0.999), col = "steelblue", lwd = 2)
qqnorm(simulation, pch = 1, frame = FALSE, main="Means of Exponentials", cex.main=0.9, 
        cex.axis=0.6, cex.lab=0.6, col="red", cex=0.5)
qqline(simulation, probs = c(0.001, 0.999), col = "steelblue", lwd = 2)