Summary and Conclusion

This is an assignment for the Coursera course Statistical Inference. The goal of this assignment is to explore the central limit theorem as it applies to pseudorandom samples from the exponential distribution.

This analysis shows that samples of 40 observations pulled from an exponential distribution are approximately normally distributed. It also shows that with a larger sample sizes, the average of the observations approaches the population mean. I hope you find this interesting and insightful.

Jeff Hebert


The official assignment:

The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. In this simulation, you will investigate the distribution of averages of 40 exponential(0.2)s. Note that you will need to do a thousand or so simulated averages of 40 exponentials.

Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponential(0.2)s. You should

  1. Show where the distribution is centered at and compare it to the theoretical center of the distribution.
  2. Show how variable it is and compare it to the theoretical variance of the distribution.
  3. Show that the distribution is approximately normal.
  4. Evaluate the coverage of the confidence interval for 1/lambda: mean +/- 1.96 * StDev/sqrt(n).

What does the Exponential Distribution look like

Before we look at sample means, let’s look at the underlying distribution. This is a skewed distribution. Theoretically, the mean and standard deviation are 1/lambda. Let’s double check that the mean and standard deviation = 5.

sample_lambda <- 0.2
sample_size <- 40
sample_count <- 1000
exp_dist <- rexp(1e6, sample_lambda)
mean(exp_dist); sd(exp_dist)
## [1] 5.002898
## [1] 5.001157
hist(exp_dist, breaks = 80, xlim=c(0,30), ylim=c(0,200000))
lines((seq(1:31)-1),dexp(x = (seq(1:31)-1),sample_lambda)*1e6, col="blue", lwd=2)

Simulating Samples of the Exponential Distribution

OK, now that we know the ‘true’ population, let’s simulate some samples. We create 1000 sets of 40 samples from the exponential distribution. I set the seed to 123 so you can produce the same data if you want to play along at home. Be sure to double check the mean of your sample means to make sure it is close to 5. Great!

set.seed(123) # Reproduce same samples
sample_data <- matrix(rexp(sample_count*sample_size, rate=sample_lambda), sample_count, sample_size)

sample_means <- apply(sample_data, 1, mean)
mean(sample_means)
## [1] 5.011911

Now we can work on the first two objectives. How does the

  1. Show where the distribution is centered at and compare it to the theoretical center of the distribution.
  2. Show how variable it is and compare it to the theoretical variance of the distribution.

Let’s calculate the theoretical mean of the sample means and compare with the actual mean of the sample means. The sample mean is 5.01, which is very close to the theoretical mean of 5.0. Great!

sample_theo_mean <- 1/sample_lambda
sample_mean <- mean(sample_means)
sample_theo_mean; sample_mean
## [1] 5
## [1] 5.011911

Now let’s calculate the theoretical standard deviation of sample means and compare with the actual standard deviation of the sample means. The sample standard deviation is 0.79, which is very close to the theoretical standard deviation of 0.78. Very good!

sample_theo_sd <- 1/sample_lambda / sqrt(sample_size)
sample_sd <- sd(sample_means)
sample_theo_sd; sample_sd
## [1] 0.7905694
## [1] 0.7802751

Distribution of Sample Means

The next objective is to compare the distribution of sample means with a normal distribution.

  1. Show that the distribution is approximately normal.

To check this, let’s look at a histogram of the data and a normal quantile plot. We can add a density plot for normal distribution to compare with the distribution of sample means. These two plots show that the distribution of sample means is approximately normal. Also, the mean and standard deviation of the sample means matches very closely with the theoretical values.

Convergence of Sample Means to Population Mean

The final objective is a little tricky. The idea is to look at the confidence interval of the mean and compare it with the mean we calculate for various sample sizes. In this case, a picture is worth 10,000 words.

  1. Evaluate the coverage of the confidence interval for 1/lambda: mean +/- 1.96 * StDev/sqrt(n).

The central limit theorem indicates that the mean of the sample means will approach the population mean. The rate of convergence is related to the population variation and the number of values in the sample. This plot shows the sample mean of a sample pulled from an exponential distribution. For samples less than 20 observations, the mean is not very close to the population mean of 5. However, with more than 100 observations the sample mean is very close to 5. In the plot, the solid line is the population mean of 5. The dashed lines represent the 95% confidence interval for samples of different sizes pulled from the population.

get_exp_sample_mean <- function(count = 40, seed = 123){
    set.seed(seed) # Reproduce same samples
    sample_data <- rexp(count, rate=sample_lambda)
    return(mean(sample_data))
}
get_exp_sample_sd <- function(count = 40, seed = 123){
    set.seed(seed) # Reproduce same samples
    sample_data <- rexp(count, rate=sample_lambda)
    return(sd(sample_data))
}

sample_size <- seq(1:1000)
sample_mean <- as.numeric(lapply(sample_size, get_exp_sample_mean))
sample_sd <- as.numeric(lapply(sample_size, get_exp_sample_sd))
sample_low_interval <- sample_mean - 1.96 * sample_sd / sqrt(sample_size)
sample_high_interval <- sample_mean + 1.96 * sample_sd / sqrt(sample_size)

Finally, let’s check to see how many of our samples actually contain the true population mean. To do this, we need to calculate sample intervals for each of our original 1000 simulation samples. Then we can calculate the amount of intervals that contain the true mean of 5. It looks like 92.6% of the intervals actually contain 5. We expected 95% of the intervals to contain 5. This discrepancy is likely due to the relatively small sample size of 40 observations drawn from this population. The chart above shows that larger samples are more stable around the mean.

sample_sds <- apply(sample_data, 1, sd)
sample_size <- 40
sample_low_interval <- sample_means - 1.96 * sample_sds / sqrt(sample_size)
sample_high_interval <- sample_means + 1.96 * sample_sds / sqrt(sample_size)

correct_rate <- ifelse(sample_low_interval < 5 & sample_high_interval > 5, 1,0)
mean(correct_rate)
## [1] 0.926