Overview

The overarching goal of this paper is to illustrate the power of the Central Limit Theorm (CLT). Specifically, how the sample distribution of means, will approach that of a normal distribution limiting toward the mean of the population, even if the underlying population distribution is not normally distributed. For this paper, I will be using the mean and the variance of the Exponential distribution to illustrate this point.

Sample Mean vs. Theoretical Mean

The following code creates a histogram of 1,000 random observations from an Expontential distribution, with the vertical blue line at the theoretical mean (the exponential distribution is defined around its rate, lambda, and both the mean and standard deviation are calculated as 1/lambda). In this case, the mean is 1/lambda = 1/0.2 = 5. As you can see, this population in no way resembles a standard normal distribution

set.seed(100)
lambda <- 0.2
exp_dist <- rexp(1000, lambda)
exp_dist_mean <- 1/lambda
hist(exp_dist, main = "Exponential Distribution Histogram", xlab = "Observation Value")
abline(v = exp_dist_mean, col = "blue", lwd = 4)

Now, by drawing a sample of size 40 from the exact same population, exp_dist, and calculating the mean of said sample, I get a vector of sample means (sample_means). As you can see, creating a histogram of sample_means shows a distribution closely resembling that of the Normal distribution, limiting towards the theoretical mean of the original population. (Again, the mean is noted by the vertical blue line at 5)

sample_means <- NULL
for(i in exp_dist) {
      sample_means <- c(sample_means, mean(sample(exp_dist, 40)))
}
hist(sample_means, main = "Histogram of Sample Means", xlab = "Sample Mean Value")
abline(v = exp_dist_mean, col = "blue", lwd = 4)

To drive the point home, below is another Exponential distribution with the same lambda, this time with an observation size of 10,000.

hist(rexp(10000, lambda), main = "Exponentional Distribution", xlab = "Observation Value")
abline(v = 1/lambda, col = "blue", lwd = 4)

By running the same sample mean code as above, this time randomly drawing from an Exponential distribution with 10,000 observations (rate is still lambda as defined in the first code chunk), you will see the sample distribution becomes even more normally distributed. I’m calculating the mean of 100 samples as opposed to 40, and it is clear the center of the distribution draws even closer to the theoretical mean of the original population distribution (1/lambda or, in this case, 5).

sample_means_100 <- NULL
for(i in 1:10000) {
      sample_means_100 <- c(sample_means_100, mean(sample(rexp(10000, lambda),100)))
}
hist(sample_means_100, main = "Histogram of Sample Means (large sample)", xlab = "Sample Mean Value")
abline(v = exp_dist_mean, col = "blue", lwd = 4)

Sample Variance vs. Theoretical Variance

Recall that we set lambda equal to 0.2 at the beginning of the paper. Since both the mean and the standard deviation (sigma) of an Exponential distribution are defined as 1/lambda, we can calculate the theoretical standard deviation and variance of our population distribution as 1/lambda and 1/lambda^2, respectively.

sigma <- 1/lambda
variance <- sigma^2
sigma
## [1] 5
variance
## [1] 25

Now compare the above two values to the actualy values of the standard deviation and variance of our population, and you will see the population values below approach the theoretical values calculated above.

sd(exp_dist)
## [1] 5.062016
var(exp_dist)
## [1] 25.624

Now, the CLT states that the sample distribution is approximately normal with a standard error of the mean equal to the variance of the population divided by the n, the number of sample that went into th mean. Below, you can see the both the theoretical calculation for the standard error and how the actual standard error of sample_means approaches the theoretical value.

#variance for sample means
theo_stand_error <- variance/40
theo_stand_error
## [1] 0.625
actual_stand_error <- var(sample_means)
actual_stand_error
## [1] 0.6467231

You can see this visually when I overlay a normal distribution curve to the sample distribution of sample_means. I’ve called the same histogram of sample_means as above, this time adding a normal density curve to the graph with 3 standard errors, shown by the dotted blue lines, above and below the mean of 5.

hist(sample_means, main = "Histogram of Sample Means with Normal Curve", xlab = "Sample Mean Value", freq = FALSE, ylim = c(0,0.55))
curve(dnorm(x, mean = 5, sd = sd(sample_means)), from = 2, to = 9, add = TRUE, col = "red", lwd = 3)
stand_error_vector <- 5 + c(-3,-2,-1,1,2,3)*actual_stand_error
abline(v = stand_error_vector, col = "blue", lwd = 2, lty = 3)