The Central Limit Theorem (CLT) states that the distribution of averages of iid variables (properly normalized) becomes that of a standard normal as the sample size increases This means that the test statistic \[\frac{\bar X_n - \mu}{\sigma / \sqrt{n}}= \frac{\sqrt n (\bar X_n - \mu)}{\sigma} = \frac{\mbox{Estimate} - \mbox{Mean of estimate}}{\mbox{Std. Err. of estimate}}\] has a distribution like that of a standard normal for large \(n\).
In the class lectures, this was demonstrated with a binomial distribution, where one experiment consisted of a simulated die being thrown n times, producing a sample with size n. The test statistic was calculated for this sample using the equation above, and then this experiment was repeated 1000 times. It was shown that the distribution of 1000 test statistics generated in this way approximated the Standard Normal distribution. When n (the sample size) was increased, the distribution of the 1000 test statistics generated more closely resembled a Standard Normal. By the Central Limit theorem, as n approaches \(\infty\), the resulting distribution approaches \(N(0,1)\).
In this exercise, we will repeat these experiments, but with the Exponential distribution. The Central Limit Theorem applies to all distributions. No matter what distribution the sample comes from, its average value will be of the Standard Normal.
Let’s start by simulating a sample of values taken from an Exponential distribution. An Exponential distribution describes the time between events from a Poisson distribution with rate \(\lambda\).
We can simulate this sampling from the exponential distribution in R with the \(rexp\) function with \(\lambda\) as the rate parameter. Here are some characteristics of our simulation:
By calling \(rexp\) \(n*nosim\) times, we are generating one big sample of that size, but then split it up into smaller sample sets of size 40 in order to explore the distribution of sample means. In the code below, after sampling from the distribution \(n*nosim2\) times, we generate a matrix of size nosim x n and another of nosim2 x n, so that each row is a simulation and each column is a member of that sample. We generated two sets of samples sets, one containing 1000 simulations, and another containing 2000 simulations. This will be used later to show the limiting behavior of the sample sets with increasing number of simulations.
To show limiting behavior later,
lambda <- 0.2
mu <- 1./lambda
n <- 40
nosim <- 1000
nosim2 <- 2000
simulations <- rexp(n*nosim2, lambda)
simulations40 <- matrix(simulations[1:(n*nosim)], nosim)
simulations40by2000 <- matrix(simulations, nosim2)
The mean of exponential distribution is \(\frac{1}{\lambda}\). But due to randomness in the process of sampling from the distribution, any given sample mean will probably not be exactly the same as the population mean. But as the sample size is increased, the variance in the sample mean decreases, meaning it’s more likely to be close to the population mean. In the limit \(n \to \infty\), the sample mean approaches the population mean: \(\bar X_n \to \mu\). Because of this, the mean estimator \(\bar X_n\) is said to be consistent.
We can visually show this limiting (asymptotic) behavior by plotting a progression of \(\bar X_n\) for samples of increasing n. This is a demonstration of the Law of Large Numbers, which is related to the CLT.
# average the first 2 samples, then the first 3, then the first 4, etc...
# so iteratively averaging sample sets of increasing size.
means <- cumsum(simulations[1:1000]) / (1 : 1000)
library(ggplot2)
g <- ggplot(data.frame(x = 1 : 1000, y = means), aes(x = x, y = y))
g <- g + geom_hline(yintercept = mu, color="blue") + geom_line(size = 1, color='cyan3')
g <- g + labs(x = "Sample size", y = "Sample mean", title="Sample mean as a function of sample size")
g
In this plot the blue line is the population mean 5, and the jagged line is the value of the sample mean as a function of the sample size. In the first few samples (of size 1-10, say), the sample mean varies wildly. But by the time the sample size reaches 1000, the mean is stable within about .1 of \(\mu\). This nicely demonstrates the limiting behavior of the sample mean.
The CLT also states that if we generate one mean per sample set of size 40, and consider those to be new random variables with their own distribution, then the mean of that new distribution (the distribution of the means) is an estimate of the original population mean. Even though the original sample means vary considerably from the population mean because of the small sample size, the mean of means is a much closer estimate.
X_bars <- apply(simulations40, 1, mean)
mean(X_bars)
## [1] 5.019248
If we were to extend this to a larger number of sample sets or a larger sample set size, we would see this number also approach the population mean as its limiting behavior. In other words both the population distribution and the mean-of-sample-sets distribution have the same mean.
Another way of showing the limiting (asymptotic) behavior above is through the sample variance. The sample variance, \(s^2\) is also a consistent estimator of the population variance, \(\sigma\), which in the case of the Exponential distribution, is \(\frac{1}{\lambda^2}\).
Although, as stated above, the mean of the population distribution, and the mean of the mean-of-sample-sets distribution are equivalent, the variances of these two distributions are different but related by the equation: \[SE^2 = \frac{s^2}{n}\]
The CLT states that the mean-of-sample-sets distribution is normally distributed. Since its mean is \(\bar X\) and standard deviation is \(\frac{s}{\sqrt n}\), we can characterize its distribution as \(N(\bar X, \frac{s}{\sqrt n})\).
Since the mean-of-sample-sets standard deviation is inversely related to the sample size, if we run simulations with larger sample sizes, the standard deviation will be smaller, and so the mean of this distribution will be a better estimate of the population mean.
For the simulations generated above, the theorem states that we should get a standard deviation of the mean-of-sample-sets distribution, called the Standard Error, of:
s <- 1./lambda
SE <- s / sqrt(n)
SE
## [1] 0.7905694
Now let’s manually calculate the standard deviation of \(\bar X\) and see that it is roughly equivalent to the Standard Error:
sd(X_bars)
## [1] 0.8029065
whereas the population standard deviation is 5.
In order to show graphically that this mean-of-sample-sets distribution is normally distributed, we can normalize it by its mean and standard deviation. This transforms it into a Standard Normal Distribution, \(N(0,1)\). So let’s do that normalization, plot the histogram of the transformed distribution, and show how it compares to an exact Standard Normal Distribution, by overlaying it:
cfunc <- function(x, n) (mean(x) - mu) / SE
dat <- data.frame(
x = c(apply(simulations40, 1, cfunc, n)))
g <- ggplot(dat, aes(x = x)) +
geom_histogram(alpha = .20, binwidth=.3, colour = "black", aes(y = ..density..))
g + stat_function(fun = dnorm, size = 2)
If we increase the number of simulations run, the histogram would more closely resemble the Standard Normal.
cfunc <- function(x, n) (mean(x) - mu) / SE
dat <- data.frame(
x = c(apply(simulations40by2000, 1, cfunc, n)))
g <- ggplot(dat, aes(x = x)) +
geom_histogram(alpha = .20, binwidth=.3, colour = "black", aes(y = ..density..))
g + stat_function(fun = dnorm, size = 2)