By Andreas Schaetti
In this report the Central Limit Theorem (CLT) is visualized with examples using the exponential distribution. Specifically, the mean and variance of the average over samples of size 40 from the exponential distribution are investigated. It is shown that the averages are distributed approximately according to a normal distribution.
The exponential distribution describes the time between events in a Poisson process. For example, it models the time between clicks of a geiger counter. It is defined by the equation \(f(x) = \lambda e^{-\lambda x}\). The only parameter is the rate \(\lambda\). The mean and standard deviation are both equal to \(\frac{1}{\lambda}\). The distribution is only defined on the interval \([0, \infty)\).
The CLT states that a properly normalized distribution of averages of independent and identically distributed random variables approximates a standard normal distribution for a large number of variables.
Simulations are performed for 1000 sample averages, where every sample contains 40 events randomly drawn from an exponential distribution with \(\lambda = 0.2\). For each simulation, 40 random numbers are drawn from the exponential distribution. Mean and variance are calculated for each sample and stored in a separate vector. The code is given in the appendix.
The distribution of the sample averages is centered at the theoretical mean and is approximately symmetric about the mean. Also plotted is the mean over all sample averages. This is the prediction of the theoretical mean.
mean.error <- (mean(means) - mean.theory) / mean.theory
The relative error in predicting the theoretical mean is 0.525 %. Thus the prediction is very close to the truth.
The same plot is repeated for the variance of the sample averages. However, the distribution is not precisely centered around the theoretical variance and it is skewed towards higher values. It does not resemble a normal distribution. The CLT only predicts a normal distribution for the sample averages, not the sample variances.
variance.error <- (mean(variances) - variance.theory) / variance.theory
The relative error in predicting the theoretical variance is -0.688 %.
If we sample the exponential distribution many times the histogram of the sampled values follows closely the theoretical distribution. The expected value for each bin is sampled at the central position of the bin. The distribution looks much different from a normal distribution: it is not centered at its mean and it is highly skewed towards lower values.
In contrast, the distribution of sample averages closely resembles a normal distribution. In addition, if the sample averages are centered at the mean and normalized by the standard deviation, the distribution approximates a standard normal distribution (with mean = 0 and standard deviation = 1).
Another approach to judge whether the data is normally distributed is by comparing the quantiles of the standard normal distribution with the quantiles of the normalized data. This is called a QQ plot. Except for the very lowest and highest quantiles, the quantiles of the distribution of sample averages are very close to the quantiles of the normal distribution.
means <- vector("numeric")
variances <- vector("numeric")
for (i in 1 : nsimulations)
{
sample <- rexp(nevents, rate = lambda)
means <- c(means, mean(sample))
variances <- c(variances, var(sample))
}
hist(means,
breaks = 30,
main = "Histogram of sample averages",
xlab = "Average of sample")
abline(v = mean.theory, lwd = 3, col="red")
abline(v = mean(means), lwd = 3, col="blue", lty=2)
legend("topright", c("Theoretical mean", "Average mean"),
bty = "n", lwd = 3, col = c("red", "blue"), lty=c(1, 2))
hist(variances,
breaks=50,
main="Histogram of sample variances",
xlab="Variance of sample")
abline(v = variance.theory, lwd=3, col="red")
abline(v = mean(variances), lwd = 3, col="blue", lty=2)
legend("topright", c("Theoretical variance", "Average variance"),
bty = "n", lwd = 3, col = c("red", "blue"), lty=c(1, 2))
single.distribution <- rexp(nsimulations, rate = lambda)
hist(single.distribution,
breaks = 50,
freq = FALSE,
main ="Sampled vs theoretical exponential distribution",
xlab ="Value of exponential distribution")
xvals <- seq(0, 50, by=0.5)
lines(xvals,
dexp(xvals, rate = lambda),
lwd=2,
col="red")
abline(v = mean.theory, lwd=3, col="blue")
legend("topright", c("Theoretical distribution", "Theoretical mean"),
bty = "n", lwd = 3, col = c("red", "blue"))
means.norm <- (means - mean(means)) / sd(means)
par(mfrow=c(1, 2))
hist(means.norm,
breaks = 40,
freq = FALSE,
main = "Normalized sample averages",
xlab = "Value of normal distribution")
xvals <- seq(-4, 4, by=0.1)
lines(xvals, dnorm(xvals), lwd = 2, col="red")
xvals <- seq(0, 1, len=100)
quantiles.norm <- qnorm(xvals)
quantiles.data <- quantile(means.norm, xvals)
plot(quantiles.norm, quantiles.data,
pch=21,
xlim=c(-2.4, 2.4),
ylim=c(-2.4, 2.4),
main="QQ plot for sample averages",
xlab="Quantiles of std normal",
ylab="Quantiles of normalized data")
abline(a=0, b=1, lwd=2, col="red")