The Central Limit Theorem (CLT) tells us, that if \(X_i\), \(i = 1 \dots n\) are independent identically distributed (iid) random variables with mean \(\mu\) and standard deviation \(\sigma\), then the distribution of their average will converge point wise to the Normal Distribution with mean \(\mu\) and standard deviation \(\frac{\sigma}{\sqrt{n}}\).
Formally written: let \(X^{*} = \frac{\sum_{i=1}^{n} X_i}{n}\). Then
\[ P\left( X^{*} < x \right) \rightarrow F(x), \quad \mathrm{if} \quad n \rightarrow \infty \]
where \(F(x)\) is the distribution function of the normal distribution \(N(\mu, \frac{\sigma^2}{n})\).
The CLT is true independent of the distribution of the original iid \(X_i\), \(i = 1 \dots n\). The aim of this document is to show this theorem by investigating the distribution of exponential distribution..
The exponential distribution can be simulated in R with rexp(n, \(\lambda\)) where \(\lambda\) is the rate parameter. The mean of exponential distribution is \(\frac{1}{\lambda}\) and the standard deviation is also \(\frac{1}{\lambda}\). In our simulations \(\lambda = 0.2\) and we are going to investigate the distribution of averages of 40 exponentials with thousand simulations.
set.seed(1234)
sim <- 1000
n <- 40
lambda <- 0.2
exp <- matrix(rexp(sim*n, lambda), nrow=sim, ncol=n)
avg <- apply(exp, 1, mean)
The vector avg contains the 1000 average of 40
exponentials with rate \(\lambda =
0.2\). The mean of the averages should also be approximately
\(\mu\), and its variance should
approximate \(\frac{\sigma^2}{n}\), and
if we would simulate infinitely many such averages, these values would
converge exactly. In the below table these values are summarized.
| Statistic | Exponential Distribution | Theoretical Average | Simulated Average |
|---|---|---|---|
| Mean | 5 | 5 | 4.9742388 |
| Standard Deviation | 5 | 0.7905694 | 0.7713431 |
| Variance | 25 | 0.625 | 0.5949702 |
Indeed, the simulation mean and standard deviation of the averages and the theoretical mean and standard deviation are very close to each other, respectively. Since, the standard deviation is the square root of the variance, the difference in the simulated and theoretical values are squared in the variance causing a bigger difference.
library(ggplot2)
## Warning: a(z) 'ggplot2' csomag az R 4.5.3 verziójával lett fordítva
First we investigate the density function of the exponential distribution with rate 0.2. Figure 1 shows the histogram of 10.000 simulated values from this distribution (salmon color) together with the theoretical density function (black line). We see that the theoretical line and the histogram align together very well as expected.
The demonstrate the Central limit theorem, Fig 2 shows the histogram of the simulated averages (salmon color) together with the density function of the normal distribution with mean and standard derivation that of the theoretical average (black line). The green and blue colors are respectively the theoretical mean and the mean of the observed averages. The figure presents beautifully how the distribution of the averages converge to the normal distribution function, and their means are again very close to each other.
set.seed(123)
s = 10000
data <- data.frame(x = rexp(s, lambda))
g <- ggplot(data = data, mapping = aes(x=x))
g <- g + geom_histogram(binwidth=.7, colour = "black", fill = "salmon",
aes(y = after_stat(density)))
g <- g + stat_function(fun = dexp, args = (lambda=lambda), linewidth=1)
g <- g + labs(title = "Figure 1: Exponential Distribution") + theme_bw()
g
g2 <- ggplot(data = as.data.frame(avg), mapping = aes(x=avg))
g2 <- g2 + geom_histogram(binwidth=.2, colour = "black", fill = "salmon",
aes(y = after_stat(density)))
g2 <- g2 + stat_function(fun = dnorm, args= list( mean=thmean, sd=avgsd),
linewidth=1, aes(colour = "Normal Density"))
g2 <- g2 +
geom_vline( aes(xintercept = thmean, colour = "Theoretical Mean"), linewidth = 1) +
geom_vline( aes(xintercept = avgmean, colour = "Simulated Mean") , linewidth = 1)
g2 <- g2 + scale_colour_manual(name = "Legend",
values = c("Theoretical Mean" = "green3",
"Simulated Mean" = "blue",
"Normal Density" = "black"))
g2 <- g2 + labs(title = "Figure 2: Normal Distribution Function and the Histogram of the Averages",
x = "x") + theme_bw()
g2