This document will show how R languaje is used to generate random samples of exponential distributions, and test if their behavior fits what predicts the Central Limit Theorem (CLT).
Broadly speaking, the CLT states that, under certain conditions, the means of samples of independent random variables will follow a normal distribution, regardless of the type of distribution they come from. Among the above conditions, we note, first, that both the expected value and variance must be well-defined, and secondly, the number of samples must be large enough.
Therefore, if we have \(n\) samples of random variables that come from distributions whose expected value is \(\mu\) and whose variance is \(\sigma^2\), CLT states that their averages will follow approximately a normal distribution with mean \(\mu\) and variance \(\sigma^2/n\). That is to say: \(\overline{X}_n \approx N(\mu, \sigma^2/n)\).
The exponential distribution is frequently used to model the time elapsed between independents events, and it is described by the parameter \(\lambda\), which is the number of events per unit time. The expected value of an exponential distribution is given by \(E[X] = 1/\lambda\), like the standard deviation, \(\sigma = 1/\lambda\).
In R is possible to generate random deviates of exponential distributions with rexp
function, where you can set the size of the sample, and the rate parameter, \(\lambda\). To test the CLT we going to draw 5,000 samples of exponential distributions, all of them with \(n = 40\) and \(\lambda = 0.2\), and then we calculate the average of each one and save them on the expmeans
object.
set.seed(1234)
expmatrix <- matrix(rexp(40*5000, 0.2), 5000)
expmeans <- apply(expmatrix, 1, mean)
Now we have a sample of 5,000 means of exponential distributions with \(n = 40\) and \(\lambda = 0.2\) saved in a vector called expmeans
. The CLT states that the mean of this distribution should be the expected value of the distribution from with it comes, that is
\[\frac{1}{\lambda} = \frac{1}{0.2} = 5\]
Let’s to check it calculating the mean of our means vector expmeans
.
mean(expmeans)
## [1] 5.003888
The mean of our sample is pretty close to the theoretical mean. Let’s see it graphically.
Histogram drawn with the vector of the 5,000 exponential distributions means.
It seem that our distributions of means is centered around 5, like we can see in the histogram. Finally, we will calculate the 95% confidence interval for this average.
(mean(expmeans) + c(-1, 1) * qnorm(0.975) * sd(expmeans)/sqrt(length(expmeans)))
## [1] 4.981958 5.025818
We estimate the average mean of the sample as \(4.98\) to \(5.03\) with 95% confidence. The interval includes the mean that CLT predicts, \(5\), which is further evidence that, so far, the theory is being fulfilled.
According to the CLT, the distribution of our vector of averages should be, approximately, like a normal distribution, with mean \(\mu\), in this case \(1/\lambda\), as we have just seen, and variance \(\sigma^2/n\), where \(\sigma^2\) would be \(1/\lambda^2\), so
\[\frac{\sigma^2}{n} = \frac{1/\lambda^2}{n} = \frac{25}{40} = 0.625\]
Let’s to check it by calculating the variance of our means vector expmeans
.
var(expmeans)
## [1] 0.6259561
Again, the variance of our sample is very close to the theoretical variance predicted by CLT.
Finally, we are going to test the fit of our distributions of means, expmeans
, to a normal distribution. We will do it in two ways, with a simple visual test and with a statistical test. First, let’s see the degree of fit to a normal distribution.
Left: The more the bins fit within the red line, the better fit to a normal distribution. Right: Histogram of 5,000 exponential random deviates with \(\lambda = 0.2\).
The shape of the histogram of means fits nicely to the shape which would have a normal distribution with mean \(5\) and standard deviation \(\sqrt{0.625}\), and it is very different from the sample of exponential random deviates with \(\lambda = 0.2\).
Now let’s perform a Shapiro-Wilk test, which test the null hypothesis that the sample of means come from a normal distribution.
shapiro.test(expmeans)
##
## Shapiro-Wilk normality test
##
## data: expmeans
## W = 0.99463, p-value = 1.058e-12
The P-value obtained is much lower than 0.05, so we should reject the null hypothesis and accept that our mean vector expmeans
does not come from a normal distribution. In any case, we should keep in mind that when working with large samples, small deviations can lead to a rejection of the null hypothesis, although the data fit well with what is expected from a normal distribution, as in this case.
Here you can find the code that was used to build the plots.
Figure 1 - Histogram of exponential distributions means plot
library(ggplot2)
ggplot(data.frame(x = expmeans), aes(x = x)) + geom_histogram(colour = "darkslategray3",
fill = "darkslategray3") + scale_x_continuous(breaks = c(1:8)) +
geom_vline(xintercept = 5, colour = "black") + ggtitle("Histogram of exponential distributions means") +
xlab("Means") + ylab("Count") + theme(plot.title = element_text(size = 9,
face = "bold"), axis.text = element_text(size = 6), axis.title = element_text(size = 9))
Figure 2 - Histogram of means with normal curve plot and exponential distribution
plot1 <- ggplot(data.frame(x = expmeans), aes(x = x)) + geom_histogram(aes(y = ..density..),
colour = "darkslategray3", fill = "darkslategray3") + scale_x_continuous(breaks = c(1:8)) +
stat_function(fun = dnorm, color = "red", args = list(mean = 5,
sd = sqrt(0.625))) + ggtitle("Histogram of means and normal curve") +
xlab("Means") + ylab("Density") + theme(plot.title = element_text(size = 9,
face = "bold"), axis.text = element_text(size = 6), axis.title = element_text(size = 9))
set.seed(5678)
expdist <- rexp(5000, 0.2)
plot2 <- ggplot(data.frame(x = expdist), aes(x = x)) + geom_histogram(aes(y = ..density..),
colour = "darkslategray3", fill = "darkslategray3") + ggtitle("Histogram of exponential distribution") +
xlab("Values") + ylab("Density") + theme(plot.title = element_text(size = 9,
face = "bold"), axis.text = element_text(size = 6), axis.title = element_text(size = 9))
library(gridExtra)
grid.arrange(plot1, plot2, ncol = 2)