In this exercise, we conduct simulation on exponential random variables, study the distribution of the mean, and compare with the theoretical values given by the Central Limit Theorem (CLT).
First we generate the simulation data with rexp, with the random seed set to 4689. We generate 1000 batches of simulation, each with 40 random exponentials. We then calculate the mean of each batch and store them in mns.
set.seed(4689)
lambda <- 0.2
n = 40
B = 1000
x <- matrix(rexp(n * B, lambda), B, n)
mns40 <- apply(x,1,mean)
mns <- data.frame(mns40)
We now calculate the sample mean and compare it the the theoretical mean.
sample.mean <- mean(mns$mns40)
sample.mean
## [1] 5.025818
According to CLT, the theoretical mean is the mean of the exponential distribution, i.e.
theoretical.mean <- 1 / lambda
theoretical.mean
## [1] 5
We see that they are quite close to each other. In fact, the relative error is:
(sample.mean - theoretical.mean) / theoretical.mean
## [1] 0.005163502
We can see that the sample mean is consistant witht the theoretical mean.
sample.var = var(mns$mns40)
sample.var
## [1] 0.6457689
According to CLT, the theoretical sd is s/sqrt(n), where s is the sd of the original exponential distribution. Therefore,
theoretical.var = (1 / lambda) ^ 2 / n
theoretical.var
## [1] 0.625
And the relative error in variation is:
(sample.var - theoretical.var) / theoretical.var
## [1] 0.03323021
Again we can see that the sample variation is consistant with the theoretical value.
We now use the density plot to show the distribution of the means, and compare it with the normal distribution. See Figure 1.
library(ggplot2)
g <- ggplot(mns, aes(x=mns40)) + geom_density() + labs(x='Means') + geom_vline(xintercept = 1/lambda) + stat_function(fun = dnorm, args = list(mean = theoretical.mean, sd = sqrt(theoretical.var)), col = 'red')
print(g)
The distribution of means. Here the red curve show the theoretical distribution, i.e. a normal distribution with mean = 1/lambda and sd = 1/lambda, and the black curve shows the density of the simulation data.
We can see that despite minor random deviation, the overall shape and distribution of the means is consistant with the normal distribution with mean = 1/lambda and sd = 1/lambda / sqrt(n). The vertical line at x=5 shows the theoretical center of the distribution, and the sample distribution is roughly symmetrical about the line.
Using the sample statistics, we can compute the confidence interval for the mean:
sample.mean + c(-1,1) *qt(0.975, n-1) * sqrt(sample.var) / sqrt(n)
## [1] 4.768815 5.282820
which is consistant with the theoretical value 1/lambda = 5.
A better way to test if the sample follows the normal distribution is to use the Shapiro-Wilk test. The null hypothesis for the test is that the sample is normally distributed.
shapiro.test(mns$mns40)
##
## Shapiro-Wilk normality test
##
## data: mns$mns40
## W = 0.99522, p-value = 0.003141
Here we see that with a p-value of 0.003141, we reject the null hypothesis, which means that the sample is probably not normally distributed. This is likely due to the fact that the population size for each batch is too small for CLT to kick in.
We demonstrate this by repeating the sampling procedure with different batch sizes, 40, 80, 160, 320, 640. We then calculate the mean of each batch, and form 1000 samples of mean for each batch sizes. We normalize the distributions to make them easier to compare. We can see that by using a increasing population size, the sample distribution gets closer to the normal distribution.
For n = 640, We perform the Shapiro-Wilk test:
shapiro.test(mns$mns640)
##
## Shapiro-Wilk normality test
##
## data: mns$mns640
## W = 0.99917, p-value = 0.9455
With p = 0.9455, we can no longer reject the null hypothesis for the SW test. CLT rules!