The goal of this report is to investigate the exponential distribution in R and compare it with the Central Limit Theorem. To that effect it will investigate the distribution of 1000 averages of 40 exponential observations with the lambda parameter set to 0.2.
The exponential distribution PDF has one parameter: lambda. In this report it will be set to 0.2 and hold fixed.
Its theoretical mean and variance are 1/lambda or 5 in our case.
The standard error (SE) of the mean statistic is sigma/sqrt(n) or 5/sqrt(40) or 0.7905694. Its variance is SE^2 or 0.625
First we set some constants:
set.seed(12345)
rate <- 0.2
n <- 40
ns <- 1000
se <- sqrt(0.625)
Here is an example of exponential distribution with lambda=0.2 (1000 samples):
plot(density(rexp(ns, rate=rate)), main="Exponential distribution with lambda = 0.2",
xlab="x")
abline(v=5, col="blue")
Figure 1: Exponential distribution with lambda = 0.2. We can see the theoretical mean at 5.
#First get the sample means
samples <- matrix(rexp(ns*n, rate=0.2), ncol=n)
sampleMeans <- apply(samples,1,mean)
#Now the variances
sampleMeansVars = rep(0,ns)
for (i in 1:ns) {
samples <- matrix(rexp(n*n, rate=0.2), ncol=n)
var <- var(apply(samples,1,mean))
sampleMeansVars[i] <- var
}
The theoretical mean is 1/lambda, so in our case 5.
The sample mean of one sample of size 1000 is 4.956, which is close to the theoretical mean of 5:
sampleMean <- mean(matrix(samples, ncol=1)[1:1000])
cat(sampleMean)
## 4.956496
The average of multiple sample means of sample size 40 is 4.977, which is even closer to the theoretical mean of 5. So the mean statistic distribution is centered at 5:
sampleMean <- mean(sampleMeans)
cat(sampleMean)
## 4.97704
means <- cumsum(sampleMeans)/(1:ns)
ggplot(data.frame(x=1:ns, y=means), aes(x=x, y=y)) +
geom_hline(yintercept = 5) + geom_line() +
labs(x = "Number of obs", y = "Cumulative mean") +
ggtitle("Convergence of the mean of samples of exponentials towards 5") +
theme_bw()
Figure 2: Convergence of the empirical mean to the theoretical mean
The theoretical variance of the mean estimate is the SE^2, which is (sigma/n)^2 or 25/40 = 0.625
The variance of one sample of 1000 sample means is 0.601, which is close the the theoretical variance of 0.625:
sampleMeanVariance <- var(sampleMeans)
cat(sampleMeanVariance)
## 0.6010114
If we increase the number of simulations to 1000, this difference will be reduced even if the sample size is smaller (40).
sampleMeanVariance <- mean(sampleMeansVars)
cat(sampleMeanVariance)
## 0.6271657
So the empirical estimate of the SE of 1000 sets of 40 sample means is 0.627, which is closer the the theoretical variance of 0.625.
vars <- cumsum(sampleMeansVars)/(1:ns)
ggplot(data.frame(x=1:ns, y=vars), aes(x=x, y=y)) +
geom_hline(yintercept = se^2) + geom_line() +
labs(x = "Number of obs", y = "Cumulative mean") +
ggtitle("Convergence of the variance of mean of samples of size 40") +
theme_bw()
Figure 2: Convergence of the empirical variance to the theoretical variance
The Central Limit Theorem (CLT) states that with a large enough set of samples, the distribution of the mean is normal. In addition, if the sample size is big enough, the distribution will have a small variance.
#First we plot the empirical mean distribution
hist(sampleMeans,prob=TRUE, main="Histogram of mean statistics",
xlab="Sample means", ylim = c(0,0.6))
#Then we plot the theoretical statistic distribution
x <- seq(0,10, length=ns)
lines(x, dnorm(x,5,sqrt(se)),col="blue")
Figure 3: Normality of the mean statistic for the exponential distribution. In blue the theoretical mean statistic distribution.
We can see that the empirical mean statistic distribution (boxplot) has a gaussian shape which is not the case of the exponential distribution (beginning of the document) that is skewed to the left.
The blue curve is the theoretical sample mean distribution: X’ ~ norm(5,sqrt(0.625))
Finally we can compare the two 95% confidence intervals.
First the theoretical:
5 + c(-1,1) * qnorm(.95) * se
## [1] 3.699629 6.300371
Then the empirical:
quantile(sampleMeans, prob=c(.05,0.95))
## 5% 95%
## 3.761442 6.252184
Both are pretty close.