Investigation of the Central Limit Theorem - Distribution of Means and Standard Deviation of Samples of the Exponential Distribution

Eugene Hickey

Overview

This project ains to investigate the central limit theorem (CLT). Specifically it will examine how the statistical parameters of samples from an exponential distribution will follow a normal distribution.

Simulations

The following code was used to generate the data. It creates 1000 samples of forty random numbers each. Within each sample, the numbers follow an exponential distribution with mean 5 and standard deviation 5 (i.e. \(\lambda = 0.2\)).

set.seed(314159)
lambda = 0.2
sampleNo = 1000
sampleSize = 40
data <- matrix((rexp(sampleNo*sampleSize, lambda)), 
               nrow=sampleNo, 
               ncol=sampleSize, 
               byrow=TRUE)

The plot below shows histograms of the values from the first few samples. The underlying exponential distribution is evident, even though, with only 40 values in each sample, the histograms are quite noisy.

panels <- matrix(data=c(0, 0.5, 0.35, 1.,
                        0.5, 1, 0.35, 1.,
                        0, 0.5, 0, 0.65,
                        0.5, 1, 0, 0.65) ,nrow=4, ncol=4, byrow=TRUE)
plot.new()
par(bg=NA)
n = split.screen(figs=panels, erase=TRUE)
screen(4)
hist(data[1,], breaks=10, main="", xlab="")
screen(3)
hist(data[2,], breaks=10, main="", xlab="")
screen(2)
hist(data[3,], breaks=10, main="", xlab="", axes=FALSE)
screen(1)
hist(data[4,], breaks=10, main="", xlab="", axes=FALSE)
t = mtext("              Histogram of Values from the First Four Samples", line=2,
      side=3, outer=FALSE, cex=1, col="red")

Next, the means of all 1000 samples are calculated, i.e. we take the mean of the 40 values in each of the 1000 samples:

mns = apply(X=data, MARGIN=1, FUN=mean)
head(mns, 10)

##  [1] 6.147614 5.071306 5.254685 5.399566 5.940491 4.732895 5.134425
##  [8] 5.430949 5.273040 4.430560

The plot below shows a histogram of means. Also shown is the observed mean of the sample means and the theoretical mean of means.

hist(mns, breaks=30, col="lightblue", xlab = "sample Means", main = "Histogram of the Means from Sample of the Exponential Distribution")
abline(v = 1/lambda, col = "darkgreen", lwd = 2, lty = 4)
abline(v = mean(mns), col="red", lwd=2, lty=3)
legend("topright", 
       legend=c("theoretical mean", "observed mean"), 
       lty=c(4, 3), 
       lwd=c(2,2), 
       col=c("darkgreen", "red"))

Sample Mean vs Theoretical Mean

The observed mean of the 1000 samples is compared to the theoretical mean.

means <- c(mean(mns), 1/lambda)
names(means) <- c("Observed Mean", "    Theoretical Mean")
confidence <- mean(mns) + c(-1, 1)*qt(df=39, p=0.975)*sd(mns)/sqrt(40)
text <- paste("The Confidence Interval for the Mean is from ", 
              round(confidence[1], 2), 
              " to ", 
              round(confidence[2], 2))
means

##        Observed Mean     Theoretical Mean 
##              4.98902              5.00000

text

## [1] "The Confidence Interval for the Mean is from  4.73  to  5.25"

As can be seem, the sample mean is close to the theoretical mean and is with the confidence interval calculated using a t-95% quantile.

Sample Variance vs Theoretical Variance

The observed variance of the 1000 samples is compared to the theoretical variance.

variance <- c(var(mns), 1/(lambda)^2/(40-1))
names(variance) <- c("Observed Variance", "    Theoretical Variance")
chisq_lower = (sampleNo-1)*sd(mns)^2/(qchisq(0.975, 999))
chisq_upper = (sampleNo-1)*sd(mns)^2/(qchisq(0.025, 999))
text <- paste("The Confidence Interval for the Variance is from ", 
              round(chisq_lower, 3), 
              " to ", 
              round(chisq_upper, 3))
variance

##        Observed Variance     Theoretical Variance 
##                0.6568530                0.6410256

text

## [1] "The Confidence Interval for the Variance is from  0.603  to  0.718"

As can be seem, the sample variance is close to the theoretical variance and is within the confidence interval calculated using a \(\chi^2\) 95% quantile.

Histogram of the Means of the 1000 Samples

First of all, we convert the means of the 1000 samples to standard normal for by subtracting the mean of means, \(\bar{X}\), and by dividing by the standard deviation, \(\sigma\).

mns <- (mns-mean(mns))/sd(mns)

Below we plot a histogram showing the means of each of the 1000 samples. Even though the samples each follow the exponential distribution, their means follow a normal distribution.

hist = hist(mns, breaks = 30, prob=TRUE)
normal_scale = max(hist$counts)/0.4 
# this is match the amplitude of the histogram to normal function

curve(dnorm(x, 0, 1), -3, 3, col = "blue", add = TRUE)
abline(v=0, lwd=5, lty=2, col="red")
legend("topright", 
       legend=c("theoretical mean", "normal distribution"), 
       lty=c(2, 1), 
       lwd=c(5,1), 
       col=c("red", "blue"))

To test that this histogram of means is indeed normally distributed, we carried out a Shapiro-Wilk Test and got a surprising result:

shapiro = shapiro.test(mns)
shapiro

## 
##  Shapiro-Wilk normality test
## 
## data:  mns
## W = 0.99018, p-value = 3.161e-06

As can be seen, the test is quite conclusive. The low p-value (< 0.05) means we should reject the Null Hypothesis and conclude that the sample means are not normally distributed. Not what I was expecting and I’ not sure how to explain this.

Q-Q Plot

To further investigate ibution of the sample means, we produced a q-q plot. It is shown below:

qqnorm(mns)
qqline(mns, col="red", lwd=3, lty=7)

The shape of the plot, the fact the sample quantiles are above the red (normal) line at the extreme ends of the plot, suggest that the distribution is right-skewed.