This project ains to investigate the central limit theorem (CLT). Specifically it will examine how the statistical parameters of samples from an exponential distribution will follow a normal distribution.
The following code was used to generate the data. It creates 1000 samples of forty random numbers each. Within each sample, the numbers follow an exponential distribution with mean 5 and standard deviation 5 (i.e. \(\lambda = 0.2\)).
set.seed(314159)
lambda = 0.2
sampleNo = 1000
sampleSize = 40
data <- matrix((rexp(sampleNo*sampleSize, lambda)),
nrow=sampleNo,
ncol=sampleSize,
byrow=TRUE)
The plot below shows histograms of the values from the first few samples. The underlying exponential distribution is evident, even though, with only 40 values in each sample, the histograms are quite noisy.
panels <- matrix(data=c(0, 0.5, 0.35, 1.,
0.5, 1, 0.35, 1.,
0, 0.5, 0, 0.65,
0.5, 1, 0, 0.65) ,nrow=4, ncol=4, byrow=TRUE)
plot.new()
par(bg=NA)
n = split.screen(figs=panels, erase=TRUE)
screen(4)
hist(data[1,], breaks=10, main="", xlab="")
screen(3)
hist(data[2,], breaks=10, main="", xlab="")
screen(2)
hist(data[3,], breaks=10, main="", xlab="", axes=FALSE)
screen(1)
hist(data[4,], breaks=10, main="", xlab="", axes=FALSE)
t = mtext(" Histogram of Values from the First Four Samples", line=2,
side=3, outer=FALSE, cex=1, col="red")
Next, the means of all 1000 samples are calculated, i.e. we take the mean of the 40 values in each of the 1000 samples:
mns = apply(X=data, MARGIN=1, FUN=mean)
head(mns, 10)
## [1] 6.147614 5.071306 5.254685 5.399566 5.940491 4.732895 5.134425
## [8] 5.430949 5.273040 4.430560
The plot below shows a histogram of means. Also shown is the observed mean of the sample means and the theoretical mean of means.
hist(mns, breaks=30, col="lightblue", xlab = "sample Means", main = "Histogram of the Means from Sample of the Exponential Distribution")
abline(v = 1/lambda, col = "darkgreen", lwd = 2, lty = 4)
abline(v = mean(mns), col="red", lwd=2, lty=3)
legend("topright",
legend=c("theoretical mean", "observed mean"),
lty=c(4, 3),
lwd=c(2,2),
col=c("darkgreen", "red"))
The observed mean of the 1000 samples is compared to the theoretical mean.
means <- c(mean(mns), 1/lambda)
names(means) <- c("Observed Mean", " Theoretical Mean")
confidence <- mean(mns) + c(-1, 1)*qt(df=39, p=0.975)*sd(mns)/sqrt(40)
text <- paste("The Confidence Interval for the Mean is from ",
round(confidence[1], 2),
" to ",
round(confidence[2], 2))
means
## Observed Mean Theoretical Mean
## 4.98902 5.00000
text
## [1] "The Confidence Interval for the Mean is from 4.73 to 5.25"
As can be seem, the sample mean is close to the theoretical mean and is with the confidence interval calculated using a t-95% quantile.
The observed variance of the 1000 samples is compared to the theoretical variance.
variance <- c(var(mns), 1/(lambda)^2/(40-1))
names(variance) <- c("Observed Variance", " Theoretical Variance")
chisq_lower = (sampleNo-1)*sd(mns)^2/(qchisq(0.975, 999))
chisq_upper = (sampleNo-1)*sd(mns)^2/(qchisq(0.025, 999))
text <- paste("The Confidence Interval for the Variance is from ",
round(chisq_lower, 3),
" to ",
round(chisq_upper, 3))
variance
## Observed Variance Theoretical Variance
## 0.6568530 0.6410256
text
## [1] "The Confidence Interval for the Variance is from 0.603 to 0.718"
As can be seem, the sample variance is close to the theoretical variance and is within the confidence interval calculated using a \(\chi^2\) 95% quantile.
First of all, we convert the means of the 1000 samples to standard normal for by subtracting the mean of means, \(\bar{X}\), and by dividing by the standard deviation, \(\sigma\).
mns <- (mns-mean(mns))/sd(mns)
Below we plot a histogram showing the means of each of the 1000 samples. Even though the samples each follow the exponential distribution, their means follow a normal distribution.
hist = hist(mns, breaks = 30, prob=TRUE)
normal_scale = max(hist$counts)/0.4
# this is match the amplitude of the histogram to normal function
curve(dnorm(x, 0, 1), -3, 3, col = "blue", add = TRUE)
abline(v=0, lwd=5, lty=2, col="red")
legend("topright",
legend=c("theoretical mean", "normal distribution"),
lty=c(2, 1),
lwd=c(5,1),
col=c("red", "blue"))
To test that this histogram of means is indeed normally distributed, we carried out a Shapiro-Wilk Test and got a surprising result:
shapiro = shapiro.test(mns)
shapiro
##
## Shapiro-Wilk normality test
##
## data: mns
## W = 0.99018, p-value = 3.161e-06
As can be seen, the test is quite conclusive. The low p-value (< 0.05) means we should reject the Null Hypothesis and conclude that the sample means are not normally distributed. Not what I was expecting and I’ not sure how to explain this.
To further investigate ibution of the sample means, we produced a q-q plot. It is shown below:
qqnorm(mns)
qqline(mns, col="red", lwd=3, lty=7)
The shape of the plot, the fact the sample quantiles are above the red (normal) line at the extreme ends of the plot, suggest that the distribution is right-skewed.