Statistical Inference Project: Properties of a distribution of means

We are asked to illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 numbers drawn from an exponential distribution with \( \lambda \)=0.2. The mean of exponential distribution is \( \frac{1}{\lambda} \) and the standard deviation is also also \( \frac{1}{\lambda} \).

I begin by setting lambda = 0.2 and drawing 40 random numbers from the exponential dsitribution (using the rexp() function) 1000 times. I calculate the mean of each of the 1000 samples, and plot.

rm(list = ls())
lambda=0.2
sims=1000
ns=40
mp<-matrix(rexp(sims*ns, lambda), sims)
mexp<-apply(mp, 1, mean)
vexp<-sd(mexp)
mx=(max(mexp)+1)
hist(mexp,breaks=51, main="Histogram of means of 1000 samples of size 40", ylab="density", xlab="sample mean", xlim=c(min(mexp),mx))

medmexp<-mean(mexp)
tmean<-1/lambda
abline(v=medmexp, col='red', lwd=3)
abline(v=tmean, col='blue',lwd=3)
legend(max(mexp)-2.5,55, c("mean of 1000 samples", "population mean"), lty=1, col=c("red", "blue"), lwd=3,bty="n", cex=1)

plot of chunk unnamed-chunk-1

The distribution of these means is centered on the true population mean. As we see in the graph above, the mean of all the samples, 5.0384808, approximates the known population mean of \( \frac{1}{\lambda}= \) 5.

strmexp<-sd(mexp)
stract<-1/(lambda*sqrt(ns))
starts<-medmexp-strmexp
ends<-medmexp+strmexp
tstarts<-medmexp-stract
tends<-medmexp+stract
hist(mexp,breaks=51, main="Histogram of means of 1000 samples of size 40", ylab="density", xlab="sample mean",xlim=c(min(mexp),mx))
segments(x0=starts,y0=5, x=ends, y=5,col = "red", lty = 1, lwd=3)
segments(x0=tstarts,y0=10, x=tends, y=10,col = "blue", lty = 1, lwd=3)
legend(max(mexp)-2.5,55, c("mean +/- 1 standard error of samples", "mean +/- 1 predicted standard error"), lty=1, col=c("red", "blue"), lwd=3,bty="n", cex=.85)

plot of chunk unnamed-chunk-2

The standard error of this distribution of means, 0.7994724, has an expected value of \( \frac{1}{\lambda \sqrt{n}}= \) 0.7905694. The equation indicates that as n, the size of each of our samples, becomes large, the standard error of this distribution will tend to zero, and we will have a very precise estimate of the population mean.

mpone<-matrix(rexp(sims*ns, lambda))
mnpone<-mean(mpone)
par(mfrow=c(1,2))
hist(mpone,breaks=51, main="40000 random draws", ylab="probability", xlab="values", prob=TRUE)
curve(dnorm(x, mean=mean(mpone), sd=sd(mpone)), col="red",lwd=2,add=TRUE)
legend(5,.15, ("normal distribution"), lty=1, col="red", bty="n",cex=.85,lwd=2)
hist(mexp,breaks=51, main="1000 means of random draws", xlab="means", ylab="probability", prob=TRUE)
curve(dnorm(x, mean=medmexp, sd=strmexp), col="red",lwd=2,add=TRUE)

plot of chunk unnamed-chunk-3

The law of large numbers tells us that many random samples will provide a good estimate of the population mean, and we can see that here, where the distribution of 40000 random numbers from the exponential distribution has a mean of 5.0399964. Remember the true population mean is \( \frac{1}{\lambda}= \) 5, and the overall mean of then means of 1000 samples of size 40 is 5.0384808.

However, the central limit theorem tells us that the distribution of a set of means will be normal, regardless of the underlying distribution that the samples come from. We see here, that the distribution of 1000 sample means (n=40) from the exponential distribution is well approximated by a normal distribution, but the distribution of 40000 random numbers from the exponential distribution is not.