Overview

The central limit theorem in the theory of probability asserts that “the distribution of the arithmetic mean of a large number of independent, identically distributed variables will be approximately normal, regardless of the underlying distribution”.[1]

In this report we shall illustrate the assertion of the Central Limit Theorem using the exponential distribution.

The Simulations

Set Up Values and the R Functions to Generate the Mean of Each Distribution

We consider a random sample of \(40\) exponential values or exponentials and compute its mean by using the composition of R functions, mean(rexp(n, rate)), where n\(= 40\) and the rate is fixed at lambda\(= 0.2\).

We shall do this random sampling about \(1,000\) times to generate a sampling distribution with \(1,000\) averages or means of the \(40\) exponentials.

The \(1,000\) averages are generated by the following R scripts.

set.seed(75)
averages = numeric(0)
for(i in 1:1000){
        averages[i] = mean(rexp(numberOfExponentials, lambda))
        }

We use the R function set.seed(...) for reproducibility of our random sampling.

Sample Mean Versus the Theoretical Mean of the Sampling Distribution

In the description of the R function rexp(n, rate), the mean of the exponential distribution is equal to 1/rate. Since we always set the rate at lambda\(= 0.2\), in each of the \(1,000\) calls of the R function rexp(...), the theoretical mean of our distribution is equal to the theoretical_Mean\(= 1/0.2 = 5\).

The sample mean of the distribution is the mean of the generated \(1,000\) averages. Thus, it is equal to sample_Mean = mean(averages) = 5.04 which is very close to \(5\), the theoretical mean.

The histogram in the next page illustrates the respective position of these two means.

On Approximately Normal Distribution

Normal distributions are symmetric around their mean.[3]

Visually the blue curve above that fits the sampling distribution is almost bell-shaped at the mean. Hence, the sampling distribution of the 1,000 averages is almost symmetric around the mean.

The mean, median, and mode of a normal distribution are equal.[3]

mean(averages) = 5.04 and median(averages) = 5.01. They are almost equal.

Normal distributions are denser in the center and less dense in the tails.[3]

The histogram of the distribution of 1,000 averages shows that it is denser in the center and less dense in the tails.

Approximately 68% and 95% of the area of a normal distribution is within one and two standard deviations of the mean, respectively.[3]

limit1 <- mean(averages) + c(-1,1) * sd(averages)   # 1 sd
limit2 <- mean(averages) + 2*c(-1,1) * sd(averages)  # 2 sds

area <- function(ul,ll){
        (sum(as.numeric(averages<=ul)) - sum(as.numeric(averages<ll)))/1000
}

area within one standard deviation = round(area(limit1[2], limit1[1]), 2) = 0.7 = 70\(\%\).

area within two standard deviations = round(area(limit2[2], limit2[1]), 2) = 0.95 = 95\(\%\).

Sample Variance versus the Theoretical Variance of the Sampling Distribution

The variability of the distribution of the \(1,000\) averages is equal to the sample variance which is the square of the standard deviation of the averages. Thus, the sample_Variance = sd(averages)^2=0.67 or the sample_Variance = var(averages) = 0.67.

The theoretical variance of the sampling distribution of the \(1,000\) averages is equal to

theoretical_Variance = \(\left(\dfrac{\text{mean}}{\sqrt{n}}\right)^2=\left(\dfrac{5}{\sqrt{1000}}\right)^2=0.025\).

The sample variance is larger than the theoretical variance. Thus, the averages in the sampling distribution with mean = \(5.04\) and variance = \(0.67\) are “very spread out around the mean and from each other” compared to the normal distribution with mean = \(5\) and variance = \(0.025\). See the two histograms below.

References

[1] http://www.math.uah.edu/stat/sample/CLT.html

[2] http://en.wikipedia.org/wiki/Exponential_distribution

[3] http://onlinestatbook.com/2/normal_distribution/intro.html

Supporting Appendix Material

R Scripts to Plot the Histogram of the 1,000 Averages, the Curve, the Theoretical Mean, and the Sample Mean (for Supporting Appendix Material)

x <- averages
sample_Mean<- mean(x)
h <- hist(x, 
          col="red",
          xlab="Averages",
          main="Histogram of the 1,000 Averages",
          ylim = c(-15,275))
xfit <- seq(min(x),max(x),length=1000)
yfit <- dnorm(xfit, mean=sample_Mean, sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit,yfit, col="blue",lwd=2)
lines(c(5,5), c(-15,275), col="yellow",lwd=4)
lines(c(sample_Mean+0.01, sample_Mean+0.01), c(-15,275), col="orange", lwd=4)
mtext(c("theoretical mean (yellow line) = 5","sample mean (orange line) = 5.04"),
      side=1,line=2,at=c(3,7))

R Scripts to Plot the Respective Histograms and the Curves of the Sample and the Theoretical Variances (for Appendix Supporting Material)

x <- averages
sample_Mean<- mean(x)
set.seed(100)
b <- rnorm(1000, mean=5, sd=sqrt(0.025))
par(mfrow=c(2,1))
h1 <- hist(x, 
          col = "red",
          xlab = "Averages",
          main = "Histogram of the 1,000 Averages with sample_Mean = 5.04 
and sample_Variance = 0.67",
          ylim = c(0, 250))
xfit1 <- seq(min(x),max(x),length=1000)
yfit1 <- dnorm(xfit1, mean=sample_Mean, sd=sd(x))
yfit1 <- yfit1*diff(h1$mids[1:2])*length(x)
lines(xfit1,yfit1, col="blue",lwd=2)
h2 <- hist(b,
           col ="red",
           xlab ="Averages",
           main ="Histogram of a Normal Distribution of size = 1000,
theoretical_Mean = 5 and theoretical_Variance = 0.025",
           ylim = c(0, 350))
xfit2 <- seq(min(b),max(b),length=1000)
yfit2 <- dnorm(xfit2, mean=mean(b), sd=sd(b))
yfit2 <- yfit2*diff(h2$mids[1:2])*length(b)
lines(xfit2,yfit2, col="blue",lwd=2)