The central limit theorem in the theory of probability asserts that “the distribution of the arithmetic mean of a large number of independent, identically distributed variables will be approximately normal, regardless of the underlying distribution”.[1]
In this report we shall illustrate the assertion of the Central Limit Theorem using the exponential distribution.
R Functions to Generate the Mean of Each DistributionWe consider a random sample of \(40\) exponential values or exponentials and compute its mean by using the composition of R functions, mean(rexp(n, rate)), where n\(= 40\) and the rate is fixed at lambda\(= 0.2\).
We shall do this random sampling about \(1,000\) times to generate a sampling distribution with \(1,000\) averages or means of the \(40\) exponentials.
The \(1,000\) averages are generated by the following R scripts.
set.seed(75)
averages = numeric(0)
for(i in 1:1000){
averages[i] = mean(rexp(numberOfExponentials, lambda))
}
We use the R function set.seed(...) for reproducibility of our random sampling.
In the description of the R function rexp(n, rate), the mean of the exponential distribution is equal to 1/rate. Since we always set the rate at lambda\(= 0.2\), in each of the \(1,000\) calls of the R function rexp(...), the theoretical mean of our distribution is equal to the theoretical_Mean\(= 1/0.2 = 5\).
The sample mean of the distribution is the mean of the generated \(1,000\) averages. Thus, it is equal to sample_Mean = mean(averages) = 5.04 which is very close to \(5\), the theoretical mean.
The histogram in the next page illustrates the respective position of these two means.
Visually the blue curve above that fits the sampling distribution is almost bell-shaped at the mean. Hence, the sampling distribution of the 1,000 averages is almost symmetric around the mean.
mean(averages) = 5.04 and median(averages) = 5.01. They are almost equal.
The histogram of the distribution of 1,000 averages shows that it is denser in the center and less dense in the tails.
limit1 <- mean(averages) + c(-1,1) * sd(averages) # 1 sd
limit2 <- mean(averages) + 2*c(-1,1) * sd(averages) # 2 sds
area <- function(ul,ll){
(sum(as.numeric(averages<=ul)) - sum(as.numeric(averages<ll)))/1000
}
area within one standard deviation = round(area(limit1[2], limit1[1]), 2) = 0.7 = 70\(\%\).
area within two standard deviations = round(area(limit2[2], limit2[1]), 2) = 0.95 = 95\(\%\).
The variability of the distribution of the \(1,000\) averages is equal to the sample variance which is the square of the standard deviation of the averages. Thus, the sample_Variance = sd(averages)^2=0.67 or the sample_Variance = var(averages) = 0.67.
The theoretical variance of the sampling distribution of the \(1,000\) averages is equal to
theoretical_Variance = \(\left(\dfrac{\text{mean}}{\sqrt{n}}\right)^2=\left(\dfrac{5}{\sqrt{1000}}\right)^2=0.025\).
The sample variance is larger than the theoretical variance. Thus, the averages in the sampling distribution with mean = \(5.04\) and variance = \(0.67\) are “very spread out around the mean and from each other” compared to the normal distribution with mean = \(5\) and variance = \(0.025\). See the two histograms below.
[1] http://www.math.uah.edu/stat/sample/CLT.html
[2] http://en.wikipedia.org/wiki/Exponential_distribution
[3] http://onlinestatbook.com/2/normal_distribution/intro.html
R Scripts to Plot the Histogram of the 1,000 Averages, the Curve, the Theoretical Mean, and the Sample Mean (for Supporting Appendix Material)x <- averages
sample_Mean<- mean(x)
h <- hist(x,
col="red",
xlab="Averages",
main="Histogram of the 1,000 Averages",
ylim = c(-15,275))
xfit <- seq(min(x),max(x),length=1000)
yfit <- dnorm(xfit, mean=sample_Mean, sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit,yfit, col="blue",lwd=2)
lines(c(5,5), c(-15,275), col="yellow",lwd=4)
lines(c(sample_Mean+0.01, sample_Mean+0.01), c(-15,275), col="orange", lwd=4)
mtext(c("theoretical mean (yellow line) = 5","sample mean (orange line) = 5.04"),
side=1,line=2,at=c(3,7))
R Scripts to Plot the Respective Histograms and the Curves of the Sample and the Theoretical Variances (for Appendix Supporting Material)x <- averages
sample_Mean<- mean(x)
set.seed(100)
b <- rnorm(1000, mean=5, sd=sqrt(0.025))
par(mfrow=c(2,1))
h1 <- hist(x,
col = "red",
xlab = "Averages",
main = "Histogram of the 1,000 Averages with sample_Mean = 5.04
and sample_Variance = 0.67",
ylim = c(0, 250))
xfit1 <- seq(min(x),max(x),length=1000)
yfit1 <- dnorm(xfit1, mean=sample_Mean, sd=sd(x))
yfit1 <- yfit1*diff(h1$mids[1:2])*length(x)
lines(xfit1,yfit1, col="blue",lwd=2)
h2 <- hist(b,
col ="red",
xlab ="Averages",
main ="Histogram of a Normal Distribution of size = 1000,
theoretical_Mean = 5 and theoretical_Variance = 0.025",
ylim = c(0, 350))
xfit2 <- seq(min(b),max(b),length=1000)
yfit2 <- dnorm(xfit2, mean=mean(b), sd=sd(b))
yfit2 <- yfit2*diff(h2$mids[1:2])*length(b)
lines(xfit2,yfit2, col="blue",lwd=2)