Overview

There are several well-studied and well-understood distributions in current probability theory. Each distribution is shaped differently and is definied by one or more parameters that guide the probability of obtaining certain results. In probability theory the Central Limit Theorem (CLT) states that if means are generated from combinations of randomly-generated values then those means are distributed according to the normal distribution regardless of the distribution from which the original values were generated.

Setup

In order to show the truth of this through the monte-carlo method we’ll generate a lot of random values based on some simple parameters and then we’ll check how the results work out statistically. First we’ll set up some initial values. The set.seed command allows for reproducibility of the random results, lambda=.2 is a parameter for our distribution, meanOf=40 refers to how many values will go into creating our sample mean, and noSim=1000 is how many simulations we’ll run. The code is in the appendix.

We generate the values from the exponential distribution with rate parameter lambda.

Because the rate is \(\lambda = 0.2\) the theoretical mean of our exponential distribution will be \(\frac{1}{\lambda} = 5\). Note that the (red) theoretical and (blue) computed means are nearly identical.

In order to work with the results I put the numbers in a 1000 by 40 matrix.

By taking the mean (arithmetic average) across each row of the matrix we’ll get 1000 means derived from 40 randomly generated values each.

(1.) Sample mean (of the results) compared to the theoretical mean (calculated)

Let \(X_1, ..., X_{40}\) be the values generated.
Then \(\frac{1}{40} \cdotp (X_1 + ... + X_{40})\) is the arithmetic mean of those values.
The expectation of this is \(E[\frac{1}{40} \cdotp (X_1 + ... + X_{40})]\) \(= \frac{1}{40} \cdotp (E[X_1] + ... + E[X_{40}])\) \(= \frac{1}{40} \cdotp (\frac{1}{.2} + ... + \frac{1}{.2})\) \(= \frac{1}{40} \cdotp 40 \cdotp 5 = 5\)

So the theoretical mean of the means is 5. A graph of a scatterplot of all 1000 of the points (means) can be found in the appendix (black dots on grey background).

A histogram of the means of the rows is more useful. The figure is in the appendix (yellow bars). The vertical red line of the theoretical mean and the vertical blue line of the calculated mean are in nearly the same place. This is because the calculated mean 5.0300071 is very close to the theoretical mean of 5.

(2.) Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.

The variance of our set of means is 0.6144395, the standard deviation of the means is its square root, 0.7838619, and the variance of the original set of values is 25.5509947. Remember that because the original data is from an exponential random distribution the theoretical variance for the exponential random variables is \(\frac{1}{\lambda^2} = 25\).

And the theoretical variance of a mean of 40 exponential random variables is \(Var(\frac{1}{n} \cdotp (X_1 + ... + X_{40}))\) \(=\frac{1}{n^2} \cdotp Var(X_1 + ... + X_{40})\)
[Because the points are (theoretically) independent the variance of the sum is the sum of the variances.] \(= \frac{1}{n^2} \cdotp (Var(X_1) + ... + Var(X_{40}))\) \(= \frac{1}{40^2} \cdotp (40 \cdotp 25)\) \(= \frac{25}{40} = \frac{5}{8} = .625\)

So we see that the variance of the vector of means, \(0.6144395\) is pretty close to the theoretical computed variance 0.625 of the means.

(3.) Show that the distribution is approximately normal.

The CLT says that what is normally distributed is \(\frac{\bar{X} - \mu}{\sqrt{\frac{25}{40}}}\). We’ll start with a histogram of the means with a red curve overlaid to show the theoretical normal distribution values. The normal distribution curve is generated using the adjustments to mean and standard deviation of the set of means.

The means certainly look approximately normally distributed. In R the qqplot shows another visualization. It graphs the quantiles of the theoretical normal distribution against the quantiles of the sample distribution. Points (quantile indicators) closer to the (red) central line indicate good approximations of normality.

We can also run the Shapiro-Wilk statistical test for normality. Its code can be found in the appendix.

The low p value of 0.0005229 would indicate that at significance level of 99.9477118% we would reject the null hypothesis that the data are normally distributed. However, the large numer of values included (1000) makes the Shapiro-Wilk test unreliable.

The skewness of 0.2533957 compared to the value of 0 for the normal distribution shows that the calculated means are skewed (are more frequently) slightly to the right of the mean of the normal distribution.

Kurtosis is a measure of how “tailed” the data is. The kurtosis here is 2.8118617. That’s very close to the 3 that we’d get for the standard normal, so normality seems a reasonable assumption.

In point 3, focus on the difference between the distribution of a large collection of random exponentials and the distribution of a large collection of averages of 40 exponentials.

Here are 1000 exponentials compared to 1000 means of 40 random values each.

The means are similar (both approximately 5) while the variances and distributions are significantly different.

Conclusion

We conclude that for n random variables following a non-normal distribution the mean of those n is approximately normally distributed.

Appendix

Here’s setting the initial parameters.

set.seed(20190331)
lambda <- .2
meanOf <- 40
noSim <- 1000

Then generating the values.

values <- rexp(n = meanOf * noSim, rate = lambda)

The sample mean of the set of exponentially-distributed values.

mean(values)

Code and plot of the means with arbitrary index. The horizontal red line displays the theoretical mean, and the horizontal blue line indicates the calculated mean.

library(ggplot2)
g1 <-
        qplot(x = 1:1000,
              y = means,
              ylab = "Mean")
g1 + geom_hline(yintercept = 1 / lambda,
                lwd = 2,
                color = "red") + geom_hline(yintercept = mean(means),
                                            lwd = 2,
                                            color = "blue") +
        theme(axis.title.x = element_blank())

Code for histogram of the means.

g2 <-
        qplot(
                means,
                geom = "histogram",
                binwidth = .1,
                fill = I("yellow"),
                col = I("black"),
                xlab = "Mean"
        )
g2.1 <-
        g2 + geom_vline(xintercept = 1/lambda,
                        lwd = 2,
                        color = "red") + geom_vline(xintercept = mean(means),
                                                    lwd = 2,
                                                    color = "blue")
g2.1

Code for the Shapiro-Wilk test.

SWtest <- shapiro.test(normDist)

The code for Skewness and Kurtosis.

skewness(normDist)
kurtosis(normDist)

Here is the creation code for the histogram of exponentials vs. means.

val1 <- cbind(data.frame(values[1:noSim]), "Values")
names(val1) <- c("val", "tag")
val2 <- cbind(data.frame(means), "Means")
names(val2) <- c("val", "tag")
val12 <- rbind(val1, val2)

Analysis of Means of Random Variables and the Central Limit Theorem

Thomas Hill III

April 3, 2019