Statistical Inference: Part 1 - Simulation Exercise

Overview

This paper reports an assessment of the Central Limit Theorum (CLT) prediction that the means of sets of random samples drawn from the same parent distribution will themselves tend to a normal distribution. It does so by reporting the results of simulations where random samples are drawn from the exponential Probability Density Function (pdf) implemented in R. According to the Law of Large Numbers (LLN) & CLT the distribution of means of these sets will tend toward a normal distribution with \(\mu_{means} \approx \mu_{pdf}\) and \(\sigma_{means} \approx \frac{\sigma_{pdf}}{\sqrt{n}}\). The \(\mu_{pdf}\) and \(\sigma_{pdf}\) values are those of the parent distribution and n is the size of the sample sets.

Simulations

The Probability Density Function (pdf) for an exponential distribution is given by \(f(x) = \lambda {e}^{- \lambda x}\) where \(\mu_{pdf} = \sigma_{pdf} = \frac{1}{\lambda} = 5\) (assuming \(\lambda = 0.2\)). In R, the exponential distribution pdf is implemented and called as the function rexp(n,lambda) where n is the number of randomly selected samples to be returned from the distribution with the specified value of lambda. The following code chunk first sets the working directory, the random number seed (to ensure results are reproducible) and the parameters controlling the random sample generation. It then computes and displays a histogram of a random sample of 1000 values from the exponential function and a histogram of the means of 1000 sets of 40 random samples from the exponential function.

## This code chunk has been redacted since it contains a solution to Project Q1 for DSS:SI.

The left hand side (LHS) histogram in the figure above shows the probability distribution of a single set of 1000 random numbers sampled from the exponential distribution with \(\lambda = 0.2\). As expected, the distribution (blue bars) of this sample approximately follows an exponential curve. To illustrate this, the dashed black line overlayed on the LHS shows the pdf for an exponential distribution with \(\lambda = 0.2\). The right hand side (RHS) histogram in the figure above shows the probability distribution of the measured mean values of 1000 sets of 40 random numbers sampled from the exponential distribution with \(\lambda = 0.2\) (blue bars). It is clear from the RHS histogram that the means are approximately normally distributed around a value of 5, as is predicted by the CLT since the \(\mu_{pdf} = \frac{1}{\lambda} = 5\). Also recall that the CLT predicts that the variance of the means distribution will approximate to \(\sigma_{means}^2 \approx \frac{\sigma_{pdf}^2}{n} = 0.625\) (for \(\sigma_{pdf} = \frac{1}{\lambda} = \frac{1}{0.2} = 5\) and \(n = 40\)) To illustrate this, the dashed black line overlayed on the RHS histogram shows the pdf for a normal distribution with \(\mu_{pdf} = 5\) and \(\sigma^2_{pdf} = 0.625\) (\(\sigma_{pdf} \approx 0.791\)). For a numerical assessment of the mean and variance values of the observed distribution, see the appendix to this report.

Normal Distribution Tests

There are a large number of tests which may be used to assess whether an observed distribution is at least consistent with normal. It is not generally considered possible to prove that a distribution is normal, rather what can be assessed are deviations from the normal distribution.

## Skewness = 0.306     Kurtosis = 3.253

## This code chunk has been redacted since it contains a solution to Project Q1 for DSS:SI.

Two procedures which are commonly used for assessing normality of distributions are Q-Q plots and the skewness-kurtosis tests. The code chunks above draw a Q-Q plot for the 1000 means of sets of 40 random numbers, described above, and perform the skewness-kurtosis tests. The Q-Q plot shows blue dots for the quantiles of the observed distribution of the 1000 means (y-axis) as a function of those to be expected from a normal distribution (x-axis). The overall linearity of the plot, by comparison to the dashed black line, suggests that the distribution of the means is a good approximation to normal. For a normal distribution the skewness (3rd moment) is 0 and the kurtosis (4th moment = peakedness) is 3. This compares to measurements from the simulation results of 0.306 and 3.253, respectively. These S-K results are therefore in good agreement with the normal values. The Q-Q plot and S-K results support the view that the distribution of the means of the sets of random samples from an exponential distribution is approximately normal.

Conclusion

In light of the results presented and discussed above, it is reasonable to conclude that the CLT prediction of a normal distribution for the means of sets of random samples from an exponential distribution is fulfilled.

Appendix: Mean & Variance as functions of Sample Size and Iterations

The \(\mu_{means}\) and \(\sigma_{means}^2\) shown by the means of the sets of random samples from an exponential distribution are computed and displayed as functions of random sample size and number of iterations using the following code chunk.

## LHS mean (1k x 40)      = 5.067                   RHS mean (1k x 40)     = 5.015
## LHS variance (1k x 40)  = 0.626                   RHS variance (1k x 40) = 0.614

## This code chunk has been redacted since it contains a solution to Project Q1 for DSS:SI.

Both the LHS & RHS \(\mu_{means}\) and \(\sigma_{means}^2\) are close to the expected values of 5 and 0.625 respectively for 1000 x means of 40 randomly selected samples from an exponential distribution with \(\lambda = 0.2\). Looking at the curves for \(\mu_{means}\) and \(\sigma_{means}^2\) plotted in the charts, it is clear that the \(\mu_{means}\) values converge rapidly on the theoretical values in both charts (blue solid lines converge on black dashed lines). The convergence of the \(\sigma_{means}^2\) in the LHS chart (red dotted line converging on black dashed line) is harder to see, even with respect to the equivalent \(\sigma_{means}^2\) in the RHS chart, but this is just a consequence of the varying sample sizes (ie changing n values) on the LHS.