The Central Limit Theorem in Action

Overview

This is the first part of a project for Statistical Inference course, which is a part of Coursera’s Data Science and Data Science: Statistics and Machine Learning Specializations.

The report aims to investigate the exponential distribution means sample and compares it to the Central Limit Theorem conclusions, viz: the distribution of averages of \(iid\) variables becomes \(\bar X_{n}\) ~ \(N(\mu, \frac{\sigma^2}{n})\) as the sample size \(n\) increases.

Here is simulated a thousand random variables, each equal the average of exponentials size \(n=40\) with \(\lambda=0.2\), and illustrated properties of this distribution.

Code chunks can be displayed by clicking Code button

library(data.table); library(ggplot2)

Simulations

To provide reproducibility, the \(seed=2112\) was set
Then, the required sample was created
- averages of \(1000\) simulated exponential distributions (\(nsim=1000\)):
  - each distribution is of size \(n=40\),
  - the rate parameter \(\lambda = 0.2\)

it was implemented with the replicate() and the rexp() functions

the means were saved in the data table simulation (Appendix: code simulation)

1. Sample Mean vs Theoretical Mean

The sample mean is 5.0012, the theoretical mean calculated as \(1/\lambda\) is 5. So, the difference \(theoretical\ mean-sample\ mean=\) -0.0012, i.e. the sample mean and the theoretical mean match up closely (Appendix: code mean).

Sample Mean	Theoretical Mean	Difference
5.0012	5	-0.0012

Explore simulations with the histogram of the observations averages, followed by highlighting values of the sample mean and the theoretical mean (Appendix: code mplot)

The plot also shows that the center of distribution of averages of 40 exponentials is very close to the theoretical center of the distribution.

2. Sample Variance vs Theoretical Variance

The sample variance is 0.6201, the theoretical variance calculated as \(VAR(\bar X_{n}) = \frac{1}{\lambda^2 n}\) is 0.625. So, the difference \(theoretical\ variance-sample\ variance=\) -0.0012, i.e. the sample mean variance and the theoretical variance match up closely (Appendix: code variance).

Sample Mean Variance	Theoretical Variance	Difference
0.6201	0.625	-0.0012

Explore simulations with the density of the observations means, followed by lines fitted to the data and to the theoretical normal distribution \(N(\mu,sd^2)\) , where \(\mu=1/\lambda, sd=\frac{1}{\lambda \sqrt n}\) (Appendix: code varplot)

The plot also shows that the variance of the sample mean is very close to the theoretical variance of the distribution.

3. Distribution: approximately normal

First, as the above plots show, the sample data are quite symmetric and not skewed, the sample mean and variance are close to their theoretical values.

Second, the sample mean, median and mode and match up closely to each other (Appendix: code mmm):

Sample Mode	Sample Median	Sample Mean
5.204	4.9779	5.0012

And finally, q-q plot comparing the theoretical and observed quantiles, shows a nearly linear plot (Appendix: code qqplot):

All of the above leads to the conclusion that simulated distribution of 1000 averages of 40 random exponential variables with \(\lambda=0.2\) approaches a normal distribution \(N(5, \frac{5}{8})\). So, it’s shown how the Central Limit Theorem works

Appendix

`simulation`

set.seed(2112)
lambda <- 0.2
n <- 40
nsim <- 1000
esample <- data.table(mean = replicate(nsim, mean(rexp(n, lambda))))

`mean`

smean <- mean(esample[,mean])
tmean <- 1/lambda
diff <- tmean - smean

`mplot`

line.data <- data.table(x = c(smean,tmean),
                        means = c("Simulation","Theoretical"),
                        stringsAsFactors = FALSE)
mplot <- ggplot(esample, aes(mean)) +
        geom_histogram(colour = "darkgrey", fill = "cornflowerblue",binwidth = 0.2) +
        geom_vline(aes(xintercept = x, colour = means), line.data,
                   size=c(3,1.5)) +
        labs(title = "Sample Mean versus Theoretical Mean",
             subtitle = "Histogram of observations means") +
        scale_colour_manual(values = c("violet","purple")) +
        guides(colour = guide_legend(override.aes = list(size = 2)))

`variance`

svar <- var(esample)
tvar <- 1/(lambda^2*n)

`varplot`

varplot <- ggplot(esample, aes(x=mean)) +
        geom_histogram(colour = "darkgrey", aes(y=..density..),
                       fill = "cornflowerblue", binwidth = 0.5) +
        geom_vline(aes(xintercept = x, colour = means), line.data,
                   size=c(3,1.5)) +
        geom_density(size = 2, color = "violet") +
        stat_function(fun = dnorm, args = list(mean = tmean, sd = sqrt(tvar)),
                                 colour = "purple", size=1.5) +
        labs(title = "Sample Mean Variance versus Theoretical Variance",
             subtitle = "Density of observations means") +
        scale_colour_manual(values = c("violet","purple")) +
        guides(colour = guide_legend(override.aes = list(size = 2)))

`mmm`

getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}
smode <- getmode(esample[,mean])
smed <- median(esample[,mean])

`qqplot`

qqnorm(esample[,mean], col="cornflowerblue",
                 main = "Sample Quantiles versus Theoretical Quantiles")
qqline(esample[,mean], col = "brown", lw=3.0)