Overview

In this project, part of the Johns Hopkins Data Science specialization, we will make comparisons between two well know distributions, the Exponential and the Normal distributions as well as an exploration of the interactions of the distributions and association with the Central Limit Theorem (CLT).

Simulations

We establish a measures for reproducibility (seed) as well as variables for the number of simulations (1000) to perform, the rate of growth for the exponential distribution (lambda = 0.2) and the number of random IID draws of n-sample size (40).
require("ggplot2")
## Loading required package: ggplot2
set.seed(1729)
num.sims <- 1000
lambda <- 0.2
n <- 40
for (i in 1:1000) { 
    sample <- rexp(n=n,lambda) 
    #get mean of sample of 40 draws from the distribution
    meanSample <- mean(sample)
    #write the mean of each sample to the index of the iteration
    num.sims[i] <- meanSample
    #print(iterations) test update of matrix
}
meanOfMeans <- mean(num.sims)

Sample Mean vs. Theoretical Mean

\(\bar X_n\) is approximately \(N(\mu, \sigma^2 / n)\)

The law of large numbers tells us that if we collect infinitely many observations, in this context, mean IID draws from an exponential distribution that the sample mean (mean of means) will limit to its estimand; the population mean or hypotethical mean. In this case, the population mean is 1/lambda (where lambda = 0.2) or ‘5’ and our estimated mean is the mean of collected means (~4.99)

Consequently, we observed this limiting in our data. The mean of the distribution of sample means was 4.99 and the mean of the population it was estimating (1/lambda) was 5. Simply, 4.99 approximates 5. Further, we can see evidence of the Central Limit Theorem in these results by plotting the density of means with a vertical blue line showing the hypothetical mean (5). If we collected more and more data (increase ‘n’) then our density would approximate more and more to a normal distribution.

print(1/lambda)
## [1] 5
print(meanOfMeans)
## [1] 4.997828
ggplot(as.data.frame(num.sims), aes(x=num.sims)) + geom_density() + geom_vline(xintercept = meanOfMeans, size = .75, color = "blue") +labs(x="x's", y="density")

Sample Variance vs. Theoretical Variance

As we collect infinitely more samples from the distribution and take their means not only does the mean of the sample means converage on the population mean but the sample variance also converges on the population variance. In this regard, we find more evidence to support the Central Limit Theorem.

ssd <- 1/(lambda*sqrt(n))
ssd
## [1] 0.7905694
psd <- sd(num.sims)
psd
## [1] 0.7772278

The sample standard deviation is 1/lambda * sqrt(n): 0.79 whereas the standard deviation of the population given is 0.77. These two numbers approximate.

Further, if we extend this example a bit we know that we can simply square the standard deviations and arrive at similar results for the variance. And, of course there is simply the variance of the actual observations we took via num.sims, which is 0.60.

sample_var <- ssd^2 #which is also 1/lambda^2/n
sample_var
## [1] 0.625
pop_var <- psd^2 #Which is also sigma or 1/lambda^2
pop_var
## [1] 0.604083
var(num.sims)
## [1] 0.604083

Finally, we compare the two plots, one of the standard normal distribution (in blue) and of our simulated exponential distribution converging on the central limit and becoming more ‘normal’ like.

ggplot(as.data.frame(num.sims), aes(x = num.sims)) + geom_density(binwidth=.2)+stat_function(geom = "line", fun = dnorm, args = list(mean = meanOfMeans, sd = 1/.2/sqrt(n)), size = 1.5, color = "blue") + geom_vline(xintercept=meanOfMeans, color = "goldenrod")

Conclusion

We have been able to demonstrate that in certain cases sampling from a non-normal distribution that a large enough sample size (n) will approximate to a normal distribution as we collect more samples. The distribution of sample means converged on the mean of the population and the variability in the sample space also converged on the variance of the population. Thus our sample distribution from a non-normal distribution eventually became normal. This simulation demonstrates the consistency of the statistic and Central Limit Theorem.