Simulations

1000 means of random exponentials (i.e. random variables sampled from an exponential distribution, with the given rate of 0.2) will be simulated. The simulation method will be to call the R function “rexp” a 1000 times - each time with argument “rate” set to 0.2 and argument “n” (sample size) set to 40. Each of the 1000 times, the function will return a vector of 40 random exponentials. All these results will be saved in a 1000 X 40 matrix, and the means of the 1000 rows (the simulation means) will be stored in a vector of length 1000. This vector will be used for comparisons against the theoretical values and distribution.

library(ggplot2)
library(gridExtra)

lambda <- 0.2

simTotal <- 1000
sampleSize <- 40
expMatrix <- matrix(nrow = simTotal, ncol = sampleSize)

set.seed(1)
for (sim in 1:simTotal) {
    expMatrix[sim,] <- rexp(n = sampleSize, rate = lambda)
}
#write.csv(expMatrix, "expMatrix.csv")

avgVector <- vector(length = simTotal)
for (sim in 1:simTotal) {
    avgVector[sim] <- mean(expMatrix[sim,])
}
#write.csv(avgVector, "avgVector.csv")

The 1000 simulation means have been stored in “avgVector”, to be used in the comparisons below.

————————————————————————————————–

Sample Mean vs. Theoretical Mean

theo_mean <- 1/lambda
print(theo_mean)

## [1] 5

sim_mean <- mean(avgVector)
print(sim_mean)

## [1] 4.990025

The mean of 1000 simulated means i.e. the simulation mean (4.99) is quite close to the theoretical mean (5.00).

————————————————————————————————–

Sample Variance vs. Theoretical Variance

theo_var <- ((1/lambda)^2)/sampleSize
print(theo_var)

## [1] 0.625

sim_var <- var(avgVector)
print(sim_var)

## [1] 0.6111165

The variance from the simulation mean i.e. the simulation variance (0.611) is quite close to the theoretical variance divided by the size of each simulation (25/40 = 0.625).

————————————————————————————————–

Distribution of 1000 Random Exponentials vs. Distribution of 1000 Simulated Means

expVector <- vector(length = 1000)
set.seed(1)
expVector <- rexp(n = 1000, rate = lambda)
#write.csv(expVector, "expVector.csv")

gg1 <- 
    ggplot(as.data.frame(expVector), aes(x = expVector)) +
    geom_histogram(binwidth = 2, color = 'black', fill = 'white') +
    xlab('Random Exponential') +
    ylab('Frequency') +
    ggtitle('Random Exponential Distribution') +
    theme(text = element_text(size = 8))

gg2 <- 
    ggplot(as.data.frame(avgVector), aes(x = avgVector)) +
    geom_histogram(binwidth = 0.25, color = 'black', fill = 'white') +
    xlab('Mean of 40 Random Exponentials') +
    ylab('Frequency') +
    ggtitle('Simulated Mean Distribution') +
    theme(text = element_text(size = 8))

#grid.arrange(gg1, gg2, ncol = 2)

grid.arrange(ggplotGrob(gg1), 
             ggplotGrob(gg2), 
             ncol = 2, 
             widths = unit(c(0.5, 0.5), "npc"))

The distribution of 1000 simulated means looks far more Gaussian than the distribution of 1000 random exponentials.

————————————————————————————————–

Conclusion

The central limit theorem as described in the Overview has been strongly supported by:

- Close proximity of the simulation mean to the theoritical mean,

- Close proximity of the simulation variance to the theoretical variance divided by sample size, and

- The approximately Gaussian shape of the distribution of simulation means.

The Central Limit Theorem (CLT) illustrated by multiple sampling from an exponential distribution

Statistical Inference Project - Data Science Specialization @ Coursera.org (Johns Hopkins University)

Author: Mercia Carolina Wentzel

——————————————————————————————–

Overview

——————————————————————————————–

Simulations

————————————————————————————————–

Sample Mean vs. Theoretical Mean

————————————————————————————————–

Sample Variance vs. Theoretical Variance

————————————————————————————————–

Distribution of 1000 Random Exponentials vs. Distribution of 1000 Simulated Means

————————————————————————————————–

Conclusion

————————————————————————————————–