Versions & Dates

R Version: 3.5.1

RStudio Version: 1.1.456

Date Authored: 2018-08-11

Overview: Statistical Inference Course Project I

Overview: Part I of the course project for Statistical Inference, one of ten courses for the Data Science Specialization by Johns Hopkins University, aims to demonstrate the Central Limit Theorem (CLT) by simulating 1,000 randomly-generated sets of 40 exponential distributions and comparing the “theoretical” (provided) mean and standard deviation to the “sample” means and variances of the “sample” (simulated) distributions.

Objectives: The present analysis provides code and narrative to store provided parameters, e.g. lambda, simulate exponential distribution means and variances, compare “sample” and “theoretical” means and variances, and provide visualizations both for comparison to theoretical and sample statistics, as well as comparison to Gaussian (Normal) Distribution demonstrating CLT.

Packages: Tidyverse packages dplyr and ggplot2, as well as package knitr, if undetected, are installed automatically. Both packages are loaded with function library(). Package versions are provided dynamically below:

Versions:

Simulations

Assigning Provided Parameters

The following calls function set.seed() to ensure all quasi-randomly-generated numbers are exactly replicated for reproducibility. lambda is defined using provided “theoretical” parameters, as well as the provided number of exponential distributions, n_exp, and provided number of simulations, n_sim:

set.seed(1)
lambda <- 0.2
n_exp <- 40
n_sim <- 1000

Simulating Exponential Distributions

Base function rexp() in conjunction with primitive methods mean() and var() to 1,000 means and variances of 40 exponential distributions. Objects sim_mns and sim_vrs are initialized to receive simulated mean values and variance values in the subsequent loop function, respectively:

sim_mns <- NULL                                      
sim_vrs <- NULL
 
for (i in 1:n_sim){
  sim_mns <- c(sim_mns, mean(rexp(n = n_exp, rate = lambda)))
  sim_vrs <- c(sim_vrs, var(rexp(n = n_exp, rate = lambda))) 
}

Objects sim_mns and sim_vrs now contain 1,000 values, the first 6 of which are demonstrated with calls to function head():

head(sim_mns)
## [1] 4.860372 4.279204 5.196446 5.995896 4.027864 4.874493
head(sim_vrs)
## [1] 25.85143 17.12385 16.49944 23.09948 14.74029 16.35456

Sample vs. Theoretical Mean

Overview

The mean of “Sample” (simulated) means and “Theoretical” (provided) mean are defined in objects sample_mn and theory_mn, respectively. Following, the absolute difference and percent difference are calculated and stored in objects abs_diff_mn and per_diff_mn, respectively:

sample_mn <- mean(sim_mns)
theory_mn <- 1/lambda  

abs_diff_mn <- abs(theory_mn - sample_mn)
per_diff_mn <- abs_diff_mn / theory_mn * 100

Findings

The Sample Mean, Theoretical Mean, Absolute Difference, and Percent Difference (%) are shown in Table 1:

Table 1: Sample v. Theoretical Mean Values
Sample Mean Theoretical Mean Absolute Difference Percent Difference (%)
4.996854 5 0.0031457 0.0629148

In sum, the mean value of our 1,000 “Sample” (simulated) distributions of 40 exponentials is remarkably similar to the “Theoretical” (provided) mean, 1 / lambda, with a percent difference of 0.0629148 %.

Visualizations

In order to visualize the distribution of simulated means in ggplot2, they are coerced into data.frame object mean_df. Figure 1 depicts a histogram of the distribution of means of simulated exponential distributions is provided below, including the simulated “Sample” mean and provided “Theoretical” mean. Note the remarkable similarity between both values and the Gaussian shape of the distribution attributable to the Central Limit Theorem:

Figure 1 depicts the remarkable similarity between the mean value of simulated “Sample” means and the provided “Theoretical” mean.

Sample vs. Theoretical Variance

Overview

The mean of “Sample” (simulated) variance and “Theoretical” (provided) variance are defined in objects sample_var_mn and theory_var, respectively. Following, the absolute difference and percent difference are calculated and stored in objects abs_diff_var and per_diff_var, respectively.

sample_var_mn <- mean(sim_vrs)
theory_var <- (1 / lambda)^2

abs_diff_var <- abs(theory_var - sample_var_mn)
per_diff_var <- abs_diff_var / sample_var_mn * 100

Findings

The Sample Variance, Theoretical Variance, Absolute Difference, and Percent Difference (%) are shown in Table 2:

Table 2: Sample v. Theoretical Variance Values
Sample Variance Theoretical Variance Absolute Difference Percent Difference (%)
25.68205 25 0.6820479 2.655738

In sum, the mean of the variances of our 1,000 “Sample” (simulated) distributions of 40 exponentials is remarkably similar to the “Theoretical” (provided) variance, 1 / lambda, with a percent difference of 2.6557378 %.

Visualizations

In order to visualize the distribution of simulated variances in ggplot2, they are coerced into data.frame object vars_df, while the provided “Theoretical” variance is replicated and stored in data.frame object theory_vars_df. Figure 2 depicts a histogram of the distribution of variances of simulated exponential distributions is provided below, including the simulated “Sample” variance and provided “Theoretical” variance. Note the remarkable similarity between both values:

Figure 2 depicts the remarkable similarity between the mean value of simulated “Sample” variance and provided “Theoretical” variance.

Approximate Normality

Visualizations

While the latter of the previous histograms depicts the distribution of average variances for our 1,000 simulations of exponential distributions, which somewhat resembled a right-skewed Gaussian (Normal) distribution, the former is remarkably Gaussian. Observe Figure 3, overlaid with a distribution (labeld “Normal”) with “Theoretical” (provided) parameters and adjusted for “Sample” (simulated) size, while juxtaposed with probability density function geom_density() for the distribution of simulation means (labeled “Sample”):

Figure 3 demonstrates that the mean values of each simulation clearly approximate a Gaussian or Normal distribution, represented by the orange “Normal” curve and “Sample” probability density function.

Discussion & Conclusions

Despite our 1,000 simulations being comprised of 40 randomly-generated exponential values, the mean values of each simulation are approximately normally distributed (demonstrated in Figure 3), demonstrating the Central Limit Theorem. According to Wikipedia:

“The central limit theorem (CLT) establishes that, in some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a”bell curve“) even if the original variables themselves are not normally distributed.”

Moreover, Wikipedia provides an example of CLT that is identical to the above use case:

“Suppose that a sample is obtained containing a large number of observations, each observation being randomly generated in a way that does not depend on the values of the other observations, and that the arithmetic average of the observed values is computed. If this procedure is performed many times, the central limit theorem says that the computed values of the average will be distributed according to a normal distribution.”

In sum, in obtaining a large number of sets of randomly-generated values and calculating the arithmetic mean of each set, we demonstrate CLT, emphasized above in Figure 3. Further information on CLT and source of the above quotes are found on the Wikipedia page “Central Limit Theorem”.

Appendix

Tables

Table 1 Code: “Sample v. Theoretical Mean Values”

mn_table <- data.frame("Sample Mean" = sample_mn,
                       "Theoretical Mean" = theory_mn,
                       "Absolute Difference" = abs_diff_mn,
                       "Percent Difference (%)" = per_diff_mn,
                       row.names = "")

Table 2 Code: “Sample v. Theoretical Variance Values”

vars_table <- data.frame("Mean Sample Variance" = sample_var_mn,
                         "Theoretical Variance" = theory_var,
                         "Absolute Difference" = abs_diff_var,
                         "Percent Difference" = per_diff_var,
                         row.names = "")

Figures

Figure 1 Code: “Theoretical v. Sample Means of 1,000 Simulations”

mean_df <- data.frame("Simulations" = sim_mns)
 
ggplot(data = mean_df, aes(x = Simulations)) +
  geom_histogram(binwidth = 0.1, fill = "lightgrey") +
  geom_vline(aes(xintercept = 5L, color = "Theoretical"), lwd = 1.5, 
             linetype = 5, show.legend = TRUE) +
  geom_vline(aes(xintercept = mean(Simulations), color = "Sample"), lwd = 1.5, 
             linetype = 3, show.legend = TRUE) +
  ggtitle(label = "Figure 1: Theoretical v. Sample Means of 1,000 Simulations", 
          subtitle = "40 Exponential Distributions per Simulation") +
  labs(y = "Count", color = "Mean") +
  theme_classic()  

Figure 2 Code: “Theoretical v. Sample Variance of 1,000 Simulations”

vars_df <- data.frame("Variances" = sim_vrs)
theory_vars_df <- data.frame("Variances" = rep(theory_var, n_sim))
 
ggplot(data = vars_df, aes(x = Variances)) +
  geom_histogram(bins = 50, fill = "lightgrey") +
  geom_vline(aes(xintercept = mean(Variances), color = "Sample"), 
             lwd = 1.5, linetype = 5, show.legend = TRUE) +
  geom_vline(aes(xintercept = 25L, color = "Theoretical"), lwd = 1.5, 
             linetype = 3, show.legend = TRUE) +
  ggtitle(label = "Figure 2: Theoretical v. Sample Variance of 1,000 Simulations", 
          subtitle = "40 Exponential Distributions Each") +
  guides(color = guide_legend(title = "Variance")) +
  labs(ylab("Count")) +
  theme_classic()

Figure 3 Code: “Sample Mean Approximation to Gaussian Distribution”

distro <- ggplot(data = mean_df, aes(x = Simulations))
distro <- distro +
  geom_histogram(bins = 50, fill = "skyblue", alpha = 0.5, 
                 aes(y = ..density.., fill = ..count.. )) +
  stat_function(fun = dnorm, color = "darkorange1", lwd = 1.05, 
                args = list(mean = mean(mean_df$Simulations),
                            sd = sd(mean_df$Simulations))) +
  geom_density(color = "steelblue2", lwd = 1.05)

distro <- distro +
  ggtitle(label = "Figure 3: Sample Mean Approximation to Gaussian Distribution", 
          subtitle = "Sample Histogram and Density v. Normal Distribution") +
  labs(x = "Density", y = "Simulation Mean") +
  guides(fill = FALSE ) +
  theme_classic()

distro <- distro +
  annotate(geom = "text", x = 5.7, y = 0.45, color = "darkorange1", label = "Normal") +
  annotate(geom = "text", x = 6, y = 0.35, color = "steelblue2", label = "Sample")

distro