R Version: 3.5.1
RStudio Version: 1.1.456
Date Authored: 2018-08-11
Overview: Part I of the course project for Statistical Inference, one of ten courses for the Data Science Specialization by Johns Hopkins University, aims to demonstrate the Central Limit Theorem (CLT) by simulating 1,000 randomly-generated sets of 40 exponential distributions and comparing the “theoretical” (provided) mean and standard deviation to the “sample” means and variances of the “sample” (simulated) distributions.
Objectives: The present analysis provides code and narrative to store provided parameters, e.g. lambda, simulate exponential distribution means and variances, compare “sample” and “theoretical” means and variances, and provide visualizations both for comparison to theoretical and sample statistics, as well as comparison to Gaussian (Normal) Distribution demonstrating CLT.
Packages: Tidyverse packages dplyr and ggplot2, as well as package knitr, if undetected, are installed automatically. Both packages are loaded with function library(). Package versions are provided dynamically below:
Versions:
dplyr: 0.7.6knitr: 1.20ggplot2: 3.0.0The following calls function set.seed() to ensure all quasi-randomly-generated numbers are exactly replicated for reproducibility. lambda is defined using provided “theoretical” parameters, as well as the provided number of exponential distributions, n_exp, and provided number of simulations, n_sim:
set.seed(1)
lambda <- 0.2
n_exp <- 40
n_sim <- 1000
Base function rexp() in conjunction with primitive methods mean() and var() to 1,000 means and variances of 40 exponential distributions. Objects sim_mns and sim_vrs are initialized to receive simulated mean values and variance values in the subsequent loop function, respectively:
sim_mns <- NULL
sim_vrs <- NULL
for (i in 1:n_sim){
sim_mns <- c(sim_mns, mean(rexp(n = n_exp, rate = lambda)))
sim_vrs <- c(sim_vrs, var(rexp(n = n_exp, rate = lambda)))
}
Objects sim_mns and sim_vrs now contain 1,000 values, the first 6 of which are demonstrated with calls to function head():
head(sim_mns)
## [1] 4.860372 4.279204 5.196446 5.995896 4.027864 4.874493
head(sim_vrs)
## [1] 25.85143 17.12385 16.49944 23.09948 14.74029 16.35456
The mean of “Sample” (simulated) means and “Theoretical” (provided) mean are defined in objects sample_mn and theory_mn, respectively. Following, the absolute difference and percent difference are calculated and stored in objects abs_diff_mn and per_diff_mn, respectively:
sample_mn <- mean(sim_mns)
theory_mn <- 1/lambda
abs_diff_mn <- abs(theory_mn - sample_mn)
per_diff_mn <- abs_diff_mn / theory_mn * 100
The Sample Mean, Theoretical Mean, Absolute Difference, and Percent Difference (%) are shown in Table 1:
| Sample Mean | Theoretical Mean | Absolute Difference | Percent Difference (%) | |
|---|---|---|---|---|
| 4.996854 | 5 | 0.0031457 | 0.0629148 |
In sum, the mean value of our 1,000 “Sample” (simulated) distributions of 40 exponentials is remarkably similar to the “Theoretical” (provided) mean, 1 / lambda, with a percent difference of 0.0629148 %.
In order to visualize the distribution of simulated means in ggplot2, they are coerced into data.frame object mean_df. Figure 1 depicts a histogram of the distribution of means of simulated exponential distributions is provided below, including the simulated “Sample” mean and provided “Theoretical” mean. Note the remarkable similarity between both values and the Gaussian shape of the distribution attributable to the Central Limit Theorem:
Figure 1 depicts the remarkable similarity between the mean value of simulated “Sample” means and the provided “Theoretical” mean.
The mean of “Sample” (simulated) variance and “Theoretical” (provided) variance are defined in objects sample_var_mn and theory_var, respectively. Following, the absolute difference and percent difference are calculated and stored in objects abs_diff_var and per_diff_var, respectively.
sample_var_mn <- mean(sim_vrs)
theory_var <- (1 / lambda)^2
abs_diff_var <- abs(theory_var - sample_var_mn)
per_diff_var <- abs_diff_var / sample_var_mn * 100
The Sample Variance, Theoretical Variance, Absolute Difference, and Percent Difference (%) are shown in Table 2:
| Sample Variance | Theoretical Variance | Absolute Difference | Percent Difference (%) | |
|---|---|---|---|---|
| 25.68205 | 25 | 0.6820479 | 2.655738 |
In sum, the mean of the variances of our 1,000 “Sample” (simulated) distributions of 40 exponentials is remarkably similar to the “Theoretical” (provided) variance, 1 / lambda, with a percent difference of 2.6557378 %.
In order to visualize the distribution of simulated variances in ggplot2, they are coerced into data.frame object vars_df, while the provided “Theoretical” variance is replicated and stored in data.frame object theory_vars_df. Figure 2 depicts a histogram of the distribution of variances of simulated exponential distributions is provided below, including the simulated “Sample” variance and provided “Theoretical” variance. Note the remarkable similarity between both values:
Figure 2 depicts the remarkable similarity between the mean value of simulated “Sample” variance and provided “Theoretical” variance.
While the latter of the previous histograms depicts the distribution of average variances for our 1,000 simulations of exponential distributions, which somewhat resembled a right-skewed Gaussian (Normal) distribution, the former is remarkably Gaussian. Observe Figure 3, overlaid with a distribution (labeld “Normal”) with “Theoretical” (provided) parameters and adjusted for “Sample” (simulated) size, while juxtaposed with probability density function geom_density() for the distribution of simulation means (labeled “Sample”):
Figure 3 demonstrates that the mean values of each simulation clearly approximate a Gaussian or Normal distribution, represented by the orange “Normal” curve and “Sample” probability density function.
Despite our 1,000 simulations being comprised of 40 randomly-generated exponential values, the mean values of each simulation are approximately normally distributed (demonstrated in Figure 3), demonstrating the Central Limit Theorem. According to Wikipedia:
“The central limit theorem (CLT) establishes that, in some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a”bell curve“) even if the original variables themselves are not normally distributed.”
Moreover, Wikipedia provides an example of CLT that is identical to the above use case:
“Suppose that a sample is obtained containing a large number of observations, each observation being randomly generated in a way that does not depend on the values of the other observations, and that the arithmetic average of the observed values is computed. If this procedure is performed many times, the central limit theorem says that the computed values of the average will be distributed according to a normal distribution.”
In sum, in obtaining a large number of sets of randomly-generated values and calculating the arithmetic mean of each set, we demonstrate CLT, emphasized above in Figure 3. Further information on CLT and source of the above quotes are found on the Wikipedia page “Central Limit Theorem”.
mn_table <- data.frame("Sample Mean" = sample_mn,
"Theoretical Mean" = theory_mn,
"Absolute Difference" = abs_diff_mn,
"Percent Difference (%)" = per_diff_mn,
row.names = "")
vars_table <- data.frame("Mean Sample Variance" = sample_var_mn,
"Theoretical Variance" = theory_var,
"Absolute Difference" = abs_diff_var,
"Percent Difference" = per_diff_var,
row.names = "")
mean_df <- data.frame("Simulations" = sim_mns)
ggplot(data = mean_df, aes(x = Simulations)) +
geom_histogram(binwidth = 0.1, fill = "lightgrey") +
geom_vline(aes(xintercept = 5L, color = "Theoretical"), lwd = 1.5,
linetype = 5, show.legend = TRUE) +
geom_vline(aes(xintercept = mean(Simulations), color = "Sample"), lwd = 1.5,
linetype = 3, show.legend = TRUE) +
ggtitle(label = "Figure 1: Theoretical v. Sample Means of 1,000 Simulations",
subtitle = "40 Exponential Distributions per Simulation") +
labs(y = "Count", color = "Mean") +
theme_classic()
vars_df <- data.frame("Variances" = sim_vrs)
theory_vars_df <- data.frame("Variances" = rep(theory_var, n_sim))
ggplot(data = vars_df, aes(x = Variances)) +
geom_histogram(bins = 50, fill = "lightgrey") +
geom_vline(aes(xintercept = mean(Variances), color = "Sample"),
lwd = 1.5, linetype = 5, show.legend = TRUE) +
geom_vline(aes(xintercept = 25L, color = "Theoretical"), lwd = 1.5,
linetype = 3, show.legend = TRUE) +
ggtitle(label = "Figure 2: Theoretical v. Sample Variance of 1,000 Simulations",
subtitle = "40 Exponential Distributions Each") +
guides(color = guide_legend(title = "Variance")) +
labs(ylab("Count")) +
theme_classic()
distro <- ggplot(data = mean_df, aes(x = Simulations))
distro <- distro +
geom_histogram(bins = 50, fill = "skyblue", alpha = 0.5,
aes(y = ..density.., fill = ..count.. )) +
stat_function(fun = dnorm, color = "darkorange1", lwd = 1.05,
args = list(mean = mean(mean_df$Simulations),
sd = sd(mean_df$Simulations))) +
geom_density(color = "steelblue2", lwd = 1.05)
distro <- distro +
ggtitle(label = "Figure 3: Sample Mean Approximation to Gaussian Distribution",
subtitle = "Sample Histogram and Density v. Normal Distribution") +
labs(x = "Density", y = "Simulation Mean") +
guides(fill = FALSE ) +
theme_classic()
distro <- distro +
annotate(geom = "text", x = 5.7, y = 0.45, color = "darkorange1", label = "Normal") +
annotate(geom = "text", x = 6, y = 0.35, color = "steelblue2", label = "Sample")
distro