In this part of the project, a simulated exponential distribution will be compared to the Central Limit Thereom. The simulated dataset will consist of one thouand exponential distribution runs that are averaged over forty exponentials. The exponential rate parameter (lambda) will be 0.2, and the simulation seed will be set to 50 to allow for reproducibility. Reader comprehension of Central Limit Theorem, mean and standard deviation of exponential distribution are assumed, and basic understanding of R programming would be helpful.
The exponential distribution will be simulated by:
## set variables to human legible names
lambda_rate_parameter <- 0.2
exponential_count <- 40
simulation_count <- 1000
## set seed (e.g., reproducibility)
set.seed(288)
## compile simulation data
simulation_data <- matrix(rexp(simulation_count*exponential_count, rate = lambda_rate_parameter), simulation_count, exponential_count)
## compile average run data
simulated_averages <- rowMeans(simulation_data)
The distribution of sample mean and theoretical mean is explored in the plot below. The simulated data mean is 5.001088, and the theoretical mean is 5. These values are very close to one another. This sample may be in agreeance with the Central Limit Theorem.
## calculate means
theoretical_mean <- 1/0.2
simulated_mean <- mean(simulated_averages)
## create base-plot histogram
hist(simulated_averages, col = "wheat", breaks = 24, prob = TRUE, ylab = "density", xlab = " ", main = "Distribution of means")
## insert mean lines
abline(v = theoretical_mean, col = "purple") ## theoretical mean
abline(v = simulated_mean, col = "blue", lwd = 2) ## simulated mean
## insert key
legend('right', c("simulated", "theoretical"), lty = c(2,1), col = c("blue", "purple"))
The variances are calculated below. The simulated variance is 0.64356, and the theoretical variance is 0.625. These values are very close to one another. This sample may be in agreeance with the Central Limit Theorem.
## calculate variance
simulated_variance <- var(simulated_averages)
theoretical_variance <- (1/0.2)^2/exponential_count
Based on the results above and with our understanding of the Central Limit Theorem, we will further assess the simulated data for an approximately normal distribution. The simulated data is approximately normal as outlined in this plot:
## load library
library(ggplot2)
## create approximately normal distribution
data <- data.frame(simulated_averages)
normal_distribution <- ggplot(data, aes(x = simulated_averages))
normal_distribution <- normal_distribution + geom_histogram(aes(y = ..density..), color = "grey", fill = "wheat")
normal_distribution + geom_density(color = "purple")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
As a final measure in demonstrating that this simulated data is approximately normal and in agreeance with the Central Limit Theorem, a Normal Q-Q Plot is shown below. This plot also shows that the simulated data distribution is approximately normal.
qqnorm(simulated_averages)
qqline(simulated_averages)
The Central Limit Theorem has not been disproved by this simulated data analysis. In fact, this simulated data is in agreeance with the Theorem as three forms of analysis have shown the simulated data is approximately normal.
## R version 3.2.0 (2015-04-16)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_1.0.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.0 digest_0.6.8 MASS_7.3-40 grid_3.2.0
## [5] plyr_1.8.3 gtable_0.1.2 formatR_1.2 magrittr_1.5
## [9] scales_0.2.5 evaluate_0.7 stringi_0.5-5 reshape2_1.4.1
## [13] rmarkdown_0.6.1 labeling_0.3 proto_0.3-10 tools_3.2.0
## [17] stringr_1.0.0 munsell_0.4.2 yaml_2.1.13 colorspace_1.2-6
## [21] htmltools_0.2.6 knitr_1.10.5