Statistical Inference Course Project I

A Simulation Exercise

written by: Jeanna Clark

Overview

In this part of the project, a simulated exponential distribution will be compared to the Central Limit Thereom. The simulated dataset will consist of one thouand exponential distribution runs that are averaged over forty exponentials. The exponential rate parameter (lambda) will be 0.2, and the simulation seed will be set to 50 to allow for reproducibility. Reader comprehension of Central Limit Theorem, mean and standard deviation of exponential distribution are assumed, and basic understanding of R programming would be helpful.


Simulations

The exponential distribution will be simulated by:

## set variables to human legible names
lambda_rate_parameter <- 0.2
exponential_count <- 40 
simulation_count <- 1000

## set seed (e.g., reproducibility)
set.seed(288)

## compile simulation data
simulation_data <- matrix(rexp(simulation_count*exponential_count, rate = lambda_rate_parameter), simulation_count, exponential_count)

## compile average run data
simulated_averages <- rowMeans(simulation_data)

Sample Mean versus Theoretical Mean

The distribution of sample mean and theoretical mean is explored in the plot below. The simulated data mean is 5.001088, and the theoretical mean is 5. These values are very close to one another. This sample may be in agreeance with the Central Limit Theorem.

## calculate means
theoretical_mean <- 1/0.2
simulated_mean <- mean(simulated_averages)

## create base-plot histogram 
hist(simulated_averages, col = "wheat", breaks = 24, prob = TRUE, ylab = "density", xlab = " ", main = "Distribution of means")

## insert mean lines
abline(v = theoretical_mean, col = "purple") ## theoretical mean
abline(v = simulated_mean, col = "blue", lwd = 2) ## simulated mean

## insert key
legend('right', c("simulated", "theoretical"), lty = c(2,1), col = c("blue", "purple"))


Sample Variance versus Theoretical Variance

The variances are calculated below. The simulated variance is 0.64356, and the theoretical variance is 0.625. These values are very close to one another. This sample may be in agreeance with the Central Limit Theorem.

## calculate variance
simulated_variance <- var(simulated_averages)
theoretical_variance <- (1/0.2)^2/exponential_count

Distribution

Based on the results above and with our understanding of the Central Limit Theorem, we will further assess the simulated data for an approximately normal distribution. The simulated data is approximately normal as outlined in this plot:

## load library
library(ggplot2)

## create approximately normal distribution
data <- data.frame(simulated_averages)
normal_distribution <- ggplot(data, aes(x = simulated_averages))
normal_distribution <- normal_distribution + geom_histogram(aes(y = ..density..), color = "grey", fill = "wheat")
normal_distribution + geom_density(color = "purple")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

As a final measure in demonstrating that this simulated data is approximately normal and in agreeance with the Central Limit Theorem, a Normal Q-Q Plot is shown below. This plot also shows that the simulated data distribution is approximately normal.

qqnorm(simulated_averages)
qqline(simulated_averages)

The Central Limit Theorem has not been disproved by this simulated data analysis. In fact, this simulated data is in agreeance with the Theorem as three forms of analysis have shown the simulated data is approximately normal.


Codebook

  • Description of data: simulated exponential distribution e.g., rexp(n, lambda), lamda = 0.2 for all simulations. Using distribution of averages of 40 exponentials from 1,000 simulations.
  • Data source: RStudio
  • Simulation Exercise: “In this project you will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. You will investigate the distribution of averages of 40 exponentials. Note that you will need to do a thousand simulations. Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials. You should
  1. Show the sample mean and compare it to the theoretical mean of the distribution.
  2. Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.
  3. Show that the distribution is approximately normal. In point 3, focus on the difference between the distribution of a large collection of random exponentials and the distribution of a large collection of averages of 40 exponentials. As a motivating example, compare the distribution of 1000 random uniforms hist(runif(1000)) and the distribution of 1000 averages of 40 random uniforms mns = NULL for (i in 1 : 1000) mns = c(mns, mean(runif(40))) hist(mns) This distribution looks far more Gaussian than the original uniform distribution! This exercise is asking you to use your knowledge of the theory given in class to relate the two distributions. Confused? Try re-watching video lecture 07 for a starter on how to complete this project." (sourced from course rubric: https://class.coursera.org/statinference-031/human_grading/view/courses/975164/assessments/4/submissions)

Publication information

  • Author: Jeanna Clark
  • RPubs username: asclepiusgal
  • Github username: asclepiusgal
  • Course: Statistical Inference - Hopkins on Coursera
  • Project Rubric: URL
  • Target audience: written for course peers to grade
  • Copyright: August 2015
  • Published on RPubs: URL
  • Session info:
## R version 3.2.0 (2015-04-16)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_1.0.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.0      digest_0.6.8     MASS_7.3-40      grid_3.2.0      
##  [5] plyr_1.8.3       gtable_0.1.2     formatR_1.2      magrittr_1.5    
##  [9] scales_0.2.5     evaluate_0.7     stringi_0.5-5    reshape2_1.4.1  
## [13] rmarkdown_0.6.1  labeling_0.3     proto_0.3-10     tools_3.2.0     
## [17] stringr_1.0.0    munsell_0.4.2    yaml_2.1.13      colorspace_1.2-6
## [21] htmltools_0.2.6  knitr_1.10.5
  • Published: 2015-08-19 11:14:09