Statistical Inference Project - Part 1

Overview

From the project instructions:

In this project you will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. You will investigate the distribution of averages of 40 exponentials. Note that you will need to do a thousand simulations.

Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials. You should:

Show the sample mean and compare it to the theoretical mean of the distribution.
Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.
Show that the distribution is approximately normal.

Simulations

First we set the seed to ensure reproducibility, then we set the value for lambda, the number of exponentials, and the number of simulations. We can then run the simulations, calculate the means, and show how these means fall in a histogram.

# Showing system info for basic reproducibility
sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.6.1  magrittr_1.5    tools_3.6.1     htmltools_0.4.0
##  [5] yaml_2.2.0      Rcpp_1.0.3      stringi_1.4.4   rmarkdown_2.1  
##  [9] knitr_1.27      stringr_1.4.0   xfun_0.12       digest_0.6.23  
## [13] rlang_0.4.4     evaluate_0.14

# Setting seed
set.seed(2020)

# Setting parameters
lambda <- 0.2
n <- 40
sims <- 1000

# Running simulations
sims_ran <- replicate(sims, rexp(n, lambda))

# Means of the simulations
sims_means <- apply(sims_ran, 2, mean)

# Plotting a simple histogram to show the distribution of the means
hist(sims_means, breaks = 50, main = "Means of 1000 Simulations of the Exponential Distribution", xlab = "Mean")

We can see from the histogram that the distribution looks fairly Gaussian.

Question 1: Sample Mean vs Theoretical Mean

The theoretical mean for an exponential distribution with lambda = 0.2 is 1/0.2 = 5 (mu = 1/lambda). We can show via the same histogram above where mean = 5 fits, as well as show the sample mean of the simulations and how it compares to the theoretical mean of 5.

# Histogram of sample means
hist(sims_means, breaks = 50, main = "Histogram of Sample Means vs. Theoretical Mean of 5", xlab = "Sample Means")

# Theoretical Mean of 5
abline(v = 5, lwd = 4, col = "blue")

# Calculating the sample mean of the simulations
mean(sims_means)

## [1] 5.033948

# Sample Mean on the histogram
abline(v = mean(sims_means), lwd = 4, col = "red")

We see that the sample mean (5.03, red vertical line) is very close to the theoretical mean (5, blue vertical line)

Question 2: Sample Variance vs. Theoretical Variance

The theoretical variance for an exponential distribution with lambda = 0.2 is (1/lambda)^2/n. The theoretical and sample variances can be calculated very simply. I do that below, as well as show the difference between the two. We can see that they are very close to eachother.

# Calculating the theoretical variance
round((1/lambda)^2/n, 3)

## [1] 0.625

# Calculating the sample variance
round(var(sims_means),3)

## [1] 0.607

# Showing the difference
round((1/lambda)^2/n, 3) - round(var(sims_means),3)

## [1] 0.018

Question 3: Show that the distribution is Approximately Normal

The Central Limit Theorem states that for sample sizes greater than approximately 30, the means begin to follow a normal distribution regardless of the underlying probability distribution from which the means came. We can see how close our means distribution is to normal here by overlaying a normal line against a sample line

# Repeating the means histogram
hist(sims_means, breaks = 50, main = "Means of 1000 Simulations of the Exponential Distribution", xlab = "Mean", prob = TRUE, xlim = c(2,8))

# Adding a sample distribution curbe
lines(density(sims_means), lwd = 4, col = "purple")

# Adding a Normal distribution curve for contrast
x <- seq(min(sims_means), max(sims_means), length = 2*n)
y <- dnorm(x, mean = 1/lambda, sd = sqrt((1/lambda)^2/n))
lines(x, y, lwd = 4, col = "green")

Conclusion

We can see that the distribution is very close to Normal. The green curve represents Normality. The purple curve represents the distribution of the sample means. This is due to the our high sample/simulation size (n = 1000) and the Central Limit Theorem.

Statistical Inference Project - Part 1 - Simulation