Title: “Simulation and Inferential Data Analysis Project” author: “Ken Peters” date: “7/23/2020” output: html_document

In this project we will investigate the exponential distribution in R and compare it with the Central Limit Theorem. We will investigate the distribution of averages of 40 exponentials. And we will do a thousand simulations. install.packages(“ggplot2”) install.packages(“knitr”) The following table shows the observed sample mean and compares it to the theoretical mean of the distribution. Likewise, it also shows the observed sample standard deviation and the observed sample varience and compares them to the theoretical standard deviation and varience of the distribution. knitr::opts_chunk$set(fig.align=“center”)

Theoretical Values Observed Values
Mean 5.0000000 4.9840809
Standard Deviation 0.7905694 0.7996708
Varience 0.6250000 0.6394734

The following histogram shows both the theoretical mean and the sample mean.

Next, we overlay a Normal Distribution onto our Histogram of the Simulation. And it looks approximately normal.

Finally, we compare the difference between the distribution of a large collection of random exponentials and the distribution of a large collection of averages of 40 exponentials.

This distribution looks far more Gaussian than the original exponential distribution!

APPENDEX

Code for the simulation

## Install the needed packages and initialize the variables
install.packages("ggplot2")
## Warning: package 'ggplot2' is in use and will not be installed
library(ggplot2)
install.packages("knitr")
## Warning: package 'knitr' is in use and will not be installed
library(knitr)
lambda <- 0.2
meanth <- 1/lambda
sdth <- 1/lambda
n <- 40
set.seed(456)
nsims <- 1000
mns = NULL
for (i in 1 : nsims) mns = c(mns, mean(rexp(n, lambda)))
mean(mns)
## [1] 4.984081
var(mns)
## [1] 0.6394734
sd(mns)
## [1] 0.7996708

Create the table that shows the observed sample mean and compares it to the theoretical mean of the distribution. Likewise, it also shows the observed sample standard deviation and the observed sample varience and compares them to the theoretical standard deviation and varience of the distribution.

th <- c(meanth, sdth/sqrt(n), sdth^2/n)
obsv <- c(mean(mns), sd(mns), var(mns))
df <- data.frame(th, obsv)
colnames(df) <- c("Theoretical Values", "Observed Values")
rownames(df) <- c("Mean", "Standard Deviation", "Varience")
kable(df) 
Theoretical Values Observed Values
Mean 5.0000000 4.9840809
Standard Deviation 0.7905694 0.7996708
Varience 0.6250000 0.6394734

Create a histogram that shows both the theoretical mean and the sample mean

hist(mns,main="Histogram of a Simulation of \n1000 exponential distribution means", xlab = "Means from Samples of Size 40", col = "khaki", breaks = 30)
abline(v = meanth, col= "coral", lwd = 4,lty=2)
abline(v = mean(mns), col= "blue4", lwd = 2)
legend('topright', c("Theoretical Mean", "Observed Mean"),
       lty= c(2,1),
       bty = "n", col = c(col = "coral", col = "blue4"))

Next, we overlay a Normal Distribution onto our Histogram of the Simulation.

hist(mns,main="Distribution of a Simulation of \n1000 exponential \ndistribution means", prob = TRUE, xlab = "Means from Samples of Size 40", col = "green2", breaks = 30)
xfit <- seq(min(mns), max(mns), length = 100) 
yfit <- dnorm(xfit, mean = meanth, sd = sdth/sqrt(n)) 
lines(xfit, yfit, col = "dodgerblue2", lwd = 3) 

Finally, we creat two histograms to compare the difference between the distribution of a large collection of random exponentials and the distribution of a large collection of averages of 40 exponentials.

par(mfrow=c(1, 2))
hist(rexp(1000, lambda), main="Histogram of a Simulation of \n1000 values from \nan exponential distribution \nno means", prob = TRUE, xlab = "No means", col = "khaki", breaks = 30)
lines(seq(min(rexp(1000, lambda)), max(rexp(1000, lambda)), length=100),
      dexp(seq(min(rexp(1000, lambda)), max(rexp(1000, lambda)), length=100), lambda),
      col="dodgerblue2", lwd=4)
abline(v = mean(rexp(1000, lambda)), col= "coral", lwd = 4,lty=1)
hist(mns,main="Distribution of a Simulation of \n1000 exponential \ndistribution means", prob = TRUE, xlab = "Means from Samples of Size 40", col = "green2", breaks = 30)
abline(v = mean(mns), col= "coral", lwd = 4)
xfit <- seq(min(mns), max(mns), length = 100) 
yfit <- dnorm(xfit, mean = meanth, sd = sdth/sqrt(n)) 
lines(xfit, yfit, col = "dodgerblue2", lwd = 3)