Statistical Inference (statinference-007)

Synopsis

From the course website: “The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also also 1/lambda. Set lambda = 0.2 for all of the simulations. In this simulation, you will investigate the distribution of averages of 40 exponential(0.2)s. Note that you will need to do a thousand or so simulated averages of 40 exponentials.

Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponential(0.2)s. You should 1. Show where the distribution is centered at and compare it to the theoretical center of the distribution. 2. Show how variable it is and compare it to the theoretical variance of the distribution. 3. Show that the distribution is approximately normal.

Note that for point 3, focus on the difference between the distribution of a large collection of random exponentials and the distribution of a large collection of averages of 40 exponentials. "

Creating the data (simulation)

Set environment variables

# Load Libraries for use
library(plyr)
library(ggplot2)

# Simulation parameters
set.seed(2014)
runs = 1000
n = 40
lambda = 0.2

This is an exercise of understanding the distribution of the mean from a number (n) of draws from an exponential distribution with rate constant lambda. The random number generator was seeded with “2014” simply for reproducibility of the results if run again at a later time.

Running the simulation

#generate the data, see above section for parameter values
distMeans = NULL
for (i in 1 : runs) distMeans = c(distMeans, mean(rexp(n, lambda)))

1. Show where the distribution is centered at and compare it to the theoretical center of the distribution.

# Calculate the mean of the means ;)
sampleMean=mean(distMeans)

# Calculate the theoretical mean (1/lambda)
theoryMean = 1/lambda

# For plotting, put means in a data frame
# Thank you to: http://stackoverflow.com/questions/12545322/add-vline-to-existing-plot-and-have-it-appear-in-ggplot2-legend
meanData <- data.frame(Means=c("Sample Mean", "Theoretical Mean"), vals=c(sampleMean, theoryMean))

# Generate a histogram and add vertical lines to show sample and theoritical mean on the same plot
# Thank you to http://www.r-bloggers.com/ggplot2-cheatsheet-for-visualizing-distributions/
ggplot(NULL, aes(x=distMeans)) + 
  geom_histogram(aes(y = ..density..), color="black", fill=NA, binwidth=.2) +
  geom_density(color="blue") +
  geom_vline(data=meanData, aes(xintercept=vals, linetype=Means,colour = Means),show_guide = TRUE)

In theory, the distribution should be centered at 1/lambda, which is: 5
The sample means generated by our simulation has a mean of: 5.0400101

2. Show how variable it is and compare it to the theoretical variance of the distribution.

Let’s first look at how the data is distributed:

summary(distMeans)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.565   4.474   4.990   5.040   5.551   7.751

Looking at the shape of the distribution, it looks pretty normal (and we’ll confirm this with a q-q plot in the next step), so I’ll get a meaningful standard deviation and variance returned.

Sample Standard Deviation:

sd(distMeans)

## [1] 0.7876914

Theoretical Standard Deviation:

(1/lambda)/sqrt(n)

## [1] 0.7905694

Sample Variance:

var(distMeans)

## [1] 0.6204578

Theoretical Variance:

1/(lambda^2*n)

## [1] 0.625

Lastly, given the shape of the histogram, which looks very normal (to be confirmed by a q-q plot below) we can use the t-test just to see the 95% confidence interval. With large n and normality assumed, the t and z tests are for the most part identical, and for n=1000, a reasonable estimate.

t.test(distMeans)$conf.int

## [1] 4.99113 5.08889
## attr(,"conf.level")
## [1] 0.95

As seen from the above values, both the sample standard deviation and variance are extremely close to their theoretical values. Looking at the 95% confidence interval we’re pretty confident that our mean of means is within a narrow window that includes the theoretical mean as well.

3. Show that the distribution is approximately normal.

The best graphical way (and easy way with R) is to do a quantile-quantile (Q-Q) plot.

qqnorm(distMeans);qqline(distMeans)

Looking at the Q-Q plot of the data, it is very clear that it approximates a straight line. The interpretation of this result is that the data is normally distributed.

Statistical Inference (statinference-007) – Project 1: Simulation

Brock Webb @brockwebb

November 21, 2014