Statistical Inference Project - Part I Sample Mean vs Theoretical Mean

Author: chenghueylin
Year/Month: 2015/12

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Overview

This project is to investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda.

Set lambda = 0.2 for all of the simulations. You will investigate the distribution of averages of 40 exponentials. Note that you will need to do a thousand simulations.

Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials. You should
       1. Show the sample mean and compare it to the theoretical mean of the distribution.
       2. Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.
       3. Show that the distribution is approximately normal.

library(ggplot2)    ## for plotting functions

Simulations

lambda <- 0.2;     ## set lambda
n <- 40;           ## set 40 exponentials
nosim <- 1000;     ## set No. of simulations to 1000

set.seed(345);     ## set seed for consistent simulations

## Create a exponentials data frame 1000 X 40 from rexp(x, lambda) 
simulatedData <- matrix(rexp(n*nosim, lambda), nrow=nosim, ncol=n);

## Calculate the average/mean of each row (40 exponentials): sample mean
simMeans <- apply(simulatedData, 1, mean);

Sample Mean versus Theoretical Mean

Show the sample mean and compare it to the theoretical mean of the distribution.

## Calculate sample mean
sampleMean <- round(mean(simMeans),3);

## Calculate theoretical mean
theoreticalMean <- round(1/lambda,3);

## Mean distribution of 1000 simulations 
hist(simMeans, col="blue", main="Histogram of 1000 means of 40 sample exponentials", xlab="Sample Mean", ylab="Frequency");

## highlight the 2 means we are comparing
abline(v=sampleMean, col="green", lwd=6, lty=2);
abline(v=theoreticalMean, col="red", lwd=3);

Conclusion: The sample mean of this distrubution is 4.989 and the theoretical mean is 5. The actual center of the distribution of the average of 40 exponetials (green dotted line) is very close to its theoretical center of the distribution (red solid line).

Sample Variance versus Theoretical Variance

Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.

## variance of this distribution
sampleVar <- round(var(simMeans), 3);
## theoretical variance of this distribution
theoretcalVar <- round((1/lambda)^2/n,3);

## Calculate sd of the sample mean
sampleSD <- round(sd(simMeans),3);
## Calculate Theoretical sd
theoreticalSD <- round((1/lambda)/sqrt(n),3);

Conclusion: The variance of this sample distribution is 0.622 and the theoretical variance is 0.625. Both of them are very close. The actual standard deviation of the sample distribution is 0.789. The theoretical standard deviation is 0.791. The difference of actual sd and theoretical sd is very small.

Distribution

Show that the distribution is approximately normal.
To show the distribution is normal, we plot the distribution of simulated sample data and overlay the normal distribution (a bell curve) with lambda=0.2 to see if the 2 distributions are aligned.

simMeans_df <- data.frame(simMeans);
names(simMeans_df) <- c("simMean")
ggplot(simMeans_df, aes(x=simMean)) +
       labs(x="similuated mean", title="Distribution of Averages of Samples vs Theoretical Mean") +
       geom_histogram(aes(y=..density..), color="green", fill="blue", size=1, binwidth=0.2) + 
       geom_density(color="green", size=1) +
       stat_function(fun=dnorm, arg=list(mean=theoreticalMean, sd=theoreticalSD), color = "red", size=1) +
       geom_vline(xintercept=theoreticalMean, color="red", size=1);

Conclusion: The green line is the distribution of averages of the simulated samples. The red line is the normal distribution with lambda = 0.2. The figure shows that the 2 distribution lines, green and red, are well aligned thus the distribution of simulated data is approximately normal.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.