Statistical Inference Course Project Part 1

The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also also 1/lambda. Set lambda = 0.2 for all of the simulations. In this simulation, you will investigate the distribution of averages of 40 exponential(0.2)s. Note that you will need to do a thousand or so simulated averages of 40 exponentials. Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponential(0.2)s.

In drawing \(n = 40\) samples from the exponential distribution with rate \(\lambda = 0.2\), let \(X_i\) denote the outcome of draw \(i\).
The expected value or mean for \(X_i\) is \(\mu = \frac{1}{\lambda} = \frac{1}{0.2} = 5\)
The standard deviation for \(X_i\) is \(\sigma = \frac{1}{\lambda} = \frac{1}{0.2} = 5\)
The CLT states that repeatedly generating the averages \(\bar X\) of \(n\) independent and identically distibuted samples (from any distribution) simulates a normal random variable \(\bar X_n \sim \mbox{N}(\mu, \frac{\sigma^2}{n})\).
Therefore, if we repeatedly (e.g. \(1000\) times) draw \(n = 40\) exponentials with rate \(\lambda = 0.2\), take their mean \(\bar X\), and plot a histogram, we expect a distribution approximating a normal distribution with \(\mu_t = \mu = 5\), and standard deviation \(\sigma_t = \sqrt{\frac{\sigma^2}{n}} = \frac{\sigma}{\sqrt{n}} = \frac{5}{\sqrt{40}} = 0.79\).

## Declare constants
lambda <- 0.2
mut <- 1/lambda
n <- 40
sigmat <- 1/lambda/sqrt(n)
nosim <- 1000
## Compute matrix of nosim rows of n exponential(lambda) samples
expmatrix <- matrix(rexp(n * nosim, lambda), nosim)
## Compute data frame with nosim rows of sample means
meandf <- data.frame(
    meanofn <- apply(expmatrix, 1, mean)
)

library(ggplot2)
## Define colours
orange <- "#E69F00"
blue <- "#56B4E9"
## Construct base plot
g <- ggplot(meandf, aes(x = meanofn)) + xlab("Mean") + ylab("Density")
## Add sample data distribution
g <- g + geom_histogram(alpha=0.20, binwidth=0.1, fill=orange, colour=orange, aes(y=..density..))
## Add title and render
g +  ggtitle("Fig. 1. Sample data")

plot of chunk unnamed-chunk-2

1. Show where the distribution is centered at and compare it to the theoretical center of the distribution

We display vertical lines for the mean of the sample data (orange) and the theoretical mean \(\mu_t = \mu = 5\) (blue, dashed).

## Add mean of sample data
g <- g + geom_vline(data=meandf, aes(xintercept=mean(meanofn)), colour=orange, size=1)
## Add theoretical mean
g <- g + geom_vline(aes(xintercept=mut), colour=blue, linetype="dashed", size=1)
## Add title and render
g + ggtitle("Fig. 2. Added samples and theoretical means")

plot of chunk unnamed-chunk-3

2. Show how variable it is and compare it to the theoretical variance of the distribution

We display vertical lines for the \(\pm\) sd of the sample data (orange) as well as those for the theoretical sd \(\sigma_t = \frac{\sigma}{\sqrt{n}} = 0.79\) (blue, dotted).

## Add sd of sample data
g <- g + geom_vline(data=meandf, aes(xintercept=mean(meanofn)-sd(meanofn)), colour=orange, size=1)
g <- g + geom_vline(data=meandf, aes(xintercept=mean(meanofn)+sd(meanofn)), colour=orange, size=1)
## Add theoretical sd
g <- g + geom_vline(aes(xintercept=mut-sigmat), colour=blue, linetype="dotted", size=1)
g <- g + geom_vline(aes(xintercept=mut+sigmat), colour=blue, linetype="dotted", size=1)
## Add title and render
g + ggtitle("Fig. 3. Added sample data and theoretical sds")

plot of chunk unnamed-chunk-4

3. Show that the distribution is approximately normal

We overlay the normal distribution parametrized by \(\mu_t\) and \(\sigma_t^2\) (blue).

## Add normal distribution
g <- g + stat_function(fun=dnorm, arg=list(mean=mut, sd=sigmat), colour=blue)
## Add title and render
g +  ggtitle("Fig. 4. Added normal distribution")

plot of chunk unnamed-chunk-5

4. Evaluate the coverage of the confidence interval for \(\frac{1}{\lambda}\): \(\bar X \pm 1.96 \frac{S}{\sqrt{n}}\)

We reuse the previously computed matrix with \(1000\) rows of \(n\) exponential(\(\lambda\)) samples. For each row, we determine whether \(\frac{1}{\lambda}\) is within the confidence interval computed for that row \(\bar X \pm 1.96 \frac{S}{\sqrt{n}}\).

The results are collected in a vector of length \(1000\) with values \(0\) or \(1\). The sum of this vector divided by its length is de coverage ratio for the confidence interval for \(\frac{1}{\lambda}\).

coverage <- apply(expmatrix, 1, function(x) {
  ll <- mean(x) - qnorm(.975) * sd(x) / sqrt(n)
  ul <- mean(x) + qnorm(.975) * sd(x) / sqrt(n)
  mean(ll < 1/lambda & ul > 1/lambda)
})
round(sum(coverage)/length(coverage),2)

## [1] 0.92