Statistical Inference Project - Simulation Exercise

Overview

This project is part of Statistical Inference courses by John Hopkins University. This project will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. We will investigate the distribution of averages of 40 exponentials.

Simulations

Load package

library(ggplot2)

We then initialize the simulation controlling variables.

nosim <- 1000
n <- 40
lambda <- 0.2

## Create a matrix with 1000 simulations (nosim = 1000) with rexp (random generation) is 40

Set the seed of the Random Number Generator so that the analysis is reproducible

set.seed(15)

Create a matrix with thousands rows corresponding to 1000 simulations and forty column corressponding to 40 random generations

simulationMatrix <- matrix(rexp(nosim * n, rate = lambda), nosim)

Create a vector of thousand rows containing the mean of each row of the simulationMatrix

simulationMean <- rowMeans(simulationMatrix)

Then we will create a data frame that contain the whole data

simulationData <- data.frame(cbind(simulationMatrix, simulationMean))

We will plot the simulation data to visualize

ggplot(data = simulationData, aes(simulationData$simulationMean)) +
  geom_histogram(breaks = seq(2,9, by = 0.2), col = "blue", aes(fill = ..count..)) +
  labs(title = "Histogram of Mean Distribution", x = "Distribution Mean", y = "Frequency") +
  geom_vline(aes(xintercept = mean(simulationData$simulationMean), color = "red", linetype = "dashed", size =1))

Sample Mean versus Theoretical Mean

Calculate actual mean of the simulated mean sample data

actualMean <- mean(simulationMean)
actualMean

## [1] 4.980535

And the theoretical mean is calculated by

theoreticalMean <- (1/lambda)
theoreticalMean

## [1] 5

We can see that the sample mean is very close to the theoretical mean

Sample Variance Versus Theoretical Variance:

Calculate the actual variance of the simulated data

actualVar <- var(simulationMean)
actualVar

## [1] 0.6429192

And the theoretical variance is calcualted by

theoreticalVar <- ((1/lambda)^2)/n
theoreticalVar

## [1] 0.625

Again, the actual variance is very close to the theoretical variance

Distributions:

To prove that the simulated mean sample data approximately follows the Normal distribution, we perform the following three steps:

Step 1: Create an approximate normal distribution and see how the sample data alligns with it.

ggplot(simulationData, aes(x = simulationMean)) +
  geom_histogram(aes(y = ..density..), colour = "black", fill = "blue") +
  geom_density(colour = "red", size = 1)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Step 2: Compare the 95% confidence intervals of the simulated mean sample data and the theoretical normally distributed data

actualConfInterval <- actualMean +c(-1,1)*1.96*sqrt(actualVar)/sqrt(n)

print(paste("Actual coverage of the confidence interval for +/-1.96sd = ", round(actualConfInterval, 3)))

## [1] "Actual coverage of the confidence interval for +/-1.96sd =  4.732"
## [2] "Actual coverage of the confidence interval for +/-1.96sd =  5.229"

theoreticalConfInterval <- theoreticalMean + c(-1, 1)*1.96*sqrt(theoreticalVar)/sqrt(n)

print(paste("Theoretical coverage of the confidence interval for +/-1.96sd =", round(theoreticalConfInterval, 3)))

## [1] "Theoretical coverage of the confidence interval for +/-1.96sd = 4.755"
## [2] "Theoretical coverage of the confidence interval for +/-1.96sd = 5.245"

Step 3: q-q plot for quantile

qqnorm(simulationMean)
qqline(simulationMean)

We can see that the actual quantiles are close to the theoretical quantiles. Thus, the distribution is approximately

Statistical Inference Project - Simulation Exercise - John Hopkins

Hoang Lam (Nancy Lam)

December 18, 2018