Statistical Inference Course Project Part 1: Simulation Exercise

Overview

In Part 1 of this Project, we will investigate the exponential distribution in R and compare it with the Central Limit Theorem. Given that lambda = 0.2 for all of the simulations, we will investigate the distribution of averages of 40 exponentials over a thousand simulations.

Simulations

# Setting pre-defined parameters
lambda <- 0.2
n <- 40
sims <- 1:1000
set.seed(123)

# Loading necessary R packages
if(!require(ggplot2)){install.packages('ggplot2')}; library(ggplot2)

## Loading required package: ggplot2

# Simulating the population
population <- data.frame(x=sapply(sims, function(x) {mean(rexp(n, lambda))}))

# Plotting the histogram
hist.pop <- ggplot(population, aes(x=x)) +
  geom_histogram(aes(y=..count.., fill=..count..)) +
  labs(title="Histogram of Averages of 40 Exponentials over 1000 Simulations", y="Frequency", x="Mean")
hist.pop

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Sample Mean versus Theoretical Mean

# Tabulating the sample mean and theoretical mean
sample.mean <- mean(population$x)
theoretical.mean <- 1/lambda
cbind(sample.mean, theoretical.mean)

##      sample.mean theoretical.mean
## [1,]    5.011911                5

# Checking the 95% confidence interval for sample mean
t.test(population$x)[4]

## $conf.int
## [1] 4.963824 5.059998
## attr(,"conf.level")
## [1] 0.95

As shown above, the sample mean and theoretical mean are very close in value. For the 95% confidence interval, the sample mean is between 4.96 and 5.06.

Sample Variance versus Theoretical Variance

# Tabulating the sample variance and theoretical variance
sample.variance <- var(population$x)
theoretical.variance <- ((1/lambda)^2)/n
cbind(sample.variance, theoretical.variance)

##      sample.variance theoretical.variance
## [1,]       0.6004928                0.625

As shown above, the sample variance and theoretical variance are very close in value.

Distribution

# Plotting sample mean & variance versus theoretical mean & variance
gg <- ggplot(population, aes(x=x)) +
  geom_histogram(aes(y=..density.., fill=..density..)) +
  labs(title="Histogram of Averages of 40 Exponentials over 1000 Simulations", y="Density", x="Mean") +
  geom_density(colour="blue") +
  geom_vline(xintercept=sample.mean, colour="blue", linetype="dashed") +
  stat_function(fun=dnorm, args=list(mean=1/lambda, sd=sqrt(theoretical.variance)), color="red") +
  geom_vline(xintercept=theoretical.mean, color="red", linetype="dashed")
gg

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As shown above, the averages of 40 exponentials over 1000 simulations are very close in value to the theoretical mean of a normal distribution. This suggests that the distribution is approximately normal.