Statistical Inference Course Project: Part 1

Overview

This project will look at the exponential distribution in R and apply the Central Limit Theorem, as studied in this course. A simulation of 1000 averages of 40 exponential distributions will be analyzed. The mean, variance, and distribution of this simulation will be recorded.

Simulations

First let’s begin by creating our sample of 40 exponential distributions simulated 1000 times. Lambda = 0.2 will be set for all simulations.

set.seed(2020)
lambda <- 0.2 #set lamba
n <- 40 #set number of exponentials
sims <- 1000 #set number of simulatons
exp_sim <- replicate(sims, rexp(n, lambda))

exp_means <- apply(exp_sim, 2, mean)

hist(exp_means, breaks = 40, xlim = c(2,8), main = "Distribution of Simulation Means") #quick exploratory histogram

Things to note:

exp_sim is a numeric matrix such that each column is a simulation with 40 rows of exponentials.
exp_means is the vector of 1000 averages of 40 random exponentials.
The exploratory histogram shows that the data is indeed slightly Guassian.

Sample Mean vs. Theoretical Mean

Now let’s compare means. Theoretically the mean of an exponential distribution is 1/lambda. Comparatively we will calculate the sample mean of our simulation using mean function.

theo_mean <- 1/lambda #theoretical mean

print(paste("Theoretical Mean:", theo_mean))

## [1] "Theoretical Mean: 5"

real_mean <- round(mean(exp_means), 3) #sample mean, rounded to 3 digits

print(paste("Sample Mean:", real_mean))

## [1] "Sample Mean: 5.034"

So, the theoretical mean is very close to the actual sample mean of the distribution! Let’s see how this looks in our histogram:

hist(exp_means, breaks = 40, xlim = c(2,8), main = "Distribution of Simulation Means")

abline(v=real_mean, lwd = 2, col = "orange")
abline(v=theo_mean, lwd = 2, col = "green")

legend("topright", lwd = 2, col = c("orange", "green"), legend = c("Sample Mean", "Theoretical Mean"))

Sample Variance vs. Theoretical Variance

Let’s compare variances. Theoretically the standard deviation of an exponential distribution is 1/lambda, as well. So following the Central Limit Theorem, the distribution of averages has a variance of 1/(n * lambda^2), where n is the sample size (i.e. n = 40).

Comparatively we will calculate the sample variance of our simulation using var function.

theo_var <- 1/(n * lambda^2) #theoretical variance

print(paste("Theoretical Variance:", theo_var))

## [1] "Theoretical Variance: 0.625"

real_var <- round(var(exp_means), 3) #sample variance, rounded to 3 digits

print(paste("Sample Variance:", real_var))

## [1] "Sample Variance: 0.607"

The theoretical variance is a good estimator for the actual sample variance.

Distribution

According to the Central Limit Theorem, the distribution of averages of iid variable becomes standard normal as sample size increases. The distribution of averages has mean mu and variance sigma^2/n.

Using the CLT, the simulation should be approximately normal with mean 1/lamba and variance 1/(n * lambda^2), where lambda = 0.2 and n = 40.

The plot below illustrates the data with a density smoother (pink). Compare that, the dotted-line showing the approximate normal distribution (blue) using the theoretical mean and variances.

hist(exp_means, breaks = 40, xlim = c(2,8), main = "Distribution of Simulation Means", xlab = "mean", probability = TRUE)

lines(density(exp_means), lwd=3, col="pink") #density smoother of simulation

#Fit a normal approximation using the theoretical values
x <- seq(min(exp_means), max(exp_means), length=2*n)
y <- dnorm(x, mean=real_mean, sd=sqrt(real_var))
lines(x, y, col="blue", lwd=3, lty = 3)

legend("topright", lwd = 3, col = c("pink", "blue"), legend = c("Simulation Density", "Normal Density"), lty = c(1,3))

Following the CLT, the simulation is approximately normal, and as sample size (n) increases, it will become more normal.