Executive Summary

This report is a course project within the Statistical Inference Course on the Data Science Specialization by Johns Hopkins University on Coursera.

The project consists of two parts:

Part 1: Simulation Exercise Instructions

In this project you will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. You will investigate the distribution of averages of 40 exponentials. Note that you will need to do a thousand simulations.

1.- Show the sample mean and compare it to the theoretical mean of the distribution.

Simulation Run

# Constants
lambda <- 0.2 ; n = 40 ; sim = 1000

# Set seed to create reproducibility
set.seed(12345)

# Simulation
Exp_Sim <- matrix(rexp(sim*n, lambda), nrow = sim, ncol = n)
Exp_Mean <- apply(Exp_Sim, 1, mean)

Exponential Distribution Mean (Theoretical Mean)

Theo_Mean <- 1/lambda
print(Theo_Mean)
## [1] 5

Mean of the Simulated Exponential Random Variable Means (Sample Mean)

Sample_Mean <- mean(Exp_Mean)
print(Sample_Mean)
## [1] 4.971972

Vizualization of both Distributions and both means.

# Histograms
par(mfrow = c(2,1))

# Distribution of the Simulated Sample of Exponential Random Variables
hist(Exp_Sim, breaks = 40, xlim=c(0,30), 
     main = "Distribution of the Simulated Sample of Exponential Random Variables", 
     xlab = "Exponential Variables", 
     ylab = "Frequency", 
     col = "khaki1")
abline(v = 1/lambda, lty = 1, lwd = 4, col = "gray13")
legend("topright", lty = 1, lwd = 4, col = "gray13", legend = "Theoretical Mean")

# Distribution of the Simulated Exponential Random Variable Means
hist(Exp_Mean, breaks = 40, xlim=c(3,9), 
     main = "Distribution of the Simulated Exponential Random Variable Means", 
     xlab = "Means", 
     ylab = "Frequency", 
     col = "khaki1")
abline(v = mean(Exp_Mean), lty = 1, lwd = 4, col = "gray13")
legend("topright", lty = 1, lwd = 4, col = "gray13", legend = "Sample Mean")

2.- Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.

Exponential Distribution Variance (Theoretical Variance)

Theo_Var <- ((1/lambda)/sqrt(n))^2
print(Theo_Var)
## [1] 0.625

Variance of the Distribution of Averages (Sample Variance)

Sample_Var <- sd(Exp_Mean)^2
print(Sample_Var)
## [1] 0.6157926

As you can see, both variances are very close.

3.- Show that the distribution is approximately normal.

This graph shows the histogram of the simulation of the exponential variable means distribution. And a normal distribution curve on top, that considers as parameters the exponential theoretical mean, and the theoretical standard deviation divided by the square root of the sample size (Central Limit Theorem). Illustrating that it fits relatively well.

# Histogram
     hist(Exp_Mean, breaks = 40, 
          main = "Distribution of the Simulated Exponential Random Variable Means", 
          prob=T, xlab = "Means", ylab = "Density", col = "khaki1")

# Normal Distribution Line
   # Create 40 breaks from the min and max values of the Simulated Sample of Exp Variables, for the X Axis.
     xfit <- seq(min(Exp_Mean),max(Exp_Mean),length=40)
   # Create a Normal Distribution Vector of Densities for the Y Axis
     yfit <- dnorm(xfit, mean=1/lambda,sd=1/lambda/sqrt(n))
   # Plot the Normal Distribution Line
     lines(xfit,yfit,lty = 1, lwd = 2, col = "black")
     legend(6.5, 0.50, lty = 1, lwd = 2, col = "black", legend = "Normal Distribution", cex = 0.75)

Given the Central Limit Theorem’s mean and standard deviation, we can visualize how would a normal distribution would look like, according to the simulated sample of exponential random variables.

# Histogram
h <- hist(Exp_Mean, breaks = 40, 
          main = "Distribution of the Simulated Exponential Random Variable Means", 
          xlab = "Means", ylab = "Frequency", col = "khaki1")

# Normal Distribution Line
   # Create 40 breaks from the min and max values of the Simulated Sample of Exp Variables, for the X Axis.
     xfit2 <- seq(min(Exp_Mean),max(Exp_Mean),length=40)
   # Create a Normal Distribution Vector of Densities for the Y Axis
     yfit2 <- dnorm(xfit2, mean=1/lambda,sd=1/lambda/sqrt(n))
   # Simulate the sample frequency with the normal Distribution line         
     yfit2 <- yfit2*diff(h$breaks[1:2])*length(Exp_Mean)
   # Plot the Normal Distribution Line
     lines(xfit2,yfit2,lty = 1, lwd = 2, col = "black")
     legend(6.5, 45, lty = 1, lwd = 2, col = "black", legend = "Normal Distribution", cex = 0.75)