Overview

The task developed in this document is for the completion of the Statistical Inference Course Assignment, part of Coursera’s Data Science Certification by Johns Hopkins University. There are two parts to this project :

As requested, each pdf report should have a maximum length of 3 pages admiting more 3 pages of supporting material as an appendix if needed.

Part 1 - Simulation exercise

Overview

Investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. By investigating the distribution of averages of 40 exponentials, illustrate via simulation and associated explanatory text the properties of the distribution:

Requested demonstrations

  1. Show the sample mean and compare it to the theoretical mean of the distribution.
  2. Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.
  3. Show that the distribution is approximately normal. Focus on the difference between the distribution of a large collection of random exponentials and the distribution of a large collection of averages of 40 exponentials.

Given conditions

  1. Set lambda = 0.2 for all of the simulations.
  2. Investigate the distribution of averages of 40 exponentials.
  3. Do 1000 simulations.

Create the test bench and obtain the data

set.seed(28) #set the seed of R’s random number generator before simulating random values is a good practice.
lambda <- 0.2     # The provided lambda parameter
exp_num <- 40     # The number of exponentials
simu_num <- 1000  # And the number of simulations

#Run the simulations
simulation <- replicate(simu_num, rexp(exp_num, lambda))

#Calculate the means of the exponential simulation
means <- apply(simulation, 2, mean)

Question 1: Comparision between Sample Mean Versus Theoretical Mean

The mean of the exponential distribution is equal to 1/lambda. As our lambda is equal to 0.2, the theoretical mean should be 5. The graphical representation below depicts the relationship with our simulation mean.

sample_mean <- mean(means)    # Get the final mean figure
theory_mean <- 1/lambda       # The exponential theoretical mean

# Now plot an histogram to see things clear
r <- hist(means, main = "Exponential Sample Means - (Simulated)", col = "lightgray", breaks = 100)
abline(v = sample_mean, lwd = 4,  col = "red")
abline(v = theory_mean, lwd = 4,  col = "blue3")

# And show our numbers
text(6.5, 30, paste("Sample mean = ", round(sample_mean,2)), col="red")
text(6.5, 28, paste("Theoretical mean = ", round(theory_mean,2)), col="blue3")

Our code above shows that the sample mean is very close to the theoretical mean of 5.

Question 2: Comparision between Sample Variance Versus Theoretical Variance

To find the variance, first find the standard deviation of the exponential distribution which is equal to \((1/\lambda)/\sqrt(n)\) . And then, square the standard deviation to calculate the variance.

theory_sd <- (1/lambda)/sqrt(exp_num)
sample_sd <- sd(means)

theory_var = theory_sd^2
sample_var = sample_sd^2
Parameter Type Standard Deviation Variance
Theoretical 0.7906 0.625
Sample 0.7867 0.6188

Same as the mean, the values of the variances are also close to each other.

Question 3: Distribution:

Considering that according to the Central Limit Theorem, the means of the sample simulations should follow a normal distribution, this section should investigate whether the exponential distribution is approximately normal and follows it.

# Provide the histogram of the simulation as background
hist(means, main = "Normal Distribution X Simulation", col = "lightgray", breaks = 100)

# Overlap the normal distribution
xfit <- seq(min(means), max(means), length = 100)
yfit <- dnorm(xfit, mean = 1/lambda, sd = (1/lambda)/sqrt(exp_num))
lines(xfit, yfit*60, lwd=2, col="red")

# And the simulation density
den <- density(means)
lines(den$x, den$y*60, lwd=2, col="blue")

# Provide the legend
text(6.5, 30, "Normal Distribution", col="red")
text(6.5, 28, "Simulation Density", col="blue3")

We may see above that the distribution of means of random sampled exponentials overlaps with the normal distribution. Also, due to the Central Limit Theorem, increasing the number the samples from 1000 will cause the simulation distribution to come even close to the normal distribution.