The task developed in this document is for the completion of the Statistical Inference Course Assignment, part of Coursera’s Data Science Certification by Johns Hopkins University. There are two parts to this project :
As requested, each pdf report should have a maximum length of 3 pages admiting more 3 pages of supporting material as an appendix if needed.
Investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. By investigating the distribution of averages of 40 exponentials, illustrate via simulation and associated explanatory text the properties of the distribution:
Create the test bench and obtain the data
set.seed(28) #set the seed of R’s random number generator before simulating random values is a good practice.
lambda <- 0.2 # The provided lambda parameter
exp_num <- 40 # The number of exponentials
simu_num <- 1000 # And the number of simulations
#Run the simulations
simulation <- replicate(simu_num, rexp(exp_num, lambda))
#Calculate the means of the exponential simulation
means <- apply(simulation, 2, mean)
The mean of the exponential distribution is equal to 1/lambda. As our
lambda is equal to 0.2, the theoretical mean should be 5. The graphical
representation below depicts the relationship with our simulation
mean.
sample_mean <- mean(means) # Get the final mean figure
theory_mean <- 1/lambda # The exponential theoretical mean
# Now plot an histogram to see things clear
r <- hist(means, main = "Exponential Sample Means - (Simulated)", col = "lightgray", breaks = 100)
abline(v = sample_mean, lwd = 4, col = "red")
abline(v = theory_mean, lwd = 4, col = "blue3")
# And show our numbers
text(6.5, 30, paste("Sample mean = ", round(sample_mean,2)), col="red")
text(6.5, 28, paste("Theoretical mean = ", round(theory_mean,2)), col="blue3")
Our code above shows that the sample mean is very close to the theoretical mean of 5.
To find the variance, first find the standard deviation of the exponential distribution which is equal to \((1/\lambda)/\sqrt(n)\) . And then, square the standard deviation to calculate the variance.
theory_sd <- (1/lambda)/sqrt(exp_num)
sample_sd <- sd(means)
theory_var = theory_sd^2
sample_var = sample_sd^2
| Parameter Type | Standard Deviation | Variance |
|---|---|---|
| Theoretical | 0.7906 | 0.625 |
| Sample | 0.7867 | 0.6188 |
Same as the mean, the values of the variances are also close to each other.
Considering that according to the Central Limit Theorem, the means of the sample simulations should follow a normal distribution, this section should investigate whether the exponential distribution is approximately normal and follows it.
# Provide the histogram of the simulation as background
hist(means, main = "Normal Distribution X Simulation", col = "lightgray", breaks = 100)
# Overlap the normal distribution
xfit <- seq(min(means), max(means), length = 100)
yfit <- dnorm(xfit, mean = 1/lambda, sd = (1/lambda)/sqrt(exp_num))
lines(xfit, yfit*60, lwd=2, col="red")
# And the simulation density
den <- density(means)
lines(den$x, den$y*60, lwd=2, col="blue")
# Provide the legend
text(6.5, 30, "Normal Distribution", col="red")
text(6.5, 28, "Simulation Density", col="blue3")
We may see above that the distribution of means of random sampled exponentials overlaps with the normal distribution. Also, due to the Central Limit Theorem, increasing the number the samples from 1000 will cause the simulation distribution to come even close to the normal distribution.