“Statistical Inference Course - Assignment 1”

Part of Coursera - Johns Hopkins University - Data Science Specialization)
author: “Aled Evans”
Project Overview
The project is an investigation of the exponential distribution in R and a comparison with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) - where lambda is the rate parameter. Lambda is set = 0.2.

Simulations

The simulation uses the constants for the mean and SD by setting lambda = 0.2. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Lambda = 0.2 is used for all of the simulations and to investigate the distribution of averages of 40 exponentials. The number of simulations for this investigation is 1000.
# load ggplot for use in analysis
library(ggplot2)
# set the constants for the analysis
lambda <- 0.2 # lambda for rexp
n <- 40 # number of exponentials
numberSimulations <- 1000 # the number of simlations/tests
# set the seed to create a reproducible analysis
set.seed(523532)
# run the simulation: resulting in a matix of n * numberSimulations
expoDistributions <- matrix(data = rexp(n * numberSimulations, lambda), nrow=numberSimulations)
expoDistDataMeans <- data.frame(means=apply(expoDistributions, 1, mean))

Sample Values and Theoretical Values

Calculate sample mean - given below as meanExpoMeans
# Calculate sample mean - given below as meanExpoMeans
meanExpoMeans <- mean(expoDistDataMeans$means)
Calculate and store sample SD - given below as sdExp
# Calculate and store sample SD (standard deviation)- given below as sdExp
sdExp<- sd(expoDistDataMeans$means)
Variance of sample means- calculate and store as VarianceExp
# Variance of sample means - calculate and store as VarianceExp
VarianaceExp<- var(expoDistDataMeans$means)

# Call mean, SD and variance values from 1000 simulations to display in the analysis
meanExpoMeans
## [1] 5.030837
sdExp
## [1] 0.7948557
VarianaceExp
## [1] 0.6317955

Theoretical values - for use in visual comparison

Calculate the theoretical values for the mean, standard deviation and variance. These are stored to be used in visual analysis later with a comparison of the sample values and theoretical values.
# Calculate and store theoretical mean
theoreticalMean <- 1/lambda
# Theoretical standard deviation - calculate and store
theoreticalSD <- ((1/lambda) *1/sqrt(n))
# Theoretical variance - calculate and store
theoreticalVar <- theoreticalSD^2
# Call each of the theoretical values to display in the analysis
theoreticalMean
## [1] 5
theoreticalSD
## [1] 0.7905694
theoreticalVar
## [1] 0.625
As noted above the theoretical vs. sample values of 1 - mean; 2- variance; and 3 - standard deviation; are each slightly different but quite close in value.

Plot to illustrate comparison between theoretical values and simulated values

dfExpoDistDataMeans <- data.frame(expoDistDataMeans) 
meansPlot <- ggplot(dfExpoDistDataMeans, aes(expoDistDataMeans))
# Plot sample means as a histogram in purple
meansPlot <- meansPlot+geom_histogram(binwidth=lambda, fill="purple",color="purple",aes(x=means, y= ..density..))
# Label the histogram plot
meansPlot <- meansPlot + labs(title = "Exponential Distribution Density - 40 simulations", x = "Mean values for each of the 40 simulations", y="Density")
#Plot sample mean distibution in yellow
meansPlot <- meansPlot + stat_function(fun=dnorm,args=list(mean= meanExpoMeans,sd=sdExp), color = "yellow", size = 1.0)
# Plot sample mean line as a yellow dashed ine
meansPlot <- meansPlot + geom_vline(xintercept=meanExpoMeans, color = "yellow", size = 0.8, linetype="F1")
# Plot theoretical mean distibution in green
meansPlot <- meansPlot + stat_function(fun=dnorm,args=list(mean=theoreticalMean,sd=theoreticalSD), color = "green", size = 0.8)
# Plot theoretical mean as a green dotted line
meansPlot <- meansPlot + geom_vline(xintercept=theoreticalMean, color = "green", size = 1.0, linetype="dotted")
meansPlot

For visual comparison to aid in the analysis:
The sample mean is displayed in yellow as a long-dashed line. The theoretical mean is a green dotted line.
The distribution curve for the theoretical mean (a normal distribution curve) is displayed as a green line. The density distribution of the sample mean is displayed as a yellow line (this yellow curve is means of the density also displayed as the purple histogram on the plot).
The plot demonstrates that the central limit theorem has worked to deliver sample data that follows a normal distribution curve and mean. These sample values are very close to theoretical distribution in green - there is a relatively small amount of variance between the sample values and the theoretical values.
Final Summary
1 - The analysis shows the sample mean (from a sample of 40 exponentials and 1000 simulations) and compares it to the theoretical mean.
2- The variance (how variable the sample is compared to the theoretical values) has also been analysed and plotted for a visual comparison.
3 - The distribution of the sample values has been shown to be approximately normal. The yellow curve (sample) is very close approximation of the theoretical normal distribution curve (green).