Synopsis

This is part of the project for the Statistical Inference class in the Johns Hopkins Data Science Specialization by Coursera.

The exponential distribution can be simulated in R with rexp(n, lambda) function where lambda is the rate parameter and n is the number of observations. The mean of the exponential distribution is 1/lambda and the standard deviation is also also 1/lambda. It sets lambda = 0.2 for all of the simulations. In this simulation, it investigates the distribution of averages of 40 exponential(0.2)s.

This report Illustrates via simulation the properties of the distribution of the mean of 40 exponential(0.2)s.
It shows:

Simulating the Exponential Distribution

First it needs to generate the simulation according to parameters specified by the project. The number of observations is 40 and lambda is 0.2. It uses set.seed function to ensure all results are reproducible.

Furthermore, it use replicate function to simulate evaluation of the mean of 40 exponential distributions. In this case, it sets the number of simulations with 1500.

The following R code performs the simulation:

## Simulating the Exponential Distribution
#
#### Set parameters: lambda, number of observations and number of simulations
lambda = 0.2
samples = 40
nrSimulations = 1500
#
## Set set.seed function to get reproducible results
set.seed(50)
#
## Get the mean of each set of 40 and the means of each simulation
allMeans <- replicate(nrSimulations, mean(rexp(samples, lambda)))

Where the distribution is centered and the theoretical center of the distribution

The distribution is centered in the mean. Thus, the mean of theoretical exponential distribution is 1/lambda and the mean of the simulate data is the mean of all simulations. The following R code demonstrates that the distribution is centered approximately in the theoretical center of the exponential distribution.

## Theoretical center
theoCenter <- 1/lambda
theoCenter
## [1] 5
## Center of the simulations
mean(allMeans)
## [1] 4.966

Then, the center of the distribution simulated is approximately the theoretical center of the exponential distribution.

How variable it is relative to the theoretical variance of the distribution

The theoretical variance of the exponential distribution is (1/lambda)^2/(n), where n is the number of observations (“samples” in R code). The theoretical standard deviation is (1/lambda)/sqrt(n).

The simulated data has variance calculated by the var function in R and a standard deviation calculated by the sd function. The following R code calculates these values:

## Theoretical Variance and standard deviation
theoVariance <-  (1/lambda)^2/(samples)
theoVariance
## [1] 0.625
## Theoretical standard deviation
theoStandardDeviation <- (1/lambda)/sqrt(samples)
theoStandardDeviation
## [1] 0.7906
## Variance of the simulations
varianceSimulations <- var(allMeans)
varianceSimulations
## [1] 0.6296
## Standard deviation of the simulations
standardDeviationSimulations <-sd(allMeans)
standardDeviationSimulations
## [1] 0.7935

These values demonstrates that the theoretical variance of the distribution is similar to variance of the simulations.

The distribution is approximately normal

The normality of the simulated distribution can be verified by the Shapiro-Wilk normality test. When it gets an approximate p-value < 0.1, the distribution is approximately normal.

The following R code performs the Shapiro-Wilk normality test for the simulated distribution:

## Performing Shapiro-Wilk normality test
shapiro.test(allMeans)
## 
##  Shapiro-Wilk normality test
## 
## data:  allMeans
## W = 0.9936, p-value = 4.585e-06

How it gets a p-value < 0.1, then the simulated distribution is approximately normal. This is graphically perceptible in the following plot:

library(ggplot2)
ggplot(data = data.frame(allMeans), aes(x = allMeans)) +
        geom_histogram(binwidth = 0.2, aes(y= ..density..), fill = NA, color = "black") + 
        geom_vline(xintercept = mean(allMeans), size = 1, color = "green",linetype="dashed") + 
        geom_vline(xintercept = theoCenter, size = 1, color = "red",linetype="dashed") + 
        stat_function(fun = dnorm, color = "green", size = 1,
                      arg = list(mean = mean(allMeans), sd = standardDeviationSimulations)) +
        stat_function(fun = dnorm, color = "red", size = 1,
                      arg = list(mean = theoCenter, sd = theoStandardDeviation)) +
        ylab("density") + xlab("mean") +
        ggtitle(expression(atop("Simulated distribution and theoretical distribution", 
                                atop(italic("Green - Simulated, Red - Theoretical"), ""))))  +
        scale_x_continuous(breaks = seq(0, 8, 1)) 

plot of chunk plot