Overview

In this project we will investigate the exponential distribution in R and compare it with the Central Limit Theorem, using simulations. This paper are part of Statistical Inference course project from Data Science specialization, by John Hopkins

We will verify:

The sample mean and compare it to the theoretical mean of the distribution.
How variable the sample is (via variance) and compare it to the theoretical variance of the distribution.
See if the distribution is approximately a normal distribution.

Simulations

The exponential distribution will be simulated using rexp(n, lambda) where lambda is the rate parameter.

In this study we will generate 1000 simulations (averages) with 40 random exponential samples each. We’ll define \(\lambda=0.2\) (lambda=0.2) for all of the simulations.

Generating the averages

## simulation parameters
n <- 1000 # simulations
s <- 40 # sample size
lambda <- 0.2 # exponential rate
## simulation data
dt <- NULL # null initial value
for (i in 1:n) dt <- c (dt, mean(rexp(s,lambda)) ) # 1000 means of 40 random exponential sample

Sample Mean versus Theoretical Mean

The mean of exponential distribution is \(1/\lambda\) and the standard deviation is also \(1/\lambda\), so, the expected value for the mean is \(1/0.2 = 5\).

Let’s see the means distribution in the simulation.

library(ggplot2)

g <- ggplot(as.data.frame(dt), aes(x = dt))
g <- g + geom_histogram(color="black", alpha=.1, fill="blue", binwidth=.2)
g <- g + geom_vline(xintercept=mean(dt), color="blue")
g <- g + geom_vline(xintercept=1/lambda, color="red")
g <- g + xlab("Average from 40 random exp sample")
g <- g + ylab("Frequency")
g <- g + ggtitle("Average of means (blue) vs Theorical mean (red)")
g <- g + theme(plot.title=element_text(size=rel(1.2), face="bold"))
g

We can see that at this level, the average means is very close to the theorical mean.

data.frame(average.means=mean(dt), theorical.mean=1/lambda)

##   average.means theorical.mean
## 1      4.999525              5

Sample Variance versus Theoretical Variance

Lets see what is the differenc between the thorical variance and the sample variance from the simulation. The mean of exponential distribution is \(1/\lambda\) and the standard deviation is also \(1/\lambda\), so, the expected value for the variance is \((1/\lambda)^2/n\) or \(0.2^{-2}/n = 0.625\).

data.frame(
  means.variance = round(var(dt),3),
  theorical.variance = (1/lambda)^2/s
  )

##   means.variance theorical.variance
## 1          0.686              0.625

Distribution

To comparing the therical distribution with the sample simulation we can visuzlize three curves: 1. The sample probability density 2. The normal distribution with the mean and standard error from sample data 3. The theorical nomal distribution from exponential parameters

Visualizing sample simulation density and the data set

# ggplot, data set is the simulation
g <- ggplot(data.frame(x=dt), aes(x = x))
g <- g + ggtitle("Mean simulation and sample density")
g <- g + labs(x = "Average of means", y = "P Density")
g <- g + theme(plot.title=element_text(size=rel(1.2), face="bold"))

# visualizing the simulation distribution
g <- g + geom_histogram(position="identity", fill="blue", color="black", alpha=0.1, 
                        binwidth=0.2, aes(y= ..density..))
g <- g + geom_density(size=1, colour="blue") #distribution
g <- g + geom_vline(xintercept=mean(dt), color="blue", size=1) # mean
g

Visualizing the three density curves

## visualizing the theorical and simulation distribution

# ggplot, data set is the simulation
g <- ggplot(data.frame(x=dt), aes(x = x))
g <- g + ggtitle("Simulation Distribution (blue/yellow) vs Theorical Distribution (red)")
g <- g + labs(x = "mean", y = "frequency")
g <- g + theme(plot.title=element_text(size=rel(1.2), face="bold"))

# visualizing the simulation distribution
g <- g + geom_density(size=1, colour="blue") #distribution
g <- g + geom_vline(xintercept=mean(dt), color="blue", size=1) # mean

# visualizing the theorical distribution mean = 1/lambda sd = 1/lambda
g <- g + stat_function(fun = dnorm, colour = "red", args=list(mean=1/lambda), size = 1, linetype=2); # dnorm with therical parameters
g <- g + geom_vline(xintercept=1/lambda, color="red", linetype=2, size=1) # theorical mean
g <- g + geom_vline(xintercept=1/lambda+c(-1,1)*sqrt((1/lambda)^2/s), color="red", linetype=2) #theorical sd

# visualizing the simulation distribution assuming dnorm wiht mean(dt) and sd(dt)
g <- g + stat_function(fun = dnorm, colour = "yellow", args=list(mean=mean(dt,sd=sd(dt))), size = 1, linetype=2); #dnorm with simulation parameters
g <- g + geom_vline(xintercept=mean(dt)+c(-1,1)*sd(dt), color="yellow") # +/- sd 

# chart
g

We can see that the simulation distribution follow very close to the theorical parameters.

Comparation between exponential and the Central Limit Theorem

Giuliano Sposito