In this project we will investigate the exponential distribution in R and compare it with the Central Limit Theorem, using simulations. This paper are part of Statistical Inference course project from Data Science specialization, by John Hopkins
We will verify:
The exponential distribution will be simulated using rexp(n, lambda) where lambda is the rate parameter.
In this study we will generate 1000 simulations (averages) with 40 random exponential samples each. We’ll define \(\lambda=0.2\) (lambda=0.2) for all of the simulations.
Generating the averages
## simulation parameters
n <- 1000 # simulations
s <- 40 # sample size
lambda <- 0.2 # exponential rate
## simulation data
dt <- NULL # null initial value
for (i in 1:n) dt <- c (dt, mean(rexp(s,lambda)) ) # 1000 means of 40 random exponential sample
The mean of exponential distribution is \(1/\lambda\) and the standard deviation is also \(1/\lambda\), so, the expected value for the mean is \(1/0.2 = 5\).
Let’s see the means distribution in the simulation.
library(ggplot2)
g <- ggplot(as.data.frame(dt), aes(x = dt))
g <- g + geom_histogram(color="black", alpha=.1, fill="blue", binwidth=.2)
g <- g + geom_vline(xintercept=mean(dt), color="blue")
g <- g + geom_vline(xintercept=1/lambda, color="red")
g <- g + xlab("Average from 40 random exp sample")
g <- g + ylab("Frequency")
g <- g + ggtitle("Average of means (blue) vs Theorical mean (red)")
g <- g + theme(plot.title=element_text(size=rel(1.2), face="bold"))
g
We can see that at this level, the average means is very close to the theorical mean.
data.frame(average.means=mean(dt), theorical.mean=1/lambda)
## average.means theorical.mean
## 1 4.999525 5
Lets see what is the differenc between the thorical variance and the sample variance from the simulation. The mean of exponential distribution is \(1/\lambda\) and the standard deviation is also \(1/\lambda\), so, the expected value for the variance is \((1/\lambda)^2/n\) or \(0.2^{-2}/n = 0.625\).
data.frame(
means.variance = round(var(dt),3),
theorical.variance = (1/lambda)^2/s
)
## means.variance theorical.variance
## 1 0.686 0.625
To comparing the therical distribution with the sample simulation we can visuzlize three curves: 1. The sample probability density 2. The normal distribution with the mean and standard error from sample data 3. The theorical nomal distribution from exponential parameters
# ggplot, data set is the simulation
g <- ggplot(data.frame(x=dt), aes(x = x))
g <- g + ggtitle("Mean simulation and sample density")
g <- g + labs(x = "Average of means", y = "P Density")
g <- g + theme(plot.title=element_text(size=rel(1.2), face="bold"))
# visualizing the simulation distribution
g <- g + geom_histogram(position="identity", fill="blue", color="black", alpha=0.1,
binwidth=0.2, aes(y= ..density..))
g <- g + geom_density(size=1, colour="blue") #distribution
g <- g + geom_vline(xintercept=mean(dt), color="blue", size=1) # mean
g
## visualizing the theorical and simulation distribution
# ggplot, data set is the simulation
g <- ggplot(data.frame(x=dt), aes(x = x))
g <- g + ggtitle("Simulation Distribution (blue/yellow) vs Theorical Distribution (red)")
g <- g + labs(x = "mean", y = "frequency")
g <- g + theme(plot.title=element_text(size=rel(1.2), face="bold"))
# visualizing the simulation distribution
g <- g + geom_density(size=1, colour="blue") #distribution
g <- g + geom_vline(xintercept=mean(dt), color="blue", size=1) # mean
# visualizing the theorical distribution mean = 1/lambda sd = 1/lambda
g <- g + stat_function(fun = dnorm, colour = "red", args=list(mean=1/lambda), size = 1, linetype=2); # dnorm with therical parameters
g <- g + geom_vline(xintercept=1/lambda, color="red", linetype=2, size=1) # theorical mean
g <- g + geom_vline(xintercept=1/lambda+c(-1,1)*sqrt((1/lambda)^2/s), color="red", linetype=2) #theorical sd
# visualizing the simulation distribution assuming dnorm wiht mean(dt) and sd(dt)
g <- g + stat_function(fun = dnorm, colour = "yellow", args=list(mean=mean(dt,sd=sd(dt))), size = 1, linetype=2); #dnorm with simulation parameters
g <- g + geom_vline(xintercept=mean(dt)+c(-1,1)*sd(dt), color="yellow") # +/- sd
# chart
g
We can see that the simulation distribution follow very close to the theorical parameters.