In this project you will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. You will investigate the distribution of averages of 40 exponentials. Note that you will need to do a thousand simulations.
Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials. You should
We will generate a serie of 1000 simulations. Each simulation will contain the averages of 40 exponentials rexp(40, 0.2)
## Warning: package 'ggplot2' was built under R version 3.3.3
#constants
set.seed(100)
nbSim <- 1000
lambda <- 0.2
n <- 40
simExp <- data.frame(ncol=2,nrow=1000)
names(simExp) <- c("rowID","Mean")
simExpCUm <- data.frame(ncol=2,nrow=1000)
names(simExpCUm) <- c("rowID","CumMean")
for (i in 1:nbSim)
{
simExp[i,1] <- i
simExp[i,2] <- mean(rexp(n,lambda))
simExpCUm[i,1] <- i
simExpCUm[i,2] <- mean(simExp[,2],na.rm=TRUE)
}
We’ve now generated our simulated data let’s compare the theorical mean with the sample mean
sampleMean <- mean(simExp[,2])
theoryMean <- 1/lambda
paste("The sample mean is ", sampleMean ," and the theorical mnean is ", theoryMean)
## [1] "The sample mean is 4.9997019268744 and the theorical mnean is 5"
The simulated mean is very close to the theoretical value of 5. Let’s plot the cumulative mean and see how we’re getting closer to 5 as we increase the number of simulations.
g<- ggplot(simExpCUm, aes(x = rowID, y = CumMean),na.rm=TRUE) +
geom_line(size=2)+
geom_hline(yintercept = 1 / lambda, color = "red", size = 1) +
scale_y_continuous(limits=c(4, 6)) +
theme(plot.title = element_text(size=12, face="bold", vjust=2, hjust=0.5)) +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))+
labs(title="Cumulative Sample Mean vs Theoretical Population Mean") +
labs(x = "Number of Simulations", y = "Cumulative Sample Mean")
suppressWarnings(print(g))
We will now compare the variance the sample means variance with the theoretical varience. The variance of the sample means estimates the variance of the population by using the varience of the 1000 entries in the means vector times the sample size, 40. That is, ??2=Var(samplemeans)×N.
sampleVar <- var(simExp$Mean )
#The theoretical variance is given by ??2=(1/lambda)^2.
theoryVar <- (1/lambda)^2/n
paste("The sample variance ", sampleVar ," and the theorical variance is ", theoryVar)
## [1] "The sample variance 0.643244224882213 and the theorical variance is 0.625"
Again both of these variance are quite close to each other.
ggplot(simExp, aes(x=simExp$Mean )) +
geom_histogram(aes(y = ..density..),
color = "black", fill = "lightblue", binwidth = 0.1)+
geom_density(aes(color = "Simulated "))+
stat_function(aes(colour = "Theoretical"),
fun = dnorm, args = list(mean = 1/lambda, sd = (1/lambda/sqrt(n)))) +
scale_colour_manual("Legend title", values = c("black", "red"))+
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))+
labs(title="Distribution of Sample Mean vs Theoretical Distribution Mean") +
labs(x = "Density", y = "Mean")
In addition to the mean comparison and variance comparison we can conclude that the sample mean follows a normal distribution. This can be explained by the central limit theorem that states that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population.
We have now shown that the mean, variance and the distribution of the simulated exponential distribution values are close to a normal distribution. We can also confirm it by looking at the confience intervals:
sampleCI <- round(sampleMean + c(-1,1)*1.96*sd(simExp$Mean)/sqrt(n),3)
sampleCI
## [1] 4.751 5.248
theoryCI <- round(theoryMean + c(-1,1)*1.96*sqrt(theoryVar)/sqrt(n),3)
theoryCI
## [1] 4.755 5.245