In this project the exponential distribution in R will be compared with the Central Limit Theorem. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Lambda will be set to 0.2 for all of the simulations. Will focus on the investiagion of the distribution of averages of 40 exponential withing 1000 simulations
First lets run 1000 simulations and compute the mean of the 40 random generated numbers
#Set seed for reproducibility
set.seed(123)
lambda <- 0.2 #Expected mean
theor_mean <- 1/lambda #Theoretical mean
theor_sigma <- 1/lambda #Theoretica standard deviation
n <- 40 #Nº of observations
expected_var <- theor_sigma^2/n
simulation=NULL
for (i in 1 : 1000) {
simulation = c(simulation, mean(rexp(40,lambda)))
}
results <- data.frame(simulation,size=40)
hist(simulation,breaks = 40, main="Distribution of 1000 random exponentials", xlab="Means")
#Mean from simulations
mean(simulation)
## [1] 5.011911
#Expected mean from exponential distribution
print(theor_mean <- 1/lambda)
## [1] 5
Difference <- (mean(simulation)/(1/lambda))*100 -1
As seen from the data, the simulated mean is close to the theoretical one which is 5.
# Theoretical variance
print(expected_var <- theor_sigma^2/n)
## [1] 0.625
# Simulated variance
var(simulation)
## [1] 0.6004928
The variance of the simulation, 0.6 is close to the theoretical variance, which is 0.625.
According to the central limit theorem, which statesthat if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement , then the distribution of the sample means will be approximately normally distributed (according to: https://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Probability/BS704_Probability12.html) This should be the case of the previous simulation.
hist(simulation,breaks = 20, main="Distribution of 1000 random exponentials", xlab="means"
,probability = TRUE)
x<-seq(0,100,0.01)
curve(dnorm(x, mean = mean(simulation), sd=sd(simulation)), col="green", lwd=2, add = TRUE)
From the graph it is clear that the simulated data is approximately normal. This should be more obvious if we run the simulations more times, for example 20000 times. Les see what happens.
simulation2=NULL
for (i in 1 : 20000) {
simulation2 = c(simulation2, mean(rexp(40,lambda)))
}
hist(simulation2, breaks = 50, main="Distribution of 20.000 random exponentials", xlab="means"
,probability = TRUE)
x<-seq(0,100,0.01)
curve(dnorm(x, mean = mean(simulation2), sd=sd(simulation2)), col="blue", lwd=2, add = TRUE)
The final plot shows a more normal distribution than the previous one. This shows that with more data or simulations the distribution is more related to a normal distribution.