This analysis performs a test in order to observe the consequances predicted by the Central Limit Theorem when anlysing the mean distribution of a set of iid comming from a non-normal distribution, in this case, the exponential distribution
The analysis will cover 3 aspects:
Set libraries
library(ggplot2)
library(plyr)
1000 sets of 40 randomly generated exponential values are simulated. Sets are numbered using the “group” variable
set.seed(123)
sim <- data.frame(value=numeric(), group=numeric())
for (i in 1:1000) {sim <- rbind(sim,data.frame(value=rexp(40,0.2), group=rep(i,40)))}
The theoretical value of the simulation mean is 1/rate parameter(lambda), being labmda=0.2 the theoretical distribution mean is 5. The comparison against the obtained values are:
For the exponential distribution:
g <- ggplot(sim, aes(x = value))
g <- g + geom_histogram(fill = "black",binwidth=1)
cuts <- rbind(data.frame(Values="Theoretical mean", mean=5),
data.frame(Values="Empirical mean", mean=mean(sim$value)))
g <- g + geom_vline(data=cuts,
aes(xintercept=mean,
linetype=Values,
colour = Values),
size=1,
show_guide=TRUE)
g <- g + ylab("# observations") + xlab("")
g <- g + labs(title = paste('Empirical mean = ', round(mean(sim$value),4)))
g
For the simultaed means:
The simulated values are aggregated, obtaining a total of 1000 means of 40 exponentially distributed random variables (each):
meansim <- data.frame(value=with(sim, tapply(value,group,mean)))
The theoretical mean of the mean distribution is also 1/lambda, 5 in our case
h <- ggplot(meansim, aes(x = value))
h <- h + geom_histogram(fill = "black",binwidth=0.2)
cuts <- rbind(data.frame(Values="Theoretical mean", mean=5),
data.frame(Values="Empirical mean", mean=mean(meansim$value)))
h <- h + geom_vline(data=cuts,
aes(xintercept=mean,
linetype=Values,
colour = Values),
size=1,
show_guide=TRUE)
h <- h + ylab("# observations") + xlab("")
h <- h + labs(title = paste('Empirical mean = ', round(mean(meansim$value),4)))
h
The analysis shows that the empirical mean is not far from the theoretical mean. With 1000 observations comming from 40 draws each, this is already a very small difference. Increasing the amount of simulated values will bring both means closer and closer, in accordance with the CLT.
Note that the distribution is looking nurmally distributed.
Following the CLT, the variance of the mean distribution should get closer to sigma^2/n when the number of iterations increase, where:
The comparison is done as follows:
c(Theoretical_variance=5^2/40, Empirical_variance=var(meansim$value))
## Theoretical_variance Empirical_variance
## 0.6250000 0.6004928
As it can be seen, both numbers are already quite close. Increasing the amount of iteration
Finally, in order to completly check the Central Limit Theorem, the mean distribution should follow approximately a normal distribution with mean = 5 and variance = 5^2/40
This is shown by overlaping this theoretical distribution to the empirical mean distribution:
i <- ggplot(meansim, aes(x = value))
i <- i + geom_density(size=1)
i <- i + stat_function(fun = dnorm, arg=list(mean=5, sd=sqrt(0.625)), color="red",size=1)
i <- i + ylab("# observations") + xlab("")
i <- i + labs(title = "Comparison with theoretical normal distribution")
i
As it can be seen, the sample distribution is very close to the theoretical distribution predicted by the CLT.