Central Limit Theorem analysis

Synopsis

This analysis performs a test in order to observe the consequances predicted by the Central Limit Theorem when anlysing the mean distribution of a set of iid comming from a non-normal distribution, in this case, the exponential distribution

The analysis will cover 3 aspects:

Comparison of the sample and theoretical mean of the distribution.
Comparison of the sample and theoretical variance of the distribution.
Show that the distribution is approximately normal.

Simulations

Set libraries

library(ggplot2)
library(plyr)

1000 sets of 40 randomly generated exponential values are simulated. Sets are numbered using the “group” variable

set.seed(123)
sim <- data.frame(value=numeric(), group=numeric())
for (i in 1:1000) {sim <- rbind(sim,data.frame(value=rexp(40,0.2), group=rep(i,40)))}

1. Sample Mean vs. Theoretical Mean

The theoretical value of the simulation mean is 1/rate parameter(lambda), being labmda=0.2 the theoretical distribution mean is 5. The comparison against the obtained values are:

For the exponential distribution:

g <- ggplot(sim, aes(x = value))
g <- g + geom_histogram(fill = "black",binwidth=1) 
cuts <- rbind(data.frame(Values="Theoretical mean", mean=5),
              data.frame(Values="Empirical mean", mean=mean(sim$value)))
g <- g + geom_vline(data=cuts, 
             aes(xintercept=mean, 
                 linetype=Values,
                 colour = Values),
                 size=1,
                 show_guide=TRUE)
g <- g + ylab("# observations") + xlab("")
g <- g + labs(title = paste('Empirical mean = ', round(mean(sim$value),4)))
g

For the simultaed means:

The simulated values are aggregated, obtaining a total of 1000 means of 40 exponentially distributed random variables (each):

meansim <- data.frame(value=with(sim, tapply(value,group,mean)))

The theoretical mean of the mean distribution is also 1/lambda, 5 in our case

h <- ggplot(meansim, aes(x = value))
h <- h + geom_histogram(fill = "black",binwidth=0.2) 
cuts <- rbind(data.frame(Values="Theoretical mean", mean=5),
              data.frame(Values="Empirical mean", mean=mean(meansim$value)))
h <- h + geom_vline(data=cuts, 
             aes(xintercept=mean, 
                 linetype=Values,
                 colour = Values),
                 size=1,
                 show_guide=TRUE)
h <- h + ylab("# observations") + xlab("")
h <- h + labs(title = paste('Empirical mean = ', round(mean(meansim$value),4)))
h

The analysis shows that the empirical mean is not far from the theoretical mean. With 1000 observations comming from 40 draws each, this is already a very small difference. Increasing the amount of simulated values will bring both means closer and closer, in accordance with the CLT.

Note that the distribution is looking nurmally distributed.

2. Sample Variance vs. Theoretical Variance

Following the CLT, the variance of the mean distribution should get closer to sigma^2/n when the number of iterations increase, where:

sigma is the theoretical variance of the exponential ditribution: 1/lambda: 5
n is the number of values summarised : 40

The comparison is done as follows:

c(Theoretical_variance=5^2/40, Empirical_variance=var(meansim$value))

## Theoretical_variance   Empirical_variance 
##            0.6250000            0.6004928

As it can be seen, both numbers are already quite close. Increasing the amount of iteration

3. Sample distribution

Finally, in order to completly check the Central Limit Theorem, the mean distribution should follow approximately a normal distribution with mean = 5 and variance = 5^2/40

This is shown by overlaping this theoretical distribution to the empirical mean distribution:

i <- ggplot(meansim, aes(x = value))
i <- i + geom_density(size=1) 
i <- i + stat_function(fun = dnorm, arg=list(mean=5, sd=sqrt(0.625)), color="red",size=1)
i <- i + ylab("# observations") + xlab("")
i <- i + labs(title = "Comparison with theoretical normal distribution")
i

As it can be seen, the sample distribution is very close to the theoretical distribution predicted by the CLT.