Overview

The central limit theorem (CLT) states that the distribution of the sum (or average) of a large number of independent, identically distributed variables will be approximately normal, regardless of the underlying distribution (http://www.math.uah.edu/stat/sample/CLT.html). This project is about simulating exponentialy distributed random variables and comparison of the result distribution of the mean with the normal to confirm the statement above.

Set up for the graphical library in R

options("scipen"=10)
if (!require(ggplot2)) {
  install.packages("ggplot2")
  library(ggplot2)
}

Simulations

The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. All further simulation is going to be done with lambda = 0.2

lambda <- 0.2

To generate enough data to the project there are 1000 sample generations where each sample has 40 random exponentialy distributed values. All data is stored in a matrix.

rexp_data <- matrix(rexp(40 * 1000, lambda), ncol=40)

Next step is calculation of the mean and standard deviation for each sample.

means <- apply(rexp_data, 1, mean)
sds <- apply(rexp_data, 1, sd)

At this point, there is the data to proceed with a further data analyses.

Sample vs theoretical means

The theoretical mean of exponential distribution is

1 / lambda
## [1] 5

The mean of the simulations:

mean(means)
## [1] 4.987228

The both numbers are very close. Let’s see the visual repesentation on the graph:

m <- qplot(x=means)
m <- m + geom_histogram(fill="green", colour="blue")
m <- m + theme_bw()
m <- m + geom_vline(xintercept = 1 / lambda, colour = "red", size = 2)
m <- m + labs(x = "Sample mean values", y="", title="Sample mean distribution")
m

The red vertical line on the graph, which corresponds to the theoretical mean, is centred. The conclusion could be made that the average of sample means is a good estimate of the theoretical mean.

Sample vs Theoretical Variance

The theoretical variance of exponential distribution is:

(1 / lambda) ^ 2
## [1] 25

The squared mean of the sample standard deviations is:

mean(sds ^ 2) 
## [1] 24.95424

The both numbers are very close. Let’s see the visual repesentation on the graph:

m <- qplot(x=sds ^ 2)
m <- m + geom_histogram(fill="green", colour="blue")
m <- m + theme_bw()
m <- m + geom_vline(xintercept = (1 / lambda) ^ 2, colour = "red", size = 2)
m <- m + labs(x = "Sample variance values", y="", title="Sample variance distribution")
m

As it is shown on the graph, a red vertical line (which correspond to the theoretical variance) is around the mean of distribution.

Distribution

To investigate the distribution of simulations, let’s normaline means variable:

mnormalized <- (means - (1/lambda)) / ((1/lambda)/sqrt(40))

The result is:

summary(mnormalized)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.82700 -0.71570 -0.06243 -0.01616  0.58090  3.91200

The mean of the the normalized sample means is close enough to 0 and 1 where 3 quantiles are almost the same with the theoretical data for N(0,1). Compare the result with the folloowing normally distributed distribution:

summary(rnorm(1000))
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.11300 -0.62290  0.01513  0.02100  0.65830  3.30400

The graphical representation of comparison to show that the normalized mean density has the same shape with the theoretical normal distribution:

m <- qplot(mnormalized, geom = 'blank')    
m <- m + geom_line(aes(y = ..density.., colour = 'The exponential distribution simulations'), stat = 'density')   
m <- m + stat_function(fun = dnorm, aes(colour = 'Normal'))                        
m <- m + geom_histogram(aes(y = ..density..), alpha = 0.6, fill="black", colour="blue")                         
m <- m + scale_colour_manual(name = 'Density', values = c('red', 'blue'))
m <- m + theme_bw()
m <- m + labs(x = "Value", y="Density", title="Theretical/Normalized sample mean comparison")
m

From the graph cold be concluded the simulated distributions are almost identical with normal since the shape and the main parametrs are very close.

Conclusion

The simulations of exponential distribution provided enough data to illustrate the Central Limit Theorem. The main charactericts of relust distribution are very close to the normal distribution what was presented on the graphs and in calculations in R code.