Introduction

The following analysis is for the purposes of the course project of Coursera Statistical Inference course. The scope of the analysis is to investigate the exponential distribution in R and compare it with the Central Limit Theorem. The fist step is to simulate 1000 random exponential distributions and 1000 averages of 40 random exponential distributions. The second step is to compare the sample mean (the mean of the 1000 averages of 40 random exponential) with the theoretical mean of the exponential distribution. The theoretical mean of the exponential distribution is 1/lambda and the standard deviation is also 1/lambda. The lambda is set to 0.2 for all of the simulations. The results of the analysis confirm the Central Limit Theory. The code for the analysis is given in the Appendix.

Simulations

First the simulation parameters are set. The number of simulations is set to 1000, lambda is set to 0.2 and n is set to 40. Also the set.seed command is used for reasons of reproducibility.

The first simulation of 1000 exponential is run and the results are displayed in a histogram (See Appendix figure 1). The exponential distribution is simulated with rexp(nosim, lambda), stored as data.frame and plotted using ggplot2. As it can been seen by the graph the frequency is high at the start of x axis and as we move away from it the frequency falls.

The second simulation is that of the 1000 samples of 40 random exponentials. The result is also given in a form of a histogram (See Appendix figure 2. In order to store the 1000 means of 40 exponential samples, an empty object, named “tme” is created. By using the “for” function, 1000 means of 40 exponential distributions, are created with “mean(rexp(n, lambda)” and stored to “tme”. The results are again plotted using ggplot2. As it can be seen from the plot the distribution looks like normal and the mean is close to 5.

Comparison of sample mean with theoretical mean

The theoretical and sample mean are calculated and presented in the following table. The theoretical mean is equal to 1/lambda.

## Theoretical Mean      Sample Mean 
##            5.000            4.996

The sample mean is 4.996 very close to the theoretical mean wich is 5.

Comparison of sample variance with theoretical variance

The theoretical and sample variance are calculated and presented in the following table. The theoretical variance is equal to (1/lambda)^2/n.

## Theoretical Variance      Sample Variance 
##                0.625                0.646

The sample variance (0.646) is very close to the theoretical variance (0.625).

Graphical Comparison of Distributions

The first comparison is that between the distributions of the 1000 exponantials against the 1000 means of 4o exponentials. The difference is clear. The distribution of the 1000 exponentials is more concentrated in between 0-5. The distribution of 1000 means of 40 distributions looks like normal distribution, since it is concentrated around the mean, which is 5.

The second comparison is between the distribution of 1000 means of 40 exponentials and the normal distribution with mean of 5 and a standard deviation of 0.625. The two distributions are very close. Not identical but it can be said that the distribution of 1000 means of 40 exponentials follows N ~ (1/lambda,(1/lambda)/sqrt(n)).

Conclusion

The above just confirm the Central Limit Theorem, which states that the sampling distribution of the sample mean approximates the normal distribution, regardless of the distribution of the population from which the samples are drawn, if the sample size is sufficiently large.

Appendix

Code

Set parametres

nosim <- 1000
lambda <- 0.2
n <- 40
set.seed(345)

Simulation 1 and Figure 1

exp <- as.data.frame(rexp(nosim, lambda))

m <- ggplot(exp, aes(x =rexp(nosim, lambda), xlab = "Value"))
m <- m + geom_histogram(aes(y=..density..), colour="black", fill = "grey")
m + geom_density(colour="blue", size=1) + xlab("Value") + ggtitle("Figure 1. Distribution of 1000 random exponential")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Simulation 2 and Figure 2

mns = NULL
for (i in 1 : 1000) mns = c(mns, mean(rexp(n, lambda)))
tme <- as.data.frame(mns)

m <- ggplot(tme, aes(x = mns, xlab = "Value"))
m <- m + geom_histogram(aes(y=..density..), colour="black", fill = "grey")
m + geom_density(colour="red", size=1) + xlab("Value") + ggtitle("Figure 2. Distribution of 1000 averages of 40 random exponential") + geom_vline(xintercept = mean(mns), colour="red", size =1)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Mean comparison

sm <- round(mean(mns),3)
tm <- 1/lambda
means <- as.table(c(tm, sm)) 
names(means) <- c("Theoretical Mean", "Sample Mean")
means

Variance Comparison

sv <- round(var(mns),3)
tv <- (1/lambda)^2/n
variances <- as.table(c(tv, sv)) 
names(variances) <- c("Theoretical Variance", "Sample Variance")
variances

Data manipulation and graphs for graph comparison.

exp$label <- "1000 exponentials"
tme$label <- "1000 means of 40 exponentials"
names(exp)[1] <- "value"
names(tme)[1] <- "value"
plot <- rbind(tme, exp)


library(ggplot2)
qplot(value, data=plot, geom="density", fill=label, alpha=I(.5),main="1000 exponentials vs 1000 means of 40 exponentials", xlab="Value", ylab="Density")
norm <- rnorm(nosim,tm,sqrt(tv))
norm <- as.data.frame(norm)
norm$label <- "Normal"
names(norm)[1] <- "value"
plot2 <- rbind(tme,norm)

qplot(value, data=plot2, geom="density", fill=label, alpha=I(.5),main="1000 exponentials vs Normal Distribution", xlab="Value", ylab="Density")