The following analysis is for the purposes of the course project of Coursera Statistical Inference course. The scope of the analysis is to investigate the exponential distribution in R and compare it with the Central Limit Theorem. The fist step is to simulate 1000 random exponential distributions and 1000 averages of 40 random exponential distributions. The second step is to compare the sample mean (the mean of the 1000 averages of 40 random exponential) with the theoretical mean of the exponential distribution. The theoretical mean of the exponential distribution is 1/lambda and the standard deviation is also 1/lambda. The lambda is set to 0.2 for all of the simulations. The results of the analysis confirm the Central Limit Theory. The code for the analysis is given in the Appendix.
First the simulation parameters are set. The number of simulations is set to 1000, lambda is set to 0.2 and n is set to 40. Also the set.seed command is used for reasons of reproducibility.
The first simulation of 1000 exponential is run and the results are displayed in a histogram (See Appendix figure 1). The exponential distribution is simulated with rexp(nosim, lambda), stored as data.frame and plotted using ggplot2. As it can been seen by the graph the frequency is high at the start of x axis and as we move away from it the frequency falls.
The second simulation is that of the 1000 samples of 40 random exponentials. The result is also given in a form of a histogram (See Appendix figure 2. In order to store the 1000 means of 40 exponential samples, an empty object, named “tme” is created. By using the “for” function, 1000 means of 40 exponential distributions, are created with “mean(rexp(n, lambda)” and stored to “tme”. The results are again plotted using ggplot2. As it can be seen from the plot the distribution looks like normal and the mean is close to 5.
The theoretical and sample mean are calculated and presented in the following table. The theoretical mean is equal to 1/lambda.
## Theoretical Mean Sample Mean
## 5.000 4.996
The sample mean is 4.996 very close to the theoretical mean wich is 5.
The theoretical and sample variance are calculated and presented in the following table. The theoretical variance is equal to (1/lambda)^2/n.
## Theoretical Variance Sample Variance
## 0.625 0.646
The sample variance (0.646) is very close to the theoretical variance (0.625).
The first comparison is that between the distributions of the 1000 exponantials against the 1000 means of 4o exponentials. The difference is clear. The distribution of the 1000 exponentials is more concentrated in between 0-5. The distribution of 1000 means of 40 distributions looks like normal distribution, since it is concentrated around the mean, which is 5.
The second comparison is between the distribution of 1000 means of 40 exponentials and the normal distribution with mean of 5 and a standard deviation of 0.625. The two distributions are very close. Not identical but it can be said that the distribution of 1000 means of 40 exponentials follows N ~ (1/lambda,(1/lambda)/sqrt(n)).
The above just confirm the Central Limit Theorem, which states that the sampling distribution of the sample mean approximates the normal distribution, regardless of the distribution of the population from which the samples are drawn, if the sample size is sufficiently large.
Set parametres
nosim <- 1000
lambda <- 0.2
n <- 40
set.seed(345)
Simulation 1 and Figure 1
exp <- as.data.frame(rexp(nosim, lambda))
m <- ggplot(exp, aes(x =rexp(nosim, lambda), xlab = "Value"))
m <- m + geom_histogram(aes(y=..density..), colour="black", fill = "grey")
m + geom_density(colour="blue", size=1) + xlab("Value") + ggtitle("Figure 1. Distribution of 1000 random exponential")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Simulation 2 and Figure 2
mns = NULL
for (i in 1 : 1000) mns = c(mns, mean(rexp(n, lambda)))
tme <- as.data.frame(mns)
m <- ggplot(tme, aes(x = mns, xlab = "Value"))
m <- m + geom_histogram(aes(y=..density..), colour="black", fill = "grey")
m + geom_density(colour="red", size=1) + xlab("Value") + ggtitle("Figure 2. Distribution of 1000 averages of 40 random exponential") + geom_vline(xintercept = mean(mns), colour="red", size =1)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Mean comparison
sm <- round(mean(mns),3)
tm <- 1/lambda
means <- as.table(c(tm, sm))
names(means) <- c("Theoretical Mean", "Sample Mean")
means
Variance Comparison
sv <- round(var(mns),3)
tv <- (1/lambda)^2/n
variances <- as.table(c(tv, sv))
names(variances) <- c("Theoretical Variance", "Sample Variance")
variances
Data manipulation and graphs for graph comparison.
exp$label <- "1000 exponentials"
tme$label <- "1000 means of 40 exponentials"
names(exp)[1] <- "value"
names(tme)[1] <- "value"
plot <- rbind(tme, exp)
library(ggplot2)
qplot(value, data=plot, geom="density", fill=label, alpha=I(.5),main="1000 exponentials vs 1000 means of 40 exponentials", xlab="Value", ylab="Density")
norm <- rnorm(nosim,tm,sqrt(tv))
norm <- as.data.frame(norm)
norm$label <- "Normal"
names(norm)[1] <- "value"
plot2 <- rbind(tme,norm)
qplot(value, data=plot2, geom="density", fill=label, alpha=I(.5),main="1000 exponentials vs Normal Distribution", xlab="Value", ylab="Density")