Part 1. Simulation Exercise

Overview

This part illustrate the simulation of exponential distribution and the Central Limit Theorem. I’ll run 1,000 simulations of 40 exponential distributions and compare its statistics with theoretical statistics. Finally I’ll compare its distribution with the normal one.

Simulations

I will run 1,000 simulations, each is the distribution of averages of 40 exponentials. The mean and standard deviation of the simulations are store in 2 columns (namely mean and sd) of the output dataframe sim.

n <- 40
lambda <- 0.2
simno <- 1000

# create empty simulation matrix, row = simulation, col = mean, var of each simulation
sim <- data.frame(matrix(ncol = 2, nrow = simno))
colnames(sim) <- c("mean", "sd")

# run the simulations
for (i in 1 : simno){
    roll <- rexp(n, lambda)
    sim[i,1] <- mean(roll)
    sim[i,2] <- sd(roll)
}

Sample Mean versus Theoretical Mean

I now plot a histogram of the mean of the simulation. The expected theoretical mean of the exponential distribution is also represented by a vertical red line:

hist(sim$mean, main="Histogram of simulation mean", xlab="Mean")
abline(v=1/lambda,col="red", lwd=5)

We can clearly see that the histogram of the simulation mean is pretty normal with the mean is centered around 5. This agrees with the theoretical mean of the exponential distribution (red line) at 5 (which is 1/0.2).

Sample Variance versus Theoretical Variance

I now plot a histogram of the standard deviation of the simulation. The expected theoretical standard deviation of the exponential distribution is also represented by a vertical red line:

hist(sim$sd, main="Histogram of simulation standard deviation", xlab="Standard deviation")
abline(v=1/lambda,col="red", lwd=5)

In this plot, we can see that the histogram of the simulation standard deviation is somewhat normal with the mean is centered around 5 (could be a little higher or lower). This is close to the theoretical standard deviation of the exponential distribution (red line) at 5 (which is 1/0.2).
Please note that the code may give slightly different actual plots because the data is created randomly during simulation and the histogram you see may not look so normal. However, the comparison can be made as described above.

Distribution

Now I test with the difference between the distribution of a collection of 1,000 random exponentials and the distribution of 1,000 collections of averages of 40 random exponentials. Two of them will be shown respectively below.
On the left plot (one collection of 1,000 random distributions), I plot the mean of the observed collections as a vertical red line).
On the remaining plot of average, I overlay the corresponding theoretical normal distribution with mean and standard variation of 5 (dashed blue curve). Again, the theoretical mean is marked by a vertical red line.

par(mfrow=c(1,2))

temp <- rexp(1000, lambda)
hist(temp, main="Hist. of random exp. dist.", xlab="Mean")
abline(v=mean(temp),col="red", lwd = 3)

h <- hist(sim$mean, ylim=c(0,250), xlim=c(1,9), main="Hist. of avg. of random exp. mean", xlab="Mean") 
abline(v=1/lambda,col="red", lwd = 3)
xfit <- seq(1, 9, length = 100) # extract x limit
yfit <- dnorm(xfit, mean = mean(sim$mean), sd = sd(sim$mean)) # create normal distribution from mean and sd
yfit <- yfit * diff(h$mids[1:2]) * length(sim$mean) # scale disttribution by cell width and length of data
lines(xfit, yfit, col = "blue", lwd = 3, lty=2)

We can see from the left plot that although a single collection of a large number of distribution can provide the characteristic of that distribution (simulation mean is close to theoretical mean of 5), we cannot infer any further interesting from it.
On the other hand, by performing a lot of simulation and investigate the distribution of these simulations, we can derive its characteric. The plotted histogram is very close to the dashed blue curve. It clearly indicates that the distribution of an average of simulations is close to a normal distribution.

Statistical Inference Course Project - part 1

Hung Dinh

April 29, 2018