Statistical Inference

Course Project

Simulation Excercise

By Daniel Perez

The Central Limit Theorem states that the arithmetic mean of a sufficiently large number of iterates of independent random variables will be approximately normally distributed regardless of the underlying distribution. This analysis will investigate the exponential distribution and compare it with the central limit theorem. This will be done by simulating in R 1000 groups of 40 randomly generated exponential variables with \( \lambda = 0.2 \).

The exponential distribution is described by the rate of change \( \lambda \). It has a mean (\( \mu \)) and standard deviation (\( \sigma \)) equal to \( 1/\lambda \).

lambda = 0.2
mean = 1/lambda
sd = 1/lambda

Let's plot the density distribution to see what it looks like:

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.1.2

x <- seq(0,100, length = 100)
dx <- dexp(x, rate = lambda)
px <- pexp(x, rate = lambda)
xdata <- data.frame(x = x,dx = dx, px = px)
xplot <- ggplot(xdata)+ geom_line(aes(x,dx, color = "Density")) + geom_line(aes(x,px, color = "Distribution"))
xplot <- xplot + labs(title = "Exponential Distribution", x = "X", y = "")
xplot <- xplot + scale_colour_manual("", values = c("red", "blue"))
print(xplot)

plot of chunk unnamed-chunk-2

Let's declare and initialize some useful variables:

trials = 1000
observations = 40

Next lets set the seed to ensure reproducibility:

set.seed(2015)

With the seed set, we are ready to start generating random exponential by using the function rexp and the parameters lambda (\( \lambda \)) for the rate of change and observations for the number of observations. We will replicate this for the amount given by the variable trials.

simulations <- replicate(trials, rexp(observations,lambda))

Here's what one of these trials look like (rounded to 2 decimal places):

t(matrix(round(simulations,2)[1:40],10))

##       [,1] [,2]  [,3] [,4]  [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 13.97 7.47  5.53 3.73 14.15 6.65 2.42 0.26 3.14  6.10
## [2,]  1.92 6.60 10.37 1.65  2.38 2.78 0.24 7.02 2.52  3.60
## [3,]  6.26 0.64 10.71 4.64  0.02 1.77 0.73 3.47 1.89  3.76
## [4,]  7.04 7.73  2.30 1.31  7.66 0.15 4.67 2.72 4.84  0.74

Now that we have our simulations let's proceed to apply the mean function over the 2^nd dimension (columns).

sample_means <- apply(simulations, 2, mean)

Here's a table with the first 10 means of the 1000 trials (rounded to 2 decimal places):

t(matrix(round(sample_means,2)[1:10],10))

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 4.39 3.81 5.87 5.77 6.05  5.8 5.61  4.2 4.54  5.36

As we can see in the previous table, it looks like the trial means have a mean of 5. Let's take the mean of means of the 1000 trials.

simulation_mean <- mean(sample_means)

The mean of means of the 1000 trials is 5.01.

Let's proceed to get the variance of the means of the 1000 trials with var.

simulation_variance <- var(sample_means)

The variance of the 1000 trials is 0.63.

Does the Central Limit Theorem work? The Central Limit Theorem states that for a large enough sample, the distribution of the mean will be equal to the population mean (\( \mu \)) with a variance (\( \sigma^2/n \)). Let's compare our results with the predicted values.

difference_theoretical_simulation_mean  <- simulation_mean/mean - 1
difference_theoretical_simulation_mean

## [1] 0.002312687

difference_theoretical_simulation_variance <- simulation_variance*observations/sd^2 - 1
difference_theoretical_simulation_variance

## [1] 0.002078767

So our difference between the theoretical and the simulated values are less than \( 1% \) for the mean and variance respectively.

Finally let's graph the histogram of the mean of means of the exponential distribution with the normal distribution given by the theoretical values and the normal distribution given by the values that resulted from the simulation.

dSM <- data.frame(sample_means)
g <- ggplot(dSM, aes(x = sample_means))
g <- g + geom_histogram(aes(y = ..density..),alpha = 0.6)
g <- g + stat_function(fun = dnorm,args = list(mean = mean, sd = sd/sqrt(observations)), aes(color = "Theoretical"), linetype = "solid", size = 1.0)
g <- g + stat_function(fun = dnorm,args = list(mean = simulation_mean, sd = sqrt(simulation_variance)), aes(color = "Sample"), linetype = "dashed", size = 1.0)
g <- g + labs(title = "Density Distribution for Means of Exponential Distribution", x = "Mean", y = "Density")
g <- g + scale_colour_manual("Density Distributions", values = c("red", "blue"))
print(g)

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk unnamed-chunk-12

Since the number of trials is sufficiently large, the Central Limit Theorem predicts accurately the distribution of the mean even though it comes from a population with a different distribution.