Statistical Inference Simulation Exercise

Zhe Jiang

Nov 01, 2018

Overview

In this small project, we will verify the central limit theorem,the sum of a number of independent and identically distributed random variables with finite variances will tend to a normal distribution as the number of variables grows, by simulating 1000 times of independent exponential distribtuion with same parameter lambda.

Simulations

rexp function in R randomly generates n number of observations with rate.
In this project, we set lambda equal to 0.2 and generate only 40 observations each simulation.

lambda = 0.2
rexp(n = 40, rate = lambda)

##  [1]  2.35594193  6.02310082  0.20420453 13.37912024  0.26619074
##  [6]  0.32783879  3.12983217  2.58486433  2.63800652  5.44826376
## [11]  3.77357217 10.09575759  8.06069962  5.03137106  0.67227798
## [16]  6.34508826  1.90373186  1.14968793 10.21417893  1.07566223
## [21]  5.85848640  7.85698840  3.09210497  4.36301125  1.91983381
## [26]  3.85763688  9.25295209  1.06953542  0.83450683  2.23291497
## [31]  0.06239337  8.15578652  0.76182808  1.78292414 10.69468391
## [36]  1.65239326  3.30710943  6.57529662  2.34480067 14.67795577

As it shows, it randomly generates 40 observations, and they are expected be exponentially distruibuted. However, in this project, we are interested in mean of each simulation, and below is merely one observation.

mean(rexp(n = 40, rate = lambda))

## [1] 5.529037

replicate function in R repeat n times simulations. Let’s make 1000 simulations and see how the result is. Also, for the purpose of reproducability, let’s set seed.

set.seed(123)
mns <- replicate(n = 1000, mean(rexp(40, rate = lambda)))

Since we made 1000 observations, the mean of each observations should be normaly distributed with mean equals to mean of each identical indenpendent distribution, which is 1/lambda. Let’s see theoretical mean

1/lambda

## [1] 5

And let’s see the observed mean

mean(mns)

## [1] 5.011911

Now, let’s visualize the result in graph.

library(ggplot2)
mns = data.frame(mns)
ggplot(mns, aes(x = mns)) +   
    geom_histogram(bins = 25, aes(y=..density..), colour = "black", fill = "green") + 
    geom_density(aes(y =..density..)) +
    geom_vline(xintercept = mean(mns$mns), colour = "red") +  # actual mean line
    geom_vline(xintercept = 1/lambda, colour = "blue") + # theoretical mean line
    scale_x_continuous(breaks = c(3,4,5,6,7,8)) + 
    ylab("Density") + xlab("mns") + ggtitle("Histogram of random generated exp")

As the graph shows that the distribution of 1000 observations are very close to normal distribution, and the actual mean (red line) is very close to theoretical mean (blue line).

Now, let’s see the difference between theoretical and observed variance.

1/(lambda^2)/40

## [1] 0.625

var(mns$mns)

## [1] 0.6004928

The difference between theoretical and actual is not minimal. The reason for this gap is having small observations.
If there is more observatons, the variance should be close to actual variance. Let’s do 1000 observations this time, and see the differnce

n = 1000
abs(var(replicate(n = 1000, mean(rexp(n, rate = lambda)))) - 1/(lambda^2*n))

## [1] 0.0008110179

Now, the difference became much smaller, and it proves that we are getting close to “truth” if we getting more and more sample.

Conclusion

We conclude that the sum of the identical independent distributions are close to normal distribution with same mean value and var/sample size.