Statistical Inference Project

Exponential Distribution and CLT

Goal

We will investigate the exponential distribution in R and will apply the Central Limit Theorem (CLT) to the distribution of averages of 40 exponentials with \(\lambda\) = .2

Exponential distribution

This is how the PDF for the exponential distribution looks like:

hist(rexp(n=10000, rate = .2), breaks=30, main="Probability Density Function",
  xlab="rate=0.2, n=10.000")

Analysis

We will use base and ggplot2 libraries for the graphical representations (see appendix)

We generate first the data of our simulations (see appendix)

Sample Mean vs Theoretical Mean

We calculate the mean of our previous simulations:

meansim<-rowMeans(fmat(1000)) #See appendix for details on fmat function
mean(meansim)

## [1] 5.014378

And in the otherside, the theoretical mean of the exponential distribution is \(\mu\) = 1/\(\lambda\), as \(\lambda\) = .2:
\(\mu\) = 5

Sample mean and theoretical mean are almost the same (5)

Here with a graphical view:

hist(meansim, breaks=30,main="Distribution of sample means", xlab="rate=.2, n=1000")
abline(v=5, col="purple") #Line of the theoretical mean
axis(3, at=5, labels = 5, pos=3, col.axis="purple")

Sample Variance vs Theoretical Variance

We calculate the Variance of our simulations:

meansim<-rowMeans(fmat(1000))
var(meansim)

## [1] 0.6463562

The theoretical variance of the exponential distribution is \(\sigma^2\) = 1/(\(\lambda^2 * n\)), as \(\lambda\) = .2 and n = 40:
\(\sigma^2\) = .625

Sample variance and theoretical mean are almost the same (.625)

Distribution is approximately normal

The Central Limit Theorem (CLT) tell us that \(\bar X_n\) is approximately \(N(\mu, \sigma^2 / n)\)

The exponential distribution has \(\mu\) = 1/\(\lambda\)
The exponential distribution has \(\sigma\) = 1/\(\lambda\)
As by our analysis n = 40

As we increment the number of simulations we see that the exponential distribution gets closer to a Normal distribution

g <- ggplot(dat, aes(x = x, fill = simulations)) + 
        geom_histogram(binwidth=.3, colour = "black", aes(y = ..density..)) 
g <- g + stat_function(fun = dnorm, args=list(mean=mean_expd, sd=sd_expd/sqrt(n_exp)),           size = 2)
g + facet_grid(. ~ simulations)

Appendix

Library for graphical representations

library(ggplot2)

Set our variables

n_exp <- 40 #number of samples
lambda <- .2
mean_expd <- 1/lambda #mean of an exponential distribution
sd_expd <- 1/lambda #standard deviation of an exponential distribution
set.seed(111) #set seed

Create function to generate simulations

#Funtion that creates a matrix with a predefined simulations of a number 
#of 40 (default) exponentials
fmat<-function(simulations, n_exp=40){
        matrix(rexp(n=simulations * n_exp, rate = .2), simulations)
}

Generate data for our simulations

sim_a <- 10 #simulations
sim_b <- 100
sim_c <- 1000
sim_d <- 10000

#Data frame with different amount of simulations
dat <- data.frame(
        x = c(apply(fmat(sim_a), 1, mean), apply(fmat(sim_b), 1, mean),
              apply(fmat(sim_c), 1, mean), apply(fmat(sim_d), 1, mean)),
        simulations = factor(c(rep(sim_a, each=sim_a), rep(sim_b, each=sim_b),
                             rep(sim_c, each=sim_c), rep(sim_d, each=sim_d))))

Statistical Inference Project - Part 1

Javier Sáenz

19 November 2015