Using R to compare the Exponential Distribution with the Central Limit Theorem

Overview

This document is the final report of the Peer Assessment project - Part 1 from Coursera’s course Statistical Inference, as part of the Specialization in Data Science. It was built up in RStudio, using its knitr functions, meant to be published in pdf format.
Here, we investigate the exponential distribution in R and compare it with the Central Limit Theorem.
The exponential distribution can be simulated in R with the command rexp(n, lambda) where lambda is the rate parameter.

The Exponential Distribution

In probability theory and statistics, the Exponential Distribution is the probability distribution that describes the time between events in a Poisson process, i.e. a process in which events occur continuously and independently at a constant average rate.

Probability Density Function

The probability density function (pdf) of an exponential distribution is

\(f(x;\lambda) = \begin{cases} \lambda e^{-\lambda x} & x \ge 0, \\0 & x < 0.\end{cases}\)

The mean of exponential distribution is \(1/\lambda\) and the standard deviation is also \(1/\lambda\).

Visualizing the Exponential Distribution

The graphical behavior of the Exponential Distribution can be better visualized with a small simulation of 200 data shown in the graphs below:

n <- 200
x <- 1:n
lambda <- 0.2
y <- rexp(n,lambda)
z <- data.frame(x,y)

graph01 <- ggplot(z, aes(y)) +
           geom_histogram(binwidth = 1, colour = "#999999", fill= "#000099") +
           ggtitle(expression(atop("Exponential Distribution",
                              atop(italic("n = 200, lambda = 0.2"),"")))) +
           xlab("rexp(n,lambda)")     

graph02 <- ggplot(z, aes(y)) +
           geom_line(stat = "density") +
           stat_function(fun = dexp, arg=list(rate=lambda), colour = "red") +
           ggtitle(expression(atop("Exponential Distribution",
                              atop(italic("n = 200, lambda = 0.2"),"")))) +
           xlab("rexp(n,lambda)")

grid.arrange(graph01, graph02, ncol = 2)

Exponential Distribution Data Simulation

For all simulations, we set \(\lambda\)(lambda) = 0.2 and the seed as 1234, using the comand set.seed(123).
We will investigate the distribution of averages of 40 exponential datasets. Note that we will use a thousand simulations to guarantee data consistency and then compare the distribution they generate with both the normal distribution and the results that are predicted by the Central Limit Theorem..

The matrix with the 1000 datasets (rows) and 40 random exponential data each (columns) is obtained by:

numberSimul <- 1000
lambda <- 0.2
numberDatasets <- 40

set.seed(123)
dataExp <- rexp(numberSimul * numberDatasets, rate=lambda)
simulData <- matrix(dataExp, numberSimul, numberDatasets)
simulDataFR <- as.data.frame(simulData)

The vector with the averages of each dataset is:

simulMean <- rowMeans(simulData)
countMean <- 1:numberSimul
simulMeanFR <- data.frame(countMean, simulMean)

Analysis

1) Sample Mean versus Theoretical Mean

The distribution of averages has sample mean (\(\bar x\)) equal to the population mean (\(\mu\)). The sample data mean is :

sampleMean <- mean(simulMean)
sampleMean

## [1] 5.011911

In an Exponential Distribution, the predicted mathematical mean is \(\mu=1/\lambda\), where \(\mu= 1/\lambda = 1/0.2 = 5.00\).

The distribution of the averages of the datasets can be seen in the following graph:

ggplot(simulMeanFR, aes(simulMean)) +
       geom_histogram(binwidth = 0.1, colour = "#999999", fill="#000099") +
       geom_vline(xintercept = sampleMean, colour = "red", size=2) +
       ggtitle(expression(atop("Distribution of Averages",
          atop(italic("1000 datasets of 40 exponentials, lambda = 0.2"),"")))) +
       xlab("averages")

In this case, we can conclude they are almost exactly the same as predicted.

2) Sample Variance versus Theoretical Variance

The distribution of averages has its variance as \(\sigma^2 = s^2 / n\) , or :

sampleSD  <- sd(simulMean)
sampleVar <- var(simulMean)
sampleVar

## [1] 0.6088292

The theoretical variance is defined as \(\sigma^2 = (1 / \lambda) ^ 2 / n\), or \(\sigma^2 = (1 / 0.2)^2 / 40 = 0.625\).

And, once again, the two numers (0.609 and 0.625) are very close to each other, as predicted by the CLT.

3) Sample Distribution approximation to Normal Distribution

The approximation of the sample distribution to a Normal Distribution can be visually seen with the following graph:

ggplot(simulMeanFR, aes(simulMean) ) +
        geom_histogram(aes(y = ..density..), binwidth=0.1, 
                               colour = "black", fill = "#F8766D") +
        geom_density(stat = "density", colour="blue",size=1.2) +
        geom_vline(xintercept = sampleMean, colour = "blue", size=1) +
        geom_vline(xintercept = 5, colour = "red", size=1) +
        stat_function(fun = dnorm, arg = list(mean = sampleMean, sd = sampleSD),
                      colour = "red", size=1.5) +
        ggtitle(expression(atop("Distribution of Averages",
            atop(italic("1000 datasets of 40 exponentials,lambda = 0.2"),"")))) +
        xlab("averages")

The averages are almos coincident and the shapes of the sample and the theoretical normal curves are very close.
The normality plot will look like :

qqnorm(simulMean)
qqline(simulMean, col = "red", lwd = 1)

Which are also very close.In any case, it is also recommended to perform a normality test to check the assumptions.
A Kolmogorov-Smirnov test can be used for this example.

ks.test(simulMean, "pnorm", mean = sampleMean, sd = sampleSD)

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  simulMean
## D = 0.0206, p-value = 0.7881
## alternative hypothesis: two-sided

With such a high value for p, there is no evidence that we should doubt the distribution is normal, as stated by the Central Limit Theorem.

Statistical Inference Project - Part 1

Marcelo Tardelli

November 18, 2015