Comparing the Exponential Distribution to the CLT

Statistical Inference - Progamming Assignment No. 1

Author: João Ascensão

1. Overview

In the present paper we will investigate the exponential distribution in R in order to apply the Central Limit Theorem (CLT). We conclude that, despite the underlying distribution, the distribution of the averages is almost normal, centered around the population mean and with variance (1/lambda)² /n, approximately.

2. Simulations

2.1. Distribution of 1,000 random exponentials

We will start by simulating 1,000 random exponentials using rexp. Then, we will plot the resulting histogram using ggplot.

From the exercise, we need to set our rate to .2. Also, we are fixing the seed for the sake of reproducibility.

lambda <- .2 ## set rate
set.seed(2) ## set seed
randExps <- rexp(n=1000, rate=lambda) ## generates 1000 random exps
results <- data.frame(randExps); rm(randExps) ## saves generated exps to data frame results

The following code generates a histogram of the resulting values, overlaid with a transparent density curve.

library(ggplot2) ## load ggplot2
plotExps <- ggplot(results, aes(x=randExps)) +
               geom_histogram(aes(y=..density..), ## histogram with density instead of count on y-axis
                              binwidth=.5,
                              colour="#949494",
                              fill="#b8adad") +
               geom_density(alpha=.2, fill="#FF6666") + ## overlay with density plot
               labs(x="Value", y="Density\n") +
               ggtitle("Distribution of 1,000 Exponentials\n")
## print plot
plotExps

plot of chunk randExpsPlot

2.2. Distribution of 1,000 averages of 40 random exponentials

In order perform the simulations, we will use a for loop and continuously save each of the resulting averages to a single vector called expMeans.

expMeans <- NULL ## creates var expMeans to be used in the loop
for(i in 1:1000) {
    expMeans <- c(expMeans, mean(rexp(n=40, rate=lambda)))
}
results <- cbind(results, expMeans); rm(expMeans) ## binds expMeans vector to results data frame

Find below the code to plot the resulting histogram, with the respective density.

plotMeans <- ggplot(results, aes(x=expMeans)) +
               geom_histogram(aes(y=..density..), ## histogram with density instead of count on y-axis
                              binwidth=.5,
                              colour="#949494",
                              fill="#b8adad") +
               geom_density(alpha=.2, fill="#FF6666") + ## overlay with density plot
               labs(x="Mean", y="Density\n") +
               ggtitle("Distribution of the Mean of 40 Exponentials\n")
## print plot
plotMeans

plot of chunk expMeansPlot

The resulting distribution looks Gaussian indeed, and very different from the first one. Let's plot the densities together.

xValues <- c(results$randExps, results$expMeans) ## creates a vector including all results
Distribution <- c(rep("Exponentials", times=1000), rep("Means", times=1000)) ## repeats the id of each distribution 1,000 times each
rearrangedData <- data.frame(xValues, Distribution) ## binds the previous vectors, values with labels
## plots both densities together
ggplot(rearrangedData, aes(x=xValues, colour=Distribution)) + 
    geom_density(alpha=0.2) +
    labs(x="Value", y="Density\n") +
    ggtitle("Comparing the Distributions\n")

plot of chunk comparisonPlot

3. Sample Mean vs. Theoretical Mean

Theoretically, the mean of the exponential distribution (our underlying distribution), as per the exercise, is equal to 1/lambda.

theoMean <- 1/lambda ## computes the mean of the exponential distribution
theoMean ## prints the mean

## [1] 5

In order to compare both, we need to compute our sample mean.

library(dplyr)
resultsAverages <- rearrangedData %>%
                       group_by(Distribution) %>%
                           summarise(Average=mean(xValues))
kable(resultsAverages)

Distribution	Average
Exponentials	5.180107
Means	5.015567

From the all the above, we infer that the sample mean is very close to the theoritical mean of the underlying distribution.

We can use ggplot to visualize this relationship.

ggplot(rearrangedData, aes(x=xValues, colour=Distribution)) + 
    geom_density(alpha=0.2) +
    geom_vline(data=resultsAverages, aes(xintercept=Average, colour=Distribution), linetype="dashed") +
    labs(x="Value", y="Density\n") +
    ggtitle("Comparing the Means of the Distributions\n")

plot of chunk sampleMeanPlot

It is said that the sample mean is an unbiased estimator of the population mean, because its expected value is equal to the population mean.

4. Sample Variance vs. Theoretical Variance

Like the mean, the standard deviation of a population following an exponential distribution is equal to 1/lambda (or 1/rate).

Therefore, the variance is equal to (1/lamba)² or 1/lamba^2. Let's compute in R.

theoreticalVar <- 1/lambda^2 ## computes the variance of the underlying population
theoreticalVar ## prints calculated the variance

## [1] 25

Let's compute our sample variance (in fact, the standard error of the mean), so we can compare both.

resultsVars <- rearrangedData %>%
                   group_by(Distribution) %>%
                       summarise(Variance=var(xValues))
kable(resultsVars)

Distribution	Variance
Exponentials	25.8288385
Means	0.6780073

The variance of the underlying distribution (exponential) is much larger than the standard error of our means. Let's look at it graphically.

We will plot the average plus/minus one standard deviation for each distribution, code can be found below.

## gets the data to a suitable format for plotting
plotVar <- data.frame("Value"=resultsAverages$Average+c(-1,1,1,-1)*sqrt(resultsVars$Variance), 
                      "Distribution"=rep(c("Exponentials", "Means"), times=2))

## plots the densities with corresponding measures of standard deviation
ggplot(rearrangedData, aes(x=xValues, colour=Distribution)) + 
    geom_density(alpha=0.2) +
    geom_vline(data=plotVar, aes(xintercept=Value, colour=Distribution), linetype="dashed") +
    labs(x="Value", y="Density\n") +
    ggtitle("Comparing the Standard Deviations of the Distributions\n")

plot of chunk sampleVarPlot

But is there a relationship between the variances? Let's use R to find out.

relVars <- resultsVars$Variance[1]/resultsVars$Variance[2]
## prints the result
relVars

## [1] 38.09522

This value is very close to our number of exponentials for each itineration, since the standard error of the mean is approximately equal to the variance divided by n.

5. Distribution

n <- 40 ## number of exponentials
## performs and labels 1000 normals
normSim <- data.frame(xValues=rnorm(n=1000, mean=1/lambda, sd=(1/lambda)^2/n), 
                      Distribution=rep("Normal", times=1000))
rearrangedData <- rbind(rearrangedData, normSim) ## binds with existing results
rearrangedData <- rearrangedData %>%
                      filter(Distribution==c("Means", "Normal")) ## filters out the exponentials
## code below plots both densities together
ggplot(rearrangedData, aes(x=xValues, colour=Distribution)) + 
    geom_density(alpha=0.2) +
    labs(x="Value", y="Density\n") +
    ggtitle("Sample Distribution Vs. Theoretical Distribution (Normal)\n")

plot of chunk Distribution

We can say the distribution of the means is approximately normal in shape and both distributions are centered around the same value.

The normal distribution is, however, more concentrated around the mean and the sample distribution has heavier tails.