Overview

This is the course project for the Statistical Inference course, from Johns Hopkins on Coursera. The exponential distribution in R will be investigated and compared with the Central Limit Theorem. Given a lambda of 0.2 for all simulations, the distrubtion of averages of 40 exponentials across 1000 simulations will be evaluated and shown to be roughly normal.

Assumptions

The assumption for this data is that we are sampling without replacement.

Simulations

We will begin with setting the initial variables as provided in the project instructions. The data (means_exponentials) is then created as a data.frame so that it can be easily used in ggplot2.

require(ggplot2)
set.seed(1337)
lambda <- 0.2
n <- 40
simulations <- 1000

samples <- replicate(simulations, rexp(n, lambda))
means_exponentials <- data.frame(x = apply(samples, 2, mean))

A simple exploratory graph can be seen below showing a histogram of the data.

means_histogram <- ggplot(means_exponentials, aes(x = x)) + 
  geom_histogram() +
  labs(title="Distribution of Averages of 40 Exponentials over 1000 Simulations",
       y="Frequency",
       x="Mean")
means_histogram

Sample Mean versus Theoretical Mean

Question 1: Show the sample mean and compare it to the theoretical mean of the distribution.
The theoretical mean is simply 1/lambda

(theoretical_mean <- 1/lambda)
## [1] 5
(sample_mean <- mean(means_exponentials$x))
## [1] 5.055995

And the sample mean is about 5.06 which is pretty close to the theoretical mean. Below is a histogram showing the distribution along with these two means, to visually highlight these data.

means_histogram + 
  geom_vline(aes(xintercept = sample_mean, col = "Sample")) +
  geom_vline(aes(xintercept = 1/lambda, col = "Theoretical")) +
  scale_color_manual(name = "Means", values = c("red", "blue"))

Confidence Interval

We can also perform a 95% confidence interval on the means:

t.test(means_exponentials$x)$conf.int
## [1] 5.005797 5.106193
## attr(,"conf.level")
## [1] 0.95

We can see we have a very tight confidence interval with a range of only 0.1.

Sample Variance versus Theoretical Variance

Question 2: Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.
The theoretical variance is defined by (1/lambda)^2 / n:

(theoretical_variance <- (1/lambda)^2/n)
## [1] 0.625
(sample_variance <- var(means_exponentials$x))
## [1] 0.6543703

As shown, the sample variance, 0.654 is pretty close to the theoretical variance, 0.625.

We can also evaluate the standard deviations of these populations.

(theoretical_sd <- (1/lambda)/sqrt(n))
## [1] 0.7905694
(sample_sd <- sd(means_exponentials$x))
## [1] 0.8089316

These are obviously close as well, since they are intrinsicily tied to the variances above.

Distribution

Question 3: Show that the distribution is approximately normal.
The central limit theorem states that these sample simulations should follow a normal distribution.

ggplot(means_exponentials, aes(x = x)) + 
    geom_histogram(aes(y = ..density..), fill = "gray") + 
    geom_density(aes(col = "Sample")) + 
    stat_function(fun = dnorm, 
                  args = list(mean = 1/lambda, sd = sqrt(theoretical_variance)), 
                  aes(col = "Normal")) + 
    scale_color_manual(name = "Distributions", values = c("Sample" = "red", "Normal" = "blue")) +
    labs(title="Distribution of Averages of 40 Exponentials over 1000 Simulations",
       y="Density",
       x="Mean")

As we can see in this graph, the distribution of means appears to follow a normal distribution. There is a slight difference on the right tail of the sample distribution. However, this is over 1000 simulations, and if we were to increase the number of simulations, per the central limit theorem, we would get closer to the normal distribution.