Overview

From the instructions:

In this project you will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. You will investigate the distribution of averages of 40 exponentials. Note that you will need to do a thousand simulations.

Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials.

# The tidyverse package automatically loads several usefull packages at once
library(tidyverse)

## ── Attaching packages ────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.3
## ✓ tidyr   1.0.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0

## ── Conflicts ───────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

# Figure captions:
library(captioner)
fig_caption <- captioner::captioner("Figure")

# Simulation parameters
lambda <- 0.2
N <- 40 # size of each distriburion
nsim <- 1000 # number of simulations
set.seed(1)

Simulations

# simululate 1000 random exponential ditributions of size 40: 
mat <- c()
means <- c()
sds <-  c()

for (i in 1:nsim) {
  # keep all values
  sample_exp <- rexp(N, lambda)
  mat <- c(mat, sample_exp)
  
  # keep the means and SDs for eaxh simulated sample
  mu <- mean(sample_exp)
  means <- c(means, mu)
  
  s <- sd(sample_exp)
  sds <- c(sds, s)
}

# every col is a new distribution
mat <- matrix(mat, nrow=N)

# c((1/lambda) / sqrt(N), sd(means))

Results

Objective 1

Show the sample mean and compare it to the theoretical mean of the distribution.

According to the central limit theorem, the distribution of means of a large number of samples from ANY distribution should be approximatively normal. In this excecise, I generated 1000 samples from an exponential distribution with lambda = 0.2. If lambda = 0.2, the means and SDs of each of the samples should be close to 1/lambda = 1/0.2 = 5.0. The means should be normally distributed around this value, with a SD of (1/lambda) / sqrt(N) = 0.79.

list(`theoretical mean` = 1/lambda, `mean of sample means` = mean(means))

## $`theoretical mean`
## [1] 5
## 
## $`mean of sample means`
## [1] 4.990025

mat %>% as.data.frame.matrix() %>% tidyr::gather("Sample", "Values") %>%
  ggplot(aes(x = `Values`)) +
  geom_density(aes(group=`Sample`), size=0.025) +
  geom_density(color="blue") +
  # ggpubr::stat_central_tendency(type = "mean", color="blue") +
  geom_vline(xintercept = mean(means), color="black") + 
  theme(legend.position = "none") +
  annotate("text", x = 60, y=0.3, hjust=1, vjust=1,
           label=paste0("mean of means = ", scales::number(mean(means), 0.001), "\n",
                        "mean of SDs (SE) = ", scales::number(mean(sds), 0.001), "\n",
                        "theoretical mean and SE = ", scales::number(1/lambda, 0.1))) +
  labs(caption = paste0("black lines: each random distribution\n",
                        "blue line: overall distribution\n",
                        "| : mean of sample means")) +
  scale_x_continuous("Samples' values", breaks = seq(0, 100, 10), limits=c(0, 60)) + 
  scale_y_continuous("Density", limits=c(0, 0.3))

Figure 1: The chart shows the density of each of the 1000 sampled distributions (black lines) and of the overall distribution (size 40*1000, blue line). The mean of sample means is shown as a vertical line.

Objective 2

Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.

The theoretical variance should be (1/lambda)^2 / N = 0.625. The variance of the distribution of generated means is 0.611, which is close the theoretical value. I expect that, with larger number of simulations and / or larger samples, it should become even closer.

list(`theoretical variance` = (1/lambda)^2 / N, `generated variance` = var(means))

## $`theoretical variance`
## [1] 0.625
## 
## $`generated variance`
## [1] 0.6111165

Objective 3

Show that the distribution is approximately normal. Focus on the difference between the distribution of a large collection of random exponentials and the distribution of a large collection of averages of 40 exponentials.

data.frame(M = means) %>%
  ggplot(aes(x=M)) + 
  geom_density(fill="grey", color="grey50") + 
  # ggpubr::stat_overlay_normal_density(lty="dashed") +
  # ggpubr::stat_central_tendency(type = "mean", lty="dashed") +
  geom_vline(xintercept = 1/lambda) +
  stat_function(geom = "line", fun = dnorm, lty="dashed",
                args = list(mean = 1/lambda, sd = (1/lambda) / sqrt(N))) +
  annotate("text", x = 2, y=0.5, hjust=0, vjust=1,
           label=paste0("mean of means = ", scales::number(mean(means), 0.001), "\n",
                        "theoretical mean and SE = ", scales::number(1/lambda, 0.1))) +
  labs(caption = paste0("gray area: distribution of sampled means\n",
                        "--- : normal distribution with the theoretical mean and SD \n",
                        "| : theoretical mean of means")) +
  scale_x_continuous("Samples' means", breaks = 0:10, limits=c(2, 8)) +
  scale_y_continuous("Density")

Figure 2: The chart compares the distribution of the sample means with a normal distribution with the theoretical parameters (mean=5.0, sd=0.79). The generated distribution looks very similar to the expected gaussian distribution.

Extra

As the nuber of sample increases, the sample mean and SD converge to the theoretical value of 5.0.

data.frame(x=1:nsim, `Mean`=cumsum(means)/1:nsim, `SD` = cumsum(sds)/1:nsim) %>%
  tidyr::gather(key="Measure", value="y", -c("x")) %>%
  ggplot(aes(x=x, y=y)) +
  geom_hline(yintercept = 1/lambda) +
  geom_line(aes(color=`Measure`)) +
  scale_x_continuous("Sample number") + 
  scale_y_continuous("Cummulative Average Value", breaks = seq(4, 6, 0.25), limits=c(4, 5.5)) + 
  annotate("text", x = 1000, y=5, hjust=1, vjust=-1,
           label=paste0("theoretical mean and SE = ", scales::number(1/lambda, 0.1)))

Figure 3: Sample mean (red) and SD (blue) converge to the theoretical value.

Simulation Exercise Instructions