From the instructions:
In this project you will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. You will investigate the distribution of averages of 40 exponentials. Note that you will need to do a thousand simulations.
Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials.
# The tidyverse package automatically loads several usefull packages at once
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.3
## ✓ tidyr 1.0.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# Figure captions:
library(captioner)
fig_caption <- captioner::captioner("Figure")
# Simulation parameters
lambda <- 0.2
N <- 40 # size of each distriburion
nsim <- 1000 # number of simulations
set.seed(1)
# simululate 1000 random exponential ditributions of size 40:
mat <- c()
means <- c()
sds <- c()
for (i in 1:nsim) {
# keep all values
sample_exp <- rexp(N, lambda)
mat <- c(mat, sample_exp)
# keep the means and SDs for eaxh simulated sample
mu <- mean(sample_exp)
means <- c(means, mu)
s <- sd(sample_exp)
sds <- c(sds, s)
}
# every col is a new distribution
mat <- matrix(mat, nrow=N)
# c((1/lambda) / sqrt(N), sd(means))
Show the sample mean and compare it to the theoretical mean of the distribution.
According to the central limit theorem, the distribution of means of a large number of samples from ANY distribution should be approximatively normal. In this excecise, I generated 1000 samples from an exponential distribution with lambda = 0.2. If lambda = 0.2, the means and SDs of each of the samples should be close to 1/lambda = 1/0.2 = 5.0. The means should be normally distributed around this value, with a SD of (1/lambda) / sqrt(N) = 0.79.
list(`theoretical mean` = 1/lambda, `mean of sample means` = mean(means))
## $`theoretical mean`
## [1] 5
##
## $`mean of sample means`
## [1] 4.990025
mat %>% as.data.frame.matrix() %>% tidyr::gather("Sample", "Values") %>%
ggplot(aes(x = `Values`)) +
geom_density(aes(group=`Sample`), size=0.025) +
geom_density(color="blue") +
# ggpubr::stat_central_tendency(type = "mean", color="blue") +
geom_vline(xintercept = mean(means), color="black") +
theme(legend.position = "none") +
annotate("text", x = 60, y=0.3, hjust=1, vjust=1,
label=paste0("mean of means = ", scales::number(mean(means), 0.001), "\n",
"mean of SDs (SE) = ", scales::number(mean(sds), 0.001), "\n",
"theoretical mean and SE = ", scales::number(1/lambda, 0.1))) +
labs(caption = paste0("black lines: each random distribution\n",
"blue line: overall distribution\n",
"| : mean of sample means")) +
scale_x_continuous("Samples' values", breaks = seq(0, 100, 10), limits=c(0, 60)) +
scale_y_continuous("Density", limits=c(0, 0.3))
Figure 1: The chart shows the density of each of the 1000 sampled distributions (black lines) and of the overall distribution (size 40*1000, blue line). The mean of sample means is shown as a vertical line.
Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.
The theoretical variance should be (1/lambda)^2 / N = 0.625. The variance of the distribution of generated means is 0.611, which is close the theoretical value. I expect that, with larger number of simulations and / or larger samples, it should become even closer.
list(`theoretical variance` = (1/lambda)^2 / N, `generated variance` = var(means))
## $`theoretical variance`
## [1] 0.625
##
## $`generated variance`
## [1] 0.6111165
Show that the distribution is approximately normal. Focus on the difference between the distribution of a large collection of random exponentials and the distribution of a large collection of averages of 40 exponentials.
data.frame(M = means) %>%
ggplot(aes(x=M)) +
geom_density(fill="grey", color="grey50") +
# ggpubr::stat_overlay_normal_density(lty="dashed") +
# ggpubr::stat_central_tendency(type = "mean", lty="dashed") +
geom_vline(xintercept = 1/lambda) +
stat_function(geom = "line", fun = dnorm, lty="dashed",
args = list(mean = 1/lambda, sd = (1/lambda) / sqrt(N))) +
annotate("text", x = 2, y=0.5, hjust=0, vjust=1,
label=paste0("mean of means = ", scales::number(mean(means), 0.001), "\n",
"theoretical mean and SE = ", scales::number(1/lambda, 0.1))) +
labs(caption = paste0("gray area: distribution of sampled means\n",
"--- : normal distribution with the theoretical mean and SD \n",
"| : theoretical mean of means")) +
scale_x_continuous("Samples' means", breaks = 0:10, limits=c(2, 8)) +
scale_y_continuous("Density")
Figure 2: The chart compares the distribution of the sample means with a normal distribution with the theoretical parameters (mean=5.0, sd=0.79). The generated distribution looks very similar to the expected gaussian distribution.
As the nuber of sample increases, the sample mean and SD converge to the theoretical value of 5.0.
data.frame(x=1:nsim, `Mean`=cumsum(means)/1:nsim, `SD` = cumsum(sds)/1:nsim) %>%
tidyr::gather(key="Measure", value="y", -c("x")) %>%
ggplot(aes(x=x, y=y)) +
geom_hline(yintercept = 1/lambda) +
geom_line(aes(color=`Measure`)) +
scale_x_continuous("Sample number") +
scale_y_continuous("Cummulative Average Value", breaks = seq(4, 6, 0.25), limits=c(4, 5.5)) +
annotate("text", x = 1000, y=5, hjust=1, vjust=-1,
label=paste0("theoretical mean and SE = ", scales::number(1/lambda, 0.1)))
Figure 3: Sample mean (red) and SD (blue) converge to the theoretical value.