The document provides an analysis of exponential distribution with the central limit theorem applied to it. The code used to generate the data and plots is shown in Appendix section.
Let’s simulate exponential distribution sample of 40,000 observations and rate parameter \(\lambda = 0.2\) (see Appendix 1. Exponential Distribution Sample Simulation).
Such sample distribution is build according exponential distribution probability density function which has the form
\[ f(x;\lambda) = \left\{ \begin{array}{ll} \lambda \exp^{-\lambda x} & \mbox{if } x \geq 0, \\ 0 & \mbox{if } x < 0. \end{array} \right. \]
where \(x\) is a random variable and \(\lambda\) is a rate parameter.
Now we’re going to perform next steps (see Appendix 2. Sample Means Calculation):
Split the simulated sample distribution into 1,000 samples, so we have 40 observations per each sample.
Calculate mean for each sample.
The expected value of the exponential distribution population is
\[E[X] = \frac{1}{\lambda} = \frac{1}{0.2} = 5\]
As we can see our sample mean has almost identical value of 5.02. Our sample standard deviation (0.82) is almost equal to the estimated standard error, which can be calculated by taking the exponential distribution population standard deviation and applying it to the distribution sample size
\[\frac{\sigma}{\sqrt{n}} = \frac{1/\lambda}{\sqrt{n}} = \frac{1/0.2}{\sqrt{40}} \approx 0.79\]
where \(n\) is the number of sample observations per calculated mean.
All these calculations reaffirm the central limit theorem (CLT). It states that as sample size, \(n\), gets larger, the distribution of the difference between the sample average \(\bar{X}_n\) and its limit \(\mu\), when multiplied by the factor \(\sqrt{n}\) (that is \(\sqrt{n} (\bar{X}_n - \mu)\)), approximates the normal distribution with mean 0 and variance \(\sigma^2\). For large enough \(n\), the distribution of \(\bar{X}_n\) is close to the normal distribution with mean \(\mu\) and variance \(\sigma^2 / n\). Hence we may claim \(\bar{X} \sim N(\mu, \sigma^2 / n)\) for any large enough \(\bar{X}\). The CLT is also applicable for the independent and identically distributed (IID) sample of averages, like our one is. Let’s apply the CLT to our sample means distribution.
We’ll now going to show that our exponential sample means distribution is the standard normal. To do this we need to “shift” our sample distribution to the standard normal. This is done by (see Appendix 3. Apply CLT to Exponential Sample Means Distribution):
subtracting exponential distribution mean from the sample mean;
dividing obtained result by estimated standard error.
\[\frac{\bar{X} - \mu}{\sigma / \sqrt{n}} = \frac{\bar{X} - 1 / \lambda}{\frac{1}{\lambda} / \sqrt{n}}\]
As CLT states for the distribution of averages of IID variables, our normalized sample exponential distribution of means become approximately standard normal with mean 0.03 and variance 1.06 (i.e. \(\bar{X} \sim N(\mu, \sigma^2)\)). We may compare our normal distribution to simply 1,000 random exponent values distribution. It doesn’t look like bell-shaped. Due to exponential nature it has a peak on its leftmost side and a long tail on the right with mean 5.01 and variance 25.25.
library(ggplot2)
library(scales)
## Rate parameter of the exponential distribution to be simulated
lambda <- 0.2
set.seed(1028380L)
## Exponential distribution sample
exponential_sample <-
rexp(n = sample_means_quantity * observations_per_sample, rate = lambda)
ggplot() +
geom_histogram(
mapping = aes(x = exponential_sample),
color = "gray50",
bins = 50
) +
stat_function(
mapping = aes(x = exponential_sample),
fun = function(x)
{
## To draw exponential density distribution along the "count"
## y-scale instead of "density" 'dexp()' result vector should be
## multiplied by 'exponential_sample' observations quantity
dexp(x = x, rate = lambda) * length(exponential_sample)
}
) +
scale_y_continuous(labels = comma) +
labs(
title = "Exponential Distribution Sample",
x = "value",
y = "number of occurences"
)
library(dplyr)
## Sample averages of 'observations_per_sample' exponential distribution
## simulations
sample_averages <-
matrix(
data = exponential_sample,
nrow = sample_means_quantity,
ncol = observations_per_sample
) %>%
apply(MARGIN = 1, FUN = mean)
ggplot() +
geom_histogram(
mapping = aes(x = sample_averages, y = ..density..),
color = "gray50",
bins = 40
) +
stat_function(
mapping = aes(x = sample_averages),
size = 2,
fun = dnorm,
args = list(mean = mean(sample_averages), sd = sd(sample_averages))
) +
geom_vline(
mapping = aes(
xintercept = c(
1 / lambda,
mean(sample_averages),
(1 / lambda) + ((1 / lambda) / sqrt(observations_per_sample)),
mean(sample_averages) + sd(sample_averages)
),
color = c(
"exponential distribution mean",
"sample mean",
"estimated std. error",
"sample std. deviation"
)
)
) +
scale_color_manual(name = "", values = c("blue", "red", "green", "purple")) +
geom_vline(
color = "blue",
xintercept = (1 / lambda) - ((1 / lambda) / sqrt(observations_per_sample))
) +
geom_vline(
color = "purple",
xintercept = mean(sample_averages) - sd(sample_averages)
) +
theme(legend.position = "bottom") +
labs(title = "Sample Means Distribution", x = "sample average")
library(gridExtra)
sample_averages_normalized <-
(sample_averages - 1 / lambda) / ((1 / lambda) / sqrt(observations_per_sample))
sample_means_normalized_plot <-
ggplot() +
geom_histogram(
mapping = aes(x = sample_averages_normalized, y = ..density..),
color = "gray50",
bins = 40
) +
stat_function(
mapping = aes(x = sample_averages_normalized),
size = 2,
fun = dnorm,
args = list(
mean = mean(sample_averages_normalized),
sd = sd(sample_averages_normalized)
)
) +
geom_vline(
color = c("blue", "red", "red"),
xintercept = c(
mean(sample_averages_normalized),
mean(sample_averages_normalized) + sd(sample_averages_normalized),
mean(sample_averages_normalized) - sd(sample_averages_normalized)
)
) +
labs(
title = "Sample Means Distribution Normalized",
x = "sample average"
)
set.seed(122830L)
## Set of 1000 random exponentials to show their distribution
random_exponentials <- rexp(n = 1000, rate = lambda)
random_exponentials_plot <-
ggplot() +
geom_histogram(
mapping = aes(x = random_exponentials, y = ..density..),
color = "gray50",
bins = 40
) +
stat_function(
mapping = aes(x = random_exponentials),
size = 2,
fun = dexp,
args = list(rate = lambda)
) +
geom_vline(
mapping = aes(
xintercept = c(
mean(random_exponentials),
mean(random_exponentials) + sd(random_exponentials)
),
color = c(
"sample mean",
"sample std. deviation"
)
)
) +
scale_color_manual(name = "", values = c("blue", "red")) +
geom_vline(
color = "red",
xintercept = mean(random_exponentials) - sd(random_exponentials)
) +
theme(legend.position = "bottom") +
labs(title = "Random Exponentials Distribution", x = "sample value")
grid.arrange(sample_means_normalized_plot, random_exponentials_plot)