The central limit theorem states that the sampling distribution of the mean, distribution of sample means from many samples, is nearly normal centred at the population mean, with standard error equal to the population standard deviation divided by the square root of the sample size.
This demonstration shows how taking sufficient samples from the exponential distribution will produce means whose distribution approximate the normal distribution.
The exponential distribution has mean and standard deviation of \(1/\lambda\).
We will simulate 1000 samples, each of size 40, with an exponential distribution of \(\lambda=0.2\).
knitr::opts_chunk$set(cache=FALSE, echo = TRUE)
library(ggplot2); library(ggthemes); library(kableExtra); library(moments)
count_sims <- 1000
lambda <- 0.2
n <- 40 Using rexp, we create the samples and then compute the
average of each one.
set.seed(123456)
mean_sim_rows <- rowMeans(
matrix(
rexp(n = count_sims * n, rate = lambda),
count_sims,
n
)
) The theoretical mean of the exponential distribution is \(\mu = 1/\lambda\). In this case, with \(\lambda = 0.2\), the mean is \(\mu = 1/0.2 = 5\).
The following code plots the distribution of sample means and calculates the overall mean of those values.
df_plot <- as.data.frame(mean_sim_rows)
g <- ggplot(df_plot, aes(mean_sim_rows)) +
theme_minimal() +
geom_histogram(
alpha = 0.75,
bins = 30,
aes(y = stat(density)),
position="identity",
fill="orange",
col="darkgrey"
) +
geom_line(
aes(y = ..density..),
colour = 'red',
stat = 'density',
size = 1.5,
alpha = 0.6
) +
geom_vline(
aes(xintercept = 1/ lambda,
colour="Theoretical")
) +
geom_vline(
aes(xintercept = mean(mean_sim_rows),
colour="Sample")
) +
ggtitle(NULL) +
xlab("Sample mean") +
ylab("Density") +
scale_color_manual(
name="Means",
values=c(Theoretical="blue", Sample="green")
) +
theme(plot.margin = margin(0, 1, 0.5, 0.5, "cm"))
df_tbl <- data.frame("Mean"=c(mean(mean_sim_rows), 1/ lambda))
rownames(df_tbl) = c("Simulation","Theoretical")
tbl_means <- df_tbl %>% kbl() %>% kable_styling()| Mean | |
|---|---|
| Simulation | 5.022915 |
| Theoretical | 5.000000 |
The sampled mean of 5.022915 is very close to the theoretical value of 5 while the distribution is approaching a normal Gaussian bell curve.
Theoretical variance of an exponential distribution is given by \(\sigma^2 = \frac{(1/\lambda)^2}{n}=\frac{(1/0.2)^2}{1000}=0.625\)
The sample variance can be found using the var function
with sample means:
sim_var <- var(mean_sim_rows)Variance of the simulated means is 0.6557463, also a good approximation of the normal distribution. The small difference between simulated and theoretical variances would decrease with greater sample size or number of simulations.
First we can compare the density function of the sampled means with the theoretical normal distribution:
g <- ggplot(df_plot, aes(mean_sim_rows)) +
theme_minimal() +
stat_function(
fun = dnorm,
aes(colour = "Theoretical"),
size = 1.5,
args = list(
mean = (1/ lambda),
sd = sqrt((1/ lambda)^2 / n )
)
) +
geom_density(
aes(colour="Sample"),
size = 1.5
) +
ggtitle (NULL) +
xlab("Sample mean") +
ylab("Density") +
scale_color_manual(
name="Means",
values=c(Theoretical="blue", Sample="green")
) +
theme(plot.margin = margin(0, 0, 0.5, 0.5, "cm"))Overall, the distribution shows a good approximation to the normal with the following notes:
sim_median <- median(mean_sim_rows)
sim_skew <- skewness(mean_sim_rows)The latter point can be confirmed by examining the median (4.9616838) and the skewness (0.4634158).
We can also use a QQ plot to compare the sample means with the normal distribution. Normal distribution plots will line up along the diagonal.
ggplot(df_plot, aes(sample = mean_sim_rows)) +
theme_minimal() +
ggtitle ("Normal Q-Q Plot") +
xlab("Theoretical Quantiles") +
ylab("Sample Quantiles") +
stat_qq(colour = "green", alpha = 0.3, size = 2.5) +
stat_qq_line(colour = "blue")For the bulk of the data, the distribution lines up well with the normal distribution with the tails showing the small right skew previously mentioned.
Taking the means 1000 samples of 40 random values from the exponential distribution has provided a good approximation of normal distribution.