\(~\)
library('tidyverse') #for ggplot2
library('ggpubr') # to build qqplot
\(~\)
This project will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The averages of 40 exponentials will be analyzed.
\(~\)
The code for generating random exponential distribution in R is rexp(n,lamda) where n refers to the sample size and lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. In our exercise, lambda is set to 0.2 for all the simulations.
\(~\)
\(~\)
This graph visualizes exponential distributions. The seed is set so that the results are reproducible. The random samples are drawn from exponential distribution generated using rexp(n) with the sample size (n) ranging from 10,100,1000,10000. The default value of 1 is used as the rate parameter. The histogram plots show that, as sample size increases, the sample distribution approaches to be an exponential distribution.
set.seed(1)
par(mfrow=c(2,2))
set.seed(1)
hist(rexp(10))
hist(rexp(100))
hist(rexp(1000))
hist(rexp(10000))
\(~\)
Generate the distribution of the averages of 40 exponentials with lambda (rate parameter) = 0.2.
set.seed(1)
mns = NULL
for (i in 1 : 1000) mns = c(mns, mean(rexp(40,0.2)))
data <- data.frame(mns, size = 40)
\(~\)
# Sample Means
sampleMeans <- mean(data$mns)
# theoretical mean of exponential distribution is 1/lambda
theoreticalMeans <- 1/0.2
data.frame(
metric = c("sample mean", "theoretical mean"),
value = c(sampleMeans, theoreticalMeans)
)
## metric value
## 1 sample mean 4.990025
## 2 theoretical mean 5.000000
The theoretical mean of exponential distribution is 1/lambda. The mean of the averages of 40 exponential (4.999)is nearly identical to theoretical mean (5.00)
\(~\)
data %>%
ggplot(aes(x= mns))+
geom_histogram(aes(y = ..density..),
binwidth =.25, fill = "#69b3a2", color = "black") +
geom_text(aes(x = mean(mns), label="\nsample mean", y = 0.2),
colour="black",angle = 90, size = 5)+
geom_vline(aes(xintercept = mean(mns)), size = 1.2, colour = "black") +
labs(
title = "Distribution of the averages of 40 random exponentials (1000 simulations)",
x = "Averages",
y = "Density"
)+
scale_y_continuous(limits = c(0, 0.6)) +
theme_bw() +
theme(legend.position = "none")
\(~\)
# Sample Means
sampleVariance <- var(data$mns)
# theoretical mean of exponential distribution is 1/lambda
theoreticalVariance <- ((1/0.2)/sqrt(40))^2
data.frame(
metric = c("sample variance", "theoretical variance"),
value = c(sampleVariance, theoreticalVariance)
)
## metric value
## 1 sample variance 0.6111165
## 2 theoretical variance 0.6250000
The standard deviation of the exponential distribution is 1/lambda. The variance is calculated using the standard deviation(sd) and sample size(n) (var = (sd/sqrt(n))^2 ). The theoretical variance is 0.625.
The sample variance (0.6111165)is nearly identical to theoretical mean (0.625)
\(~\)
\(~\)
## 10000 simulations - increase sample size
set.seed(1)
mns = NULL
for (i in 1 : 10000) mns = c(mns, mean(rexp(40,0.2)))
data2 <- data.frame(mns, size = 40)
\(~\)
Sim.1000 <- ggqqplot(data$mns)
Sim.10000 <- ggqqplot(data2$mns)
figure <- ggarrange(Sim.1000, Sim.10000,
labels = c("1000 simulations", "10000 simulations"),
ncol = 2, nrow = 1)
figure
QQplot plots theoretical quantiles against the actual quantiles of our
variable. If our variable follows a normal distribution, the quantiles
of our variable must be perfectly in line with the “theoretical” normal
quantiles: a straight line on the QQ Plot tells us we have a normal
distribution.
\(~\)
data %>%
ggplot(aes(x= mns))+
geom_histogram(aes(y = ..density..),
binwidth =.25, fill = "#69b3a2", color = "black") +
geom_text(aes(x = mean(mns), label="\nsample mean", y = 0.2),
colour="black",angle = 90, size = 5)+
geom_vline(aes(xintercept = mean(mns)), size = 1.2, colour = "black") +
stat_function(fun = dnorm, args = list(mean = mean(data$mns), sd = sd(data$mns))) +
labs(
title = "Distribution of the averages of 40 random exponentials (1000 simulations)",
x = "Averages",
y = "Density"
)+
scale_y_continuous(limits = c(0, 0.6)) +
theme_bw() +
theme(legend.position = "none")
The stat.function generates a bell curve, the graph above is not symmetrical, as expected with normal distribution. We can see two 2 apexes.
\(~\)
data2 %>%
ggplot(aes(x= mns))+
geom_histogram(aes(y = ..density..),
binwidth =.25, fill = "#69b3a2", color = "black") +
geom_text(aes(x = mean(mns), label="\nsample mean", y = 0.2),
colour="black",angle = 90, size = 5)+
geom_vline(aes(xintercept = mean(mns)), size = 1.2, colour = "black") +
stat_function(fun = dnorm, args = list(mean = mean(data2$mns), sd = sd(data2$mns))) +
labs(
title = "Distribution of the averages of 40 random exponentials (10000 simulations)",
x = "Averages",
y = "Density"
)+
scale_y_continuous(limits = c(0, 0.6)) +
theme_bw() +
theme(legend.position = "none")
When generating histogram with bell curve for 10,000 simulation, the
graph appears to be nearly symmetrical about the sample means.
\(~\)
This exercise illustrates the Central Limit Theorem which states that the distribution of averages of independent and identically distributed (IID) variables becomes that of a standard normal as the sample size increases.