\(~\)

library('tidyverse') #for ggplot2 
library('ggpubr') # to build qqplot 

\(~\)

Overview

This project will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The averages of 40 exponentials will be analyzed.

\(~\)

Simulations

The code for generating random exponential distribution in R is rexp(n,lamda) where n refers to the sample size and lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. In our exercise, lambda is set to 0.2 for all the simulations.

\(~\)

Introduction to Exponential Distibutions

\(~\)

This graph visualizes exponential distributions. The seed is set so that the results are reproducible. The random samples are drawn from exponential distribution generated using rexp(n) with the sample size (n) ranging from 10,100,1000,10000. The default value of 1 is used as the rate parameter. The histogram plots show that, as sample size increases, the sample distribution approaches to be an exponential distribution.

set.seed(1)

par(mfrow=c(2,2))

set.seed(1)

hist(rexp(10))
hist(rexp(100))
hist(rexp(1000))
hist(rexp(10000))

\(~\)

Distribution for averages of 40 exponentials

Generate the distribution of the averages of 40 exponentials with lambda (rate parameter) = 0.2.

set.seed(1)

mns = NULL
for (i in 1 : 1000) mns = c(mns, mean(rexp(40,0.2)))

data <- data.frame(mns, size = 40)

\(~\)

Sample Mean versus Theoretical Mean:

# Sample Means 
sampleMeans <-  mean(data$mns)

# theoretical mean of exponential distribution is 1/lambda 
theoreticalMeans <- 1/0.2 

data.frame(
   metric =  c("sample mean", "theoretical mean"),  
   value = c(sampleMeans, theoreticalMeans)
          ) 
##             metric    value
## 1      sample mean 4.990025
## 2 theoretical mean 5.000000

The theoretical mean of exponential distribution is 1/lambda. The mean of the averages of 40 exponential (4.999)is nearly identical to theoretical mean (5.00)

\(~\)

data %>% 
  ggplot(aes(x= mns))+
  geom_histogram(aes(y = ..density..),
                binwidth =.25, fill = "#69b3a2", color = "black") + 
  
  geom_text(aes(x = mean(mns), label="\nsample mean", y = 0.2),
            colour="black",angle = 90, size = 5)+
  
  geom_vline(aes(xintercept = mean(mns)), size = 1.2, colour = "black") +
  
  labs(
    title = "Distribution of the averages of 40 random exponentials (1000 simulations)", 
    x = "Averages", 
    y  = "Density"
    )+
            
  scale_y_continuous(limits = c(0, 0.6)) +
  
  theme_bw() +
  theme(legend.position = "none")

\(~\)

Sample Variance versus Theoretical Variance

# Sample Means 
sampleVariance <-  var(data$mns)

# theoretical mean of exponential distribution is 1/lambda 
theoreticalVariance <- ((1/0.2)/sqrt(40))^2

data.frame(
   metric =  c("sample variance", "theoretical variance"),  
   value = c(sampleVariance, theoreticalVariance)
) 
##                 metric     value
## 1      sample variance 0.6111165
## 2 theoretical variance 0.6250000

The standard deviation of the exponential distribution is 1/lambda. The variance is calculated using the standard deviation(sd) and sample size(n) (var = (sd/sqrt(n))^2 ). The theoretical variance is 0.625.

The sample variance (0.6111165)is nearly identical to theoretical mean (0.625)

\(~\)

Determine whether data follows Normal Distribution

\(~\)

## 10000 simulations - increase sample size 
set.seed(1)

mns = NULL
for (i in 1 : 10000) mns = c(mns, mean(rexp(40,0.2)))
data2 <- data.frame(mns, size = 40)

\(~\)

qqplots

Sim.1000 <- ggqqplot(data$mns)
Sim.10000 <- ggqqplot(data2$mns)

figure <- ggarrange(Sim.1000, Sim.10000,
                    labels = c("1000 simulations", "10000 simulations"),
                    ncol = 2, nrow = 1)
figure

QQplot plots theoretical quantiles against the actual quantiles of our variable. If our variable follows a normal distribution, the quantiles of our variable must be perfectly in line with the “theoretical” normal quantiles: a straight line on the QQ Plot tells us we have a normal distribution.

\(~\)

Histogram bell curve

data %>% 
  ggplot(aes(x= mns))+
  geom_histogram(aes(y = ..density..),
                binwidth =.25, fill = "#69b3a2", color = "black") + 
  
  geom_text(aes(x = mean(mns), label="\nsample mean", y = 0.2),
            colour="black",angle = 90, size = 5)+
  
  geom_vline(aes(xintercept = mean(mns)), size = 1.2, colour = "black") +
  
  stat_function(fun = dnorm, args = list(mean = mean(data$mns), sd = sd(data$mns))) + 

  
  labs(
    title = "Distribution of the averages of 40 random exponentials (1000 simulations)", 
    x = "Averages", 
    y  = "Density"
    )+
            
  scale_y_continuous(limits = c(0, 0.6)) +
  
  theme_bw() +
  theme(legend.position = "none")

The stat.function generates a bell curve, the graph above is not symmetrical, as expected with normal distribution. We can see two 2 apexes.

\(~\)

Histogram bell curve: 10000 Simulations

data2 %>% 
  ggplot(aes(x= mns))+
  geom_histogram(aes(y = ..density..),
                binwidth =.25, fill = "#69b3a2", color = "black") + 
  
  geom_text(aes(x = mean(mns), label="\nsample mean", y = 0.2),
            colour="black",angle = 90, size = 5)+
  
  geom_vline(aes(xintercept = mean(mns)), size = 1.2, colour = "black") +
  
  stat_function(fun = dnorm, args = list(mean = mean(data2$mns), sd = sd(data2$mns))) + 

  
  labs(
    title = "Distribution of the averages of 40 random exponentials (10000 simulations)", 
    x = "Averages", 
    y  = "Density"
    )+
            
  scale_y_continuous(limits = c(0, 0.6)) +
  
  theme_bw() +
  theme(legend.position = "none")

When generating histogram with bell curve for 10,000 simulation, the graph appears to be nearly symmetrical about the sample means.

\(~\)

Conclusion

This exercise illustrates the Central Limit Theorem which states that the distribution of averages of independent and identically distributed (IID) variables becomes that of a standard normal as the sample size increases.