Synopsis

These sheet has one goal :

it’s a simulation exercise.

Investigation of the exponential distribution in R and comparison with the Central Limit Theorem. The investigation is done with 1000 sample of exponentiel distribution. Each sample contains 40 observations.

Initialization code

knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
my_plot_fun <- function(my_data) {
  my_data <- as.data.frame(my_data)
  ggplot(my_data,aes(x = my_data)) + theme_classic() +
theme(plot.title = element_text(color="#ADA717", size=18, face="bold.italic",hjust=0.5),
 axis.title.x = element_text(color="#993333", size=18, face="bold"),
 axis.text.x = element_text(color="#993333", size=14,vjust = 0),
 axis.title.y = element_text(color="darkgreen", size=18, face="bold"),
 axis.text.y = element_text(face="bold", color="darkgreen", size=14),
 legend.text = element_text(size=12),
 legend.title = element_text(size=20))}

Simulation Exercise

Means of each sample

  1. Simulation of samples, each sample contain 40 exponential distribution. The mean of each sample is stored in the object my_means
set.seed(10)
# Set lambda = 0.2 for all of the simulations. 
lambda <- .2
# Number of total simulations/samples
N <- 1000
# Number of observations in each sample
numb <- 40
my_means = NULL
# By a for loop, add the mean of exponential distribution of 40 observations and a rate of .2
for (i in 1 : N) {
  my_means <- c(my_means, mean(rexp(n = numb,rate = lambda)))
  }
# Calculation of the average of all the thousand means
mean_exp <- mean(my_means)
  1. What contains my_means ?
print(paste("The maximum is : ",round(max(my_means),4),". The minimum is : ",
            round(min(my_means),4),". The mean is : ",round(mean(my_means),4),sep = ""))
## [1] "The maximum is : 8.1401. The minimum is : 3.1262. The mean is : 5.0451"
count <- sum(my_means-qnorm(p = .975,mean = 5,sd = (1/lambda)/sqrt(numb))>0)
count <- count + sum(my_means-qnorm(p = .025,mean = 5,sd = (1/lambda)/sqrt(numb))<0)
print(count/1000*100)
## [1] 4.5
  1. First conclusion
There is 45 means that are out of the 95% confidence interval of the normal distribution. This represent 4.5% of the number of means instead of 5%.
There is slightly less means on the tails that it should be if the distribution of means was normal distributed.
  1. Here a plot that resume all above :
mpt <- my_plot_fun(my_means) +
  scale_x_continuous(breaks=seq(round(min(my_means))-1, round(max(my_means))+1, by = .5),
                     limits =c(2.5,9)) +
  geom_histogram(alpha=0.6,binwidth=.15,colour="black", fill="#935430") +
  geom_vline(aes(xintercept=mean_exp, colour = "Simulation mean", 
                 linetype="Simulation mean"), size=1) +
  geom_vline(aes(xintercept=5, colour="Normal mean", linetype="Normal mean"), size=1)
## Warning: Removed 2 rows containing missing values (geom_bar).


Variance of each sample

  1. The code below calculate the variance of each sample and store it in the object my_variances :
set.seed(10)
my_variances = NULL
# By a for loop, add the variance of exponential distribution of 40 observations and a rate of .2
for (i in 1 : N) {
  my_variances <- c(my_variances, var(rexp(n = numb,rate = lambda)))
}
mpt2 <- my_plot_fun(my_variances) +
  xlim(0,100)+
  geom_histogram(alpha=0.6,binwidth=2,colour="black", fill="#235440") +
  geom_vline(aes(xintercept=25, colour = "Theoretical variance", linetype="Theoretical variance"),
             size=1)
## Warning: Removed 2 rows containing missing values (geom_bar).

  1. Calculation of the variance of all the thousand means: what is the variance of the distribution of the averages ? :
var_exp <- var(my_means)
The theoretical variance of one exponential distribution is (1/lambda)^2=25. And the theoretical variance of the averages is (1/lambda)^2/numb=0.625.
  1. Comparison : variance of distribution of averages vs theoretical variance
print(paste("The theoretical variance for averages is :",(1/lambda)^2/numb,", against 
            the simulation variance : ",var_exp,"."))
## [1] "The theoretical variance for averages is : 0.625 , against \n            the simulation variance :  0.637254390249431 ."
The values are very close, this confirm the CLT.

Density of the distribution of means

  1. By drawing the density function of means, it’s possible to evaluate the normality of the distribution :
mpt3 <- my_plot_fun(my_means) +
  scale_x_continuous(breaks=seq(round(min(my_means))-1,round(max(my_means)) +1, by = .5),
                     limits =c(1,9) ) +
  geom_histogram(aes(y=..density..),alpha=0.6,binwidth=.15,colour="black")+
   stat_function(fun = dnorm,n = 100, args = list(mean = 5, sd = (1/lambda)/sqrt(numb)),
                 geom = 'area', alpha = 0.5,
                 aes(colour = "Normal distribution",fill="Normal distribution"))+
  geom_density(alpha=.5,aes(colour = "Simulated data",fill="Simulated data")) 
## Warning: Removed 2 rows containing missing values (geom_bar).




Second conclusion
The density of the averages seems to follow a normal distribution : the maximum is slightly deviated to the left. However, the rest seems to fit very well.