Overview

In this project, we will investigate the exponential distribution in R and compare it with the Central Limit Theorem.

Exponential distribution is the probability distribution that describes the time between events in a Poisson process, i.e. a process in which events occur continuously and independently at a constant average rate (lambda).

For our simulation, the average rate (lambda) is given as 0.2. We will take 40 (n) samples, average them and perform this simulation 1,000 times. We will compare the sample mean analytic to normal distribution validating the central limit theorem.

Simulations

rexp(n, rate) function generates random variable distributed exponentially. In the R code below, we will generate a sample of 40 random numbers and take the mean of it 1,000 times.

#Generate a sample of 40 random variables from exponential distribution
#Find the mean of each sample
#Repeat the simulation 1000 times
#Take the mean of the samples (distribution mean)

#Create the data frame
sim_data <- data.frame(nrow=1000, ncol=2)
names(sim_data)=c("Index", "Mean")

for (i in 1:1000)
{
  sim_data[i,1] <- i
  sim_data[i,2] <- mean(rexp(40, 0.2))  #Mean of each round (sample) of simulation
}

Sample Mean versus Theoretical Mean

Taking mean of the 1,000 simulation gives us the distribution (simulation) mean. This is calculated below based on the observed data and compared to theorical mean which is 1/lambda.

#Calculate the mean of the simulated data
sim_mean <- mean(sim_data$Mean)
print(sim_mean)
## [1] 5.0122

Based on the theorical data, we can calculate the theorical mean as follows.

#Theorical mean is 1/lambda
theo_mean <- 1/0.2

We observe that simulated and theorical averages are the same.

If we plot the simulated data, median of observed and theorical data, we get the following histogram.

#Plot a histogram of the mean distribution
p<- ggplot(data = sim_data, aes(sim_data$Mean))+ geom_histogram(binwidth=0.1, color='blue')
p<- p +labs(title="Exponential Simulation") + labs(x="Sample Mean")+labs(y="Frequency")
p <- p+ theme(plot.title = element_text(hjust = 0.5))
p<- p+ geom_vline(aes(xintercept=sim_mean, colour = "Simulated Mean")) 
p<- p+ geom_vline(aes(xintercept= theo_mean, colour = "Theorical Mean"))
p<- p+ scale_colour_manual("",breaks = c("Simulated Mean", "Theorical Mean"), values = c("red", "yellow")) 
print (p)

Sample Variance versus Theoretical Variance

We can use the variance function in R to find the variance of the observed values. For exponential distribution, the theorical variance can be calculated as 1/lambda^2

#Variance of the observed (simulated) values
sim_var <- var(sim_data$Mean)
print(sim_var)
## [1] 0.632258
#For theorical values the variance is
theo_var<- (1/(0.2^2))/40
print(theo_var)
## [1] 0.625

By comparison, we can say that the variance difference between observed and theorical data is very small.

Show that the distribution is approximately normal

Central Limit Theorem states that regardless of the distribution of the population, the sample statistic approach normal distribution as number of sample grow.

To validate this statement, we can plot a PDF for the observed values.

#Plot density function 
p<- ggplot(data = sim_data, aes(sim_data$Mean))+ geom_histogram(aes(x=sim_data$Mean, ..density..), binwidth=0.2, fill= 'red', color='black')
p<- p+ stat_function(fun=dnorm, color="blue", args=list(mean=mean(sim_data$Mean), sd=sd(sim_data$Mean)))
p<- p +labs(title="Exponential Simulation Density") + labs(x="Sample Mean")+labs(y="Density")
p <- p+ theme(plot.title = element_text(hjust = 0.5))
print (p)

In the R code above, we superimposed the normal distribution on top of the observed, simulated values. This indicates that the distribution of the observed data is very close to normal distribution.