Statistical Inference Project - Investigate exponential distribution in R and compare it with the Central Limit Theorem.

Overview

In this project the exponential distribution in R will be investigated and compared with the Central Limit Theorem. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Lambda will be set to 0.2 for all of the simulations. The distribution of averages of 40 exponentials will be investigated by doing a thousand simulations.

We will illustrate via simulation and associated explanatory text the properties of distribution of the mean of 40 exponentials for following:

Show the sample mean and compare it to the theoretical mean of the distribution.
Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.
Show that the distribution is approximately normal.

Simulations

Defining the values for the exponential data simulation

lambda = 0.2
nexp = 40
nosim <- 1000

1.Show the sample mean and compare it to theoretical mean of the distribution.

Simulating the means of 40 exponential distributions for 1000 simulations.

set.seed(111)

simulateMean <- data.frame(ncol = 2, nrow=nosim)
colnames(simulateMean) <- c("SimIndex","Mean")

for (i in 1:nosim){
    simulateMean[i,1] <- i
  simulateMean[i,2] <- mean(rexp(nexp, lambda))
}

samplemean <- mean(simulateMean$Mean)
theormean <- 1/lambda

The Sample(simulated) mean of the distribution is 5.0256195 and is comparable(similar) to theoretical mean of 5.

The below plot shows spread of means for 1000 simulations. The red line is the theoretical mean and the black line is the sample mean.

library(ggplot2)

ggplot(simulateMean, aes(x = simulateMean$Mean)) + geom_histogram(alpha = .9,col = 'green',binwidth=.1, aes(y=..density..))  +   
  geom_vline(xintercept = theormean, colour = "red", show.legend = T) + 
  geom_vline(xintercept = samplemean, colour = "black", show.legend = T) + 
  labs(x = "Means spread", y = "Frequency", title = "Simulation of Exponential distribution means") +
  scale_colour_manual( breaks = c("TheoreticalMean", "SampleMean"),
         values = c("TheoreticalMean"="red","SampleMean"="black"))

2. Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.

simvariance <- var(simulateMean$Mean)
theorvariance <- (1/lambda)^2/nexp

The Sample(simulated) variance is 0.6069798 and is comparable(similar) to theoretical variance of 0.625

3. Show that the distribution is approximately normal.

In point 3, focus on the difference between the distribution of a large collection of random exponentials and the distribution of a large collection of averages of 40 exponentials.

We will simulate a large(1000) collection of random exponentials and plot its distribution. From the plot, the distribution of 1000 random exponentials does not resemble a normal distribution.

set.seed(111)
simulateexpval <- data.frame(ncol = 2, nrow=1000)
colnames(simulateexpval) <- c("Index","Value")
for (i in 1:1000){
    simulateexpval[i,1] <- i
  simulateexpval[i,2] <- (rexp(n =1, lambda))
}

ggplot(simulateexpval, aes(simulateexpval$Value)) + geom_histogram(alpha = .9,col = 'green',binwidth=.5,aes(y=..density..)) +
    labs(x = "Random Exponential values", y = "Frequency", title = "Simulation of Exponential distribution")

Now we will plot the distribution of large(1000) collection of averages of 40 exponentials. And map the theoretical(red) and sample(black) mean curve on the distribution.

From this plot, we see that distributions of averages of exponentials resemble a normal distribution and is centered at the sample mean value. This is as per the Central Limit theorem, which states that the averages of samples follow a normal distribution.

ggplot(simulateMean, aes(simulateMean$Mean)) + geom_histogram(alpha = .9,col = 'green',binwidth=.1,aes(y=..density..))  +   
   stat_function(fun = dnorm, color = "black", args = list(mean=samplemean, sd = simvariance), size =1.2) + stat_function(fun = dnorm, color = "red", args = list(mean=theormean, sd = theorvariance)) +
    geom_vline(xintercept = theormean, colour = "red", size = .7) + 
  geom_vline(xintercept = samplemean, colour = "black",size = .7) + 
  labs(x = "Means spread", y = "Frequency", title = "Simulation of Exponential distribution means")

In the next simulation, we will increase the simulations from 1000 to 10000.

This plot will show that distributions become much closer.

lambda = 0.2
nexp = 40
nosim <- 10000

set.seed(111)

simulateMean <- data.frame(ncol = 2, nrow=nosim)
colnames(simulateMean) <- c("SimIndex","Mean")

for (i in 1:nosim){
    simulateMean[i,1] <- i
  simulateMean[i,2] <- mean(rexp(nexp, lambda))
}

samplemean <- mean(simulateMean$Mean)
simvariance <- var(simulateMean$Mean)

ggplot(simulateMean, aes(simulateMean$Mean)) + geom_histogram(alpha = .9,col = 'green',binwidth=.1,aes(y=..density..))  +   
   stat_function(fun = dnorm, color = "black", args = list(mean=samplemean, sd = simvariance), size =1.2) + stat_function(fun = dnorm, color = "red", args = list(mean=theormean, sd = theorvariance)) +
    geom_vline(xintercept = theormean, colour = "red", size = .7) + 
  geom_vline(xintercept = samplemean, colour = "black",size = .7) + 
  labs(x = "Means spread", y = "Frequency", title = "Simulation of Exponential distribution means")

Appendix: You can find the online RPub document at

RPub Repository