Project Overview

In this project we will investigate the Central Limit Theorem (CLT) for exponential distribution. According to the Central Limit Theorem under certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined expected value and well-defined variance, will be approximately normally distributed, regardless of the distribution of the random variables. We will test the theorem with exponential distribution by simulating 1000 samples of size 40 and will compare the mean and variance of the distribution to the theoretical mean and variance of the distribution.

Loading Necessary Libraries

library(ggplot2)
library(knitr)
library(printr)

Simulation

In this section we perform a simulation of 1000 samples of size 40 from an exponential distribution with parameter \(\lambda = 0.2\). The exponential distribution can be simulated in R with rexp(n, lambda) where lambda \(\lambda\) is the rate parameter. The mean of exponential distribution is \(1/\lambda\) and the standard deviation is also \(1/\lambda\).

In the following code we generate the samples and calculate the average of each sample.

no_simulation <- 1000   # number of simulations 
lambda <-  0.2 
n <- 40             # sample size

set.seed(123)
simulated_data <- matrix(rexp(n= no_simulation*n,rate=lambda), no_simulation, n)
sample_mean <- rowMeans(simulated_data)

Sample Mean versus Theoretical Mean

Assuming sample size of \(n\), we can prove that the theoretical mean of the average of samples will be : \(\mu_\bar{x}=\dfrac{1}{\lambda}\). Table below shows that the average from sample means and the theoretical mean are very close in value.

actual_mean <- mean(sample_mean) 
        
theoretical_mean <- 1/ lambda

result1 <-data.frame("Mean"=c(actual_mean,theoretical_mean), 
                     row.names = c("Mean from the samples ","Theoretical mean"))

kable(x = round(result1,3),align = 'c')
Mean
Mean from the samples 5.012
Theoretical mean 5.000

Histogram below presents the distribution of sample means. The vertical line drawn also to show that the relative potion of the mean of distribution and theoretical mean. You can observe that the two lines almost overlap showing that theoretical mean matches mean of the distribution.

sampleMean_data <- as.data.frame (sample_mean)

ggplot(sampleMean_data, aes(sample_mean))+
        geom_histogram(alpha=.5, position="identity", fill="blue", col="blue")+
        geom_vline(xintercept = theoretical_mean, colour="darkorange4", linetype = "longdash",show_guide=TRUE)+
        geom_vline(xintercept = actual_mean, colour="green", linetype = "longdash", show_guide=TRUE)+
        ggtitle ("Histogram of the sample means ")+
        xlab("Sample mean")+
        ylab("Frequency")

Sample Variance versus Theoretical Variance

Assuming sample size of \(n\), the theoretical variance of the average of samples can be calculated as : \(\sigma^{2}_\bar{x}=\dfrac{(1/\lambda)^2}{n}\) . Table below shows that the variance of sample means and the theoretical variance are very close in value.

actual_variance <- var(sample_mean) 
        
theoretical_variance <- (1/ lambda)^2 /n 
 
result2 <-data.frame("Variance"=c(actual_variance, theoretical_variance), 
                     row.names = c("Variance from the sample ","Theoretical variance"))

kable(x = round(result2,3),align = 'c')
Variance
Variance from the sample 0.609
Theoretical variance 0.625

Distribution

According to the central limit theorem (CLT), the averages of samples follow normal distribution. In this section we verify this theorem by comparing the distribution of the samples means and normal distribution.

The Figure below shows the distribution computed using the histogram and the fitting normal curve. The normal density plot shown with green color and is created using the theoretical mean and standard deviation. This plot clearly shows that the distribution of the sample means almost matches the normal distribution.

ggplot(sampleMean_data, aes(sample_mean))+
        geom_histogram(aes(y=..density..), alpha=.5, position="identity", fill="blue", col="blue")+
        geom_density(colour="burlywood4", size=1)+
        stat_function(fun = dnorm, colour = "green", args = list(mean = theoretical_mean, sd = sqrt(theoretical_variance)))+
        ggtitle ("Histogram of sample means with the fitting normal curve ")+
        xlab("Sample mean")+
        ylab("Frequency")

Also we create a Normal Probability Plot of Residuals below to confirm the fact that the distribution of sample means matches the theoretical normal distribution.

qqplot.data <- function (vec) # argument: vector of numbers
{
  # following four lines from base R's qqline()
  y <- quantile(vec[!is.na(vec)], c(0.25, 0.75))
  x <- qnorm(c(0.25, 0.75))
  slope <- diff(y)/diff(x)
  int <- y[1L] - slope * x[1L]

  d <- data.frame(resids = vec)

  ggplot(d, aes(sample = resids)) + stat_qq(col="blue") + geom_abline(slope = slope, intercept = int, col="burlywood4")

}

qqplot.data (sample_mean) +ggtitle ("Normal probability plot ")

Both histogram and the normal probability plot show that distribution of averages is approximately normal.

Confidence Interval

Finally, we evaluate the coverage of the confidence interval for averages using the formula : \(\mu_{\bar{x}} \pm z_{0.95} * \sigma_{\bar{x}} = \mu_{\bar{x}} \pm z_{0.95} \frac{1/\lambda}{\sqrt{n}}\) for different values of \(\lambda\)

no_simulation <- 1000
n <- 40 

lambdavals <- seq(4, 6, by=0.01)

coverage <- sapply(lambdavals, function(lambda){
        
        mu_hats  <- rowMeans(matrix(rexp(n= no_simulation*n,rate=lambda), no_simulation, n))
        
        ll <- mu_hats - qnorm(0.975) * sqrt((mu_hats)^2/n)
        ul <- mu_hats + qnorm(0.975) * sqrt((mu_hats)^2/n)
        mean(ll < 1/lambda & ul > 1/lambda )
})

qplot(lambdavals, coverage) + geom_hline(yintercept=0.95)

The graph shows a perfect coverage of 95% confidence interval for different values of \(lambda\)