Statistical Inference Course Project : Exponential Distribution in R vs Central Limit Theorem

Overview

In this project we will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. We will investigate the distribution of averages of 40 exponentials. Note that we are needed to do a thousand simulations.

Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials. We should

Show the sample mean and compare it to the theoretical mean of the distribution.
Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.
Show that the distribution is approximately normal.

Simulations

We will run a series of 1000 simulations to create a data set for comparison to theory. Each simulation will contain 40 observations and the expoential distribution function will be set to “rexp(40, 0.2)”.

We simulate 1000 samples for each size 40 with exponential distribution lambda=0.2 by using rexp(n, lambda). The mean of exponential distribution is 1/lambda. The standard deviation is also 1/lambda. We generate the samples and calculate the average of each sample.

library(ggplot2)
library(knitr)

no_simulation <- 1000   # number of simulations 
lambda <-  0.2 
n <- 40             # sample size


simulated_data <- matrix(rexp(n= no_simulation*n,rate=lambda), no_simulation, n)
sample_mean <- rowMeans(simulated_data)

Sample Mean vs Theoretical Mean

The theoretical mean of the average of samples will be : 1/lambda .The following shows that the average from sample means and the theoretical mean are very close.

Sample Mean : The sample mean or empirical mean and the sample covariance are statistics computed from a collection (the sample) of data on one or more random variables.

actual_mean <- mean(sample_mean) 
theoretical_mean <- 1/ lambda

result_1 <-data.frame("Mean"=c(actual_mean,theoretical_mean), 
                     row.names = c("Mean from the samples ","Theoretical mean"))

result_1

The simulation mean of 4.983227 is close to the theoretical value of 5. Histogram plot of the exponential distribution n = 1000

sampleMean_data <- as.data.frame (sample_mean)

 ggplot(sampleMean_data, aes(sample_mean)) + 
   geom_histogram(alpha=.5, position="identity", col="black", fill = "white") + 
   geom_vline(xintercept = theoretical_mean, colour="red",show.legend=TRUE) + 
   geom_vline(xintercept = actual_mean, colour="blue", show.legend=TRUE) + 
   ggtitle ("Histogram of the sample means ") + 
   xlab("Sample mean")+ylab("Density")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Sample Variance versus Theoretical Variance

The theoretical variance of the average of samples will be (1/lambda)^2/n. The following shows that the variance of sample means and the theoretical variance are very close in value.

actual_variance <- var(sample_mean) 
        
theoretical_variance <- (1/ lambda)^2 /n 
 
result_2 <-data.frame("Variance"=c(actual_variance, theoretical_variance), 
                     row.names = c("Variance from the sample ","Theoretical variance"))

result_2

Distributions

According to the central limit theorem (CLT), the averages of samples follow normal distribution.

This following plot shows that the distribution of the sample means almost matches the normal distribution. Also we create a Normal Probability Plot of Residuals below to confirm the fact that the distribution of sample means matches the theoretical normal distribution.

ggplot(sampleMean_data, aes(sample_mean)) +
        geom_histogram(aes(y=..density..), alpha=.5, position="identity", fill="white", col="black") +
        geom_density(colour="red", size=1) +
        stat_function(fun = dnorm, colour = "blue", args = list(mean = theoretical_mean, sd = sqrt(theoretical_variance))) +
        ggtitle ("Histogram of Sample Means with Fitting Normal Curve ") +
        xlab("Sample mean") +
        ylab("Density")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

 qqnorm(sample_mean, main ="Normal Probability Plot")
 qqline(sample_mean,col = "blue")

Both histogram and the normal probability plot show that distribution of averages is approximately normal.