Statistical Inference Course Project:Exponential distribution in R V.S. Central Limit Theorem

Overview

In this project you will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. You will investigate the distribution of averages of 40 exponentials. Note that you will need to do a thousand simulations.

We Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials. We should

1.Show the sample mean and compare it to the theoretical mean of the distribution. 2.Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution. 3.Show that the distribution is approximately normal.

Simulation

We will run a series of 1000 simulations to create a data set for comparison to theory. Each simulation will contain 40 observations and the expoential distribution function will be set to “rexp(40, 0.2)”.

We simulate 1000 samples for each size 40 with exponential distribution λ=0.2 by using rexp(n, lambda). The mean of exponential distribution is 1/λ. The standard deviation is also 1/λ. We generate the samples and calculate the average of each sample.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.3

library(knitr)

## Warning: package 'knitr' was built under R version 3.5.3

no_simulation <- 1000   # number of simulations 
lambda <-  0.2 
n <- 40             # sample size


simulated_data <- matrix(rexp(n= no_simulation*n,rate=lambda), no_simulation, n)
sample_mean <- rowMeans(simulated_data)

Sample Mean VS Theoretical Mean

The theoretical mean of the average of samples will be : 1/λ .The following shows that the average from sample means and the theoretical mean are very close.

actual_mean <- mean(sample_mean) 
theoretical_mean <- 1/ lambda

result1 <-data.frame("Mean"=c(actual_mean,theoretical_mean), 
                     row.names = c("Mean from the samples ","Theoretical mean"))

result1

##                            Mean
## Mean from the samples  5.033723
## Theoretical mean       5.000000

The simulation mean of 4.983227 is close to the theoretical value of 5. Histogram plot of the exponential distribution n = 1000

sampleMean_data <- as.data.frame (sample_mean)

 ggplot(sampleMean_data, aes(sample_mean))+geom_histogram(alpha=.5, position="identity", col="black")+geom_vline(xintercept = theoretical_mean, colour="red",show.legend=TRUE)+geom_vline(xintercept = actual_mean, colour="green", show.legend=TRUE)+ggtitle ("Histogram of the sample means ")+xlab("Sample mean")+ylab("Density")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Sample Variance VS Theoretical Variance The theoretical variance of the average of samples will be (1/λ)^2/n. The following shows that the variance of sample means and the theoretical variance are very close in value.

actual_variance <- var(sample_mean) 
        
theoretical_variance <- (1/ lambda)^2 /n 
 
result2 <-data.frame("Variance"=c(actual_variance, theoretical_variance), 
                     row.names = c("Variance from the sample ","Theoretical variance"))

Distribution

According to the central limit theorem (CLT), the averages of samples follow normal distribution.

This following plot shows that the distribution of the sample means almost matches the normal distribution. Also we create a Normal Probability Plot of Residuals below to confirm the fact that the distribution of sample means matches the theoretical normal distribution.

ggplot(sampleMean_data, aes(sample_mean))+
        geom_histogram(aes(y=..density..), alpha=.5, position="identity", fill="white", col="black")+
        geom_density(colour="red", size=1)+
        stat_function(fun = dnorm, colour = "green", args = list(mean = theoretical_mean, sd = sqrt(theoretical_variance)))+
        ggtitle ("Histogram of sample means with the fitting normal curve ")+
        xlab("Sample mean")+
        ylab("Density")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qqnorm(sample_mean, main ="Normal probability plot")
 qqline(sample_mean,col = "3")