PART 1 : Simulation in R

Investigate the exponential distribution and compare it with the Central Limit Theorem

Overview

In this project we: - generate 40 exponentials as our sample - resample with replacement 1000 times - calculate both theoritical and sample statistics and compare

We illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials. We will:

1- Show the sample mean and compare it to the theoretical mean of the distribution. 2- Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution. 3- Show that the distribution is approximately normal.

library(data.table)
library(ggplot2)
library(stats)
echo = TRUE # display the code
results = 'asis' # display the output without formatting

Set some of the variables as predefined in the project

set.seed(777) 
lambda <- 0.2 # the rate
n <- 40 # exponential sample size 
nsim <- 1000 # no of simulations

Create the data table

sampleData <- matrix(rexp(nsim*n, lambda), nrow=nsim)
dim(sampleData)
## [1] 1000   40

Now we have a matrix with 1000 rows and 40 columns of random data with exponential distribution

Comapring means

Show the sample mean and compare it to the theoretical mean of the distribution

Theoritical mean

th_mean <- 1/lambda
th_mean
## [1] 5

Sample mean

sim_mean <- rowMeans(sampleData) # calculate the average for each row
sam_mean <- mean(sim_mean)
sam_mean
## [1] 4.969789

Calculate the difference between the means

mean_diff <- th_mean - sam_mean
mean_diff
## [1] 0.03021104

The difference between the theoritical mean (expected mean) and the sample mean is 0.03021104.

Comparing variances

Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution

Theoritical variance

th_variance <- (1/lambda)^2 / n
th_variance
## [1] 0.625

Sample variance

var(sim_mean)
## [1] 0.6028996

The theoritical variance (expected variance) and the sample variance are very close.

Comparing standard deviations

Doing this comparison because we are going to use the results for plotting

Theoritical sd

th_sd <- 1/(lambda * sqrt(n))
th_sd
## [1] 0.7905694

Sample sd

sam_sd <- sd(sim_mean)
sam_sd
## [1] 0.7764661

Here as well, the difference between the expected sd and the sample sd is insignificant.

Plot the distribution

Since the theoritical statistics are very close to sample statistics the thickness of the lines for the plots are going to be different to be able to identify them in case of an overlap.

df <- data.frame(sim_mean)
plot_1 <- ggplot(df) + 
        geom_histogram(mapping = aes(x = sim_mean , y = stat(density)), color = "black", fill = "lightgray") +
        geom_vline(xintercept = th_mean, size = 2, color = "red") +
        geom_vline(xintercept = sam_mean, size = 1, color = "blue") 
plot_1
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As expected, theoritical mean (red) and sample mean (blue) are overlapping.

Now, let’s plot the standard deviations superimposed on the same plot of the means

plot_2 <- plot_1 +
        stat_function(fun = dnorm, args = list(mean = th_mean, sd = th_sd), color = "red", size = 2) +
        stat_function(fun = dnorm, args = list(mean = sam_mean, sd = sam_sd), color = "blue", size = 1)
plot_2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As you can see, the distribution of means of 40 exponential distributions is close to a normal distribution. Compare this graph with the one below for 1000 random exponentials:

hist(rexp(1000, lambda), breaks = 20)

Comparing Confidence Interval

Theoritical confidence intervals

th_CI <- th_mean + c(-1,1) * 1.96 * sqrt(th_variance) / sqrt(n)
th_CI
## [1] 4.755 5.245

Sample confidence interval

sam_CI <- mean(sim_mean) + c(-1,1) * 1.96 * sd(sim_mean)/sqrt(n)
sam_CI
## [1] 4.729160 5.210418

The intervals are close as well.

Type of distribution

Show that the distribution is approximately normal.

Method 1

using QQ plot

Perform a visual inspection using Q-Q plots The Q-Q plot shows that normality is probably a reasonably good approximation.

library(car)
## Loading required package: carData
qqPlot(sim_mean)

## [1] 291  98

If the data is normally distributed, the points in the QQ-normal plot lie on a straight diagonal line. As we can see in the above graph.

Let’s plot one for the exponential distribution so that we can see the difference.

qqPlot(rexp(1000, lambda))

## [1] 455 629

Method 2

using QQ norm

qqnorm is a generic function the default method of which produces a normal QQ plot of the values in y. qqline adds a line to a normal quantile-quantile plot which passes through the first and third quartiles.

qqnorm(sim_mean) ; qqline(sim_mean, col = "red")

qqnorm(rexp(1000, lambda)); qqline(rexp(1000, lambda), col = "red")

Conclusion

By comparing the graph of the distribution of 1000 random exponentials and the the graph of the distribution of 1000 means of 40 exponential distributions, we can see how Cnetral Limit Theorem manifests as a normal distribution.