Statistical Inference - Simulation Exercise

Overview

This document reports the comparison between exponential distribution and Central Limit Theorem. It simulates the distribution of averages of 40 exponentials that are performed over 1000 simulations.

Load needed libraries

## Load needed libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

Simulation

The exponential distribution will be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations.

Here variables starting with ‘t’ is the theoritical value
while variables starting with ‘s’ is related to the sample population e.g.
tmean -> theoritical mean
smean -> sample mean

# rate parameter to be used in rexp function.
lambda <- 0.2

# no of exponentials.
n <- 40

# for consistent seeding of random numbers
set.seed(1000)

sims = NULL
for (i in 1 : 1000) sims = c(sims, mean(rexp(n, lambda)))
hist(sims)

#sims

Sample Vs Theoritical Mean

tmean <- 1 / lambda
smean <- mean(sims)

tmean : 5
smean : 4.9869634

Sample Vs Theoritical Variance

Given that the theoritical standard deviation (sigma) is given as 1 / lambda (i.e. 5), then the variance should be sigma^2 (i.e. 25).

For the sample size n = 40, the variance should be the standard deviation n times.

tsdev <- 1 / lambda
tvar <- tsdev^2

ssdev <- sd(sims)
svar <- var(sims) * n 

#svar2 <- ssdev^2 * n
#print(ssdev)
#print(svar2)

tvariance : 25
svariance : 26.1737217

Distribution

Below is a plot showing the distribution of the means, and also an overlay of a normal distribution function. This shows that the sample mean follows the population mean of 5. The distribution of the sample mean (red line in chart) is closely the same as the distribution of the Population mean (blue line in chart)

#convert vector of means to a dataframe so that ggplot can work with them
sims <- data.frame(sims)
colnames(sims) <- c("Means")

# plot using ggplot a histogram of the 
a <- ggplot(sims, aes(Means))     
a + geom_histogram(aes(y=..density..), alpha=.5, position="identity", fill="light blue", color = "black") + stat_function(fun = dnorm, args = list(mean = tmean), color = "blue") + stat_function(fun = dnorm, args = list(mean = smean), color = "red") + xlab("Sample Mean") + ylab("Frequency")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.