Statistical Inference Course Project Part 1

Overview

This purpose of this project is investigating the exponential distribution in R and comparing it with the Central Limit Theorem. With the help of R function rexp(), one can easily conduct the simulation.
Normally, exponential distribution is used to describe the arrival time of a randomly recurring independent event sequence. The probability density function is\[F(x;\lambda) = 1 - \lambda e^{-\lambda x}, x \geq 0\] where \(\lambda\) is the rate paramater. Practically it describes the number of times that the independent event recurring in unit time. The expected value of \(x\) in exponential distribution depends on \(\lambda\), which is \(E[x] = \frac{1}{\lambda}\). On the other hand, the standard deviation is also \(\frac{1}{\lambda}\).
In this project, \(\lambda = 0.2\) has been set for all simulations, where the distribution of averages of 40 exponentials will be investigated and simulation iterations have been set as 1000.

Simulation

As mentioned above, some parameters have been set for the simulation.

# Don't forget to set the working directory
# 0. Load all the necessary packages
# For expo distribution, check the web below
# https://en.wikipedia.org/wiki/Exponential_distribution
library(dplyr)
library(lubridate)
library(ggplot2)

# 1. Simulation parameters set up 
lambda <- 0.2
n <- 40
iteration <- 1000
##Set seed for reproducible analyse
set.seed(2)

# 2. Simulate
## Duplicate
expo <- replicate(iteration, rexp(n, lambda))
## Extract mean values
mean.expo <- apply(expo, 2, mean)
mean.expo <- data.frame(mean.expo)
names(mean.expo) <- c("Simulation.Values")

Sample Mean vs. Theoretical Mean

The expected value of \(x\) in exponential distribution depends on \(\lambda\), which is \(E[x] = \frac{1}{\lambda}\), the theoretical mean value in this case. On the other hand, the sample mean can be calculated with the help of mean() function. The results are shown below.

sample.mean <- mean(mean.expo$Simulation.Values)
theore.mean <- 1 / lambda

sample.mean

## [1] 5.016356

theore.mean

## [1] 5

This also can be illustrated graphically by the histogram.

p <- ggplot(data = mean.expo, aes(x = Simulation.Values))
p + geom_histogram(aes(y = ..density..), 
                   binwidth = .1, color = "black", fill = "white") + 
        geom_vline(aes(xintercept = 5), color = "red", size = 2) +
        labs(x = "Simulation Values", y = "Density") + 
        ggtitle("The Distribution of Mean of an Exponential Distribution")

It is not surprised that they are pretty close.

Sample Variance vs. Theoretical Variance

The theoretical variance is supposed to be calculated by the equation: \[ S = \frac{\sigma}{\sqrt n} \] where in this case \(\sigma = \frac{1}{\lambda}\). On the other hand, the sample variance can be represented by function sd().

sample.sd <- sd(mean.expo$Simulation.Values)
theore.sd <- (1 / lambda)/sqrt(n)

sample.sd

## [1] 0.818004

theore.sd

## [1] 0.7905694

Again, the simulation result is very close to the theoretical result.

Sample Distribution vs. Theoretical Distribution

The similarity of these two distributions, again, can be easily investigated by ploting them.

p + geom_histogram(aes(y = ..density..), 
                   binwidth = .1, color = "black", fill = "white") + 
        geom_density(alpha = .3, fill = "#458B00", color = "green") + 
        stat_function(fun = dnorm, args = list(mean = theore.mean, 
                                               sd = theore.sd), color = "red") +
        labs(x = "Simulation Values", y = "Density") + 
        ggtitle("The Distribution of Mean of an Exponential Distribution")

As shown in the graph, the shaded part is the distribution of the sample exponential distribution, while the red curve stands for the theoretical distribution of this simulation, which follows a normal distribution. It is pretty obvious that they are similar. In fact, if the number of iteration increases(which is iteration = 1000 now), the shape of the simulated curve would be closer to the theoretucal one.