Simulating Exponential Distribution - Coursera Statistical Inference Course Project

Overview

In this project you will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. You will investigate the distribution of averages of 40 exponentials. Note that you will need to do a thousand simulations.

Simulations

The code in the Appendix simulates 1000 sets of 40 samples each from a exponential distribution with lambda rate of 0.2 using the sapply function. It is collected to a tbl_df format for use with dplyr package later on. Also means of each simulation run of 40 samples are calculated.

Sample Statistics Compared to Theoretical values

Here is the summary of values from our simulation

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00004  1.45800  3.52700  5.02000  7.00100 48.25000

When comparing the results from simulation to the theoretical values. We can see that they are close to each other. For example, theoretical variance is 25 and we get about 24.55 from our simulation

##                   Mean          SD   Variance
## Simulated   5.02020855  4.95487795 24.5508155
## Theoretical 5.00000000  5.00000000 25.0000000
## Difference  0.02020855 -0.04512205 -0.4491845

Distributions and Results

Distribution of simulated values from the expotential distribution, indeed looks like an right-skewed exponential distribution. This is shown in the first plot of the Appendix.

When we calculate sample means and plot them, their distribution looks like a normal distribution as it should. This can be seen in the second plot of the Appendix. There the red line shows the mean of means and the green dashed line is the theoretical mean (5).

In our simulation, the differences of the sample statistics and theoretical values are marginal. Here the distribution of samples is a good estimate for the theoretical exponential distribution.

Appendix

# Libraries and options
library(knitr)
library(dplyr)
library(ggplot2)

# Simulate data
set.seed(5678)

n <- 40
lambda <-  0.2
nSimulations <- 1000

dt <- tbl_df(data.frame(group = as.factor(rep(1:nSimulations, each = n)))) %>%
    mutate(x = as.numeric(sapply(1 : nSimulations, function(x) rexp(n, lambda))))

# Calculate means per run
dtMeans <- dt %>%
    group_by(group) %>%
    summarise(mean = mean(x))

# Compare mean, standard deviation and variance of sample means to theoretical values
theoreticalMean <- 1 / lambda
xMean <- mean(dt$x)
meanDiff <- xMean - theoreticalMean

theoreticalSD <- 1 / lambda
xSD <- sd(dt$x)
sdDiff <- xSD - theoreticalSD

theoreticalVar <- 1 / lambda^2
xVar <- var(dt$x)
varDiff <- xVar - theoreticalVar

dfComparison <- data.frame(Mean = c(xMean, theoreticalMean, meanDiff), 
           SD = c(xSD, theoreticalSD, sdDiff), 
           Variance = c(xVar, theoreticalVar, varDiff), 
           row.names = c('Simulated', 'Theoretical', 'Difference'))

## Histogram  of x
plot1 <- ggplot(dt, aes(x)) +
        geom_histogram(binwidth = 0.1) +
        labs(title = 'Distribution of All Samples')

## Histogram of sample means
plot2 <- ggplot(dtMeans, aes(mean)) +
    geom_histogram(binwidth = 0.15) +
    geom_vline(aes(xintercept = mean(dtMeans$mean)), colour = 'red') +
    geom_vline(aes(xintercept = theoreticalMean), colour = 'green', linetype = 'dashed') +
    labs(title = 'Distribution of Sample Means')