Statistical Inference Course Project

Part 1: Simulation Exercise Instructions

In this project I investigated the exponential distribution in R and compared it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. I’ve set lambda = 0.2 for all of the simulations. I also investigated the distribution of averages of 40 exponentials and completed 1,000 simulations. Note - Before starting, I installed and loaded the following packages in R: R.utils, rmarkdown, knitr, tidyverse, ggplot2, and UsingR.

I illustrated via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials:

Displayed the sample mean and compared it to the theoretical mean of the distribution.
Demonstrated how variable the sample is (via variance) and compared it to the theoretical variance of the distribution.
Displayed that the distribution is approximately normal.

        # Using pre-defined parameters
        lambda = 0.2
        n = 40
        sims = 1:1000
        set.seed(1234)
        # From directions: 
        # The exponential distribution can be simulated in R with rexp(n, lambda) 
        # where lambda is the rate parameter. 
        # The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda.
        
        # Simulate the population:
        pop = data.frame(x=sapply(sims, function(x) {
                mean(rexp(n, lambda))
        }))

        # Plot the histogram
        ggplot(pop, aes(x=x)) +
                geom_histogram(aes(y=..count.., fill=..count..), 
                               color = "black") + 
                labs(title="Distribution of Means of 40 Exponentials", subtitle="~ 1000 Simulations", 
                     y="Frequency Count", x="Mean")

I used the pre-defined parameters and set the seed at 1234. By sampling without replacement, the plot visualizes that the mean is around 5. The distribution in the plot seems to be approximating a normal distribution, however, there is uneven distribution in the tails.

# Show the sample mean and compare it to the theoretical mean of the distribution.
        sample.mean = round(mean(pop$x), 2)
        theo.mean = round(1/lambda, 2)

The sample mean (4.97) approximates the theoretical mean (5).

# Check the 95% confidence interval for the sample mean
        t.test(pop$x)

## 
##  One Sample t-test
## 
## data:  pop$x
## t = 208.23, df = 999, p-value < 0.00000000000000022
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  4.927362 5.021116
## sample estimates:
## mean of x 
##  4.974239

        ci = t.test(pop$x)$conf
        lowerci = round(ci[1], 2)
        upperci = round(ci[2], 2)

The 95% CI for the sample mean (4.93, 5.02) includes the theoretical mean (5).

# Show how variable the sample is (via variance) and compare it to the theoretical variance of the 
# distribution.
        
        sample.var = round(var(pop$x), 2)
        theo.var = round(((1/lambda)^2)/n, 2)

The sample variance (0.57) and the theoretical variance (0.62) are pretty close to each other.

# Show that the distribution is approximately normal.
        
         # Plot the sample mean and var vs. theoretical mean and var:
        #We need to plot the density rather than the count because we are 
        #also plotting the geom_vlines. These would be flattened 
        #on the bottom of the y-axis (<1) if the y-axis was count.
        ggplot(pop, aes(x=x)) +
                geom_histogram(aes(y=..density.., fill = ..density..), 
                               color = "black") +
                labs(title="Sample Distribution of Means", 
                     subtitle = "Against Theoretical Distribution", 
                     y = "Density", 
                     x = "Mean", 
                     caption = "Black = Sample Mean, Red = Theoretical Mean") +
                geom_density(color = "black") +
                geom_vline(xintercept = sample.mean, color = "black", linetype = "dashed", show.legend = TRUE) +
                stat_function(fun = dnorm, args = list(mean = 1/lambda, sd = sqrt(theo.var)), 
                              color = "red") +
                geom_vline(xintercept = theo.mean, color = "red", linetype = "dashed", show.legend = TRUE)

Evaluating the figure, the distribution of the sample mean for 40 exponentials simulated 1000 times approximates the theoretical mean for a normal distribution. You can see this by comparing the shape of the black line compared to the shape of the red line. The mean from the sample mean distribution (black vertical dashed line) is slightly lower than the mean from the theoretical mean distribution (red vertical dashed line).

Statistical Inference Course Project - Part 1

A. Johnson

10 August 2019

Part 1: Simulation Exercise Instructions