Statistical Inference Course Project Part 1

Title

Testing the Central Limit Theorem with the Exponential Distribution.

Overview

In this project we will investigate the exponential distributiuon in R and confirm that it follows the Central Limit Theorem and predicted means and standard deviations.

The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda.

Simulations

We will set lambda = 0.2 (as assigned) for all of the simulations. We will investigate the distribution of averages of 40 exponentials. We will do a thousand simulations.

Refer to the appendix to see the actual simulation code being run. In the code “sim” will be the number of simulations, and “n” will be the number of samples to average.

The simulation in the appendix will create a numeric vector where each entry will be the average of 40 rexp() data pulls. It will request this average 1000 times. It will also create a similar vector of each standard deviation. Data referenced in the rest of the report is from the simulated data.

Sample Mean versus Theoretical Mean

# Calculate theoretical and real means
theomean <- 1/lambda
realmean <- mean(expavg)

The theoretical mean of the exponential distribution is 1/lambda. At lambda = 0.2 the theoretical mean is 5. The calculated mean after 1000 simulations is 5.0309717.

Sample Deviation versus Theoretical Deviation

# Calculate theoretical and real standard deviations
theosd <- 1/lambda
realsd <- mean(expsd)

The theoretical standard deviation of the exponential distribution is 1/lambda as well. At lambda = 0.2 the theoretical deviation is 5. The average of all the calculated standard deviations (of each 40 variable set) is 4.9262426.

Sample Varience versus Theoretical Varience

# Calculate theoretical and real varience
theovar <- (1/lambda)^2/n
realvar <- var(expavg)

This is affected by the number of samples (40). If 40 samples are taken, the standard error would be (1/lambda)/sqrt(n). At lambda = 0.2 and with 40 samples the theoretical varience is 0.625. The actual variance is 0.6454564.

Distribution of the Means

The distribution of the means for 1000 simulated means should theoretically form the shape of a standard normal distribution. Let’s plot our simulation to see if that’s true.

This is very close to the familiar bell curve. Increasing the number of simulations will increase how closely this conforms to the standard normal bell curve.

Appendix of all Simulation Code

# The exponential function comes in the form rexp(n, lambda).
# We will request 40 samples, and then take their mean.
# This procedure will be completed 1,000 times.
lambda <- 0.2
sim <- 1000
n <- 40

expavg <- NULL
expsd <- NULL

# Set seed for reproducability
set.seed(1337)
for (i in 1:sim) {
    expavg <- append((expavg), mean(rexp(n,lambda)))
    expsd <- append((expsd), sd(rexp(n,lambda)))
}

# Calculate theoretical and real means
theomean <- 1/lambda
realmean <- mean(expavg)

# Calculate theoretical and real standard deviations
theosd <- 1/lambda
realsd <- mean(expsd)

# Calculate theoretical and real varience
theovar <- (1/lambda)^2/n
realvar <- var(expavg)

# Plot the data as a histogram and draw the approximate curve.
library(ggplot2)

df <- as.data.frame(expavg)

g <- ggplot(df, aes(expavg)) + 
    geom_histogram(aes(y=..density..), binwidth=0.1) +
    geom_density(color="red", size=2);

print(g)