Statistical Inference Project Part 1: Central Limit Theorem & Exponential Distribution

Overview

The simulations below demonstrate the CLT in action for the exponential distribution. We took 1000 random samples of size 40 from the distribution with lambda = 0.2 to create a distribution of averages. Below we compare the theoretical mean (1/lambda) and the observed mean of the distribution of averages. We also compare the theoretical variance (1/lambda)^2/n to the observed variance of the distriubiton. Thirdy we showed that the distribution of sample means was approximately normal.

Simulations

The simulation below takes random samples of size 40 of the exponential distribution with lambda = 0.2. Then it takes the mean of each of those samples and stores the average in the the variable averages. Averages represents the distriubtion of averages. The simulation also takes a seperate random sample of the distribution for comparison purposes later.

# setting parameters and seed
lambda = 0.2
set.seed(123)


# 1000 random samples from the exponential function lambda = 0.2
exponentialdist <- rexp(1000, rate = lambda)

# samples exponential distribution 1000 times with 40 obserations in each sample, and takes
# the mean of those averages
averages <- c()
for (i in 1:1000){
        averages = c(averages, mean(rexp(40, lambda)))
}

Sample Mean VS Theoretical Mean

First we calculated the theoretical and sample means of the distribution of averages. The plot below shows the theoretical mean and sample mean on the distribution. They are represented by orange and blue vertical lines respectively.

library(ggplot2)
# mean of random sample of exponential function 
Theoretical_Mean <- 1/lambda

# showing sample mean of distribution of mean of 40 exponentials
Sample_Mean <- mean(averages)

# plot comparing location of sample mean and theoretical mean on distribution
g <- ggplot(as.data.frame(averages), aes(averages))
g <- g + geom_histogram(binwidth = .2, color = "black", fill = "purple") +
        labs(x = "Means from Samples of Size 40 from Exponential Distribution",
             y = "Frequency", title = "Simulated Sample Means Distriubtion")



# adding reference lines
g <- g + geom_vline(xintercept = Theoretical_Mean, color = "orange", size = 1)
g <- g + geom_vline(xintercept = Sample_Mean, color = "cyan", size = 1, lty = 2)

g

The theoretical mean of the exponential distribution is 5.
The sample mean of the distribution of sample means is 5.0129624. The mean of the distibution of sample means is slightly larger, but is very close to the theoretical mean, because it approximates the true mean for large values of n.

Sample Variance Vs Theoretical Variance

We calculated the theoretical and sample variances and stanard deviations of the distribution of averages. The plot produced below shows the theoretical standard deviation and sample standard deviation on the distribution. They are represented by orange and blue striped vertical lines respectively. As you can see, the standard deviations and therefore the variances are approximately the same.

# calculating theoretical variance of exponential distribution
exp_sd <- 1/lambda
# calculating theoretical variance of the distribution of 1000 means
theoretical_sd <- exp_sd/sqrt(40)


var(averages)

## [1] 0.6039681

theoretical_sd

## [1] 0.7905694

sample_sd <- sd(averages)

# making a density histogram
v <- ggplot(as.data.frame(averages), aes(x = averages)) + geom_histogram(binwidth = .2, 
        color = "black", fill = "purple", aes(y = ..density..)) +
        labs(x = "Means from Samples of Size 40 from Exponential Distribution",
             y = "Density", title = "Simulated Sample Means Distriubtion Comparing Variances")

# adding theoretical distriubtion and actual distribution curves
v <- v + stat_function(fun = dnorm, args = c(mean = Theoretical_Mean, sd = theoretical_sd), color = "black")
v <- v + stat_function(fun = dnorm, args = c(mean = Sample_Mean, sd = sample_sd), color = "red" )


theory_sd_scale <- Theoretical_Mean + c(-3:3) * theoretical_sd
sample_sd_scale <- Sample_Mean + c(-3,3) * sample_sd

# adding reference lines showing standar devitions form mean
v <- v + geom_vline(xintercept = theory_sd_scale, color = "orange", size = 1)
v <- v + geom_vline(xintercept = sample_sd_scale, color = "cyan", size = 1, lty = 2)
v

The standard deviation produced by our repeated sampling is so close to the theoretical standard deviation that it is hard to tell that the observed standard deviation is slightly smaller. The dashed blue line (observed standard deviation) is only slightly less than the theoretical standard deviation (orange line) three standard deviations out.

The table below summarizes the means and variances below.

# summarizing the means and variances in a table
mean_vars <- data.frame(Theoretical = c(Theoretical_Mean, (theoretical_sd)^2), Observed = c(Sample_Mean, sample_sd^2))
row.names(mean_vars) <- c("Means", "Variances")
mean_vars

##           Theoretical  Observed
## Means           5.000 5.0129624
## Variances       0.625 0.6039681

The observed mean apears slightly higher and the observed variance is slightly lower. The important thing to notice is that since our distribution of averages is from a large sample size, the standard deviation and variance of a distirubiton approach their true values.

Distributions

The figures below help us demonstrate that the distribution of 1000 sample means where n = 40 and lambda = .2 is approximately normal.

v + labs(title = "Looking For Normality")

qqnorm(averages)
qqline(averages, col = "blue")

We can see that by CLT if we take many samples and create a distribution of averages, that distribution is normally distributed around the mean. We can tell that our distribution is approaching this theoretical normal curve, because of its shape and the fact their our theoretical and sample means and variances practically overlap.

Also, the qqplot above shows little deviation from the normal distribution shown by the line in blue.