Overview

Title: Statistical Inference Course Project_Part 1 | Author: Anna Huynh | Date: 11/24/2020

Overview

This project is to investigate the exponential distribution in R and compare it with the Central Limit Theorem (CLT), consisting of two parts:

Part 1: A simulation exercise.
Part 2: Basic inferential data analysis.

Part 1: Simulation Exercise

Definition

The Central Limit Theorem (CLT) - one of the most important theorems in all of statistics. It states that the distribution of averages of iid (independent and identically distributed) variables (properly normalized) becomes that of a standard normal as the sample size increases. The CLT tells us that averages have normal distributions centered at the population mean.
The exponential distribution is the probability distribution of the time between events in a Poisson point process. (Wikipedia)

Simulations

The exponential distribution can be simulated by taking samples from size of (n, lambda) where lambda is the rate parameter, and n is 1000 random uniforms. The mean (mu) of exponential distribution is 1/lambda and the standard deviation (s) is also 1/lambda. Set lambda = 0.2 for all of the simulations. We eventually get the result of the distribution of averages of 40 exponential (sample sizes).

1. Sample Mean versus Theoretical Mean

# Setting values
lambda <- 0.2
random_uni <- 1000
sampleSize <- 40

means <- vector()
for (i in 1:1000) {
    means <- c(means, mean(rexp(sampleSize, lambda)))
}
hist(means, breaks=40, col = "green")
rug(means)
lines(density(means))
abline(v=1/lambda, col="magenta", lwd=4) #The magenta line shows actual mean

plot of chunk unnamed-chunk-2

print(mean(means))

## [1] 5.035921

Figure 01: Sample Mean versus Theoretical Mean

The expected (theoretical) mean for the population (actual mean) is 1/λ = 1/0.2 = 5
The mean of sample means is 4.972682

Observation: Sample mean is pretty close to Theoretical mean

2. Sample Variance versus Theoretical Variance

Theoretical variance of the distribution (logical estimate) = s² / n = (1/0.2)² / 40 = 0.625
Sample Variance of sample means = 0.6252048

var(means)

## [1] 0.602873

Observation: Sample variance is pretty close to Theoretical variance

3. Distribution: Via figures and text, explain how one can tell the distribution is approximately normal.

install.packages("ggplot2")

## Error in install.packages : Updating loaded packages

library(ggplot2)
pvals <- seq(.5, .99, by = .01)
myplot <- function(means){
d <- data.frame(n= qnorm(pvals),t=qt(pvals, means),
p = pvals)
g <- ggplot(d, aes(x = n, y = t))
g <- g + geom_abline(size = 2, col = "lightblue") # Normal distribution
g <- g + geom_point(color="black",size=4,alpha=1/2)#Distribution of Sample means
g <- g + geom_vline(xintercept = qnorm(0.975))
g <- g + geom_hline(yintercept = qt(0.975, means))
g <- g + labs(x="Theoretical Quantiles",y="Sample Quantiles")
g <- g + ggtitle("Sample means in Normal Distribution")
g
}
myplot(40)

plot of chunk unnamed-chunk-4 Figure 02: The distribution of a large collection of random exponential and the distribution of a large collection of averages of 40 exponential.The closer the values lie to the line, the better the fit. The plot suggests a high degree of normality.