Statistical Inference Course Project
Peer-graded Assignment
This course project is available on GitHub
The Central Limit Theorem states that if you have a population with mean \(\mu\) and standard deviation \(\sigma\) and take sufficiently large random samples from the population (generally sample sizes greater than 30), then the distribution of the sample means will be approximately normally distributed about the population mean \(\mu\) - no matter the shape of the population distribution.
This project explores the Central Limit Theorem using the exponential distribution in R. The theoretical normal distribution will be compared to the distribution of calculated means of samples from the exponential distribution.
Load packages used in this analysis.
if (!require(ggplot2)) {
install.packages("ggplot2")
library(ggplot2)
}
## Loading required package: ggplot2
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
Display session information.
sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.1.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.1 knitr_1.22 magrittr_1.5 tidyselect_0.2.5
## [5] munsell_0.5.0 colorspace_1.4-1 R6_2.4.0 rlang_0.3.4
## [9] stringr_1.4.0 plyr_1.8.4 dplyr_0.8.1 tools_3.6.0
## [13] grid_3.6.0 gtable_0.3.0 xfun_0.7 withr_2.1.2
## [17] htmltools_0.3.6 assertthat_0.2.1 yaml_2.2.0 lazyeval_0.2.2
## [21] digest_0.6.18 tibble_2.1.1 crayon_1.3.4 purrr_0.3.2
## [25] glue_1.3.1 evaluate_0.13 rmarkdown_1.12 stringi_1.4.3
## [29] compiler_3.6.0 pillar_1.4.0 scales_1.0.0 pkgconfig_2.0.2
Perform 1000 simulations, each with 40 samples of an exponential distribution. The 40 samples will be used to calculate the arithmetic mean and variance and then compared to the theoretical estimates.
To make the data reproducible, a seed will be set. Also, set the control parameters \(\lambda = 0.2\) (the rate) and \(n = 40\) (number of samples).
# set seed for reproducability
set.seed(062000)
# set sampling values:
lambda <- 0.2 # rate parameter
n <- 40 # number of samples (exponentials) in each simulation
numSimulations <- 1000 # number of simulations
# simulate the population
simMeans <- data.frame(expMean = sapply(1 : numSimulations, function(x) {mean(rexp(n, lambda))}))
According to the Central Limit Theorem, the distribution of the sample means will be approximately normally distributed with a mean equal to the population mean \(\mu\) of the underlying distribution. Because the underlying distribution in this simulation is exponential, the theoretical mean of the exponential distribution will be compared to the corresponding sample mean of the simulation. For an exponential distribution, the theoretical mean is equal to \(\frac{1}{\lambda}\).
Calculate the sample mean and theoretical mean across all 1000 simulations of 40 samples from an exponential distribution where \(\lambda = 0.2\).
# calculate sample mean and theoretical mean
sampleMean <- mean(simMeans$expMean)
theoMean <- 1/lambda
compMeans <- data.frame(sampleMean, theoMean)
names(compMeans) <- c("Sample Mean", "Theoretical Mean")
print(compMeans)
## Sample Mean Theoretical Mean
## 1 4.950877 5
As part of the data analysis, also perform a one sample t-test to check the 95% confidence interval for the sample mean.
t.test(simMeans$expMean, conf.level = 0.95)
##
## One Sample t-test
##
## data: simMeans$expMean
## t = 198.48, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 4.901927 4.999826
## sample estimates:
## mean of x
## 4.950877
Display a histogram to show the averages of the 40 exponentials over 1000 simulations. Include the sample mean and theoretical mean for comparison.
# plot the distribution (sample mean versus theoretical mean)
expSimulationMeansChart <- ggplot(simMeans, aes(x = expMean, y = ..count..)) +
geom_histogram(binwidth = 0.15, color = "white", fill = rgb(0.2,0.7,0.1,0.4)) +
geom_vline(aes(xintercept = sampleMean, color = "sample"), size = 0.50) +
geom_vline(aes(xintercept = theoMean, color = "theoretical"), size = 0.50) +
xlab("Mean") +
ylab("Frequency") +
theme(plot.title = element_text(size = 14, hjust = 0.5)) +
scale_color_manual(name = "Means", values = c(sample = "blue", theoretical = "red")) +
ggtitle("Distribution of Exponential Simulation Means")
print(expSimulationMeansChart)
The sample mean came out to be 4.9508767 while the theoretical mean is 5. As shown in the above chart, the mean of the sample means of exponentials (blue vertical line) is very close to the theoretical mean of an exponential distribution (red vertical line). We can also see that with a 95% confidence interval, the sampled mean is between 4.9019272 and 4.9998263 which closely match.
In the same manner used to compare the Sample Mean and Theoretical Mean, the Sample Variance will be compared to the Theoretical Variance.
The theoretical variance is \(\frac{(\frac{1}{\lambda})^2}{n}\).
# calculate sample variance and theoretical variance
sampleVariance <- var(simMeans$expMean)
theoVariance <- ((1/lambda)^2)/n
compVariance <- data.frame(sampleVariance, theoVariance)
names(compVariance) <- c("Sample Variance", "Theoretical Variance")
print(compVariance)
## Sample Variance Theoretical Variance
## 1 0.6222257 0.625
The sample variance came out to be 0.6222257 which is very close to the theoretical variance 0.625.
Determine whether the exponential distribution is approximately normally distributed about the population mean. According to the Central Limit Theorem, the means of the sample simulations should follow a normal distribution.
# plot the distribution
expSimulationMeansChart <- ggplot(simMeans, aes(x = expMean)) +
geom_histogram(aes(y = ..density..), binwidth = 0.15, color = "white", fill = rgb(0.2,0.7,0.1,0.4)) +
geom_vline(aes(xintercept = sampleMean, color = "sample"), size = 0.50) +
geom_vline(aes(xintercept = theoMean, color = "theoretical"), size = 0.50) +
xlab("Mean") +
ylab("Density") +
theme(plot.title = element_text(size = 14, hjust = 0.5)) +
scale_color_manual(name = "Means", values = c(sample = "blue", theoretical = "red")) +
stat_function(fun = dnorm, args = list(mean = sampleMean, sd = sqrt(sampleVariance)), color = "blue", size = 1.0) +
stat_function(fun = dnorm, args = list(mean = theoMean, sd = sqrt(theoVariance)), color = "red", size = 1.0, linetype = "dashed") +
ggtitle("Distribution of Exponential Simulation Means")
print(expSimulationMeansChart)
As shown in the above plot, the distribution of means of the sampled exponential distribution appear to follow a normal distribution.
The density of the sampled data is shown by the light green bars. The dotted red line represents a normal distribution which is very close to the sample distribution colored in blue.