Course Project Statistical
InferencePart 1 of the Course Project aims to show an understanding of the central limit theorem (CLT), comparing simulation results with theoretical expectations. The activity is based on a sample of 1,000 means generated by 40 numbers (with an exponential distribution profile with lambda 0.2). Then comparing those values (the sample and theoretical) to prove the CLT. As a result, a graph was plotted showing the normality of the sampleβs average, confirming the CLT.
Please find the Requirements and Settings to reproduce this experiment in the APPENDIX section or Forking the Github Repository.
According to Part 1 of the Course Project, the parameters of this assignment should be:
# Parameters
simulations <- 1000; sample_size <- 40; lambda <- 0.2
Generating random data to create a dataset to answer the tasks posted on β1. Objectivesβ.
# Adapting the source code given in the instructions.
mean_sample_exponential <- c()
for (i in 1 : 1000) {
mean_sample_exponential = c(mean_sample_exponential,
base::mean(stats::rexp(n = 40, rate = 0.2)))
}
The mean_sample_exponential is a vector, 1000 in length,
representing the mean of each simulation of 40 samples.
Please find below the answer to each question of the exponential function dataset.
Task 1: Show the sample mean and compare it to the theoretical mean of the distribution.
Theoretical Mean: The Theoretical mean of an exponential distribution is the inverse of lambda.
# Calculating the Theoretical Mean.
theoretical_mean <- 1 / lambda
So in this exercise, the theoretical mean is 5.
Sample Mean: The sample mean is shown below.
# Calculating the Sample Mean of each simulation.
sample_mean <- mean(mean_sample_exponential)
Thus, the sample mean is 5.007542.
Conclusions
Based on the results above, those means are very close due to the significant amount of samples and simulations. This exercise shows the concepts of the Central Limit Theorem.
Task 2: Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.
To calculate the variance is necessary one step before calculating the standard deviation. Thus, this section is divided into 2 steps.
Theoretical Standard Deviation: The standard deviation is calculated analytically.
# Using the formula to calculate Theoretical Standard Deviation
theoretical_sd <- (1/lambda)/(base::sqrt(sample_size))
So in this exercise, the theoretical standard deviation is 0.7905694.
Sample Standard Deviation: The sample standard deviation is shown below
# Using the sample to calculate the empirical value.
sample_sd <- stats::sd(mean_sample_exponential)
The sample standard deviation is 0.774678.
Using the standard deviation calculated above. It is possible to calculate the variances.
Theoretical Variance: The Variance is the square of the standard deviation
# Calculating the Theoretical variance.
theoretical_varicane <- theoretical_sd^2
So in this exercise the theorical variance is 0.625.
Sample Variance:
# Calculating the sample variance.
sample_variance <- stats::sd(mean_sample_exponential)^2
The sample variance is 0.6001259.
Conclusions
Based on the results above, those variances are very close due to the significant amount of samples and simulations. This exercise shows the concepts of the Central Limit Theorem.
Task 3: Show that the distribution is approximately normal.
This question will be quickly answered using a graph.
Comparison between Theoretical and Simulation outcomes. Graph Source Code in Appendix.
Conclusions
The histogram from the mean of 1000 simulations has proved the so-called bell curve, which we could understand as a normal distribution. To ensure this understanding, I have plotted two lines, the red one is the simulation, and the blue one is the theoretical density curve (normal). From the results of this graph, the behavior of the Central Limit Theorem is clear.
In order to reproduce this Course Project in any environment, please
find below the Packages, Seed definition and
SessionInfo().
# Loading libraries
library(ggplot2)
# Force results to be in English
Sys.setlocale("LC_ALL", "English.utf8")
# Set seed
set.seed(2022)
## R version 4.2.0 (2022-04-22 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22000)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.3.6 rmarkdown_2.14
##
## loaded via a namespace (and not attached):
## [1] highr_0.9 bslib_0.3.1 compiler_4.2.0 pillar_1.7.0
## [5] jquerylib_0.1.4 tools_4.2.0 digest_0.6.29 lubridate_1.8.0
## [9] jsonlite_1.8.0 evaluate_0.15 lifecycle_1.0.1 tibble_3.1.7
## [13] gtable_0.3.0 pkgconfig_2.0.3 rlang_1.0.3 DBI_1.1.3
## [17] cli_3.3.0 rstudioapi_0.13 yaml_2.3.5 xfun_0.31
## [21] fastmap_1.1.0 withr_2.5.0 dplyr_1.0.9 stringr_1.4.0
## [25] knitr_1.39.3 generics_0.1.2 sass_0.4.1 vctrs_0.4.1
## [29] tidyselect_1.1.2 grid_4.2.0 glue_1.6.2 R6_2.5.1
## [33] fansi_1.0.3 purrr_0.3.4 magrittr_2.0.3 codetools_0.2-18
## [37] scales_1.2.0 htmltools_0.5.2 ellipsis_0.3.2 assertthat_0.2.1
## [41] colorspace_2.0-3 utf8_1.2.2 stringi_1.7.6 munsell_0.5.0
## [45] crayon_1.5.1
# Density Histogram of averages
hist(mean_sample_exponential, breaks = 75,
main = "Comparison between Theoretical and Simulation outcomes",
xlab = "Sample Average", ylab = "Frequency")
# Draw a line of Density of the averages of samples
lines(stats::density(mean_sample_exponential)$x,
stats::density(mean_sample_exponential)$y * 100, # Converting density into absolute value.
lwd = 2,
col = "red")
# Theoretical center of distribution. In other words, the theoretical mean.
abline(v = 1/lambda, col="orange", lwd = 2)
# Theoretical density of the averages of the simulations samples
xfit <- base::seq(min(mean_sample_exponential),
base::max(mean_sample_exponential),
length = 100)
yfit <- stats::dnorm(x = xfit,
mean = 1/lambda,
sd = (1/lambda/sqrt(sample_size)))
# Draw a line of Theoretical
lines(xfit,
yfit * 100, # Converting density into absolute value.
pch = 20,
col = "blue",
lty = 4,
lwd = 2)
# Add legend in the histogram
legend('topright',
c("simulation", "normal", "mean"),
lty = c(1,2,1),
col = c("red", "blue", "orange"),
lwd = 2)
According to the QQ plot, the points are close to the blue line, which means the average of the exponential sample follow a normal distribution by its quantiles.
# Creating a Q-q plot
qqnorm(mean_sample_exponential, pch = 1, frame = FALSE)
qqline(mean_sample_exponential, col = "tomato", lwd = 2)