This is the first part of a project for Statistical Inference course, which is a part of Coursera’s Data Science and Data Science: Statistics and Machine Learning Specializations.
The report aims to investigate the exponential distribution means sample and compares it to the Central Limit Theorem conclusions, viz: the distribution of averages of \(iid\) variables becomes \(\bar X_{n}\) ~ \(N(\mu, \frac{\sigma^2}{n})\) as the sample size \(n\) increases.
Here is simulated a thousand random variables, each equal the average of exponentials size \(n=40\) with \(\lambda=0.2\), and illustrated properties of this distribution.
Code buttonlibrary(data.table); library(ggplot2)Simulations
it was implemented with the replicate() and the rexp() functions
simulation (Appendix: code simulation)The sample mean is 5.0012, the theoretical mean calculated as \(1/\lambda\) is 5. So, the difference \(theoretical\ mean-sample\ mean=\) -0.0012, i.e. the sample mean and the theoretical mean match up closely (Appendix: code mean).
| Sample Mean | Theoretical Mean | Difference |
|---|---|---|
| 5.0012 | 5 | -0.0012 |
Explore simulations with the histogram of the observations averages, followed by highlighting values of the sample mean and the theoretical mean (Appendix: code mplot)
The plot also shows that the center of distribution of averages of 40 exponentials is very close to the theoretical center of the distribution.
The sample variance is 0.6201, the theoretical variance calculated as \(VAR(\bar X_{n}) = \frac{1}{\lambda^2 n}\) is 0.625. So, the difference \(theoretical\ variance-sample\ variance=\) -0.0012, i.e. the sample mean variance and the theoretical variance match up closely (Appendix: code variance).
| Sample Mean Variance | Theoretical Variance | Difference |
|---|---|---|
| 0.6201 | 0.625 | -0.0012 |
Explore simulations with the density of the observations means, followed by lines fitted to the data and to the theoretical normal distribution \(N(\mu,sd^2)\) , where \(\mu=1/\lambda, sd=\frac{1}{\lambda \sqrt n}\) (Appendix: code varplot)
The plot also shows that the variance of the sample mean is very close to the theoretical variance of the distribution.
First, as the above plots show, the sample data are quite symmetric and not skewed, the sample mean and variance are close to their theoretical values.
Second, the sample mean, median and mode and match up closely to each other (Appendix: code mmm):
| Sample Mode | Sample Median | Sample Mean |
|---|---|---|
| 5.204 | 4.9779 | 5.0012 |
And finally, q-q plot comparing the theoretical and observed quantiles, shows a nearly linear plot (Appendix: code qqplot):
All of the above leads to the conclusion that simulated distribution of 1000 averages of 40 random exponential variables with \(\lambda=0.2\) approaches a normal distribution \(N(5, \frac{5}{8})\). So, it’s shown how the Central Limit Theorem works
simulationset.seed(2112)
lambda <- 0.2
n <- 40
nsim <- 1000
esample <- data.table(mean = replicate(nsim, mean(rexp(n, lambda))))meansmean <- mean(esample[,mean])
tmean <- 1/lambda
diff <- tmean - smeanmplotline.data <- data.table(x = c(smean,tmean),
means = c("Simulation","Theoretical"),
stringsAsFactors = FALSE)
mplot <- ggplot(esample, aes(mean)) +
geom_histogram(colour = "darkgrey", fill = "cornflowerblue",binwidth = 0.2) +
geom_vline(aes(xintercept = x, colour = means), line.data,
size=c(3,1.5)) +
labs(title = "Sample Mean versus Theoretical Mean",
subtitle = "Histogram of observations means") +
scale_colour_manual(values = c("violet","purple")) +
guides(colour = guide_legend(override.aes = list(size = 2)))variancesvar <- var(esample)
tvar <- 1/(lambda^2*n)varplotvarplot <- ggplot(esample, aes(x=mean)) +
geom_histogram(colour = "darkgrey", aes(y=..density..),
fill = "cornflowerblue", binwidth = 0.5) +
geom_vline(aes(xintercept = x, colour = means), line.data,
size=c(3,1.5)) +
geom_density(size = 2, color = "violet") +
stat_function(fun = dnorm, args = list(mean = tmean, sd = sqrt(tvar)),
colour = "purple", size=1.5) +
labs(title = "Sample Mean Variance versus Theoretical Variance",
subtitle = "Density of observations means") +
scale_colour_manual(values = c("violet","purple")) +
guides(colour = guide_legend(override.aes = list(size = 2)))mmmgetmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
smode <- getmode(esample[,mean])
smed <- median(esample[,mean])qqplotqqnorm(esample[,mean], col="cornflowerblue",
main = "Sample Quantiles versus Theoretical Quantiles")
qqline(esample[,mean], col = "brown", lw=3.0)