The aim of this project is to show the power of the Central Limit Theorem, one of the most important concept and central tenet in Statistics. Normally , we all assume that the theoretical distribution of the population that we sample from is normal or approximately so. I that respect we further assume that the mean of that sample will also be normally distributed. The power of the Central Limit Theorem is that even if theoretical distribution is not normal we can still assume that the sample mean is normally distributed. That will be shown by simulating from the exponential distribution, which is very different from the normal as can be seen from Figure 3. The main mass is to the right of 0 and the vicinity, is heavily skewed to the right, and mainly is not “bell” shaped. The exponential density function is:
\[
f(X;\lambda)=\lambda e^{-\lambda x},
\] where the mean of the distribution is \(\frac{1}{\lambda}\), the variance is \(\frac{1}{\lambda^2}\), and the standard deviation is again \(\frac{1}{\lambda}\). In this report \(\lambda\) will be set equal to 0.2 , therefore the theoretical mean will be 1/.2 = 5 or \(\mu = 5\) and variance will be \(\sigma^2 = 1/.2^2 = 25\), finally the standard deviation is \(\sigma = 1/.2 = 5\). The simulation experiments will be simple - sample from the exponential distribution, calculate the mean and variance. Then repeat this 1000 times (in this report the number of simulations will be set equal to 1000), whereby each iteration will record the mean and the variance. Plot the histogram of the 1000 results, and make conclusions. Finally, compare the distribution of the simulation of the 1000 exponentially distributed numbers with that of the distribution of the 1000 means of samples of size 40.
According to the Law of Large Numbers the sample mean in the limit should converge to the mean of the population. We can see here that even with one sample of 40 exponentials the sample mean is pretty close to the theoretical \(\mu = 5\).
nosim = 1000; lambda = .2; n = 40
set.seed(0); mean(rexp(n,lambda))
## [1] 4.776259
In order to investigate this relation we can look at the distribution of the 1000 of means of samples of different szes - i.e. n = 10,40,90. The histograms of the simulation can be seen in Figure 1 in the appendix, with the accompanying code. It can clearly be seen that the mean of the distributions is cetered at the theoretical mean (marked by the vertical lines). Also, it can e seen that as n increases the the results are more and more closer to the theoretical mean.
The variance of the sample mean is centered at the variance of the population from which we are sampling. The variance of the sample mean is itself a random variable and the nice property is that it estimates the variance of the population. We take a single sample here of 40 exponentials and calculate the \(\sigma^2\) and \(\sigma\).
set.seed(0); var(rexp(n,lambda))
## [1] 23.90902
set.seed(0); sd(rexp(n,lambda))
## [1] 4.889685
As can be seen, the resuls are pretty close to the theoretical parameters. In order to investigate this further we will simulate 1000 samples of n=10,40,90 and measure their variances. Then plot the histograms of the different cases is shown in Figure 2. It can be concluded that sample vaiances are even closer to the theoretical ones then that of the means, and of course as the sample size increases the relation gets even stronger. So sample variance estimates pretty good the population variance.
The sample mean is itself a random variable which has a mean equal to the mean of the population, but the variance of that distribution is in fact \(\frac{\sigma^2}{n}\) or the standard deviation is \(\frac{\sigma}{\sqrt n}\). With our parameters that theoretical variance would be 25/40 = 0.625. It could be observed that this variance is much samller than the variance of the population. And we can take one sample:
set.seed(0); var(rexp(n,lambda))/n
## [1] 0.5977255
Again - pretty close. Or we can simulate 1000 n = 40 samples and calculate the variance of that 1000 means:
means = NULL; set.seed(0)
for (i in 1 : nosim) {means = c(means, mean(rexp(n,lambda)))}
var(means)
## [1] 0.6181582
Of course the number now is much closer to the theoretical value. In order to further investigate this relation we can plot the histogram of the means and compare it to the distribution of the population from which we are sampling. That is illustrated in Figure 3. It clearly shows that both distributions are centered at the population mean \(\mu = 5\), as indicated by the red horisontal line. Also, the distribution of the mean of samples is clearly normally shaped, whereas the population distribution is very different from normal. On the other hand also can be observed that the exponential distribution is much more spread out than that of the means of samples.
Distribution of the means of 1000 samples of size 10, 40 and 90 from the exponential distribution.
Distribution of the variances of 1000 samples of size 10, 40 and 90 from the exponential distribution.
Histogram of 1000 draws from exponential distribution(left) and histogram of the means of 1000 of 40 exponentials(right).
Code Figure 1
library(ggplot2); set.seed(0)
dat <- data.frame(
x = c(apply(matrix(rexp(nosim * (n-30), lambda),
nosim), 1, mean),
apply(matrix(rexp(nosim * n, lambda),
nosim), 1, mean),
apply(matrix(rexp(nosim * (n+50), lambda),
nosim), 1, mean)
),
size = factor(rep(c(10, 40, 90), rep(nosim, 3))))
g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(alpha = .20,
binwidth=.2, colour = "black")
g <- g + geom_vline(xintercept = 1/lambda, size = 2)
g + facet_grid(. ~ size) #+ annotate("text", x = 7, y = 40, label = "Some text")
Code Figure 2
set.seed(0)
dat <- data.frame(
x = c(apply(matrix(rexp(nosim * (n-30), lambda),
nosim), 1, var),
apply(matrix(rexp(nosim * n, lambda),
nosim), 1, var),
apply(matrix(rexp(nosim * (n+50), lambda),
nosim), 1, var)
),
size = factor(rep(c(10, 40, 90), rep(nosim, 3))))
g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(alpha = .20,
binwidth=7, colour = "black")
g <- g + geom_vline(xintercept = (1/lambda)^2, size = 2)
g + facet_grid(. ~ size)
Code Figure 3
set.seed(0)
dat <- data.frame(
x = c(rexp(nosim, lambda),
apply(matrix(rexp(nosim * n, lambda),
nosim), 1, mean)
),
simulation = factor(rep(c("exp dist", "sample dist"), rep(nosim, 2))))
g <- ggplot(dat, aes(x = x, fill = simulation)) + geom_histogram( alpha = .20,
binwidth=.5, colour = "black")
g <- g + geom_vline(xintercept = 1/lambda, size = 1,color="red")
g + facet_grid(.~simulation) + theme_bw()
It must be acknowledged here that most of the code in this project is a modified version from prof. Brian Caffo’s lectures from the course “Statistical Inference” at Coursera.