In this project I will verify the Central Limit Theorem with simulated exponential samples. I will do a thousand simulations and investigate the sample of averages of 40 exponentials, by comparing the sample mean, variance and distribution with theoretical counterpoints respectively.
Suppose \(\{X_1, X_2, ..., X_n\}\) is a random sample of size n - that is, a sequence of independent and identically distributed (i.i.d.) random variables drawn from distributions of expected values given by \(\mu\) and finite variances given by \(\sigma^2\). The sample mean of the random sample is denoted by \(\bar{X}\), where \(\bar{X} = \frac{1}{n} \sum_{i=1}^{\infty} {X_i}\). Then as \(n \rightarrow \infty\) (n approaches infinity), the random variables \(\sqrt{n} (\bar{X} - \mu)\) approximate the normal distribution with mean 0 and variance \(\sigma^2\). That is, \(\sqrt{n} (\bar{X} - \mu) \overset{d}{\rightarrow} N(0, \sigma^2) \Leftrightarrow \frac{\sqrt{n} }{\sigma} (\bar{X} - \mu) \overset{d}{\rightarrow} N(0, 1)\).
In probability theory and statistics, the exponential distribution is the probability distribution that describes the time between events in a Poisson process, i.e. a process in which events occur continuously and independently at a constant average rate. The probability density function (pdf) of an exponential distribution is \[f(X=x|\lambda) = \lambda e^{-\lambda x}, ~x\geqslant0\].
Its expectatation \(E(X) = 1/\lambda\), and its variance \(Var(X) = 1/\lambda^2\) (So its standard deviation \(s(X) = 1/\lambda\)).
lambda <- 0.2
nosim <- 1000
n <- 40
set.seed(1) # make the results reproducible
data <- matrix(rexp(n*nosim,lambda), nrow = nosim, ncol = n)
str(data)
## num [1:1000, 1:40] 3.776 5.908 0.729 0.699 2.18 ...
data.mean <- apply(data, 1, mean)
In order to make the simulation reproducible, I set the seed as 1 before simulation. With the function rexp, I simulated \(1000 \times 40\) exponentials with paramter \(\lambda=0.2\), which are stored in the \(1000 \times 40\) matrix named data. The average of each rows of exponentials are calculated respectively and saved in the vector named data.mean.
mean(data.mean) # sample mean of the distribution of averages of 40 exponentials
## [1] 4.990025
1/lambda # theoretical mean of the distribution of averages of 40 exponentials
## [1] 5
According to Central Limit Theorem, the therotical mean of the distribution of averages of 40 exponentials should be the same with the mean of random exponentials, which is \(1/\lambda\).
In the output, the sample mean (4.99) of the distribution of averages of 40 exponentials is very close to the theoretical mean (5).
var(data.mean) # sample variance of the distribution of averages of 40 exponentials
## [1] 0.6177072
(1/lambda) ^2/n # theoretical variance of the distribution of averages of 40 exponentials
## [1] 0.625
According to Central Limit Theorem, the therotical variance of the distribution of averages of 40 exponentials should be the variance of random exponentials divided by square root of sample size, which is \(1/(\lambda \sqrt{n})\).
In the output, the sample variance (0.618) of the distribution of averages of 40 exponentials is very close to the theoretical variance (0.625).
data.normalize <- sqrt(n) * (data.mean - 1/lambda) / (1/lambda)
ks.test(data.normalize,"pnorm")
##
## One-sample Kolmogorov-Smirnov test
##
## data: data.normalize
## D = 0.039084, p-value = 0.09423
## alternative hypothesis: two-sided
qqplot.data <- function(vec) # argument: vector of numbers
{
# following four lines from base R's qqline()
y <- quantile(vec[!is.na(vec)], c(0.25, 0.75))
x <- qnorm(c(0.25, 0.75))
slope <- diff(y)/diff(x)
int <- y[1L] - slope * x[1L]
d <- data.frame(resids = vec)
ggplot(d, aes(sample = resids)) +
stat_qq(color = "steelblue") +
geom_abline(slope = slope, intercept = int, linetype = 2) +
ggtitle("Q-Q plot of the data") +
theme(plot.title = element_text(size = 15, face = "bold", hjust = 0.5)) +
theme(axis.title.y = element_text(size = 13) ) +
theme(axis.title.x = element_text(size = 13) )
}
qqplot.data(data.normalize)
The 1000 averages of 40 exponentials are normalized (Z-transformed) with mean \(1/\lambda\) and variance \(1/(\lambda \sqrt{n})\) in the following analysis.
The Kolmogorov-Smirnov test (K-S test or KS test) is a nonparametric test of the equality of continuous, one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution (one-sample K-S test), or to compare two samples (two-sample K-S test). Here I use one-sample K-S test to expolore if normalized 1000 averages of 40 exponentials follows standard normal distribution. Since the p-value is 0.094, at alpha level of 0.05 it implies the sample distribution is approximately normal.
The Quantile-Quantile Plot (Q-Q Plot), showing only tiny tail effects, also indicates approximate normality.
library(ggplot2)
g <- ggplot(data.frame(data.normalize), aes(x = data.normalize) )
g <- g + geom_histogram(binwidth = .3, color = "blue", fill = "gold", aes(y = ..density..))
g <- g + stat_function(fun = dnorm, size = 1.5, linetype = 2)
g <- g + labs(x = "normalized sample")
g <- g + labs(title = "Sample Distribution and Theoretical Distribution")
g <- g + theme(plot.title = element_text(size = 13, face = "bold", hjust = 0.5))
g
The histogram shows the shape of the sample distribution of averages of 40 exponentials. And A standard normal distribution is represented as the black dashed bell curve in the plot. From the plot, we can see the normalized sample distribution, \(\lambda\sqrt{n} (\bar{X} - 1/\lambda)\), is approximately standard normal distribution. In other words, averages of 40 exponentials approximately follows a normal distribution with mean \(1/\lambda\) and variance \(1/(\lambda \sqrt{n})\).