Simulation of Central Limit Theorem with Exponential Samples

Catalog

Overview

In this project I will verify the Central Limit Theorem with simulated exponential samples. I will do a thousand simulations and investigate the sample of averages of 40 exponentials, by comparing the sample mean, variance and distribution with theoretical counterpoints respectively.

Central Limit Theorem

Suppose \(\{X_1, X_2, ..., X_n\}\) is a random sample of size n - that is, a sequence of independent and identically distributed (i.i.d.) random variables drawn from distributions of expected values given by \(\mu\) and finite variances given by \(\sigma^2\). The sample mean of the random sample is denoted by \(\bar{X}\), where \(\bar{X} = \frac{1}{n} \sum_{i=1}^{\infty} {X_i}\). Then as \(n \rightarrow \infty\) (n approaches infinity), the random variables \(\sqrt{n} (\bar{X} - \mu)\) approximate the normal distribution with mean 0 and variance \(\sigma^2\). That is, \(\sqrt{n} (\bar{X} - \mu) \overset{d}{\rightarrow} N(0, \sigma^2) \Leftrightarrow \frac{\sqrt{n} }{\sigma} (\bar{X} - \mu) \overset{d}{\rightarrow} N(0, 1)\).

Exponential distribution

In probability theory and statistics, the exponential distribution is the probability distribution that describes the time between events in a Poisson process, i.e. a process in which events occur continuously and independently at a constant average rate. The probability density function (pdf) of an exponential distribution is \[f(X=x|\lambda) = \lambda e^{-\lambda x}, ~x\geqslant0\].
Its expectatation \(E(X) = 1/\lambda\), and its variance \(Var(X) = 1/\lambda^2\) (So its standard deviation \(s(X) = 1/\lambda\)).

Simulation

lambda <- 0.2
nosim <- 1000
n <- 40
set.seed(1)   # make the results reproducible
data <- matrix(rexp(n*nosim,lambda), nrow = nosim, ncol = n)
str(data)

##  num [1:1000, 1:40] 3.776 5.908 0.729 0.699 2.18 ...

data.mean <- apply(data, 1, mean)

In order to make the simulation reproducible, I set the seed as 1 before simulation. With the function rexp, I simulated \(1000 \times 40\) exponentials with paramter \(\lambda=0.2\), which are stored in the \(1000 \times 40\) matrix named data. The average of each rows of exponentials are calculated respectively and saved in the vector named data.mean.

Sample Mean versus Theoretical Mean

mean(data.mean)  # sample mean of the distribution of averages of 40 exponentials

## [1] 4.990025

1/lambda         # theoretical mean of the distribution of averages of 40 exponentials

## [1] 5

According to Central Limit Theorem, the therotical mean of the distribution of averages of 40 exponentials should be the same with the mean of random exponentials, which is \(1/\lambda\).

In the output, the sample mean (4.99) of the distribution of averages of 40 exponentials is very close to the theoretical mean (5).

Sample Variance versus Theoretical Variance

var(data.mean)   # sample variance of the distribution of averages of 40 exponentials

## [1] 0.6177072

(1/lambda) ^2/n   # theoretical variance of the distribution of averages of 40 exponentials

## [1] 0.625

According to Central Limit Theorem, the therotical variance of the distribution of averages of 40 exponentials should be the variance of random exponentials divided by square root of sample size, which is \(1/(\lambda \sqrt{n})\).

In the output, the sample variance (0.618) of the distribution of averages of 40 exponentials is very close to the theoretical variance (0.625).

Sample Distribution versus Theoretical Distribution

1. Kolmogorov-Smirnov Test and Quantile-Quantile Plot

data.normalize <- sqrt(n) * (data.mean - 1/lambda) / (1/lambda)
ks.test(data.normalize,"pnorm")

## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  data.normalize
## D = 0.039084, p-value = 0.09423
## alternative hypothesis: two-sided

qqplot.data <- function(vec)  # argument: vector of numbers
{
  # following four lines from base R's qqline()
  y <- quantile(vec[!is.na(vec)], c(0.25, 0.75))
  x <- qnorm(c(0.25, 0.75))
  slope <- diff(y)/diff(x)
  int <- y[1L] - slope * x[1L]
  d <- data.frame(resids = vec)
  ggplot(d, aes(sample = resids)) + 
    stat_qq(color = "steelblue") + 
    geom_abline(slope = slope, intercept = int, linetype = 2) +
    ggtitle("Q-Q plot of the data") +
    theme(plot.title = element_text(size = 15, face = "bold", hjust = 0.5)) +
    theme(axis.title.y = element_text(size = 13) )  +
    theme(axis.title.x = element_text(size = 13) )
}
qqplot.data(data.normalize)

The 1000 averages of 40 exponentials are normalized (Z-transformed) with mean \(1/\lambda\) and variance \(1/(\lambda \sqrt{n})\) in the following analysis.

The Kolmogorov-Smirnov test (K-S test or KS test) is a nonparametric test of the equality of continuous, one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution (one-sample K-S test), or to compare two samples (two-sample K-S test). Here I use one-sample K-S test to expolore if normalized 1000 averages of 40 exponentials follows standard normal distribution. Since the p-value is 0.094, at alpha level of 0.05 it implies the sample distribution is approximately normal.

The Quantile-Quantile Plot (Q-Q Plot), showing only tiny tail effects, also indicates approximate normality.

2. Distribution Visualization

library(ggplot2)
g <- ggplot(data.frame(data.normalize), aes(x = data.normalize) )
g <- g + geom_histogram(binwidth = .3, color = "blue", fill = "gold", aes(y = ..density..)) 
g <- g + stat_function(fun = dnorm, size = 1.5, linetype = 2)
g <- g + labs(x = "normalized sample") 
g <- g + labs(title = "Sample Distribution and Theoretical Distribution")
g <- g + theme(plot.title = element_text(size = 13, face = "bold", hjust = 0.5))
g

The histogram shows the shape of the sample distribution of averages of 40 exponentials. And A standard normal distribution is represented as the black dashed bell curve in the plot. From the plot, we can see the normalized sample distribution, \(\lambda\sqrt{n} (\bar{X} - 1/\lambda)\), is approximately standard normal distribution. In other words, averages of 40 exponentials approximately follows a normal distribution with mean \(1/\lambda\) and variance \(1/(\lambda \sqrt{n})\).