The Central Limit Theorem

Oleksandr Fialko
11/01/2017

The central limit theorem (CLT) is a statistical theory that states that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population. Furthermore, all of the samples will follow an approximate normal distribution pattern, with all variances being approximately equal to the variance of the population divided by each sample’s size.

The cartoon is taken from here

Binomial distribution

Let's flip a biased coin million times:

sample_size = 1000
data <- rbinom(n=sample_size,size=1,prob=0.2)

The mean is \( 0.2 \), while the variance \( 0.2(1-0.2)= 0.16 \):

c(mean(data),var(data))
[1] 0.2050000 0.1631381

Many observations

Now let's repeat our experiment many times

num_obs = 1000
flips<- rbinom(sample_size*num_obs,1, 0.2)

and store the results in a matrix:

data <- matrix(flips,nrow = num_obs)

Calculate means of each observation:

means <- apply(data,1,mean)

CLT in action

The means should have Gaussian distribution with mean \( 0.2 \) and variance \( 0.16/1000 \), which is indeed the case as shown here.

I have created a Shiny application, in which I demonstrate CLT using other distributions.

sig <- 1.6e-4
x<-seq(0.15,0.25,0.001)
y<-exp(-(x-0.2)**2/sig/2) 
y<-y/sum(y)/0.001
hist(means,freq = F)
lines(x,y,col='red',lwd=2)

plot of chunk unnamed-chunk-7