Central Limit Theorem

What is it?

From Wikipedia - the central limit theorem (CLT) states that, given certain conditions, the arithmetic mean of a sufficiently large number of iterates of independent random variables, each with a well-defined (finite) expected value and finite variance, will be approximately normally distributed, regardless of the underlying distribution.

In simpler words - Be it any distribution, random samples taken iteratively will always be normally distributed.

From StatTrek - the sampling distribution of the mean of any independent, random variable will be normal or nearly normal, if the sample size is large enough.

How large is “large enough”? The answer depends on two factors.

Requirements for accuracy. The more closely the sampling distribution needs to resemble a normal distribution, the more sample points will be required.
The shape of the underlying population. The more closely the original population resembles a normal distribution, the fewer sample points will be required.

In practice, some statisticians say that a sample size of 30 is large enough when the population distribution is roughly bell-shaped. Others recommend a sample size of at least 40. But if the original population is distinctly not normal (e.g., is badly skewed, has multiple peaks, and/or has outliers), researchers like the sample size to be even larger.

Let’s see a few examples..

Uniform Distribution

Let’s assume 100 students in a class. Their marks are simulated as a roughly uniform distribution.

# Use ggplot for prettier graphs
library(ggplot2)

# Set Seed 
set.seed(500)

# Create 100 values between 45 and 91 randomly
x <- runif(100 ,45,91)
#Plot a histogram
qplot (x, fill = I("pink"))

#Display the mean
(original_mean <- mean(x))

## [1] 66.76623

What if we take a sample of 30 randomly?

sample1 <- x [ sample (x,30, replace = T)]
#Plot a histogram
qplot (sample1, fill = I("dodgerblue2"))

#Display the mean
mean (sample1)

## [1] 70.10866

The mean obtained is 70.11 whereas the original mean is 66.77. Difference of 3.34.

What Central Limit Theorem states is that is we take a large number of such sample, then the mean of means will approach the population mean.

## Central Limit Theorem
y <- vector()
for (i in 1:1000) {
y[i] <-mean (x[ sample (30 , x, replace = T)])
}
qplot (y, fill = I("firebrick4"))

mean (y)

## [1] 70.23237

Now, the mean of means obtained is 70.23 whereas the original mean is 66.77. Difference of 3.46. Also, we observe a beautiful normal distribution.

Hence, we see this is approaching the population mean.

But what if we reduce the sample size to 10? and what if we only take 100 iterations of 30 samples?

# 10 samples and 1000 iterations
y2 <- vector()
for (i in 1:1000) {
y2[i] <-mean (x[ sample (10 , x, replace = T)])
}
qplot (y2, fill = I("black"))

mean (y2)

## [1] 77.17426

# 30 samples and 100 iterations
y3 <- vector()
for (i in 1:100) {
y3[i] <-mean (x[ sample (30 , x, replace = T)])
}
qplot (y3, fill = I("yellow"))

mean (y3)

## [1] 70.40265

Mean with 10 samples and 1000 iterations: 77.17

Difference: 10.4

Mean with 30 samples and 100 iterations: 70.4

Difference: 3.63

Original mean: 66.77

In scenarios where simulation is possible, this property works great. But in reality, taking 1000 or even 100 samples is not practical.

Let’s take a realistic iterations of 5. We will loop through between 10,20,30,40,50 in order to find the ideal sample size, giving mean of means to be closest to the population mean of 66.77

# Initialise vectors for 10,20,30,40,50 sample sizes
y4 <- vector()
y5 <- vector()
y6 <- vector()
y7 <- vector()
y8 <- vector()

for (i in 1:5) {
y4[i] <-mean (x[ sample (10 , x, replace = T)])
y5[i] <-mean (x[ sample (20 , x, replace = T)])
y6[i] <-mean (x[ sample (30 , x, replace = T)])
y7[i] <-mean (x[ sample (40 , x, replace = T)])
y8[i] <-mean (x[ sample (50 , x, replace = T)])

}

The means are as follows:

Original mean: 66.77

Mean with 10 samples for 5 iterations: 77.85

Mean with 20 samples for 5 iterations: 72.13

Mean with 30 samples for 5 iterations: 68.83

Mean with 40 samples for 5 iterations: 69.81

Mean with 50 samples for 5 iterations: 68.12

Around 30 samples seems close enough, considering we are only doing 5 iterations. 30 being the magic number many statisticians go by with.

Also to be kept in mind, we simulated 100 scores of students as a uniform distribution. However, in reality a distribution very close to normal is observed. Hence, estimates will be closer to the mean with maybe even lesser interations / samples.

2. Normal Distribution

# Create 100 values between 45 and 91, mean 65, standard deviation 15
x <- rnorm(100 ,mean = 65, sd = 15)
#Plot a histogram
qplot (x, fill = I("pink"))

#Display the mean
(original_mean <- mean(x))

## [1] 63.773

What if we take a sample of 30 randomly?

sample1 <- x [ sample (x,30, replace = T)]
#Plot a histogram
qplot (sample1, fill = I("dodgerblue2"))

#Display the mean
mean(sample1)

## [1] 63.97986

As observed, when the distribution is close to normal, even a single iteration gives a sample mean which is very close to the population mean.