Understanding the Central Limit Theorem With Simulation

Introduction:

Arguably, the most important concept in Statistics is the Central Limit Theorem. It states that for any given population distribution with mean = \(\mu\) and standard deviation = \(\sigma\), if we randomly select a sufficiently sized sample (which we’ll define as ‘n’), take its mean, and then repeat that process a very large number of times (which we’ll define as ‘t’), we end up with a set of values that is normally distributed around \(\mu\) with a standard deviation of \(\frac{\sigma}{\sqrt{n}}\), which is known as the Standard Error.

Many people are very confused by the Central Limit Theorem (or CLT) when they first learn about it. We hear language like “standard deviation of the sampling distribution of the sample mean” and have trouble wrapping our heads around it. We either give it up as a bad job or learn just enough to plug numbers into formulas on math exams and accept vague statements like “your sample size should be at least 30” without digging deeper.

I personally find visualization to be very helpful when learning something and I think visualizing the CLT can be particularly helpful. To do this, we need to create a simulation. We definitely want to do this with a computer because a computer can run thousands of trials in a matter of moments and do all the heavy maths for us. We’ll use R to program the simulation and R’s ggplot package to visualize the results. I’ll show all the code I’m using so you can follow along and I’ll make it available on github here. I encourage you to grab it and play with it yourself!

Defining the Population:

First thing’s first: We need to make sure our R session is configured correctly and set our random number seed so our results are reproducible.

require(ggplot2)
require(knitr)
set.seed(7759)

Now we need to create our population. Note that in the “real world” we’ll never have perfect knowledge of the population we’re studying, but it’ll be a huge help here because we can use it to see how well the CLT works.

It’s important to realize that the CLT will work for ANY population distribution, no matter how wacky it is. To show this, let’s test it against a pretty weird population. Here’s how we’ll create it:

# Create a non-uniform population of 100,000 numbers between 1 and 100
pop1 <- rnorm(20000, mean = 10, sd = 3)
pop2 <- rnorm(80000, mean = 70, sd = 10)
pop <- c(pop1, pop2)

mu <- mean(pop) #calculate the population mean
sigma <- sd(pop) #calculate the population standard deviation
rm(pop1, pop2) #clean up

Let’s take a look at what we’ve got:

popdf <- as.data.frame(pop)
hg <- ggplot(popdf, aes(x = pop)) + geom_histogram(colour = "black", fill = "steelblue") + 
        ggtitle("Histogram of Population") + xlab("value")
hg

Excellent. Note that we’ve calculated our population mean \(\mu = 58.01\) and our standard deviation \(\sigma = 25.65\). We’ll be referring back to these later.

The Simulation:

Time to run the simulation. In order to really see what’s going on with the CLT we’re going to run this simulation a bunch of times with different values for ‘n’ (how many samples we take) and ‘t’ (how many trials we run). We’ll store all this data in a table so we can then visualize it. Note that this is almost certainly not the most efficient way of coding this simulation, but I think it makes it pretty easy to understand how it all works.

n <- c(1, 5, 10, 30, 50, 100) #set up number of samples
t <- c(10, 100, 1000, 10000) #set up number of trials in simulation

df <- data.frame() #initialize our empty data frame

# Run the simulation
for(i in n) { #for each value of n...
    col <- c()
    for(j in t) { #we loop through each value of t...
        trial <- 1:j
        counter <- j #set up an egg timer based on whichever t value we're on
        value <- c()
        while(counter > 0) {    # and extract n samples from the population...
            bucket <- sample(pop, i, replace = TRUE)
            xbar <- mean(bucket) #calculate the mean...
            value <- c(value, xbar) # and add it to a vector
            counter <- counter - 1 #egg timer counts down and loops back until it hits 0
        }
        sbar <- sd(value) #calculate the standard deviation of our sample
        col <- cbind(trial, value, sbar, i, j) #stick all the info together...
        df <- rbind(df, col) #and attach it to our master data frame
    } #and we do it again for the next set of values until we're done!
    
}

rm(col, bucket, value, counter, i, j, n, sbar, t, xbar, trial) #clean up

# Let's take a look!
str(df)

## 'data.frame':    66660 obs. of  5 variables:
##  $ trial: num  1 2 3 4 5 6 7 8 9 10 ...
##  $ value: num  78.4 103.8 76.8 5.9 57.1 ...
##  $ sbar : num  30.5 30.5 30.5 30.5 30.5 ...
##  $ i    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ j    : num  10 10 10 10 10 10 10 10 10 10 ...

head(df, n = 25) #the full table is too big to look at but we can take a peek at the first few rows.

##    trial      value     sbar i   j
## 1      1  78.437079 30.48892 1  10
## 2      2 103.799899 30.48892 1  10
## 3      3  76.757030 30.48892 1  10
## 4      4   5.895178 30.48892 1  10
## 5      5  57.115890 30.48892 1  10
## 6      6  70.078027 30.48892 1  10
## 7      7  75.922762 30.48892 1  10
## 8      8  65.917111 30.48892 1  10
## 9      9  64.930066 30.48892 1  10
## 10    10  10.149737 30.48892 1  10
## 11     1  78.937776 25.46287 1 100
## 12     2  71.634467 25.46287 1 100
## 13     3  51.884210 25.46287 1 100
## 14     4  66.170830 25.46287 1 100
## 15     5  68.572041 25.46287 1 100
## 16     6  64.730679 25.46287 1 100
## 17     7  80.156125 25.46287 1 100
## 18     8  74.827322 25.46287 1 100
## 19     9  69.140215 25.46287 1 100
## 20    10  55.787072 25.46287 1 100
## 21    11  10.639685 25.46287 1 100
## 22    12  82.069167 25.46287 1 100
## 23    13  77.277233 25.46287 1 100
## 24    14  83.392655 25.46287 1 100
## 25    15  11.403050 25.46287 1 100

Looking at the resulting data frame (which we’ve named ‘df’) isn’t very exciting. But, now let’s visualize the results and see what we can see:

# We tidy up our data frame to get it ready for graphing. Note that we built it in "tall"
# form so it's already structured for ggplot

names(df) <- c("trial#", "value", "sdev", "samples", "trials")

# Creating the plot
g <- ggplot(df, aes(x = value)) + geom_density(fill = "steelblue") + 
        facet_grid(samples ~ trials, labeller = label_both) + 
        ggtitle("Demonstrating The Central Limit Theorem With Simulation") + 
        geom_vline(xintercept = mu, linetype = "dashed")
g

Analysis:

Let’s take some time to really digest what we’re looking at here. Each panel represents a simulation with different values of samples ‘n’ and trials ‘t’. For example, the top left panel shows the results for pulling a sample size of 1 from the population, taking its mean (which is just itself for n = 1), repeating the process a total of 10 times, then plotting the resulting density function**. Now let’s look at the bottom right panel. Here we’re pulling a sample size of 100 from the population, taking its mean, repeating the process a total of 10,000 times, then plotting the density function.

**side note: we’re using a density function rather than a histogram to make the y-scales work better. If we used a histogram the scale differences between t = 10 and t = 10,000 would result in many panels appearing to be blank!

Remember that our population mean \(\mu = 58.01\). If the CLT is true, we’d expect to see our plotted density functions take on a normalized or bell-curve-like shape centered around \(\mu\). We’ve put a line down right at \(\mu\) on our graph so we can easily see that this appears to be happening. Take a moment and think about whether increasing ‘n’ or increasing ‘t’ has a greater effect on the shape and ‘tightness’ of the curve around \(\mu\). Let’s also notice that things really start looking good when ‘n’ is 30 or greater, which gives us some intuition around why we ideally want sample sizes of 30 or greater.

What about the standard error? Remember that this is a shorthand way of saying “the standard deviation of the sample distribution of the sample mean”, which doesn’t exactly roll off the tongue. According to the CLT, the standard error should equal \(\frac{\sigma}{\sqrt{n}}\). This makes sense if we think about it - as our sample size ‘n’ increases, more of our sample will be closer to the population mean \(\mu\) and the resulting distribution will be more tightly clumped around it. We can see this happens by looking at our graph. But, how does the CLT’s assertion that the standard error is \(\frac{\sigma}{\sqrt{n}}\) hold up to our observed results?

Let’s create a table of the 24 sample standard deviations (6 values of ‘n’ * 4 values of ‘t’) we got from our simulation.

# create data frame of simulated sample standard deviations
m <- matrix(unique(df$sdev), nrow = 4, ncol = 6)
sdf <- as.data.frame(m, row.names = c("t10", "t100", "t1000", "t10000"))
names(sdf) <- c("s1", "s5", "s10", "s30", "s50", "s100")
sdf <- t(sdf) #transposed to match our graph better
rm(m) #clean up
kable(sdf)

	t10	t100	t1000	t10000
s1	30.488919	25.462874	26.776761	25.453177
s5	12.390729	11.109437	11.285796	11.520970
s10	6.963056	8.270599	8.286171	8.182902
s30	4.087005	4.313087	4.661867	4.676186
s50	4.252869	3.806862	3.745439	3.592716
s100	2.275072	2.734845	2.585293	2.573638

Now let’s calculate what the CLT tells us these values should be for each value of ‘n’. We’re using the standard error formula:

\[ se = \frac{\sigma}{\sqrt{n}} \]

# Calculate our expected standard error from the population standard deviation
exvals <- sigma/sqrt(c(1, 5, 10, 30, 50, 100))
dfex <- as.data.frame(exvals, row.names = c("s1", "s5", "s10", "s30", "s50", "s100"))
names(dfex) <- "Predicted Standard Deviations"
kable(dfex)

	Predicted Standard Deviations
s1	25.653426
s5	11.472561
s10	8.112326
s30	4.683653
s50	3.627942
s100	2.565343

Just eyeballing it, it looks like they’re pretty close. Let’s calculate the ratios of our observed values over the calculated standard error. Realize that the closer the two values are, the closer \(\frac{obs}{se}\) will be to 1.

# Express how close we are as percentages.
pexval <- sdf/exvals

Finally, we’ll subtract 1 from these values to calculate the distance each observation is from the calculated standard error and round the results to 3 decimal places to make it easier to read and interpret. If \(\frac{obs}{se} - 1 \approx 0\) then the simulation has provided evidence that the standard error calculation is valid.

# Subtract 1 to show distance from ideal and round to 3 decimals to make it readable
kable(round(pexval - 1, 3))

	t10	t100	t1000	t10000
s1	0.188	-0.007	0.044	-0.008
s5	0.080	-0.032	-0.016	0.004
s10	-0.142	0.020	0.021	0.009
s30	-0.127	-0.079	-0.005	-0.002
s50	0.172	0.049	0.032	-0.010
s100	-0.113	0.066	0.008	0.003

Once ‘t’ gets high enough, the values get very close to the CLT’s standard error calculation. We can see from this that if we ran a simulation with an extremely high ‘t’ value we’d get even closer.

Conclusions:

Through the power of simulation, we’ve visualized the Central Limit Theorem in action and seen direct evidence that is is valid. Hopefully, this demonstration has helped provide some insight into how the CLT works. I encourage you to monkey around with the parameters, change the ‘n’, ‘t’, and seed values and run some more experiments!