Central Limit Theorem Visualized
April James Palermo
2023-06-25
Intro
The central limit theorem is a primary concept in probability theory that has applications in many areas of
statistics.
The central limit theorem states that if you have a population with mean μ and standard deviation
σ and take sufficiently large random samples from the population with replacement, then the
distribution of the sample means will be approximately normally distributed. This will hold true
regardless of whether the source population is normal or skewed, provided the sample size is
sufficiently large (usually n > 30). (LaMorte, 2016)
Put simply, it means that if you take a large enough sample size you can use assumptions from the normal
distribution to make inferences, even if the population you are sampling from is not normally distributed.
We can visualize this using simulations. Let’s generate a sample from an exponential distribution.
exponential_population <- rexp(100000)
# Create histogram
hist(exponential_population)Histogram of exponential_population
exponential_population
Frequency
0 2 4 6 8 10 12 14
0 20000 40000 60000
1
# View the first 100 values
exponential_population[1:100]
## [1] 1.021420203 0.698846637 0.230344569 1.125318001 0.214058104 0.809094006
## [7] 0.768993227 0.185062981 0.848800521 0.442284249 0.705711233 0.494117017
## [13] 0.681895133 1.113334544 2.238097517 0.676328768 0.357036500 0.318299998
## [19] 1.973217713 0.038775117 0.754847746 2.449630307 0.607052263 0.245286069
## [25] 0.975712093 1.758369857 0.987079609 1.124864893 1.536624826 2.484413219
## [31] 0.438689952 0.020732639 0.638129859 1.905802613 2.236279258 1.194328279
## [37] 1.772907946 0.227517834 0.269432545 1.358788465 0.022795807 1.108755641
## [43] 1.377392091 0.156010434 0.827379383 0.240955117 0.481678377 0.058278495
## [49] 0.686649813 0.297594023 0.666388515 0.764499105 1.478031928 0.867132050
## [55] 0.972330206 1.428553048 0.383916857 1.296721676 0.241552247 0.919308922
## [61] 0.079250519 0.652509212 1.306592660 1.166858869 2.867425985 0.873779650
## [67] 1.387650156 3.064533286 0.199100057 0.320643250 1.274504468 2.063898723
## [73] 0.234148304 6.209300145 0.609590197 0.130943286 1.823048372 1.364293726
## [79] 2.601843966 1.392299559 0.234483520 0.489223579 0.480168186 0.815513288
## [85] 1.924655661 0.032626613 0.452030270 0.174517500 1.387740182 0.375357188
## [91] 0.223378924 0.037905969 0.005958029 0.965610557 0.391087098 0.112160693
## [97] 0.371659569 4.375257987 0.587934098 1.879095979
# Summary stats
summary(exponential_population)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000001 0.286986 0.691188 0.999883 1.387432 13.255734
boxplot(exponential_population)0 2 4 6 8 10 12
We can see that the distribution has a long tail, mean of 1, and median 0.69.
According to the CLT if we take repeated samples with replacement from this population, the sample means
will be normally distributed (provided the sample size is sufficient). We will start with samples of 100 random
observations repeated 10 times then observe the sample means.
samples <- array()
for(i in 1:10){
s <- sample(exponential_population, 100)
2
mu <- mean(s)
samples[i] <- mu
}
samples
## [1] 0.9751920 1.1270565 1.0494598 1.0562399 0.8667646 1.0053993 0.9727791
## [8] 1.1400212 0.9490733 0.9948863
hist(samples)Histogram of samples
samples
Frequency
0.85 0.90 0.95 1.00 1.05 1.10 1.15
0 1 2 3
The distribution looks more normal than exponantial, but not quite there yet. In order for CLT to hold, we
need to sample the population a sufficiently large number of times. Let’s try 100 samples:
samples <- array()
for(i in 1:100){
s <- sample(exponential_population, 100)
mu <- mean(s)
samples[i] <- mu
}
samples
## [1] 0.9661506 0.9541129 0.9352923 1.0195576 0.9199149 0.9668843 1.0750143
## [8] 0.9302495 1.1545761 0.9346216 1.1522422 0.8881713 0.9399624 0.9253655
## [15] 1.0732246 1.0930584 1.1056995 1.2316355 0.9812106 1.2076004 1.0224202
## [22] 0.9284465 0.9133240 1.0633815 0.9288824 0.8853675 0.9608395 1.0928315
## [29] 0.9588437 0.9661512 1.0374025 1.1394718 1.0512028 1.0663262 0.8399453
## [36] 0.9783965 0.8719886 1.1117454 0.9579027 0.8806490 0.9533344 0.9674913
## [43] 0.9570956 1.0282709 0.9763893 0.8468531 0.9446544 1.1264172 1.0581555
3
## [50] 1.0289698 1.0280887 0.9249991 0.9776168 1.0588844 1.1200003 1.0930015
## [57] 0.8803831 0.8710322 0.8633251 1.0711742 0.8719949 1.1868083 1.0192422
## [64] 0.9555150 1.1748106 1.1278829 0.8207299 1.1137419 0.9919652 0.9629441
## [71] 0.8935032 1.0193064 1.1072296 1.1201330 1.0125745 0.9572441 1.0556847
## [78] 1.0859512 0.9703408 1.0866412 0.9824862 0.9919420 0.9375259 0.9202279
## [85] 0.9517425 0.9405721 1.1357943 0.8123255 1.0247732 1.0302315 1.0609358
## [92] 1.0378936 0.7831558 0.7865689 1.0156439 0.8263444 0.9086112 0.9609420
## [99] 0.9791703 0.9375027
hist(samples)Histogram of samples
samples
Frequency
0.8 0.9 1.0 1.1 1.2
0 5 10 15 20
This looks even moreso normal, with tails being much smaller and the center of the distribution having more
definition. Let’s try 100,000:
samples <- array()
for(i in 1:5000){
s <- sample(exponential_population, 100)
mu <- mean(s)
samples[i] <- mu
}
hist(samples)
4
Histogram of samples
samples
Frequency
0.8 1.0 1.2 1.4
0 200 400 600 800For comparison, let’s plot samples from a normal distribution:
hist(rnorm(5000, mean=1))Histogram of rnorm(5000, mean = 1)
rnorm(5000, mean = 1)
Frequency
−2 0 2 4
0 200 400 600 800 1000
As mentioned, this works for any distribution. To demonstrate, let’s simulate data from other distributions
and visualize.
5
# Function for simulating data from a given distribution
generate_dist <- function(name, sample_func, n, ss=0.1*n, sims){
dist <- {}
dist$name <- name
dist$obs <- sample_func(n)
dist$ss <- ss
dist$sample_means <- array()
for(i in seq(1:sims)){
s <- sample(dist$obs, ss, replace=TRUE)
mu <- mean(s)
dist$sample_means[i] <- mu
}
return(dist)
}
# Modified sample functions to set default parameters
rpois_mod <- function(n){return(rpois(n=n, lambda=1))}
beta_05_05 <- function(n){return(rbeta(n=n, shape1=0.5, shape2=0.5))}
beta_5_1 <- function(n){return(rbeta(n=n, shape1=5, shape2=1))}
# Function for generating a list containing data from several distributions
generate_dist_list <- function(n, ss, sims){
list_out <- list(
generate_dist("Normal(0,1)", rnorm, n, ss, sims),
generate_dist("Exponential(1)", rexp, n, ss, sims),
generate_dist("Uniform(0,1)", runif, n, ss, sims),
generate_dist("Poisson(1)", rpois_mod, n, ss, sims),
generate_dist("Beta(0.5,0.5)", beta_05_05, n, ss, sims),
generate_dist("Beta(5,1)", beta_5_1, n, ss, sims)
)
return(list_out)
}
# Create distribution list
dist_list <- generate_dist_list(1000000, ss=50, sims=10000)
# Set up and iterate through data list to plot
par(mfrow=c(2,3))
for(dist in dist_list){
n_size <- format(length(dist$obs), big.mark=",")
hist(dist$obs, main=paste0(dist$name, " \n (N=", n_size, ")"))
}
6
Normal(0,1)
(N=1,000,000)
dist$obs
Frequency
−4 −2 0 2 4
0 100000
Exponential(1)
(N=1,000,000)
dist$obs
Frequency
0 2 4 6 8 12
0e+00 2e+05 4e+05
Uniform(0,1)
(N=1,000,000)
dist$obs
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0 20000 50000
Poisson(1)
(N=1,000,000)
dist$obs
Frequency
0 2 4 6 8
0e+00 2e+05
Beta(0.5,0.5)
(N=1,000,000)
dist$obs
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0 60000 140000
Beta(5,1)
(N=1,000,000)
dist$obs
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0 100000What happens to sample mean distributions if we use a small sample size (n=5)?
dist_list <- generate_dist_list(1000000, ss=5, sims=10000)
par(mfrow=c(2,3))
for(dist in dist_list){
ss <- format(dist$ss)
sims <- format(length(dist$sample_means), big.mark=",")
hist(dist$sample_means,
main=paste0(dist$name, " Sample Dist. \n (n=", dist$ss, ", ", sims, " simulations)"))
}
7
Normal(0,1) Sample Dist.
(n=5, 10,000 simulations)
dist$sample_means
Frequency
−2 −1 0 1
0 500 1500
Exponential(1) Sample Dist.
(n=5, 10,000 simulations)
dist$sample_means
Frequency
0 1 2 3
0 500 1500
Uniform(0,1) Sample Dist.
(n=5, 10,000 simulations)
dist$sample_means
Frequency
0.2 0.4 0.6 0.8
0 500 1000
Poisson(1) Sample Dist.
(n=5, 10,000 simulations)
dist$sample_means
Frequency
0.0 1.0 2.0 3.0
0 500 1500
Beta(0.5,0.5) Sample Dist.
(n=5, 10,000 simulations)
dist$sample_means
Frequency
0.0 0.2 0.4 0.6 0.8 1.0
0 400 800 1200
Beta(5,1) Sample Dist.
(n=5, 10,000 simulations)
dist$sample_means
Frequency
0.5 0.6 0.7 0.8 0.9 1.0
0 1000 2500A large number of sample means from small samples still appears normal when the population is normally
distributed, but you can see that samples from other distributions have non-normal traits. Exponential,
Poisson, and Beta(α=5, β=1) all show skew, while Beta(α=0.5, β=0.5) and Uniform(0,1) sample mean
distributions have heavier tails.
Repeated again, this time with a larger sample size (n=100):
dist_list <- generate_dist_list(1000000, ss=100, sims=10000)
par(mfrow=c(2,3))
for(dist in dist_list){
ss <- format(dist$ss)
sims <- format(length(dist$sample_means), big.mark=",")
hist(dist$sample_means,
main=paste0(dist$name, " Sample Dist. \n (n=", dist$ss, ", ", sims, " simulations)"))
}
8
Normal(0,1) Sample Dist.
(n=100, 10,000 simulations)
dist$sample_means
Frequency
−0.4 0.0 0.2 0.4
0 500 1500
Exponential(1) Sample Dist.
(n=100, 10,000 simulations)
dist$sample_means
Frequency
0.6 0.8 1.0 1.2 1.4
0 500 1500
Uniform(0,1) Sample Dist.
(n=100, 10,000 simulations)
dist$sample_means
Frequency
0.40 0.50 0.60
0 1000 2000
Poisson(1) Sample Dist.
(n=100, 10,000 simulations)
dist$sample_means
Frequency
0.6 0.8 1.0 1.2 1.4
0 500 1500
Beta(0.5,0.5) Sample Dist.
(n=100, 10,000 simulations)
dist$sample_means
Frequency
0.35 0.45 0.55 0.65
0 500 1500
Beta(5,1) Sample Dist.
(n=100, 10,000 simulations)
dist$sample_means
Frequency
0.78 0.82 0.86
0 400 1000Back to normality!
#What caused the paragraph to be indented? The command > caused the paragraph to be indented.
#What command(s) caused the bold heading? The command ## caused the bold heading.
9