Visualization of Central Limit Theorem (CLT)

Central limit theorem tells that if many, many and many more random samples (each with equal or more than 30) are drawn from a population, the distribution of the means of the samples will follow normal distribution, even if the population is not normally distributed.

Features of the CLT

  • Mean, median and mode of the samples means will remain at the center and at the same point
  • 68% of the sample means will be within 1 standard deviation of the mean
  • 95% of the sample means will be within 1.96 standard deviation of the mean
  • 99% of the sample means will be within 2.58 standard deviation of the mean

z-values calculation

In a distribution two values are important to know: probability (y-axis) and quantile values (x-axis).

In a normal distribution, z-values indicate the quantile values (point in the x-axis), p-values indicate probability (area covered by the curve before the specific quantile values)

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
qnorm(p = 0.95) # for one-tail, produces z-value (quantile value)
## [1] 1.644854
qnorm(p = 0.975) # for two-tail (1-0.025 = 97.5) 
## [1] 1.959964

P-values from a normal distribution

pnorm(1.959964)
## [1] 0.975
pnorm(2.58)
## [1] 0.99506

Create a hypothetical population

set.seed(99)
population = runif(n = 1000000, min = 0, max = 100)
n_sample = 100000
sample_size = 30

sample_means = numeric(n_sample)

for (i in 1:n_sample) {
  current_sample = sample(population, size = sample_size, replace = TRUE)
  sample_means[i] = mean(current_sample)
}

View the distribution using ggplot

ggplot always requires a dataframe and aesthetics.

plot_data = data.frame(Means = sample_means)

ggplot(data = plot_data, aes(x = Means)) +
  geom_histogram(bins = 60, aes(y = ..density..), color = 'black', fill = 'blue', alpha = 0.5) +
  geom_density(color = 'red', size = 1.5) +
  theme_bw() +
  labs(title = "Central Limit Theorem (CLT)", subtitle = "CLT using uniform population", x = "Sample means", y = "Density or relative frequency")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Prove the normal distribution features of this curve

Range of sample means

min(sample_means)
## [1] 27.77249
max(sample_means)
## [1] 72.77555

Mean and SD of sample means

mean(sample_means)
## [1] 50.00928
sd(sample_means)
## [1] 5.270791

Range within one SD

mean(sample_means) - sd(sample_means)
## [1] 44.73848
mean(sample_means) + sd(sample_means)
## [1] 55.28007

We will categorize the sample means in three classes

table(cut(sample_means, breaks = c(-Inf, 44.73292, 55.28573, Inf), labels = c("Upto 44.73", "Within 44.73 and 55.28", "Greater than 55.28")))*100/100000
## 
##             Upto 44.73 Within 44.73 and 55.28     Greater than 55.28 
##                 16.017                 68.011                 15.972

We see that there are 68% (67.968) of the sample means fall within 1 SD of the sample means.

Creating normal distribution from this distribution of the sample means

plot_data = data.frame(Means = sample_means)

plot_data$sdandardized_means = scale(sample_means)

ggplot(data = plot_data, aes(x = sdandardized_means)) +
  geom_histogram(bins = 60, aes(y = ..density..), color = 'black', fill = 'blue', alpha = 0.5) +
  geom_density(color = 'red', size = 1.5) +
  theme_bw() +
  labs(title = "Central Limit Theorem (CLT)", subtitle = "CLT using uniform population", x = "Standardized values", y = "Density or relative frequency") +
  scale_x_continuous(breaks = c(-2.58, -1.96, 0, +1.96, +2.58))