Visualization of central limit theorem

Central limits theorem tells that if many, many and many more random samples (each with equal or more than 30) are drawn from a population, the distribution of the means of the samples will follow normal distribution, even if the population is not normally distributed.

Features of CLT

  • Mean median and mode of the samples means will remain at the center and at the same point
  • 68% of the sample mean will be within 1 standard deviation of the mean
  • 95% of the sample mean will be within 1.96 standard deviation of the mean
  • 99% of the sample mean will be within 2.58 standard deviation of the mean

z-values calculation

In a distribution two values are important to know: probability (y axis)and quantile values(x-axis).

In a normal distribution, z-values indicate the quantile values (point in the x-axis), p-values indicate probability (area covered by the curve before the specific quantile values)

qnorm(p = 0.95) #for one tail (produces z value)
## [1] 1.644854
qnorm(p=.975) #for 2 tail
## [1] 1.959964

p-values from a normal distribution

pnorm(1.959964)
## [1] 0.975
pnorm(2.58)
## [1] 0.99506

Creat a hypothetical population

set.seed(99)
population = runif(n=1000000, min=0, max = 100)
n_sample = 100000
sample_size = 30

sample_means = numeric(n_sample)

for (i in 1:n_sample){
  current_sample= sample(population, size = sample_size, replace = TRUE)
  sample_means[i] = mean(current_sample)
}

View the destribution using ggplot

ggplot always requires data frame and aesthetics

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.3     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
plot_data = data.frame(Means = sample_means)
ggplot(data= plot_data, aes(x= Means))+
  geom_histogram(bins = 60, aes(y=..density..), color = 'black', fill = "blue", alpha = 0.5)+
  geom_density(color = 'red', size = 1.5)+
  theme_bw() +
  labs( title = " Central Limit Theorem", subtitle= "CLT using uniform population", x = "Sample mean", Y = "Density or relative freqiency")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Ignoring unknown labels:
## • Y : "Density or relative freqiency"

Prove the normal destribution features of this curve.

Range of sample means

min(sample_means)
## [1] 27.77249
max(sample_means)
## [1] 72.77555
mean(sample_means)
## [1] 50.00928
sd(sample_means)
## [1] 5.270791
mean(sample_means) - sd(sample_means)
## [1] 44.73848
mean(sample_means) + sd(sample_means)
## [1] 55.28007

We will categories the sample means in three classes

table(cut(sample_means, breaks =c(-Inf,44.73298, 55.25235, +Inf), labels= c("upto 44.73", "within 44.73 and 55.25", "Greater than 55.28")))*100/100000
## 
##             upto 44.73 within 44.73 and 55.25     Greater than 55.28 
##                 16.017                 67.844                 16.139

We see that there are 68% (67.968) of sample means fall within 1 sd of the sample means.

creating normal destribution from this destribution of the sample means

library(tidyverse)
plot_data = data.frame(Means = sample_means)

plot_data$standardized_means = scale(sample_means)
ggplot(data= plot_data, aes(x= standardized_means))+
  geom_histogram(bins = 60, aes(y=..density..), color = 'black', fill = "blue", alpha = 0.5)+
  geom_density(color = 'red', size = 1.5)+
  theme_bw() +
  labs( title = " Central Limit Theorem", subtitle= "CLT using uniform population", x = "Standardized values", Y = "Density or relative freqiency") +
  scale_x_continuous(breaks = c(-2.58, -1.96, 0, +1.96, +2.58 ))
## Ignoring unknown labels:
## • Y : "Density or relative freqiency"

table(cut(sample_means , breaks = c(-inf ,44.74292, 55.2573, inf), labels = c(“upto 44.73” and 55.28” ,“greater ,