Central limit theorem tells that if many, many and many more random samples (each with equal or more than 30) are drawn from a population, the distribution of the means of the samples will follow normal distribution, even if the population is not normally distributed.
In a distribution two values are important to know: probability (y-axis) and quantile values (x-axis).
In a normal distribution, z-values indicate the quantile values (point in the x-axis), p-values indicate probability (area covered by the curve before the specific quantile values)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.1 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
qnorm(p = 0.95) # for one-tail, produces z-value (quantile value)
## [1] 1.644854
qnorm(p = 0.975) # for two-tail (1-0.025 = 97.5)
## [1] 1.959964
pnorm(1.959964)
## [1] 0.975
pnorm(2.58)
## [1] 0.99506
set.seed(99)
population = runif(n = 1000000, min = 0, max = 100)
n_sample = 100000
sample_size = 30
sample_means = numeric(n_sample)
for (i in 1:n_sample) {
current_sample = sample(population, size = sample_size, replace = TRUE)
sample_means[i] = mean(current_sample)
}
ggplot always requires a dataframe and aesthetics.
plot_data = data.frame(Means = sample_means)
ggplot(data = plot_data, aes(x = Means)) +
geom_histogram(bins = 60, aes(y = ..density..), color = 'black', fill = 'blue', alpha = 0.5) +
geom_density(color = 'red', size = 1.5) +
theme_bw() +
labs(title = "Central Limit Theorem (CLT)", subtitle = "CLT using uniform population", x = "Sample means", y = "Density or relative frequency")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Range of sample means
min(sample_means)
## [1] 27.77249
max(sample_means)
## [1] 72.77555
Mean and SD of sample means
mean(sample_means)
## [1] 50.00928
sd(sample_means)
## [1] 5.270791
Range within one SD
mean(sample_means) - sd(sample_means)
## [1] 44.73848
mean(sample_means) + sd(sample_means)
## [1] 55.28007
We will categorize the sample means in three classes
table(cut(sample_means, breaks = c(-Inf, 44.73292, 55.28573, Inf), labels = c("Upto 44.73", "Within 44.73 and 55.28", "Greater than 55.28")))*100/100000
##
## Upto 44.73 Within 44.73 and 55.28 Greater than 55.28
## 16.017 68.011 15.972
We see that there are 68% (67.968) of the sample means fall within 1 SD of the sample means.
plot_data = data.frame(Means = sample_means)
plot_data$sdandardized_means = scale(sample_means)
ggplot(data = plot_data, aes(x = sdandardized_means)) +
geom_histogram(bins = 60, aes(y = ..density..), color = 'black', fill = 'blue', alpha = 0.5) +
geom_density(color = 'red', size = 1.5) +
theme_bw() +
labs(title = "Central Limit Theorem (CLT)", subtitle = "CLT using uniform population", x = "Standardized values", y = "Density or relative frequency") +
scale_x_continuous(breaks = c(-2.58, -1.96, 0, +1.96, +2.58))