Central limits theorem tells that if many, many and many more random samples (each with equal or more than 30) are drawn from a population, the distribution of the means of the samples will follow normal distribution, even if the population is not normally distributed.
In a distribution two values are important to know: probability (y axis)and quantile values(x-axis).
In a normal distribution, z-values indicate the quantile values (point in the x-axis), p-values indicate probability (area covered by the curve before the specific quantile values)
qnorm(p = 0.95) #for one tail (produces z value)
## [1] 1.644854
qnorm(p=.975) #for 2 tail
## [1] 1.959964
pnorm(1.959964)
## [1] 0.975
pnorm(2.58)
## [1] 0.99506
set.seed(99)
population = runif(n=1000000, min=0, max = 100)
n_sample = 100000
sample_size = 30
sample_means = numeric(n_sample)
for (i in 1:n_sample){
current_sample= sample(population, size = sample_size, replace = TRUE)
sample_means[i] = mean(current_sample)
}
ggplot always requires data frame and aesthetics
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.1 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.3 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
plot_data = data.frame(Means = sample_means)
ggplot(data= plot_data, aes(x= Means))+
geom_histogram(bins = 60, aes(y=..density..), color = 'black', fill = "blue", alpha = 0.5)+
geom_density(color = 'red', size = 1.5)+
theme_bw() +
labs( title = " Central Limit Theorem", subtitle= "CLT using uniform population", x = "Sample mean", Y = "Density or relative freqiency")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Ignoring unknown labels:
## • Y : "Density or relative freqiency"
Range of sample means
min(sample_means)
## [1] 27.77249
max(sample_means)
## [1] 72.77555
mean(sample_means)
## [1] 50.00928
sd(sample_means)
## [1] 5.270791
mean(sample_means) - sd(sample_means)
## [1] 44.73848
mean(sample_means) + sd(sample_means)
## [1] 55.28007
We will categories the sample means in three classes
table(cut(sample_means, breaks =c(-Inf,44.73298, 55.25235, +Inf), labels= c("upto 44.73", "within 44.73 and 55.25", "Greater than 55.28")))*100/100000
##
## upto 44.73 within 44.73 and 55.25 Greater than 55.28
## 16.017 67.844 16.139
We see that there are 68% (67.968) of sample means fall within 1 sd of the sample means.
library(tidyverse)
plot_data = data.frame(Means = sample_means)
plot_data$standardized_means = scale(sample_means)
ggplot(data= plot_data, aes(x= standardized_means))+
geom_histogram(bins = 60, aes(y=..density..), color = 'black', fill = "blue", alpha = 0.5)+
geom_density(color = 'red', size = 1.5)+
theme_bw() +
labs( title = " Central Limit Theorem", subtitle= "CLT using uniform population", x = "Standardized values", Y = "Density or relative freqiency") +
scale_x_continuous(breaks = c(-2.58, -1.96, 0, +1.96, +2.58 ))
## Ignoring unknown labels:
## • Y : "Density or relative freqiency"
table(cut(sample_means , breaks = c(-inf ,44.74292, 55.2573, inf),
labels = c(“upto 44.73” and 55.28” ,“greater ,