The Central Limit Theorem (CLT) is one of the most important theorems in statistics and data science.
The CLT states that the sample mean (\(\bar{x}\)) of a probability distribution sample is a random variable with a mean value given by population mean \(\mu\) and standard deviation \(\sigma_{\bar{x}}\) (standard error) given by population standard deviation \(\sigma\) divided by \(\sqrt{n}\)
i.e. \(\sigma_{\bar{x}}=\dfrac{\sigma}{\sqrt{n}}\), where n is the sample size.
rm(list = ls()) # Clear environment
gc() # Clear unused memory
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 531384 28.4 1183607 63.3 NA 669265 35.8
## Vcells 976737 7.5 8388608 64.0 16384 1840452 14.1
cat("\f") # Clear the console
# dev.off() # Clear par(mfrow=c(3,2)) type commands
?runif # runif generates random deviates; read the arguments required
myunif <- runif(n = 10000,
min = 10,
max = 20
)
myunif[1:16] # check if created correctly, run summary commands if required. head(myunif)
## [1] 15.27919 13.62529 14.89611 11.82282 13.57619 18.27487 16.20547 18.53198
## [9] 12.40289 11.18536 15.69295 16.25366 14.33508 11.26731 13.19037 13.06659
mu <- mean(myunif) # check mean of actual distribution
mu
## [1] 14.97481
sigma <- sd(myunif) # check sd of actual distribution
sigma
## [1] 2.884187
library("psych")
## Warning: package 'psych' was built under R version 4.2.3
describe(myunif) # summary statistics
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 10000 14.97 2.88 14.96 14.97 3.71 10 20 10 0.01 -1.21 0.03
### Plot out the distribution
hist(x = myunif,
main = "Histogram of Uniform Distribution (min=10, max=20, n=10,000)",
xlab = ""
)
?matrix # creates a matrix from the given set of values.
?rep # replicates the values in x lenth times
# fill in the matrix with 0s
z <- matrix(data = rep(x = 0,
times = 10000
),
nrow = 10000,
ncol = 1)
z[1:16]
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
describe(z)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 10000 0 0 0 0 0 0 0 0 NaN NaN 0
Try reading the loop from inside out.
?sample # makes a sample of the specified size from the elements of x using either with or without replacement.
# sample(x, size, replace = FALSE, prob = NULL)
for (i in 1:10000){
z[i,] <- mean(sample( x = myunif, # a vector of elements from which to choose, or a positive integer.
size = 100, # a non-negative integer giving the number of items to choose.
replace = TRUE # should sampling be with replacement?
)
)
}
z[1:16]
## [1] 14.58196 14.86110 15.24264 14.80985 15.43189 14.64520 15.05815 15.67749
## [9] 15.01629 14.01945 14.55968 14.68105 14.50573 15.05990 15.04674 14.79897
describe(z)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 10000 14.98 0.29 14.98 14.98 0.29 13.66 16.08 2.42 0 -0.03 0
hist(z, xlab = "", main = "Histogram of Sample Mean (n = 100)")
CONTAINING SAMPLE MEAN OF A 2 obs SAMPLE,
CONTAINING SAMPLE MEAN OF A 6 obs BIG SAMPLE,
CONTAINING SAMPLE MEAN OF A 30 obs BIGGER SAMPLE, AND
CONTAINING SAMPLE MEAN OF A 1000 obs BIGGEST SAMPLES
z <- matrix(data = rep(x = 0,
times = 40000
),
nrow = 10000,
ncol = 4)
Take a random sample of 2 observations from the uniform distribution, find its mean and store it in column 1, row 1 of matrix z. Repeat 9,999 times filling in different rows i of column 1 each time.
Take a random sample of 6 observations from the uniform distribution, find its mean and store it in column 2, row 1 of matrix z. Repeat 9,999 times filling in different rows i of column21 each time.
Take a random sample of 30 observations from the uniform distribution, find its mean and store it in column 3, row 1 of matrix z. Repeat 9,999 times filling in different rows i of column 3 each time.
Take a random sample of 1000 observations from the uniform distribution, find its mean and store it in column 4, row 1 of matrix z. Repeat 9,999 times filling in different rows i of column 4 each time.
Try reading the loop from outside to inside this time, and i indexes rows and j indexes columns -
n <- c(2, 6, 30, 1000) # let n be the sample size
#larger sample size should have closer sample mean to true population mean
# i indexes rows and j indexes columns
# As column j increases from 1 to 4 (1-2-3-4), sample size also increases from 2 to 1000 (2-6-30-1000) respectively.
# Sample (of size 2-6-30-1000) pulled out from the original distribution i=10000 times.
for (j in 1:4){ # indexing columns of matrix, where each column will represent different sample size
for (i in 1:10000){ # indexes the rows of matrix
z[i,j] <- mean(sample( x = myunif, # compute mean and assign
size = n[j],
replace = TRUE
)
)
}
}
colnames(z) <- c("Sample size=2", "Sample size=6", "Sample size=30", "Sample size=1000")
summary(z)
## Sample size=2 Sample size=6 Sample size=30 Sample size=1000
## Min. :10.10 Min. :10.88 Min. :13.10 Min. :14.65
## 1st Qu.:13.51 1st Qu.:14.14 1st Qu.:14.62 1st Qu.:14.91
## Median :14.97 Median :14.95 Median :14.98 Median :14.97
## Mean :14.97 Mean :14.96 Mean :14.98 Mean :14.97
## 3rd Qu.:16.40 3rd Qu.:15.79 3rd Qu.:15.33 3rd Qu.:15.04
## Max. :19.99 Max. :19.14 Max. :16.85 Max. :15.34
# matrix row containing means of uniform distribution of different sample sizes - col 1 has runif sample size of 2, col 2 has runif sample size 6, col 3 has runif sample size 30, and col 4 has runif sample size 1000. 10,000 iterations.
par(mfrow=c(3,2))
length(myunif)
## [1] 10000
hist(x = myunif,
main = "Histogram of Uniform Distribution, N=10,000",
xlab = ""
)
?hist
for (k in 1:4){
hist(x = z[,k], # matrix column 1-4 which contain sample means of randomly chosen samples from uniform dist
main = "Histogram of Sample Mean of Uniform Distribution",
xlim = c(10, 20), # plot the domain of X axis from x=10 to x=20 only
xlab = paste0("Sample Size ", n[k], " (Column ", k, " from matrix)")
)
}
Recall sample mean is population mean if CLT holds. Also, standard deviation of sample means (standard error) is just the population standard deviation divided by square root of smaple size. Using these two facts, we can backtrack population mean and population standard deviation from the sample moments.
z[1:16] # matrix row containing means of uniform distribution of different sample sizes - col 1 has runif sample size of 2, col 2 has runif sample size 6, col 3 has runif sample size 30, and col 4 has runif sample size 1000. 10,000 iterations.
## [1] 17.47448 17.15370 15.73808 11.16293 13.06245 12.82670 12.86843 15.91598
## [9] 11.82471 13.71621 16.02936 13.46811 12.64761 14.76458 14.76267 14.97740
mu # original uniform distribution mean
## [1] 14.97481
colMeans(z) # [1] 14.99702 15.01788 15.02452 15.01863
## Sample size=2 Sample size=6 Sample size=30 Sample size=1000
## 14.97367 14.96318 14.97835 14.97389
##########################################################################################
### original distribution mean is approximated well by small (n=2) sample and/or big (n=1000) sample equally well ###
##########################################################################################
sigma # original uniform distribution sd
## [1] 2.884187
?apply # Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.
apply(X = z,
MARGIN = 2, # function applied on columns, not rows as c(1, 2) indicates rows and columns respectively
FUN = sd) # [1] 2.03311048 1.18419671 0.52746969 0.09101246
## Sample size=2 Sample size=6 Sample size=30 Sample size=1000
## 2.03644517 1.19496139 0.52399152 0.09187381
##########################################################################################
### original distribution sd is approximated by adjusting for sample sd by sqrt n, where n is the sample size###
##########################################################################################
apply(X = z,
MARGIN = 2,
FUN = sd) * c(sqrt(2),
sqrt(6),
sqrt(30),
sqrt(1000)
) # [1] 2.875252 2.900678 2.889070 2.878067
## Sample size=2 Sample size=6 Sample size=30 Sample size=1000
## 2.879968 2.927046 2.870020 2.905305
sigma
## [1] 2.884187
Central Limit Theorem (CLT) is one of the most important theorems in statistics and data science. The CLT states that the sample mean of a probability distribution sample is a random variable with a mean value given by population mean and standard deviation given by population standard deviation divided by square root of N, where N is the sample size.
# rm(list = ls()) # Clear environment
# gc() # Clear unused memory
# cat("\f") # Clear the console
# dev.off() # Clear par(mfrow=c(3,2)) type commands
mypoiss <- rpois(n = 10000,
lambda = 1)
mu2 <- mean(mypoiss) # 1.0004
mu2
## [1] 0.9892
sd2 <-sd(mypoiss) # 1.011188
sd2
## [1] 0.9964852
hist(x = mypoiss, main = "Histogram of Poisson Distribution (lambda=1)")
##
Step 2: Create an empty matrix that will store means of different sample
sizes, say 6 different sample sizes of 2, 6, 30, 50, 200, 1000
z2 <- matrix(data = rep(x = 0,
times = 60000
),
nrow = 10000,
ncol = 6
)
z2[1:16]
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
for (j in 1:6){ # indexing columns of matrix, where each column will represent different sample size
for (i in 1:10000){ # indexes the rows of moatrix
n2 <- c( 2 , 6 , 30, 50, 200, 1000) # let n be the sample size - larger sample size should have closer sample mean to true population mean
z2[i,j] <- mean(sample(x = mypoiss,
size = n2[j],
replace = TRUE
)
)
}
}
colnames(z2) <- c("Sample size=2", "Sample size=6", "Sample size=30","Sample size=50", "Sample size=200", "Sample size=1000")
summary(z2)
## Sample size=2 Sample size=6 Sample size=30 Sample size=50
## Min. :0.0000 Min. :0.0000 Min. :0.3667 Min. :0.4400
## 1st Qu.:0.5000 1st Qu.:0.6667 1st Qu.:0.8667 1st Qu.:0.9000
## Median :1.0000 Median :1.0000 Median :0.9667 Median :0.9800
## Mean :0.9966 Mean :0.9857 Mean :0.9865 Mean :0.9899
## 3rd Qu.:1.5000 3rd Qu.:1.1667 3rd Qu.:1.1000 3rd Qu.:1.0800
## Max. :5.0000 Max. :2.8333 Max. :1.7667 Max. :1.5400
## Sample size=200 Sample size=1000
## Min. :0.7500 Min. :0.8580
## 1st Qu.:0.9400 1st Qu.:0.9680
## Median :0.9900 Median :0.9890
## Mean :0.9896 Mean :0.9886
## 3rd Qu.:1.0400 3rd Qu.:1.0100
## Max. :1.3000 Max. :1.1120
# matrix row containing means of poisson distribution of different sample sizes - col 1 has rpoiss sample size of 2, col 2 has rpoiss sample size 6, col 3 has rpoiss sample size 30, and col 4 has rpoiss sample size 50, col 5 has rpoiss sample size 200, and col 6 has rpoiss sample size 1000.
# means are taken 10,000 times / generated 10,000 times from the same distribution / iterations.
par(mfrow = c(4,2))
hist(x = mypoiss,
main = "Histogram of Poisson Distribution (lambda=1)"
)
for (k in 1:6){
hist(x = z2[,k],
main = "Histogram of Mean of Poisson Dist (lambda=1)",
xlim = c(-1, 3),
ylim = c(0,10000),
xlab = paste0("Column ", k, " from matrix")
)
} # main = paste("Mean of " ,z[,k] )
# dev.off() # Clear par(mfrow=c(3,2)) type commands
for (k in 1:6){
hist(x = z2[,k],
main = "Histogram of Mean of Poisson Dist (lambda=1)",
xlim = c(-2, 4), # keeps the x axis same in all charts
xlab = paste0("Column ", k, " from matrix")
)
} # main = paste("Mean of " ,z[,k] )
mu2 # original uniform distribution mean
## [1] 0.9892
colMeans(z2) # [1] 14.99702 15.01788 15.02452 15.01863
## Sample size=2 Sample size=6 Sample size=30 Sample size=50
## 0.9966000 0.9857000 0.9865067 0.9898740
## Sample size=200 Sample size=1000
## 0.9895905 0.9886167
##########################################################################################
### original distribution mean is approximated well by small (n=2) sample and/or big (n=1000) sample equally well ###
##########################################################################################
sd2
## [1] 0.9964852
##########################################################################################
### original distribution sd is approximated by adjusting for sample sd by sqrt n, where n is the sample size###
##########################################################################################
apply(X = z2,
MARGIN = 2, # columns
FUN = sd) * c(sqrt(2),
sqrt(6),
sqrt(30),
sqrt(50),
sqrt(200),
sqrt(1000)
) ## [1] 0.9885683 0.9982089 0.9895905 0.9905761 0.9955966 1.0060817
## Sample size=2 Sample size=6 Sample size=30 Sample size=50
## 0.9996383 0.9985353 0.9934904 0.9819469
## Sample size=200 Sample size=1000
## 0.9949110 0.9969903