The Central Limit Theorem (CLT) is one of the most important theorems in statistics and data science.
The CLT states that the sample mean (\(\bar{x}\)) of a probability distribution sample is a random variable with a mean value given by population mean \(\mu\) and standard deviation \(\sigma_{\bar{x}}\) (standard error) given by population standard deviation \(\sigma\) divided by \(\sqrt{n}\)
i.e. \(\sigma_{\bar{x}}=\dfrac{\sigma}{\sqrt{n}}\), where n is the sample size.
rm(list = ls()) # Clear environment
gc() # Clear unused memory
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 528228 28.3 1174578 62.8 NA 669288 35.8
## Vcells 972128 7.5 8388608 64.0 16384 1840040 14.1
cat("\f") # Clear the console
# dev.off() # Clear par(mfrow=c(3,2)) type commands
?runif # runif generates random deviates; read the arguments required
myunif <- runif(n = 10000,
min = 10,
max = 20
)
myunif[1:16] # check if created correctly, run summary commands if required. head(myunif)
## [1] 15.28508 15.29126 19.97242 19.15795 11.34572 14.97238 19.06715 14.32130
## [9] 18.91481 12.75889 15.51272 16.80669 17.98116 19.52528 15.05994 12.35413
mu <- mean(myunif) # check mean of actual distribution
mu
## [1] 14.99099
sigma <- sd(myunif) # check sd of actual distribution
sigma
## [1] 2.903908
library("psych")
describe(myunif) # summary statistics
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 10000 14.99 2.9 15.01 14.99 3.76 10 20 10 0 -1.22 0.03
### Plot out the distribution
hist(x = myunif,
main = "Histogram of Uniform Distribution (min=10, max=20, n=10,000)",
xlab = ""
)
?matrix # creates a matrix from the given set of values.
?rep # replicates the values in x lenth times
# fill in the matrix with 0s
z <- matrix(data = rep(x = 0,
times = 10000
),
nrow = 10000,
ncol = 1)
z[1:16]
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
describe(z)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 10000 0 0 0 0 0 0 0 0 NaN NaN 0
Try reading the loop from inside out.
?sample # makes a sample of the specified size from the elements of x using either with or without replacement.
# sample(x, size, replace = FALSE, prob = NULL)
for (i in 1:10000){
z[i,] <- mean(sample( x = myunif, # a vector of elements from which to choose, or a positive integer.
size = 100, # a non-negative integer giving the number of items to choose.
replace = TRUE # should sampling be with replacement?
)
)
}
z[1:16]
## [1] 15.01953 14.90143 15.34522 15.00622 15.70078 15.13112 14.58994 15.17121
## [9] 14.98000 14.77428 14.89862 14.97955 15.02438 14.63983 14.92752 15.35812
describe(z)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 10000 14.99 0.29 14.99 15 0.3 14 16.02 2.02 -0.04 -0.1 0
hist(z, xlab = "", main = "Histogram of Sample Mean (n = 100)")
CONTAINING SAMPLE MEAN OF A 2 obs SAMPLE,
CONTAINING SAMPLE MEAN OF A 6 obs BIG SAMPLE,
CONTAINING SAMPLE MEAN OF A 30 obs BIGGER SAMPLE, AND
CONTAINING SAMPLE MEAN OF A 1000 obs BIGGEST SAMPLES
z <- matrix(data = rep(x = 0,
times = 40000
),
nrow = 10000,
ncol = 4)
Take a random sample of 2 observations from the uniform distribution, find its mean and store it in column 1, row 1 of matrix z. Repeat 9,999 times filling in different rows i of column 1 each time.
Take a random sample of 6 observations from the uniform distribution, find its mean and store it in column 2, row 1 of matrix z. Repeat 9,999 times filling in different rows i of column21 each time.
Take a random sample of 30 observations from the uniform distribution, find its mean and store it in column 3, row 1 of matrix z. Repeat 9,999 times filling in different rows i of column 3 each time.
Take a random sample of 1000 observations from the uniform distribution, find its mean and store it in column 4, row 1 of matrix z. Repeat 9,999 times filling in different rows i of column 4 each time.
Try reading the loop from outside to inside this time, and i indexes rows and j indexes columns -
n <- c(2, 6, 30, 1000) # let n be the sample size
#larger sample size should have closer sample mean to true population mean
# i indexes rows and j indexes columns
# As column j increases from 1 to 4 (1-2-3-4), sample size also increases from 2 to 1000 (2-6-30-1000) respectively.
# Sample (of size 2-6-30-1000) pulled out from the original distribution i=10000 times.
for (j in 1:4){ # indexing columns of matrix, where each column will represent different sample size
for (i in 1:10000){ # indexes the rows of matrix
z[i,j] <- mean(sample( x = myunif, # compute mean and assign
size = n[j],
replace = TRUE
)
)
}
}
colnames(z) <- c("Sample size=2", "Sample size=6", "Sample size=30", "Sample size=1000")
summary(z)
## Sample size=2 Sample size=6 Sample size=30 Sample size=1000
## Min. :10.07 Min. :10.65 Min. :12.91 Min. :14.62
## 1st Qu.:13.51 1st Qu.:14.20 1st Qu.:14.62 1st Qu.:14.93
## Median :14.98 Median :14.99 Median :14.99 Median :14.99
## Mean :14.98 Mean :15.01 Mean :14.98 Mean :14.99
## 3rd Qu.:16.47 3rd Qu.:15.81 3rd Qu.:15.34 3rd Qu.:15.05
## Max. :19.97 Max. :19.11 Max. :16.84 Max. :15.32
# matrix row containing means of uniform distribution of different sample sizes - col 1 has runif sample size of 2, col 2 has runif sample size 6, col 3 has runif sample size 30, and col 4 has runif sample size 1000. 10,000 iterations.
par(mfrow=c(3,2))
length(myunif)
## [1] 10000
hist(x = myunif,
main = "Histogram of Uniform Distribution, N=10,000",
xlab = ""
)
?hist
for (k in 1:4){
hist(x = z[,k], # matrix column 1-4 which contain sample means of randomly chosen samples from uniform dist
main = "Histogram of Sample Mean of Uniform Distribution",
xlim = c(10, 20), # plot the domain of X axis from x=10 to x=20 only
xlab = paste0("Sample Size ", n[k], " (Column ", k, " from matrix)")
)
}
Recall sample mean is population mean if CLT holds. Also, standard deviation of sample means (standard error) is just the population standard deviation divided by square root of smaple size. Using these two facts, we can backtrack population mean and population standard deviation from the sample moments.
z[1:16] # matrix row containing means of uniform distribution of different sample sizes - col 1 has runif sample size of 2, col 2 has runif sample size 6, col 3 has runif sample size 30, and col 4 has runif sample size 1000. 10,000 iterations.
## [1] 13.47066 17.79353 14.01069 10.83835 15.40102 12.55278 16.49548 14.46722
## [9] 10.23553 14.44267 13.11832 18.13630 15.44076 13.57534 17.87807 14.80485
mu # original uniform distribution mean
## [1] 14.99099
colMeans(z) # [1] 14.99702 15.01788 15.02452 15.01863
## Sample size=2 Sample size=6 Sample size=30 Sample size=1000
## 14.98463 15.00541 14.98281 14.99077
##########################################################################################
### original distribution mean is approximated well by small (n=2) sample and/or big (n=1000) sample equally well ###
##########################################################################################
sigma # original uniform distribution sd
## [1] 2.903908
?apply # Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.
apply(X = z,
MARGIN = 2, # function applied on columns, not rows as c(1, 2) indicates rows and columns respectively
FUN = sd) # [1] 2.03311048 1.18419671 0.52746969 0.09101246
## Sample size=2 Sample size=6 Sample size=30 Sample size=1000
## 2.05097055 1.18846580 0.53071114 0.09290199
##########################################################################################
### original distribution sd is approximated by adjusting for sample sd by sqrt n, where n is the sample size###
##########################################################################################
apply(X = z,
MARGIN = 2,
FUN = sd) * c(sqrt(2),
sqrt(6),
sqrt(30),
sqrt(1000)
) # [1] 2.875252 2.900678 2.889070 2.878067
## Sample size=2 Sample size=6 Sample size=30 Sample size=1000
## 2.900510 2.911135 2.906825 2.937819
sigma
## [1] 2.903908
Central Limit Theorem (CLT) is one of the most important theorems in statistics and data science. The CLT states that the sample mean of a probability distribution sample is a random variable with a mean value given by population mean and standard deviation given by population standard deviation divided by square root of N, where N is the sample size.
# rm(list = ls()) # Clear environment
# gc() # Clear unused memory
# cat("\f") # Clear the console
# dev.off() # Clear par(mfrow=c(3,2)) type commands
mypoiss <- rpois(n = 10000,
lambda = 1)
mu2 <- mean(mypoiss) # 1.0004
mu2
## [1] 0.985
sd2 <-sd(mypoiss) # 1.011188
sd2
## [1] 1.004329
hist(x = mypoiss, main = "Histogram of Poisson Distribution (lambda=1)")
##
Step 2: Create an empty matrix that will store means of different sample
sizes, say 6 different sample sizes of 2, 6, 30, 50, 200, 1000
z2 <- matrix(data = rep(x = 0,
times = 60000
),
nrow = 10000,
ncol = 6
)
z2[1:16]
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
for (j in 1:6){ # indexing columns of matrix, where each column will represent different sample size
for (i in 1:10000){ # indexes the rows of moatrix
n2 <- c( 2 , 6 , 30, 50, 200, 1000) # let n be the sample size - larger sample size should have closer sample mean to true population mean
z2[i,j] <- mean(sample(x = mypoiss,
size = n2[j],
replace = TRUE
)
)
}
}
colnames(z2) <- c("Sample size=2", "Sample size=6", "Sample size=30","Sample size=50", "Sample size=200", "Sample size=1000")
summary(z2)
## Sample size=2 Sample size=6 Sample size=30 Sample size=50
## Min. :0.0000 Min. :0.0000 Min. :0.3667 Min. :0.440
## 1st Qu.:0.5000 1st Qu.:0.6667 1st Qu.:0.8667 1st Qu.:0.880
## Median :1.0000 Median :1.0000 Median :0.9667 Median :0.980
## Mean :0.9886 Mean :0.9820 Mean :0.9857 Mean :0.986
## 3rd Qu.:1.5000 3rd Qu.:1.1667 3rd Qu.:1.1000 3rd Qu.:1.080
## Max. :4.5000 Max. :3.0000 Max. :1.7333 Max. :1.540
## Sample size=200 Sample size=1000
## Min. :0.7100 Min. :0.8700
## 1st Qu.:0.9350 1st Qu.:0.9640
## Median :0.9850 Median :0.9850
## Mean :0.9852 Mean :0.9854
## 3rd Qu.:1.0350 3rd Qu.:1.0070
## Max. :1.2650 Max. :1.1120
# matrix row containing means of poisson distribution of different sample sizes - col 1 has rpoiss sample size of 2, col 2 has rpoiss sample size 6, col 3 has rpoiss sample size 30, and col 4 has rpoiss sample size 50, col 5 has rpoiss sample size 200, and col 6 has rpoiss sample size 1000.
# means are taken 10,000 times / generated 10,000 times from the same distribution / iterations.
par(mfrow = c(4,2))
hist(x = mypoiss,
main = "Histogram of Poisson Distribution (lambda=1)"
)
for (k in 1:6){
hist(x = z2[,k],
main = "Histogram of Mean of Poisson Dist (lambda=1)",
xlim = c(-1, 3),
ylim = c(0,10000),
xlab = paste0("Column ", k, " from matrix")
)
} # main = paste("Mean of " ,z[,k] )
# dev.off() # Clear par(mfrow=c(3,2)) type commands
for (k in 1:6){
hist(x = z2[,k],
main = "Histogram of Mean of Poisson Dist (lambda=1)",
xlim = c(-2, 4), # keeps the x axis same in all charts
xlab = paste0("Column ", k, " from matrix")
)
} # main = paste("Mean of " ,z[,k] )
mu2 # original uniform distribution mean
## [1] 0.985
colMeans(z2) # [1] 14.99702 15.01788 15.02452 15.01863
## Sample size=2 Sample size=6 Sample size=30 Sample size=50
## 0.9886000 0.9820500 0.9857167 0.9859860
## Sample size=200 Sample size=1000
## 0.9851815 0.9853921
##########################################################################################
### original distribution mean is approximated well by small (n=2) sample and/or big (n=1000) sample equally well ###
##########################################################################################
sd2
## [1] 1.004329
##########################################################################################
### original distribution sd is approximated by adjusting for sample sd by sqrt n, where n is the sample size###
##########################################################################################
apply(X = z2,
MARGIN = 2, # columns
FUN = sd) * c(sqrt(2),
sqrt(6),
sqrt(30),
sqrt(50),
sqrt(200),
sqrt(1000)
) ## [1] 0.9885683 0.9982089 0.9895905 0.9905761 0.9955966 1.0060817
## Sample size=2 Sample size=6 Sample size=30 Sample size=50
## 1.0188444 0.9993915 1.0019764 1.0041849
## Sample size=200 Sample size=1000
## 1.0107349 1.0009749