In class, we saw that
The Central Limit Theorem (CLT) is one of the most important theorems in statistics and data science.
The CLT states that the sample mean (\(\bar{x}\)) of a probability distribution sample is a random variable with a mean value given by population mean \(\mu\) and standard deviation \(\sigma_{\bar{x}}\) (standard error) given by population standard deviation \(\sigma\) divided by \(\sqrt{n}\)
i.e. \(\sigma_{\bar{x}}=\dfrac{\sigma}{\sqrt{n}}\), where n is the sample size.
However, it is a bit more general than that !!!
For me, CLT means that from any population distribution (need not be normal), if I take randomly chosen observations to create a representative sample, as long as my sample is “large enough” or \(n \geq 30\), any sample statistic will be a close approximation of the true unknown population statistic. In fact, we can go a step further and make a stronger statement about the distribution of the sample statistic too - the sample statistic will be normally distributed around the unknown / unobservable population statistic !
CLT is not restricted to the sample mean (and I provide visual proof for both sample mean and sample 10th percentile below).
Thus, CLT will allow us to do inferential statistics by just relying on a single sample, where \(n<<N\).
rm(list = ls()) # Clear environment
gc() # Clear unused memory
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 528119 28.3 1174267 62.8 NA 669288 35.8
## Vcells 971272 7.5 8388608 64.0 16384 1840035 14.1
cat("\f") # Clear the console
# dev.off() # Clear par(mfrow=c(3,2)) type commands
library("psych")
??lognormal # find lognormal by textual search in R
?stats::Lognormal # read up the syntax once you know the exact name
mydist <- rlnorm(n = 10000,
meanlog = 1,
sdlog = 1
)
describe(mydist) # check mean and sd of actual distribution
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 10000 4.48 5.56 2.79 3.43 2.48 0.07 98.77 98.7 4.84 43.91 0.06
mu <- mean(mydist) # store mean of actual distribution
mu
## [1] 4.481729
sigma <- sd(mydist) # store sd of actual distribution
sigma
## [1] 5.564187
### Plot out the distribution
hist(x = mydist,
main = "Histogram of Log Normal Dist \n (meanlog = 1, sdlog = 1, n = 10,000)",
xlab = ""
)
To understand the distribution -
hist(x = log(mydist),
main = "Histogram of Logged - Log Normal Dist \n (meanlog = 1, sdlog = 1, n = 10,000)",
xlab = ""
)
mydist, create and
store their meanCONTAINING SAMPLE MEAN OF A 2 obs SAMPLE,
CONTAINING SAMPLE MEAN OF A 6 obs BIG SAMPLE,
CONTAINING SAMPLE MEAN OF A 30 obs BIGGER SAMPLE, AND
CONTAINING SAMPLE MEAN OF A 1000 obs BIGGEST SAMPLES
All items contain a value of 0.
z <- matrix(data = rep(x = 0,
times = 40000
),
nrow = 10000,
ncol = 4)
# i indexes rows and j indexes columns
# let n be the sample size
n <- c(2, 6, 30, 1000)
# As column j increases from 1 to 4 (1-2-3-4), sample size also increases from 2 to 1000 (2-6-30-1000) respectively.
for (j in 1:4){ # each column will represent different sample size
for (i in 1:10000){ # indexes the rows of matrix
z[i,j] <- mean( # compute mean and assign to z
sample( # Sample (of size 2-6-30-1000) pulled out from the original distribution i=10000 times.
x = mydist,
size = n[j],
replace = TRUE
)
)
}
}
If intuition is right, mean of sample should get closer to mean of
actual lognormal distribution as sample size increases, and
the distribution of sample mean should get ‘tighter’ around the true
population mean.
z matrix contains means of log normal distribution of
different sample sizes -
col 1 has 10,000 sample means where sample size was 2,
col 2 has 10,000 sample means where sample size was 6,…
Lets check our intuition above -
# original distribution mean and sd
mu
## [1] 4.481729
sigma
## [1] 5.564187
# actual means from the data
colnames(z) <- c("Sample size=2", "Sample size=6", "Sample size=30", "Sample size=1000")
summary(z)
## Sample size=2 Sample size=6 Sample size=30 Sample size=1000
## Min. : 0.2667 Min. : 0.7738 Min. : 1.908 Min. :3.879
## 1st Qu.: 2.0544 1st Qu.: 2.9773 1st Qu.: 3.761 1st Qu.:4.363
## Median : 3.4179 Median : 3.9922 Median : 4.346 Median :4.477
## Mean : 4.4737 Mean : 4.4768 Mean : 4.470 Mean :4.481
## 3rd Qu.: 5.5789 3rd Qu.: 5.4309 3rd Qu.: 5.026 3rd Qu.:4.597
## Max. :51.5775 Max. :21.1216 Max. :10.845 Max. :5.200
par(mfrow=c(3,2))
length(mydist)
## [1] 10000
hist(x = mydist,
main = "Histogram of Log Normal Dist, N=10,000",
xlab = ""
)
# matrix column 1-4 which contain sample means of randomly chosen samples from log normal dist
for (k in 1:4){
hist(x = z[,k],
main = "Histogram of Sample Mean of Log Normal Dist",
xlim = c(0, 30), # plot the domain of X axis from x=0 to x=10 only
xlab = paste0("Sample Size ", n[k], " (Column ", k, " from matrix)")
)
}
Recall sample mean is population mean if CLT holds.
Also, standard deviation of sample means (standard error) is just the population standard deviation divided by square root of sample size.
Using these two facts, we can backtrack population mean and population standard deviation from the sample moments.
Original distribution mean is approximated by sample mean.
Furthermore, original distribution’s sd is approximated by
adjusting for sample sd by \(\sqrt n\) division, where n is
the sample size.
mu # original uniform distribution mean
## [1] 4.481729
apply(X = z,
MARGIN = 2, # function applied on columns
FUN = mean)
## Sample size=2 Sample size=6 Sample size=30 Sample size=1000
## 4.473751 4.476783 4.470226 4.481333
sigma # original uniform distribution sd
## [1] 5.564187
apply(X = z,
MARGIN = 2,
FUN = sd) * c(sqrt(2),
sqrt(6),
sqrt(30),
sqrt(1000)
) # [1] 2.875252 2.900678 2.889070 2.878067
## Sample size=2 Sample size=6 Sample size=30 Sample size=1000
## 5.608554 5.385537 5.546037 5.544563
Central Limit Theorem (CLT) is one of the most important theorems in statistics and data science. The CLT states that the sample mean of a probability distribution sample is a random variable with a mean value given by population mean and standard deviation given by population standard deviation divided by square root of N, where N is the sample size.
CLT can be applied to other sample statistics such as say
10th percentile, and is not necessarily restricted to the
sample mean.
??quantile
?quantile # produces sample quantiles corresponding to the given probabilities. The smallest observation corresponds to a probability of 0 and the largest to a probability of 1
percentile_10 <- quantile(x = mydist, # numeric vector whose sample quantiles are wanted,
probs = c(.10) # numeric vector of probabilities with values in [0,1][0,1].
)
percentile_10
## 10%
## 0.77446
# 10,000 by 5 empty matrix
z2 <- matrix(data = rep(x = 0,
times = 50000
),
nrow = 10000,
ncol = 5)
n <- c(2, 6, 30, 120, 1000)
for (j in 1:5){
for (i in 1:10000){
z2[i,j] <- quantile( # CHANGE FROM MEAN TO QUANTILE
x = sample( x = mydist,
size = n[j],
replace = TRUE
),
probs = c(.1)
)
}
}
## SUMMARY STATS - should be centered at percentile_10
percentile_10
## 10%
## 0.77446
colnames(z2) <- c("Sample size=2", "Sample size=6", "Sample size=30", "Sample size=120", "Sample size=1000")
summary(z2)
## Sample size=2 Sample size=6 Sample size=30 Sample size=120
## Min. : 0.1198 Min. :0.1633 Min. :0.2521 Min. :0.4398
## 1st Qu.: 1.2442 1st Qu.:0.8206 1st Qu.:0.6941 1st Qu.:0.7106
## Median : 2.0460 Median :1.1626 Median :0.8395 Median :0.7907
## Mean : 2.6209 Mean :1.3115 Mean :0.8732 Mean :0.7962
## 3rd Qu.: 3.3207 3rd Qu.:1.6328 3rd Qu.:1.0238 3rd Qu.:0.8693
## Max. :22.4286 Max. :6.5521 Max. :2.2215 Max. :1.3193
## Sample size=1000
## Min. :0.6182
## 1st Qu.:0.7531
## Median :0.7750
## Mean :0.7754
## 3rd Qu.:0.8008
## Max. :0.9428
## GRAPHICALLY - should be centered at percentile_10
par(mfrow=c(3,2))
hist(x = mydist,
main = "Histogram of Log Normal Dist, N=10,000",
xlab = ""
)
for (k in 1:5){
hist(x = z2[,k], # CHNAGE TO z2
main = "Histogram of Sample 10 Percentile of Log Normal Dist",
xlim = c(0, 15),
xlab = paste0("Sample Size ", n[k], " (Column ", k, " from matrix)")
)
}
Note that you see CLT does not really kick in till sample size is 30 or more.