Central Limit Theorem

I. Definition and Proof

In class, we saw that

The Central Limit Theorem (CLT) is one of the most important theorems in statistics and data science.

The CLT states that the sample mean (\(\bar{x}\)) of a probability distribution sample is a random variable with a mean value given by population mean \(\mu\) and standard deviation \(\sigma_{\bar{x}}\) (standard error) given by population standard deviation \(\sigma\) divided by \(\sqrt{n}\)

i.e. \(\sigma_{\bar{x}}=\dfrac{\sigma}{\sqrt{n}}\), where n is the sample size.

Proof

However, it is a bit more general than that !!!

My definition -

  • For me, CLT means that from any population distribution (need not be normal), if I take randomly chosen observations to create a representative sample, as long as my sample is “large enough” or \(n \geq 30\), any sample statistic will be a close approximation of the true unknown population statistic. In fact, we can go a step further and make a stronger statement about the distribution of the sample statistic too - the sample statistic will be normally distributed around the unknown / unobservable population statistic !

  • CLT is not restricted to the sample mean (and I provide visual proof for both sample mean and sample 10th percentile below).

  • Thus, CLT will allow us to do inferential statistics by just relying on a single sample, where \(n<<N\).

II. Log Normal Distribution: Mean

Step 0: Clear the work space, load packages

rm(list = ls()) # Clear environment
gc()            # Clear unused memory
##          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 528119 28.3    1174267 62.8         NA   669288 35.8
## Vcells 971272  7.5    8388608 64.0      16384  1840035 14.1
cat("\f")       # Clear the console
# dev.off()     # Clear par(mfrow=c(3,2)) type commands 

library("psych")

Step 1: Start with a distribution, say lognormal with meanlog and sdlog of 1.

  • Maybe you can try a uniform distribution with true mean of 15 and min/max of 10/20 respectively.
??lognormal        # find lognormal by textual search in R

?stats::Lognormal  # read up the syntax once you know the exact name 

mydist <- rlnorm(n       = 10000, 
                 meanlog = 1, 
                 sdlog   = 1 
                ) 

describe(mydist)  # check mean and sd of actual distribution
##    vars     n mean   sd median trimmed  mad  min   max range skew kurtosis   se
## X1    1 10000 4.48 5.56   2.79    3.43 2.48 0.07 98.77  98.7 4.84    43.91 0.06
mu <- mean(mydist)      # store mean of actual distribution
mu
## [1] 4.481729
sigma <- sd(mydist)     # store sd of actual distribution
sigma
## [1] 5.564187
### Plot out the distribution  
hist(x    = mydist, 
     main = "Histogram of Log Normal Dist \n (meanlog = 1, sdlog = 1, n = 10,000)",
     xlab = ""
     )

To understand the distribution -

hist(x    = log(mydist), 
     main = "Histogram of Logged - Log Normal Dist \n (meanlog = 1, sdlog = 1, n = 10,000)",
     xlab = ""
     )

Step 2: Create random samples from mydist, create and store their mean

  • CONTAINING SAMPLE MEAN OF A 2 obs SAMPLE,

  • CONTAINING SAMPLE MEAN OF A 6 obs BIG SAMPLE,

  • CONTAINING SAMPLE MEAN OF A 30 obs BIGGER SAMPLE, AND

  • CONTAINING SAMPLE MEAN OF A 1000 obs BIGGEST SAMPLES

Create the Empty Matrix -

All items contain a value of 0.

z <- matrix(data = rep(x     = 0, 
                       times = 40000
                       ), 
            nrow = 10000, 
            ncol = 4)

Replace values of the null matrix -

# i indexes rows and j indexes columns

# let n be the sample size 
n <- c(2, 6, 30, 1000) 

# As column j increases from 1 to 4 (1-2-3-4), sample size also increases from 2 to 1000 (2-6-30-1000) respectively.  



for (j in 1:4){         # each column will represent different sample size
 
   for (i in 1:10000){  # indexes the rows of matrix
  
       z[i,j] <- mean(  # compute mean and assign to z
                       sample(                    # Sample (of size 2-6-30-1000) pulled out from the original distribution i=10000 times.
                              x       = mydist, 
                              size    = n[j], 
                              replace = TRUE
                            )
                      )
    }
}

If intuition is right, mean of sample should get closer to mean of actual lognormal distribution as sample size increases, and the distribution of sample mean should get ‘tighter’ around the true population mean.

Step 3: Check if CLT holds

z matrix contains means of log normal distribution of different sample sizes -

  • col 1 has 10,000 sample means where sample size was 2,

  • col 2 has 10,000 sample means where sample size was 6,…

With summary commands

Lets check our intuition above -

# original distribution mean and sd
mu 
## [1] 4.481729
sigma 
## [1] 5.564187
# actual means from the data
colnames(z) <- c("Sample size=2", "Sample size=6", "Sample size=30", "Sample size=1000") 

summary(z)  
##  Sample size=2     Sample size=6     Sample size=30   Sample size=1000
##  Min.   : 0.2667   Min.   : 0.7738   Min.   : 1.908   Min.   :3.879   
##  1st Qu.: 2.0544   1st Qu.: 2.9773   1st Qu.: 3.761   1st Qu.:4.363   
##  Median : 3.4179   Median : 3.9922   Median : 4.346   Median :4.477   
##  Mean   : 4.4737   Mean   : 4.4768   Mean   : 4.470   Mean   :4.481   
##  3rd Qu.: 5.5789   3rd Qu.: 5.4309   3rd Qu.: 5.026   3rd Qu.:4.597   
##  Max.   :51.5775   Max.   :21.1216   Max.   :10.845   Max.   :5.200

With graphs

par(mfrow=c(3,2))

length(mydist)
## [1] 10000
hist(x    = mydist, 
     main = "Histogram of Log Normal Dist, N=10,000",
     xlab = ""
     )

# matrix column 1-4 which contain sample means of randomly chosen samples from log normal dist
for (k in 1:4){
    hist(x    = z[,k],    
         main = "Histogram of Sample Mean of Log Normal Dist", 
         xlim = c(0, 30), # plot the domain of X axis from x=0 to x=10 only
         xlab = paste0("Sample Size ", n[k], " (Column ", k, " from matrix)") 
         )
  }  

Step 6: Backtracing population mean and sd

  • Recall sample mean is population mean if CLT holds.

  • Also, standard deviation of sample means (standard error) is just the population standard deviation divided by square root of sample size.

Using these two facts, we can backtrack population mean and population standard deviation from the sample moments.

Original distribution mean is approximated by sample mean. Furthermore, original distribution’s sd is approximated by adjusting for sample sd by \(\sqrt n\) division, where n is the sample size.

mu      # original uniform distribution mean
## [1] 4.481729
apply(X = z,
      MARGIN = 2,  # function applied on columns
      FUN = mean)    
##    Sample size=2    Sample size=6   Sample size=30 Sample size=1000 
##         4.473751         4.476783         4.470226         4.481333
sigma   # original uniform distribution sd
## [1] 5.564187
apply(X = z,
      MARGIN = 2,
      FUN = sd) * c(sqrt(2), 
                    sqrt(6), 
                    sqrt(30), 
                    sqrt(1000)
                    ) # [1] 2.875252 2.900678 2.889070 2.878067
##    Sample size=2    Sample size=6   Sample size=30 Sample size=1000 
##         5.608554         5.385537         5.546037         5.544563

Central Limit Theorem (CLT) is one of the most important theorems in statistics and data science. The CLT states that the sample mean of a probability distribution sample is a random variable with a mean value given by population mean and standard deviation given by population standard deviation divided by square root of N, where N is the sample size.

Thus, as N goes to infinity,the uncertainty or standard deviation of the mean goes to zero. This means that the larger the size of our dataset, the better, as larger samples lead to smaller variance error.

III. Log Normal Distribution: 10th percentile

CLT can be applied to other sample statistics such as say 10th percentile, and is not necessarily restricted to the sample mean.

??quantile
?quantile   # produces sample quantiles corresponding to the given probabilities. The smallest observation corresponds to a probability of 0 and the largest to a probability of 1

percentile_10 <- quantile(x = mydist,    # numeric vector whose sample quantiles are wanted,
                          probs = c(.10) # numeric vector of probabilities with values in [0,1][0,1].
                          ) 
percentile_10
##     10% 
## 0.77446
# 10,000 by 5 empty matrix
z2 <- matrix(data = rep(x     = 0, 
                        times = 50000
                        ), 
             nrow = 10000, 
             ncol = 5)

n <- c(2, 6, 30, 120, 1000) 
for (j in 1:5){        
   for (i in 1:10000){  
       z2[i,j] <- quantile( # CHANGE FROM MEAN TO QUANTILE
                            x     = sample( x       = mydist, 
                                            size    = n[j], 
                                            replace = TRUE
                                        ),
                            probs = c(.1) 
                       )
    }
}

## SUMMARY STATS - should be centered at percentile_10
percentile_10
##     10% 
## 0.77446
colnames(z2) <- c("Sample size=2", "Sample size=6", "Sample size=30", "Sample size=120", "Sample size=1000") 

summary(z2)  
##  Sample size=2     Sample size=6    Sample size=30   Sample size=120 
##  Min.   : 0.1198   Min.   :0.1633   Min.   :0.2521   Min.   :0.4398  
##  1st Qu.: 1.2442   1st Qu.:0.8206   1st Qu.:0.6941   1st Qu.:0.7106  
##  Median : 2.0460   Median :1.1626   Median :0.8395   Median :0.7907  
##  Mean   : 2.6209   Mean   :1.3115   Mean   :0.8732   Mean   :0.7962  
##  3rd Qu.: 3.3207   3rd Qu.:1.6328   3rd Qu.:1.0238   3rd Qu.:0.8693  
##  Max.   :22.4286   Max.   :6.5521   Max.   :2.2215   Max.   :1.3193  
##  Sample size=1000
##  Min.   :0.6182  
##  1st Qu.:0.7531  
##  Median :0.7750  
##  Mean   :0.7754  
##  3rd Qu.:0.8008  
##  Max.   :0.9428
## GRAPHICALLY  - should be centered at percentile_10

par(mfrow=c(3,2))

hist(x    = mydist, 
     main = "Histogram of Log Normal Dist, N=10,000",
     xlab = ""
     )

for (k in 1:5){
    hist(x    = z2[,k],    # CHNAGE TO z2
         main = "Histogram of Sample 10 Percentile of Log Normal Dist", 
         xlim = c(0, 15), 
         xlab = paste0("Sample Size ", n[k], " (Column ", k, " from matrix)") 
         )
  }  

Note that you see CLT does not really kick in till sample size is 30 or more.