Central Limit Theorem

Definition and Proof

The Central Limit Theorem (CLT) is one of the most important theorems in statistics and data science.

The CLT states that the sample mean (\(\bar{x}\)) of a probability distribution sample is a random variable with a mean value given by population mean \(\mu\) and standard deviation \(\sigma_{\bar{x}}\) (standard error) given by population standard deviation \(\sigma\) divided by \(\sqrt{n}\)

i.e. \(\sigma_{\bar{x}}=\dfrac{\sigma}{\sqrt{n}}\), where n is the sample size.

Proof

I. UNIFORM DISTRIBUTION

Step 0: Clear the work space

rm(list = ls()) # Clear environment
gc()            # Clear unused memory

##          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 528228 28.3    1174578 62.8         NA   669288 35.8
## Vcells 972128  7.5    8388608 64.0      16384  1840040 14.1

cat("\f")       # Clear the console

# dev.off()     # Clear par(mfrow=c(3,2)) type commands

Step 1: Start with a distribution, say uniform with true mean of 15 and min/max of 10/20 respectively

?runif                 # runif generates random deviates; read the arguments required

myunif <- runif(n  = 10000, 
                min = 10, 
                max = 20 
                ) 
myunif[1:16]            # check if created correctly, run summary commands if required.  head(myunif)

##  [1] 15.28508 15.29126 19.97242 19.15795 11.34572 14.97238 19.06715 14.32130
##  [9] 18.91481 12.75889 15.51272 16.80669 17.98116 19.52528 15.05994 12.35413

mu <- mean(myunif)      # check mean of actual distribution
mu

## [1] 14.99099

sigma <- sd(myunif)     # check sd of actual distribution
sigma

## [1] 2.903908

library("psych")
describe(myunif)        # summary statistics

##    vars     n  mean  sd median trimmed  mad min max range skew kurtosis   se
## X1    1 10000 14.99 2.9  15.01   14.99 3.76  10  20    10    0    -1.22 0.03

### Plot out the distribution  
hist(x    = myunif, 
     main = "Histogram of Uniform Distribution (min=10, max=20, n=10,000)",
     xlab = ""
     )

Step 2: Create an empty matrix of 10000 rows and 1 column in which you will store the means of randomly drawn samples (of size 100)

Create 10000 row and 1 column matrix with 0 entries only

?matrix     # creates a matrix from the given set of values.
?rep        # replicates the values in x lenth times

# fill in the matrix with 0s
z <- matrix(data = rep(x     = 0, 
                       times = 10000
                       ), 
            nrow = 10000, 
            ncol = 1)

z[1:16]

##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

describe(z)

##    vars     n mean sd median trimmed mad min max range skew kurtosis se
## X1    1 10000    0  0      0       0   0   0   0     0  NaN      NaN  0

Step 3: Take a random sample of 100 observations from the uniform distribution, find its mean and store it in row 1 of matrix z. Repeat 9,999 times - storing the mean of sample of size 100 in row i of matrix z.

Try reading the loop from inside out.

?sample # makes a sample of the specified size from the elements of x using either with or without replacement.
# sample(x, size, replace = FALSE, prob = NULL)

for (i in 1:10000){ 
    z[i,] <- mean(sample( x        = myunif, # a vector of elements from which to choose, or a positive integer. 
                          size     = 100,    # a non-negative integer giving the number of items to choose.
                          replace  = TRUE    # should sampling be with replacement?
                        )
                )
  }

z[1:16]

##  [1] 15.01953 14.90143 15.34522 15.00622 15.70078 15.13112 14.58994 15.17121
##  [9] 14.98000 14.77428 14.89862 14.97955 15.02438 14.63983 14.92752 15.35812

describe(z)

##    vars     n  mean   sd median trimmed mad min   max range  skew kurtosis se
## X1    1 10000 14.99 0.29  14.99      15 0.3  14 16.02  2.02 -0.04     -0.1  0

hist(z, xlab = "", main = "Histogram of Sample Mean (n = 100)")

Step 4: Generalise the code to use different sample sizes instead of just \(n=1000\) above

EXPAND COLUMNS OF MATRIX TO —

CONTAINING SAMPLE MEAN OF A 2 obs SAMPLE,
CONTAINING SAMPLE MEAN OF A 6 obs BIG SAMPLE,
CONTAINING SAMPLE MEAN OF A 30 obs BIGGER SAMPLE, AND
CONTAINING SAMPLE MEAN OF A 1000 obs BIGGEST SAMPLES

4 a ) EXPANDING COLUMNS OF NULL MATRIX CODE —

z <- matrix(data = rep(x     = 0, 
                       times = 40000
                       ), 
            nrow = 10000, 
            ncol = 4)

4 b ) REPLACE VALUES OF NULL MATRIX —

Take a random sample of 2 observations from the uniform distribution, find its mean and store it in column 1, row 1 of matrix z. Repeat 9,999 times filling in different rows i of column 1 each time.
Take a random sample of 6 observations from the uniform distribution, find its mean and store it in column 2, row 1 of matrix z. Repeat 9,999 times filling in different rows i of column21 each time.
Take a random sample of 30 observations from the uniform distribution, find its mean and store it in column 3, row 1 of matrix z. Repeat 9,999 times filling in different rows i of column 3 each time.
Take a random sample of 1000 observations from the uniform distribution, find its mean and store it in column 4, row 1 of matrix z. Repeat 9,999 times filling in different rows i of column 4 each time.

Try reading the loop from outside to inside this time, and i indexes rows and j indexes columns -

n <- c(2, 6, 30, 1000) # let n be the sample size
#larger sample size should have closer sample mean to true population mean

# i indexes rows and j indexes columns
# As column j increases from 1 to 4 (1-2-3-4), sample size also increases from 2 to 1000 (2-6-30-1000) respectively.  
# Sample (of size 2-6-30-1000) pulled out from the original distribution i=10000 times.

for (j in 1:4){        # indexing columns of matrix, where each column will represent different sample size
 
   for (i in 1:10000){  # indexes the rows of matrix
  
       z[i,j] <- mean(sample( x       = myunif,  # compute mean and assign
                              size    = n[j], 
                              replace = TRUE
                            )
                      )
    }
}

Step 5: If intuition is right, mean of sample should get closer to mean of actual uniform distribution as sample size increases, and the distribution of sample mean should get ‘tighter’ around the true popultion mean.

5A. With summary commands

colnames(z) <- c("Sample size=2", "Sample size=6", "Sample size=30", "Sample size=1000") 
summary(z)

##  Sample size=2   Sample size=6   Sample size=30  Sample size=1000
##  Min.   :10.07   Min.   :10.65   Min.   :12.91   Min.   :14.62   
##  1st Qu.:13.51   1st Qu.:14.20   1st Qu.:14.62   1st Qu.:14.93   
##  Median :14.98   Median :14.99   Median :14.99   Median :14.99   
##  Mean   :14.98   Mean   :15.01   Mean   :14.98   Mean   :14.99   
##  3rd Qu.:16.47   3rd Qu.:15.81   3rd Qu.:15.34   3rd Qu.:15.05   
##  Max.   :19.97   Max.   :19.11   Max.   :16.84   Max.   :15.32

# matrix row containing means of uniform distribution of different sample sizes - col 1 has runif sample size of 2, col 2 has runif sample size 6, col 3 has runif sample size 30, and col 4 has runif sample size 1000.   10,000 iterations.

5B. With graphs

par(mfrow=c(3,2))

length(myunif)

## [1] 10000

hist(x    = myunif, 
     main = "Histogram of Uniform Distribution, N=10,000",
     xlab = ""
     )

?hist

for (k in 1:4){
    hist(x    = z[,k],     # matrix column 1-4 which contain sample means of randomly chosen samples from uniform dist
         main = "Histogram of Sample Mean of Uniform Distribution", 
         xlim = c(10, 20), # plot the domain of X axis from x=10 to x=20 only
         xlab = paste0("Sample Size ", n[k], " (Column ", k, " from matrix)") 
         )
  }

Step 6: Backtracing population mean and sd

Recall sample mean is population mean if CLT holds. Also, standard deviation of sample means (standard error) is just the population standard deviation divided by square root of smaple size. Using these two facts, we can backtrack population mean and population standard deviation from the sample moments.

z[1:16] # matrix row containing means of uniform distribution of different sample sizes - col 1 has runif sample size of 2, col 2 has runif sample size 6, col 3 has runif sample size 30, and col 4 has runif sample size 1000.   10,000 iterations.

##  [1] 13.47066 17.79353 14.01069 10.83835 15.40102 12.55278 16.49548 14.46722
##  [9] 10.23553 14.44267 13.11832 18.13630 15.44076 13.57534 17.87807 14.80485

mu      # original uniform distribution mean

## [1] 14.99099

colMeans(z)              # [1] 14.99702 15.01788 15.02452 15.01863

##    Sample size=2    Sample size=6   Sample size=30 Sample size=1000 
##         14.98463         15.00541         14.98281         14.99077

##########################################################################################
### original distribution mean is approximated well by small (n=2) sample and/or big (n=1000) sample equally well ###
##########################################################################################

sigma   # original uniform distribution sd

## [1] 2.903908

?apply # Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.
apply(X = z,
      MARGIN = 2,  # function applied on columns, not rows as c(1, 2) indicates rows and columns respectively
      FUN = sd)    # [1] 2.03311048 1.18419671 0.52746969 0.09101246

##    Sample size=2    Sample size=6   Sample size=30 Sample size=1000 
##       2.05097055       1.18846580       0.53071114       0.09290199

##########################################################################################
### original distribution sd is approximated by adjusting for sample sd by sqrt n, where n is the sample size###
##########################################################################################


apply(X = z,
      MARGIN = 2,
      FUN = sd) * c(sqrt(2), 
                    sqrt(6), 
                    sqrt(30), 
                    sqrt(1000)
                    ) # [1] 2.875252 2.900678 2.889070 2.878067

##    Sample size=2    Sample size=6   Sample size=30 Sample size=1000 
##         2.900510         2.911135         2.906825         2.937819

sigma

## [1] 2.903908

Central Limit Theorem (CLT) is one of the most important theorems in statistics and data science. The CLT states that the sample mean of a probability distribution sample is a random variable with a mean value given by population mean and standard deviation given by population standard deviation divided by square root of N, where N is the sample size.

Thus, as N goes to infinity,the uncertainty or standard deviation of the mean goes to zero. This means that the larger the size of our dataset, the better, as larger samples lead to smaller variance error.

II. Poisson Distribution

Step 0: Clear the work space

# rm(list = ls()) # Clear environment
# gc()            # Clear unused memory
# cat("\f")       # Clear the console
# dev.off()       # Clear par(mfrow=c(3,2)) type commands

Step 1: Create the distribution

mypoiss <- rpois(n      = 10000, 
                 lambda = 1) 

mu2 <- mean(mypoiss) # 1.0004
mu2

## [1] 0.985

sd2 <-sd(mypoiss)    #  1.011188
sd2

## [1] 1.004329

hist(x = mypoiss, main = "Histogram of Poisson Distribution (lambda=1)")

## Step 2: Create an empty matrix that will store means of different sample sizes, say 6 different sample sizes of 2, 6, 30, 50, 200, 1000

Create 10000 row and 6 column matrix with 0 entries only -

z2 <- matrix(data = rep(x     = 0, 
                        times = 60000
                        ), 
             nrow = 10000, 
             ncol = 6
             )

z2[1:16]

##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

STEP 3: Take 10,000 random samples of sample size 2,6,30,50 and 1000 respectively, compute their mean, and place the sample mean in the matrix z (column j) to plot it later

for (j in 1:6){       # indexing columns of matrix, where each column will represent different sample size
  for (i in 1:10000){ # indexes the rows of moatrix
    n2 <- c( 2 , 6 , 30, 50, 200, 1000) # let n be the sample size - larger sample size should have closer sample mean to true population mean
    z2[i,j] <-  mean(sample(x       = mypoiss, 
                            size    = n2[j], 
                            replace = TRUE
                            )
                      )
    }
}

Step 4: If intuition is right, mean of sample should get closer to actual uniform dist mean as sample size increases, and the distribution of sample mean should get ‘tighter’ around the true popultion mean.

Step 5: Confirm intution above with either Plot or with summary commands

colnames(z2) <- c("Sample size=2", "Sample size=6", "Sample size=30","Sample size=50", "Sample size=200", "Sample size=1000") 
summary(z2)

##  Sample size=2    Sample size=6    Sample size=30   Sample size=50 
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.3667   Min.   :0.440  
##  1st Qu.:0.5000   1st Qu.:0.6667   1st Qu.:0.8667   1st Qu.:0.880  
##  Median :1.0000   Median :1.0000   Median :0.9667   Median :0.980  
##  Mean   :0.9886   Mean   :0.9820   Mean   :0.9857   Mean   :0.986  
##  3rd Qu.:1.5000   3rd Qu.:1.1667   3rd Qu.:1.1000   3rd Qu.:1.080  
##  Max.   :4.5000   Max.   :3.0000   Max.   :1.7333   Max.   :1.540  
##  Sample size=200  Sample size=1000
##  Min.   :0.7100   Min.   :0.8700  
##  1st Qu.:0.9350   1st Qu.:0.9640  
##  Median :0.9850   Median :0.9850  
##  Mean   :0.9852   Mean   :0.9854  
##  3rd Qu.:1.0350   3rd Qu.:1.0070  
##  Max.   :1.2650   Max.   :1.1120

# matrix row containing means of poisson distribution of different sample sizes - col 1 has rpoiss sample size of 2, col 2 has rpoiss sample size 6, col 3 has rpoiss sample size 30, and col 4 has rpoiss sample size 50, col 5 has rpoiss sample size 200, and col 6 has rpoiss sample size 1000.  
# means are taken 10,000 times / generated 10,000 times from the same distribution / iterations.  


par(mfrow = c(4,2))
hist(x    = mypoiss, 
     main = "Histogram of Poisson Distribution (lambda=1)"
     )


for (k in 1:6){
  hist(x = z2[,k], 
       main = "Histogram of Mean of Poisson Dist (lambda=1)",
       xlim = c(-1, 3), 
       ylim = c(0,10000),
       xlab = paste0("Column ", k, " from matrix") 
      )
} # main = paste("Mean of " ,z[,k] )

RUN THIS CHUNKS IF YOU WANT BIGGER GRAPHS

# dev.off()       # Clear par(mfrow=c(3,2)) type commands 

for (k in 1:6){
       hist(x = z2[,k], 
       main = "Histogram of Mean of Poisson Dist (lambda=1)",
       xlim = c(-2, 4),  # keeps the x axis same in all charts
       xlab = paste0("Column ", k, " from matrix") 
       )
} # main = paste("Mean of " ,z[,k] )

Step 6: Backtracing population mean and sd

mu2      # original uniform distribution mean

## [1] 0.985

colMeans(z2)              # [1] 14.99702 15.01788 15.02452 15.01863

##    Sample size=2    Sample size=6   Sample size=30   Sample size=50 
##        0.9886000        0.9820500        0.9857167        0.9859860 
##  Sample size=200 Sample size=1000 
##        0.9851815        0.9853921

##########################################################################################
### original distribution mean is approximated well by small (n=2) sample and/or big (n=1000) sample equally well ###
##########################################################################################


sd2

## [1] 1.004329

##########################################################################################
### original distribution sd is approximated by adjusting for sample sd by sqrt n, where n is the sample size###
##########################################################################################


apply(X = z2,
      MARGIN = 2,  # columns
      FUN = sd) * c(sqrt(2), 
                    sqrt(6), 
                    sqrt(30), 
                    sqrt(50), 
                    sqrt(200), 
                    sqrt(1000)
                    ) ## [1] 0.9885683 0.9982089 0.9895905 0.9905761 0.9955966 1.0060817

##    Sample size=2    Sample size=6   Sample size=30   Sample size=50 
##        1.0188444        0.9993915        1.0019764        1.0041849 
##  Sample size=200 Sample size=1000 
##        1.0107349        1.0009749

Central Limit Theorem

Arvind Sharma

2023-10-31

Central Limit Theorem

Definition and Proof

I. UNIFORM DISTRIBUTION

Step 0: Clear the work space

Step 1: Start with a distribution, say uniform with true mean of 15 and min/max of 10/20 respectively

Step 2: Create an empty matrix of 10000 rows and 1 column in which you will store the means of randomly drawn samples (of size 100)

Create 10000 row and 1 column matrix with 0 entries only

Step 3: Take a random sample of 100 observations from the uniform distribution, find its mean and store it in row 1 of matrix z. Repeat 9,999 times - storing the mean of sample of size 100 in row i of matrix z.

Step 4: Generalise the code to use different sample sizes instead of just \(n=1000\) above

EXPAND COLUMNS OF MATRIX TO —

4 a ) EXPANDING COLUMNS OF NULL MATRIX CODE —

4 b ) REPLACE VALUES OF NULL MATRIX —

Step 5: If intuition is right, mean of sample should get closer to mean of actual uniform distribution as sample size increases, and the distribution of sample mean should get ‘tighter’ around the true popultion mean.

5A. With summary commands

5B. With graphs

Step 6: Backtracing population mean and sd

II. Poisson Distribution

Step 0: Clear the work space

Step 1: Create the distribution

Create 10000 row and 6 column matrix with 0 entries only -

STEP 3: Take 10,000 random samples of sample size 2,6,30,50 and 1000 respectively, compute their mean, and place the sample mean in the matrix z (column j) to plot it later

Step 4: If intuition is right, mean of sample should get closer to actual uniform dist mean as sample size increases, and the distribution of sample mean should get ‘tighter’ around the true popultion mean.

Step 5: Confirm intution above with either Plot or with summary commands

RUN THIS CHUNKS IF YOU WANT BIGGER GRAPHS

Step 6: Backtracing population mean and sd