Central Limit Theorem

Definition and Proof

The Central Limit Theorem (CLT) is one of the most important theorems in statistics and data science.

The CLT states that the sample mean (\(\bar{x}\)) of a probability distribution sample is a random variable with a mean value given by population mean \(\mu\) and standard deviation \(\sigma_{\bar{x}}\) (standard error) given by population standard deviation \(\sigma\) divided by \(\sqrt{n}\)

i.e. \(\sigma_{\bar{x}}=\dfrac{\sigma}{\sqrt{n}}\), where n is the sample size.

Proof

I. UNIFORM DISTRIBUTION

Step 0: Clear the work space

rm(list = ls()) # Clear environment
gc()            # Clear unused memory

##          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 531384 28.4    1183607 63.3         NA   669265 35.8
## Vcells 976737  7.5    8388608 64.0      16384  1840452 14.1

cat("\f")       # Clear the console

# dev.off()     # Clear par(mfrow=c(3,2)) type commands

Step 1: Start with a distribution, say uniform with true mean of 15 and min/max of 10/20 respectively

?runif                 # runif generates random deviates; read the arguments required

myunif <- runif(n  = 10000, 
                min = 10, 
                max = 20 
                ) 
myunif[1:16]            # check if created correctly, run summary commands if required.  head(myunif)

##  [1] 15.27919 13.62529 14.89611 11.82282 13.57619 18.27487 16.20547 18.53198
##  [9] 12.40289 11.18536 15.69295 16.25366 14.33508 11.26731 13.19037 13.06659

mu <- mean(myunif)      # check mean of actual distribution
mu

## [1] 14.97481

sigma <- sd(myunif)     # check sd of actual distribution
sigma

## [1] 2.884187

library("psych")

## Warning: package 'psych' was built under R version 4.2.3

describe(myunif)        # summary statistics

##    vars     n  mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 10000 14.97 2.88  14.96   14.97 3.71  10  20    10 0.01    -1.21 0.03

### Plot out the distribution  
hist(x    = myunif, 
     main = "Histogram of Uniform Distribution (min=10, max=20, n=10,000)",
     xlab = ""
     )

Step 2: Create an empty matrix of 10000 rows and 1 column in which you will store the means of randomly drawn samples (of size 100)

Create 10000 row and 1 column matrix with 0 entries only

?matrix     # creates a matrix from the given set of values.
?rep        # replicates the values in x lenth times

# fill in the matrix with 0s
z <- matrix(data = rep(x     = 0, 
                       times = 10000
                       ), 
            nrow = 10000, 
            ncol = 1)

z[1:16]

##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

describe(z)

##    vars     n mean sd median trimmed mad min max range skew kurtosis se
## X1    1 10000    0  0      0       0   0   0   0     0  NaN      NaN  0

Step 3: Take a random sample of 100 observations from the uniform distribution, find its mean and store it in row 1 of matrix z. Repeat 9,999 times - storing the mean of sample of size 100 in row i of matrix z.

Try reading the loop from inside out.

?sample # makes a sample of the specified size from the elements of x using either with or without replacement.
# sample(x, size, replace = FALSE, prob = NULL)

for (i in 1:10000){ 
    z[i,] <- mean(sample( x        = myunif, # a vector of elements from which to choose, or a positive integer. 
                          size     = 100,    # a non-negative integer giving the number of items to choose.
                          replace  = TRUE    # should sampling be with replacement?
                        )
                )
  }

z[1:16]

##  [1] 14.58196 14.86110 15.24264 14.80985 15.43189 14.64520 15.05815 15.67749
##  [9] 15.01629 14.01945 14.55968 14.68105 14.50573 15.05990 15.04674 14.79897

describe(z)

##    vars     n  mean   sd median trimmed  mad   min   max range skew kurtosis se
## X1    1 10000 14.98 0.29  14.98   14.98 0.29 13.66 16.08  2.42    0    -0.03  0

hist(z, xlab = "", main = "Histogram of Sample Mean (n = 100)")

Step 4: Generalise the code to use different sample sizes instead of just \(n=1000\) above

EXPAND COLUMNS OF MATRIX TO —

CONTAINING SAMPLE MEAN OF A 2 obs SAMPLE,
CONTAINING SAMPLE MEAN OF A 6 obs BIG SAMPLE,
CONTAINING SAMPLE MEAN OF A 30 obs BIGGER SAMPLE, AND
CONTAINING SAMPLE MEAN OF A 1000 obs BIGGEST SAMPLES

4 a ) EXPANDING COLUMNS OF NULL MATRIX CODE —

z <- matrix(data = rep(x     = 0, 
                       times = 40000
                       ), 
            nrow = 10000, 
            ncol = 4)

4 b ) REPLACE VALUES OF NULL MATRIX —

Take a random sample of 2 observations from the uniform distribution, find its mean and store it in column 1, row 1 of matrix z. Repeat 9,999 times filling in different rows i of column 1 each time.
Take a random sample of 6 observations from the uniform distribution, find its mean and store it in column 2, row 1 of matrix z. Repeat 9,999 times filling in different rows i of column21 each time.
Take a random sample of 30 observations from the uniform distribution, find its mean and store it in column 3, row 1 of matrix z. Repeat 9,999 times filling in different rows i of column 3 each time.
Take a random sample of 1000 observations from the uniform distribution, find its mean and store it in column 4, row 1 of matrix z. Repeat 9,999 times filling in different rows i of column 4 each time.

Try reading the loop from outside to inside this time, and i indexes rows and j indexes columns -

n <- c(2, 6, 30, 1000) # let n be the sample size
#larger sample size should have closer sample mean to true population mean

# i indexes rows and j indexes columns
# As column j increases from 1 to 4 (1-2-3-4), sample size also increases from 2 to 1000 (2-6-30-1000) respectively.  
# Sample (of size 2-6-30-1000) pulled out from the original distribution i=10000 times.

for (j in 1:4){        # indexing columns of matrix, where each column will represent different sample size
 
   for (i in 1:10000){  # indexes the rows of matrix
  
       z[i,j] <- mean(sample( x       = myunif,  # compute mean and assign
                              size    = n[j], 
                              replace = TRUE
                            )
                      )
    }
}

Step 5: If intuition is right, mean of sample should get closer to mean of actual uniform distribution as sample size increases, and the distribution of sample mean should get ‘tighter’ around the true popultion mean.

5A. With summary commands

colnames(z) <- c("Sample size=2", "Sample size=6", "Sample size=30", "Sample size=1000") 
summary(z)

##  Sample size=2   Sample size=6   Sample size=30  Sample size=1000
##  Min.   :10.10   Min.   :10.88   Min.   :13.10   Min.   :14.65   
##  1st Qu.:13.51   1st Qu.:14.14   1st Qu.:14.62   1st Qu.:14.91   
##  Median :14.97   Median :14.95   Median :14.98   Median :14.97   
##  Mean   :14.97   Mean   :14.96   Mean   :14.98   Mean   :14.97   
##  3rd Qu.:16.40   3rd Qu.:15.79   3rd Qu.:15.33   3rd Qu.:15.04   
##  Max.   :19.99   Max.   :19.14   Max.   :16.85   Max.   :15.34

# matrix row containing means of uniform distribution of different sample sizes - col 1 has runif sample size of 2, col 2 has runif sample size 6, col 3 has runif sample size 30, and col 4 has runif sample size 1000.   10,000 iterations.

5B. With graphs

par(mfrow=c(3,2))

length(myunif)

## [1] 10000

hist(x    = myunif, 
     main = "Histogram of Uniform Distribution, N=10,000",
     xlab = ""
     )

?hist

for (k in 1:4){
    hist(x    = z[,k],     # matrix column 1-4 which contain sample means of randomly chosen samples from uniform dist
         main = "Histogram of Sample Mean of Uniform Distribution", 
         xlim = c(10, 20), # plot the domain of X axis from x=10 to x=20 only
         xlab = paste0("Sample Size ", n[k], " (Column ", k, " from matrix)") 
         )
  }

Step 6: Backtracing population mean and sd

Recall sample mean is population mean if CLT holds. Also, standard deviation of sample means (standard error) is just the population standard deviation divided by square root of smaple size. Using these two facts, we can backtrack population mean and population standard deviation from the sample moments.

z[1:16] # matrix row containing means of uniform distribution of different sample sizes - col 1 has runif sample size of 2, col 2 has runif sample size 6, col 3 has runif sample size 30, and col 4 has runif sample size 1000.   10,000 iterations.

##  [1] 17.47448 17.15370 15.73808 11.16293 13.06245 12.82670 12.86843 15.91598
##  [9] 11.82471 13.71621 16.02936 13.46811 12.64761 14.76458 14.76267 14.97740

mu      # original uniform distribution mean

## [1] 14.97481

colMeans(z)              # [1] 14.99702 15.01788 15.02452 15.01863

##    Sample size=2    Sample size=6   Sample size=30 Sample size=1000 
##         14.97367         14.96318         14.97835         14.97389

##########################################################################################
### original distribution mean is approximated well by small (n=2) sample and/or big (n=1000) sample equally well ###
##########################################################################################

sigma   # original uniform distribution sd

## [1] 2.884187

?apply # Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.
apply(X = z,
      MARGIN = 2,  # function applied on columns, not rows as c(1, 2) indicates rows and columns respectively
      FUN = sd)    # [1] 2.03311048 1.18419671 0.52746969 0.09101246

##    Sample size=2    Sample size=6   Sample size=30 Sample size=1000 
##       2.03644517       1.19496139       0.52399152       0.09187381

##########################################################################################
### original distribution sd is approximated by adjusting for sample sd by sqrt n, where n is the sample size###
##########################################################################################


apply(X = z,
      MARGIN = 2,
      FUN = sd) * c(sqrt(2), 
                    sqrt(6), 
                    sqrt(30), 
                    sqrt(1000)
                    ) # [1] 2.875252 2.900678 2.889070 2.878067

##    Sample size=2    Sample size=6   Sample size=30 Sample size=1000 
##         2.879968         2.927046         2.870020         2.905305

sigma

## [1] 2.884187

Central Limit Theorem (CLT) is one of the most important theorems in statistics and data science. The CLT states that the sample mean of a probability distribution sample is a random variable with a mean value given by population mean and standard deviation given by population standard deviation divided by square root of N, where N is the sample size.

Thus, as N goes to infinity,the uncertainty or standard deviation of the mean goes to zero. This means that the larger the size of our dataset, the better, as larger samples lead to smaller variance error.

II. Poisson Distribution

Step 0: Clear the work space

# rm(list = ls()) # Clear environment
# gc()            # Clear unused memory
# cat("\f")       # Clear the console
# dev.off()       # Clear par(mfrow=c(3,2)) type commands

Step 1: Create the distribution

mypoiss <- rpois(n      = 10000, 
                 lambda = 1) 

mu2 <- mean(mypoiss) # 1.0004
mu2

## [1] 0.9892

sd2 <-sd(mypoiss)    #  1.011188
sd2

## [1] 0.9964852

hist(x = mypoiss, main = "Histogram of Poisson Distribution (lambda=1)")

## Step 2: Create an empty matrix that will store means of different sample sizes, say 6 different sample sizes of 2, 6, 30, 50, 200, 1000

Create 10000 row and 6 column matrix with 0 entries only -

z2 <- matrix(data = rep(x     = 0, 
                        times = 60000
                        ), 
             nrow = 10000, 
             ncol = 6
             )

z2[1:16]

##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

STEP 3: Take 10,000 random samples of sample size 2,6,30,50 and 1000 respectively, compute their mean, and place the sample mean in the matrix z (column j) to plot it later

for (j in 1:6){       # indexing columns of matrix, where each column will represent different sample size
  for (i in 1:10000){ # indexes the rows of moatrix
    n2 <- c( 2 , 6 , 30, 50, 200, 1000) # let n be the sample size - larger sample size should have closer sample mean to true population mean
    z2[i,j] <-  mean(sample(x       = mypoiss, 
                            size    = n2[j], 
                            replace = TRUE
                            )
                      )
    }
}

Step 4: If intuition is right, mean of sample should get closer to actual uniform dist mean as sample size increases, and the distribution of sample mean should get ‘tighter’ around the true popultion mean.

Step 5: Confirm intution above with either Plot or with summary commands

colnames(z2) <- c("Sample size=2", "Sample size=6", "Sample size=30","Sample size=50", "Sample size=200", "Sample size=1000") 
summary(z2)

##  Sample size=2    Sample size=6    Sample size=30   Sample size=50  
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.3667   Min.   :0.4400  
##  1st Qu.:0.5000   1st Qu.:0.6667   1st Qu.:0.8667   1st Qu.:0.9000  
##  Median :1.0000   Median :1.0000   Median :0.9667   Median :0.9800  
##  Mean   :0.9966   Mean   :0.9857   Mean   :0.9865   Mean   :0.9899  
##  3rd Qu.:1.5000   3rd Qu.:1.1667   3rd Qu.:1.1000   3rd Qu.:1.0800  
##  Max.   :5.0000   Max.   :2.8333   Max.   :1.7667   Max.   :1.5400  
##  Sample size=200  Sample size=1000
##  Min.   :0.7500   Min.   :0.8580  
##  1st Qu.:0.9400   1st Qu.:0.9680  
##  Median :0.9900   Median :0.9890  
##  Mean   :0.9896   Mean   :0.9886  
##  3rd Qu.:1.0400   3rd Qu.:1.0100  
##  Max.   :1.3000   Max.   :1.1120

# matrix row containing means of poisson distribution of different sample sizes - col 1 has rpoiss sample size of 2, col 2 has rpoiss sample size 6, col 3 has rpoiss sample size 30, and col 4 has rpoiss sample size 50, col 5 has rpoiss sample size 200, and col 6 has rpoiss sample size 1000.  
# means are taken 10,000 times / generated 10,000 times from the same distribution / iterations.  


par(mfrow = c(4,2))
hist(x    = mypoiss, 
     main = "Histogram of Poisson Distribution (lambda=1)"
     )


for (k in 1:6){
  hist(x = z2[,k], 
       main = "Histogram of Mean of Poisson Dist (lambda=1)",
       xlim = c(-1, 3), 
       ylim = c(0,10000),
       xlab = paste0("Column ", k, " from matrix") 
      )
} # main = paste("Mean of " ,z[,k] )

RUN THIS CHUNKS IF YOU WANT BIGGER GRAPHS

# dev.off()       # Clear par(mfrow=c(3,2)) type commands 

for (k in 1:6){
       hist(x = z2[,k], 
       main = "Histogram of Mean of Poisson Dist (lambda=1)",
       xlim = c(-2, 4),  # keeps the x axis same in all charts
       xlab = paste0("Column ", k, " from matrix") 
       )
} # main = paste("Mean of " ,z[,k] )

Step 6: Backtracing population mean and sd

mu2      # original uniform distribution mean

## [1] 0.9892

colMeans(z2)              # [1] 14.99702 15.01788 15.02452 15.01863

##    Sample size=2    Sample size=6   Sample size=30   Sample size=50 
##        0.9966000        0.9857000        0.9865067        0.9898740 
##  Sample size=200 Sample size=1000 
##        0.9895905        0.9886167

##########################################################################################
### original distribution mean is approximated well by small (n=2) sample and/or big (n=1000) sample equally well ###
##########################################################################################


sd2

## [1] 0.9964852

##########################################################################################
### original distribution sd is approximated by adjusting for sample sd by sqrt n, where n is the sample size###
##########################################################################################


apply(X = z2,
      MARGIN = 2,  # columns
      FUN = sd) * c(sqrt(2), 
                    sqrt(6), 
                    sqrt(30), 
                    sqrt(50), 
                    sqrt(200), 
                    sqrt(1000)
                    ) ## [1] 0.9885683 0.9982089 0.9895905 0.9905761 0.9955966 1.0060817

##    Sample size=2    Sample size=6   Sample size=30   Sample size=50 
##        0.9996383        0.9985353        0.9934904        0.9819469 
##  Sample size=200 Sample size=1000 
##        0.9949110        0.9969903

Central Limit Theorem

Arvind Sharma

2024-04-16

Central Limit Theorem

Definition and Proof

I. UNIFORM DISTRIBUTION

Step 0: Clear the work space

Step 1: Start with a distribution, say uniform with true mean of 15 and min/max of 10/20 respectively

Step 2: Create an empty matrix of 10000 rows and 1 column in which you will store the means of randomly drawn samples (of size 100)

Create 10000 row and 1 column matrix with 0 entries only

Step 3: Take a random sample of 100 observations from the uniform distribution, find its mean and store it in row 1 of matrix z. Repeat 9,999 times - storing the mean of sample of size 100 in row i of matrix z.

Step 4: Generalise the code to use different sample sizes instead of just \(n=1000\) above

EXPAND COLUMNS OF MATRIX TO —

4 a ) EXPANDING COLUMNS OF NULL MATRIX CODE —

4 b ) REPLACE VALUES OF NULL MATRIX —

Step 5: If intuition is right, mean of sample should get closer to mean of actual uniform distribution as sample size increases, and the distribution of sample mean should get ‘tighter’ around the true popultion mean.

5A. With summary commands

5B. With graphs

Step 6: Backtracing population mean and sd

II. Poisson Distribution

Step 0: Clear the work space

Step 1: Create the distribution

Create 10000 row and 6 column matrix with 0 entries only -

STEP 3: Take 10,000 random samples of sample size 2,6,30,50 and 1000 respectively, compute their mean, and place the sample mean in the matrix z (column j) to plot it later

Step 4: If intuition is right, mean of sample should get closer to actual uniform dist mean as sample size increases, and the distribution of sample mean should get ‘tighter’ around the true popultion mean.

Step 5: Confirm intution above with either Plot or with summary commands

RUN THIS CHUNKS IF YOU WANT BIGGER GRAPHS

Step 6: Backtracing population mean and sd