Law of Large Numbers & Central Limit Theorem

Please go to Dropbox folder (link in Course Introduction and Materials) and find the R Markdown file (.RmD extension) for the Central Limit Theorem (CLT) code we saw in class.

Then -

0.

Begin with setting seed in R. The recommended way to specify seeds is - set.seed(seed = 42) , where seed can take on any single value that is interpreted as an integer (42 here, but you can put your favorite number instead).

0.Solution:

set.seed(seed = 7) #my favorite number!

1.

Please Google and describe Law of Large Numbers in your own words.

1.Solution:

The Law of Large Numbers (LLN) states that, the more you repeat an empirical experiment independently and average the results each time, the more likely that you will be close to the true expected value/finding. In other words, to find the true average of an experiment, the law of large number states that an experiment should be repeated with a large sample for a number of times to obtain results that are true of the sample population. In social sciences, we often begin with a pilot study with a small sample size, then re-run the experiment with a larger sample size, then continue to replicate the study with different subsets of the population to generalize on the findings. After enough iterations, there will be enough results to generalize on the entire population.

2.

Please explain CLT in your own words. You can, and should read your textbook and/or online references to understand what is CLT, its uses, et ctera. Furthermore, if you find any useful resource, include it in your post so that the rest of the class can have a look at it to. EG - Aspect 3 in the section Overview of four aspects. (1:16) in the YouTube video substantiates my claim more formally that xxx..., Josh Starmer at StatsQuest... , JB Statistics Intro to CLT, significance/uses of CLT, Wikipedia, ...

2.Solution:

The Central Limit Theorem (CLT) states that with a large enough sample size, the sampling distribution of the mean will always adhere to a normal distribution — regardless of the type of distribution (e.g., Poisson, binomial, exponential, etc.) — and the sampling error will decrease as the sample size increases. In other words, as sample size increases (typically over 30 in social sciences), the sample mean and standard deviation will become closer to the population mean and standard deviation; the distribution will take on a bell-curve.

(Image Source)

3.

What are the similarities and differences between LLN and CLT? Write a few lines.

3.Solution:

The Law of Large Numbers (LLN) and Central Limit Theorem (CLT) are quite similar in the sense that a large sample size allows us to make inferences about the general population. However, the theories are different in that the LLN focuses specifically on the large sample size / iterations of the experiment to allow for the generalization, whereas the CLT emphasizes the normality of the distribution with a large enough sample size.

For example, let’s imagine that I’m starting an ice cream factory and my business plan is to only sell five ice cream flavors. To decide on the flavors, I randomly send out surveys to residents living in my state. Many of the responses indicated orange sherbet to be everyone’s favorite — a fruit and flavor that is native and unique to my state. Arguably, the sample size is small and not representative of all 50 US states at this point. However, the more I sample residents from other US states and increase the sample size, the likely that the distribution of flavors will change, with more basic and common flavors like chocolate and vanilla at the center of the distribution, and less common flavors, like orange sherbet, being at the ends of the distribution tail. With LLN, the more number of responses from different US states will allow us to make generalizations about commonly favored flavors, while the CLT states that with enough responses, we’ll be able to see which flavors are preferred the most on average.

4.

Pick up any distribution apart from normal, uniform or poisson. You can Wikipedia about the distribution and/or read how to implement the distribution in R (what parameters are required to generate the distribution).

Please describe this distribution first in 5 lines.

4.Solution:

Binomial distributions look at the total count of outcomes in a succession of trials; the trials permit only two possible mutally exclusively outcomes (i.e., binary). For example, surveying for the number of folks who choose chocolate ice cream as their favorite flavor would be a binomial distribution. In other words, it models the number of successes in a fixed number of independent Bernoulli trials. The parameters required to model binomial distributions are: 1) probability of success (p), and 2) x successes in a given number of trials (n).

5 A.

Then, apply the CLT on the sample mean of this chosen distribution. in R (adapt our class R code, or you can find an alternative code on the web too).

5A. Solution:

rm(list = ls()) # Clear environment
gc()            # Clear unused memory

##          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 537496 28.8    1201002 64.2         NA   669400 35.8
## Vcells 995660  7.6    8388608 64.0      16384  1851617 14.2

cat("\f")       # Clear console

set.seed(7)
binom.data <- rbinom(10000, size = 10000, prob = 0.5) #large sample size to apply CLT

library(psych)
describe(binom.data)

##    vars     n    mean    sd median trimmed   mad  min  max range  skew kurtosis
## X1    1 10000 5000.55 50.59   5000 5000.51 50.41 4794 5201   407 -0.01     0.02
##      se
## X1 0.51

summary(binom.data)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4794    4966    5000    5001    5035    5201

mu <- mean(binom.data)      # check mean of actual distribution
mu

## [1] 5000.546

sigma <- sd(binom.data)     # check sd of actual distribution
sigma

## [1] 50.59332

#visualize
hist(x = binom.data,
     main = "Histogram of the Binomial Distribution with n = 10,000 and p = 0.5 ",
     xlab = "")

z <- matrix(data = rep(x     = 0, 
                       times = 40000
                       ), 
            nrow = 10000, 
            ncol = 4)

# i indexes rows and j indexes columns

# let n be the sample size 
#sample mean of 2 obs, 6 obs, 30, 1000 
n <- c(2, 6, 30, 1000) 

for (j in 1:4){
  for (i in 1:10000){ 
    z[i,j] <- mean(
                  sample( x        = binom.data,  
                          size     = n[j],   
                          replace  = TRUE   
                        )
                )
  }
}

colnames(z) <- c("Sample size=2", "Sample size=6", "Sample size=30", "Sample size=1000") 
summary(z)

##  Sample size=2  Sample size=6  Sample size=30 Sample size=1000
##  Min.   :4860   Min.   :4925   Min.   :4963   Min.   :4995    
##  1st Qu.:4976   1st Qu.:4986   1st Qu.:4994   1st Qu.:4999    
##  Median :5000   Median :5000   Median :5001   Median :5001    
##  Mean   :5000   Mean   :5000   Mean   :5001   Mean   :5001    
##  3rd Qu.:5024   3rd Qu.:5014   3rd Qu.:5007   3rd Qu.:5002    
##  Max.   :5148   Max.   :5085   Max.   :5034   Max.   :5006

#mean gets closer to true value as sample size increases

hist(z, xlab = "", main = "Histogram of Sample Mean ")

#check CLT with graph
par(mfrow=c(3,2))
length(binom.data)

## [1] 10000

hist(x    = binom.data, 
     main = "Histogram of Binom Distribution, N=10,000",
     xlab = ""
     )

# matrix column 1-4 which contain sample means of randomly chosen samples from binom dist
for (k in 1:4){
    hist(x    = z[,k],    
         main = "Histogram of Sample Mean of Binom Dist, N=10,000", 
         xlim = c(0,6000), # plot the domain of X axis from x=0 to x=30 only
         xlab = paste0("Sample Size ", n[k], " (Column ", k, " from matrix)") 
         )
}

#check sample mean to population mean
mu

## [1] 5000.546

#vs

apply(X = z,
      MARGIN = 2,  # function applied on columns
      FUN = mean)

##    Sample size=2    Sample size=6   Sample size=30 Sample size=1000 
##         5000.181         5000.241         5000.686         5000.556

#check sample sd to population sd
sigma

## [1] 50.59332

#vs

apply(X = z,
      MARGIN = 2,
      FUN = sd) * c(sqrt(2), 
                    sqrt(6), 
                    sqrt(30), 
                    sqrt(1000)
                    )

##    Sample size=2    Sample size=6   Sample size=30 Sample size=1000 
##         50.03404         50.63752         50.55205         50.31923

#larger the sample size, smaller the variance error

5 B.

Alternatively, apply the CLT on any other sample statistic like say the sample median, sample 25th percentile or even the sample 80th percentile. This may be marginally harder than the last part, but you can try to submit both.

Does the central limit theorem hold as expected? Please elaborate (at-least 3 points).

You can post a few pictures to substantiate your claim while answering the CLT part above. Make sure there are comments in your code to explain and walk the reader through your logic.

5B. Solution:

percentile_25 <- quantile(x = binom.data,    # numeric vector whose sample quantiles are wanted,
                          probs = c(.25) # numeric vector of probabilities with values in [0,1][0,1].
                          ) 
percentile_25

##  25% 
## 4966

z2 <- matrix(data = rep(x     = 0, 
                        times = 50000
                        ), 
             nrow = 10000, 
             ncol = 5)

n <- c(2, 6, 30, 120, 1000) 
for (j in 1:5){        
   for (i in 1:10000){  
       z2[i,j] <- quantile( # CHANGE FROM MEAN TO QUANTILE
                            x     = sample( x       = binom.data, 
                                            size    = n[j], 
                                            replace = TRUE
                                        ),
                            probs = c(.25) 
                       )
    }
}


#random sample size
colnames(z2) <- c("Sample size=2", "Sample size=6", "Sample size=30", "Sample size=120", "Sample size=1000") 

summary(z2)

##  Sample size=2  Sample size=6  Sample size=30 Sample size=120 Sample size=1000
##  Min.   :4849   Min.   :4876   Min.   :4919   Min.   :4944    Min.   :4959    
##  1st Qu.:4962   1st Qu.:4958   1st Qu.:4960   1st Qu.:4963    1st Qu.:4965    
##  Median :4987   Median :4974   Median :4968   Median :4967    Median :4966    
##  Mean   :4987   Mean   :4974   Mean   :4968   Mean   :4967    Mean   :4966    
##  3rd Qu.:5012   3rd Qu.:4991   3rd Qu.:4976   3rd Qu.:4971    3rd Qu.:4968    
##  Max.   :5126   Max.   :5076   Max.   :5014   Max.   :4995    Max.   :4974

The 25th percentile value is 4966. As you can see with the five random sample sizes, the CLT holds at a sample size of at least 30. This is shown as the value at 25th percentile gets closer to the mean with a sample size of 30 or more. Whether it is the sample median, 25th or 80th percentile, the CLT holds true here, in which the sample value equals the population value with a large enough sample size.

## check graphically

par(mfrow=c(3,2))

hist(x    = binom.data, 
     main = "Histogram of Binom Dist, N=10,000",
     xlab = ""
     )

for (k in 1:5){
    hist(x    = z2[,k],   #z2 now
         main = "Histogram of Sample 25 Percentile of Binom Dist", 
         xlim = c(0, 6000), 
         xlab = paste0("Sample Size ", n[k], " (Column ", k, " from matrix)") 
         )
  }

Week 5 Discussion

Jiwon Ban

2024-04-21