Week5_Discussion5

0. Begin with setting seed in R. The recommended way to specify seeds is - set.seed(seed = 42) , where seed can take on any single value that is interpreted as an integer (42 here, but you can put your favorite number instead).

set.seed(777)

1. Please Google and describe Law of Large Numbers in your own words.

The Law of Large Numbers states that as a sample size grows, the mean of the sample size grows closer to the mean of the population. Put another way, the sample better represents the population as the sample becomes larger. For example with casinos, a gambler playing a game is a sample of 1. In a given play or sample, a gambler may win, thus the casino loses. However, over time as players continues to gamble at the casino and the sample grows, the gamblers will lose more than they win, tipping odds in the casino’s favor. This is described in this YouTube video: https://www.youtube.com/watch?v=RXY-WN0ahiw.

2. Please explain CLT in your own words.

Many statistical tests are based on the Central Limit Theorem (CLT). The CLT is based on the sampling distribution of the mean. The sampling distribution is the distribution of means of samples taken from a population. As we usually don’t know the mean of the true population, we need to take samples. The CLT states that the sample means will be normally distributed over many trials or large N. This applies regardless of what distribution we take the samples from (uniform, exponential, etc.). In order for the CLT to work, you need to calculate a mean from your sample. Also, generally, the CLT is true when the sample size is at least 30. There are 4 aspects to the CLT: the variables sample must be independent, as the sample size increases the distribution of sample means becomes normally distributed, the original distribution can be any shape but with large sample sizes the closer it will be to normal, and we subtract the population mean and divide by the standard deviation.

3. What are the similarities and differences between LLN and CLT?

The LLN and CLT are both fundamental concepts in inferential statistics. The LLN is more an application or specific case of the CLT. The LLN states that as the sample size increases, the sample mean converges in probability to the population mean. Put another way, as the sample size or number of trials increases, the probability that our sample mean differs from the population mean by more than a tiny amount approaches zero. LLN is more focused on the behavior of the sample mean. CLT states that as sample size increases, the distribution of the sample mean tends toward the normal distribution, regardless of the distribution from which we are sampling. So CLT is more focused on the distribution of the sample means. This is helpful in statistics because we can infer information about a population from a large sample without knowing its distribution.

4. Pick up any distribution apart from normal, uniform or poisson. You can Wikipedia about the distribution and/or read how to implement the distribution in R (what parameters are required to generate the distribution). Please describe this distribution first in 5 lines.

The Chi Square distribution is a continuous distribution that helps us test for goodness of fit of a model to observed data. It takes the sum of squared random variables, tests whether data series are independent, and is used for estimating confidences surrounding variance and standard deviation for a random variable from a normal distribution. Like other distributions, Chi Square involves repeating an experiment many times and analyzing the expected value. With Chi Square, we’d calculate how much each observed count differed from the expected value, square those differences, and then add them up. It tells us how likely it is to get certain deviations from expected outcomes.

5A. Then, apply the CLT on the sample mean of this chosen distribution in R.

#first let's clear our environment and console
rm(list = ls())
gc()

##          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 538397 28.8    1203567 64.3         NA   669420 35.8
## Vcells 996196  7.7    8388608 64.0      16384  1851760 14.2

cat("\f")

dev.off

## function (which = dev.cur()) 
## {
##     if (which == 1) 
##         stop("cannot shut down device 1 (the null device)")
##     .External(C_devoff, as.integer(which))
##     dev.cur()
## }
## <bytecode: 0x1444a7e88>
## <environment: namespace:grDevices>

#first we need to create the chi-square distribution
set.seed(53535)                                          # Set seed for reproducibility
N <- 10000                                               # Specify sample size

y_rchisq <- rchisq(N, df = 5)          # Draw N chi squared distributed values
head(y_rchisq)                                                # Print values to RStudio console

## [1] 3.277758 5.952375 4.849289 3.994664 2.721948 5.649542

hist(y_rchisq,                                          # Plot of randomly drawn chisq density
     breaks = 100,
     main = "")

#Step 2: Create an empty matrix that will store means of different sample sizes, say 6 different sample sizes of 2, 6, 30, 50, 200, 1000

z <- matrix(data = rep(x     = 0, 
                       times = 40000
                       ), 
            nrow = 10000, 
            ncol = 4)

#replace values of null matrix
n <- c(2, 6, 30, 1000) # let n be the sample size
#larger sample size should have closer sample mean to true population mean

# i indexes rows and j indexes columns
# As column j increases from 1 to 4 (1-2-3-4), sample size also increases from 2 to 1000 (2-6-30-1000) respectively.  
# Sample (of size 2-6-30-1000) pulled out from the original distribution i=10000 times.

for (j in 1:4){        # indexing columns of matrix, where each column will represent different sample size
 
   for (i in 1:10000){  # indexes the rows of matrix
  
       z[i,j] <- mean(sample( x       = y_rchisq,  # compute mean and assign
                              size    = n[j], 
                              replace = TRUE
                            )
                      )
    }
}

#check summary stats to see mean of sample get closer to mean of actual chi squared distribution as sample size increases. And the distribution of sample mean should get 'tighter' around the true population mean.

colnames(z) <- c("Sample size=2", "Sample size=6", "Sample size=30", "Sample size=1000") 
summary(z)

##  Sample size=2     Sample size=6    Sample size=30  Sample size=1000
##  Min.   : 0.3883   Min.   : 1.590   Min.   :3.212   Min.   :4.583   
##  1st Qu.: 3.3643   1st Qu.: 4.080   1st Qu.:4.581   1st Qu.:4.917   
##  Median : 4.5987   Median : 4.882   Median :4.958   Median :4.982   
##  Mean   : 4.9560   Mean   : 4.990   Mean   :4.978   Mean   :4.983   
##  3rd Qu.: 6.1783   3rd Qu.: 5.768   3rd Qu.:5.351   3rd Qu.:5.049   
##  Max.   :18.0506   Max.   :11.278   Max.   :7.416   Max.   :5.380

# matrix row containing means of chi squared distribution of different sample sizes - col 1 has rchisq sample size of 2, col 2 has rchisq sample size 6, col 3 has rchisq sample size 30, and col 4 has rchisq sample size 1000.10,000 iterations.

#check CLT with chi squared with graphs

par(mfrow=c(3,2))

length(y_rchisq)

## [1] 10000

hist(x    = y_rchisq, 
     main = "Histogram of Chi Square Distribution, N=10,000",
     xlab = ""
     )

?hist

for (k in 1:4){
    hist(x    = z[,k],     # matrix column 1-4 which contain sample means of randomly chosen samples from chi square dist
         main = "Histogram of Sample Mean of Chi Square Distribution", 
         xlim = c(0, 25), # plot the domain of X axis from x=0 to x=25 only
         xlab = paste0("Sample Size ", n[k], " (Column ", k, " from matrix)") 
         )
  }

5B. Alternatively, apply the CLT on any other sample statistic like say the sample median, sample 25th percentile or even the sample 80th percentile. This may be marginally harder than the last part, but you can try to submit both.

?median

#adjust the function to calculate the median instead of the mean

n <- c(2, 6, 30, 1000) # let n be the sample size

# i indexes rows and j indexes columns
# As column j increases from 1 to 4 (1-2-3-4), sample size also increases from 2 to 1000 (2-6-30-1000) respectively.  
# Sample (of size 2-6-30-1000) pulled out from the original distribution i=10000 times.

for (j in 1:4){        # indexing columns of matrix, where each column will represent different sample size
 
   for (i in 1:10000){  # indexes the rows of matrix
  
       z[i,j] <- median(sample( x       = y_rchisq,  # compute mean and assign
                              size    = n[j], 
                              replace = TRUE
                            )
                      )
    }
}

#check the summary to see if central limit theorem holds with median 
colnames(z) <- c("Sample size=2", "Sample size=6", "Sample size=30", "Sample size=1000") 
summary(z)

##  Sample size=2     Sample size=6    Sample size=30  Sample size=1000
##  Min.   : 0.2984   Min.   : 1.103   Min.   :2.058   Min.   :3.853   
##  1st Qu.: 3.3324   1st Qu.: 3.545   1st Qu.:3.921   1st Qu.:4.273   
##  Median : 4.6265   Median : 4.384   Median :4.362   Median :4.357   
##  Mean   : 4.9693   Mean   : 4.519   Mean   :4.379   Mean   :4.355   
##  3rd Qu.: 6.2093   3rd Qu.: 5.359   3rd Qu.:4.806   3rd Qu.:4.441   
##  Max.   :17.1616   Max.   :12.389   Max.   :7.423   Max.   :4.795

par(mfrow=c(3,2))

length(y_rchisq)

## [1] 10000

hist(x    = y_rchisq, 
     main = "Histogram of Chi Square Distribution, N=10,000",
     xlab = ""
     )

?hist

for (k in 1:4){
    hist(x    = z[,k],     # matrix column 1-4 which contain sample means of randomly chosen samples from chi square dist
         main = "Histogram of Sample Median of Chi Square Distribution", 
         xlim = c(0, 25), # plot the domain of X axis from x=0 to x=25 only
         xlab = paste0("Sample Size ", n[k], " (Column ", k, " from matrix)") 
         )
  }

It appears that yes, the CLT holds up for median as well. This is interesting and makes sense given in a normal distribution, the mean and median are equal.