1 Please Google and describe Law of Large Numbers in your own words.

#The Law of Large Numbers is basically the idea that the more you observe something, the closer you get to seeing what you expect. If you keep flipping a coin, you’ll see that the number of heads and tails you get will get closer and closer to 50% each as you keep flipping. This means that when you do experiments or look at data sets, the more you look at them, the more you can predict what will happen.

2. Please explain CLT in your own words.

#The Central Limit Theorem (CLT) is a crucial statistical principle that explains why the distribution of sample means tends to resemble a normal distribution as the sample size increases, regardless of the initial distribution shape of the population. In essence, if you repeatedly draw sufficiently large samples from any population and calculate their means, those means will approximate a normal (bell-shaped) curve. This normal distribution will have a mean equal to the population mean and a standard deviation known as the standard error. The theorem holds true even if the original data are skewed, bunched up, or spread out in any particular pattern. The CLT is essential in statistics because it justifies using normal probability calculations to make inferences about overall population parameters based on sample data. This feature is what makes reliable hypothesis testing and confidence interval construction possible, even when the underlying population distribution is unknown.

3 What are the similarities and differences between LLN and CLT?

#The Law of Large Numbers (LLN) and the Central Limit Theorem (CLT) are both about how averages behave in big groups, but they’re looking at different things. The LLN says that as group sizes get bigger, the average result will get closer to the actual result, which means the average is a reliable way to measure things. Meanwhile, the CLT focuses on the way that the averages are spread out across the group. It says that no matter what the group looks like to start with, if you take a big enough sample, the averages will start to look more like a bell curve.

#Both theorems need a big group to work, and they help us to make predictions and draw conclusions about the whole group based on only some of the members. But while the LLN is about how close the average is to the true answer, the CLT is about how the averages are spread out. So the LLN helps us to be accurate, and the CLT helps us to understand how accurate we can be when we look at different groups.

4. Pick up any distribution apart from normal, uniform or poisson. You can Wikipedia about the distribution and/or read how to implement the distribution in R (what parameters are required to generate the distribution). Please describe this distribution first in 5 lines.

#The Chi-Square distribution is a statistical distribution that explains the sum of the squares of 𝑘 independent and standard normal random variables. This distribution is primarily used in hypothesis testing and in constructing confidence intervals, especially in tests of independence and goodness-of-fit in categorical data. The shape of the Chi-Square distribution is dependent on the degrees of freedom 𝑘, which makes it more symmetrical and closer to a normal distribution as 𝑘 increases. It is a specific case of the Gamma distribution and is always non-negative, starting from zero and extending to the right indefinitely. Chi-square distributions are fundamental in inferential statistics, particularly in scenarios that involve large sample sizes.

#5A  Then, apply the CLT on the sample mean of this chosen distributionLinks to an external site. in RLinks to an external site. (adapt our class R code, or you can find an alternative code on the web too). 
rm(list = ls())
gc()

##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 525055 28.1    1169910 62.5   660494 35.3
## Vcells 956357  7.3    8388608 64.0  1769528 13.6

set.seed(42)                                          # Set seed for reproducibility
N <- 10000                                             # Specify sample size

y_rchisq <- rchisq(N, df = 5)          # Draw N chi squared distributed values
head(y_rchisq)

## [1] 8.817420 2.562234 4.136585 1.485791 3.705466 2.356456

hist(y_rchisq,                                         
     breaks = 100,
     main = "")

z <- matrix(data = rep(x     = 0, 
                       times = 50000
                       ), 
            nrow = 10000, 
            ncol = 5)
n <- c(2, 10, 50,200,3000)
for (j in 1:4){        
 
   for (i in 1:10000){  
  
       z[i,j] <- mean(sample( x       = y_rchisq,  
                              size    = n[j], 
                              replace = TRUE
                            )
                      )
    }
}

colnames(z) <- c(" size=2", "size=10", "size=50","size=200","size=3000") 
summary(z)

##      size=2           size=10          size=50         size=200       size=3000
##  Min.   : 0.3036   Min.   : 1.877   Min.   :3.483   Min.   :4.210   Min.   :0  
##  1st Qu.: 3.3488   1st Qu.: 4.277   1st Qu.:4.706   1st Qu.:4.852   1st Qu.:0  
##  Median : 4.6525   Median : 4.935   Median :5.004   Median :5.002   Median :0  
##  Mean   : 4.9909   Mean   : 4.993   Mean   :5.012   Mean   :5.006   Mean   :0  
##  3rd Qu.: 6.3096   3rd Qu.: 5.643   3rd Qu.:5.306   3rd Qu.:5.158   3rd Qu.:0  
##  Max.   :20.7787   Max.   :10.482   Max.   :6.728   Max.   :5.866   Max.   :0

par(mfrow=c(3,2))

length(y_rchisq)

## [1] 10000

hist(x    = y_rchisq, 
     main = "Hist of Chi_Square Distribution",
     xlab = ""
     )


for (k in 1:4){
    hist(x    = z[,k],     
         main = "Hist of Sample Mean of Chi Square Distribution", 
         xlim = c(0, 25), 
         xlab = paste0("Sample Size ", n[k], " (Column ", k, " from matrix)") 
         )
  }

5B. Alternatively, apply the CLT on any other sample statistic like say the sample median, sample 25th percentile or even the sample 80th percentile. This may be marginally harder than the last part, but you can try to submit both.

n <- c(2, 10, 50,200,3000)
y_rchisq <- rchisq(N, df = 5) 
chi_square.data <- y_rchisq        
percentile_25 <- quantile(x = chi_square.data,    # numeric vector whose sample quantiles are wanted,
                          probs = c(.25) # numeric vector of probabilities with values in [0,1][0,1].
                          ) 
percentile_25

##      25% 
## 2.678725

z <- matrix(data = rep(x     = 0, 
                        times = 50000
                        ), 
             nrow = 10000, 
             ncol = 5)

n <- c(2, 10, 50, 200, 3000) 
for (j in 1:5){        
   for (i in 1:10000){  
       z[i,j] <- quantile( # CHANGE FROM MEAN TO QUANTILE
                            x     = sample( x       = chi_square.data, 
                                            size    = n[j], 
                                            replace = TRUE
                                        ),
                            probs = c(.25) 
                       )
    }
}

colnames(z) <- c("Sample size=2", "Sample size=10", "Sample size=50", "Sample size=200", "Sample size=3000") 

summary(z)

##  Sample size=2     Sample size=10   Sample size=50  Sample size=200
##  Min.   : 0.2427   Min.   :0.5683   Min.   :1.477   Min.   :1.949  
##  1st Qu.: 2.7877   1st Qu.:2.3498   1st Qu.:2.478   1st Qu.:2.566  
##  Median : 3.8442   Median :2.8977   Median :2.723   Median :2.696  
##  Mean   : 4.1791   Mean   :2.9952   Mean   :2.750   Mean   :2.701  
##  3rd Qu.: 5.2712   3rd Qu.:3.5493   3rd Qu.:3.005   3rd Qu.:2.832  
##  Max.   :17.7281   Max.   :7.5732   Max.   :4.478   Max.   :3.529  
##  Sample size=3000
##  Min.   :2.499   
##  1st Qu.:2.651   
##  Median :2.679   
##  Mean   :2.683   
##  3rd Qu.:2.719   
##  Max.   :2.909

par(mfrow=c(3,2))


hist(x    = chi_square.data, 
     main = "Hist of Chi_Square Distribution",
     xlab = ""
     )


for (k in 1:5){
    hist(x    = z[,k],     
         main = "Histogram of Sample 25 Percentile of chi_square Dist", 
         xlim = c(0, 25), 
         xlab = paste0("Sample Size ", n[k], " (Column ", k, " from matrix)") 
         )
  }

# yes, the CLT holds up for 25th percentile as well. It makes sense given in a normal distribution, the mean and 25th percentile are equal.As the sample size increases, sample statistics such as the sample mean and sample 25th percentile tend to converge towards their theoretical values. This trend aligns with the Central Limit Theorem (CLT), which posits that as sample sizes grow, the sampling distributions approximate a normal distribution. Histograms reflecting this phenomenon show a clear movement towards normality.  The pattern supports the CLT’s assertion that with larger sample sizes, the distribution of sample statistics becomes increasingly normal.

Discussion5

ANDI XU

2024-04-21

1 Please Google and describe Law of Large Numbers in your own words.

2. Please explain CLT in your own words.

3 What are the similarities and differences between LLN and CLT?

4. Pick up any distribution apart from normal, uniform or poisson. You can Wikipedia about the distribution and/or read how to implement the distribution in R (what parameters are required to generate the distribution). Please describe this distribution first in 5 lines.

5B. Alternatively, apply the CLT on any other sample statistic like say the sample median, sample 25th percentile or even the sample 80th percentile. This may be marginally harder than the last part, but you can try to submit both.