DA Week 4 Discussion

Begin with setting seed in R. The recommended way to specify seeds is - set.seed(seed = 42), where seed can take on any single value that is interpreted as an integer (42 here, but you can put your favorite number instead).

set.seed(seed = 42)
rand_nums <- runif(10)
print(rand_nums)

##  [1] 0.9148060 0.9370754 0.2861395 0.8304476 0.6417455 0.5190959 0.7365883
##  [8] 0.1346666 0.6569923 0.7050648

Please Google and describe the Law of Large Numbers in your own words.

The Law of Large Numbers states that as the sample size increases, the sample mean converges to the population mean. This occurs because larger sample sizes limit the effects of random fluctuations, making the sample more representative of the population.

Please explain CLT in your own words.

The Central Limit Theorem (CLT) states that the distribution of a sample mean can be approximated by a near-normal distribution if the sample size is sufficiently large, given that the sample is independent and identically distributed. It’s an important concept in statistics because it enables statisticians to employ statistical processes that accept a normal distribution. It also can be used on both discrete and continuous data, providing flexibility in its applications.

https://bootcamp.umass.edu/blog/quality-management/central-limit-theorem

What are the similarities and differences between Law of Large Numbers and Central Limit Theorem?

Both the LoLN and CLT state that larger samples of a population lead to more representative data related to the population. Both LoLN and CLT are utilized in an attempt to make predictions of an entire population based on the findings of a smaller sample. While LoLN conveys the idea that the sample mean will approximate the population mean as the sample grows, the CLT shows that the sample mean will be normally distributed when the sample grows. In this difference, CLT is more intrinsically related to normal distributions than the LoLN when it comes to the result, application, and shape of the distribution.

Pick any distribution apart from normal, uniform, or poisson. Read how to implement the distribution in R (what parameters are required to generate the distribution). Describe the distribution first in 5 lines.

The Gamma distribution is a probability distribution that models the time it takes for a certain number of events to occur (waiting times). It’s parameters are shape and scale. Shape relates to how many events you are waiting for, and scale means the average time per event on average. The Chi-square distribution is a specific iteration of the Gamma distribution, with a scale of 2 and an integer k. Gamma distributions are considered flexible in their application, and are often used in Bayesian statistics, queuing theory, and survival analysis to model waiting times and uncertainty.

5A. Then, apply the CLT on the sample mean of this chosen distribution in R.

## Step 1: Create the distribution

shape <- 2      #Shape parameter 
scale <- 1      #Scale parameter
n <- 10000      #Number of samples

population <- rgamma(n, shape, scale)
hist(x =population, main = 'Histogram of Gamma Distribution')

## Step 2: Create an empty matrix to store sample means 

sample_n <- c(2, 6, 30, 50, 200, 1000)  #Recommended sample sizes

z2 <- matrix(0, nrow = n, ncol = length(sample_n)) 
z2[1:16]

##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

## Step 3: Generate sample means for previously created sample sizes

for (j in 1:length(sample_n)) {
  for(i in 1:10000) {
    sample_n
    z2[i,j] <- mean(sample(x = population, size = sample_n[j], 
                           replace = TRUE ))
  }
}
z2[1:16]

##  [1] 3.0412766 1.6856202 0.8310131 4.3073299 2.3505171 0.8469918 1.7042939
##  [8] 1.9000347 1.6099917 1.9414896 0.8434196 2.4363454 2.5411472 1.0746662
## [15] 1.2179040 1.8383885

## Step 4: Establish values of the sample mean for subsequent plots
pop_mean <- mean(population)
pop_sd <- sd(population)

pop_mean

## [1] 2.004068

pop_sd

## [1] 1.432217

## Step 5: Show that the mean of the sample size approaches the 
## mean of a normal distribution as n increases. 

#Assigning column names to z2 and observing CLT intuition via summary
#commands 
colnames(z2) <- c("Sample size = 2", "Sample size = 6", "Sample size = 30","Sample size = 50", "Sample size = 200", "Sample size = 1000") 
summary(z2)

##  Sample size = 2  Sample size = 6  Sample size = 30 Sample size = 50
##  Min.   :0.1457   Min.   :0.5194   Min.   :1.107    Min.   :1.310   
##  1st Qu.:1.2718   1st Qu.:1.5794   1st Qu.:1.823    1st Qu.:1.867   
##  Median :1.8469   Median :1.9292   Median :1.994    Median :1.996   
##  Mean   :2.0122   Mean   :1.9889   Mean   :2.007    Mean   :2.004   
##  3rd Qu.:2.5644   3rd Qu.:2.3594   3rd Qu.:2.179    3rd Qu.:2.134   
##  Max.   :8.0680   Max.   :4.6939   Max.   :3.060    Max.   :2.836   
##  Sample size = 200 Sample size = 1000
##  Min.   :1.632     Min.   :1.817     
##  1st Qu.:1.935     1st Qu.:1.973     
##  Median :2.001     Median :2.004     
##  Mean   :2.003     Mean   :2.004     
##  3rd Qu.:2.071     3rd Qu.:2.034     
##  Max.   :2.475     Max.   :2.178

#Configuring plotting layout
par(mfrow=c(2,1))

#Looping through the columns of z2 and creating a histogram of sample
#means 
for (j in 1:ncol(z2)) {
  print(
  hist(z2[,j], probability = TRUE, col = rgb(.2, .4, .6, .5),
  main = colnames(z2)[j], xlab = 'Histogram of Gamma Distribution')
)
#Including a red line at the population mean
abline(v = pop_mean, col = 'red', lwd = 2)

#Including a normal distribution curve using the sample mean and 
#sample standard deviation to contrast results
curve(dnorm(x, mean=mean(z2[,j]), sd=sd(z2[,j])), col = 'blue', 
            lwd = 2, add = TRUE)
}

Does the central limit theorem hold as expected?

The central limit theorem does hold as expected. The distribution of the sample mean begins to resemble a normal distribution as the sample size increases, even though the original random variables were not distributed normally. This confirms that regardless of the initial distribution, the sample mean follows a normal distribution when the sample size is big enough.It’s generally considered that sample sizes need to exceed thirty for this theorem to take form, and that holds true in my plots.

DA Week 4 Discussion

Mick Pomer

2025-02-05