#The Law of Large Numbers is basically the idea that the more you observe something, the closer you get to seeing what you expect. If you keep flipping a coin, you’ll see that the number of heads and tails you get will get closer and closer to 50% each as you keep flipping. This means that when you do experiments or look at data sets, the more you look at them, the more you can predict what will happen.
#The Central Limit Theorem (CLT) is a crucial statistical principle that explains why the distribution of sample means tends to resemble a normal distribution as the sample size increases, regardless of the initial distribution shape of the population. In essence, if you repeatedly draw sufficiently large samples from any population and calculate their means, those means will approximate a normal (bell-shaped) curve. This normal distribution will have a mean equal to the population mean and a standard deviation known as the standard error. The theorem holds true even if the original data are skewed, bunched up, or spread out in any particular pattern. The CLT is essential in statistics because it justifies using normal probability calculations to make inferences about overall population parameters based on sample data. This feature is what makes reliable hypothesis testing and confidence interval construction possible, even when the underlying population distribution is unknown.
#The Law of Large Numbers (LLN) and the Central Limit Theorem (CLT) are both about how averages behave in big groups, but they’re looking at different things. The LLN says that as group sizes get bigger, the average result will get closer to the actual result, which means the average is a reliable way to measure things. Meanwhile, the CLT focuses on the way that the averages are spread out across the group. It says that no matter what the group looks like to start with, if you take a big enough sample, the averages will start to look more like a bell curve.
#Both theorems need a big group to work, and they help us to make predictions and draw conclusions about the whole group based on only some of the members. But while the LLN is about how close the average is to the true answer, the CLT is about how the averages are spread out. So the LLN helps us to be accurate, and the CLT helps us to understand how accurate we can be when we look at different groups.
#The Chi-Square distribution is a statistical distribution that explains the sum of the squares of 𝑘 independent and standard normal random variables. This distribution is primarily used in hypothesis testing and in constructing confidence intervals, especially in tests of independence and goodness-of-fit in categorical data. The shape of the Chi-Square distribution is dependent on the degrees of freedom 𝑘, which makes it more symmetrical and closer to a normal distribution as 𝑘 increases. It is a specific case of the Gamma distribution and is always non-negative, starting from zero and extending to the right indefinitely. Chi-square distributions are fundamental in inferential statistics, particularly in scenarios that involve large sample sizes.
#5A Then, apply the CLT on the sample mean of this chosen distributionLinks to an external site. in RLinks to an external site. (adapt our class R code, or you can find an alternative code on the web too).
rm(list = ls())
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 525055 28.1 1169910 62.5 660494 35.3
## Vcells 956357 7.3 8388608 64.0 1769528 13.6
set.seed(42) # Set seed for reproducibility
N <- 10000 # Specify sample size
y_rchisq <- rchisq(N, df = 5) # Draw N chi squared distributed values
head(y_rchisq)
## [1] 8.817420 2.562234 4.136585 1.485791 3.705466 2.356456
hist(y_rchisq,
breaks = 100,
main = "")
z <- matrix(data = rep(x = 0,
times = 50000
),
nrow = 10000,
ncol = 5)
n <- c(2, 10, 50,200,3000)
for (j in 1:4){
for (i in 1:10000){
z[i,j] <- mean(sample( x = y_rchisq,
size = n[j],
replace = TRUE
)
)
}
}
colnames(z) <- c(" size=2", "size=10", "size=50","size=200","size=3000")
summary(z)
## size=2 size=10 size=50 size=200 size=3000
## Min. : 0.3036 Min. : 1.877 Min. :3.483 Min. :4.210 Min. :0
## 1st Qu.: 3.3488 1st Qu.: 4.277 1st Qu.:4.706 1st Qu.:4.852 1st Qu.:0
## Median : 4.6525 Median : 4.935 Median :5.004 Median :5.002 Median :0
## Mean : 4.9909 Mean : 4.993 Mean :5.012 Mean :5.006 Mean :0
## 3rd Qu.: 6.3096 3rd Qu.: 5.643 3rd Qu.:5.306 3rd Qu.:5.158 3rd Qu.:0
## Max. :20.7787 Max. :10.482 Max. :6.728 Max. :5.866 Max. :0
par(mfrow=c(3,2))
length(y_rchisq)
## [1] 10000
hist(x = y_rchisq,
main = "Hist of Chi_Square Distribution",
xlab = ""
)
for (k in 1:4){
hist(x = z[,k],
main = "Hist of Sample Mean of Chi Square Distribution",
xlim = c(0, 25),
xlab = paste0("Sample Size ", n[k], " (Column ", k, " from matrix)")
)
}
n <- c(2, 10, 50,200,3000)
y_rchisq <- rchisq(N, df = 5)
chi_square.data <- y_rchisq
percentile_25 <- quantile(x = chi_square.data, # numeric vector whose sample quantiles are wanted,
probs = c(.25) # numeric vector of probabilities with values in [0,1][0,1].
)
percentile_25
## 25%
## 2.678725
z <- matrix(data = rep(x = 0,
times = 50000
),
nrow = 10000,
ncol = 5)
n <- c(2, 10, 50, 200, 3000)
for (j in 1:5){
for (i in 1:10000){
z[i,j] <- quantile( # CHANGE FROM MEAN TO QUANTILE
x = sample( x = chi_square.data,
size = n[j],
replace = TRUE
),
probs = c(.25)
)
}
}
colnames(z) <- c("Sample size=2", "Sample size=10", "Sample size=50", "Sample size=200", "Sample size=3000")
summary(z)
## Sample size=2 Sample size=10 Sample size=50 Sample size=200
## Min. : 0.2427 Min. :0.5683 Min. :1.477 Min. :1.949
## 1st Qu.: 2.7877 1st Qu.:2.3498 1st Qu.:2.478 1st Qu.:2.566
## Median : 3.8442 Median :2.8977 Median :2.723 Median :2.696
## Mean : 4.1791 Mean :2.9952 Mean :2.750 Mean :2.701
## 3rd Qu.: 5.2712 3rd Qu.:3.5493 3rd Qu.:3.005 3rd Qu.:2.832
## Max. :17.7281 Max. :7.5732 Max. :4.478 Max. :3.529
## Sample size=3000
## Min. :2.499
## 1st Qu.:2.651
## Median :2.679
## Mean :2.683
## 3rd Qu.:2.719
## Max. :2.909
par(mfrow=c(3,2))
hist(x = chi_square.data,
main = "Hist of Chi_Square Distribution",
xlab = ""
)
for (k in 1:5){
hist(x = z[,k],
main = "Histogram of Sample 25 Percentile of chi_square Dist",
xlim = c(0, 25),
xlab = paste0("Sample Size ", n[k], " (Column ", k, " from matrix)")
)
}
# yes, the CLT holds up for 25th percentile as well. It makes sense given in a normal distribution, the mean and 25th percentile are equal.As the sample size increases, sample statistics such as the sample mean and sample 25th percentile tend to converge towards their theoretical values. This trend aligns with the Central Limit Theorem (CLT), which posits that as sample sizes grow, the sampling distributions approximate a normal distribution. Histograms reflecting this phenomenon show a clear movement towards normality. The pattern supports the CLT’s assertion that with larger sample sizes, the distribution of sample statistics becomes increasingly normal.