set.seed(777)
The Law of Large Numbers states that as a sample size grows, the mean of the sample size grows closer to the mean of the population. Put another way, the sample better represents the population as the sample becomes larger. For example with casinos, a gambler playing a game is a sample of 1. In a given play or sample, a gambler may win, thus the casino loses. However, over time as players continues to gamble at the casino and the sample grows, the gamblers will lose more than they win, tipping odds in the casino’s favor. This is described in this YouTube video: https://www.youtube.com/watch?v=RXY-WN0ahiw.
Many statistical tests are based on the Central Limit Theorem (CLT). The CLT is based on the sampling distribution of the mean. The sampling distribution is the distribution of means of samples taken from a population. As we usually don’t know the mean of the true population, we need to take samples. The CLT states that the sample means will be normally distributed over many trials or large N. This applies regardless of what distribution we take the samples from (uniform, exponential, etc.). In order for the CLT to work, you need to calculate a mean from your sample. Also, generally, the CLT is true when the sample size is at least 30. There are 4 aspects to the CLT: the variables sample must be independent, as the sample size increases the distribution of sample means becomes normally distributed, the original distribution can be any shape but with large sample sizes the closer it will be to normal, and we subtract the population mean and divide by the standard deviation.
The LLN and CLT are both fundamental concepts in inferential statistics. The LLN is more an application or specific case of the CLT. The LLN states that as the sample size increases, the sample mean converges in probability to the population mean. Put another way, as the sample size or number of trials increases, the probability that our sample mean differs from the population mean by more than a tiny amount approaches zero. LLN is more focused on the behavior of the sample mean. CLT states that as sample size increases, the distribution of the sample mean tends toward the normal distribution, regardless of the distribution from which we are sampling. So CLT is more focused on the distribution of the sample means. This is helpful in statistics because we can infer information about a population from a large sample without knowing its distribution.
The Chi Square distribution is a continuous distribution that helps us test for goodness of fit of a model to observed data. It takes the sum of squared random variables, tests whether data series are independent, and is used for estimating confidences surrounding variance and standard deviation for a random variable from a normal distribution. Like other distributions, Chi Square involves repeating an experiment many times and analyzing the expected value. With Chi Square, we’d calculate how much each observed count differed from the expected value, square those differences, and then add them up. It tells us how likely it is to get certain deviations from expected outcomes.
#first let's clear our environment and console
rm(list = ls())
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 538397 28.8 1203567 64.3 NA 669420 35.8
## Vcells 996196 7.7 8388608 64.0 16384 1851760 14.2
cat("\f")
dev.off
## function (which = dev.cur())
## {
## if (which == 1)
## stop("cannot shut down device 1 (the null device)")
## .External(C_devoff, as.integer(which))
## dev.cur()
## }
## <bytecode: 0x1444a7e88>
## <environment: namespace:grDevices>
#first we need to create the chi-square distribution
set.seed(53535) # Set seed for reproducibility
N <- 10000 # Specify sample size
y_rchisq <- rchisq(N, df = 5) # Draw N chi squared distributed values
head(y_rchisq) # Print values to RStudio console
## [1] 3.277758 5.952375 4.849289 3.994664 2.721948 5.649542
hist(y_rchisq, # Plot of randomly drawn chisq density
breaks = 100,
main = "")
#Step 2: Create an empty matrix that will store means of different sample sizes, say 6 different sample sizes of 2, 6, 30, 50, 200, 1000
z <- matrix(data = rep(x = 0,
times = 40000
),
nrow = 10000,
ncol = 4)
#replace values of null matrix
n <- c(2, 6, 30, 1000) # let n be the sample size
#larger sample size should have closer sample mean to true population mean
# i indexes rows and j indexes columns
# As column j increases from 1 to 4 (1-2-3-4), sample size also increases from 2 to 1000 (2-6-30-1000) respectively.
# Sample (of size 2-6-30-1000) pulled out from the original distribution i=10000 times.
for (j in 1:4){ # indexing columns of matrix, where each column will represent different sample size
for (i in 1:10000){ # indexes the rows of matrix
z[i,j] <- mean(sample( x = y_rchisq, # compute mean and assign
size = n[j],
replace = TRUE
)
)
}
}
#check summary stats to see mean of sample get closer to mean of actual chi squared distribution as sample size increases. And the distribution of sample mean should get 'tighter' around the true population mean.
colnames(z) <- c("Sample size=2", "Sample size=6", "Sample size=30", "Sample size=1000")
summary(z)
## Sample size=2 Sample size=6 Sample size=30 Sample size=1000
## Min. : 0.3883 Min. : 1.590 Min. :3.212 Min. :4.583
## 1st Qu.: 3.3643 1st Qu.: 4.080 1st Qu.:4.581 1st Qu.:4.917
## Median : 4.5987 Median : 4.882 Median :4.958 Median :4.982
## Mean : 4.9560 Mean : 4.990 Mean :4.978 Mean :4.983
## 3rd Qu.: 6.1783 3rd Qu.: 5.768 3rd Qu.:5.351 3rd Qu.:5.049
## Max. :18.0506 Max. :11.278 Max. :7.416 Max. :5.380
# matrix row containing means of chi squared distribution of different sample sizes - col 1 has rchisq sample size of 2, col 2 has rchisq sample size 6, col 3 has rchisq sample size 30, and col 4 has rchisq sample size 1000.10,000 iterations.
#check CLT with chi squared with graphs
par(mfrow=c(3,2))
length(y_rchisq)
## [1] 10000
hist(x = y_rchisq,
main = "Histogram of Chi Square Distribution, N=10,000",
xlab = ""
)
?hist
for (k in 1:4){
hist(x = z[,k], # matrix column 1-4 which contain sample means of randomly chosen samples from chi square dist
main = "Histogram of Sample Mean of Chi Square Distribution",
xlim = c(0, 25), # plot the domain of X axis from x=0 to x=25 only
xlab = paste0("Sample Size ", n[k], " (Column ", k, " from matrix)")
)
}
?median
#adjust the function to calculate the median instead of the mean
n <- c(2, 6, 30, 1000) # let n be the sample size
# i indexes rows and j indexes columns
# As column j increases from 1 to 4 (1-2-3-4), sample size also increases from 2 to 1000 (2-6-30-1000) respectively.
# Sample (of size 2-6-30-1000) pulled out from the original distribution i=10000 times.
for (j in 1:4){ # indexing columns of matrix, where each column will represent different sample size
for (i in 1:10000){ # indexes the rows of matrix
z[i,j] <- median(sample( x = y_rchisq, # compute mean and assign
size = n[j],
replace = TRUE
)
)
}
}
#check the summary to see if central limit theorem holds with median
colnames(z) <- c("Sample size=2", "Sample size=6", "Sample size=30", "Sample size=1000")
summary(z)
## Sample size=2 Sample size=6 Sample size=30 Sample size=1000
## Min. : 0.2984 Min. : 1.103 Min. :2.058 Min. :3.853
## 1st Qu.: 3.3324 1st Qu.: 3.545 1st Qu.:3.921 1st Qu.:4.273
## Median : 4.6265 Median : 4.384 Median :4.362 Median :4.357
## Mean : 4.9693 Mean : 4.519 Mean :4.379 Mean :4.355
## 3rd Qu.: 6.2093 3rd Qu.: 5.359 3rd Qu.:4.806 3rd Qu.:4.441
## Max. :17.1616 Max. :12.389 Max. :7.423 Max. :4.795
par(mfrow=c(3,2))
length(y_rchisq)
## [1] 10000
hist(x = y_rchisq,
main = "Histogram of Chi Square Distribution, N=10,000",
xlab = ""
)
?hist
for (k in 1:4){
hist(x = z[,k], # matrix column 1-4 which contain sample means of randomly chosen samples from chi square dist
main = "Histogram of Sample Median of Chi Square Distribution",
xlim = c(0, 25), # plot the domain of X axis from x=0 to x=25 only
xlab = paste0("Sample Size ", n[k], " (Column ", k, " from matrix)")
)
}
It appears that yes, the CLT holds up for median as well. This is interesting and makes sense given in a normal distribution, the mean and median are equal.