When we generate samples for testing, we usually do not have a very big number in our sample group. However, as we add more and more to our sample group, it grows to become closer to the actual population. As it grows, the sample mean gets closer to the average of the entire population because our sample group is getting closer to representing the population. The more tests, trials, and experiments that we run, the more accurate our sample average will be.
We know that different populations’ measures fall into different patterns, or distribution shapes. It is important that we choose the correct distribution to calculate sample statistics and ultimately draw conclusions about our sample. However, as our sample grows, the sample mean can be approximated with the normal distribution no matter its true population dstribution.
Both the LLN and CLT apply to large samples and tell us something about the sample mean. The LLN tells us that the sample mean will converge to the population mean as the sample grows, and the CLT tells us that the sample will be representative of a normal distribution as it gets larger.
I have chosen the Chi-Square distribution. Chi-square is a continuous probability distribution that is commonly used in hypothesis testing because of its relationship to the standard normal distribution. Chi-square can have different levels of independence represented by k. If the values from a random sample of a standard normal distribution were squared, we would have a Chi-square distribution with k = 1. If the values from two standard normal distributions were squared and added together, we would have Chi-square with k = 2.
There are not many everyday examples whose observations follow a Chi-square distribution, so it is not often used to describe real-world situations. It has only one parameter, k, to indicate the degrees of freedom.
library("psych")
set.seed(10)
# generate a random sample with the Chi-square distribution, with 4 degrees of freedom
chi_sample = rchisq(n = 10000, df = 4)
# summary statistics for my sample
describe(chi_sample)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 10000 4 2.82 3.35 3.64 2.4 0.03 20.97 20.94 1.37 2.55 0.03
# plot histogram of my sample
hist(x = chi_sample, main = "Histogram of Chi-Square Distribution With k = 4")
# create a matrix of 10000 rows and 1 column to store the means of randomly drawn samples of size 100
chi_matrix = matrix(data = rep(x = 0, times = 10000), nrow = 10000, ncol = 1)
# take a random sample of 100 observations from the chi-square distribution, find its mean, and store it in row 1 of my matrix. Repeat 9,999 times - storing the mean of sample of size 100 in row i of chi_matrix
for (i in 1:10000) {
chi_matrix[i, ] = mean(sample(x = chi_sample, size = 100, replace = TRUE))
}
# summary statistics of my matrix of sample means
describe(chi_matrix)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 10000 3.99 0.28 3.99 3.99 0.28 2.94 5.16 2.22 0.07 0.07 0
# plot histogram of means from my sample matrix
hist(chi_matrix, xlab = "", main = "Histogram of Sample Mean (n = 100)")
# expand the columns of null matrix to fill
chi_2_matrix = matrix(data = rep(x = 0, times = 40000), nrow = 10000, ncol = 4)
# larger sample size should have closer sample mean to true population mean
# 1 - take a random sample of 2 observations from the chi-square distribution, find its mean, and store it in column 1, row 1 of matrix chi_2_matrix (repeat 9,999 times)
# 2 - take a random sample of 6 observations from the chi-square distribution, find its mean, and store it in column 2, row 1 of matrix chi_2_matrix (repeat 9,999 times)
# 3 - take a random sample of 30 observations from the chi-square distribution, find its mean, and store it in column 3, row 1 of matrix chi_2_matrix (repeat 9,999 times)
# 4 - take a random sample of 1000 observations from the chi-square distribution, find its mean, and store it in column 4, row 1 of chi_2_matrix (repeat 9,999 times)
for (j in 1:4) {
for (i in 1:10000) {
chi_2_matrix[i, j] = mean(sample(x = chi_sample, replace = TRUE))
}
}
# if intution is correct, the mean of sample should get closer to the mean of the actual chi-square distribution as sample size increases, and the distribution of sample mean should get "tighter" around the true population mean
colnames(chi_2_matrix) = c("Sample Size = 2", "Sample Size = 6", "Sample Size = 30", "Sample Size = 1000")
summary(chi_2_matrix)
## Sample Size = 2 Sample Size = 6 Sample Size = 30 Sample Size = 1000
## Min. :3.899 Min. :3.897 Min. :3.900 Min. :3.906
## 1st Qu.:3.979 1st Qu.:3.979 1st Qu.:3.979 1st Qu.:3.979
## Median :3.997 Median :3.998 Median :3.998 Median :3.998
## Mean :3.998 Mean :3.998 Mean :3.998 Mean :3.999
## 3rd Qu.:4.016 3rd Qu.:4.017 3rd Qu.:4.017 3rd Qu.:4.018
## Max. :4.099 Max. :4.099 Max. :4.110 Max. :4.112
# shown with graphs
par(mfrow = c(3, 2))
hist(x = chi_sample, main = "Histogram of Chi Square", xlab = "")
for (k in 1:4) {
hist(x = chi_2_matrix[, k], main = "Histogram of Sample Mean of Chi-Square Distribution", xlab = paste0("Column ", k, " from matrix"))
}
colMeans(chi_2_matrix)
## Sample Size = 2 Sample Size = 6 Sample Size = 30 Sample Size = 1000
## 3.997509 3.997740 3.997753 3.998531
We can see with the graphs, as well as with the summary means at the end, the mean of the samples converges closer and closer as our sample size gets larger. Furthermore, the distributions become more normal in shape and look like our familiar bell curve, thus confirming the CLT.
set.seed(10)
library("miscTools")
# create a new empty matrix to fill with the median
chi_3_matrix = matrix(data = rep(x = 0, times = 40000), nrow = 10000, ncol = 4)
# generate samples of differing sizes, but this time calculate the median, not the mean
for (j in 1:4) {
for (i in 1:10000) {
chi_3_matrix[i, j] = median(sample(x = chi_sample, replace = TRUE))
}
}
colnames(chi_3_matrix) = c("Sample Size = 2", "Sample Size = 6", "Sample Size = 30", "Sample Size = 1000")
colMedians(chi_3_matrix)
## Sample Size = 2 Sample Size = 6 Sample Size = 30 Sample Size = 1000
## 3.353092 3.353375 3.353144 3.353966
median(chi_sample)
## [1] 3.353144
# with graphs
par(mfrow = c(3, 2))
hist(x = chi_sample, main = "Histogram of Chi Square", xlab = "")
for (k in 1:4) {
hist(x = chi_3_matrix[, k], main = "Histogram of Sample Median of Chi-Square Distribution", xlab = paste0("Column ", k, " from matrix"))
}
My initial thought was that the CLT would not make sense for the median of our data based on the way that this sampling is taking place. Selecting two random values from our set of 10000 and finding the median could lead to very different results, especially given that the minimum is 0.03204, the maximum is 20.96961, and the median is 3.35314. However, this experimenting has led me to some interesting results. The CLT does appear to hold. The median converges closer and closer to the true population median as sample size increases. Surprisingly, the median generated from the medians of sample size of 30 is an exact match. Although this appears to be luck of the draw, it is very interesting. Although it is not the easiest to see, the graphs to appear to become more normal in shape as the sample increases as well, though not to the same extent that the sample mean graphs did.