Week 4 Discussion - Central Limit Theorem

1. Please Google and describe Law of Large Numbers in your own words.

When we generate samples for testing, we usually do not have a very big number in our sample group. However, as we add more and more to our sample group, it grows to become closer to the actual population. As it grows, the sample mean gets closer to the average of the entire population because our sample group is getting closer to representing the population. The more tests, trials, and experiments that we run, the more accurate our sample average will be.

2. Please explain CLT in your own words.

We know that different populations’ measures fall into different patterns, or distribution shapes. It is important that we choose the correct distribution to calculate sample statistics and ultimately draw conclusions about our sample. However, as our sample grows, the sample mean can be approximated with the normal distribution no matter its true population dstribution.

3. What are the similarities and differences between LLN and CLT?

Both the LLN and CLT apply to large samples and tell us something about the sample mean. The LLN tells us that the sample mean will converge to the population mean as the sample grows, and the CLT tells us that the sample will be representative of a normal distribution as it gets larger.

4. Pick up any distribution apart from normal, uniform, or poisson. Please describe this distribution first in five lines.

I have chosen the Chi-Square distribution. Chi-square is a continuous probability distribution that is commonly used in hypothesis testing because of its relationship to the standard normal distribution. Chi-square can have different levels of independence represented by k. If the values from a random sample of a standard normal distribution were squared, we would have a Chi-square distribution with k = 1. If the values from two standard normal distributions were squared and added together, we would have Chi-square with k = 2.

There are not many everyday examples whose observations follow a Chi-square distribution, so it is not often used to describe real-world situations. It has only one parameter, k, to indicate the degrees of freedom.

5. Then, apply the CLT on the sample mean of this chosen distribution in R.

library("psych")
set.seed(10)
# generate a random sample with the Chi-square distribution, with 4 degrees of freedom
chi_sample = rchisq(n = 10000, df = 4)

# summary statistics for my sample
describe(chi_sample)

##    vars     n mean   sd median trimmed mad  min   max range skew kurtosis   se
## X1    1 10000    4 2.82   3.35    3.64 2.4 0.03 20.97 20.94 1.37     2.55 0.03

# plot histogram of my sample
hist(x = chi_sample, main = "Histogram of Chi-Square Distribution With k = 4")

# create a matrix of 10000 rows and 1 column to store the means of randomly drawn samples of size 100
chi_matrix = matrix(data = rep(x = 0, times = 10000), nrow = 10000, ncol = 1)

# take a random sample of 100 observations from the chi-square distribution, find its mean, and store it in row 1 of my matrix. Repeat 9,999 times - storing the mean of sample of size 100 in row i of chi_matrix

for (i in 1:10000) {
  chi_matrix[i, ] = mean(sample(x = chi_sample, size = 100, replace = TRUE))
}

# summary statistics of my matrix of sample means
describe(chi_matrix)

##    vars     n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 10000 3.99 0.28   3.99    3.99 0.28 2.94 5.16  2.22 0.07     0.07  0

# plot histogram of means from my sample matrix
hist(chi_matrix, xlab = "", main = "Histogram of Sample Mean (n = 100)")

# expand the columns of null matrix to fill
chi_2_matrix = matrix(data = rep(x = 0, times = 40000), nrow = 10000, ncol = 4)

# larger sample size should have closer sample mean to true population mean 
# 1 - take a random sample of 2 observations from the chi-square distribution, find its mean, and store it in column 1, row 1 of matrix chi_2_matrix (repeat 9,999 times)
# 2 - take a random sample of 6 observations from the chi-square distribution, find its mean, and store it in column 2, row 1 of matrix chi_2_matrix (repeat 9,999 times)
# 3 - take a random sample of 30 observations from the chi-square distribution, find its mean, and store it in column 3, row 1 of matrix chi_2_matrix (repeat 9,999 times)
# 4 - take a random sample of 1000 observations from the chi-square distribution, find its mean, and store it in column 4, row 1 of chi_2_matrix (repeat 9,999 times)

for (j in 1:4) {
  for (i in 1:10000) {
    chi_2_matrix[i, j] = mean(sample(x = chi_sample, replace = TRUE))
  }
}

# if intution is correct, the mean of sample should get closer to the mean of the actual chi-square distribution as sample size increases, and the distribution of sample mean should get "tighter" around the true population mean

colnames(chi_2_matrix) = c("Sample Size = 2", "Sample Size = 6", "Sample Size = 30", "Sample Size = 1000")
summary(chi_2_matrix)

##  Sample Size = 2 Sample Size = 6 Sample Size = 30 Sample Size = 1000
##  Min.   :3.899   Min.   :3.897   Min.   :3.900    Min.   :3.906     
##  1st Qu.:3.979   1st Qu.:3.979   1st Qu.:3.979    1st Qu.:3.979     
##  Median :3.997   Median :3.998   Median :3.998    Median :3.998     
##  Mean   :3.998   Mean   :3.998   Mean   :3.998    Mean   :3.999     
##  3rd Qu.:4.016   3rd Qu.:4.017   3rd Qu.:4.017    3rd Qu.:4.018     
##  Max.   :4.099   Max.   :4.099   Max.   :4.110    Max.   :4.112

# shown with graphs
par(mfrow = c(3, 2))
hist(x = chi_sample, main = "Histogram of Chi Square", xlab = "")
for (k in 1:4) {
  hist(x = chi_2_matrix[, k], main = "Histogram of Sample Mean of Chi-Square Distribution", xlab = paste0("Column ", k, " from matrix"))
}

colMeans(chi_2_matrix)

##    Sample Size = 2    Sample Size = 6   Sample Size = 30 Sample Size = 1000 
##           3.997509           3.997740           3.997753           3.998531

We can see with the graphs, as well as with the summary means at the end, the mean of the samples converges closer and closer as our sample size gets larger. Furthermore, the distributions become more normal in shape and look like our familiar bell curve, thus confirming the CLT.

6. Alternatively, apply the CLT on any other sample statistic like say the sample median, sample 25th percentile, or even the sample 80th percentile. Does the central limit theorem hold as expected? Please elaborate (at least 3 points).

set.seed(10)
library("miscTools")
# create a new empty matrix to fill with the median

chi_3_matrix = matrix(data = rep(x = 0, times = 40000), nrow = 10000, ncol = 4)

# generate samples of differing sizes, but this time calculate the median, not the mean
for (j in 1:4) {
  for (i in 1:10000) {
    chi_3_matrix[i, j] = median(sample(x = chi_sample, replace = TRUE))
  }
}

colnames(chi_3_matrix) = c("Sample Size = 2", "Sample Size = 6", "Sample Size = 30", "Sample Size = 1000")
colMedians(chi_3_matrix)

##    Sample Size = 2    Sample Size = 6   Sample Size = 30 Sample Size = 1000 
##           3.353092           3.353375           3.353144           3.353966

median(chi_sample)

## [1] 3.353144

# with graphs
par(mfrow = c(3, 2))
hist(x = chi_sample, main = "Histogram of Chi Square", xlab = "")
for (k in 1:4) {
  hist(x = chi_3_matrix[, k], main = "Histogram of Sample Median of Chi-Square Distribution", xlab = paste0("Column ", k, " from matrix"))
}

My initial thought was that the CLT would not make sense for the median of our data based on the way that this sampling is taking place. Selecting two random values from our set of 10000 and finding the median could lead to very different results, especially given that the minimum is 0.03204, the maximum is 20.96961, and the median is 3.35314. However, this experimenting has led me to some interesting results. The CLT does appear to hold. The median converges closer and closer to the true population median as sample size increases. Surprisingly, the median generated from the medians of sample size of 30 is an exact match. Although this appears to be luck of the draw, it is very interesting. Although it is not the easiest to see, the graphs to appear to become more normal in shape as the sample increases as well, though not to the same extent that the sample mean graphs did.