0. Set the seed.

set.seed(seed = 17)

1. Describe the Law of Large Numbers

We know that when rolling a fair six-sided die, the probability of any given value being rolled is \(\frac{1}{6}\), and that the expected value of this uniform distribution is 3.5. However, in practice, rolling such a die six times would not guarantee that each number is rolled exactly once. Over hundreds, thousands, or even millions of repetitions, though, we would expect to see a much more even distribution of the numbers rolled. This is what the Law of Large Numbers describes, that as the sample size of an experiment increases, the sample mean will converge on the population mean. In other words, for random variable X, sample size n, and population average \(\mu\), \(\lim_{n \to \infty}{\bar{X_n} - \mu} = 0\)

2. Explain the Central Limit Theorem

The Central Limit Theorem describes how, with a sufficiently large sample size, a random variable X can be modeled with a normal distribution. Typically the cutoff for sample size is 30 or more, and once this is reached then we can take the mean of the sample to be the same as the mean of the population (\(\bar{X} = \mu\)) and the standard deviation of the sample to be the ratio of the population standard deviation and the square root of the sample size (Sample Standard Deviation = \(\frac{\sigma}{\sqrt{n}}\)). Using dice as the example, if we were to roll two six-sided dice instead of of one, after enough rolls of the dice we can see that the random variable (which comes from the sum of two uniform distributions) starts to resemble a normal distribution with mean 7 and standard deviation \(\frac{ \sqrt{210}}{6}\) (\(\approx2.4152\)).

(This website helped me understand the key differences between the LLN and the CLT: https://www.geeksforgeeks.org/maths/central-limit-theorem/)

3. What are the similarities and differences between the Law of Large Numbers and the Central Limit Theorem?

Both the Law of Large Numbers and the Central Limit Theorem state that, given a large enough sample size, the sample mean will converge on the population mean. Where they differ is that the Law of Large Numbers states that the sample mean and the population mean follow the same distribution, whereas with the Central Limit Theorem the distribution of the sample is usually unknown and always irrelevant to the fact that the random variable can eventually be modeled with a normal distribution.

4. Describe any distribution other than normal, uniform, or poisson.

The hypergeometric distribution is a discrete probability distribution. Given a finite population and a fixed number of elements in the population with a desired trait, this distribution describes the probability of pulling a set number of successes from the sample within a number of tries without replacement. This distribution is most commonly used to determine the probability of pulling one or more specific cards from a deck within a number of draws, such as when trying to find the last card of a given suit to complete a flush in poker. The key aspects of this distribution is that population size is fixed and sampling is done without replacement, so the number of successes in the sample can never exceed the number of successes in the population.

5. Apply the Central Limit Theorem on the sample mean and standard deviation of the hypergeometric distribution.

my_hyper <- rhyper(nn = 10000, m = 5, n = 35, k = 8)
my_hyper[1:16]
##  [1] 0 3 1 2 1 1 0 0 2 0 1 0 2 2 3 2
mu <- mean(my_hyper)
mu
## [1] 0.9959
sigma <- sd(my_hyper)     
sigma
## [1] 0.8399725
library("psych")
describe(my_hyper)
##    vars     n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 10000    1 0.84      1    0.94 1.48   0   5     5 0.56    -0.06 0.01
hist(x = my_hyper, main = "Histogram of Hypergeometric Distribution (K = 5, N = 40, n = 8)", xlab = "")

z <- matrix(data = rep(x = 0, times = 10000), nrow = 10000, ncol = 1)
z[1:16]
##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
describe(z)
##    vars     n mean sd median trimmed mad min max range skew kurtosis se
## X1    1 10000    0  0      0       0   0   0   0     0  NaN      NaN  0
for (i in 1:10000){
   z[i,] <- mean(sample(x=my_hyper, size = 100, replace = TRUE))
}
z[1:16]
##  [1] 0.95 0.90 1.08 0.87 0.96 1.19 0.86 0.95 1.02 1.10 0.93 0.99 1.05 1.04 1.02
## [16] 1.01
describe(z)
##    vars     n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 10000    1 0.08      1       1 0.09 0.67 1.33  0.66 0.07     0.01  0
hist(z, xlab="", main="Histogram of Sample Mean (n = 100")

z <- matrix(data = rep(x     = 0, 
                       times = 40000
                       ), 
            nrow = 10000, 
            ncol = 4)
n <- c(2, 6, 30, 1000)
for (j in 1:4){        
 
   for (i in 1:10000){  
  
       z[i,j] <- mean(sample( x       = my_hyper, 
                              size    = n[j], 
                              replace = TRUE
                            )
                      )
    }
}
colnames(z) <- c("Sample size=2", "Sample size=6", "Sample size=30", "Sample size=1000") 
summary(z)  
##  Sample size=2    Sample size=6    Sample size=30   Sample size=1000
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.4667   Min.   :0.8930  
##  1st Qu.:0.5000   1st Qu.:0.8333   1st Qu.:0.9000   1st Qu.:0.9780  
##  Median :1.0000   Median :1.0000   Median :1.0000   Median :0.9960  
##  Mean   :0.9867   Mean   :0.9952   Mean   :0.9943   Mean   :0.9957  
##  3rd Qu.:1.5000   3rd Qu.:1.1667   3rd Qu.:1.1000   3rd Qu.:1.0140  
##  Max.   :3.5000   Max.   :2.5000   Max.   :1.5667   Max.   :1.0890
par(mfrow=c(3,2))
length(my_hyper)
## [1] 10000
hist(x = my_hyper, main = "Histogram of Hypergeometric Distribution (K = 5, N = 40, n = 8)", xlab = "")
for (k in 1:4){
    hist(x    = z[,k],    
         main = "Histogram of Sample Mean of Hypergeometric Distribution", 
         xlim = c(10, 20), 
         xlab = paste0("Sample Size ", n[k], " (Column ", k, " from matrix)") 
         )
}  
z[1:16]
##  [1] 1.5 0.0 0.5 0.5 1.0 0.0 1.5 2.5 1.0 0.5 1.5 0.5 1.0 1.5 1.5 0.0
mu
## [1] 0.9959
colMeans(z)
##    Sample size=2    Sample size=6   Sample size=30 Sample size=1000 
##        0.9866500        0.9951833        0.9942533        0.9957202
sigma
## [1] 0.8399725
apply(X = z, MARGIN = 2, FUN = sd)
##    Sample size=2    Sample size=6   Sample size=30 Sample size=1000 
##       0.58824432       0.34570761       0.15205320       0.02649662
apply(X = z,
      MARGIN = 2,
      FUN = sd) * c(sqrt(2), 
                    sqrt(6), 
                    sqrt(30), 
                    sqrt(1000)
                    )
##    Sample size=2    Sample size=6   Sample size=30 Sample size=1000 
##        0.8319031        0.8468072        0.8328297        0.8378968
sigma
## [1] 0.8399725

5c. Does the Central Limit Theorem hold as expected?

c(mu, colMeans(z))
##                     Sample size=2    Sample size=6   Sample size=30 
##        0.9959000        0.9866500        0.9951833        0.9942533 
## Sample size=1000 
##        0.9957202
c(sigma, apply(X = z, MARGIN = 2, FUN = sd) * c(sqrt(2), sqrt(6), sqrt(30), sqrt(1000)))
##                     Sample size=2    Sample size=6   Sample size=30 
##        0.8399725        0.8319031        0.8468072        0.8328297 
## Sample size=1000 
##        0.8378968

In the case of the mean, yes. The difference between the sample mean at a sample size of 1000 and the population mean is 0.000003. If we were to round both values to four decimal places (as we have done before in this class), the two would be identical.

If my understanding of the Central Limit Theorem is correct, then the sample standard deviation does not seem to hold. We would expect the sample standard deviation to scale inversely with the square root of the sample size, but based on the values that are shown, the sample standard deviation looks to be converging on the population standard deviation. I’m left to wonder if I made a mistake earlier in the calculations, if the hypergeometric distribution is not a good example for this exercise, or if I don’t understand the Central Limit Theorem as well as I thought I did. I look forward to hearing from my fellow students on the matter.