set.seed(seed = 17)
We know that when rolling a fair six-sided die, the probability of any given value being rolled is \(\frac{1}{6}\), and that the expected value of this uniform distribution is 3.5. However, in practice, rolling such a die six times would not guarantee that each number is rolled exactly once. Over hundreds, thousands, or even millions of repetitions, though, we would expect to see a much more even distribution of the numbers rolled. This is what the Law of Large Numbers describes, that as the sample size of an experiment increases, the sample mean will converge on the population mean. In other words, for random variable X, sample size n, and population average \(\mu\), \(\lim_{n \to \infty}{\bar{X_n} - \mu} = 0\)
The Central Limit Theorem describes how, with a sufficiently large sample size, a random variable X can be modeled with a normal distribution. Typically the cutoff for sample size is 30 or more, and once this is reached then we can take the mean of the sample to be the same as the mean of the population (\(\bar{X} = \mu\)) and the standard deviation of the sample to be the ratio of the population standard deviation and the square root of the sample size (Sample Standard Deviation = \(\frac{\sigma}{\sqrt{n}}\)). Using dice as the example, if we were to roll two six-sided dice instead of of one, after enough rolls of the dice we can see that the random variable (which comes from the sum of two uniform distributions) starts to resemble a normal distribution with mean 7 and standard deviation \(\frac{ \sqrt{210}}{6}\) (\(\approx2.4152\)).
(This website helped me understand the key differences between the LLN and the CLT: https://www.geeksforgeeks.org/maths/central-limit-theorem/)
Both the Law of Large Numbers and the Central Limit Theorem state that, given a large enough sample size, the sample mean will converge on the population mean. Where they differ is that the Law of Large Numbers states that the sample mean and the population mean follow the same distribution, whereas with the Central Limit Theorem the distribution of the sample is usually unknown and always irrelevant to the fact that the random variable can eventually be modeled with a normal distribution.
The hypergeometric distribution is a discrete probability distribution. Given a finite population and a fixed number of elements in the population with a desired trait, this distribution describes the probability of pulling a set number of successes from the sample within a number of tries without replacement. This distribution is most commonly used to determine the probability of pulling one or more specific cards from a deck within a number of draws, such as when trying to find the last card of a given suit to complete a flush in poker. The key aspects of this distribution is that population size is fixed and sampling is done without replacement, so the number of successes in the sample can never exceed the number of successes in the population.
my_hyper <- rhyper(nn = 10000, m = 5, n = 35, k = 8)
my_hyper[1:16]
## [1] 0 3 1 2 1 1 0 0 2 0 1 0 2 2 3 2
mu <- mean(my_hyper)
mu
## [1] 0.9959
sigma <- sd(my_hyper)
sigma
## [1] 0.8399725
library("psych")
describe(my_hyper)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 10000 1 0.84 1 0.94 1.48 0 5 5 0.56 -0.06 0.01
hist(x = my_hyper, main = "Histogram of Hypergeometric Distribution (K = 5, N = 40, n = 8)", xlab = "")
z <- matrix(data = rep(x = 0, times = 10000), nrow = 10000, ncol = 1)
z[1:16]
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
describe(z)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 10000 0 0 0 0 0 0 0 0 NaN NaN 0
for (i in 1:10000){
z[i,] <- mean(sample(x=my_hyper, size = 100, replace = TRUE))
}
z[1:16]
## [1] 0.95 0.90 1.08 0.87 0.96 1.19 0.86 0.95 1.02 1.10 0.93 0.99 1.05 1.04 1.02
## [16] 1.01
describe(z)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 10000 1 0.08 1 1 0.09 0.67 1.33 0.66 0.07 0.01 0
hist(z, xlab="", main="Histogram of Sample Mean (n = 100")
z <- matrix(data = rep(x = 0,
times = 40000
),
nrow = 10000,
ncol = 4)
n <- c(2, 6, 30, 1000)
for (j in 1:4){
for (i in 1:10000){
z[i,j] <- mean(sample( x = my_hyper,
size = n[j],
replace = TRUE
)
)
}
}
colnames(z) <- c("Sample size=2", "Sample size=6", "Sample size=30", "Sample size=1000")
summary(z)
## Sample size=2 Sample size=6 Sample size=30 Sample size=1000
## Min. :0.0000 Min. :0.0000 Min. :0.4667 Min. :0.8930
## 1st Qu.:0.5000 1st Qu.:0.8333 1st Qu.:0.9000 1st Qu.:0.9780
## Median :1.0000 Median :1.0000 Median :1.0000 Median :0.9960
## Mean :0.9867 Mean :0.9952 Mean :0.9943 Mean :0.9957
## 3rd Qu.:1.5000 3rd Qu.:1.1667 3rd Qu.:1.1000 3rd Qu.:1.0140
## Max. :3.5000 Max. :2.5000 Max. :1.5667 Max. :1.0890
par(mfrow=c(3,2))
length(my_hyper)
## [1] 10000
hist(x = my_hyper, main = "Histogram of Hypergeometric Distribution (K = 5, N = 40, n = 8)", xlab = "")
for (k in 1:4){
hist(x = z[,k],
main = "Histogram of Sample Mean of Hypergeometric Distribution",
xlim = c(10, 20),
xlab = paste0("Sample Size ", n[k], " (Column ", k, " from matrix)")
)
}
z[1:16]
## [1] 1.5 0.0 0.5 0.5 1.0 0.0 1.5 2.5 1.0 0.5 1.5 0.5 1.0 1.5 1.5 0.0
mu
## [1] 0.9959
colMeans(z)
## Sample size=2 Sample size=6 Sample size=30 Sample size=1000
## 0.9866500 0.9951833 0.9942533 0.9957202
sigma
## [1] 0.8399725
apply(X = z, MARGIN = 2, FUN = sd)
## Sample size=2 Sample size=6 Sample size=30 Sample size=1000
## 0.58824432 0.34570761 0.15205320 0.02649662
apply(X = z,
MARGIN = 2,
FUN = sd) * c(sqrt(2),
sqrt(6),
sqrt(30),
sqrt(1000)
)
## Sample size=2 Sample size=6 Sample size=30 Sample size=1000
## 0.8319031 0.8468072 0.8328297 0.8378968
sigma
## [1] 0.8399725
c(mu, colMeans(z))
## Sample size=2 Sample size=6 Sample size=30
## 0.9959000 0.9866500 0.9951833 0.9942533
## Sample size=1000
## 0.9957202
c(sigma, apply(X = z, MARGIN = 2, FUN = sd) * c(sqrt(2), sqrt(6), sqrt(30), sqrt(1000)))
## Sample size=2 Sample size=6 Sample size=30
## 0.8399725 0.8319031 0.8468072 0.8328297
## Sample size=1000
## 0.8378968
In the case of the mean, yes. The difference between the sample mean at a sample size of 1000 and the population mean is 0.000003. If we were to round both values to four decimal places (as we have done before in this class), the two would be identical.
If my understanding of the Central Limit Theorem is correct, then the sample standard deviation does not seem to hold. We would expect the sample standard deviation to scale inversely with the square root of the sample size, but based on the values that are shown, the sample standard deviation looks to be converging on the population standard deviation. I’m left to wonder if I made a mistake earlier in the calculations, if the hypergeometric distribution is not a good example for this exercise, or if I don’t understand the Central Limit Theorem as well as I thought I did. I look forward to hearing from my fellow students on the matter.