set.seed(seed = 407)
rm(list = ls())
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 618117 33.1 1415010 75.6 702077 37.5
## Vcells 1153710 8.9 8388608 64.0 1927822 14.8
cat("\f")
The law of Large Numbers states that with a large number of independent trials, the recorded probability of an outcome will eventually settle at stable value approximating its actual probability. For example, flipping a coin ten times and recording the percent of times heads came out on top may not result in exactly 50%. But if one flips a coin ten thousand times, the resulting recorded probability will be almost exactly 50%.
The Central Limit Theorem is a bit more specific, but conveys a similar idea to the Law of Large Numbers. CLT states that with a sufficiently large sample size, assuming independent observations, the point estimate from the sample should follow a normal distribution centered around the expected value \(μ\). A general rule is that the sample size \(n\) is large enough when \(np >= 10\) and \(n(1-p) >= 10\).
The main difference lies within the specificity of the two ideas. CLT is a more specific formulation of the Law of Large Numbers with specific applications in mind. LLN speaks to the general phenomena of true probabilities appearing after many trials, while CLT quantifies this phenomena in the context of sampling. CLT allows a researcher to quantify the reliability of their findings.
The following example from statology is a good example of the binomial distribution. We will be looking at the proportion of emails received in a day that would be considered spam. Broadly speaking, the average rate of spam is 4% of your total daily emails. We can model this example with multiple sample sizes.
p <- 0.04 #Probability of any given email being spam
n1 <- 20 #Small Sample Size
n2 <- 100 #Medium Sample Size
n3 <- 300 #Large Sample Size
x <- (0:40)
plot(x, dbinom(x, n1, p), type = "h", lwd = 5, ylab = "Probability", xlab = "# of Spam Emails",
main = "Distribution of Daily Spam Emails (n=20)")
plot(x, dbinom(x, n2, p), type = "h", lwd = 5, ylab = "Probability", xlab = "# of Spam Emails",
main = "Distribution of Daily Spam Emails (n=100)")
plot(x, dbinom(x, n3, p), type = "h", lwd = 5, ylab = "Probability", xlab = "# of Spam Emails",
main = "Distribution of Daily Spam Emails (n=200)")
Lets first check each of the 3 cases above to see if the Central Limit Theorem applies:
#Small Sample Size (n=20)
(n1*p > 10) & (n1*(1-p) > 10)
## [1] FALSE
#Medium Sample Size (n=100)
n2*p > 10 & (n2*(1-p) > 10)
## [1] FALSE
#Large Sample Size (n=300)
n3*p > 10 & (n3*(1-p) > 10)
## [1] TRUE
Only the largest sample (n=300) meets the success failure condition of CLT, so we will analyze this one. With that conditioned met, we can say that the results will approximate a normal distribution. Knowing the true value of \(p=0.04\), we would expect the mean \(\mu_p\) and \(SE_p\) to be equal to the following:
\(\mu_p = np = 300*0.04 = 12\)
\(SE_p = \sqrt{np(1-p)} = \sqrt{300*0.04(0.96)} = 3.394\)
Given that the example meets the conditions for the CLT, and the parameters calculated above, we can plot a normal distribution approximating the daily amount of spam emails.
x <- (0:25)
plot(x, dnorm(x, mean = 12, sd = 3.394), type = 'b', ylab = "Probability", xlab = "# of Spam Emails",
main = "Normal Approximation of Daily Spam Emails (n=300)" )
Having constructed a normal approximation using the Central Limit Theorem, lets randomly generate a sample year and see how it compares.
sample_data <- rbinom(365,n3,p) #generates a list of the daily spam count for a year
hist(sample_data)
Though not perfect, the histogram does roughly resemble the normal distribution. Lets check the sample mean and see how it compares.
mean(sample_data) #check the sample mean
## [1] 12.0137
The sample mean is well within 1 standard error from our the projected mean, suggesting that the approximation is a good one. Though in real life we will never have the true population mean, we could in that case use confidence intervals to make conclusions. In either case the Central Limit Theorem is an invaluable tool we can use to quantify the reliability of results.