The Central Limit Theorem

2024-09-22

What is the Central Limit Theorem

The Central Limit Theorem is the statement about how the sampling distribution of the sample mean will approach a normal distribution as the sample size increases, regardless of the population’s distribution.
The CLT will apply even if the original data is skewed or non-normal.

Why is the CLT important

The Central Limit Theorem is important because it allows one to make inferences about population parameters based on samples using normal probability models even when the population distribution isn’t normal.
It also helps one construct confidence intervals and performing hypothesis tests.

Equations and Math

Standard Deviation of the Mean: \(SD = \frac{\sigma}{\sqrt{n}}\)
Sample Mean Standardized: \(Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}}\)
If \(X_1, X_2, \ldots, X_n\) are independent random variables with a mean of \(\mu\) and variance \(\sigma^2\), then the sample mean will approach a normal distribution.

Example of CLT

One example of CLT is if we roll a dice. Since a dice has a uniform population distribution with the values of \(1, 2, 3, 4, 5, 6\), we can calculate the sample mean of dice rolls and see how the dice will be approximately normal.
Population Mean and Variance: \(\mu = \frac{1+2+3+4+5+6}{6} = 3.5\) \(\sigma^2 = \frac{(1-3.5)^2 + (2-3.5)^2 + \cdots + (6-3.5)^2}{6} = 2.92\)

Simulating the Dice Example in ggplot

Code for Simulating the Dice Example in ggplot

sample_means = replicate(1000, mean(sample(1:6, 100, replace=TRUE)))
ggplot(data.frame(x=sample_means), aes(x)) +
  geom_histogram(binwidth=0.1, fill="magenta", alpha=0.6) +
  labs(title="Distribution of Sample Means", 
      x="Sample Mean", y="Frequency"
      )

Comparing Different Sample Sizes in the Dice Example

Code for Comparing Different Sample Sizes in the Dice Example

small_sample_means = replicate(1000, mean(sample(1:6, 10, 
replace=TRUE)))
data = data.frame(
  mean = c(small_sample_means, sample_means),
  sample_size = factor(rep(c("n = 10", "n = 100"), each = 1000))
)

ggplot(data, aes(x = mean, fill = sample_size)) +
  geom_histogram(
  position = "identity", alpha = 0.6, binwidth = 0.2) +
  facet_wrap(~sample_size) +
  labs(title = "Comparison of Sample Mean Distributions", 
  x = "Sample Mean", y = "Frequency"
  )

Line Graph of Sample Means for Various Sample Sizes in Plotly

Code for Line Graph of Sample Means for Various Sample Sizes in Plotly

sample_sizes = seq(10, 1000, by=10)
means = sapply(sample_sizes, function(size) {
  replicate(1000, mean(sample(1:6, size, replace=TRUE)))
})
mom = apply(means, 2, mean)

set.seed(42)

plot_ly(x = ~sample_sizes, y = ~mom, type = 'scatter',
  mode = 'lines') %>%
  layout(title = "Mean of Sample Means with Various Sample Sizes",
         xaxis = list(title = "Sample Size"),
         yaxis = list(title = "Mean of Sample Means"))