2025-03-17

Introduction to the Central Limit Theorem

  • The Central Limit Theorem (CLT) is a key Statistics concept.
  • It states that, given a sufficiently large sample size, the sampling distribution of the sample mean will be approximately normal.
  • This property allows us to use the normal distribution to approximate probabilities

Conditions for CLT

  • Conditions to use CLT:

    1) Samples are random and independent.

    2) The sample size (n) is sufficiently large (n ≥ 30).

CLT Formula

\[ \bar{X}_n \approx N\left(\mu, \frac{\sigma^2}{n} \right) \]

  • Where:
    • \(\mu\) is the population mean
    • \(\sigma^2\) is the population variance
    • \(n\) is the sample size
  • Essentially, the sample mean will be approximately normally distributed.

The Normal Distribution

  • Defined by mean \(\mu\) and standard deviation \(\sigma\).
  • Probability density function:

\[ f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x - \mu)^2}{2\sigma^2}} \]

  • Using standard deviations:
    • \(68\%\) within \(\mu \pm \sigma\)
    • \(95\%\) within \(\mu \pm 2\sigma\)
    • \(99.7\%\) within \(\mu \pm 3\sigma\)

Visualizing CLT

  • Let’s start with a non-normal population distribution. Below is an Exponential Distribution.

Visualizing CLT 2

  • Now, we’ll take random samples from this non-normal distribution and compute their means. Notice how as sample size increases, they become more and more normal.

  • First will be a sample with n=5.

  • Next will be a sample with n=100.

ggplot(df5, aes(x = mean)) +
  geom_histogram() +
  labs(title="Sample Size 5 (<=30)",
       x = "Sample Mean",
       y = "Frequency")

ggplot(df100, aes(x = mean)) +
  geom_histogram() +
  labs(title="Sample Size 100 (>=30)",
       x = "Sample Mean",
       y = "Frequency")

Density vs Sample Size

  • Next, we’ll look at a plot that displays how the density of a sample mean distribution increases as sample size (n) increases.

Density Plot: Sample Mean Distributions for Different Sample Sizes

Summary

  • The Central Limit Theorem (CLT) states that the sampling distribution of the sample mean approaches normality as sample size increases, regardless of the original population’s shape.
  • The CLT’s main use is that it allows us to use normal probability models, even when starting with non-normal populations.