Central Limit Theorem

2025-03-17

Introduction to the Central Limit Theorem

The Central Limit Theorem (CLT) is a key Statistics concept.
It states that, given a sufficiently large sample size, the sampling distribution of the sample mean will be approximately normal.
This property allows us to use the normal distribution to approximate probabilities

Conditions for CLT

Conditions to use CLT:

1) Samples are random and independent.

2) The sample size (n) is sufficiently large (n ≥ 30).

CLT Formula

\[ \bar{X}_n \approx N\left(\mu, \frac{\sigma^2}{n} \right) \]

Where:
- \(\mu\) is the population mean
- \(\sigma^2\) is the population variance
- \(n\) is the sample size
Essentially, the sample mean will be approximately normally distributed.

The Normal Distribution

Defined by mean \(\mu\) and standard deviation \(\sigma\).
Probability density function:

\[ f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x - \mu)^2}{2\sigma^2}} \]

Using standard deviations:
- \(68\%\) within \(\mu \pm \sigma\)
- \(95\%\) within \(\mu \pm 2\sigma\)
- \(99.7\%\) within \(\mu \pm 3\sigma\)

Visualizing CLT

Let’s start with a non-normal population distribution. Below is an Exponential Distribution.

Visualizing CLT 2

Now, we’ll take random samples from this non-normal distribution and compute their means. Notice how as sample size increases, they become more and more normal.
First will be a sample with n=5.
Next will be a sample with n=100.

ggplot(df5, aes(x = mean)) +
  geom_histogram() +
  labs(title="Sample Size 5 (<=30)",
       x = "Sample Mean",
       y = "Frequency")

ggplot(df100, aes(x = mean)) +
  geom_histogram() +
  labs(title="Sample Size 100 (>=30)",
       x = "Sample Mean",
       y = "Frequency")

Density vs Sample Size

Next, we’ll look at a plot that displays how the density of a sample mean distribution increases as sample size (n) increases.

Density Plot: Sample Mean Distributions for Different Sample Sizes

Summary

The Central Limit Theorem (CLT) states that the sampling distribution of the sample mean approaches normality as sample size increases, regardless of the original population’s shape.
The CLT’s main use is that it allows us to use normal probability models, even when starting with non-normal populations.