Background

In a recent presentation with some customers, they asked why we made a lot of assumptions about data that had to do with it being normally distributed. They made the correct observation that not all data is normally distributed.

The Central Limit Theorem helps us overcome this problem. It says that even if the original data is not normal, the mean of many samples from that data will be normal.

This allows us to use techniques and models that work well with normal data, such as Z-Scores, T-Scores and P-Values, to describe and analyze the sample mean.

This may sound complicated, but it is easier to understand with some examples. Let’s look at some data that is not normal and see how the Central Limit Theorem works by taking random samples from it.

Left Skewed Distribution

ActualDF = tibble(X = seq(1, 10, by = 0.01)) %>%
  mutate(Y = X^2 - 1 * X + 100)

ActualPlot = ActualDF %>%
  ggplot() +
  geom_histogram(aes(Y), bins = 30, fill = "dark blue") +
  theme_classic() +
  theme(plot.title = element_text(size=11)) +
  labs(title = "Left Skewed Distribution - Actual Values",
       x = "", y = "Count")

ActualDF_Sample = tibble(
  S = sapply(1:10000, function(i) mean(sample(ActualDF$Y, size = 50, replace = TRUE)))
)

SamplePlot = ActualDF_Sample %>%
  ggplot() +
  geom_histogram(aes(S), bins = 30, fill = "pink") +
  theme_classic() +
  theme(plot.title = element_text(size=11)) +
  labs(title = "10K Sample Means from Skewed Dist.",
       x = "", y = "Count")

grid.arrange(ActualPlot, SamplePlot, ncol=2,
#             top="Central Limit Theorem - Simulation")
             top="CLT - Simulation")

So after taking the means of 10,000 random samples of 50 from our skewed distribution we end up with a normally distributed sample distribution in pink on the right.

Right Skewed Distribution

ActualDF = tibble(X = seq(1, 10, by = 0.01)) %>%
  mutate(Y = X^2 - 1 * X + 20,
         Y = max(Y)*1.5 - Y)

ActualPlot = ActualDF %>%
  ggplot() +
  geom_histogram(aes(Y), bins = 30, fill = "dark blue") +
  theme_classic() +
  theme(plot.title = element_text(size=11)) +
  labs(title = "Right Skewed Distribution - Actual Values",
       x = "", y = "Count")

ActualDF_Sample = tibble(
  S = sapply(1:10000, function(i) mean(sample(ActualDF$Y, size = 50, replace = TRUE)))
)

SamplePlot = ActualDF_Sample %>%
  ggplot() +
  geom_histogram(aes(S), bins = 30, fill = "pink") +
  theme_classic() +
  theme(plot.title = element_text(size=11)) +
  labs(title = "10K Sample Means from Skewed Dist.",
       x = "", y = "Count")

grid.arrange(ActualPlot, SamplePlot, ncol=2,
#             top="Central Limit Theorem - Simulation")
             top="CLT - Simulation")

So after taking the means of 10,000 random samples of 50 from our skewed distribution we end up with a normally distributed sample distribution in pink on the right.

Bi-modal Distribution

ActualDF = bind_rows(
  tibble(Y = rnorm(500, mean = 100, sd = 15)),
  tibble(Y = rnorm(500, mean = 150, sd = 15)),
)

ActualPlot = ActualDF %>%
  ggplot() +
  geom_histogram(aes(Y), bins = 30, fill = "dark blue") +
  theme_classic() +
  theme(plot.title = element_text(size=11)) +
  labs(title = "Bimodal Distribution - Actual Values",
       x = "", y = "Count")

ActualDF_Sample = tibble(
  S = sapply(1:10000, function(i) mean(sample(ActualDF$Y, size = 50, replace = TRUE)))
)

SamplePlot = ActualDF_Sample %>%
  ggplot() +
  geom_histogram(aes(S), bins = 30, fill = "pink") +
  theme_classic() +
  theme(plot.title = element_text(size=11)) +
  labs(title = "10K Sample Means from Bimodal Dist.",
       x = "", y = "Count")


grid.arrange(ActualPlot, SamplePlot, ncol=2,
#             top="Central Limit Theorem - Simulation")
             top="CLT - Simulation")

Finally, after taking the means of 10,000 random samples of 50 from our bi-modal distribution we end up with a normally distributed sample distribution in pink on the right.

So from all of these examples you can see that we can apply the assumptions and statistical methods of normal distributions to nearly any actual distribution.

Keep in mind that these methods are useful in those cases where you have:

  1. A limited amount of the total data available and need to draw conclusions about the data that you don’t have.

  2. Have all of the data to-date but need to draw conclusions about the data that you will collect in the future.

In those rare cases where you are just making observations about data where you have 100% of the data then these methods aren’t useful or even required as you don’t need to make assumptions.

Regardless of the shape of your data’s distribution, you should…

Be Savvy