Data Skeptic is a podcast which covers topics in data science. Podcast episodes alternate between interview episodes and shorter mini-episodes, which cover basic data science concepts. Episode #77(Polich 2015) introduces a topic from statistics, the Central Limit Theorem (CLT). To illustrate the idea, Kyle described some graphics - but it was a bit hard to follow. This document is my attempt to produce those graphics. The source for the document is on github (Ferrucci 2016).
The CLT says that, for any distribution X, repeated means of a number of samples from X follow a normal distribution. The amazing thing is that this works even if X itself is not normally distributed.
To show off the CLT, we need some sort of distribution that we can take repeated samples of - say, 100 samples in a batch - and then compute the mean. Then we repeat that process - take 100 samples, compute the mean - and look at the distribution of the mean values. The most surprising result happens when the original distribution has some odd shape - something asymmetrical, rather than a typical bell curve.
The hypothetical data set used in the podcast episode had to do with birds’ habit of throwing away some of their food each day. The day after throwing away a lot of food, a bird is likely to eat more food. Linh Da offered a complicating factor: some birds have a crop, where they store food. If Yoshi stores food in his crop on a particular day, then on the following day he might tend to waste more food, since he can eat from his crop.
A simple model of bird food waste uses two normal distributions; a low-mean normal used when the previous day had a high amount of waste, and a high-mean normal used for the opposite case, when the previous day had a low amount of waste. The result as a time series has an oscillatory nature, and in the histogram of the data, we see two distinct peaks.
Have a look at birdmodel.R, and the comments therein, for more details about the model.
100 simulated days of bird waste. Notice that days tend to alternate between high and low waste amounts.
1000 simulated days of bird waste - histogram.
Stacked Histogram from 100 birds over 100 days.
Rotated Histogram…
The mean of birds’ data is a random variable, and the CLT predicts that the distribution of that variable is normal. Let’s have a look.
Histogram of means: 5,000 birds, 10,000 days of waste
This does look pretty normal, but is it? During the podcast Kyle hinted that there would be a mini-episode to answer that question… some sort of normality test. For now, I’ll be content with a qualitative result.
Ferrucci, Aaron. 2016. “Visualizing Data Skeptic’s CLT Mini-Episode.” https://github.com/aaronferrucci/birdfoodwaste.
Polich, Kyle. 2015. “The Central Limit Theorem.” Data Skeptic Podcast. http://dataskeptic.com/epnotes/ep77_central-limit-theorem.php.