Example 1: Central Limit Theorem

So why is the normal distribution so normal?

In probability theory, the central limit theorem (CLT) states that, given certain conditions, the mean of a sufficiently large number of independent random variables, each with a well-defined mean and well-defined variance, will be approximately normally distributed.

Wikipedia

This is easily demonstrated by a little simulation: first, we’ll create a random variable called ‘variable’. It will contain 10,000 random values ranging from -1 to +1 (this is R code, and you’ll learn to do some of this in your lab this week!):

variable = runif(n=10000, min=-1, max=+1)

And we’ll plot a histogram of this variable

As you can see, the variable is pretty much uniformly distributed. Now we’ll add another random variable to this random variable:

variable = variable + runif(n=10000, min=-1, max=+1)

And we’ll plot it again…

How does that work? Well, let’s think about it for a second…

Ok, now let’s repeat this a few more times, and see what happens…

Example 2: Central Limit Theorem Visualized

Here’s a cool ‘web-based’ visualization of the Central Limit Theorem. Watch what happens as we increase the number of bins…

In the example here, at every triangle, the ball has a 50/50 shot of going to the left or to the right. You can also think of it like coin flips, where the number of coin flips is (bins - 1).

Example 3: Simpson’s Paradox

If you’ll recall from last week, a Simpson’s Paradox is…

a paradox in probability and statistics, in which a trend appears in different groups of data but disappears or reverses when these groups are combined.

Wikipedia

You might also recall that I mentioned a study that looked at the relationship between SAT scores and teacher salaries (“Getting what you pay for: the debate over equity in public school expenditures”" (1999), Journal of Statistics Education 7(2)). Well it turns out we can use the data from the above study to explore a Simpson’s Paradox ourselves:

Here is the top few cases of a data frame describing average SAT scores by state, with several additional explanatory variables:

head(SAT)
##        state expend ratio salary frac verbal math  sat
## 1    Alabama  4.405  17.2 31.144    8    491  538 1029
## 2     Alaska  8.963  17.6 47.951   47    445  489  934
## 3    Arizona  4.778  19.3 32.175   27    448  496  944
## 4   Arkansas  4.459  17.1 28.934    6    482  523 1005
## 5 California  4.992  24.0 41.078   45    417  485  902
## 6   Colorado  5.443  18.4 34.571   29    462  518  980

To get an idea of this Simpson’s Paradox, we’ll have a look at the relationship between SAT scores and teacher salary. To see this, we’ll plot sat as a function of salary:

We’ll also add a trend line (regression line) so you can see the ‘modeled’ relationship (we’ll do a lot of this in this course!):

This seems counter-intuitive doesn’t it? But yet, the numbers don’t lie… so what’s going on?

What is happening here is something called confouding, and we can see it when we ‘control for’ the fraction of students actually taking SATs. One way to do this is to ‘group’ the data by the fraction of students taking the SATs. Here we’ve grouped the states into states with ‘low’ (0%-22%), ‘medium’ (%23-49%), or ‘high’ (50%-81%) percentages of students taking SATs (to give you some perspective, I’ve also added labels to the point):

And now when we plot the ‘trend line(s)’, we see an entirely different picture!

Why do you think there are three lines here? How is it possible that by ‘controlling for’ the fraction of students taking SATs, the estimated relationships change direction like this?

References

  1. http://terpconnect.umd.edu/~toh/spectrum/SignalsAndNoise.html
  2. http://blog.vctr.me/posts/central-limit-theorem.html
  3. http://www.r-bloggers.com/example-9-20-visualizing-simpsons-paradox/