So why is the normal distribution so normal?
In probability theory, the central limit theorem (CLT) states that, given certain conditions, the mean of a sufficiently large number of independent random variables, each with a well-defined mean and well-defined variance, will be approximately normally distributed.
This is easily demonstrated by a little simulation: first, we’ll create a random variable called ‘variable’. It will contain 10,000 random values ranging from -1 to +1 (this is R code, and you’ll learn to do some of this in your lab this week!):
variable = runif(n=10000, min=-1, max=+1)
And we’ll plot a histogram of this variable
As you can see, the variable is pretty much uniformly distributed. Now we’ll add another random variable to this random variable:
variable = variable + runif(n=10000, min=-1, max=+1)
And we’ll plot it again…
How does that work? Well, let’s think about it for a second…
Ok, now let’s repeat this a few more times, and see what happens…
Here’s a cool ‘web-based’ visualization of the Central Limit Theorem. Watch what happens as we increase the number of bins…
In the example here, at every triangle, the ball has a 50/50 shot of going to the left or to the right. You can also think of it like coin flips, where the number of coin flips is (bins - 1).
If you’ll recall from last week, a Simpson’s Paradox is…
a paradox in probability and statistics, in which a trend appears in different groups of data but disappears or reverses when these groups are combined.
You might also recall that I mentioned a study that looked at the relationship between SAT scores and teacher salaries (“Getting what you pay for: the debate over equity in public school expenditures”" (1999), Journal of Statistics Education 7(2)). Well it turns out we can use the data from the above study to explore a Simpson’s Paradox ourselves:
Here is the top few cases of a data frame describing average SAT scores by state, with several additional explanatory variables:
head(SAT)
## state expend ratio salary frac verbal math sat
## 1 Alabama 4.405 17.2 31.144 8 491 538 1029
## 2 Alaska 8.963 17.6 47.951 47 445 489 934
## 3 Arizona 4.778 19.3 32.175 27 448 496 944
## 4 Arkansas 4.459 17.1 28.934 6 482 523 1005
## 5 California 4.992 24.0 41.078 45 417 485 902
## 6 Colorado 5.443 18.4 34.571 29 462 518 980
To get an idea of this Simpson’s Paradox, we’ll have a look at the relationship between SAT scores and teacher salary. To see this, we’ll plot sat as a function of salary:
We’ll also add a trend line (regression line) so you can see the ‘modeled’ relationship (we’ll do a lot of this in this course!):
This seems counter-intuitive doesn’t it? But yet, the numbers don’t lie… so what’s going on?
What is happening here is something called confouding, and we can see it when we ‘control for’ the fraction of students actually taking SATs. One way to do this is to ‘group’ the data by the fraction of students taking the SATs. Here we’ve grouped the states into states with ‘low’ (0%-22%), ‘medium’ (%23-49%), or ‘high’ (50%-81%) percentages of students taking SATs (to give you some perspective, I’ve also added labels to the point):
And now when we plot the ‘trend line(s)’, we see an entirely different picture!
Why do you think there are three lines here? How is it possible that by ‘controlling for’ the fraction of students taking SATs, the estimated relationships change direction like this?