Note: Tuesday Quora session with Hadley Wickham
September 15, 2016
Note: Tuesday Quora session with Hadley Wickham
I have a random person locked in a room. If you can correctly guess their height (within 5 inches) I'll let them go unharmed.
How will you guess their height?
Without any other knowledge, the best you can do is to just guess the average height of people. You expect the height to be the mean height, or expected height (\(E(h)\)).
If I tell you this person is an NBA player, or a 6 year old, you would want to adjust your guess, but you'd still go with a conditional average.
In math terms, if we have to estimate a height without any details:
\[\hat{h} = \bar{h}\]
Bringing in one piece of information (e.g. NBA status) we would add a term that adjusts our estimate given that knowledge.
\[\hat{h}|NBA = \bar{h} + x\]
What does \(x\) represent?
We already know how a line works:
\[ y = mx + b\]
In econometrics we switch the order of the terms (\(y = b + mk\)),
rename \(b\) and \(m\) (\(\beta_0\) and \(\beta_1\)),
and add in the stochastic* error term (\(u\)).
\[ y = \beta_0 + \beta_1 x + u \]
*("Stochastic" is fancy speak for "random".)
\[ y = \beta_0 + \beta_1 x + u \]
\(\beta_0\) is the intercept of our line… it tells us what value \(y\) would take if \(x=0\).
\(\beta_1\) is the slope of our line. It tells us how \(y\) changes if \(x\) changes by one unit.
Our dependent variable is \(y\) and our independent variable is \(x\).
\[ y = \beta_0 + \beta_1 x + u \]
This model is too simple to be used in practice. (Possible exception: well designed randomized trials.)
But it will help us understand the fundamentals, so we're going to dissect it before moving on to more realistic settings.
\[ yield = \beta_0 + \beta_1 fertilizer + u\]
The error term (or "residual") is where we get the more or less stuff that happens because of natural variation. e.g. rainfall, land quality, parasites, etc.
A well behaved error term is normally distributed (making inference easy), with a mean of 0 (meaning that these unobserved factors balance out).
We call it an error term, because it represents how far off our model is from the actual data (\(u \equiv y - \beta_0 + \beta_1 x\))
What is the effect of education on wage?
\[wage = \beta_0 + \beta_1 educ + u \]
Here \(u\) includes the effects of everything that isn't part of the \(educ\) variable.
If we estimate the model and get these results:
\[\hat{wage} = 3 + 0.58 educ\]
What is the value of one extra year of school?
What is the expected wage for someone with a bachelor's degree?
The expected value of \(u\) is 0. That means that the effects of unobserved factors should balance out.
\(E(u|x) = 0\) is a stronger assumption than \(\bar{u} = E(u) = 0\). Why? (hint: Try graphing out alternatives in your notes.)
\(E(u|x) = 0\) means that there aren't levels of \(x\) that are systematically above or below our line.
If \(E(u|x)\) was +1 for low levels of x, -2 for medium levels, and +1 for high levels, then it would mean our model is estimating a value of \(y\) that is too high in the middle, and too low at the ends.
If the error is independent of x, then it becomes possible for our model to directly reflect the data. If the error is not independent of x, then our estimates will be biased.
For our example:
\[wage = \beta_0 + \beta_1 educ + u \]
\(\beta_0\) is likely to overstate the effect of education (i.e. schooling) because we're not controlling for the effects of variables like IQ, parents' backgrounds, etc.
For our example:
\[wage = \beta_0 + \beta_1 educ + u \]
IQ, etc. will probably correlate with years of schooling. The model will fit the data\(\hat{\beta_1}\) is what our model will estimate, and it will fit the data, but it won't get the true value of \(\beta_1\).
That \(E(u|x) = 0\) implies that:
\[E(y|x) = E(\beta_0 + \beta_1 x + u|x)\]
\[E(y|x) = E(\beta_0) + E(\beta_1 x) + E(u|x)\]
\[E(y|x) = \beta_0 + \beta_1 x + 0\]
This means that our line (in theory) is going through the conditional average value of the dependent variable.
Now to better understand what's going on, we'll consider the simplest case (\(y\) as a function of a single \(x\) variable). We're going to consider an ideal case (i.e. we're going to build a theory), and determine what would be the best way to estimate a true relationship in that case, under ideal circumstances.
We want to make our estimate as good as possible. But what does that mean?
It's going to have to do with our error term.
We're going to draw a line through some dots. This line is \(\hat{y}\) and it represents our estimate of \(y|x\).
We know that few of our points will actually be on the line. Most estimates will be wrong. But our errors will be random and unpredictable (if we could predict them, we would account for them in our model!).
We can quantify how far off our estimate is for each estimate (i.e. each \(\hat{y},x\) pair) by subtracting that estimate from the observed value of \(y\) (i.e. what the data shows).
\[\hat{u_i} = y_i - \hat{y_i}\] \[\hat{u_i} = y_i - \hat{\beta_0} - \hat{\beta_1}x_i\]
We assumed that \(E(u)=0\).
Since \(E(u) = \sum_{i=1}^{n}u_i/n\) we know that \(n \times E(u) = \sum_i u_i\). And since the expected error is 0, the sum of errors must also be 0. Any specific error will be positive or negative, but they'll all wash out.
So any estimator that doesn't screw up the error term is going to minimize the sum of errors. That doesn't narrow down our list of candidates for estimators of our paraters (the \(\beta\)'s).
We can't help but minimize the sum of errors. Because some errors are negative, that route goes nowhere. But what if we modified the errors so they were always positive.
The sum of the absolute values of the errors (\(|u|\)) would be a better candidate, because minimizing it requires that our line actually go through most of the data points.
One thing we want out of our estimator is for it to reliably reflect the population parameters. That is, if we could get information on everyone (and even better: everyone who ever was and will ever be), the \(\beta\) values that would match that data shouldn't be different from the average values we would get by estimating the parameters on lots of random samples.
We want our estimator to be unbiased.
Given an unbiased estimator, we want something that comes up with estimates that are consistently close to the true/population values of our \(\beta\) parameters. We prefer estimators with relatively little variance.
i.e. Since we're going to be working with a sample, we want the estimates from that sample to be as close to the true values as possible.
Given our theoretical assumptions (the "Gauss-Markov assumptions"), and restricting ourselves to linear models (which are much easier to deal with in general), there is a type of estimator that is unbiased and efficient (i.e. has lower variance than other candidate estimators).
The Best Linear Unbiased Estimator (BLUE) is the Ordinary Least Squares estimate.
We will estimate \(\beta_0\) and \(\beta_1\) by drawing a line that minimizes the Sum of Squared Residuals (SSR).
Note: some economists call this the sum of squared errors, or SSE. This is unfortunate because we're also interested in the explained sum of squares (SSE) which those economists would call the sum of squares due to regression (SSR). That is, sometimes SSR is SSE and vice versa, depending on who an econometrician learned econometrics from. Sometimes they all become RSS and ESS. Economists should take linguistics classes.
The counterpart to the SSR (Sum of Squared Residuals or Residual Sum of Squares) is:
the Explained Sum of Squares (SSE).
The SSR + SSE together gives us the Total Sum of Squares (SST).
\[SST \equiv \sum_{i=1}^{n}(y_i - \bar{y})^2\]
\[SSE \equiv \sum_{i=1}^{n}(\hat{y_i} - \bar{y})^2\]
\[SSR \equiv \sum_{i=1}^{n}(\hat{u_i})^2\]
\[SST \equiv \sum_{i=1}^{n}(y_i - \bar{y})^2\]
Here we're looking at the difference between what we observe (\(y_i\)) and the average (\(\bar{y}\)). In other words, how far off is \(y\) from its average value? Squaring the term means that we aren't making the mistake of adding negative and positive numbers and feeling good when they add up to 0.
Squaring also means that outliers are treated as more important than points that are close to average.
\(SST/(n-1)\) gives us the sample variance of \(y\).
\[SSE \equiv \sum_{i=1}^{n}(\hat{y_i} - \bar{y})^2\]
Now we're comparing how far our estimated values of \(y\) (\(\bar{y_i}\)) are from average. Our estimates are explaining that \(y\) is sometimes above or below average because of the effects of \(x\).
\[SSR \equiv \sum_{i=1}^{n}(\hat{u_i})^2\]
Finally, we're asking, how far off our estimates are from the actual observed values of \(y\). Another way of writing this is:
\[SSR \equiv \sum_{i=1}^{n}(y_i - \hat{y_i})^2\]
Our standard "goodness-of-fit" measure is \(R^2\) which is defined as
\[R^2 \equiv \frac{SSE}{SST} = 1 - \frac{SSR}{SST}\]
The closer \(R^2\) is to 1, the better our model fits the data. But we don't want to overfit.
We don't want to just try lots of different lines and choose the best one, we want a method that works in general (as long as our assumptions about independence, error distribution, etc. hold)
\[\hat{\beta_1} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i-\bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})(x_i - \bar{x})} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i-\bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}\]
This is equivalent to looking at the covariance of \(x\) and \(y\) divided by the variance of \(x\) (with division by \(n\) cancelling out for numerator and denominator).