This is the fourth challenge question for the course. Remember that it’s meant to be challenging and stretch your skills, but you can always ask for help or collaborate with your classmate.
People don’t randomly choose whether or not to go to college: they choose (subject to constraints) to go. The same factors which influence a person to go to college are also likely to affect their lifetime earnings trajectory. This poses a challenge for estimating the returns to college education, since the sample of college attendees is biased in favor of the kinds of people who would go to college. Worst of all, many of those factors are not measured in most datasets, and are poorly understood enough that we’ll probably never be sure we’ve measured everything. So you usually can’t correct for this selection bias by simply including the omitted variables in the regression.
You’ve been asked to measure the net benefit people will receive, in terms of earnings, from a program which will help more people go to college. To do that, you need to estimate the causal ATE of college on earnings. You need some way to deal with this selection bias and get at the causal ATE. You don’t have data on the unobservables which are correlated with selection into college (the “treatment”) and earnings.
What’s an econometrician to do?
The “instrumental variables” approach is one solution to this problem. The key insight is to use an instrumental variable (IV), \(Z\), which is correlated with \(X\) but uncorrelated with those unobservable factors (call them \(U\)), so that the only way it affects \(Y\) is through \(X\). Mathematically, the following conditions must be true about \(Z\) for it to be an IV:
\(E[XZ] \neq 0\) : the correlation between \(X\) and \(Z\) must be non-zero. This is often called the “inclusion condition”. It is easy to test by calculating the correlation between \(X\) and \(Z\): if it’s clearly non-zero, you can say the inclusion condition is satisfied. Another method is to estimate \(X = \alpha + \beta Z + \epsilon\) and see if the coefficient on \(Z\) is statistically significant. If it is, then you know that \(Z\) is correlated with \(X\). (This is basically the same as calculating the correlation, but a bit better becaue you get standard errors and p-values.)
The DAG below formalizes this graphically:
\(X\) causally affects \(Y\), but both \(X\) and \(Y\) are affected by an unobserved variable \(U\). The solution is to use another variable, \(Z\), which directly affects \(X\) but does not directly affect \(Y\) (\(Z\) can only affect \(Y\) through \(X\)). By using \(Z\) to predict \(X\), we recover the \(X\)-that-would-have-been-but-for-the-influence-of-\(U\), which we can then use without fear of bias.
Implementing an IV is “straightforward”:
Intuitively, in a situation with selection bias the variation in \(Y\) is driven by a combination of variation in \(X\) and variation in \(U\). An IV allows you to recover the causal ATE of \(X\) on \(Y\) by “purging” \(X\) of the variation driven by \(U\), and using only the “clean” variation which is driven by \(Z\).
Fortunately, you believe you have an IV that might work: whether or not someone lived in the same city as a college (any college) during their senior year of high school. Living near a college in their senior year will increase the chance someone completes college themselves, but isn’t correlated with average annual earnings except through its influence on college attendance.1
You want to make sure the IV method works, so you’re going to work with a simulated dataset where you know that the causal ATE of education is \(\$15,000\). The dataset is college_data.csv. The data describe a simulated sample of 20,000 individuals:
earnings: Average annual earnings at the midpoint of the person’s career in dollars.college: Whether someone attended college (1) or not (0).live_near_college: Whether someone lived near a college in their senior year of high school (1) or not (0).college on the x-axis.\[ earnings = \alpha + \beta college + \epsilon \]
college on live_near_college. Save the model as first_stage_model (e.g., as shown below). Does live_near_college pass the inclusion test?first_stage_model <- lm(college ~ live_near_college, data=college_data)
live_near_college is likely to satisfy the exclusion restriction.first_stage_model using the predict function. Save your predicted output as college_hat, and create a new dataframe containing college_data and college_hat. (Hint: Read the output from ?predict. Use newdata=college_data. All you need are the predicted values of \(X\), i.e., college.)\[ earnings = \alpha + \beta college\_hat + \epsilon .\]
The IV method was first devised by Sewall Wright, a geneticist, in the 1930s. It was soon applied to economics (by Philip Wright, Sewall’s father) to resolve a paradox in demand estimation, where unobserved factors like marketing and product quality were causing estimated demand curves to slope upwards (i.e., higher prices causing higher demand, resulting from a positive \(\beta\) in \(Q = \alpha + \beta P + \epsilon\)).2 At the time the issues with unobservables and omitted variables bias was not as well-understood as they are today, and the positive coefficients were causing a lot of confusion. The IV approach, however, gave downward-sloping demand curves. It is now standard practice to use IVs to estimate demand curves.
The IV technique is arguably one of the most important methods in the modern econometrician’s toolkit. From a purely statistical or “non-causal” predictive standpoint, it’s a bit counterintuitive: the predictions \(\hat{X}\) will have some error from the regression, on top of whatever measurement error is in \(X\). So how can using \(\hat{X}\) improve our predictive power? It’s like we’re throwing away data to predict better—and it works! I think this kind of thinking is a hallmark of good economic statistics: rather than simply falling back on the old statistical maxims about Central Limit Theorems and sample size (tired), we think carefully about the variation in the data and how it connects to the underlying problem so that we can use only the data whose variation is free of confounding influences (wired). The “naive” statistical approach of “just get more data” wouldn’t necessarily help us. In fact it might make things worse if the new data were also contaminated with confounding influences.
My introduction to this tool came from a David Card paper, “Using geographic variation in college proximity to estimate the return to schooling”. In it, Card does something very similar to what we did in this exercise to estimate the returns to schooling. One of his key (and somewhat counterintuitive) findings was that, despite earning less overall, the returns to college education are the highest for people whose parents didn’t have a college education. In the years since Card wrote his paper this result has become much more accepted.
In a real application, you’d be controlling for other factors as well, like parental education and income and so on. These are easy to accommodate in the IV framework so let’s ignore them for now.↩
This history is a bit unclear—textual analysis suggests the father wrote the relevant appendix of the paper in which the method was presented. But the son was the statistical mastermind. The current sense is that the two collaborated on the method, applying it to problems each found relevant—the father studied pig iron, while the son studied pigs. This slide deck has more information if you’re curious.↩