library(tidyverse)
library(haven)
library(broom)
library(ggdag)
library(estimatr)
library(fixest)
library(modelsummary)
<- read_stata("data/Dee04.dta") voting
Does education increase civic engagement?
Social scientists have long been interested in the causal effects of education. We’ve seen a ton of examples of economists looking at the causal effect of education on wages or income. Political scientists, not surprisingly, are less interested in income and more interested in the effect of education on civil behavior. For instance, does education make people vote more?
For this example, we’ll use a subset of data from Dee (2004) (preprocessed and cleaned by Miller (2021)) to explore whether college education causes people to register to vote. (See another version of this at Steve’s website)
- Treatment =
college
: A binary variable indicating if the person attended a junior, community, or 4-year college by 1984 - Outcome =
register
: A binary variable indicating if the person is currently registered to vote - Instrument =
distance
: Miles from respondent’s high school to the nearest 2-year college
Exploratory data analysis
What proportion of college attendees are registered to vote? Group by register
and college
, summarize to get the number or rows in each group, then add a column that calculates the proportion.
<- voting %>%
group_props group_by(register, college) %>%
summarize(total = n()) %>%
group_by(college) %>%
mutate(prop = total / sum(total))
`summarise()` has grouped output by 'register'. You can override using the
`.groups` argument.
group_props
# A tibble: 4 × 4
# Groups: college [2]
register college total prop
<dbl> <dbl> <int> <dbl>
1 0 0 1780 0.426
2 0 1 1257 0.249
3 1 0 2399 0.574
4 1 1 3791 0.751
Visualize these proportions with ggplot()
and geom_col()
ggplot(group_props, aes(x = factor(college), y = prop, fill = factor(register))) +
geom_col()
Naive model
Run a super wrong and naive model that estimates the effect of college attendance on voter registration (register ~ college
).
<- lm(register ~ college, data = voting)
model_naive tidy(model_naive)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.574 0.00714 80.4 0
2 college 0.177 0.00965 18.3 1.03e-73
For fun, make a scatterplot of this relationship:
ggplot(voting, aes(x = college, y = register)) +
geom_point(alpha = 0.05, position = position_jitter(width = 0.1, height = 0.1)) +
geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
TODO: Interpret this finding. Why is this estimate wrong?
Bonus fun! Logistic regression
<- glm(register ~ college, family = binomial(link = "logit"), data = voting)
model_logit tidy(model_logit)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.298 0.0313 9.54 1.43e-21
2 college 0.805 0.0451 17.8 3.33e-71
# College attendance is associated with a 0.805 increase in the log odds of registering to vote, whatever that means
# Odds ratio
exp(0.805)
[1] 2.236696
# College attendees are 2.23 times more likely to register to vote
# plogis((intercept + coefficient)) - plogis(intercept)
plogis((0.298 + 0.805)) - plogis(0.298)
[1] 0.1768683
# College attendance leads to a 17.6 percentage point increase in the probability of registering to vote
library(marginaleffects)
|>
model_logit comparisons(by = "college")
Term Contrast college Estimate Std. Error z Pr(>|z|) S
college mean(1) - mean(0) 0 0.177 0.00978 18.1 <0.001 240.8
college mean(1) - mean(0) 1 0.177 0.00978 18.1 <0.001 240.8
2.5 % 97.5 %
0.158 0.196
0.158 0.196
Columns: rowid, term, contrast, college, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted
Type: response
# "Divide by 4" trick/rule
0.805 / 4
[1] 0.20125
Distance to college as an instrument
We’re stuck with endogeneity. There are things that cause both education and voter registration that confound the relationship, and we can’t control for all of them.
In his paper, Dee (2004) uses distance to the nearest college as an instrument to help remove this exogeneity. He essentially creates this DAG (though without actually making a DAG):
At first glance, this feels like it could be a good instrument:
- Relevance (Z → X and cor(Z, X) ≠ 0): Distance to college should be associated with college attendance. The closer a college is, the cheapter it is to attend, and the more opportunity there is to attend.
- Excludability (Z → X → Y and Z not → Y and cor(Z, Y | X) = 0): Distance affects college attendance which affects voting registration. But distance should influence voting registration only because people go to college (and no other reason).
- Exogeneity (U not → Z and Cor(Z, U) = 0): Colleges should exist before students exist; students and their voting patterns don’t influence whether pre-existing colleges exist (i.e. there should be no arrows going into the instrument node).
Let’s check these conditions
Relevance
See if there’s correlation between the instrument (distance
) and the treatment (college
). Use the cor()
function (and cor.test()
if you want a p-value):
<- lm(college ~ distance, data = voting)
model_check_relevance tidy(model_check_relevance)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.609 0.00773 78.8 0
2 distance -0.00637 0.000592 -10.8 7.35e-27
%>%
voting summarize(relevance = cor(distance, college))
# A tibble: 1 × 1
relevance
<dbl>
1 -0.111
cor.test(voting$distance, voting$college)
Pearson's product-moment correlation
data: voting$distance and voting$college
t = -10.764, df = 9225, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.13147886 -0.09117564
sample estimates:
cor
-0.111373
Plot the relationship between the instrument and treatment:
ggplot(voting, aes(x = distance, y = college)) +
geom_point(alpha = 0.01) +
geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'
Yep!
Exclusion
See if there’s a relationship between the instrument (distance
) and the outcome (register
).
<- lm(distance ~ register, data = voting)
model_check_exclusion tidy(model_check_exclusion)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 10.2 0.158 64.3 0
2 register -0.620 0.193 -3.22 0.00130
%>%
voting summarize(relevance = cor(distance, register))
# A tibble: 1 × 1
relevance
<dbl>
1 -0.0335
cor.test(voting$distance, voting$register)
Pearson's product-moment correlation
data: voting$distance and voting$register
t = -3.2165, df = 9225, p-value = 0.001302
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.05383772 -0.01307421
sample estimates:
cor
-0.03346988
To help check the “only through” condition, see if there’s a relationship between distance
and other possibly confounding variables (like black
, female
, hispanic
, and so on). If it’s related, that’s a good sign that there’s an arrow between those nodes, thus breaking the exclusion requirement.
%>%
voting summarize(rel_black = cor(distance, black),
rel_female = cor(distance, female),
rel_hispanic = cor(distance, hispanic))
# A tibble: 1 × 3
rel_black rel_female rel_hispanic
<dbl> <dbl> <dbl>
1 -0.0899 -0.0128 -0.0708
Nope :(
Exogeneity
There’s no statistical test here. Instead we have to tell a theoretical story that distance is uncorrelated with anything else in the DAG.
In theory, living closer to a college should explain or increase the likelihood of attending college, but shouldn’t in turn influence outcomes as an adult, like the propensity to vote.
But in real life, that’s not the case! Black and Hispanic Americans are more likely to live in urban areas, and there are more colleges in urban areas, and race/ethnicity are correlated with voting patterns.
IV estimation
Let’s pretend that this is a good instrument, regardless of whatever we concluded above.
By hand, to make your life miserable
Make a first stage model that predicts college attendance based on distance to college. Control for black
, hispanic
, and female
, since they’re potential confounders (and because Dee originally did that too).
<- lm(college ~ distance + black + hispanic + female,
first_stage data = voting)
Plug the original dataset into the first stage model with augment_columns()
to generate predicted values of college
(or the exogenous part of college
):
<- augment_columns(first_stage, voting) %>%
voting_with_prediction rename(college_hat = .fitted)
Make a second stage model that estimates the effect of predicted college on voter registration, also controlling for black
, hispanic
, and female
.
What is the causal effect of attending college?
<- lm(register ~ college_hat + black + hispanic + female,
second_stage data = voting_with_prediction)
tidy(second_stage)
# A tibble: 5 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.527 0.0460 11.5 2.94e-30
2 college_hat 0.229 0.0816 2.80 5.06e- 3
3 black 0.0686 0.0154 4.46 8.47e- 6
4 hispanic 0.0340 0.0151 2.25 2.42e- 2
5 female 0.00586 0.0102 0.576 5.65e- 1
All at once, to make your life wonderful
Use iv_robust()
to run a 2SLS model all at the same time:
<- iv_robust(register ~ college + black + hispanic + female |
model_all_at_once + black + hispanic + female,
distance data = voting, diagnostics = TRUE)
tidy(model_all_at_once)
term estimate std.error statistic p.value conf.low
1 (Intercept) 0.527284479 0.04596064 11.4725220 2.907318e-30 0.437191451
2 college 0.228786550 0.08139531 2.8108074 4.952157e-03 0.069233727
3 black 0.068604828 0.01476781 4.6455670 3.438411e-06 0.039656662
4 hispanic 0.034046791 0.01501458 2.2675813 2.337778e-02 0.004614883
5 female 0.005863362 0.01004211 0.5838773 5.593171e-01 -0.013821401
conf.high df outcome
1 0.61737751 9222 register
2 0.38833937 9222 register
3 0.09755299 9222 register
4 0.06347870 9222 register
5 0.02554812 9222 register
summary(model_all_at_once)
Call:
iv_robust(formula = register ~ college + black + hispanic + female |
distance + black + hispanic + female, data = voting, diagnostics = TRUE)
Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 0.527284 0.04596 11.4725 2.907e-30 0.437191 0.61738 9222
college 0.228787 0.08140 2.8108 4.952e-03 0.069234 0.38834 9222
black 0.068605 0.01477 4.6456 3.438e-06 0.039657 0.09755 9222
hispanic 0.034047 0.01501 2.2676 2.338e-02 0.004615 0.06348 9222
female 0.005863 0.01004 0.5839 5.593e-01 -0.013821 0.02555 9222
Multiple R-squared: 0.0348 , Adjusted R-squared: 0.03438
F-statistic: 6.717 on 4 and 9222 DF, p-value: 2.149e-05
Diagnostics:
numdf dendf value p.value
Weak instruments 1 9222 137.149 <2e-16 ***
Wu-Hausman 1 9221 0.381 0.537
Overidentifying 0 NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Use feols()
to run a 2SLS model all the same time:
# y ~ exogenous | fe | endogenous ~ instrument
<- feols(register ~ black + hispanic + female | college ~ distance,
model_feols data = voting)
tidy(model_feols)
# A tibble: 5 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.527 0.0452 11.7 3.36e-31
2 fit_college 0.229 0.0803 2.85 4.38e- 3
3 black 0.0686 0.0151 4.53 6.00e- 6
4 hispanic 0.0340 0.0149 2.29 2.20e- 2
5 female 0.00586 0.0100 0.586 5.58e- 1
# model_feols$iv_first_stage
# summary(model_feols, stage = 1)
Causal effect
TODO: What is the final causal effect of college attendance on registering to vote?
modelsummary(list(second_stage, model_all_at_once, model_feols))
| (1) | (2) | (3) |
---|---|---|---|
(Intercept) | 0.527 | 0.527 | 0.527 |
(0.046) | (0.046) | (0.045) | |
college_hat | 0.229 | ||
(0.082) | |||
black | 0.069 | 0.069 | 0.069 |
(0.015) | (0.015) | (0.015) | |
hispanic | 0.034 | 0.034 | 0.034 |
(0.015) | (0.015) | (0.015) | |
female | 0.006 | 0.006 | 0.006 |
(0.010) | (0.010) | (0.010) | |
college | 0.229 | ||
(0.081) | |||
fit_college | 0.229 | ||
(0.080) | |||
Num.Obs. | 9227 | 9227 | 9227 |
R2 | 0.003 | 0.035 | 0.035 |
R2 Adj. | 0.002 | 0.034 | 0.034 |
AIC | 12234.6 | 11933.2 | 11931.2 |
BIC | 12277.4 | 11976.0 | 11966.9 |
Log.Lik. | -6111.297 | ||
F | 6.370 | ||
RMSE | 0.47 | 0.46 | 0.46 |
Std.Errors | IID |
Assuming this is a good instrument, going to college causes a 22 percentage point increase in the probability of registering to vote.