Does education increase civic engagement?

Published

March 28, 2024

Social scientists have long been interested in the causal effects of education. We’ve seen a ton of examples of economists looking at the causal effect of education on wages or income. Political scientists, not surprisingly, are less interested in income and more interested in the effect of education on civil behavior. For instance, does education make people vote more?

For this example, we’ll use a subset of data from Dee (2004) (preprocessed and cleaned by Miller (2021)) to explore whether college education causes people to register to vote. (See another version of this at Steve’s website)

library(tidyverse)
library(haven)
library(broom)
library(ggdag)
library(estimatr)
library(fixest)
library(modelsummary)

voting <- read_stata("data/Dee04.dta")

Exploratory data analysis

What proportion of college attendees are registered to vote? Group by register and college, summarize to get the number or rows in each group, then add a column that calculates the proportion.

group_props <- voting %>% 
  group_by(register, college) %>% 
  summarize(total = n()) %>% 
  group_by(college) %>% 
  mutate(prop = total / sum(total))
`summarise()` has grouped output by 'register'. You can override using the
`.groups` argument.
group_props
# A tibble: 4 × 4
# Groups:   college [2]
  register college total  prop
     <dbl>   <dbl> <int> <dbl>
1        0       0  1780 0.426
2        0       1  1257 0.249
3        1       0  2399 0.574
4        1       1  3791 0.751

Visualize these proportions with ggplot() and geom_col()

ggplot(group_props, aes(x = factor(college), y = prop, fill = factor(register))) +
  geom_col()

Naive model

Run a super wrong and naive model that estimates the effect of college attendance on voter registration (register ~ college).

model_naive <- lm(register ~ college, data = voting)
tidy(model_naive)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    0.574   0.00714      80.4 0       
2 college        0.177   0.00965      18.3 1.03e-73

For fun, make a scatterplot of this relationship:

ggplot(voting, aes(x = college, y = register)) +
  geom_point(alpha = 0.05, position = position_jitter(width = 0.1, height = 0.1)) + 
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

TODO: Interpret this finding. Why is this estimate wrong?

Bonus fun! Logistic regression

model_logit <- glm(register ~ college, family = binomial(link = "logit"), data = voting)
tidy(model_logit)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    0.298    0.0313      9.54 1.43e-21
2 college        0.805    0.0451     17.8  3.33e-71
# College attendance is associated with a 0.805 increase in the log odds of registering to vote, whatever that means

# Odds ratio
exp(0.805)
[1] 2.236696
# College attendees are 2.23 times more likely to register to vote

# plogis((intercept + coefficient)) - plogis(intercept)
plogis((0.298 + 0.805)) - plogis(0.298)
[1] 0.1768683
# College attendance leads to a 17.6 percentage point increase in the probability of registering to vote

library(marginaleffects)
model_logit |> 
  comparisons(by = "college")

    Term          Contrast college Estimate Std. Error    z Pr(>|z|)     S
 college mean(1) - mean(0)       0    0.177    0.00978 18.1   <0.001 240.8
 college mean(1) - mean(0)       1    0.177    0.00978 18.1   <0.001 240.8
 2.5 % 97.5 %
 0.158  0.196
 0.158  0.196

Columns: rowid, term, contrast, college, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted 
Type:  response 
# "Divide by 4" trick/rule
0.805 / 4
[1] 0.20125

Distance to college as an instrument

We’re stuck with endogeneity. There are things that cause both education and voter registration that confound the relationship, and we can’t control for all of them.

In his paper, Dee (2004) uses distance to the nearest college as an instrument to help remove this exogeneity. He essentially creates this DAG (though without actually making a DAG):

At first glance, this feels like it could be a good instrument:

  1. Relevance (Z → X and cor(Z, X) ≠ 0): Distance to college should be associated with college attendance. The closer a college is, the cheapter it is to attend, and the more opportunity there is to attend.
  2. Excludability (Z → X → Y and Z not → Y and cor(Z, Y | X) = 0): Distance affects college attendance which affects voting registration. But distance should influence voting registration only because people go to college (and no other reason).
  3. Exogeneity (U not → Z and Cor(Z, U) = 0): Colleges should exist before students exist; students and their voting patterns don’t influence whether pre-existing colleges exist (i.e. there should be no arrows going into the instrument node).

Let’s check these conditions

Relevance

See if there’s correlation between the instrument (distance) and the treatment (college). Use the cor() function (and cor.test() if you want a p-value):

model_check_relevance <- lm(college ~ distance, data = voting)
tidy(model_check_relevance)
# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  0.609    0.00773       78.8 0       
2 distance    -0.00637  0.000592     -10.8 7.35e-27
voting %>% 
  summarize(relevance = cor(distance, college))
# A tibble: 1 × 1
  relevance
      <dbl>
1    -0.111
cor.test(voting$distance, voting$college)

    Pearson's product-moment correlation

data:  voting$distance and voting$college
t = -10.764, df = 9225, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.13147886 -0.09117564
sample estimates:
      cor 
-0.111373 

Plot the relationship between the instrument and treatment:

ggplot(voting, aes(x = distance, y = college)) +
  geom_point(alpha = 0.01) +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

Yep!

Exclusion

See if there’s a relationship between the instrument (distance) and the outcome (register).

model_check_exclusion <- lm(distance ~ register, data = voting)
tidy(model_check_exclusion)
# A tibble: 2 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)   10.2       0.158     64.3  0      
2 register      -0.620     0.193     -3.22 0.00130
voting %>% 
  summarize(relevance = cor(distance, register))
# A tibble: 1 × 1
  relevance
      <dbl>
1   -0.0335
cor.test(voting$distance, voting$register)

    Pearson's product-moment correlation

data:  voting$distance and voting$register
t = -3.2165, df = 9225, p-value = 0.001302
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.05383772 -0.01307421
sample estimates:
        cor 
-0.03346988 

To help check the “only through” condition, see if there’s a relationship between distance and other possibly confounding variables (like black, female, hispanic, and so on). If it’s related, that’s a good sign that there’s an arrow between those nodes, thus breaking the exclusion requirement.

voting %>% 
  summarize(rel_black = cor(distance, black),
            rel_female = cor(distance, female),
            rel_hispanic = cor(distance, hispanic))
# A tibble: 1 × 3
  rel_black rel_female rel_hispanic
      <dbl>      <dbl>        <dbl>
1   -0.0899    -0.0128      -0.0708

Nope :(

Exogeneity

There’s no statistical test here. Instead we have to tell a theoretical story that distance is uncorrelated with anything else in the DAG.

In theory, living closer to a college should explain or increase the likelihood of attending college, but shouldn’t in turn influence outcomes as an adult, like the propensity to vote.

But in real life, that’s not the case! Black and Hispanic Americans are more likely to live in urban areas, and there are more colleges in urban areas, and race/ethnicity are correlated with voting patterns.

IV estimation

Let’s pretend that this is a good instrument, regardless of whatever we concluded above.

By hand, to make your life miserable

Make a first stage model that predicts college attendance based on distance to college. Control for black, hispanic, and female, since they’re potential confounders (and because Dee originally did that too).

first_stage <- lm(college ~ distance + black + hispanic + female,
                  data = voting)

Plug the original dataset into the first stage model with augment_columns() to generate predicted values of college (or the exogenous part of college):

voting_with_prediction <- augment_columns(first_stage, voting) %>% 
  rename(college_hat = .fitted)

Make a second stage model that estimates the effect of predicted college on voter registration, also controlling for black, hispanic, and female.

What is the causal effect of attending college?

second_stage <- lm(register ~ college_hat + black + hispanic + female,
                   data = voting_with_prediction)
tidy(second_stage)
# A tibble: 5 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  0.527      0.0460    11.5   2.94e-30
2 college_hat  0.229      0.0816     2.80  5.06e- 3
3 black        0.0686     0.0154     4.46  8.47e- 6
4 hispanic     0.0340     0.0151     2.25  2.42e- 2
5 female       0.00586    0.0102     0.576 5.65e- 1

All at once, to make your life wonderful

Use iv_robust() to run a 2SLS model all at the same time:

model_all_at_once <- iv_robust(register ~ college + black + hispanic + female |
                                 distance + black + hispanic + female,
                               data = voting, diagnostics = TRUE)
tidy(model_all_at_once)
         term    estimate  std.error  statistic      p.value     conf.low
1 (Intercept) 0.527284479 0.04596064 11.4725220 2.907318e-30  0.437191451
2     college 0.228786550 0.08139531  2.8108074 4.952157e-03  0.069233727
3       black 0.068604828 0.01476781  4.6455670 3.438411e-06  0.039656662
4    hispanic 0.034046791 0.01501458  2.2675813 2.337778e-02  0.004614883
5      female 0.005863362 0.01004211  0.5838773 5.593171e-01 -0.013821401
   conf.high   df  outcome
1 0.61737751 9222 register
2 0.38833937 9222 register
3 0.09755299 9222 register
4 0.06347870 9222 register
5 0.02554812 9222 register
summary(model_all_at_once)

Call:
iv_robust(formula = register ~ college + black + hispanic + female | 
    distance + black + hispanic + female, data = voting, diagnostics = TRUE)

Standard error type:  HC2 

Coefficients:
            Estimate Std. Error t value  Pr(>|t|)  CI Lower CI Upper   DF
(Intercept) 0.527284    0.04596 11.4725 2.907e-30  0.437191  0.61738 9222
college     0.228787    0.08140  2.8108 4.952e-03  0.069234  0.38834 9222
black       0.068605    0.01477  4.6456 3.438e-06  0.039657  0.09755 9222
hispanic    0.034047    0.01501  2.2676 2.338e-02  0.004615  0.06348 9222
female      0.005863    0.01004  0.5839 5.593e-01 -0.013821  0.02555 9222

Multiple R-squared:  0.0348 ,   Adjusted R-squared:  0.03438 
F-statistic: 6.717 on 4 and 9222 DF,  p-value: 2.149e-05

Diagnostics:
                 numdf dendf   value p.value    
Weak instruments     1  9222 137.149  <2e-16 ***
Wu-Hausman           1  9221   0.381   0.537    
Overidentifying      0    NA      NA      NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Use feols() to run a 2SLS model all the same time:

# y ~ exogenous | fe | endogenous ~ instrument
model_feols <- feols(register ~ black + hispanic + female | college ~ distance,
                     data = voting)
tidy(model_feols)
# A tibble: 5 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  0.527      0.0452    11.7   3.36e-31
2 fit_college  0.229      0.0803     2.85  4.38e- 3
3 black        0.0686     0.0151     4.53  6.00e- 6
4 hispanic     0.0340     0.0149     2.29  2.20e- 2
5 female       0.00586    0.0100     0.586 5.58e- 1
# model_feols$iv_first_stage
# summary(model_feols, stage = 1)

Causal effect

TODO: What is the final causal effect of college attendance on registering to vote?

modelsummary(list(second_stage, model_all_at_once, model_feols))

(1)

(2)

(3)

(Intercept)

0.527

0.527

0.527

(0.046)

(0.046)

(0.045)

college_hat

0.229

(0.082)

black

0.069

0.069

0.069

(0.015)

(0.015)

(0.015)

hispanic

0.034

0.034

0.034

(0.015)

(0.015)

(0.015)

female

0.006

0.006

0.006

(0.010)

(0.010)

(0.010)

college

0.229

(0.081)

fit_college

0.229

(0.080)

Num.Obs.

9227

9227

9227

R2

0.003

0.035

0.035

R2 Adj.

0.002

0.034

0.034

AIC

12234.6

11933.2

11931.2

BIC

12277.4

11976.0

11966.9

Log.Lik.

-6111.297

F

6.370

RMSE

0.47

0.46

0.46

Std.Errors

IID

Assuming this is a good instrument, going to college causes a 22 percentage point increase in the probability of registering to vote.

References

Dee, Thomas S. 2004. “Are There Civics Returns to Education?” Journal of Public Economics 88 (9–10): 1697–1720. https://doi.org/10.1016/j.jpubeco.2003.11.002.
Miller, Steve. 2021. Stevedata: Steve’s Toy Data for Teaching about a Variety of Methodological, Social, and Political Topics. https://CRAN.R-project.org/package=stevedata.