Causality: Regression Discontinuity

Case Study: Impact of minimum legal drinking age on death rates

Author

Mike Aguilar | https://www.linkedin.com/in/mike-aguilar-econ/

Introduction

This exercise is based on the study of changes in the minimum legal drinking age, which is detailed in Mastering Metrics.

The motivation for the study is an attempt by policy makers to reduce death of young drivers who are under the influence of alcohol. The policy makers want to use a natural experiment via regression discontinuity.

Specifically, we have data for a time and location wherein the minimum legal drinking age was 21yrs. We want to compare the death rates of young people above and below that age in order to reveal the impact this drinking rule has on deaths. If we can show that the death rate is higher for drivers over 21yrs old, the policy makers can use that as evidence that there is a relationship between drinking and driving deaths.

Note: the study is not designed to investigate the death of young drivers who are drinking illegally.

Key data items:

unit of observation (i): individuals are grouped according to age by 30day increments. e.g. everyone aged 19yrs and 0days old through 19yrs and 30days old is captured in observation 1; everyone aged 19yrs and 31days through 19yrs and 60days is captured in observation 2, etc..
mva: death rate from motor vehicle accidents (# of deaths per 100K people per year); e.g. 31 for observation i implies that there were 31 deaths from motor vehicle accidents in a year per 100K people within age group i.
agecell: average age of individuals within group i

Using this document

Throughout this code example, you’ll see several questions indicated by “Q”. Each question is followed by a solution, indicated by “A”.

The best way to learn this material is through active participation. I suggest that you attempt to formulate your answers to each question before viewing the prepared “A” answer.

Setting up the DAG

Q: What is the treatment, outcome, treatment effect, treated and control groups?

Treatment: being over 21yrs old
Outcome: motor vehicle death rate
Treatment Effect: The impact of being 21yrs or older on the likelihood of dying in a car accident
Treated Group: those 21 and older
Control Group: those younger than 21

Q: Is the treatment variable available in your dataframe within the format you need? Is the outcome variable available in your dataframe within the format you need?

The outcome is presented in a fine manner
The treatment needs to be created. We have the running variable (agecell), but not the treatment variable (being 21 or older).

Housekeeping

knitr::opts_chunk$set(echo = TRUE)
rm(list=ls()) # clear workspace
cat("\014")  # clear console

library("dplyr")


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library("ggplot2")
library("tidyverse")

Warning: package 'tidyverse' was built under R version 4.3.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ lubridate 1.9.2     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.0
✔ readr     2.1.4

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load the data

load(file="mlda.rda")

Clean the data

Follow the authors of the original study by dropping the NAs and cleaning up the extraneous items we won’t need for our study.

mlda<-na.omit(mlda)
mlda<-mlda %>%
  select(-contains("fitted"))

Create the treatment variable

mlda<-mutate(mlda,over21 = as.factor(agecell >= 21))

EDA

In the following, we will focus on motor vehicle accidents (mva) unless otherwise stated.

Summarize the outcome

summary(mlda$mva)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  26.86   30.12   31.64   31.62   33.10   36.39

Summarize the running variable

summary(mlda$agecell)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  19.07   20.03   21.00   21.00   21.97   22.93

Summarize the treatment variable

summary(mlda$over21)

FALSE  TRUE 
   24    24

ATE via DiM

Summarize the outcome for treat / control groups

mlda%>%
  group_by(over21) %>%
  summarise(MVA.min = min(mva), MVA.mean = mean(mva), MVA.max=max(mva))

# A tibble: 2 × 4
  over21 MVA.min MVA.mean MVA.max
  <fct>    <dbl>    <dbl>   <dbl>
1 FALSE     29.7     32.5    36.4
2 TRUE      26.9     30.8    36.3

Q: What does the summary output suggest about the treatment effect?

A: Being younger than 21yrs is associated with higher death rates. This suggests a negative ATE.

Q: Compute the ATE via Difference in Means (DiM).

mean.young <- mlda %>%
  filter(agecell<21) %>%
  summarize(mean.young = mean(mva))
mean.old <- mlda %>%
  filter(agecell>=21) %>%
  summarize(mean.young = mean(mva))
ATE.DiM = mean.old - mean.young
ATE.DiM

  mean.young
1  -1.677356

Q: Interpret this ATE estimate.

A: Among everyone in the sample, being 21yrs or older lowered the death rate by 1.677 percentage points relative to what it would have potentially been had they been younger than 21yrs old.

Q: Does this support the policy maker’s efforts?

A: No. This is the opposite conclusion they were expecting / hoping to see. They wanted to show that being permitted to drink increases the death rate from driving.

Q: What concerns do you have about this DiM estimate?

A: There is likely OVB. Older drivers are more experienced, and so we would expect their death rate to be lower than younger drivers.

Regression Discontinuity

Motivation

Q: How should you structure a sharp RD to address the concerns you expressed above? Explain why this may work.

A: We can control for these age related omitted confounders (e.g. experience) by including the running variable agecell into the specification.

Diagnostics

Discontinuity at Threshold

Q: Does the following graph support / debunk a sharp RD approach?

varlist <- c("all" = "All Causes",
             "mva" = "Motor Vehicle Accidents",
             "internal" = "Internal Causes")

mlda %>%
  select(agecell, over21, one_of(names(varlist))) %>%
  gather(response, value, -agecell, -over21, na.rm = TRUE) %>%
  mutate(response = recode(response, !!!as.list(varlist))) %>%
  ggplot(aes(x = agecell, y = value)) +
  geom_point() +
  geom_smooth(mapping = aes(group = over21), se = FALSE, method = "lm",
              formula = y ~ poly(x, 2)) +
  geom_smooth(mapping = aes(group = over21), se = FALSE, method = "lm",
              formula = y ~ x, color = "black") +
  facet_grid(response ~ ., scales = "free_y") +
  labs(y = "Mortality rate (per 100,000)", x = "Age")

Focus on the last panel.
Supports.
The discontinuity at 21yrs old is evidence that RD might be appropriate.
The DiM estimate was detecting the general downward slope of the regression line. However, it overlooked the break at 21yrs old.
Interestingly, there is very little evidence of a break at 21yrs old for internal causes of death. This provides some circumstantial evidence that driving is indeed important.

Manipulation

Q: How can we search for possible manipulation of the treatment variable? Implement that method and comment on the presence of manipulation.

hist(mlda$agecell)

There is a uniform distribution of drivers across the age groups.
This suggests there is no manipulation
By contrast, if we saw a lot of people just over 21yrs old and very few just younger, that might suggest there is manipulation
This lack of manipulation makes sense given the running variable is something as immutable as age. Of course, people can lie about their age, but that is difficult to do on an official driver license.

ATE

Q: What is the numerical estimate of the sharp RD ATE? Is it statistically significant? Interpret that value within our case. Is the finding supportive of the policy maker’s efforts?

sharp.RD<-lm(mva~over21+agecell, data = mlda)
summary(sharp.RD)


Call:
lm(formula = mva ~ over21 + agecell, data = mlda)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.5318 -0.8494 -0.1800  0.7577  3.3094 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  95.4814     6.7549  14.135  < 2e-16 ***
over21TRUE    4.5340     0.7680   5.904 4.34e-07 ***
agecell      -3.1488     0.3372  -9.337 4.26e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.329 on 45 degrees of freedom
Multiple R-squared:  0.7025,    Adjusted R-squared:  0.6893 
F-statistic: 53.14 on 2 and 45 DF,  p-value: 1.419e-12

4.5340
Across everyone in our sample, being 21yrs or older causes the death rate to be 4.5 percentage points higher than it otherwise would have been if they were younger than 21yrs old.
This supports the policy maker’s efforts.

Nolinearities

Q: What might be the motivation for including a quadratic term on the running variable in the specification?

A: Possibility to improve goodness of fit around the threshold

Q: What is the estimate of the ATE with a quadratic term? Does this adjustment of the model seem useful?

mlda<-mlda%>%
  mutate(agesq=agecell^2)
sharp.RD.quad<-lm(mva~over21+agecell+agesq, data = mlda)
summary(sharp.RD.quad)


Call:
lm(formula = mva ~ over21 + agecell + agesq, data = mlda)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4970 -0.8068 -0.2596  0.7981  3.1540 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -28.3775    71.5235  -0.397    0.693    
over21TRUE    4.5340     0.7513   6.035    3e-07 ***
agecell       8.6820     6.8106   1.275    0.209    
agesq        -0.2817     0.1620  -1.739    0.089 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.3 on 44 degrees of freedom
Multiple R-squared:  0.7217,    Adjusted R-squared:  0.7027 
F-statistic: 38.03 on 3 and 44 DF,  p-value: 2.778e-12

The agecell and agesq variables are both insignificant (at the 5% level).
This doesn’t mean we should drop both.
It appears that the information contained within the running variable is now being essentially split among these two covariates.
Notice that the treatment effect is unchanged. That is a good sign that we have not introduced any new information that is sharpening our estimates.
We can proceed with a linear specification for ease.

Q: Can you run a joint hypothesis test to determine if the information contained within the running variable and its squared are jointly important?

library(car)

Warning: package 'car' was built under R version 4.3.2

Loading required package: carData


Attaching package: 'car'

The following object is masked from 'package:purrr':

    some

The following object is masked from 'package:dplyr':

    recode

linearHypothesis(sharp.RD.quad,c("agecell=0","agesq=0"),test="F")

Linear hypothesis test

Hypothesis:
agecell = 0
agesq = 0

Model 1: restricted model
Model 2: mva ~ over21 + agecell + agesq

  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1     46 233.579                                  
2     44  74.407  2    159.17 47.063 1.175e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A: - We reject the null that at least one of these two variables (age or age2) is important.
- This supports our notion of keeping the running variable, and doing so linearly is fine.

Differing Running Marginal Effects

Q: Test whether the marginal effect of age on mva differs for those over vs under 21yrs old.

In order to isolate the ATE, we need to recenter the running variable around the threshold.

mlda <- mutate(mlda, agerecenter = agecell - 21)
sharp.RD.interact2<-lm(mva~over21+agerecenter+over21*agerecenter, data = mlda)
summary(sharp.RD.interact2)


Call:
lm(formula = mva ~ over21 + agerecenter + over21 * agerecenter, 
    data = mlda)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4124 -0.7774 -0.2913  0.8495  3.2378 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)             29.9292     0.5308  56.390  < 2e-16 ***
over21TRUE               4.5340     0.7506   6.041 2.94e-07 ***
agerecenter             -2.5676     0.4661  -5.508 1.77e-06 ***
over21TRUE:agerecenter  -1.1624     0.6592  -1.763   0.0848 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.299 on 44 degrees of freedom
Multiple R-squared:  0.7222,    Adjusted R-squared:  0.7032 
F-statistic: 38.13 on 3 and 44 DF,  p-value: 2.671e-12

The interaction effect doesn’t appear terribly significant.
Importantly, the ATE hasn’t changed, implying that this modification is not important to this aspect of the model.

Narrowing the caliper

Q: What are the (dis)advantages of narrowing the caliper?

Advantage: sharper estimates because less extraneous factors that can vary away from the threshold
Disadvantage: fewer observations

Q: What is the ATE estimate with a caliper set at 20-21yrs old. You can use the simple linear model.

sharp.RD.caliper<-lm(mva~over21+agecell, data = mlda,subset = agecell >20 & agecell<22)
summary(sharp.RD.caliper)


Call:
lm(formula = mva ~ over21 + agecell, data = mlda, subset = agecell > 
    20 & agecell < 22)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.1088 -0.8810 -0.3918  0.7157  3.1297 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 102.4206    19.7941   5.174 3.98e-05 ***
over21TRUE    4.7593     1.0981   4.334 0.000292 ***
agecell      -3.4683     0.9651  -3.594 0.001708 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.341 on 21 degrees of freedom
Multiple R-squared:  0.4736,    Adjusted R-squared:  0.4234 
F-statistic: 9.445 on 2 and 21 DF,  p-value: 0.001186

The ATE is now 4.753
The sign, siginificance,and basic magnitude are the same
This gives us confidence in our results.