Causal Inference with Linear Regression for Beginners

Class 4

Causal Inference

Predictive vs Prescriptive

In Data Analytics there are often three types of answers

Descriptive - Aim is to aggregate and describe your current data (a snapshot)
- Tables, Charts, Maps, Tableau
- Predictive - Aim is to predict the dependent variable. How will change in the near future
  - Machine Learning
  - Time Series Forecast
- Prescriptive - Aim is to explain the dependent variable. What is the effect of your advertising campaign? Why are workers leaving your firm?
  - What this course is about!

Linear Regression Models for Causal Inference

In these notes we will cover:

Randomized Experiments
Difference-in-Differences
Instrumental Variables
Regression Discontinuity
Maybe Matching and Propensity Score Matching

Why does Data Science need Causal Inference?

A/B testing is not always available

What are the effects of spanking on labor outcomes?
What are the long-term effects of smoking in utero?
What are the long-term effects of fertility in your 20’s vs 30’s?
What are the effects of crack cocaine on productivity?

Who wants to sign up to be randomly assigned in one of these experiments?

Why does Data Science need Causal Inference?

A/B testing is not always available

Randomized control trials can be very expensive!

“one study found 28 Phase III RCTs funded by the National Institute of Neurological Disorders and Stroke prior to 2000 with a total cost of US $\$$ 335 million, for a mean cost of US $\$$ 12 million per RCT.”

Why does Data Science need Causal Inference?

Big Data does not solve all problems

The most famous case is the Google Flu Trends (GFT) Algorithm.

GFT was meant to be an early warning system of flu season, at times out performing the CDC.

But, then it went bad—and failed spectacularly—missing at the peak of the 2013 flu season by 140 percent. GFT went quickly from the poster child of big data to the poster child of the foibles of big data – of big data hubris.

As Tim Harford concluded in his article:

“Big data” has arrived, but big insights have not. The challenge now is to solve new problems and gain new answers – without making the same old statistical mistakes on a grander scale than ever.

Why does Data Science need Causal Inference

Machines are only as smart as we train them to be.

For example, I can run a supervised machine learning algorithm that shows the computer a series of cats and flowers.

The program does such a good job and predicts cats and flowers with 98% accuracy.

Then I show it a picture of a dog. What happens?

Two Types of Causal Questions

Two types of causal questions (Gelman and Imbens 2013):

Reverse causal inference: search for causes of effects (Why?).
- Why does the United States perform so poorly in Math standardized exams?
Forward causal questions: estimation of effects of causes (What if …?).
- Does teacher’s IQ affect students performance? Class size?

Causal effects

We are motivated by why questions but, when conducting our analysis, we tend proceed by addressing what if questions.

Examples:

How does taking this course affect your earnings in 3-years?
- Note that this is different from a predictive question: “What will be the earnings of students taking this course?”
If Uber increases prices, how would it affect demand?
Does death penalty decrease crime rates?
Would it be profitable for a firm to allow employees to work from home? (Yahoo 2013)
Are employees more satisfied if they are informed about the salaries of their colleagues? (Card, Mas, Moretti and Saez 2012)

Correlation Does not Imply Causation

Two well known properties
1. Correlation does not imply causation
2. y can cause x even if x takes place before y
Less known property
- y can cause x
- x can cause y
- z can cause both x and y

Examples

Potentially confusing examples:

Red cars are more likely to get involved in accidents
People that sleep less tend to live longer
Students living in households where there are more books tend to have higher GPA’s.
Countries that eat more chocolate receive more Nobel prizes (Messerli 2012)

Examples

Randomized Experiments

How to Estimate Causal Effects

In the physical sciences:

Often, one can answer this type of question by running an experiment on a specific unit.
Example: Galileo Galilei Leaning Tower of Pisa experiment
- Necessary conditions (ceteris paribus, other things being equal):
- Temporal stability: the response does not change if we change the moment when the treatment is applied.
- Causal transience: the response of one treatment is not affected by prior exposure of the unit to the other treatment.
- Unit homogeneity: homogeneity with respect to treatments and responses.

How to Estimate Causal Effects

In the social sciences:

None of these assumptions is plausible.
We use the statistical solution: estimate average causal effect of the treatment over the population of units.
Intuition: All Other things being equal" conditions are likely to be satisfied on average across treated and non-treated if the treatment is randomly assigned.

Health Insurance Experiment

Suppose we are interested in the effect of health insurance on a person’s health

Let’s think of a treatment (getting insurance) of individual $i$ as a binary random variable $D_i = {0, 1}$ And potential outcomes (counterfactuals): $Y_{0i}$, $Y_{1i}$

$Y_{1i}$ = A measure of person $i$’s health given they have insurance ($D_i=1$).

$Y_{0i}$ = A measure of person $i$’s health given they do not have insurance ($D_i=0$).

The individual treatment effect is $Y_{1i} , Y_{0i}$

Unfortunately, for $i$, we only observe $Y_{1i}$ if $D_i$ = 1 and $Y_{0i}$ if $D_i$ = 0

For any individual $i$, we only observe $Y_i=D_iY_{1i}+(1-D_i)Y_{0i}$

AVERAGE TREATMENT EFFECT

The problem is we cannot observe you as both having and not having insurance.

Solution is to look for the AVERAGE TREATMENT EFFECT (ATE)! \[E[Y_{1i}] - E[Y_{0i}]\] And a naive comparison of averages does not tell us what we want to know: \[E[Y_{1i}|D_{i} = 1] - E[Y_{0i}|D_{i} = 0] \] \[ =\begin{array}{c}\underbrace{E[Y_{1i}|D_{i} = 1] - E[Y_{0i}|D_{i} = 1] }\\ ATE\end{array}+\begin{array}{c}\underbrace{E[Y_{0i}|D_{i} = 1] - E[Y_{0i}|D_{i} = 0] }\\ Sample \, Selection \, Bias\end{array} \]

Average Treatment effect on the Treated

Average treatment effect (ATE) and average treatment effect on the treated (ATT) need not to be the same and the distinction is sometimes important

They will be the same only if treatment is homogeneous across groups: \[E[Y_{1i} - Y_{0i}|D_i = 1] = E[Y_{1i} - Y_{0i}|D_i = 0] = E[Y_{1i} - Y_{0i}]\]

That is, the treatment is assigned randomly.

Random Assigment

We want to understand what would have happened to the treated in the absence of treatment and thus overcome the selection problem…

Solution : Random assignment

Random Assignment

Random assignment makes $D_i$ independent of potential outcomes, hence:

the selection effect is zeroed out and
the treatment effect on the treated is equal to the ATE.

Types of randomized experiments

Randomized experiment is designed and implemented consciously by social scientists. It entails conscious use of a treatment and control group with random assignment.

Lab experiments
- The effect of feedback on relative performance Azmat and Iriberri, 2012
Field experiments (Lab-in-the-field or artefactual experiments)
- The effect of feedback on relative performance: (Bagues et al, work in progress)
Natural experiments - has a source of randomization that is as if" randomly assigned, but this variation was not part of a conscious randomized treatment and control design
- Vietnam-era service effect on education and earnings (Flory, Leibbrandt and List 2010)

Who uses Experiments in Business?

Tech Companies like Google, Facebook, and Amazon are positioned to use experiments.

They embraced the idea of “Data-Base Management” where the results of experiments were taken over the advice of HiPPO’s (Highest Paid Person’s Opinion)

THE A/B TEST: INSIDE THE TECHNOLOGY THAT’S CHANGING THE RULES OF BUSINESS, Wire Magazine 4 2012

In Praise of Data-Driven Management (AKA “Why You Should be Skeptical of HiPPO’s”)

Potential drawbacks of RCTs

Experiments provide a very transparent and simple empirical strategy and they solve the selection bias. However, there are a number of potential problems:

Problems of implementation
- Compliance and attrition
- Cost, political issues (policy makers need to acknowledge ignorance),…
Ethical issues
- Note that the ethical argument is not obvious when (i) the treatment cannot be applied to everybody (maybe due to some budget constraints) and (ii) the optimal assignment rule is unknown.
Hawthorne effect
- The Illumination Experiment (Landsberger 1950, Levitt and List 2011)
- Audit study in France (Behaghel et al. 2015)

Rand Health Insurance Study

Conducted between 1974 and 1982

Randomly assigned thousands of non-elderly individuals and families to different insurance plan designs

Plans ranged from free care to $1,000 deductible (basically) with variations in between

Comparable deductible today is at least $4,000

Studied effects on health spending and health outcomes

Rand Health Insurance

Plan Types
plantype	n	pct
Catastrophic	759	0.1918120
Deductible	881	0.2226434
Coinsurance	1022	0.2582765
Free	1295	0.3272681

Patient Characteristics

Patinet Characteristics
variable	Mean	Std. Dev.
age	32.36	12.92
blackhisp	0.17	0.38
educper	12.10	2.88
female	0.56	0.50
ghindx	70.86	14.91
hosp	0.12	0.32
income1cpi	31603.21	18148.25
mhix	75.50	14.75

Patient Characteristics by Plan

Plan Demographics - Catastrophic
response	(Intercept)	Coinsurance	Deductible	Free
age	32.4 (0.485)	0.966 (0.655)	0.561 (0.676)	0.435 (0.614)
blackhisp	0.172 (0.0199)	-0.0269 (0.025)	-0.0188 (0.0266)	-0.0281 (0.0245)
educper	12.1 (0.14)	-0.0613 (0.186)	-0.157 (0.191)	-0.263 (0.183)
female	0.56 (0.0118)	-0.0247 (0.0153)	-0.0231 (0.016)	-0.0379 (0.015)
ghindx	70.9 (0.694)	0.211 (0.922)	-1.44 (0.952)	-1.31 (0.872)
hosp	0.115 (0.0117)	-0.00249 (0.0152)	0.00449 (0.016)	0.00117 (0.0146)
income1cpi	31,603 (1,073)	970 (1,391)	-2,104 (1,386)	-976 (1,346)
mhix	75.5 (0.696)	1.07 (0.872)	0.454 (0.911)	0.433 (0.826)

Patient Health by Plan

Patient Health Spending by Plan
response	(Intercept)	Cost Sharing	Deductible	Free
ftf	2.78 (0.178)	0.481 (0.24)	0.193 (0.247)	1.66 (0.248)
inpdol_inf	388 (44.9)	92.5 (72.8)	72.2 (68.6)	116 (59.8)
out_inf	248 (14.8)	59.8 (20.7)	41.8 (20.8)	169 (19.9)
tot_inf	636 (54.5)	152 (84.6)	114 (79.1)	285 (72.4)
totadm	0.0991 (0.00785)	0.0023 (0.0108)	0.0159 (0.0109)	0.0288 (0.0105)

References

Experiments and Potential Outcomes MM, Chapter 1

J. Angrist, D. Lang, and P. Oreopoulos, “Incentives and Services for College Achievement: Evidence from a Randomized Trial”, American Economic Journal: Applied Economics, Jan. 2009.

A. Aron-Dine, L. Einav, and A. Finkelstein, “The RAND Health Insurance Experiment Three Decades Later”, J. of Economic Perspectives 27 (Winter 2013), 197-222.

R.H. Brook, et al., “Does Free Care Improve Adults’ Health?”, New England J. of Medicine 309 (Dec. 8, 1983), 1426-1434.

S. Taubman, et al., “Medicaid Increases Emergency-Department Use: Evidence from Oregon’s Health Insurance Experiment”, Science, Jan 2, 2014.

Difference in Differences

An Alternative Approach

We saw previously that RCT’s are the ideal empirical study.

When an RCT is unavailable, then provided we observe enough covariates to eliminate all forms of selection and omitted variable bias, we can use regression to estimate accurate causal effects.

Alternative Strategies

But sometimes we find ourselves in a situation where an RCT is not feasible, and it is impossible to observe all the important ways in which the treated and control units differ.

In this case, there are three additional empirical strategies typically use: - Difference in Differences - Instrumental Variables - Regression Discontinuity

Today, we will look at dif-in-dif.

Framework

Recall the potential outcome framework. When we estimate a treatment control contrast what we get is: \[E(Y|D=1)-E(Y|D=0)=\delta+E(Y_0|D=1)-E(Y_0|D=0)\] Where $\delta$ is the ATE.

This equation says that the average of the treated group minus the average of the control group is the average treatment effect plus selection bias (AKA omitted variable bias in the regression framework).

Parallel Trends

We will now explore another way to get rid of the selection bias.

Suppose we have data on the outcome variable for our treatment and control group from the previous period. Call this $Y_{pre}$.

Now suppose further that: \[E(Y_0|D=1)-E(Y_{pre}|D=1)=E(Y_0|D=0)-E(Y_{pre}|D=0)\] This assumption is known as the parallel trends assumptions and is crucial for getting compelling estimates in the dif-in-dif framework.

What does this assumption mean?

Parallel Trends

It says that if the treatment group had never been treated, the average change in the outcome variable would have been identical to the average change in the outcome variable for the control group.

How plausible this assumption is depends upon the given study you are examining.

For now, let’s assume it is true, and see how this can help us kill the selection bias.

Difference in Difference

Suppose instead of just comparing the average treatment outcome to the average control outcome, we use the pre-period data to compare the average change in the treatment group to the average change in the control group.

That is, we calculate: \[E(Y|D=1)-E(Y_{pre}|D=1)-[E(Y|D=0)-E(Y_{pre}|D=0)]\] This is a difference in difference or dif-in-dif estimate.

Difference in Differences: Animated

Difference in Differences: Differences in Means Example

Difference in Differences: Graphical Approach

\[\begin{align*}Y&= \beta_0 + \beta_1*[Time] + \beta_2*[Intervention] \\ &+ \beta_3*[Time*Intervention] + \beta_4*[Covariates]+\epsilon\end{align*}\]

Parallel Trend Assumption

The parallel trend assumption is the most critical of the above the four assumptions to ensure internal validity of DID models and is the hardest to fulfill.

It requires that in the absence of treatment, the difference between the ‘treatment’ and ‘control’ group is constant over time.

Although there is no statistical test for this assumption, visual inspection is useful when you have observations over many time points.

It has also been proposed that the smaller the time period tested, the more likely the assumption is to hold.

Violation of parallel trend assumption will lead to biased estimation of the causal effect.

Parralel Trend Assumption: Met

Parralel Trend Assumption: Violated

Dif-in-Dif in Practice

Case study: who pays for mandated childbirth coverage?

When government mandate employers to provide benefits, who is really footing the bill?

Is it the employer?
Or is it the employee who pays for it indirectly in the form of a pay cut?

This analysis is first conducted by Jonathan Gruber in 1994, an MIT Professor who serves as the director of the Health Care Program at the National Bureau of Economic Research (NBER). To date, The Incidence of Mandated Benefits remains one of the most influential paper in healthcare economics.

Timeline

Understanding the timeline is important for identifying the causal effect:

Before 1978: there was limited health care coverage for childbirth.

1975-1979: a subset of states passed laws, mandating the health care coverage of childbirth.

Starting in 1978: federal legislation mandates the health care coverage of childbirth for all states.

Difference-in-Differences Table

The Legendary Triple Dif

Dif-in-Dif in R - Manual Calculation

require(foreign)
eitc<-read.dta("https://github.com/CausalReinforcer/Stata/raw/master/eitc.dta")
# Create two additional dummy variables to indicate before/after
# and treatment/control groups.

# the EITC went into effect in the year 1994
eitc$post93 = as.numeric(eitc$year >= 1994)

# The EITC only affects women with at least one child, so the
# treatment group will be all women with children.
eitc$anykids = as.numeric(eitc$children >= 1)

# Compute the four data points needed in the DID calculation:
a = sapply(subset(eitc, post93 == 0 & anykids == 0, select=work), mean)
b = sapply(subset(eitc, post93 == 0 & anykids == 1, select=work), mean)
c = sapply(subset(eitc, post93 == 1 & anykids == 0, select=work), mean)
d = sapply(subset(eitc, post93 == 1 & anykids == 1, select=work), mean)

# Compute the effect of the EITC on the employment of women with children:
(d-c)-(b-a)

##       work 
## 0.04687313

Dif-in-Dif in R - Regression

\[work=\beta_0+\delta_0posst93+\beta_1anykids+\delta_1(anykids*post93)+\epsilon\]

reg1 = lm(work ~ post93 + anykids + post93*anykids, data = eitc)
summary(reg1)

## 
## Call:
## lm(formula = work ~ post93 + anykids + post93 * anykids, data = eitc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5755 -0.4908  0.4245  0.5092  0.5540 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.575460   0.008845  65.060  < 2e-16 ***
## post93         -0.002074   0.012931  -0.160  0.87261    
## anykids        -0.129498   0.011676 -11.091  < 2e-16 ***
## post93:anykids  0.046873   0.017158   2.732  0.00631 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4967 on 13742 degrees of freedom
## Multiple R-squared:  0.0126, Adjusted R-squared:  0.01238 
## F-statistic: 58.45 on 3 and 13742 DF,  p-value: < 2.2e-16

Create Plot

# Take average value of 'work' by year, conditional on anykids
minfo = aggregate(eitc$work, list(eitc$year,eitc$anykids == 1), mean)
# rename column headings (variables)
names(minfo) = c("YR","Treatment","LFPR")

# Attach a new column with labels
minfo$Group[1:6] = "Single women, no children"
minfo$Group[7:12] = "Single women, children"
#minfo

require(ggplot2)    #package for creating nice plots

qplot(YR, LFPR, data=minfo, geom=c("point","line"), colour=Group,
xlab="Year", ylab="Labor Force Participation Rate")+geom_vline(xintercept = 1994)

Create Plot

Strengths and Limitations

Strengths

Intuitive interpretation
Can obtain causal effect using observational data if assumptions are met
Can use either individual and group level data
Comparison groups can start at different levels of the outcome. (DID focuses on change rather than absolute levels)
Accounts for change/change due to factors other than intervention

Limitations

Requires baseline data & a non-intervention group
Cannot use if intervention allocation determined by baseline outcome
Cannot use if comparison groups have different outcome trend (Abadie 2005 has proposed solution)
Cannot use if composition of groups pre/post change are not stable

BEST PRACTICES

Be sure outcome trend did not influence allocation of the treatment/intervention
Acquire more data points before and after to test parallel trend assumption
Use linear probability model to help with interpretability
Be sure to examine composition of population in treatment/intervention and control groups before and after intervention
Use robust standard errors to account for autocorrelation between pre/post in same individual
Perform sub-analysis to see if intervention had similar/different effect on components of the outcome

Intrumental Variables

Introduction (1)

As discussed many times before, correlation between the error terms and regressors is a serious threat to internal validity.
We know this can happen because of omitted variables, measurement errors, and simultaneous causality.
Instrumental variables (IV) regression is an approach to eliminate any inconsistency in our estimation because of correlation with the error terms.

Introduction (2)

We can imagine that a regressor $X$ has two parts: the part that is correlated with $u$ and another that is not
We use instrumental variables (or instruments, for short) to isolate the uncorrelated part and use it in our estimation.
Let’s use a Venn-Diagram illustrating the variation in
- the Outcome Variable ($Y$)
- the Treatment Variable ($X$) and
- the Instrumental Variable ($Z$)

Visual Concept

Let the $\color{green}{Green\, circle\, is\, the\, unexplained\, variation.}$
Let the $\color{red}{Red\, circle\, is\, the\, correlated\, part\, of\, X\, with\, the\, error.}$
Let the $\color{blue}{Blue\, circle\, is\, the\, uncorrelated\, part\, of\, X\, with\, the\, error.}$
Let the $\color{orange}{Orange\, circle\, shows\, where\, X\, and\, Z\, are\, correlated.}$
Let the $\color{purple}{Purple\, circle\, shows\, where\, X,\, Y,\, and\, Z\, are\, correlated.}$

Education Example

\[ln(wage)= \beta_0+\beta_1*EDUC+\beta_2*EXP +\beta_3*EXP^2+...+u\]

What is $u$?

Ability
Motivation
Everything Else that affects wage

Further, we can think of the Education as function. \[ EDUC=f(GENDER,S-E, location, Ability, Motivation)\]

What influences Log wages?

\[\begin{align*} ln(wage) &=\beta_0+\beta_1*EDUC(X,Ability,Motivation)+\beta_2*EXP \\ &+\beta_3*EXP^2+...+u(Ability,Motivation)\end{align*}\]

Increased Ability is associated with increases in Education and $u$.

What looks like an effect due to an increase in Education may be an increase in Ability.

The estimate of $\beta_1$ picks up the effect of Education and the hidden effect of Ability.

An Exogenous Influence

\[ln(wage)=\beta_0+\beta_1*EDUC(Z,X,Ability,Motivation) \\ +\beta_2*EXP+\beta_3*EXP^2+...+u(Ability,Motivation)\]

A variable Z is associated with an increase in Education, but does not affect the error $u$.

An effect due to the effect of an increase of Z on Education will only be an increase in Education. For changes in Z, we can estimate changes in log wages that are caused by education.

Z is an Instrumental Variable

Instrumental Variables Regression

Three important threats to internal validity are:

omitted variable bias from a variable that is correlated with X but is unobserved, so cannot be included in the regression;
simultaneous causality bias (X causes Y, Y causes X);
errors-in-variables bias (X is measured with error)

Instrumental variables regression can eliminate bias when $E(u|X) \ne 0$ by using an instrumental variable, Z

The IV Estimator with a Single Regressor and a Single Instrument

The IV Model

Let our population regression model be

\[ Y_i = \beta_0 + \beta_1 X_i + u_i,~~i=1,\dots,n \]

and let the variable $Z_i$ be an instrumental variable that isolates the part of $X_i$ that is uncorrelated with $u_i$.

Endogeneity and Exogeneity

We define variables that are correlated with the error term as endogenous
We define variables that are uncorrelated with the error term as exogenous
Exogenous variables are those that are determined outside our model
Endogenous variables are determined from within our model. For example, in the cause of simultaneous causality, both our dependent variable $Y$ and the regressor $X$ are endogenous.

Conditions for a Valid Instrument

Instrument relevance condition: $Corr(Z_i, X_i) \ne 0$
Instrument exogeneity condition: $Corr(Z_i, u_i) = 0$

The Two Stage Least Squares Estimator (1)

If valid instrument, $Z$, is available, we are able to estimate the coefficient $\beta_1$ using two stage least squares (TSLS).

decompose $X$ into its two components to isolate the component that is uncorrelated with the error terms. \[ X_i = \underbrace{\pi_0 + \pi_1 Z_i}_{\text{uncorrelated component}} + v_i \] Since we don’t observe $\pi_0$ and $\pi_1$ we need to estimate them using OLS \[ \hat{X}_i = \hat{\pi}_0 + \hat{\pi}_1 Z_i \]

The Two Stage Least Squares Estimator (2)

Then, we use this uncorrelated component to estimate $\beta_1$: we regress $Y_i$ on $\hat{X}_i$ using OLS to estimate $\beta_0^{TSLS}$ and $\beta_1^{TSLS}$.

Example: Philip Wright’s Problem

Suppose we wanted to estimate the price elasticity in the log-log model

\[ \ln(Q_i^{butter}) = \beta_0 + \beta_1 \ln(P_i^{butter}) + u_i \]

if we had a sample of $n$ observations of quantity demanded and the equilibrium price, we can run an OLS estimation to estimate the elasticity coefficient $\beta_1$.

Supply and Demand (1)

Supply and Demand (2)

Supply and Demand (3)

IV Animated

Application to the Demand for Cigarettes (1)

Consider the question of imposing a tax on cigarettes in order to reduce illness and deaths from smoking-in addition to other negative externalities on society.
If the elasticity of demand in response to changes in cigarette prices, it would be easy to impose a policy to reduce smoking by any desired proportion.
We need to estimate this elasticity, but we cannot rely on data of demand and prices-we need to use an instrumental variable.

Application to the Demand for Cigarettes (2)

We have cross-sectional data from 48 US states in 1995 with the variables

$SalesTax_i$: Which we use as an instrument.
$Q_i^{cigarettes}$: Cigarette consumption: number of packs sold/capita in state $i$
$P_i^{cigarettes}$: Average real price/pack including all taxes

Application to the Demand for Cigarettes (3)

Before we can carry out TSLS estimation we must investigate the relevance of our instrument.

First, instrument relevance: since higher sales taxes would lead to higher cigarette prices this condition is plausibly satisfied.
Second, instrument exogeneity: since sales taxes are set driven by questions of public finance and politics, and not questions of cigarette consumption, this condition is also plausibly satisfied.

Application to the Demand for Cigarettes (4)

Statistical software conceals the various steps needed for IV regression, but it can be useful to demonstrate the steps here

The first state regression yields \[ \begin{alignat*}{2} \widehat{\ln(P_i^{cigarettes})} = &4.63 + &&~0.031 SalesTax_i \\ &(0.03) &&~(0.005) \end{alignat*} \]

with $R^2 = 0.47$.

Data

suppressMessages(library("AER"))
suppressMessages(library("plm"))
data("CigarettesSW")

Data (2)

cig.data=CigarettesSW
cig.data$packpc=cig.data$pack
cig.data$ravgprs <- cig.data$price/cig.data$cpi # real average price
cig.data$rtax <- cig.data$tax/cig.data$cpi # real average cig tax
cig.data$rtaxs <- cig.data$taxs/cig.data$cpi # real average total tax
cig.data$rtaxso <- cig.data$rtaxs - cig.data$rtax # instrument

The first stage regression:

first.stage.res <- lm(log(ravgprs) ~ rtaxso, data=cig.data, subset=(year == 1995))
coeftest(first.stage.res, vcov.=vcovHAC(first.stage.res))

## 
## t test of coefficients:
## 
##              Estimate Std. Error  t value  Pr(>|t|)    
## (Intercept) 4.6165463  0.0285440 161.7343 < 2.2e-16 ***
## rtaxso      0.0307289  0.0048623   6.3198 9.588e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Application to the Demand for Cigarettes (5)

In the second stage, $\ln(Q_i^{cigarettes})$ is regressed on $\widehat{\ln(P_i^{cigarettes})}$ \[ \begin{alignat*}{3} \widehat{\ln(Q_i^{cigarettes})} = &9.72 - &&~1.08 \widehat{\ln(P_i^{cigarettes})} \\ &(1.53) &&~(0.32) \end{alignat*} \]

IV regression in R

library(AER)
iv.res <- ivreg(log(packpc) ~ log(ravgprs) | rtaxso, data=cig.data, subset=(year == 1995))
coeftest(iv.res, vcov.=vcovHAC(iv.res))

## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)   9.71988    1.52719  6.3646 8.211e-08 ***
## log(ravgprs) -1.08359    0.31869 -3.4002  0.001401 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Application to the Demand for Cigarettes (6)

This is a strong relationship between prices and demand.

But perhaps our assumption of exogeneity might not be very valid.

Consider income: states with higher income might not need to rely on taxes for revenue and there is presumably an effect of income on consumption of cigarettes.

Application to the Demand for Cigarettes (1)

As stated before, when we regressed quantity demanded on prices, using sales taxes as an instrument, we might have correlation with the error since state income might be correlated with sales taxes. So, now let’s add an exogenous variable for income

\[ \begin{alignat*}{5} \widehat{\ln(Q_i^{cigarettes})} = &9.43 - &&1.14 \widehat{\ln(P_i^{cigarettes})} + &&0.21\ln(Inc_i) \\ &(1.26) &&(0.37) && (0.31) \end{alignat*} \]

Application to the Demand for Cigarettes in R (1)

cig.data$perinc <- cig.data$income/(cig.data$pop * cig.data$cpi)

iv.res.2 <- ivreg(log(packpc) ~ log(ravgprs) + log(perinc) 
                  | rtaxso + log(perinc), data=cig.data, 
                  subset=(year == 1995))
coeftest(iv.res.2, vcov.=vcovHAC(iv.res.2))

## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)   9.43066    1.25464  7.5166 1.757e-09 ***
## log(ravgprs) -1.14338    0.37064 -3.0848  0.003477 ** 
## log(perinc)   0.21452    0.31145  0.6888  0.494509    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Application to the Demand for Cigarettes (2)

Now instead of using only one instrument we can use two: $SalesTax_i$ and $CigTax_i$, so $m = 2$, making this model overidentified.

\[ \begin{alignat*}{5} \widehat{\ln(Q_i^{cigarettes})} = &9.89 - &&1.28 \widehat{\ln(P_i^{cigarettes})} + &&0.28\ln(Inc_i) \\ &(0.96) &&(0.25) &&(0.25) \end{alignat*} \]

Application to the Demand for Cigarettes in R (2)

iv.res.3 <- ivreg(log(packpc) ~ log(ravgprs) + log(perinc) 
                  | rtaxso + rtax + log(perinc), data=cig.data, 
                  subset=(year == 1995))
coeftest(iv.res.3, vcov.=vcovHAC(iv.res.3)) # For Robust Standard Errors

## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)   9.89496    0.96599 10.2434 2.435e-13 ***
## log(ravgprs) -1.27742    0.25299 -5.0493 7.805e-06 ***
## log(perinc)   0.28040    0.25461  1.1013    0.2766    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Checking Instrument Validity

How do you know you have good instruments?

Assumption #1: Instrument Relevance

The relevance of an instrument (the more the variation in $X$ is explained by the instrument) plays the same role as sample size: it produces more accurate estimators.
Instruments that do not explain a lot of the variation in $X$ are called weak instruments. For example, distance of the state from cigarette manufacturing.

Why Weak Instruments are a Problem? (1)

If instruments are weak, it is like having a very small sample size: the approximation of the estimator’s distribution to the normal distribution is very poor.
Recall that with a single regressor and instrument \[ \hat{\beta}_1^{TSLS} = \frac{s_{ZY}}{s_{ZX}} \overset{p}{\longrightarrow} \frac{Cov(Z_i, Y_i)}{Cov(Z_i, X_i)} = \beta_1 \]

Why Weak Instruments are a Problem? (2)

Suppose that the instrument is completely irrelevant so that $Cov(Z_i, X_i) = 0$, then \[ s_{ZX} \overset{p}{\longrightarrow} Cov(Z_i, X_i) = 0 \] Causing the denominator of $Cov(Z_i,Y_i)/Cov(Z_i,X_i)$ to be zero, which makes the distribution of $\beta_1^{TSLS}$ not normal.
A similar problem would be encountered with instruments that are not completely irrelevant but are weak.

Checking for Weak Instruments

To check for weak instruments, compute the $F$-statistic testing the hypothesis that the coefficients on all the instruments are zero in the first stage.
A rule of thumb is not to worry about weak instruments if the first-stage $F$-statistic is greater than 10.

What to Do with Weak Instruments

If the model is overidentified and we have both weak and strong instruments, it is best to drop the weak instruments
However, if the model is exactly identified, it is not possible to drop any instruments.
In this case, we should try find stronger instruments (not a very easy task) or use the weak instruments with other methods than TSLS that are less sensitive to weak instruments.

Why Weak Instruments are a Problem? (2)

Suppose that the instrument is completely irrelevant so that $Cov(Z_i, X_i) = 0$, then \[ s_{ZX} \overset{p}{\longrightarrow} Cov(Z_i, X_i) = 0 \]
Causing the denominator of $Cov(Z_i,Y_i)/Cov(Z_i,X_i)$ to be zero, which makes the distribution of $\beta_1^{TSLS}$ not normal.
A similar problem would be encountered with instruments that are not completely irrelevant but are weak.

Checking for Weak Instruments

To check for weak instruments, compute the $F$-statistic testing the hypothesis that the coefficients on all the instruments are zero in the first stage.
A rule of thumb is not to worry about weak instruments if the first-stage $F$-statistic is greater than 10.

Assumption #2: Instrument Exogeneity

If the instruments are not exogenous then TSLS estimators will suffer from inconsistency.

Can Exogeneity be Statistically Tested?

If the model is exactly identified we cannot test for exogeneity
If the model is overidentified it is possible to test the hypothesis that the "extra’’ instruments are exogenous under the assumption that there are enough valid instruments to identify the coefficients of interest (test of overidentified restrictions).

The Overidentifying Restrictions Test (The J-Statistic) (1)

Suppose you have a model with one endogenous regressor and two instruments
We can test for exogeneity by running two IV regressions once with each of the instruments
If they are both valid, we should expect our estimates to be close, otherwise we should be suspicious of the validity one or both of the instruments.

The Overidentifying Restrictions Test (The J-Statistic) (2)

We can test for exogeneity using the $J$-statistic. We do this by estimating the following regression

\[ \begin{align*} \hat{u}_i^{TSLS} = &\delta_0 + \delta_1 Z_{1i} + \cdots + \delta_m Z_{mi}\\ {}+ &\delta_{m+1} W_{1i} + \cdots + \delta_{m+r} W_{ri} + e_i \end{align*} \]

and using an $F$-test for $\delta_1 = \cdots = \delta_m = 0$

Overidentification Test witin the Application to the Demand for Cigarettes

Instrument Validity (1)

In our previous TSLS we used two instruments: $SalesTax_i$ and $CigTax_i$, and one exogenous regressor: state income.

There are still concerns about the exogeneity of $CigTax_i$: there could be state specific characteristics that influence both cigarette taxes and cigarette consumption.

Instrument Validity (2)

Luckily, we have panel data so we can eliminate state fixed effects.
To simplify matters we will focus on the differences between 1985 and 1995.
We regress $[\ln(Q_{i,1995}^{cigarettes}) - \ln(Q_{i,1985}^{cigarettes})]$ on $[\ln(P_{i,1995}^{cigarettes}) - \ln(P_{i,1985}^{cigarettes})]$ and $[\ln(Inc_{i,1995}) - \ln(Inc_{i,1985})]$
- We use as instruments $[SalexTax_{i,1995} - SalesTax_{i,1985}]$ and $[CigTax_{i,1995} - CigTax_{i,1985}]$

We are actually going to run a state fixed effects regression using all the data.

panel.iv.res.1 <- plm(log(packpc) ~ log(ravgprs) + log(perinc) 
                      | rtaxso + log(perinc), data=cig.data, 
                      method="within", effect="individual", 
                      index=c("state", "year"))
coeftest(panel.iv.res.1, vcov.=vcovHC(panel.iv.res.1))

## 
## t test of coefficients:
## 
##               Estimate Std. Error t value  Pr(>|t|)    
## log(ravgprs) -1.072460   0.168316 -6.3717 8.011e-08 ***
## log(perinc)  -0.079004   0.254929 -0.3099     0.758    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

First Stage $F$-test

panel.1st.stage.res.1 <- lm(log(ravgprs) ~ rtaxso, data=cig.data)
lht(panel.1st.stage.res.1, "rtaxso = 0", vcov=vcovHAC)

## Linear hypothesis test
## 
## Hypothesis:
## rtaxso = 0
## 
## Model 1: restricted model
## Model 2: log(ravgprs) ~ rtaxso
## 
## Note: Coefficient covariance matrix supplied.
## 
##   Res.Df Df      F    Pr(>F)    
## 1     95                        
## 2     94  1 112.64 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Different IV

panel.iv.res.2 <- plm(log(packpc) ~ log(ravgprs) + log(perinc)
                      | rtax + log(perinc), data=cig.data, 
                      method="within", effect="individual", 
                      index=c("state", "year"))
coeftest(panel.iv.res.2, vcov.=vcovHC(panel.iv.res.2))

## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## log(ravgprs) -1.36316    0.16975 -8.0303 2.668e-10 ***
## log(perinc)   0.34247    0.24179  1.4164    0.1634    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

First Stage $F$-test

panel.1st.stage.res.2 <- lm(log(ravgprs) ~ rtax, data=cig.data)
lht(panel.1st.stage.res.2, "rtax = 0", vcov=vcovHAC)

## Linear hypothesis test
## 
## Hypothesis:
## rtax = 0
## 
## Model 1: restricted model
## Model 2: log(ravgprs) ~ rtax
## 
## Note: Coefficient covariance matrix supplied.
## 
##   Res.Df Df      F    Pr(>F)    
## 1     95                        
## 2     94  1 302.12 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This time we include both instruments

panel.iv.res.3 <- plm(log(packpc) ~ log(ravgprs) 
                      + log(perinc) | rtaxso + rtax + log(perinc),
                      data=cig.data, method="within", effect="individual",
                      index=c("state", "year"))
coeftest(panel.iv.res.3, vcov.=vcovHC(panel.iv.res.3))

## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## log(ravgprs) -1.26750    0.15872 -7.9858 3.103e-10 ***
## log(perinc)   0.20378    0.23261  0.8761    0.3855    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

First Stage $F$-test

panel.1st.stage.res.3 <- lm(log(ravgprs) ~ rtaxso + rtax, data=cig.data)
lht(panel.1st.stage.res.3, c("rtaxso = 0", "rtax = 0"), vcov=vcovHAC)

## Linear hypothesis test
## 
## Hypothesis:
## rtaxso = 0
## rtax = 0
## 
## Model 1: restricted model
## Model 2: log(ravgprs) ~ rtaxso + rtax
## 
## Note: Coefficient covariance matrix supplied.
## 
##   Res.Df Df     F    Pr(>F)    
## 1     95                       
## 2     93  2 165.2 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Hansen J-test:

j.test.reg <- lm(panel.iv.res.3$residuals ~ rtaxso + rtax + log(perinc), data=cig.data)
lht(j.test.reg, c("rtaxso = 0", "rtax = 0"), vcov.=vcovHAC)

## Linear hypothesis test
## 
## Hypothesis:
## rtaxso = 0
## rtax = 0
## 
## Model 1: restricted model
## Model 2: panel.iv.res.3$residuals ~ rtaxso + rtax + log(perinc)
## 
## Note: Coefficient covariance matrix supplied.
## 
##   Res.Df Df      F Pr(>F)
## 1     94                 
## 2     92  2 0.0012 0.9988

References

J. Angrist, “Lifetime Earnings and the Vietnam Era Draft Lottery: Evidence from Social Security Administrative Records,” American Economic Review, June 1990. 5

J. Angrist and A. Krueger, “Does Compulsory School Attendance Affect Schooling and Earnings?”, Quarterly Journal of Economics 106, November 1991.

J. Angrist, et al., “Who benefits from KIPP?”, J. of Policy Analysis and Management, Fall 2012.

J. Angrist, V. Lavy, and A. Schlosser, “Multiple Experiments for the Quantity and Quality of Children”, Journal of Labor Economics 28, October 2010.

Regression Discontinuity Design

Quick Review of diff-in-diff

This document replicates the Table 4.1 and Figures 4.2 4.4 4.5 found in Mastering Metrics (based on data from Carpenter and Dobkin 2009)

Will adding controls affect diff-in-diff estimates if treatment assignment was random?

Answer = Not unless you’ve added ‘bad controls’, which are controls also affected by treatment.

Quick Review of diff-in-diff

When you’ve done this, you’re no longer estimating the causal effect of treatment

Control (that are exogenous) will just improve precision, but shouldn’t affect estimates

Quick Review of diff-in-diff

What are some standard falsification tests you might want to run with diff-in-diff?

Answer

Compare ex-ante characteristics of treated & untreated
Check timing of treatment effect
Run regression using dep. variables that shouldn’t be affected by treatment (if it is what we think it is)
Check whether reversal of treatment has opposite effect
Triple-difference estimation

Quick Review of diff-in-diff

If you find ex-ante differences in treated and treated, is internal validity gone?

Answer = Not necessarily but it could suggest non-random assignment of treatment that is problematic. E.g. observations with characteristic ‘z’ are more likely to be treated and observations with this characteristic are also likely to be trending differently for other reasons

Quick Review of diff-in-diff

Does the absence of a pre-trend in diff-in-diff ensure that differential trends assumption holds and causal inferences can be made?

Answer = Sadly, no. We can never prove causality with 100% confidence. It could be that trend was going to change after treatment for reasons unrelated to treatment

Quick Review of diff-in-diff

How are triple differences helpful and reducing concerns about violation of parallel trends assumption?

Answer = Before, an “identification policeman” would just need a story about why treated might be trending differently after event for other reasons. Now, he/she would need story about why that different trend would be particularly true for subset of firms that are more sensitive to treatment

Basic idea of RDD

The basic idea of regression discontinuity RDD is the following:

Observations (e.g. firm, individual, etc.) are “treated” based on a known cutoff rule.
The cutoff is what creates the discontinuity.

Researcher is interested in how this treatment affects outcome variable of interest, $y$.

Examples of RDD settings

If you think about it, these type of cutoff rules are commonplace in finance
- A borrower FICO score > 620 makes securitization of the loan more likely
  - Keys, et al (QJE 2010)
Accounting variable x exceeding some threshold causes loan covenant violation
- Roberts and Sufi (JF 2009)

RDD is like difference-in-difference.

Has similar flavor to diff-in-diff natural experiment setting in that you can illustrate identification with a figure
Plot outcome y against independent variable that determines treatment assignment, x.
Should observe sharp, discontinuous change in y at the cutoff value of x.

But, RDD is different.

RDD has some key differences.
- Assignment to treatment is NOT random;
- Assignment is based on value of x
- When treatment only depends on x (what I’ll later call “sharp RDD”, there is no overlap in treatment & controls; i.e. we never observe the same x for a treatment and a control)

Randomized Experiment

RDD randomization assumption

Assignment to treatment and control isn’t random, but whether individual observation is treated is assumed to be random.
- i.e. researcher assumes that observations (e.g. firm, person, etc.) can’t perfectly manipulate their x value
- Therefore, whether an observation’s x falls immediately above or below key cutoff x is random

Two types of RDD

Sharp RDD

Assignment to treatment only depends on x; i.e. if $x >= x'$ you are treated with probability 1

Fuzzy RDD

Having $x >= x'$ only increases probability of treatment; i.e. other factors (besides x) will influence whether you are actually treated or not

Sharp RDD

Fuzzy RDD

Sharp versus Fuzzy RDD

This subtle distinction affects exactly how you estimate the causal effect of treatment

With Sharp RDD, we will basically compare average $y$ immediately above and below $x'$
With fuzzy RDD, the average change in y around the threshold understate causal effect [why?]
- Answer = Comparision assumes all observations were treated, but this isn’t true; if all observations had been treated, observed change in y would be even larger; we will need to rescale based on change in probability

Parrallel Trends

Bias vs Noise

RDD Animated

RDD in R

library(AER)
library(foreign)
library(rdd)
library(stargazer)
AEJfigs=read.dta("AEJfigs.dta")
# All = all deaths
AEJfigs$age = AEJfigs$agecell - 21
AEJfigs$over21 = ifelse(AEJfigs$agecell >= 21,1,0)
reg.1=RDestimate(all~agecell,data=AEJfigs,cutpoint = 21)
plot(reg.1)
title(main="All Causes of Death", xlab="AGE",
      ylab="Mortality rate from all causes (per 100,000)")

RDD in R

## 
## Call:
## RDestimate(formula = all ~ agecell, data = AEJfigs, cutpoint = 21)
## 
## Type:
## sharp 
## 
## Estimates:
##            Bandwidth  Observations  Estimate  Std. Error  z value
## LATE       1.6561     40            9.001     1.480       6.080  
## Half-BW    0.8281     20            9.579     1.914       5.004  
## Double-BW  3.3123     48            7.953     1.278       6.223  
##            Pr(>|z|)      
## LATE       1.199e-09  ***
## Half-BW    5.609e-07  ***
## Double-BW  4.882e-10  ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## F-statistics:
##            F      Num. DoF  Denom. DoF  p        
## LATE       33.08  3         36          3.799e-10
## Half-BW    29.05  3         16          2.078e-06
## Double-BW  32.54  3         44          6.129e-11

RDD in R

## 
## Call:
## RDestimate(formula = mva ~ agecell, data = AEJfigs, cutpoint = 21)
## 
## Type:
## sharp 
## 
## Estimates:
##            Bandwidth  Observations  Estimate  Std. Error  z value
## LATE       1.2109     30            4.977     1.0590      4.700  
## Half-BW    0.6054     14            4.956     1.3767      3.600  
## Double-BW  2.4218     48            4.566     0.7086      6.444  
##            Pr(>|z|)      
## LATE       2.607e-06  ***
## Half-BW    3.182e-04  ***
## Double-BW  1.162e-10  ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## F-statistics:
##            F      Num. DoF  Denom. DoF  p        
## LATE       13.32  3         26          3.692e-05
## Half-BW    12.76  3         10          1.879e-03
## Double-BW  26.99  3         44          9.322e-10

RDD in R

## 
## Call:
## RDestimate(formula = internal ~ agecell, data = AEJfigs, cutpoint = 21)
## 
## Type:
## sharp 
## 
## Estimates:
##            Bandwidth  Observations  Estimate  Std. Error  z value
## LATE       0.8809     22            1.4128    0.8206      1.722  
## Half-BW    0.4405     10            1.8691    1.0203      1.832  
## Double-BW  1.7618     42            0.7652    0.6179      1.239  
##            Pr(>|z|)   
## LATE       0.08513   .
## Half-BW    0.06698   .
## Double-BW  0.21553    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## F-statistics:
##            F       Num. DoF  Denom. DoF  p        
## LATE        6.830  3         18          5.734e-03
## Half-BW     1.765  3          6          5.068e-01
## Double-BW  22.695  3         38          2.750e-08

List of Papers that use RDD

include_graphics("RDtable5aaa.png")

List of Papers that use RDD

include_graphics("RDtable5aa.png")

List of Papers that use RDD

include_graphics("RDtable5a.png")

List of Papers that use RDD

include_graphics("RDtable5.png")

References

C. Carpenter and C. Dobkin, “The Effect of Alcohol Consumption on Mortality: Regression Discontinuity Evidence from the MLDA”, American Economic Journal: Applied Economics 1 (2009), 164-182.

A. Abdulkadiroglu, et al., “The Elite Illusion: Achievement Effects at Boston and New York Exam Schools”, Econometrica, 2014.

Sources of Bias

Ommitted Variable Bias: Y is affected by X, but through a third variable
Simultenity Bias: Y can cause X and X can cause Y.
Measurement Error: people lie and imperfect recall.
Sample Selection Bias: Not all the data are available or are available in an endogenous way.

Examples of Bias

Ommitted Variable Bias: Books and Academic Performance
Simultenity Bias: Price and Quantity
Measurement Error:
- How much do you weigh?
- How tall are you?
- How much do you earn in a year?
- Did you round to make these estimates? Did you round up or down?
Sample Selection Bias:
- You sample a bunch of college students to find out if HS GPA affects College GPA
- Planes and WWII my favorite example