lm_1 <- lm(int_rt ~ 1, data = loan_data)lm_bin_0 <- lm(int_rt ~ 0 + fico_bin, data = loan_data)lm_bin_1 <- lm(int_rt ~ 1 + fico_bin, data = loan_data)larger sample \(\rightarrow\) more precise estimate \(\rightarrow\) closer to objective truth
chance that experiment will fail to find significant effects, even if treatment effect really does exist
Two views of the world when looking at natural experiments:
fixing selection on observables often makes selection on unobservables worse
“last resort design”
randomization requires \(Y_i (1), Y_i (0) \rightarrow D_i\)
SOO Assumes: \(Y_i (1), Y_i (0) \rightarrow D_i | X_i\)
modern approach to propensity score matching” is OLS with inverse probability weighting
instead of picking a neighbor or looking for an exact match, we can weight each observation according to its inverse probability (more weight on what happens when someone who likely shouldn’t be in the sample is in the sample)
treated units assigned weight $ $
untreated units assigned weight $ $
approach to causal ID which relies on rules
idea: individuals on either side of the cutoff are essentially identical, but the treatment they receive differs because of some arbitrary assignment/cutoff rule
four “ingredients”
causal question of interest: key element for causal data analysis
outcome variable (Y)
treatment variable (T)
assignment variable (X, also called “running” var)
assignment variable is used to determine treatment
if X exceeds a certain threshold, treatment is given
Sharp regression discontinuity
to get \(\tau\), compare units with \(D_i=0 \text{ and }D_i=1\) exactly at the cutoff
\[ \tau^{RD}=E[Y_i(1)-Y_i(0)|X_i=c] \]
but since we cannot observe the counterfactual, we can estimate this by taking the limit as x approaches c
\[ \hat\tau^{RD} = \text{lim}E[Y_i(1)|X_i=x]-\text{lim}E[Y_i(0)|X_i=x] \]
RD comes close to mimicking random assignment, but without true randomization, we need to satisfy certain identification assumptions
Testing the ID assumptions
Manipulation Test
we are assuming that \(X_i-c\) (ie how far you are from the cutoff) is as good as randomly assigned in the neighborhood of c
if this is true, then units cannot sort around c (ie students cannot manipulate scores, politicians cannot change voter support)
we test this by looking at the distribution of \(X_i\)
if we seen “strange” behavior around the cutoff, we worry that there is manipulation, where people just below the cutoff can do something to raise themselves just over the cutoff to receive the treatment
Covariates Smoothness Test
we assume that no other variables change discontinuously around cutoff
goal: estimate the difference in outcomes for just-treated vs just-untreated individuals
implement this by comparing the average outcomes just below and just above the cutoff
Regression-Based RD: \(Y_i=\alpha+\tau D_i + \epsilon_i \text{ for }c-h ≤c+h \text{ where }D_i=1(X_i≥c)\)
Choosing h
we want h to be small; the RD estimate is causal only at c
we want h to be big enough, since our standard errors will be too huge if only using data at c
example of bias-variance tradeoff
Functional Form
a simple difference in means ignores any relationship between \(Y_i\) and \(X_i\)
if we know that \(Y_i(X_i)\) is a linear function, we can improve on this by controlling for the underlying relationship: \(Y_i=\alpha +\tau D_i+\beta(X_i-c)+\epsilon_i\)
we can also allow for different slopes on either side of the cutoff: \(Y_i=\alpha + \tau D_i + \beta_1(X_i-c)+\beta_2(X_i-c)D_i+\epsilon_i\)
can also use nonlinear functional forms
\(Y_i=\alpha +\tau D_i+f(X_i)+\epsilon_i\)
\(Y_i=\alpha + \tau D_i + f(X_i)+f(X_i)D_i+\epsilon_i\)
Interpretation of RD Results
Sometimes, treatment status doesn’t change by 100% at the cutoff
some units above the cutoff may get treated, and vice versa
crossing c changes the probability of treatment
Time Series Data: observation on a single unit over time
Repeated Cross-Section Data: repeated sampling of different units over time (eg. Census surveys)
Panel Data: multiple observations of the same unit over time
Overarching goal: think about how information on the time dimension helps us address the selection problem in causal inference
Causal inference with cross-sectional data is hard
people, firms, etc. are different from one another
can only get a clean comparison when you have a (quasi-) experimental setup, such as an experiment or RD
key insight: rather than comparing \(i \text{ to } j, \text{ compare } i \text{ in } t \text{ to } i \text{ in } t-1\)
Cross-Sectional estimator
compare \(i\) to \(j\) (static)
suffers from selection bias (\(i\) and \(j\) are systematically different)
Post vs pre estimator
compare \(i\) to itself over time
Suffers from time-varying unobservables (AKA non-zero trends)
KEY: treatment happened! probably for a reason
Difference in Difference (DD)
uses across unit, within time comparisons and within-unit, across time comparisons
essentially compares treated to untreated units over time
Consider treated unit i’s data-generating process:
\[ Y_{it} = \tau D_t + \beta X_i + \gamma U_i + \theta V_t \]
Untreated unit j’s data-generating process:
\[ Y_{jt} = \beta X_j + \gamma U_j + \theta V_t \]
Variable Definitions:
\(X\): time-invariant observable characteristics (e.g., gender)
\(U\): time-invariant unobservable characteristics (e.g., preferences)
\(V_t\): time-varying unobservable characteristics
(Suppose all observables are controlled for)
Post-vs-Pre Estimate for Treated Units:
\[ Y_{i,t=1} - Y_{i,t=0} = \tau(D_{t=1} - D_{t=0}) + \beta(X_i - X_i) + \gamma(U_i - U_i) + \theta(V_{t=1} - V_{t=0}) \]
Post-vs-Pre Estimate for Untreated Units:
\[ Y_{j,t=1} - Y_{j,t=0} = \beta(X_j - X_j) + \gamma(U_j - U_j) + \theta(V_{t=1} - V_{t=0}) \]
Difference-in-Differences (DD) Estimator:
\[ \widehat{\text{DD}}_\tau = (Y_{i,t=1} - Y_{i,t=0}) - (Y_{j,t=1} - Y_{j,t=0}) \\ = \tau(D_{t=1} - D_{t=0}) \\ = \tau \]
ID Assumption: counterfactual trend = untreated trend