Instruments with Heterogeneous Effects: Bias, Monotonicity, and Localness

Nick Huntington-Klein

October 08, 2019

IV is at a Weird Stage

  • We’re more skeptical of instruments (rainfall, anyone?) but developing estimators robust to validity violations (Kolesár et al., 2015; Windmeijer et al., 2018)
  • We know about weak instruments, but in practice, IVs are heavily underpowered and cluster-sensitive (Young, 2018; Andrews et al., 2018)
  • Standard methods don’t identify what we thought (Aronow & Samii 2016; Goodman-Bacon 2018; Gibbons, Serráto, & Urbancic 2019). What about IV?

This Paper

  • I look at first-stage effect heterogeneity in IV
  • I define the identified LATE (it’s not the ATE for compliers!)
  • I simple estimators that reduce bias and can be more robust to violations of monotonicity
  • OR uncover ATE for compliers
  • Take advantage of developments in modeling effect heterogeneity (Top-K \(\tau\)-Path, Causal Forest)
  • In applied replication with strong instrument and \(N \approx 33k\) subsample, can reduce mean abs. deviation by about 15%.

One-X One-Z Derivation

  • The value-added is in the estimator and looking at bias, so let’s keep this simple

\[ y_i = x_i\beta_i + \varepsilon_i \] \[ x_i = z_i\gamma_i + \nu_i \]

  • Assume all controls partialled out, \(Cov(z_i,\varepsilon_i) = 0, Cov(\gamma_i,\beta_i)\neq 0\)

Standard IV

\[\hat{\beta}^{IV} = \frac{N\widehat{Cov}(z,y)}{N\widehat{Cov}(z,x)}\]

(dropped subscript indicates a vector)

\[ N\widehat{Cov}(z,y) = \sum_i z_iy_i = \sum_i (z_i^2 \gamma_i\beta_i+z_i\nu_i\beta_i+z_i\varepsilon_i) \] \[ N\widehat{Cov}(z,x) = \sum_i z_ix_i = \sum_i (z_i^2\gamma_i + z_i\nu_i) \]

Standard IV

  • In expectation, \(E(z'\varepsilon) = E(z'\nu)= Cov(z,\gamma) = Cov(z,\beta)=0\), so

\[ E(\hat{\beta}^{IV}) = \frac{E(\gamma'\beta)}{E(\gamma)} \]

  • A weighted average of the \(\beta_i\)s, where the weights are \(\gamma_i\), not 1 for compliers and 0 otherwise
  • See Imbens & Angrist (1994) - this isn’t new
  • But appears to have been forgotten, even by Angrist, Imbens, & Rubin (1995)

Bias

  • What does IV bias look like under generalized first-stage heterogeneity?

\[ \frac{\sum_i z_i\nu_i\beta_i}{\sum_i z_ix_i}+\frac{\sum_iz_i\varepsilon_i}{\sum_i z_ix_i} \]

  • First term usually becomes zero, except that \(Cov(\gamma,\beta)\neq 0\), and so \(E(x'\beta)\neq E(x)E(\beta)\), so it doesn’t
  • An idea… can we increase the denominator to reduce bias? What would happen if we dropped a \(\gamma_i = 0\) observation?

Modeling First-Stage Heterogeneity

  • Strengthening the relationship between \(z\) and \(x\) will reduce bias
  • Model variation in \(\gamma_i\) directly - allow the relationship to be strong where it’s strong, and drop/downplay where it’s weak
  • What is identified if we allow, in our regression, the effect of \(z_i\) to vary by groups \(g_i\)?

\[ x_i = z_i \sum_g \gamma_g I_{gi} +\nu_i \]

Modeling First-Stage Heterogeneity

  • Using similar calculations,

\[ E(\hat{\beta}^{2SLS}) = \frac{E(\sum_i\beta_i\gamma_i\sum_g\gamma_gI_{gi})}{E(\sum_g\gamma_g^2N_g)} \]

  • A weighted average with weights \(\gamma_i\gamma_g\) instead of just \(\gamma_i\) - a Super-Local Average Treatment Effect (SLATE)
  • You may have run this analysis without realizing you were getting a SLATE
  • SLATE \(\prec\) ATE… but is SLATE \(\prec\) LATE?

Benefits

  • Instead of requiring monotonicity to avoid negative weights, we require only monotonicity-within-group
  • Under i.i.d., the bias term will generally be smaller:

\[ \frac{\sum_gE\left(\hat{\gamma}_g^2\left(\sum_iz_i^2(\nu_i\beta_i +\varepsilon_i)^2I_{gi}\right)\right)}{\sum_g E\left(\hat{\gamma}_g^4\left(\sum_i z_i^4I_{gi}\right)\right)} \]

Weighting

  • The previous analysis implies a tradeoff between localness and bias
  • Once we have our estimates of \(\gamma_i\) , we can control the degree of that tradeoff by weighting by a function of \(\gamma_i\)
  • Under regular IV with weights, the treatment effect weights are \(w_i^2\gamma_i\)

\[ E(\hat{\beta}^{WIV}) = \frac{E((WW\gamma)'\beta)}{E(WW\gamma)} \]

Weighting

  • If \(Cov(w,\gamma)>0\) then this intuitively should reduce bias
  • \(w_i=I(\gamma\neq 0)\) is standard - “only run the analysis in places the IV is relevant”
  • \(w_i=(F_{\gamma_i})^p, p \neq 0\) controls bias-variance tradeoff and applies in multi-instrument setting. \(F_\gamma\) is a first-stage \(F\)-statistic except the numerator assumes \(\gamma=\gamma_i\) for the whole sample
  • Under monotonicity, \(p = 1/4\) identifies same effect as groups, and \(p = -1/4\) recovers “ATE for compliers”

Bias under Weighting

  • How does selection of \(p\) affect bias \(\zeta\) in single-\(x\) single-\(z\) setting?

\[ \small \begin{eqnarray} \frac{\partial \zeta}{\partial p} &=& \frac{\sum_{i|\gamma_i \neq 0}\log(F_{\gamma_i})(F_{\gamma_i})^pz_i(\nu_i\beta_i+\varepsilon_i)}{\sum_i(F_{\gamma_i})^p z_ix_i} \nonumber \\ && - \zeta \frac{\sum_{i|\gamma_i \neq 0}\log(F_{\gamma_i})(F_{\gamma_i})^p z_ix_i}{\sum_i(F_{\gamma_i})^p z_ix_i} \nonumber \end{eqnarray} \]

  • Smaller samples/bigger bias to start: increasing \(p\) reduces bias

  • Bigger samples/smaller bias to start: more likely that increasing \(p\) increases bias

Feasible Estimators

  • How can we actually do this if we don’t know \(\gamma_i\), or the groups over which \(\gamma_i\) varies?
  • Noting that overfitting is not a concern (Belloni et al. 2014)!
  • We know plenty of ways to model effect heterogeneity with covariates - hierarchical models, interaction terms…
  • But how can we either perform this without needing an extensive model of effect heterogeneity, or in such a way that does build a model that can handle high dimensionality?

Feasible Estimators

  • GroupSearch: Just greedy - pick groups randomly a bunch of times, estimate first stage with groups, pick the groups that give biggest \(F\)-stat. No need to actually model the effect
  • Top-K \(\tau\)-Path (Sampath et al. 2015, 2016) - use a concordance matrix to determine agreement between two variables. Sort observations from “highest contribution to agreement” to “lowest”, then compare to null. Split into “positive effect”, “negative effect”, “null effect” groups
  • Causal Forest (Athey & Imbens 2016) - repeatedly split data on covariates to maximize difference in treatment effects between groups. Estimate individual effect, then make groups from quantiles of the effect

Simulation

  • With our infeasible and feasible estimators (GroupSearch, TKTP) in hand, let’s see how this actually performs with simulated data

\[ y_i = x_i\beta_i + 2w_i+\varepsilon_i \] \[ x_i = z_i\gamma_i + w_i + \nu_i \] \[ z_i, w_i, \varepsilon_i, \nu_i \sim N(0,1) \]

  • Four groups: A, B, C, and D. \(\beta = \{1, 2, 3, 4\}\) and \(\gamma = \{0, .075, .15, .223\}\).
  • OLS bias is \(1\), median first-stage F-statistic is \(10\) at \(N = 1600\).
  • I generate \(1,000\) simulated samples with N = {100, 200, 400, 800, 1600, 3200, 6400, 12800, 25600}

Base with Feasible Estimators

How Many Groups?

z is Invalid

Gamma is Invalid

vs. other Small-Sample Methods

Clustering

  • Given Young (2018) we should be concerned about the sensitivity of the method to clusters and clustering
  • Following him, I use the data generating process

\[ x_i = z_i\gamma_i + \lambda_c(\eta_c+ w_i +nu_i)\sqrt{2} \] \[ y_i = x_i\beta_i + \lambda_c(\eta_c +2w_i +\varepsilon_i)/\sqrt{2} \]

where \(\lambda_c\) is a \(z_i\) value from cluster \(c\), and \(\eta_c\) is a randomly selected \(\varepsilon_i\) value from cluster \(c\). \(\lambda_c\) and \(\eta_c\) are the same for all members of cluster \(c\)

Clustering

Recovering the ATE

  • We can use weighting to recover the ATE among compliers
  • But the performance of the weighting estimator hasn’t been particularly strong
  • However, at large sample sizes, the very small \(p = -1/4\) weight may reduce bias further

Recovering the ATE

Application

  • I apply the group-interaction estimator to Angrist, Battistin, & Vuri (2017)
  • This paper suggests that Italian small-classroom benefits might be due to teachers cheating
  • It uses a Maimonides-style instrument for class size combined with a ``monitored test’’ indicator
  • I use GroupSearch and Causal Forest, presenting math results only here
  • Sampling is done by cluster

First-Stage Heterogeneity

Main Results

All Italy North or Central South
Class Size \(\times\) Monitored -0.035 -0.039\(^{*}\) -0.035
(0.024) (0.021) (0.060)
Class Size \(\times\) Not Monitored -0.066\(^{***}\) -0.042\(^{**}\) -0.143\(^{***}\)
(0.021) (0.018) (0.053)
Monitored -0.174\(^{***}\) -0.082\(^{**}\) -0.395\(^{***}\)
(0.041) (0.038) (0.096)
Weak IV F Monitored 44691 34569 12093
Not Monitored 23072 19291 5552

GroupSearch (5 Groups)

All Italy North or Central South
Class Size \(\times\) Monitored -0.036 -0.038\(^{*}\) -0.038
(0.024) (0.021) (0.060)
Class Size \(\times\) Not Monitored -0.066\(^{***}\) -0.041\(^{**}\) -0.144\(^{***}\)
(0.021) (0.018) (0.053)
Monitored -0.173\(^{***}\) -0.081\(^{**}\) -0.389\(^{***}\)
(0.041) (0.038) (0.096)
Weak IV F Monitored 8943 6919 2423
Not Monitored 4615 3858 1111

Causal Forest (5 Quantiles)

All Italy North or Central South
Class Size \(\times\) Monitored -0.029 -0.041\(^{**}\) -0.034
(0.023) (0.021) (0.060)
Class Size \(\times\) Not Monitored -0.069\(^{***}\) -0.042\(^{**}\) -0.142\(^{***}\)
(0.021) (0.018) (0.053)
Monitored -0.191\(^{***}\) -0.076\(^{**}\) -0.393\(^{***}\)
(0.039) (0.037) (0.094)
Weak IV F Monitored 9435 7081 2465
Not Monitored 4793 3933 1140

Comparing Bias

Conclusion

  • In social science, most effects are probably heterogeneous
  • In IV this can affect what is identified and the small-sample bias
  • But by modeling heterogeneity directly, we can considerably reduce that bias and improve robustnes to monotonicity violations
  • Importantly, these methods are extremely simple and can be implemented anywhere linear IV can
  • But their power is improved through the use of new advanced in heterogenous effects modeling