Instruments with Heterogeneous Effects: Bias, Monotonicity, and Localness
Nick Huntington-Klein
October 08, 2019
IV is at a Weird Stage
- We’re more skeptical of instruments (rainfall, anyone?) but developing estimators robust to validity violations (Kolesár et al., 2015; Windmeijer et al., 2018)
- We know about weak instruments, but in practice, IVs are heavily underpowered and cluster-sensitive (Young, 2018; Andrews et al., 2018)
- Standard methods don’t identify what we thought (Aronow & Samii 2016; Goodman-Bacon 2018; Gibbons, Serráto, & Urbancic 2019). What about IV?
This Paper
- I look at first-stage effect heterogeneity in IV
- I define the identified LATE (it’s not the ATE for compliers!)
- I simple estimators that reduce bias and can be more robust to violations of monotonicity
- OR uncover ATE for compliers
- Take advantage of developments in modeling effect heterogeneity (Top-K \(\tau\)-Path, Causal Forest)
- In applied replication with strong instrument and \(N \approx 33k\) subsample, can reduce mean abs. deviation by about 15%.
One-X One-Z Derivation
- The value-added is in the estimator and looking at bias, so let’s keep this simple
\[ y_i = x_i\beta_i + \varepsilon_i \] \[ x_i = z_i\gamma_i + \nu_i \]
- Assume all controls partialled out, \(Cov(z_i,\varepsilon_i) = 0, Cov(\gamma_i,\beta_i)\neq 0\)
Standard IV
\[\hat{\beta}^{IV} = \frac{N\widehat{Cov}(z,y)}{N\widehat{Cov}(z,x)}\]
(dropped subscript indicates a vector)
\[ N\widehat{Cov}(z,y) = \sum_i z_iy_i = \sum_i (z_i^2 \gamma_i\beta_i+z_i\nu_i\beta_i+z_i\varepsilon_i) \] \[ N\widehat{Cov}(z,x) = \sum_i z_ix_i = \sum_i (z_i^2\gamma_i + z_i\nu_i) \]
Standard IV
- In expectation, \(E(z'\varepsilon) = E(z'\nu)= Cov(z,\gamma) = Cov(z,\beta)=0\), so
\[ E(\hat{\beta}^{IV}) = \frac{E(\gamma'\beta)}{E(\gamma)} \]
- A weighted average of the \(\beta_i\)s, where the weights are \(\gamma_i\), not 1 for compliers and 0 otherwise
- See Imbens & Angrist (1994) - this isn’t new
- But appears to have been forgotten, even by Angrist, Imbens, & Rubin (1995)
Bias
- What does IV bias look like under generalized first-stage heterogeneity?
\[ \frac{\sum_i z_i\nu_i\beta_i}{\sum_i z_ix_i}+\frac{\sum_iz_i\varepsilon_i}{\sum_i z_ix_i} \]
- First term usually becomes zero, except that \(Cov(\gamma,\beta)\neq 0\), and so \(E(x'\beta)\neq E(x)E(\beta)\), so it doesn’t
- An idea… can we increase the denominator to reduce bias? What would happen if we dropped a \(\gamma_i = 0\) observation?
Modeling First-Stage Heterogeneity
- Strengthening the relationship between \(z\) and \(x\) will reduce bias
- Model variation in \(\gamma_i\) directly - allow the relationship to be strong where it’s strong, and drop/downplay where it’s weak
- What is identified if we allow, in our regression, the effect of \(z_i\) to vary by groups \(g_i\)?
\[ x_i = z_i \sum_g \gamma_g I_{gi} +\nu_i \]
Modeling First-Stage Heterogeneity
- Using similar calculations,
\[ E(\hat{\beta}^{2SLS}) = \frac{E(\sum_i\beta_i\gamma_i\sum_g\gamma_gI_{gi})}{E(\sum_g\gamma_g^2N_g)} \]
- A weighted average with weights \(\gamma_i\gamma_g\) instead of just \(\gamma_i\) - a Super-Local Average Treatment Effect (SLATE)
- You may have run this analysis without realizing you were getting a SLATE
- SLATE \(\prec\) ATE… but is SLATE \(\prec\) LATE?
Benefits
- Instead of requiring monotonicity to avoid negative weights, we require only monotonicity-within-group
- Under i.i.d., the bias term will generally be smaller:
\[ \frac{\sum_gE\left(\hat{\gamma}_g^2\left(\sum_iz_i^2(\nu_i\beta_i +\varepsilon_i)^2I_{gi}\right)\right)}{\sum_g E\left(\hat{\gamma}_g^4\left(\sum_i z_i^4I_{gi}\right)\right)} \]
Weighting
- The previous analysis implies a tradeoff between localness and bias
- Once we have our estimates of \(\gamma_i\) , we can control the degree of that tradeoff by weighting by a function of \(\gamma_i\)
- Under regular IV with weights, the treatment effect weights are \(w_i^2\gamma_i\)
\[ E(\hat{\beta}^{WIV}) = \frac{E((WW\gamma)'\beta)}{E(WW\gamma)} \]
Weighting
- If \(Cov(w,\gamma)>0\) then this intuitively should reduce bias
- \(w_i=I(\gamma\neq 0)\) is standard - “only run the analysis in places the IV is relevant”
- \(w_i=(F_{\gamma_i})^p, p \neq 0\) controls bias-variance tradeoff and applies in multi-instrument setting. \(F_\gamma\) is a first-stage \(F\)-statistic except the numerator assumes \(\gamma=\gamma_i\) for the whole sample
- Under monotonicity, \(p = 1/4\) identifies same effect as groups, and \(p = -1/4\) recovers “ATE for compliers”
Bias under Weighting
- How does selection of \(p\) affect bias \(\zeta\) in single-\(x\) single-\(z\) setting?
\[
\small
\begin{eqnarray}
\frac{\partial \zeta}{\partial p} &=& \frac{\sum_{i|\gamma_i \neq 0}\log(F_{\gamma_i})(F_{\gamma_i})^pz_i(\nu_i\beta_i+\varepsilon_i)}{\sum_i(F_{\gamma_i})^p z_ix_i} \nonumber \\
&& - \zeta \frac{\sum_{i|\gamma_i \neq 0}\log(F_{\gamma_i})(F_{\gamma_i})^p z_ix_i}{\sum_i(F_{\gamma_i})^p z_ix_i} \nonumber
\end{eqnarray}
\]
Feasible Estimators
- How can we actually do this if we don’t know \(\gamma_i\), or the groups over which \(\gamma_i\) varies?
- Noting that overfitting is not a concern (Belloni et al. 2014)!
- We know plenty of ways to model effect heterogeneity with covariates - hierarchical models, interaction terms…
- But how can we either perform this without needing an extensive model of effect heterogeneity, or in such a way that does build a model that can handle high dimensionality?
Feasible Estimators
- GroupSearch: Just greedy - pick groups randomly a bunch of times, estimate first stage with groups, pick the groups that give biggest \(F\)-stat. No need to actually model the effect
- Top-K \(\tau\)-Path (Sampath et al. 2015, 2016) - use a concordance matrix to determine agreement between two variables. Sort observations from “highest contribution to agreement” to “lowest”, then compare to null. Split into “positive effect”, “negative effect”, “null effect” groups
- Causal Forest (Athey & Imbens 2016) - repeatedly split data on covariates to maximize difference in treatment effects between groups. Estimate individual effect, then make groups from quantiles of the effect
Simulation
- With our infeasible and feasible estimators (GroupSearch, TKTP) in hand, let’s see how this actually performs with simulated data
\[ y_i = x_i\beta_i + 2w_i+\varepsilon_i \] \[ x_i = z_i\gamma_i + w_i + \nu_i \] \[ z_i, w_i, \varepsilon_i, \nu_i \sim N(0,1) \]
- Four groups: A, B, C, and D. \(\beta = \{1, 2, 3, 4\}\) and \(\gamma = \{0, .075, .15, .223\}\).
- OLS bias is \(1\), median first-stage F-statistic is \(10\) at \(N = 1600\).
- I generate \(1,000\) simulated samples with N = {100, 200, 400, 800, 1600, 3200, 6400, 12800, 25600}
Base with Feasible Estimators

How Many Groups?

z is Invalid

Gamma is Invalid

vs. other Small-Sample Methods

Clustering
- Given Young (2018) we should be concerned about the sensitivity of the method to clusters and clustering
- Following him, I use the data generating process
\[ x_i = z_i\gamma_i + \lambda_c(\eta_c+ w_i +nu_i)\sqrt{2} \] \[ y_i = x_i\beta_i + \lambda_c(\eta_c +2w_i +\varepsilon_i)/\sqrt{2} \]
where \(\lambda_c\) is a \(z_i\) value from cluster \(c\), and \(\eta_c\) is a randomly selected \(\varepsilon_i\) value from cluster \(c\). \(\lambda_c\) and \(\eta_c\) are the same for all members of cluster \(c\)
Clustering

Recovering the ATE
- We can use weighting to recover the ATE among compliers
- But the performance of the weighting estimator hasn’t been particularly strong
- However, at large sample sizes, the very small \(p = -1/4\) weight may reduce bias further
Recovering the ATE

Application
- I apply the group-interaction estimator to Angrist, Battistin, & Vuri (2017)
- This paper suggests that Italian small-classroom benefits might be due to teachers cheating
- It uses a Maimonides-style instrument for class size combined with a ``monitored test’’ indicator
- I use GroupSearch and Causal Forest, presenting math results only here
- Sampling is done by cluster
First-Stage Heterogeneity

Main Results
|
|
All Italy
|
North or Central
|
South
|
|
Class Size \(\times\) Monitored
|
-0.035
|
-0.039\(^{*}\)
|
-0.035
|
|
|
(0.024)
|
(0.021)
|
(0.060)
|
|
Class Size \(\times\) Not Monitored
|
-0.066\(^{***}\)
|
-0.042\(^{**}\)
|
-0.143\(^{***}\)
|
|
|
(0.021)
|
(0.018)
|
(0.053)
|
|
Monitored
|
-0.174\(^{***}\)
|
-0.082\(^{**}\)
|
-0.395\(^{***}\)
|
|
|
(0.041)
|
(0.038)
|
(0.096)
|
|
Weak IV F Monitored
|
44691
|
34569
|
12093
|
|
Not Monitored
|
23072
|
19291
|
5552
|
GroupSearch (5 Groups)
|
|
All Italy
|
North or Central
|
South
|
|
Class Size \(\times\) Monitored
|
-0.036
|
-0.038\(^{*}\)
|
-0.038
|
|
|
(0.024)
|
(0.021)
|
(0.060)
|
|
Class Size \(\times\) Not Monitored
|
-0.066\(^{***}\)
|
-0.041\(^{**}\)
|
-0.144\(^{***}\)
|
|
|
(0.021)
|
(0.018)
|
(0.053)
|
|
Monitored
|
-0.173\(^{***}\)
|
-0.081\(^{**}\)
|
-0.389\(^{***}\)
|
|
|
(0.041)
|
(0.038)
|
(0.096)
|
|
Weak IV F Monitored
|
8943
|
6919
|
2423
|
|
Not Monitored
|
4615
|
3858
|
1111
|
Causal Forest (5 Quantiles)
|
|
All Italy
|
North or Central
|
South
|
|
Class Size \(\times\) Monitored
|
-0.029
|
-0.041\(^{**}\)
|
-0.034
|
|
|
(0.023)
|
(0.021)
|
(0.060)
|
|
Class Size \(\times\) Not Monitored
|
-0.069\(^{***}\)
|
-0.042\(^{**}\)
|
-0.142\(^{***}\)
|
|
|
(0.021)
|
(0.018)
|
(0.053)
|
|
Monitored
|
-0.191\(^{***}\)
|
-0.076\(^{**}\)
|
-0.393\(^{***}\)
|
|
|
(0.039)
|
(0.037)
|
(0.094)
|
|
Weak IV F Monitored
|
9435
|
7081
|
2465
|
|
Not Monitored
|
4793
|
3933
|
1140
|
Comparing Bias

Conclusion
- In social science, most effects are probably heterogeneous
- In IV this can affect what is identified and the small-sample bias
- But by modeling heterogeneity directly, we can considerably reduce that bias and improve robustnes to monotonicity violations
- Importantly, these methods are extremely simple and can be implemented anywhere linear IV can
- But their power is improved through the use of new advanced in heterogenous effects modeling