Nonparametric Bounds in Two-Sample Summary-Data Mendelian Randomization

class: center, top, .title-slide, title-slide

# Nonparametric Bounds in Two-Sample Summary-Data Mendelian Randomization
## Some Cautionary Tales for Practice .vsmall[(slides at <a href="https://rpubs.com/rmtrane/bdo_pres" class="uri">https://rpubs.com/rmtrane/bdo_pres</a>)]
### Ralph Møller Trane
### University of Wisconsin–Madison 
### 2021-04-30

---

# Setup

Does `$X$` cause `$Y$`? (We will only consider binary `$X,Y$`)

Formally, want to learn something about `$\text{ATE} = E[Y^1 - Y^0]$`. (Since `$Y$` is binary, `$-1 \le \text{ATE} \le 1$`.)

Tough question if we cannot rule out the existence of unmeasured confounders.

---

# Instrumental Variables

.pull-left[
We can estimate the ATE if we can find `$Z$` such that

Formally, the `$Z$` should satisfy the following:

(A1) *(Relevance)*: `$Z \not\perp X$` 
(A2) *(Independent instrument)*: `$Z \perp U$` 
(A3) *(Exclusion restriction)*: `$Y^{z,x} = Y^{z',x} = Y^{x}$` for all `$x,z,z'$` 
(A4) *(Conditional ignorability of `$X,Z$` given `$U$`)*: `$Y^{z,x} \perp Z, X | U$`

]

.pull-right[
Examples:

* <a name=cite-leigh_instrumental_2004></a>[Leigh and Schembri (2004)](https://linkinghub.elsevier.com/retrieve/pii/S0895435603003214) use tobacco tax level as an instrument to estimate the causal effect of smoking on lung cancer.
* <a name=cite-bloom_benefits_1997></a>[Bloom, Orr, Bell, Cave, Doolittle, Lin, and Bos (1997)](https://www.jstor.org/stable/146183?origin=crossref) use the random assignment of admission to a training program to assess the causal effect of that program on earnings. 
* Many more: <a name=cite-angrist_instrumental_2001></a>[Angrist and Krueger (2001)](https://pubs.aeaweb.org/doi/10.1257/jep.15.4.69)
]

---

# Mendelian Randomization

In recent years, the use of genetic markers as IVs has gained traction. This is called *Mendelian Randomization*.

Built on Gregor Mendel's observation that alleles are distributed randomly in people at fertilization.

For example,

* `$Z$` = some SNP
* `$X$` = high cholesterol
* `$Y$` = incidence of heart attack
* `$U$` = environmental risk factors

There are many ways of estimating causal effects using IVs. Most rely on additional strong modeling assumptions.

The IV model itself can be used to obtain firm nonparametric bounds on the ATE.

---
layout: true

# Nonparametric bounds

---

<a name=cite-manski_nonparametric_1990></a>[Manski (1990)](#bib-manski_nonparametric_1990) showed that for a binary instrument

`$$\small
\max \left\{\begin{array}{c}
\max_z -P(Y = 0, X = 1 | Z = z) - P(Y = 1, X = 0 | Z = z) \\
\max_{z_1 \neq z_2} P(Y = 1 | Z = z_1) - P(Y = 1 | Z = z_2) - P(Y = 1, X = 0 | Z = z_1) - P(Y = 0, X = 1 | Z = z_2)
\end{array}\right\} \\
\le \text{ATE} \le \\
\small \min \left\{\begin{array}{c}
\min_z P(Y = 1, X = 1 | Z = z) + P(Y = 0, X = 0 | Z = z) \\
\min_{z_1 \neq z_2} P(Y = 1 | Z = z_1) - P(Y = 1 | Z = z_2) + P(Y = 1, X = 0 | Z = z_1) + P(Y = 0, X = 1 | Z = z_2)
\end{array}\right\}$$`

<a name=cite-balke_bounds_1997></a>[Balke and Pearl (1997)](https://doi.org/10.1080/01621459.1997.10474074) showed that the width of these bounds is always less than `$1 - ST$` (important!), where

`$$ST = |P(X = 1|Z=1) - P(X = 1|Z=0)|$$`

(Bounds for arbitrary categorical instruments presented in <a name=cite-richardson_ace_2014></a>[Richardson and Robins (2014)](https://arxiv.org/abs/1410.0470))

---

In many MR analyses, we do not have data on `$(X,Y) | Z$`. Instead, they rely on GWAS results which give information about `$X|Z$` and `$Y|Z$` separately.

Fortunately, bounds using `$P(X|Z)$` and `$P(Y|Z)$` have been derived <a name=cite-ramsahai_causal_2012></a>([Ramsahai, 2012](#bib-ramsahai_causal_2012)), but the behavior not well-known.

Our main question: what can we learn about causal effects using nonparametric bounds in two-sample MR studies?

---

Width of many two-sample bounds vs. strength of instruments. Each dot represents bounds based on a set of values for `$P(X|Z)$` and `$P(Y|Z)$`. Black: simulated values. Colored: real data.

**Result**: under additional assumptions, width `$\le 2(1-\text{ST})$`.

.small[
(For multi-leveled IV: `$\text{ST} = \max_{z_1 \neq z_2} | P(X = 1 | Z = z_1) - P(X = 0 | Z = z_2)|$`.)
]

---

Also unable to detect direction when using real data:

.pull-left[A: Two-sample IV bounds for the ATE of smoking on the incidence of lung cancer.]

.pull-right[B: Two-sample IV bounds for the ATE of high cholesterol on the incidence of heart attack.]

---
 
Conclusion: we pay a price when using two-sample rather than one-sample data.
--
 That price is information about `$\text{Cov}(X,Y | Z = z)$`:

$$
P(X = x, Y = y | Z = z) = P(X = x | Z = z)P(Y = y | Z = z) + (2\cdot I[x = y] - 1)\text{Cov}(X, Y | Z = z)
$$

It is possible to find inequalities `$\text{Cov}(X,Y | Z = z)$` must satisfy based on the observed values of `$P(X = x | Z = z)$` and `$P(Y = y | Z = z)$` for the resulting `$(X,Y)|Z$` to follow the IV model.

So we can get a sense information lost due to the two sample design by choosing random, valid values of `$\text{Cov}(X,Y | Z = z)$`, and reconstructing the corresponding one-sample bounds.

---

Reconstructed one-sample bounds based on two-sample bounds. 1000 one-sample bounds in each panel. Simulated data.

---

.pull-left[
<img src="data:image/png;base64,#/home/ralphtrane/Documents/RPackages_dev/ACEBounds/figures/png/example_analyses/trivariate_bounds.png" height="550"/>
]

.pull-right[

Possible one-sample IV bounds for the ATE of

A. smoking on the incidence of lung cancer

B. high cholesterol on the incidence of heart attack

]

---
layout: false

# Lessons Learned

* Two-sample data result in bounds much more conservative than one-sample data

* In practice, the genetic markers used as instruments are just too weak to give informative bounds

* Bound-based analysis does not, on its own, seem to be terribly useful in a two-sample MR study

* However, it might be useful in an addition to other sorts of analyses:
    - check if an effect estimate based on a different IV method is within the bounds
    - bound effect size if direction is already well known

.small[
(Slides created using [`xaringan`](https://bookdown.org/yihui/rmarkdown/xaringan.html). Theme available as RStudio skeleton here: https://github.com/rmtrane/XaringanForUWMadison)
]

---
layout: false

# References

<a name=bib-balke_bounds_1997></a>[Balke, A. and J.
Pearl](#cite-balke_bounds_1997) (1997). "Bounds on Treatment Effects
from Studies with Imperfect Compliance". In: _Journal of the American
Statistical Association_ 92.439, pp. 1171-1176. ISSN: 0162-1459. DOI:
[10.1080/01621459.1997.10474074](https://doi.org/10.1080%2F01621459.1997.10474074).
URL:
[https://doi.org/10.1080/01621459.1997.10474074](https://doi.org/10.1080/01621459.1997.10474074)
(visited on Feb. 05, 2020).

<a name=bib-bloom_benefits_1997></a>[Bloom, H. S., L. L. Orr, S. H.
Bell, et al.](#cite-bloom_benefits_1997) (1997). "The Benefits and
Costs of JTPA Title II-A Programs: Key Findings from the National Job
Training Partnership Act Study". In: _The Journal of Human Resources_
32.3, p. 549. ISSN: 0022166X. DOI:
[10.2307/146183](https://doi.org/10.2307%2F146183). URL:
[https://www.jstor.org/stable/146183?origin=crossref](https://www.jstor.org/stable/146183?origin=crossref)
(visited on Apr. 30, 2021).

<a name=bib-leigh_instrumental_2004></a>[Leigh, J. and M.
Schembri](#cite-leigh_instrumental_2004) (2004). "Instrumental
variables technique: cigarette price provided better estimate of
effects of smoking on SF-12". En. In: _Journal of Clinical
Epidemiology_ 57.3, pp. 284-293. ISSN: 08954356. DOI:
[10.1016/j.jclinepi.2003.08.006](https://doi.org/10.1016%2Fj.jclinepi.2003.08.006).
URL:
[https://linkinghub.elsevier.com/retrieve/pii/S0895435603003214](https://linkinghub.elsevier.com/retrieve/pii/S0895435603003214)
(visited on Apr. 30, 2021).

<a name=bib-manski_nonparametric_1990></a>[Manski, C.
F.](#cite-manski_nonparametric_1990) (1990). "Nonparametric Bounds on
Treatment Effects". In: _The American Economic Review_ 80.2, pp.
319-323. ISSN: 0002-8282.

<a name=bib-ramsahai_causal_2012></a>[Ramsahai, R.
R.](#cite-ramsahai_causal_2012) (2012). "Causal Bounds and Observable
Constraints for Non-Deterministic Models". In: _J. Mach. Learn. Res._
13, pp. 829-848. ISSN: 1532-4435.

---

# References (cont.)

<a name=bib-angrist_instrumental_2001></a>[Angrist, J. D. and A. B.
Krueger](#cite-angrist_instrumental_2001) (2001). "Instrumental
Variables and the Search for Identification: From Supply and Demand to
Natural Experiments". En. In: _Journal of Economic Perspectives_ 15.4,
pp. 69-85. ISSN: 0895-3309. DOI:
[10.1257/jep.15.4.69](https://doi.org/10.1257%2Fjep.15.4.69). URL:
[https://pubs.aeaweb.org/doi/10.1257/jep.15.4.69](https://pubs.aeaweb.org/doi/10.1257/jep.15.4.69)
(visited on Apr. 30, 2021).

<a name=bib-richardson_ace_2014></a>[Richardson, T. S. and J. M.
Robins](#cite-richardson_ace_2014) (2014). "ACE Bounds; SEMs with
Equilibrium Conditions". In: _Statistical Science_ 29.3, pp. 363-366.
ISSN: 0883-4237. DOI:
[10.1214/14-STS485](https://doi.org/10.1214%2F14-STS485). arXiv:
[1410.0470](https://arxiv.org/abs/1410.0470).