Nonparametric Bounds in Two-Sample Summary-Data Mendelian Randomization

class: center, top, .title-slide, title-slide

# Nonparametric Bounds in Two-Sample Summary-Data Mendelian Randomization
## Some Cautionary Tales for Practice .vsmall[(slides at <a href="https://rpubs.com/rmtrane/jsm_presentation" class="uri">https://rpubs.com/rmtrane/jsm_presentation</a>)]
### Ralph Møller Trane
### University of Wisconsin–Madison 
### 2021-08-12

---

# Highlights

* Previously, nonparametric IV bounds have been thoroughly studied when data on exposure, outcome, and instrument are collected at once

* Many MR studies use two-sample data, i.e. data on exposure and outcome are separate from data on instrument and outcome

* Simulation and real data show that two-sample bounds are generally much wider than one-sample bounds making them less useful

* The loss of information we see when going from one-sample to two-sample bounds varies a lot

* Generally, nonparametric bounds by themselves might be of limited use in two-sample MR studies.

---
layout: true

# Setup

---
 
Does some (binary) `$X$` cause (binary) `$Y$`? (We will only consider binary `$X$`, `$Y$`.)

Formally, want to learn something about `$\text{ATE} = E[Y^1 - Y^0] = E[Y^1] - E[Y^0]$`.

Tough question if we cannot rule out the existence of unmeasured confounders.

---

.pull-left[

We can estimate the ATE if we can find an instrument `$Z$` such that

Formally, `$Z$` should satisfy

(A1) `$Z \not\perp X$` *(Relevance)*
(A2) `$Z \perp U$` *(Independent instrument)*
(A3) `$Y^{z,x} = Y^{z',x} = Y^{x}$` for all `$x,z,z'$` *(Exclusion restriction)*
(A4) `$Y^{z,x} \perp Z, X | U$` *(Conditional ignorability of `$X,Z$` given `$U$`)*

]

.pull-right[
Examples:

* <a name=cite-leigh_instrumental_2004></a>[Leigh and Schembri (2004)](#bib-leigh_instrumental_2004) use tobacco tax level as an instrument to estimate the causal effect of smoking on lung cancer.
* <a name=cite-bloom_benefits_1997></a>[Bloom, Orr, Bell, et al. (1997)](#bib-bloom_benefits_1997) use the random assignment of admission to a job training program to assess the causal effect of that program on earnings. 
* Many more: <a name=cite-angrist_instrumental_2001></a>[Angrist and Krueger (2001)](#bib-angrist_instrumental_2001), <a name=cite-hernan_instruments_2006></a>[Hernán and Robins (2006)](#bib-hernan_instruments_2006), <a name=cite-baiocchi_instrumental_2014></a>[Baiocchi, Cheng, and Small (2014)](#bib-baiocchi_instrumental_2014).
]

---
layout: true

# Non-parametric bounds

---

There are many ways of estimating causal effects using IVs. Many rely on strong modeling assumptions, for example the underlying logistic regression model in GWAS.

The IV model itself can be used to obtain firm bounds on the ATE. <a name=cite-manski_nonparametric_1990></a>[Manski (1990)](#bib-manski_nonparametric_1990) showed that for a binary instrument

`$$\small
\max \left\{\begin{array}{c}
\max_z -P(Y = 0, X = 1 | Z = z) - P(Y = 1, X = 0 | Z = z) \\
\max_{z_1 \neq z_2} P(Y = 1 | Z = z_1) - P(Y = 1 | Z = z_2) - P(Y = 1, X = 0 | Z = z_1) - P(Y = 0, X = 1 | Z = z_2)
\end{array}\right\} \\
\le \text{ATE} \le \\
\small \min \left\{\begin{array}{c}
\min_z P(Y = 1, X = 1 | Z = z) + P(Y = 0, X = 0 | Z = z) \\
\min_{z_1 \neq z_2} P(Y = 1 | Z = z_1) - P(Y = 1 | Z = z_2) + P(Y = 1, X = 0 | Z = z_1) + P(Y = 0, X = 1 | Z = z_2)
\end{array}\right\}$$`

<a name=cite-balke_bounds_1997></a>[Balke and Pearl (1997)](#bib-balke_bounds_1997) showed that the width of these bounds is always less than `$1 - ST$`, where `$ST = |P(X = 1|Z=1) - P(X = 1|Z=0)|$`.

Bounds for arbitrary categorical instruments presented in <a name=cite-richardson_ace_2014></a>[Richardson and Robins (2014)](#bib-richardson_ace_2014).

---
layout: false

# Two-Sample Mendelian Randomization

In recent years, the use of genetic markers as IVs has gained traction. This is called *Mendelian Randomization*. Built on Gregor Mendel's observation that alleles are distributed randomly in people at fertilization <a name=cite-davey_smith_mendelian_2003></a><a name=cite-lawlor_mendelian_2008></a>([Davey Smith and Ebrahim, 2003](#bib-davey_smith_mendelian_2003); [Lawlor, Harbord, Sterne, et al., 2008](#bib-lawlor_mendelian_2008)).

For example (as per <a name=cite-voight_plasma_2012></a>[Voight, Peloso, Orho-Melander, et al. (2012)](#bib-voight_plasma_2012)),

* `$Z$` = LIPG Asn396Ser
* `$X$` = has high HDL cholesterol?
* `$Y$` = ever had heart attack?
* `$U$` = all other risk factors

In many MR analyses, we do not have data on `$(X,Y) | Z$`. Instead, they rely on GWAS results which give information about `$X|Z$` and `$Y|Z$` separately.

Fortunately, bounds using `$P(X|Z)$` and `$P(Y|Z)$` have been derived <a name=cite-ramsahai_causal_2012></a>([Ramsahai, 2012](#bib-ramsahai_causal_2012)), but their behavior not well-known.

Our main question: **what can we learn from nonparametric bounds of causal effects in two-sample MR studies?**

---

# Result 1: Length of Nonparametric Bounds from Two-Sample MR

Width of many two-sample bounds vs. strength of instruments. Each dot represents bounds based on a set of values for `$P(X|Z)$` and `$P(Y|Z)$`. 
Black: simulated values. Colored: real data.

**Result**: under (A1)-(A4) and additional assumptions, width less than `$2(1-\text{ST})$`.

.small[
(For multi-leveled IV: `$\text{ST} = \max_{z_1 \neq z_2} | P(X = 1 | Z = z_1) - P(X = 0 | Z = z_2)|$`.)
]

---
# Illustration of Result 1

Due to very wide bounds, we are unable to detect direction when using real data, and generally learn very little:

.pull-left[A: Two-sample IV bounds for the ATE of smoking on the incidence of lung cancer.]

.pull-right[B: Two-sample IV bounds for the ATE of high cholesterol on the incidence of heart attack.]

Note: results based on GWAS.

---
# Interpretation of Result 1

Conclusion: we pay a price when using two-sample rather than one-sample data.

Question: how much information is lost due to the two-sample design?

---
layout: true

# Quantifying Information Loss

---

In one-sample data, we get `$\color{blue}{P(X = x, Y = y | Z = z)}$`

In two-sample data, we get `$\color{red}{P(X = x | Z = z), P(Y = y | Z = z)}$`.

What we really lose is information about `$\text{Cov}(X,Y | Z = z)$`!

**IF** we knew `$\text{Cov}(X, Y | Z = z)$`, we could go from two-sample information to one-sample information:

$$
\color{blue}{P(X = x, Y = y | Z = z)} = \color{red}{P(X = x | Z = z)P(Y = y | Z = z)} + (2\cdot I[x = y] - 1)\text{Cov}(X, Y | Z = z)
$$

We obtain *potential* one-sample bounds based on the two-sample data by randomly drawing valid values of `$\text{Cov}(X,Y|Z=z)$`.

By doing so repeatedly, we get a sense of what information might have been obtained from one-sample data nonparametric bounds.

---

We reconstruct 1000 one-sample bounds from each of nine sets of two-sample bounds. Simulated data.

---

.pull-left[
<img src="data:image/png;base64,#/Users/ralphtrane/Documents/ACEBounds/figures/png/example_analyses/trivariate_bounds.png" height="550"/>
]

.pull-right[

Possible one-sample IV bounds for the ATE of

A. smoking on the incidence of lung cancer

B. high cholesterol on the incidence of heart attack

]

---

Conclusions:

1. Hard to characterize the information loss in general

2. Information lost due to two-sample design might not be the limiting factor for nonparametric bounds in MR analyses

---
layout: false

# Lessons Learned

Lesson 1: Two-sample data result in bounds much more conservative than one-sample data

Lesson 2: In practice, the genetic markers used as instruments are just too weak to give informative bounds

Lesson 3: Bound-based analysis does not, on its own, seem to be terribly useful in a two-sample MR study

Lesson 4: However, it might be useful in an addition to other sorts of analyses:
* check if an effect estimate based on a different IV method is within the bounds
* bound effect size if direction is already well known

---
layout: false

# References

<a name=bib-angrist_instrumental_2001></a>[Angrist, J. D. and A. B.
Krueger](#cite-angrist_instrumental_2001) (2001). "Instrumental
Variables and the Search for Identification: From Supply and Demand to
Natural Experiments". En. In: _Journal of Economic Perspectives_ 15.4,
pp. 69-85. ISSN: 0895-3309. DOI:
[10.1257/jep.15.4.69](https://doi.org/10.1257%2Fjep.15.4.69). URL:
[https://pubs.aeaweb.org/doi/10.1257/jep.15.4.69](https://pubs.aeaweb.org/doi/10.1257/jep.15.4.69)
(visited on Apr. 30, 2021).

<a name=bib-baiocchi_instrumental_2014></a>[Baiocchi, M., J. Cheng, and
D. S. Small](#cite-baiocchi_instrumental_2014) (2014). "Instrumental
Variable Methods for Causal Inference: Instrumental Variable Methods
for Causal Inference". En. In: _Statistics in Medicine_ 33.13, pp.
2297-2340. ISSN: 02776715. DOI:
[10.1002/sim.6128](https://doi.org/10.1002%2Fsim.6128).

<a name=bib-balke_bounds_1997></a>[Balke, A. and J.
Pearl](#cite-balke_bounds_1997) (1997). "Bounds on Treatment Effects
from Studies with Imperfect Compliance". In: _Journal of the American
Statistical Association_ 92.439, pp. 1171-1176. ISSN: 0162-1459. DOI:
[10.1080/01621459.1997.10474074](https://doi.org/10.1080%2F01621459.1997.10474074).
URL:
[https://doi.org/10.1080/01621459.1997.10474074](https://doi.org/10.1080/01621459.1997.10474074)
(visited on Feb. 05, 2020).

<a name=bib-bloom_benefits_1997></a>[Bloom, H. S., L. L. Orr, S. H.
Bell, et al.](#cite-bloom_benefits_1997) (1997). "The Benefits and
Costs of JTPA Title II-A Programs: Key Findings from the National Job
Training Partnership Act Study". In: _The Journal of Human Resources_
32.3, p. 549. ISSN: 0022166X. DOI:
[10.2307/146183](https://doi.org/10.2307%2F146183). URL:
[https://www.jstor.org/stable/146183?origin=crossref](https://www.jstor.org/stable/146183?origin=crossref)
(visited on Apr. 30, 2021).

<a name=bib-davey_smith_mendelian_2003></a>[Davey Smith, G. and S.
Ebrahim](#cite-davey_smith_mendelian_2003) (2003). "`Mendelian
Randomization': Can Genetic Epidemiology Contribute to Understanding
Environmental Determinants of Disease?" En. In: _International Journal
of Epidemiology_ 32.1, pp. 1-22. ISSN: 0300-5771. DOI:
[10.1093/ije/dyg070](https://doi.org/10.1093%2Fije%2Fdyg070).

<a name=bib-hernan_instruments_2006></a>[Hernán, M. A. and J. M.
Robins](#cite-hernan_instruments_2006) (2006). "Instruments for Causal
Inference: An Epidemiologist's Dream?" En. In: _Epidemiology_ 17.4, pp.
360-372. ISSN: 1044-3983. DOI:
[10.1097/01.ede.0000222409.00878.37](https://doi.org/10.1097%2F01.ede.0000222409.00878.37).

---

# References (cont.)

<a name=bib-lawlor_mendelian_2008></a>[Lawlor, D. A., R. M. Harbord, J.
A. C. Sterne, et al.](#cite-lawlor_mendelian_2008) (2008). "Mendelian
Randomization: Using Genes as Instruments for Making Causal Inferences
in Epidemiology". Eng. In: _Statistics in Medicine_ 27.8, pp.
1133-1163. ISSN: 0277-6715. DOI:
[10.1002/sim.3034](https://doi.org/10.1002%2Fsim.3034).

<a name=bib-leigh_instrumental_2004></a>[Leigh, J. and M.
Schembri](#cite-leigh_instrumental_2004) (2004). "Instrumental
variables technique: cigarette price provided better estimate of
effects of smoking on SF-12". En. In: _Journal of Clinical
Epidemiology_ 57.3, pp. 284-293. ISSN: 08954356. DOI:
[10.1016/j.jclinepi.2003.08.006](https://doi.org/10.1016%2Fj.jclinepi.2003.08.006).
URL:
[https://linkinghub.elsevier.com/retrieve/pii/S0895435603003214](https://linkinghub.elsevier.com/retrieve/pii/S0895435603003214)
(visited on Apr. 30, 2021).

<a name=bib-manski_nonparametric_1990></a>[Manski, C.
F.](#cite-manski_nonparametric_1990) (1990). "Nonparametric Bounds on
Treatment Effects". In: _The American Economic Review_ 80.2, pp.
319-323. ISSN: 0002-8282.

<a name=bib-ramsahai_causal_2012></a>[Ramsahai, R.
R.](#cite-ramsahai_causal_2012) (2012). "Causal Bounds and Observable
Constraints for Non-Deterministic Models". In: _J. Mach. Learn. Res._
13, pp. 829-848. ISSN: 1532-4435.

<a name=bib-richardson_ace_2014></a>[Richardson, T. S. and J. M.
Robins](#cite-richardson_ace_2014) (2014). "ACE Bounds; SEMs with
Equilibrium Conditions". In: _Statistical Science_ 29.3, pp. 363-366.
ISSN: 0883-4237. DOI:
[10.1214/14-STS485](https://doi.org/10.1214%2F14-STS485). arXiv:
[1410.0470](https://arxiv.org/abs/1410.0470).

<a name=bib-voight_plasma_2012></a>[Voight, B. F., G. M. Peloso, M.
Orho-Melander, et al.](#cite-voight_plasma_2012) (2012). "Plasma HDL
cholesterol and risk of myocardial infarction: a mendelian
randomisation study". En. In: _The Lancet_ 380.9841, pp. 572-580. ISSN:
01406736. DOI:
[10.1016/S0140-6736(12)60312-2](https://doi.org/10.1016%2FS0140-6736%2812%2960312-2).
URL:
[https://linkinghub.elsevier.com/retrieve/pii/S0140673612603122](https://linkinghub.elsevier.com/retrieve/pii/S0140673612603122)
(visited on Aug. 09, 2021).