class: center, top, .title-slide, title-slide # Nonparametric Bounds in Two-Sample Summary-Data Mendelian Randomization ## Some Cautionary Tales for Practice
.vsmall[(slides at
https://rpubs.com/rmtrane/jsm_presentation
)] ### Ralph Møller Trane ### University of Wisconsin–Madison
### 2021-08-12 --- # Highlights * Previously, nonparametric IV bounds have been thoroughly studied when data on exposure, outcome, and instrument are collected at once * Many MR studies use two-sample data, i.e. data on exposure and outcome are separate from data on instrument and outcome * Simulation and real data show that two-sample bounds are generally much wider than one-sample bounds making them less useful * The loss of information we see when going from one-sample to two-sample bounds varies a lot * Generally, nonparametric bounds by themselves might be of limited use in two-sample MR studies. --- layout: true # Setup --- Does some (binary) `\(X\)` cause (binary) `\(Y\)`? (We will only consider binary `\(X\)`, `\(Y\)`.) Formally, want to learn something about `\(\text{ATE} = E[Y^1 - Y^0] = E[Y^1] - E[Y^0]\)`. Tough question if we cannot rule out the existence of unmeasured confounders. <img src="data:image/png;base64,#JSMpresentation_files/figure-html/unnamed-chunk-2-1.png" width="700px" height="400px" style="display: block; margin: auto;" /> --- .pull-left[ We can estimate the ATE if we can find an instrument `\(Z\)` such that <img src="data:image/png;base64,#JSMpresentation_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> Formally, `\(Z\)` should satisfy (A1) `\(Z \not\perp X\)` *(Relevance)*</br> (A2) `\(Z \perp U\)` *(Independent instrument)*</br> (A3) `\(Y^{z,x} = Y^{z',x} = Y^{x}\)` for all `\(x,z,z'\)` *(Exclusion restriction)*</br> (A4) `\(Y^{z,x} \perp Z, X | U\)` *(Conditional ignorability of `\(X,Z\)` given `\(U\)`)* ] -- .pull-right[ Examples: * <a name=cite-leigh_instrumental_2004></a>[Leigh and Schembri (2004)](#bib-leigh_instrumental_2004) use tobacco tax level as an instrument to estimate the causal effect of smoking on lung cancer. * <a name=cite-bloom_benefits_1997></a>[Bloom, Orr, Bell, et al. (1997)](#bib-bloom_benefits_1997) use the random assignment of admission to a job training program to assess the causal effect of that program on earnings. * Many more: <a name=cite-angrist_instrumental_2001></a>[Angrist and Krueger (2001)](#bib-angrist_instrumental_2001), <a name=cite-hernan_instruments_2006></a>[Hernán and Robins (2006)](#bib-hernan_instruments_2006), <a name=cite-baiocchi_instrumental_2014></a>[Baiocchi, Cheng, and Small (2014)](#bib-baiocchi_instrumental_2014). ] --- layout: true # Non-parametric bounds --- There are many ways of estimating causal effects using IVs. Many rely on strong modeling assumptions, for example the underlying logistic regression model in GWAS. -- The IV model itself can be used to obtain firm bounds on the ATE. <a name=cite-manski_nonparametric_1990></a>[Manski (1990)](#bib-manski_nonparametric_1990) showed that for a binary instrument `$$\small \max \left\{\begin{array}{c} \max_z -P(Y = 0, X = 1 | Z = z) - P(Y = 1, X = 0 | Z = z) \\ \max_{z_1 \neq z_2} P(Y = 1 | Z = z_1) - P(Y = 1 | Z = z_2) - P(Y = 1, X = 0 | Z = z_1) - P(Y = 0, X = 1 | Z = z_2) \end{array}\right\} \\ \le \text{ATE} \le \\ \small \min \left\{\begin{array}{c} \min_z P(Y = 1, X = 1 | Z = z) + P(Y = 0, X = 0 | Z = z) \\ \min_{z_1 \neq z_2} P(Y = 1 | Z = z_1) - P(Y = 1 | Z = z_2) + P(Y = 1, X = 0 | Z = z_1) + P(Y = 0, X = 1 | Z = z_2) \end{array}\right\}$$` <a name=cite-balke_bounds_1997></a>[Balke and Pearl (1997)](#bib-balke_bounds_1997) showed that the width of these bounds is always less than `\(1 - ST\)`, where `\(ST = |P(X = 1|Z=1) - P(X = 1|Z=0)|\)`. Bounds for arbitrary categorical instruments presented in <a name=cite-richardson_ace_2014></a>[Richardson and Robins (2014)](#bib-richardson_ace_2014). --- layout: false # Two-Sample Mendelian Randomization In recent years, the use of genetic markers as IVs has gained traction. This is called *Mendelian Randomization*. Built on Gregor Mendel's observation that alleles are distributed randomly in people at fertilization <a name=cite-davey_smith_mendelian_2003></a><a name=cite-lawlor_mendelian_2008></a>([Davey Smith and Ebrahim, 2003](#bib-davey_smith_mendelian_2003); [Lawlor, Harbord, Sterne, et al., 2008](#bib-lawlor_mendelian_2008)). -- For example (as per <a name=cite-voight_plasma_2012></a>[Voight, Peloso, Orho-Melander, et al. (2012)](#bib-voight_plasma_2012)), * `\(Z\)` = LIPG Asn396Ser * `\(X\)` = has high HDL cholesterol? * `\(Y\)` = ever had heart attack? * `\(U\)` = all other risk factors -- In many MR analyses, we do not have data on `\((X,Y) | Z\)`. Instead, they rely on GWAS results which give information about `\(X|Z\)` and `\(Y|Z\)` separately. Fortunately, bounds using `\(P(X|Z)\)` and `\(P(Y|Z)\)` have been derived <a name=cite-ramsahai_causal_2012></a>([Ramsahai, 2012](#bib-ramsahai_causal_2012)), but their behavior not well-known. -- Our main question: **what can we learn from nonparametric bounds of causal effects in two-sample MR studies?** --- # Result 1: Length of Nonparametric Bounds from Two-Sample MR Width of many two-sample bounds vs. strength of instruments. Each dot represents bounds based on a set of values for `\(P(X|Z)\)` and `\(P(Y|Z)\)`. </br> Black: simulated values. Colored: real data. <center> <img src="data:image/png;base64,#/Users/ralphtrane/Documents/ACEBounds/JSMpresentation/pip_figure.png" height="375"/> </center> **Result**: under (A1)-(A4) and additional assumptions, width less than `\(2(1-\text{ST})\)`. .small[ (For multi-leveled IV: `\(\text{ST} = \max_{z_1 \neq z_2} | P(X = 1 | Z = z_1) - P(X = 0 | Z = z_2)|\)`.) ] --- # Illustration of Result 1 Due to very wide bounds, we are unable to detect direction when using real data, and generally learn very little: .pull-left[A: Two-sample IV bounds for the ATE of smoking on the incidence of lung cancer.] .pull-right[B: Two-sample IV bounds for the ATE of high cholesterol on the incidence of heart attack.] <img src="data:image/png;base64,#/Users/ralphtrane/Documents/ACEBounds/figures/png/example_analyses/bivariate_bounds.png" height="400" class="imgcenter"/> Note: results based on GWAS. --- # Interpretation of Result 1 Conclusion: we pay a price when using two-sample rather than one-sample data. Question: how much information is lost due to the two-sample design? --- layout: true # Quantifying Information Loss --- In <span style="color: blue">one-sample</span> data, we get `\(\color{blue}{P(X = x, Y = y | Z = z)}\)` In <span style="color: red">two-sample</span> data, we get `\(\color{red}{P(X = x | Z = z), P(Y = y | Z = z)}\)`. -- What we really lose is information about `\(\text{Cov}(X,Y | Z = z)\)`! **IF** we knew `\(\text{Cov}(X, Y | Z = z)\)`, we could go from two-sample information to one-sample information: $$ \color{blue}{P(X = x, Y = y | Z = z)} = \color{red}{P(X = x | Z = z)P(Y = y | Z = z)} + (2\cdot I[x = y] - 1)\text{Cov}(X, Y | Z = z) $$ -- We obtain *potential* <span style="color: blue">one-sample</span> bounds based on the <span style="color: red">two-sample</span> data by randomly drawing valid values of `\(\text{Cov}(X,Y|Z=z)\)`. By doing so repeatedly, we get a sense of what information might have been obtained from one-sample data nonparametric bounds. <!-- From the IV model, we find constraints on `\(\text{Cov}(X,Y|Z)\)` depending on `\(P(X|Z)\)` and `\(P(Y|Z)\)`. Randomly choosing valid values of `\(\text{Cov}(X,Y|Z)\)`, we can reconstruct potential one-sample bounds, and get a sense of the information lost due to two-sample study design. --> --- We reconstruct 1000 one-sample bounds from each of nine sets of two-sample bounds. Simulated data. <center> <img src="data:image/png;base64,#/Users/ralphtrane/Documents/ACEBounds/figures/png/trivariate_bounds_subset_plot.png" height="500"/> </center> --- .pull-left[ <img src="data:image/png;base64,#/Users/ralphtrane/Documents/ACEBounds/figures/png/example_analyses/trivariate_bounds.png" height="550"/> ] .pull-right[ </br></br> Possible one-sample IV bounds for the ATE of A. smoking on the incidence of lung cancer B. high cholesterol on the incidence of heart attack ] --- Conclusions: 1. Hard to characterize the information loss in general 2. Information lost due to two-sample design might not be the limiting factor for nonparametric bounds in MR analyses --- layout: false # Lessons Learned Lesson 1: Two-sample data result in bounds much more conservative than one-sample data Lesson 2: In practice, the genetic markers used as instruments are just too weak to give informative bounds Lesson 3: Bound-based analysis does not, on its own, seem to be terribly useful in a two-sample MR study Lesson 4: However, it might be useful in an addition to other sorts of analyses: * check if an effect estimate based on a different IV method is within the bounds * bound effect size if direction is already well known --- layout: false # References <a name=bib-angrist_instrumental_2001></a>[Angrist, J. D. and A. B. Krueger](#cite-angrist_instrumental_2001) (2001). "Instrumental Variables and the Search for Identification: From Supply and Demand to Natural Experiments". En. In: _Journal of Economic Perspectives_ 15.4, pp. 69-85. ISSN: 0895-3309. DOI: [10.1257/jep.15.4.69](https://doi.org/10.1257%2Fjep.15.4.69). URL: [https://pubs.aeaweb.org/doi/10.1257/jep.15.4.69](https://pubs.aeaweb.org/doi/10.1257/jep.15.4.69) (visited on Apr. 30, 2021). <a name=bib-baiocchi_instrumental_2014></a>[Baiocchi, M., J. Cheng, and D. S. Small](#cite-baiocchi_instrumental_2014) (2014). "Instrumental Variable Methods for Causal Inference: Instrumental Variable Methods for Causal Inference". En. In: _Statistics in Medicine_ 33.13, pp. 2297-2340. ISSN: 02776715. DOI: [10.1002/sim.6128](https://doi.org/10.1002%2Fsim.6128). <a name=bib-balke_bounds_1997></a>[Balke, A. and J. Pearl](#cite-balke_bounds_1997) (1997). "Bounds on Treatment Effects from Studies with Imperfect Compliance". In: _Journal of the American Statistical Association_ 92.439, pp. 1171-1176. ISSN: 0162-1459. DOI: [10.1080/01621459.1997.10474074](https://doi.org/10.1080%2F01621459.1997.10474074). URL: [https://doi.org/10.1080/01621459.1997.10474074](https://doi.org/10.1080/01621459.1997.10474074) (visited on Feb. 05, 2020). <a name=bib-bloom_benefits_1997></a>[Bloom, H. S., L. L. Orr, S. H. Bell, et al.](#cite-bloom_benefits_1997) (1997). "The Benefits and Costs of JTPA Title II-A Programs: Key Findings from the National Job Training Partnership Act Study". In: _The Journal of Human Resources_ 32.3, p. 549. ISSN: 0022166X. DOI: [10.2307/146183](https://doi.org/10.2307%2F146183). URL: [https://www.jstor.org/stable/146183?origin=crossref](https://www.jstor.org/stable/146183?origin=crossref) (visited on Apr. 30, 2021). <a name=bib-davey_smith_mendelian_2003></a>[Davey Smith, G. and S. Ebrahim](#cite-davey_smith_mendelian_2003) (2003). "`Mendelian Randomization': Can Genetic Epidemiology Contribute to Understanding Environmental Determinants of Disease?" En. In: _International Journal of Epidemiology_ 32.1, pp. 1-22. ISSN: 0300-5771. DOI: [10.1093/ije/dyg070](https://doi.org/10.1093%2Fije%2Fdyg070). <a name=bib-hernan_instruments_2006></a>[Hernán, M. A. and J. M. Robins](#cite-hernan_instruments_2006) (2006). "Instruments for Causal Inference: An Epidemiologist's Dream?" En. In: _Epidemiology_ 17.4, pp. 360-372. ISSN: 1044-3983. DOI: [10.1097/01.ede.0000222409.00878.37](https://doi.org/10.1097%2F01.ede.0000222409.00878.37). --- # References (cont.) <a name=bib-lawlor_mendelian_2008></a>[Lawlor, D. A., R. M. Harbord, J. A. C. Sterne, et al.](#cite-lawlor_mendelian_2008) (2008). "Mendelian Randomization: Using Genes as Instruments for Making Causal Inferences in Epidemiology". Eng. In: _Statistics in Medicine_ 27.8, pp. 1133-1163. ISSN: 0277-6715. DOI: [10.1002/sim.3034](https://doi.org/10.1002%2Fsim.3034). <a name=bib-leigh_instrumental_2004></a>[Leigh, J. and M. Schembri](#cite-leigh_instrumental_2004) (2004). "Instrumental variables technique: cigarette price provided better estimate of effects of smoking on SF-12". En. In: _Journal of Clinical Epidemiology_ 57.3, pp. 284-293. ISSN: 08954356. DOI: [10.1016/j.jclinepi.2003.08.006](https://doi.org/10.1016%2Fj.jclinepi.2003.08.006). URL: [https://linkinghub.elsevier.com/retrieve/pii/S0895435603003214](https://linkinghub.elsevier.com/retrieve/pii/S0895435603003214) (visited on Apr. 30, 2021). <a name=bib-manski_nonparametric_1990></a>[Manski, C. F.](#cite-manski_nonparametric_1990) (1990). "Nonparametric Bounds on Treatment Effects". In: _The American Economic Review_ 80.2, pp. 319-323. ISSN: 0002-8282. <a name=bib-ramsahai_causal_2012></a>[Ramsahai, R. R.](#cite-ramsahai_causal_2012) (2012). "Causal Bounds and Observable Constraints for Non-Deterministic Models". In: _J. Mach. Learn. Res._ 13, pp. 829-848. ISSN: 1532-4435. <a name=bib-richardson_ace_2014></a>[Richardson, T. S. and J. M. Robins](#cite-richardson_ace_2014) (2014). "ACE Bounds; SEMs with Equilibrium Conditions". In: _Statistical Science_ 29.3, pp. 363-366. ISSN: 0883-4237. DOI: [10.1214/14-STS485](https://doi.org/10.1214%2F14-STS485). arXiv: [1410.0470](https://arxiv.org/abs/1410.0470). <a name=bib-voight_plasma_2012></a>[Voight, B. F., G. M. Peloso, M. Orho-Melander, et al.](#cite-voight_plasma_2012) (2012). "Plasma HDL cholesterol and risk of myocardial infarction: a mendelian randomisation study". En. In: _The Lancet_ 380.9841, pp. 572-580. ISSN: 01406736. DOI: [10.1016/S0140-6736(12)60312-2](https://doi.org/10.1016%2FS0140-6736%2812%2960312-2). URL: [https://linkinghub.elsevier.com/retrieve/pii/S0140673612603122](https://linkinghub.elsevier.com/retrieve/pii/S0140673612603122) (visited on Aug. 09, 2021).