Introduction

This is an introduction to conducting a Mendelian randomization. You can read through the document and follow the process along, and the embedded code can also be transferred to your own version of R (or RStudio) if you want to replicate these analyses.

For reference, the guidance we’ll be using for this analysis comes from ‘The Interleukin-6 Receptor Mendelian Randomisation Analysis Consortium’. This group published an influential paper in The Lancet in 2012 entitled “The interleukin-6 receptor as a target for prevention of coronary heart disease: a mendelian randomisation analysis” (PubMed ID: 22421340).

First Step: Loading the Requisite Libraries

Easy first step is just loading the libraries that you will need for these analyses. I’ve also included additional libraries that I sometimes use to clean up datasets in an easier way, or to organize the data throughout this process.

library(TwoSampleMR)
library(utils)
library(data.table)
library(dplyr)
library(tidyverse)

Second Step: Identifying our Exposure and outcome

In the context of the study we are using to guide our process, we are really trying to proxy the effects of tocilizumab, a monocolonal antibody that blocks the interleukin-6 receptor (IL6R). This has important implications in atherosclerosis, where interleukin-6 (IL-6) is an established key player. IL-6, a cytokine produced mainly by T cells, macrophages, and adipocytes, promotes inflammation via binding to the IL6R. When IL6 binds to IL6R, the resulting inflammatory cascade can be ‘quantified’ in circulating blood of patients by measuring downstream products, namely fibrinogen and C-reactive protein. This inflammatory cascade results in an elevated risk for many atherosclerotic cardiovascular diseases - we are going to focus specifically on coronary artery disease (CAD).

So, to review: our question is, will a genetically proxied blockade of IL6R be associated with a reduced risk of CAD? Our exposure is the genetic proxy of IL6R inhibition, and our outcome is the risk of CAD.

Third Step: The Genetic Proxy Exposure

Now, the complicated part starts. We need to design a genetic proxy for IL6R inhibition. When we want to replicate the effects of a drug, we have a few options to guide the creation of the genetic proxy. These are: 1) effects on gene expression, 2) effects on protein expression, 3) effects on a function of the drug target, and 4) effects on a downstream marker of the drug target. In theory, all four are great ways to design a genetic proxy for a drug. In practice, really only 1, 2 and 4 are viable. There are millions of genetic variants, and simply nowhere near enough information on the actual functional impact of each possible variant that has been tested and validated in preclinical models. So, option number 3 is rarely feasible.

For this particular group, they opted to go for the 4th route - effects on a downstream marker of the drug target. If the drug is targeting IL6R, then we have numerous downstream targets. The authors of the Lancet publication selected C-reactive protein (CRP) and fibrinogen. Using genetic studies, they sought to identify a genetic variant (or multiple genetic variants) that were associated with these markers. Importantly, they wanted to find variants that were located in the gene-encoding region for IL6R; this strengthens the likelihood that the genetic proxy will reflect the effects of pharmacological inhibition, instead of simply arriving at the same outcome through a different mechanistic pathway.

In this case, we are going to replace the variant that they selected, since the one they selected is hard to detect with modern systems. They used a variant called rs7529229, we are going to use rs12730935. Our variant is in strong linkage disequilibrium (r2 ~ 0.94-0.99) with their variant, which means that these variants are often inherited together. If that is the case, it is accepted in Mendelian randomization studies to replace variants with other variants as long as they are in strong linkage disequilibrium (r2 > 0.90).

So, we are going to pull a genome-wide association study (GWAS) that is looking for variants associated with C-reactive protein. If you look at the code that I use, in order to find the variants that I am interested in, I could go straight to looking for the variant I’ve already identified.

exp_dat <- extract_instruments("ebi-a-GCST90029070")
exp_dat <- exp_dat[exp_dat$SNP == "rs12730935",  ]
print(exp_dat)

If I didn’t have a pre-selected variant, I could also isolate the variants that I am looking at to just a specific region that is around where the gene-encoding region is for IL6R. In this case, IL6R can be found between base positions 154405343 - 154469450, so I will limit my search to this region with a ‘buffer zone’ around this region of 200,000 positions. That means I’ll search between 154205343 - 154669450. We are also going to relax our linkage disequilibrium assumption. Normally, this is set to r2 < 0.001; for drug-target studies using Mendelian randomization, this could limit our statistical power since we are already restricting our ‘scope’ to a single region in the entire human genome. Therefore, we often relax this assumption, and instead use r2 < 0.2.

exp_dat_alt <- extract_instruments("ebi-a-GCST90029070", r2=0.2)
exp_dat_alt <- exp_dat_alt[exp_dat_alt$chr.exposure == 1, ]
exp_dat_alt <- exp_dat_alt[exp_dat_alt$pos.exposure >= 154205343,  ]
exp_dat_alt <- exp_dat_alt[exp_dat_alt$pos.exposure <= 154669450,  ]
print(exp_dat_alt)

So, now we have collected two different exposures that both provide genetic proxies of IL6R inhibition by showing that these variants we’ve selected are associated with a downstream outcome, C-reactive protein.

Fourth Step: Selecting our Outcome

Our outcome is more straightforward. We just have to select a GWAS that looked at variants associated with CAD. In this case, I’m going to use a GWAS from Mbatchou et al. that looked at variants associated with CAD in 352,063 European individuals. If you want to get fancy, you can select multiple outcomes to verify that whatever results you get are consistently showing up, regardless of which GWAS you pick for your outcome. You could also test different outcomes, if you wanted to simultaneously assess the effects of genetically proxied IL6R inhibition on CAD, ischemic stroke, atrial fibrillation, etc.

For the purposes of what we are doing at the moment, I’m going to stick with just testing CAD as our outcome. I’m going to run two different lines of code here, since we have two exposures. Remember, our first exposure (which I assigned to “exp_dat”) is just the variant rs12730935. Our second exposure (which I assigned to “exp_dat_alt”) is a selection of variants that were significantly associated with C-reactive protein and were also found within the gene-encoding region for IL6R (plus or minus 200,000 bases). So, I am going to name my outcomes “out_dat” and “out_dat_alt”, accordingly.

We then harmonize the exposure and outcome data, which prepares us to run our analyses.

out_dat <- extract_outcome_data(snps=exp_dat$SNP, outcomes=c("ebi-a-GCST90013864"))

out_dat_alt <- extract_outcome_data(snps=exp_dat_alt$SNP, outcomes=c("ebi-a-GCST90013864"))


final_dat <- harmonise_data(exp_dat, out_dat)

final_dat_alt <- harmonise_data(exp_dat_alt, out_dat_alt)

Fifth Step: Running our Analyses

This is fun step where you finally get to see the results. We are first going to run our single variant exposure data (the exposure with just rs12730935). Remember, we are using the exposure as the “variant-associated reduction in C-reactive protein” and our outcome as the “variant-associated change in CAD risk”. So, to interpret our final results, we are looking at the “change in CAD risk per standard deviation INCREASE in C-reactive protein”. If we want to change that to the “change in CAD risk per standard deviation DECREASE in C-reactive protein”, then we just need to multiply our results by -1.

The second set of analyses are the same general interpretation, but now we are using multiple variants instead of a single variant. That means we now have to consider the strength of these variants - will there be horizontal pleiotropy, or weak variants, etc.? Our sensitivity analyses will help us examine that possibility.

res_1 <- mr(final_dat)
print(res_1)

res_alt <- mr(final_dat_alt)
print(res_alt)

So, right away you can see a few things from our results. The first set of results is straightforward - we can’t do too many fancy sensitivity analyses because we only have one genetic variant. For our analysis, one variant-associated standard deviation increase in C-reactive protein resulted in a significant in risk. The beta value was 0.377, which we exponentiate to get an odds ratio of 1.46. This means that the significant increase in risk was a 46% increase. To convert that now (because remember, we are reducing C-reactive protein when we inhibit IL6R so it is actually the opposite), we multiple 0.377 by negative 1. Once we exponentiate that, we find that one variant-associated REDUCTION in C-reactive protein is associated with an odds ratio of 0.69, or a 31% reduced risk.

For our second set of results, we find the same general concept. Our primary results from analyses involving multiple genetic variants is the “inverse variance weighted” method. It is considered the most robust and reliable, so we almost always start with that as our primary results. Then, we can use weighted median (which assumes that up to half of the included variants were weak and can be discarded) and MR-Egger (which takes into account the possibility of horizontal pleiotropy). In this case, all three of those were significant and in the same direction, suggesting strong findings relating IL6R inhibition (quantified with changes in C-reactive protein) to CAD risk.

We can do one more check on our second set of analyses. There were 39 genetic variants in this exposure, so we can actually run a method that lets us see if some of those are outliers. If there are outliers, this approach will actually remove those and re-run the analyses to provide “outlier-adjusted” results. This approach is called Mendelian Randomization Pleiotropy RESidual Sum and Outlier, or MR-PRESSO.

adjusted_res_alt <- run_mr_presso(final_dat_alt)
print(adjusted_res_alt)

In this case, we see that no outliers were detected so all “outlier-corrected” results are set to “NA”. That provides another level of robust evidence to our findings suggesting that IL6R inhibition would reduce CAD risk.