GenomicSEM and Bayesian network learning

Preamble

We’re very excited about GenomicSEM and want to showcase some of the possible future extensions we have coming up. The following document describes the integration of GenomicSEM and Bayesian network inference. The document should be viewed as a pre-pre-print. The document describes the integration of two methods, and provides a single illustrative, and successful, simulation. The simulation validates the method in a very minimal sense (one example doesn’t deal with all use cases). If you want to use these methods feel free too but we recommend you simulate data similar to the data you intend to use in your real analysis and validate the method works for your specific application! You would be using a developmental branch of our package, please report any issues and best check in with us about the stability of the code.

lets load and install all required packages and the development version of GenomicSEM (make sure to switch back to the original version for production!)

library(devtools)
install_github("MichelNivard/GenomicSEM",ref = "dev_v3")
require(GenomicSEM)
require(pcalg)
require(corpcor)
require(MASS)

GenomicSEM and Bayesian network learning.

One of the big advantages of publishing GenomicSEM as an R package is that this allows us to integrate GenomicSEM with other R packages. R has a large ecosystem of packages to facilitate interesting statistical models. In this document we present a way to take the genetic covariance matrix between a set of traits suspected to be causally related and perform Bayesian network inference give the genetic covariance matrix, the analysis produces a likely network describing the causal relationships between the traits of interest.

The networks produced by most Bayesian network inference algorithms are called directed a-cyclic graphs (DAG) and they describe the causal relationships between traits. DAGs have some rules, they are a-cyclic, that is their are no “circles” in the network, so you can’t have “a” causing “b” causing “c” causing “a”. This is a serious limitation as bidirectional causation and more complicate recursive relationships are likely part of the set of causal relationships most people are interested in. For now its important to keep this limitation in mind!

Lets start with a miniature example of 4 related variables, for now we use simulated data, later we switch to analyses based on GWAS summary data.

t1 <- rnorm(1000)
t2 <- rnorm(1000)
t3 <- .4*t1 + .5*t2  + rnorm(1000)
t4 <- .6*t3 + rnorm(1000)

dataset <- cbind.data.frame(t1,t2,t3,t4)

We made 4 variables, the first (t1) and the second (t2) causes the third (t3), which in turn cases the forth (t4). This means that while all 4 variables correlate, we can figure out that t1 doesn’t directly influence t4, by conditioning on t3.

# al variables are dependent
cor(dataset)

##            t1         t2        t3        t4
## t1 1.00000000 0.02637316 0.2932722 0.1940716
## t2 0.02637316 1.00000000 0.4196568 0.2301954
## t3 0.29327216 0.41965680 1.0000000 0.5887943
## t4 0.19407157 0.23019544 0.5887943 1.0000000

# But t3, conditional on t2 is independent of t1
reg <- lm(t4 ~t3 + t1,data=dataset)
summary(reg)

## 
## Call:
## lm(formula = t4 ~ t3 + t1, data = dataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5012 -0.6666  0.0217  0.6843  3.1145 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.009254   0.032036   0.289    0.773    
## t3          0.597756   0.027494  21.742   <2e-16 ***
## t1          0.028811   0.032943   0.875    0.382    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.012 on 997 degrees of freedom
## Multiple R-squared:  0.3472, Adjusted R-squared:  0.3459 
## F-statistic: 265.1 on 2 and 997 DF,  p-value: < 2.2e-16

Conditional independence is a key concept in Bayesian network learning, it is leveraged to figure out the web of causal relationships in a dataset. because when we do not know the true structure of the network we are left to one of 3 strategies:

Guess the network (this seems unsatisfactory to most)
Use prior knowledge (may be reasonable but we do not have perfect knowledge)
Use an algorithm to test for conditional independent variablse in the set, eliminate direct paths between those variables and infer the network (or set of all possible networks)

There is a whole literature on strategy three, which is far beyond the scope of this tutorial, google is your friend. Basically their are algorithms that quickly eliminate very implausible networks and covrge on (a set of) possible networks given your data. Lets take one of these algorithms out for a spin, the PC algorithm, as implemented in the pcalg package. This particular algorithm works based solely on the correlation matrix between traits, and the sample size. It leverages the fact that from the correlation matrix we can obtain the partial correlations, which describe the relationships between each pair of variables conditional on all other variables.

cor <- cor(dataset)
n <- 1000
network <- pc(suffStat = list(C=cor,n=n),indepTest = gaussCItest,labels = c("t1","t2","t3","t4"),alpha=0.01 )

plot(network, main="Toy DAG")

## Loading required namespace: Rgraphviz

Wow! In this admittedly very simple example the algorithm correctly inferred the DAG! and it did so relying only on the correlation matrix and the sample size we got the true (but very simple!) network. One of he big limitations is that if there are unobserved confounders, that is variables that you haven’t measured which influence the variables you have, things can go wrong. Let’s see how a unmeasured confounded can mess up results.

I simulate the same network, but add a 5th variable (confounder “c” in the code below) which influences both t1 and t4, but is not measured and therefore doesn’t appear in our dataset.

c <- rnorm(1000)

t1 <- rnorm(1000) + .4*c
t2 <- rnorm(1000)
t3 <- .4*t1 + .5*t2  + rnorm(1000) 
t4 <- .6*t3 + rnorm(1000) + .5*c

dataset <- cbind.data.frame(t1,t2,t3,t4) 

cor <- cor(dataset)
n <- 1000
network <- pc(suffStat = list(C=cor,n=n),indepTest = gaussCItest,labels = c("t1","t2","t3","t4"),alpha=0.01 )

plot(network, main="Toy DAG in presence of a confounder")

Note how the algorithm, in the presence of an unobserved confounder, erroneously infers the presence of a causal relation between t1 and t4. The risk of confounding should give us pause when considering which variables to consider/include in our models. Bearing these limitations in mind, the techniques and algorithms used in Bayesian network inference are valuable when used in a responsible manner to infer candidate causal models from observational data.

Genomic SEM and Baysian network inference.

Genomic SEM offers users the option to estimate a genetic correlation matrix between traits for which GWAS summary data is available, this allows us to consider a variety of variables like volumetric MRI, psychopathology, risk factors like smoking propensity or risk seeking behavior concurrently with disease endpoints like cardiac disease or cancers. Clearly an exiting prospect!

To link genomicSEM to the pcalg package for Bayesian network inference we need 2 things:

a function to produce “sufficient statistics” for an independence test
- use the function: GSEM.suffStat()
a function to perform independence tests of the relationship between any pair of variables given any set of other variables in our data.
- use the function: GSEM.indepTest()

Since genomicSEM (or rather the underlying LD score regression analysis) produces estimates of genetic correlation, its logical to base our independence test on the correlation matrix, but because each cell in the correlation matrix is based on 2 GWASs with different sample sizes and potentially unknown sample overlap, we need to account for this in our inference. To do so we re-sample the correlation matrix from its sample variance/covariance matrix as produced by GenomicSEM. Our sufficient statistics are therefor 250 (re)sampled correlation matrices (sort of like bootstrap re-samples), and our independence test is based on aggregating across these re-sampled genetic correlation matrices.

Genomic Data simulation

This step is complex and taes a while and if you don’t have genetic data you can skip it and just download the result I generate with my simulation to try GenomicSEM and Bayesian network learining.

download the GWAS results generated from these steps HERE (260 Mb) if you have no raw genotypes to simulate for yourself

Data simulation code

First we must simulate realistic data, so to do so we sample 30.000 “causal” SNPs for 6 traits which we will use in our network, we sample and simulate only from HapMap 3 SNPs, we sample from the plink format .bim file which contains all the genomic loci that are present in our dataset:

## generate SNPlists randomly:


bim <- read.table("simulation_set.bim")

snplist.1 <- sample(bim[,2],30000)
snplist.2 <- sample(bim[,2],30000)
snplist.3 <- sample(bim[,2],30000)
snplist.4 <- sample(bim[,2],30000)
snplist.5 <- sample(bim[,2],30000)
snplist.6 <- sample(bim[,2],30000)

write.table(snplist.1,"snplist.1",quote = F,row.names = F,col.names = F)
write.table(snplist.2,"snplist.2",quote = F,row.names = F,col.names = F)
write.table(snplist.3,"snplist.3",quote = F,row.names = F,col.names = F)
write.table(snplist.4,"snplist.4",quote = F,row.names = F,col.names = F)
write.table(snplist.5,"snplist.5",quote = F,row.names = F,col.names = F)
write.table(snplist.6,"snplist.6",quote = F,row.names = F,col.names = F)

The next step is to simulate actual phenotypes for 40.000 unrelated individuals for which I have genotypes (sampled from UKbiobank data). On a linux machine with GCTA installed ( get it here ), I run the following commands in the terminal:

############ STEP 1 simulate:

./gcta64 --bfile simulation_set --simu-qt --simu-causal-loci snplist.1 --simu-hsq 0.5 --simu-rep 1 --out sim.1
./gcta64 --bfile simulation_set --simu-qt --simu-causal-loci snplist.2 --simu-hsq 0.5 --simu-rep 1 --out sim.2
./gcta64 --bfile simulation_set --simu-qt --simu-causal-loci snplist.3 --simu-hsq 0.5 --simu-rep 1 --out sim.3
./gcta64 --bfile simulation_set --simu-qt --simu-causal-loci snplist.4 --simu-hsq 0.5 --simu-rep 1 --out sim.4
./gcta64 --bfile simulation_set --simu-qt --simu-causal-loci snplist.4 --simu-hsq 0.5 --simu-rep 1 --out sim.5
./gcta64 --bfile simulation_set --simu-qt --simu-causal-loci snplist.4 --simu-hsq 0.5 --simu-rep 1 --out sim.6

This will take a while…. We then read the simulated phenotypes into R and create a (network) relationship between the variables:

# read phenotypes after simulation:

ph1 <- read.table("sim.1.phen")
ph2 <- read.table("sim.2.phen")
ph3 <- read.table("sim.3.phen")
ph4 <- read.table("sim.4.phen")
ph5 <- read.table("sim.5.phen")
ph6 <- read.table("sim.6.phen")

P1 <- scale(ph1[,3]) 
P2 <- scale(ph2[,3]) 
P3 <- scale(ph3[,3])
P4 <- scale(ph4[,3])
P5 <- scale(ph5[,3])
P6 <- scale(ph6[,3])

## MAke a causal model, 

# 1 and 2 cause 3. 3 cause 4 and 4 and 5 cause 6

P3a <- P3 + .3*P1 + .4*P2
P4a <- P4 + .7*P3a
P6a <- P6 + .5*P4a + .3*P5

Then we save these new composite phenotypes:

phe1 <- cbind(ph1[,1:2],P1)
phe2 <- cbind(ph1[,1:2],P2)
phe3 <- cbind(ph1[,1:2],P3a)
phe4 <- cbind(ph1[,1:2],P4a)
phe5 <- cbind(ph1[,1:2],P5)
phe6 <- cbind(ph1[,1:2],P6a)


write.table(x =phe1,file="one.pheno",row.names = F,col.names = F)
write.table(x =phe2,file="two.pheno" ,row.names = F,col.names = F)
write.table(x =phe3,file="three.pheno" ,row.names = F,col.names = F)
write.table(x =phe4,file="four.pheno" ,row.names = F,col.names = F)
write.table(x =phe5,file="five.pheno" ,row.names = F,col.names = F)
write.table(x =phe6,file="six.pheno" ,row.names = F,col.names = F)

We then use the following bash code to run a GWAS on the phenotypes in Plink 1.9:


./plink  --pheno one.pheno  --assoc --bfile simulation_set --threads 12 -allow-no-sex --out one
./plink  --pheno two.pheno  --assoc --bfile simulation_set --threads 12 -allow-no-sex --out two
./plink  --pheno three.pheno  --assoc --bfile simulation_set --threads 12 -allow-no-sex --out three
./plink  --pheno four.pheno  --assoc --bfile simulation_set --threads 12 -allow-no-sex --out four
./plink  --pheno five.pheno  --assoc --bfile simulation_set --threads 12 -allow-no-sex --out five
./plink  --pheno six.pheno  --assoc --bfile simulation_set --threads 12 -allow-no-sex --out six

Finaly we paste the association results and a ref file (which contains the .bim file with the following header: “chr.no rs.no nill BPno A1 A2”) so that genomicSEM can read them:


paste ref one.qassoc > one.sums
paste ref two.qassoc > two.sums
paste ref three.qassoc > three.sums
paste ref four.qassoc > four.sums
paste ref five.qassoc > five.sums
paste ref six.qassoc > six.sums

Combining GenomicSEM and Bayesian network learning

download the GWAS results generated from the steps above HERE (260 Mb) if you wish to run the code below.

Take a moment to realize we fit a network model to variables which we don’t need to measure, we just need the summary statistics for a GWAS of each variable we want to include.

We are ready to perform GenomicSEM on GWAS results for 6 traits. In “production” i.e. if this was for a paper, I would advice additional simulations varying 1. the heritability of the phenotypes, 2. the sample overlap between the GWASs (now all are run on the same 40.000 genotypes), and 3. the sample sizes in the GWASs. You can use the code above to aid you in the additional simulations.

traits <- c("one.sums","two.sums","three.sums","four.sums","five.sums","six.sums") 

munge(files = traits ,hm3 = "w_hm3.noMHC.snplist", 
      N = rep(40000,6),
      trait.names = c("one","two","three",
                      "four","five","six"),
      info.filter = .9,maf.filter = .05)


covstruct <- ldsc(traits =c("one.sumstats.gz","two.sumstats.gz","three.sumstats.gz",
                            "four.sumstats.gz","five.sumstats.gz","six.sumstats.gz"),
                  trait.names= c("one","two","three","four","five","six"),
                  ld = "eur_w_ld_chr/", wld="eur_w_ld_chr/",
                  sample.prev = NA,
                  population.prev = NA  )

GenomicSEM computes the genetic covariance between traits, and the parameter variance/covariance matrix of the estimated covariance matrix (confusing right?). These we can use in a Bayesian network inference analysis, using the new GenomicSEM functions to create appropriate sufficient statistics and perform a conditional independence test:

suffstats <- GSEM.suffStat(suffStat = covstruct)

pc_network <- pc(suffstats,indepTest = GSEM.indepTest,alpha = .05,
                 labels = c("one","two","three","four","five","six"))

plot(pc_network, main="simulated genomicSEM DAG")

Hey presto! we have estimated the DAG we simulated! That’s pretty neat! Note how the pc algorithm doesn’t provide us with parameter estimates, for this we can translate the final DAG into a SEM regression model which we can fit using genomicSEM. The resulting model can give is parameter estimates, and importantly model fit statistics.

# Fit the resulting DAG as a observed variable SEM model.

model <- '
three ~ one + two
four ~ three
six ~ four + five
'

fit <- usermodel(covstruct,estimation = "DWLS",model=model)

## [1] "Running primary model"
## [1] "Calculating model chi-square"
## [1] "Calculating CFI"
## [1] "Calculating Standardized Results"
## [1] "Calculating SRMR"
## elapsed 
##     3.1

fit

## $modelfit
##       chisq df    p_chisq      AIC       CFI       SRMR
## df 18.16929 10 0.05217419 40.16929 0.9977824 0.01891372
## 
## $results
##      lhs op   rhs Unstandardized_Estimate Unstandardized_SE
## 1  three  ~   one               0.2522421        0.04299386
## 2  three  ~   two               0.3942366        0.02784543
## 3   four  ~ three               0.6011611        0.02626740
## 4    six  ~  four               0.5324948        0.02840304
## 5    six  ~  five               0.1927543        0.02788483
## 27   one ~~   one               0.4750233        0.03427202
## 28   two ~~   two               0.4451210        0.01770129
## 29 three ~~ three               0.3586305        0.01644986
## 30  four ~~  four               0.3080154        0.01552483
## 31  five ~~  five               0.4533114        0.01956517
## 32   six ~~   six               0.3016736        0.01312094
##    Standardized_Est Standardized_SE
## 1         0.2566494      0.04374506
## 2         0.3882940      0.02742570
## 3         0.5955655      0.02602291
## 4         0.5410819      0.02886108
## 5         0.1928651      0.02790086
## 27        0.9999992      0.07214809
## 28        1.0000000      0.03976736
## 29        0.7815861      0.03585022
## 30        0.6588393      0.03320733
## 31        1.0000001      0.04316056
## 32        0.6662543      0.02897795

The model fit is reasonable (CFI > .98 and chisq test not significant), as we would expect since we fitted the true data generating model. The simulation above is preliminary evidence the method works in the best of circumstances. Again if you want to test this further, or use these techniques in a paper, useful scenarios to consider are: no sample overlap between the GWASs, the number of causal loci is smaller or larger, heritability is unequal between traits, sample sizes between GWASs differ, the presence of counfounding traits, or the presence of a small set of confounding genetic loci.

Now notice that doing these analyses based on GWAS summary data does allow us considerable freedom to model a large variety of traits! We really only need a GWAS of reasonable sample size (N or effective N > 20.000 as a rule of thumb) for each variable we want to include.

Real world example

I am working on a (reasonable) real world example, preferably one with strong priors on some of the causal directions. Send me your suggestions (between 4-6 variables, must be GWASed) if you have any!