Performing a GWAS-by-subtraction in GenomicSEM.

The preprint by Demange et al. (here on BioRxiv) uses GenomicSEM and the input GWAS of educational attainment (EA) and cognitive performance (CP) to derive two new GWASes: cognitive (Cog) and non-cognitive (NonCog) contributions to educational attainment. We feel this is a very valuable analysis as it allows us to evaluate the genetic architecture and correlates of traits, other than cognition, that contribute to educational success. For a full account please refer to the paper, the present document is a tutorial on GWAS-by-subtraction which will enable you to better understand our work but perhaps more importantly run your own GWAS-by-subtraction.

The tutorial is easier if you have some experience using GenomicSEM, so go here for a general introduction to GenomicSEM and basic tutorials. This code will not download LD scores, the HapMap3 reference file or the 1000Genomes based allele frequency reference file. Those are found here, here and here. Most people will have downloaded those files if they work with either LD score regression or GenomicSEM, they are big and it makes little sense to have multiple copies so we omit them from the code below. At the beginning of the tutorial we assume the following files/folders are in your working directory (or you can change the paths to them in the code below):

“w_hm3.noMHC.snplist” (hapmap SNPs excluding MHC)
“reference.1000G.maf.0.005.txt” (reference MAF)
“eur_w_ld_chr/” (folder with LD scores)

If you just want the Cog/NonCog GWAS summary statistics to compute PRS or run other analyses you can download these here.

If you need help with the code reach out to us on our GenomicSEM google group, we’ll try to help you out as soon as we can.

This text is not meant to offer a complete account of all code used in the paper (we will make a github repository available for that), but rather is an account of GWAS-by-subtraction as we would explain it to you at a GenomicSEM workshop.

The basic structural equation model.

Let’s first discuss the nature of the model we will be estimating. In Figure 1 (left, the entire model describes the relationship between 3 observed variables (in squares) which are cognitive performance, educational attainment and a SNP, and 2 latent variables which we call Cog and NonCog. In GenomicSEM we always only work with GWAS summary statistics, so to estimate this model we need GWAS summary data for cognitive performance (CP) and educational attainment (EA) which you can download here.

Figure 1: Overal pathmodel and smaller path models highlighting the cognitive (red) and non-cognitive (blue) path from SNP to EA.

Figure 1: Overal pathmodel and smaller path models highlighting the cognitive (red) and non-cognitive (blue) path from SNP to EA.

A critical component of the model is that it allows for 2 paths from the SNP to educational attainment. The first path (red path in Figure 1 top right) corresponds to the SNP’s effect on EA via a latent variable (Cog). The latent variable Cog loads on both EA as well as CP: it reflects the cognitive genetic effect on educational attainment. The second path from SNP to educational attainment (blue path Figure 1 bottom right) reflects genetic effects on EA, that are not correlated at all with cognitive performance (NonCog).

Let’s break down the model into its critical pieces, we have the following 5 relationships: CP and EA are regressed on Cog, EA is also regressed on NonCog. The two latent variables Cog and NonCog are then regressed on the SNP. We break this down into two code examples: in the first we fit the model without a SNP, and in the second we add the SNP to the model.

Prepare the data

We first need to download and process the data, if you’re in linux or on mac (and have installed wget) just run the code below, on windows or on a mac without wget use the links to manually download:

# Get EA:
# Get CP:

We must process the GWAS summary data we just downloaded (a process called “munging” in both LD score regression and GenomicSEM), the processing entails omitting SNPs not in the references file (found at the link mentioned at the top of this page), with low INFO or low minor allele frequency.

      info.filter = 0.9, 
      maf.filter = 0.01)

      info.filter = 0.9,
      maf.filter = 0.01

Having made sure the summary data are in a uniform format and underwent minimal QC, we move on to run multivariable LD score regression. In this step we compute the genetic covariance matrix between traits and the standard errors associated with the entries in the covariance matrix. This is the last preliminary step and it makes sense to save the result as you will likely not rerun this step. We suppress the output of the munge function and some other functions for this tutorial but for most functions GenomicSEM will print the exact steps it is taking to the R terminal and/or a log file so you can make sure things are going as expected.

traits <- c("CP.sumstats.gz","EA.sumstats.gz")
sample.prev <- c(NA,NA)
population.prev <- c(NA,NA)
wld <- "eur_w_ld_chr/"
trait.names<-c("CP", "EA")

LDSCoutput <- ldsc(traits, 

save(LDSCoutput, file="LDSCoutputCogNonCog.RData")

Run the SEM model without SNPs

Having processed the GWAS summary data, and computed the genetic covariance between the traits, we can fit our first GenomicSEM model, which is the model depicted in Figure 1a but omits the SNPs. We specify the 2 latent variables Cog (C) and NonCog (NC) as a function of EA and CP: (C=~ NA*EA + start(0.4)*CP and NC=~NA*EA), and further impose constraints that make sure all (co)variance in EA and CP is captured by the latent variables and that the latent variables have a variance of 1 (NC~~1*NC and C~~1*C) and a covariance of 0 (C~~0*NC ).


model<-'C=~NA*EA + start(0.4)*CP

         CP ~~ 0*EA

## [1] "Running primary model"
## [1] "Calculating model chi-square"
## [1] "Calculating CFI"
## [1] "Calculating Standardized Results"
## Warning in lav_partable_flat(FLAT, blocks = "group", meanstructure = meanstructure, : duplicated elements in model syntax have been ignored:
##      V2 ~~ V1
## [1] "Calculating SRMR"
## elapsed 
##   0.323
## $modelfit
##    chisq df p_chisq AIC CFI        SRMR
## df    NA  0      NA  NA   1 1.04818e-09
## $results
##   lhs op rhs Unstand_Est          Unstand_SE STD_Genotype
## 1   C =~  CP   0.4464830 0.00760378990216596    1.0000000
## 2   C =~  EA   0.2298708 0.00535873579410749    0.6859947
## 3   C ~~   C   1.0000000                        1.0000000
## 4  NC =~  EA   0.2438145 0.00436859738964451    0.7276065
## 5  NC ~~  NC   1.0000000                        1.0000000
##      STD_Genotype_SE   STD_All
## 1 0.0170304116227263 1.0000000
## 2 0.0159918750218389 0.6859947
## 3                    1.0000000
## 4  0.013037041917133 0.7276065
## 5                    1.0000000
save(output, file="Modeloutput.Rdata" )

The results are presented as per usual in GenomicSEM for each freely estimated parameter we present an estimate and a standard error, and their standardized equivalents. Model fit statistics are presented but uninformative in this case as the model is “saturated”. Note estimates will differ from the preprint as we cannot share summary data from 23andme, but you can request the summary data directly from them.

Clean the SNPs for GWAS

Now we need to preprocess the summary statistics for a GWAS, where the model is fit for each SNP in the genome. To make this step computationally feasible for this tutorial, we shorten the input GWAS to only the top 999 SNPs, using unix commmand. Note that since the files are sorted on P-value we won’t grab the same SNPS from the two files, but there is sufficient overlap for the purpose of this tutorial.

head -1000  GWAS_CP_all.txt > CP_1000.txt
head -1000  GWAS_EA_excl23andMe.txt > EA_1000.txt

We prepare the top 999 SNPS for analysis in GenomicSEM. If you run a GWAS-by-subtraction with other GWAS, you may have to change the parameters of the sumstats function (for example if your input GWAS is based on logistic or linear regression, etc.). These parameters do matter, please check what to use with the original GenomicSEM paper and tutorial. If in doubt you can reach out to us! This step also filters SNPs for which no allele frequency information is available in the reference file, as well as with INFO < 0.6 or MAF < 0.01.

files = c("CP_1000.txt", "EA_1000.txt")
ref = "reference.1000G.maf.0.005.txt"
trait.names = c("CP","EA")
se.logit = c(F,F)
info.filter = 0.6
maf.filter = 0.01

p_sumstats<-sumstats(files, ref, trait.names, se.logit, info.filter, maf.filter, OLS=c(T,T),linprob=NULL, prop=NULL, N=c(257828,766345))
save(p_sumstats, file="Sumstats.RData")

Run the GWAS-by-subtraction

We are now ready to run the GenomicSEM GWAS-by-subtraction. Additional to the model ran above, we introduce the SNP variable. The SNP is regressed on the latent variables C and NC concurrently ( e.g. C~SNP and N~CNP).

# model with SNP
model<-'C=~NA*EA +start(0.2)*EA + start(0.5)*CP
         NC=~NA*EA +start(0.2)*EA


         CP ~~ 0*EA

#Run the Genomic SEM GWAS
outputGWAS<-userGWAS(covstruc=LDSCoutput,SNPs=p_sumstats,estimation="DWLS",model=model,sub =c("C~SNP","NC~SNP"))# printwarn = FALSE
## [1] "Please note that an update was made to userGWAS on 11/21/19 so that it combines addSNPs and userGWAS."
## This is lavaan 0.6-5
## lavaan is BETA software! Please report any bugs.
## elapsed 
##  22.801
save(outputGWAS, file="outputGWAS.RData")

You have now succesfully performed a GWAS for the first ~999 SNPs (500 of which end up being filtered because they were not in both input files)! The outputGWAS object will contain 2 sets of GWAS output which are the SNP effects on C (Cog) and on NC (NonCog).

The top 5 lines of the of the NonCog GWAS should look something like this (if suppressing warnings):

##            SNP CHR       BP      MAF A1 A2 lhs op rhs free label
## 5   rs13086611   3 49385417 0.335984  A  T  NC  ~ SNP    5      
## 588 rs17080528   3 49389842 0.335984  C  T  NC  ~ SNP    5      
## 590 rs13090388   3 49391082 0.335984  C  T  NC  ~ SNP    5      
## 593  rs1050450   3 49394834 0.335984  G  A  NC  ~ SNP    5      
## 595  rs1800668   3 49395757 0.335984  G  A  NC  ~ SNP    5      
## 51   rs3811699   3 49396360 0.335984  T  C  NC  ~ SNP    5      
##             est          SE Z_Estimate Pval_Estimate error
## 5   -0.03963906 0.008653982  -4.580442  4.639937e-06     0
## 588 -0.04637792 0.008659324  -5.355837  8.516140e-08     0
## 590 -0.04676222 0.008659884  -5.399867  6.669026e-08     0
## 593 -0.04647800 0.008659422  -5.367333  7.990938e-08     0
## 595 -0.04635707 0.008659465  -5.353341  8.634490e-08     0
## 51  -0.04661164 0.008659361  -5.382803  7.333459e-08     0