HDL in GenomicSEM

High-definition likelihood (HDL) inference is an exciting new method to estimate heritability, and genetic correlation with higher precision then LD score regression (LDSC). LDSC models the relation between the covariance of a SNP’s Z statistics for pairs of traits and that SNP’s LD score. The HDL extends the LDSC method by modelling the relation between covariances among Z statistics for pairs of traits across multiple SNPs and a full matrix of cross-SNP LD scores. You should read the paper for all the details:

Ning, Z., Pawitan, Y., & Shen, X. (2020). High-definition likelihood inference of genetic correlations across human complex traits (pp. 1-6). Nature Genetics.

HDL is of obvious value to people that want to analyze GWAS summary statistics. There is intuitive and accessible code to run HDL found here:https://github.com/zhenin/HDL

Practically using HDL means you can analyze GWAS sumstats from smaller GWASs and still have enough power to estimate genetic correlations. Like LDSC, HDL compute the standard errors of heritability and coheritability estimates using a block jacknife. In order to appropriately include HDL estimates in GenomicSEM, we extended the block jacknife to estimate the dependencies between estimation errors across heritability and coheritability estimates, just as we have previously done for LDSC.

We wrote a basic multivariate version of HDL as a function in GenomicSEM (hdl()). It can ingest GWAS summary statistics you have prepared for analysis in GenomicSEM using the munge() function.

If you end up using the hdl() function in a paper, we would highly appreciate if you cite Ning et al., as it is the direct product of their intellectual labor.

Analysis example for hdl() in GenomicSEM

First you will need to (re)install GenomicSEM, the install instructions are HERE

Then we need to download the LD references matrices for HDL, I opted for the HapMap2 set of SNPs, which is like 15Gb, so a long download and quite a bit of diskspace! You can download these files here: https://www.dropbox.com/s/kv5zhgu274wg9z5/UKB_imputed_hapmap2_SVD_eigen99_extraction.tar.gz?dl=0

Again unzip these files and place in your working directory.

For this example I am using GWAS sumstats for BMI Yengo et al : https://portals.broadinstitute.org/collaboration/giant/images/6/63/Meta-analysis_Wood_et_al%2BUKBiobank_2018.txt.gz

GWAS summary data for educational attainment: Lee et al: https://www.dropbox.com/s/ho58e9jmytmpaf8/GWAS_EA_excl23andMe.txt?dl=0

GWAS summary data for schizophrenia: Ripke et al. (SCZ2): https://www.med.unc.edu/pgc/download-results/scz/

You can process these summary statistics with the munge() functino as per instructions on the GenomicSEM wiki

If you want to proceed quickly and without having to download and process all files yourself you can downlaod the files I minged directly HERE

IMPORTANT I found some NaN values in BMI summary statistics, which I had to remove, I did so by searching for them in a text editor and deleting the entire line.

Having collected all the files we need we go ahead and run hdl() in genomicSEM. Note hdl() will warn you that overlap between the BMI sumstats and the references file it low (~98.%). If we were running this as a real analysis for publication, we would download the full GWAS sumstats, and select the exact set of SNPs. hdl() will run for about 15 minutes maybe a bit more, if you want faster estimation (because you have many traits for example) consider running hdl() with the argument method= set to ”piecewise” which is a faster implementation of hdl() which has some additional benefits we will outline soon. Code for both versions is included, as is a run with LD score regression as implemented in genomicSEM.

# run HDL
hdl.covstruct <- hdl(c("BMI.sumstats","EA3.sumstats.gz","scz2.sumstats.gz"),sample.prev = c(NA,NA,.5),population.prev = c(NA,NA,0.01) ,trait.names=c("BMI","EA3","SCZ"),LD.path="UKB_imputed_hapmap2_SVD_eigen99_extraction/", method = "jackknife")
  
# a faster variant of HDL
hdl.covstruct.2 <- hdl(c("BMI.sumstats","EA3.sumstats.gz","scz2.sumstats.gz"),sample.prev = c(NA,NA,.5),population.prev = c(NA,NA,0.01) ,trait.names=c("BMI","EA3","SCZ"),LD.path="UKB_imputed_hapmap2_SVD_eigen99_extraction/", method = "piecewise")  
  
  
# run LDSC for comparisson
ldsc.covstruct <- ldsc(traits =c("BMI.sumstats","EA3.sumstats.gz","scz2.sumstats.gz"),sample.prev = c(NA,NA,.5),population.prev = c(NA,NA,0.01) ,ld = "eur_w_ld_chr/",wld = "eur_w_ld_chr/",trait.names =c("BMI","EA3","SCZ"))

Having computed the genetic covariance matrix, and its variance covariance matrix, we can proceed and fit a genomicSEM model. Here we simply compute the genetic correlations, so you can compare them with output from the original HDL or ldsc implementations if you wish.

  # Model to compute correlations:
  
  
  model <- '
  lSCZ =~ NA*SCZ
  lEA3 =~ NA*EA3
  lBMI =~ NA*BMI
  
  SCZ ~~ 0*SCZ + 0*EA3 + 0*BMI
  EA3 ~~ 0*EA3 + 0*BMI
  BMI ~~ 0*BMI
  
  lSCZ ~~ 1*lSCZ + lEA3 + lBMI
  lEA3 ~~ 1*lEA3 + lBMI
  lBMI ~~ 1*lBMI
  '

  
## fit a model for the correlations based on HDL:
hdl.cors <-  usermodel(covstruc = hdl.covstruct ,estimation = "DWLS",model = model)
## [1] "Running primary model"
## [1] "Calculating model chi-square"
## [1] "Calculating CFI"
## [1] "Calculating Standardized Results"
## [1] "Calculating SRMR"
## elapsed 
##   0.738
hdl.cors
## $modelfit
##    chisq df p_chisq AIC CFI         SRMR
## df    NA  0      NA  NA   1 1.456728e-09
## 
## $results
##     lhs op  rhs Unstandardized_Estimate Unstandardized_SE Standardized_Est
## 1  lSCZ =~  SCZ              0.42670246       0.007709009       1.00000000
## 2  lEA3 =~  EA3              0.31105812       0.003807877       1.00000000
## 3  lBMI =~  BMI             -0.37420119       0.004785996       1.00000000
## 11 lSCZ ~~ lEA3              0.06347467       0.015422844       0.06347467
## 12 lSCZ ~~ lBMI              0.13108164       0.013900687      -0.13108164
## 14 lEA3 ~~ lBMI              0.29887999       0.015921505      -0.29887999
##    Standardized_SE
## 1       0.01806647
## 2       0.01224169
## 3       0.01278990
## 11      0.01542284
## 12      0.01390069
## 14      0.01592150

and for comparison let’s look at the results using the "piecewise" implementation of HDL, and ldsc():

## fit a model for the correlations based on with HDL piecewise (faster version):
 hdl.cors2 <-  usermodel(covstruc = hdl.covstruct.2,estimation = "DWLS",model = model)
## [1] "Running primary model"
## [1] "Calculating model chi-square"
## [1] "Calculating CFI"
## [1] "Calculating Standardized Results"
## [1] "Calculating SRMR"
## elapsed 
##   0.808
 hdl.cors2
## $modelfit
##    chisq df p_chisq AIC CFI         SRMR
## df    NA  0      NA  NA   1 5.220787e-09
## 
## $results
##     lhs op  rhs Unstandardized_Estimate Unstandardized_SE Standardized_Est
## 1  lSCZ =~  SCZ             -0.42929800       0.009449058       1.00000000
## 2  lEA3 =~  EA3              0.32898487       0.004492512       1.00000000
## 3  lBMI =~  BMI              0.38906525       0.005118436       1.00000000
## 11 lSCZ ~~ lEA3             -0.04779385       0.013969207       0.04779385
## 12 lSCZ ~~ lBMI              0.12676083       0.015809034      -0.12676083
## 14 lEA3 ~~ lBMI             -0.18557739       0.013052721      -0.18557739
##    Standardized_SE
## 1       0.02201049
## 2       0.01365568
## 3       0.01315573
## 11      0.01396921
## 12      0.01580903
## 14      0.01305272
 ## fit a model for the correlations based on ldsc for comparison:
 ldsc.cors <-  usermodel(covstruc = ldsc.covstruct,estimation = "DWLS",model = model)
## [1] "Running primary model"
## [1] "Calculating model chi-square"
## [1] "Calculating CFI"
## [1] "Calculating Standardized Results"
## [1] "Calculating SRMR"
## elapsed 
##   0.627
 ldsc.cors
## $modelfit
##    chisq df p_chisq AIC CFI         SRMR
## df    NA  0      NA  NA   1 2.007542e-10
## 
## $results
##     lhs op  rhs Unstandardized_Estimate Unstandardized_SE Standardized_Est
## 1  lSCZ =~  SCZ              0.50090672       0.010097916       1.00000000
## 2  lEA3 =~  EA3              0.33509115       0.004504385       1.00000000
## 3  lBMI =~  BMI             -0.45722380       0.006857845       1.00000000
## 11 lSCZ ~~ lEA3              0.06172632       0.019153422       0.06172632
## 12 lSCZ ~~ lBMI              0.11110016       0.017700927      -0.11110016
## 14 lEA3 ~~ lBMI              0.26925026       0.013625589      -0.26925026
##    Standardized_SE
## 1       0.02015927
## 2       0.01344227
## 3       0.01499888
## 11      0.01915342
## 12      0.01770093
## 14      0.01362559

Lets collect the relevant results into one dataframe

df <-  round(cbind(ldsc.cors$results[4:6,6:7],hdl.cors$results[4:6,6:7],hdl.cors2$results[4:6,6:7]),3)

colnames(df) <- c("rg ldsc","se ldsc","rg hdl","se hdl","rg hdl piecewise","se hdl piecewise")
rownames(df) <- c("cor(SCZ,EA)","cor(SCZ,BMI)","cor(BMI,EA)")

knitr::kable(df)
rg ldsc se ldsc rg hdl se hdl rg hdl piecewise se hdl piecewise
cor(SCZ,EA) 0.062 0.019 0.063 0.015 0.048 0.014
cor(SCZ,BMI) -0.111 0.018 -0.131 0.014 -0.127 0.016
cor(BMI,EA) -0.269 0.014 -0.299 0.016 -0.186 0.013

As you can see very similar results, but in this instance the standard errors aren’t that much smaller for hdl() then for ldsc() which could be due to the somewhat low overlap between the BMI sumstats and the reference LD, or to slight mismatch between the UKB reference LD and these GWAS, which are based on UKB and many other cohorts. Nonetheless, you now have HDL at your disposal which will certainly be valuable in low power situations. Note that with out of sample LD hdl() may esitmate somewhat lower heritability, and a higher intercept, this means you should NOT use the intercept to shrink effect sizes if you run a GWAS in GenomicSEM.