The preprint by de la Fuente et al. (2021) introduce a novel multivariate method for the joint analysis of direct GWAS and proxy GWAX summary data of complex diseases that relaxes standard assumptions and recovers unbiased estimates of common variant SNP heritability and of individual SNP effects under a variety of conditions.
The proposed meta-analytic approach is implemented in Genomic SEM, and allows estimating a multivariate model of genetic risk for complex diseases using summary data from case-control GWAS and proxy-phenotype GWAX. For illustrative purposes hereforth we will use two data sources: 1) direct case-control GWAS of Alzheimer’s disease from the International Genetics of Alzheimer’s Project (IGAP), and 2) proxy-phenotype GWAX summary data of maternal and paternal history of diseases from UKB. In our multivariate model, the total genetic propensity towards disease risk is represented as a latent factor, F, that is specified to affect the direct GWAS phenotype and the two GWAX phenotypes according to the following system of regression equations:
Or more compactly as Y=ΛF+U, where Λ is a vector of coefficients relating F to measured phenotypes Y, and U constitutes residual genetic propensities toward each of the measured phenotypes that are independent of F, and uncorrelated with one another and with F. This model is represented as a path diagram below (excluding the dashed portion).
Figure 1. Path diagram for multivariate genetic analysis of direct GWAS and proxy GWAX of complex diseases. We present the minimal identification constraint λdirect=1 such that the variance of the factor corresponds to the meta-analytic heritability estimate on scale of the direct GWAS. The dashed portion of the diagram represents the portion of the model that is specified to produce meta-analytic estimates for effects of individual SNPs.
We specify the model with the minimal identification constraint that λdirect=1 such that F takes on the scale of the direct GWAS phenotype, and σF2 can be interpreted as an unbiased estimate of hSNP2 of the meta-analyzed phenotype in the direct GWAS metric. Under this parameterization the departure of λmat and λpat from .5 indicates departure of the empirical data from the standard GWAX model.
The multivariate model can be expanded to include the effect of an individual genetic variant, x, on the latent factor, F (Figure 1 including the dashed portion). Such a model comprises the following two sets of equations:
where γ is an unstandardized regression coefficient and e is a residual. Such a model can be run for all available SNPs, one at a time, such that a complete set of meta-analytic summary statistics for F, can be produced from the relaxed multivariate model.
In this example we will use direct case-control GWAS of AD from IGAP and GWAX of maternal and paternal AD from the UK Biobank Biobank. To run the model you will need to install the Genomic SEM package (installation guide here), and have the LD scores folder (“eur_w_ld_chr/”), the HapMap3 reference file(“w_hm3.noMHC.snplist”), and the 1000Genomes reference file (“reference.1000G.maf.0.005.txt”) in your working directory. All these file can be downloaded here.
After getting your working directory ready with all the required files, we will start by munging our summary statistics with the HapMap3 reference file using the munge function from the Genomic SEM package. This function prepares the original summary statistics files in LDSC format, performing some basic QC on the genetic variants based on INFO scores and MAF. The package defaults filter out SNPs with INFO scores < 0.9 and MAF < 0.01, but these filters can be manually specified by the user. Before munging our files, we will take some additional steps for this particular analysis. Firstly, we will filter out the LD block of the APOE region, to prevent SNPs with extra large effects to affect LDSC slope estimation.
library(data.table)
#------------------ IGAP SUMSTATS (Kunkle et al., 2019) -----------------#
AD_IGAPII <- fread("Kunkle_etal_Stage1_results.txt",data.table = F)
# Exclude SNPs on chromosome 19 containing the linkage disequilibrium block of the APOE gene
AD_IGAPII <- AD_IGAPII[AD_IGAPII$SNP %notin%
AD_IGAPII[AD_IGAPII$chr == 19 & (AD_IGAPII$pos >= 45116911 & AD_IGAPII$pos <= 46318605), "SNP"],]
# Write .txt file with the pruned AD sumstats
write.table(AD_IGAPII,"AD_IGAPII_NOAPOE.txt",quote = F, sep = " ", row.names = F, col.names = T)
#------------------ GWAX SUMSTATS (Marioni et al., 2019) -----------------#
#------------------------ UKB Maternal AD ------------------------#
UKB_Maternal <- fread("1_UKB_AD_maternal_summary_output_June2019.txt",data.table = F)
# Exclude SNPs on chromosome 19 containing the linkage disequilibrium block of the APOE gene
UKB_Maternal <- UKB_Maternal[UKB_Maternal$SNP %notin%
UKB_Maternal[UKB_Maternal$CHR == 19 & (UKB_Maternal$BP >= 45116911 & UKB_Maternal$BP <= 46318605), "SNP"],]
# Write .txt file with the pruned AD sumstats
write.table(UKB_Maternal,"UKB_Maternal_NOAPOE.txt",quote = F, sep = " ", row.names = F, col.names = T)
#------------------------ UKB Paternal AD ------------------------#
UKB_Paternal <- fread("2_UKB_AD_paternal_summary_output_June2019.txt",data.table = F)
# Exclude SNPs on chromosome 19 containing the linkage disequilibrium block of the APOE gene
UKB_Paternal <- UKB_Paternal[UKB_Paternal$SNP %notin%
UKB_Paternal[UKB_Paternal$CHR == 19 & (UKB_Paternal$BP >= 45116911 & UKB_Paternal$BP <= 46318605), "SNP"],]
# Write .txt file with the pruned AD sumstats
write.table(UKB_Paternal,"UKB_Paternal_NOAPOE.txt",quote = F, sep = " ", row.names = F, col.names = T)
Next we need to calculate the effective sample size associated with the traits. Effective sample size refers to 4v(1-v)n, where v is the sample prevalence. This is particularly important for IGAP’s summary statistics, as they comprise multiple case-control cohorts with diverse case ascertainment. A full description tutorial on calculating sum of effective sample size and preparing GWAS summary statistics can be found in the Genomic SEM wiki. For this summary statistics we will calculate the sum of effective sample sizes across contributing cohorts as follows:
#Kunkle et al. 2019 AD GWAS Sum of Neffs
#ADGC consortium
ncases <- c(532,1549,727,894,304,286,213,268,6,27,9,666,658,491,256,1798,80,132,696,13,295,59,323,668,596,1177,1255,339,38,73)
ncontrols <- c(1571,512,156,586,377,505,338,173,112,144,141,712,1046,738,189,1568,48,153,762,233,769,217,181,365,170,1126,829,187,94,560)
n <- ncases+ncontrols
pi <- ncases/n
EffN <- 4*pi*(1-pi)*n
ADGC_EffN <- sum(EffN)
#CHARGE consortium
ncases <- c(95,277,450,330,985)
ncontrols <- c(2708,169,1702,3910,4985)
n <- ncases+ncontrols
pi <- ncases/n
EffN <- 4*pi*(1-pi)*n
CHARGE_EffN <- sum(EffN)
#EADI consortium
ncases <- c(2240)
ncontrols <- c(6631)
n <- ncases+ncontrols
pi <- ncases/n
EffN <- 4*pi*(1-pi)*n
EADI_EffN <- sum(EffN)
#GERAD I (MRC,ARUK,BONN,WASHU)
ncases <- sum(1008,939,551,423,256)
ncontrols <- sum(873,82,37,156,6129)
n <- ncases+ncontrols
pi <- ncases/n
EffN <- 4*pi*(1-pi)*n
GERAD_I_EffN <- sum(EffN)
#GERAD II (cases: NIMH,UCL PRION, UCL LASER, 1958BC, KORA, HNR)
ncases <- sum(127,82,47)
ncontrols <- sum(5342,434,353)
n <- ncases+ncontrols
pi <- ncases/n
EffN <- 4*pi*(1-pi)*n
GERAD_II_EffN <- sum(EffN)
#Total sum of effective Ns across contributing consortia
IGAP_total_effN <- round(sum(ADGC_EffN,CHARGE_EffN,EADI_EffN,GERAD_I_EffN))
For the UKB maternal and paternal proxy-phenotype summary statistics we calculate the effective sample size as 4v(1-v)n:
#UKB maternal Neff
ncases <- 27696
ncontrols <- 260980
n <- ncases+ncontrols
pi <- ncases/n
UKBmat_EffN <- 4*pi*(1-pi)*n
#UKB paternal Neff
ncases <- 14338
ncontrols <- 245941
n <- ncases+ncontrols
pi <- ncases/n
UKBpat_EffN <- 4*pi*(1-pi)*n
After calculating the effective sample sizes we munge the summary statistics with the reference file. Make sure to check the munge .log files to make sure that everything is going as expected.
library(GenomicSEM)
# Munge summary statistics excluding block of the APOE gene with HapMap3 ref file
#IGAP sumstats
munge(files=c("AD_IGAPII_NOAPOE.txt"), hm3 = "w_hm3.noMHC.snplist",
trait.names=c("AD_IGAP_NOAPOE"),N=IGAP_total_effN, info.filter = 0.9, maf.filter = 0.01)
#UKB maternal sumstats
munge(files=c("UKB_Maternal_NOAPOE.txt"), hm3 = "w_hm3.noMHC.snplist",
trait.names=c("UKB_Maternal_NOAPOE"),N=UKBmat_EffN, info.filter = 0.9, maf.filter = 0.01)
#UKB paternal sumstats
munge(files=c("UKB_Paternal_NOAPOE.txt"), hm3 = "w_hm3.noMHC.snplist",
trait.names=c("UKB_Paternal_NOAPOE"),N=UKBpat_EffN, info.filter = 0.9, maf.filter = 0.01)
And finally, we run multivariate LDSC regression on the munged summary statistics to estimate the empiric genetic covariance matrix S and its associated sampling covariance matrix V, which allows computing standard errors for the parameter estimates. For a more detailed explanation on the Genomic SEM ldsc function follow this link to the wiki. Note that we input sample prevalence = 0.5 as we are using the (sum of) effective sample sizes, which reflects the total sample size had the proportion of cases and controls in the sample been equal. Based on epidemiological data we set the population prevalence to .05. Make sure to check the LDSC .log file to make sure that everything is going as expected.
##################### Multivariate Model of Direct GWAS and GWAX of Alzheimer's Disease ##########################
######## Multivariate LDSC results ########
traits <- c("AD_IGAP_II_NOAPOE.sumstats.gz","UKB_Maternal_NOAPOE.sumstats.gz","UKB_Paternal_NOAPOE.sumstats.gz")
sample.prev <- c(.5,.5,.5)#Set sample prevalence at .5 (we are using effective sample size)
population.prev <- c(0.05,0.05,0.05)#Set population prevalence across contributing cohorts
ld <- "eur_w_ld_chr/"#LD folder
wld <- "eur_w_ld_chr/"#LD weights folder
trait.names<-c("AD_IGAP_II","AD_UKB_Maternal","AD_UKB_Paternal")#Set trait names
LDSCoutput <- ldsc(traits, sample.prev, population.prev, ld, wld, trait.names, stand = T)#Run multivariate LDSC
save(LDSCoutput,file = "LDSCoutput_AD_GWAX.RData")#Save LDSC output
Now we have all we need to estimate AD heritability using the Genomic SEM relaxed model. The first thing to do is to specify the model using the syntax below:
#Define the model using factor loading parameterization
commonfactor.model<-'AD =~ 1*AD_IGAP_II + AD_UKB_Maternal + AD_UKB_Paternal
AD~~NA*AD'
where AD represents the latent factor of AD liability, which captures the shared genetic variance among the GWAS and GWAX indicators. Note that we use factor loading parameterization by fixing the unstandardized factor loading of the IGAP case-control GWAS summary statistics to 1, so that the AD latent factor is in the scale of the case-control summary statistics. This model allows freely estimating the λ attenuation coefficients (i.e., the factor loadings) for the GWAX indicators, which will inform about potential departures from the simple GWAX model assumption (i.e., λmat=λpat = .5). Now we will fit the model using the output from the multivariate LDSC regression we ran before.
#Fit the model
ADFoutput<-usermodel(covstruc = LDSCoutput, estimation = "DWLS", model = commonfactor.model,
CFIcalc = TRUE, std.lv = FALSE, imp_cov = FALSE)
#Check parameter estimates
ADFoutput$results
## lhs op rhs Unstand_Est Unstand_SE
## 6 AD =~ AD_UKB_Maternal 0.462814184 0.168476663565629
## 7 AD =~ AD_UKB_Paternal 0.365882961 0.121864082249674
## 4 AD ~~ AD 0.069471299 0.0297020365461447
## 1 AD_IGAP_II ~~ AD_IGAP_II 0.003960514 0.0280669868516446
## 2 AD_UKB_Maternal ~~ AD_UKB_Maternal 0.001930155 0.00601804164677808
## 3 AD_UKB_Paternal ~~ AD_UKB_Paternal 0.003098379 0.00690842366521341
## 5 AD =~ AD_IGAP_II 1.000000000
## STD_Genotype STD_Genotype_SE STD_All p_value
## 6 0.96728830 0.352118648243169 0.94084162 0.006013357
## 7 0.89042901 0.296573836416133 0.86608374 0.002678727
## 4 0.94606543 0.404484590447722 0.94606543 0.019338509
## 1 0.05393458 0.382218359476506 0.05393458 0.887783572
## 2 0.11481705 0.357988841174752 0.11481705 0.748416507
## 3 0.24989895 0.557197226630599 0.24989895 0.653797546
## 5 1.00000000 0.97265895 NA
The heritability estimate corresponds with the unstandardized estimate of the AD latent factor variance (i.e., AD~~AD Unstand_Est), which is 0.069 (h2=6.9%). The model output also indicates some deviations from the simple GWAX model, as the attenuation cofficients for the maternal (λmat=.463) and paternal (λpat= .366) GWAX are different from the expected value of .5.
Before running multivariate GWAS we need to prepare the summary statistics so that the same allele is the reference allele across SNPs and contributing cohorts. This preparation is performed by the sumstats function, which transforms the SNP effect sizes and SEs such that they are scaled relative to unit-variance scaled phenotypes. The Genomic SEM sumstats function then takes the scaled effect sizes and SEs and combined them into a single listwise deleted file. For a more detailed explanation on the sumstats check out the tutorial in the Genomic SEM wiki.
#Prepare sumstats for multivariate GWAS
#Define original sumstats file names
files=c("Kunkle_etal_Stage1_results.txt", "1_UKB_AD_maternal_GWAXbetas.txt", "2_UKB_AD_paternal_GWAXbetas.txt")
#1000 Genomes based allele frequency reference file
ref= "reference.1000G.maf.0.005.txt"
#Trait names !!! These should match those previously used for LDSC
trait.names=c("AD_IGAP_II","AD_UKB_Maternal","AD_UKB_Paternal")
#In this case the effect sizes are logistic betas and SEs, make sure to check this
se.logit=c(T,T,T)
#Filter SNPs with info scores < .6
info.filter=.6
#Filter SNPs with MAF < .01
maf.filter=0.01
#Run sumstats function
AD_sumstats <- sumstats(files=files,ref=ref,trait.names=trait.names,se.logit=se.logit,
info.filter=info.filter,maf.filter=maf.filter,keep.indel=FALSE,parallel=FALSE,cores=NULL)
#Write sumstats file ready for common factor GWAS
write.table(AD_sumstats, file = "AD_GWAX_SUMSTATS_Sept2021.txt",
quote = F, row.names = F, col.names = T)
The next step is to combine the output from the sumstats function with the LDSC output in order to run the multivariate GWAS relaxed model of Direct GWAS and proxy GWAX incopoprating individual SNP effects. To do so, we first define the model syntax for the model (see path diagram in Figure 1 above), and then run the userGWAS function:
model<-"AD =~ 1*AD_IGAP_II + AD_UKB_Maternal + AD_UKB_Paternal
AD~start(0.0001)*SNP"
#Specify what parts of the model you want to save
sub<-"AD~SNP"
#Run the multivariate GWAS using parallel processing
AD_output <- userGWAS(covstruc=LDSCoutput,SNPs=AD_sumstats,model=model,
estimation = "DWLS",sub=sub,printwarn = TRUE,smooth_check=TRUE,parallel = T,cores = 100)
#Save the output
write.csv(AD_output, file = "AD_output_sumstats.csv")
Now let’s check the output of the multivariate GWAS:
AD_output[1:10,]
## SNP CHR BP MAF A1 A2 lhs op rhs free label est
## 1 rs72631875 1 705882 0.0775348 G A AD ~ SNP 1 NA 0.030598444
## 2 rs4951859 1 729679 0.1590460 C G AD ~ SNP 1 NA -0.003675313
## 3 rs148120343 1 730087 0.0497018 T C AD ~ SNP 1 NA 0.021068044
## 4 rs141242758 1 734349 0.1123260 T C AD ~ SNP 1 NA 0.010102302
## 5 rs79010578 1 736289 0.1371770 T A AD ~ SNP 1 NA 0.000620636
## 6 rs3094315 1 752566 0.1600400 G A AD ~ SNP 1 NA 0.002459805
## 7 rs3131972 1 752721 0.1610340 A G AD ~ SNP 1 NA 0.002279826
## 8 rs2073813 1 753541 0.1232600 G A AD ~ SNP 1 NA 0.004801025
## 9 rs3131969 1 754182 0.1282310 A G AD ~ SNP 1 NA -0.002018143
## 10 rs3131968 1 754192 0.1282310 A G AD ~ SNP 1 NA -0.002070208
## SE Z_Estimate Pval_Estimate chisq chisq_df chisq_pval
## 1 0.018562277 1.64842083 0.09926634 0.5725457 7 0.9991351
## 2 0.010661921 -0.34471397 0.73030941 6.4676341 7 0.4863252
## 3 0.018295120 1.15156637 0.24949931 0.5323781 7 0.9993191
## 4 0.011885244 0.84998691 0.39533237 4.8399680 7 0.6794853
## 5 0.011605189 0.05347918 0.95735012 3.5568668 7 0.8291660
## 6 0.009383923 0.26212972 0.79322143 3.5790831 7 0.8267795
## 7 0.009465002 0.24086905 0.80965661 3.8007694 7 0.8024177
## 8 0.010543450 0.45535615 0.64885303 3.1750254 7 0.8683526
## 9 0.010282281 -0.19627383 0.84439585 2.6044637 7 0.9190276
## 10 0.010285114 -0.20128199 0.84047808 2.5843661 7 0.9206114
## AIC error warning Z_smooth
## 1 6.572546 0 0 0
## 2 12.467634 0 0 0
## 3 6.532378 0 0 0
## 4 10.839968 0 0 0
## 5 9.556867 0 0 0
## 6 9.579083 0 0 0
## 7 9.800769 0 0 0
## 8 9.175025 0 0 0
## 9 8.604464 0 0 0
## 10 8.584366 0 0 0
Each row presents the AD liability factor regressed on the SNP listed in the “SNP” column. The “est” column lists the corresponding effect. The “SE” column lists the sandwich corrected standard error associated with this effect. The “Z_Estimate” column lists the ratio of est/SE. The “Pval_Estimate” column lists the corresponding p-value of Z_Estimate. The fail column indicates whether a particular run did not converge. The warning column will list whether a particular run raised any warnings in lavaan.