Background and reasoning

Available data: In the following analysis, we use freely available genome-wide association study (GWAS) summary statistics from Linner et al. (2019) on risky behaviors (https://doi.org/10.1038/s41588-018-0309-3). In this study, the authors performed GWASs on the propensity to speeding, alcohol consumption, smoking, and the number of self-reported sexual partners. In addition to the four traits, they extracted the first principal component (PC) of these four traits in a principal component analysis (PCA), and performed a GWAS on this first PC to create a summary of overall risky behaviors. The summary statistics are available on https://www.thessgac.org/data .

It is important to note that the GWAS summary statistics for ever smoking status were derived from UKB as well as TAG, whereas all the others (including the PC sumstats) were obtained from UKB only.

Aim of the analysis: Using Linner’s summary statistics of speeding, drinking, smoking and sexual partners, we will perform linkage disequilibirum score regression (LDSC) and obtain the genetic correlation matrix from the four risky behaviors in Linner et al. (2019). Using PCA, we will extract the first PC from this genetic correlation matrix. Resulting standardised loadings will be used to weight composite scores that summarise SNP effects (z-scores) contributed by the four risky behaviors. We will use a modified version of the R package provided by Baselman et al. (2019) (https://www.ncbi.nlm.nih.gov/pubmed/30643256) to create composite scores which also allows to correct for sample overlap.

Validation of the resuling composite scores: To validate the results, we will compare SNP effects from Linner’s phenotypic PC GWAS with effect sizes extracted in our analysis. A perfect correlation between our composite scores that we created using summary statistics only (called GWAMA in this document) and the PC sumstats would indicate that we captured the same effects as Linner et al. obtained in their PCA with phenotypic data. Validating the composite scores serves as a sanity check and is crucial as to our knowledge this kind of approach is unprecedented.

Obtain genetic correlation matrix

After downloading the summary statistics, we use the genomic SEM package in R to pre-process and munge the files (Grotzinger et al., 2019; https://www.nature.com/articles/s41562-019-0566-x). Using the prepared sumstats files, we perform LDSC to obtain a genetic covariance/ correlation matrix.

Step 1: Prepare and munge files

library(data.table)
library(devtools)
library(GenomicSEM)
library(knitr)
library(psych)
# set work directory to where raw GWAS are stored
setwd("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/GWAS_raw")

# list all files in this directory, only first four are the risky behaviors
gwas_files<-list.files()
gwas_files<-gwas_files[1:4]

#load all files listed in gwas_files
for (i in 1:length(gwas_files)) assign(gwas_files[i],fread(gwas_files[i],header=T,data.table=F))

# double check which files were loaded in 
ls(pattern=".txt")

# change column names &
# set wd to the directory the files with the new names should be saved
setwd("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/GWAS_header")

for(i in gwas_files){
file<-get(i)
print(dim(file))
print(head(file))
names(file)<-c("SNP","CHR","POS","A1","A2","eaf_A1","beta","se","p")
print(head(file))
fwrite(file,file=i, quote=FALSE,col.names=TRUE,row.names=F,sep=" ")
}

#################
## munge
#################
files<-gwas_files

# as reference SNP list use 1000G reference genome (HapMap 3)
# wget https://data.broadinstitute.org/alkesgroup/LDSCORE/w_hm3.snplist.bz2
hm3<-"w_hm3.noMHC.snplist.txt"

trait.names<-c("speeding","drinking","smoking","sexual_partners")

N<-c(404291,414343,518633,370711)

munge(files=files,
      hm3=hm3,
      trait.names = trait.names,
      N=N)

Step 2: Perform linkage disequilibrium score regression (LDSC)

# download and unpack SNP weights: 
# wget https://data.broadinstitute.org/alkesgroup/LDSCORE/eur_w_ld_chr.tar.bz2
# tar -jxvf eur_w_ld_chr.tar.bz2
# download software to unzip .tar document on windows
##############################################################
# setwd to munged files
setwd("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/GWAS_munged")

# specifications for ldsc function
traits<-c("speeding.sumstats.gz","drinking.sumstats.gz","smoking.sumstats.gz","sexual_partners.sumstats.gz")
ld<- "eur_w_ld_chr/"
wld<- "eur_w_ld_chr/"
trait.names<-c("speeding","drinking","smoking","sexual_partners")
sample.prev<-c(NA,NA,.59,NA)

# https://bmjopen.bmj.com/content/4/12/e005663 
# this paper says that 50% of population had never smoked (but they use UKB ...?)
# difficult to find papers on ever smoked rather than regular smokers
population.prev<-c(NA,NA,.5,NA)

LDSCoutput_risk_taking<-ldsc(traits=traits,
                             sample.prev=sample.prev,
                             population.prev=population.prev,
                             ld=ld,wld=wld,
                             trait.names = trait.names)

#LDSCoutput_risk_taking$S is going to be the covariance matrix
#LDSCoutput_risk_taking$V is the sampling covariance matrix as expected by Lavaan
save(LDSCoutput_risk_taking, file="risk_taking.RData")

Visualise the genetic correlation matrix

Step 1: Extract covariance and correlation matrix

load("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/GWAS_munged/risk_taking.RData")
##standardise covariance matrix
covmatrix<-LDSCoutput_risk_taking$S
cormatrix<-round(cov2cor(LDSCoutput_risk_taking$S),digits=2)

Step 2: Create a heatmap from the genetic correlation matrix

Sanity check:

For comparison with our matrix, we contacted Linner et al. (2019) who shared their genetic correlation matrix:

They found very similar genetic correlations between the four risky behaviors using LDSC and we suggest that our pre-processing was successful.

Step 3: Display LDSC intercepts to quantify sample overlap between the four risky behaviors

load("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/GWAS_munged/risk_taking.RData")
##standardise covariance matrix
intercepts<-data.matrix(round(LDSCoutput_risk_taking$I,digits = 2))
colnames(intercepts)<-c("speeding","drinking","smoking","sexual_partners")
rownames(intercepts)<-c("speeding","drinking","smoking","sexual_partners")

Principal Component Analysis

Eigen decomposition

We perform eigen decomposition on the correlation matrix using the eigen function. We aim to obtain standardised loadings of the four risky behaviors on the first PC, by standardising the eigenvectors of the first PC with their corresponding eigenvalue.

Eigenvectors

#extract eigenvectors and eigenvalues
eigenvectors<-round(eigen(cormatrix)$vectors,digits=3)

\[\begin{bmatrix} -0.37&0.855&0.246&-0.267 \\ -0.496&0.089&-0.841&0.197 \\ -0.544&-0.48&0.111&-0.679 \\ -0.567&-0.176&0.469&0.654 \\ \end{bmatrix}\]

Eigenvalues and explained variance

eigenvalues<-round(eigen(cormatrix)$values,digits = 3)
explained_variance<-round(eigenvalues/sum(eigenvalues)*100,digits = 3)
library(knitr)
values<-cbind(c("1st PC","2nd PC","3rd PC","4th PC"),eigenvalues,explained_variance)
kable(values, caption="Eigenvalues and explained variance (%)")

Eigenvalues and explained variance (%)
	eigenvalues	explained_variance
1st PC	1.997	49.913
2nd PC	0.896	22.394
3rd PC	0.687	17.171
4th PC	0.421	10.522

Standardised loadings

We standardise the eigenvectors with the eigenvalues, in order to obtain standardised loadings.

The standardised loadings from our genetic correlation matrix are: \[\begin{bmatrix} -0.523&0.809&0.204&-0.173 \\ -0.701&0.084&-0.697&0.128 \\ -0.769&-0.454&0.092&-0.441 \\ -0.801&-0.167&0.389&0.424 \\ \end{bmatrix}\]

Create weighted composite scores

We prepare the files to satisfy the expected format of the multivariate_GWAMA function by Baselman et al. (2019). Note that the function expects exactly nine columns (SNPID,CHR,BP,EA,OA,EAF,N,Z,P) and that effect sizes must be indicated as z-scores instead of betas. We slightly modify the function from Baselman et al. and, using the standardised loadings extracted in the PCA, we create our weighted z-scores summarising SNP effects contributed by the four risky behaviors.

Step 1: Format the files to be included in weighted score

# read in gwas files
gwas_files<-list.files()
gwas_files<-gwas_files[1:4]

for (i in 1:length(gwas_files)) assign(gwas_files[i],fread(gwas_files[i],header=T,data.table=F))
ls(pattern=".txt")

# generate N column
AUTOMOBILE_SPEEDING_PROPENSITY_GWAS.txt$N<-404291
DRINKS_PER_WEEK_GWAS.txt$N<-414343
EVER_SMOKER_GWAS_MA_UKB_TAG.txt$N<-518633
NUMBER_SEXUAL_PARTNERS_GWAS.txt$N<-370711
  
setwd("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/GWAMA_format")

# transform beta into z-scores
# z-score=beta/stand_error # https://www.biostars.org/p/140292/
# re-name columns and include only the 9 columns needed by the function

for(i in gwas_files){
  file<-get(i)
  file$Z<-file$beta/file$se
  print(dim(file))
  print(head(file))
  names(file)<-c("SNPID","CHR","BP","EA","OA","EAF","beta","se","P","N","Z")
  print(head(file))
  file<-file[,c("SNPID","CHR","BP","EA","OA","EAF","N","Z","P")]
  print(dim(file))
  print(head(file))
  fwrite(file,file=i, quote=FALSE,col.names=TRUE,row.names=F,sep=" ")
}

Step 2: Modify GWAMA function

In the equation formulated by Baselman et al. (2019), we replace the weights in their equation which is the square root of N times the heritability, with standardised loadings times the square root of N:

Change w = sqrt(N*h2)to w = sqrt(N)*standardised_loadings

If we only replace heritabilities with standardised loadings in the arguments of the function, we will get errors for two reasons:

It won’t allow negative loadings (because we are replacing h2 with potentially negative loadings and negative heritability wouldn’t make sense)
It won’t allow to calculate our weights, as there is no sqrt() of a negative value: sqrt(-number)=NaN

I have made the following changes to address these problems:

1) Disable sanity check for negative heritabilites:

source("https://github.com/baselmans/multivariate_GWAMA/blob/master/Test_Data/N_weighted_GWAMA.function.1_2_6.R?raw=TRUE")
my_GWAMA<-multivariate_GWAMA
trace(my_GWAMA,edit=T)

I deleted lines 217-221, which disables the sanity check to stop the function from running when loadings are not in the expected h2 format (h2<0 & h2>1).

2) Modify code to calculate our weights

Original code:

# line 663
W<-t(t(N) * h2)
# line 665
sqrt_W<-sqrt(W)

As we replace h2 with negative loadings, this code produces NaNs.

Modified code:

w = t(t(sqrt(N))*h2)
sqrt_W<-w

Step 3: Run GWAMA with modified function

# Make an empty list with length equal to the number of input files
dat<-vector("list",4)

# Read in the data 
n<-0
for(file in gwas_files){
  n<-n+1
  print(n)
  dat[[n]]<-fread(gwas_files[n],data.table = F)
}

# load covariance matrix from LDSC
load("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/GWAS_munged/risk_taking.RData")

# select the matrix of LDSC intercepts 
# this matrix should be in the same order as data read in into dat (done it all automatic)
CTI<-as.matrix(LDSCoutput_risk_taking$I)

# load correlation matrix
cormatrix<-cov2cor(LDSCoutput_risk_taking$S)

# eigen decomposition
eigenvectors<-eigen(cormatrix)$vectors
eigenvalues<-eigen(cormatrix)$values

# calculate standardised loadings
loadings<-as.vector(eigenvectors%*%sqrt(diag(eigenvalues))[,1])

# run my modified GWAMA function
my_GWAMA(x=dat,
          cov_Z=CTI,
          h2=loadings,
          out=".",
          name="my_GWAMA_weights_changed",
          output_gz=F,
          check_columns=F)

Running the function can take a while. It will create one file containing the composite summary stats and a log file to keep track of operations.

For example, the log file shows the number of total and significant SNPs:

“14,701 SNPs reached genome-wide significance (Minimum P-value: 3.1405e-41)”
“Successfully created summary statistics for 11,424,118 SNPs”

The function does not test for significance. It only counts the number of SNP’s below a certain p-value threshold. The default threshold used here is p < 5e-08, but can be changed by the user.

Evaluate the resulting composite summary stats (labelled GWAMA)

As explained above, we chose the Linner et al. (2019) summary statistics as they performed a GWAS on the first PC, in addition to the four risky behaviors. The PC sumstats can be used as a reference for how well GWAMA represents a summary of SNP effects contributed by the four risky behaviors: speeding, drinking, smoking and sexual partners. If our composite GWAMA scores highly correlate with the PC sumstats, we can be confident that GWAMA is capturing the same effects without having used phenotypic data at all.

We will perform genetic correlations between GWAMA, the PC and the four risky behaviors summary stats. We will also plot beta and z statistics contained in GWAMA against those contained in the other summary statistics and calculate their correlations.

It is important to note that the directions indicated by the standardised loadings are arbitrary and it would not change the signal contained in the statistics if we flipped the signs of the loadings as long as this is done consistently across all loadings. In the case of this analysis, we obtained negatively associated indices, because we weighted the SNPs with negative standardised loadings. However, for future analyses we suggest to scale the composite scores in the most sensible direction to make results more intuitively interpretable.

Prepare and munge GWAMA and PC files for LDSC

# set work directory to where raw GWAS are stored
setwd("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/GWAS_raw")

# store file name in gwas_file
gwas_file<-list.files(pattern="RISK_PC")

# load into R
Risk_PC<-fread(gwas_file,header=T,data.table = F)

# rename columns
names(Risk_PC)<-c("SNP","CHR","POS","A1","A2","eaf_A1","beta","se","p")

# and set wd to the directory I want to save the files with the new names
setwd("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/Evaluate_GWAMA")

# write table 
fwrite(Risk_PC,file="Risk_PC_GWAS.txt", quote=FALSE,col.names=TRUE,row.names=F,sep=" ")


# prepare GWAMA file #
# set work directory to where GWAMA file is stored
setwd("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/GWAMA_format")
# store file name in gwas_file
gwas_file<-list.files(pattern="weighted_GWAMA.results.txt")
# load into R
GWAMA<-fread(gwas_file,header=T,data.table = F)
# rename columns
names(GWAMA)<-c("CPTID","SNPID","CHR","BP","EA","OA","EAF","MAF","N","N_obs","Direction","BETA","SE","Z","p")
GWAMA<-GWAMA[,c("SNPID","CHR","BP","EA","OA","EAF","MAF","N","BETA","SE","Z","p")]
# setwd to where you want to save GWAMA
setwd("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/Evaluate_GWAMA")
# write table 
fwrite(GWAMA,file="GWAMA_header.txt", quote=FALSE,col.names=TRUE,row.names=F,sep=" ")

# munge PC and GWAMA files ##

#load dependencies for genomic SEM
library(devtools)
library(GenomicSEM)

files<-c("Risk_PC_GWAS.txt","GWAMA_header.txt")

# use 1000G reference genome (HapMap 3)
# wget https://data.broadinstitute.org/alkesgroup/LDSCORE/w_hm3.snplist.bz2
hm3<-"C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/GWAS_header/w_hm3.noMHC.snplist.txt"

trait.names<-c("Risk_PC","GWAMA")

N<-c(315894,NA)

munge(files=files,
      hm3=hm3,
      trait.names = trait.names,
      N=N)

Perform LDSC with GWAMA, PC and original risk taking GWAS sumstats

setwd("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/GWAS_munged")

traits<-c("speeding.sumstats.gz","drinking.sumstats.gz","smoking.sumstats.gz","sexual_partners.sumstats.gz","C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/Evaluate_GWAMA/Risk_PC.sumstats.gz","C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/Evaluate_GWAMA/GWAMA.sumstats.gz")

ld<- "eur_w_ld_chr/"
wld<- "eur_w_ld_chr/"
trait.names<-c("speeding","drinking","smoking","sexual_partners","PC","GWAMA")
sample.prev<-c(NA,NA,.59,NA,NA,NA)
population.prev<-c(NA,NA,.5,NA,NA,NA)

LDSCoutput_evaluate_GWAMA<-ldsc(traits=traits,
                                 sample.prev=sample.prev,
                                 population.prev=population.prev,
                                 ld=ld,wld=wld,
                                 trait.names = trait.names)

setwd("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/Evaluate_GWAMA")
save(LDSCoutput_evaluate_GWAMA, file="Evaluate_GWAMA.RData")

Heatmap of genetic correlations between GWAMA, PC and risky behaviors

Note the strong genetic correlation between our GWAMA composite scores and the PC sumstats. It suggests that we capture the same effects with the weighted composite scores, as was captured with the PC sumstats.

Display the LDSC intercepts between the four risky behaviors, GWAMA and PC

LDSC intercepts represent sample overlap. We would expect traits compared with themselves to have a perfect sample overlap (1 or very close to 1). When comparing GWAMA with the four risky behaviors, we expect substantial sample overlap as all four have contributed to the sumstats contained in GWAMA. And as PC and GWAMA in theory capture the same signal, we expect an even stronger sample overlap between these two traits.

load("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/Evaluate_GWAMA/Evaluate_GWAMA.RData")
##standardise covariance matrix
intercepts<-data.matrix(round(LDSCoutput_evaluate_GWAMA$I,digits = 2))
colnames(intercepts)<-c("speeding","drinking","smoking","sexual_partners","PC","GWAMA")
rownames(intercepts)<-c("speeding","drinking","smoking","sexual_partners","PC","GWAMA")

Evaluate GWAMA with standardised loadings from the correlation matrix

Scatterplots & correlations between z-scores ad beta

In the following plots, the blue line indicates the central correlation line and the red line is the line of identity with an intercept of 1 and a slope of -1: x=-y. Note: The GWAMA results are negative as we have weighted them with negative standardised loadings and it would not change the signal contained in the GWAMA sumstats if we multiplied all standardised loadings by -1.

GWAMA vs. PC sumstats

z statistics

beta

GWAMA vs. speeding sumstats

z statistics

beta

GWAMA vs. drinking

z statistics

beta

GWAMA vs. smoking

z statistics

beta

GWAMA vs. sexual partners

z statistics

beta

Create GWAMA with loadings from genetic covariance matrix

We also checked whether using standardised loadings from a covariance instead of a correlation matrix would obtain similar GWAMA scores. In order to do so, we ran the same code as displayed in “Create weighted composite scores” -> “Step 3: Run GWAMA with modified function”, but without standardising the covariance matrix obtained in LDSC.

## create composite scores from covariance matrix
source("https://github.com/baselmans/multivariate_GWAMA/blob/master/Test_Data/N_weighted_GWAMA.function.1_2_6.R?raw=TRUE")
library(data.table)

# modify multivariate_GWAMA function as done in "Step 2: Modified GWAMA function"

# Make an empty list with length equal to the number of input files
dat<-vector("list",4)

# Read in the data 
setwd("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/GWAMA_format")

gwas_files<-c("AUTOMOBILE_SPEEDING_PROPENSITY_GWAS.txt","DRINKS_PER_WEEK_GWAS.txt","EVER_SMOKER_GWAS_MA_UKB_TAG.txt","NUMBER_SEXUAL_PARTNERS_GWAS.txt")

n<-0
for(file in gwas_files){
  n<-n+1
  print(n)
  dat[[n]]<-fread(gwas_files[n],data.table = F)
}

# load covariance matrix from LDSC
load("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/GWAS_munged/risk_taking.RData")

# select the matrix of LDSC intercepts 
CTI<-as.matrix(LDSCoutput_risk_taking$I)

# load covariance!! matrix
covmatrix<-LDSCoutput_risk_taking$S

# eigen decomposition
eigenvectors<-eigen(covmatrix)$vectors
eigenvalues<-eigen(covmatrix)$values

# calculate standardised loadings
loadings<-as.vector(eigenvectors%*%sqrt(diag(eigenvalues))[,1])

# run my modified GWAMA function
my_GWAMA(x=dat,
         cov_Z=CTI,
         h2=loadings,
         out=".",
         name="my_GWAMA_cov_matrix",
         output_gz=F,
         check_columns=F)

Resulting composite scores weighted with standardised loadings from the covariance matrix were similar to composite scores with loadings from the correlation matrix. To summarise, the genetic correlation between GWAMA and the PC sumstats slighly declined (rcor = -0.992 vs. rcov = -0.954). The correlation between GWAMA and the smoking sumstats slightly increased (rcor = -0.826 vs. rcov = -0.913), but decreased for the speeding sumstats (rcor = -0.466 vs. rcov = -0.336).

Perform and validate principal component analysis in GWAS summary statistics

Anna Elisabeth Furtjes, King’s College London

12/03/2020

Background and reasoning

Obtain genetic correlation matrix

Step 1: Prepare and munge files

Step 2: Perform linkage disequilibrium score regression (LDSC)

Visualise the genetic correlation matrix

Step 1: Extract covariance and correlation matrix

Step 2: Create a heatmap from the genetic correlation matrix

Sanity check:

For comparison with our matrix, we contacted Linner et al. (2019) who shared their genetic correlation matrix:

Step 3: Display LDSC intercepts to quantify sample overlap between the four risky behaviors

Principal Component Analysis

Eigen decomposition

Eigenvectors

Eigenvalues and explained variance

Standardised loadings

Create weighted composite scores

Step 1: Format the files to be included in weighted score

Step 2: Modify GWAMA function

1) Disable sanity check for negative heritabilites:

2) Modify code to calculate our weights

Step 3: Run GWAMA with modified function

Evaluate the resulting composite summary stats (labelled GWAMA)

Prepare and munge GWAMA and PC files for LDSC

Perform LDSC with GWAMA, PC and original risk taking GWAS sumstats

Heatmap of genetic correlations between GWAMA, PC and risky behaviors

Display the LDSC intercepts between the four risky behaviors, GWAMA and PC

Evaluate GWAMA with standardised loadings from the correlation matrix

Scatterplots & correlations between z-scores ad beta

GWAMA vs. PC sumstats

z statistics

beta

GWAMA vs. speeding sumstats

z statistics

beta

GWAMA vs. drinking

z statistics

beta

GWAMA vs. smoking

z statistics

beta

GWAMA vs. sexual partners

z statistics

beta

Create GWAMA with loadings from genetic covariance matrix

Evaluate GWAMA with standardised loadings from covariance matrix

Scatterplots & correlations between z-scores and beta

GWAMA vs. PC

z statistics

beta

GWAMA vs. speeding

z statistics

beta

GWAMA vs. drinking

z statistics

beta

GWAMA vs. smoking

z statistics

beta

GWAMA vs. sexual partners

z statistics

beta