Available data: In the following analysis, we use freely available genome-wide association study (GWAS) summary statistics from Linner et al. (2019) on risky behaviors (https://doi.org/10.1038/s41588-018-0309-3). In this study, the authors performed GWASs on the propensity to speeding, alcohol consumption, smoking, and the number of self-reported sexual partners. In addition to the four traits, they extracted the first principal component (PC) of these four traits in a principal component analysis (PCA), and performed a GWAS on this first PC to create a summary of overall risky behaviors. The summary statistics are available on https://www.thessgac.org/data .
It is important to note that the GWAS summary statistics for ever smoking status were derived from UKB as well as TAG, whereas all the others (including the PC sumstats) were obtained from UKB only.
Aim of the analysis: Using Linner’s summary statistics of speeding, drinking, smoking and sexual partners, we will perform linkage disequilibirum score regression (LDSC) and obtain the genetic correlation matrix from the four risky behaviors in Linner et al. (2019). Using PCA, we will extract the first PC from this genetic correlation matrix. Resulting standardised loadings will be used to weight composite scores that summarise SNP effects (z-scores) contributed by the four risky behaviors. We will use a modified version of the R package provided by Baselman et al. (2019) (https://www.ncbi.nlm.nih.gov/pubmed/30643256) to create composite scores which also allows to correct for sample overlap.
Validation of the resuling composite scores: To validate the results, we will compare SNP effects from Linner’s phenotypic PC GWAS with effect sizes extracted in our analysis. A perfect correlation between our composite scores that we created using summary statistics only (called GWAMA in this document) and the PC sumstats would indicate that we captured the same effects as Linner et al. obtained in their PCA with phenotypic data. Validating the composite scores serves as a sanity check and is crucial as to our knowledge this kind of approach is unprecedented.
After downloading the summary statistics, we use the genomic SEM package in R to pre-process and munge the files (Grotzinger et al., 2019; https://www.nature.com/articles/s41562-019-0566-x). Using the prepared sumstats files, we perform LDSC to obtain a genetic covariance/ correlation matrix.
library(data.table)
library(devtools)
library(GenomicSEM)
library(knitr)
library(psych)
# set work directory to where raw GWAS are stored
setwd("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/GWAS_raw")
# list all files in this directory, only first four are the risky behaviors
gwas_files<-list.files()
gwas_files<-gwas_files[1:4]
#load all files listed in gwas_files
for (i in 1:length(gwas_files)) assign(gwas_files[i],fread(gwas_files[i],header=T,data.table=F))
# double check which files were loaded in
ls(pattern=".txt")
# change column names &
# set wd to the directory the files with the new names should be saved
setwd("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/GWAS_header")
for(i in gwas_files){
file<-get(i)
print(dim(file))
print(head(file))
names(file)<-c("SNP","CHR","POS","A1","A2","eaf_A1","beta","se","p")
print(head(file))
fwrite(file,file=i, quote=FALSE,col.names=TRUE,row.names=F,sep=" ")
}
#################
## munge
#################
files<-gwas_files
# as reference SNP list use 1000G reference genome (HapMap 3)
# wget https://data.broadinstitute.org/alkesgroup/LDSCORE/w_hm3.snplist.bz2
hm3<-"w_hm3.noMHC.snplist.txt"
trait.names<-c("speeding","drinking","smoking","sexual_partners")
N<-c(404291,414343,518633,370711)
munge(files=files,
hm3=hm3,
trait.names = trait.names,
N=N)
# download and unpack SNP weights:
# wget https://data.broadinstitute.org/alkesgroup/LDSCORE/eur_w_ld_chr.tar.bz2
# tar -jxvf eur_w_ld_chr.tar.bz2
# download software to unzip .tar document on windows
##############################################################
# setwd to munged files
setwd("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/GWAS_munged")
# specifications for ldsc function
traits<-c("speeding.sumstats.gz","drinking.sumstats.gz","smoking.sumstats.gz","sexual_partners.sumstats.gz")
ld<- "eur_w_ld_chr/"
wld<- "eur_w_ld_chr/"
trait.names<-c("speeding","drinking","smoking","sexual_partners")
sample.prev<-c(NA,NA,.59,NA)
# https://bmjopen.bmj.com/content/4/12/e005663
# this paper says that 50% of population had never smoked (but they use UKB ...?)
# difficult to find papers on ever smoked rather than regular smokers
population.prev<-c(NA,NA,.5,NA)
LDSCoutput_risk_taking<-ldsc(traits=traits,
sample.prev=sample.prev,
population.prev=population.prev,
ld=ld,wld=wld,
trait.names = trait.names)
#LDSCoutput_risk_taking$S is going to be the covariance matrix
#LDSCoutput_risk_taking$V is the sampling covariance matrix as expected by Lavaan
save(LDSCoutput_risk_taking, file="risk_taking.RData")
load("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/GWAS_munged/risk_taking.RData")
##standardise covariance matrix
covmatrix<-LDSCoutput_risk_taking$S
cormatrix<-round(cov2cor(LDSCoutput_risk_taking$S),digits=2)
load("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/GWAS_munged/risk_taking.RData")
##standardise covariance matrix
intercepts<-data.matrix(round(LDSCoutput_risk_taking$I,digits = 2))
colnames(intercepts)<-c("speeding","drinking","smoking","sexual_partners")
rownames(intercepts)<-c("speeding","drinking","smoking","sexual_partners")
We perform eigen decomposition on the correlation matrix using the eigen function. We aim to obtain standardised loadings of the four risky behaviors on the first PC, by standardising the eigenvectors of the first PC with their corresponding eigenvalue.
#extract eigenvectors and eigenvalues
eigenvectors<-round(eigen(cormatrix)$vectors,digits=3)
\[\begin{bmatrix} -0.37&0.855&0.246&-0.267 \\ -0.496&0.089&-0.841&0.197 \\ -0.544&-0.48&0.111&-0.679 \\ -0.567&-0.176&0.469&0.654 \\ \end{bmatrix}\]
eigenvalues<-round(eigen(cormatrix)$values,digits = 3)
explained_variance<-round(eigenvalues/sum(eigenvalues)*100,digits = 3)
library(knitr)
values<-cbind(c("1st PC","2nd PC","3rd PC","4th PC"),eigenvalues,explained_variance)
kable(values, caption="Eigenvalues and explained variance (%)")
| eigenvalues | explained_variance | |
|---|---|---|
| 1st PC | 1.997 | 49.913 |
| 2nd PC | 0.896 | 22.394 |
| 3rd PC | 0.687 | 17.171 |
| 4th PC | 0.421 | 10.522 |
We standardise the eigenvectors with the eigenvalues, in order to obtain standardised loadings.
The standardised loadings from our genetic correlation matrix are: \[\begin{bmatrix} -0.523&0.809&0.204&-0.173 \\ -0.701&0.084&-0.697&0.128 \\ -0.769&-0.454&0.092&-0.441 \\ -0.801&-0.167&0.389&0.424 \\ \end{bmatrix}\]
We prepare the files to satisfy the expected format of the multivariate_GWAMA function by Baselman et al. (2019). Note that the function expects exactly nine columns (SNPID,CHR,BP,EA,OA,EAF,N,Z,P) and that effect sizes must be indicated as z-scores instead of betas. We slightly modify the function from Baselman et al. and, using the standardised loadings extracted in the PCA, we create our weighted z-scores summarising SNP effects contributed by the four risky behaviors.
# read in gwas files
gwas_files<-list.files()
gwas_files<-gwas_files[1:4]
for (i in 1:length(gwas_files)) assign(gwas_files[i],fread(gwas_files[i],header=T,data.table=F))
ls(pattern=".txt")
# generate N column
AUTOMOBILE_SPEEDING_PROPENSITY_GWAS.txt$N<-404291
DRINKS_PER_WEEK_GWAS.txt$N<-414343
EVER_SMOKER_GWAS_MA_UKB_TAG.txt$N<-518633
NUMBER_SEXUAL_PARTNERS_GWAS.txt$N<-370711
setwd("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/GWAMA_format")
# transform beta into z-scores
# z-score=beta/stand_error # https://www.biostars.org/p/140292/
# re-name columns and include only the 9 columns needed by the function
for(i in gwas_files){
file<-get(i)
file$Z<-file$beta/file$se
print(dim(file))
print(head(file))
names(file)<-c("SNPID","CHR","BP","EA","OA","EAF","beta","se","P","N","Z")
print(head(file))
file<-file[,c("SNPID","CHR","BP","EA","OA","EAF","N","Z","P")]
print(dim(file))
print(head(file))
fwrite(file,file=i, quote=FALSE,col.names=TRUE,row.names=F,sep=" ")
}
In the equation formulated by Baselman et al. (2019), we replace the weights in their equation which is the square root of N times the heritability, with standardised loadings times the square root of N:
Change w = sqrt(N*h2)to w = sqrt(N)*standardised_loadings
If we only replace heritabilities with standardised loadings in the arguments of the function, we will get errors for two reasons:
It won’t allow negative loadings (because we are replacing h2 with potentially negative loadings and negative heritability wouldn’t make sense)
It won’t allow to calculate our weights, as there is no sqrt() of a negative value: sqrt(-number)=NaN
I have made the following changes to address these problems:
source("https://github.com/baselmans/multivariate_GWAMA/blob/master/Test_Data/N_weighted_GWAMA.function.1_2_6.R?raw=TRUE")
my_GWAMA<-multivariate_GWAMA
trace(my_GWAMA,edit=T)
I deleted lines 217-221, which disables the sanity check to stop the function from running when loadings are not in the expected h2 format (h2<0 & h2>1).
Original code:
# line 663
W<-t(t(N) * h2)
# line 665
sqrt_W<-sqrt(W)
As we replace h2 with negative loadings, this code produces NaNs.
Modified code:
w = t(t(sqrt(N))*h2)
sqrt_W<-w
# Make an empty list with length equal to the number of input files
dat<-vector("list",4)
# Read in the data
n<-0
for(file in gwas_files){
n<-n+1
print(n)
dat[[n]]<-fread(gwas_files[n],data.table = F)
}
# load covariance matrix from LDSC
load("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/GWAS_munged/risk_taking.RData")
# select the matrix of LDSC intercepts
# this matrix should be in the same order as data read in into dat (done it all automatic)
CTI<-as.matrix(LDSCoutput_risk_taking$I)
# load correlation matrix
cormatrix<-cov2cor(LDSCoutput_risk_taking$S)
# eigen decomposition
eigenvectors<-eigen(cormatrix)$vectors
eigenvalues<-eigen(cormatrix)$values
# calculate standardised loadings
loadings<-as.vector(eigenvectors%*%sqrt(diag(eigenvalues))[,1])
# run my modified GWAMA function
my_GWAMA(x=dat,
cov_Z=CTI,
h2=loadings,
out=".",
name="my_GWAMA_weights_changed",
output_gz=F,
check_columns=F)
Running the function can take a while. It will create one file containing the composite summary stats and a log file to keep track of operations.
For example, the log file shows the number of total and significant SNPs:
“14,701 SNPs reached genome-wide significance (Minimum P-value: 3.1405e-41)”
“Successfully created summary statistics for 11,424,118 SNPs”
The function does not test for significance. It only counts the number of SNP’s below a certain p-value threshold. The default threshold used here is p < 5e-08, but can be changed by the user.
As explained above, we chose the Linner et al. (2019) summary statistics as they performed a GWAS on the first PC, in addition to the four risky behaviors. The PC sumstats can be used as a reference for how well GWAMA represents a summary of SNP effects contributed by the four risky behaviors: speeding, drinking, smoking and sexual partners. If our composite GWAMA scores highly correlate with the PC sumstats, we can be confident that GWAMA is capturing the same effects without having used phenotypic data at all.
We will perform genetic correlations between GWAMA, the PC and the four risky behaviors summary stats. We will also plot beta and z statistics contained in GWAMA against those contained in the other summary statistics and calculate their correlations.
It is important to note that the directions indicated by the standardised loadings are arbitrary and it would not change the signal contained in the statistics if we flipped the signs of the loadings as long as this is done consistently across all loadings. In the case of this analysis, we obtained negatively associated indices, because we weighted the SNPs with negative standardised loadings. However, for future analyses we suggest to scale the composite scores in the most sensible direction to make results more intuitively interpretable.
# set work directory to where raw GWAS are stored
setwd("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/GWAS_raw")
# store file name in gwas_file
gwas_file<-list.files(pattern="RISK_PC")
# load into R
Risk_PC<-fread(gwas_file,header=T,data.table = F)
# rename columns
names(Risk_PC)<-c("SNP","CHR","POS","A1","A2","eaf_A1","beta","se","p")
# and set wd to the directory I want to save the files with the new names
setwd("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/Evaluate_GWAMA")
# write table
fwrite(Risk_PC,file="Risk_PC_GWAS.txt", quote=FALSE,col.names=TRUE,row.names=F,sep=" ")
# prepare GWAMA file #
# set work directory to where GWAMA file is stored
setwd("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/GWAMA_format")
# store file name in gwas_file
gwas_file<-list.files(pattern="weighted_GWAMA.results.txt")
# load into R
GWAMA<-fread(gwas_file,header=T,data.table = F)
# rename columns
names(GWAMA)<-c("CPTID","SNPID","CHR","BP","EA","OA","EAF","MAF","N","N_obs","Direction","BETA","SE","Z","p")
GWAMA<-GWAMA[,c("SNPID","CHR","BP","EA","OA","EAF","MAF","N","BETA","SE","Z","p")]
# setwd to where you want to save GWAMA
setwd("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/Evaluate_GWAMA")
# write table
fwrite(GWAMA,file="GWAMA_header.txt", quote=FALSE,col.names=TRUE,row.names=F,sep=" ")
# munge PC and GWAMA files ##
#load dependencies for genomic SEM
library(devtools)
library(GenomicSEM)
files<-c("Risk_PC_GWAS.txt","GWAMA_header.txt")
# use 1000G reference genome (HapMap 3)
# wget https://data.broadinstitute.org/alkesgroup/LDSCORE/w_hm3.snplist.bz2
hm3<-"C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/GWAS_header/w_hm3.noMHC.snplist.txt"
trait.names<-c("Risk_PC","GWAMA")
N<-c(315894,NA)
munge(files=files,
hm3=hm3,
trait.names = trait.names,
N=N)
setwd("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/GWAS_munged")
traits<-c("speeding.sumstats.gz","drinking.sumstats.gz","smoking.sumstats.gz","sexual_partners.sumstats.gz","C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/Evaluate_GWAMA/Risk_PC.sumstats.gz","C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/Evaluate_GWAMA/GWAMA.sumstats.gz")
ld<- "eur_w_ld_chr/"
wld<- "eur_w_ld_chr/"
trait.names<-c("speeding","drinking","smoking","sexual_partners","PC","GWAMA")
sample.prev<-c(NA,NA,.59,NA,NA,NA)
population.prev<-c(NA,NA,.5,NA,NA,NA)
LDSCoutput_evaluate_GWAMA<-ldsc(traits=traits,
sample.prev=sample.prev,
population.prev=population.prev,
ld=ld,wld=wld,
trait.names = trait.names)
setwd("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/Evaluate_GWAMA")
save(LDSCoutput_evaluate_GWAMA, file="Evaluate_GWAMA.RData")
Note the strong genetic correlation between our GWAMA composite scores and the PC sumstats. It suggests that we capture the same effects with the weighted composite scores, as was captured with the PC sumstats.
LDSC intercepts represent sample overlap. We would expect traits compared with themselves to have a perfect sample overlap (1 or very close to 1). When comparing GWAMA with the four risky behaviors, we expect substantial sample overlap as all four have contributed to the sumstats contained in GWAMA. And as PC and GWAMA in theory capture the same signal, we expect an even stronger sample overlap between these two traits.
load("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/Evaluate_GWAMA/Evaluate_GWAMA.RData")
##standardise covariance matrix
intercepts<-data.matrix(round(LDSCoutput_evaluate_GWAMA$I,digits = 2))
colnames(intercepts)<-c("speeding","drinking","smoking","sexual_partners","PC","GWAMA")
rownames(intercepts)<-c("speeding","drinking","smoking","sexual_partners","PC","GWAMA")
In the following plots, the blue line indicates the central correlation line and the red line is the line of identity with an intercept of 1 and a slope of -1: x=-y. Note: The GWAMA results are negative as we have weighted them with negative standardised loadings and it would not change the signal contained in the GWAMA sumstats if we multiplied all standardised loadings by -1.
We also checked whether using standardised loadings from a covariance instead of a correlation matrix would obtain similar GWAMA scores. In order to do so, we ran the same code as displayed in “Create weighted composite scores” -> “Step 3: Run GWAMA with modified function”, but without standardising the covariance matrix obtained in LDSC.
## create composite scores from covariance matrix
source("https://github.com/baselmans/multivariate_GWAMA/blob/master/Test_Data/N_weighted_GWAMA.function.1_2_6.R?raw=TRUE")
library(data.table)
# modify multivariate_GWAMA function as done in "Step 2: Modified GWAMA function"
# Make an empty list with length equal to the number of input files
dat<-vector("list",4)
# Read in the data
setwd("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/GWAMA_format")
gwas_files<-c("AUTOMOBILE_SPEEDING_PROPENSITY_GWAS.txt","DRINKS_PER_WEEK_GWAS.txt","EVER_SMOKER_GWAS_MA_UKB_TAG.txt","NUMBER_SEXUAL_PARTNERS_GWAS.txt")
n<-0
for(file in gwas_files){
n<-n+1
print(n)
dat[[n]]<-fread(gwas_files[n],data.table = F)
}
# load covariance matrix from LDSC
load("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/GWAS_munged/risk_taking.RData")
# select the matrix of LDSC intercepts
CTI<-as.matrix(LDSCoutput_risk_taking$I)
# load covariance!! matrix
covmatrix<-LDSCoutput_risk_taking$S
# eigen decomposition
eigenvectors<-eigen(covmatrix)$vectors
eigenvalues<-eigen(covmatrix)$values
# calculate standardised loadings
loadings<-as.vector(eigenvectors%*%sqrt(diag(eigenvalues))[,1])
# run my modified GWAMA function
my_GWAMA(x=dat,
cov_Z=CTI,
h2=loadings,
out=".",
name="my_GWAMA_cov_matrix",
output_gz=F,
check_columns=F)
Resulting composite scores weighted with standardised loadings from the covariance matrix were similar to composite scores with loadings from the correlation matrix. To summarise, the genetic correlation between GWAMA and the PC sumstats slighly declined (rcor = -0.992 vs. rcov = -0.954). The correlation between GWAMA and the smoking sumstats slightly increased (rcor = -0.826 vs. rcov = -0.913), but decreased for the speeding sumstats (rcor = -0.466 vs. rcov = -0.336).
In the following plots, the blue line indicates the central correlation line and the red line is the line of identity with an intercept of 1 and a slope of -1: x=-y. Note: The GWAMA results are negative as we have weighted them with negative standardised loadings and it would not change the signal contained in the GWAMA sumstats if we multiplied all standardised loadings by -1.