Aim of this document:

The primary documentation of the analysis under investigation can be found at https://rpubs.com/annaFurtjes/genomicPCA , which we are evaluating in greater detail in this document. When plotting effect sizes from the genomic PCA approach against effect sizes contained in the risky behaviors GWASs (by Linner et al, 2019, https://doi.org/10.1038/s41588-018-0309-3), we noticed “mirrored” patterns in these plots. To understand whether this impacts the validity of our new genomic PCA method, we aimed to examine how the pattern emerges.

Reminder of the background and reasoning of genomic PCA

Available data: In the following analysis, we use freely available genome-wide association study (GWAS) summary statistics from Linner et al. (2019) on risky behaviors (https://doi.org/10.1038/s41588-018-0309-3). In this study, the authors performed GWASs on the propensity to speeding, alcohol consumption, smoking, and the number of self-reported sexual partners. In addition to the four traits, they extracted the first principal component (PC) of these four traits in a principal component analysis (PCA), and performed a GWAS on this first PC to create a summary of overall risky behaviors. The summary statistics are available on https://www.thessgac.org/data .

It is important to note that the GWAS summary statistics for ever smoking status were derived from UKB as well as TAG, whereas all the others (including the PC sumstats) were obtained from UKB only.

Aim of the analysis: Using Linner’s summary statistics of speeding, drinking, smoking and sexual partners, we will perform linkage disequilibirum score regression (LDSC) and obtain the genetic correlation matrix from the four risky behaviors in Linner et al. (2019). Using PCA, we will extract the first PC from this genetic correlation matrix. Resulting standardised loadings will be used to weight composite scores that summarise SNP effects (z-scores) contributed by the four risky behaviors. We will use a modified version of the R package provided by Baselman et al. (2019) (https://www.ncbi.nlm.nih.gov/pubmed/30643256) to create composite scores which also allows to correct for sample overlap.

Validation of the resuling composite scores: To validate the results, we will compare SNP effects from Linner’s phenotypic PC GWAS with effect sizes extracted in our analysis. A perfect correlation between our genetic scores that we created using summary statistics only (called GWAMA in this document) and the PC sumstats would indicate that we captured the same effects as Linner et al. obtained in their PCA with phenotypic data. Validating the genetic scores from genomic PCA serves as a sanity check and is crucial as to our knowledge this kind of approach is unprecedented.

Heatmap genetic correlations

The following heatmap shows the association between our genomic PCA scores, Linner’s PC GWASs and the four risky behaviors GWASs.

Note the strong genetic correlation between our GWAMA scores and the PC sumstats from Linner et al. It suggests that we capture the same effects with GWAMA, as was captured with the PC sumstats.

Heatmap LDSC intercepts

load("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/Evaluate_GWAMA/Evaluate_GWAMA.RData")
##standardise covariance matrix
intercepts<-data.matrix(round(LDSCoutput_evaluate_GWAMA$I,digits = 2))
colnames(intercepts)<-c("speeding","drinking","smoking","sexual_partners","PC","GWAMA")
rownames(intercepts)<-c("speeding","drinking","smoking","sexual_partners","PC","GWAMA")

Note that the intercepts along the diagnonal are 1 (or very close to 1). As we compare each trait with itself, we expect perfect sample overlap. Also, the heatmap shows that there is substantial sample overlap between the four risky behaviors and the PC as well as the GWAMA sumstats.

Genomic PCA evaluation

In order to evaluate genomic PCA, we plotted SNP z-scores and beta statistics contained in Linner’s summary statistics against the z-scores and beta statistics generated with genomic PCA. We noticed that the plots seemed “mirrored”, which was particularly obvious at the extreme ends of the plots. In the following, we aim to demonstrate that the mirrored patterns in the plots come from the fact that SNPs in high LD have very similar effect sizes, but the effect allele is arbitrary. As a result, SNPs in high LD can have positive or negative effect sizes.

GWAMA vs. PC sumstats

z statistics

beta

GWAMA vs. speeding sumstats

z statistics

beta

GWAMA vs. drinking

z statistics

beta

GWAMA vs. smoking

z statistics

beta

GWAMA vs. sexual partners

z statistics

beta

How can we explain the “mirrored” patterns?

In order to assess whether the mirrored patterns appear with arbitrary effect alleles, we have selected individual SNPs that exhibit similar patterns at extreme ends of the distribution in the sexual partners vs. GWAMA plot. The red circles below indicate the selected SNPs.

This is the code we selected the SNPs with:

top<-all_sumstats[which(all_sumstats$sexual_z<=(-5)&all_sumstats$GWAMA_z>=12.5),]
top$pos<-"top"

bottom<-all_sumstats[which(all_sumstats$sexual_z>=5&all_sumstats$GWAMA_z<=(-12.5)),]
bottom$pos<-"bottom"

save<-rbind(top,bottom)

This selection resulted in 191 SNPs in the bottom circle and 207 SNPs in the top circle.

To test for high LD and arbitrary effect alleles, we asked two questions:

  1. Are the SNPs from the top circle in high LD with the SNPs in the bottom?

  2. Is the effect allele in a top SNP associated with the other allele in a bottom SNP, and vice versa? If so, it would explain why one SNP ended up with a positive effect size and the other one with a negative effect size.

High LD between SNPs from extreme ends of the distribution

This website (https://ldlink.nci.nih.gov/?tab=home) allows to “Investigate correlated alleles for a pair of variants in high LD” by inserting two rs-ids at a time. As this must be done manually, we have randomly selected 5 SNPs from the top and bottom circle respectively (see image above) and noted their correlation (R2) indicating linkage disequilibrium between these SNPs.

Linkage disequilibrium between SNPs

These are the 10 selected SNPs (5 top, 5 bottom circle):

No. SNP Position
1 rs10433525 bottom
2 rs10511076 bottom
3 rs10511081 bottom
4 rs10511083 bottom
5 rs10511087 bottom
6 rs10433499 top
7 rs10433500 top
8 rs10511075 top
9 rs10511078 top
10 rs10511085 top

The following table indicates the correlation (R2) between these SNPs on the genome, which were obtained from the website mentioned above.

## Warning: package 'knitr' was built under R version 3.6.1
LD/ correlation (R2) between 10 SNPs
SNP_No 1 2 3 4 5 6 7 8 9 10
1 1.00 0.95 0.95 0.95 0.95 1.00 0.95 0.95 1.00 1.00
2 0.95 1.00 1.00 1.00 1.00 0.95 0.89 1.00 0.95 1.00
3 0.95 1.00 1.00 1.00 1.00 0.95 0.88 1.00 0.95 1.00
4 0.95 1.00 1.00 1.00 1.00 0.95 0.89 1.00 0.95 1.00
5 0.95 1.00 1.00 1.00 1.00 0.95 0.89 1.00 0.95 1.00
6 1.00 0.95 0.95 0.95 0.95 1.00 0.83 0.95 1.00 0.95
7 0.95 0.89 0.88 0.89 0.98 0.83 1.00 0.88 0.83 0.88
8 0.95 1.00 1.00 1.00 1.00 0.95 0.88 1.00 0.95 1.00
9 1.00 0.95 0.95 0.95 0.95 1.00 0.83 0.95 1.00 0.95
10 1.00 1.00 1.00 1.00 1.00 0.95 0.88 1.00 0.95 1.00

It shows that the selected SNPs are all very closely correlated on the genome, regardless of whether they are positioned in the top or bottom end of the distribution.

Is the effect allele from a “top SNP” associated with the other allele from a “bottom SNP”, and vice versa?

For further investigation, we have randomly selected a few SNPs to investigate whether effect alleles are arbitrary in our results. The information in the following table is taken from the GWAMA and the sexual partners sumstats. The column Position indicates if this SNP can be found in the top or the bottom red circle of the plot.

No. SNP Position EA OA GWAMA_z sexual_z
1 rs10433525 bottom T C -13.27 7.36
2 rs10511076 bottom A G -13.32 7.29
3 rs10511085 top A C 13.29 -7.25
4 rs10511078 top A G 13.22 -7.29

Based on this information, we would expect that the effect allele in SNP 1 (T) is associated with the effect allele in SNP 2 (A), but the effect allele in SNP 1 (T) is associated with the other allele in SNP 3 (C).

When entering the SNP IDs in the LDpair function on the website, our expectations are confirmed. For example:

  1. rs10433525(T) allele is correlated with rs10511076(A) allele
  2. rs10433525(T) allele is correlated with rs10511085(C) allele
  3. rs10433525(T) allele is correlated with rs10511078(G) allele

Signed correlations between the SNPs

The R package TwoSampleMR allows to extract correlations between SNPs for more than only two SNPs at once. If our hypothesis on the mirrored patterns in the plot were true, we would expect two triangular regions of highly positive heat and a middle region of highly negative heat when plotting the SNP correlations.

library(data.table)
library(tidyr)
alleles<-fread("C:/Users/anna/OneDrive - King's College London/PhD/genomic_pca_simulation_started_Jan2020/Rmarkdown/mirroredSNPswithallele0303.txt",header=T,data.table=F)
snps<-c("rs10433525","rs10511076","rs10511081","rs10511083","rs10511087","rs10433499","rs10433500","rs10511075","rs10511078","rs10511085")
#snps<-c("rs10433499","rs10433500","rs10511075","rs10511076","rs10511081","rs10511083")
snps<-alleles[alleles$SNP %in% snps,]

# keep snpid, ea and oa columns
#snps<-snps[,c("SNP","EA","OA")]

#use MR package to calculate correlations
library(TwoSampleMR)
corr<-ld_matrix(snps[,1],with_alleles = T)
names_corr<-unlist(dimnames(corr)[1])

#attach effect and other allele to name as done in ld_matrix
GWAMA<-paste0(snps$SNP,"_",snps$EA)
GWAMA<-paste0(GWAMA,"_",snps$OA)

# melt corr
melted_cor<-reshape2::melt(corr)

# create columns showing if effect and other allele agreed with reference file used in ld_matrix function
melted_cor$comvar1<-0
for(i in GWAMA){
  melted_cor$comvar1<-with(melted_cor, ifelse(Var1==i,comvar1+1,comvar1))
}

melted_cor$comvar2<-0
for(i in GWAMA){
  melted_cor$comvar2<-with(melted_cor, ifelse(Var2==i,comvar2+1,comvar2))
}


# give cells the correct sign with regard to effect and other allele in GWAMA 
melted_cor$valuesigned<-with(melted_cor, ifelse(comvar1==1&comvar2==0|comvar1==0&comvar2==1,value*(-1),value))

melted_cor<-melted_cor[,c("Var1","Var2","valuesigned")]

melted_cor<-separate(melted_cor,Var1,into="Var1",sep="_")
melted_cor<-separate(melted_cor,Var2,into="Var2",sep="_")
melted_cor$valuesigned<-round(melted_cor$valuesigned,digits=2)

signed_matrix<-reshape2::dcast(melted_cor,Var1~Var2)
rownames(signed_matrix)<-signed_matrix$Var1
list<-c("rs10433499","rs10433500","rs10511085","rs10511075","rs10511078","rs10433525","rs10511076","rs10511081","rs10511083","rs10511087")
signed_matrix<-signed_matrix[list,list]

Reminder of the position of the SNPs in the matrix

SNP Position
rs10433499 top
rs10433500 top
rs10511085 top
rs10511075 top
rs10511078 top
rs10433525 bottom
rs10511076 bottom
rs10511081 bottom
rs10511083 bottom
rs10511087 bottom

Based on this evidence, we infer that the mirrored patterns in the GWAMA plots can be explained by arbitrary effect alleles which is a property of LD patterns on the genome and does not invalidate the genomic PCA method. The signs in the heatmap are as expected from their position in the plot (top vs. bottom). SNPs that have associated effect alleles in GWAMA show positive associations in the heatmap and SNPs that have their effect allele associated with the other allele in the other SNPs are negatively correlated.