Author: Charleen D. Adams
GTEx coded their eQTL analysis such that the effect allele is the alternate allele. But the alternate allele is not always the minor allele. The effect sizes (“betas”) for the eQTLs refer to the slope of the linear regressions “computed as the effect of the alternative allele (ALT) relative to the reference allele (REF) in the human genome reference (i.e., the eQTL effect allele is the ALT allele)” – which is not always the minor allele.
In order to know which of the alleles is the true minor allele, I needed to obtain annotation information for the SNPs.
SNPnexussqldf to merge EAF back into the master RP dataset# Total number eQTLs across the RP genes
dim(stringent_RP_sig_beta)
## [1] 2418 45
# Number of unique eQTLs
snp=unique(stringent_RP_sig_beta$SNP)
length(snp)
## [1] 996
SNPnexusSNPnexus is an online annotation tool for SNPs (with an option for GRCh38/hg38).
I entered our list of 996 unique eQTLs (threshold qval<0.05 & P<5x10-6) into SNPnexus and obtained annotation data for which allele is the reference, alternate, and minor, along with the minor allele frequencey (MAF).
(Side note: Separate SNPnexus annotation files also provide genomic consequence and deleteriosness for these SNPs. I can make plots with this data later…)
# Read in the dummy copy of the SNPnexus results called my_data2
my_data2 <- read.table("C:/Users/charl/Dropbox/Harvard/ribosomal_proteins/Main_Tables_Paper/v8/gen_coords_v8_unique.txt",
header=TRUE, sep='\t',stringsAsFactors = FALSE)
# How many of the unique eQTLs can we annotate?
dim(my_data2)
## [1] 985 11
985/996 #99%
## [1] 0.9889558
# Create a dummy variable called "match" for whether the minor allele is the reference allele (1=yes)
my_data2<-within(my_data2, match <- ifelse (Minor.Allele == REF.Allele,1,0))
table(my_data2$match)
##
## 0 1
## 740 245
# Create the EAF variable by reasoning that if the minor allele is the references allele, then
# 1) the EAF is 1-MAF for our data
# 2) otherwise, the EAF is the MAF
my_data2$Minor.Allele.Global.Frequency=as.numeric(my_data2$Minor.Allele.Global.Frequency)
my_data2$EAF <- my_data2$Minor.Allele.Global.Frequency
my_data2$EAF <- ifelse(my_data2$match=="1", 1-my_data2$Minor.Allele.Global.Frequency,
my_data2$Minor.Allele.Global.Frequency)
my_data2$dbSNP=as.character(my_data2$dbSNP)
# sqldf to merge (ignore warning)
merged=sqldf("select *
from stringent_RP_sig_beta left join my_data2
on stringent_RP_sig_beta.SNP = my_data2.dbSNP",
row.names = TRUE)
merged_EAF <- merged[order(merged$SNP),]
hist(merged$EAF, col = c("green"),
main="Histogram of EAF values for ribosomal protein eQTLs")
summary(merged$EAF)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0024 0.1216 0.2919 0.3279 0.5132 0.9740 139
In the table, “effect_allele” refers to the alternate allele in GTEx. The “other_allele” is the reference allele. The variable “match” signifies “1” when the reference allele is the minor allele and “0” when the alternate allele is the minor allele.
library(DT)
my_table_cols=c('tissue','n','position','chr','gene_name','SNP', 'EAF','match','effect_allele', 'other_allele','beta','pval', 'qval','se' )
interactive_merged=merged[my_table_cols]
datatable(interactive_merged, options = list(pageLength = 5))
Most of the SNPs not annotated in SNPnexus appear to be small indels, except for one for prostate (row id=‘987634’) that is a long interspersed nuclear element (LINE), and oddly, is coded by GTEx as the reference. (It will get dropped in MR analyses for prostate.)
I noticed one of the SNPs in the GTEx files was labeled with two rsids: rs4909,rs12012747. This meant it was not originally found in the SNPnexus lookup. I went in an verified that the two rsids refer to the same SNP and chose rs12012747 to label it in our master file.
merged_EAF=merged
#### One SNP has two rsids: change to the rsid that has MAF info: rs12012747
merged_EAF$SNP=as.character(merged_EAF$SNP)
merged_EAF$SNP=ifelse(merged_EAF$SNP=='rs4909,rs12012747', 'rs12012747', merged_EAF$SNP)
merged_EAF$match=ifelse(merged_EAF$SNP=='rs12012747', 0, merged_EAF$match)
merged_EAF$EAF=ifelse(merged_EAF$SNP=='rs12012747', 0.331126, merged_EAF$EAF)
### Rename the 'eaf' variable in the master file to reflect that it is the MAF in GTEx (not the actual EAF; the actual EAF is labled 'EAF')
colnames(merged_EAF)[31] <- "maf_in_GTEx"
write.csv(merged_EAF, 'merged_EAF.csv')
‘The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS. The data used for the analyses described in this manuscript were obtained from: the GTEx Portal on 04/01/2020’
I’ve started a Github repository to organize the files and scripts for this project. If Github refuses to open through the link, copy this link into your browser: https://github.com/charleendadams/ribosomal_protein_eQTLs