Legacy notes: contains learning for LDSC and Odyssey

Overview

Annotated script to obtain genetic correlations with LD Score regression (ldsc)
 Includes:                                                                                                                                                                    
 1. Set up on Odyssey                                                                                                                                                           
 2. Data set up and running of ldsc, includes preliminary genetic correltion results for: longevity & breast cancer; 
 breast cancer & lung cancer; longevity on lung cancer;   
 anti-social personality disorder (ASP) & longevity; & a mention that blood plasma levels of some RP/MRPs were attempted with ldsc & breast cancer & longevity, 
 but failed (possibly due to small sample size &/or that levels of the RPs/MRPs are not that polygenic)                                                                              
 3. List of GWAS/eQTL data sources                                                                                                                                            
 4. Bibliography for ldsc, SMR, and breast cancer mechanisms (Panis)                                                                                                          
 5. Set up for SMR (linux executable file, changing path name, editing .bashrc)                                                                                               
 6. Command-line basics/tips                                                                                                                                                  
 7. Data management for obtaining rsids from chr & position; reading-in & cleaning a vcf file in R                                                                               
 8. MR analysis of eqtlGen (blood) on breast cancer (replication of MR eQTL in blood on breast cancer)                                                                        
 9. Gene-level meta-analyses with eqtlGen on breast cancer in blood                                                                                                           
 10. MR analysis of eqtlGen (blood) on longevity                                                                                                                              
 11. Gene-level meta-analyses with eqtlGen on longevity in blood

Odyssey set-up

Set-up for working on the Harvard cluster ("Odyssey")

Steps:                                                                                                                                    
 1. Permission from Odyssey to use it: provides log-in and password:                                                                       
 https://docs.rc.fas.harvard.edu/kb/how-do-i-get-a-research-computing-account/                                                             
 2. Way to transfer files from personal computer to Odyssey                                                                                  
 3. Use of Interactive Rstudio (nice interface for Rstudio on Odyssey; gwas files are too big for personal computer)                       
 4. Must download the VNP Cisco client to connect to Interactive Rstudio: https://docs.rc.fas.harvard.edu/kb/vpn-setup/                    
 5. Shell interface: terminal window for Mac or PuTTy (https://www.putty.org) or WinSCP for Windows (https://winscp.net/eng/index.php)     
 6. Questions/problems with logging in, email FAS IT help: rchelp@fas.harvard.edu                                                          

## Documentation to remember

+------------------- Helpful Documentation: --------------------+
| https://docs.rc.fas.harvard.edu/kb/quickstart-guide/          |
| https://docs.rc.fas.harvard.edu/kb/running-jobs/              |
| https://docs.rc.fas.harvard.edu/kb/convenient-slurm-commands/ |
+---------------------------------------------------------------+

## FilzeZilla

+--- FileZilla launch to transfer files from home computer to Odyssey: ---+
| pw: P _ _ _ d_r1_3# & Google authenticator pw...                        |
| port 22: login.rc.fas.harvard.edu;                                      |
| user: cdadams; Interactive; SFTP - SSH File Transfer Protocol           |
+-------------------------------------------------------------------------+

## Interactive Rstudio

+------------------ Interactive Rstudio: -------------------------------------------+
| Must have downloaded the appropriate Cisco Client (use Chrome or Firefox) for VPN |
| Cisco vpn.rc.fas.harvard.edu                                                      |         
| username: cdadams@fasrc                                                           |  
| pw: P _ _ _ d_r1_3                                                                |
| Google authenticator pw...                                                        |
|                                                                                   | 
| Website for Interactive R: https://vdi.rc.fas.harvard.edu                         |  
| Second log-in needed for going to the interactive applications:                   |
| username: cdadams                                                                 |
| pw: P _ _ _ d_r1_3#                                                               | 
|                                                                                   |
| Select Rstudio; Partition: shared; Memory Allocation in GB: 96; Number of core: 1 |
| Number of GPUs: 0; Allocated Time: 48:00:00                                       | 
| R version to be loaded with Rstudio: R/3.6.3-fasrc01 Core                         |
| Location of your R_LIBS_USER : $HOME/apps/R_3.6.3:$R_LIBS_USER                    |
| script to be executed before starting Rstudio:                                    |
| export R_LIBS_USER=$HOME/apps/R_3.6.3:$R_LIBS_USER                                |
| (Instructions for setting up the R_LIBS_USER:                                     |
| https://docs.rc.fas.harvard.edu/kb/r-packages/)                                   |
|                                                                                   |
| Can use the Rstudio for Bioconductor & Tidyverse (newer version of R)             |
| When installing through Interactive RStudio/Bioconductor & Tidyverse 4.0.3,       |
| gmp is already configured                                                         | 
| as is the directory for installing R packages.                                    |
| All packages installed via Interactive RStudio, however, must be                  |
| installed separately with R/4.0.2-fasrc01.                                        |
| For some of these gmp needs to be loaded prior: module load gmp/6.1.2-fasrc01     |
+-----------------------------------------------------------------------------------+

Running a job on Odyssey

# Running a job (not using this here, but will in the future):
# Example .sh script                                                                                  
                                                                                                        
#!/bin/bash                                                                                         
#SBATCH -n 1                # Number of cores (-n)                                                  
#SBATCH -N 1                # Ensure that all cores are on one Node (-N)                            
#SBATCH -t 0-00:10          # Runtime in D-HH:MM, minimum of 10 minutes                             
#SBATCH -p serial_requeue   # Partition to submit to                                                
#SBATCH --mem=100           # Memory pool for all cores (see also --mem-per-cpu)                    
#SBATCH -o myoutput_%j.out  # File to which STDOUT will be written, %j inserts jobid                
#SBATCH -e myerrors_%j.err  # File to which STDERR will be written, %j inserts jobid                
module load perl/5.26.1-fasrc01 #Load Perl module                                                    
perl -e 'print "Hi there.\n"'

MISC Odyssey

slurm commands: https://docs.rc.fas.harvard.edu/kb/convenient-slurm-commands/                       
                                                                                                      
# Mac terminal launch to Odyssey                                                                      
ssh cdadams@login.rc.fas.harvard.edu                                                                  
# pw: P _ _ _ d_r1_3# &; Google authenticator pw...                                                   
                                                                                                      
# Run an interactive terminal session                                                                 
# website: https://docs.rc.fas.harvard.edu/kb/running-jobs/#Interactive_jobs_and_salloc               
salloc -p test --mem 5000 -t 0-06:00

LDSC

# Explainer and source: https://github.com/bulik/ldsc/wiki/Heritability-and-Genetic-Correlation

# create a directory in home directory for the ldsc environment (I called this 'ld'): /n/home04/cdadams/ld/
mkdir ld

# load the version of python that works with anaconda (required to run ldsc)
# I googled to learn what version of python worked with anaconda on Odyssey

#Generally, there are two ways to find out what modules are on Odyssey: 

# 1. Go to this website and search: https://portal.rc.fas.harvard.edu/p3/build-reports/
# 2. In Odyssey, run: "module avail" (without the quotes)
# That will bring up a massive list. It's easier to search on the website...
 
module load python/2.7.14-fasrc01

# clone ldsc 
git clone https://github.com/bulik/ldsc.git

# go to ldsc folder within /n/home04/cdadams/ld/
cd ldsc

# set up the environment
conda env create --file environment.yml
source activate ldsc

# run ldsc help commands to see if the set-up worked
# The period before the forward slash is necessary   

./ldsc.py -h
./munge_sumstats.py -h

# read-in European ld scores and 1000genomes SNPlist
#ld scores can also be generated by ldsc; however, there is no need to do this since they have already been generated 
#for Europeans; also, the 1000genomes SNPlist is curated already; these two files are standardly used 

wget https://data.broadinstitute.org/alkesgroup/LDSCORE/eur_w_ld_chr.tar.bz2
wget https://data.broadinstitute.org/alkesgroup/LDSCORE/w_hm3.snplist.bz2

# untar the ld score and unzip the snplist
tar -jxvf eur_w_ld_chr.tar.bz2
bunzip2 w_hm3.snplist.bz2

# command to unzip the longevity GWAS if needed (but can remain zipped)
#gunzip lifegen_phase2_bothpl_alldr_2017_09_18.tsv.gz

# format the files for ldsc (generates file ending in .sumstats.gz)
# below are formatting for longevity, rp/mrp protein gwas, and anti-social personality (ASP) gwas, breast cancer (BCAC), and lung cancer
# in order to know how to set the files up, open them in Interactive Rstudio and look at how they are coded and/or clean them
# (cleaning is essential for GWAS that come in vcf vs txt format) 

# All the steps above are preliminary set-up. I re-emphasize that Interactive Rstudio is needed to look at the GWAS files and format them. More about that below.

# For my convenience, I added the "module load" command again here, since this isn't a bash script. It's a text file that I'm using to run bits 
# and pieces of code for various purposes. A "bash script" is a script that can be written that includes command-line code and R. It can then be 
# submitted to Odyssey to run the script. This only works if you know everything in the script is accurate. 

module load python/2.7.14-fasrc01

## File formating for LDSC 

# These are python/ldsc command-line mini-scripts (commands) to format GWAS data into the way that ldsc needs. I got the commands from the ldsc wiki. 
# ("munge" just appears to mean "format"). No time to explain each of the commands, designated with the "--". Read the ldsc wiki (Google it; it's on Github). 

# Format GWAS for ldsc: breast cancer (BCAC)
./munge_sumstats.py \
--sumstats bcac_master3.txt \
--snp id \
--ignore var_name \
--a1 Effect.Gwas \
--a2 Baseline.Gwas \
--N 247173 \
--out bcac_final \
--chunksize 500000 \
--p P.value.Gwas \
--signed-sumstats Beta,0 \
--merge-alleles w_hm3.snplist

# Format GWAS for ldsc: longevity
./munge_sumstats.py \
--sumstats lifegen_phase2_bothpl_alldr_2017_09_18.tsv.gz \
--snp rsid \
--ignore snpid \
--a1 a1 \
--a2 a0 \
--N 1012240  \
--out long_3 \
--chunksize 500000 \
--p p \
--signed-sumstats beta1,0 \
--merge-alleles w_hm3.snplist 

# Format GWAS for ldsc: lung cancer
./munge_sumstats.py \
--sumstats lc.gwas2.txt \
--snp SNP \
--a1 Allele1 \
--a2 Allele2 \
--out lc \
--chunksize 500000 \
--p pval \
--N 27209 \
--signed-sumstats Beta,0 \
--merge-alleles w_hm3.snplist

# Format nuc eQTL for ldsc: 
./munge_sumstats.py \
--sumstats nucs_eqtlGen.txt \
--snp SNP \
--ignore SNPChr \
--a1 AssessedAllele \
--a2 OtherAllele \
--N 30596  \
--out nucs \
--chunksize 500000 \
--p Pvalue \
--signed-sumstats Zscore,0 \
--merge-alleles w_hm3.snplist 

# Format GWAS for ldsc: anti-social personality disorder (ASP)
./munge_sumstats.py \
--sumstats BroadABC2.txt \
--snp RS \
--ignore SNP \
--a1 A1 \
--a2 A2 \
--out ASP \
--chunksize 500000 \
--p P-value \
--signed-sumstats BETA,0 \
--merge-alleles w_hm3.snplist

# Initially, I also loaded in and formated a number of GWAS for RP/MRP plasma protein levels. These only contained 3303 people each.
# I ran ldsc on these with breast cancer and longevity, but ldsc couldn't obtain genetic correlations. I don't know if this is because
# RP/MRP protein levels aren't that polygenic or if the GWAS were too small. The authors of ldsc note that it doesn't perform well on 
# sample sizes <3000....

## LDSC 
# LD Score Regression (if the mean chi^2 value provided in the log is <1.02, data not suitable for ldsc)
# None of the rp/mrp gwas were suitable for ldsc

# LD Score Regression to obtain the genetic correlation between two traits
# This is similar to polygenic risk scores; provides an estimate of how traits are related; gives direction of effect;
# Free from environmental sources of confounding

# LD Score Regression: longevity and breast cancer 
./ldsc.py \
--rg long_3.sumstats.gz,bcac_final.sumstats.gz \
--ref-ld-chr eur_w_ld_chr/ \
--w-ld-chr eur_w_ld_chr/ \
--out long_3_bcac_final
less -s long_3_bcac_final.log

# Genetic Correlation
#-------------------
# Genetic Correlation: -0.1136 (0.0505)
# Z-score: -2.2499
# P: 0.0245

# LD Score Regression: breast cancer and lung cancer 
./ldsc.py \
--rg bcac_final.sumstats.gz,lc.sumstats.gz \
--ref-ld-chr eur_w_ld_chr/ \
--w-ld-chr eur_w_ld_chr/ \
--out bcac_final_lc
less bcac_final_lc.log
less -s bcac_final_lc.log

# Genetic Correlation: breast cancer and lung cancer 
#-------------------
# Genetic Correlation: 0.2911 (0.1052)
# Z-score: 2.767
# P: 0.0057

# LD Score Regression: longevity and lung cancer
./ldsc.py \
--rg long_3.sumstats.gz,lc.sumstats.gz \
--ref-ld-chr eur_w_ld_chr/ \
--w-ld-chr eur_w_ld_chr/ \
--out long_lc
less -s long_lc.log

# Genetic Correlation: longevity and lung cancer
#-------------------
# Genetic Correlation: -0.4428 (0.0738)
# Z-score: -5.997
# P: 2.0102e-09

# LD Score Regression: nucs and breast cancer (nada)
./ldsc.py \
--rg nucs.sumstats.gz,bcac_final.sumstats.gz \
--ref-ld-chr eur_w_ld_chr/ \
--w-ld-chr eur_w_ld_chr/ \
--out nucs_bcac
less -s nucs_bcac.log

# Genetic Correlation: nucs and breast cancer (nada)
#-------------------
# Genetic Correlation: 0.0456 (0.0669)
# Z-score: 0.6815
# P: 0.4956

# LD Score Regression: nucs and longevity (nada)
./ldsc.py \
--rg nucs.sumstats.gz,long_3.sumstats.gz \
--ref-ld-chr eur_w_ld_chr/ \
--w-ld-chr eur_w_ld_chr/ \
--out nucs_long
less -s nucs_long.log

# Genetic Correlation: nucs and longevity (nada)
#-------------------
# Genetic Correlation: -0.0241 (0.0386)
# Z-score: -0.6252
# P: 0.5318

# LD Score Regression: breast cancer and ASP (nada)
./ldsc.py \
--rg ASP.sumstats.gz,bcac_final.sumstats.gz \
--ref-ld-chr eur_w_ld_chr/ \
--w-ld-chr eur_w_ld_chr/ \
--out ASP_bcac
less ASP_bcac.log

# Genetic Correlation: breast cancer and ASP (nada)
#-------------------
# Genetic Correlation: 0.0799 (0.0956)
# Z-score: 0.8354
# P: 0.4035

# LD Score Regression: longevity and ASP (exciting)
./ldsc.py \
--rg ASP.sumstats.gz,long_3.sumstats.gz \
--ref-ld-chr eur_w_ld_chr/ \
--w-ld-chr eur_w_ld_chr/ \
--out ASP
less ASP_long.log

# Genetic Correlation:longevity and ASP 
#-------------------
# Genetic Correlation: -0.3345 (0.0623)
# Z-score: -5.3669
# P: 8.0083e-08

# I have included ASP because I had the file and its a large GWAS. 
# When I initially ran ldsc, I had included the flag -a1-inc, thinking it told ldsc to code the a1 allele 
# as the risk allele. However, -a1-inc is intended only if the effect estimates are signless; sometimes
# GWAS results are provided such that the a1 allele always INCREASES the trait or risk for disease. This is
# different than how GTEx, for example, is coded, where the risk allele can increase or decrease expression. 
# As such, I re-ran the analyses to determine how much of a difference it made on the findings. The overall 
# results held for the genetic correlation between longevity and breast cancer, breast cancer and lung cancer, 
# and lung cancer on longevity. The p-values attenuated, but remained significant (P<0.05). 

## Heritiability (h2)
# h2 estimates can be obtained for individual traits, but h2 is biased low by GC correction
# performed by the GWAS authors. In contrast, genetic correlation from ldsc between two traits isn't affected, even
# when the h2 is biased.

# Heritability
./ldsc.py \
--h2 long_3.sumstats.gz \
--ref-ld-chr eur_w_ld_chr/ \
--w-ld-chr eur_w_ld_chr/ \
--out long_h2

# Heritability
./ldsc.py \
--h2 bcac_fianl.sumstats.gz \
--ref-ld-chr eur_w_ld_chr/ \
--w-ld-chr eur_w_ld_chr/ \
--out bc_h2

# Heritability
./ldsc.py \
--h2 mrpl33.sumstats.gz \
--ref-ld-chr eur_w_ld_chr/ \
--w-ld-chr eur_w_ld_chr/ \
--out mrpl33_h2

# Heritability
./ldsc.py \
--h2 RNMTL1.sumstats.gz \
--ref-ld-chr eur_w_ld_chr/ \
--w-ld-chr eur_w_ld_chr/ \
--out RNMTL1_h2

## Partitioned h2 with ldsc (might be useful later)

# I read-in some of the example files, unzipped them, and tried to get partitioned h2 for longevity; 
# didn't work; not sure why: maybe the 'baseline.' (. after file?)

wget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/1000G_Phase1_baseline_ldscores.tgz
wget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/1000G_Phase1_frq.tgz
wget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/weights_hm3_no_hla.tgz
wget https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/1000G_Phase1_cell_type_groups.tgz

tar zxvf 1000G_Phase1_baseline_ldscores.tgz
tar zxvf 1000G_Phase1_frq.tgz
tar zxvf weights_hm3_no_hla.tgz
tar zxvf 1000G_Phase1_cell_type_groups.tgz

cd 1000G_frq/ 
gunzip *.gz
cp -a /n/home04/cdadams/ldsc/ldsc/1000G_frq/. /n/home04/cdadams/ldsc/ldsc/

cd weights_hm3_no_hla/ 
gunzip *.gz
cp -a /n/home04/cdadams/ldsc/ldsc/weights_hm3_no_hla/. /n/home04/cdadams/ldsc/ldsc/

cd baseline/ 
gunzip *.gz
cp -a /n/home04/cdadams/ldsc/ldsc/baseline/. /n/home04/cdadams/ldsc/ldsc/

./ldsc.py \
    --h2 long.sumstats.gz \
    --ref-ld-chr baseline.\ 
    --w-ld-chr weights.\
    --overlap-annot\
    --frqfile-chr 1000G.mac5eur.\
    --out long_baseline

GWAS/eQTL data, Part 1

Website for getting any journal article from Harvard library (requires Harvard Key, two-step verification)

http://p.lib.harvard.edu/discovery/journals.html

Breast cancer (BCAC 2020):

From: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7808397/pdf/nihms-1572778.pdf
133384 cases + 113789 controls
On personal computer, I compressed the file:
If I wanted to keep the original, I’d use the -k command:
gzip -k icogs_onco_gwas_meta_overall_breast_cancer_summary_level_statistics.txt
gzip icogs_onco_gwas_meta_overall_breast_cancer_summary_level_statistics.txt

Formatting vcf files in R

see example: “/n/home04/cdadams/ld/ldsc/lc/lc.r”
contains: reading in a vcd file, removing the extra content, matching the colnames to the data, and formatting the data for ldsc

Blood eQTLGen in 30,000 people

https://www.eqtlgen.org/cis-eqtls.html
Amazing resource: data are in binary format, though, and set up for SMR analysis. (I have a fas help ticket in
to help get SMR loaded on the cluster. There are Unix executable files and source code….

GTEx full summary data

https://console.cloud.google.com/storage/browser/gtex-resources/
These are massive files and are available in pay-per-order buckets. I’m not sure what kind of files they are or how we would store them.
But if we could get them, we could do tissue-specific SMR, once I learn it.

Longevity GWAS summary statistics

https://datashare.ed.ac.uk/handle/10283/3209

GWAS/eQTL data, Part 2

GWAS of plasma proteins in 3301 people. Contains 15-20 levels of ribosomal protein GWAS. Possibly useful for two-sample MR (not big enough for LDSC). Data available through a searchable Dropbox link: https://app.box.com/s/u3flbp13zjydegrxjb2uepagp1vb6bj2 Sun, B. B. et al. Genomic atlas of the human plasma proteome. Nature 558, 73–79 (2018).
GWAS of longevity (includes LDSC analyses). Data available here: https://datashare.ed.ac.uk/handle/10283/3209 Timmers, P. R. et al. Genomics of 1 million parent lifespans implicates novel pathways and common diseases and distinguishes survival chances. eLife 15, (2019).
GWAS of breast cancer in 133384 cases + 113789 controls (http://bcac.ccge.medschl.cam.ac.uk) Zhang, H. et al. Genome-wide association study identifies 32 novel breast cancer susceptibility loci from overall and subtype-specific analyses. Nat Genet 52, 572–581 (2020).
GWAS of lung cancer in 11,348 cases and 15,861 controls. Data available through (https://gwas.mrcieu.ac.uk/datasets/). Wang, Y. et al. Rare variants of large effect in BRCA2 and CHEK2 affect risk of lung cancer. Nat Genet 46, 736–741 (2014).
UK Biobank: Population-based cohort in the UK on 500,000 people, includes many GWAS (including some for asthma, age, self-reported cancer, the metabolome, etc.). The MRC-IEU have gathered the GWAS for these and performed their own GWAS analyses of the primary data. The fully summary statistics are downloadable as vcf files (https://gwas.mrcieu.ac.uk/datasets/). Collins, R. What makes UK Biobank special? Lancet 31, 1173–1174 (2012).
Gene-Tissue Expression (GTEx) Project: Tissue-specific expression, genotypes, eQTL analyses in about 50 post-mortem tissues. Provides both easily downloadable significant findings (“eGenes”) and full summary statistics, though these are pay-per-order and large (https://console.cloud.google.com/storage/browser/gtex-resources/). SMR would be used to analyze the data. Aguet, F. et al. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, (2020).
eQTLGen Consortium’s generation of full eQTL summary statistics in blood on 31,684 individuals. Includes cis, trans, and seQTLs in binary format for use with SMR (https://www.eqtlgen.org/cis-eqtls.html). Although set up for SMR, the data are large enough to perform LDSC, if gene sets (e.g., nucleolar biology genes) can be extracted out of binary format and made into a text file in the format used by LDSC. This would be a highly efficient and powerful way to determine whether nucleolar biology gene expression in blood is etiologically related to cancer and longevity. Võsa, U. et al. Unraveling the polygenic architecture of complex traits using blood eQTL meta-analysis. bioRxiv (2018) doi:10.1101/447367.

Methods papers

Methods papers for integrating GWAS/eQTL summary statistics: Linkage Disequilibrium Score (LDSC) Regression and Summary-Data-Based Mendelian Randomization (SMR).

Main LDSC paper: LDSC identifies genetic correlation between complex traits, which is useful for etiological insights and prioritizing causal relationships. The technique is similar to MR in that the estimates are free from most sources of environmental confounding and similar to polygenic-risk scores (only easier to do and more efficient). Bulik-Sullivan, B. et al. An atlas of genetic correlations across human diseases and traits. Nat Genet 47, 1236–1241 (2015).
Using LDSC: The authors demonstrate use of LDSC to explore the relationships between lung function (FEV) and lung cancer. Paper includes some eQTL analyses. Kachuri, L. et al. Immune-mediated genetic pathways resulting in pulmonary function impairment increase lung cancer susceptibility. Nat Commun 11, (2020).
Main SMR paper: Zhu, Z. et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat Genet 48, 481–487 (2016).
Using SMR: eQTLGen Consortium’s generation of full eQTL summary statistics in blood on 31,684 individuals. Võsa, U. et al. Unraveling the polygenic architecture of complex traits using blood eQTL meta-analysis. bioRxiv (2018) doi:10.1101/447367.

Combined bibliography

Papers sent to Lemos Lab by Carolina Panis (breast cancer mechanisms) and me in March:

LaKind, J. S., Wilkins, A. A. & Bates, M. N. Human breast biomonitoring and environmental chemicals: use of breast tissues and fluids in breast cancer etiologic research. J Expo Sci Environ Epidemiol vol. 17 525–540 (2007).
Benz, C. C. Impact of aging on the biology of breast cancer. Crit Rev Oncol Hematol vol. 66 65–74 (2008).
Mishra, S. K., Sengupta, D., Sar, P. & Bhargava, D. K. Molecular basis of aging and breast cancer. J Cancer Sci Ther vol. 5 069–074 (2013).
van der Plaat, D. A. et al. Occupational exposure to pesticides is associated with differential DNA methylation. Occup Environ Med 75, 427–435 (2018).
Zhu, Z. et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat Genet 48, 481–487 (2016).
Võsa, U. et al. Unraveling the polygenic architecture of complex traits using blood eQTL meta-analysis. bioRxiv (2018) doi:10.1101/447367.
Kachuri, L. et al. Immune-mediated genetic pathways resulting in pulmonary function impairment increase lung cancer susceptibility. Nat Commun 11, (2020).
Bulik-Sullivan, B. et al. An atlas of genetic correlations across human diseases and traits. Nat Genet 47, 1236–1241 (2015).
Sun, B. B. et al. Genomic atlas of the human plasma proteome. Nature 558, 73–79 (2018).
Timmers, P. R. et al. Genomics of 1 million parent lifespans implicates novel pathways and common diseases and distinguishes survival chances. eLife 15, (2019).
Zhang, H. et al. Genome-wide association study identifies 32 novel breast cancer susceptibility loci from overall and subtype-specific analyses. Nat Genet 52, 572–581 (2020).
Collins, R. What makes UK Biobank special? Lancet 31, 1173–1174 (2012).
Aguet, F. et al. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, (2020).
Wang, Y. et al. Rare variants of large effect in BRCA2 and CHEK2 affect risk of lung cancer. Nat Genet 46, 736–741 (2014).

SMR

Another method for working with publicly available summary statistics, similar to ldsc

SMR (Summary-data-based Mendelian Randomization)

https://cnsgenomics.com/software/smr/#Download
I grabbed the linux.zip executable file, which had correct permissions: smr_Linux.zip.
I used unzip to extract the executable.
I was able to run it with: ./smr_Linux

I have to refer to the executable executable by its path to get it to run it, and want to use this as a normal command. - To do so, I made a bin directory in my home direcoryand added it to my export PATH=\(PATH:\)HOME/bin/ - I used nano to open my .bashrc (which I had modified for another purpose earlier in the year) - I added PATH=\(PATH:\)HOME/bin/ - That added the bin directory into my home folder as a path that Linux uses to search for executables. - I put the smr_Linux executable in there so that its available for to use from the commandline by just typing smr_Linux

SMR is a command-line method for performing MR with eQTL data. I’d like to learn how (if possible) to run SMR for the nucleolar biology gene expression in relation to breast cancer, longevity, lung cancer, and asthma.

Command-line basics

# Converting from tsv to csv and from csv to txt
sed 's/'$'\t''/,/g' lifegen_phase2_bothpl_alldr_2017_09_18.tsv > file.csv
awk -v RS=, '$1' file.csv > lifegen.txt

# Changing column headers
sed -e '1s/a1/A1/' -e '1s/a0/A2/' lifegen_phase2_bothpl_alldr_2017_09_18.tsv > mod.tsv
sed -e '1s/rsid/SNP/' -e '1s/a0/A2/' mod.tsv > mod2.tsv

# Master proteome GWAS file, from which I tried getting RPS3A (learning --- didn't use this)
# These files were large. I removed them. Was easier to download each RP protein GWAS individually
# But this is what I tried:
wget http://metabolomics.helmholtz-muenchen.de/pgwas/download/suhre.pgwas.koraf4.tar

mkdir rps3a
tar -C /n/home04/cdadams/ld/ldsc/rps3a -xvf suhre.pgwas.koraf4.tar
tar -xvf suhre.pgwas.koraf4.tar
cat *out.gz > rps3a_all
mv rps3a_all rps3a_all.gz
gzip -cd rps3a_all.gz | more
gzip -cd rps3a_all.gz | less -s
gunzip rps3a_all.gz
less -s ps3a_all

# merge all .gz files and keep header
cat *v1.tsv.gz | sed '2,${ /^#/d; }' >mrpl33_2
mv mrpl33_2 mrpl33_2.gz

# copy from one folder into another 
cp mrpl33_master.txt /n/home04/cdadams/ld/ldsc/mrpl33_master.txt
cp RPS3A_master.txt /n/home04/cdadams/ld/ldsc/RPS3A_master.txt

# find out how  many lines a file has
wc -l bcac_master.txt

# Load Plink (however; didn't use it)
module load plink/1.90-fasrc01

# Makes bed, bim, fam files from VCF (useful for running a GWAS: individual-level data; didn't use it)
plink --vcf ieu-a-966.vcf.gz

Data management

Example R code for obtaining rsids from chromosome and position - Has to be done one-by-one for each chromosome with SNPlocs.Hsapiens.dbSNP.20120608 and uses HG37. - Likely a way to write a script to loop through each, but it was frankly faster to do each chromosome individually.

setwd("/n/home04/cdadams/ld/ldsc/MRPL52")
# Data for RP protein GWAS: http://www.phpc.cam.ac.uk/ceu/proteins/

# What packages are loaded?
(.packages())

# NOTE: when installing through Interactive RStudio/Bioconductor & Tidyverse 4.0.3, gmp is already configured
# as is the directory for installing R packages. All packages installed via Interactive RStudio must be 
# installed separately with R/4.0.2-fasrc01. For some of these gmp needs to be loaded prior: module load gmp/6.1.2-fasrc01

install_github("jrs95/hyprcoloc", build_opts = c("--no-resave-data", "--no-manual"), build_vignettes = TRUE)
BiocManager::install("graph")
BiocManager::install("RBGL")
BiocManager::install("SNPlocs.Hsapiens.dbSNP.20120608")
BiocManager::install("EnsDb.Hsapiens.v79")
# install.packages("graph")
# install.packages("RBGL")
install.packages("parallel")
install.packages("reshape2")
install.packages("magick")
install.packages("tm")
install.packages("devtools")
install.packages("remotes")
devtools::install_github("oncogenetics/oncofunco")
devtools::install_github("MRCIEU/TwoSampleMR")
devtools::install_github("MRCIEU/MRInstruments")
remotes::install_github("WSpiller/RadialMR")
install.packages("grid")
install.packages("gridExtra")
install.packages("MASS")
install.packages("vcd")
install.packages("Hmisc")
install.packages("htmltools")
install.packages("stringr")
install.packages("e1071")
install.packages("rcompanion")
install.packages("car")
install.packages("sqldf")
install.packages("kableExtra")
install.packages("gplots")
install.packages("graphics")
install.packages("GGally")
install.packages("CCA")
install.packages("CCP")
install.packages("lme4")
install.packages("qqman")
install.packages("dplyr")
install.packages("biomaRt")
install.packages("data.table")
install.packages("DataCombine")
install.packages("RColorBrewer")
install.packages("tidyverse")
install.packages("packcircles")
install.packages("ggplot2")
install.packages("viridis")
install.packages("igraph")
install.packages("CMplot")
install.packages("corrplot")
install.packages("foreach")
install.packages("Cairo")
install.packages("caret")
install.packages("readr")
install.packages("vcfR")
install.packages("Vennerable", repos="http://R-Forge.R-project.org")
library(graph)
library(RBGL)
library(Vennerable)
library(magick)
library(reshape2)
library(parallel)
library(devtools)
library(tm)
library(TwoSampleMR)
library(MRInstruments) 
library(RadialMR)
library(hyprcoloc)
library(SNPlocs.Hsapiens.dbSNP.20120608)
library(oncofunco)
library(grid)
library(gridExtra)
library(MASS)
library(vcd)
library(Hmisc)
library(htmltools)
library(stringr) 
library(e1071)
library(rcompanion)
library(car)
library(sqldf)
library(kableExtra)
library(gplots)
library(graphics)
library(GGally)
library(CCA)
library(lme4)
library(CCP)
library(qqman)
library(dplyr)
#library(tidyr)
library(biomaRt)
library(data.table)
library(DataCombine)
library(RColorBrewer)
library(tidyverse)
library(packcircles)
library(ggplot2)
library(viridis)
library(igraph)
library(CMplot)
library(metaviz)
library(corrplot)
library(foreach)
library(Cairo)
library(caret)
library(EnsDb.Hsapiens.v79)
library(readr)
library(vcfR)

# options(defaultPackages=c(getOption("defaultPackages"),
#                           "mypackage1","mypackage2", ... [etc.]))

setwd("/n/home04/cdadams/ld/ldsc/MRPL52")
list.files()

# read-in the MRPL52 chromosome files
c1=read.table(gzfile("MRPL52.7123.25.3_chrom_1_meta_final_v1.tsv.gz"), header=TRUE)
c1$p=10^(c1$log.P.)
ch1_snps <- getSNPlocs("ch1", as.GRanges=TRUE)
mypos <- c1$position
idx <- match(mypos, start(ch1_snps))
rsids <- mcols(ch1_snps)$RefSNP_id[idx]
rsids2=(paste("rs", rsids, sep=""))
c1$SNP=rsids2
c1_clean=c1[which(c1$SNP!="rsNA"),]

c2=read.table(gzfile("MRPL52.7123.25.3_chrom_2_meta_final_v1.tsv.gz"), header=TRUE)   
c2$p=10^(c2$log.P.)
ch2_snps <- getSNPlocs("ch2", as.GRanges=TRUE)
mypos <- c2$position
idx <- match(mypos, start(ch2_snps))
rsids <- mcols(ch2_snps)$RefSNP_id[idx]
rsids2=(paste("rs", rsids, sep=""))
c2$SNP=rsids2
c2_clean=c2[which(c2$SNP!="rsNA"),]

c3=read.table(gzfile("MRPL52.7123.25.3_chrom_3_meta_final_v1.tsv.gz"), header=TRUE)   
c3$p=10^(c3$log.P.)
ch3_snps <- getSNPlocs("ch3", as.GRanges=TRUE)
mypos <- c3$position
idx <- match(mypos, start(ch3_snps))
rsids <- mcols(ch3_snps)$RefSNP_id[idx]
rsids3=(paste("rs", rsids, sep=""))
c3$SNP=rsids3
c3_clean=c3[which(c3$SNP!="rsNA"),]

c4=read.table(gzfile("MRPL52.7123.25.3_chrom_4_meta_final_v1.tsv.gz"), header=TRUE)   
c4$p=10^(c4$log.P.)
ch4_snps <- getSNPlocs("ch4", as.GRanges=TRUE)
mypos <- c4$position
idx <- match(mypos, start(ch4_snps))
rsids <- mcols(ch4_snps)$RefSNP_id[idx]
rsids4=(paste("rs", rsids, sep=""))
c4$SNP=rsids4
c4_clean=c4[which(c4$SNP!="rsNA"),]

c5=read.table(gzfile("MRPL52.7123.25.3_chrom_5_meta_final_v1.tsv.gz"), header=TRUE)   
c5$p=10^(c5$log.P.)
ch5_snps <- getSNPlocs("ch5", as.GRanges=TRUE)
mypos <- c5$position
idx <- match(mypos, start(ch5_snps))
rsids <- mcols(ch5_snps)$RefSNP_id[idx]
rsids5=(paste("rs", rsids, sep=""))
c5$SNP=rsids5
c5_clean=c5[which(c5$SNP!="rsNA"),]

c6=read.table(gzfile("MRPL52.7123.25.3_chrom_6_meta_final_v1.tsv.gz"), header=TRUE)   
c6$p=10^(c6$log.P.)
ch6_snps <- getSNPlocs("ch6", as.GRanges=TRUE)
mypos <- c6$position
idx <- match(mypos, start(ch6_snps))
rsids <- mcols(ch6_snps)$RefSNP_id[idx]
rsids6=(paste("rs", rsids, sep=""))
c6$SNP=rsids6
c6_clean=c6[which(c6$SNP!="rsNA"),]

c7=read.table(gzfile("MRPL52.7123.25.3_chrom_7_meta_final_v1.tsv.gz"), header=TRUE)   
c7$p=10^(c7$log.P.)
ch7_snps <- getSNPlocs("ch7", as.GRanges=TRUE)
mypos <- c7$position
idx <- match(mypos, start(ch7_snps))
rsids <- mcols(ch7_snps)$RefSNP_id[idx]
rsids7=(paste("rs", rsids, sep=""))
c7$SNP=rsids7
c7_clean=c7[which(c7$SNP!="rsNA"),]

c8=read.table(gzfile("MRPL52.7123.25.3_chrom_8_meta_final_v1.tsv.gz"), header=TRUE)   
c8$p=10^(c8$log.P.)
ch8_snps <- getSNPlocs("ch8", as.GRanges=TRUE)
mypos <- c8$position
idx <- match(mypos, start(ch8_snps))
rsids <- mcols(ch8_snps)$RefSNP_id[idx]
rsids8=(paste("rs", rsids, sep=""))
c8$SNP=rsids8
c8_clean=c8[which(c8$SNP!="rsNA"),]

c9=read.table(gzfile("MRPL52.7123.25.3_chrom_9_meta_final_v1.tsv.gz"), header=TRUE)   
#c9=read.table("MRPL52.7123.25.3_chrom_9_meta_final_v1.tsv", header=TRUE)   
head(c9)
c9$p=10^(c9$log.P.)
ch9_snps <- getSNPlocs("ch9", as.GRanges=TRUE)
mypos <- c9$position
idx <- match(mypos, start(ch9_snps))
rsids <- mcols(ch9_snps)$RefSNP_id[idx]
rsids9=(paste("rs", rsids, sep=""))
c9$SNP=rsids9
c9_clean=c9[which(c9$SNP!="rsNA"),]

c10=read.table(gzfile("MRPL52.7123.25.3_chrom_10_meta_final_v1.tsv.gz"), header=TRUE)   
c10$p=10^(c10$log.P.)
ch10_snps <- getSNPlocs("ch10", as.GRanges=TRUE)
mypos <- c10$position
idx <- match(mypos, start(ch10_snps))
rsids <- mcols(ch10_snps)$RefSNP_id[idx]
rsids10=(paste("rs", rsids, sep=""))
c10$SNP=rsids10
c10_clean=c10[which(c10$SNP!="rsNA"),]

c11=read.table(gzfile("MRPL52.7123.25.3_chrom_11_meta_final_v1.tsv.gz"), header=TRUE)   
c11$p=10^(c11$log.P.)
ch11_snps <- getSNPlocs("ch11", as.GRanges=TRUE)
mypos <- c11$position
idx <- match(mypos, start(ch11_snps))
rsids <- mcols(ch11_snps)$RefSNP_id[idx]
rsids11=(paste("rs", rsids, sep=""))
c11$SNP=rsids11
c11_clean=c11[which(c11$SNP!="rsNA"),]

c12=read.table(gzfile("MRPL52.7123.25.3_chrom_12_meta_final_v1.tsv.gz"), header=TRUE)   
c12$p=10^(c12$log.P.)
ch12_snps <- getSNPlocs("ch12", as.GRanges=TRUE)
mypos <- c12$position
idx <- match(mypos, start(ch12_snps))
rsids <- mcols(ch12_snps)$RefSNP_id[idx]
rsids12=(paste("rs", rsids, sep=""))
c12$SNP=rsids12
c12_clean=c12[which(c12$SNP!="rsNA"),]

c13=read.table(gzfile("MRPL52.7123.25.3_chrom_13_meta_final_v1.tsv.gz"), header=TRUE)   
c13$p=10^(c13$log.P.)
ch13_snps <- getSNPlocs("ch13", as.GRanges=TRUE)
mypos <- c13$position
idx <- match(mypos, start(ch13_snps))
rsids <- mcols(ch13_snps)$RefSNP_id[idx]
rsids13=(paste("rs", rsids, sep=""))
c13$SNP=rsids13
c13_clean=c13[which(c13$SNP!="rsNA"),]

c14=read.table(gzfile("MRPL52.7123.25.3_chrom_14_meta_final_v1.tsv.gz"), header=TRUE)   
c14$p=10^(c14$log.P.)
ch14_snps <- getSNPlocs("ch14", as.GRanges=TRUE)
mypos <- c14$position
idx <- match(mypos, start(ch14_snps))
rsids <- mcols(ch14_snps)$RefSNP_id[idx]
rsids14=(paste("rs", rsids, sep=""))
c14$SNP=rsids14
c14_clean=c14[which(c14$SNP!="rsNA"),]

c15=read.table(gzfile("MRPL52.7123.25.3_chrom_15_meta_final_v1.tsv.gz"), header=TRUE)   
c15$p=10^(c15$log.P.)
ch15_snps <- getSNPlocs("ch15", as.GRanges=TRUE)
mypos <- c15$position
idx <- match(mypos, start(ch15_snps))
rsids <- mcols(ch15_snps)$RefSNP_id[idx]
rsids15=(paste("rs", rsids, sep=""))
c15$SNP=rsids15
c15_clean=c15[which(c15$SNP!="rsNA"),]

c16=read.table(gzfile("MRPL52.7123.25.3_chrom_16_meta_final_v1.tsv.gz"), header=TRUE)   
c16$p=10^(c16$log.P.)
ch16_snps <- getSNPlocs("ch16", as.GRanges=TRUE)
mypos <- c16$position
idx <- match(mypos, start(ch16_snps))
rsids <- mcols(ch16_snps)$RefSNP_id[idx]
rsids16=(paste("rs", rsids, sep=""))
c16$SNP=rsids16
c16_clean=c16[which(c16$SNP!="rsNA"),]

c17=read.table(gzfile("MRPL52.7123.25.3_chrom_17_meta_final_v1.tsv.gz"), header=TRUE)   
c17$p=10^(c17$log.P.)
ch17_snps <- getSNPlocs("ch17", as.GRanges=TRUE)
mypos <- c17$position
idx <- match(mypos, start(ch17_snps))
rsids <- mcols(ch17_snps)$RefSNP_id[idx]
rsids17=(paste("rs", rsids, sep=""))
c17$SNP=rsids17
c17_clean=c17[which(c17$SNP!="rsNA"),]

c18=read.table(gzfile("MRPL52.7123.25.3_chrom_18_meta_final_v1.tsv.gz"), header=TRUE)   
c18$p=10^(c18$log.P.)
ch18_snps <- getSNPlocs("ch18", as.GRanges=TRUE)
mypos <- c18$position
idx <- match(mypos, start(ch18_snps))
rsids <- mcols(ch18_snps)$RefSNP_id[idx]
rsids18=(paste("rs", rsids, sep=""))
c18$SNP=rsids18
c18_clean=c18[which(c18$SNP!="rsNA"),]

c19=read.table(gzfile("MRPL52.7123.25.3_chrom_19_meta_final_v1.tsv.gz"), header=TRUE)   
c19$p=10^(c19$log.P.)
ch19_snps <- getSNPlocs("ch19", as.GRanges=TRUE)
mypos <- c19$position
idx <- match(mypos, start(ch19_snps))
rsids <- mcols(ch19_snps)$RefSNP_id[idx]
rsids19=(paste("rs", rsids, sep=""))
c19$SNP=rsids19
c19_clean=c19[which(c19$SNP!="rsNA"),]

c20=read.table(gzfile("MRPL52.7123.25.3_chrom_20_meta_final_v1.tsv.gz"), header=TRUE)   
c20$p=10^(c20$log.P.)
ch20_snps <- getSNPlocs("ch20", as.GRanges=TRUE)
mypos <- c20$position
idx <- match(mypos, start(ch20_snps))
rsids <- mcols(ch20_snps)$RefSNP_id[idx]
rsids20=(paste("rs", rsids, sep=""))
c20$SNP=rsids20
c20_clean=c20[which(c20$SNP!="rsNA"),]

c21=read.table(gzfile("MRPL52.7123.25.3_chrom_21_meta_final_v1.tsv.gz"), header=TRUE)   
c21$p=10^(c21$log.P.)
ch21_snps <- getSNPlocs("ch21", as.GRanges=TRUE)
mypos <- c21$position
idx <- match(mypos, start(ch21_snps))
rsids <- mcols(ch21_snps)$RefSNP_id[idx]
rsids21=(paste("rs", rsids, sep=""))
c21$SNP=rsids21
c21_clean=c21[which(c21$SNP!="rsNA"),]

c22=read.table(gzfile("MRPL52.7123.25.3_chrom_22_meta_final_v1.tsv.gz"), header=TRUE)   
c22$p=10^(c22$log.P.)
ch22_snps <- getSNPlocs("ch22", as.GRanges=TRUE)
mypos <- c22$position
idx <- match(mypos, start(ch22_snps))
rsids <- mcols(ch22_snps)$RefSNP_id[idx]
rsids22=(paste("rs", rsids, sep=""))
c22$SNP=rsids22
c22_clean=c22[which(c22$SNP!="rsNA"),]

MRPL52_master=rbind(c1_clean, c2_clean, c3_clean, c4_clean, c5_clean, c6_clean, 
                    c7_clean, c8_clean, c9_clean, c10_clean,
                    c11_clean, c12_clean, c13_clean, c14_clean, c15_clean, c16_clean, 
                    c17_clean, c18_clean, c19_clean, c20_clean, c21_clean, c22_clean)

colnames(MRPL52_master)[6]="Beta"

write.table(MRPL52_master, file = "/n/home04/cdadams/ld/ldsc/MRPL52_master.txt", sep = "\t",
            row.names = FALSE, col.names = TRUE, quote=FALSE)

# Save as a compressed file:
# gz1 <- gzfile("/n/home04/cdadams/ld/ldsc/MRPL52_master.gz", "w")
# write.csv(MRPL52_master, gz1)
# close(gz1)

Dealing with vcf files

Read-in the lung cancer GWAS

# R code for reading-in and cleaning a vcf file and formating the lung cancer gwas
setwd("/n/home04/cdadams/ld/ldsc/lc")
library(strex)

# read two times the vcf file, first for the columns names, second for the data
tmp_vcf<-readLines(gzfile("/n/home04/cdadams/ld/ldsc/ieu-a-966.vcf.gz"))
tmp_vcf_data<-read.table(gzfile("/n/home04/cdadams/ld/ldsc/ieu-a-966.vcf.gz"), stringsAsFactors = FALSE)
head(tmp_vcf_data)

# filter for the columns names
tmp_vcf<-tmp_vcf[-(grep("#CHROM",tmp_vcf)+1):-(length(tmp_vcf))]
vcf_names<-unlist(strsplit(tmp_vcf[length(tmp_vcf)],"\t"))
names(tmp_vcf_data)<-vcf_names

head(tmp_vcf_data)
dim(tmp_vcf_data)
colnames(tmp_vcf_data)[1]="chromosome"
colnames(tmp_vcf_data)[3]="SNP"
colnames(tmp_vcf_data)[4]="Allele2"
colnames(tmp_vcf_data)[5]="Allele1"
colnames(tmp_vcf_data)[10]="mix"

tmp_vcf_data$Beta=sub("\\:.*", "",  tmp_vcf_data$mix)
head(tmp_vcf_data)

#tmp_vcf_data$SE.1=sub('.*:', '', tmp_vcf_data$mix)

tmp_vcf_data$SE.1=str_before_nth(tmp_vcf_data$mix, ":", 2)
head(tmp_vcf_data)
tmp_vcf_data$SE=sub('.*:', '', tmp_vcf_data$SE.1)

tmp_vcf_data$p.1=str_before_nth(tmp_vcf_data$mix, ":", 3)
tmp_vcf_data$p=sub('.*:', '', tmp_vcf_data$p.1)

tmp_vcf_data$eaf.1=str_before_nth(tmp_vcf_data$mix, ":", 4)
tmp_vcf_data$eaf=sub('.*:', '', tmp_vcf_data$eaf.1)

tmp_vcf_data$N.1=str_before_nth(tmp_vcf_data$mix, ":", 5)
tmp_vcf_data$N=sub('.*:', '', tmp_vcf_data$N.1)

tmp_vcf_data$p=as.numeric(as.character(tmp_vcf_data$p))
tmp_vcf_data$Beta=as.numeric(as.character(tmp_vcf_data$Beta))
tmp_vcf_data$N=as.numeric(as.character(tmp_vcf_data$N))
tmp_vcf_data$eaf=as.numeric(as.character(tmp_vcf_data$eaf))
tmp_vcf_data$SE=as.numeric(as.character(tmp_vcf_data$SE))
head(tmp_vcf_data)
tmp_vcf_data$pval=10^(-tmp_vcf_data$p)
tmp_vcf_data$pval2=10^(tmp_vcf_data$p)
colnames(tmp_vcf_data)

myvars=c("chromosome","POS","SNP","Allele2","Allele1","Beta","SE","pval","eaf","N")
str(tmp_vcf_data)
lc=tmp_vcf_data[myvars]
head(lc)
summary(lc$pval)

write.table(lc, file = "/n/home04/cdadams/ld/ldsc/lc.gwas2.txt", sep = "\t",
            row.names = FALSE, col.names = TRUE, quote=FALSE)

MR of eqtlGen on BC

The intention here is to replicate the MR findings from GTEx in blood on breast cancer using eqtlGen
(I also extracted the NUC genes from eqtlGen and obtained genetic correlations for them with breast cancer and longevity.
Unfortuantely, those are null. It’s likely the case that I need to learn SMR and parititioned h2.

setwd('/n/home04/cdadams/ld/ldsc/')
(.packages())

# NOTE: when installing through Interactive RStudio/Bioconductor & Tidyverse 4.0.3, gmp is already configured
# as is the directory for installing R packages. All packages installed via Interactive RStudio must be 
# installed separately with R/4.0.2-fasrc01. For some of these gmp needs to be loaded prior: module load gmp/6.1.2-fasrc01

library(graph)
library(RBGL)
library(Vennerable)
library(magick)
library(reshape2)
library(parallel)
library(devtools)
library(tm)
library(TwoSampleMR)
library(MRInstruments) 
library(RadialMR)
library(hyprcoloc)
library(SNPlocs.Hsapiens.dbSNP.20120608)
library(oncofunco)
library(grid)
library(gridExtra)
library(MASS)
library(vcd)
library(Hmisc)
library(htmltools)
library(stringr) 
library(e1071)
library(rcompanion)
library(car)
library(sqldf)
library(kableExtra)
library(gplots)
library(graphics)
library(GGally)
library(CCA)
library(lme4)
library(CCP)
library(qqman)
library(dplyr)
#library(tidyr)
library(biomaRt)
library(data.table)
library(DataCombine)
library(RColorBrewer)
library(tidyverse)
library(packcircles)
library(ggplot2)
library(viridis)
library(igraph)
library(CMplot)
library(metaviz)
library(corrplot)
library(foreach)
library(Cairo)
library(caret)
library(EnsDb.Hsapiens.v79)
library(readr)
library(vcfR)

# options(defaultPackages=c(getOption("defaultPackages"),
#                           "mypackage1","mypackage2", ... [etc.]))
setwd('/n/home04/cdadams/ld/ldsc/')
#####################################################################################################
# Read-in the BCAC data
# I had compressed the file before transfering to the cluster: 
# gzip icogs_onco_gwas_meta_overall_breast_cancer_summary_level_statistics.txt
# An Excel data dictionary accompanies the BCAC data
# /n/home04/cdadams/ld/ldsc//data_dictionary_summary_stats_bc_risk_2020.xlsx

#Data dictiony
# iCOGS-Oncoarray-GWAS-Meta-Overall-summary-results 
# Label Meaning
# var_name  The variant name is coded as chromosome_position_non-effect-allele_effect-allele
# Effect.Gwas   The effect allele in GWAS
# Baseline.Gwas The non effect allele in GWAS
# Freq.Gwas Effect allele frequency
# beta.Gwas Log odds ratio (OR) for effect alllele in GWAS data
# SE.Gwas   standard error of the log OR estimate in GWAS data
# P.value.Gwas  P-value of the log OR estimate in GWAS data
# SNP.iCOGS SNPs name in iCOGS panel. For SNPs with rsid, the name is coded as rsid:position:non-effect-allele:effect-allele; 
# For SNPs without rsid, the name is similarly as var_name 

bcac=read.table(gzfile("icogs_onco_gwas_meta_overall_breast_cancer_summary_level_statistics.txt.gz"), header=TRUE)
head(bcac)
colnames(bcac)
bcac$id=gsub( ":.*$", "", bcac$SNP.iCOGs)
dim(bcac)
bcac2=bcac[which(startsWith(bcac$id, 'rs')),]   
head(bcac2)
dim(bcac2)
colnames(bcac2)[5]="Beta"

#save.image("longevity_bcac.Rdata")
write.table(bcac2, file = "/n/home04/cdadams/ld/ldsc/bcac_master3.txt", sep = "\t",
            row.names = FALSE, col.names = TRUE, quote=FALSE)
# Read-in the bcac file formated for ldsc
bcac3=read.table(file = "/n/home04/cdadams/ld/ldsc/bcac_master3.txt", sep = "\t", header=TRUE)
#####################################################################################################
# Read-in the NUC lists
# First, transfer the files from personal computer to /n/home04/cdadams/ld/ldsc/
#MR of NUC expression in eQTLGen blood on BC

rp_list_expanded=read.csv('/n/home04/cdadams/ld/ldsc/expanded_verified.csv')#copied to ~/Dropbox/Harvard/ribosomal_proteins/Main_Tables_Paper/v8/august_visuals/December
rib_biogenesis=read.csv('/n/home04/cdadams/ld/ldsc/rib_biogenesis.csv')
head(rib_biogenesis)
rib_biogenesis$gene=as.character(rib_biogenesis$gene)
rib_biogenesis$gene=substr(rib_biogenesis$gene,1,nchar(rib_biogenesis$gene)-1)
nucleolus=read.csv('/n/home04/cdadams/ld/ldsc/nucleolus.csv')
nucleolus$gene.=as.character(nucleolus$gene.)
nucleolus$gene=substr(nucleolus$gene,1,nchar(nucleolus$gene)-1)
nucleolus$gene.=NULL
combo=rbind(rib_biogenesis,nucleolus)
combo2=combo %>% distinct(gene, .keep_all = TRUE)
rp_in_nuc=combo2[which(combo2$gene %in% rp_list_expanded$all_RP),]
rp_list_expanded$gene=rp_list_expanded$all_RP
rp_list_expanded$all_RP=NULL
combo3=rbind(combo2,rp_list_expanded)
combo4=combo3 %>% distinct(gene, .keep_all = TRUE)
dim(combo4)
write.csv(combo4, '/n/home04/cdadams/ld/ldsc/combo4.csv')#I'd copied over another version of this for the initial ldsc analysis 3/10

combo4=read.csv('/n/home04/cdadams/ld/ldsc/combo4.csv')
#####################################################################################################
# Working with the eQTLGen file
setwd('/n/home04/cdadams/ld/ldsc/blood-eQTLs')
#####################################################################################################
# This README accompanies the files with cis-eQTL results from eQTLGen
# 
# Files
# -----
# File with full cis-eQTL results: 2019-12-11-cis-eQTLsFDR-ProbeLevel-CohortInfoRemoved-BonferroniAdded.txt.gz
# File with significant (FDR<0.05) cis-eQTL results: 2019-12-11-cis-eQTLsFDR0.05-ProbeLevel-CohortInfoRemoved-BonferroniAdded.txt.gz
# 
# Column Names
# ------------
# Pvalue - P-value
# SNP - SNP rs ID
# SNPChr - SNP chromosome
# SNPPos - SNP position
# AssessedAllele - Assessed allele, the Z-score refers to this allele
# OtherAllele - Not assessed allele
# Zscore - Z-score
# Gene - ENSG name (Ensembl v71) of the eQTL gene
# GeneSymbol - HGNC name of the gene
# GeneChr - Gene chromosome
# GenePos - Centre of gene position
# NrCohorts - Total number of cohorts where this SNP-gene combination was tested
# NrSamples - Total number of samples where this SNP-gene combination was tested
# FDR - False discovery rate estimated based on permutations
# BonferroniP - P-value after Bonferroni correction
# 
# Additional information
# ----------------------
# These files contain all cis-eQTL results from eQTLGen, accompanying the article.
# 19,250 genes that showed expression in blood were tested.
# Every SNP-gene combination with a distance <1Mb from the center of the gene and  tested in at least 2 cohorts was included.
# Associations where SNP/proxy positioned in Illumina probe were not removed from combined analysis.
# UPDATE LOG
# ----------
# 2018-10-19: Initial data release
# 2018-12-19: In the current README, following file names have been fixed and updated:
# 2019-12-20: Cis-eQTLs are now updated to have a 2-cohort filter: every cis-eQTL must be tested in at least 2 cohorts to be reported

#On cluster, I gzipped the eqtlGen file: gzip 2019-12-11-cis-eQTLsFDR-ProbeLevel-CohortInfoRemoved-BonferroniAdded.txt
#this is because I kept getting error messages both here & on the cluster saying ther was no room on the disk.
#I moved unnecessary files to a folder in my personal computer DB: /Users/charleenadams/Dropbox/Harvard/longevity/survival/odyssey_transfer/
#This has the gz files for the protein GWAS, the eQTLGen SMR file, and the eQTLGen EAF file.
#I launched an interacted terminal session with increased memory: 
# salloc -p test  --mem 5000 -t 0-06:00

eqtlGen=read.table(gzfile("2019-12-11-cis-eQTLsFDR-ProbeLevel-CohortInfoRemoved-BonferroniAdded.txt.gz"), header=TRUE)   
head(eqtlGen)

# 127341799 expected rows (I think the read-in worked)
dim(eqtlGen)

# Extract out the NUC genes
nuc.list=read.csv('nuc.list.csv')
nucs=eqtlGen[which(eqtlGen$GeneSymbol %in% nuc.list$gene),]
head(eqtlGen$Gene)

write.table(nucs, file = "/n/home04/cdadams/ld/ldsc/nucs_eqtlGen.txt", sep = "\t",
            row.names = FALSE, col.names = TRUE, quote=FALSE)

gznucs <- gzfile("/n/home04/cdadams/ld/ldsc/nucs.gz", "w")
write.csv(nucs, gznucs)
close(gznucs)

# Format for ldsc (ldsc was null for nucs on bc and longevity)
# Format nuc eQTL for ldsc: 
# ./munge_sumstats.py \
# --sumstats nucs_eqtlGen.txt \
# --snp SNP \
# --ignore SNPChr \
# --a1 AssessedAllele \
# --a2 OtherAllele \
# --N 30596  \
# --out nucs \
# --chunksize 500000 \
# --p Pvalue \
# --signed-sumstats Zscore,0 \
# --merge-alleles w_hm3.snplist 

#####################################################################################################
# Must merge the MAF file (2018-07-18_SNP_AF_for_AlleleB_combined_allele_counts_and_MAF_pos_added.txt) with the nucs_eqtlGen.txt
# to get the effect allele frequencies

# the allele frequencies are needed for the calcuation of the beta and se from the z scores
# the methods for this are provided by: Zhu 2016: https://www.nature.com/articles/ng.3538

# Beta= z / sqrt(2p(1− p)(n + z^2)) and
# SE= 1 / sqrt(2p(1− p)(n + z^2))
# 
# Where p is the frequency of the imputed SNP, 
# you could use out reference panel to calculate p.

# This README accompanies the file with allele frequencies from eQTLGen cis-eQTL analysis: 
# 2018-07-18_SNP_AF_for_AlleleB_combined_allele_counts_and_MAF_pos_added.txt.gz
# 
# Column Names
# ------------
#   SNP - SNP rs ID
# hg19_chr - chr number
# hg19_pos - SNP position (hg19)
# AlleleA - Other allele
# AlleleB - Assessed allele
# allA_total - Total allele count of genotype AA
# allAB_total - Total allele count of genotype AB
# allB_total - Total allele count of genotype BB
# AlleleB_all - Allele frequency of assessed allele
# 
# Additional information
# ----------------------
# The allele frequencies were calculated using reported allele counts from all cohorts except 
# Framingham Heart Study, because the related samples are present in this cohort.

#Read-in "2018-07-18_SNP_AF_for_AlleleB_combined_allele_counts_and_MAF_pos_added.txt" file 
nucs_eqtl_MAF=read.table(file = "/n/home04/cdadams/ld/ldsc/2018-07-18_SNP_AF_for_AlleleB_combined_allele_counts_and_MAF_pos_added.txt", sep = "\t", header=TRUE)
nucs_total <- merge(nucs_eqtlGen,nucs_eqtl_MAF,by="SNP")

n=30596
nucs_total$Beta=nucs_total$Zscore/sqrt(2*(nucs_total$AlleleB_all)*(1-nucs_total$AlleleB_all)*(n+nucs_total$Zscore^2))
nucs_total$SE=1/sqrt(2*(nucs_total$AlleleB_all)*(1-nucs_total$AlleleB_all)*(n+nucs_total$Zscore^2))

# Beta= z / sqrt(2p(1− p)(n + z^2)) and
# SE= 1 / sqrt(2p(1− p)(n + z^2))
# 
# Where p is the frequency of the imputed SNP, 
# you could use out reference panel to calculate p.

write.table(nucs_total, file = "/n/home04/cdadams/ld/ldsc/nucs_total_maf_beta_se.txt", sep = "\t",
            row.names = FALSE, col.names = TRUE, quote=FALSE)

#####################################################################################################
# Read-in for MR formatting
exposure_nuc_eqtlGen <- read_exposure_data(
  filename = "/n/home04/cdadams/ld/ldsc/nucs_total_maf_beta_se.txt",
  sep = '\t',
  snp_col = 'SNP',
  beta_col = 'Beta',
  se_col = 'SE',
  effect_allele_col = 'AssessedAllele',
  #phenotype_col = 'tissue',
  #units_col = 'tissue',
  other_allele_col = 'OtherAllele',
  eaf_col = 'AlleleB_all',
  #samplesize_col = 'samplesize',
  #ncase_col = 'ncase',
  #ncontrol_col = 'ncontrol',
  gene_col = 'GeneSymbol',
  pval_col = 'Pvalue'
)

exposure_nuc_eqtlGen_p=exposure_nuc_eqtlGen[which(exposure_nuc_eqtlGen$pval.exposure<0.000005),]
write.csv(exposure_nuc_eqtlGen_p, 'exposure_nuc_eqtlGen_p.csv')

exposure_nuc_eqtlGen_p=read.csv('exposure_nuc_eqtlGen_p.csv')

# Read-in the outcome bcac for MR (formatted for ldsc)
# # Format GWAS for ldsc: breast cancer (BCAC)
# ./munge_sumstats.py \
# --sumstats bcac_master3.txt \
# --snp id \
# --ignore var_name \
# --a1 Effect.Gwas \
# --a2 Baseline.Gwas \
# --N 247173 \
# --out bcac_final \
# --chunksize 500000 \
# --p P.value.Gwas \
# --signed-sumstats Beta,0 \
# --merge-alleles w_hm3.snplist

bc <- read_outcome_data(
  snps = exposure_nuc_eqtlGen_p$SNP,
  filename = "/n/home04/cdadams/ld/ldsc/bcac_master3.txt",
  sep = '\t',
  snp_col = 'id',
  beta_col = 'Beta',
  se_col = 'SE.Gwas',
  effect_allele_col = 'Effect.Gwas',
  #phenotype_col = 'tissue',#tissue
  #units_col = 'tissue',
  other_allele_col = 'Baseline.Gwas',
  eaf_col = 'Freq.Gwas',
  #samplesize_col = 'samplesize',
  #ncase_col = 'ncase',
  #ncontrol_col = 'ncontrol',
  #gene_col = 'gene_name',
  pval_col = 'P.value.Gwas'
)

#Harmonize the alleles btw the two gwas
nuc_eqtlGen_bc <- harmonise_data(exposure_nuc_eqtlGen_p, bc, action = 2)
dim(nuc_eqtlGen_bc)
write.csv(nuc_eqtlGen_bc, '/n/home04/cdadams/ld/ldsc/nuc_eqtlGen_bc.csv')#can simply read this in for the MR

nuc_eqtlGen_bc=read.csv('nuc_eqtlGen_bc.csv')

single_res=mr_singlesnp(nuc_eqtlGen_bc)
single_res=single_res[order(single_res$SNP),]

single_res2=single_res[3:124270,]#remove the meta-analyses

#Merge the dat and single results files
single_merged=merge(nuc_eqtlGen_bc, single_res2, 
                    by=c("SNP", "exposure", "outcome"), all = TRUE)
single_merged2<-subset(single_merged, (!is.na(single_merged$p)))

#FDR correction
p5=single_merged2$p
fdr5=p.adjust(p5, method = "fdr", n = length(p5))
single_merged2$fdr=fdr5
dim(single_merged2)

#Create the "concordance" variable
single_merged2$concord1=ifelse(single_merged2$beta.exposure<0 & single_merged2$beta.outcome>0, "concord", "no")
single_merged2$concord2=ifelse(single_merged2$beta.exposure>0 & single_merged2$beta.outcome<0, "concord", single_merged2$concord1)
table(single_merged2$concord2)

#Order by pvalue
single_merged2=single_merged2[order(single_merged2$p),]
write.csv(single_merged2, 'nucs_eqtlGen_bc_fdr_no_clump.csv')
single_merged2=read.csv('nucs_eqtlGen_bc_fdr_no_clump.csv')

# Re-run the analysis but clump to prioritize unique findings for the gene-level 
# meta-analyses
nuc_eqtlGen_bc=read.csv('/n/home04/cdadams/ld/ldsc/nuc_eqtlGen_bc.csv')
nuc_eqtlGen_bc_clump=clump_data(nuc_eqtlGen_bc)
single_res_clump=mr_singlesnp(nuc_eqtlGen_bc_clump)
single_res_clump=single_res_clump[order(single_res_clump$SNP),]
head(single_res_clump)
dim(single_res_clump)

single_res_clump2=single_res_clump[3:447,]#remove the meta-analyses
head(single_res_clump_clump2)
dim(single_res_clump2)

#Merge the dat and single results files
single_merged_clump=merge(nuc_eqtlGen_bc_clump, single_res_clump2, 
                    by=c("SNP", "exposure", "outcome"), all = TRUE)
dim(single_merged_clump)
head(single_merged_clump)
single_merged_clump2<-subset(single_merged_clump, (!is.na(single_merged_clump$p)))
dim(single_merged_clump2)

#FDR correction
p5_clump=single_merged_clump2$p
fdr5_clump=p.adjust(p5_clump, method = "fdr", n = length(p5_clump))
single_merged_clump2$fdr_clump=fdr5_clump
dim(single_merged_clump2)

#Create the "concordance" variable
single_merged_clump2$concord1=ifelse(single_merged_clump2$beta.exposure<0 & single_merged_clump2$beta.outcome>0, "concord", "no")
single_merged_clump2$concord2=ifelse(single_merged_clump2$beta.exposure>0 & single_merged_clump2$beta.outcome<0, "concord", single_merged_clump2$concord1)
table(single_merged_clump2$concord2)
head(single_merged_clump2)

#Order by pvalue
single_merged_clump_clump2=single_merged_clump2[order(single_merged_clump2$p),]
sigs_clump2=single_merged_clump2[which(single_merged_clump2$p<0.05),]
sigs_clump2$gene.exposure=droplevels(sigs_clump2$gene.exposure)
write.csv(single_merged_clump2, 'nucs_eqtlGen_bc_fdr_CLUMPED.csv')
#######################################################################
# Perform meta-analyses (with non-clumped set; clumped at level of the gene) 
#of top findings from clumped data: largely null except for Wald ratios and two SNP IVWs
single_merged_clump2=read.csv('nucs_eqtlGen_bc_fdr_CLUMPED.csv')
sigs_clump2=single_merged_clump2[which(single_merged_clump2$p<0.05),]
sigs_clump2$gene.exposure=droplevels(sigs_clump2$gene.exposure)
sigs_clump2$gene.exposure

#Format the id.exposure column to contain the gene names so that master file is sortable.
nuc_eqtlGen_bc=read.csv('/n/home04/cdadams/ld/ldsc/nuc_eqtlGen_bc.csv')
nuc_eqtlGen_bc$id.exposure=nuc_eqtlGen_bc$gene.exposure
nuc_eqtlGen_bc$id.outcome="breast cancer"

# Get the clumped rps23
big_rps23=nuc_eqtlGen_bc[which(nuc_eqtlGen_bc$gene.exposure=="RPS23"),]
big_rps23=clump_data(big_rps23)
mr_meta_big_rps23=mr(big_rps23)
mr_meta_big_rps23

# Get the clumped EXOSC6
big_EXOSC6=nuc_eqtlGen_bc[which(nuc_eqtlGen_bc$gene.exposure=="EXOSC6"),]#no sig
big_EXOSC6=clump_data(big_EXOSC6)
mr_meta_big_EXOSC6=mr(big_EXOSC6)
mr_meta_big_EXOSC6

# Get the clumped GTF3A
big_GTF3A=nuc_eqtlGen_bc[which(nuc_eqtlGen_bc$gene.exposure=="GTF3A"),] #two snps but sig
dim(big_GTF3A)
big_GTF3A=clump_data(big_GTF3A)
dim(big_GTF3A)
mr_meta_big_GTF3A=mr(big_GTF3A)
mr_meta_big_GTF3A

# Get the clumped RRNAD1
big_RRNAD1=nuc_eqtlGen_bc[which(nuc_eqtlGen_bc$gene.exposure=="RRNAD1"),] #not sig
big_RRNAD1=clump_data(big_RRNAD1)
mr_meta_big_RRNAD1=mr(big_RRNAD1)
mr_meta_big_RRNAD1

# Get the clumped RPP30
big_RPP30=nuc_eqtlGen_bc[which(nuc_eqtlGen_bc$gene.exposure=="RPP30"),] # sig but only 1
big_RPP30=clump_data(big_RPP30)
mr_meta_big_RPP30=mr(big_RPP30)
mr_meta_big_RPP30

# Get the clumped DAP3
big_DAP3=nuc_eqtlGen_bc[which(nuc_eqtlGen_bc$gene.exposure=="DAP3"),] # sig but only 1
big_DAP3=clump_data(big_DAP3)
mr_meta_big_DAP3=mr(big_DAP3)
mr_meta_big_DAP3

# Get the clumped THUMPD1
big_THUMPD1=nuc_eqtlGen_bc[which(nuc_eqtlGen_bc$gene.exposure=="THUMPD1"),] # not sig
big_THUMPD1=clump_data(big_THUMPD1)
mr_meta_big_THUMPD1=mr(big_THUMPD1)
mr_meta_big_THUMPD1

# Get the clumped MRPL41
big_MRPL41=nuc_eqtlGen_bc[which(nuc_eqtlGen_bc$gene.exposure=="MRPL41"),] # sig but only 1
big_MRPL41=clump_data(big_MRPL41)
mr_meta_big_MRPL41=mr(big_MRPL41)
mr_meta_big_MRPL41

# Get the clumped MRPL21
big_MRPL21=nuc_eqtlGen_bc[which(nuc_eqtlGen_bc$gene.exposure=="MRPL21"),] # 7 but not sig
big_MRPL21=clump_data(big_MRPL21)
mr_meta_big_MRPL21=mr(big_MRPL21)
mr_meta_big_MRPL21

# Get the clumped NCK1
big_NCK1=nuc_eqtlGen_bc[which(nuc_eqtlGen_bc$gene.exposure=="NCK1"),] # sig but only 1
big_NCK1=clump_data(big_NCK1)
mr_meta_big_NCK1=mr(big_NCK1)
mr_meta_big_NCK1

# Get the clumped ERI1
big_ERI1=nuc_eqtlGen_bc[which(nuc_eqtlGen_bc$gene.exposure=="ERI1"),] # sig but only 1
big_ERI1=clump_data(big_ERI1)
mr_meta_big_ERI1=mr(big_ERI1)
mr_meta_big_ERI1

# Get the clumped NOL8
big_NOL8=nuc_eqtlGen_bc[which(nuc_eqtlGen_bc$gene.exposure=="NOL8"),] # not sig
big_NOL8=clump_data(big_NOL8)
mr_meta_big_NOL8=mr(big_NOL8)
mr_meta_big_NOL8

# Get the clumped NSUN5P2
big_NSUN5P2=nuc_eqtlGen_bc[which(nuc_eqtlGen_bc$gene.exposure=="NSUN5P2"),] # not sig
big_NSUN5P2=clump_data(big_NSUN5P2)
mr_meta_big_NSUN5P2=mr(big_NSUN5P2)
mr_meta_big_NSUN5P2

# Get the clumped HELB
big_HELB=nuc_eqtlGen_bc[which(nuc_eqtlGen_bc$gene.exposure=="HELB"),] # not sig
big_HELB=clump_data(big_HELB)
mr_meta_big_HELB=mr(big_HELB)
mr_meta_big_HELB

# Get the clumped MRPL4
big_MRPL4=nuc_eqtlGen_bc[which(nuc_eqtlGen_bc$gene.exposure=="MRPL4"),] # not sig
big_MRPL4=clump_data(big_MRPL4)
mr_meta_big_MRPL4=mr(big_MRPL4)
mr_meta_big_MRPL4

# Get the clumped URB2
big_URB2=nuc_eqtlGen_bc[which(nuc_eqtlGen_bc$gene.exposure=="URB2"),] # not sig
big_URB2=clump_data(big_URB2)
mr_meta_big_URB2=mr(big_URB2)
mr_meta_big_URB2

# Get the clumped PPARGC1A
big_PPARGC1A=nuc_eqtlGen_bc[which(nuc_eqtlGen_bc$gene.exposure=="PPARGC1A"),] # not sig
big_PPARGC1A=clump_data(big_PPARGC1A)
mr_meta_big_PPARGC1A=mr(big_PPARGC1A)
mr_meta_big_PPARGC1A

# Get the clumped MRPL19
big_MRPL19=nuc_eqtlGen_bc[which(nuc_eqtlGen_bc$gene.exposure=="MRPL19"),] # sig but only 1
big_MRPL19=clump_data(big_MRPL19)
mr_meta_big_MRPL19=mr(big_MRPL19)
mr_meta_big_MRPL19

# Get the clumped KAT2B
big_KAT2B=nuc_eqtlGen_bc[which(nuc_eqtlGen_bc$gene.exposure=="KAT2B"),] # not sig
big_KAT2B=clump_data(big_KAT2B)
mr_meta_big_KAT2B=mr(big_KAT2B)
mr_meta_big_KAT2B

# Get the clumped IMP4
big_IMP4=nuc_eqtlGen_bc[which(nuc_eqtlGen_bc$gene.exposure=="IMP4"),] # not sig
big_IMP4=clump_data(big_IMP4)
mr_meta_big_IMP4=mr(big_IMP4)
mr_meta_big_IMP4

# Get the clumped MRPL10
big_MRPL10=nuc_eqtlGen_bc[which(nuc_eqtlGen_bc$gene.exposure=="MRPL10"),] # not sig
big_MRPL10=clump_data(big_MRPL10)
mr_meta_big_MRPL10=mr(big_MRPL10)
mr_meta_big_MRPL10

# Get the clumped MPV17L2
big_MPV17L2=nuc_eqtlGen_bc[which(nuc_eqtlGen_bc$gene.exposure=="MPV17L2"),] # sig but only 1
big_MPV17L2=clump_data(big_MPV17L2)
mr_meta_big_MPV17L2=mr(big_MPV17L2)
mr_meta_big_MPV17L2

# Get the clumped TEX10
big_TEX10=nuc_eqtlGen_bc[which(nuc_eqtlGen_bc$gene.exposure=="TEX10"),] # not sig 
big_TEX10=clump_data(big_TEX10)
mr_meta_big_TEX10=mr(big_TEX10)
mr_meta_big_TEX10

# Get the clumped EIF2AK2
big_EIF2AK2=nuc_eqtlGen_bc[which(nuc_eqtlGen_bc$gene.exposure=="EIF2AK2"),] # not sig 
big_EIF2AK2=clump_data(big_EIF2AK2)
mr_meta_big_EIF2AK2=mr(big_EIF2AK2)
mr_meta_big_EIF2AK2

# Get the clumped RPL8
big_RPL8=nuc_eqtlGen_bc[which(nuc_eqtlGen_bc$gene.exposure=="RPL8"),] # not sig 
big_RPL8=clump_data(big_RPL8)
mr_meta_big_RPL8=mr(big_RPL8)
mr_meta_big_RPL8

sigs_clump2$gene.exposure
#ls to get the names of the mr_met_big* results; copy and format
ls()
#Remove the objects in the environment that aren't the gene-level meta-analytic MR results
rm(list= ls()[!(ls() %in% c("mr_meta_big_DAP3",          
                            "mr_meta_big_EIF2AK2",        "mr_meta_big_ERI1",           "mr_meta_big_EXOSC6",        
                            "mr_meta_big_GTF3A",          "mr_meta_big_HELB",           "mr_meta_big_IMP4",          
                            "mr_meta_big_KAT2B",          "mr_meta_big_MPV17L2",        "mr_meta_big_MRPL10",        
                            "mr_meta_big_MRPL19",         "mr_meta_big_MRPL21",         "mr_meta_big_MRPL4",         
                            "mr_meta_big_MRPL41",         "mr_meta_big_NCK1",           "mr_meta_big_NOL8",          
                            "mr_meta_big_NSUN5P2",        "mr_meta_big_PPARGC1A",       "mr_meta_big_RPL8",          
                            "mr_meta_big_RPP30",          "mr_meta_big_RRNAD1",         "mr_meta_big_TEX10",         
                            "mr_meta_big_THUMPD1",        "mr_meta_big_URB2",           "mr_meta_big_rps23"))])
DF_obj <- lapply(ls(), get)
str(DF_obj)
nms <- c("mr_meta_big_DAP3",          
         "mr_meta_big_EIF2AK2",        "mr_meta_big_ERI1",           "mr_meta_big_EXOSC6",        
         "mr_meta_big_GTF3A",          "mr_meta_big_HELB",           "mr_meta_big_IMP4",          
         "mr_meta_big_KAT2B",          "mr_meta_big_MPV17L2",        "mr_meta_big_MRPL10",        
         "mr_meta_big_MRPL19",         "mr_meta_big_MRPL21",         "mr_meta_big_MRPL4",         
         "mr_meta_big_MRPL41",         "mr_meta_big_NCK1",           "mr_meta_big_NOL8",          
         "mr_meta_big_NSUN5P2",        "mr_meta_big_PPARGC1A",       "mr_meta_big_RPL8",          
         "mr_meta_big_RPP30",          "mr_meta_big_RRNAD1",         "mr_meta_big_TEX10",         
         "mr_meta_big_THUMPD1",        "mr_meta_big_URB2",           "mr_meta_big_rps23")
names(DF_obj) = nms
for( i in 1:length(DF_obj))
  write.csv(DF_obj[i],paste0(nms[i],'bc.csv'))

#Merge all the meta-analytic results
merged <- reduce(DF_obj, full_join)
head(merged, n=20)
write.csv(merged, 'metas_eqtlGen_bc.csv')

GTEx blood

# Read-in the GTEx master eqtls file to obtain the n for blood (n=3288 vs 30596 for eqtlGen)

load("/n/home04/cdadams/ld/ldsc/V8_master_tissues_n.Rdata")
colnames(master_v8_n)
colnames(master_v8_n)[2] <- "chromosome"
colnames(master_v8_n)[31] <- "MAF"
master_v8_n$eaf=ifelse(master_v8_n$ref_factor=='1', master_v8_n$MAF, 1-master_v8_n$MAF)
table(master_v8_n$tissue)
gtex_blood=master_v8_n[which(master_v8_n$tissue=="Whole_Blood"),]
gtex_blood$n
write.csv(gtex_blood, '/n/home04/cdadams/ld/ldsc/gtex_blood.csv')

gtex_blood=read.csv('/n/home04/cdadams/ld/ldsc/gtex_blood.csv')
gtex_blood_combo4=gtex_blood[which(gtex_blood$gene_name %in% combo4$gene),]
write.csv(gtex_blood_combo4, 'gtex_blood_nuc.csv')

MR eqtlGen on longevity

Legacy meta-analysis. As I noted, at first I selected top hits and ran the meta-analyses one-by-one.

I made a note to develop a loop later
I did later use a loop (and it is available in other repositories of the meta-analyses of eqtlGen on bc and longevity)

#####################################################################################################
# Read-in the GTEx master eqtls file to obtain the n for blood (n=3288 vs 30596 for eqtlGen)
# This is just a reminder of the code for working with the entire GTEx egene dataset and that there are only 3288 samples for blood
load("/n/home04/cdadams/ld/ldsc/V8_master_tissues_n.Rdata")
colnames(master_v8_n)
colnames(master_v8_n)[2] <- "chromosome"
colnames(master_v8_n)[31] <- "MAF"
master_v8_n$eaf=ifelse(master_v8_n$ref_factor=='1', master_v8_n$MAF, 1-master_v8_n$MAF)
table(master_v8_n$tissue)
gtex_blood=master_v8_n[which(master_v8_n$tissue=="Whole_Blood"),]
gtex_blood$n
write.csv(gtex_blood, '/n/home04/cdadams/ld/ldsc/gtex_blood.csv')

gtex_blood=read.csv('/n/home04/cdadams/ld/ldsc/gtex_blood.csv')
gtex_blood_combo4=gtex_blood[which(gtex_blood$gene_name %in% combo4$gene),]
write.csv(gtex_blood_combo4, 'gtex_blood_nuc.csv')

# MR of eqtlGen (blood in >30,000 people) on longevity
setwd('/n/home04/cdadams/ld/ldsc/')
(.packages())

# NOTE: when installing through Interactive RStudio/Bioconductor & Tidyverse 4.0.3, gmp is already configured
# as is the directory for installing R packages. All packages installed via Interactive RStudio must be 
# installed separately with R/4.0.2-fasrc01. For some of these gmp needs to be loaded prior: module load gmp/6.1.2-fasrc01

library(graph)
library(RBGL)
library(Vennerable)
library(magick)
library(reshape2)
library(parallel)
library(devtools)
library(tm)
library(TwoSampleMR)
library(MRInstruments) 
library(RadialMR)
library(hyprcoloc)
library(SNPlocs.Hsapiens.dbSNP.20120608)
library(oncofunco)
library(grid)
library(gridExtra)
library(MASS)
library(vcd)
library(Hmisc)
library(htmltools)
library(stringr) 
library(e1071)
library(rcompanion)
library(car)
library(sqldf)
library(kableExtra)
library(gplots)
library(graphics)
library(GGally)
library(CCA)
library(lme4)
library(CCP)
library(qqman)
library(dplyr)
#library(tidyr)
library(biomaRt)
library(data.table)
library(DataCombine)
library(RColorBrewer)
library(tidyverse)
library(packcircles)
library(ggplot2)
library(viridis)
library(igraph)
library(CMplot)
library(metaviz)
library(corrplot)
library(foreach)
library(Cairo)
library(caret)
library(EnsDb.Hsapiens.v79)
library(readr)
library(vcfR)

# options(defaultPackages=c(getOption("defaultPackages"),
#                           "mypackage1","mypackage2", ... [etc.]))

setwd('/n/home04/cdadams/ld/ldsc/')
# Readme for the longevity gwas
# This dataset contains HRC-imputed GWAS summary statistics on parental survival, under the assumption of common effect sizes between fathers and mothers. 
# rsid � reference SNP cluster ID; 
# snpid � SNP chromosome and position ID; 
# chr � Chromosome; 
# pos � Base-pair position on chromosome (GRCh37); 
# a1 � the effect allele; 
# a0 � the reference allele; 
# n � Number of phenotypes analysed for the SNP. Note that combined parental residuals from� UK Biobank are counted once, and LifeGen meta-analysed father and mother statistics are counted once each, and as a #result, the total number of phenotypes analysed is actually greater than the number listed. 
# freq1 � frequency of the effect allele; 
# beta1 � log hazard protection ratio (negation of log hazard ratio) for carrying one copy of the a1 allele in self; 
# se � Standard Error; 
# p � the P value for the Wald test of association between imputed dosage and cox model residual; direction � direction of effect in UK Biobank and LifeGen; 
# info � imputation info score; 
# freq_se � standard error of effect allele frequency between UK Biobank and LifeGen; 
# min_freq1 � minimum effect allele frequency across UK Biobank and LifeGen; 
# max_freq1 � maximum effect allele frequency across UK Biobank and LifeGen.

# Read-in the longevity gwas data
long=read.table(gzfile("lifegen_phase2_bothpl_alldr_2017_09_18.tsv.gz"), header=TRUE)

# save.image("longevity_long.Rdata")
write.table(long, file = "/n/home04/cdadams/ld/ldsc/long_for_mr.txt", sep = "\t",
            row.names = FALSE, col.names = TRUE, quote=FALSE)

# Read-in the NUC lists
# First, transfer the files from personal computer to /n/home04/cdadams/ld/ldsc/
# MR of NUC expression in eQTLGen blood on BC

rp_list_expanded=read.csv('/n/home04/cdadams/ld/ldsc/expanded_verified.csv')#copied to ~/Dropbox/Harvard/ribosomal_proteins/Main_Tables_Paper/v8/august_visuals/December
rib_biogenesis=read.csv('/n/home04/cdadams/ld/ldsc/rib_biogenesis.csv')
head(rib_biogenesis)
rib_biogenesis$gene=as.character(rib_biogenesis$gene)
rib_biogenesis$gene=substr(rib_biogenesis$gene,1,nchar(rib_biogenesis$gene)-1)
nucleolus=read.csv('/n/home04/cdadams/ld/ldsc/nucleolus.csv')
nucleolus$gene.=as.character(nucleolus$gene.)
nucleolus$gene=substr(nucleolus$gene,1,nchar(nucleolus$gene)-1)
nucleolus$gene.=NULL
combo=rbind(rib_biogenesis,nucleolus)
combo2=combo %>% distinct(gene, .keep_all = TRUE)
rp_in_nuc=combo2[which(combo2$gene %in% rp_list_expanded$all_RP),]
rp_list_expanded$gene=rp_list_expanded$all_RP
rp_list_expanded$all_RP=NULL
combo3=rbind(combo2,rp_list_expanded)
combo4=combo3 %>% distinct(gene, .keep_all = TRUE)
write.csv(combo4, '/n/home04/cdadams/ld/ldsc/combo4.csv')#I'd copied over another version of this for the initial ldsc analysis 3/10
combo4=read.csv('/n/home04/cdadams/ld/ldsc/combo4.csv')

# Working with the eQTLGen file
setwd('/n/home04/cdadams/ld/ldsc/blood-eQTLs')
# This README accompanies the files with cis-eQTL results from eQTLGen
# 
# Files
# -----
# File with full cis-eQTL results: 2019-12-11-cis-eQTLsFDR-ProbeLevel-CohortInfoRemoved-BonferroniAdded.txt.gz
# File with significant (FDR<0.05) cis-eQTL results: 2019-12-11-cis-eQTLsFDR0.05-ProbeLevel-CohortInfoRemoved-BonferroniAdded.txt.gz
# 
# Column Names
# ------------
# Pvalue - P-value
# SNP - SNP rs ID
# SNPChr - SNP chromosome
# SNPPos - SNP position
# AssessedAllele - Assessed allele, the Z-score refers to this allele
# OtherAllele - Not assessed allele
# Zscore - Z-score
# Gene - ENSG name (Ensembl v71) of the eQTL gene
# GeneSymbol - HGNC name of the gene
# GeneChr - Gene chromosome
# GenePos - Centre of gene position
# NrCohorts - Total number of cohorts where this SNP-gene combination was tested
# NrSamples - Total number of samples where this SNP-gene combination was tested
# FDR - False discovery rate estimated based on permutations
# BonferroniP - P-value after Bonferroni correction
# 
# Additional information
# ----------------------
# These files contain all cis-eQTL results from eQTLGen, accompanying the article.
# 19,250 genes that showed expression in blood were tested.
# Every SNP-gene combination with a distance <1Mb from the center of the gene and  tested in at least 2 cohorts was included.
# Associations where SNP/proxy positioned in Illumina probe were not removed from combined analysis.
# UPDATE LOG
# ----------
# 2018-10-19: Initial data release
# 2018-12-19: In the current README, following file names have been fixed and updated:
# 2019-12-20: Cis-eQTLs are now updated to have a 2-cohort filter: every cis-eQTL must be tested in at least 2 cohorts to be reported

# On cluster, I gzipped the eqtlGen file: gzip 2019-12-11-cis-eQTLsFDR-ProbeLevel-CohortInfoRemoved-BonferroniAdded.txt
# this is because I kept getting error messages both here & on the cluster saying ther was no room on the disk.
# I moved unnecessary files to a folder in my personal computer DB: /Users/charleenadams/Dropbox/Harvard/longevity/survival/odyssey_transfer/
# This has the gz files for the protein GWAS, the eQTLGen SMR file, and the eQTLGen EAF file.
# I launched an interacted terminal session with increased memory: 
# salloc -p test  --mem 5000 -t 0-06:00

eqtlGen=read.table(gzfile("2019-12-11-cis-eQTLsFDR-ProbeLevel-CohortInfoRemoved-BonferroniAdded.txt.gz"), header=TRUE)   

#127341799 expected rows (I think the read-in worked)

#Extract out the NUC genes
nuc.list=read.csv('nuc.list.csv')
nucs=eqtlGen[which(eqtlGen$GeneSymbol %in% nuc.list$gene),]
write.table(nucs, file = "/n/home04/cdadams/ld/ldsc/nucs_eqtlGen.txt", sep = "\t",
            row.names = FALSE, col.names = TRUE, quote=FALSE)

gznucs <- gzfile("/n/home04/cdadams/ld/ldsc/nucs.gz", "w")
write.csv(nucs, gznucs)
close(gznucs)

# Format for ldsc (ldsc was null for nucs on bc and longevity)
# Format nuc eQTL for ldsc: 
# ./munge_sumstats.py \
# --sumstats nucs_eqtlGen.txt \
# --snp SNP \
# --ignore SNPChr \
# --a1 AssessedAllele \
# --a2 OtherAllele \
# --N 30596  \
# --out nucs \
# --chunksize 500000 \
# --p Pvalue \
# --signed-sumstats Zscore,0 \
# --merge-alleles w_hm3.snplist 

# Must merge the MAF file (2018-07-18_SNP_AF_for_AlleleB_combined_allele_counts_and_MAF_pos_added.txt) with the nucs_eqtlGen.txt
# to get the effect allele frequencies

# the allele frequencies are needed for the calcuation of the beta and se from the z scores
# the methods for this are provided by: Zhu 2016: https://www.nature.com/articles/ng.3538

# Beta= z / sqrt(2p(1− p)(n + z^2)) and
# SE= 1 / sqrt(2p(1− p)(n + z^2))
# 
# Where p is the frequency of the imputed SNP, 
# you could use out reference panel to calculate p.

# This README accompanies the file with allele frequencies from eQTLGen cis-eQTL analysis: 
# 2018-07-18_SNP_AF_for_AlleleB_combined_allele_counts_and_MAF_pos_added.txt.gz
# 
# Column Names
# ------------
#   SNP - SNP rs ID
# hg19_chr - chr number
# hg19_pos - SNP position (hg19)
# AlleleA - Other allele
# AlleleB - Assessed allele
# allA_total - Total allele count of genotype AA
# allAB_total - Total allele count of genotype AB
# allB_total - Total allele count of genotype BB
# AlleleB_all - Allele frequency of assessed allele
# 
# Additional information
# ----------------------
# The allele frequencies were calculated using reported allele counts from all cohorts except 
#Framingham Heart Study, because the related samples are present in this cohort.

#Read-in "2018-07-18_SNP_AF_for_AlleleB_combined_allele_counts_and_MAF_pos_added.txt" file 
nucs_eqtl_MAF=read.table(file = "/n/home04/cdadams/ld/ldsc/2018-07-18_SNP_AF_for_AlleleB_combined_allele_counts_and_MAF_pos_added.txt", sep = "\t", header=TRUE)
nucs_total <- merge(nucs_eqtlGen,nucs_eqtl_MAF,by="SNP")

n=30596
nucs_total$Beta=nucs_total$Zscore/sqrt(2*(nucs_total$AlleleB_all)*(1-nucs_total$AlleleB_all)*(n+nucs_total$Zscore^2))
nucs_total$SE=1/sqrt(2*(nucs_total$AlleleB_all)*(1-nucs_total$AlleleB_all)*(n+nucs_total$Zscore^2))

# Beta= z / sqrt(2p(1− p)(n + z^2)) and
# SE= 1 / sqrt(2p(1− p)(n + z^2))
# 
# Where p is the frequency of the imputed SNP, 
# you could use out reference panel to calculate p.

write.table(nucs_total, file = "/n/home04/cdadams/ld/ldsc/nucs_total_maf_beta_se.txt", sep = "\t",
            row.names = FALSE, col.names = TRUE, quote=FALSE)

# Read-in for MR formatting
exposure_nuc_eqtlGen <- read_exposure_data(
  filename = "/n/home04/cdadams/ld/ldsc/nucs_total_maf_beta_se.txt",
  sep = '\t',
  snp_col = 'SNP',
  beta_col = 'Beta',
  se_col = 'SE',
  effect_allele_col = 'AssessedAllele',
  #phenotype_col = 'tissue',
  #units_col = 'tissue',
  other_allele_col = 'OtherAllele',
  eaf_col = 'AlleleB_all',
  #samplesize_col = 'samplesize',
  #ncase_col = 'ncase',
  #ncontrol_col = 'ncontrol',
  gene_col = 'GeneSymbol',
  pval_col = 'Pvalue'
)

exposure_nuc_eqtlGen_p=exposure_nuc_eqtlGen[which(exposure_nuc_eqtlGen$pval.exposure<0.000005),]
write.csv(exposure_nuc_eqtlGen_p, 'exposure_nuc_eqtlGen_p.csv')

exposure_nuc_eqtlGen_p=read.csv('exposure_nuc_eqtlGen_p.csv')

# Format GWAS for ldsc: longevity
# ./munge_sumstats.py \
# --sumstats lifegen_phase2_bothpl_alldr_2017_09_18.tsv.gz \
# --snp rsid \
# --ignore snpid \
# --a1 a1 \
# --a2 a0 \
# --N 1012240  \
# --out long_3 \
# --chunksize 500000 \
# --p p \
# --signed-sumstats beta1,0 \
# --merge-alleles w_hm3.snplist 

long_outcome <- read_outcome_data(
  snps = exposure_nuc_eqtlGen_p$SNP,
  filename = "/n/home04/cdadams/ld/ldsc/long_for_mr.txt",
  sep = '\t',
  snp_col = 'rsid',
  beta_col = 'beta1',
  se_col = 'se',
  effect_allele_col = 'a1',
  #phenotype_col = 'tissue',#tissue
  #units_col = 'tissue',
  other_allele_col = 'a0',
  eaf_col = 'freq1',
  #samplesize_col = 'samplesize',
  #ncase_col = 'ncase',
  #ncontrol_col = 'ncontrol',
  #gene_col = 'gene_name',
  pval_col = 'p'
)

#Harmonize the alleles btw the two gwas
nuc_eqtlGen_long <- harmonise_data(exposure_nuc_eqtlGen_p, long_outcome , action = 2)
write.csv(nuc_eqtlGen_long, 'nuc_eqtlGen_long.csv')#can simply read this in for the MR

nuc_eqtlGen_long=read.csv('nuc_eqtlGen_long.csv')
single_res=mr_singlesnp(nuc_eqtlGen_long)
single_res=single_res[order(single_res$SNP),]

single_res2=single_res[3:122829,]#remove the meta-analyses

#Merge the dat and single results files
single_merged=merge(nuc_eqtlGen_long, single_res2, 
                    by=c("SNP", "exposure", "outcome"), all = TRUE)
single_merged2<-subset(single_merged, (!is.na(single_merged$p)))
dim(single_merged2)

#FDR correction
p5=single_merged2$p
fdr5=p.adjust(p5, method = "fdr", n = length(p5))
single_merged2$fdr=fdr5
dim(single_merged2)

#Create the "concordance" variable
single_merged2$concord1=ifelse(single_merged2$beta.exposure<0 & single_merged2$beta.outcome>0, "concord", "no")
single_merged2$concord2=ifelse(single_merged2$beta.exposure>0 & single_merged2$beta.outcome<0, "concord", single_merged2$concord1)
table(single_merged2$concord2)

#Order by pvalue
single_merged2=single_merged2[order(single_merged2$p),]
write.csv(single_merged2, 'nucs_eqtlGen_long_fdr_no_clump.csv')
#######################################################################
# Re-perform clumping the entire set to find the unique findings to be used 
# for prioritizing the gene-level meta-analyses

# Use the harmonized dataset
nuc_eqtlGen_long=read.csv('nuc_eqtlGen_long.csv')

# Clump
nuc_eqtlGen_long_clump=clump_data(nuc_eqtlGen_long)

single_res_clump=mr_singlesnp(nuc_eqtlGen_long_clump)
single_res_clump=single_res_clump[order(single_re_clumps$SNP),]
single_res_clump2=single_res_clump[3:457,]#remove the meta-analyses

# Merge the dat and single results files
single_merged_clump=merge(nuc_eqtlGen_long_clump, single_res_clump2, 
                    by=c("SNP", "exposure", "outcome"), all = TRUE)
single_merged_clump2<-subset(single_merged_clump, (!is.na(single_merged_clump$p)))

# FDR correction
p5_clump=single_merged_clump2$p
fdr5_clump=p.adjust(p5_clump, method = "fdr", n = length(p5_clump))
single_merged_clump2$fdr=fdr5_clump

# Create the "concordance" variable
single_merged_clump2$concord1=ifelse(single_merged_clump2$beta.exposure<0 & single_merged_clump2$beta.outcome>0, "concord", "no")
single_merged_clump2$concord2=ifelse(single_merged_clump2$beta.exposure>0 & single_merged_clump2$beta.outcome<0, "concord", single_merged_clump2$concord1)

# Order by pvalue
single_merged_clump2=single_merged_clump2[order(single_merged_clump2$p),]
write.csv(single_merged_clump2, 'nucs_eqtlGen_long_fdr_CLUMPED.csv')

sigs=single_merged_clump2[which(single_merged_clump2$p<0.05),]
sigs$gene.exposure

# Perform meta-analyses of top findings: some significant findings
# try loop...

# Format the id.exposure column to contain the gene names so that master file is sortable.
nuc_eqtlGen_long$id.exposure=nuc_eqtlGen_long$gene.exposure
nuc_eqtlGen_long$id.outcome="longevity"

# Get the clumped NOP14
big_NOP14=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="NOP14"),] #not sig
big_NOP14=clump_data(big_NOP14)
mr_meta_big_NOP14=mr(big_NOP14)
mr_meta_big_NOP14

# Get the clumped MCRS1
big_MCRS1=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="MCRS1"),] #not sig
big_MCRS1=clump_data(big_MCRS1)
mr_meta_big_MCRS1=mr(big_MCRS1)
mr_meta_big_MCRS1

# Get the clumped SRFBP1
big_SRFBP1=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="SRFBP1"),] #sig but only 1
big_SRFBP1=clump_data(big_SRFBP1)
mr_meta_big_SRFBP1=mr(big_SRFBP1)
mr_meta_big_SRFBP1

# Get the clumped RPL28
big_RPL28=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="RPL28"),] #SIGNIFICANT! decreases longevity
big_RPL28=clump_data(big_RPL28)
mr_meta_big_RPL28=mr(big_RPL28)
mr_meta_big_RPL28

# Get the clumped DAP3
big_DAP3=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="DAP3"),] # sig but only 1
big_DAP3=clump_data(big_DAP3)
mr_meta_big_DAP3=mr(big_DAP3)
mr_meta_big_DAP3

# Get the clumped EIF2S1
big_EIF2S1=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="EIF2S1"),] # sig but only 1
big_EIF2S1=clump_data(big_EIF2S1)
mr_meta_big_EIF2S1=mr(big_EIF2S1)
mr_meta_big_EIF2S1

# Get the clumped NOP14
big_NOP14=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="NOP14"),] # not sig
big_NOP14=clump_data(big_NOP14)
mr_meta_big_NOP14=mr(big_NOP14)
mr_meta_big_NOP14

# Get the clumped FBLL1
big_FBLL1=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="FBLL1"),] # sig but only 1
big_FBLL1=clump_data(big_FBLL1)
mr_meta_big_FBLL1=mr(big_FBLL1)
mr_meta_big_FBLL1

big_XPO1=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="XPO1"),] # sig but only one
big_XPO1=clump_data(big_XPO1)
mr_meta_big_XPO1=mr(big_XPO1)
mr_meta_big_XPO1

# Get the clumped NSUN5P2
big_NSUN5P2=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="NSUN5P2"),] # not sig 
big_NSUN5P2=clump_data(big_NSUN5P2)
mr_meta_big_NSUN5P2=mr(big_NSUN5P2)
mr_meta_big_NSUN5P2

# Get the clumped ERCC2
big_ERCC2=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="ERCC2"),] # not sig
big_ERCC2=clump_data(big_ERCC2)
mr_meta_big_ERCC2=mr(big_ERCC2)
mr_meta_big_ERCC2

# Get the clumped MRPL35
big_MRPL35=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="MRPL35"),] # not sig
big_MRPL35=clump_data(big_MRPL35)
mr_meta_big_MRPL35=mr(big_MRPL35)
mr_meta_big_MRPL35

# Get the clumped NOP58
big_NOP58=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="NOP58"),] # not sig
big_NOP58=clump_data(big_NOP58)
mr_meta_big_NOP58=mr(big_NOP58)
mr_meta_big_NOP58

# Get the clumped EXOSC6
big_EXOSC6=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="EXOSC6"),] # signficant! 
big_EXOSC6=clump_data(big_EXOSC6)
mr_meta_big_EXOSC6=mr(big_EXOSC6)
mr_meta_big_EXOSC6

# Get the clumped GTPBP10
big_GTPBP10=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="GTPBP10"),] # not sig
big_GTPBP10=clump_data(big_GTPBP10)
mr_meta_big_GTPBP10=mr(big_GTPBP10)
mr_meta_big_GTPBP10

# Get the clumped HEATR1
big_HEATR1=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="HEATR1"),] # sig but only 1
big_HEATR1=clump_data(big_HEATR1)
mr_meta_big_HEATR1=mr(big_HEATR1)
mr_meta_big_HEATR1

# Get the clumped EIF2AK4
big_EIF2AK4=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="EIF2AK4"),] # significant!!!
big_EIF2AK4=clump_data(big_EIF2AK4)
mr_meta_big_EIF2AK4=mr(big_EIF2AK4)
mr_meta_big_EIF2AK4

# Get the clumped EIF2A
big_EIF2A=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="EIF2A"),] # not sig 
big_EIF2A=clump_data(big_EIF2A)
mr_meta_big_EIF2A=mr(big_EIF2A)
mr_meta_big_EIF2A

# Get the clumped EIF3C
big_EIF3C=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="EIF3C"),] # significant!
big_EIF3C=clump_data(big_EIF3C)
mr_meta_big_EIF3C=mr(big_EIF3C)
mr_meta_big_EIF3C

# Get the clumped RPF2
big_RPF2=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="RPF2"),] # significant!
big_RPF2=clump_data(big_RPF2)
mr_meta_big_RPF2=mr(big_RPF2)
mr_meta_big_RPF2

# Get the clumped RPL11
big_RPL11=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="RPL11"),] # sig but only 1
big_RPL11=clump_data(big_RPL11)
mr_meta_big_RPL11=mr(big_RPL11)
mr_meta_big_RPL11

# Get the clumped NCK1
big_NCK1=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="NCK1"),] # sig but only 1
big_NCK1=clump_data(big_NCK1)
mr_meta_big_NCK1=mr(big_NCK1)
mr_meta_big_NCK1

# Get the clumped EXOSC8
big_EXOSC8=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="EXOSC8"),] # not sig 
big_EXOSC8=clump_data(big_EXOSC8)
mr_meta_big_EXOSC8=mr(big_EXOSC8)
mr_meta_big_EXOSC8

# Get the clumped MRPL18
big_MRPL18=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="MRPL18"),] # not sig 
big_MRPL18=clump_data(big_MRPL18)
mr_meta_big_MRPL18=mr(big_MRPL18)
mr_meta_big_MRPL18

# Get the clumped RPL31
big_RPL31=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="RPL31"),] # sig but only 1 
big_RPL31=clump_data(big_RPL31)
mr_meta_big_RPL31=mr(big_RPL31)
mr_meta_big_RPL31

# Get the clumped ETF1
big_ETF1=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="ETF1"),] # sig but only 1 
big_ETF1=clump_data(big_ETF1)
mr_meta_big_ETF1=mr(big_ETF1)
mr_meta_big_ETF1

# Get the clumped RPL29
big_RPL29=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="RPL29"),] # not sig
big_RPL29=clump_data(big_RPL29)
mr_meta_big_RPL29=mr(big_RPL29)
mr_meta_big_RPL29

# Get the clumped KAT2B
big_KAT2B=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="KAT2B"),] # not sig
big_KAT2B=clump_data(big_KAT2B)
mr_meta_big_KAT2B=mr(big_KAT2B)
mr_meta_big_KAT2B

# Get the clumped SURF6
big_SURF6=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="SURF6"),] # not sig
big_SURF6=clump_data(big_SURF6)
mr_meta_big_SURF6=mr(big_SURF6)
mr_meta_big_SURF6

# Get the clumped APOD
big_APOD=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="APOD"),] # not sig
big_APOD=clump_data(big_APOD)
mr_meta_big_APOD=mr(big_APOD)
mr_meta_big_APOD

# Get the clumped EXOSC8
big_EXOSC8=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="EXOSC8"),] # not sig
big_EXOSC8=clump_data(big_EXOSC8)
mr_meta_big_EXOSC8=mr(big_EXOSC8)
mr_meta_big_EXOSC8

# Get the clumped MRPS33
big_MRPS33=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="MRPS33"),] # not sig
big_MRPS33=clump_data(big_MRPS33)
mr_meta_big_MRPS33=mr(big_MRPS33)
mr_meta_big_MRPS33

# Get the clumped MRPL47
big_MRPL47=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="MRPL47"),] # sig but only 1
big_MRPL47=clump_data(big_MRPL47)
mr_meta_big_MRPL47=mr(big_MRPL47)
mr_meta_big_MRPL47

# Get the clumped MRPS7
big_MRPS7=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="MRPS7"),] # SIGNFICANT
big_MRPS7=clump_data(big_MRPS7)
mr_meta_big_MRPS7=mr(big_MRPS7)
mr_meta_big_MRPS7

# Get the clumped RICTOR
big_RICTOR=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="RICTOR"),] # sig but only 1
big_RICTOR=clump_data(big_RICTOR)
mr_meta_big_RICTOR=mr(big_RICTOR)
mr_meta_big_RICTOR

# Get the clumped MPHOSPH6
big_MPHOSPH6=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="MPHOSPH6"),] # not sig

big_MPHOSPH6=clump_data(big_MPHOSPH6)

mr_meta_big_MPHOSPH6=mr(big_MPHOSPH6)
mr_meta_big_MPHOSPH6

# Get the clumped SART1
big_SART1=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="SART1"),] # sig but only 1
big_SART1=clump_data(big_SART1)
mr_meta_big_SART1=mr(big_SART1)
mr_meta_big_SART1

# Get the clumped LTV1 
big_LTV1=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="LTV1"),] # sig but only 1
big_LTV1=clump_data(big_LTV1)
mr_meta_big_LTV1=mr(big_LTV1)
mr_meta_big_LTV1 

# Get the clumped METTL5 
big_METTL5=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="METTL5"),] # not sig
big_METTL5=clump_data(big_METTL5)
mr_meta_big_METTL5=mr(big_METTL5)
mr_meta_big_METTL5 

# Get the clumped FBL 
big_FBL=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="FBL"),] # sig but only 1
big_FBL=clump_data(big_FBL)
mr_meta_big_FBL=mr(big_FBL)
mr_meta_big_FBL 

# Get the clumped BMS1 
big_BMS1=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="BMS1"),] # sig but only 1
big_BMS1=clump_data(big_BMS1)
mr_meta_big_BMS1=mr(big_BMS1)
mr_meta_big_BMS1 

# Get the clumped BMS1 
big_MRPS9=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="MRPS9"),] # not sig
big_MRPS9=clump_data(big_MRPS9)
mr_meta_big_MRPS9=mr(big_MRPS9)
mr_meta_big_MRPS9 

# Get the clumped WDR55 
big_WDR55=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="WDR55"),] # not sig
big_WDR55=clump_data(big_WDR55)
mr_meta_big_WDR55=mr(big_WDR55)
mr_meta_big_WDR55 

# Get the clumped RPL37 
big_RPL37=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="RPL37"),] # sig but only 1
big_RPL37=clump_data(big_RPL37)
mr_meta_big_RPL37=mr(big_RPL37)
mr_meta_big_RPL37 

# Get the clumped ZNHIT3 
big_ZNHIT3=nuc_eqtlGen_long[which(nuc_eqtlGen_long$gene.exposure=="ZNHIT3"),] # not sig
big_ZNHIT3=clump_data(big_ZNHIT3)
mr_meta_big_ZNHIT3=mr(big_ZNHIT3)
mr_meta_big_ZNHIT3 

#Remove the objects in the environment that aren't the gene-level meta-analytic MR results
rm(list= ls()[!(ls() %in% c("mr_meta_big_APOD",      "mr_meta_big_BMS1",       "mr_meta_big_DAP3",       "mr_meta_big_EIF2A",     
                            "mr_meta_big_EIF2AK4",    "mr_meta_big_EIF2S1",     "mr_meta_big_EIF3C",    
                            "mr_meta_big_ERCC2",      "mr_meta_big_ETF1",       "mr_meta_big_EXOSC6",    
                            "mr_meta_big_EXOSC8",     "mr_meta_big_FBL",        "mr_meta_big_FBLL1",     
                            "mr_meta_big_GTPBP10",    "mr_meta_big_HEATR1",     "mr_meta_big_KAT2B",     
                            "mr_meta_big_LTV1",       "mr_meta_big_MCRS1",      "mr_meta_big_METTL5",    
                            "mr_meta_big_MPHOSPH6",   "mr_meta_big_MRPL18",     "mr_meta_big_MRPL35",    
                            "mr_meta_big_MRPL47",     "mr_meta_big_MRPS33",     "mr_meta_big_MRPS7",     
                            "mr_meta_big_MRPS9",      "mr_meta_big_NCK1",       "mr_meta_big_NOP14",     
                            "mr_meta_big_NOP58",      "mr_meta_big_NSUN5P2",    "mr_meta_big_RICTOR",    
                            "mr_meta_big_RPF2",       "mr_meta_big_RPL11",      "mr_meta_big_RPL28",     
                            "mr_meta_big_RPL29",      "mr_meta_big_RPL31",      "mr_meta_big_RPL37",     
                            "mr_meta_big_SART1",      "mr_meta_big_SRFBP1",     "mr_meta_big_SURF6",     
                            "mr_meta_big_WDR55",      "mr_meta_big_XPO1",       "mr_meta_big_ZNHIT3"))])

DF_obj <- lapply(ls(), get)
str(DF_obj)
nms <- c("mr_meta_big_APOD",      "mr_meta_big_BMS1",       "mr_meta_big_DAP3",       "mr_meta_big_EIF2A",     
         "mr_meta_big_EIF2AK4",    "mr_meta_big_EIF2S1",     "mr_meta_big_EIF3C",    
         "mr_meta_big_ERCC2",      "mr_meta_big_ETF1",       "mr_meta_big_EXOSC6",    
         "mr_meta_big_EXOSC8",     "mr_meta_big_FBL",        "mr_meta_big_FBLL1",     
         "mr_meta_big_GTPBP10",    "mr_meta_big_HEATR1",     "mr_meta_big_KAT2B",     
         "mr_meta_big_LTV1",       "mr_meta_big_MCRS1",      "mr_meta_big_METTL5",    
         "mr_meta_big_MPHOSPH6",   "mr_meta_big_MRPL18",     "mr_meta_big_MRPL35",    
         "mr_meta_big_MRPL47",     "mr_meta_big_MRPS33",     "mr_meta_big_MRPS7",     
         "mr_meta_big_MRPS9",      "mr_meta_big_NCK1",       "mr_meta_big_NOP14",     
         "mr_meta_big_NOP58",      "mr_meta_big_NSUN5P2",    "mr_meta_big_RICTOR",    
         "mr_meta_big_RPF2",       "mr_meta_big_RPL11",      "mr_meta_big_RPL28",     
         "mr_meta_big_RPL29",      "mr_meta_big_RPL31",      "mr_meta_big_RPL37",     
         "mr_meta_big_SART1",      "mr_meta_big_SRFBP1",     "mr_meta_big_SURF6",     
         "mr_meta_big_WDR55",      "mr_meta_big_XPO1",       "mr_meta_big_ZNHIT3")
names(DF_obj) = nms
for( i in 1:length(DF_obj))
  write.csv(DF_obj[i],paste0(nms[i],'.csv'))

#Merge all the meta-analytic results
merged <- reduce(DF_obj, full_join)
write.csv(merged, 'metas_eqtlGen_long.csv')