0.1 Data

0.1.1 UK Biobank Pharma Proteomics Project (UKB-PPP) Proteins

2940 summary statistics; Europeans; both GRCh19/38

  • Inflammation: 736
  • Cardiometabolic: 736
  • Oncology: 735
  • Neurology: 733

0.1.2 METSIM

1391 summary statistics; Finnish men; GRCh38; tsv files

  • Amino Acid: 215
  • Carbohydrate: 25
  • Cofactors and Vitamins: 38
  • Energy: 10
  • Lipid: 548
  • Nucleotide: 42
  • Partially Characterized: 16
  • Peptide: 42
  • Uncharacterized: 292
  • Xenobiotics: 163

0.1.3 Nightingale

249 summary statistics; Europeans; GRCh37; vcf files

Don’t have a breakdown by class, but do know the files include:

  • Lipoprotein Subclasses:
    • Includes particle concentrations and composition
  • Lipoprotein Particle Size
  • Apolipoprotein A-I and B
  • Multiple Cholesterol and Triglyceride Measures
  • Albumin
  • Various Fatty Acids
  • Low-Molecular-Weight Metabolites:
    • Amino acids (including branched-chain and aromatic)
    • Glycolysis-related measures
    • Ketone bodies

0.1.4 Clinical phenotype(s)

  • Coronary Heart Disease (CHD) GRCh37; vcf file; unformatted and downloaded
Phenotype Abbreviation Dataset Author PubMed ID Sample Size (N) Number of Cases (N Cases) Population
Coronary Heart Disease CHD ieu-a-7 Nikpay 26343387 184,305 60,801 Mixed

NOTE: Remember the Jurgens data…

  • e.g, for BMI: “/Users/charleenadams/acy1_bmi/GWAS_sumstats_EUR__invnorm_bmi__TOTALsample.tsv”

0.1.5 HyPrColoc sample overlap considerations

I revisited the Staley’s R Markdown example, examined the paper’s supplementary material on their simulation approach, and test-ran the analysis with the required matrices for accounting for sample overlap—LD matrices, sample overlap matrices, and correlation matrices of betas. Unfortunately, incorporating this adjustment into our pipeline is not feasible.

While generating these matrices dynamically is straightforward, running the model with the correlation matrix is computationally prohibitive—the runtime would be impractical. The authors acknowledge this in both the supplement and the R Markdown exercise, cautioning that adjusting for sample overlap may not be necessary. They provide several key arguments against it:

  • Risk of False Negatives – Colocalization is inherently more likely for correlated traits, and adjusting for overlap might remove true signals.

  • False Positives May Be Overstated – Even when assuming no sample overlap, correct colocalization clusters were still detected.

  • Computational Cost vs. Benefit – HyPrColoc was designed for large-scale analyses, but incorporating correlations drastically increases complexity without strong evidence of improving accuracy.

This aligns with a key takeaway from James’s R Markdown exercise:

“When we incorrectly assumed that the traits were measured in studies with no overlapping participants, we correctly detected the three clusters of colocalized traits—at a fraction of the computational cost of accounting for correlations between summary data.”


0.1.6 Future Steps

I have not yet added CHD to the analysis or formatted the data, as I’m prioritizing benchmarking proteins and METSIM metabolites. The plan is to integrate CHD and other clinical phenotypes once that is complete.

The CHD data was generated from the CARDIoGRAMplusC4D chip, which included Europeans and South Asians. The authors of HyPrColoc state that their method assumes all traits in a model derive from the same population, yet they used this dataset themselves. That raises a few questions (in the colloquial sense, not in the petitio principii sense):

  • Which is it? Can HyPrColoc handle traits from different ancestries or not?
  • Was this an oversight or an intentional decision?

UKP-PPP, METSIM, and CHD CARDIoGRAMplusC4D appear reasonably independent. We can consider adding in UKB Nightingale or Nightingale summary statistics from a smaller but more broadly European (vs UKB) cohort. Regardless, METSIM remains the priority for now. Likewise, we could add in proteins subset to the cis-regions of proteins of interest.


0.2 Directories


1 Bioinformatic preparation

1.1 Fetch metabolite summary statistics

1.2 Format UKB-PPP pQTLs

I had previously obtained these (see https://rpubs.com/YodaMendel/1243451) for an example of how to programmatically get files from UKB-PPP.

1.2.1 Untar all 2940

1.2.2 Add rsIDs by CHR

  • ~28 hours run time parallelized, including Mac sleeping overnight 😞
    • Next time use caffeinate -i ./add_rsids.sh

1.2.3 🚫 DON’T 🚫 merge UKP-PPP chromosome files: draft script!!!

Takes too long; keep as chromosomes; retained because it likely works and we might need the chromosomes merged if we try out a non-cis-region approach later.

1.2.4 Create cis-regions

The method I devised below obtains cis-regions (500KB up and downstream of TSSes using Ensembl) for 2808 of 2940 (96%) of the files.

1.3 Debugging headers

Adding in the rsIDs created headers with mixed delimiters, so I fixed that.

1.4 Format Nightingale VCF files: draft script

2 Analysis with just one cis-region: PCSK9 with p-value threshold criterion (P<5E-8)

2.1 Subset METSIM files by PCSK9 cis-region

2.2 Filter METSIM files on at least one SNP with P<5E-8 in PCSK9 cis-region

2.2.1 Prep for HyPrColoc

2.3 Harmonizing PCSK9 and METSIM metabolites with at least on rsID with P<5E-8

We chose to prioritize which MESTIM metabolites to examine using a p-value theshold with the criterion being that at least one SNP in a respective cis-region being less than the threshold.

2.4 HyPrColoc of PCSK9 and METSIM metabolites with at least one SNP with P<5E-8

We conducted a colocalization analysis using HyPrColoc to explore shared genetic loci influencing PCSK9 and ?? METSIM metabolites.

3 Analysis of PCSK9 on METSIM metabolites (ignoring p-value threshold criterion)

3.1 Harmonizing PCSK9 and all METSIM metabolites (ignoring p-value criterion)

3.2 HyPrColoc of PCSK9 and all METSIM metabolites (ignoring p-value criterion)

3.2.1 Results: HyPrColoc of PCSK9 and all metabolites (ignoring p-value)

4 Benchmarking loop of selected proteins on METISM metabolites (P<5E-8 threshold)

4.1 Identify 25 proteins in list from Usman

4.2 Loop to make cis-regions for 24 proteins (minus ACE2)

4.3 Filter METSIM files on at least one SNP with P<5E-8 in respective cis-region

4.4 Loop to harmonize selected cis-region proteins and METSIM metabolites with at least on rsID with P<5E-8

4.5 HyPrColoc of selected proteins and METSIM metabolites (at least one SNP with P<5E-8)

We conducted a colocalization analysis using HyPrColoc to explore shared genetic loci influencing selected proteins and METSIM metabolites filtered on at least one SNP in the METSIM files have a P<5E-8.

5 Benchmarking loop of selected proteins on METISM metabolites (P<5E-6 threshold)

5.1 Filter METSIM files on at least one SNP with P<5E-6

5.2 Loop to harmonize selected cis-region proteins and METSIM metabolites with at least on rsID with P<5E-6

5.3 HyPrColoc of selected proteins and METSIM metabolites with at least one SNP with P<5E-6

6 Comparison of HyPrColoc results for selected proteins using P<5E-8 and P<5E-6 filters

7 Shiny app to deliver results

Rpubs doesn’t let me share files, but this can be done with rsconnect and a shiny app.

Go here for the results: https://yodamendel.shinyapps.io/hypr_results_deliver/