2940 summary statistics; Europeans; both GRCh19/38
#!/bin/sh
# Directory containing the files
dir_path="/Users/charleenadams/ukbppp"
# Temporary file for storing category counts
temp_counts=$(mktemp)
# Loop through all .tar files in the directory
for file in "$dir_path"/*.tar; do
# Extract the category from the filename
category=$(echo "$file" | sed -E 's/.*_v1_([A-Za-z]+)(_II)?\.tar/\1/')
# Output the category to the temp file
echo "$category" >> "$temp_counts"
done
# Count the occurrences of each category and save to output
sort "$temp_counts" | uniq -c | awk '{print $1, $2}' | sort -nr > "$dir_path/category_counts.txt"
# Clean up
rm "$temp_counts"
echo "Category counts saved to $dir_path/category_counts.txt"
1391 summary statistics; Finnish men; GRCh38; tsv files
249 summary statistics; Europeans; GRCh37; vcf files
Don’t have a breakdown by class, but do know the files include:
| Phenotype | Abbreviation | Dataset | Author | PubMed ID | Sample Size (N) | Number of Cases (N Cases) | Population |
|---|---|---|---|---|---|---|---|
| Coronary Heart Disease | CHD | ieu-a-7 | Nikpay | 26343387 | 184,305 | 60,801 | Mixed |
NOTE: Remember the Jurgens data…
I revisited the Staley’s R Markdown example, examined the paper’s supplementary material on their simulation approach, and test-ran the analysis with the required matrices for accounting for sample overlap—LD matrices, sample overlap matrices, and correlation matrices of betas. Unfortunately, incorporating this adjustment into our pipeline is not feasible.
While generating these matrices dynamically is straightforward, running the model with the correlation matrix is computationally prohibitive—the runtime would be impractical. The authors acknowledge this in both the supplement and the R Markdown exercise, cautioning that adjusting for sample overlap may not be necessary. They provide several key arguments against it:
Risk of False Negatives – Colocalization is inherently more likely for correlated traits, and adjusting for overlap might remove true signals.
False Positives May Be Overstated – Even when assuming no sample overlap, correct colocalization clusters were still detected.
Computational Cost vs. Benefit – HyPrColoc was designed for large-scale analyses, but incorporating correlations drastically increases complexity without strong evidence of improving accuracy.
This aligns with a key takeaway from James’s R Markdown exercise:
“When we incorrectly assumed that the traits were measured in studies with no overlapping participants, we correctly detected the three clusters of colocalized traits—at a fraction of the computational cost of accounting for correlations between summary data.”
I have not yet added CHD to the analysis or formatted the data, as I’m prioritizing benchmarking proteins and METSIM metabolites. The plan is to integrate CHD and other clinical phenotypes once that is complete.
The CHD data was generated from the CARDIoGRAMplusC4D chip, which included Europeans and South Asians. The authors of HyPrColoc state that their method assumes all traits in a model derive from the same population, yet they used this dataset themselves. That raises a few questions (in the colloquial sense, not in the petitio principii sense):
UKP-PPP, METSIM, and CHD CARDIoGRAMplusC4D appear reasonably independent. We can consider adding in UKB Nightingale or Nightingale summary statistics from a smaller but more broadly European (vs UKB) cohort. Regardless, METSIM remains the priority for now. Likewise, we could add in proteins subset to the cis-regions of proteins of interest.
Main: /Users/charleenadams/coloc_clin_mol
UKB-PPP files:
METSIM files: /Users/charleenadams/metsim_files
Nightingale files: /Users/charleenadams/met_d_files
Selected clinical phenotype files: /Users/charleenadams/clinical_files
#!/bin/bash
#nohup ./fetch_met_d_files.sh > fetch_met_d_files.log 2>&1 &
cd /Users/charleenadams/met_d_files
# Directory to save files
output_dir="/Users/charleenadams/met_d_files"
mkdir -p "$output_dir"
cd "$output_dir" || exit
# Base URL
base_url="https://gwas.mrcieu.ac.uk/files/"
# File containing IDs (phenocodes)
phenocode_file="/Users/charleenadams/met_d_files/met_d_ids.tsv"
# Log file for fetched files
log_file="${output_dir}/fetch_met_d_files.log"
touch "$log_file"
# Function to fetch files
fetch_file() {
local prefix="$1"
# Check if both .vcf.gz and .vcf.gz.tbi exist
if [[ ! -f "${prefix}.vcf.gz" || ! -f "${prefix}.vcf.gz.tbi" ]]; then
echo "Fetching: ${prefix}" >> "$log_file"
wget -q "${base_url}${prefix}/${prefix}.vcf.gz" -O "${prefix}.vcf.gz"
wget -q "${base_url}${prefix}/${prefix}.vcf.gz.tbi" -O "${prefix}.vcf.gz.tbi"
echo "Completed: ${prefix}" >> "$log_file"
else
echo "Skipped (already exists): ${prefix}" >> "$log_file"
fi
}
export -f fetch_file
export base_url
export log_file
# Run parallel downloads using IDs from the file
awk '{print $1}' "$phenocode_file" | parallel -j 10 fetch_file {}
#!/bin/bash
#nohup ./fetch_metsim_files.sh > fetch_metsim_files.log 2>&1 &
# Directory to save files
output_dir="/Users/charleenadams/metsim_files"
mkdir -p "$output_dir"
# Base URL for summary statistics download
base_url="https://pheweb.org/metsim-metab/download"
# Phenocode file (list of phenocodes from the TSV file)
phenocode_file="/Users/charleenadams/metsim_files/phecode.tsv"
# Navigate to the output directory
cd "$output_dir"
# Function to download summary statistics
fetch_file() {
local phenocode="$1"
wget -q "${base_url}/${phenocode}" -O "${phenocode}.txt.gz"
}
# Export function and variable for parallel execution
export -f fetch_file
export base_url
# Run parallel downloads using phenocodes from the file
cat "$phenocode_file" | parallel -j 10 fetch_file {}
I had previously obtained these (see https://rpubs.com/YodaMendel/1243451) for an example of how to programmatically get files from UKB-PPP.
#!/bin/bash
# Define the directory containing your tar files
tar_directory="/Users/charleenadams/ukbppp"
# Define the output directory for extracted files
output_directory="/Users/charleenadams/ukbppp/coloc_clin_mol_extracted"
# Define the log file for tracking progress
log_file="/Users/charleenadams/ukbppp/extraction.log"
# Create the output directory if it doesn't exist
mkdir -p "$output_directory"
# Create or clear the log file
> "$log_file"
# Function to extract a single tar file
extract_tar() {
local tar_file="$1"
local output_directory="$2"
local log_file="$3"
# Log the start of extraction
echo "Starting extraction: $tar_file" >> "$log_file"
# Extract the tar file into the output directory
if tar -xf "$tar_file" -C "$output_directory"; then
echo "Completed extraction: $tar_file" >> "$log_file"
else
echo "Failed extraction: $tar_file" >> "$log_file"
fi
}
export -f extract_tar
# Use `find` to locate all `.tar` files and process them in parallel
find "$tar_directory" -maxdepth 1 -type f -name "*.tar" | parallel -j 16 extract_tar {} "$output_directory" "$log_file"
echo "Extraction complete. Log saved to $log_file."
# Run with:
#chmod +x untar.sh
#nohup ./untar.sh >untar.log 2>&1 &
ps -p 97839 -o stat,etime
caffeinate -i ./add_rsids.sh#!/bin/bash
# Base directory where your data is located
base_dir="/Users/charleenadams/ukbppp/coloc_clin_mol_extracted"
# Directory where RSID map files are located
map_dir="/Users/charleenadams/temp_BI/chemokine_rgs_olink/maps"
# Log file to track progress
log_file="$base_dir/add_rsids.log"
echo "Log started on $(date)" > "$log_file"
# Function to process each file
process_file() {
local file="$1"
local dir
dir=$(dirname "$file")
local chrom
chrom=$(basename "$file" | sed -E 's/.*chr([0-9XY]+).*/\1/')
local map_file="$map_dir/olink_rsid_map_mac5_info03_b0_7_chr${chrom}_patched_v2.tsv.gz"
local output_file="${file%.gz}_rsid_added.gz"
{
echo "Processing file: $file (Chromosome $chrom)"
if [[ -n "$chrom" && -f "$map_file" ]]; then
echo " Using RSID map: $map_file"
# Remove existing rsid_added file to avoid overwrite prompts
rm -f "$output_file"
# Join the original file with the RSID map based on the ID column
gunzip -c "$file" | \
awk 'BEGIN {OFS="\t"}
NR==FNR {map[$1]=$4; genpos37[$1]=$5; genpos38[$1]=$6; next}
($3 in map) {print $0, map[$3], genpos37[$3], genpos38[$3]}
!($3 in map) {print $0, ".", ".", "."}' <(gunzip -c "$map_file") - | \
gzip -c > "$output_file"
if [[ -s "$output_file" ]]; then
echo " RSID added with GENPOS37 and GENPOS38. Output saved to $output_file"
# Delete the original file if the RSID-added file was successfully created
rm -f "$file"
echo " Deleted original file: $file"
else
echo " Warning: $output_file is empty! Original file not deleted."
fi
else
echo " Error: RSID map file for chromosome $chrom not found or chromosome not identified."
fi
} >> "$log_file" 2>&1
}
export -f process_file
export map_dir
export log_file
# Find all .gz files and process them in parallel
find "$base_dir" -type f -name "*.gz" | parallel -j16 process_file
echo "Log ended on $(date)" >> "$log_file"
echo "Processing complete. Check $log_file for details."
#Run with:
#nano add_rsids.sh
#chmod +x add_rsids.sh
#nohup ./add_rsids.sh > add_rsids2.log 2>&1 &
#ps -p 54107 -o stat,etime
Takes too long; keep as chromosomes; retained because it likely works and we might need the chromosomes merged if we try out a non-cis-region approach later.
#!/bin/bash
# Define directories
base_dir="/Users/charleenadams/ukbppp/coloc_clin_mol_extracted"
output_dir="/Users/charleenadams/merged_ukbppp"
log_file="${output_dir}/merge_rsid_files.log"
# Create output directory if it doesn't exist
mkdir -p "$output_dir"
# Start log file
echo "Processing started on $(date)" > "$log_file"
# Function to process each subdirectory
process_directory() {
local dir="$1"
# Extract the subdirectory name for naming the merged file
subdir_name=$(basename "$dir")
merged_file="${output_dir}/${subdir_name}_merged_rsid_added.gz"
echo "Processing directory: $dir" | tee -a "$log_file"
# Find all *_rsid_added.gz files in the current directory
rsid_files=("${dir}"/*_rsid_added.gz)
# Check if there are files to merge
if [[ ${#rsid_files[@]} -gt 0 ]]; then
# Extract header from the first file
gunzip -c "${rsid_files[0]}" | head -n 1 > temp_header.txt
# Process each file, excluding the header, and append to a temporary content file
for file in "${rsid_files[@]}"; do
gunzip -c "$file" | tail -n +2 >> temp_content.txt
done
# Concatenate header and content into the final merged file
cat temp_header.txt temp_content.txt | gzip -c > "$merged_file"
# Clean up temporary files
rm temp_header.txt temp_content.txt
# Confirm if the merged file was created successfully and isn’t empty
if [[ -s "$merged_file" ]]; then
echo " Merged file created: $merged_file" | tee -a "$log_file"
else
echo " Warning: Merged file is empty for directory $dir" | tee -a "$log_file"
fi
else
echo " No '_rsid_added.gz' files found in directory $dir" | tee -a "$log_file"
fi
}
export -f process_directory
export log_file output_dir
# Find all subdirectories in base_dir and process them in parallel
find "$base_dir" -mindepth 1 -maxdepth 1 -type d -print0 | parallel -0 -j 15 process_directory
echo "Processing completed on $(date)" >> "$log_file"
echo "Check $log_file for details."
# Run with:
# Script located at: /Users/charleenadams/coloc_clin_mol/scripts
#nano chrom_merge.sh
#chmod +x chrom_merge.sh
#nohup ./chrom_merge.sh > chrom_merge.log 2>&1 &
#ps -p 20853 -o stat,etime
The method I devised below obtains cis-regions (500KB up and downstream of TSSes using Ensembl) for 2808 of 2940 (96%) of the files.
#!/bin/bash
# Directories
base_dir="/Users/charleenadams/ukbppp/coloc_clin_mol_extracted"
results_dir="/Users/charleenadams/cis_ukbppp"
mkdir -p "$results_dir"
# Log file
log_file="$results_dir/process_files.log"
echo "Processing started on $(date)" > "$log_file"
# Function to process each directory
process_directory() {
local dir="$1"
local protein_name
protein_name=$(basename "$dir" | cut -d '_' -f1) # Extract protein name (e.g., ITGB6)
echo "Processing directory: $dir (Protein: $protein_name)" | tee -a "$log_file"
# Retrieve TSS for the protein using Ensembl REST API
tss_info=$(curl -s "https://rest.ensembl.org/lookup/symbol/human/${protein_name}?content-type=application/json")
if [[ -z "$tss_info" ]]; then
echo " Error: TSS not found for $protein_name" | tee -a "$log_file"
return
fi
tss=$(echo "$tss_info" | jq '.start') # Extract TSS (start position)
chrom=$(echo "$tss_info" | jq -r '.seq_region_name') # Extract chromosome
# Check if TSS and chromosome are valid
if [[ -z "$tss" || -z "$chrom" ]]; then
echo " Error: Invalid TSS or chromosome for $protein_name" | tee -a "$log_file"
return
fi
# Define the region (500kb upstream and downstream)
start=$((tss - 500000))
end=$((tss + 500000))
if ((start < 0)); then start=0; fi
# Locate the chromosome file for the TSS
chrom_file=$(find "$dir" -type f -name "discovery_chr${chrom}_*.gz")
if [[ -z "$chrom_file" ]]; then
echo " Error: Chromosome file not found for $protein_name (chr $chrom)" | tee -a "$log_file"
return
fi
# Filter the chromosome file for the region and save results
output_file="${results_dir}/${protein_name}_cis_region.gz"
gunzip -c "$chrom_file" | awk -v chrom="$chrom" -v start="$start" -v end="$end" \
'NR==1 || ($1 == chrom && $2 >= start && $2 <= end)' | gzip -c > "$output_file"
if [[ -s "$output_file" ]]; then
echo " Cis-region file created: $output_file" | tee -a "$log_file"
else
echo " Warning: Cis-region file is empty for $protein_name" | tee -a "$log_file"
fi
}
export -f process_directory
export results_dir log_file
# Run the process in parallel for all directories
find "$base_dir" -mindepth 1 -maxdepth 1 -type d | parallel -j15 process_directory {}
echo "Processing completed on $(date)" >> "$log_file"
echo "Check $log_file for details."
# Run with:
#nano cis_regions.sh
#chmod +x cis_regions.sh
#nohup ./cis_regions.sh > cis_regions.log 2>&1 &
#ps -p 20853 -o stat,etime
Adding in the rsIDs created headers with mixed delimiters, so I fixed that.
#!/bin/bash
input_dir="/Users/charleenadams/cis_ukbppp"
output_dir="/Users/charleenadams/cis_ukbppp/cleaned"
mkdir -p "$output_dir"
for file in "$input_dir"/*_cis_region.gz; do
if [[ -f "$file" ]]; then
output_file="$output_dir/$(basename "$file" .gz)_cleaned.gz"
echo "Processing $file -> $output_file"
gunzip -c "$file" | \
awk 'BEGIN {OFS="\t"} {for (i=1; i<=NF; i++) $i=$i} 1' | \
gzip > "$output_file"
else
echo "Skipping $file: Not a valid file."
fi
done
echo "Processing complete. Cleaned files are saved in $output_dir."
du -sh /Users/charleenadams/cis_ukbppp/* | grep -v '4.0K' | wc -l
du -sh /Users/charleenadams/cis_ukbppp/* | grep -v '4.0K' > non_4K_files.txt
#!/bin/bash
# Define the directory containing the .vcf.gz files
input_dir="/Users/charleenadams/met_d_files"
output_dir="/Users/charleenadams/met_d_files/formatted"
mkdir -p "$output_dir"
# Log file for tracking progress
log_file="$output_dir/format_vcf.log"
echo "Log started on $(date)" > "$log_file"
# Function to process each file
process_vcf() {
local file="$1"
local base_name
base_name=$(basename "$file" .vcf.gz)
local output_file="${output_dir}/${base_name}_formatted.vcf.gz"
echo "Processing $file..." | tee -a "$log_file"
# Extract the header and process the data
gunzip -c "$file" | \
awk -v OFS="\t" '
BEGIN {process = 0}
/^#CHROM/ {
process = 1;
print "CHROM", "POS19", "ID", "ALLELE0", "ALLELE1", "QUAL", "FILTER", "INFO", "BETA", "SE", "LOG10P", "A1FREQ", "rsid", "GENPOS";
}
process && !/^#/ {
split($9, format_keys, ":");
split($10, format_values, ":");
# Create new columns based on FORMAT field
for (i = 1; i <= length(format_keys); i++) {
if (format_keys[i] == "ES") BETA = format_values[i];
if (format_keys[i] == "SE") SE = format_values[i];
if (format_keys[i] == "LP") LOG10P = format_values[i];
if (format_keys[i] == "AF") A1FREQ = format_values[i];
if (format_keys[i] == "ID") rsid = format_values[i];
}
# Map columns to required headers
print $1, $2, rsid, $4, $5, $6, $7, $8, BETA, SE, LOG10P, A1FREQ, rsid, $2;
}' | gzip -c > "$output_file"
# Confirm processing success
if [[ -s "$output_file" ]]; then
echo " Successfully processed: $output_file" | tee -a "$log_file"
rm "$file"
else
echo " Warning: Processing failed for $file" | tee -a "$log_file"
fi
}
export -f process_vcf
export output_dir
export log_file
# Find all .vcf.gz files (excluding .tbi and .tbi.1 files) and process them in parallel
find "$input_dir" -type f -name "*.vcf.gz" ! -name "*.tbi*" | parallel -j4 process_vcf
echo "Log ended on $(date)" >> "$log_file"
echo "Processing complete. Check $log_file for details."
#!/bin/bash
#run with:
#cd /Users/charleenadams/coloc_clin_mol/scripts
#chmod +x subset_PCSK9_metsim.sh
#nohup ./subset_PCSK9_metsim.sh > subset_PCSK9_metsim.log 2>&1 &
#ps -p 42761 -o stat,etime
# Input files and directories
cis_file="/Users/charleenadams/cis_ukbppp/cleaned/PCSK9_cis_region_cleaned.gz"
metsim_dir="/Users/charleenadams/metsim_files"
output_dir="/Users/charleenadams/metsim_files/metsim_cis_PCSK9_fixed"
log_file="subset_PCSK9_metsim.log"
rsid_file="pcsk9_rsid_list.txt"
# Create output directory
mkdir -p "$output_dir"
# Extract rsid list from the cis-region file (column 15) and store in a file
gzcat "$cis_file" | awk 'NR > 1 {print $15}' | sort | uniq > "$rsid_file"
if [[ ! -s "$rsid_file" ]]; then
echo "Error: No rsid entries found in $cis_file" >> "$log_file"
exit 1
fi
# Export variables for parallel processing
export metsim_dir output_dir log_file rsid_file
# Function to process a single METSIM file
process_metsim_file() {
local metsim_file=$1
local base_name=$(basename "$metsim_file" .txt.gz)
local output_file="$output_dir/${base_name}_cis_PCSK9.txt"
gzcat "$metsim_file" | awk '
BEGIN {
while ((getline < "'$rsid_file'") > 0) lookup[$1] = 1;
close("'$rsid_file'");
}
NR == 1 {print $0; next} # Print header
$5 in lookup {print $0} # RSIDs are in column 5 in METSIM
' > "$output_file"
if [[ ! -s "$output_file" ]]; then
echo "Warning: No matching rows for $metsim_file" >> "$log_file"
rm "$output_file"
else
echo "Processed $metsim_file" >> "$log_file"
fi
}
export -f process_metsim_file
# Use GNU parallel to process files
find "$metsim_dir" -name "*.gz" | parallel -j $(nproc) process_metsim_file {}
# Ran interactively in Rstudio
# Load necessary libraries
library(data.table)
# Directory where the txt files reside
directory <- "/Users/charleenadams/metsim_files/metsim_cis_PCSK9_fixed"
# Subdirectory where to copy files with pval < 5e-8
subdirectory <- "p_val_metsim_cis_PCSK9_fixed"
# Ensure subdirectory exists
if (!dir.exists(file.path(directory, subdirectory))) {
dir.create(file.path(directory, subdirectory))
}
# Get all .txt files in the directory
files <- list.files(path = directory, pattern = "\\.txt$", full.names = TRUE)
# Process each file
for (file in files) {
cat("Processing file:", basename(file), "\n")
# Read the file
data <- fread(file)
# Check if any p-value is less than 5e-8
if (any(data$pval < 5e-8, na.rm = TRUE)) {
cat("File", basename(file), "has pval < 5e-8\n")
file.copy(from = file, to = file.path(directory, subdirectory, basename(file)))
cat("Copied", basename(file), "to", subdirectory, "\n")
} else {
cat("No match found in", basename(file), "\n")
}
}
cat("Script completed\n")
Tricky. Do NOT under any circumstances select “update all” when prompted to install newer versions.
# Follow these installation steps exactly
remotes::install_version("Rcpp", version = "1.0.12", repos = "http://cran.us.r-project.org")
remotes::install_version("RcppEigen", version = "0.3.3.9.3", repos = "http://cran.us.r-project.org")
Sys.setenv(CC = "/opt/homebrew/opt/gcc/bin/gcc-14")
Sys.setenv(CXX = "/opt/homebrew/opt/gcc/bin/g++-14")
devtools::install_github("jrs95/hyprcoloc",
build_opts = c("--no-resave-data", "--no-manual"),
build_vignettes = TRUE)
# Load the necessary libraries
library(hyprcoloc)
# Regression coefficients from ten studies (Traits 1-5, 6-8 & 9-10 colocalize)
betas_t <- hyprcoloc::test.betas
ses_t <- hyprcoloc::test.ses
# Trait names and SNP IDs
traits_t <- paste0("T", 1:10)
rsid_t <- rownames(betas_t)
str(rsid_t)
# Colocalization analysis
test <- hyprcoloc(betas_t, ses_t, trait.names = traits_t, snp.id = rsid_t, snpscores = TRUE)
___
# Define number of traits and SNPs
num_traits <- 800
num_snps <- 1062 # Adjust if necessary
# Generate new trait names
traits_t <- paste0("T", 1:num_traits)
# Expand betas_t matrix (adding random noise to mimic new traits)
betas_t_expanded <- cbind(
betas_t, # Existing data
matrix(
rnorm(num_snps * (num_traits - ncol(betas_t)),
mean = rowMeans(betas_t, na.rm = TRUE),
sd = apply(betas_t, 1, sd, na.rm = TRUE)),
nrow = num_snps
)
)
# Expand ses_t matrix (adding noise)
ses_t_expanded <- cbind(
ses_t, # Existing data
matrix(
rnorm(num_snps * (num_traits - ncol(ses_t)),
mean = rowMeans(ses_t, na.rm = TRUE),
sd = apply(ses_t, 1, sd, na.rm = TRUE)),
nrow = num_snps
)
)
# Assign new column names
colnames(betas_t_expanded) <- traits_t
colnames(ses_t_expanded) <- traits_t
# Ensure row names are preserved
rownames(betas_t_expanded) <- rownames(betas_t)
rownames(ses_t_expanded) <- rownames(ses_t)
# Run colocalization analysis with 800 traits
test <- hyprcoloc(betas_t_expanded, ses_t_expanded, trait.names = traits_t, snp.id = rsid_t, snpscores = TRUE)
# Print summary of results
print(test)
We chose to prioritize which MESTIM metabolites to examine using a p-value theshold with the criterion being that at least one SNP in a respective cis-region being less than the threshold.
# Ran interactively in Rstudio
# Could do:
# nohup Rscript ./harmonize_PCSK9_metsim_pval_subsets_parallel.R > harmonize_PCSK9_metsim_pval_subsets_parallel.log 2>&1 &
# ps -p 17652 -o stat,etime
# Load required libraries
library(data.table)
library(TwoSampleMR)
# Custom logging function
custom_log <- function(level, message) {
msg <- paste(format(Sys.time(), "%Y-%m-%d %H:%M:%S"), level, message)
cat(msg, "\n", file = "benchmark_harm_twosampleMR.log", append = TRUE)
cat(msg, "\n") # Also print to console
}
# Directories
cis_region_dir <- "/Users/charleenadams/cis_ukbppp/cleaned/"
metsim_dir <- "/Users/charleenadams/metsim_files/metsim_cis_PCSK9_fixed/p_val_metsim_cis_PCSK9_fixed"
output_dir <- "/Users/charleenadams/harmonized_results_fixed/"
# Function to read and format data for TwoSampleMR, including chromosome and position
read_and_format_data <- function(file_path, is_cis) {
tryCatch({
# Read file and convert to data.frame
df <- fread(file_path, sep = "\t", header = TRUE)
df <- as.data.frame(df) # Ensure it is a data.frame
# Format data for TwoSampleMR
formatted_df <- format_data(df,
type = ifelse(is_cis, "exposure", "outcome"),
snp_col = ifelse(is_cis, "rsid", "rsids"),
beta_col = ifelse(is_cis, "BETA", "beta"),
se_col = ifelse(is_cis, "SE", "sebeta"),
effect_allele_col = ifelse(is_cis, "ALLELE1", "alt"),
other_allele_col = ifelse(is_cis, "ALLELE0", "ref"),
eaf_col = ifelse(is_cis, "A1FREQ", "maf"),
pval_col = ifelse(is_cis, "LOG10P", "pval"),
chr_col = ifelse(is_cis, "CHROM", "chrom"),
pos_col = ifelse(is_cis, "POS38", "pos"),
phenotype_col = basename(file_path))
# Set exposure or outcome names
if (is_cis) {
formatted_df$id.exposure <- basename(file_path)
} else {
formatted_df$id.outcome <- basename(file_path)
}
return(formatted_df)
}, error = function(e) {
custom_log("ERROR", paste("Error reading file", file_path, ":", e$message))
return(NULL)
})
}
# Main function using TwoSampleMR
main <- function() {
# Create output directory if it doesn't exist
dir.create(output_dir, showWarnings = FALSE, recursive = TRUE)
# Select the cis-region file
cis_file <- file.path(cis_region_dir, "PCSK9_cis_region_cleaned.gz")
# Get the list of METSIM subset files
metsim_files <- list.files(metsim_dir, pattern = "*.txt", full.names = TRUE)
custom_log("INFO", paste("Processing cis-region file:", cis_file))
custom_log("INFO", paste("Processing METSIM files:", paste(metsim_files, collapse = ", ")))
# Read and format the cis-region file
cis_df <- read_and_format_data(cis_file, is_cis = TRUE)
if (is.null(cis_df)) {
custom_log("ERROR", "Failed to read cis-region file.")
return()
}
# Harmonize each METSIM file with the cis-region file
harmonized_results <- lapply(metsim_files, function(metsim_file) {
metsim_df <- read_and_format_data(metsim_file, is_cis = FALSE)
if (!is.null(metsim_df)) {
tryCatch({
harmonized <- harmonise_data(exposure_dat = cis_df, outcome_dat = metsim_df, action = 2)
custom_log("INFO", paste("Harmonized", metsim_file, "with", nrow(harmonized), "rows."))
return(harmonized)
}, error = function(e) {
custom_log("ERROR", paste("Error harmonizing data for", metsim_file, ":", e$message))
return(NULL)
})
}
return(NULL)
})
# Combine harmonized results
harmonized_results <- harmonized_results[!sapply(harmonized_results, is.null)]
if (length(harmonized_results) > 0) {
final_results <- rbindlist(harmonized_results)
output_file <- file.path(output_dir, "harmonized_results_PCSK9.txt")
fwrite(final_results, file = output_file, sep = "\t")
custom_log("INFO", paste("Saved harmonized results to", output_file))
} else {
custom_log("WARNING", "No harmonized results were generated.")
}
}
# Run the main function
main()
We conducted a colocalization analysis using HyPrColoc to explore shared genetic loci influencing PCSK9 and ?? METSIM metabolites.
# Run interactively in Rstudio
library(data.table)
library(hyprcoloc)
library(openxlsx)
library(DT)
library(knitr)
# Load phenotype mapping data
phenotype_map <- fread("/Users/charleenadams/metsim_files/phenotypes.tsv")
print(names(phenotype_map)) # Confirm structure of phenotype_map
# Define file path to harmonized results: note, this directory contains the files with at least one SNP P<5E-8
file_path <- "/Users/charleenadams/harmonized_results_fixed/harmonized_results_PCSK9.txt"
#file_path <- "/Users/charleenadams/harmonized_results/harmonized_results_PCSK9.txt"
# Read the harmonized data
data <- fread(file_path)
# Remove all characters after and including the first `_`
data$id.outcome <- sub("_.*", "", data$id.outcome)
data$id.exposure <- sub("_.*", "", data$id.exposure)
# Ensure both data frames are data.tables
setDT(data)
setDT(phenotype_map)
# Merge phenostring from phenotype_map into data using id.outcome and phenocode
data <- merge(data, phenotype_map[, .(phenocode, phenostring)],
by.x = "id.outcome", by.y = "phenocode", all.x = TRUE)
# Rename the phenostring column to METSIM_name
setnames(data, "phenostring", "METSIM_name")
# Remove asterisks (*) from the METSIM_name column
data$METSIM_name <- gsub("\\*", "", data$METSIM_name)
cat("Data read successfully. Total rows:", nrow(data), "\n")
# Ensure the data contains the required columns
required_columns <- c("SNP", "id.exposure", "METSIM_name", "beta.exposure", "beta.outcome", "se.exposure", "se.outcome")
if (!all(required_columns %in% colnames(data))) {
stop("The harmonized data is missing required columns!")
}
# Summarize pval.outcome for each id.outcome
pval_summary <- data[, .(
min_pval = min(pval.outcome, na.rm = TRUE),
max_pval = max(pval.outcome, na.rm = TRUE),
mean_pval = mean(pval.outcome, na.rm = TRUE),
median_pval = median(pval.outcome, na.rm = TRUE),
count = .N # Count number of SNPs per id.outcome
), by = id.outcome]
# Display results
print(pval_summary)
pval_summary <- pval_summary[order(pval_summary$min_pval),]
# Extract unique traits and common SNPs
traits <- unique(c(data$id.exposure, data$METSIM_name))
common_snps <- unique(data$SNP)
cat("Number of traits:", length(traits), "Number of SNPs:", length(common_snps), "\n")
# Filter data to include only common SNPs
data <- data[SNP %in% common_snps]
# Reshape data to create a long table for matrix population
long_data <- rbindlist(list(
data[, .(SNP, trait = id.exposure, beta = beta.exposure, se = se.exposure)],
data[, .(SNP, trait = METSIM_name, beta = beta.outcome, se = se.outcome)]
), use.names = TRUE)
# Ensure no duplicate SNP-trait combinations
long_data <- unique(long_data, by = c("SNP", "trait"))
cat("Rows in reshaped long_data:", nrow(long_data), "\n")
# Create matrices for betas and ses
betas_mat <- dcast(long_data, SNP ~ trait, value.var = "beta", fill = NA)
ses_mat <- dcast(long_data, SNP ~ trait, value.var = "se", fill = NA)
snps <- betas_mat$SNP
# Convert matrices to the appropriate format for hyprcoloc
rownames(betas_mat) <- betas_mat$SNP
rownames(ses_mat) <- ses_mat$SNP
betas_mat <- as.matrix(betas_mat[, -1, with = FALSE])
ses_mat <- as.matrix(ses_mat[, -1, with = FALSE])
# Restore the SNP column as rownames
rownames(betas_mat) <- snps
rownames(ses_mat) <- snps
# Confirm rownames are correctly set
head(rownames(betas_mat))
cat("Dimensions of beta matrix:", dim(betas_mat), "\n")
cat("Dimensions of SE matrix:", dim(ses_mat), "\n")
# Check for rows with missing values
rows_with_na_betas <- apply(betas_mat, 1, function(row) any(is.na(row)))
rows_with_na_ses <- apply(ses_mat, 1, function(row) any(is.na(row)))
# Filter matrices to keep only rows with no missing values
betas_mat <- betas_mat[!rows_with_na_betas, , drop = FALSE]
ses_mat <- ses_mat[!rows_with_na_ses, , drop = FALSE]
cat("Dimensions of beta matrix:", dim(betas_mat), "\n")
cat("Dimensions of SE matrix:", dim(ses_mat), "\n")
# Run hyprcoloc
cat("Running hyprcoloc...\n")
betas <- betas_mat
ses <- ses_mat
traits_names <- colnames(betas_mat)
traits_names
snp_ids <- rownames(betas_mat)
# Uniform priors = FALSE
results_unipriorsFALSE <- hyprcoloc(betas, ses, trait.names = traits_names, snp.id = snp_ids, snpscores = TRUE,uniform.priors=FALSE)
# Static table with kable
kable(
results_unipriorsFALSE$results,
caption = "HyprColoc Results with Uniform Priors = FALSE",
align = 'c'
)
# Uniform priors = TRUE
results_unipriorsTRUE <- hyprcoloc(betas, ses, trait.names = traits_names, snp.id = snp_ids, snpscores = TRUE, uniform.priors=TRUE)
kable(
results_unipriorsTRUE$results,
caption = "HyprColoc Results with Uniform Priors = TRUE",
align = 'c'
)
# Without assuming uniform priors and with reg.thresh and align.thresh
results_uniform_priors_FALSE_reg.thresh.95_align.thresh.95 <- hyprcoloc(betas, ses, trait.names = traits_names, snp.id = snp_ids, uniform.priors = FALSE,snpscores = TRUE,
reg.thresh = 0.95, align.thresh = 0.95
)
# Input
traits_string3 <- results_uniform_priors_FALSE_reg.thresh.95_align.thresh.95[["results"]][["traits"]][1]
# Split the string by commas and count the number of elements
num_traits3 <- length(strsplit(traits_string3, ", ")[[1]])
# Print the result
cat("Number of traits:", num_traits3, "\n")
kable(
results_uniform_priors_FALSE_reg.thresh.95_align.thresh.95$results,
caption = "HyprColoc Results with Uniform Priors = FALSE and regional and alignment thresholds = 0.95",
align = 'c'
)
# Save the results as tabs in Excel
# Define file path for the output Excel workbook
output_file <- "/Users/charleenadams/pcsk9_hyprcoloc_corrected_results_summary_with_snpscores_fixed.xlsx"
# Initialize a new workbook
wb <- createWorkbook()
# Function to add a result set to the workbook
add_results_to_workbook <- function(workbook, result_data, sheet_name) {
# Extract the results data frame
results_df <- result_data$results
# Create a worksheet and write the data
addWorksheet(workbook, sheet_name)
writeData(workbook, sheet_name, results_df)
# Apply styling
header_style <- createStyle(fontSize = 12, fontColour = "#FFFFFF", halign = "CENTER",
fgFill = "#4F81BD", textDecoration = "Bold")
addStyle(workbook, sheet_name, header_style, rows = 1, cols = 1:ncol(results_df), gridExpand = TRUE)
}
# Function to add SNP scores to the workbook with rsIDs
add_snpscores_to_workbook <- function(workbook, snp_scores, rsIDs, sheet_name) {
# Convert SNP scores to a data table and include rsIDs
snp_scores_df <- as.data.table(snp_scores)
colnames(snp_scores_df) <- paste0("SNP_Score_Set_", seq_len(ncol(snp_scores_df)))
snp_scores_df$rsID <- rsIDs
# Reorder columns to put rsID first
setcolorder(snp_scores_df, c("rsID", setdiff(names(snp_scores_df), "rsID")))
# Create a worksheet and write the data
addWorksheet(workbook, sheet_name)
writeData(workbook, sheet_name, snp_scores_df)
# Apply styling
header_style <- createStyle(fontSize = 12, fontColour = "#FFFFFF", halign = "CENTER",
fgFill = "#4F81BD", textDecoration = "Bold")
addStyle(workbook, sheet_name, header_style, rows = 1, cols = 1:ncol(snp_scores_df), gridExpand = TRUE)
}
# Add results and SNP scores for each analysis
# Ensure rsIDs are available as rownames in the matrices
rsIDs <- rownames(betas_mat)
# results_unipriorsFALSE
add_results_to_workbook(wb, results_unipriorsFALSE, "Results_Unipriors_FALSE")
add_snpscores_to_workbook(wb, results_unipriorsFALSE$snpscores, rsIDs, "SNP_Scores_Unipriors_FALSE")
# results_unipriorsTRUE
add_results_to_workbook(wb, results_unipriorsTRUE, "Results_Unipriors_TRUE")
add_snpscores_to_workbook(wb, results_unipriorsTRUE$snpscores, rsIDs, "SNP_Scores_Unipriors_TRUE")
# results_uniform_priors_FALSE_reg.thresh.95_align.thresh.95
add_results_to_workbook(wb, results_uniform_priors_FALSE_reg.thresh.95_align.thresh.95,
"Unipriors_FALSE_Thresh_95")
add_snpscores_to_workbook(wb, results_uniform_priors_FALSE_reg.thresh.95_align.thresh.95$snpscores,
rsIDs, "SNP_Scores_Thresh_95")
# Save the workbook
saveWorkbook(wb, output_file, overwrite = TRUE)
# Confirm completion
cat("Results and SNP scores saved to:", output_file, "\n")
# Ran interactively in rstudio
# Could do:
# nohup Rscript ./harmonize_PCSK9_metsim_pval_subsets_parallel.R > harmonize_PCSK9_metsim_pval_subsets_parallel.log 2>&1 &
# ps -p 17652 -o stat,etime
# Load required libraries
library(data.table)
library(TwoSampleMR)
# Custom logging function
custom_log <- function(level, message) {
msg <- paste(format(Sys.time(), "%Y-%m-%d %H:%M:%S"), level, message)
cat(msg, "\n", file = "benchmark_harm_twosampleMR.log", append = TRUE)
cat(msg, "\n") # Also print to console
}
# Directories
cis_region_dir <- "/Users/charleenadams/cis_ukbppp/cleaned/"
metsim_dir <- "/Users/charleenadams/metsim_files/metsim_cis_PCSK9_fixed/"
output_dir <- "/Users/charleenadams/harmonized_results_fixed_all/"
# Function to read and format data for TwoSampleMR, including chromosome and position
read_and_format_data <- function(file_path, is_cis) {
tryCatch({
# Read file and convert to data.frame
df <- fread(file_path, sep = "\t", header = TRUE)
df <- as.data.frame(df) # Ensure it is a data.frame
# Format data for TwoSampleMR
formatted_df <- format_data(df,
type = ifelse(is_cis, "exposure", "outcome"),
snp_col = ifelse(is_cis, "rsid", "rsids"),
beta_col = ifelse(is_cis, "BETA", "beta"),
se_col = ifelse(is_cis, "SE", "sebeta"),
effect_allele_col = ifelse(is_cis, "ALLELE1", "alt"),
other_allele_col = ifelse(is_cis, "ALLELE0", "ref"),
eaf_col = ifelse(is_cis, "A1FREQ", "maf"),
pval_col = ifelse(is_cis, "LOG10P", "pval"),
chr_col = ifelse(is_cis, "CHROM", "chrom"),
pos_col = ifelse(is_cis, "POS38", "pos"),
phenotype_col = basename(file_path))
# Set exposure or outcome names
if (is_cis) {
formatted_df$id.exposure <- basename(file_path)
} else {
formatted_df$id.outcome <- basename(file_path)
}
return(formatted_df)
}, error = function(e) {
custom_log("ERROR", paste("Error reading file", file_path, ":", e$message))
return(NULL)
})
}
# Main function using TwoSampleMR
main <- function() {
# Create output directory if it doesn't exist
dir.create(output_dir, showWarnings = FALSE, recursive = TRUE)
# Select the cis-region file
cis_file <- file.path(cis_region_dir, "PCSK9_cis_region_cleaned.gz")
# Get the list of METSIM subset files
metsim_files <- list.files(metsim_dir, pattern = "*.txt", full.names = TRUE)
custom_log("INFO", paste("Processing cis-region file:", cis_file))
custom_log("INFO", paste("Processing METSIM files:", paste(metsim_files, collapse = ", ")))
# Read and format the cis-region file
cis_df <- read_and_format_data(cis_file, is_cis = TRUE)
if (is.null(cis_df)) {
custom_log("ERROR", "Failed to read cis-region file.")
return()
}
# Harmonize each METSIM file with the cis-region file
harmonized_results <- lapply(metsim_files, function(metsim_file) {
metsim_df <- read_and_format_data(metsim_file, is_cis = FALSE)
if (!is.null(metsim_df)) {
tryCatch({
harmonized <- harmonise_data(exposure_dat = cis_df, outcome_dat = metsim_df, action = 2)
custom_log("INFO", paste("Harmonized", metsim_file, "with", nrow(harmonized), "rows."))
return(harmonized)
}, error = function(e) {
custom_log("ERROR", paste("Error harmonizing data for", metsim_file, ":", e$message))
return(NULL)
})
}
return(NULL)
})
# Combine harmonized results
harmonized_results <- harmonized_results[!sapply(harmonized_results, is.null)]
if (length(harmonized_results) > 0) {
final_results <- rbindlist(harmonized_results)
output_file <- file.path(output_dir, "harmonized_results_PCSK9.txt")
fwrite(final_results, file = output_file, sep = "\t")
custom_log("INFO", paste("Saved harmonized results to", output_file))
} else {
custom_log("WARNING", "No harmonized results were generated.")
}
}
# Run the main function
main()
This is technically doable but not feasible for a pipeline. I believe it took over 8 hours.
# Run interactively in Rstudio
library(data.table)
library(hyprcoloc)
library(openxlsx)
library(DT)
library(knitr)
# Load phenotype mapping data
phenotype_map <- fread("/Users/charleenadams/metsim_files/phenotypes.tsv")
print(names(phenotype_map)) # Confirm structure of phenotype_map
# Define file path to harmonized results
file_path <- "/Users/charleenadams/harmonized_results_fixed_all/harmonized_results_PCSK9.txt"
#file_path <- "/Users/charleenadams/harmonized_results/harmonized_results_PCSK9.txt"# legacy
# Read the harmonized data
data <- fread(file_path)
# Remove all characters after and including the first `_`
data$id.outcome <- sub("_.*", "", data$id.outcome)
data$id.exposure <- sub("_.*", "", data$id.exposure)
# Ensure both data frames are data.tables
setDT(data)
setDT(phenotype_map)
# Merge phenostring from phenotype_map into data using id.outcome and phenocode
data <- merge(data, phenotype_map[, .(phenocode, phenostring)],
by.x = "id.outcome", by.y = "phenocode", all.x = TRUE)
# Rename the phenostring column to METSIM_name
setnames(data, "phenostring", "METSIM_name")
# Remove asterisks (*) from the METSIM_name column
data$METSIM_name <- gsub("\\*", "", data$METSIM_name)
cat("Data read successfully. Total rows:", nrow(data), "\n")
# Ensure the data contains the required columns
required_columns <- c("SNP", "id.exposure", "METSIM_name", "beta.exposure", "beta.outcome", "se.exposure", "se.outcome")
if (!all(required_columns %in% colnames(data))) {
stop("The harmonized data is missing required columns!")
}
# Extract unique traits and common SNPs
traits <- unique(c(data$id.exposure, data$METSIM_name))
common_snps <- unique(data$SNP)
cat("Number of traits:", length(traits), "Number of SNPs:", length(common_snps), "\n")
# Filter data to include only common SNPs
data <- data[SNP %in% common_snps]
# Reshape data to create a long table for matrix population
long_data <- rbindlist(list(
data[, .(SNP, trait = id.exposure, beta = beta.exposure, se = se.exposure)],
data[, .(SNP, trait = METSIM_name, beta = beta.outcome, se = se.outcome)]
), use.names = TRUE)
# Ensure no duplicate SNP-trait combinations
long_data <- unique(long_data, by = c("SNP", "trait"))
cat("Rows in reshaped long_data:", nrow(long_data), "\n")
# Create matrices for betas and ses
betas_mat <- dcast(long_data, SNP ~ trait, value.var = "beta", fill = NA)
ses_mat <- dcast(long_data, SNP ~ trait, value.var = "se", fill = NA)
snps <- betas_mat$SNP
# Convert matrices to the appropriate format for hyprcoloc
rownames(betas_mat) <- betas_mat$SNP
rownames(ses_mat) <- ses_mat$SNP
betas_mat <- as.matrix(betas_mat[, -1, with = FALSE])
ses_mat <- as.matrix(ses_mat[, -1, with = FALSE])
# Restore the SNP column as rownames
rownames(betas_mat) <- snps
rownames(ses_mat) <- snps
# Confirm rownames are correctly set
head(rownames(betas_mat))
cat("Dimensions of beta matrix:", dim(betas_mat), "\n")
cat("Dimensions of SE matrix:", dim(ses_mat), "\n")
# Check for rows with missing values
rows_with_na_betas <- apply(betas_mat, 1, function(row) any(is.na(row)))
rows_with_na_ses <- apply(ses_mat, 1, function(row) any(is.na(row)))
# Filter matrices to keep only rows with no missing values
betas_mat <- betas_mat[!rows_with_na_betas, , drop = FALSE]
ses_mat <- ses_mat[!rows_with_na_ses, , drop = FALSE]
cat("Dimensions of beta matrix:", dim(betas_mat), "\n")
cat("Dimensions of SE matrix:", dim(ses_mat), "\n")
# This noted-out solution assumes no effect of the missing SNP on the metabolite, which may be false. So, I removed all rows with any missing values instead, which would risk eliminating a causal SNP if in a row with some missing values. Perhaps six of one, half a dozen of the other. The same candidate causal SNP is found regardless.
# betas_mat_filtered[is.na(betas_mat_filtered)] <- 0
# ses_mat_filtered[is.na(ses_mat_filtered)] <- 1e6 # Large SE for low confidence
# Run hyprcoloc
cat("Running hyprcoloc...\n")
betas <- betas_mat
ses <- ses_mat
traits_names <- colnames(betas_mat)
traits_names
snp_ids <- rownames(betas_mat)
# Uniform priors = FALSE
results_unipriorsFALSE <- hyprcoloc(betas, ses, trait.names = traits_names, snp.id = snp_ids, snpscores = TRUE,uniform.priors=FALSE)
# Static table with kable
kable(
results_unipriorsFALSE$results,
caption = "HyprColoc Results with Uniform Priors = FALSE",
align = 'c'
)
# Uniform priors = TRUE
results_unipriorsTRUE <- hyprcoloc(betas, ses, trait.names = traits_names, snp.id = snp_ids, snpscores = TRUE, uniform.priors=TRUE)
kable(
results_unipriorsTRUE$results,
caption = "HyprColoc Results with Uniform Priors = TRUE",
align = 'c'
)
# Without assuming uniform priors and with reg.thresh and align.thresh
results_uniform_priors_FALSE_reg.thresh.95_align.thresh.95 <- hyprcoloc(betas, ses, trait.names = traits_names, snp.id = snp_ids, uniform.priors = FALSE,snpscores = TRUE,
reg.thresh = 0.95, align.thresh = 0.95
)
# Input
traits_string3 <- results_uniform_priors_FALSE_reg.thresh.95_align.thresh.95[["results"]][["traits"]][1]
# Split the string by commas and count the number of elements
num_traits3 <- length(strsplit(traits_string3, ", ")[[1]])
# Print the result
cat("Number of traits:", num_traits3, "\n")
kable(
results_uniform_priors_FALSE_reg.thresh.95_align.thresh.95$results,
caption = "HyprColoc Results with Uniform Priors = FALSE and regional and alignment thresholds = 0.95",
align = 'c'
)
# Save the results as tabs in Excel
# Define file path for the output Excel workbook
output_file <- "/Users/charleenadams/pcsk9_hyprcoloc_corrected_results_summary_with_snpscores_fixed.xlsx"
# Initialize a new workbook
wb <- createWorkbook()
# Function to add a result set to the workbook
add_results_to_workbook <- function(workbook, result_data, sheet_name) {
# Extract the results data frame
results_df <- result_data$results
# Create a worksheet and write the data
addWorksheet(workbook, sheet_name)
writeData(workbook, sheet_name, results_df)
# Apply styling
header_style <- createStyle(fontSize = 12, fontColour = "#FFFFFF", halign = "CENTER",
fgFill = "#4F81BD", textDecoration = "Bold")
addStyle(workbook, sheet_name, header_style, rows = 1, cols = 1:ncol(results_df), gridExpand = TRUE)
}
# Function to add SNP scores to the workbook with rsIDs
add_snpscores_to_workbook <- function(workbook, snp_scores, rsIDs, sheet_name) {
# Convert SNP scores to a data table and include rsIDs
snp_scores_df <- as.data.table(snp_scores)
colnames(snp_scores_df) <- paste0("SNP_Score_Set_", seq_len(ncol(snp_scores_df)))
snp_scores_df$rsID <- rsIDs
# Reorder columns to put rsID first
setcolorder(snp_scores_df, c("rsID", setdiff(names(snp_scores_df), "rsID")))
# Create a worksheet and write the data
addWorksheet(workbook, sheet_name)
writeData(workbook, sheet_name, snp_scores_df)
# Apply styling
header_style <- createStyle(fontSize = 12, fontColour = "#FFFFFF", halign = "CENTER",
fgFill = "#4F81BD", textDecoration = "Bold")
addStyle(workbook, sheet_name, header_style, rows = 1, cols = 1:ncol(snp_scores_df), gridExpand = TRUE)
}
# Add results and SNP scores for each analysis
# Ensure rsIDs are available as rownames in the matrices
rsIDs <- rownames(betas_mat)
# results_unipriorsFALSE
add_results_to_workbook(wb, results_unipriorsFALSE, "Results_Unipriors_FALSE")
add_snpscores_to_workbook(wb, results_unipriorsFALSE$snpscores, rsIDs, "SNP_Scores_Unipriors_FALSE")
# results_unipriorsTRUE
add_results_to_workbook(wb, results_unipriorsTRUE, "Results_Unipriors_TRUE")
add_snpscores_to_workbook(wb, results_unipriorsTRUE$snpscores, rsIDs, "SNP_Scores_Unipriors_TRUE")
# results_uniform_priors_FALSE_reg.thresh.95_align.thresh.95
add_results_to_workbook(wb, results_uniform_priors_FALSE_reg.thresh.95_align.thresh.95,
"Unipriors_FALSE_Thresh_95")
add_snpscores_to_workbook(wb, results_uniform_priors_FALSE_reg.thresh.95_align.thresh.95$snpscores,
rsIDs, "SNP_Scores_Thresh_95")
# Save the workbook
saveWorkbook(wb, output_file, overwrite = TRUE)
# Confirm completion
cat("Results and SNP scores saved to:", output_file, "\n")
library(data.table)
# Load phenotype mapping data
phenotype_map <- fread("/Users/charleenadams/metsim_files/phenotypes.tsv")
print(names(phenotype_map)) # Confirm structure of phenotype_map
head(phenotype_map)
Results_Unipriors_FALSE <- fread("/Users/charleenadams/coloc_clin_mol/results/pcsk9/all/Results_Unipriors_FALSE.csv")
Results_Unipriors_FALSE
Results_Unipriors_TRUE <- fread("/Users/charleenadams/coloc_clin_mol/results/pcsk9/all/Results_Unipriors_TRUE.csv")
Results_Unipriors_TRUE
Results_Unipriors_FALSE_Thresh_95 <- fread("/Users/charleenadams/coloc_clin_mol/results/pcsk9/all/Results_Unipriors_FALSE_Thresh_95.csv")
Results_Unipriors_FALSE_Thresh_95
str(Results_Unipriors_FALSE_Thresh_95)
Results_Unipriors_FALSE_Thresh_95$traits
str(Results_Unipriors_FALSE_Thresh_95$traits[1])
split_metabolites <- function(metabolite_string) {
split_list <- c() # Initialize empty list for results
current_part <- "" # Buffer to accumulate valid metabolite names
open_parens <- 0 # Tracks nesting level
prev_char <- "" # Stores the previous character
# Loop through characters
for (char in strsplit(metabolite_string, "")[[1]]) {
if (char == "(") open_parens <- open_parens + 1
if (char == ")") open_parens <- open_parens - 1
# 🔥 Fix: Don't split if the comma is followed by a digit (like 1,2- or 2,3-)
if (char == "," && open_parens == 0 && !grepl("\\d", prev_char)) {
split_list <- c(split_list, trimws(current_part)) # Save previous part
current_part <- "" # Reset buffer
} else {
current_part <- paste0(current_part, char) # Accumulate valid characters
}
prev_char <- char # Update previous character
}
# Add last part
split_list <- c(split_list, trimws(current_part))
return(split_list)
}
# Results_Unipriors_FALSE_Thresh_95$traits[1]
metabolite_string_Results_Unipriors_FALSE_Thresh_95 <- Results_Unipriors_FALSE_Thresh_95$traits[1]
split_list_FALSE_Thresh_95 <- split_metabolites(metabolite_string_Results_Unipriors_FALSE_Thresh_95)
print(split_list_FALSE_Thresh_95)
metabolite_string_Results_Unipriors_FALSE <- Results_Unipriors_FALSE$traits[1]
split_list_FALSE <- split_metabolites(metabolite_string_Results_Unipriors_FALSE)
print(split_list_FALSE)
metabolite_string_Results_Unipriors_TRUE <- Results_Unipriors_TRUE$traits[1]
split_list_TRUE <- split_metabolites(metabolite_string_Results_Unipriors_TRUE)
print(split_list_TRUE)
24/25 available
cd /Users/charleenadams/cis_ukbppp/cleaned
# Find the 25 proteins
ls | grep -E '^(ACY1|APOE|CD36|IL6|ACE2|NTproBNP|LEPR|AGXT|LEP|LRP11|LPL|REN|UMOD|FABP4|IL1RL1|FAP|IL6R|ANGPTL3|APOL1|ASAH1|KLKB1|PDGFRA|SIGLEC9|WFIKKN2|MAP4K5|NPPB)_' | wc -l
# List them
ls | grep -E '^(ACY1|APOE|CD36|IL6|ACE2|NTproBNP|LEPR|AGXT|LEP|LRP11|LPL|REN|UMOD|FABP4|IL1RL1|FAP|IL6R|ANGPTL3|APOL1|ASAH1|KLKB1|PDGFRA|SIGLEC9|WFIKKN2|MAP4K5|NPPB)_'
#!/bin/bash
# Run with:
# cd /Users/charleenadams/coloc_clin_mol/scripts
# chmod +x loop_subset_24_parallel.sh
# nohup ./loop_subset_24_parallel.sh > loop_subset_24_parallel.log 2>&1 &
# ps -p 2562 -o stat,etime
set -x
# Define directories
cis_dir="/Users/charleenadams/cis_ukbppp/cleaned"
metsim_dir="/Users/charleenadams/metsim_files"
output_base_dir="/Users/charleenadams/metsim_files"
# Identify the specific 24 proteins dynamically
# remove ACE2
cis_files=$(ls "$cis_dir" | grep -E '^(ACY1|APOE|CD36|IL6|NTproBNP|LEPR|AGXT|LEP|LRP11|LPL|REN|UMOD|FABP4|IL1RL1|FAP|IL6R|ANGPTL3|APOL1|ASAH1|KLKB1|PDGFRA|SIGLEC9|WFIKKN2|MAP4K5|NPPB)_')
# Loop through each identified cis-region file
for cis_file in $cis_files; do
protein_name=$(echo "$cis_file" | cut -d'_' -f1)
output_dir="${output_base_dir}/${protein_name}_fixed"
rsid_file="${protein_name}_rsid_list.txt"
# Create output directory
mkdir -p "$output_dir"
# Extract RSID list from column 15 of the cis-region file
gzcat "$cis_dir/$cis_file" | awk 'NR > 1 {print $15}' | sort | uniq > "$rsid_file"
# Check if RSID file is empty
if [[ ! -s "$rsid_file" ]]; then
echo "Error: No RSIDs found in $cis_file" >> "${protein_name}.log"
continue
fi
export output_dir rsid_file
# Function to process a single METSIM file
process_metsim_file() {
local metsim_file=$1
local base_name=$(basename "$metsim_file" .txt.gz)
local output_file="$output_dir/${base_name}_cis_${protein_name}.txt"
gzcat "$metsim_file" | awk '
BEGIN {
while ((getline < "'$rsid_file'") > 0) lookup[$1] = 1;
close("'$rsid_file'");
}
NR == 1 {print $0; next} # Print header
$5 in lookup {print $0} # RSIDs are in column 5 in METSIM
' > "$output_file"
if [[ ! -s "$output_file" ]]; then
echo "Warning: No matching rows for $metsim_file" >> "${protein_name}.log"
rm "$output_file"
else
echo "Processed $metsim_file" >> "${protein_name}.log"
fi
}
export -f process_metsim_file
# Use GNU parallel to process METSIM files in parallel
find "$metsim_dir" -name "*.gz" | parallel -j $(nproc) process_metsim_file {}
done
# ran interactively in Rstudio
# Load necessary libraries
library(data.table)
# Parent directory where the protein-specific folders are located
parent_directory <- "/Users/charleenadams/metsim_files"
# Explicit list of valid subdirectories to process
valid_subdirectories <- c("LEP_fixed", "FABP4_fixed", "MAP4K5_fixed", "IL1RL1_fixed",
"REN_fixed", "IL6_fixed", "LRP11_fixed", "ASAH1_fixed", "FAP_fixed",
"APOE_fixed", "ACY1_fixed", "AGXT_fixed", "KLKB1_fixed", "APOL1_fixed",
"NPPB_fixed", "SIGLEC9_fixed", "LEPR_fixed", "UMOD_fixed", "CD36_fixed",
"ANGPTL3_fixed", "WFIKKN2_fixed", "IL6R_fixed", "LPL_fixed", "PDGFRA_fixed")
# Get all valid subdirectories
subdirectories <- file.path(parent_directory, valid_subdirectories)
subdirectories <- subdirectories[dir.exists(subdirectories)]
# Process each subdirectory
for (subdir in subdirectories) {
# Extract the protein name from the directory name
protein_name <- basename(subdir)
# Define the subdirectory for significant p-values
significant_subdir <- file.path(subdir, paste0("p_val_metsim_cis_", protein_name))
# Ensure the subdirectory exists
if (!dir.exists(significant_subdir)) {
dir.create(significant_subdir)
cat("Created directory:", significant_subdir, "\n")
}
# Get all .txt files in the subdirectory
files <- list.files(path = subdir, pattern = "\\.txt$", full.names = TRUE)
# Process each file
for (file in files) {
cat("Processing file:", basename(file), "in", protein_name, "\n")
# Read the file
data <- fread(file)
# Check if any p-value is less than 5e-8
if (any(data$pval < 5e-8, na.rm = TRUE)) {
cat("File", basename(file), "has pval < 5e-8\n")
file.copy(from = file, to = file.path(significant_subdir, basename(file)))
cat("Copied", basename(file), "to", significant_subdir, "\n")
} else {
cat("No match found in", basename(file), "\n")
}
}
}
cat("Script completed\n")
# Ran interactively in rstudio
# could do:
# nohup Rscript ./harmonize_loop24_metsim_pval_subsets_parallel.R > harmonize_loop24_metsim_pval_subsets_parallel.log 2>&1 &
# ps -p 17652 -o stat,etime
# Load required libraries
library(data.table)
library(TwoSampleMR)
# Custom logging function
custom_log <- function(level, message) {
msg <- paste(format(Sys.time(), "%Y-%m-%d %H:%M:%S"), level, message)
cat(msg, "\n", file = "benchmark_harm_twosampleMR.log", append = TRUE)
cat(msg, "\n") # Also print to console
}
# Function to read and format data for TwoSampleMR, including chromosome and position
read_and_format_data <- function(file_path, is_cis) {
tryCatch({
# Read file and convert to data.frame
df <- fread(file_path, sep = "\t", header = TRUE)
df <- as.data.frame(df) # Ensure it is a data.frame
# Format data for TwoSampleMR
formatted_df <- format_data(df,
type = ifelse(is_cis, "exposure", "outcome"),
snp_col = ifelse(is_cis, "rsid", "rsids"),
beta_col = ifelse(is_cis, "BETA", "beta"),
se_col = ifelse(is_cis, "SE", "sebeta"),
effect_allele_col = ifelse(is_cis, "ALLELE1", "alt"),
other_allele_col = ifelse(is_cis, "ALLELE0", "ref"),
eaf_col = ifelse(is_cis, "A1FREQ", "maf"),
pval_col = ifelse(is_cis, "LOG10P", "pval"),
chr_col = ifelse(is_cis, "CHROM", "chrom"),
pos_col = ifelse(is_cis, "POS38", "pos"),
phenotype_col = basename(file_path))
# Set exposure or outcome names
if (is_cis) {
formatted_df$id.exposure <- basename(file_path)
} else {
formatted_df$id.outcome <- basename(file_path)
}
return(formatted_df)
}, error = function(e) {
custom_log("ERROR", paste("Error reading file", file_path, ":", e$message))
return(NULL)
})
}
# Define directories
cis_region_dir <- "/Users/charleenadams/cis_ukbppp/cleaned/"
metsim_dir <- "/Users/charleenadams/metsim_files"
output_dir <- "/Users/charleenadams/harmonized_results_fixed/"
# List of valid protein subdirectories (ensures only the right ones are processed)
valid_proteins <- c("LEP", "FABP4", "MAP4K5", "IL1RL1", "REN", "IL6", "LRP11", "ASAH1", "FAP",
"APOE", "ACY1", "AGXT", "KLKB1", "APOL1", "NPPB", "SIGLEC9", "LEPR",
"UMOD", "CD36", "ANGPTL3", "WFIKKN2", "IL6R", "LPL", "PDGFRA")
# Iterate through each protein's directory
for (protein in valid_proteins) {
# Define paths
protein_fixed_dir <- file.path(metsim_dir, paste0(protein, "_fixed"))
pval_subdir <- file.path(protein_fixed_dir, paste0("p_val_metsim_cis_", protein, "_fixed"))
cis_file <- file.path(cis_region_dir, paste0(protein, "_cis_region_cleaned.gz"))
# Check if pval_subdir exists before proceeding
if (!dir.exists(pval_subdir)) {
custom_log("WARNING", paste("Skipping", pval_subdir, "- Directory does not exist."))
next
}
# Get list of METSIM subset files
metsim_files <- list.files(pval_subdir, pattern = "\\.txt$", full.names = TRUE)
if (length(metsim_files) == 0) {
custom_log("WARNING", paste("No files found in", pval_subdir, "- Skipping."))
next
}
# Read and format the cis-region file
if (!file.exists(cis_file)) {
custom_log("ERROR", paste("Missing cis-region file:", cis_file, "- Skipping."))
next
}
custom_log("INFO", paste("Processing", protein, "with cis-region file:", cis_file))
cis_df <- read_and_format_data(cis_file, is_cis = TRUE)
if (is.null(cis_df)) {
custom_log("ERROR", paste("Failed to read cis-region file for", protein, "- Skipping."))
next
}
# Harmonize each METSIM file with the cis-region file
harmonized_results <- lapply(metsim_files, function(metsim_file) {
metsim_df <- read_and_format_data(metsim_file, is_cis = FALSE)
if (!is.null(metsim_df)) {
tryCatch({
harmonized <- harmonise_data(exposure_dat = cis_df, outcome_dat = metsim_df, action = 2)
custom_log("INFO", paste("Harmonized", metsim_file, "with", nrow(harmonized), "rows."))
return(harmonized)
}, error = function(e) {
custom_log("ERROR", paste("Error harmonizing data for", metsim_file, ":", e$message))
return(NULL)
})
}
return(NULL)
})
# Combine harmonized results
harmonized_results <- harmonized_results[!sapply(harmonized_results, is.null)]
if (length(harmonized_results) > 0) {
final_results <- rbindlist(harmonized_results)
output_file <- file.path(output_dir, paste0("harmonized_results_", protein, ".txt"))
fwrite(final_results, file = output_file, sep = "\t")
custom_log("INFO", paste("Saved harmonized results for", protein, "to", output_file))
} else {
custom_log("WARNING", paste("No harmonized results generated for", protein))
}
}
cat("Script completed\n")
We conducted a colocalization analysis using HyPrColoc to explore shared genetic loci influencing selected proteins and METSIM metabolites filtered on at least one SNP in the METSIM files have a P<5E-8.
# Run interactively in Rstudio
# Load required libraries
library(data.table)
library(hyprcoloc)
library(openxlsx)
library(DT)
library(knitr)
# Define input and output directories
input_dir <- "/Users/charleenadams/harmonized_results_fixed/"
output_dir <- "/Users/charleenadams/harmonized_results_fixed/hyprcoloc_results"
# Ensure output directory exists
if (!dir.exists(output_dir)) {
dir.create(output_dir, recursive = TRUE)
}
# Load phenotype mapping data
phenotype_map <- fread("/Users/charleenadams/metsim_files/phenotypes.tsv")
# Get list of all harmonized result files
harmonized_files <- list.files(input_dir, pattern = "^harmonized_results_.*\\.txt$", full.names = TRUE)
# Function to process each file
process_file <- function(file_path) {
protein_name <- gsub("^harmonized_results_|\\.txt$", "", basename(file_path)) # Extract protein name
cat("Processing file:", file_path, "\n")
# Read the harmonized data
data <- fread(file_path)
# Remove all characters after and including the first `_`
data$id.outcome <- sub("_.*", "", data$id.outcome)
data$id.exposure <- sub("_.*", "", data$id.exposure)
# Merge phenotype names and ensure correct naming format
setDT(data)
setDT(phenotype_map)
data <- merge(data, phenotype_map[, .(phenocode, phenostring)],
by.x = "id.outcome", by.y = "phenocode", all.x = TRUE)
setnames(data, "phenostring", "METSIM_name")
data$METSIM_name <- gsub("\\*", "", data$METSIM_name)
# Append the phenocode to the METSIM_name (e.g., "butyrylcarnitine (C4)_C100001054")
data$METSIM_name <- paste0(data$METSIM_name, "_", data$id.outcome)
# Ensure required columns exist
required_columns <- c("SNP", "id.exposure", "METSIM_name", "beta.exposure", "beta.outcome", "se.exposure", "se.outcome")
if (!all(required_columns %in% colnames(data))) {
cat("Skipping", protein_name, "- missing required columns!\n")
return(NULL)
}
# Reshape data to create matrices
long_data <- rbindlist(list(
data[, .(SNP, trait = id.exposure, beta = beta.exposure, se = se.exposure)],
data[, .(SNP, trait = METSIM_name, beta = beta.outcome, se = se.outcome)]
), use.names = TRUE)
long_data <- unique(long_data, by = c("SNP", "trait"))
# Create beta and SE matrices
betas_mat <- dcast(long_data, SNP ~ trait, value.var = "beta", fill = NA)
ses_mat <- dcast(long_data, SNP ~ trait, value.var = "se", fill = NA)
snps <- betas_mat$SNP
rownames(betas_mat) <- betas_mat$SNP
rownames(ses_mat) <- ses_mat$SNP
betas_mat <- as.matrix(betas_mat[, -1, with = FALSE])
ses_mat <- as.matrix(ses_mat[, -1, with = FALSE])
rownames(betas_mat) <- snps
rownames(ses_mat) <- snps
# Filter matrices to remove missing data rows
rows_with_na <- apply(betas_mat, 1, function(row) any(is.na(row))) | apply(ses_mat, 1, function(row) any(is.na(row)))
betas_mat <- betas_mat[!rows_with_na, , drop = FALSE]
ses_mat <- ses_mat[!rows_with_na, , drop = FALSE]
# Run HyprColoc
cat("Running HyprColoc for", protein_name, "...\n")
betas <- betas_mat
ses <- ses_mat
traits_names <- colnames(betas_mat)
snp_ids <- rownames(betas_mat)
results_unipriorsFALSE <- hyprcoloc(betas, ses, trait.names = traits_names, snp.id = snp_ids, snpscores = TRUE, uniform.priors = FALSE)
results_unipriorsTRUE <- hyprcoloc(betas, ses, trait.names = traits_names, snp.id = snp_ids, snpscores = TRUE, uniform.priors = TRUE)
results_thresh_95 <- hyprcoloc(betas, ses, trait.names = traits_names, snp.id = snp_ids, snpscores = TRUE, uniform.priors = FALSE, reg.thresh = 0.95, align.thresh = 0.95)
# Define output file path
output_file <- file.path(output_dir, paste0(protein_name, "_hyprcoloc_results.xlsx"))
# Initialize a new workbook
wb <- createWorkbook()
# Style for headers
header_style <- createStyle(fontSize = 12, fontColour = "#FFFFFF", halign = "CENTER", fgFill = "#4F81BD", textDecoration = "Bold", wrapText = TRUE)
# Style for left-aligned titles
title_style <- createStyle(fontSize = 14, halign = "LEFT", textDecoration = "Bold")
# Style for wrapping text in traits column
traits_style <- createStyle(wrapText = TRUE)
# Function to add results with styled header and left-aligned title
add_results_to_workbook <- function(workbook, result_data, sheet_name, title) {
if (!is.null(result_data$results)) {
addWorksheet(workbook, sheet_name)
writeData(workbook, sheet_name, data.table(Title = title), startRow = 1, colNames = FALSE)
addStyle(workbook, sheet_name, title_style, rows = 1, cols = 1, gridExpand = TRUE)
# Count the number of traits in each row
result_data$results$counts <- sapply(result_data$results$traits, function(x) {
if (x == "None") {
return(0) # If no traits are listed
} else {
return(length(strsplit(x, ", ")[[1]]))
}
})
writeData(workbook, sheet_name, result_data$results, startRow = 3)
addStyle(workbook, sheet_name, header_style, rows = 3, cols = 1:ncol(result_data$results), gridExpand = TRUE)
# Apply wrap text style to 'traits' column
traits_col_index <- which(names(result_data$results) == "traits")
if (length(traits_col_index) > 0) {
addStyle(workbook, sheet_name, traits_style, rows = 4:(nrow(result_data$results) + 3), cols = traits_col_index, gridExpand = TRUE, stack = TRUE)
}
# Auto adjust column widths based on content
for (i in 1:ncol(result_data$results)) {
setColWidths(workbook, sheet = sheet_name, cols = i, widths = "auto")
}
}
}
# Function to add SNP scores with sorting and renaming
add_snpscores_to_workbook <- function(workbook, snp_scores, sheet_name, title) {
if (!is.null(snp_scores)) {
addWorksheet(workbook, sheet_name)
writeData(workbook, sheet_name, data.table(Title = title), startRow = 1, colNames = FALSE)
addStyle(workbook, sheet_name, title_style, rows = 1, cols = 1, gridExpand = TRUE)
# Convert to data.table, remove 'V2' column, add 'rsID', rename 'V1', and sort
snp_scores_df <- as.data.table(snp_scores)
snp_scores_df <- snp_scores_df[, .(V1)] # Keep only V1
snp_scores_df$rsID <- snp_ids
setnames(snp_scores_df, "V1", "SNP_scores")
snp_scores_df <- snp_scores_df[order(-SNP_scores)] # Sort descending
writeData(workbook, sheet_name, snp_scores_df, startRow = 3)
addStyle(workbook, sheet_name, header_style, rows = 3, cols = 1:ncol(snp_scores_df), gridExpand = TRUE)
# Auto adjust column widths based on content
for (i in 1:ncol(snp_scores_df)) {
setColWidths(workbook, sheet = sheet_name, cols = i, widths = "auto")
}
}
}
# Save results
add_results_to_workbook(wb, results_unipriorsFALSE, paste0(protein_name, "_Res_Unip_F"), paste("HyprColoc of", protein_name, "(Uniform Priors = FALSE)"))
add_results_to_workbook(wb, results_unipriorsTRUE, paste0(protein_name, "_Res_Unip_T"), paste("HyprColoc of", protein_name, "(Uniform Priors = TRUE)"))
add_results_to_workbook(wb, results_thresh_95, paste0(protein_name, "_Res_Unip_F_95"), paste("HyprColoc of", protein_name, "(Uniform Priors = FALSE, Thresholds = 0.95)"))
add_snpscores_to_workbook(wb, results_unipriorsFALSE$snpscores, paste0(protein_name, "_Scores_Unip_F"), paste("HyprColoc SNP Scores for", protein_name, "(Uniform Priors = FALSE)"))
add_snpscores_to_workbook(wb, results_unipriorsTRUE$snpscores, paste0(protein_name, "_Scores_Unip_T"), paste("HyprColoc SNP Scores for", protein_name, "(Uniform Priors = TRUE)"))
add_snpscores_to_workbook(wb, results_thresh_95$snpscores, paste0(protein_name, "_Scores_Unip_F_95"), paste("HyprColoc SNP Scores for", protein_name, "(Uniform Priors = FALSE, Thresholds = 0.95)"))
saveWorkbook(wb, output_file, overwrite = TRUE)
cat("Results saved to:", output_file, "\n")
}
# Process all files
for (file_path in harmonized_files) {
process_file(file_path)
}
cat("All HyprColoc analyses completed!\n")
# ran interactively in Rstudio
# Load necessary libraries
library(data.table)
# Parent directory where the protein-specific folders are located
parent_directory <- "/Users/charleenadams/metsim_files"
# Explicit list of valid subdirectories to process
valid_subdirectories <- c("LEP_fixed", "FABP4_fixed", "MAP4K5_fixed", "IL1RL1_fixed",
"REN_fixed", "IL6_fixed", "LRP11_fixed", "ASAH1_fixed", "FAP_fixed",
"APOE_fixed", "ACY1_fixed", "AGXT_fixed", "KLKB1_fixed", "APOL1_fixed",
"NPPB_fixed", "SIGLEC9_fixed", "LEPR_fixed", "UMOD_fixed", "CD36_fixed",
"ANGPTL3_fixed", "WFIKKN2_fixed", "IL6R_fixed", "LPL_fixed", "PDGFRA_fixed")
# Get all valid subdirectories
subdirectories <- file.path(parent_directory, valid_subdirectories)
subdirectories <- subdirectories[dir.exists(subdirectories)]
# Process each subdirectory
for (subdir in subdirectories) {
# Extract the protein name from the directory name
protein_name <- basename(subdir)
# Define the subdirectory for significant p-values
significant_subdir <- file.path(subdir, paste0("p_val_5E6_metsim_cis_", protein_name))
# Ensure the subdirectory exists
if (!dir.exists(significant_subdir)) {
dir.create(significant_subdir)
cat("Created directory:", significant_subdir, "\n")
}
# Get all .txt files in the subdirectory
files <- list.files(path = subdir, pattern = "\\.txt$", full.names = TRUE)
# Process each file
for (file in files) {
cat("Processing file:", basename(file), "in", protein_name, "\n")
# Read the file
data <- fread(file)
# Check if any p-value is less than 5e-8
if (any(data$pval < 5e-6, na.rm = TRUE)) {
cat("File", basename(file), "has pval < 5e-6\n")
file.copy(from = file, to = file.path(significant_subdir, basename(file)))
cat("Copied", basename(file), "to", significant_subdir, "\n")
} else {
cat("No match found in", basename(file), "\n")
}
}
}
cat("Script completed\n")
# Ran interactively in rstudio
# Could do but DIDN'T:
# nohup Rscript ./harmonize_loop24_metsim_pval_5E6_subsets_parallel.R > harmonize_loop24_metsim_pval_5E6_subsets_parallel.log 2>&1 &
# ps -p 17652 -o stat,etime
# Do this: make directory
# mkdir pv5E6
# Load required libraries
library(data.table)
library(TwoSampleMR)
# Custom logging function
custom_log <- function(level, message) {
msg <- paste(format(Sys.time(), "%Y-%m-%d %H:%M:%S"), level, message)
cat(msg, "\n", file = "benchmark_harm_twosampleMR.log", append = TRUE)
cat(msg, "\n") # Also print to console
}
# Function to read and format data for TwoSampleMR, including chromosome and position
read_and_format_data <- function(file_path, is_cis) {
tryCatch({
# Read file and convert to data.frame
df <- fread(file_path, sep = "\t", header = TRUE)
df <- as.data.frame(df) # Ensure it is a data.frame
# Format data for TwoSampleMR
formatted_df <- format_data(df,
type = ifelse(is_cis, "exposure", "outcome"),
snp_col = ifelse(is_cis, "rsid", "rsids"),
beta_col = ifelse(is_cis, "BETA", "beta"),
se_col = ifelse(is_cis, "SE", "sebeta"),
effect_allele_col = ifelse(is_cis, "ALLELE1", "alt"),
other_allele_col = ifelse(is_cis, "ALLELE0", "ref"),
eaf_col = ifelse(is_cis, "A1FREQ", "maf"),
pval_col = ifelse(is_cis, "LOG10P", "pval"),
chr_col = ifelse(is_cis, "CHROM", "chrom"),
pos_col = ifelse(is_cis, "POS38", "pos"),
phenotype_col = basename(file_path))
# Set exposure or outcome names
if (is_cis) {
formatted_df$id.exposure <- basename(file_path)
} else {
formatted_df$id.outcome <- basename(file_path)
}
return(formatted_df)
}, error = function(e) {
custom_log("ERROR", paste("Error reading file", file_path, ":", e$message))
return(NULL)
})
}
# Define directories
cis_region_dir <- "/Users/charleenadams/cis_ukbppp/cleaned/"
metsim_dir <- "/Users/charleenadams/metsim_files"
output_dir <- "/Users/charleenadams/harmonized_results_fixed/pv5E6"
# List of valid protein subdirectories (ensures only the right ones are processed)
valid_proteins <- c("LEP", "FABP4", "MAP4K5", "IL1RL1", "REN", "IL6", "LRP11", "ASAH1", "FAP",
"APOE", "ACY1", "AGXT", "KLKB1", "APOL1", "NPPB", "SIGLEC9", "LEPR",
"UMOD", "CD36", "ANGPTL3", "WFIKKN2", "IL6R", "LPL", "PDGFRA")
# Iterate through each protein's directory
for (protein in valid_proteins) {
# Define paths
protein_fixed_dir <- file.path(metsim_dir, paste0(protein, "_fixed"))
pval_subdir <- file.path(protein_fixed_dir, paste0("p_val_5E6_metsim_cis_", protein, "_fixed"))
cis_file <- file.path(cis_region_dir, paste0(protein, "_cis_region_cleaned.gz"))
# Check if pval_subdir exists before proceeding
if (!dir.exists(pval_subdir)) {
custom_log("WARNING", paste("Skipping", pval_subdir, "- Directory does not exist."))
next
}
# Get list of METSIM subset files
metsim_files <- list.files(pval_subdir, pattern = "\\.txt$", full.names = TRUE)
if (length(metsim_files) == 0) {
custom_log("WARNING", paste("No files found in", pval_subdir, "- Skipping."))
next
}
# Read and format the cis-region file
if (!file.exists(cis_file)) {
custom_log("ERROR", paste("Missing cis-region file:", cis_file, "- Skipping."))
next
}
custom_log("INFO", paste("Processing", protein, "with cis-region file:", cis_file))
cis_df <- read_and_format_data(cis_file, is_cis = TRUE)
if (is.null(cis_df)) {
custom_log("ERROR", paste("Failed to read cis-region file for", protein, "- Skipping."))
next
}
# Harmonize each METSIM file with the cis-region file
harmonized_results <- lapply(metsim_files, function(metsim_file) {
metsim_df <- read_and_format_data(metsim_file, is_cis = FALSE)
if (!is.null(metsim_df)) {
tryCatch({
harmonized <- harmonise_data(exposure_dat = cis_df, outcome_dat = metsim_df, action = 2)
custom_log("INFO", paste("Harmonized", metsim_file, "with", nrow(harmonized), "rows."))
return(harmonized)
}, error = function(e) {
custom_log("ERROR", paste("Error harmonizing data for", metsim_file, ":", e$message))
return(NULL)
})
}
return(NULL)
})
# Combine harmonized results
harmonized_results <- harmonized_results[!sapply(harmonized_results, is.null)]
if (length(harmonized_results) > 0) {
final_results <- rbindlist(harmonized_results)
output_file <- file.path(output_dir, paste0("harmonized_results_5E6_", protein, ".txt"))
fwrite(final_results, file = output_file, sep = "\t")
custom_log("INFO", paste("Saved harmonized results for", protein, "to", output_file))
} else {
custom_log("WARNING", paste("No harmonized results generated for", protein))
}
}
cat("Script completed\n")
# Run interactively in Rstudio
# Load required libraries
# create new directory for the harmzonized pv_5E6 files
# mv *_pv_5E6* /Users/charleenadams/harmonized_results_fixed/pv5E6
library(data.table)
library(hyprcoloc)
library(openxlsx)
library(DT)
library(knitr)
# Define input and output directories
input_dir <- "/Users/charleenadams/harmonized_results_fixed/pv5E6"
output_dir <- "/Users/charleenadams/harmonized_results_fixed/hyprcoloc_results_pval_5E6"
# Ensure output directory exists
if (!dir.exists(output_dir)) {
dir.create(output_dir, recursive = TRUE)
}
# Load phenotype mapping data
phenotype_map <- fread("/Users/charleenadams/metsim_files/phenotypes.tsv")
# Get list of all harmonized result files
harmonized_files <- list.files(input_dir, pattern = "^harmonized_results_.*\\.txt$", full.names = TRUE)
# Function to process each file
process_file <- function(file_path) {
protein_name <- gsub("^harmonized_results_|\\.txt$", "", basename(file_path)) # Extract protein name
cat("Processing file:", file_path, "\n")
# Read the harmonized data
data <- fread(file_path)
# Remove all characters after and including the first `_`
data$id.outcome <- sub("_.*", "", data$id.outcome)
data$id.exposure <- sub("_.*", "", data$id.exposure)
# Merge phenotype names and ensure correct naming format
setDT(data)
setDT(phenotype_map)
data <- merge(data, phenotype_map[, .(phenocode, phenostring)],
by.x = "id.outcome", by.y = "phenocode", all.x = TRUE)
setnames(data, "phenostring", "METSIM_name")
data$METSIM_name <- gsub("\\*", "", data$METSIM_name)
# Append the phenocode to the METSIM_name (e.g., "butyrylcarnitine (C4)_C100001054")
data$METSIM_name <- paste0(data$METSIM_name, "_", data$id.outcome)
# Ensure required columns exist
required_columns <- c("SNP", "id.exposure", "METSIM_name", "beta.exposure", "beta.outcome", "se.exposure", "se.outcome")
if (!all(required_columns %in% colnames(data))) {
cat("Skipping", protein_name, "- missing required columns!\n")
return(NULL)
}
# Reshape data to create matrices
long_data <- rbindlist(list(
data[, .(SNP, trait = id.exposure, beta = beta.exposure, se = se.exposure)],
data[, .(SNP, trait = METSIM_name, beta = beta.outcome, se = se.outcome)]
), use.names = TRUE)
long_data <- unique(long_data, by = c("SNP", "trait"))
# Create beta and SE matrices
betas_mat <- dcast(long_data, SNP ~ trait, value.var = "beta", fill = NA)
ses_mat <- dcast(long_data, SNP ~ trait, value.var = "se", fill = NA)
snps <- betas_mat$SNP
rownames(betas_mat) <- betas_mat$SNP
rownames(ses_mat) <- ses_mat$SNP
betas_mat <- as.matrix(betas_mat[, -1, with = FALSE])
ses_mat <- as.matrix(ses_mat[, -1, with = FALSE])
rownames(betas_mat) <- snps
rownames(ses_mat) <- snps
# Filter matrices to remove missing data rows
rows_with_na <- apply(betas_mat, 1, function(row) any(is.na(row))) | apply(ses_mat, 1, function(row) any(is.na(row)))
betas_mat <- betas_mat[!rows_with_na, , drop = FALSE]
ses_mat <- ses_mat[!rows_with_na, , drop = FALSE]
# Run HyprColoc
cat("Running HyprColoc for", protein_name, "...\n")
betas <- betas_mat
ses <- ses_mat
traits_names <- colnames(betas_mat)
snp_ids <- rownames(betas_mat)
results_unipriorsFALSE <- hyprcoloc(betas, ses, trait.names = traits_names, snp.id = snp_ids, snpscores = TRUE, uniform.priors = FALSE)
results_unipriorsTRUE <- hyprcoloc(betas, ses, trait.names = traits_names, snp.id = snp_ids, snpscores = TRUE, uniform.priors = TRUE)
results_thresh_95 <- hyprcoloc(betas, ses, trait.names = traits_names, snp.id = snp_ids, snpscores = TRUE, uniform.priors = FALSE, reg.thresh = 0.95, align.thresh = 0.95)
# Define output file path
output_file <- file.path(output_dir, paste0(protein_name, "_hyprcoloc_results.xlsx"))
# Initialize a new workbook
wb <- createWorkbook()
# Style for headers
header_style <- createStyle(fontSize = 12, fontColour = "#FFFFFF", halign = "CENTER", fgFill = "#4F81BD", textDecoration = "Bold", wrapText = TRUE)
# Style for left-aligned titles
title_style <- createStyle(fontSize = 14, halign = "LEFT", textDecoration = "Bold")
# Style for wrapping text in traits column
traits_style <- createStyle(wrapText = TRUE)
# Function to add results with styled header and left-aligned title
add_results_to_workbook <- function(workbook, result_data, sheet_name, title) {
if (!is.null(result_data$results)) {
addWorksheet(workbook, sheet_name)
writeData(workbook, sheet_name, data.table(Title = title), startRow = 1, colNames = FALSE)
addStyle(workbook, sheet_name, title_style, rows = 1, cols = 1, gridExpand = TRUE)
# Count the number of traits in each row
result_data$results$counts <- sapply(result_data$results$traits, function(x) {
if (x == "None") {
return(0) # If no traits are listed
} else {
return(length(strsplit(x, ", ")[[1]]))
}
})
writeData(workbook, sheet_name, result_data$results, startRow = 3)
addStyle(workbook, sheet_name, header_style, rows = 3, cols = 1:ncol(result_data$results), gridExpand = TRUE)
# Apply wrap text style to 'traits' column
traits_col_index <- which(names(result_data$results) == "traits")
if (length(traits_col_index) > 0) {
addStyle(workbook, sheet_name, traits_style, rows = 4:(nrow(result_data$results) + 3), cols = traits_col_index, gridExpand = TRUE, stack = TRUE)
}
# Auto adjust column widths based on content
for (i in 1:ncol(result_data$results)) {
setColWidths(workbook, sheet = sheet_name, cols = i, widths = "auto")
}
}
}
# Function to add SNP scores with sorting and renaming
add_snpscores_to_workbook <- function(workbook, snp_scores, sheet_name, title) {
if (!is.null(snp_scores)) {
addWorksheet(workbook, sheet_name)
writeData(workbook, sheet_name, data.table(Title = title), startRow = 1, colNames = FALSE)
addStyle(workbook, sheet_name, title_style, rows = 1, cols = 1, gridExpand = TRUE)
# Convert to data.table, remove 'V2' column, add 'rsID', rename 'V1', and sort
snp_scores_df <- as.data.table(snp_scores)
snp_scores_df <- snp_scores_df[, .(V1)] # Keep only V1
snp_scores_df$rsID <- snp_ids
setnames(snp_scores_df, "V1", "SNP_scores")
snp_scores_df <- snp_scores_df[order(-SNP_scores)] # Sort descending
writeData(workbook, sheet_name, snp_scores_df, startRow = 3)
addStyle(workbook, sheet_name, header_style, rows = 3, cols = 1:ncol(snp_scores_df), gridExpand = TRUE)
# Auto adjust column widths based on content
for (i in 1:ncol(snp_scores_df)) {
setColWidths(workbook, sheet = sheet_name, cols = i, widths = "auto")
}
}
}
# Save results
add_results_to_workbook(wb, results_unipriorsFALSE, paste0(protein_name, "_Res_Unip_F"), paste("HyprColoc of", protein_name, "(Uniform Priors = FALSE)"))
add_results_to_workbook(wb, results_unipriorsTRUE, paste0(protein_name, "_Res_Unip_T"), paste("HyprColoc of", protein_name, "(Uniform Priors = TRUE)"))
add_results_to_workbook(wb, results_thresh_95, paste0(protein_name, "_Res_Unip_F_95"), paste("HyprColoc of", protein_name, "(Uniform Priors = FALSE, Thresholds = 0.95)"))
add_snpscores_to_workbook(wb, results_unipriorsFALSE$snpscores, paste0(protein_name, "_Scores_Unip_F"), paste("HyprColoc SNP Scores for", protein_name, "(Uniform Priors = FALSE)"))
add_snpscores_to_workbook(wb, results_unipriorsTRUE$snpscores, paste0(protein_name, "_Scores_Unip_T"), paste("HyprColoc SNP Scores for", protein_name, "(Uniform Priors = TRUE)"))
add_snpscores_to_workbook(wb, results_thresh_95$snpscores, paste0(protein_name, "_Scores_Unip_F_95"), paste("HyprColoc SNP Scores for", protein_name, "(Uniform Priors = FALSE, Thresholds = 0.95)"))
saveWorkbook(wb, output_file, overwrite = TRUE)
cat("Results saved to:", output_file, "\n")
}
# Process all files
for (file_path in harmonized_files) {
process_file(file_path)
}
cat("All HyprColoc analyses completed!\n")
Produces individual Excel files for the proteins with three models (uniform priors = FALSE, uniform priors = TRUE, and uniform priors = FALSE and regional and alignment thresholds = 0.95)
# Ran interactively in Rstudio
# Load required libraries
library(openxlsx)
library(data.table)
# Define directories
dir_p5e8 <- "/Users/charleenadams/harmonized_results_fixed/hyprcoloc_results"
dir_p5e6 <- "/Users/charleenadams/harmonized_results_fixed/hyprcoloc_results_pval_5E6"
output_dir <- "/Users/charleenadams/harmonized_results_fixed/hyprcoloc_comparisons_pval_5E8_5E6"
# Ensure output directory exists
if (!dir.exists(output_dir)) {
dir.create(output_dir, recursive = TRUE)
}
# List files in each directory
files_p5e8 <- list.files(dir_p5e8, pattern = "_hyprcoloc_results.xlsx$", full.names = TRUE)
files_p5e6 <- list.files(dir_p5e6, pattern = "_hyprcoloc_results.xlsx$", full.names = TRUE)
# Extract protein names
protein_names_p5e8 <- sub("_hyprcoloc_results.xlsx", "", basename(files_p5e8))
protein_names_p5e6 <- sub("5E6_", "", sub("_hyprcoloc_results.xlsx", "", basename(files_p5e6)))
# Identify common proteins
common_proteins <- intersect(protein_names_p5e8, protein_names_p5e6)
# Function to process HyprColoc results
process_hyprcoloc_results <- function(file_path) {
wb <- loadWorkbook(file_path)
sheet_names <- names(wb)
# Identify sheets that contain results
result_sheets <- sheet_names[grep("_Res_Unip_", sheet_names)]
if (length(result_sheets) == 0) {
return(NULL)
}
results_list <- list()
for (sheet in result_sheets) {
data <- read.xlsx(file_path, sheet = sheet, startRow = 3) # Skip first 2 rows
if (!all(c("iteration", "traits", "posterior_prob", "posterior_explained_by_snp", "candidate_snp", "regional_prob") %in% colnames(data))) {
next
}
# Clean and process data
setDT(data)
data <- data[, .(iteration, traits, candidate_snp, posterior_prob, posterior_explained_by_snp, regional_prob)]
# Remove square brackets from traits
data[, traits := gsub("\\[.*?\\]", "", traits)]
# Count traits
data[, num_traits := sapply(strsplit(traits, ", "), length)]
# Wrap traits every 2 per line for readability
data[, traits := sapply(strsplit(traits, ", "), function(x) paste(x, collapse = ifelse(length(x) > 2, "\n", ", ")))]
results_list[[sheet]] <- data
}
return(results_list)
}
# Function to compare two datasets
compare_hyprcoloc_results <- function(protein, file1, file2) {
data_p5e8 <- process_hyprcoloc_results(file1)
data_p5e6 <- process_hyprcoloc_results(file2)
if (is.null(data_p5e8) | is.null(data_p5e6)) {
cat("Skipping", protein, "- No comparable HyprColoc results found.\n")
return(NULL)
}
comparison_results <- list()
for (sheet in names(data_p5e8)) {
sheet_p5e6 <- paste0("5E6_", sheet)
if (!(sheet_p5e6 %in% names(data_p5e6))) {
next
}
df1 <- data_p5e8[[sheet]]
df2 <- data_p5e6[[sheet_p5e6]]
# Merge results by iteration
merged_df <- merge(df1, df2, by = "iteration", suffixes = c("_p5e8", "_p5e6"), all = TRUE)
# Compare traits
merged_df[, match_traits := traits_p5e8 == traits_p5e6]
# Compare SNP and posterior probability
merged_df[, match_snp := candidate_snp_p5e8 == candidate_snp_p5e6]
merged_df[, posterior_prob_diff := posterior_explained_by_snp_p5e8 - posterior_explained_by_snp_p5e6]
# Add regional_prob columns
merged_df[, regional_prob_diff := regional_prob_p5e8 - regional_prob_p5e6]
comparison_results[[sheet]] <- merged_df
}
return(comparison_results)
}
# Formatting: Dark lavender header, bold white font, center-aligned columns
header_style <- createStyle(fontSize = 12, fontColour = "#FFFFFF", halign = "CENTER",
fgFill = "#69306D", textDecoration = "Bold", wrapText = TRUE)
centered_style <- createStyle(halign = "CENTER")
wrapped_text_style <- createStyle(wrapText = TRUE)
# Loop through all common proteins and compare results
for (protein in common_proteins) {
file1 <- file.path(dir_p5e8, paste0(protein, "_hyprcoloc_results.xlsx"))
file2 <- file.path(dir_p5e6, paste0("5E6_", protein, "_hyprcoloc_results.xlsx"))
comparison <- compare_hyprcoloc_results(protein, file1, file2)
if (!is.null(comparison)) {
output_file <- file.path(output_dir, paste0(protein, "_hyprcoloc_comparison.xlsx"))
wb <- createWorkbook()
for (sheet in names(comparison)) {
addWorksheet(wb, sheet)
writeData(wb, sheet, comparison[[sheet]], startRow = 2, colNames = TRUE)
# Apply header formatting
addStyle(wb, sheet, header_style, rows = 2, cols = 1:ncol(comparison[[sheet]]), gridExpand = TRUE)
# Apply center alignment
addStyle(wb, sheet, centered_style, rows = 3:(nrow(comparison[[sheet]]) + 2), cols = 1:ncol(comparison[[sheet]]), gridExpand = TRUE)
# Set column width
setColWidths(wb, sheet, cols = 1:ncol(comparison[[sheet]]), widths = 25)
# Apply wrapped text for trait columns
trait_cols <- grep("traits", names(comparison[[sheet]]))
if (length(trait_cols) > 0) {
addStyle(wb, sheet, wrapped_text_style, rows = 3:(nrow(comparison[[sheet]]) + 2), cols = trait_cols, gridExpand = TRUE)
}
}
saveWorkbook(wb, output_file, overwrite = TRUE)
cat("Saved comparison for", protein, "to", output_file, "\n")
}
}
cat("All comparisons completed!\n")
This script collects the results for the proteins from the results just above and puts them in tabs by name of protein.
# Load required libraries
library(openxlsx)
library(data.table)
# Define directories
comparison_dir <- "/Users/charleenadams/harmonized_results_fixed/hyprcoloc_comparisons_pval_5E8_5E6"
# Define output file names for each tab
output_top_file_F <- "/Users/charleenadams/coloc_clin_mol/scripts/hypr_results_deliver/www/hyprcoloc_comparisons_unipF_top.xlsx"
output_top_file_T <- "/Users/charleenadams/coloc_clin_mol/scripts/hypr_results_deliver/www/hyprcoloc_comparisons_unipT_top.xlsx"
output_top_file_F_95 <- "/Users/charleenadams/coloc_clin_mol/scripts/hypr_results_deliver/www/hyprcoloc_comparisons_unipF_95_top.xlsx"
# List all comparison files, excluding temporary files
comparison_files <- list.files(comparison_dir, pattern = "_hyprcoloc_comparison.xlsx$", full.names = TRUE)
comparison_files <- comparison_files[!grepl("~$", comparison_files)]
# Function to process each tab and create separate results files
process_comparison_files <- function(comparison_files, tab_index, output_file) {
wb <- createWorkbook()
# Header Formatting: Dark Lavender, Bold White Font
header_style <- createStyle(fontSize = 12, fontColour = "#FFFFFF", halign = "CENTER",
fgFill = "#69306D", textDecoration = "Bold", wrapText = TRUE)
centered_style <- createStyle(halign = "CENTER")
wrapped_text_style <- createStyle(wrapText = TRUE)
summary_data <- list()
### 🔹 First, Create the "Top_Hits" Worksheet
addWorksheet(wb, "Top_Hits")
for (file in comparison_files) {
protein_name <- sub("_hyprcoloc_comparison.xlsx", "", basename(file))
# Try loading the workbook, catch errors
wb_comp <- tryCatch({
loadWorkbook(file)
}, error = function(e) {
message(paste("Skipping", file, "- Could not load workbook:", e$message))
return(NULL)
})
if (is.null(wb_comp)) {
next
}
sheet_names <- names(wb_comp)
# Check if the tab index is valid
if (length(sheet_names) >= tab_index) {
sheet <- sheet_names[tab_index]
# Read data directly from the specified tab of the Excel file
data <- tryCatch({
read.xlsx(file, sheet = sheet, colNames = TRUE, detectDates = FALSE)
}, error = function(e) {
message(paste("Skipping sheet:", sheet, "in", file, "- Error reading sheet:", e$message))
return(NULL)
})
if (is.null(data)) {
message(paste("Data is NULL after reading the sheet", sheet, "from", file))
next
}
# Convert to data.table
setDT(data)
# Ensure required columns exist before filtering
required_cols <- c("traits_p5e8", "posterior_prob_p5e8", "posterior_explained_by_snp_p5e8",
"regional_prob_p5e8", "candidate_snp_p5e8",
"traits_p5e6", "posterior_prob_p5e6", "posterior_explained_by_snp_p5e6",
"regional_prob_p5e6", "candidate_snp_p5e6")
missing_cols <- setdiff(required_cols, names(data))
if (length(missing_cols) > 0) {
message(paste("Skipping summary extraction for", protein_name, "due to missing columns:", paste(missing_cols, collapse=", ")))
next
}
# Filter traits where posterior_prob, regional_prob, and posterior_explained_by_snp are all >= 0.8
top_results_p5e8 <- data[posterior_prob_p5e8 >= 0.8 &
posterior_explained_by_snp_p5e8 >= 0.8 &
regional_prob_p5e8 >= 0.8,
.(protein = protein_name,
traits = traits_p5e8,
pval_threshold = "5E-8",
candidate_snp = candidate_snp_p5e8,
posterior_probability = posterior_prob_p5e8,
posterior_prob_explained_by_snp = posterior_explained_by_snp_p5e8,
regional_prob = regional_prob_p5e8,
includes_protein = ifelse(grepl(protein_name, traits_p5e8, ignore.case = TRUE), "Yes", "No"))]
top_results_p5e6 <- data[posterior_prob_p5e6 >= 0.8 &
posterior_explained_by_snp_p5e6 >= 0.8 &
regional_prob_p5e6 >= 0.8,
.(protein = protein_name,
traits = traits_p5e6,
pval_threshold = "5E-6",
candidate_snp = candidate_snp_p5e6,
posterior_probability = posterior_prob_p5e6,
posterior_prob_explained_by_snp = posterior_explained_by_snp_p5e6,
regional_prob = regional_prob_p5e6,
includes_protein = ifelse(grepl(protein_name, traits_p5e6, ignore.case = TRUE), "Yes", "No"))]
# Combine results
top_results <- rbind(top_results_p5e8, top_results_p5e6)
if (nrow(top_results) > 0) {
summary_data[[protein_name]] <- top_results
}
# Add worksheet for this protein
addWorksheet(wb, protein_name)
writeData(wb, protein_name, data, startRow = 1, colNames = TRUE)
# Apply formatting
addStyle(wb, protein_name, header_style, rows = 1, cols = 1:ncol(data), gridExpand = TRUE)
addStyle(wb, protein_name, centered_style, rows = 2:(nrow(data) + 1), cols = 1:ncol(data), gridExpand = TRUE)
# Wrap text for trait columns
trait_cols <- grep("traits", names(data))
if (length(trait_cols) > 0) {
addStyle(wb, protein_name, wrapped_text_style, rows = 2:(nrow(data) + 1), cols = trait_cols, gridExpand = TRUE)
}
# Set column width
setColWidths(wb, protein_name, cols = 1:ncol(data), widths = 25)
} else {
message(paste("Tab index", tab_index, "does not exist for", file))
}
}
# Populate "Top_Hits" tab
if (length(summary_data) > 0) {
summary_df <- rbindlist(summary_data, use.names = TRUE, fill = TRUE)
writeData(wb, "Top_Hits", summary_df, startRow = 1, colNames = TRUE)
# Apply formatting to summary tab
addStyle(wb, "Top_Hits", header_style, rows = 1, cols = 1:ncol(summary_df), gridExpand = TRUE)
addStyle(wb, "Top_Hits", centered_style, rows = 2:(nrow(summary_df) + 1), cols = 1:ncol(summary_df), gridExpand = TRUE)
# Wrap text for trait column
addStyle(wb, "Top_Hits", wrapped_text_style, rows = 2:(nrow(summary_df) + 1), cols = 2, gridExpand = TRUE)
# Set column width
setColWidths(wb, "Top_Hits", cols = 1:ncol(summary_df), widths = 25)
}
# Save final summary file
saveWorkbook(wb, output_file, overwrite = TRUE)
cat("Top hits Excel file created at:", output_file, "\n")
}
# Process for each tab
process_comparison_files(comparison_files, 1, output_top_file_F)
process_comparison_files(comparison_files, 2, output_top_file_T)
process_comparison_files(comparison_files, 3, output_top_file_F_95)
Rpubs doesn’t let me share files, but this can be done with rsconnect and a shiny app.
Go here for the results: https://yodamendel.shinyapps.io/hypr_results_deliver/
# Run with:
# rsconnect::deployApp('/Users/charleenadams/coloc_clin_mol/scripts/hypr_results_deliver')
library(shiny)
library(DT)
library(openxlsx)
library(data.table)
# Define file paths relative to www/
file1 <- "www/hyprcoloc_comparisons_unipT_top.xlsx"
file2 <- "www/hyprcoloc_comparisons_unipF_95_top.xlsx"
file3 <- "www/hyprcoloc_comparisons_unipF_top.xlsx"
# Function to read "Top_Hits" sheet from Excel
read_top_hits <- function(file_path) {
if (!file.exists(file_path)) {
return(NULL)
}
wb <- loadWorkbook(file_path)
sheet_names <- names(wb)
if (!"Top_Hits" %in% sheet_names) {
return(NULL)
}
data <- read.xlsx(file_path, sheet = "Top_Hits", colNames = TRUE)
if (nrow(data) == 0) {
return(NULL)
}
setDT(data)
# Remove unwanted columns
data <- data[, !c("regional_prob_p5e8", "regional_prob_p5e6", "Uniform_Prior"), with = FALSE]
return(data)
}
# Read data
data_F <- read_top_hits(file3)
data_T <- read_top_hits(file1)
data_F_95 <- read_top_hits(file2)
# UI
ui <- fluidPage(
# Professional Title Section
div(style = "text-align: center; margin-bottom: 20px;",
h1("HyPrColoc Results"),
h4("Colocalization Analysis of UK Biobank Proteomics & METSIM Metabolomics"),
tags$a(href = "https://www.bidmc.org/research/research-by-department/medicine/cardiovascular-medicine/personal-genomics-and-cardiometabolic-disease",
target = "_blank",
"Our Lab",
style = "font-size: 18px; color: #0033cc; text-decoration: none; font-weight: bold;")
),
sidebarLayout(
# Reduce sidebar width to make tables larger
sidebarPanel(
width = 3, # Reduced from default (4) to 3
h3("Instructions"), # Match font size with Interactive Tables
p("Click the buttons below to download full results.", br(),
"Each file will have today's date appended."),
hr(),
h3("Download Results"),
downloadButton("download_file1", "Download hyprcoloc_comparisons_unipT"),
br(), br(),
downloadButton("download_file2", "Download hyprcoloc_comparisons_unipF_95"),
br(), br(),
downloadButton("download_file3", "Download hyprcoloc_comparisons_unipF"),
hr(),
# OBJECTIVE BOX
div(style = "border: 1px solid #ccc; padding: 10px; background-color: #f9f9f9;",
strong("Objective"),
p("The primary objective of this statistical analysis is to develop and implement a pipeline for performing
colocalization analysis using HyPrColoc on molecular and clinical phenotypes. This includes proteomics
(UK Biobank Pharma Proteomics Project - UKB-PPP), metabolomics (METSIM and Nightingale), and clinical disease
traits such as coronary heart disease (CHD). The results presented here are for selected proteins and METSIM
metabolites within a 1MB region around the transcription start site (TSS) of the respective proteins.")
)
),
# Increase the main panel width
mainPanel(
width = 9, # Increased from default (8) to 9
h3("Interactive Tables"),
h5("Selected results (posterior probability & posterior probability explained by SNP ≥ 0.8)"),
tabsetPanel(
tabPanel("Uniform Priors = FALSE",
DTOutput("table_F")),
tabPanel("Uniform Priors = TRUE",
DTOutput("table_T")),
tabPanel("Uniform Priors = FALSE; thresholds = 0.95",
DTOutput("table_F_95"))
)
)
),
# BIDMC Logo - Bottom Left of Page
div(style = "position: fixed; bottom: 10px; left: 10px;",
img(src = "BIDMC_HMS_Stacked-LockUp.png", width = "125px") # Adjust width as needed
)
)
# Server
server <- function(input, output) {
# Download Handlers
output$download_file1 <- downloadHandler(
filename = function() { paste0("hyprcoloc_comparisons_unipT_", Sys.Date(), ".xlsx") },
content = function(file) { file.copy(file1, file) }
)
output$download_file2 <- downloadHandler(
filename = function() { paste0("hyprcoloc_comparisons_unipF_95_", Sys.Date(), ".xlsx") },
content = function(file) { file.copy(file2, file) }
)
output$download_file3 <- downloadHandler(
filename = function() { paste0("hyprcoloc_comparisons_unipF_", Sys.Date(), ".xlsx") },
content = function(file) { file.copy(file3, file) }
)
# Define table styling with alternating light lavender rows
datatable_options <- list(
dom = 'Blfrtip', # Adds a search button
pageLength = 25, # Default to 25 rows per page
scrollX = TRUE, # Enable horizontal scrolling
autoWidth = TRUE, # Auto adjust column width
rowCallback = JS(
"function(row, data, index) {
if (index % 2 === 0) {
$(row).css('background-color', '#E6E6FA'); // Light lavender
}
}"
)
)
# Render DataTables
output$table_F <- renderDT({
if (!is.null(data_F)) {
datatable(data_F, options = datatable_options, rownames = FALSE, filter = "top")
} else {
datatable(data.frame(Message = "No data available"), options = list(dom = 't'))
}
})
output$table_T <- renderDT({
if (!is.null(data_T)) {
datatable(data_T, options = datatable_options, rownames = FALSE, filter = "top")
} else {
datatable(data.frame(Message = "No data available"), options = list(dom = 't'))
}
})
output$table_F_95 <- renderDT({
if (!is.null(data_F_95)) {
datatable(data_F_95, options = datatable_options, rownames = FALSE, filter = "top")
} else {
datatable(data.frame(Message = "No data available"), options = list(dom = 't'))
}
})
}
# Run app
shinyApp(ui = ui, server = server)