Genetic Correlations Between Olink Chemokines

1 Introduction
2 UKB-PPP
3 Getting the Data
4 Other ones? 🎣
5 Getting the Chemokine Summary Statistics from Synapse
6 Data Preparation
7 Count Chemokine Files and Untar
8 Merge in the RSIDs
9 Check Progress of RSIDs Being Added
10 Loop Through Each (RSID-Annotated) Chemokine Sudirectory and Merge the Chromosome Files
11 LD Scores and the HapMap3 SNPlist
12 Add a P (p-value) Column
13 Try Munging One Chemokine Summary Statistics File
14 Munge Loop for the 33 Chemokine Summary Statistics
15 Run LDSC for Pairwise Combinations of Chemokines
16 Process the Results Files (correlation.logs)
17 Clean Names of P1 and P2 for Chemokines
18 Results: Genetic Correlations
19 Visualize the Chemokine Genetic Correlations
20 Summary
21 Next Steps?
22 Exploring Mutually Significant Variants for Chemokine Pairs
23 Obtain the Commonly Significant SNPs Across the Chemokines
24 Results: Commonly Significant SNPs for Chemokines
25 Pairwise Heatmap for Number of Shared SNPs (P<5e11) Between Chemokines
26 Are the Chemokines on the Same Chromosome?
27 Signficant SNP-TSS Chromosome Alignment
28 Sun’s Significant Chemokines by RSID (P<1.7xe-11)
29 Comparison of Sun’s Significant SNPs Across Chemokines
30 Updated Comparison of Sun Top Hits

1 Introduction

This pipeline uses linkage disequilibrium score regression (LDSC) to calculate genetic correlations (rg) between 33 Olink chemokines.

1.1 Note 1

LDSC requires linkage disequilibrium (LD) scores based on r² values, unlike SuSiE and coloc, which rely on the signed correlation coefficient (r) in matrix form. These LD scores are also aggregated and weighted. We are starting with the UK Biobank Pharma Proteomics Project (UKB-PPP) summary statistics for Europeans because precomputed LD scores for Europeans are available from the Alkes Lab. For our MESA and JHS populations, we will need to calculate these scores ourselves, which is more complex than generating LD matrices and could require substantial storage space. Additionally, LDSC generally works best with sample sizes of > 5,000, so MESA and JHS may be slightly underpowered for this approach. To test this pipeline with chemokine data, we are using the UKB-PPP dataset.

1.2 Note 2

This setup is designed for flexibility and scalability; for example, it could be adapted to process additional UKB-PPP proteins in about 3 hours (meaning the scripts can be revised in about 3 hours), significantly reducing the manual effort needed. Run time for generating genetic correlations varies depending on the number of protein-protein pairs analyzed. This loop for LDSC, which generated 528 genetic correlations from 33 chemokines, ran for about 5 hours. (Note: This pipeline contains many scripts. The genetic correlation script was not parallelized for this initial run, as the focus was on obtaining results; future runs would benefit from parallelization to improve efficiency.)

2 UKB-PPP

Direct quote from the Synapse UKB-PPP website:

The UK Biobank Pharma Proteomics Project (UKB-PPP) is a collaboration between the UK Biobank (UKB) and thirteen biopharmaceutical companies characterising the plasma proteomic profiles of 54,219 UKB participants. As part of a collaborative analysis across the thirteen UKB-PPP partners, we conducted comprehensive protein quantitative trait loci (pQTL) mapping of 2,923 proteins that identifies 14,287 primary genetic associations, of which 85% are newly discovered, in addition to ancestry-specific pQTL mapping in non-Europeans. We identify independent secondary associations in 87% of cis and 30% of trans loci, expanding the catalogue of genetic instruments for downstream analyses.

3 Getting the Data

The UKB-PPP pQTL summary statistics are available via synapse for the “discovery” UKB EUR population. To download, it requires creating a synapse account, obtaining a personal access token, and adding files to a download list.

In a terminal window, programmatically acquire the download list of selected chemokine summary statistics.

I used the protein mapping file, which I’d downloaded previously to select the chemokines (i.e., I looked at them in Excel). I placed a copy of the rsid and protein mapping files at:

/Users/charleenadams/temp_BI/chemokine_rgs_olink/maps
(synapse_env) charleenadams@Charleens-MacBook-Pro maps % ls
olink_protein_map_1.5k_v1.tsv               olink_rsid_map_mac5_info03_b0_7_chr20_patched_v2.tsv.gz
olink_protein_map_3k_v1.tsv             olink_rsid_map_mac5_info03_b0_7_chr21_patched_v2.tsv.gz
olink_rsid_map_mac5_info03_b0_7_chr10_patched_v2.tsv.gz olink_rsid_map_mac5_info03_b0_7_chr22_patched_v2.tsv.gz
olink_rsid_map_mac5_info03_b0_7_chr11_patched_v2.tsv.gz olink_rsid_map_mac5_info03_b0_7_chr2_patched_v2.tsv.gz
olink_rsid_map_mac5_info03_b0_7_chr12_patched_v2.tsv.gz olink_rsid_map_mac5_info03_b0_7_chr3_patched_v2.tsv.gz
olink_rsid_map_mac5_info03_b0_7_chr13_patched_v2.tsv.gz olink_rsid_map_mac5_info03_b0_7_chr4_patched_v2.tsv.gz
olink_rsid_map_mac5_info03_b0_7_chr14_patched_v2.tsv.gz olink_rsid_map_mac5_info03_b0_7_chr5_patched_v2.tsv.gz
olink_rsid_map_mac5_info03_b0_7_chr15_patched_v2.tsv.gz olink_rsid_map_mac5_info03_b0_7_chr6_patched_v2.tsv.gz
olink_rsid_map_mac5_info03_b0_7_chr16_patched_v2.tsv.gz olink_rsid_map_mac5_info03_b0_7_chr7_patched_v2.tsv.gz
olink_rsid_map_mac5_info03_b0_7_chr17_patched_v2.tsv.gz olink_rsid_map_mac5_info03_b0_7_chr8_patched_v2.tsv.gz
olink_rsid_map_mac5_info03_b0_7_chr18_patched_v2.tsv.gz olink_rsid_map_mac5_info03_b0_7_chr9_patched_v2.tsv.gz
olink_rsid_map_mac5_info03_b0_7_chr19_patched_v2.tsv.gz olink_rsid_map_mac5_info03_b0_7_chrXY_patched_v2.tsv.gz
olink_rsid_map_mac5_info03_b0_7_chr1_patched_v2.tsv olink_rsid_map_mac5_info03_b0_7_chrX_patched_v2.tsv.gz

Here are the 33 chemokines I selected. Would you like other ones? I’m not sure if we want Chemokine-like protein TAFA-5. I kept it for now.

UKBPPP_ProteinID	Olink Target Fullname	OlinkID	UniProt	Assay
CCL13:Q99616:OID20655:v1	C-C motif chemokine 13	OID20655	Q99616	CCL13
CCL14:Q16627:OID20401:v1	C-C motif chemokine 14	OID20401	Q16627	CCL14
CCL15:Q16663:OID20328:v1	C-C motif chemokine 15	OID20328	Q16663	CCL15
CCL16:O15467:OID20334:v1	C-C motif chemokine 16	OID20334	O15467	CCL16
CCL17:Q92583:OID20745:v1	C-C motif chemokine 17	OID20745	Q92583	CCL17
CCL18:P55774:OID20395:v1	C-C motif chemokine 18	OID20395	P55774	CCL18
CCL19:Q99731:OID21030:v1	C-C motif chemokine 19	OID21030	Q99731	CCL19
CCL2:P13500:OID21004:v1	C-C motif chemokine 2	OID21004	P13500	CCL2
CCL20:P78556:OID20671:v1	C-C motif chemokine 20	OID20671	P78556	CCL20
CCL21:O00585:OID20686:v1	C-C motif chemokine 21	OID20686	O00585	CCL21
CCL22:O00626:OID20765:v1	C-C motif chemokine 22	OID20765	O00626	CCL22
CCL23:P55773:OID20693:v1	C-C motif chemokine 23	OID20693	P55773	CCL23
CCL24:O00175:OID20772:v1	C-C motif chemokine 24	OID20772	O00175	CCL24
CCL25:O15444:OID20674:v1	C-C motif chemokine 25	OID20674	O15444	CCL25
CCL26:Q9Y258:OID20546:v1	C-C motif chemokine 26	OID20546	Q9Y258	CCL26
CCL27:Q9Y4X3:OID20121:v1	C-C motif chemokine 27	OID20121	Q9Y4X3	CCL27
CCL28:Q9NRJ3:OID20569:v1	C-C motif chemokine 28	OID20569	Q9NRJ3	CCL28
CCL3:P10147:OID20610:v1	C-C motif chemokine 3	OID20610	P10147	CCL3
CCL4:P13236:OID20695:v1	C-C motif chemokine 4	OID20695	P13236	CCL4
CCL5:P13501:OID20412:v1	C-C motif chemokine 5	OID20412	P13501	CCL5
CCL7:P80098:OID20523:v1	C-C motif chemokine 7	OID20523	P80098	CCL7
CCL8:P80075:OID21466:v1	C-C motif chemokine 8	OID21466	P80075	CCL8
CXCL10:P02778:OID20697:v1	C-X-C motif chemokine 10	OID20697	P02778	CXCL10
CXCL11:O14625:OID21042:v1	C-X-C motif chemokine 11	OID21042	O14625	CXCL11
CXCL13:O43927:OID21024:v1	C-X-C motif chemokine 13	OID21024	O43927	CXCL13
CXCL14:O95715:OID20458:v1	C-X-C motif chemokine 14	OID20458	O95715	CXCL14
CXCL16:Q9H2A7:OID20282:v1	C-X-C motif chemokine 16	OID20282	Q9H2A7	CXCL16
CXCL17:Q6UXB2:OID20622:v1	C-X-C motif chemokine 17	OID20622	Q6UXB2	CXCL17
CXCL3:P19876:OID20788:v1	C-X-C motif chemokine 3	OID20788	P19876	CXCL3
CXCL5:P42830:OID20207:v1	C-X-C motif chemokine 5	OID20207	P42830	CXCL5
CXCL6:P80162:OID20613:v1	C-X-C motif chemokine 6	OID20613	P80162	CXCL6
CXCL9:Q07325:OID20675:v1	C-X-C motif chemokine 9	OID20675	Q07325	CXCL9
TAFA5:Q7Z5A7:OID21321:v1	Chemokine-like protein TAFA-5	OID21321	Q7Z5A7	TAFA5

4 Other ones? 🎣

library(DT)
table <- fread("/Users/charleenadams/temp_BI/chemokine_rgs_olink/protein_map_interactive.csv")

datatable(table,
          options = list(pageLength = 10, autoWidth = TRUE, scrollX = TRUE),
          rownames = FALSE,
          caption = 'Olink Protein Map')

5 Getting the Chemokine Summary Statistics from Synapse

After logging into Synapse and adding desired files to the download list, these are the steps.

cd /Users/charleenadams/temp_BI/chemokine_rgs_olink
conda create -n synapse_env -c conda-forge mamba
conda activate synapse_env
pip install synapseclient
synapse -h
synapse get-download-list
synapse login "cdadams" eyJ0eXAiOiJKV1QiLCJraWQiOiJXN05OOldMSlQ6SjVSSzpMN1RMOlQ3TDc6M1ZYNjpKRU9VOjY0NFI6VTNJWDo1S1oyOjdaQ0s6RlBUSCIsImFsZyI6IlJTMjU2In0.eyJhY2Nlc3MiOnsic2NvcGUiOlsidmlldyIsImRvd25sb2FkIiwibW9kaWZ5Il0sIm9pZGNfY2xhaW1zIjp7fX0sInRva2VuX3R5cGUiOiJQRVJTT05BTF9BQ0NFU1NfVE9LRU4iLCJpc3MiOiJodHRwczovL3JlcG8tcHJvZC5wcm9kLnNhZ2ViYXNlLm9yZy9hdXRoL3YxIiwiYXVkIjoiMCIsIm5iZiI6MTcyOTcwNjgzOSwiaWF0IjoxNzI5NzA2ODM5LCJqdGkiOiIxMzA3MyIsInN1YiI6IjM0NzI3ODMifQ.B65yULl25FZ1vmFOXEE6RaNYOhXTXZ5ov5JLVP47nJpejRA6oMfcwOcY1z4RH7bLFQ48xzRhc1I8kSbexid8jpoKnQJlBI8mVNItvv92dZNDrNN_8cvXtHNLVb6O12jwIZ0XDuQxcLuxm0kF52IZSXWMtwBzBeyKckkNYL9LTbQKDmFRmOpwZGVjkZyT7wsiuLLjj9bTcMfFcY-IstD7xnK4O51D_UejK6vg-Eh0Uw23i4NHhMAA02-D-tZq1yV0X8yI0EIOFoLQ3AKtXgrr4sE6zDGz4KTPaqVM557Vv9Aqk2KtnfiZRSWWxWYPbJm1Ty8AYLBDv6Iq-tdAval8kw

6 Data Preparation

6.1 Look at RSID

RSID maps are provided per chromosome.

In terminal window, look at the rsid map for chromosome 1. It’s nice that they give us positions for both POS19 and POS38.

zless - olink_rsid_map_mac5_info03_b0_7_chr1_patched_v2.tsv.gz

ID      REF     ALT     rsid    POS19   POS38
1:10177:A:AC:imp:v1     A       AC      rs367896724     10177   10177
1:10352:T:TA:imp:v1     T       TA      rs201106462     10352   10352
1:10511:G:A:imp:v1      G       A       rs534229142     10511   10511

7 Count Chemokine Files and Untar

We have to manually select the Olink files for the download list, which means it can be easy to miss them. I’d identified 33, but when adding them to the download list, it’s difficult to eyeball. Below I’m counting that we got 33 and then unpacking them.

ls *.tar | wc -l 
for file in *.tar; do tar -xf "$file"; done

8 Merge in the RSIDs

The rsid numbers need to be merged into the chemokine summary statistics files for LDSC. When we unpack the files, they produce 33 subdirectories. So, the code below goes into the subdirectories and annotates all the files in all of them with their corresponding rsid mapping files by chromosome.

#!/bin/bash

# Base directory where your data is located
base_dir="/Users/charleenadams/temp_BI/chemokine_rgs_olink"
# Directory where RSID map files are located
map_dir="$base_dir/maps"

# Log file to track progress
log_file="add_rsid_log.txt"
echo "Log started on $(date)" > $log_file

# Loop through each subdirectory in the base directory
for dir in "$base_dir"/*; do
    if [[ -d "$dir" ]]; then
        echo "Processing directory: $dir" | tee -a $log_file

        # Loop through each .gz file in the subdirectory
        for file in "$dir"/*.gz; do
            # Extract chromosome number from file name without using -P option
            chrom=$(basename "$file" | sed -E 's/.*chr([0-9XY]+).*/\1/')

            # Define the RSID map file for this chromosome
            map_file="$map_dir/olink_rsid_map_mac5_info03_b0_7_chr${chrom}_patched_v2.tsv.gz"
            output_file="${file%.gz}_rsid_added.gz"

            echo "  Processing file: $file (Chromosome $chrom)" | tee -a $log_file

            # Check if the chromosome number was extracted and the RSID map file exists
            if [[ -n "$chrom" && -f "$map_file" ]]; then
                echo "    Using RSID map: $map_file" | tee -a $log_file

                # Remove existing rsid_added file to avoid overwrite prompts
                rm -f "$output_file"

                # Join the original file with the RSID map based on the ID column
                gunzip -c "$file" | \
                awk 'NR==FNR{a[$1]=$4; next} $3 in a {$3=a[$3]}1' <(gunzip -c "$map_file") - | \
                gzip -c > "$output_file"

                # Confirm successful processing
                if [[ -s "$output_file" ]]; then
                    echo "    RSID added and output saved to $output_file" | tee -a $log_file
                else
                    echo "    Warning: $output_file is empty!" | tee -a $log_file
                fi
            else
                echo "    Error: RSID map file for chromosome $chrom not found or chromosome not identified." | tee -a $log_file
            fi
        done
    fi
done

echo "Log ended on $(date)" >> $log_file
echo "Processing complete. Check $log_file for details."

#Run with: 
chmod +x add_rsid.sh
nohup ./add_rsid.sh > add_rsid_log.txt 2>&1 &
ps -p 27745 -o stat,etime

9 Check Progress of RSIDs Being Added

Lots of things can go wrong in loops. Good to watch their progress.

List all files ending in _rsid_added.gz in all subdirectories
Remove all files ending in _rsid_added.gz in all subdirectories (in case they are made incorrectly)

find . -type f \( -name '*_rsid_added.gz' -o -name '*_rsid_added_rsid_added.gz' \) -exec ls {} +
find . -type f \( -name '*_rsid_added.gz' -o -name '*_rsid_added_rsid_added.gz' \) -exec rm {} +

10 Loop Through Each (RSID-Annotated) Chemokine Sudirectory and Merge the Chromosome Files

Each chemokine subdirectory contains summary statistics by chromosome. They must be merged for LDSC so that there is one summary statistics file per chemokine. (Well, I bet LDSC would let us run things one chromosome at a time, but setting that up sounds more hellish than merging, and merging is cleaner for this purpose.)

#!/bin/bash

# Define base directory
base_dir="/Users/charleenadams/temp_BI/chemokine_rgs_olink"

# Log file for tracking
log_file="$base_dir/merge_rsid_files_log.txt"
echo "Processing started on $(date)" > $log_file

# Find directories to process, excluding specific ones
for dir in "$base_dir"/*/; do
    # Skip directories you don’t want to process
    if [[ "$dir" == *"maps"* || "$dir" == *"1000G_Phase3_weights_hm3_no_MHC"* || "$dir" == *"ldsc"* || "$dir" == *"rsconnect"* ]]; then
        continue
    fi
    
    echo "Processing directory: $dir" | tee -a $log_file

    # Collect all '_rsid_added.gz' files in this directory
    rsid_files=("$dir"*_rsid_added.gz)

    # If there are files to merge
    if [[ ${#rsid_files[@]} -gt 0 ]]; then
        merged_file="${dir%/}_merged_rsid_added.gz"
        
        # Extract header from the first chromosome file
        gunzip -c "${rsid_files[0]}" | head -n 1 > temp_header.txt

        # Use GNU parallel to process each file without header in parallel
        export -f gunzip
        export -f tail
        export log_file
        export merged_file
        export temp_header.txt

        parallel -j4 'gunzip -c {} | tail -n +2 >> temp_content.txt' ::: "${rsid_files[@]}"

        # Concatenate header and content into final merged file
        cat temp_header.txt temp_content.txt | gzip -c > "$merged_file"
        rm temp_header.txt temp_content.txt

        # Confirm if the merged file was created successfully and isn’t empty
        if [[ -s "$merged_file" ]]; then
            echo "  Merged file created: $merged_file" | tee -a $log_file
        else
            echo "  Warning: Merged file is empty for directory $dir" | tee -a $log_file
        fi
    else
        echo "  No rsid_added.gz files found in directory $dir" | tee -a $log_file
    fi
done

echo "Processing completed on $(date)" >> $log_file
echo "Check $log_file for details."

# Run with:
chmod +x chemokine_chrom_merge.sh
nohup ./chemokine_chrom_merge.sh > chemokine_chrom_merge_log.txt 2>&1 &

11 LD Scores and the HapMap3 SNPlist

I obtained the precalculated LD scores for Europeans within 1000 Genomes Project Phase 3, as provided by the Alkes lab. These files were specifically generated using HapMap3 SNPs within the European subset of 1000 Genomes samples, excluding the MHC region, which is particularly complex in terms of LD structure due to high genetic diversity and extended linkage disequilibrium in that region.

Using these files is standard practice for LDSC analysis focused on European ancestry because they help ensure comparability across studies, minimize confounding by population structure, and leverage the better-characterized LD patterns in this population.

I’m also using a HapMap3 snplist: w_hm3.snplist.

12 Add a P (p-value) Column

The Olink files have LOG10P. LDSC needs non-transformed p-values.

#!/bin/bash

# Loop through each *merged_rsid_added.gz file
for file in *_merged_rsid_added.gz; do
    # Define the output filename
    output_file="${file%_merged_rsid_added.gz}_with_pval.gz"
    
    # Use awk to calculate P-value from LOG10P and write to a new gzipped file
    gunzip -c "$file" | \
    awk 'BEGIN {OFS="\t"} NR==1 {print $0, "P"} NR>1 {pval = (10 ^ -$13); print $0, pval}' | \
    gzip > "$output_file"
    
    echo "Processed file saved: $output_file"
done

#Run with: 
chmod +x pval.sh
nohup ./pval.sh> pval_log.txt 2>&1 &

13 Try Munging One Chemokine Summary Statistics File

LDSC requires that summary statistics be in a certain form. Getting them into that form is something they refer to as “munging.” I tried munging just one file. It worked.

python /Users/charleenadams/temp_BI/chemokine_rgs_olink/ldsc/munge_sumstats.py \
        --sumstats CCL13_Q99616_OID20655_v1_Inflammation_with_pval.gz \
        --snp rsid \
        --a1 ALLELE1 \
        --a2 ALLELE0 \
        --frq A1FREQ \
        --N-col N \
        --p P \
        --signed-sumstats BETA,0 \
        --out CCL13_Q99616_OID20655_v1_Inflammation_with_pval_TEST \
        --merge-alleles /Users/charleenadams/temp_BI/chemokine_rgs_olink/w_hm3.snplist

14 Munge Loop for the 33 Chemokine Summary Statistics

#!/bin/bash

# Define base directory
base_dir="/Users/charleenadams/temp_BI/chemokine_rgs_olink"
output_dir="$base_dir/munged"

# Create the output directory if it doesn't exist
mkdir -p "$output_dir"

# Loop through each file ending with "_with_pval.gz" in the base directory
for file in "$base_dir"/*_with_pval.gz; do
    # Extract the base name without the extension
    base_name=$(basename "$file" "_with_pval.gz")

    # Define the output filename for munging
    output_file="$output_dir/${base_name}_munged"

    # Perform the munging, specifying the column names explicitly
    munge_sumstats.py \
        --sumstats "$file" \
        --snp rsid \
        --a1 ALLELE1 \
        --a2 ALLELE0 \
        --frq A1FREQ \
        --N-col N \
        --p P \
        --signed-sumstats BETA,0 \
        --out "$output_file" \
        --merge-alleles /Users/charleenadams/temp_BI/chemokine_rgs_olink/w_hm3.snplist

    echo "Munged file created: $output_file"
done
# Run with: 
chmod +x munge_loop.sh
nohup ./munge_loop.sh > munge_loop_log.txt 2>&1 &

15 Run LDSC for Pairwise Combinations of Chemokines

NB: This works but is slow! Next time we grab more sets of UKB-PPP summary statistics, it will be worth running this step in parallel.

#!/bin/bash

# Define base directory and reference LD scores
base_dir="/Users/charleenadams/temp_BI/chemokine_rgs_olink"
ldscores="$base_dir/1000G_Phase3_weights_hm3_no_MHC"  # Adjust path if necessary

# Directory for the munged files
munged_dir="$base_dir/munged"

# Loop over pairs of munged files
for file1 in "$munged_dir"/*.sumstats.gz; do
    for file2 in "$munged_dir"/*.sumstats.gz; do
        # Avoid redundant calculations
        if [[ "$file1" < "$file2" ]]; then
            # Extract the filename without path and suffix for clarity in output
            base_file1=$(basename "${file1%_munged.sumstats.gz}")
            base_file2=$(basename "${file2%_munged.sumstats.gz}")
            output_prefix="$munged_dir/${base_file1}_${base_file2}_correlation"

            # Count unique SNPs for each file pair
            M_count=$(gunzip -c "$file1" "$file2" | awk 'NR > 1 {print $1}' | sort | uniq | wc -l)
            
            # Run LDSC for genetic correlation
            ./ldsc/ldsc.py \
                --rg "$file1","$file2" \
                --ref-ld-chr "$ldscores"/weights.hm3_noMHC. \
                --w-ld-chr "$ldscores"/weights.hm3_noMHC. \
                --out "$output_prefix" \
                --not-M-5-50 \
                --M "$M_count"
            
            echo "Genetic correlation output created: ${output_prefix}.log"
        fi
    done
done
# Run with: 
chmod +x ldsc_chemokines.sh
nohup ./ldsc_chemokines.sh > ldsc_chemokines_log.txt 2>&1 &

16 Process the Results Files (correlation.logs)

The results are generated as log files and don’t provide the genetic correlations until the end. Create a loop to extract the results from each file and merge them together so that we can look at them.

#!/bin/bash

# Define the output file and write the header only once
output_file="genetic_correlation_master_results.txt"
echo -e "p1\tp2\trg\tse\tz\tp\th2_obs\th2_obs_se\th2_int\th2_int_se\tgcov_int\tgcov_int_se" > "$output_file"

# Loop over each .log file to extract the result line
for log_file in *.log; do
    # Extract lines that have .sumstats.gz followed by numeric values
    awk '/\.sumstats\.gz/ && /[0-9]+\.[0-9]+/' "$log_file" >> "$output_file"
done

#Run with: 

chmod +x extract_rgs_from_correlation_logs.sh
nohup ./extract_rgs_from_correlation_logs.sh > extract_rgs_from_correlation_logs_log.txt 2>&1 &

17 Clean Names of P1 and P2 for Chemokines

“P1” and “P2” refer to “trait 1” and “trait 2” in LDSC. The file names for each chemokine were originally used to index the traits. But this is messy to look at. I want to display the results by the chemokine symbol nicely in a heatmap.

#!/bin/bash

# Define input and output files
input_file="genetic_correlation_master_results.txt"
output_file="formatted_genetic_correlation_results.txt"

# Write the header to the output file
head -n 1 "$input_file" > "$output_file"

# Process each line, extracting and formatting the p1 and p2 fields
tail -n +2 "$input_file" | while read -r line; do
    # Extract the full paths for p1 and p2
    p1=$(echo "$line" | awk '{print $1}')
    p2=$(echo "$line" | awk '{print $2}')

    # Extract only the chemokine label (e.g., CCL13, CCL14) from each path
    p1_label=$(echo "$p1" | sed -E 's|.*/([A-Z]+[0-9]+)_.*|\1|')
    p2_label=$(echo "$p2" | sed -E 's|.*/([A-Z]+[0-9]+)_.*|\1|')

    # Replace the original paths in the line with the formatted labels
    formatted_line=$(echo "$line" | sed -E "s|$p1|$p1_label|; s|$p2|$p2_label|")

    # Append the formatted line to the output file
    echo "$formatted_line" >> "$output_file"
done

echo "Formatting complete. Output saved to $output_file."

#Run with: 

chmod +x format_p1_p2.sh
./format_p1_p2.sh

18 Results: Genetic Correlations

18.0.1 Column Descriptions for Genetic Correlation Results

These results provide estimates of genetic correlation between pairs of chemokine traits (p1 and p2), along with additional information regarding the heritability of p2 and genetic covariance between the traits. Below is a description of each column.

p1: The first trait in the pair being analyzed for genetic correlation.
p2: The second trait in the pair being analyzed for genetic correlation.
rg: The estimated genetic correlation (r_g) between trait p1 and trait p2. This value represents the proportion of genetic variance shared between the two traits.
se: The standard error of the genetic correlation estimate. It provides a measure of the uncertainty around the rg value.
z: The z-score for the genetic correlation estimate, calculated as rg/se. A higher z-score indicates a stronger, more statistically significant correlation.
p: The p-value for the genetic correlation estimate, indicating the probability that the observed correlation is due to chance. Lower values suggest a significant genetic correlation between traits p1 and p2.
h2_obs: The observed heritability for trait p2. This value represents the proportion of phenotypic variance in p2 that is attributable to genetic variance.
h2_obs_se: The standard error of the observed heritability for p2.
h2_int: The intercept for the heritability estimate of p2. This intercept helps check for potential confounding effects, such as population stratification, which could artificially inflate heritability estimates.
h2_int_se: The standard error of the heritability intercept for p2.
gcov_int: The intercept for the genetic covariance between traits p1 and p2. This intercept can indicate potential confounding influences on the genetic correlation.
gcov_int_se: The standard error of the genetic covariance intercept.

The rg and p columns are key for understanding the degree and significance of genetic overlap between the two traits, while the heritability and intercept values provide additional context about the genetic architecture of p2 and any potential confounding influences affecting the estimates.

library(DT)
datatable(data, #datatable(head(data, n = 10), 
          options = list(pageLength = 10, autoWidth = TRUE, scrollX = TRUE),
          rownames = FALSE,
          caption = 'Genetic Correlation (from LDSC) Results')

19 Visualize the Chemokine Genetic Correlations

19.1 Heatmap with Hierarchical Clustering

# Load necessary libraries
library(ggplot2)
library(reshape2)
library(dplyr)
library(tidyr)

setwd("/Users/charleenadams/temp_BI/chemokine_rgs_olink/munged")

# Load the data
data <- read.table("formatted_genetic_correlation_results.txt", header = TRUE, stringsAsFactors = FALSE)

# Ensure symmetry by adding reversed rows
data_symmetric <- bind_rows(data, data %>% rename(p1 = p2, p2 = p1))

# Create the matrix for the rg values
rg_matrix <- data_symmetric %>%
  select(p1, p2, rg) %>%
  spread(key = p2, value = rg, fill = NA)

# Set row names as p1 and drop the p1 column
rownames(rg_matrix) <- rg_matrix$p1
rg_matrix <- rg_matrix %>% select(-p1)

# Replace NA values with 0 for clustering purposes (optional, adjust as needed)
rg_matrix[is.na(rg_matrix)] <- 0

# Perform hierarchical clustering
dist_matrix <- dist(rg_matrix)                # Calculate distance
cluster_rows <- hclust(dist_matrix)           # Row clustering
cluster_cols <- hclust(dist(t(rg_matrix)))    # Column clustering

# Reorder the matrix based on clustering
rg_matrix_ordered <- rg_matrix[cluster_rows$order, cluster_cols$order]

# Melt the reordered matrix for ggplot
rg_long <- melt(as.matrix(rg_matrix_ordered), na.rm = TRUE)
colnames(rg_long) <- c("p1", "p2", "rg")

# Plot the heatmap with hierarchical clustering
ggplot(rg_long, aes(x = p1, y = p2, fill = rg)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0,
                       limit = c(-1, 1), space = "Lab", name = "Genetic Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 10, hjust = 1),
        axis.text.y = element_text(size = 10),
        axis.title = element_blank(),
        legend.position = "bottom",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 8)) +
  coord_fixed()

Hierarchical Clustering Heatmap for Genetic Correlation of 33 Olink Chemokines

# Define the number of clusters you'd like to identify (adjust as needed)
num_clusters <- 5  # or any number of clusters you want to cut at

# Cut the dendrograms to get clusters for rows and columns
row_clusters <- cutree(cluster_rows, k = num_clusters)
col_clusters <- cutree(cluster_cols, k = num_clusters)

# Convert clusters to a data frame for rows (chemokines)
row_cluster_df <- data.frame(Chemokine = names(row_clusters), Cluster = row_clusters)
row_cluster_df <- row_cluster_df[order(row_cluster_df$Cluster),]

# Print row clusters in a table format
print("Row Clusters (Chemokines grouped by row):")

## [1] "Row Clusters (Chemokines grouped by row):"

print(row_cluster_df)

##        Chemokine Cluster
## CCL13      CCL13       1
## CCL17      CCL17       1
## CCL2        CCL2       1
## CCL24      CCL24       1
## CCL26      CCL26       1
## CCL27      CCL27       1
## CCL4        CCL4       1
## CCL5        CCL5       1
## CCL8        CCL8       1
## CXCL11    CXCL11       1
## CXCL3      CXCL3       1
## CXCL5      CXCL5       1
## CXCL6      CXCL6       1
## CCL14      CCL14       2
## CCL7        CCL7       2
## CCL15      CCL15       3
## CCL19      CCL19       3
## CCL20      CCL20       3
## CCL21      CCL21       3
## CCL23      CCL23       3
## CXCL10    CXCL10       3
## CXCL9      CXCL9       3
## CCL16      CCL16       4
## CCL18      CCL18       4
## CCL22      CCL22       4
## CCL3        CCL3       4
## CXCL13    CXCL13       4
## CCL25      CCL25       5
## CCL28      CCL28       5
## CXCL14    CXCL14       5
## CXCL16    CXCL16       5
## CXCL17    CXCL17       5
## TAFA5      TAFA5       5

# If you want to save it as a CSV file, you can use:
# write.csv(row_cluster_df, "row_clusters.csv", row.names = FALSE)

19.2 Heatmap with K-means Clustering

# Load necessary libraries
library(ggplot2)
library(reshape2)
library(dplyr)
library(tidyr)

setwd("/Users/charleenadams/temp_BI/chemokine_rgs_olink/munged")

# Load the data
data <- read.table("formatted_genetic_correlation_results.txt", header = TRUE, stringsAsFactors = FALSE)

# Ensure symmetry by adding reversed rows
data_symmetric <- bind_rows(data, data %>% rename(p1 = p2, p2 = p1))

# Create the matrix for the rg values
rg_matrix <- data_symmetric %>%
  select(p1, p2, rg) %>%
  spread(key = p2, value = rg, fill = 0) # Replace NAs with 0 for k-means

# Set row names as p1 and drop the p1 column
rownames(rg_matrix) <- rg_matrix$p1
rg_matrix <- rg_matrix %>% select(-p1)

# Define the number of clusters
set.seed(42) # For reproducibility
k <- 5  # Change this based on desired clusters

# Apply k-means clustering
kmeans_result <- kmeans(rg_matrix, centers = k)
cluster_assignment <- kmeans_result$cluster

# Order rows and columns by cluster
rg_matrix_ordered <- rg_matrix[order(cluster_assignment), order(cluster_assignment)]

# Melt the ordered matrix for ggplot
rg_long <- melt(as.matrix(rg_matrix_ordered), na.rm = TRUE)
colnames(rg_long) <- c("p1", "p2", "rg")

# Plot the heatmap with k-means clustering
ggplot(rg_long, aes(x = p1, y = p2, fill = rg)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0,
                       limit = c(-1, 1), space = "Lab", name = "Genetic Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, size = 10, hjust = 1),
        axis.text.y = element_text(size = 10),
        axis.title = element_blank(),
        legend.position = "bottom",
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 8)) +
  coord_fixed() +
  ggtitle(paste("Genetic Correlation Heatmap with K-means Clustering (k =", k, ")"))

K-means Clustering Heatmap for Genetic Correlation of 33 Olink Chemokines

19.3 Compare K-means and Hierarchical Clustering Assignments

# Load necessary libraries
library(dplyr)
library(tidyr)

# Define number of clusters
k <- 5

# K-means clustering on the correlation matrix
set.seed(42)
kmeans_result <- kmeans(rg_matrix, centers = k)
kmeans_clusters <- as.data.frame(kmeans_result$cluster)
colnames(kmeans_clusters) <- "Kmeans_Cluster"
kmeans_clusters$Chemokine <- rownames(kmeans_clusters)

# Hierarchical clustering on the distance matrix
dist_matrix <- dist(rg_matrix)
hclust_result <- hclust(dist_matrix)
hierarchical_clusters <- cutree(hclust_result, k = k) # Assign to 'k' clusters
hierarchical_clusters <- as.data.frame(hierarchical_clusters)
colnames(hierarchical_clusters) <- "Hierarchical_Cluster"
hierarchical_clusters$Chemokine <- rownames(hierarchical_clusters)

# Combine results into one table for comparison
cluster_comparison <- left_join(kmeans_clusters, hierarchical_clusters, by = "Chemokine")
cluster_comparison <- cluster_comparison %>% arrange(Kmeans_Cluster, Hierarchical_Cluster)

# Display the table
knitr::kable(cluster_comparison, caption = "Comparison of K-means and Hierarchical Clustering Assignments")

Comparison of K-means and Hierarchical Clustering Assignments
Kmeans_Cluster	Chemokine	Hierarchical_Cluster
1	CCL13	1
1	CCL17	1
1	CCL2	1
1	CCL26	1
1	CCL5	1
1	CCL8	1
1	CXCL11	1
1	CXCL3	1
1	CXCL5	1
1	CXCL6	1
2	CCL4	1
2	CCL15	3
2	CCL19	3
2	CCL20	3
2	CCL21	3
2	CCL23	3
2	CXCL10	3
2	CXCL9	3
2	CCL18	4
2	CXCL13	4
3	CCL24	1
3	CCL27	1
3	CCL25	5
3	CCL28	5
3	CXCL14	5
3	CXCL16	5
3	CXCL17	5
3	TAFA5	5
4	CCL16	4
4	CCL3	4
5	CCL14	2
5	CCL7	2
5	CCL22	4

20 Summary

We analyzed genetic correlations among 33 Olink chemokines using LDSC with UKB-PPP data for European ancestry. LDSC required linkage disequilibrium (LD) scores based on squared correlation coefficients (r²), unlike SuSiE and coloc, which rely on signed correlations (r). Chemokine summary statistics were obtained from Synapse, with SNPs mapped via rsid maps, and were processed to meet LDSC input standards by converting log-transformed p-values to standard p-values. Genetic correlations were estimated for each chemokine pair using precomputed LD scores from the 1000 Genomes European dataset, focusing on HapMap3 SNPs outside the MHC region to avoid genetic diversity confounding. For clustering, we applied both hierarchical and k-means clustering to visualize chemokine relationships, using a five-cluster solution for comparison. Results indicated consistent groupings across methods, highlighting chemokine relationships (mostly positively genetically correlated) and cluster stability across clustering techniques.

21 Next Steps?

A different set of proteins?
Exploring Mutually Significant Variants for Chemokine Pairs? (We took this forward.)
SuSiE / Coloc?
Genomic Structural Equation Modeling (GenomicSEM)? If so, for what purpose?
1. Latent genetic factor underlying the chemokine correlations?
2. Genetic clusters or subgroups among chemokines?
3. Multivariate analyses?
4. Creating new GWAS summary statistics via GWAS-by-Subtraction?
Something else?

22 Exploring Mutually Significant Variants for Chemokine Pairs

In the conversation between us (me, Rob, and Usman) on 10/30/2024, we noted that I had not included some chemokines. I identified some that were missed.

UKBPPP Protein ID	Olink Target Full Name	Olink ID	UniProt	Assay
CXCL8:P10145:OID20153:v1	Interleukin-8	OID20153	P10145	CXCL8
CXCL8:P10145:OID20631:v1	Interleukin-8	OID20631	P10145	CXCL8
CXCL8:P10145:OID20997:v1	Interleukin-8	OID20997	P10145	CXCL8
CXCL8:P10145:OID21430:v1	Interleukin-8	OID21430	P10145	CXCL8
CXCL1:P09341:OID20762:v1	Growth-regulated alpha protein	OID20762	P09341	CXCL1
CCL11:P51671:OID20668:v1	Eotaxin	OID20668	P51671	CCL11

I could fetch these from Synapse and re-run the pipeline to munge these and obtain the pairwise genetic correlations. This would give me the chance to parallelize the genetic correlations script. But before doing this, we might want to think about it more (see NOTE 3).

22.1 NOTE 1

Note that there are four versions of CXCL8. I’m not sure why Olink has done that, but I’m guessing it has to do with QC. When the genetic correlations are finished, I’ll need to re-name these so that they don’t look like duplicates in the correlation plot.

22.2 NOTE 2

Some things done with GenomicSEM require a genetic correlation matrix. So, obtaining these now kills two birds with one stone, though I don’t know if LDSC can be done within GenomicSEM or if so whether it would be more efficient to do so. Regardless, some hypotheses need a genetic correlation matrix to answer with GenomicSEM. Usman is particularly interested in GenomicSEM for multivariate analyses.

22.3 NOTE 3

The new laptop is here! Our Gerzsten drive isn’t mapped to it yet, and I need to get VPN installed, which could take several days. But I could fetch all the UKB-PPP files for EUR from Synapse, if we want. I suspect the compressed EUR files will be about 2-3 TB. After adding the rsids and pvals, which creates new files, and after munging, which also creates new files, we’ll be up to ~6-7TB. (I could perhaps add the rsids and convert from LOG10P to p-values in one script to reduce the times I create new files.) I think with the right parallelizing, it should take 5-7 days to run genetic correlations with the UKB-PPP EUR set of ~2900 protein summary statistics. This is worth it if we want to do lots of different GenomicSEM analyses (e.g., latent analyses, multivariable GWASes, etc). We would not want to have any other data on the new laptop until this is done. If it turns out to be too computationally burdensome on its CPU, then we will have tested the scripts and could move it to AWS.

22.4 NOTE 4

In addition to noting some chemokines that I hadn’t included, we decided in the conversation on Wednesday that I’d do some data mining to explore the landscape of jointly significant SNPs for the chemokines and that I’d do so at two p-value thresholds: P<5e8 and P<5e11. Below I have extracted the P<5e11 SNPs (and P<5e8 SNPs) from the EUR Olink summary statistics for the 33 chemokines. So far, the document below display the P<5e11 results.

We also spoke about the use of the Sun (UKB-PPP) paper’s supplement to find shared significant SNPs across the chemokines. Looks like they used P<1.7xe-11 (vs what I chose above). Also of note is that the table I used from their supplement (ST9) chose the most significant SNP for a region (based on their dynamic region defintion, I presume).

22.5 NOTE 5

Rob observed that the strong genetic correlations between some chemokine traits might point to issues with Olink’s assay. In particular, see these for CCL7 and CCL8 and CCL19 and CCL21 pairs.

p1	p2	rg	se	z	p	h2_obs	h2_obs_se	h2_int	h2_int_se	gcov_int	gcov_int_se
CCL7	CCL8	0.7954	0.0582	13.6576	1.8184e-42	0.1771	0.0244	1.0511	0.0436	0.4577	0.0087
CCL19	CCL21	0.7743	0.0654	11.8390	2.4542e-32	0.2744	0.0384	1.0036	0.0098	0.3442	0

22.6 NOTE 6

It’s worth just stating it explicitly that “genetic correlations” are generated from a process that looks across the genome, accounting for LD, in order to compare the genetic architectures of two traits. Genetic correlations aren’t about finding the most significant hits.

In contrast, the comparative analysis I’m doing below is a manual data dive focused on significant hits and seeing how they compare across the chemokines. This approach isn’t focused solely on cis hits for the chemokines, which will become clearer as you see the results.

23 Obtain the Commonly Significant SNPs Across the Chemokines

use P<5e8 and P<5e11

This loop below iterates through the original (but rsid annotated and p-value added) EUR Olink chemokine summary statistics by creating two subdirectories for the files with only the SNPs with the desired p-value thresholds: subset_pval_filtered_5e11 and subset_pval_filtered_5e8. Then, within each of the two directories, it creates a file that compares chemokine-by-chemokine for SNPs that are significant at the threshold.

#!/bin/bash

# Define base directory and threshold directories
base_dir="/Users/charleenadams/temp_BI/chemokine_rgs_olink"
threshold_dirs=("subset_pval_filtered_5e8" "subset_pval_filtered_5e11")

# Original file path for extracting the header (just for reference)
original_file="${base_dir}/CCL2_P13500_OID21004_v1_Neurology_with_pval.gz"

# Function to process a directory
process_directory() {
    local dir_path="$1"
    local output_file="${dir_path}/common_snps_comparison_p_too.txt"
    
    # Assuming columns are always in this order: CHROM, GENPOS, rsid, ..., P (last column)
    local CHROM_COL=1
    local GENPOS_COL=2
    local RSID_COL=3
    local P_COL=$(awk 'NR==1{print NF; exit}' "$original_file")  # Assuming P is the last column

    # Write header to the single output file, including P value columns
    echo -e "File1\tFile2\tSNP\tCHROM\tGENPOS\tFile1_P\tFile2_P" > "$output_file"

    # Collect all files in the directory
    local files=("$dir_path"/*_pval_filtered_*.txt)
    
    # Iterate through each file combination
    for ((i=0; i<${#files[@]}; i++)); do
        for ((j=i+1; j<${#files[@]}; j++)); do
            local ref_file="${files[i]}"
            local comp_file="${files[j]}"
            ref_name=$(basename "$ref_file" .txt)
            comp_name=$(basename "$comp_file" .txt)

            # Use awk with here-document to find common SNPs
            awk -v p_col=$P_COL '
            BEGIN {
                OFS="\t"
            }
            NR==FNR {
                snp_list[$3] = $1 "\t" $2 "\t" $(NF)  # Store CHROM, GENPOS, and P value (NF is the last field)
                next
            } 
            ($3 in snp_list) {
                split(snp_list[$3], ref, "\t")
                print "'"$ref_name"'", "'"$comp_name"'", $3, ref[1], ref[2], ref[3], $(NF)
            }' "$ref_file" "$comp_file" >> "$output_file"
        done
    done
}

# Loop over each threshold directory
for threshold_dir in "${threshold_dirs[@]}"; do
    dir_path="${base_dir}/${threshold_dir}"
    process_directory "$dir_path"
done

echo "Processing complete for both directories."

# Run with: 
chmod +x find_common_sig_chemokine_snps_p_too.sh
nohup ./find_common_sig_chemokine_snps_p_too.sh > find_common_sig_chemokine_snps_p_too_log.txt 2>&1 &

24 Results: Commonly Significant SNPs for Chemokines

I read the (P<5e11) results into R to look at them. I realized it would be helpful to know the transcription start site (TSS) for each chemokine to ascertain whether the shared significant SNPs are cis or trans in relation to the chemokines. And it occurred to me that I might be able to tackle my pet obessesion about whether the trans SNPs are in enhancers.

24.1 Fetch the TSS Annotations and Enhancer Info from Ensembl

This step is so that we can visualize where the TSSes are for our chemokines and think about them in relation to significant SNPs that are trans.

library(data.table)
library(biomaRt)

filtered_5e11 <- fread("/Users/charleenadams/temp_BI/chemokine_rgs_olink/subset_pval_filtered_5e11/common_snps_comparison_p_too.txt")

# Extract characters before the first underscore in File1 and File2
filtered_5e11[, chemokine1 := sub("_.*", "", File1)]
filtered_5e11[, chemokine2 := sub("_.*", "", File2)]

# Reorder columns as specified
filtered_5e11 <- filtered_5e11[, .(chemokine1, chemokine2, SNP, CHROM, GENPOS, File1_P, File2_P, File1, File2)]

# Define the list of unique chemokines from your data
chemokines <- unique(c(filtered_5e11$chemokine1, filtered_5e11$chemokine2))

# Connect to Ensembl's human gene dataset
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")

# Query TSS information for only the main chromosomes (1-22, X, Y, MT)
tss_data <- getBM(
  attributes = c("hgnc_symbol", "chromosome_name", "transcription_start_site"),
  filters = c("hgnc_symbol", "chromosome_name"),
  values = list(chemokines, c(1:22, "X", "Y", "MT")),
  mart = ensembl
)

# Convert to data.table if it's not already
if (!is.data.table(tss_data)) {
  tss_data <- as.data.table(tss_data)
}

# Rename columns for clarity
setnames(tss_data, c("hgnc_symbol", "chromosome_name", "transcription_start_site"),
         c("chemokine", "TSS_chrom", "TSS_pos"))

# Remove duplicate entries in tss_data based on chemokine name
tss_data <- tss_data[!duplicated(tss_data$chemokine)]

# Now proceed with the merge
annotated_data <- merge(filtered_5e11, tss_data, by.x = "chemokine1", by.y = "chemokine", all.x = TRUE)
setnames(annotated_data, "TSS_chrom", "TSS_chrom1")
setnames(annotated_data, "TSS_pos", "TSS_pos1")

# Merge TSS info for chemokine2
annotated_data <- merge(annotated_data, tss_data, by.x = "chemokine2", by.y = "chemokine", all.x = TRUE)
setnames(annotated_data, "TSS_chrom", "TSS_chrom2")
setnames(annotated_data, "TSS_pos", "TSS_pos2")

# Reorder columns if needed
setcolorder(annotated_data, c("chemokine1", "chemokine2", "SNP", "CHROM", "GENPOS", "TSS_chrom1", "TSS_pos1", "TSS_chrom2", "TSS_pos2", "File1_P", "File2_P", "File1", "File2"))

# Extract only the chromosome number from TSS_chrom1 and TSS_chrom2, ensuring only primary chromosomes are kept
annotated_data[, TSS_chrom1 := sub("^.*?(\\d+|X|Y|MT).*", "\\1", TSS_chrom1)]
annotated_data[, TSS_chrom2 := sub("^.*?(\\d+|X|Y|MT).*", "\\1", TSS_chrom2)]

filtered_5e11 <- annotated_data

# Assume your data is in a data.table called `filtered_5e11`
snp_data <- copy(filtered_5e11)

# Connect to the Ensembl regulatory feature dataset
ensembl_reg <- useMart("ENSEMBL_MART_FUNCGEN", dataset = "hsapiens_regulatory_feature", host = "https://www.ensembl.org")

# Retrieve enhancer regions using the correct filter and attributes
enhancer_data <- getBM(
    attributes = c("chromosome_name", "chromosome_start", "chromosome_end"),
    filters = "regulatory_feature_type_name",
    values = "enhancer",
    mart = ensembl_reg
)

# Convert enhancer data to a data.table
enhancer_data <- as.data.table(enhancer_data)
setnames(enhancer_data, c("chromosome_name", "chromosome_start", "chromosome_end"), c("CHROM", "start_position", "end_position"))

# Ensure the chromosome columns in both tables are numeric
snp_data[, CHROM := as.integer(CHROM)]
enhancer_data[, CHROM := as.integer(CHROM)]

# Prepare enhancer data for overlap join
setkey(enhancer_data, CHROM, start_position, end_position)
snp_data[, enhancer := "Not in Enhancer"]  # Default value

# Create temporary range columns for the SNP position
snp_data[, `:=`(GENPOS_start = GENPOS, GENPOS_end = GENPOS)]

# Use foverlaps to find matches between SNPs and enhancer regions
setkey(snp_data, CHROM, GENPOS_start, GENPOS_end)
overlap_result <- foverlaps(snp_data, enhancer_data, by.x = c("CHROM", "GENPOS_start", "GENPOS_end"), by.y = c("CHROM", "start_position", "end_position"), nomatch = 0)

# Annotate SNPs that fall in enhancer regions
snp_data[overlap_result, enhancer := "In Enhancer", on = .(CHROM, GENPOS_start = GENPOS)]

# Remove temporary columns
snp_data[, `:=`(GENPOS_start = NULL, GENPOS_end = NULL)]

filtered_5e11 <- snp_data

# Reorder columns if needed
setcolorder(filtered_5e11, c("chemokine1", "chemokine2", "SNP", "CHROM", "GENPOS", "TSS_chrom1", "TSS_pos1", "TSS_chrom2", "TSS_pos2", "enhancer", "File1_P", "File2_P", "File1", "File2"))

write.table(filtered_5e11, "/Users/charleenadams/temp_BI/chemokine_rgs_olink/filtered_5e11.txt",
            row.names = FALSE, col.names = TRUE, quote = FALSE, sep = '\t')

24.2 Results Column Descriptions

chemokine1: Identifies the first chemokine in each analyzed pair, represented by its standard abbreviation (e.g., “CCL19” for C-C motif chemokine ligand 19).
chemokine2: Specifies the second chemokine in each pair, represented by its standard abbreviation. Pairing allows for comparative analysis of genomic regions between the two chemokines.
SNP: Refers to the rsID identifier of shared, significant SNPs between chemokine1 and chemokine2 (e.g., “rs6679677”). Each SNP was filtered for statistical significance, representing variants of interest that are common between the chemokine pairs.
CHROM: The chromosome number on which the shared SNP is located.
GENPOS: The genomic position (GRCh38) of the SNP in base pairs on the chromosome specified in CHROM. This precise location allows alignment with transcription start sites (TSS) or enhancer regions for further analysis.
TSS_chrom1: The chromosome where the transcription start site (TSS) of chemokine1 is located, as identified by Ensembl’s human gene dataset (hsapiens_gene_ensembl). This information helps assess whether the SNP is on the same chromosome (cis) or a different chromosome (trans) in relation to chemokine1.
TSS_pos1: The exact position (in base pairs) of the TSS for chemokine1 on TSS_chrom1, retrieved from Ensembl. This positional data aids in proximity analysis between the SNP and chemokine1.
TSS_chrom2: The chromosome where the TSS for chemokine2 is located, retrieved from Ensembl. This value allows determination of whether chemokine2 is in a cis or trans relationship to the SNP.
TSS_pos2: The specific genomic position (in base pairs) of the TSS for chemokine2 on TSS_chrom2, aiding in proximity evaluation relative to the SNP for chemokine2.
enhancer: Indicates whether each SNP falls within an enhancer region (“In Enhancer”) or outside of one (“Not in Enhancer”). Enhancer regions were identified through Ensembl’s regulatory feature dataset (hsapiens_regulatory_feature), focusing specifically on entries classified as “enhancer.”
File1_P and File2_P: P-values associated with each SNP for chemokine1 and chemokine2, respectively. These values quantify the strength of association of the SNP with each chemokine trait.
File1 and File2: Metadata from Olink describing the specific assay and condition information for each chemokine. This includes assay details, context of measurement, and versioning, providing additional context for each chemokine’s data.

24.3 CCL2 and 32 Other Chemokines (P<5e11)

I’m going to provide all the results in a table after giving just the results for CCL2 and CXCL10. I just wanted to give you the data for each of them in case you wanted to explore them individually. In the full data I’ll post below these, you can also key into them, but it’s easier separately. So, that’s why I’m being redundant.

# Filter for rows where chemokine1 is "CCL2"
ccl2_shared <- filtered_5e11[chemokine1 == "CCL2"]

library(DT)
datatable(ccl2_shared, #datatable(head(data, n = 10), 
          options = list(pageLength = 10, autoWidth = TRUE, scrollX = TRUE),
          rownames = FALSE,
          caption = 'Commonly Significant SNPs Between CCL2 and 32 Other Chemokines (P<5e11)')

24.4 CXCL10 and 32 Other Chemokines (P<5e11)

# Filter for rows where chemokine1 is "CXCL10"
CXCL10_shared <- filtered_5e11[chemokine1 == "CXCL10"]

library(DT)
datatable(CXCL10_shared, #datatable(head(data, n = 10), 
          options = list(pageLength = 10, autoWidth = TRUE, scrollX = TRUE),
          rownames = FALSE,
          caption = 'Commonly Significant SNPs Between CXCL10 and 32 Other Chemokines (P<5e11)')

24.5 Results: All Commonly Significant SNPs (P<5e11) for Chemokines

library(DT)
datatable(filtered_5e11, #datatable(head(data, n = 10), 
          options = list(pageLength = 10, autoWidth = TRUE, scrollX = TRUE),
          rownames = FALSE,
          caption = 'Commonly Significant SNPs Between Chemokines (P<5e11)')

25 Pairwise Heatmap for Number of Shared SNPs (P<5e11) Between Chemokines

I originally generated this plot prior to fetching the TSSes and enhancer information. Seeing the dark blue box below made me look more deeply at what was going on. The shared SNPs for the chemokines with that dark-blue box are on chromosome 6 in the MHC. (When I ran the formal genetic correlation analysis, we excluded SNPs in that region due to complex LD patterns…)

# Load necessary libraries
library(data.table)
library(ggplot2)
library(reshape2)

filtered_5e11 <- fread("/Users/charleenadams/temp_BI/chemokine_rgs_olink/filtered_5e11.txt")
head(filtered_5e11)

##    chemokine1 chemokine2       SNP CHROM    GENPOS TSS_chrom1 TSS_pos1
##        <char>     <char>    <char> <int>     <int>      <int>    <int>
## 1:      CCL19      CXCL9 rs6679677     1 113761186          9 34691276
## 2:      CCL19      CXCL9 rs2476601     1 113834946          9 34691276
## 3:      CCL13       CCL2 rs3026943     1 159162136         17 34356480
## 4:      CCL26       CCL2 rs3026943     1 159162136          7 75789896
## 5:      CCL13      CCL26 rs3026943     1 159162136         17 34356480
## 6:      CCL13       CCL7 rs3026943     1 159162136         17 34356480
##    TSS_chrom2 TSS_pos2        enhancer     File1_P     File2_P
##         <int>    <int>          <char>       <num>       <num>
## 1:          4 76007509 Not in Enhancer 2.50784e-20 4.34510e-15
## 2:          4 76007509 Not in Enhancer 1.42364e-20 1.82684e-15
## 3:         17 34255274 Not in Enhancer 3.17468e-16 3.53834e-26
## 4:         17 34255274 Not in Enhancer 2.01141e-11 3.53834e-26
## 5:          7 75789896 Not in Enhancer 3.17468e-16 2.01141e-11
## 6:         17 34270221 Not in Enhancer 3.17468e-16 7.67185e-40
##                                                                 File1
##                                                                <char>
## 1:    CCL19_Q99731_OID21030_v1_Neurology_with_pval_pval_filtered_5e11
## 2:    CCL19_Q99731_OID21030_v1_Neurology_with_pval_pval_filtered_5e11
## 3: CCL13_Q99616_OID20655_v1_Inflammation_with_pval_pval_filtered_5e11
## 4: CCL26_Q9Y258_OID20546_v1_Inflammation_with_pval_pval_filtered_5e11
## 5: CCL13_Q99616_OID20655_v1_Inflammation_with_pval_pval_filtered_5e11
## 6: CCL13_Q99616_OID20655_v1_Inflammation_with_pval_pval_filtered_5e11
##                                                                 File2
##                                                                <char>
## 1: CXCL9_Q07325_OID20675_v1_Inflammation_with_pval_pval_filtered_5e11
## 2: CXCL9_Q07325_OID20675_v1_Inflammation_with_pval_pval_filtered_5e11
## 3:     CCL2_P13500_OID21004_v1_Neurology_with_pval_pval_filtered_5e11
## 4:     CCL2_P13500_OID21004_v1_Neurology_with_pval_pval_filtered_5e11
## 5: CCL26_Q9Y258_OID20546_v1_Inflammation_with_pval_pval_filtered_5e11
## 6:  CCL7_P80098_OID20523_v1_Inflammation_with_pval_pval_filtered_5e11

# Convert data table to binary presence/absence for heatmap
chemokine_snp <- filtered_5e11[, .N, by = .(chemokine1, chemokine2)]
chemokine_snp_wide <- dcast(chemokine_snp, chemokine1 ~ chemokine2, value.var = "N", fill = 0)
# write.csv(chemokine_snp_wide,"/Users/charleenadams/temp_BI/chemokine_rgs_olink/chemokine_snp_wide.csv")

# Set chemokine names in a specific order to ensure a triangular structure
chemokines <- unique(chemokine_snp_wide$chemokine1)

# Melt data to long format
chemokine_snp_melt <- melt(chemokine_snp_wide, id.vars = "chemokine1", variable.name = "chemokine2", value.name = "value")

# Filter to retain only the upper triangle
chemokine_snp_melt <- chemokine_snp_melt[match(chemokine_snp_melt$chemokine1, chemokines) < match(chemokine_snp_melt$chemokine2, chemokines), ]

# Plot with ggplot2
ggplot(chemokine_snp_melt, aes(x = chemokine2, y = chemokine1, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "white", high = "blue") +
  theme_minimal() +
  labs(title = "Shared SNPs Across Chemokine Pairs",
       x = "Chemokine 2",
       y = "Chemokine 1",
       fill = "Number of Shared SNPs") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Pairwise Heatmap for Number of Shared SNPs (P<5e11) Between 33 Chemokines

26 Are the Chemokines on the Same Chromosome?

Are the genes that encode the chemokines on the same chromosomes?

# Create a column indicating whether TSS_chrom1 and TSS_chrom2 are the same
filtered_5e11[, TSS_match := ifelse(TSS_chrom1 == TSS_chrom2, "Same", "Different")]

# Calculate counts of "Same" and "Different" for each chemokine pair
tss_match_counts_by_pair <- filtered_5e11[, .N, by = .(chemokine1, chemokine2, TSS_match)]

# Load ggplot2 for visualization
library(ggplot2)

# Plot the counts by chemokine pairs with "Same" and "Different" TSS chromosomes
ggplot(tss_match_counts_by_pair, aes(x = chemokine1, y = N, fill = TSS_match)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~ chemokine2, scales = "free_x") +
  scale_fill_manual(values = c("Same" = "skyblue", "Different" = "salmon"), name = "TSS Chromosome Comparison") +
  labs(
    title = "TSS Chromosome Comparison for Each Chemokine Pair",
    x = "Chemokine 1",
    y = "Count of SNPs"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom"
  )

27 Signficant SNP-TSS Chromosome Alignment

Purpose: to investigate whether significant SNPs are on the same or different chromosomes from the TSS of the genes encoding the chemokines.

Match:
A “Match” occurs when the chromosome of the SNP (CHROM) is the same as either of the TSS chromosomes for the chemokine pair (TSS_chrom1 or TSS_chrom2).
- We check if the SNP’s CHROM value matches either TSS_chrom1 or TSS_chrom2 for each row in the data.
- If CHROM equals either TSS_chrom1 or TSS_chrom2, we label it as “Match” because this indicates that the SNP is on the same chromosome as one of the transcription start sites (TSS) for the chemokine pair.
No Match:
A “No Match” occurs when the SNP’s CHROM does not match either TSS_chrom1 or TSS_chrom2.
- In other words, if CHROM is different from both TSS_chrom1 and TSS_chrom2, the SNP is on a different chromosome than both TSS sites, hence it is labeled as “No Match.”

This distinction helps us understand whether the shared significant SNP is located on the same chromosome as the TSS of either chemokine in the pair.

27.1 All Chemokines

# Try making x axis labels smaller

# Add a column to check if SNP chromosome matches either TSS chromosome
filtered_5e11[, SNP_TSS_match := ifelse(CHROM == as.integer(TSS_chrom1) | CHROM == as.integer(TSS_chrom2), "Match", "No Match")]

# Calculate counts of matches for each chemokine pair
match_counts <- filtered_5e11[, .N, by = .(chemokine1, chemokine2, SNP_TSS_match)]

# Plot the match counts
ggplot(match_counts, aes(x = chemokine1, y = N, fill = SNP_TSS_match)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~ chemokine2) +
  scale_fill_manual(values = c("Match" = "skyblue", "No Match" = "salmon"), name = "SNP to TSS Chromosome Match") +
  labs(
    title = "SNP Chromosome Alignment with TSS Chromosome for Chemokine Pairs",
    x = "Chemokine 1",
    y = "Count"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 3),  # Reduced x-axis label size
    legend.position = "bottom"
  )

27.2 Selected Chemokines

Since the chemokines on the x-axis label are difficult to read, I’ve zoomed into selected chemokines to look at them.

# Filter for selected chemokine2 values but keep all chemokine1 values
filtered_data <- match_counts[chemokine2 %in% c("CXCL11", "CCL2", "CXCL10")]

# Plot the match counts with full chemokine1 values and limited chemokine2 facets
ggplot(filtered_data, aes(x = chemokine1, y = N, fill = SNP_TSS_match)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~ chemokine2) +
  scale_fill_manual(values = c("Match" = "skyblue", "No Match" = "salmon"), name = "SNP to TSS Chromosome Match") +
  labs(
    title = "SNP Chromosome Alignment with TSS Chromosome for Selected Chemokine2 Values",
    x = "Chemokine 1",
    y = "Count"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
    legend.position = "bottom"
  )

28 Sun’s Significant Chemokines by RSID (P<1.7xe-11)

Here, we are looking at Sun’s Supplementary Table 9 (ST9), which provides a list of their top SNPs (per dynamic region). In order to attempt to replicate what they found, I’d need to adopt a region-based approach like theirs. Their dynamic regions aren’t straightforward; they aren’t just 1MB around anything.

# Load the necessary library for better readability (optional)
library(tibble)
library(DT)
library(data.table)
sun_sigs_ST9 <- read.csv("/Users/charleenadams/temp_BI/chemokine_rgs_olink/sun_sigs.csv")

datatable(sun_sigs_ST9,
          options = list(pageLength = 10, autoWidth = TRUE, scrollX = TRUE),
          rownames = TRUE,
          caption = 'ST9 from Sun with the top SNP per region (P<1.7e-11)')

29 Comparison of Sun’s Significant SNPs Across Chemokines

RSID (P<1.7xe-11)

# Create the DataFrame using tribble for clarity of the chemokine list
chemokine_data <- tribble(
  ~UKBPPP_ProteinID, ~`Olink Target Fullname`,      ~OlinkID,   ~UniProt, ~Assay,
  "CCL13:Q99616:OID20655:v1", "C-C motif chemokine 13", "OID20655", "Q99616", "CCL13",
  "CCL14:Q16627:OID20401:v1", "C-C motif chemokine 14", "OID20401", "Q16627", "CCL14",
  "CCL15:Q16663:OID20328:v1", "C-C motif chemokine 15", "OID20328", "Q16663", "CCL15",
  "CCL16:O15467:OID20334:v1", "C-C motif chemokine 16", "OID20334", "O15467", "CCL16",
  "CCL17:Q92583:OID20745:v1", "C-C motif chemokine 17", "OID20745", "Q92583", "CCL17",
  "CCL18:P55774:OID20395:v1", "C-C motif chemokine 18", "OID20395", "P55774", "CCL18",
  "CCL19:Q99731:OID21030:v1", "C-C motif chemokine 19", "OID21030", "Q99731", "CCL19",
  "CCL2:P13500:OID21004:v1",  "C-C motif chemokine 2",  "OID21004", "P13500", "CCL2",
  "CCL20:P78556:OID20671:v1", "C-C motif chemokine 20", "OID20671", "P78556", "CCL20",
  "CCL21:O00585:OID20686:v1", "C-C motif chemokine 21", "OID20686", "O00585", "CCL21",
  "CCL22:O00626:OID20765:v1", "C-C motif chemokine 22", "OID20765", "O00626", "CCL22",
  "CCL23:P55773:OID20693:v1", "C-C motif chemokine 23", "OID20693", "P55773", "CCL23",
  "CCL24:O00175:OID20772:v1", "C-C motif chemokine 24", "OID20772", "O00175", "CCL24",
  "CCL25:O15444:OID20674:v1", "C-C motif chemokine 25", "OID20674", "O15444", "CCL25",
  "CCL26:Q9Y258:OID20546:v1", "C-C motif chemokine 26", "OID20546", "Q9Y258", "CCL26",
  "CCL27:Q9Y4X3:OID20121:v1", "C-C motif chemokine 27", "OID20121", "Q9Y4X3", "CCL27",
  "CCL28:Q9NRJ3:OID20569:v1", "C-C motif chemokine 28", "OID20569", "Q9NRJ3", "CCL28",
  "CCL3:P10147:OID20610:v1",  "C-C motif chemokine 3",  "OID20610", "P10147", "CCL3",
  "CCL4:P13236:OID20695:v1",  "C-C motif chemokine 4",  "OID20695", "P13236", "CCL4",
  "CCL5:P13501:OID20412:v1",  "C-C motif chemokine 5",  "OID20412", "P13501", "CCL5",
  "CCL7:P80098:OID20523:v1",  "C-C motif chemokine 7",  "OID20523", "P80098", "CCL7",
  "CCL8:P80075:OID21466:v1",  "C-C motif chemokine 8",  "OID21466", "P80075", "CCL8",
  "CXCL10:P02778:OID20697:v1","C-X-C motif chemokine 10", "OID20697", "P02778", "CXCL10",
  "CXCL11:O14625:OID21042:v1","C-X-C motif chemokine 11", "OID21042", "O14625", "CXCL11",
  "CXCL13:O43927:OID21024:v1","C-X-C motif chemokine 13", "OID21024", "O43927", "CXCL13",
  "CXCL14:O95715:OID20458:v1","C-X-C motif chemokine 14", "OID20458", "O95715", "CXCL14",
  "CXCL16:Q9H2A7:OID20282:v1","C-X-C motif chemokine 16", "OID20282", "Q9H2A7", "CXCL16",
  "CXCL17:Q6UXB2:OID20622:v1","C-X-C motif chemokine 17", "OID20622", "Q6UXB2", "CXCL17",
  "CXCL3:P19876:OID20788:v1", "C-X-C motif chemokine 3",  "OID20788", "P19876", "CXCL3",
  "CXCL5:P42830:OID20207:v1", "C-X-C motif chemokine 5",  "OID20207", "P42830", "CXCL5",
  "CXCL6:P80162:OID20613:v1", "C-X-C motif chemokine 6",  "OID20613", "P80162", "CXCL6",
  "CXCL9:Q07325:OID20675:v1", "C-X-C motif chemokine 9",  "OID20675", "Q07325", "CXCL9",
  "TAFA5:Q7Z5A7:OID21321:v1", "Chemokine-like protein TAFA-5", "OID21321", "Q7Z5A7", "TAFA5")

list=sun_sigs_ST9[which(sun_sigs_ST9$Assay.Target%in%chemokine_data$Assay),]
list$ID <- paste0(list$Variant.ID..CHROM.GENPOS..hg37..A0.A1.imp.v1., "_", list$rsID)

# Find common rsIDs across different Assay.Targets
common_rsids <- list %>%
  group_by(ID) %>%
  filter(n_distinct(Assay.Target) > 1) %>%
  distinct(ID, Assay.Target) %>%
  arrange(ID)
write.csv(common_rsids,"/Users/charleenadams/temp_BI/chemokine_rgs_olink/common_rsids.csv")

# Step 1: Convert the data to a binary matrix format
binary_matrix <- common_rsids %>%
  mutate(value = 1) %>% # Binary indicator
  pivot_wider(names_from = Assay.Target, values_from = value, values_fill = list(value = 0)) %>%
  column_to_rownames("ID") # Set rsID as row names

# Step 2: Add row and column sums
binary_matrix$Row_Sum <- rowSums(binary_matrix) # Row sums
binary_matrix <- rbind(binary_matrix, Col_Sum = colSums(binary_matrix)) # Column sums

# Step 3: Display as an interactive table in R Markdown
library(DT)
datatable(binary_matrix,
          options = list(pageLength = 10, autoWidth = TRUE, scrollX = TRUE),
          rownames = TRUE,
          caption = 'Commonly Significant Chemokine SNPs Identified by Sun et al. (P<1.7e-11)')

30 Updated Comparison of Sun Top Hits

Based on the conversation between Rob, Usman, and me and a discussion with me and Usman, I have updated our chemokine table. Here are the changes:

I used pathway enrichment analysis to select chemokine activity proteins rather than selecting by the naming convention of the genes. I believe this approach captures all the proteins we wanted from before and adds in a few new ones. We could include even more if we chose to include other pathways involving chemokines (e.g., “chemokine receptor binding”).
Big change: I’ve added a row sum column for the top hits between any pair of proteins in the data. This exists to the left of the row sum column for the chemokine activity proteins. (Presorted, col sums are still at the bottom.)
To the SNP ID, I’ve added Sun’s bioinformatically identified gene, which I presume is the nearest gene.
The complete matrix is too big to render, so I’m only showing the chemokine activity proteins.

# Suppress output while loading libraries
suppressPackageStartupMessages({
  library(stringr)
  library(purrr)
  library(clusterProfiler)
  library(org.Hs.eg.db)
  library(dplyr)
  library(tidyr)
  library(ggplot2)
  library(DT)
  library(dplyr)
  library(data.table)
  library(tibble)
})

# Load the necessary library for better readability (optional)
setwd("/Users/charleenadams/temp_BI/chemokine_rgs_olink")

sun_sigs_ST9 <- read.csv("/Users/charleenadams/temp_BI/chemokine_rgs_olink/sun_sigs.csv")
sun_sigs_ST9$ID <- paste0(sun_sigs_ST9$Variant.ID..CHROM.GENPOS..hg37..A0.A1.imp.v1., "_", sun_sigs_ST9$Bioinfomatic.annotated.gene, "_", sun_sigs_ST9$rsID)
unique_sig_protein <- unique(sun_sigs_ST9$Assay.Target)

# Find common rsIDs across different Assay.Targets
common_rsids_all <- sun_sigs_ST9 %>%
  group_by(ID) %>%
  filter(n_distinct(Assay.Target) > 1) %>%
  distinct(ID, Assay.Target) %>%
  arrange(ID)
#write.csv(common_rsids_all ,"/Users/charleenadams/temp_BI/chemokine_rgs_olink/common_rsids_all _all_SUN.csv")

# testid <- common_rsids_all [which(common_rsids_all $ID=="3:56849749:T:C:imp:v1_ARHGEF3_rs1354034"),]

# # Clean up the IDs in the common_rsids_all  data frame
# common_rsids_all $original_Id <- common_rsids_all $ID
# common_rsids_all  <- common_rsids_all  %>%
#   mutate(ID = str_replace(ID, "(:[ATCG]+:[ATCG]+:imp:v1)", "")) # Remove only the undesired pattern

# Alternate way to clean those ids: 
# clean_ids <- function(ids) {
#   sub(":[^:]+:[^:]+:imp:v1", "_", ids)  # Remove `:T:C:imp:v1` or similar patterns
# }

# Count unique Assay.Targets by ID and sort in descending order
assay_targets_by_id <- common_rsids_all  %>%
  dplyr::group_by(ID) %>%  # Group by ID
  dplyr::summarize(unique_assay_targets = n_distinct(Assay.Target), .groups = "drop") %>%  # Count unique Assay.Targets
  dplyr::arrange(desc(unique_assay_targets))  # Arrange in descending order

# Filter by 3rd quartile: shared SNP >=4
summary(assay_targets_by_id$unique_assay_targets)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   3.000   5.941   4.000 444.000

# Generate pathway enrichment results
genes <- unique(common_rsids_all $Assay.Target)
pathway_results <- enrichGO(gene = genes, OrgDb = org.Hs.eg.db, keyType = "SYMBOL")

# Create the dotplot
enrichment_plot_20 <- dotplot(
  pathway_results,
  showCategory = 20,                                  # Number of categories to display
  title = "Pathways Enriched in Sun ST9"     # Plot title
)
enrichment_plot_20

# Save the plot as a high-resolution TIFF for Nature
ggsave(
  filename = "/Users/charleenadams/temp_BI/chemokine_rgs_olink/20_pathway_enrichment_Sun9_everything_high_res.tiff", 
  plot = enrichment_plot_20, 
  width =10,                       # Width in inches
  height = 10,                     # Height in inches
  dpi = 600,                       # High DPI for publication
  units = "in",                    # Specify units (inches)
  device = "tiff",                 # Save as TIFF
  compression = "lzw"              # Use LZW compression for smaller file size
)

# Step 1: Extract the total number of proteins/genes for each pathway

pathway_data <- pathway_results@result %>%
  as_tibble() %>%  # Ensure tibble for compatibility
  dplyr::select(Pathway_ID = ID, Description, geneID) %>%
  mutate(Genes = strsplit(as.character(geneID), "/")) %>%  # Ensure geneID is character
  unnest(cols = c(Genes))  # Expand the Genes list column into rows

pathway_totals <- pathway_data %>%
  group_by(Pathway_ID, Description) %>%
  summarise(
    Total_Proteins = n_distinct(Genes), # Count unique genes for each pathway
    .groups = "drop"
  )

datatable(pathway_totals,
          options = list(pageLength = 10, autoWidth = TRUE, scrollX = TRUE),
          rownames = FALSE,
          caption = 'Number of Proteins in Sun ST9 Pathways')

genes_in_pathway <- pathway_data %>%
  filter(Pathway_ID == "GO:0008009") %>%
  pull(Genes)

unique_genes <- unique(genes_in_pathway)

# Process all SNPs and proportions
all_proteins_snp <- common_rsids_all %>%
  mutate(rsID = stringr::str_extract(ID, "rs\\d+")) %>%
  left_join(pathway_data, by = c("Assay.Target" = "Genes")) %>%
  na.omit()

all_snps_with_proportions <- all_proteins_snp %>%
  group_by(ID, rsID, Pathway_ID, Description) %>%
  summarize(
    GeneCount = n(),
    Proteins = paste(unique(Assay.Target), collapse = ", "),
    .groups = "drop"
  ) %>%
  left_join(pathway_totals, by = c("Pathway_ID", "Description")) %>%
  mutate(Proportion = GeneCount / Total_Proteins)

# Clean proportions table
all_snps_with_proportions_clean <- all_snps_with_proportions %>%
  dplyr::select(ID, rsID, Proteins) %>%
  group_by(ID, rsID) %>%
  summarize(Proteins = paste(unique(Proteins), collapse = ", "), .groups = "drop")

# Binary matrix processing
binary_matrix_all <- common_rsids_all %>%
  mutate(value = 1) %>%
  pivot_wider(names_from = Assay.Target, values_from = value, values_fill = list(value = 0)) %>%
  as.data.frame()

# Add row sums
binary_matrix_all$Row_Sum_Genome <- rowSums(binary_matrix_all[, -1])

# Chemokine activity sums
columns_to_sum <- intersect(unique_genes, colnames(binary_matrix_all))
binary_matrix_all$Row_Sum_Chemokine_Activity <- rowSums(binary_matrix_all[, columns_to_sum, drop = FALSE])

# Writing results with IDs as rownames
write.csv(binary_matrix_all, "/Users/charleenadams/temp_BI/chemokine_rgs_olink/binary_matrix_all_and_chemokine_rowsums4.csv", row.names = TRUE)


# Step 1: Ensure row names are properly assigned
# Assuming `binary_matrix_all` is your data frame
rownames(binary_matrix_all) <- binary_matrix_all$ID  # Set the SNP IDs as row names
binary_matrix_all$ID <- NULL  # Remove the ID column if it's now in row names

# Step 2: Subset the data to include only the relevant columns
columns_to_display <- c("Row_Sum_Genome", "Row_Sum_Chemokine_Activity", unique_genes)  # Include specified columns
columns_to_display <- intersect(columns_to_display, colnames(binary_matrix_all))  # Ensure columns exist
binary_matrix_subset <- binary_matrix_all[, columns_to_display, drop = FALSE]  # Subset the matrix

# Step 3: Check if "11" exists in the row names and rename it to "colSum"
rownames(binary_matrix_subset)[rownames(binary_matrix_subset) == "11"] <- "colSum"

# Step 4: Display as an interactive table
datatable(binary_matrix_subset,
          options = list(
              pageLength = 10,    # Number of rows per page
              autoWidth = TRUE,   # Adjust column widths
              scrollX = TRUE      # Enable horizontal scrolling
          ),
          rownames = TRUE,       # Display row names
          caption = 'Pleiotropic SNP Pairs')  # Add caption

# Subset the data to keep only the relevant columns
rownames(binary_matrix_all)[rownames(binary_matrix_all) == "11"] <- "colSum"
binary_matrix_all_df <- as.data.frame(binary_matrix_all)
subset_df <- binary_matrix_all_df[, c("Row_Sum_Chemokine_Activity", "Row_Sum_Genome")]

# Calculate the proportion of chemokine activity
subset_df$Proportion_Chemokine_Activity <- with(subset_df, 
    ifelse(Row_Sum_Genome == 0, 0, Row_Sum_Chemokine_Activity / Row_Sum_Genome))

# View the resulting data frame
subset_df <- subset_df[order(subset_df$Proportion_Chemokine_Activity, decreasing = TRUE),]

# Step 4: Display as an interactive table
datatable(subset_df,
          options = list(
              pageLength = 10,    # Number of rows per page
              autoWidth = TRUE,   # Adjust column widths
              scrollX = TRUE      # Enable horizontal scrolling
          ),
          rownames = TRUE,       # Display row names
          caption = 'Proportion Chemokine Activity')  # Add caption

# Optionally, save the result
write.csv(subset_df, "subset_with_proportions.csv", row.names = FALSE)