1 Introduction

This analysis fine-maps a 1MB region on chr1 for proprotein convertase subtilisin/kexin type 9 (PCSK9) using SuSiE. We focus on a 500KB upstream and downstream of the sentinel SNP (rs11591147) and include only variants with a MAF >0.01. The purpose of the analysis is to obtain a positive control to sanity check if the methods I’m developing reproduce a known signal.

2 UKB-PPP

Direct quote from the synapse the UK Biobank Pharma Proteomics Project (UKB-PPP) website:

The UK Biobank Pharma Proteomics Project (UKB-PPP) is a collaboration between the UK Biobank (UKB) and thirteen biopharmaceutical companies characterising the plasma proteomic profiles of 54,219 UKB participants. As part of a collaborative analysis across the thirteen UKB-PPP partners, we conducted comprehensive protein quantitative trait loci (pQTL) mapping of 2,923 proteins that identifies 14,287 primary genetic associations, of which 85% are newly discovered, in addition to ancestry-specific pQTL mapping in non-Europeans. We identify independent secondary associations in 87% of cis and 30% of trans loci, expanding the catalogue of genetic instruments for downstream analyses.

NB: My understanding is that a larger list of UKB-PPP pQTL summary statistics are available somewhere. (Certainly a larger list of individual-level UKB protein measurements are available via UKB with application access.)

3 Getting the Data

The UKB-PPP pQTL summary statistics are available via synapse for the “discovery” UKB EUR population. To download, it requires creating a synapse account, obtaining a personal access token, and adding files to a download list.

cd /Users/charleenadams/temp_BI/susie_pcsk9
conda create -n synapse_env -c conda-forge mamba
conda activate synapse_env
pip install synapseclient
synapse -h
synapse get-download-list
synapse login "cda...." eyJ0eXAiOiJKV1QiLCJraWQiOiJXN05OOldMSlQ6SjVSSzpMN1RMOlQ3TDc6M1ZYNjpKRU9VOjY0NFI6VTNJWDo1S1oyOjdaQ0s6RlBUSCIsImFsZyI6IlJTMjU2In0.eyJhY2Nlc3MiOnsic2NvcGUiOlsidmlldyIsImRvd25sb2FkIiwibW9kaWZ5Il0sIm9pZGNfY2xhaW1zIjp7fX0sInRva2VuX3R5cGUiOiJQRVJTT05BTF9BQ0NFU1NfVE9LRU4iLCJpc3MiOiJodHRwczovL3JlcG8tcHJvZC5wcm9kLnNhZ2ViYXNlLm9yZy9hdXRoL3YxIiwiYXVkIjoiMCIsIm5iZiI6MTcyOTcwNjgzOSwiaWF0IjoxNzI5NzA2ODM5LCJqdGkiOiIxMzA3MyIsInN1YiI6IjM0NzI3ODMifQ.B65yULl25FZ1vmFOXEE6RaNYOhXTXZ5ov5JLVP47nJpejRA6oMfcwOcY1z4RH7bLFQ48xzRhc1I8kSbexid8jpoKnQJlBI8mVNItvv92dZNDrNN_8cvXtHNLVb6O12jwIZ0XDuQxcLuxm0kF52IZSXWMtwBzBeyKckkNYL9LTbQKDmFRmOpwZGVjkZyT7wsiuLLjj9bTcMfFcY-IstD7xnK4O51D_UejK6vg-Eh0Uw23i4NHhMAA02-D-tZq1yV0X8yI0EIOFoLQ3AKtXgrr4sE6zDGz4KTPaqVM557Vv9Aqk2KtnfiZRSWWxWYPbJm1Ty8AYLBDv6Iq-tdAval8kw

4 Data Preparation

4.1 Look at RSID

Maps are provided per chromosome.

In terminal window, look at the rsid map for chromosome 1

zless - olink_rsid_map_mac5_info03_b0_7_chr1_patched_v2.tsv.gz

ID      REF     ALT     rsid    POS19   POS38
1:10177:A:AC:imp:v1     A       AC      rs367896724     10177   10177
1:10352:T:TA:imp:v1     T       TA      rs201106462     10352   10352
1:10511:G:A:imp:v1      G       A       rs534229142     10511   10511

4.2 Look at the protein mapping files for PCSK9

grep "PCSK9" olink_protein_map_3k_v1.tsv
PCSK9:Q8NBP7:OID20235:v1    Proprotein convertase subtilisin/kexin type 9   OID20235    Q8NBP7  PCSK9   Cardiometabolic B04413  Q8NBP7  PCSK9   ENSG0000016917455039447 55064852    1   1:100   C   no

grep "PCSK9" olink_protein_map_1.5k_v1.tsv
PCSK9:Q8NBP7:OID20235:v1    Proprotein convertase subtilisin/kexin type 9   OID20235    Q8NBP7  PCSK9   Cardiometabolic B04413  Q8NBP7  PCSK9   ENSG0000016917455039447 55064852    1   1:100   C

4.3 Extract the PCSK9 data

tar -xf PCSK9_Q8NBP7_OID20235_v1_Cardiometabolic.tar

4.4 Look at PCSK9 chr1

cd PCSK9_Q8NBP7_OID20235_v1_Cardiometabolic
zless discovery_chr1_PCSK9:Q8NBP7:OID20235:v1:Cardiometabolic.gz

CHROM GENPOS ID ALLELE0 ALLELE1 A1FREQ INFO N TEST BETA SE CHISQ LOG10P EXTRA
1 17641 1:17641:G:A:imp:v1 G A 0.000857406 0.868653 34090 ADD 0.0543698 0.134404 0.16364 0.163785 NA
1 55057 1:55057:A:G:imp:v1 A G 0.000905533 0.722233 34090 ADD -0.177095 0.143424 1.52463 0.6637 NA
1 101551 1:101551:T:C:imp:v1 T C 0.000760446 0.771951 34090 ADD -0.185624 0.151463 1.50195 0.656845 NA

4.5 Merge mapping data and PCSK9 summary stats for chr 1

mapch1=fread("/Users/charleenadams/temp_BI/susie_pcsk9/olink_rsid_map_mac5_info03_b0_7_chr1_patched_v2.tsv")
psck9ch1=fread("/Users/charleenadams/temp_BI/susie_pcsk9/PCSK9_Q8NBP7_OID20235_v1_Cardiometabolic/discovery_chr1_PCSK9:Q8NBP7:OID20235:v1:Cardiometabolic")
merged_data <- merge(mapch1, psck9ch1, by = "ID")
write.table(merged_data, "/Users/charleenadams/temp_BI/susie_pcsk9/PCSK9_Q8NBP7_OID20235_v1_Cardiometabolic/merged_chr1_map.txt",
            row.names = FALSE, col.names = TRUE, quote = FALSE, sep = '\t')

5 Find sentinel SNP

merged_data <- fread("/Users/charleenadams/temp_BI/susie_pcsk9/PCSK9_Q8NBP7_OID20235_v1_Cardiometabolic/merged_chr1_map.txt")

# Convert LOG10P to regular p-values
merged_data$Pval <- 10^(-merged_data$LOG10P)

# Sort by the newly created Pval column to find the SNP with the smallest p-value
sorted_data <- merged_data[order(merged_data$Pval), ]

# Display the SNP with the smallest p-value (which will be the first row after sorting)
head(sorted_data, 1)
##                       ID    REF    ALT       rsid    POS19    POS38 CHROM
##                   <char> <char> <char>     <char>    <int>    <int> <int>
## 1: 1:55505647:G:T:imp:v1      G      T rs11591147 55505647 55039974     1
##      GENPOS ALLELE0 ALLELE1    A1FREQ    INFO     N   TEST     BETA        SE
##       <int>  <char>  <char>     <num>   <num> <int> <char>    <num>     <num>
## 1: 55039974       G       T 0.0176885 1.00225 34090    ADD -1.15328 0.0277856
##     CHISQ  LOG10P  EXTRA  Pval
##     <num>   <num> <lgcl> <num>
## 1: 1722.8 375.816     NA     0
# Identify the SNP of interest
snp_of_interest <- sorted_data %>% filter(rsid == "rs11591147")

# Extract position and chromosome
snp_position <- snp_of_interest$POS38[1]
chromosome <- snp_of_interest$CHROM[1]

# Define 500KB upstream and downstream boundaries
upstream_limit <- snp_position - 500000
downstream_limit <- snp_position + 500000

# Subset data within the 500KB window
subset_data <- sorted_data %>%
  filter(CHROM == chromosome &
         POS38 >= upstream_limit &
         POS38 <= downstream_limit)

# Save the subset data
write.table(subset_data, "/Users/charleenadams/temp_BI/susie_pcsk9/500kb_up_and_downstream_sentinel_PCSK9.txt",
            row.names = FALSE, col.names = TRUE, quote = FALSE, sep = '\t')
write.csv(subset_data, "/Users/charleenadams/temp_BI/susie_pcsk9/500kb_up_and_downstream_sentinel_PCSK9.csv")

7 Data Cleaning and Preparation

# Location for square LD matrix and SNP list 
ld_matrix_file1 <- "/Users/charleenadams/temp_BI/LD_matrices/1000Genomes_scratch/annotated/eur/eur_bfiles_filtered/eur_bfiles_MAF01/chr1_ld_matrixMAF01/LD_matrix_chr1_54539974_55539974_maf01_filtered_EUR2.phased.vcor1"
snp_list_file1 <- "/Users/charleenadams/temp_BI/LD_matrices/1000Genomes_scratch/annotated/eur/eur_bfiles_filtered/eur_bfiles_MAF01/chr1_ld_matrixMAF01/LD_matrix_chr1_54539974_55539974_maf01_filtered_EUR2.phased.vcor1.vars"

# Load square LD matrix and SNP list
ld_matrix1 <- as.matrix(read.table(ld_matrix_file1, header = FALSE, stringsAsFactors = FALSE))
snp_list1 <- read.table(snp_list_file1, header = FALSE, stringsAsFactors = FALSE)$V1
rownames(ld_matrix1) <- colnames(ld_matrix1) <- snp_list1

# Ensure LD matrix is symmetric
if (!isSymmetric(ld_matrix1)) {
  ld_matrix1 <- (ld_matrix1 + t(ld_matrix1)) / 2
}

# Load LD dataframe for MAJ_A annotation
ld_matrix_file_df <- fread("/Users/charleenadams/temp_BI/LD_matrices/1000Genomes_scratch/annotated/eur/eur_bfiles_filtered/eur_bfiles_MAF01/chr1_ld_matrixMAF01/LD_matrix_chr1_54539974_55539974_maf01_filtered_EUR2_df.vcor")

# Read-in the chr1 regional subset for the *PCSK9*

# Convert both datasets to data.table if not already
pcsk9_ch1=read.csv("/Users/charleenadams/temp_BI/susie_pcsk9/500kb_up_and_downstream_sentinel_PCSK9.csv")
dim(pcsk9_ch1)
## [1] 6665   21
setDT(pcsk9_ch1)
setDT(ld_matrix_file_df)

# Remove duplicates based on ID_A, keeping the first occurrence
ld_matrix_file_df2_unique <- ld_matrix_file_df %>%
  distinct(ID_A, .keep_all = TRUE)

# Merge with subset_data on rsid and ID_A
plink_merged_data <- pcsk9_ch1 %>%
  inner_join(ld_matrix_file_df2_unique, by = c("rsid" = "ID_A"))
# colnames(plink_merged_data)
# dim(plink_merged_data)

# Create a column to indicate mismatched alleles
plink_merged_data <- plink_merged_data %>%
  mutate(
    mismatch = MAJ_A != ALLELE0
  )
table(plink_merged_data$mismatch)
## 
## FALSE  TRUE 
##  2547   861
# Add variable for mismatched MAJ_A and ALLELE0
mismatches=plink_merged_data[which(plink_merged_data$mismatch=="TRUE"),]
#head(mismatches)

# Look at the A1FREQ for the mismatches: most are the reference allele...
summary(mismatches$A1FREQ)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.002226 0.625754 0.739411 0.723622 0.809163 0.993406
plink_merged_data_copy <- copy(plink_merged_data) # retain original for sanity checking

# Harmonize alleles where there is a mismatch in the copied data
plink_merged_data[mismatch == TRUE, `:=`(
  # Swap ALLELE0 and ALLELE1 if there is a mismatch
  ALLELE0 = ALLELE1,
  ALLELE1 = ALLELE0,
  
  # Flip the BETA sign if there is a mismatch
  BETA = -BETA
)]

# Check the results to verify that mismatches are resolved
print(plink_merged_data[mismatch == TRUE])
##          X                    ID    REF    ALT       rsid    POS19    POS38
##      <int>                <char> <char> <char>     <char>    <int>    <int>
##   1:     3 1:55521313:G:T:imp:v1      G      T   rs472495 55521313 55055640
##   2:     4 1:55521109:G:A:imp:v1      G      A   rs693668 55521109 55055436
##   3:     5 1:55519174:G:A:imp:v1      G      A   rs499883 55519174 55053501
##   4:     8 1:55521242:T:G:imp:v1      T      G   rs471705 55521242 55055569
##   5:    10 1:55520212:G:A:imp:v1      G      A  rs1165287 55520212 55054539
##  ---                                                                       
## 857:  6614 1:55125417:T:C:imp:v1      T      C   rs652245 55125417 54659744
## 858:  6622 1:55135161:A:G:imp:v1      A      G   rs484251 55135161 54669488
## 859:  6624 1:55271495:G:T:imp:v1      G      T  rs2077061 55271495 54805822
## 860:  6656 1:55921073:A:G:imp:v1      A      G rs12743014 55921073 55455400
## 861:  6663 1:55354302:C:A:imp:v1      C      A  rs3170766 55354302 54888629
##      CHROM   GENPOS ALLELE0 ALLELE1   A1FREQ     INFO     N   TEST         BETA
##      <int>    <int>  <char>  <char>    <num>    <num> <int> <char>        <num>
##   1:     1 55055640       T       G 0.650489 0.996690 34090    ADD -1.57226e-01
##   2:     1 55055436       A       G 0.649923 0.995948 34090    ADD -1.56640e-01
##   3:     1 55053501       A       G 0.613184 0.960763 34090    ADD -1.54644e-01
##   4:     1 55055569       G       T 0.640686 0.998569 34090    ADD -1.48991e-01
##   5:     1 55054539       A       G 0.640543 0.989352 34090    ADD -1.49169e-01
##  ---                                                                           
## 857:     1 54659744       C       T 0.755154 0.983153 34090    ADD -1.49105e-04
## 858:     1 54669488       G       A 0.719152 0.975224 34090    ADD  1.23720e-04
## 859:     1 54805822       T       G 0.609815 0.990566 34090    ADD -1.07789e-04
## 860:     1 55455400       G       A 0.545352 0.993168 34090    ADD  2.55647e-05
## 861:     1 54888629       A       C 0.583485 0.995522 34090    ADD  6.02401e-06
##              SE       CHISQ      LOG10P  EXTRA         Pval #CHROM_A    POS_A
##           <num>       <num>       <num> <lgcl>        <num>    <int>    <int>
##   1: 0.00770493 4.16401e+02 9.18292e+01     NA 1.481836e-92        1 55055640
##   2: 0.00770475 4.13320e+02 9.11585e+01     NA 6.942246e-92        1 55055436
##   3: 0.00768205 4.05241e+02 8.93999e+01     NA 3.981988e-90        1 55053501
##   4: 0.00764990 3.79323e+02 8.37576e+01     NA 1.747431e-84        1 55055569
##   5: 0.00768484 3.76779e+02 8.32037e+01     NA 6.256047e-84        1 55054539
##  ---                                                                         
## 857: 0.00860106 3.00526e-04 6.04874e-03     NA 9.861688e-01        1 54659744
## 858: 0.00826730 2.23949e-04 5.21661e-03     NA 9.880602e-01        1 54669488
## 859: 0.00755540 2.03532e-04 4.97176e-03     NA 9.886174e-01        1 54805822
## 860: 0.00740328 1.19243e-05 1.19823e-03     NA 9.972448e-01        1 55455400
## 861: 0.00745639 6.52702e-07 2.80041e-04     NA 9.993554e-01        1 54888629
##       MAJ_A CHROM_B    POS_B       ID_B  MAJ_B PHASED_R mismatch
##      <char>   <int>    <int>     <char> <char>    <num>   <lgcl>
##   1:      T       1 55056468   rs479910      G 0.930836     TRUE
##   2:      A       1 55055569   rs471705      G 0.981177     TRUE
##   3:      A       1 55053565   rs521662      T 0.843058     TRUE
##   4:      G       1 55055640   rs472495      T 0.981177     TRUE
##   5:      A       1 55054592   rs634272      C 0.820034     TRUE
##  ---                                                            
## 857:      C       1 54664069   rs689258      T 0.761521     TRUE
## 858:      G       1 54670394  rs2251719      G 0.897987     TRUE
## 859:      T       1 54806101   rs757581      G 0.985492     TRUE
## 860:      G       1 55455817 rs12124470      T 0.600031     TRUE
## 861:      A       1 54888662  rs4927176      T 1.000000     TRUE
# print(head(plink_merged_data))
# print(head(plink_merged_data_copy))

# Create a column to indicate mismatched alleles have been resolved
plink_merged_data <- plink_merged_data %>%
  mutate(
    mismatch_resolved = MAJ_A != ALLELE0
  )
table(plink_merged_data$mismatch_resolved)
## 
## FALSE  TRUE 
##  3407     1
unresolved=plink_merged_data[which(plink_merged_data$mismatch_resolved=="TRUE"),]
unresolved
##        X                    ID    REF    ALT      rsid    POS19    POS38 CHROM
##    <int>                <char> <char> <char>    <char>    <int>    <int> <int>
## 1:  5810 1:55723635:A:T:imp:v1      A      T rs2802860 55723635 55257962     1
##      GENPOS ALLELE0 ALLELE1     A1FREQ     INFO     N   TEST       BETA
##       <int>  <char>  <char>      <num>    <num> <int> <char>      <num>
## 1: 55257962       T       A 0.00222606 0.681933 34090    ADD -0.0287456
##           SE     CHISQ  LOG10P  EXTRA      Pval #CHROM_A    POS_A  MAJ_A
##        <num>     <num>   <num> <lgcl>     <num>    <int>    <int> <char>
## 1: 0.0942173 0.0930852 0.11902     NA 0.7602913        1 55257962      C
##    CHROM_B    POS_B       ID_B  MAJ_B PHASED_R mismatch mismatch_resolved
##      <int>    <int>     <char> <char>    <num>   <lgcl>            <lgcl>
## 1:       1 55258144 rs10888907      A  0.59065     TRUE              TRUE
# Hot damn, those multi-allelic variants!!! 
# Look up the unresolved snp in dbsnp: 

# rs2802860 [Homo sapiens]
# Variant type:SNVAlleles:A>C,T [Show Flanks]Chromosome:1:55257962 (GRCh38)
# 1:55723635 (GRCh37)

plink_merged_data_no_mulitalleles <- plink_merged_data[which(plink_merged_data$mismatch_resolved=="FALSE"),]
dim(plink_merged_data_no_mulitalleles)
## [1] 3407   31
# Identify palindromes but keep for now...
plink_merged_data_no_mulitalleles_palindromesID <- plink_merged_data_no_mulitalleles %>%
  mutate(
    palindromic = (ALLELE1 %in% c("A", "T") & ALLELE0 %in% c("A", "T")) | 
                  (ALLELE1 %in% c("C", "G") & ALLELE0 %in% c("C", "G"))
  )
table(plink_merged_data_no_mulitalleles_palindromesID$palindromic)
## 
## FALSE  TRUE 
##  2939   468
# Extract the SNPs from plink_merged_data_no_mulitalleles_palindromesID
snps_to_include <- plink_merged_data_no_mulitalleles_palindromesID$rsid

# Subset ld_matrix1 to include only the SNPs in snps_to_include
# Also ensure the order of the rows and columns follows the order in snps_to_include
ld_matrix1_filtered <- ld_matrix1[snps_to_include, snps_to_include]
dim(ld_matrix1_filtered)
## [1] 3407 3407
str(ld_matrix1_filtered)
##  num [1:3407, 1:3407] 1 0.00386 0.00386 0.01769 0.01415 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:3407] "rs11591147" "rs472495" "rs693668" "rs499883" ...
##   ..$ : chr [1:3407] "rs11591147" "rs472495" "rs693668" "rs499883" ...
write.table(plink_merged_data_no_mulitalleles_palindromesID,
            "/Users/charleenadams/temp_BI/susie_pcsk9/PCSK9_Q8NBP7_OID20235_v1_Cardiometabolic/plink_merged_data_no_mulitalleles_palindromesID.txt", row.names = FALSE, col.names = TRUE, quote = FALSE, sep = '\t')

8 SuSiE Results

# SuSiE needs a single value for the sample size: 
summary(plink_merged_data_no_mulitalleles_palindromesID$N)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34090   34090   34090   34090   34090   34090
dataset1QC <- list(
  beta = plink_merged_data_no_mulitalleles_palindromesID$BETA,
  varbeta = (plink_merged_data_no_mulitalleles_palindromesID$SE)^2,
  type = "quant",
  snp = plink_merged_data_no_mulitalleles_palindromesID$rsid,
  position = plink_merged_data_no_mulitalleles_palindromesID$POS38,
  N = plink_merged_data_no_mulitalleles_palindromesID$N[1],
  MAF = plink_merged_data_no_mulitalleles_palindromesID$A1FREQ,
  LD = ld_matrix1_filtered,
  sdY = 1
)

cat("Running SuSiE for PCSK9 ...\n")
## Running SuSiE for PCSK9 ...
susie_trait1QC <- susie_rss(
  z = dataset1QC$beta / sqrt(dataset1QC$varbeta),
  R = dataset1QC$LD,
  n = dataset1QC$N,
  var_y = dataset1QC$sdY^2,
  L = 10,
  verbose = FALSE
)
cat("SuSiE run for Trait 1 completed.\n")
## SuSiE run for Trait 1 completed.
summary(susie_trait1QC)
## 
## Variables in credible sets:
## 
##  variable variable_prob cs
##         1    1.00000000  1
##         4    1.00000000  2
##         6    1.00000000  6
##       135    1.00000000  9
##        68    1.00000000  8
##      3058    1.00000000  3
##      1039    0.99994243  5
##       344    0.99384049  7
##       749    0.54530243 10
##        45    0.40976783  4
##        48    0.33576116  4
##        47    0.25002138  4
##       751    0.22997779 10
##       778    0.16324414 10
##       814    0.02165504 10
## 
## Credible sets summary:
## 
##  cs cs_log10bf cs_avg_r2 cs_min_r2        variable
##   1        Inf 1.0000000 1.0000000               1
##   2 116.331764 1.0000000 1.0000000               4
##   3  25.734170 1.0000000 1.0000000            3058
##   5  20.982245 1.0000000 1.0000000            1039
##   6 139.228980 1.0000000 1.0000000               6
##   7  42.333242 1.0000000 1.0000000             344
##   8  22.979664 1.0000000 1.0000000              68
##   9 109.676113 1.0000000 1.0000000             135
##   4   9.511335 0.9964352 0.9946552        45,47,48
##  10  23.412318 0.9843960 0.9686733 749,751,778,814
# Run also with Chris Wallace's wrapper: `runsusie`
susie_trait1_runsusieQC <- runsusie(dataset1QC, verbose = FALSE)
summary(susie_trait1_runsusieQC)
## 
## Variables in credible sets:
## 
##  variable variable_prob cs
##         1    1.00000000  1
##         4    1.00000000  2
##         6    1.00000000  6
##       135    1.00000000  9
##        68    1.00000000  8
##      3058    1.00000000  3
##      1039    0.99994243  5
##       344    0.99384049  7
##       749    0.54530243 10
##        45    0.40976783  4
##        48    0.33576116  4
##        47    0.25002138  4
##       751    0.22997779 10
##       778    0.16324414 10
##       814    0.02165504 10
## 
## Credible sets summary:
## 
##  cs cs_log10bf cs_avg_r2 cs_min_r2        variable
##   1        Inf 1.0000000 1.0000000               1
##   2 116.331764 1.0000000 1.0000000               4
##   3  25.734170 1.0000000 1.0000000            3058
##   5  20.982245 1.0000000 1.0000000            1039
##   6 139.228980 1.0000000 1.0000000               6
##   7  42.333242 1.0000000 1.0000000             344
##   8  22.979664 1.0000000 1.0000000              68
##   9 109.676113 1.0000000 1.0000000             135
##   4   9.511335 0.9964352 0.9946552        45,47,48
##  10  23.412318 0.9843960 0.9686733 749,751,778,814
# Extract the variables data frame from the SuSiE object
variables_df <- susie_trait1_runsusieQC$variables

if(is.null(variables_df)) {
  # This path assumes 'variables' is within the 'sets' list, adjust if necessary
  variables_df <- data.frame(
    variable = unlist(lapply(susie_trait1_runsusieQC$sets$cs, function(x) x)),
    cs = rep(names(susie_trait1_runsusieQC$sets$cs), times = lengths(susie_trait1_runsusieQC$sets$cs)),
    variable_prob = susie_trait1_runsusieQC$pip[unlist(susie_trait1_runsusieQC$sets$cs)]
  )
}

# Sum 'variable_prob' by 'cs'
sum_by_cs <- aggregate(variable_prob ~ cs, data = variables_df, FUN = sum)

# Print or further use 'sum_by_cs'
print(sum_by_cs)
##     cs variable_prob
## 1   L1     1.0000000
## 2  L10     0.9601794
## 3   L2     1.0000000
## 4   L3     1.0000000
## 5   L4     0.9955504
## 6   L5     0.9999424
## 7   L6     1.0000000
## 8   L7     0.9938405
## 9   L8     1.0000000
## 10  L9     1.0000000
# Extract the variable strings for each credible set
susie_variable_strings_pcsk9 <- summary(susie_trait1QC)$cs$variable
susie_variable_strings_pcsk9
##  [1] "1"               "4"               "3058"            "1039"           
##  [5] "6"               "344"             "68"              "135"            
##  [9] "45,47,48"        "749,751,778,814"
# Split the strings by commas and convert them into a numeric vector
susie_variable_indices_pcsk9  <- unlist(strsplit(susie_variable_strings_pcsk9 , ",")) %>% as.numeric()
susie_variable_indices_pcsk9 
##  [1]    1    4 3058 1039    6  344   68  135   45   47   48  749  751  778  814
# Extract the corresponding SNPs from the dataset using the indices
selected_snps_pcsk9 <- dataset1QC[["snp"]][susie_variable_indices_pcsk9]

# Print the selected SNPs
print(selected_snps_pcsk9)
##  [1] "rs11591147"  "rs499883"    "rs78080497"  "rs41294810"  "rs12750160" 
##  [6] "rs112339942" "rs12732125"  "rs34324306"  "rs2479404"   "rs2495491"  
## [11] "rs2495489"   "rs17109407"  "rs4262587"   "rs199775702" "rs7538000"
see_snps_in_data_pcsk9=plink_merged_data_no_mulitalleles_palindromesID[which(plink_merged_data_no_mulitalleles_palindromesID$rsid%in%selected_snps_pcsk9),]

# Subset the LD matrix to include only SNPs in all credible sets
ld_matrix_selected_snps_pcsk9 <- ld_matrix1_filtered[selected_snps_pcsk9, selected_snps_pcsk9]
str(ld_matrix_selected_snps_pcsk9)
##  num [1:15, 1:15] 1 0.0177 0.2689 0.0786 0.3806 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:15] "rs11591147" "rs499883" "rs78080497" "rs41294810" ...
##   ..$ : chr [1:15] "rs11591147" "rs499883" "rs78080497" "rs41294810" ...

8.1 Compare with Benjamin Sun’s Credible Sets

From Supplementary Table 16 of Plasma proteomic associations with genetics and health in the UK Biobank

NB: We centered around rs11591147. We found 5 SNPs that match those that Sun found (see below the table).

# Input as a single character string (copy-paste it directly)
sun_cs <- "rs11591147
rs472495, rs693668
rs760981293, rs199714093, rs749008649
rs2479404, rs2495491, rs2495489
rs7552350, rs613855
rs111928762, 1:55701838_ATGAG_A, rs11206517, rs113793186, rs3737810, rs2902875, rs75691641, rs113911544, rs505151, rs66679331, rs67722862, rs72662528, rs67540256, rs66499115, rs58472471, rs67293523, rs72662534, rs12067569, rs17131756, rs58447843, rs67175330, rs7525503, rs17111688, rs56833376, rs61163343, rs67276129, rs1475701, 1:55619453_ATAC_A, rs57357765, rs72911441, rs61176209, rs56235208, rs2270690, rs112339942, rs754615958, rs6697048, rs67171713
1:55515668_CACG_C, rs479832, rs531701, rs557211, rs535471, rs534473, rs41294821"

# Step 1: Replace newlines with commas to standardize the string
sun_cs_cleaned <- gsub("\n", ", ", sun_cs)

# Step 2: Split by comma and trim any extra spaces around the entries
sun_cs_cleaned_final <- trimws(unlist(strsplit(sun_cs_cleaned, ",")))

# View the cleaned result
print(sun_cs_cleaned_final)
##  [1] "rs11591147"         "rs472495"           "rs693668"          
##  [4] "rs760981293"        "rs199714093"        "rs749008649"       
##  [7] "rs2479404"          "rs2495491"          "rs2495489"         
## [10] "rs7552350"          "rs613855"           "rs111928762"       
## [13] "1:55701838_ATGAG_A" "rs11206517"         "rs113793186"       
## [16] "rs3737810"          "rs2902875"          "rs75691641"        
## [19] "rs113911544"        "rs505151"           "rs66679331"        
## [22] "rs67722862"         "rs72662528"         "rs67540256"        
## [25] "rs66499115"         "rs58472471"         "rs67293523"        
## [28] "rs72662534"         "rs12067569"         "rs17131756"        
## [31] "rs58447843"         "rs67175330"         "rs7525503"         
## [34] "rs17111688"         "rs56833376"         "rs61163343"        
## [37] "rs67276129"         "rs1475701"          "1:55619453_ATAC_A" 
## [40] "rs57357765"         "rs72911441"         "rs61176209"        
## [43] "rs56235208"         "rs2270690"          "rs112339942"       
## [46] "rs754615958"        "rs6697048"          "rs67171713"        
## [49] "1:55515668_CACG_C"  "rs479832"           "rs531701"          
## [52] "rs557211"           "rs535471"           "rs534473"          
## [55] "rs41294821"
# Find SNPs that are in both selected_snps_pcsk9 and sun_cs_cleaned_final
common_snps_adams_sun <- intersect(selected_snps_pcsk9, sun_cs_cleaned_final)

# Print the common SNPs
print(common_snps_adams_sun)
## [1] "rs11591147"  "rs112339942" "rs2479404"   "rs2495491"   "rs2495489"

8.2 Correlation Matrix between SNPs in Credible Sets

# Visualize the correlation matrix using corrplot for selected SNPs
corrplot(ld_matrix_selected_snps_pcsk9, method = "color", tl.cex = 0.6, tl.col = "black", is.corr = TRUE)
mtext("LD between SNPs in all credible sets for PCSK9", side = 3, line = 5, cex = 1.2)

8.3 Manhattn Plot

# Prepare Manhattan plot

plink_merged_data_no_mulitalleles_palindromesID <- fread("/Users/charleenadams/temp_BI/susie_pcsk9/PCSK9_Q8NBP7_OID20235_v1_Cardiometabolic/plink_merged_data_no_mulitalleles_palindromesID.txt")
# colnames(plink_merged_data_no_mulitalleles_palindromesID)

# Define genome-wide significance threshold
genomewide_sig_threshold <- -log10(5e-8)

# Prepare data for Manhattan plot
# plink_merged_data_no_mulitalleles_palindromesID$LOG10P

# Convert to data frame if necessary
if (is.data.table(plink_merged_data_no_mulitalleles_palindromesID)) {
  plink_merged_data_no_mulitalleles_palindromesID_df <- as.data.frame(plink_merged_data_no_mulitalleles_palindromesID)
} else {
  plink_merged_data_no_mulitalleles_palindromesID_df <- plink_merged_data_no_mulitalleles_palindromesID
}

# Prepare data for labeling
snps_to_label <- unique(c(selected_snps_pcsk9)) #,
snps_label_data <- plink_merged_data_no_mulitalleles_palindromesID_df %>%
  dplyr::filter(rsid %in% snps_to_label) %>%
  dplyr::select(rsid, POS38, LOG10P) # , Gene

# Create Manhattan Plot
manhattan_plot <- ggplot(plink_merged_data_no_mulitalleles_palindromesID_df, aes(x = POS38, y = LOG10P)) +
  geom_point(aes(color = as.factor(CHROM)), alpha = 0.75, size = 1.3) +
  scale_color_manual(values = rep(c("blue", "orange"), length(unique(plink_merged_data_no_mulitalleles_palindromesID_df$CHROM)))) +
  labs(
    x = "Genomic Position (Chromosome 1)",
    y = expression(-log[10](italic(p)-value)),
    title = "Manhattan Plot with SuSiE Credible Set SNPs Labeled"
  ) +
  geom_hline(
    yintercept = genomewide_sig_threshold,
    linetype = "dotted",
    color = "red",
    linewidth = 1
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8)
  ) +
  geom_text_repel(
    data = snps_label_data,
    aes(label = paste(rsid,  sep = " - ")),#Gene,
    nudge_y = 1,
    size = 3,
    color = "black",
    max.overlaps = Inf,
    box.padding = 0.5,
    point.padding = 0.5,
    segment.color = "grey50"
  )

# Display the plot
print(manhattan_plot)

# Save the plot
output_dir <- "/Users/charleenadams/temp_BI/susie_pcsk9/PCSK9_Q8NBP7_OID20235_v1_Cardiometabolic/figures"
if (!dir.exists(output_dir)) {
  dir.create(output_dir, recursive = TRUE)
}
ggsave(
  filename = file.path(output_dir, "pcsk9_susie_manhattan.pdf"),
  plot = manhattan_plot,
  width = 10,
  height = 6,
  dpi = 300
)