1 Multivariate GWAS for Chemokines

In this analysis, we employ a genomic structural equation model to define a new phenotype: a latent factor that influences chemokines.

In essence, while the chemokines themselves are directly measurable phenotypes, the latent factor we infer captures the shared genetic influence among these chemokines. The GWAS results, therefore, represent the genetic associations with this inferred, composite trait, providing insights into the genetics of an underlying immune phenomenon rather than focusing on any single chemokine alone.

2 Data Preparation

2.1 Bioinformatics

I had previously obtained, untarred, formatted, and munged 33 UKB-PPP chemokine summary statistics. I transferred these files, precomputed European LD scores, and w_hm3.snplist over with rsync by using the IP address of my BIDMC laptop (removed below) and its password (not included).

(I had obtained the precalculated LD scores for Europeans within 1000 Genomes Project Phase 3, as provided by the Alkes lab. These files were specifically generated using HapMap3 SNPs within the European subset of 1000 Genomes samples, excluding the MHC region, which is particularly complex in terms of LD structure due to high genetic diversity and extended linkage disequilibrium in that region.)

2.1.1 Rsync Transfer

rsync -av --progress --partial /Users/charleenadams/temp_BI/chemokine_rgs_olink/ charleenadams@IP:/Users/charleenadams/temp_BI/chemokine_rgs_olink/

2.1.2 Untar the Remaining Chemokine Summary Statistics

I obtained all the UKB-PPP summary statistics for Europeans from Synapse. I’m only going to untar the chemokine files I did not process previously.

Scripts of this Analysis Stored at:

/Users/charleenadams/temp_BI/mvGWAS_chemokines

Files to untar:

  • CXCL8_P10145_OID20153_v1_Cardiometabolic.tar

  • CXCL8_P10145_OID20631_v1_Inflammation.tar

  • CXCL8_P10145_OID20997_v1_Neurology.tar

  • CXCL8_P10145_OID21430_v1_Oncology.tar

  • CXCL1_P09341_OID20762_v1_Inflammation.tar

  • CCL11_P51671_OID20668_v1_Inflammation.tar

#!/bin/bash

# Define the directory containing your tar files
tar_directory="/Users/charleenadams/ukbppp"

# Define the output directory for extracted files
output_directory="/Users/charleenadams/ukbppp/extracted_proteins"

# Create the output directory if it doesn't exist
mkdir -p "$output_directory"

# List of tar files to extract
tar_files=(
    "CXCL8_P10145_OID20153_v1_Cardiometabolic.tar"
    "CXCL8_P10145_OID20631_v1_Inflammation.tar"
    "CXCL8_P10145_OID20997_v1_Neurology.tar"
    "CXCL8_P10145_OID21430_v1_Oncology.tar"
    "CXCL1_P09341_OID20762_v1_Inflammation.tar"
    "CCL11_P51671_OID20668_v1_Inflammation.tar"
)

# Iterate over each tar file in the list
for tar_file in "${tar_files[@]}"; do
    # Construct the full path to the tar file
    full_path="$tar_directory/$tar_file"

    # Check if the tar file exists
    if [[ -f "$full_path" ]]; then
        echo "Extracting $tar_file..."
        # Extract the tar file into the output directory
        tar -xf "$full_path" -C "$output_directory"
    else
        echo "File $tar_file does not exist in $tar_directory."
    fi
done

echo "Extraction complete."

# Run with: 
chmod +x untar_remaining_chemokines.sh
nohup ./untar_remaining_chemokines.sh >untar_remaining_chemokines_log.txt 2>&1 &
ps -p 97839 -o stat,etime

2.1.3 Format the Remaining Chemokine Files

2.1.4 Add rsIDs

This script keep POS38 and POS19. Position isn’t needed for ldsc because the rsID is used. But for all downstream analyses, it’s obviously good to have both versus merging one or the other back in. (When I ran this for the original 33 chemokines, POS19 was dropped.) In addition, I sped up the process with parallel.

#!/bin/bash

# Base directory where your data is located
base_dir="/Users/charleenadams/ukbppp/extracted_proteins"

# Directory where RSID map files are located
map_dir="/Users/charleenadams/temp_BI/chemokine_rgs_olink/maps"

# Log file to track progress
log_file="$base_dir/add_rsid_log.txt"
echo "Log started on $(date)" > "$log_file"

# Function to process each file
process_file() {
    local file="$1"
    local dir
    dir=$(dirname "$file")
    local chrom
    chrom=$(basename "$file" | sed -E 's/.*chr([0-9XY]+).*/\1/')
    local map_file="$map_dir/olink_rsid_map_mac5_info03_b0_7_chr${chrom}_patched_v2.tsv.gz"
    local output_file="${file%.gz}_rsid_added.gz"

    {
        echo "Processing file: $file (Chromosome $chrom)"

        if [[ -n "$chrom" && -f "$map_file" ]]; then
            echo "  Using RSID map: $map_file"

            # Remove existing rsid_added file to avoid overwrite prompts
            rm -f "$output_file"

            # Join the original file with the RSID map based on the ID column
            gunzip -c "$file" | \
            awk 'BEGIN {OFS="\t"}
                 NR==FNR {map[$1]=$4; genpos37[$1]=$5; genpos38[$1]=$6; next}
                 ($3 in map) {print $0, map[$3], genpos37[$3], genpos38[$3]}
                 !($3 in map) {print $0, ".", ".", "."}' <(gunzip -c "$map_file") - | \
            gzip -c > "$output_file"

            if [[ -s "$output_file" ]]; then
                echo "  RSID added with GENPOS37 and GENPOS38. Output saved to $output_file"
            else
                echo "  Warning: $output_file is empty!"
            fi
        else
            echo "  Error: RSID map file for chromosome $chrom not found or chromosome not identified."
        fi
    } >> "$log_file" 2>&1
}

export -f process_file
export map_dir
export log_file

# Find all .gz files and process them in parallel
find "$base_dir" -type f -name "*.gz" | parallel -j4 process_file

echo "Log ended on $(date)" >> "$log_file"
echo "Processing complete. Check $log_file for details."
#Run with: 
nano add_rsid.sh
chmod +x add_rsid.sh
nohup ./add_rsid.sh > add_rsid_log.txt 2>&1 &
ps -p 18092 -o stat,etime

2.1.5 Merge the Chromosomes

#!/bin/bash

# Define base directory
base_dir="/Users/charleenadams/ukbppp/extracted_proteins"

# Log file for tracking
log_file="$base_dir/merge_rsid_files_log.txt"
echo "Processing started on $(date)" > "$log_file"

# Iterate over each subdirectory in the base directory
for dir in "$base_dir"/*/; do
    # Skip directories you don’t want to process
    if [[ "$dir" == *"maps"* || "$dir" == *"1000G_Phase3_weights_hm3_no_MHC"* || "$dir" == *"ldsc"* || "$dir" == *"rsconnect"* ]]; then
        continue
    fi

    echo "Processing directory: $dir" | tee -a "$log_file"

    # Collect all '_rsid_added.gz' files in this directory
    rsid_files=("$dir"*_rsid_added.gz)

    # Check if there are files to merge
    if [[ ${#rsid_files[@]} -gt 0 ]]; then
        merged_file="${dir%/}_merged_rsid_added.gz"

        # Extract header from the first file
        gunzip -c "${rsid_files[0]}" | head -n 1 > temp_header.txt

        # Process each file, excluding the header, and append to a temporary content file
        for file in "${rsid_files[@]}"; do
            gunzip -c "$file" | tail -n +2 >> temp_content.txt
        done

        # Concatenate header and content into the final merged file
        cat temp_header.txt temp_content.txt | gzip -c > "$merged_file"

        # Clean up temporary files
        rm temp_header.txt temp_content.txt

        # Confirm if the merged file was created successfully and isn’t empty
        if [[ -s "$merged_file" ]]; then
            echo "  Merged file created: $merged_file" | tee -a "$log_file"
        else
            echo "  Warning: Merged file is empty for directory $dir" | tee -a "$log_file"
        fi
    else
        echo "  No '_rsid_added.gz' files found in directory $dir" | tee -a "$log_file"
    fi
done

echo "Processing completed on $(date)" >> "$log_file"
echo "Check $log_file for details."

# Run with:
nano chemokine_chrom_merge.sh
chmod +x chemokine_chrom_merge.sh
nohup ./chemokine_chrom_merge.sh > chemokine_chrom_merge_log.txt 2>&1 &
ps -p 20853 -o stat,etime

2.1.6 Add a P (p-value) Column

The Olink files have LOG10P. LDSC needs non-transformed p-values. NOTE: Check the LOG10P column number!

#!/bin/bash

cd /Users/charleenadams/ukbppp/extracted_proteins
# Loop through each *merged_rsid_added.gz file
for file in *_merged_rsid_added.gz; do
    # Define the output filename
    output_file="${file%_merged_rsid_added.gz}_with_pval.gz"
    
    # Use awk to calculate P-value from LOG10P and write to a new gzipped file
    gunzip -c "$file" | \
    awk 'BEGIN {OFS="\t"} NR==1 {print $0, "P"} NR>1 {pval = (10 ^ -$13); print $0, pval}' | \
    gzip > "$output_file"
    
    echo "Processed file saved: $output_file"
done

#Run with: 
nano pval.sh
chmod +x pval.sh
nohup ./pval.sh> pval_log.txt 2>&1 &
ps -p 22727 -o stat,etime

3 LDSC Munge Loop for the 6 Chemokine Summary Statistics

Store in the folder with the 33:

NOTE: It is nightmarish to get ldsc installed and running. Must configure it with old versions of python and install dependencies from hell.

#!/bin/bash

# Enable debugging and error handling
set -euo pipefail
IFS=$'\n\t'

# Define base and output directories
base_dir="/Users/charleenadams/ukbppp/extracted_proteins"
output_dir="/Users/charleenadams/temp_BI/chemokine_rgs_olink/munged"

# Create the output directory if it doesn't exist
mkdir -p "$output_dir"

# Log progress
log_file="munge_simple_progress.log"
echo "Starting munging process at $(date)" > "$log_file"

# Loop through each file ending with "_with_pval.gz" in the base directory
for file in "$base_dir"/*_with_pval.gz; do
    if [[ -f "$file" ]]; then
        # Extract the base name without the extension
        base_name=$(basename "$file" "_with_pval.gz")

        # Define the output filename for munging
        output_file="$output_dir/${base_name}_munged"

        # Log the current file being processed
        echo "Processing file: $file" >> "$log_file"

        # Perform the munging, specifying the column names explicitly
        /Users/charleenadams/temp_BI/chemokine_rgs_olink/ldsc/munge_sumstats.py \
            --sumstats "$file" \
            --snp rsid \
            --a1 ALLELE1 \
            --a2 ALLELE0 \
            --frq A1FREQ \
            --N-col N \
            --p P \
            --signed-sumstats BETA,0 \
            --chunksize 500000 \
            --out "$output_file" \
            --merge-alleles /Users/charleenadams/temp_BI/chemokine_rgs_olink/w_hm3.snplist

        # Log success
        echo "Munged file created: $output_file" >> "$log_file"
    else
        echo "Warning: No matching files found in $base_dir" >> "$log_file"
    fi
done

# Completion message
echo "Munging process completed at $(date)" >> "$log_file"

# Run with: 
nano munge_simple.sh
chmod +x munge_simple.sh
nohup ./munge_simple.sh > munge_simple_log.txt 2>&1 &

4 Multivariate GWAS

4.1 Remove Incomplete Observations

#!/bin/bash

# Define the directory containing the files
input_dir="/Users/charleenadams/temp_BI/chemokine_rgs_olink/munged"
output_dir="/Users/charleenadams/temp_BI/chemokine_rgs_olink/munged"

# Loop through all files ending in _munged.sumstats.gz
for file in "$input_dir"/*_munged.sumstats.gz; do
    # Extract the base filename without extension
    base_name=$(basename "$file" .sumstats.gz)

    # Create output file name by appending "noNA" before .gz
    output_file="${output_dir}/${base_name}_noNA.sumstats.gz"

    # Unzip, filter rows with complete observations, and gzip the result
    gunzip -c "$file" | awk 'NR == 1 || NF == 5' | gzip > "$output_file"

    echo "Processed $file -> $output_file"
done

# Run with: 
nano filter_noNA.sh
chmod +x filter_noNA.sh
nohup ./filter_noNA.sh > filter_noNA_log.txt 2>&1 & 

4.2 GenomicSEM

4.2.1 Harmonize Alleles

Run interactively.

# Load libraries
library(data.table)

# Define paths
current_dir <- "/Users/charleenadams/temp_BI/chemokine_rgs_olink/munged"
hapmap3_snplist_path <- "/Users/charleenadams/temp_BI/mvGWAS_chemokines/eur_w_ld_chr/w_hm3.snplist"
output_dir <- "/Users/charleenadams/temp_BI/chemokine_rgs_olink/munged/harmonized"

# Create output directory if it doesn't exist
if (!dir.exists(output_dir)) {
  dir.create(output_dir)
}

# Dynamically list files to process
summary_stats <- list.files(
  path = current_dir, 
  pattern = ".*noNA\\.sumstats\\.gz$", 
  full.names = TRUE, 
  all.files = TRUE
)

# Check if any files are found
if (length(summary_stats) == 0) {
  stop("No matching summary statistics files found in the specified directory.")
}
cat("Files to process:\n", paste(summary_stats, collapse = "\n"), "\n")

# Load HapMap3 SNP list
hm3_snps <- fread(hapmap3_snplist_path, header = TRUE)

# Function to harmonize SNPs
harmonize_snps <- function(file, hm3_snps) {
  # Read summary statistics
  data <- fread(file, header = TRUE)[, .(SNP, A1, A2, N, Z)]
  
  # Filter to SNPs in HapMap3
  data <- data[SNP %in% hm3_snps$SNP]
  
  # Merge with HapMap3 to align alleles
  aligned_data <- merge(data, hm3_snps, by = "SNP", suffixes = c(".data", ".hm3"))
  
  # Flip alleles and Z-scores where necessary
  mismatched <- which(aligned_data$A1.data == aligned_data$A2.hm3 & 
                        aligned_data$A2.data == aligned_data$A1.hm3)
  if (length(mismatched) > 0) {
    aligned_data$Z[mismatched] <- -aligned_data$Z[mismatched]
    temp <- aligned_data$A1.data[mismatched]
    aligned_data$A1.data[mismatched] <- aligned_data$A2.data[mismatched]
    aligned_data$A2.data[mismatched] <- temp
  }
  
  # Return cleaned data
  aligned_data[, .(SNP, A1 = A1.data, A2 = A2.data, N, Z)]
}

# Process and save harmonized files
for (file in summary_stats) {
  cat("Processing:", file, "\n")
  
  # Harmonize SNPs
  harmonized_data <- harmonize_snps(file, hm3_snps)
  
  # Extract file name and create correct output name
  base_name <- gsub("\\.sumstats\\.gz$", "_harmonized.sumstats.gz", basename(file))
  temp_file <- file.path(output_dir, gsub("\\.gz$", "", base_name))  # Temporary uncompressed file
  output_file <- file.path(output_dir, base_name)  # Final compressed file
  
  # Save to temporary uncompressed file
  fwrite(harmonized_data, file = temp_file, sep = "\t")
  
  # Compress the file
  system(paste("gzip -f", shQuote(temp_file)))
  
  cat("Saved harmonized file:", output_file, "\n")
}

cat("All matching files harmonized and saved in:", output_dir, "\n")

4.2.2 Add Pvalue Back

library(data.table)

# Define the function to add a P column
add_p_column <- function(file, output_dir) {
  data <- fread(file)
  if (!"P" %in% colnames(data)) {
    if ("Z" %in% colnames(data)) {
      data[, P := 2 * pnorm(-abs(Z))]
      
      # Create output file path
      output_file <- file.path(output_dir, basename(file))
      
      # Save the updated data to the new file
      fwrite(data, output_file)
      cat("Added P column to:", output_file, "\n")
    } else {
      stop("Z column missing. Cannot calculate P-values in file:", file)
    }
  } else {
    cat("P column already exists in:", file, "\n")
  }
}

# Set input and output directories
input_dir <- "/Users/charleenadams/temp_BI/chemokine_rgs_olink/munged/harmonized"
output_dir <- file.path(input_dir, "pvals.fix")

# Create the output directory if it doesn't exist
if (!dir.exists(output_dir)) {
  dir.create(output_dir)
}

# List all files matching the pattern in the input directory
files <- list.files(input_dir, pattern = "\\.sumstats\\.gz$", full.names = TRUE)

# Apply the function to all files
lapply(files, add_p_column, output_dir = output_dir)

4.2.3 Multivariate LDSC

This differs from the pairwise models we ran previously. The goal is to obtain the MVLDSC covariance matrix (not the pairwise genetic correlations).

# Load required library
library(GenomicSEM)

# Define paths and variables
traits <- list.files("/Users/charleenadams/temp_BI/chemokine_rgs_olink/munged/harmonized/", 
                     pattern = "\\.sumstats\\.gz$", full.names = TRUE)

# Set sample and population prevalences to NA to bypass liability scaling
sample.prev <- rep(NA, length(traits))  # Treat traits as continuous
population.prev <- rep(NA, length(traits))

ld <- "/Users/charleenadams/temp_BI/mvGWAS_chemokines/eur_w_ld_chr/"
wld <- "/Users/charleenadams/temp_BI/mvGWAS_chemokines/eur_w_ld_chr/"

# Optional: Name your traits based on filenames without extensions
trait.names <- gsub("\\.sumstats\\.gz$", "", basename(traits))

# Run multivariable LDSC
LDSCoutput <- ldsc(traits = traits,
                   sample.prev = sample.prev,
                   population.prev = population.prev,
                   ld = ld,
                   wld = wld,
                   trait.names = trait.names,
                   stand = TRUE)

# Save the results
output_path <- "/Users/charleenadams/temp_BI/chemokine_rgs_olink/munged/harmonized"
saveRDS(LDSCoutput, file = file.path(output_path, "LDSCoutput.rds"))
write.csv(LDSCoutput, file = file.path(output_path, "LDSCoutput.csv"))

# Print completion message
cat("Multivariable LDSC analysis complete. Results saved as LDSCoutput.rds and LDSCoutput.csv in:\n", output_path, "\n")

4.2.4 Sumstats Preparation

# Load required libraries
library(GenomicSEM)
library(data.table)  # For handling summary statistics

# Step 1: Define paths and variables
traits <- list.files(
  "/Users/charleenadams/temp_BI/chemokine_rgs_olink/munged/harmonized/pvals.fix",
  pattern = "\\.sumstats$",  # Match only unzipped files
  full.names = TRUE
)

# Optional: Name your traits based on filenames without extensions
trait.names <- gsub("\\.sumstats\\.gz$", "", basename(traits))

# Step 2: Extract sample sizes (N) from each trait file
# Assuming the column name for sample size is "N"
get_sample_size <- function(file) {
  data <- fread(file, select = "N", nrows = 1)  # Only read the first row to extract N
  return(data$N[1])
}

N <- sapply(traits, get_sample_size)

# Print extracted sample sizes for debugging
cat("Extracted sample sizes:\n")
print(N)

# Step 3: Prepare the summary statistics with sumstats()
# Define the reference file for SNP variance
#ref <- "/Users/charleenadams/temp_BI/mvGWAS_chemokines/eur_w_ld_chr/w_hm3.snplist"
ref <- "/Users/charleenadams/temp_BI/chemokine_rgs_olink//reference.1000G.maf.0.005.txt"

# Define arguments for sumstats()
se.logit <- rep(FALSE, length(traits))  # These are continuous traits; SEs are not logistic
OLS <- rep(TRUE, length(traits))       # OLS is likely because these are continuous traits
linprob <- rep(FALSE, length(traits))  # Traits are not dichotomous; linear probability does not apply

# Run sumstats() to align and scale summary statistics
summary_stats <- sumstats(
  files = traits,
  ref = ref,
  trait.names = trait.names,
  se.logit = se.logit,
  OLS = OLS,
  linprob = linprob,
  N = N,
  betas = NULL,
#  info.filter = 0.6,
  maf.filter = 0.01,
  keep.indel = FALSE,
  parallel = TRUE,
  cores = 4  # Adjust for available cores
)

output_path <- "/Users/charleenadams/temp_BI/mvGWAS_chemokines/latent_chemokine_sumstats/summary_stats.txt"
fwrite(summary_stats, file = output_path, sep = "\t")

summary_stats <- fread("/Users/charleenadams/temp_BI/mvGWAS_chemokines/latent_chemokine_sumstats/summary_stats.txt")

4.2.5 Multivariate 39-Chemokine GWAS Test on 100 SNPs

with userGWAS vs commonfactorGWAS

# Load required libraries
library(GenomicSEM)
library(data.table)
library(parallel)
library(qqman)

# Start timer to measure runtime
start_time <- Sys.time()

# Step 1: Stop and clean up any parallel clusters
if (!is.null(cl <- getOption("cl"))) {
  stopCluster(cl)
  options(cl = NULL)
}

# Step 2: Define paths and variables
traits <- list.files(
  "/Users/charleenadams/temp_BI/chemokine_rgs_olink/munged/harmonized/pvals.fix",
  pattern = "\\.sumstats$",  # Match only unzipped files
  full.names = TRUE
)

# Optional: Name your traits based on filenames without extensions
trait.names <- gsub("\\.sumstats\\.gz$", "", basename(traits))

# Extract sample sizes (N) from each trait file
get_sample_size <- function(file) {
  data <- fread(file, select = "N", nrows = 1)  # Only read the first row to extract N
  return(data$N[1])
}
N <- sapply(traits, get_sample_size)
cat("Extracted sample sizes:\n")
## Extracted sample sizes:
#print(N)

# Load summary statistics (use a subset of SNPs for testing)
summary_stats <- fread("/Users/charleenadams/temp_BI/mvGWAS_chemokines/latent_chemokine_sumstats/summary_stats.txt")
summary_stats <- summary_stats[1:100, ]  # Use only the first 100 SNPs for testing
#print(colnames(summary_stats))

# Step 3: Load LDSC results
ldsc_results_path <- "/Users/charleenadams/temp_BI/chemokine_rgs_olink/munged/harmonized/LDSCoutput.rds"
ldsc_results <- readRDS(ldsc_results_path)
#str(ldsc_results)

# Step 4: Specify the model for userGWAS
ldsc_trait_names <- attr(ldsc_results$S, "dimnames")[[2]]
model_string <- paste0("F1 =~ ", paste(ldsc_trait_names, collapse = " + "), "\nF1 ~ SNP")
#print(model_string)

# Step 5: Run the Multivariate GWAS
max_cores <- 12
cat("Using", max_cores, "cores for parallel processing.\n")
## Using 12 cores for parallel processing.
CorrelatedFactors <- userGWAS(
  covstruc = ldsc_results, 
  SNPs = summary_stats, 
  model = model_string, 
  estimation = "DWLS", 
  printwarn = TRUE, 
  sub = c("F1~SNP"),
  cores = max_cores,
  toler = FALSE,
  SNPSE = FALSE,
  parallel = TRUE,
  GC = "standard",
  MPI = FALSE,
  smooth_check = TRUE,
  fix_measurement = TRUE,
  Q_SNP = TRUE
)
## [1] "Please note that an update was made to userGWAS on Sept 1 2023  so that the default behavior is to fix the measurement model using the fix_measurement argument."
## [1] "Starting GWAS Estimation"
## elapsed 
##  75.404
# Output results
str(CorrelatedFactors)
## List of 1
##  $ :'data.frame':    100 obs. of  25 variables:
##   ..$ SNP          : chr [1:100] "rs3094315" "rs3131972" "rs3131969" "rs1048488" ...
##   ..$ CHR          : int [1:100] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ BP           : int [1:100] 752566 752721 754182 760912 761147 761732 768448 779322 785989 838555 ...
##   ..$ MAF          : num [1:100] 0.16 0.161 0.128 0.16 0.161 ...
##   ..$ A1           : chr [1:100] "G" "A" "A" "C" ...
##   ..$ A2           : chr [1:100] "A" "G" "G" "T" ...
##   ..$ lhs          : chr [1:100] "F1" "F1" "F1" "F1" ...
##   ..$ op           : chr [1:100] "~" "~" "~" "~" ...
##   ..$ rhs          : chr [1:100] "SNP" "SNP" "SNP" "SNP" ...
##   ..$ free         : int [1:100] 41 41 41 41 41 41 41 41 41 41 ...
##   ..$ label        : chr [1:100] "" "" "" "" ...
##   ..$ est          : num [1:100] -0.00217 -0.00233 -0.00323 -0.00158 -0.00151 ...
##   ..$ SE           : num [1:100] 0.00546 0.00545 0.00599 0.00546 0.00545 ...
##   ..$ Z_Estimate   : num [1:100] -0.397 -0.428 -0.54 -0.29 -0.278 ...
##   ..$ Pval_Estimate: num [1:100] 0.691 0.669 0.589 0.772 0.781 ...
##   ..$ chisq        : num [1:100] 341331 341333 341362 341332 341330 ...
##   ..$ chisq_df     : num [1:100] 778 778 778 778 778 778 778 778 778 778 ...
##   ..$ chisq_pval   : num [1:100] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ AIC          : num [1:100] 341415 341417 341446 341416 341414 ...
##   ..$ Q_SNP        : num [1:100] 35.1 34.6 46.4 34.6 34.8 ...
##   ..$ Q_SNP_df     : num [1:100] 38 38 38 38 38 38 38 38 38 38 ...
##   ..$ Q_SNP_pval   : num [1:100] 0.602 0.626 0.164 0.629 0.617 ...
##   ..$ error        : num [1:100] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ warning      : chr [1:100] "0" "0" "0" "0" ...
##   ..$ Z_smooth     : num [1:100] 0.000329 0.00035 0.00062 0.000334 0.000331 ...
significant_snps <- CorrelatedFactors[[1]][CorrelatedFactors[[1]]$Pval_Estimate < 0.05, ]
#print(significant_snps)

low_heterogeneity <- CorrelatedFactors[[1]][CorrelatedFactors[[1]]$Q_SNP_pval > 0.05, ]
#print(low_heterogeneity)

# Prepare data for plots
manhattan_data <- data.frame(
  SNP = CorrelatedFactors[[1]]$SNP,
  CHR = CorrelatedFactors[[1]]$CHR,
  BP = CorrelatedFactors[[1]]$BP,
  P = CorrelatedFactors[[1]]$Pval_Estimate
)

# Step 6: Display Manhattan plot
cat("Displaying Manhattan plot.\n")
## Displaying Manhattan plot.
manhattan(
  manhattan_data,
  chr = "CHR",
  bp = "BP",
  p = "P",
  snp = "SNP",
  genomewideline = -log10(5e-8),
  suggestiveline = -log10(1e-5),
  col = c("blue4", "skyblue"),
  main = "Manhattan Plot for Latent Chemokine Genetic Factor",
  ylim = c(0, 10)
)

# Step 7: Display QQ plot
cat("Displaying QQ plot.\n")
## Displaying QQ plot.
qq(manhattan_data$P, main = "QQ Plot for Latent Chemokine Genetic Factor")

# End timer and display runtime
end_time <- Sys.time()
cat("Total runtime:", end_time - start_time, "\n")
## Total runtime: 1.380361

5 Results: Pending

The last step with the full set of SNPs has been running on 14 cores for >18 hours.

SLURM (via Harvard’s HPC) would be vastly more efficient for this sort of intensive computation. We are maxed at my laptop’s capacity; I can’t unplug the laptop because all the battery is being eaten by the task. On Harvard’s HPC, we could run this easily on many more cores, which is how it is intended by the authors to be run.

6 Estimation of Total Runtime for GWAS Job

This analysis estimates the total runtime and remaining time for the GWAS job processing 1,131,359 SNPs, utilizing 14 cores with parallel processing.


6.1 Given Information

  1. Runtime for 100 SNPs: ~80 seconds.
  2. Total SNPs to process: 1,131,359.
  3. System Setup:
    • 14 cores are utilized.
    • The system is running efficiently at ~85–90% CPU utilization per core (using top -o cpu)

6.2 Estimation Calculations

6.2.1 1. Time Per SNP

\[ \text{Time per SNP} = \frac{\text{Runtime for 100 SNPs}}{100} \] \[ \text{Time per SNP} = \frac{80}{100} = 0.8 \, \text{seconds/SNP} \]

6.2.2 2. Total Sequential Runtime

\[ \text{Sequential Runtime} = \text{Total SNPs} \times \text{Time per SNP} \] \[ \text{Sequential Runtime} = 1,131,359 \times 0.8 = 905,087.2 \, \text{seconds} \] \[ \text{Sequential Runtime} \approx 251.4 \, \text{hours} \]

6.2.3 3. Parallel Adjustment

Using 14 cores with an assumed 80% parallel efficiency: \[ \text{Parallel Runtime} = \frac{\text{Sequential Runtime}}{\text{Cores} \times \text{Efficiency}} \] \[ \text{Parallel Runtime} = \frac{905,087.2}{14 \times 0.8} = 80,948.3 \, \text{seconds} \] \[ \text{Parallel Runtime} \approx 22.5 \, \text{hours} \]

6.2.4 4. Elapsed Time

The job has been running for approximately 18 hours (17:55:46).

6.2.5 5. Remaining Time

\[ \text{Remaining Time} = \text{Parallel Runtime} - \text{Elapsed Time} \] \[ \text{Remaining Time} = 22.5 - 18 = 4.5 \, \text{hours} \]


6.3 Estimated Completion Window

  • Optimistic Estimate: Total runtime ~22.5 hours; remaining ~4.5 hours.
  • Pessimistic Estimate: If efficiency is slightly lower (70–80%), the total runtime could extend to ~25–26 hours, leaving ~7–8 hours remaining.

6.4 Conclusion

The GWAS job is expected to complete within the next 4–8 hours, depending on system efficiency and runtime variability.