In this analysis, we employ a genomic structural equation model to define a new phenotype: a latent factor that influences chemokines.
In essence, while the chemokines themselves are directly measurable phenotypes, the latent factor we infer captures the shared genetic influence among these chemokines. The GWAS results, therefore, represent the genetic associations with this inferred, composite trait, providing insights into the genetics of an underlying immune phenomenon rather than focusing on any single chemokine alone.
I had previously obtained, untarred, formatted, and munged 33 UKB-PPP chemokine summary statistics. I transferred these files, precomputed European LD scores, and w_hm3.snplist over with rsync by using the IP address of my BIDMC laptop (removed below) and its password (not included).
(I had obtained the precalculated LD scores for Europeans within 1000 Genomes Project Phase 3, as provided by the Alkes lab. These files were specifically generated using HapMap3 SNPs within the European subset of 1000 Genomes samples, excluding the MHC region, which is particularly complex in terms of LD structure due to high genetic diversity and extended linkage disequilibrium in that region.)
rsync -av --progress --partial /Users/charleenadams/temp_BI/chemokine_rgs_olink/ charleenadams@IP:/Users/charleenadams/temp_BI/chemokine_rgs_olink/
I obtained all the UKB-PPP summary statistics for Europeans from Synapse. I’m only going to untar the chemokine files I did not process previously.
Scripts of this Analysis Stored at:
/Users/charleenadams/temp_BI/mvGWAS_chemokines
Files to untar:
CXCL8_P10145_OID20153_v1_Cardiometabolic.tar
CXCL8_P10145_OID20631_v1_Inflammation.tar
CXCL8_P10145_OID20997_v1_Neurology.tar
CXCL8_P10145_OID21430_v1_Oncology.tar
CXCL1_P09341_OID20762_v1_Inflammation.tar
CCL11_P51671_OID20668_v1_Inflammation.tar
#!/bin/bash
# Define the directory containing your tar files
tar_directory="/Users/charleenadams/ukbppp"
# Define the output directory for extracted files
output_directory="/Users/charleenadams/ukbppp/extracted_proteins"
# Create the output directory if it doesn't exist
mkdir -p "$output_directory"
# List of tar files to extract
tar_files=(
"CXCL8_P10145_OID20153_v1_Cardiometabolic.tar"
"CXCL8_P10145_OID20631_v1_Inflammation.tar"
"CXCL8_P10145_OID20997_v1_Neurology.tar"
"CXCL8_P10145_OID21430_v1_Oncology.tar"
"CXCL1_P09341_OID20762_v1_Inflammation.tar"
"CCL11_P51671_OID20668_v1_Inflammation.tar"
)
# Iterate over each tar file in the list
for tar_file in "${tar_files[@]}"; do
# Construct the full path to the tar file
full_path="$tar_directory/$tar_file"
# Check if the tar file exists
if [[ -f "$full_path" ]]; then
echo "Extracting $tar_file..."
# Extract the tar file into the output directory
tar -xf "$full_path" -C "$output_directory"
else
echo "File $tar_file does not exist in $tar_directory."
fi
done
echo "Extraction complete."
# Run with:
chmod +x untar_remaining_chemokines.sh
nohup ./untar_remaining_chemokines.sh >untar_remaining_chemokines_log.txt 2>&1 &
ps -p 97839 -o stat,etime
This script keep POS38 and POS19. Position isn’t needed for ldsc because the rsID is used. But for all downstream analyses, it’s obviously good to have both versus merging one or the other back in. (When I ran this for the original 33 chemokines, POS19 was dropped.) In addition, I sped up the process with parallel.
#!/bin/bash
# Base directory where your data is located
base_dir="/Users/charleenadams/ukbppp/extracted_proteins"
# Directory where RSID map files are located
map_dir="/Users/charleenadams/temp_BI/chemokine_rgs_olink/maps"
# Log file to track progress
log_file="$base_dir/add_rsid_log.txt"
echo "Log started on $(date)" > "$log_file"
# Function to process each file
process_file() {
local file="$1"
local dir
dir=$(dirname "$file")
local chrom
chrom=$(basename "$file" | sed -E 's/.*chr([0-9XY]+).*/\1/')
local map_file="$map_dir/olink_rsid_map_mac5_info03_b0_7_chr${chrom}_patched_v2.tsv.gz"
local output_file="${file%.gz}_rsid_added.gz"
{
echo "Processing file: $file (Chromosome $chrom)"
if [[ -n "$chrom" && -f "$map_file" ]]; then
echo " Using RSID map: $map_file"
# Remove existing rsid_added file to avoid overwrite prompts
rm -f "$output_file"
# Join the original file with the RSID map based on the ID column
gunzip -c "$file" | \
awk 'BEGIN {OFS="\t"}
NR==FNR {map[$1]=$4; genpos37[$1]=$5; genpos38[$1]=$6; next}
($3 in map) {print $0, map[$3], genpos37[$3], genpos38[$3]}
!($3 in map) {print $0, ".", ".", "."}' <(gunzip -c "$map_file") - | \
gzip -c > "$output_file"
if [[ -s "$output_file" ]]; then
echo " RSID added with GENPOS37 and GENPOS38. Output saved to $output_file"
else
echo " Warning: $output_file is empty!"
fi
else
echo " Error: RSID map file for chromosome $chrom not found or chromosome not identified."
fi
} >> "$log_file" 2>&1
}
export -f process_file
export map_dir
export log_file
# Find all .gz files and process them in parallel
find "$base_dir" -type f -name "*.gz" | parallel -j4 process_file
echo "Log ended on $(date)" >> "$log_file"
echo "Processing complete. Check $log_file for details."
#Run with:
nano add_rsid.sh
chmod +x add_rsid.sh
nohup ./add_rsid.sh > add_rsid_log.txt 2>&1 &
ps -p 18092 -o stat,etime
#!/bin/bash
# Define base directory
base_dir="/Users/charleenadams/ukbppp/extracted_proteins"
# Log file for tracking
log_file="$base_dir/merge_rsid_files_log.txt"
echo "Processing started on $(date)" > "$log_file"
# Iterate over each subdirectory in the base directory
for dir in "$base_dir"/*/; do
# Skip directories you don’t want to process
if [[ "$dir" == *"maps"* || "$dir" == *"1000G_Phase3_weights_hm3_no_MHC"* || "$dir" == *"ldsc"* || "$dir" == *"rsconnect"* ]]; then
continue
fi
echo "Processing directory: $dir" | tee -a "$log_file"
# Collect all '_rsid_added.gz' files in this directory
rsid_files=("$dir"*_rsid_added.gz)
# Check if there are files to merge
if [[ ${#rsid_files[@]} -gt 0 ]]; then
merged_file="${dir%/}_merged_rsid_added.gz"
# Extract header from the first file
gunzip -c "${rsid_files[0]}" | head -n 1 > temp_header.txt
# Process each file, excluding the header, and append to a temporary content file
for file in "${rsid_files[@]}"; do
gunzip -c "$file" | tail -n +2 >> temp_content.txt
done
# Concatenate header and content into the final merged file
cat temp_header.txt temp_content.txt | gzip -c > "$merged_file"
# Clean up temporary files
rm temp_header.txt temp_content.txt
# Confirm if the merged file was created successfully and isn’t empty
if [[ -s "$merged_file" ]]; then
echo " Merged file created: $merged_file" | tee -a "$log_file"
else
echo " Warning: Merged file is empty for directory $dir" | tee -a "$log_file"
fi
else
echo " No '_rsid_added.gz' files found in directory $dir" | tee -a "$log_file"
fi
done
echo "Processing completed on $(date)" >> "$log_file"
echo "Check $log_file for details."
# Run with:
nano chemokine_chrom_merge.sh
chmod +x chemokine_chrom_merge.sh
nohup ./chemokine_chrom_merge.sh > chemokine_chrom_merge_log.txt 2>&1 &
ps -p 20853 -o stat,etime
The Olink files have LOG10P. LDSC needs non-transformed p-values. NOTE: Check the LOG10P column number!
#!/bin/bash
cd /Users/charleenadams/ukbppp/extracted_proteins
# Loop through each *merged_rsid_added.gz file
for file in *_merged_rsid_added.gz; do
# Define the output filename
output_file="${file%_merged_rsid_added.gz}_with_pval.gz"
# Use awk to calculate P-value from LOG10P and write to a new gzipped file
gunzip -c "$file" | \
awk 'BEGIN {OFS="\t"} NR==1 {print $0, "P"} NR>1 {pval = (10 ^ -$13); print $0, pval}' | \
gzip > "$output_file"
echo "Processed file saved: $output_file"
done
#Run with:
nano pval.sh
chmod +x pval.sh
nohup ./pval.sh> pval_log.txt 2>&1 &
ps -p 22727 -o stat,etime
Store in the folder with the 33:
/Users/charleenadams/temp_BI/chemokine_rgs_olink/mungedstat -f "%m %N" * | sort -n | awk '{print $2}'NOTE: It is nightmarish to get ldsc installed and running. Must configure it with old versions of python and install dependencies from hell.
#!/bin/bash
# Enable debugging and error handling
set -euo pipefail
IFS=$'\n\t'
# Define base and output directories
base_dir="/Users/charleenadams/ukbppp/extracted_proteins"
output_dir="/Users/charleenadams/temp_BI/chemokine_rgs_olink/munged"
# Create the output directory if it doesn't exist
mkdir -p "$output_dir"
# Log progress
log_file="munge_simple_progress.log"
echo "Starting munging process at $(date)" > "$log_file"
# Loop through each file ending with "_with_pval.gz" in the base directory
for file in "$base_dir"/*_with_pval.gz; do
if [[ -f "$file" ]]; then
# Extract the base name without the extension
base_name=$(basename "$file" "_with_pval.gz")
# Define the output filename for munging
output_file="$output_dir/${base_name}_munged"
# Log the current file being processed
echo "Processing file: $file" >> "$log_file"
# Perform the munging, specifying the column names explicitly
/Users/charleenadams/temp_BI/chemokine_rgs_olink/ldsc/munge_sumstats.py \
--sumstats "$file" \
--snp rsid \
--a1 ALLELE1 \
--a2 ALLELE0 \
--frq A1FREQ \
--N-col N \
--p P \
--signed-sumstats BETA,0 \
--chunksize 500000 \
--out "$output_file" \
--merge-alleles /Users/charleenadams/temp_BI/chemokine_rgs_olink/w_hm3.snplist
# Log success
echo "Munged file created: $output_file" >> "$log_file"
else
echo "Warning: No matching files found in $base_dir" >> "$log_file"
fi
done
# Completion message
echo "Munging process completed at $(date)" >> "$log_file"
# Run with:
nano munge_simple.sh
chmod +x munge_simple.sh
nohup ./munge_simple.sh > munge_simple_log.txt 2>&1 &
#!/bin/bash
# Define the directory containing the files
input_dir="/Users/charleenadams/temp_BI/chemokine_rgs_olink/munged"
output_dir="/Users/charleenadams/temp_BI/chemokine_rgs_olink/munged"
# Loop through all files ending in _munged.sumstats.gz
for file in "$input_dir"/*_munged.sumstats.gz; do
# Extract the base filename without extension
base_name=$(basename "$file" .sumstats.gz)
# Create output file name by appending "noNA" before .gz
output_file="${output_dir}/${base_name}_noNA.sumstats.gz"
# Unzip, filter rows with complete observations, and gzip the result
gunzip -c "$file" | awk 'NR == 1 || NF == 5' | gzip > "$output_file"
echo "Processed $file -> $output_file"
done
# Run with:
nano filter_noNA.sh
chmod +x filter_noNA.sh
nohup ./filter_noNA.sh > filter_noNA_log.txt 2>&1 &
Run interactively.
# Load libraries
library(data.table)
# Define paths
current_dir <- "/Users/charleenadams/temp_BI/chemokine_rgs_olink/munged"
hapmap3_snplist_path <- "/Users/charleenadams/temp_BI/mvGWAS_chemokines/eur_w_ld_chr/w_hm3.snplist"
output_dir <- "/Users/charleenadams/temp_BI/chemokine_rgs_olink/munged/harmonized"
# Create output directory if it doesn't exist
if (!dir.exists(output_dir)) {
dir.create(output_dir)
}
# Dynamically list files to process
summary_stats <- list.files(
path = current_dir,
pattern = ".*noNA\\.sumstats\\.gz$",
full.names = TRUE,
all.files = TRUE
)
# Check if any files are found
if (length(summary_stats) == 0) {
stop("No matching summary statistics files found in the specified directory.")
}
cat("Files to process:\n", paste(summary_stats, collapse = "\n"), "\n")
# Load HapMap3 SNP list
hm3_snps <- fread(hapmap3_snplist_path, header = TRUE)
# Function to harmonize SNPs
harmonize_snps <- function(file, hm3_snps) {
# Read summary statistics
data <- fread(file, header = TRUE)[, .(SNP, A1, A2, N, Z)]
# Filter to SNPs in HapMap3
data <- data[SNP %in% hm3_snps$SNP]
# Merge with HapMap3 to align alleles
aligned_data <- merge(data, hm3_snps, by = "SNP", suffixes = c(".data", ".hm3"))
# Flip alleles and Z-scores where necessary
mismatched <- which(aligned_data$A1.data == aligned_data$A2.hm3 &
aligned_data$A2.data == aligned_data$A1.hm3)
if (length(mismatched) > 0) {
aligned_data$Z[mismatched] <- -aligned_data$Z[mismatched]
temp <- aligned_data$A1.data[mismatched]
aligned_data$A1.data[mismatched] <- aligned_data$A2.data[mismatched]
aligned_data$A2.data[mismatched] <- temp
}
# Return cleaned data
aligned_data[, .(SNP, A1 = A1.data, A2 = A2.data, N, Z)]
}
# Process and save harmonized files
for (file in summary_stats) {
cat("Processing:", file, "\n")
# Harmonize SNPs
harmonized_data <- harmonize_snps(file, hm3_snps)
# Extract file name and create correct output name
base_name <- gsub("\\.sumstats\\.gz$", "_harmonized.sumstats.gz", basename(file))
temp_file <- file.path(output_dir, gsub("\\.gz$", "", base_name)) # Temporary uncompressed file
output_file <- file.path(output_dir, base_name) # Final compressed file
# Save to temporary uncompressed file
fwrite(harmonized_data, file = temp_file, sep = "\t")
# Compress the file
system(paste("gzip -f", shQuote(temp_file)))
cat("Saved harmonized file:", output_file, "\n")
}
cat("All matching files harmonized and saved in:", output_dir, "\n")
library(data.table)
# Define the function to add a P column
add_p_column <- function(file, output_dir) {
data <- fread(file)
if (!"P" %in% colnames(data)) {
if ("Z" %in% colnames(data)) {
data[, P := 2 * pnorm(-abs(Z))]
# Create output file path
output_file <- file.path(output_dir, basename(file))
# Save the updated data to the new file
fwrite(data, output_file)
cat("Added P column to:", output_file, "\n")
} else {
stop("Z column missing. Cannot calculate P-values in file:", file)
}
} else {
cat("P column already exists in:", file, "\n")
}
}
# Set input and output directories
input_dir <- "/Users/charleenadams/temp_BI/chemokine_rgs_olink/munged/harmonized"
output_dir <- file.path(input_dir, "pvals.fix")
# Create the output directory if it doesn't exist
if (!dir.exists(output_dir)) {
dir.create(output_dir)
}
# List all files matching the pattern in the input directory
files <- list.files(input_dir, pattern = "\\.sumstats\\.gz$", full.names = TRUE)
# Apply the function to all files
lapply(files, add_p_column, output_dir = output_dir)
This differs from the pairwise models we ran previously. The goal is to obtain the MVLDSC covariance matrix (not the pairwise genetic correlations).
# Load required library
library(GenomicSEM)
# Define paths and variables
traits <- list.files("/Users/charleenadams/temp_BI/chemokine_rgs_olink/munged/harmonized/",
pattern = "\\.sumstats\\.gz$", full.names = TRUE)
# Set sample and population prevalences to NA to bypass liability scaling
sample.prev <- rep(NA, length(traits)) # Treat traits as continuous
population.prev <- rep(NA, length(traits))
ld <- "/Users/charleenadams/temp_BI/mvGWAS_chemokines/eur_w_ld_chr/"
wld <- "/Users/charleenadams/temp_BI/mvGWAS_chemokines/eur_w_ld_chr/"
# Optional: Name your traits based on filenames without extensions
trait.names <- gsub("\\.sumstats\\.gz$", "", basename(traits))
# Run multivariable LDSC
LDSCoutput <- ldsc(traits = traits,
sample.prev = sample.prev,
population.prev = population.prev,
ld = ld,
wld = wld,
trait.names = trait.names,
stand = TRUE)
# Save the results
output_path <- "/Users/charleenadams/temp_BI/chemokine_rgs_olink/munged/harmonized"
saveRDS(LDSCoutput, file = file.path(output_path, "LDSCoutput.rds"))
write.csv(LDSCoutput, file = file.path(output_path, "LDSCoutput.csv"))
# Print completion message
cat("Multivariable LDSC analysis complete. Results saved as LDSCoutput.rds and LDSCoutput.csv in:\n", output_path, "\n")
# Load required libraries
library(GenomicSEM)
library(data.table) # For handling summary statistics
# Step 1: Define paths and variables
traits <- list.files(
"/Users/charleenadams/temp_BI/chemokine_rgs_olink/munged/harmonized/pvals.fix",
pattern = "\\.sumstats$", # Match only unzipped files
full.names = TRUE
)
# Optional: Name your traits based on filenames without extensions
trait.names <- gsub("\\.sumstats\\.gz$", "", basename(traits))
# Step 2: Extract sample sizes (N) from each trait file
# Assuming the column name for sample size is "N"
get_sample_size <- function(file) {
data <- fread(file, select = "N", nrows = 1) # Only read the first row to extract N
return(data$N[1])
}
N <- sapply(traits, get_sample_size)
# Print extracted sample sizes for debugging
cat("Extracted sample sizes:\n")
print(N)
# Step 3: Prepare the summary statistics with sumstats()
# Define the reference file for SNP variance
#ref <- "/Users/charleenadams/temp_BI/mvGWAS_chemokines/eur_w_ld_chr/w_hm3.snplist"
ref <- "/Users/charleenadams/temp_BI/chemokine_rgs_olink//reference.1000G.maf.0.005.txt"
# Define arguments for sumstats()
se.logit <- rep(FALSE, length(traits)) # These are continuous traits; SEs are not logistic
OLS <- rep(TRUE, length(traits)) # OLS is likely because these are continuous traits
linprob <- rep(FALSE, length(traits)) # Traits are not dichotomous; linear probability does not apply
# Run sumstats() to align and scale summary statistics
summary_stats <- sumstats(
files = traits,
ref = ref,
trait.names = trait.names,
se.logit = se.logit,
OLS = OLS,
linprob = linprob,
N = N,
betas = NULL,
# info.filter = 0.6,
maf.filter = 0.01,
keep.indel = FALSE,
parallel = TRUE,
cores = 4 # Adjust for available cores
)
output_path <- "/Users/charleenadams/temp_BI/mvGWAS_chemokines/latent_chemokine_sumstats/summary_stats.txt"
fwrite(summary_stats, file = output_path, sep = "\t")
summary_stats <- fread("/Users/charleenadams/temp_BI/mvGWAS_chemokines/latent_chemokine_sumstats/summary_stats.txt")
with userGWAS vs commonfactorGWAS
# Load required libraries
library(GenomicSEM)
library(data.table)
library(parallel)
library(qqman)
# Start timer to measure runtime
start_time <- Sys.time()
# Step 1: Stop and clean up any parallel clusters
if (!is.null(cl <- getOption("cl"))) {
stopCluster(cl)
options(cl = NULL)
}
# Step 2: Define paths and variables
traits <- list.files(
"/Users/charleenadams/temp_BI/chemokine_rgs_olink/munged/harmonized/pvals.fix",
pattern = "\\.sumstats$", # Match only unzipped files
full.names = TRUE
)
# Optional: Name your traits based on filenames without extensions
trait.names <- gsub("\\.sumstats\\.gz$", "", basename(traits))
# Extract sample sizes (N) from each trait file
get_sample_size <- function(file) {
data <- fread(file, select = "N", nrows = 1) # Only read the first row to extract N
return(data$N[1])
}
N <- sapply(traits, get_sample_size)
cat("Extracted sample sizes:\n")
## Extracted sample sizes:
#print(N)
# Load summary statistics (use a subset of SNPs for testing)
summary_stats <- fread("/Users/charleenadams/temp_BI/mvGWAS_chemokines/latent_chemokine_sumstats/summary_stats.txt")
summary_stats <- summary_stats[1:100, ] # Use only the first 100 SNPs for testing
#print(colnames(summary_stats))
# Step 3: Load LDSC results
ldsc_results_path <- "/Users/charleenadams/temp_BI/chemokine_rgs_olink/munged/harmonized/LDSCoutput.rds"
ldsc_results <- readRDS(ldsc_results_path)
#str(ldsc_results)
# Step 4: Specify the model for userGWAS
ldsc_trait_names <- attr(ldsc_results$S, "dimnames")[[2]]
model_string <- paste0("F1 =~ ", paste(ldsc_trait_names, collapse = " + "), "\nF1 ~ SNP")
#print(model_string)
# Step 5: Run the Multivariate GWAS
max_cores <- 12
cat("Using", max_cores, "cores for parallel processing.\n")
## Using 12 cores for parallel processing.
CorrelatedFactors <- userGWAS(
covstruc = ldsc_results,
SNPs = summary_stats,
model = model_string,
estimation = "DWLS",
printwarn = TRUE,
sub = c("F1~SNP"),
cores = max_cores,
toler = FALSE,
SNPSE = FALSE,
parallel = TRUE,
GC = "standard",
MPI = FALSE,
smooth_check = TRUE,
fix_measurement = TRUE,
Q_SNP = TRUE
)
## [1] "Please note that an update was made to userGWAS on Sept 1 2023 so that the default behavior is to fix the measurement model using the fix_measurement argument."
## [1] "Starting GWAS Estimation"
## elapsed
## 75.404
# Output results
str(CorrelatedFactors)
## List of 1
## $ :'data.frame': 100 obs. of 25 variables:
## ..$ SNP : chr [1:100] "rs3094315" "rs3131972" "rs3131969" "rs1048488" ...
## ..$ CHR : int [1:100] 1 1 1 1 1 1 1 1 1 1 ...
## ..$ BP : int [1:100] 752566 752721 754182 760912 761147 761732 768448 779322 785989 838555 ...
## ..$ MAF : num [1:100] 0.16 0.161 0.128 0.16 0.161 ...
## ..$ A1 : chr [1:100] "G" "A" "A" "C" ...
## ..$ A2 : chr [1:100] "A" "G" "G" "T" ...
## ..$ lhs : chr [1:100] "F1" "F1" "F1" "F1" ...
## ..$ op : chr [1:100] "~" "~" "~" "~" ...
## ..$ rhs : chr [1:100] "SNP" "SNP" "SNP" "SNP" ...
## ..$ free : int [1:100] 41 41 41 41 41 41 41 41 41 41 ...
## ..$ label : chr [1:100] "" "" "" "" ...
## ..$ est : num [1:100] -0.00217 -0.00233 -0.00323 -0.00158 -0.00151 ...
## ..$ SE : num [1:100] 0.00546 0.00545 0.00599 0.00546 0.00545 ...
## ..$ Z_Estimate : num [1:100] -0.397 -0.428 -0.54 -0.29 -0.278 ...
## ..$ Pval_Estimate: num [1:100] 0.691 0.669 0.589 0.772 0.781 ...
## ..$ chisq : num [1:100] 341331 341333 341362 341332 341330 ...
## ..$ chisq_df : num [1:100] 778 778 778 778 778 778 778 778 778 778 ...
## ..$ chisq_pval : num [1:100] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ AIC : num [1:100] 341415 341417 341446 341416 341414 ...
## ..$ Q_SNP : num [1:100] 35.1 34.6 46.4 34.6 34.8 ...
## ..$ Q_SNP_df : num [1:100] 38 38 38 38 38 38 38 38 38 38 ...
## ..$ Q_SNP_pval : num [1:100] 0.602 0.626 0.164 0.629 0.617 ...
## ..$ error : num [1:100] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ warning : chr [1:100] "0" "0" "0" "0" ...
## ..$ Z_smooth : num [1:100] 0.000329 0.00035 0.00062 0.000334 0.000331 ...
significant_snps <- CorrelatedFactors[[1]][CorrelatedFactors[[1]]$Pval_Estimate < 0.05, ]
#print(significant_snps)
low_heterogeneity <- CorrelatedFactors[[1]][CorrelatedFactors[[1]]$Q_SNP_pval > 0.05, ]
#print(low_heterogeneity)
# Prepare data for plots
manhattan_data <- data.frame(
SNP = CorrelatedFactors[[1]]$SNP,
CHR = CorrelatedFactors[[1]]$CHR,
BP = CorrelatedFactors[[1]]$BP,
P = CorrelatedFactors[[1]]$Pval_Estimate
)
# Step 6: Display Manhattan plot
cat("Displaying Manhattan plot.\n")
## Displaying Manhattan plot.
manhattan(
manhattan_data,
chr = "CHR",
bp = "BP",
p = "P",
snp = "SNP",
genomewideline = -log10(5e-8),
suggestiveline = -log10(1e-5),
col = c("blue4", "skyblue"),
main = "Manhattan Plot for Latent Chemokine Genetic Factor",
ylim = c(0, 10)
)
# Step 7: Display QQ plot
cat("Displaying QQ plot.\n")
## Displaying QQ plot.
qq(manhattan_data$P, main = "QQ Plot for Latent Chemokine Genetic Factor")
# End timer and display runtime
end_time <- Sys.time()
cat("Total runtime:", end_time - start_time, "\n")
## Total runtime: 1.380361
The last step with the full set of SNPs has been running on 14 cores for >18 hours.
SLURM (via Harvard’s HPC) would be vastly more efficient for this sort of intensive computation. We are maxed at my laptop’s capacity; I can’t unplug the laptop because all the battery is being eaten by the task. On Harvard’s HPC, we could run this easily on many more cores, which is how it is intended by the authors to be run.
This analysis estimates the total runtime and remaining time for the GWAS job processing 1,131,359 SNPs, utilizing 14 cores with parallel processing.
top -o cpu)\[ \text{Time per SNP} = \frac{\text{Runtime for 100 SNPs}}{100} \] \[ \text{Time per SNP} = \frac{80}{100} = 0.8 \, \text{seconds/SNP} \]
\[ \text{Sequential Runtime} = \text{Total SNPs} \times \text{Time per SNP} \] \[ \text{Sequential Runtime} = 1,131,359 \times 0.8 = 905,087.2 \, \text{seconds} \] \[ \text{Sequential Runtime} \approx 251.4 \, \text{hours} \]
Using 14 cores with an assumed 80% parallel efficiency: \[ \text{Parallel Runtime} = \frac{\text{Sequential Runtime}}{\text{Cores} \times \text{Efficiency}} \] \[ \text{Parallel Runtime} = \frac{905,087.2}{14 \times 0.8} = 80,948.3 \, \text{seconds} \] \[ \text{Parallel Runtime} \approx 22.5 \, \text{hours} \]
The job has been running for approximately 18 hours
(17:55:46).
\[ \text{Remaining Time} = \text{Parallel Runtime} - \text{Elapsed Time} \] \[ \text{Remaining Time} = 22.5 - 18 = 4.5 \, \text{hours} \]
The GWAS job is expected to complete within the next 4–8 hours, depending on system efficiency and runtime variability.