1 Multivariate GWAS

In this analysis, we employ a genomic structural equation model (GenomicSEM) to define a new phenotype: a latent factor that influences 12 proteins we know are influenced by a common SNP.

3 Bash Tips

tmux

Keeps scripts or tasks running even if terminal session disconnects or crashes.
Ideal for long-running processes in remote servers or during unstable connections. Using tmux and nohup is mostly redundant.

pstree

View the hierarchy of a process to see its parent and child processes. Useful for seeing if parallelization is working or if the child processes are sleeping.

watch

Monitor real-time changes in a directory, such as file creation or updates.

find

Find and count file names with certain pattern and find and count directories.

gzcat

Like zcat but works on macOS. gzcat decompresses .gz files and sends the decompressed output directly to standard output (stdout). It behaves like a combination of zcat and cat, allowing us to view or pipe the contents of .gz files without fully extracting them. Works like zless with gzcat file.gz | head.

pigz

pigz (Parallel Implementation of gzip) is a multithreaded alternative to gzip that significantly speeds up compression and decompression by utilizing multiple CPU cores. While maintaining full compatibility with gzip, pigz divides data into chunks and processes them in parallel, making it ideal for large files and systems with multiple cores. It reduces wait times in workflows involving compression or decompression, such as bioinformatics or data science tasks. With faster performance, efficient resource utilization, and scalability, pigz is a powerful tool for modern high-performance computing environments.

nohup ./process_all_ukbpp.sh > process_all_ukbpp.log 2>&1 &
pstree 63844
ps -p 63844 -o stat,etime
watch -n 1 ls -lt /Users/charleenadams/temp_BI/chemokine_rgs_olink/processed_ukbppp_selected_set
find /Users/charleenadams/temp_BI/chemokine_rgs_olink/processed_ukbppp_selected_set -type f -name "*with_pval.gz" | wc -l
find /Users/charleenadams/temp_BI/chemokine_rgs_olink/processed_ukbppp_selected_set -type d | wc -l
for file in *.gz; do
    echo "$file: $(gzcat "$file" | wc -l)"
done
gzcat file.gz | head

4 Merge the CHRs in Parallel (12 Cores)

I used pigz and added a very nice log file.


#!/bin/bash

# Enable debugging for detailed logs
set -x

# Directories and variables
BASE_DIR="/Users/charleenadams/temp_BI/chemokine_rgs_olink/processed_ukbppp_selected_set"
MERGED_DIR="$BASE_DIR/merged_files"
LOG_FILE="$BASE_DIR/merge_log.txt"

# Create the merged_files directory and log file if they don't exist
mkdir -p "$MERGED_DIR"
touch "$LOG_FILE"

# Log start of the merging process
echo "Merging process started on $(date)" | stdbuf -oL tee -a "$LOG_FILE"

# Function to merge files in a single subdirectory
merge_subdirectory() {
    local sub_dir="$1"
    local merged_dir="$2"

    # Get the base name of the subdirectory
    local protein_name=$(basename "$sub_dir")
    echo "Processing $protein_name" | stdbuf -oL tee -a "$LOG_FILE"

    # Find all the *_with_pval.gz files in the subdirectory
    local files_to_merge=($(find "$sub_dir" -type f -name "*_with_pval.gz"))

    # Check if there are files to merge
    if [[ ${#files_to_merge[@]} -eq 0 ]]; then
        echo "No files to merge in $protein_name" | stdbuf -oL tee -a "$LOG_FILE"
        return
    fi

    # Define the output file path for the merged file
    local merged_file="$merged_dir/${protein_name}_merged_with_pval.gz"

    # Stream header and content directly into the merged file
    {
        gunzip -c "${files_to_merge[0]}" | head -n 1  # Extract header from the first file
        for file in "${files_to_merge[@]}"; do
            gunzip -c "$file" | tail -n +2             # Append content without header
        done
    } | pigz > "$merged_file"

    # Verify if the merged file was created successfully
    if [[ -s "$merged_file" ]]; then
        echo "Merged file created: $merged_file" | stdbuf -oL tee -a "$LOG_FILE"
    else
        echo "Warning: Merged file is empty for $protein_name" | stdbuf -oL tee -a "$LOG_FILE"
    fi
}

export -f merge_subdirectory
export BASE_DIR MERGED_DIR LOG_FILE

# Run the merging process in parallel for each subdirectory
find "$BASE_DIR" -mindepth 1 -maxdepth 1 -type d ! -name "merged_files" | parallel -j12 merge_subdirectory {} "$MERGED_DIR"

# Log end of the merging process
echo "Merging process completed on $(date)" | stdbuf -oL tee -a "$LOG_FILE"

# chmod +x merge_files_by_chr_selected_set.sh
# nohup ./merge_files_by_chr_selected_set.sh > merge_files_by_chr_selected_set.log 2>&1 &
# ps -p 8173 -o stat,etime

5 munge with GenomicSEM by CHR for Each Protein

This script ran and mostly did what it was supposed to: it munged the files, but it saved them to the directory with the script, not where I wanted them. Due to that it didn’t merge the munged files.

library(GenomicSEM)
library(parallel)  # Ensure parallel is loaded for detectCores()

# Set directories
base_dir <- "/Users/charleenadams/temp_BI/chemokine_rgs_olink/processed_ukbppp_selected_set"
hm3_snplist <- "/Users/charleenadams/temp_BI/chemokine_rgs_olink/w_hm3.snplist"
output_directory <- "/Users/charleenadams/temp_BI/chemokine_rgs_olink/processed_ukbppp_selected_set/munged_selected_set"

# Ensure the output directory exists
if (!dir.exists(output_directory)) {
  dir.create(output_directory, recursive = TRUE)
}

# Define column names as a named list
column_names <- list(
  SNP = "rsid",        # The column for SNP identifiers
  A1 = "ALLELE1",      # The column for the first allele
  A2 = "ALLELE0",      # The column for the second allele
  effect = "BETA",     # The column for the effect size
  INFO = "INFO",       # The column for the INFO score
  P = "P",             # The column for the P-value
  N = "N",             # The column for the sample size
  MAF = "A1FREQ",      # The column for the minor allele frequency
  Z = "CHISQ"          # The column for the Z-score or chi-squared value
)

# Function to extract `N` from the first row of a file
extract_N <- function(file, column) {
  header <- read.table(gzfile(file), nrows = 1, header = TRUE)
  if (column %in% colnames(header)) {
    N <- unique(header[[column]])
    if (length(N) == 1) {
      return(as.numeric(N))
    } else {
      stop(paste("N column has multiple values in file:", file))
    }
  } else {
    stop(paste("Column", column, "not found in file:", file))
  }
}

# Function to process a single subdirectory
process_subdirectory <- function(subdir) {
  tryCatch({
    # Get subdirectory name
    subdir_name <- basename(subdir)
    
    # Create output subdirectory
    subdir_output <- file.path(output_directory, subdir_name)
    if (!dir.exists(subdir_output)) {
      dir.create(subdir_output, recursive = TRUE)
    }
    
    # Get the list of chromosome-specific files for chromosomes 1–22
    chr_files <- list.files(subdir, full.names = TRUE, pattern = "chr[1-9]|chr1[0-9]|chr2[0-2]_.*rsid_added_with_pval\\.gz$")
    
    if (length(chr_files) == 0) {
      cat(paste("No files found in subdirectory:", subdir, "\n"))
      return(NULL)
    }
    
    # Extract `N` for each file
    N_values <- sapply(chr_files, extract_N, column = column_names$N)
    
    # Define trait names based on file names
    trait_names <- gsub("_rsid_added_with_pval\\.gz$", "", basename(chr_files))
    
    # Run the munge function with internal parallelization
    munge(
      files = chr_files,
      hm3 = hm3_snplist,
      trait.names = trait_names,
      N = N_values,
      info.filter = 0.9,       # Filter by INFO score
      maf.filter = 0.01,       # Filter by MAF
      column.names = column_names,
      parallel = TRUE,         # Enable internal parallelization
      cores = detectCores() - 1,  # Use all available cores minus one
      overwrite = TRUE         # Allow overwriting of existing files
    )
    
    # Merge all chromosome files into one for the subdirectory
    merged_file <- file.path(subdir_output, paste0(subdir_name, "_merged.sumstats"))
    chromosome_files <- list.files(subdir_output, full.names = TRUE, pattern = "\\.sumstats$")
    
    if (length(chromosome_files) > 0) {
      # Read and merge chromosome files
      all_data <- do.call(rbind, lapply(chromosome_files, function(f) {
        read.table(f, header = TRUE, sep = "\t", stringsAsFactors = FALSE)
      }))
      
      # Write merged data
      write.table(
        all_data,
        file = merged_file,
        quote = FALSE,
        sep = "\t",
        row.names = FALSE,
        col.names = TRUE
      )
      
      cat(paste("Processing completed for subdirectory:", subdir, "\n"))
    } else {
      cat(paste("No chromosome output files to merge for subdirectory:", subdir, "\n"))
    }
  }, error = function(e) {
    cat(paste("Error processing subdirectory:", subdir, "\nError message:", e$message, "\n"))
  })
}

# Get all subdirectories
subdirs <- list.dirs(base_dir, full.names = TRUE, recursive = FALSE)

# Sequentially process subdirectories
for (subdir in subdirs) {
  process_subdirectory(subdir)
}

# Run with: 
# nohup Rscript functional_set_userGWAS.R > functional_set_userGWAS.log 2>&1 &

6 Merge the Munged for Each Protein

library(parallel)

# Define directories
base_dir <- "/Users/charleenadams/ukbppp"
results_dir <- file.path(base_dir, "merged_munged_functional_protein_results")

# Ensure the results directory exists
if (!dir.exists(results_dir)) {
  dir.create(results_dir, recursive = TRUE)
}

# Function to merge chromosome files for a protein
merge_chromosome_files <- function(protein_name, chr_files, output_dir) {
  tryCatch({
    # Define output file path
    merged_file <- file.path(output_dir, paste0(protein_name, "_merged.sumstats.gz"))
    
    # Merge chromosome files
    all_data <- do.call(rbind, lapply(chr_files, function(f) {
      read.table(f, header = TRUE, sep = "\t", stringsAsFactors = FALSE)
    }))
    
    # Save the merged file
    write.table(
      all_data,
      file = gzfile(merged_file),
      quote = FALSE,
      sep = "\t",
      row.names = FALSE,
      col.names = TRUE
    )
    cat(paste("Merged protein:", protein_name, "to", merged_file, "\n"))
    return(merged_file)
  }, error = function(e) {
    cat(paste("Error merging protein:", protein_name, "\nError message:", e$message, "\n"))
    return(NULL)
  })
}

# Get list of all chromosome files
chr_files <- list.files(base_dir, pattern = "discovery_chr.*\\.sumstats\\.gz$", full.names = TRUE)

# Extract unique protein identifiers
proteins <- unique(gsub("discovery_chr[0-9]+_", "", gsub("\\.sumstats\\.gz$", "", basename(chr_files))))

# Function to process a single protein
process_protein <- function(protein) {
  protein_files <- grep(protein, chr_files, value = TRUE)
  merge_chromosome_files(protein, protein_files, output_dir = results_dir)
}

# Run merging in parallel for 12 proteins
num_cores <- detectCores() - 1  # Use all but one core
cl <- makeCluster(num_cores)
clusterExport(cl, c("chr_files", "proteins", "merge_chromosome_files", "results_dir"))
clusterEvalQ(cl, library(parallel))
merged_files <- parLapply(cl, proteins, process_protein)
stopCluster(cl)

# Print summary
cat("Merging completed for all proteins.\n")

# nohup Rscript merged_munged_functional_protein_results.R > merged_munged_functional_protein_results.log 2>&1 &

7 Harmonize

# Load libraries
library(data.table)

# Define paths
current_dir <- "/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/"
hapmap3_snplist_path <- "/Users/charleenadams/temp_BI/mvGWAS_chemokines/eur_w_ld_chr/w_hm3.snplist"
output_dir <- "/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized"

# Create output directory if it doesn't exist
if (!dir.exists(output_dir)) {
  dir.create(output_dir)
}

# Dynamically list files to process
summary_stats <- list.files(
  path = current_dir, 
  pattern = "*.sumstats\\.gz$", 
  full.names = TRUE, 
  all.files = TRUE
)

# Check if any files are found
if (length(summary_stats) == 0) {
  stop("No matching summary statistics files found in the specified directory.")
}
cat("Files to process:\n", paste(summary_stats, collapse = "\n"), "\n")

# Load HapMap3 SNP list
hm3_snps <- fread(hapmap3_snplist_path, header = TRUE)

# Function to harmonize SNPs
harmonize_snps <- function(file, hm3_snps) {
  # Read summary statistics
  data <- fread(file, header = TRUE)[, .(SNP, A1, A2, N, Z)]
  
  # Filter to SNPs in HapMap3
  data <- data[SNP %in% hm3_snps$SNP]
  
  # Merge with HapMap3 to align alleles
  aligned_data <- merge(data, hm3_snps, by = "SNP", suffixes = c(".data", ".hm3"))
  
  # Flip alleles and Z-scores where necessary
  mismatched <- which(aligned_data$A1.data == aligned_data$A2.hm3 & 
                        aligned_data$A2.data == aligned_data$A1.hm3)
  if (length(mismatched) > 0) {
    aligned_data$Z[mismatched] <- -aligned_data$Z[mismatched]
    temp <- aligned_data$A1.data[mismatched]
    aligned_data$A1.data[mismatched] <- aligned_data$A2.data[mismatched]
    aligned_data$A2.data[mismatched] <- temp
  }
  
  # Return cleaned data
  aligned_data[, .(SNP, A1 = A1.data, A2 = A2.data, N, Z)]
}

# Process and save harmonized files
for (file in summary_stats) {
  cat("Processing:", file, "\n")
  
  # Harmonize SNPs
  harmonized_data <- harmonize_snps(file, hm3_snps)
  
  # Extract file name and create correct output name
  base_name <- gsub("\\.sumstats\\.gz$", "_harmonized.sumstats.gz", basename(file))
  temp_file <- file.path(output_dir, gsub("\\.gz$", "", base_name))  # Temporary uncompressed file
  output_file <- file.path(output_dir, base_name)  # Final compressed file
  
  # Save to temporary uncompressed file
  fwrite(harmonized_data, file = temp_file, sep = "\t")
  
  # Compress the file
  system(paste("gzip -f", shQuote(temp_file)))
  
  cat("Saved harmonized file:", output_file, "\n")
}

cat("All matching files harmonized and saved in:", output_dir, "\n")

8 Remove Incomplete Data

#!/bin/bash

# Define the directory containing the files
input_dir="/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized"
output_dir="/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized"

# Loop through all files ending in _munged.sumstats.gz
for file in "$input_dir"/*.sumstats.gz; do
    # Extract the base filename without extension
    base_name=$(basename "$file" .sumstats.gz)

    # Create output file name by appending "noNA" before .gz
    output_file="${output_dir}/${base_name}_noNA.sumstats.gz"

    # Unzip, filter rows with complete observations, and gzip the result
    gunzip -c "$file" | awk 'NR == 1 || NF == 5' | gzip > "$output_file"

    echo "Processed $file -> $output_file"
done

9 Add P Back!

library(data.table)

# Define the function to add a P column
add_p_column <- function(file, output_dir) {
  tryCatch({
    data <- fread(file)
    if (!"P" %in% colnames(data)) {
      if ("Z" %in% colnames(data)) {
        # Calculate P-values from Z-scores
        data[, P := 2 * pnorm(-abs(Z))]
        
        # Create output file path
        output_file <- file.path(output_dir, basename(file))
        
        # Save the updated data to the new file
        fwrite(data, output_file, sep = "\t")
        cat("Added P column to:", output_file, "\n")
      } else {
        stop("Z column missing. Cannot calculate P-values in file:", file)
      }
    } else {
      cat("P column already exists in:", file, "\n")
    }
  }, error = function(e) {
    cat("Error processing file:", file, "\nError message:", e$message, "\n")
  })
}

# Set input and output directories
input_dir <- "/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized"
output_dir <- file.path(input_dir, "pvals.fix")

# Create the output directory if it doesn't exist
if (!dir.exists(output_dir)) {
  dir.create(output_dir)
}

# List all files matching the pattern in the input directory
files <- list.files(input_dir, pattern = "\\harmonized_noNA.sumstats\\.gz$", full.names = TRUE)

# Apply the function to all files
lapply(files, add_p_column, output_dir = output_dir)

10 Process the Munged, Merged Sumstat with sumstats()

library(data.table)
library(GenomicSEM)

# Step 1: Define paths and variables
input_dir <- "/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized/pvals.fix"  # Input directory containing sumstats files
output_path <- "/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized/pvals.fix/summary_stats.txt"      # Path to save the final summary stats
ref <- "/Users/charleenadams/temp_BI/chemokine_rgs_olink/reference.1000G.maf.0.005.txt"      # Reference file for SNP variance

# List all sumstats files
traits <- list.files(
  path = input_dir,
  pattern = "\\harmonized_noNA.sumstats\\.gz$",  # Match gzipped sumstats files
  full.names = TRUE
)

# Optional: Name your traits based on filenames without extensions
trait.names <- gsub("\\_merged_harmonized_noNA.sumstats\\.gz$", "", basename(traits))

# Step 2: Extract sample sizes (N) from each trait file
# Assuming the column name for sample size is "N"
get_sample_size <- function(file) {
  data <- fread(file, select = "N", nrows = 1)  # Only read the first row to extract N
  return(data$N[1])
}

N <- sapply(traits, get_sample_size)
cat("Extracted sample sizes:\n")

# Step 3: Prepare the summary statistics with sumstats()
# Define arguments for sumstats()
se.logit <- rep(FALSE, length(traits))  # Continuous traits; SEs are not logistic
OLS <- rep(TRUE, length(traits))       # Ordinary least squares for continuous traits
linprob <- rep(FALSE, length(traits))  # Traits are not dichotomous; linear probability does not apply

# Run sumstats() to align and scale summary statistics
aligned_sumstats <- sumstats(
  files = traits,
  ref = ref,
  trait.names = trait.names,
  se.logit = se.logit,
  OLS = OLS,
  linprob = linprob,
  N = N,
  betas = NULL,
  #maf.filter = 0.01,
  keep.indel = FALSE,
  parallel = TRUE,
  cores = 12  # Adjust for available cores
)

# Save the aligned summary statistics
fwrite(aligned_sumstats, file = output_path, sep = "\t")
cat("Aligned summary statistics saved to:", output_path, "\n")

# Optional: Verify the saved file by re-loading it
summary_stats_read <- fread(output_path)

11 Multivariate ldsc in GenomicSEM

# Load required library
library(GenomicSEM)
library(data.table)
library(parallel)
library(qqman)

# Define paths and variables
traits <- list.files("/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized/pvals.fix", 
                     pattern = "\\.sumstats\\.gz$", full.names = TRUE)

# Set sample and population prevalences to NA to bypass liability scaling
sample.prev <- rep(NA, length(traits))  # Treat traits as continuous
population.prev <- rep(NA, length(traits))

ld <- "/Users/charleenadams/temp_BI/mvGWAS_chemokines/eur_w_ld_chr/"
wld <- "/Users/charleenadams/temp_BI/mvGWAS_chemokines/eur_w_ld_chr/"

# Optional: Name your traits based on filenames without extensions
trait.names <- gsub("\\_merged_harmonized_noNA.sumstats\\.gz$", "", basename(traits))
head(trait.names)

# Redirect console output to a log file
sink("/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/ldsc_results_rgs/ldsc_rgs_log.txt")

# Run multivariable LDSC
LDSCoutput <- ldsc(traits = traits,
                   sample.prev = sample.prev,
                   population.prev = population.prev,
                   ld = ld,
                   wld = wld,
                   trait.names = trait.names,
                   stand = TRUE)

# Stop capturing the console output
sink()

# Save the results
output_path <- "/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/ldsc_results_rgs"
saveRDS(LDSCoutput, file = file.path(output_path, "LDSCoutput.rds"))

# Print completion message
cat("Multivariable LDSC analysis complete and save in:\n", output_path, "\n")

12 Factor Analysis

Here, we perform a factor analysis to ascertain from the data how many factors to use for our latent model. I explored 1-5 factors.

Metric What It Measures What to Look For Interpretation for Your Results
RMSR Fit between observed and modeled data Lower is better (< 0.05 = excellent) 5-factor model = slightly better fit (RMSR = 0.04), but improvement over 4-factor model (RMSR = 0.05) is marginal.
Degrees of Freedom Parsimony (simplicity of model) Higher df = simpler model 5-factor model = more complex (df = 16), with diminishing returns compared to the 4-factor model (df = 24).
Objective Function Fit of model (ML estimation) Lower is better The 5-factor model (Objective = 15.72) has a slightly better fit than the 4-factor model (Objective = 17.25), but the improvement is small.

12.1 What Each Metric Means

  1. RMSR (Root Mean Square of Residuals):
    • Measures the discrepancy between the observed correlations in your data and the correlations predicted by your factor model.
    • Smaller RMSR values indicate a better fit. For your models:
      • 4-factor model: RMSR = 0.05 (good fit).
      • 5-factor model: RMSR = 0.04 (excellent fit but marginal improvement over 4 factors).
  2. Degrees of Freedom (df):
    • Reflects how parsimonious (simple) the model is. Models with fewer factors have higher df, as they estimate fewer parameters.
    • Higher df indicates a simpler model with less risk of overfitting. For your results:
      • The 4-factor model strikes a good balance with df = 24, while the 5-factor model (df = 16) adds complexity with minimal improvement.
  3. Objective Function:
    • Quantifies how well the model fits your data using maximum likelihood estimation (MLE). Lower values indicate better fit.
    • The 5-factor model has a smaller objective function (15.72) compared to the 4-factor model (17.25), but the improvement is minor.

12.2 Conclusion

  • While the 5-factor model slightly improves fit, the gains are minimal compared to the 4-factor model.
  • The 4-factor model is recommended as it balances simplicity (higher df) and fit (RMSR = 0.05) without overcomplicating the analysis.
# Load necessary libraries
library(psych)  # For EFA
library(GPArotation)  # For rotations

# Define the correlation matrix
# Extract the LDSC genetic correlation matrix
ldsc_results <- readRDS("/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/ldsc_results_rgs/LDSCoutput.rds")
cor_matrix <- ldsc_results$S
write.csv(cor_matrix, "/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/ldsc_results_rgs/cor_matrix.csv")

cor_matrix <- cor.smooth(cor_matrix)
eigenvalues <- eigen(cor_matrix)$values
if (any(eigenvalues <= 0)) {
    cat("The smoothed matrix is still not positive definite. Smallest eigenvalue:", min(eigenvalues), "\n")
} else {
    cat("Matrix is now positive definite.\n")
}
## Matrix is now positive definite.
# Run EFA to determine the number of latent factors
efa_results1 <- fa(
  r = cor_matrix,       # Genetic correlation matrix
  nfactors = 1,         # Test up to 1 factors 
  rotate = "oblimin",   # Allow factors to correlate
  fm = "ml"             # Use maximum likelihood estimation
)

# Run EFA to determine the number of latent factors
efa_results2 <- fa(
  r = cor_matrix,       # Genetic correlation matrix
  nfactors = 2,         # Test up to 1 factors 
  rotate = "oblimin",   # Allow factors to correlate
  fm = "ml"             # Use maximum likelihood estimation
)

# Run EFA to determine the number of latent factors
efa_results3 <- fa(
  r = cor_matrix,       # Genetic correlation matrix
  nfactors = 3,         # Test up to 3 factors 
  rotate = "oblimin",   # Allow factors to correlate
  fm = "ml"             # Use maximum likelihood estimation
)

# Run EFA to determine the number of latent factors
efa_results4 <- fa(
  r = cor_matrix,       # Genetic correlation matrix
  nfactors = 4,         # Test up to 3 factors 
  rotate = "oblimin",   # Allow factors to correlate
  fm = "ml"             # Use maximum likelihood estimation
)

13 Defining the Latent userGWAS Model with Four Factors

Protein ML1 ML2 ML3 ML4 Assigned Factor
CXCL13 0.83 -0.06 -0.06 0.19 F1
FCRL2 0.39 0.20 0.06 0.16 F1
NOS3 0.93 0.05 0.17 -0.04 F1
FCRL5 0.12 0.58 0.17 0.10 F2
LY9 0.30 0.61 -0.23 0.03 F2
LY96 -0.16 0.81 0.25 0.17 F2
SLAMF7 0.01 1.04 -0.07 -0.06 F2
TNFRSF13B 0.26 0.37 -0.20 0.25 F2
LIF 0.32 0.12 0.73 0.09 F3
TXNDC15 0.07 0.55 0.34 0.03 F3
CCL19 0.05 -0.04 0.08 0.83 F4
CCL21 -0.01 0.03 -0.06 0.89 F4

14 userGWAS with Four Latent Factors

# Load necessary libraries
library(psych)  # For EFA
library(GPArotation)  # For rotations
library(GenomicSEM)
library(semPlot)
library(GenomicSEM)
library(parallel)  # Ensure parallel is loaded for detectCores()
library(data.table)
library(qqman)
library(dplyr)

# Start timer to measure runtime
start_time <- Sys.time()

# Load summary statistics (use a subset of SNPs for testing)
summary_stats <- fread("/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized/pvals.fix/summary_stats.txt")

# Replace colons with underscores in column names
colnames(summary_stats) <- gsub(":", "_", colnames(summary_stats))

# Step 3: Load LDSC results
ldsc_results_path <- "/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/ldsc_results_rgs/LDSCoutput.rds"
ldsc_results <- readRDS(ldsc_results_path)

# Replace colons with underscores in the names of ldsc_results
if (!is.null(attr(ldsc_results$S, "dimnames"))) {
  attr(ldsc_results$S, "dimnames")[[2]] <- gsub(":", "_", attr(ldsc_results$S, "dimnames")[[2]])
}

if (!is.null(attr(ldsc_results$S_Stand, "dimnames"))) {
  attr(ldsc_results$S_Stand, "dimnames")[[2]] <- gsub(":", "_", attr(ldsc_results$S_Stand, "dimnames")[[2]])
}

# Check if names have been updated
print(attr(ldsc_results$S, "dimnames")[[2]])
print(attr(ldsc_results$S_Stand, "dimnames")[[2]])

# Step 4: Specify the model for userGWAS
ldsc_trait_names <- attr(ldsc_results$S, "dimnames")[[2]]

# Map trait names to simpler versions for compatibility with userGWAS
trait_mapping <- c(
  "CCL19_Q99731_OID21030_v1_Neurology" = "CCL19",
  "CCL21_O00585_OID20686_v1_Inflammation" = "CCL21",
  "CXCL13_O43927_OID21024_v1_Neurology" = "CXCL13",
  "FCRL2_Q96LA5_OID20639_v1_Inflammation" = "FCRL2",
  "FCRL5_Q96RD9_OID21111_v1_Neurology" = "FCRL5",
  "LIF_P15018_OID20824_v1_Neurology" = "LIF",
  "LY9_Q9HBG7_OID20670_v1_Inflammation" = "LY9",
  "LY96_Q9Y6Y9_OID20945_v1_Neurology" = "LY96",
  "NOS3_P29474_OID20834_v1_Neurology" = "NOS3",
  "SLAMF7_Q9NQ25_OID20602_v1_Inflammation" = "SLAMF7",
  "TNFRSF13B_O14836_OID20702_v1_Inflammation" = "TNFRSF13B",
  "TXNDC15_Q96J42_OID21495_v1_Oncology" = "TXNDC15"
)

# Step 5: Specify the model for userGWAS with the simplified trait names
model_string <- paste0(
  "F1 =~ CXCL13_O43927_OID21024_v1_Neurology + FCRL2_Q96LA5_OID20639_v1_Inflammation + NOS3_P29474_OID20834_v1_Neurology\n",
  "F2 =~ FCRL5_Q96RD9_OID21111_v1_Neurology + LY9_Q9HBG7_OID20670_v1_Inflammation + LY96_Q9Y6Y9_OID20945_v1_Neurology + SLAMF7_Q9NQ25_OID20602_v1_Inflammation + TNFRSF13B_O14836_OID20702_v1_Inflammation\n",
  "F3 =~ LIF_P15018_OID20824_v1_Neurology + TXNDC15_Q96J42_OID21495_v1_Oncology\n",
  "F4 =~ CCL19_Q99731_OID21030_v1_Neurology + CCL21_O00585_OID20686_v1_Inflammation\n",
  "F1 ~~ F2\nF1 ~~ F3\nF1 ~~ F4\n",
  "F2 ~~ F3\nF2 ~~ F4\n",
  "F3 ~~ F4\n",
  "F1 ~ SNP\nF2 ~ SNP\nF3 ~ SNP\nF4 ~ SNP"
)
cat("Model String:\n", model_string, "\n")

# Step 6: Redefine the SNPs after the mapping 
SNPs2 <- summary_stats[, c("SNP", "CHR", "BP", "MAF", "A1", "A2", 
                           paste0("beta.", names(trait_mapping)), 
                           paste0("se.", names(trait_mapping)))]

# Step 7: Set up userGWAS parameters
max_cores <- 15
cat("Using", max_cores, "cores for parallel processing.\n")
CorrelatedFactors <- userGWAS(
  covstruc = ldsc_results, 
  SNPs = summary_stats, 
  model = model_string, 
  estimation = "DWLS", 
  printwarn = TRUE, 
  sub = c("F1~SNP", "F2~SNP", "F3~SNP", "F4~SNP"),
  toler = FALSE,
  SNPSE = FALSE,
  parallel = TRUE,
  GC = "standard",
  MPI = FALSE,
  smooth_check = TRUE,
  fix_measurement = TRUE,
  Q_SNP = TRUE
)

# Step 8: Output results
result_path <- "/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized/pvals.fix/CorrelatedFactors12.rds"
saveRDS(CorrelatedFactors, file = result_path)
cat("GWAS completed. Results saved to:", result_path, "\n")

# Step 9: Measure runtime
end_time <- Sys.time()
cat("Total runtime:", round(difftime(end_time, start_time, units = "mins"), 2), "minutes.\n")

# Filter the F1 results for quality
CorrelatedFactors <- readRDS("/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized/pvals.fix/CorrelatedFactors12.rds")

CF1_bad <- CorrelatedFactors[[1]][which(CorrelatedFactors[[1]]$Q_SNP_pval < 5e-8),]
write.table(CF1_bad$SNP, "/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized/pvals.fix/poor_quality_snps.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)

# Look at and filter F1 results for quality
# Filter by proximity as an LD proxy

# Define low-quality SNPs and the full dataset
low_quality_snps <- CF1_bad  # Subset of SNPs with Q_SNP_pval < 5e-8
all_snps <- CorrelatedFactors[[1]]  # The full SNP dataset

# Parameters
ld_window_size <- 500000  # Proximity threshold in base pairs

# Filter low-quality SNPs for required columns
low_quality_snps <- low_quality_snps %>% select(SNP, CHR, BP)

# Exclude SNPs in LD with low-quality SNPs based on proximity
filtered_snps <- all_snps %>%
  filter(!sapply(1:n(), function(i) {
    any(abs(all_snps$BP[i] - low_quality_snps$BP[low_quality_snps$CHR == all_snps$CHR[i]]) <= ld_window_size)
  }))

# Save the filtered SNPs
write.table(filtered_snps, "/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized/pvals.fix/F1_filtered_snps.tsv", row.names = FALSE, quote = FALSE, sep = "\t")

rs2412973 <- filtered_snps[which(filtered_snps$SNP=="rs2412973"),]
rs2412973

# Output results before filtering
significant_snps1 <- filtered_snps[filtered_snps$Pval_Estimate < 5e-08, ]
dim(significant_snps1)
low_heterogeneity1 <- filtered_snps[filtered_snps$Q_SNP_pval > 0.05, ]
dim(low_heterogeneity1)

# Proteins by Assigned Factor
F1 <- c("CXCL13", "FCRL2", "NOS3")
F2 <- c("FCRL5", "LY9", "LY96", "SLAMF7", "TNFRSF13B")
F3 <- c("LIF", "TXNDC15")
F4 <- c("CCL19", "CCL21")

# F1: CXCL13, FCRL2, NOS3

# Prepare data for plots
manhattan_data <- data.frame(
  SNP = CorrelatedFactors[[1]]$SNP,
  CHR = CorrelatedFactors[[1]]$CHR,
  BP = CorrelatedFactors[[1]]$BP,
  P = CorrelatedFactors[[1]]$Pval_Estimate
)

# Prepare data for plots
manhattan_data <- data.frame(
  SNP = filtered_snps$SNP,
  CHR = filtered_snps$CHR,
  BP = filtered_snps$BP,
  P = filtered_snps$Pval_Estimate
)

# Save Manhattan plot to file
png(
  filename = "/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized/pvals.fix/Manhattan_CorrelatedFactors_filtered_snps_F1.png", 
  width = 8, 
  height = 6, 
  units = "in", 
  res = 300
)

# Step 6: Display Manhattan plot
cat("Displaying Manhattan plot.\n")
 manhattan(
  manhattan_data,
  chr = "CHR",
  bp = "BP",
  p = "P",
  snp = "SNP",
  genomewideline = -log10(5e-8),
  suggestiveline = -log10(1e-5),
  col = c("blue4", "skyblue"),
  main = "Manhattan Plot for F1 Latent Genetic Factor (CXCL13, FCRL2, NOS3)",
  ylim = c(0, 10)
)

# Close the graphics device
dev.off()

# Save QQ plot to file
png(
  filename = "/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized/pvals.fix/QQ_Latent_CorrelatedFactors_F1.png", 
  width = 8, 
  height = 6, 
  units = "in", 
  res = 300
)

# Step 7: Display QQ plot
cat("Displaying QQ plot.\n")
qq(manhattan_data$P, main = "QQ Plot for F1 Latent Genetic Factor (CXCL13, FCRL2, NOS3)")

# Close the graphics device
dev.off()

# End timer and display runtime
end_time <- Sys.time()
cat("Total runtime:", end_time - start_time, "\n")

# Run with:
# tmux new -s my_session (tmux my_session to end)
# caffeinate -i nohup Rscript userGWAS12.R > userGWAS12_log.txt 2>&1 &
# pstree -p 21557
# ps -p 21557 -o stat,etime

14.1 Manhattan Plot

# Path to the saved Manhattan plot
manhattan_plot_path <- "/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized/pvals.fix/Manhattan_CorrelatedFactors_filtered_snps_F1.png"

# Display the image
knitr::include_graphics(manhattan_plot_path)
Manhattan Plot for F1 Latent Genetic Factor

Manhattan Plot for F1 Latent Genetic Factor

14.2 QQ Plot

# Path to the saved QQ plot
qq_plot_path <- "/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized/pvals.fix/QQ_Latent_CorrelatedFactors_F1.png"

# Display the image
knitr::include_graphics(qq_plot_path)
QQ Plot for F1 Latent Genetic Factor

QQ Plot for F1 Latent Genetic Factor

15 Project Directory Summary for Latent Factor Multivariate userGWAS

16 Table of Contents


17 Base Directories

17.1 /Users/charleenadams/temp_BI/chemokine_rgs_olink/munged

  • Purpose: Stores intermediate munged files generated during GWAS preparation.

17.2 /Users/charleenadams/ukbppp/selected_set

  • Purpose: Contains .tar files for selected proteins to be processed.
  • Contents:
    • SLAMF7_Q9NQ25_OID20602_v1_Inflammation.tar
    • CCL19_Q99731_OID21030_v1_Neurology.tar
    • CCL21_O00585_OID20686_v1_Inflammation.tar
    • TXNDC15_Q96J42_OID21495_v1_Oncology.tar
    • ... (additional .tar files for all 12 proteins).

18 Key Subdirectories and Their Contents

18.1 /Users/charleenadams/temp_BI/chemokine_rgs_olink/processed_ukbppp_selected_set

  • Purpose: Stores processed results for selected protein .tar files.
  • Subdirectories:
    • /merged_files:
      • Purpose: Contains merged .gz files by protein and chromosome.
      • Example File: protein_name_merged_rsid_added.gz.
    • Individual subdirectories for each protein, containing:
      • _rsid_added.gz files: Processed files with RSID mappings added.
      • _with_pval.gz files: Processed files with p-values added.
  • Log File: processing_log.txt.

18.2 /Users/charleenadams/temp_BI/chemokine_rgs_olink/maps

  • Purpose: Contains RSID map files used to add RSIDs to chromosome-specific files.
  • Contents:
    • olink_rsid_map_mac5_info03_b0_7_chr1_patched_v2.tsv.gz
    • olink_rsid_map_mac5_info03_b0_7_chr2_patched_v2.tsv.gz
    • ... (map files for other chromosomes).

18.3 /Users/charleenadams/ukbppp/merged_munged_functional_protein_results

  • Purpose: Stores final merged and munged summary statistics and downstream results.

18.3.1 Subdirectories:

  1. /pvals.fix:
    • Contains munged files with p-values and harmonized data.
    • Example Files:
      • protein1_harmonized_noNA.sumstats.gz: Harmonized file with no missing data.
      • protein2_harmonized_noNA.sumstats.gz.
    • Purpose: These files serve as inputs for LDSC and multivariate GWAS analysis.
  2. /ldsc_results_rgs:
    • Purpose: Contains results from LDSC analysis.
    • Contents:
      • LDSCoutput.rds: Contains genetic correlations for traits.
      • ldsc_rgs_log.txt: Log file documenting LDSC analysis.
      • cor_matrix.csv: Genetic correlation matrix extracted from LDSC results.

18.4 /Users/charleenadams/temp_BI/mvGWAS_chemokines

  • Purpose: Stores LDSC reference files required for analysis.
  • Contents:
    • /eur_w_ld_chr: Directory containing LD scores for European ancestry.
    • w_hm3.snplist: HapMap 3 SNP list for filtering summary statistics.

19 Intermediate Files

19.1 .tar Files (Protein Datasets)

  • Purpose: Compressed files for each selected protein.
  • Location: /Users/charleenadams/ukbppp/selected_set/.
  • Example Contents:
    • SLAMF7_Q9NQ25_OID20602_v1_Inflammation.tar.
    • LIF_P15018_OID20824_v1_Neurology.tar.

19.2 Processed .gz Files

  • Purpose: Files processed for RSID mapping and p-value addition.
  • Location: /Users/charleenadams/temp_BI/chemokine_rgs_olink/processed_ukbppp_selected_set/.
  • Example Contents:
    • chr1_rsid_added.gz: File with RSIDs mapped.
    • chr1_rsid_added_with_pval.gz: File with RSIDs and p-values.

19.3 Merged .gz Files

  • Purpose: Protein-level merged data combining chromosome-specific results.
  • Location: /Users/charleenadams/temp_BI/chemokine_rgs_olink/processed_ukbppp_selected_set/merged_files/.
  • Example File: protein1_merged_with_pval.gz.

19.4 Munged Files

  • Purpose: Harmonized files prepared for LDSC and downstream GWAS.
  • Location: /Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/.
  • Example File: protein1_harmonized_noNA.sumstats.gz.

19.5 LDSC Outputs

  • Purpose: Results from LDSC analysis.
  • Location: /Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/ldsc_results_rgs/.
  • Key Files:
    • LDSCoutput.rds: Genetic correlation results.
    • cor_matrix.csv: Correlation matrix for exploratory factor analysis.

20 Results and Output

20.1 Plots

  1. Manhattan Plot:
    • Path: /Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized/pvals.fix/Manhattan_CorrelatedFactors.png.
    • Purpose: Displays significant SNPs influencing latent genetic factors.
  2. QQ Plot:
    • Path: /Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized/pvals.fix/QQ_Latent_CorrelatedFactors.png.
    • Purpose: Examines p-value distribution for latent genetic factors.

21 Detailed Workflows

21.1 1. Copy and Extract .tar Files

  • Source: /Users/charleenadams/ukbppp/.
  • Destination: /Users/charleenadams/temp_BI/chemokine_rgs_olink/processed_ukbppp_selected_set/.

21.2 2. Add RSIDs and P-Values

  • Input: Chromosome-specific files in /processed_ukbppp_selected_set/.
  • RSID Mapping: Uses map files from /maps/.
  • Output:
    • _rsid_added.gz: Files with RSIDs added.
    • _with_pval.gz: Files with p-values added.

21.3 3. Merge Files by Protein

  • Input: Chromosome-specific files.
  • Output Directory: /merged_files/.
  • Output Files: protein1_merged_with_pval.gz.

21.4 4. Munging Summary Statistics

  • Input: Merged files.
  • Filter Criteria:
    • INFO Score: INFO > 0.9.
    • Minor Allele Frequency (MAF): MAF > 0.01.
  • Output Directory: /pvals.fix/.
  • Output Files: protein1_harmonized_noNA.sumstats.gz.

21.5 5. LDSC Analysis

  • Input: Munged .sumstats.gz files.
  • Reference: /mvGWAS_chemokines/eur_w_ld_chr/.
  • Output Directory: /ldsc_results_rgs/.
  • Output Files:
    • LDSCoutput.rds.
    • ldsc_rgs_log.txt.
    • cor_matrix.csv.

21.6 6. Exploratory Factor Analysis (EFA)

  • Input: cor_matrix.csv.
  • Output: Optimal number of latent factors for GWAS.

21.7 7. userGWAS Analysis

  • Input: LDSC results, munged SNPs, and latent model specification.
  • Output Directory: /pvals.fix/harmonized/pvals.fix/.
  • Output Files:
    • Manhattan and QQ plots.
    • Significant SNPs influencing latent factors.

21.8 Notes

  • Log Files:
    • processing_log.txt: Logs processing steps.
    • ldsc_rgs_log.txt: Logs LDSC analysis.
  • Plots:
    • Saved in /pvals.fix/harmonized/pvals.fix/.