userGWAS of Selected Proteins (Controlled by
rs2412973)munge with GenomicSEM
by CHR for Each Proteinsumstats()ldsc in
GenomicSEMuserGWASIn this analysis, we employ a genomic structural equation model (GenomicSEM) to define a new phenotype: a latent factor that influences 12 proteins we know are influenced by a common SNP.
tmux
Keeps scripts or tasks running even if terminal session disconnects
or crashes.
Ideal for long-running processes in remote servers or during unstable
connections. Using tmux and nohup is mostly
redundant.
pstree
View the hierarchy of a process to see its parent and child processes. Useful for seeing if parallelization is working or if the child processes are sleeping.
watch
Monitor real-time changes in a directory, such as file creation or updates.
find
Find and count file names with certain pattern and find and count directories.
gzcat
Like zcat but works on macOS. gzcat decompresses .gz files and sends the decompressed output directly to standard output (stdout). It behaves like a combination of zcat and cat, allowing us to view or pipe the contents of .gz files without fully extracting them. Works like zless with gzcat file.gz | head.
pigz
pigz (Parallel Implementation of gzip) is a
multithreaded alternative to gzip that significantly speeds
up compression and decompression by utilizing multiple CPU cores. While
maintaining full compatibility with gzip, pigz
divides data into chunks and processes them in parallel, making it ideal
for large files and systems with multiple cores. It reduces wait times
in workflows involving compression or decompression, such as
bioinformatics or data science tasks. With faster performance, efficient
resource utilization, and scalability, pigz is a powerful
tool for modern high-performance computing environments.
nohup ./process_all_ukbpp.sh > process_all_ukbpp.log 2>&1 &
pstree 63844
ps -p 63844 -o stat,etime
watch -n 1 ls -lt /Users/charleenadams/temp_BI/chemokine_rgs_olink/processed_ukbppp_selected_set
find /Users/charleenadams/temp_BI/chemokine_rgs_olink/processed_ukbppp_selected_set -type f -name "*with_pval.gz" | wc -l
find /Users/charleenadams/temp_BI/chemokine_rgs_olink/processed_ukbppp_selected_set -type d | wc -l
for file in *.gz; do
echo "$file: $(gzcat "$file" | wc -l)"
done
gzcat file.gz | head
I used pigz and added a very nice log file.
#!/bin/bash
# Enable debugging for detailed logs
set -x
# Directories and variables
BASE_DIR="/Users/charleenadams/temp_BI/chemokine_rgs_olink/processed_ukbppp_selected_set"
MERGED_DIR="$BASE_DIR/merged_files"
LOG_FILE="$BASE_DIR/merge_log.txt"
# Create the merged_files directory and log file if they don't exist
mkdir -p "$MERGED_DIR"
touch "$LOG_FILE"
# Log start of the merging process
echo "Merging process started on $(date)" | stdbuf -oL tee -a "$LOG_FILE"
# Function to merge files in a single subdirectory
merge_subdirectory() {
local sub_dir="$1"
local merged_dir="$2"
# Get the base name of the subdirectory
local protein_name=$(basename "$sub_dir")
echo "Processing $protein_name" | stdbuf -oL tee -a "$LOG_FILE"
# Find all the *_with_pval.gz files in the subdirectory
local files_to_merge=($(find "$sub_dir" -type f -name "*_with_pval.gz"))
# Check if there are files to merge
if [[ ${#files_to_merge[@]} -eq 0 ]]; then
echo "No files to merge in $protein_name" | stdbuf -oL tee -a "$LOG_FILE"
return
fi
# Define the output file path for the merged file
local merged_file="$merged_dir/${protein_name}_merged_with_pval.gz"
# Stream header and content directly into the merged file
{
gunzip -c "${files_to_merge[0]}" | head -n 1 # Extract header from the first file
for file in "${files_to_merge[@]}"; do
gunzip -c "$file" | tail -n +2 # Append content without header
done
} | pigz > "$merged_file"
# Verify if the merged file was created successfully
if [[ -s "$merged_file" ]]; then
echo "Merged file created: $merged_file" | stdbuf -oL tee -a "$LOG_FILE"
else
echo "Warning: Merged file is empty for $protein_name" | stdbuf -oL tee -a "$LOG_FILE"
fi
}
export -f merge_subdirectory
export BASE_DIR MERGED_DIR LOG_FILE
# Run the merging process in parallel for each subdirectory
find "$BASE_DIR" -mindepth 1 -maxdepth 1 -type d ! -name "merged_files" | parallel -j12 merge_subdirectory {} "$MERGED_DIR"
# Log end of the merging process
echo "Merging process completed on $(date)" | stdbuf -oL tee -a "$LOG_FILE"
# chmod +x merge_files_by_chr_selected_set.sh
# nohup ./merge_files_by_chr_selected_set.sh > merge_files_by_chr_selected_set.log 2>&1 &
# ps -p 8173 -o stat,etime
munge with
GenomicSEM by CHR for Each ProteinThis script ran and mostly did what it was supposed to: it munged the files, but it saved them to the directory with the script, not where I wanted them. Due to that it didn’t merge the munged files.
library(GenomicSEM)
library(parallel) # Ensure parallel is loaded for detectCores()
# Set directories
base_dir <- "/Users/charleenadams/temp_BI/chemokine_rgs_olink/processed_ukbppp_selected_set"
hm3_snplist <- "/Users/charleenadams/temp_BI/chemokine_rgs_olink/w_hm3.snplist"
output_directory <- "/Users/charleenadams/temp_BI/chemokine_rgs_olink/processed_ukbppp_selected_set/munged_selected_set"
# Ensure the output directory exists
if (!dir.exists(output_directory)) {
dir.create(output_directory, recursive = TRUE)
}
# Define column names as a named list
column_names <- list(
SNP = "rsid", # The column for SNP identifiers
A1 = "ALLELE1", # The column for the first allele
A2 = "ALLELE0", # The column for the second allele
effect = "BETA", # The column for the effect size
INFO = "INFO", # The column for the INFO score
P = "P", # The column for the P-value
N = "N", # The column for the sample size
MAF = "A1FREQ", # The column for the minor allele frequency
Z = "CHISQ" # The column for the Z-score or chi-squared value
)
# Function to extract `N` from the first row of a file
extract_N <- function(file, column) {
header <- read.table(gzfile(file), nrows = 1, header = TRUE)
if (column %in% colnames(header)) {
N <- unique(header[[column]])
if (length(N) == 1) {
return(as.numeric(N))
} else {
stop(paste("N column has multiple values in file:", file))
}
} else {
stop(paste("Column", column, "not found in file:", file))
}
}
# Function to process a single subdirectory
process_subdirectory <- function(subdir) {
tryCatch({
# Get subdirectory name
subdir_name <- basename(subdir)
# Create output subdirectory
subdir_output <- file.path(output_directory, subdir_name)
if (!dir.exists(subdir_output)) {
dir.create(subdir_output, recursive = TRUE)
}
# Get the list of chromosome-specific files for chromosomes 1–22
chr_files <- list.files(subdir, full.names = TRUE, pattern = "chr[1-9]|chr1[0-9]|chr2[0-2]_.*rsid_added_with_pval\\.gz$")
if (length(chr_files) == 0) {
cat(paste("No files found in subdirectory:", subdir, "\n"))
return(NULL)
}
# Extract `N` for each file
N_values <- sapply(chr_files, extract_N, column = column_names$N)
# Define trait names based on file names
trait_names <- gsub("_rsid_added_with_pval\\.gz$", "", basename(chr_files))
# Run the munge function with internal parallelization
munge(
files = chr_files,
hm3 = hm3_snplist,
trait.names = trait_names,
N = N_values,
info.filter = 0.9, # Filter by INFO score
maf.filter = 0.01, # Filter by MAF
column.names = column_names,
parallel = TRUE, # Enable internal parallelization
cores = detectCores() - 1, # Use all available cores minus one
overwrite = TRUE # Allow overwriting of existing files
)
# Merge all chromosome files into one for the subdirectory
merged_file <- file.path(subdir_output, paste0(subdir_name, "_merged.sumstats"))
chromosome_files <- list.files(subdir_output, full.names = TRUE, pattern = "\\.sumstats$")
if (length(chromosome_files) > 0) {
# Read and merge chromosome files
all_data <- do.call(rbind, lapply(chromosome_files, function(f) {
read.table(f, header = TRUE, sep = "\t", stringsAsFactors = FALSE)
}))
# Write merged data
write.table(
all_data,
file = merged_file,
quote = FALSE,
sep = "\t",
row.names = FALSE,
col.names = TRUE
)
cat(paste("Processing completed for subdirectory:", subdir, "\n"))
} else {
cat(paste("No chromosome output files to merge for subdirectory:", subdir, "\n"))
}
}, error = function(e) {
cat(paste("Error processing subdirectory:", subdir, "\nError message:", e$message, "\n"))
})
}
# Get all subdirectories
subdirs <- list.dirs(base_dir, full.names = TRUE, recursive = FALSE)
# Sequentially process subdirectories
for (subdir in subdirs) {
process_subdirectory(subdir)
}
# Run with:
# nohup Rscript functional_set_userGWAS.R > functional_set_userGWAS.log 2>&1 &
library(parallel)
# Define directories
base_dir <- "/Users/charleenadams/ukbppp"
results_dir <- file.path(base_dir, "merged_munged_functional_protein_results")
# Ensure the results directory exists
if (!dir.exists(results_dir)) {
dir.create(results_dir, recursive = TRUE)
}
# Function to merge chromosome files for a protein
merge_chromosome_files <- function(protein_name, chr_files, output_dir) {
tryCatch({
# Define output file path
merged_file <- file.path(output_dir, paste0(protein_name, "_merged.sumstats.gz"))
# Merge chromosome files
all_data <- do.call(rbind, lapply(chr_files, function(f) {
read.table(f, header = TRUE, sep = "\t", stringsAsFactors = FALSE)
}))
# Save the merged file
write.table(
all_data,
file = gzfile(merged_file),
quote = FALSE,
sep = "\t",
row.names = FALSE,
col.names = TRUE
)
cat(paste("Merged protein:", protein_name, "to", merged_file, "\n"))
return(merged_file)
}, error = function(e) {
cat(paste("Error merging protein:", protein_name, "\nError message:", e$message, "\n"))
return(NULL)
})
}
# Get list of all chromosome files
chr_files <- list.files(base_dir, pattern = "discovery_chr.*\\.sumstats\\.gz$", full.names = TRUE)
# Extract unique protein identifiers
proteins <- unique(gsub("discovery_chr[0-9]+_", "", gsub("\\.sumstats\\.gz$", "", basename(chr_files))))
# Function to process a single protein
process_protein <- function(protein) {
protein_files <- grep(protein, chr_files, value = TRUE)
merge_chromosome_files(protein, protein_files, output_dir = results_dir)
}
# Run merging in parallel for 12 proteins
num_cores <- detectCores() - 1 # Use all but one core
cl <- makeCluster(num_cores)
clusterExport(cl, c("chr_files", "proteins", "merge_chromosome_files", "results_dir"))
clusterEvalQ(cl, library(parallel))
merged_files <- parLapply(cl, proteins, process_protein)
stopCluster(cl)
# Print summary
cat("Merging completed for all proteins.\n")
# nohup Rscript merged_munged_functional_protein_results.R > merged_munged_functional_protein_results.log 2>&1 &
# Load libraries
library(data.table)
# Define paths
current_dir <- "/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/"
hapmap3_snplist_path <- "/Users/charleenadams/temp_BI/mvGWAS_chemokines/eur_w_ld_chr/w_hm3.snplist"
output_dir <- "/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized"
# Create output directory if it doesn't exist
if (!dir.exists(output_dir)) {
dir.create(output_dir)
}
# Dynamically list files to process
summary_stats <- list.files(
path = current_dir,
pattern = "*.sumstats\\.gz$",
full.names = TRUE,
all.files = TRUE
)
# Check if any files are found
if (length(summary_stats) == 0) {
stop("No matching summary statistics files found in the specified directory.")
}
cat("Files to process:\n", paste(summary_stats, collapse = "\n"), "\n")
# Load HapMap3 SNP list
hm3_snps <- fread(hapmap3_snplist_path, header = TRUE)
# Function to harmonize SNPs
harmonize_snps <- function(file, hm3_snps) {
# Read summary statistics
data <- fread(file, header = TRUE)[, .(SNP, A1, A2, N, Z)]
# Filter to SNPs in HapMap3
data <- data[SNP %in% hm3_snps$SNP]
# Merge with HapMap3 to align alleles
aligned_data <- merge(data, hm3_snps, by = "SNP", suffixes = c(".data", ".hm3"))
# Flip alleles and Z-scores where necessary
mismatched <- which(aligned_data$A1.data == aligned_data$A2.hm3 &
aligned_data$A2.data == aligned_data$A1.hm3)
if (length(mismatched) > 0) {
aligned_data$Z[mismatched] <- -aligned_data$Z[mismatched]
temp <- aligned_data$A1.data[mismatched]
aligned_data$A1.data[mismatched] <- aligned_data$A2.data[mismatched]
aligned_data$A2.data[mismatched] <- temp
}
# Return cleaned data
aligned_data[, .(SNP, A1 = A1.data, A2 = A2.data, N, Z)]
}
# Process and save harmonized files
for (file in summary_stats) {
cat("Processing:", file, "\n")
# Harmonize SNPs
harmonized_data <- harmonize_snps(file, hm3_snps)
# Extract file name and create correct output name
base_name <- gsub("\\.sumstats\\.gz$", "_harmonized.sumstats.gz", basename(file))
temp_file <- file.path(output_dir, gsub("\\.gz$", "", base_name)) # Temporary uncompressed file
output_file <- file.path(output_dir, base_name) # Final compressed file
# Save to temporary uncompressed file
fwrite(harmonized_data, file = temp_file, sep = "\t")
# Compress the file
system(paste("gzip -f", shQuote(temp_file)))
cat("Saved harmonized file:", output_file, "\n")
}
cat("All matching files harmonized and saved in:", output_dir, "\n")
#!/bin/bash
# Define the directory containing the files
input_dir="/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized"
output_dir="/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized"
# Loop through all files ending in _munged.sumstats.gz
for file in "$input_dir"/*.sumstats.gz; do
# Extract the base filename without extension
base_name=$(basename "$file" .sumstats.gz)
# Create output file name by appending "noNA" before .gz
output_file="${output_dir}/${base_name}_noNA.sumstats.gz"
# Unzip, filter rows with complete observations, and gzip the result
gunzip -c "$file" | awk 'NR == 1 || NF == 5' | gzip > "$output_file"
echo "Processed $file -> $output_file"
done
library(data.table)
# Define the function to add a P column
add_p_column <- function(file, output_dir) {
tryCatch({
data <- fread(file)
if (!"P" %in% colnames(data)) {
if ("Z" %in% colnames(data)) {
# Calculate P-values from Z-scores
data[, P := 2 * pnorm(-abs(Z))]
# Create output file path
output_file <- file.path(output_dir, basename(file))
# Save the updated data to the new file
fwrite(data, output_file, sep = "\t")
cat("Added P column to:", output_file, "\n")
} else {
stop("Z column missing. Cannot calculate P-values in file:", file)
}
} else {
cat("P column already exists in:", file, "\n")
}
}, error = function(e) {
cat("Error processing file:", file, "\nError message:", e$message, "\n")
})
}
# Set input and output directories
input_dir <- "/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized"
output_dir <- file.path(input_dir, "pvals.fix")
# Create the output directory if it doesn't exist
if (!dir.exists(output_dir)) {
dir.create(output_dir)
}
# List all files matching the pattern in the input directory
files <- list.files(input_dir, pattern = "\\harmonized_noNA.sumstats\\.gz$", full.names = TRUE)
# Apply the function to all files
lapply(files, add_p_column, output_dir = output_dir)
sumstats()library(data.table)
library(GenomicSEM)
# Step 1: Define paths and variables
input_dir <- "/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized/pvals.fix" # Input directory containing sumstats files
output_path <- "/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized/pvals.fix/summary_stats.txt" # Path to save the final summary stats
ref <- "/Users/charleenadams/temp_BI/chemokine_rgs_olink/reference.1000G.maf.0.005.txt" # Reference file for SNP variance
# List all sumstats files
traits <- list.files(
path = input_dir,
pattern = "\\harmonized_noNA.sumstats\\.gz$", # Match gzipped sumstats files
full.names = TRUE
)
# Optional: Name your traits based on filenames without extensions
trait.names <- gsub("\\_merged_harmonized_noNA.sumstats\\.gz$", "", basename(traits))
# Step 2: Extract sample sizes (N) from each trait file
# Assuming the column name for sample size is "N"
get_sample_size <- function(file) {
data <- fread(file, select = "N", nrows = 1) # Only read the first row to extract N
return(data$N[1])
}
N <- sapply(traits, get_sample_size)
cat("Extracted sample sizes:\n")
# Step 3: Prepare the summary statistics with sumstats()
# Define arguments for sumstats()
se.logit <- rep(FALSE, length(traits)) # Continuous traits; SEs are not logistic
OLS <- rep(TRUE, length(traits)) # Ordinary least squares for continuous traits
linprob <- rep(FALSE, length(traits)) # Traits are not dichotomous; linear probability does not apply
# Run sumstats() to align and scale summary statistics
aligned_sumstats <- sumstats(
files = traits,
ref = ref,
trait.names = trait.names,
se.logit = se.logit,
OLS = OLS,
linprob = linprob,
N = N,
betas = NULL,
#maf.filter = 0.01,
keep.indel = FALSE,
parallel = TRUE,
cores = 12 # Adjust for available cores
)
# Save the aligned summary statistics
fwrite(aligned_sumstats, file = output_path, sep = "\t")
cat("Aligned summary statistics saved to:", output_path, "\n")
# Optional: Verify the saved file by re-loading it
summary_stats_read <- fread(output_path)
ldsc in GenomicSEM# Load required library
library(GenomicSEM)
library(data.table)
library(parallel)
library(qqman)
# Define paths and variables
traits <- list.files("/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized/pvals.fix",
pattern = "\\.sumstats\\.gz$", full.names = TRUE)
# Set sample and population prevalences to NA to bypass liability scaling
sample.prev <- rep(NA, length(traits)) # Treat traits as continuous
population.prev <- rep(NA, length(traits))
ld <- "/Users/charleenadams/temp_BI/mvGWAS_chemokines/eur_w_ld_chr/"
wld <- "/Users/charleenadams/temp_BI/mvGWAS_chemokines/eur_w_ld_chr/"
# Optional: Name your traits based on filenames without extensions
trait.names <- gsub("\\_merged_harmonized_noNA.sumstats\\.gz$", "", basename(traits))
head(trait.names)
# Redirect console output to a log file
sink("/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/ldsc_results_rgs/ldsc_rgs_log.txt")
# Run multivariable LDSC
LDSCoutput <- ldsc(traits = traits,
sample.prev = sample.prev,
population.prev = population.prev,
ld = ld,
wld = wld,
trait.names = trait.names,
stand = TRUE)
# Stop capturing the console output
sink()
# Save the results
output_path <- "/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/ldsc_results_rgs"
saveRDS(LDSCoutput, file = file.path(output_path, "LDSCoutput.rds"))
# Print completion message
cat("Multivariable LDSC analysis complete and save in:\n", output_path, "\n")
Here, we perform a factor analysis to ascertain from the data how many factors to use for our latent model. I explored 1-5 factors.
| Metric | What It Measures | What to Look For | Interpretation for Your Results |
|---|---|---|---|
| RMSR | Fit between observed and modeled data | Lower is better (< 0.05 = excellent) | 5-factor model = slightly better fit (RMSR = 0.04), but improvement over 4-factor model (RMSR = 0.05) is marginal. |
| Degrees of Freedom | Parsimony (simplicity of model) | Higher df = simpler model | 5-factor model = more complex (df = 16), with diminishing returns compared to the 4-factor model (df = 24). |
| Objective Function | Fit of model (ML estimation) | Lower is better | The 5-factor model (Objective = 15.72) has a slightly better fit than the 4-factor model (Objective = 17.25), but the improvement is small. |
# Load necessary libraries
library(psych) # For EFA
library(GPArotation) # For rotations
# Define the correlation matrix
# Extract the LDSC genetic correlation matrix
ldsc_results <- readRDS("/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/ldsc_results_rgs/LDSCoutput.rds")
cor_matrix <- ldsc_results$S
write.csv(cor_matrix, "/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/ldsc_results_rgs/cor_matrix.csv")
cor_matrix <- cor.smooth(cor_matrix)
eigenvalues <- eigen(cor_matrix)$values
if (any(eigenvalues <= 0)) {
cat("The smoothed matrix is still not positive definite. Smallest eigenvalue:", min(eigenvalues), "\n")
} else {
cat("Matrix is now positive definite.\n")
}
## Matrix is now positive definite.
# Run EFA to determine the number of latent factors
efa_results1 <- fa(
r = cor_matrix, # Genetic correlation matrix
nfactors = 1, # Test up to 1 factors
rotate = "oblimin", # Allow factors to correlate
fm = "ml" # Use maximum likelihood estimation
)
# Run EFA to determine the number of latent factors
efa_results2 <- fa(
r = cor_matrix, # Genetic correlation matrix
nfactors = 2, # Test up to 1 factors
rotate = "oblimin", # Allow factors to correlate
fm = "ml" # Use maximum likelihood estimation
)
# Run EFA to determine the number of latent factors
efa_results3 <- fa(
r = cor_matrix, # Genetic correlation matrix
nfactors = 3, # Test up to 3 factors
rotate = "oblimin", # Allow factors to correlate
fm = "ml" # Use maximum likelihood estimation
)
# Run EFA to determine the number of latent factors
efa_results4 <- fa(
r = cor_matrix, # Genetic correlation matrix
nfactors = 4, # Test up to 3 factors
rotate = "oblimin", # Allow factors to correlate
fm = "ml" # Use maximum likelihood estimation
)
| Protein | ML1 | ML2 | ML3 | ML4 | Assigned Factor |
|---|---|---|---|---|---|
| CXCL13 | 0.83 | -0.06 | -0.06 | 0.19 | F1 |
| FCRL2 | 0.39 | 0.20 | 0.06 | 0.16 | F1 |
| NOS3 | 0.93 | 0.05 | 0.17 | -0.04 | F1 |
| FCRL5 | 0.12 | 0.58 | 0.17 | 0.10 | F2 |
| LY9 | 0.30 | 0.61 | -0.23 | 0.03 | F2 |
| LY96 | -0.16 | 0.81 | 0.25 | 0.17 | F2 |
| SLAMF7 | 0.01 | 1.04 | -0.07 | -0.06 | F2 |
| TNFRSF13B | 0.26 | 0.37 | -0.20 | 0.25 | F2 |
| LIF | 0.32 | 0.12 | 0.73 | 0.09 | F3 |
| TXNDC15 | 0.07 | 0.55 | 0.34 | 0.03 | F3 |
| CCL19 | 0.05 | -0.04 | 0.08 | 0.83 | F4 |
| CCL21 | -0.01 | 0.03 | -0.06 | 0.89 | F4 |
# Load necessary libraries
library(psych) # For EFA
library(GPArotation) # For rotations
library(GenomicSEM)
library(semPlot)
library(GenomicSEM)
library(parallel) # Ensure parallel is loaded for detectCores()
library(data.table)
library(qqman)
library(dplyr)
# Start timer to measure runtime
start_time <- Sys.time()
# Load summary statistics (use a subset of SNPs for testing)
summary_stats <- fread("/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized/pvals.fix/summary_stats.txt")
# Replace colons with underscores in column names
colnames(summary_stats) <- gsub(":", "_", colnames(summary_stats))
# Step 3: Load LDSC results
ldsc_results_path <- "/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/ldsc_results_rgs/LDSCoutput.rds"
ldsc_results <- readRDS(ldsc_results_path)
# Replace colons with underscores in the names of ldsc_results
if (!is.null(attr(ldsc_results$S, "dimnames"))) {
attr(ldsc_results$S, "dimnames")[[2]] <- gsub(":", "_", attr(ldsc_results$S, "dimnames")[[2]])
}
if (!is.null(attr(ldsc_results$S_Stand, "dimnames"))) {
attr(ldsc_results$S_Stand, "dimnames")[[2]] <- gsub(":", "_", attr(ldsc_results$S_Stand, "dimnames")[[2]])
}
# Check if names have been updated
print(attr(ldsc_results$S, "dimnames")[[2]])
print(attr(ldsc_results$S_Stand, "dimnames")[[2]])
# Step 4: Specify the model for userGWAS
ldsc_trait_names <- attr(ldsc_results$S, "dimnames")[[2]]
# Map trait names to simpler versions for compatibility with userGWAS
trait_mapping <- c(
"CCL19_Q99731_OID21030_v1_Neurology" = "CCL19",
"CCL21_O00585_OID20686_v1_Inflammation" = "CCL21",
"CXCL13_O43927_OID21024_v1_Neurology" = "CXCL13",
"FCRL2_Q96LA5_OID20639_v1_Inflammation" = "FCRL2",
"FCRL5_Q96RD9_OID21111_v1_Neurology" = "FCRL5",
"LIF_P15018_OID20824_v1_Neurology" = "LIF",
"LY9_Q9HBG7_OID20670_v1_Inflammation" = "LY9",
"LY96_Q9Y6Y9_OID20945_v1_Neurology" = "LY96",
"NOS3_P29474_OID20834_v1_Neurology" = "NOS3",
"SLAMF7_Q9NQ25_OID20602_v1_Inflammation" = "SLAMF7",
"TNFRSF13B_O14836_OID20702_v1_Inflammation" = "TNFRSF13B",
"TXNDC15_Q96J42_OID21495_v1_Oncology" = "TXNDC15"
)
# Step 5: Specify the model for userGWAS with the simplified trait names
model_string <- paste0(
"F1 =~ CXCL13_O43927_OID21024_v1_Neurology + FCRL2_Q96LA5_OID20639_v1_Inflammation + NOS3_P29474_OID20834_v1_Neurology\n",
"F2 =~ FCRL5_Q96RD9_OID21111_v1_Neurology + LY9_Q9HBG7_OID20670_v1_Inflammation + LY96_Q9Y6Y9_OID20945_v1_Neurology + SLAMF7_Q9NQ25_OID20602_v1_Inflammation + TNFRSF13B_O14836_OID20702_v1_Inflammation\n",
"F3 =~ LIF_P15018_OID20824_v1_Neurology + TXNDC15_Q96J42_OID21495_v1_Oncology\n",
"F4 =~ CCL19_Q99731_OID21030_v1_Neurology + CCL21_O00585_OID20686_v1_Inflammation\n",
"F1 ~~ F2\nF1 ~~ F3\nF1 ~~ F4\n",
"F2 ~~ F3\nF2 ~~ F4\n",
"F3 ~~ F4\n",
"F1 ~ SNP\nF2 ~ SNP\nF3 ~ SNP\nF4 ~ SNP"
)
cat("Model String:\n", model_string, "\n")
# Step 6: Redefine the SNPs after the mapping
SNPs2 <- summary_stats[, c("SNP", "CHR", "BP", "MAF", "A1", "A2",
paste0("beta.", names(trait_mapping)),
paste0("se.", names(trait_mapping)))]
# Step 7: Set up userGWAS parameters
max_cores <- 15
cat("Using", max_cores, "cores for parallel processing.\n")
CorrelatedFactors <- userGWAS(
covstruc = ldsc_results,
SNPs = summary_stats,
model = model_string,
estimation = "DWLS",
printwarn = TRUE,
sub = c("F1~SNP", "F2~SNP", "F3~SNP", "F4~SNP"),
toler = FALSE,
SNPSE = FALSE,
parallel = TRUE,
GC = "standard",
MPI = FALSE,
smooth_check = TRUE,
fix_measurement = TRUE,
Q_SNP = TRUE
)
# Step 8: Output results
result_path <- "/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized/pvals.fix/CorrelatedFactors12.rds"
saveRDS(CorrelatedFactors, file = result_path)
cat("GWAS completed. Results saved to:", result_path, "\n")
# Step 9: Measure runtime
end_time <- Sys.time()
cat("Total runtime:", round(difftime(end_time, start_time, units = "mins"), 2), "minutes.\n")
# Filter the F1 results for quality
CorrelatedFactors <- readRDS("/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized/pvals.fix/CorrelatedFactors12.rds")
CF1_bad <- CorrelatedFactors[[1]][which(CorrelatedFactors[[1]]$Q_SNP_pval < 5e-8),]
write.table(CF1_bad$SNP, "/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized/pvals.fix/poor_quality_snps.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)
# Look at and filter F1 results for quality
# Filter by proximity as an LD proxy
# Define low-quality SNPs and the full dataset
low_quality_snps <- CF1_bad # Subset of SNPs with Q_SNP_pval < 5e-8
all_snps <- CorrelatedFactors[[1]] # The full SNP dataset
# Parameters
ld_window_size <- 500000 # Proximity threshold in base pairs
# Filter low-quality SNPs for required columns
low_quality_snps <- low_quality_snps %>% select(SNP, CHR, BP)
# Exclude SNPs in LD with low-quality SNPs based on proximity
filtered_snps <- all_snps %>%
filter(!sapply(1:n(), function(i) {
any(abs(all_snps$BP[i] - low_quality_snps$BP[low_quality_snps$CHR == all_snps$CHR[i]]) <= ld_window_size)
}))
# Save the filtered SNPs
write.table(filtered_snps, "/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized/pvals.fix/F1_filtered_snps.tsv", row.names = FALSE, quote = FALSE, sep = "\t")
rs2412973 <- filtered_snps[which(filtered_snps$SNP=="rs2412973"),]
rs2412973
# Output results before filtering
significant_snps1 <- filtered_snps[filtered_snps$Pval_Estimate < 5e-08, ]
dim(significant_snps1)
low_heterogeneity1 <- filtered_snps[filtered_snps$Q_SNP_pval > 0.05, ]
dim(low_heterogeneity1)
# Proteins by Assigned Factor
F1 <- c("CXCL13", "FCRL2", "NOS3")
F2 <- c("FCRL5", "LY9", "LY96", "SLAMF7", "TNFRSF13B")
F3 <- c("LIF", "TXNDC15")
F4 <- c("CCL19", "CCL21")
# F1: CXCL13, FCRL2, NOS3
# Prepare data for plots
manhattan_data <- data.frame(
SNP = CorrelatedFactors[[1]]$SNP,
CHR = CorrelatedFactors[[1]]$CHR,
BP = CorrelatedFactors[[1]]$BP,
P = CorrelatedFactors[[1]]$Pval_Estimate
)
# Prepare data for plots
manhattan_data <- data.frame(
SNP = filtered_snps$SNP,
CHR = filtered_snps$CHR,
BP = filtered_snps$BP,
P = filtered_snps$Pval_Estimate
)
# Save Manhattan plot to file
png(
filename = "/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized/pvals.fix/Manhattan_CorrelatedFactors_filtered_snps_F1.png",
width = 8,
height = 6,
units = "in",
res = 300
)
# Step 6: Display Manhattan plot
cat("Displaying Manhattan plot.\n")
manhattan(
manhattan_data,
chr = "CHR",
bp = "BP",
p = "P",
snp = "SNP",
genomewideline = -log10(5e-8),
suggestiveline = -log10(1e-5),
col = c("blue4", "skyblue"),
main = "Manhattan Plot for F1 Latent Genetic Factor (CXCL13, FCRL2, NOS3)",
ylim = c(0, 10)
)
# Close the graphics device
dev.off()
# Save QQ plot to file
png(
filename = "/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized/pvals.fix/QQ_Latent_CorrelatedFactors_F1.png",
width = 8,
height = 6,
units = "in",
res = 300
)
# Step 7: Display QQ plot
cat("Displaying QQ plot.\n")
qq(manhattan_data$P, main = "QQ Plot for F1 Latent Genetic Factor (CXCL13, FCRL2, NOS3)")
# Close the graphics device
dev.off()
# End timer and display runtime
end_time <- Sys.time()
cat("Total runtime:", end_time - start_time, "\n")
# Run with:
# tmux new -s my_session (tmux my_session to end)
# caffeinate -i nohup Rscript userGWAS12.R > userGWAS12_log.txt 2>&1 &
# pstree -p 21557
# ps -p 21557 -o stat,etime
# Path to the saved Manhattan plot
manhattan_plot_path <- "/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized/pvals.fix/Manhattan_CorrelatedFactors_filtered_snps_F1.png"
# Display the image
knitr::include_graphics(manhattan_plot_path)
Manhattan Plot for F1 Latent Genetic Factor
# Path to the saved QQ plot
qq_plot_path <- "/Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized/pvals.fix/QQ_Latent_CorrelatedFactors_F1.png"
# Display the image
knitr::include_graphics(qq_plot_path)
QQ Plot for F1 Latent Genetic Factor
userGWAS/Users/charleenadams/temp_BI/chemokine_rgs_olink/munged/Users/charleenadams/ukbppp/selected_set.tar files for
selected proteins to be processed.SLAMF7_Q9NQ25_OID20602_v1_Inflammation.tarCCL19_Q99731_OID21030_v1_Neurology.tarCCL21_O00585_OID20686_v1_Inflammation.tarTXNDC15_Q96J42_OID21495_v1_Oncology.tar... (additional .tar files for all 12
proteins)./Users/charleenadams/temp_BI/chemokine_rgs_olink/processed_ukbppp_selected_set.tar files./merged_files:
.gz files by
protein and chromosome.protein_name_merged_rsid_added.gz._rsid_added.gz files: Processed files with RSID
mappings added._with_pval.gz files: Processed files with p-values
added.processing_log.txt./Users/charleenadams/temp_BI/chemokine_rgs_olink/mapsolink_rsid_map_mac5_info03_b0_7_chr1_patched_v2.tsv.gzolink_rsid_map_mac5_info03_b0_7_chr2_patched_v2.tsv.gz... (map files for other chromosomes)./Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix:
protein1_harmonized_noNA.sumstats.gz: Harmonized file
with no missing data.protein2_harmonized_noNA.sumstats.gz./ldsc_results_rgs:
LDSCoutput.rds: Contains genetic correlations for
traits.ldsc_rgs_log.txt: Log file documenting LDSC
analysis.cor_matrix.csv: Genetic correlation matrix extracted
from LDSC results./Users/charleenadams/temp_BI/mvGWAS_chemokines/eur_w_ld_chr: Directory containing LD scores for
European ancestry.w_hm3.snplist: HapMap 3 SNP list for filtering summary
statistics..tar
Files (Protein Datasets)/Users/charleenadams/ukbppp/selected_set/.SLAMF7_Q9NQ25_OID20602_v1_Inflammation.tar.LIF_P15018_OID20824_v1_Neurology.tar..gz Files/Users/charleenadams/temp_BI/chemokine_rgs_olink/processed_ukbppp_selected_set/.chr1_rsid_added.gz: File with RSIDs mapped.chr1_rsid_added_with_pval.gz: File with RSIDs and
p-values..gz Files/Users/charleenadams/temp_BI/chemokine_rgs_olink/processed_ukbppp_selected_set/merged_files/.protein1_merged_with_pval.gz./Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/.protein1_harmonized_noNA.sumstats.gz./Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/ldsc_results_rgs/.LDSCoutput.rds: Genetic correlation results.cor_matrix.csv: Correlation matrix for exploratory
factor analysis./Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized/pvals.fix/Manhattan_CorrelatedFactors.png./Users/charleenadams/ukbppp/merged_munged_functional_protein_results/pvals.fix/harmonized/pvals.fix/QQ_Latent_CorrelatedFactors.png..tar Files/Users/charleenadams/ukbppp/./Users/charleenadams/temp_BI/chemokine_rgs_olink/processed_ukbppp_selected_set/./processed_ukbppp_selected_set/./maps/._rsid_added.gz: Files with RSIDs added._with_pval.gz: Files with p-values added./merged_files/.protein1_merged_with_pval.gz.INFO > 0.9.MAF > 0.01./pvals.fix/.protein1_harmonized_noNA.sumstats.gz..sumstats.gz files./mvGWAS_chemokines/eur_w_ld_chr/./ldsc_results_rgs/.LDSCoutput.rds.ldsc_rgs_log.txt.cor_matrix.csv.cor_matrix.csv.userGWAS Analysis/pvals.fix/harmonized/pvals.fix/.processing_log.txt: Logs processing steps.ldsc_rgs_log.txt: Logs LDSC analysis./pvals.fix/harmonized/pvals.fix/.