library(readr)
library(knitr)
library(DT)
library(dplyr)

Overview

Lipid mediators are signaling molecules derived from fatty acids that regulate inflammation and immune responses. By binding to cell-surface receptors, lipid mediators can regulate transcription and alter methylation. The goal of this study is to perform a cross-sectional analysis on patients who tested positive for the cyclic citrullinated peptide antibody (CCP) and determine whether specific CpG sites are associated with individual lipid mediators.

Methods

47 participants have both DNAm and lipid mediator information for at least 1 visit and up to 3 visits across 3 cell types. Initial quality control included adjusting for batch and implementing a range filter on our CpGs from Illumina’s EPICv1 BeadChip array.

For each of the 26 lipid mediators (LMs), linear models were fit with LM as the predictor of interest and m-value of CpG as the outcome, while adjusting for age, sex, and batch. Batch (binary: in batch 1 or other) was a covariate in the model based on PCA plots that suggested a cell-specific batch effect (2.1). Using the p-values from this probe-level analysis, probes were ranked from smallest p-value to largest. A set-enrichment analysis was completed using the methylRRA() function from the methylGSA R package. All defaults were used except for the minimum and maximum number of genes for a gene set. The minimum number of genes that a gene set must contain is 10 and the maximum number is 1000. These parameters were the same for both KEGG and GO (3.1,3.2). To correct for multiple testing the Benjamini and Hochberg False Discovery Rate (FDR) was used, and the significance cut-off was FDR < 0.05.

Results

Lipid Mediator Biological Group Table

Click Here to Download the lipid mediator pathways File

Summary Statistics

CpG and Sample Size Summary

Below is a file that contains information about sample size and number of CpGs for the dataset used in this analysis. We started with 865,859 probes and between 21 and 23 percent of probes were filtered out due to low quality. We also excluded sex chromosomes for this analysis. We used 47 samples in our model, which was the same across Bcells, Tmem, and Tnai cells. These 47 samples are in the TIPRA cohort and are all CCP positive.

Click Here to Download the Summary Statistics File

Patient Demographics

Below is a summary of several clinical variables for visit 1, which is the group we are using. The dataset is mostly non-Hispanic white females with an average age of 58.47.

Click Here to Download the Summary Statistics File

DMP Results

Number of Significant Probes from Probe Level Analysis

The tables below display the CpGs with a FDR below 0.05 for LM23, LM26, LM27, LM32, LM38, LM41, and LM47. These lipid mediators had the most probes below the FDR threshold of 0.05. Other LMs had fewer or no significant results.

LM23

LM26

LM27

LM32

LM38

LM41

LM47

DMR Results

Number of Significant DMRs from DMR analysis

The table below shows the number of significant DMR regions and the number of significant probes for each region for each lipid mediator. DMRs with a SIDAK p-value above 0.05 were removed. Some of the lipid mediators of interest from the DMP remain as top candidates, specifically LM32 and LM47, with over 10 significant DMRs. LM48, LM49, and LM50 also had 10 or more DMRs, while all other lipid mediators had 5 or less significant DMRs.

BCell

Associated Genes

The table below shows the genes associated with each DMR as well as the direction of methylation. All significant DMRs have a SIDAK p-value below 0.05. Many of the significant lipid mediators have less than 10 DMRs, but the top 5 DMRs have 10 or more.

Bcell_estimate <- read_csv("/Users/eckco/Desktop/Norris_May_Update/DMR/All_Bcell_DMR_dataset_with_estimate.csv", show_col_types = FALSE)
# Exclude the first column
Bcell_estimate <- Bcell_estimate[, -1]

# Remove extra rows and columns
Bcell_estimate <- Bcell_estimate %>%
  select(-c("CpG","fdr"))

# Keep only one row per DMR
# Keep only one row per DMR using base R unique
#Bcell_estimate <- unique(Bcell_estimate, by = "DMR_Names")
# Base R approach to keep only one row per unique DMR_Names
Bcell_estimate <- Bcell_estimate[!duplicated(Bcell_estimate$DMR_Names), ]

names(Bcell_estimate)[names(Bcell_estimate) =="sidak"] <- "SIDAK"
Bcell_estimate$SIDAK <- formatC(Bcell_estimate$SIDAK, format = "e", digits = 4)

# Rename Total column
Bcell_estimate <- Bcell_estimate %>%
  rename(numProbes = total)


full_names <- read.csv("/Users/eckco/Desktop/Norris_May_Update/finalLMlist_wParentFAInfo_updatedEnzymeInfo_Jan2024.csv")
# Prepare the full_names dataset by selecting Analyte and choosing between Pathway and Pathway2
full_names_pathway <- full_names %>%
  mutate(pathway = ifelse(is.na(Pathway) | Pathway == "", Pathway2, Pathway)) %>%
  select(Analyte, pathway)

# Merge in the pathway data using Analyte = LM
Bcell_estimate <- Bcell_estimate %>%
  left_join(full_names_pathway, by = c("LM" = "Analyte"))


names(Bcell_estimate)[names(Bcell_estimate) == "pathway"] <- "Pathway"

Bcell_estimate <- Bcell_estimate[, c(5,11,1,2,3,4,6,8,9,10)]
datatable(Bcell_estimate,rownames=FALSE, options = list(pageLength = 10), 
          caption = "Click to explore the full table")

Number of unique genes for Bcells:

Set-enrichment Analysis

KEGG Database

KEGG Summary Table

The summary table below shows the number of significant pathways for each lipid mediator when using the KEGG database. LM 31, 36, and 48 had the most significant pathways.

KEGG Upset Plot

Below is an upset plot comparing the intersecting significant pathways with the KEGG results. 5 pathways are shared between LM31, LM36, and LM48.

Click Here to Download the Upset Plot File

Top Pathways for each Lipid Mediator using KEGG

Below are bar charts colored by adjusted p-value, plotting the top significant pathways for each lipid mediator. Not all lipid mediators had significant pathways and a maximum of 10 pathways are plotted on these charts.

View KEGG Bar Plots

Click here to download the Bar Plots file

Table of top Significant Pathways for KEGG

The table below displays the top 20 pathways with the lowest padj values using KEGG for each LM.

# Load the data
Bcell_SIDAK <- read_csv("/Users/eckco/Desktop/Norris_May_Update/Bcell_Merged_KEGG_Results.csv", show_col_types = FALSE)

# Add gene count column
Bcell_SIDAK <- Bcell_SIDAK %>%
  mutate(core_enrichment_genes = ifelse(!is.na(core_enrichment),
                                        lengths(strsplit(core_enrichment, "/")),
                                        0))

# Reorder and rename
Bcell_SIDAK <- Bcell_SIDAK[, c(10,2,3,4,5,6,7,11)]
names(Bcell_SIDAK)[names(Bcell_SIDAK) == "core_enrichment_genes"] <- "Number_Of_Core_Enrichment_Genes"

# Convert padj to numeric
Bcell_SIDAK$padj_numeric <- as.numeric(Bcell_SIDAK$padj)

# Top 20 per LM based on lowest padj
Bcell_SIDAK_top20 <- Bcell_SIDAK %>%
  group_by(LM) %>%
  arrange(padj_numeric, .by_group = TRUE) %>%
  slice_head(n = 20) %>%
  ungroup()

# Format numbers for display
Bcell_SIDAK_top20$enrichmentScore <- formatC(Bcell_SIDAK_top20$enrichmentScore, format = "e", digits = 4)
Bcell_SIDAK_top20$NES <- formatC(Bcell_SIDAK_top20$NES, format = "e", digits = 4)
Bcell_SIDAK_top20$pvalue <- formatC(Bcell_SIDAK_top20$pvalue, format = "e", digits = 4)
Bcell_SIDAK_top20$padj <- formatC(Bcell_SIDAK_top20$padj_numeric, format = "e", digits = 4)

# Drop helper column
Bcell_SIDAK_top20 <- Bcell_SIDAK_top20 %>% select(-padj_numeric)


full_names <- read.csv("/Users/eckco/Desktop/Norris_May_Update/finalLMlist_wParentFAInfo_updatedEnzymeInfo_Jan2024.csv")
# Prepare the full_names dataset by selecting Analyte and choosing between Pathway and Pathway2
full_names_pathway <- full_names %>%
  mutate(pathway = ifelse(is.na(Pathway) | Pathway == "", Pathway2, Pathway)) %>%
  select(Analyte, pathway)

# Merge in the pathway data using Analyte = LM
Bcell_SIDAK_top20 <-Bcell_SIDAK_top20  %>%
  left_join(full_names_pathway, by = c("LM" = "Analyte"))


names(Bcell_SIDAK_top20)[names(Bcell_SIDAK_top20) == "pathway"] <- "Pathway"


Bcell_SIDAK_top20 <- Bcell_SIDAK_top20[, c(1,9,2,3,4,5,6,7,8)]

# Display interactive table
datatable(Bcell_SIDAK_top20, rownames = FALSE,
          options = list(pageLength = 20))

GO Onlogy Profiles

GO Summary Table

The summary table below shows the number of significant pathways for each lipid mediator when using GO ontology profiles.

GO Upset Plot

The summary table below shows the number of intersecting significant pathways for each lipid mediator when using GO ontology terms.

Click Here to Download the Bar Plots File

Top Pathways for each Lipid Mediator using GO

Below are the significant pathways using GO ontology profiles. 20 out of the 26 lipid mediators had at least 1 significant pathway using GO.

Click Here to Download the Bar Plots File

Table of top Significant Pathways for GO

The table below displays the top 20 pathways with the lowest padj values using GO for each LM.

# Load the data
Bcell_SIDAK <- read_csv("/Users/eckco/Desktop/Norris_May_Update/Merged_GO_Results.csv", show_col_types = FALSE)

# Add gene count column
Bcell_SIDAK <- Bcell_SIDAK %>%
  mutate(core_enrichment_genes = ifelse(!is.na(core_enrichment),
                                        lengths(strsplit(core_enrichment, "/")),
                                        0))

# Reorder and rename
Bcell_SIDAK <- Bcell_SIDAK[, c(10,2,3,4,5,6,7,11)]
names(Bcell_SIDAK)[names(Bcell_SIDAK) == "core_enrichment_genes"] <- "Number_Of_Core_Enrichment_Genes"

# Convert padj to numeric
Bcell_SIDAK$padj_numeric <- as.numeric(Bcell_SIDAK$padj)

# Top 20 per LM based on lowest padj
Bcell_SIDAK_top20 <- Bcell_SIDAK %>%
  group_by(LM) %>%
  arrange(padj_numeric, .by_group = TRUE) %>%
  slice_head(n = 20) %>%
  ungroup()

# Format numbers for display
Bcell_SIDAK_top20$enrichmentScore <- formatC(Bcell_SIDAK_top20$enrichmentScore, format = "e", digits = 4)
Bcell_SIDAK_top20$NES <- formatC(Bcell_SIDAK_top20$NES, format = "e", digits = 4)
Bcell_SIDAK_top20$pvalue <- formatC(Bcell_SIDAK_top20$pvalue, format = "e", digits = 4)
Bcell_SIDAK_top20$padj <- formatC(Bcell_SIDAK_top20$padj_numeric, format = "e", digits = 4)

# Drop helper column
Bcell_SIDAK_top20 <- Bcell_SIDAK_top20 %>% select(-padj_numeric)

full_names <- read.csv("/Users/eckco/Desktop/Norris_May_Update/finalLMlist_wParentFAInfo_updatedEnzymeInfo_Jan2024.csv")
# Prepare the full_names dataset by selecting Analyte and choosing between Pathway and Pathway2
full_names_pathway <- full_names %>%
  mutate(pathway = ifelse(is.na(Pathway) | Pathway == "", Pathway2, Pathway)) %>%
  select(Analyte, pathway)

# Merge in the pathway data using Analyte = LM
Bcell_SIDAK_top20 <-Bcell_SIDAK_top20  %>%
  left_join(full_names_pathway, by = c("LM" = "Analyte"))


names(Bcell_SIDAK_top20)[names(Bcell_SIDAK_top20) == "pathway"] <- "Pathway"


Bcell_SIDAK_top20 <- Bcell_SIDAK_top20[, c(1,9,2,3,4,5,6,7,8)]

# Display interactive table
datatable(Bcell_SIDAK_top20, rownames = FALSE,
          options = list(pageLength = 20))

Appendix

Code

1.1 Code for Linear Model and Range Filter

library(pbapply)
library(dplyr)
library(parallelly)
library(progressr)
library(parallel)
library(readxl)
#install.packages("BiocManager")
#BiocManager::install("missMethyl")
#library(RUVSeq)
#library(missMethyl)
#library(ruv)
library(tibble)

# Load in Data
DNAm_matrix <- readRDS("/projects/ceck@xsede.org/Norris/DynamicModel/Model_Files/DNAm_matrix.rds")
LM_matrix_wPhenotype <- readRDS("/projects/ceck@xsede.org/Norris/DynamicModel/Model_Files/LM_matrix_wPhenotype.rds")
LM_matrix_wPhenotype_df <- as.data.frame(LM_matrix_wPhenotype)
LM_matrix_wPhenotype_ADD_RA <- readRDS("/projects/ceck@xsede.org/Norris/DynamicModel/Model_Files/LM_matrix_wPhenotype_ADD_RA.rds")
LM_matrix_wPhenotype_ADD_RA  <- as.data.frame(LM_matrix_wPhenotype_ADD_RA)
print("RA_Added")
adsf <- as.data.frame(LM_matrix_wPhenotype_df$Labid)
#write.csv(adsf, "Labids.csv")
lipid_dat_T1 <- readRDS("/projects/ceck@xsede.org/Norris/DynamicModel/Model_Files/lipid_dat_T1.rds")
DNAm_Matrix_M <- readRDS("/projects/ceck@xsede.org/Norris/DynamicModel/Model_Files/DNAm_Matrix_M.rds")
DNAm_Matrix_Beta <- readRDS("/projects/ceck@xsede.org/Norris/DynamicModel/Model_Files/DNAm_Matrix_Beta.rds")
DNAm_Matrix_B <- readRDS("/projects/ceck@xsede.org/Norris/DynamicModel/Model_Files/DNAm_Matrix_B.rds")
DNAm_Matrix_Tmem <- readRDS("/projects/ceck@xsede.org/Norris/DynamicModel/Model_Files/DNAm_Matrix_Tmem.rds")
DNAm_Matrix_Tnai <- readRDS("/projects/ceck@xsede.org/Norris/DynamicModel/Model_Files/DNAm_Matrix_Tnai.rds")
DNAm_Matrix_B_Betavalues <-  readRDS("/projects/ceck@xsede.org/Norris/DynamicModel/Model_Files/DNAm_Matrix_B_Beta.rds")
DNAm_Matrix_TMem_Betavalues <- readRDS("/projects/ceck@xsede.org/Norris/DynamicModel/Model_Files/DNAm_Matrix_Tmem_Beta.rds")
DNAm_Matrix_Tnai_Betavalues <- readRDS("/projects/ceck@xsede.org/Norris/DynamicModel/Model_Files/DNAm_Matrix_Tnai_Beta.rds")
cross_Reactive <- readRDS("/projects/ceck@xsede.org/Norris/DynamicModel/Model_Files/probesCrossReactive.RDS")
failed_QC <- readRDS("/projects/ceck@xsede.org/Norris/DynamicModel/Model_Files/probesFailedQC.RDS")
OverSNP <- readRDS("/projects/ceck@xsede.org/Norris/DynamicModel/Model_Files/probesOverSNPs.RDS")
SexChrom <- readRDS("/projects/ceck@xsede.org/Norris/DynamicModel/Model_Files/probesSexChrom.RDS")

print("Data loaded")

head(LM_matrix_wPhenotype)
# List of desired CpG values 
CpG <- colnames(DNAm_matrix)
head(CpG)

# List of desired Lipid Mediators
LM <- colnames(lipid_dat_T1)
head(LM)
class(LM)
dim(LM_matrix_wPhenotype)
head(LM_matrix_wPhenotype)


# Create the input data to test
LMnames <- colnames(lipid_dat_T1)
CpGnames <- colnames(DNAm_matrix)
toTest <- expand.grid(LM = LMnames, CpG = CpGnames)


# Function to get probe info for beta values (CpG sites)
getProbeInfo <- function(beta_values) {
  cpg.min <- min(beta_values, na.rm = TRUE)  # Use na.rm = TRUE to avoid NA issues
  cpg.max <- max(beta_values, na.rm = TRUE)
  cpg.range <- cpg.max - cpg.min
  cpg.var <- var(beta_values, na.rm = TRUE)
  info <- c(min = cpg.min, max = cpg.max, range = cpg.range, var = cpg.var)
  return(info)
}

### Assuming your matrix is CpG as rows and samples as columns ###
# Transform Data by transposing
DNAm_Matrix_B_Betavalues <- t(DNAm_Matrix_B_Betavalues)
DNAm_Matrix_Tnai_Betavalues <- t(DNAm_Matrix_Tnai_Betavalues)
DNAm_Matrix_TMem_Betavalues  <- t(DNAm_Matrix_TMem_Betavalues)

dim(DNAm_Matrix_B_Betavalues)
dim(DNAm_Matrix_Tnai_Betavalues)
dim(DNAm_Matrix_TMem_Betavalues)

# Return a matrix with proper rownames
CpG.B.info <- t(sapply(1:nrow(DNAm_Matrix_B_Betavalues), function(i) getProbeInfo(DNAm_Matrix_B_Betavalues[i, ])))
rownames(CpG.B.info) <- rownames(DNAm_Matrix_B_Betavalues)
CpG.B.info <- as.data.frame(CpG.B.info)

CpG.Tnai.info <- t(sapply(1:nrow(DNAm_Matrix_Tnai_Betavalues), function(i) getProbeInfo(DNAm_Matrix_Tnai_Betavalues[i, ])))
rownames(CpG.Tnai.info) <- rownames(DNAm_Matrix_Tnai_Betavalues)
CpG.Tnai.info <- as.data.frame(CpG.Tnai.info)

CpG.Tmem.info <- t(sapply(1:nrow(DNAm_Matrix_TMem_Betavalues), function(i) getProbeInfo(DNAm_Matrix_TMem_Betavalues[i, ])))
rownames(CpG.Tmem.info) <- rownames(DNAm_Matrix_TMem_Betavalues)
CpG.Tmem.info <- as.data.frame(CpG.Tmem.info)

# Filter the data based on range
Bcell.toTest <- rownames(CpG.B.info[-which(CpG.B.info$range <0.05),])
Tnai.toTest <- rownames(CpG.Tnai.info[-which(CpG.Tnai.info$range <0.05),])
Tmem.toTest <- rownames(CpG.Tmem.info[-which(CpG.Tmem.info$range <0.05),])

Bcell.toTest.filterApplied <- toTest[which(toTest$CpG %in% Bcell.toTest),]
Tnai.toTest.filterApplied <- toTest[which(toTest$CpG %in% Tnai.toTest),]
Tmem.toTest.filterApplied <- toTest[which(toTest$CpG %in% Tmem.toTest),]

dim(Bcell.toTest.filterApplied)
dim(Tnai.toTest.filterApplied)
dim(Tmem.toTest.filterApplied)

# Ensure Quality control was carried out

# Convert Applied filter to dataframe
Bcell.toTest.filterApplied_df <- as.data.frame(Bcell.toTest.filterApplied)
Tnai.toTest.filterApplied_df <- as.data.frame(Tnai.toTest.filterApplied)
Tmem.toTest.filterApplied_df <- as.data.frame(Tmem.toTest.filterApplied)

# Convert lists to vectors
SexChrom<- as.vector(unlist(SexChrom))
cross_Reactive<- as.vector(unlist(cross_Reactive))
failed_QC<- as.vector(unlist(failed_QC))
OverSNP<- as.vector(unlist(OverSNP))

#Test for SexChromosome
Bcell.toTest.filterApplied_QC_Sex <- Bcell.toTest.filterApplied_df[Bcell.toTest.filterApplied_df$CpG %in% SexChrom,]
Tnai.toTest.filterApplied_QC_Sex <- Tnai.toTest.filterApplied_df[Tnai.toTest.filterApplied_df$CpG %in% SexChrom,]
Tmem.toTest.filterApplied_QC_Sex <- Tmem.toTest.filterApplied_df[Tmem.toTest.filterApplied_df$CpG %in% SexChrom,]

dim(Bcell.toTest.filterApplied_QC_Sex)
dim(Tnai.toTest.filterApplied_QC_Sex)
dim(Tmem.toTest.filterApplied_QC_Sex)

# Test for cross_reactive probes
Bcell.toTest.filterApplied_QC_Reactive <- Bcell.toTest.filterApplied_df[Bcell.toTest.filterApplied_df$CpG %in% cross_Reactive,]
Tnai.toTest.filterApplied_QC_Reactive <- Tnai.toTest.filterApplied_df[Tnai.toTest.filterApplied_df$CpG %in% cross_Reactive,]
Tmem.toTest.filterApplied_QC_Reactive <- Tmem.toTest.filterApplied_df[Tmem.toTest.filterApplied_df$CpG %in% cross_Reactive,]

dim(Bcell.toTest.filterApplied_QC_Reactive)
dim(Tnai.toTest.filterApplied_QC_Reactive)
dim(Tmem.toTest.filterApplied_QC_Reactive)

#Test for probes that failed QC
Bcell.toTest.filterApplied_QC_failed <- Bcell.toTest.filterApplied_df[Bcell.toTest.filterApplied_df$CpG %in% failed_QC,]
Tnai.toTest.filterApplied_QC_failed<- Tnai.toTest.filterApplied_df[Tnai.toTest.filterApplied_df$CpG %in% failed_QC,]
Tmem.toTest.filterApplied_QC_failed <- Tmem.toTest.filterApplied_df[Tmem.toTest.filterApplied_df$CpG %in% failed_QC,]

dim(Bcell.toTest.filterApplied_QC_failed)
dim(Tnai.toTest.filterApplied_QC_failed)
dim(Tmem.toTest.filterApplied_QC_failed)

#Test for probes with OverSNP
Bcell.toTest.filterApplied_QC_OverSNP <- Bcell.toTest.filterApplied_df[Bcell.toTest.filterApplied_df$CpG %in% OverSNP,]
Tnai.toTest.filterApplied_QC_OverSNP <- Tnai.toTest.filterApplied_df[Tnai.toTest.filterApplied_df$CpG %in% OverSNP,]
Tmem.toTest.filterApplied_QC_OverSNP <- Tmem.toTest.filterApplied_df[Tmem.toTest.filterApplied_df$CpG %in% OverSNP,]

dim(Bcell.toTest.filterApplied_QC_OverSNP)

dim(Tnai.toTest.filterApplied_QC_OverSNP)
dim(Tmem.toTest.filterApplied_QC_OverSNP)

#Remove CpG sites that are suppsosed to be removed
Bcell.toTest.filterApplied_post_QC <- rbind(Bcell.toTest.filterApplied_QC_OverSNP,
                                            Bcell.toTest.filterApplied_QC_failed,
                                            Bcell.toTest.filterApplied_QC_Reactive,
                                            Bcell.toTest.filterApplied_QC_Sex)
Bcell.toTest.filterApplied_post_QC <- unique(Bcell.toTest.filterApplied_post_QC)
head(Bcell.toTest.filterApplied_post_QC)
dim(Bcell.toTest.filterApplied_post_QC)
Tmem.toTest.filterApplied_post_QC <- rbind(Tmem.toTest.filterApplied_QC_OverSNP,
                                           Tmem.toTest.filterApplied_QC_failed,
                                           Tmem.toTest.filterApplied_QC_Reactive,
                                           Tmem.toTest.filterApplied_QC_Sex) 
Tmem.toTest.filterApplied_post_QC <- unique(Tmem.toTest.filterApplied_post_QC)
head(Tmem.toTest.filterApplied_post_QC)
dim(Tmem.toTest.filterApplied_post_QC)

Tnai.toTest.filterApplied_post_QC <- rbind(Tnai.toTest.filterApplied_QC_OverSNP,
                                           Tnai.toTest.filterApplied_QC_failed,
                                           Tnai.toTest.filterApplied_QC_Reactive,
                                           Tnai.toTest.filterApplied_QC_Sex)
Tnai.toTest.filterApplied_post_QC <- unique(Tnai.toTest.filterApplied_post_QC)
head(Tnai.toTest.filterApplied_post_QC)
dim(Tnai.toTest.filterApplied_post_QC)

#Check to make sure probes are removed
dim(Bcell.toTest.filterApplied_post_QC)
dim(Tmem.toTest.filterApplied_post_QC)
dim(Tnai.toTest.filterApplied_post_QC)
#Ensure that the beta filter and quality control filter are being used

# Remove CpG sites for Bcell
Bcell.toTest.filterApplied <- Bcell.toTest.filterApplied %>%
  anti_join(Bcell.toTest.filterApplied_post_QC, by = "CpG")
#write.csv(Bcell_filtered_final, "Bcell_filtered_final.csv")

# Remove CpG sites for Tmem
Tmem.toTest.filterApplied <- Tmem.toTest.filterApplied %>%
  anti_join(Tmem.toTest.filterApplied_post_QC, by = "CpG")
#write.csv(Tmem_filtered_final, "Tmem_filtered_final.csv")

# Remove CpG sites for Tnai
Tnai.toTest.filterApplied <- Tnai.toTest.filterApplied %>%
  anti_join(Tnai.toTest.filterApplied_post_QC, by = "CpG")
#write.csv(Tnai_filtered_final, "Tnai_filtered_final.csv")

BCell_ProbesRemaining <- length(rownames(Bcell.toTest.filterApplied ))/26
Tmem_ProbesRemaining <- length(rownames(Tmem.toTest.filterApplied))/26
Tnai_ProbesRemaining <- length(rownames(Tnai.toTest.filterApplied))/26
BCell_ProbesRemaining
Tmem_ProbesRemaining
Tnai_ProbesRemaining

print("Range Filter Complete")
########################################################
# Define the crossSection_mod function with checks for variable length and handling of missing data
#CpG <- toTest[1,2]
#LM <- toTest[1,1]
#head(toTest)
#CpG %in% colnames(DNAm_matrix)
#dim(DNAm_matrix)
#colnames(DNAm_matrix)

# Set up parallel processing to use 8 cores or the number available
numCores <- min(8, parallelly::availableCores())
dat.DNAm <- t(DNAm_matrix)


cellType <- "BCell"
LM <- as.vector(LM)

#Define your covariates for the model
#covariates <- c("Age", "Gender", "RAConverter", "Batch")
covariates <- c("Age", "Gender", "Batch")
#covariates <- c("Age", "Gender")
dat.LM <- as.data.frame(LM_matrix_wPhenotype)
dat.meta <- read_excel("/projects/ceck@xsede.org/Norris/DynamicModel/Model_Files/TIP-RAFinalFullDatasetJune2020_Revised_March15_2024.xlsx")
#Extract Labids from metadata
labid_vect <- dat.LM$Labid
dat.meta <- dat.meta[dat.meta$Labid %in% labid_vect,]
dim(dat.meta)

# Convert Gender to Binary
dat.meta$Gender
dat.meta$Gender <- factor(dat.meta$Gender, levels = c("Male", "Female"))
dat.meta$Gender <- as.numeric(dat.meta$Gender == "Male")
head(dat.LM)
head(dat.DNAm)
class(dat.LM)
class(dat.DNAm)


#Add RAConverter and Batch column

table(dat.meta$RAConverter)
dat.meta$RAConverter <- factor(dat.meta$RAConverter, levels = c("Yes", "No"))
dat.meta$RAConverter <- as.numeric(dat.meta$RAConverter == "Yes")
head(dat.meta$RAConverter)


LM_matrix_wPhenotype_ADD_RA <- LM_matrix_wPhenotype_ADD_RA[, c("Labid","Batch")]
head(LM_matrix_wPhenotype_ADD_RA)
head(dat.meta)

dat.meta$Batch <- LM_matrix_wPhenotype_ADD_RA$Batch

#colnames(dat.meta)
dim(dat.LM)
table(dat.meta$Batch)
dim(LM_matrix_wPhenotype)


print("Function Starting")
colnames(dat.meta)
dat.meta$Batch
crossSection_mod <- function(CpG, dat.DNAm, LM, dat.LM, covariates, dat.meta, cellType){
  #LM <- "LM31"
  #CpG <- "cg14361672"
  ### Create a dataset for Analysis ###
  LM <- as.vector(LM)
  dat.LM <- as.data.frame(LM_matrix_wPhenotype)
  dat.LM <- as.data.frame(dat.LM)
  
  #dat.LM <- dat.LM[dat.LM$Labid != 116769, ]
  dat.DNAm <- dat.DNAm
  
  # Remove column named "116769"
  #dat.DNAm <- dat.DNAm[, colnames(dat.DNAm) != "116769"]
  dat.meta <- dat.meta
  #dat.meta <- dat.meta[dat.meta$Labid != 116769, ]
  
  # Grab lipid mediator (Labids are in the Labid columns and colnames are lipid mediators)
  tmp.LM <- data.frame(Labid = dat.LM$Labid, LM = dat.LM[, which(colnames(dat.LM) == LM)])
  
  #Correct 
  tmp.DNAm <- data.frame(Labid = rownames(dat.DNAm), CpG = dat.DNAm[, CpG])
  
  # Test 1 CpG
  #tmp.DNAm <- data.frame(Labid = colnames(dat.DNAm), CpG = as.numeric(dat.DNAm[CpG, ]))
  head(tmp.DNAm)
  
  
  #Grab your metadata (Labids are in labid column and colnames are variables)
  tmp.meta <- data.frame(Labid = dat.meta$Labid, ID = dat.meta$ID, dat.meta[, which(colnames(dat.meta) %in% covariates)])
  #tmp.meta <- dat.meta[, c("Labid", "ID", "Batch", covariates)]
 
  colnames(tmp.meta)[-c(1:2)] <- colnames(dat.meta)[which(colnames(dat.meta) %in% covariates)]
  head(tmp.meta)
  #tmp.meta <- dat.meta[, c("Labid", "ID", "Batch", covariates)]
  
  
  # Merge metadata and LM data for analysis by Labid
  tmp <- merge(tmp.meta, tmp.LM, by = "Labid")
  
  # Add in DNAm Data
  tmp2 <- merge(tmp, tmp.DNAm, by = "Labid")
  
  # Remove any duplicated individuals
  tmp3 <- tmp2[!duplicated(tmp2$ID), ]
  
  #Remove Outliers
  #tmp3 <- tmp2[tmp2$Labid != 116769, ]
  dat.meta$Gender <- as.numeric(dat.meta$Gender)
  dat.meta$Age <- as.numeric(dat.meta$Age)
  
  # Check for NA values
  if (any(is.na(tmp3$LM)) | any(is.na(tmp3$Age)) | any(is.na(tmp3$Gender)) | any(is.na(tmp3$CpG))) {
    stop("Missing data found in LM, Age, Gender, or CpG after merging.")
  }
  
  ### Statistical modeling ###
  # Define model and covariates
  if (is.null(covariates)) {
    formula <- "CpG ~ as.numeric(LM)"
  } else {
    formula <- paste0("CpG ~ as.numeric(LM) +", paste(covariates, collapse = " + "))
  }  
  # Run Linear Model
  mod <- lm(as.formula(formula), data = tmp3)
  colnames(mod)
  head(mod)
  summary(mod)
  ### Output creation ###
  if ("try-error" %in% class(mod)) {
    NAvector <- rep(NA, (length(covariates) + 2) * 4) # Updated for T-stat column
    names(NAvector) <- c(
      paste0(c("Intercept", "LM", covariates), ".betaEstimate"),
      paste0(c("Intercept", "LM", covariates), ".pval"),
      paste0(c("Intercept", "LM", covariates), ".stdError"),
      paste0(c("Intercept", "LM", covariates), ".tStat")
    )
    want <- data.frame(cellType = cellType, CpG = CpG, LM = LM, formula = formula, nObs = NA, AIC = NA, 
                       meanLeverage = NA, maxLeverage = NA, Outliers = NA, t(as.data.frame(NAvector)))
  } else {
    nObs <- nobs(mod)
    AIC <- extractAIC(mod)[2]
    Leverage <- hatvalues(mod)
    meanLeverage <- mean(Leverage, na.rm = TRUE)
    maxLeverage <- max(Leverage, na.rm = TRUE)
    
    # Identify outliers above leverage threshold
    outlierIndices <- which(Leverage > 0.25)
    OutlierLabIDs <- paste0(tmp3$Labid[outlierIndices], collapse = ",")
    OutlierIDs <- paste0(tmp3$ID[outlierIndices], collapse = ",")
    
    coef <- summary(mod)$coefficients[, 1]
    names(coef) <- paste0(names(coef), ".betaEstimate")
    pvals <- summary(mod)$coefficients[, 4]
    names(pvals) <- paste0(names(pvals), ".pval")
    std.errors <- summary(mod)$coefficients[, 2]
    names(std.errors) <- paste0(names(std.errors), ".stdError")
    tStats <- summary(mod)$coefficients[, 3]
    names(tStats) <- paste0(names(tStats), ".tStat")
    
    want <- data.frame(cellType, CpG, LM, formula, nObs, AIC, meanLeverage, maxLeverage, OutlierLabIDs, OutlierIDs,
                       t(as.data.frame(coef)), t(as.data.frame(pvals)), 
                       t(as.data.frame(std.errors)), t(as.data.frame(tStats)))
    colnames(want) <- gsub("X.Intercept..", "Intercept.", fixed = TRUE, colnames(want))
  }
  
  return(want)
  #return(mod)
}


##########################

# Initialize the CSV file by writing the column names
initial_columns <- c(
  "cellType", "CpG", "LM", "formula", "nObs", "AIC", "meanLeverage","maxLeverage","OutlierLabIDs", "OutlierIDs",
  paste0(c("Intercept", "LM", covariates), ".betaEstimate"),
  paste0(c("Intercept", "LM", covariates), ".pval"),
  paste0(c("Intercept", "LM", covariates), ".stdError"),
  paste0(c("Intercept", "LM", covariates), ".tStat")
)

write.csv(data.frame(matrix(ncol = length(initial_columns), nrow = 0, dimnames = list(NULL, initial_columns))), 
          "/scratch/alpine/ceck@xsede.org/Norris_Model_Data_Storage/Batch_Variable_Models/Bcell/new_model_results/BCell_After_Batch/Bcell_Batch_Variable.csv", row.names = FALSE)

#Bcell_crossSectional_results <- mclapply(1:100000, function(i) {
Bcell_crossSectional_results <- mclapply(1:nrow(Bcell.toTest.filterApplied), function(i) {
  # Print the iteration count to indicate progress
  #print(paste("Starting iteration:", i, "of", nrow(Bcell.toTest.filterApplied)))
  result <- crossSection_mod(
    CpG = as.character(Bcell.toTest.filterApplied$CpG[i]),
    dat.DNAm = DNAm_Matrix_B, 
    LM = Bcell.toTest.filterApplied$LM[i],
    dat.LM = LM_matrix_wPhenotype, 
    covariates = covariates,      # Include the covariates argument
    dat.meta = dat.meta,          # Include metadata
    cellType = "Bcell"            # Specify cell type
  )
  
  # Append the result to the CSV after each iteration
  if (!is.null(result)) {
    write.table(result, file = "/scratch/alpine/ceck@xsede.org/Norris_Model_Data_Storage/Batch_Variable_Models/Bcell/new_model_results/BCell_After_Batch/Bcell_Batch_Variable.csv", 
                append = TRUE, sep = ",", col.names = FALSE, row.names = FALSE)
  }
  
  return(result)
}, mc.cores = numCores)

#print(mod)
#summary(mod) 
#par(mfrow = c(2,2))

print(Bcell_crossSectional_results)
print("Script Complete")

#test <- read.csv("/scratch/alpine/ceck@xsede.org/Norris_Model_Data_Storage/Batch_Variable_Models/Bcell/new_model_results/BCell_After_Batch/Bcell_Batch_Variable.csv")
#colnames(test)
#length(colnames(test))
#head(test)
#summary(test$LM.pval)
#sig <- test[test$LM.pval < 0.05,]
#dim(sig)
#sig$LM.pval

1.2 Code for P-value Histograms

# Load data
dat <- read.csv("../Bcell_Batch_Variable.csv")

# Create directory if it doesn't exist
if (!dir.exists("histograms")) {
  dir.create("histograms")
}

# Replace 'LM' with the correct column name if needed
unique_LMs <- unique(dat$LM)

for (lm in unique_LMs) {
  subset_data <- subset(dat, LM == lm)
  safe_lm <- gsub("[^A-Za-z0-9_]", "_", lm)
  pdf_filename <- paste0("histograms/", safe_lm, "_pval_hist.pdf")
  
  pdf(pdf_filename)
  hist(subset_data$LM.pval,
       main = paste("P-Value Histogram for", lm),
       xlab = "P-Value",
       col = "steelblue", border = "white")
  dev.off()
}

1.3 Code for generating PCA Plots

library(dplyr)
library(stringr)

dat <- readRDS("/scratch/alpine/ceck@xsede.org/Norris_Model_Data_Storage/harman_illumina_M.RDS")
link <- read.table("/scratch/alpine/ceck@xsede.org/Norris_Model_Data_Storage/07.id_conv")
head(link)
tail(link)
tail(link)

# Assuming `link` is your original data frame
link_IDs <- link %>%
  # Extract the part between the first and second "/"
  mutate(V2 = sub("^([^/]*/)([^/]*)/.*$", "\\2", V1)) %>%
  # Extract patterns (_Tmem_, _Tnai_, _B_) and remove underscores
  mutate(
    V3 = str_extract(V1, "_Tmem_|Tnai_|_B_") %>%
      gsub("_", "", .)
  )

# View the result
head(link_IDs)
head(colnames(dat))


# 1. Check the data
class(dat)  # Should be a matrix or data.frame
head(dat)   # Inspect the first few rows
colnames(dat)  # Check the column names (sample labels)
rownames(dat)  # Check the row names (CpG sites)
dim(dat)


# Update column names
updated_colnames <- sapply(colnames(dat), function(col) {
  match_index <- match(col, link_IDs$V2)  # Find index of matching pattern
  if (!is.na(match_index)) {
    paste(col, link_IDs$V3[match_index], sep = "_")  # Append V3 value
  } else {
    col  # Keep original name if no match
  }
})

# Assign updated column names back to the data
colnames(dat) <- updated_colnames

# Check the result
head(colnames(dat))

head(dat)

# 2. Ensure data is numeric and in matrix form
dat_matrix <- as.matrix(dat)
dat_matrix <- t(dat)

# 3. Optional: Standardize the data
dat_standardized <- scale(dat_matrix)

# 4. Perform PCA
pca_result <- prcomp(dat_standardized, center = TRUE, scale. = TRUE)


# Extract cell type information from column names
cell_types <- ifelse(grepl("_B", rownames(dat_standardized)), "B",
                     ifelse(grepl("_Tmem", rownames(dat_standardized)), "Tmem",
                            ifelse(grepl("_Tnai", rownames(dat_standardized)), "Tnai", "Other")))

# Assign colors for each cell type
cell_type_colors <- c("B" = "blue", "Tmem" = "green", "Tnai" = "red", "Other" = "gray")

point_colors <- cell_type_colors[cell_types]
png("AllCellsColored_pca_plot_Dim3.png", width = 800, height = 600)
# Plot the first two principal components, colored by cell type
plot(pca_result$x[, 1], pca_result$x[, 3],
     xlab = "PC1", ylab = "PC2",
     main = "PCA of DNAm M-values by Cell Type",
     col = point_colors, pch = 19)
# Add a legend
legend("topright", legend = names(cell_type_colors), col = cell_type_colors, pch = 19, title = "Cell Type")
dev.off()
library(factoextra)
#Visualize PCAs
fviz_eig(pca_result, ncp=10)+
  geom_hline(color="red", yintercept = 5, linetype = "longdash")+
  labs(title = "Baseline Scree Plot")



# Calculate variance explained
explained_variance <- pca_result$sdev^2
explained_variance_ratio <- explained_variance / sum(explained_variance)

# Cumulative variance explained
cumulative_variance <- cumsum(explained_variance_ratio)

# Load ggplot2
library(ggplot2)

# Create a data frame for plotting
scree_data <- data.frame(
  PC = seq_along(explained_variance_ratio),
  Variance_Explained = explained_variance_ratio,
  Cumulative_Variance = cumulative_variance
)

# Plot the scree plot
ggplot(scree_data, aes(x = PC)) +
  geom_line(aes(y = Variance_Explained), color = "blue") +
  geom_point(aes(y = Variance_Explained), color = "blue") +
  geom_line(aes(y = Cumulative_Variance), color = "red") +
  geom_point(aes(y = Cumulative_Variance), color = "red") +
  labs(title = "Scree Plot", x = "Principal Component", y = "Proportion of Variance Explained") +
  theme_minimal() +
  scale_y_continuous(labels = scales::percent) +
  scale_x_continuous(breaks = 1:length(explained_variance_ratio)) +
  annotate("text", x = 3, y = explained_variance_ratio[3], label = "Elbow?", hjust = -0.2)

# 6. Visualize PCA
# A simple 2D plot of the first two principal components
#plot <- plot(pca_result$x[, 1], pca_result$x[, 2],
#            xlab = "PC1", ylab = "PC2",
#            main = "PCA of DNAm M-values")
#print(plot)

########################################

# Separate the data by cell type
pca_b <- pca_result$x[, cell_types == "B"]
pca_tmem <- pca_result$x[, cell_types == "Tmem"]
pca_tnai <- pca_result$x[, cell_types == "Tnai"]

# Plot for B cells
png("BCellsColored_pca_plot_Dim3.png", width = 800, height = 600)
plot(pca_b[, 1], pca_b[, 2],
     xlab = "PC1", ylab = "PC2",
     main = "PCA of DNAm M-values for B cells",
     col = "blue", pch = 19)
dev.off()

# Plot for Tmem cells
png("TmemColored_pca_plot_Dim3.png", width = 800, height = 600)
plot(pca_tmem[, 1], pca_tmem[, 2],
     xlab = "PC1", ylab = "PC2",
     main = "PCA of DNAm M-values for Tmem cells",
     col = "green", pch = 19)
dev.off()

# Plot for Tnai cells
png("TnaiColored_pca_plot_Dim3.png", width = 800, height = 600)
plot(pca_tnai[, 1], pca_tnai[, 2],
     xlab = "PC1", ylab = "PC2",
     main = "PCA of DNAm M-values for Tnai cells",
     col = "red", pch = 19)
dev.off()

1.4 Code for generating Manhattan and Q-Q Plots

# Required libraries
library(dplyr)
library(ggplot2)

# Load data
dat <- read.csv("../Bcell_Batch_Variable.csv")

# Create output directories
if (!dir.exists("manhattan")) dir.create("manhattan")

# Apply FDR correction *within each LM*
dat <- dat %>%
  group_by(LM) %>%
  mutate(FDR = p.adjust(LM.pval, method = "BH")) %>%
  ungroup()

# Save significant CpGs (FDR < 0.05)
signif_cpgs <- dat %>% filter(FDR < 0.05)
write.csv(signif_cpgs, "Significant_CpGs_FDR_lt_0.05.csv", row.names = FALSE)

# Generate Manhattan plots
unique_LMs <- unique(dat$LM)

for (lm in unique_LMs) {
  df <- dat %>% filter(LM == lm)
  
  # Ensure proper chromosome sorting
  df$CHR <- as.factor(df$CHR)
  df <- df[order(as.numeric(as.character(df$CHR)), df$BP), ]

  # Calculate cumulative position
  chr_lengths <- df %>% group_by(CHR) %>% summarise(chr_len = max(BP))
  chr_offsets <- cumsum(as.numeric(chr_lengths$chr_len))
  names(chr_offsets) <- chr_lengths$CHR
  chr_offsets <- c(0, chr_offsets[-length(chr_offsets)])
  df$pos_cum <- df$BP + chr_offsets[as.character(df$CHR)]

  # Axis labels (midpoint of each chromosome for labeling)
  axis_df <- df %>% group_by(CHR) %>% summarize(center = mean(pos_cum))

  # Manhattan plot
  p <- ggplot(df, aes(x = pos_cum, y = -log10(LM.pval), color = as.factor(CHR))) +
    geom_point(alpha = 0.75, size = 0.8) +
    scale_color_manual(values = rep(c("gray20", "gray60"), length(unique(df$CHR)))) +
    scale_x_continuous(label = axis_df$CHR, breaks = axis_df$center) +
    labs(title = paste("Manhattan Plot -", lm),
         x = "Chromosome", y = "-log10(p-value)") +
    theme_minimal() +
    theme(legend.position = "none",
          panel.grid.major.x = element_blank(),
          panel.grid.minor.x = element_blank())
  
  ggsave(paste0("manhattan/", gsub("[^A-Za-z0-9_]", "_", lm), "_manhattan.pdf"), plot = p, width = 10, height = 5)
}

print("Finished generating Manhattan plots and filtering significant CpGs.")

1.5 Code for running DMR

# code to keep handy
# PATH="$HOME/miniconda/bin:$PATH"

# set up environment
# rm(list = ls())

library(ENmix)
library(minfi)
library(data.table)
# library(tidyverse)
library(dplyr)
library("IlluminaHumanMethylation450kanno.ilmn12.hg19")
data("IlluminaHumanMethylation450kanno.ilmn12.hg19")
library(tibble)
dir <- "/scratch/alpine/ceck@xsede.org/DMR/Run_Enmix"
#dir <- "/Users/eckco/Desktop/Norris_Lab/Modeling_Results/DMR/Run_Enmix"
if (file.exists(file.path(dir, "dmr_output/"))) {
  setwd(file.path(dir, "dmr_output/"))
} else {
  dir.create(file.path(dir, "dmr_output/"))
  setwd(file.path(dir, "dmr_output/"))
}
print("Output directory created")

print("Loading CpG Data...")
# load in the data
rslts <- read.csv("/scratch/alpine/ceck@xsede.org/Norris_Model_Data_Storage/Batch_Variable_Models/Bcell/Test_modelFit.csv")
#rslts <- read.csv("../Top_500_Bcell_batch_adjusted.csv")
head(rslts)
dim(rslts)
colnames(rslts)
print("Loading Mval data...")
mvals <- readRDS("/projects/ceck@xsede.org/Norris/DynamicModel/Model_Files/DNAm_matrix.rds")
dim(mvals)
mvals <- t(mvals)
#head(mvals)

print("Loading annotation data...")
anno <- getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19)
# Convert annotation file to dataframe for easy merging
anno <- as.data.frame(anno)
class(anno)
sum(is.na(rownames(anno)))
print("Data Loaded")

print("Filtering CpG information for LM7")
# filter to just the CpG related information
rslts <- rslts[rslts$LM == "LM36", ]
rslts <- rslts[, c("CpG", "LM.pval")]
head(rslts)
unique_cpg <- unique(rslts$CpG)
rslts <- rslts[rslts$CpG %in% unique_cpg, ]
dim(rslts)
rownames(rslts)
# rename the rows of the results file
rslts <- rslts %>% 
  filter(CpG %in% unique_cpg) %>% 
  column_to_rownames(var = "CpG")
rownames(rslts)

print("Filtering and formatting annotation file...")
# Filter annotation file
anno_f <- anno[rownames(rslts), c("chr", "pos")]
#anno_f <- as.data.frame(anno_f)
#anno_f <- anno_f[!is.na(rownames(anno_f)), ]
class(anno_f)

# combine the annotation information and results
z <- merge(rslts, anno_f, by = 0)


# add in an end point for the CpG to match the formatting expected by combp()
z$end <- z$pos + 1
# format the chromosome column correctly
z$chr <- gsub("chr", "", z$chr)
# name the columns
colnames(z)[c(1, 2, 3, 4, 5)] <- c("probe", "p", "chr", "start", "end")
# order and select the columns
w <- z[, c("chr", "start", "end", "p", "probe")]

# create the ordered factor for chr for later ordering
w$chr <- factor(w$chr, levels = c(1:22, "X", "Y"), ordered = TRUE)
# order the datafile by position
w <- w %>% as.data.frame() %>% arrange(chr, start)

print("combp input file created:")
head(w)

print("combp running...")
# run with defaults
#dmr <- combp(w, nCores = 8, dist.cutoff = 2000)
dmr <- combp(w, nCores = 8, dist.cutoff = 1000)
print(dmr)

#pdf("Hist.pdf")
#hist(rslts$w, breaks = 50, main = "P-value Histogram", xlab = "P-value")
#dev.off()


#print("Saving DMR results as CSV...")
#saveRDS(dmr, file = "dmr_results.rds")

#write.csv(dmr, file = "dmr_results.csv", row.names = FALSE)
#print("DMR results saved to dmr_results.csv")

print("Script Complete")

1.6 Code for running pathway analysis

library(methylGSA)
library(org.Hs.eg.db)
library(IlluminaHumanMethylation450kanno.ilmn12.hg19)
library(readr)
library(dplyr)
library(ggplot2)
library(stringr)
library(tibble)
library(UpSetR)
library(tidyverse)


# Define output directory
output_dir <- "/Users/eckco/Desktop/Norris_Lab/Modeling_Results/DMR/Pathway_Results/Bcell/GO_Results"
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)

# Get list of all RDS files in current working directory
pval_files <- list.files(pattern = "_pval_vector\\.rds$")

if (length(pval_files) == 0) stop("No *_pval_vector.rds files found in the current directory.")

# Lists to collect pathway sets for UpSet plots
all_pathways_list <- list()
sig_pathways_list <- list()

# Loop over each file
for (file in pval_files) {
  cat("\nProcessing file:", file, "\n")
  
  # Extract LM name
  lm_name <- gsub(".*(LM\\d+).*", "\\1", file)
  output_prefix <- file.path(output_dir, paste0("methylRRA_", lm_name))
  
  # Load CpG p-values
  cpg.pval <- readRDS(file)
  cat("Loaded", length(cpg.pval), "CpG p-values.\n")
  
  # Run methylRRA
  res3 <- methylRRA(
    cpg.pval = cpg.pval,
    method = "GSEA",
    #GS.type = "KEGG",
    GS.type = "GO",
    minsize = 10,
    maxsize = 1000,
    array.type = "450K"
  )
  cat("methylRRA (GSEA) complete. Pathways tested:", nrow(res3), "\n")
  
  # Save full results
  write_csv(res3, paste0(output_prefix, "_results.csv"))
  
  # Filter significant pathways
  sig_pathways <- res3 %>% filter(padj < 0.05)
  write_csv(sig_pathways, paste0(output_prefix, "_significant.csv"))
  
  # Add to pathway lists for UpSet plots
  all_pathways_list[[lm_name]] <- unique(res3$ID)
  sig_pathways_list[[lm_name]] <- if (nrow(sig_pathways) > 0) unique(sig_pathways$ID) else character(0)
  
  # Summary info
  if (nrow(sig_pathways) > 0) {
    gene_counts <- sapply(sig_pathways$core_enrichment, function(x) length(strsplit(x, "/")[[1]]))
    total_genes <- sum(gene_counts)
    total_pathways <- n_distinct(res3$ID)
    sig_pathway_count <- nrow(sig_pathways)
  } else {
    total_genes <- 0
    total_pathways <- n_distinct(res3$ID)
    sig_pathway_count <- 0
  }
  
  summary_df <- data.frame(
    LM = lm_name,
    Total_Genes_in_core_enrichment = total_genes,
    Total_Pathways_Tested = total_pathways,
    Significant_Pathways_padj_lt_0_05 = sig_pathway_count
  )
  
  write_csv(summary_df, paste0(output_prefix, "_summary.csv"))
  
  # Plot if significant
  if (sig_pathway_count > 0) {
    top_pathways <- sig_pathways %>%
      arrange(padj) %>%
      slice(1:10) %>%
      mutate(Description_wrapped = str_wrap(Description, width = 40)) %>%
      mutate(Description_wrapped = factor(Description_wrapped, levels = rev(Description_wrapped)))
    
    p <- ggplot(top_pathways, aes(x = Description_wrapped, y = -log10(padj), fill = padj)) +
      geom_bar(stat = "identity") +
      geom_text(aes(label = sprintf("p=%.3g", padj)), hjust = -0.1, size = 3.5) +
      scale_fill_gradient(low = "red", high = "blue", name = "Adj. P-value") +
      coord_flip() +
      labs(
        title = paste("Top 10 KEGG Pathways for", lm_name),
        x = "Pathway",
        y = expression(-log[10]("Adjusted P-value"))
      ) +
      theme_minimal() +
      theme(
        axis.text.y = element_text(size = 10),
        plot.title = element_text(size = 14, face = "bold"),
        plot.margin = margin(10, 10, 10, 10)
      ) +
      scale_y_continuous(expand = expansion(mult = c(0, 0.5)))
    
    ggsave(paste0(output_prefix, "_top10_padj_barplot.png"), p, width = 8, height = 5)
  } else {
    cat("No significant pathways. Skipping plot.\n")
  }
}



# Merge Summary Files #

# Set your directory containing the CSV files
input_dir <- "/Users/eckco/Desktop/Norris_Lab/Modeling_Results/DMR/Pathway_Results/Bcell/GO_Results/Summary"

# List all CSV files
csv_files <- list.files(path = input_dir, pattern = "\\.csv$", full.names = TRUE)

# Load and bind all files
merged_df <- do.call(rbind, lapply(csv_files, function(file) {
  df <- read.csv(file, row.names = 1)  
  df$SourceFile <- basename(file)      # Create source file column
  return(df)
}))

# View the result
head(merged_df)

# Save merged data (optional)
write.csv(merged_df, file = file.path(input_dir, "Bcell_Pathway_Summary.csv"))



# Upset plots to look at overlap between pathways #

# Define input directory
#input_dir <- "/Users/eckco/Desktop/Norris_Lab/Modeling_Results/DMR/Pathway_Results/Bcell/Gene_Tables/NES_Bar_Plots/significnat_pathways"
input_dir <- "/Users/eckco/Desktop/Norris_Lab/Modeling_Results/DMR/Pathway_Results/Bcell/GO_Results/Sig_Results"

# Define output PDF file
output_pdf <- file.path(input_dir, "upset_plot.pdf")

# List all CSV files
files <- list.files(input_dir, full.names = TRUE, pattern = "\\.csv$")

# Initialize list to store 'ID' vectors
id_lists <- list()

# Read each CSV file and extract 'ID' column
for (file in files) {
  df <- tryCatch(read.csv(file, header = TRUE), error = function(e) NULL)
  
  if (!is.null(df) && "ID" %in% colnames(df)) {
    ids <- unique(as.character(df$ID))
    label <- tools::file_path_sans_ext(basename(file))
    id_lists[[label]] <- ids
  } else {
    message(sprintf("Skipping %s: no 'ID' column found.", basename(file)))
  }
}

# Ensure at least two valid sets
if (length(id_lists) < 2) {
  stop("Need at least two files with valid 'ID' columns to generate an UpSet plot.")
}

# Create binary matrix
binary_matrix <- fromList(id_lists)

# Save UpSet plot as PDF
pdf(output_pdf, width = 12, height = 8)
upset(binary_matrix,
      nsets = length(id_lists),
      order.by = "freq",
      mainbar.y.label = "Common Pathways",
      sets.x.label = "Pathways per LM")
dev.off()

cat("UpSet plot saved to:", output_pdf, "\n")

1.7 Code for merging pathway results

library(dplyr)
library(readr)
library(stringr)

# Define input directory
input_dir <- "/Users/eckco/Desktop/Norris_Lab/Modeling_Results/DMR/Pathway_Results/Bcell/Bcell_Kegg_Results/More_Pathways_Sig_Data/Sig_Results_More_Pathways"

# List all CSV files in the directory
csv_files <- list.files(path = input_dir, pattern = "\\.csv$", full.names = TRUE)

# Initialize an empty list to store non-empty data frames
df_list <- list()

# Loop through each file
for (file in csv_files) {
  df <- tryCatch(read_csv(file, col_types = cols(.default = "c")), error = function(e) NULL)  # Read all columns as character
  
  if (!is.null(df) && nrow(df) > 0) {
    lm_id <- str_extract(basename(file), "LM\\d+")
    df$LM <- lm_id
    df_list[[length(df_list) + 1]] <- df
  }
}

# Combine all valid data frames
if (length(df_list) > 0) {
  merged_data <- bind_rows(df_list)
  write_csv(merged_data, file.path(input_dir, "Bcell_Merged_KEGG_Results.csv"))
  print("Merged file created successfully.")
} else {
  warning("No valid (non-empty) CSV files to merge.")
}

Supplementary Figures

2.1 PCA Plots with Batch Effect:

2.1.a Before Combat

View PCA Before Combat

Download PCA Before Combat

2.1.b After Combat

View PCA After Combat

Download PCA After Combat

2.2 P-value histograms:

Click Here to Download the Bcell Histograms

2.3 Lipid mediator histograms:

Click Here to Download the lipid mediator histograms

2.4 Normal Q-Q Plots:

View Bcell Q-Q Plots

Download BCell Q-Q Plots

2.5 Manhattan Plots:

View Bcell Manhattan Plots

Download BCell Manhattan Plots

3.1 Table with significant pathway results including list of core enrichment genes for KEGG including column with core enriched genes and leading edge.

library(readr)
library(DT)
library(dplyr)

# Load the data
Bcell_SIDAK <- read_csv("/Users/eckco/Desktop/Norris_May_Update/Bcell_Merged_KEGG_Results.csv", show_col_types = FALSE)

# Add gene count column (assuming '/' as separator; change to ',' if needed)
Bcell_SIDAK <- Bcell_SIDAK %>%
  mutate(core_enrichment_genes = ifelse(!is.na(core_enrichment),
                                        lengths(strsplit(core_enrichment, "/")),
                                        0))
  #select(-core_enrichment)  # Remove the core_enrichment column
Bcell_SIDAK <- Bcell_SIDAK[, c(10,1,2,3,4,5,6,7,8,11,9)]
Bcell_SIDAK$enrichmentScore <- formatC(Bcell_SIDAK$enrichmentScore, format = "e", digits = 4)
Bcell_SIDAK$NES <- formatC(Bcell_SIDAK$NES, format = "e", digits = 4)
Bcell_SIDAK$pvalue <- formatC(Bcell_SIDAK$pvalue, format = "e", digits = 4)
Bcell_SIDAK$padj <- formatC(Bcell_SIDAK$padj, format = "e", digits = 4)

names(Bcell_SIDAK)[names(Bcell_SIDAK) == "core_enrichment_genes"] <- "Number_Of_Core_Enrichment_Genes"


full_names <- read.csv("/Users/eckco/Desktop/Norris_May_Update/finalLMlist_wParentFAInfo_updatedEnzymeInfo_Jan2024.csv")
# Prepare the full_names dataset by selecting Analyte and choosing between Pathway and Pathway2
full_names_pathway <- full_names %>%
  mutate(pathway = ifelse(is.na(Pathway) | Pathway == "", Pathway2, Pathway)) %>%
  select(Analyte, pathway)

# Merge in the pathway data using Analyte = LM
Bcell_SIDAK <-Bcell_SIDAK  %>%
  left_join(full_names_pathway, by = c("LM" = "Analyte"))


names(Bcell_SIDAK)[names(Bcell_SIDAK) == "pathway"] <- "Pathway"


Bcell_SIDAK <- Bcell_SIDAK[, c(1,12,2,3,4,5,6,7,8,9,10,11)]

# Display the interactive datatable
datatable(Bcell_SIDAK, rownames=FALSE, options = list(pageLength = 10),
          caption = "Click to explore the full table")

3.2 Table with significant pathway results including list of core enrichment genes for GO including column with core enriched genes and leading edge.

library(readr)
library(DT)
library(dplyr)

# Load the data
Bcell_SIDAK <- read_csv("/Users/eckco/Desktop/Norris_May_Update/Merged_GO_Results.csv", show_col_types = FALSE)

# Add gene count column (assuming '/' as separator; change to ',' if needed)
Bcell_SIDAK <- Bcell_SIDAK %>%
  mutate(core_enrichment_genes = ifelse(!is.na(core_enrichment),
                                        lengths(strsplit(core_enrichment, "/")),
                                        0))
  #select(-core_enrichment)  # Remove the core_enrichment column
Bcell_SIDAK <- Bcell_SIDAK[, c(10,1,2,3,4,5,6,7,8,11,9)]
Bcell_SIDAK$enrichmentScore <- formatC(Bcell_SIDAK$enrichmentScore, format = "e", digits = 4)
Bcell_SIDAK$NES <- formatC(Bcell_SIDAK$NES, format = "e", digits = 4)
Bcell_SIDAK$pvalue <- formatC(Bcell_SIDAK$pvalue, format = "e", digits = 4)
Bcell_SIDAK$padj <- formatC(Bcell_SIDAK$padj, format = "e", digits = 4)
names(Bcell_SIDAK)[names(Bcell_SIDAK) == "core_enrichment_genes"] <- "Number_Of_Core_Enrichment_Genes"

full_names <- read.csv("/Users/eckco/Desktop/Norris_May_Update/finalLMlist_wParentFAInfo_updatedEnzymeInfo_Jan2024.csv")
# Prepare the full_names dataset by selecting Analyte and choosing between Pathway and Pathway2
full_names_pathway <- full_names %>%
  mutate(pathway = ifelse(is.na(Pathway) | Pathway == "", Pathway2, Pathway)) %>%
  select(Analyte, pathway)

# Merge in the pathway data using Analyte = LM
Bcell_SIDAK <-Bcell_SIDAK  %>%
  left_join(full_names_pathway, by = c("LM" = "Analyte"))


names(Bcell_SIDAK)[names(Bcell_SIDAK) == "pathway"] <- "Pathway"


Bcell_SIDAK <- Bcell_SIDAK[, c(1,12,2,3,4,5,6,7,8,9,10,11)]

# Display the interactive datatable
datatable(Bcell_SIDAK, rownames=FALSE, options = list(pageLength = 10),
          caption = "Click to explore the full table")

Bcell Report for TIPRA Cohort: EWAS, DMR, and GSEA Results

Connor Eck

2025-06-17