library(DESeq2)
#library(tximport)
library(tidyverse)
library(plotly)
library(PCAtools)
#library(patchwork)
#library(wbData)
#library(clusterProfiler)
#library(org.Ce.eg.db)
#library(DOSE)

Common settings

# enrichment color
options(enrichplot.colours = c("red","blue"))
# plotting theme
old <- theme_set(theme_bw(base_size = 16))

Goal & Background

update 2025-04-03

Veena has reorganized the gene sets. Here are the five groups:

Group 1: All pathways combined

Group 2: pathways involved in DNA metabolism and gene expression

Group 3: pathways involved in energy metabolism/mitochondria

Group 4: signal transduction pathways

Group 5: morphological features

Group 6: metabolic pathways.

Original goal and background

The goal is to calculate a number of derived values to characterize the level of conservation or divergence among the Dauers / GIDs and normalize the value by both the number of genes in the set and relative to the distance from the average of Dauers to the WT L2/L3.

The transcript counts are first library size normalized and transformed by the regularized log transformation into counts_rlog. The resulting log transformed counts are used as the input for downstream analyses. The rlog transformation moderates the log2FC - log transformation is known to make lowly expressed genes appear more variable (noise amplified). What this measure doesn’t account for is gene length. Notably, the rlog counts are still read counts, rather than transcript counts.

How can gene length affect the downstream analyses? I initially thought it would give more weight in the distance calculation to long genes with higher read counts when expressed at the same level as a shorter genes. But I then realized that on a log scale, correcting for gene length is equivalent to subtracting a constant from each row (gene), where the constant is log2(gene_length/1000) (per kilo base feature length). Location shifts such as this doesn’t affect distance measures.

In a different analysis (../../90-archived-analyses/output/not_in_use_within.sum.squares.zscore), I transformed the rlog counts to a z-score on a per gene basis. The rationale is that this give every gene equal weight and thus avoid having variable genes contributing too much to the distance measure. The potential caveat of the z-score is that lowly expressed genes that do not actually have meaningful changes in expression can contribute to distance measures if they have a high variance. This is a trade-off we are making to have a more balanced measure of distance.

Import data

The preprocessing of the quant data were performed in an early analysis (../../90-archived-analyses/Analysis/20250226-HBZ-within.sumsquares.Rmd) and saved to an RData file. We will load that file here.

load("../input/20250310-Dauer-project-DESeq2-objects.rdata")

Overview by PCA

We use PCA to examine the overall structure of the data and relationship between the dauers and the WT L2/L3 and adults.

I don’t expect the lowly expressed genes to affect the PCA. Nonetheless, we will try both and see how their results differ. We will plot all replicates for additional information about sample variability vs true separation between samples.

All genes

pca_all <- pca(counts_rlog, metadata = sample_df)
biplot(pca_all, colby = "condition", title = "PCA for all genes")

Filtered expression matrix

pca_filtered <- pca(counts_rlog[keep, , drop = FALSE], metadata = sample_df)
biplot(pca_filtered, colby = "condition", title = "PCA for all genes")

DESeq2 provides a plotPCA() function, which by default uses the top 500 most variable genes

plotPCA(rlog, intgroup = "condition", ntop = 10000)
using ntop=10000 top features by variance

#ggsave("../output/PCA/20250319-transcriptome-top10k-variable-genes-PCA.png")

Analyze gene sets

function to calculate a number of derived values

source("../script/20250403-within-dauer-compactness-score-functions.R")

load the new six group gene sets

gene_set_files <- list.files("../input/gene-lists",
                             pattern = "G*.tsv", full.names = TRUE)
# group 1 is just the union of the remaining five groups
all_pathways <- read_tsv("../input/gene-lists/Group.1.geneset.tsv")
gene.lists <- map(gene_set_files[-1], read_tsv)
names(gene.lists) <- c(
  "2:DNA_metabolism_txn",
  "3:Energy_metabolism_mito",
  "4:Signal_transduction",
  "5:Morphological_features",
  "6:Metabolic_features"
)
gene_list_0 <- list_rbind(gene.lists, names_to = "group")

There are still duplicate rows in the gene list. Let’s examine them and then remove the duplicated rows.

# check for duplicates
duplicates <- gene_list_0 |> 
  group_by(across(everything())) |> 
  filter(n() > 1) |> 
  arrange(group, pathway, WBGeneID)
duplicates

There are 10 duplicate rows

Remove the duplicates

gene_list <- distinct(gene_list_0) # remove duplicate rows

Are there pathways that belong to multiple groups?

# check for pathways that belong to multiple groups
gene_list |> 
  group_by(pathway) |> 
  filter(n_distinct(group) > 1) |> 
  arrange(group, pathway, WBGeneID)

Good, the five groups have no overlaps. Thus, group ID is just another label on the pathways and won’t cause ambiguity.

What’s the distribution of the number of genes per pathway?

# check the distribution of the number of genes per pathway
gene_list |> 
  group_by(pathway) |> 
  summarise(n = n_distinct(WBGeneID)) |> 
  pull(n) |>
  table()

   6    7    8    9   10   11   12   13   14   15   17   18   19   20   21   22 
   4    3    4    3    2    3    5    7    3    4    2    3    2    4    2    1 
  23   24   25   26   27   28   29   30   33   34   35   37   38   39   40   42 
   2    2    3    3    1    1    1    2    1    3    1    2    2    5    1    1 
  44   45   51   54   55   60   67   68   79   82   88   96   98   99  104  107 
   1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
 109  116  128  160  167  302  332  338  352  440  446  732 1319 1431 
   1    1    1    1    1    1    1    1    1    1    1    1    1    1 

The smallest pathways contain at least 6 genes.

Calculate the within cluster distance and other metrics

Now that we have all the pathways in a single file, we can just iterate through all of them.

Split the gene sets and apply the calculation to each subset, combine the results

# a list object to store the results of individual gene sets
gene_set_results <- with(gene_list, split(WBGeneID, pathway)) |>
  keep(\(x) length(unique(x)) > 1) |> # keep only the groups with more than one gene
  map(\(gene_set) within_dauer_compactness_score(sampleMean_rlog, gene_set))

# combine the results into a single data frame
gene_set_table <- bind_rows(gene_set_results, .id = "pathway") |>
  left_join(distinct(gene_list, group, pathway), by = "pathway") |>
  relocate(group, .before = pathway) |>
  arrange(group, R2_dauer)

Export the list to a file

# write output to file, preserves more digits for later calculations
write_tsv(gene_set_table,
          file.path("../output/within.sum.squares",
                    "20250403-geneset-stats-all.tsv"))

# also write a version with fewer digits
gene_set_table %>% 
  mutate(across(-c(pathway, group, gene_set_size), \(x) round(x, digits = 3))) %>% 
  write_tsv(file.path("../output/within.sum.squares",
                       "20250403-geneset-stats-all-rounded.tsv"))

RMSD vs d_L2L3 correlation

We observed a general linear relationship between the two measures across groups (see below). Here, we would like to quantitative assess the strength of correlation and test for significance.

gene_set_table %>% 
  #group_by(group) %>% 
  nest(.by = group) %>% 
  mutate(
    model = map(data, \(df) 
                cor.test(~ RMSD_dauer + d_L2L3_to_dauer, 
                         data = df, method = "pearson")),
    tidied = map(model, broom::tidy)
  ) %>% 
  unnest(tidied) %>% 
  select(group, Pearson = estimate, p.value, conf.low, conf.high)

Same analysis for the GIDs

gene_set_table %>% 
  #group_by(group) %>% 
  nest(.by = group) %>% 
  mutate(
    model = map(data, \(df) 
                cor.test(~ RMSD_gid + d_L2L3_to_gid, 
                         data = df, method = "pearson")),
    tidied = map(model, broom::tidy)
  ) %>% 
  unnest(tidied) %>% 
  select(group, Pearson = estimate, p.value, conf.low, conf.high)

Plots

RMSD vs d_L2L3 for groups 2-5

First, we will plot RMSD_dauer vs d_L2L3_to_dauer for groups 2-4 combined.

set_colors <- RColorBrewer::brewer.pal(5, "Set2")
names(set_colors) <- c(
  "DNA_metabolism_txn", "Energy_metabolism_mito", "Signal_transduction",
  "Morphological_features", "Metabolic_features"
)
p <- gene_set_table %>% 
  filter(group != "6:Metabolic_features") %>%
  mutate(label = paste0(pathway, " (", gene_set_size, ")")) %>%
  ggplot(aes(x = d_L2L3_to_dauer, y = RMSD_dauer,
             label = label)) + 
  #stat_smooth(method = "lm", formula = y~x+0, se = FALSE,
  #            color = "skyblue2", linetype = 2) +
  geom_point() +
  #scale_color_manual("Groups", values = set_colors, 
  #                   labels = paste(1:5, names(set_colors), sep = ":")) +
  xlab("Distance Dauer to L2/L3") + ylab("RMSD within Dauer") +
  scale_x_continuous(limits = c(0.5, 2.62), expand = c(0.1, 0.05)) +
  scale_y_continuous(limits = c(0.2, 0.92), expand = c(0.1, 0)) +
  #guides(color = guide_legend(nrow = 2)) +
  facet_wrap(~ group, nrow = 2) +
  theme(strip.background = element_blank(),
        strip.text = element_text(size = rel(0.8)),
        panel.spacing = unit(0.5, "lines"))
  #theme(legend.position = "top", legend.text = element_text(size = rel(0.6)))
ggsave(file = file.path("../output/within.sum.squares/img",
                        paste0("20250404-RMSD-vs-dL2L3-groups2-5.png")),
       plot = p + ggrepel::geom_text_repel(size = 2.5, colour = "gray20"),
       width = 7, height = 7)
print(ggplotly(p, tooltip = c("label"), width = 600, height = 600))
NULL

Next, we will plot RMSD_gid vs d_L2L3_to_gid for groups 2-4 combined.

p <- gene_set_table %>% 
  filter(group != "6:Metabolic_features") %>%
  mutate(label = paste0(pathway, " (", gene_set_size, ")")) %>%
  ggplot(aes(x = d_L2L3_to_gid, y = RMSD_gid,
             label = label)) + 
  #stat_smooth(method = "lm", formula = y~x+0, se = FALSE,
  #            color = "skyblue2", linetype = 2) +
  geom_point() +
  #scale_color_manual("Groups", values = set_colors, 
  #                   labels = paste(1:5, names(set_colors), sep = ":")) +
  xlab("Distance GID to L2/L3") + ylab("RMSD within GIDs") +
  scale_x_continuous(limits = c(0.5, 2.62), expand = c(0.1, 0.05)) +
  scale_y_continuous(limits = c(0.2, 0.92), expand = c(0.1, 0)) +
  #guides(color = guide_legend(nrow = 2)) +
  facet_wrap(~group, nrow = 2) +
  theme(strip.background = element_blank(),
        strip.text = element_text(size = rel(0.8)),
        panel.spacing = unit(0.5, "lines"))
  #theme(legend.position = "top", legend.text = element_text(size = rel(0.6)))
ggsave(file = file.path("../output/within.sum.squares/img",
                        paste0("20250404-RMSD-vs-dL2L3-groups2-5-GIDs.png")),
       plot = p + ggrepel::geom_text_repel(size = 2.5, colour = "gray20"),
       width = 7, height = 7)
print(ggplotly(p, tooltip = c("label"), width = 600, height = 600))
NULL

RMSD vs d_L2L3 for each group separately

We will generate one RMSD_dauer vs d_L2L3_to_dauer plot for each group

#set_colors <- RColorBrewer::brewer.pal(5, "Accent")
#names(set_colors) <- names(gene.lists)
for (set in names(gene.lists)){
  tmp <- gene_set_table %>% 
    filter(group == set) %>%
    mutate(label = paste0(pathway, " (", gene_set_size, ")"))
  # if there are more than 15 pathways in a group, use 2.5 for label size
  # otherwise, use 3
  label_size <- ifelse(nrow(tmp) > 15, 2.5, 3)
  # make plot
  p <- tmp %>%
    ggplot(aes(x = d_L2L3_to_dauer, y = RMSD_dauer,
               label = label)) + 
    #stat_smooth(method = "lm", formula = y~x+0, se = FALSE,
    #            color = "skyblue2", linetype = 2) +
    geom_point(aes()) +
    #scale_color_manual(NULL, values = set_colors) +
    xlab("Distance Dauer to L2/L3") + ylab("RMSD within Dauer") +
    scale_x_continuous(limits = c(0.5, 2.62), expand = c(0.1, 0.05)) +
    scale_y_continuous(limits = c(0.2, 0.92), expand = c(0.1, 0)) +
    ggtitle(set) + theme(legend.position = "bottom")
  ggsave(file = file.path("../output/within.sum.squares/img",
                          paste0("20250403-RMSD-vs-dL2L3-",
                                 gsub(":", "_", set), ".png")),
         plot = p + ggrepel::geom_text_repel(size = label_size, colour = "gray20"),
         width = 6, height = 6)
  print(ggplotly(p, tooltip = c("label")))
}

Do the same for RMSD_gid vs d_L2L3_to_gid

for (set in names(gene.lists)){
  tmp <- gene_set_table %>% 
    filter(group == set) %>%
    mutate(label = paste0(pathway, " (", gene_set_size, ")"))
  # if there are more than 15 pathways in a group, use 2.5 for label size
  # otherwise, use 3
  label_size <- ifelse(nrow(tmp) > 15, 2.5, 3)
  # make plot
  p <- tmp %>%
    ggplot(aes(x = d_L2L3_to_gid, y = RMSD_gid,
               label = label)) +
    #stat_smooth(method = "lm", formula = y~x+0, se = FALSE,
    #            color = "skyblue2", linetype = 2) +
    geom_point(aes()) +
    #scale_color_manual(NULL, values = set_colors) +
    xlab("Distance GID to L2/L3") + ylab("RMSD within GIDs") +
    scale_x_continuous(limits = c(0.5, 2.62), expand = c(0.1, 0.05)) +
    scale_y_continuous(limits = c(0.2, 0.92), expand = c(0.1, 0)) +
    ggtitle(set) + theme(legend.position = "bottom")
  ggsave(file = file.path("../output/within.sum.squares/img",
                          paste0("20250403-RMSD-vs-dL2L3-",
                                 gsub(":", "_", set), "-GIDs.png")),
         plot = p + ggrepel::geom_text_repel(size = label_size, colour = "gray20"),
         width = 6, height = 6)
  print(ggplotly(p, tooltip = c("label")))
}

Compare all five groups

Let’s compare the distribution of the metrics between the five groups. We will focus on RMSD_dauer, d_L2L3_to_dauer, R2_dauer, and 1-avg_Pearson_dauer

List the five groups

2:DNA_metabolism_txn , 3:Energy_metabolism_mito , 4:Signal_transduction , 5:Morphological_features , 6:Metabolic_features

tmp <- gene_set_table %>% 
  mutate(d_PCC_dauer = 1 - avg_Pearson_dauer,
         r_toL2L3_vs_toAdult = d_L2L3_to_dauer / d_adult_to_dauer) %>%
  select(group, pathway, gene_set_size, RMSD_dauer, d_L2L3_to_dauer,
         d_adult_to_dauer, R2_dauer, d_PCC_dauer, r_toL2L3_vs_toAdult) %>%
  pivot_longer(cols = -c(group, pathway, gene_set_size),
               names_to = "metric", values_to = "value") %>%
  mutate(metric = factor(metric,
                         levels = c("d_L2L3_to_dauer", "d_adult_to_dauer",
                                    "r_toL2L3_vs_toAdult","RMSD_dauer", 
                                    "R2_dauer", "d_PCC_dauer"),
                         labels = c("Distance to L2/L3",
                                    "Distance to adult",
                                    "d_L2L3 / d_adult",
                                    "RMSD", "(R2) RMSD / d_L2L3",
                                    "1 - avg. Pearson")))

p <- tmp %>%
  separate(group, into = c("group", "group_name"), sep = ":") %>%
  #mutate(label = paste(pathway, round(value, 3), sep = ":")) %>% 
  ggplot(aes(x = group, y = value)) +
  #geom_boxplot(outlier.shape = NA, outlier.color = NA) +
  geom_jitter(aes(text = paste(pathway, round(value, 3), sep = ":")),
              width = 0.2, size = 1.2, color = "gray40", alpha = 0.8) +
  stat_summary(fun.data = "mean_cl_boot", geom = "crossbar", 
               linewidth = 0.2, width = 0.4, color = "red") +
  scale_color_manual(NULL, values = set_colors, guide = "none") +
  facet_wrap(~metric, scales = "free_y") +
  ggtitle("Distribution of metrics, all dauers") +
  theme(strip.background = element_blank(),
        strip.text = element_text(size = rel(0.8)),
        panel.spacing = unit(0.5, "lines"))
Warning: Ignoring unknown aesthetics: text
print(ggplotly(p, tooltip = "text",
               width = 800, height = 500))
NULL
ggsave(file.path("../output/within.sum.squares/img",
                 "20250405-compare-metric-distribution-groups2-6-dauers.png"),
       width = 8, height = 5)

Repeat for RMSD_gid, d_L2L3_to_gid, R2_gid, and 1-avg_Pearson_gid

tmp <- gene_set_table %>% 
  mutate(d_PCC_gid = 1 - avg_Pearson_gid,
         r_toL2L3_vs_toAdult = d_L2L3_to_gid / d_adult_to_gid) %>%
  select(group, pathway, gene_set_size, RMSD_gid, d_L2L3_to_gid,
         d_adult_to_gid, R2_gid, d_PCC_gid, r_toL2L3_vs_toAdult) %>%
  pivot_longer(cols = -c(group, pathway, gene_set_size),
               names_to = "metric", values_to = "value") %>%
  mutate(metric = factor(metric,
                         levels = c("d_L2L3_to_gid", "d_adult_to_gid",
                                    "r_toL2L3_vs_toAdult","RMSD_gid", 
                                    "R2_gid", "d_PCC_gid"),
                         labels = c("Distance to L2/L3",
                                    "Distance to adult",
                                    "d_L2L3 / d_adult",
                                    "RMSD", "(R2) RMSD / d_L2L3",
                                    "1 - avg. Pearson")))

p <- tmp %>%
  separate(group, into = c("group", "group_name"), sep = ":") %>%
  #mutate(label = paste(pathway, round(value, 3), sep = ":")) %>% 
  ggplot(aes(x = group, y = value)) +
  #geom_boxplot(outlier.shape = NA, outlier.color = NA) +
  geom_jitter(aes(text = paste(pathway, round(value, 3), sep = ":")),
              width = 0.2, size = 1.2, color = "gray40", alpha = 0.8) +
  stat_summary(fun.data = "mean_cl_boot", geom = "crossbar", 
               linewidth = 0.2, width = 0.4, color = "red") +
  scale_color_manual(NULL, values = set_colors, guide = "none") +
  facet_wrap(~metric, scales = "free_y") +
  ggtitle("Distribution of metrics, GIDs") +
  theme(strip.background = element_blank(),
        strip.text = element_text(size = rel(0.8)),
        panel.spacing = unit(0.5, "lines"))
Warning: Ignoring unknown aesthetics: text
print(ggplotly(p, tooltip = "text",
               width = 800, height = 500))
NULL
ggsave(file.path("../output/within.sum.squares/img",
                 "20250405-compare-metric-distribution-groups2-6-GIDs.png"),
       width = 8, height = 5)

—– Don’t go below this line !—–

I haven’t updated the code and results below this yet.

plot_individual_pathway <- function(gene_set, pathway_name){
  # this function takes a gene set and the name of the pathway within that set
  # then plots pairwise scatter plots along with the corresponding PCC or SCC.
  # it also plots the PCA plot.
  
  # Gather the data
  genes <- dplyr::filter(gene_set, pathway == pathway_name) |> pull(WBGeneID)
  # all replicates, for PCA
  rep_matrix <- counts_rlog[genes, , drop = FALSE]
  # mean expression only, for pairwise correlations and heatmap
  mean_matrix <- sampleMean_rlog[genes, , drop = FALSE]
  
  # Examine the pairwise relationships.
  png(file.path("../output/within.sum.squares/img",
                paste0(pathway_name, "-pairs-PCC-plot.png")),
      width = 1000, height = 1000)
  # Plotting the correlation matrix
  pairs(mean_matrix[, c("WT_CTRL", "WT_L2_L3", all_dauer)],
        upper.panel = panel.cor,    # Correlation panel
        lower.panel = panel.smooth, # Smoothed regression lines
        main = paste0("Pairwise PCCs for ", pathway_name))
  dev.off()
  
  # Spearman's correlation, rank only.
  png(file.path("../output/within.sum.squares/img",
                paste0(pathway_name, "-pairs-SCC-plot.png")),
      width = 1000, height = 1000)
  # Plotting the correlation matrix
  pairs(mean_matrix[, c("WT_CTRL", "WT_L2_L3", all_dauer)],
        upper.panel = function(x, y, ...) panel.cor(x, y, method = "spearman"),
        lower.panel = panel.smooth, # Smoothed regression lines
        main = paste0("Pairwise SCCs for ", pathway_name))
  dev.off()
  
  # PCA plot
  PCA <- pca(rep_matrix, metadata = sample_df)
  biplot(PCA, colby = "condition", title = paste0("PCA for ", pathway_name))
  ggsave(file.path("../output/within.sum.squares/img",
                   paste0(pathway_name, "-pairs-PCA-plot.png")),
         width = 6, height = 6)
  
  # return the expression matrix
  return(mean_matrix)
}

Iterate through all Set 1 pathways

core_dauer_1_expressions <- 
  lapply(unique(core_dauer_1$pathway),
         function(pathway){plot_individual_pathway(core_dauer_1, pathway)}
  )
names(core_dauer_1_expressions) <- unique(core_dauer_1$pathway)

Iterate through all Set 2 pathways

core_dauer_2_expressions <- 
  lapply(unique(core_dauer_2$pathway),
         function(pathway){plot_individual_pathway(core_dauer_2, pathway)}
  )
names(core_dauer_2_expressions) <- unique(core_dauer_2$pathway)

Set 3: all Kim mountain genes

Read gene list and add gene name column

# load Wormbase gene table for ID conversion
#wormbase.genes <- read_tsv(here("input", "wormbase.genes.names.txt"))
# load Wormbase gene IDs
#gene_IDs <- wb_load_gene_ids("WS269")

# load gene list and add gene name column
kim_mountains <- read_tsv("../input/updated.names.kim.mountains.tsv") |>
  # rename and order columns to be consistent with the other sets
  select(pathway = Kim_group, WBGeneID, geneName) |>
  distinct() # remove duplicate rows

Apply the calculation

# a list object to store the results of individual gene sets
Kim_result <- list(
  "all_Kim" = within_dauer_compactness_score(sampleMean_rlog, kim_mountains$WBGeneID)
)

Do the same for each subgroup

Save the output

# make a table and sort by the ratio of average distance to centroid for dauers
# over distance from dauer centroid to L2/L3
# combine with the core dauer pathway results
Kim_core_dauer_result_table <- bind_rows(
  "Kim mountain" = bind_rows(Kim_result, .id = "subset"),
  "Core dauer set 1" = bind_rows(core_dauer_1_result, .id = "subset"),
  "Core dauer set 2" = bind_rows(core_dauer_2_result, .id = "subset"),
  .id = "Gene_set"
) |> arrange(R1_dauer)

# write output to file, preserves more digits for later calculations
write_tsv(Kim_core_dauer_result_table,
          file.path("../output/within.sum.squares",
                    "20250307-geneset-stats-Kim-mountain-and-core-dauer.tsv"))

# save a rounded form to make it easier to view
Kim_core_dauer_result_table %>% 
  mutate(across(-c(Gene_set, subset, gene_set_size), \(x) round(x, digits = 3))) %>% 
  write_tsv(file.path("../output/within.sum.squares",
                       "20250307-geneset-stats-Kim-mountain-and-core-dauer-rounded.tsv"))

Plots for Set 1-3

Instead of ranking pathways by R1 or R2, we can plot the pathways by their RMSD and distance between mean dauer to L2/L3

p1 <- Kim_core_dauer_result_table %>% 
  #filter(Gene_set == "Kim mountain") %>%
  ggplot(aes(x = d_L2L3_to_dauer, y = RMSD_dauer)) + 
  stat_smooth(method = "lm", formula = y~x+0, 
              color = "skyblue2", linetype = 2) +
  geom_point(aes(color = Gene_set)) +
  scale_color_manual(NULL, values = set_colors) +
  #scale_color_brewer(NULL, palette = "Dark2") +
  ggrepel::geom_text_repel(
    aes(label = paste0(subset, " (", gene_set_size, ")")),
    size = 3, colour = "gray20") +
  xlab("Distance Dauer to L2/L3") + ylab("RMSD within Dauer") +
  scale_x_continuous(expand = c(0.1, 0.05)) +
  scale_y_continuous(expand = c(0.1, 0)) +
  ggtitle("All Dauers") + theme(legend.position = "bottom")
p1
ggsave(file.path("../output/within.sum.squares/img",
                 "Kim-and-Core-Dauer-RMSD-vs-dL2L3-all-dauers.png"),
       width = 6, height = 6)

And for the GIDs

p2 <- Kim_core_dauer_result_table %>% 
  ggplot(aes(x = d_L2L3_to_gid, y = RMSD_gid)) + 
  stat_smooth(method = "lm", formula = y~x+0, 
              color = "skyblue2", linetype = 2) +
  geom_point(aes(color = Gene_set)) +
  scale_color_manual(NULL, values = set_colors) +
  #scale_color_brewer(NULL, palette = "Dark2") +
  ggrepel::geom_text_repel(
    aes(label = paste0(subset, " (", gene_set_size, ")")),
    size = 3, colour = "gray20") +
  xlab("Distance Dauer to L2/L3") + ylab("RMSD within Dauer") +
  scale_x_continuous(expand = c(0.1, 0.05)) +
  scale_y_continuous(expand = c(0.1, 0)) +
  ggtitle("GIDs") + theme(legend.position = "bottom")
p2
ggsave(file.path("../output/within.sum.squares/img",
                 "Kim-and-Core-Dauer-RMSD-vs-dL2L3-GIDs.png"),
       width = 6, height = 6)

PCA for Set 1-3

Here, we generate PCA plots for individual pathways as a diagnostic tool. We will use the counts_rlog object, which means we will 1. Use all replicates instead of sample means; 2. Include WT adult 3. Unfiltered, include all genes

1 and 2 are to provide additional information for the diagnosis (with little cost). (1) gives a sense of how variable are the biological replicates within each dauer / control type; Inclusion of WT adult provides an additional reference.

3 because lowly expressed genes will make very little contribution to PCA (since we are going to center but NOT scale the data).

plotPCA(rlog[unique(c(core_dauer_1$WBGeneID, core_dauer_2$WBGeneID,
                             kim_mountains$WBGeneID)), ,drop=FALSE], 
        intgroup = "condition", ntop = 10000)

Case study

Retinoblastoma complex

Gather the data

RbC_genes <- filter(kim_mountains, Kim_group == "Retinoblastoma complex") |> 
  pull(WBGeneID)
# all replicates, for PCA
RbC_rep_matrix <- counts_rlog[RbC_genes, , drop = FALSE]
# mean expression only, for pairwise correlations and heatmap
RbC_mean_matrix <- sampleMean_rlog[RbC_genes, , drop = FALSE]
# display the matrix
round(RbC_mean_matrix, digits = 2)

Set 4: Metabolic Pathways

The pathway gene lists are from https://wormflux.umassmed.edu/

# This code is only run to get geneset names in the same format as the Kim mountains and insulin.
wormflux = xlsx::read.xlsx("../../../Deseq_data/Immune_genes_tables_Outside_tables/wormflux_4.xlsx",sheetIndex = 2,header = T)
list.wormflux = unclass(wormflux)
list.wormflux.not.na = map(list.wormflux, function(a)na.omit(a))
gids = wb_load_gene_ids("WS295")
list.wormflux.df = map(list.wormflux.not.na,.f = function(b)data.frame(genes = b))
bind.wormflux.df = bind_rows(list.wormflux.df,.id = "id") |> dplyr::rename(class=id)
bind.wormflux.df.table = bind.wormflux.df |> mutate(geneName = wbData::i2s(genes,gids) ) |> dplyr::rename(pathway  = class,WBGeneID  =genes)
all(bind.wormflux.df.table$genes %in% rownames(sampleMean_rlog))
write_tsv(bind.wormflux.df.table, file = here("input","updated.names.wormflux.tsv"))

Read in the gene list and apply the calculation

# How the data will be load
Metabolic_pathways <- read_tsv(file = "../input/updated.names.wormflux.tsv")
# a list object to store the results of individual gene sets
Metabolic_result <- list(
  "all_Metabolic" = within_dauer_compactness_score(sampleMean_rlog,
                                                   Metabolic_pathways$WBGeneID)
)
subset_Metabolic_result <- with(Metabolic_pathways, split(WBGeneID, pathway)) |>
  keep(\(x) length(unique(x)) > 1) |> # keep only the groups with more than one gene
  map(\(gene_set) within_dauer_compactness_score(sampleMean_rlog, gene_set))
Metabolic_result <- append(Metabolic_result, subset_Metabolic_result)

Save the output

# make a table and sort by the ratio of average distance to centroid for dauers
# over distance from dauer centroid to L2/L3
# combine with the core dauer pathway results
Metabolic_result_table <- bind_rows(Metabolic_result, .id = "subset") |>
  mutate(Gene_set = "Metabolic pathways") |>
  relocate(Gene_set, .before = subset) |>
  arrange(R1_dauer)

# write output to file, preserves more digits for later calculations
write_tsv(Metabolic_result_table,
          here::here("output/within.sum.squares",
                     "20250307-geneset-stats-Metabolic-pathways.tsv"))

# save a rounded form to make it easier to view
Metabolic_result_table %>% 
  mutate(across(-c(Gene_set, subset, gene_set_size), \(x) round(x, digits = 3))) %>% 
  write_tsv(here::here("output/within.sum.squares",
                       "20250307-geneset-stats-Metabolic-pathways-rounded.tsv"))

Plots for Set 4

Instead of ranking pathways by R1 or R2, we can plot the pathways by their RMSD and distance between mean dauer to L2/L3

p1 <- Metabolic_result_table %>%
  ggplot(aes(x = d_L2L3_to_dauer, y = RMSD_dauer)) + 
  stat_smooth(method = "lm", formula = y~x+0, 
              color = "skyblue2", linetype = 2) +
  geom_point(aes(color = gene_set_size < 10)) +
  scale_color_manual("Fewer than 10 genes", values = c("black", "pink")) +
  ggrepel::geom_text_repel(
    aes(label = paste0(subset, " (", gene_set_size, ")")),
    size = 2, colour = "gray20", min.segment.length = 0) +
  xlab("Distance Dauer to L2/L3") + ylab("RMSD within Dauer") +
  scale_x_continuous(expand = c(0.1, 0.05)) +
  scale_y_continuous(expand = c(0.1, 0)) +
  ggtitle("All Dauers") + theme(legend.position = "bottom")
p1
ggsave(file.path("../output/within.sum.squares/img",
                 "Metabolic-pathways-RMSD-vs-dL2L3-all-dauers.png"),
       width = 6, height = 6)

And for the GIDs

p2 <- Metabolic_result_table %>%
  ggplot(aes(x = d_L2L3_to_gid, y = RMSD_gid)) + 
  stat_smooth(method = "lm", formula = y~x+0, 
              color = "skyblue2", linetype = 2) +
  geom_point(aes(color = gene_set_size < 10)) +
  scale_color_manual("Fewer than 10 genes", values = c("black", "pink")) +
  ggrepel::geom_text_repel(
    aes(label = paste0(subset, " (", gene_set_size, ")")),
    size = 2, colour = "gray20", min.segment.length = 0) +
  #xlab("Distance Dauer to L2/L3") + ylab("RMSD within Dauer") +
  scale_x_continuous(expand = c(0.1, 0.05)) +
  scale_y_continuous(expand = c(0.1, 0)) +
  xlab("Distance Dauer to L2/L3") + ylab("RMSD within GIDs") +
  ggtitle("GIDs") + theme(legend.position = "bottom")
p2
ggsave(file.path("../output/within.sum.squares/img",
                 "Metabolic-pathways-RMSD-vs-dL2L3-GIDs.png"),
       width = 6, height = 6)

–>

---
title: "Within measurement Dauers"
author: "Bin Z. He"
date: "2025-04-03, updated on `r Sys.Date()`"
output:
  html_notebook:
    toc: TRUE
    code_folding: hide
    theme: flatly
---


```{r message=FALSE}
library(DESeq2)
#library(tximport)
library(tidyverse)
library(plotly)
library(PCAtools)
#library(patchwork)
#library(wbData)
#library(clusterProfiler)
#library(org.Ce.eg.db)
#library(DOSE)
```

Common settings

```{r}
# enrichment color
options(enrichplot.colours = c("red","blue"))
# plotting theme
old <- theme_set(theme_bw(base_size = 16))
```

# Goal & Background
<h2 style="color:red">update 2025-04-03</h2>
Veena has reorganized the gene sets. Here are the five groups:

Group 1: All pathways combined

Group 2: pathways involved in DNA metabolism and gene expression

Group 3: pathways involved in energy metabolism/mitochondria

Group 4: signal transduction pathways

Group 5: morphological features

Group 6: metabolic pathways.

## Original goal and background
The goal is to calculate a number of derived values to characterize the level of conservation or divergence among the Dauers / GIDs and normalize the value by both the number of genes in the set and relative to the distance from the average of Dauers to the WT L2/L3.

The transcript counts are first library size normalized and transformed by the regularized log transformation into `counts_rlog`. The resulting log transformed counts are used as the input for downstream analyses. The rlog transformation moderates the log2FC - log transformation is known to make lowly expressed genes appear more variable (noise amplified). What this measure doesn't account for is gene length. Notably, the rlog counts are still read counts, rather than transcript counts.

How can gene length affect the downstream analyses? I initially thought it would give more weight in the distance calculation to long genes with higher read counts when expressed at the same level as a shorter genes. But I then realized that on a log scale, correcting for gene length is equivalent to subtracting a constant from each row (gene), where the constant is `log2(gene_length/1000)` (per kilo base feature length). Location shifts such as this doesn't affect distance measures.

In a different analysis (`../../90-archived-analyses/output/not_in_use_within.sum.squares.zscore`), I transformed the rlog counts to a z-score on a per gene basis. The rationale is that this give every gene equal weight and thus avoid having variable genes contributing too much to the distance measure. The potential caveat of the z-score is that lowly expressed genes that do not actually have meaningful changes in expression can contribute to distance measures if they have a high variance. This is a trade-off we are making to have a more balanced measure of distance.

# Import data
The preprocessing of the quant data were performed in an early analysis (`../../90-archived-analyses/Analysis/20250226-HBZ-within.sumsquares.Rmd`) and saved to an RData file. We will load that file here.

```{r}
load("../input/20250310-Dauer-project-DESeq2-objects.rdata")
```

# Overview by PCA
We use PCA to examine the overall structure of the data and relationship between the dauers and the WT L2/L3 and adults.

I don't expect the lowly expressed genes to affect the PCA. Nonetheless, we will try both and see how their results differ. We will plot all replicates for additional information about sample variability vs true separation between samples.

All genes
```{r}
pca_all <- pca(counts_rlog, metadata = sample_df)
biplot(pca_all, colby = "condition", title = "PCA for all genes")
```

Filtered expression matrix
```{r}
pca_filtered <- pca(counts_rlog[keep, , drop = FALSE], metadata = sample_df)
biplot(pca_filtered, colby = "condition", title = "PCA for all genes")
```

DESeq2 provides a `plotPCA()` function, which by default uses the top 500 most variable genes
```{r}
plotPCA(rlog, intgroup = "condition", ntop = 10000)
#ggsave("../output/PCA/20250319-transcriptome-top10k-variable-genes-PCA.png")
```

# Analyze gene sets
## function to calculate a number of derived values
```{r}
source("../script/20250403-within-dauer-compactness-score-functions.R")
```

## load the new six group gene sets
```{r message=FALSE}
gene_set_files <- list.files("../input/gene-lists",
                             pattern = "G*.tsv", full.names = TRUE)
# group 1 is just the union of the remaining five groups
all_pathways <- read_tsv("../input/gene-lists/Group.1.geneset.tsv")
gene.lists <- map(gene_set_files[-1], read_tsv)
names(gene.lists) <- c(
  "2:DNA_metabolism_txn",
  "3:Energy_metabolism_mito",
  "4:Signal_transduction",
  "5:Morphological_features",
  "6:Metabolic_features"
)
gene_list_0 <- list_rbind(gene.lists, names_to = "group")
```

There are still duplicate rows in the gene list. Let's examine them and then remove the duplicated rows.
```{r}
# check for duplicates
duplicates <- gene_list_0 |> 
  group_by(across(everything())) |> 
  filter(n() > 1) |> 
  arrange(group, pathway, WBGeneID)
duplicates
```
> There are 10 duplicate rows

Remove the duplicates
```{r}
gene_list <- distinct(gene_list_0) # remove duplicate rows
```

Are there pathways that belong to multiple groups?
```{r}
# check for pathways that belong to multiple groups
gene_list |> 
  group_by(pathway) |> 
  filter(n_distinct(group) > 1) |> 
  arrange(group, pathway, WBGeneID)
```
> Good, the five groups have no overlaps. Thus, group ID is just another label on the pathways and won't cause ambiguity.

What's the distribution of the number of genes per pathway?
```{r}
# check the distribution of the number of genes per pathway
gene_list |> 
  group_by(pathway) |> 
  summarise(n = n_distinct(WBGeneID)) |> 
  pull(n) |>
  table()
```
> The smallest pathways contain at least 6 genes.

## Calculate the within cluster distance and other metrics
Now that we have all the pathways in a single file, we can just iterate through all of them.

Split the gene sets and apply the calculation to each subset, combine the results
```{r}
# a list object to store the results of individual gene sets
gene_set_results <- with(gene_list, split(WBGeneID, pathway)) |>
  keep(\(x) length(unique(x)) > 1) |> # keep only the groups with more than one gene
  map(\(gene_set) within_dauer_compactness_score(sampleMean_rlog, gene_set))

# combine the results into a single data frame
gene_set_table <- bind_rows(gene_set_results, .id = "pathway") |>
  left_join(distinct(gene_list, group, pathway), by = "pathway") |>
  relocate(group, .before = pathway) |>
  arrange(group, R2_dauer)
```

Export the list to a file
```{r}
# write output to file, preserves more digits for later calculations
write_tsv(gene_set_table,
          file.path("../output/within.sum.squares",
                    "20250403-geneset-stats-all.tsv"))

# also write a version with fewer digits
gene_set_table %>% 
  mutate(across(-c(pathway, group, gene_set_size), \(x) round(x, digits = 3))) %>% 
  write_tsv(file.path("../output/within.sum.squares",
                       "20250403-geneset-stats-all-rounded.tsv"))
```

## RMSD vs d_L2L3 correlation
We observed a general linear relationship between the two measures across groups (see below). Here, we would like to quantitative assess the strength of correlation and test for significance.
```{r}
gene_set_table %>% 
  #group_by(group) %>% 
  nest(.by = group) %>% 
  mutate(
    model = map(data, \(df) 
                cor.test(~ RMSD_dauer + d_L2L3_to_dauer, 
                         data = df, method = "pearson")),
    tidied = map(model, broom::tidy)
  ) %>% 
  unnest(tidied) %>% 
  select(group, Pearson = estimate, p.value, conf.low, conf.high)
```

Same analysis for the GIDs
```{r}
gene_set_table %>% 
  #group_by(group) %>% 
  nest(.by = group) %>% 
  mutate(
    model = map(data, \(df) 
                cor.test(~ RMSD_gid + d_L2L3_to_gid, 
                         data = df, method = "pearson")),
    tidied = map(model, broom::tidy)
  ) %>% 
  unnest(tidied) %>% 
  select(group, Pearson = estimate, p.value, conf.low, conf.high)
```

# Plots{.tabset}
## RMSD vs d_L2L3 for groups 2-5
First, we will plot RMSD_dauer vs d_L2L3_to_dauer for groups 2-4 combined.
```{r}
set_colors <- RColorBrewer::brewer.pal(5, "Set2")
names(set_colors) <- c(
  "DNA_metabolism_txn", "Energy_metabolism_mito", "Signal_transduction",
  "Morphological_features", "Metabolic_features"
)
p <- gene_set_table %>% 
  filter(group != "6:Metabolic_features") %>%
  mutate(label = paste0(pathway, " (", gene_set_size, ")")) %>%
  ggplot(aes(x = d_L2L3_to_dauer, y = RMSD_dauer,
             label = label)) + 
  #stat_smooth(method = "lm", formula = y~x+0, se = FALSE,
  #            color = "skyblue2", linetype = 2) +
  geom_point() +
  #scale_color_manual("Groups", values = set_colors, 
  #                   labels = paste(1:5, names(set_colors), sep = ":")) +
  xlab("Distance Dauer to L2/L3") + ylab("RMSD within Dauer") +
  scale_x_continuous(limits = c(0.5, 2.62), expand = c(0.1, 0.05)) +
  scale_y_continuous(limits = c(0.2, 0.92), expand = c(0.1, 0)) +
  #guides(color = guide_legend(nrow = 2)) +
  facet_wrap(~ group, nrow = 2) +
  theme(strip.background = element_blank(),
        strip.text = element_text(size = rel(0.8)),
        panel.spacing = unit(0.5, "lines"))
  #theme(legend.position = "top", legend.text = element_text(size = rel(0.6)))
ggsave(file = file.path("../output/within.sum.squares/img",
                        paste0("20250404-RMSD-vs-dL2L3-groups2-5.png")),
       plot = p + ggrepel::geom_text_repel(size = 2.5, colour = "gray20"),
       width = 7, height = 7)
print(ggplotly(p, tooltip = c("label"), width = 600, height = 600))
```
Next, we will plot RMSD_gid vs d_L2L3_to_gid for groups 2-4 combined.
```{r}
p <- gene_set_table %>% 
  filter(group != "6:Metabolic_features") %>%
  mutate(label = paste0(pathway, " (", gene_set_size, ")")) %>%
  ggplot(aes(x = d_L2L3_to_gid, y = RMSD_gid,
             label = label)) + 
  #stat_smooth(method = "lm", formula = y~x+0, se = FALSE,
  #            color = "skyblue2", linetype = 2) +
  geom_point() +
  #scale_color_manual("Groups", values = set_colors, 
  #                   labels = paste(1:5, names(set_colors), sep = ":")) +
  xlab("Distance GID to L2/L3") + ylab("RMSD within GIDs") +
  scale_x_continuous(limits = c(0.5, 2.62), expand = c(0.1, 0.05)) +
  scale_y_continuous(limits = c(0.2, 0.92), expand = c(0.1, 0)) +
  #guides(color = guide_legend(nrow = 2)) +
  facet_wrap(~group, nrow = 2) +
  theme(strip.background = element_blank(),
        strip.text = element_text(size = rel(0.8)),
        panel.spacing = unit(0.5, "lines"))
  #theme(legend.position = "top", legend.text = element_text(size = rel(0.6)))
ggsave(file = file.path("../output/within.sum.squares/img",
                        paste0("20250404-RMSD-vs-dL2L3-groups2-5-GIDs.png")),
       plot = p + ggrepel::geom_text_repel(size = 2.5, colour = "gray20"),
       width = 7, height = 7)
print(ggplotly(p, tooltip = c("label"), width = 600, height = 600))
```

## RMSD vs d_L2L3 for each group separately
We will generate one RMSD_dauer vs d_L2L3_to_dauer plot for each group

```{r}
#set_colors <- RColorBrewer::brewer.pal(5, "Accent")
#names(set_colors) <- names(gene.lists)
for (set in names(gene.lists)){
  tmp <- gene_set_table %>% 
    filter(group == set) %>%
    mutate(label = paste0(pathway, " (", gene_set_size, ")"))
  # if there are more than 15 pathways in a group, use 2.5 for label size
  # otherwise, use 3
  label_size <- ifelse(nrow(tmp) > 15, 2.5, 3)
  # make plot
  p <- tmp %>%
    ggplot(aes(x = d_L2L3_to_dauer, y = RMSD_dauer,
               label = label)) + 
    #stat_smooth(method = "lm", formula = y~x+0, se = FALSE,
    #            color = "skyblue2", linetype = 2) +
    geom_point(aes()) +
    #scale_color_manual(NULL, values = set_colors) +
    xlab("Distance Dauer to L2/L3") + ylab("RMSD within Dauer") +
    scale_x_continuous(limits = c(0.5, 2.62), expand = c(0.1, 0.05)) +
    scale_y_continuous(limits = c(0.2, 0.92), expand = c(0.1, 0)) +
    ggtitle(set) + theme(legend.position = "bottom")
  ggsave(file = file.path("../output/within.sum.squares/img",
                          paste0("20250403-RMSD-vs-dL2L3-",
                                 gsub(":", "_", set), ".png")),
         plot = p + ggrepel::geom_text_repel(size = label_size, colour = "gray20"),
         width = 6, height = 6)
  print(ggplotly(p, tooltip = c("label")))
}
```

Do the same for RMSD_gid vs d_L2L3_to_gid
```{r}
for (set in names(gene.lists)){
  tmp <- gene_set_table %>% 
    filter(group == set) %>%
    mutate(label = paste0(pathway, " (", gene_set_size, ")"))
  # if there are more than 15 pathways in a group, use 2.5 for label size
  # otherwise, use 3
  label_size <- ifelse(nrow(tmp) > 15, 2.5, 3)
  # make plot
  p <- tmp %>%
    ggplot(aes(x = d_L2L3_to_gid, y = RMSD_gid,
               label = label)) +
    #stat_smooth(method = "lm", formula = y~x+0, se = FALSE,
    #            color = "skyblue2", linetype = 2) +
    geom_point(aes()) +
    #scale_color_manual(NULL, values = set_colors) +
    xlab("Distance GID to L2/L3") + ylab("RMSD within GIDs") +
    scale_x_continuous(limits = c(0.5, 2.62), expand = c(0.1, 0.05)) +
    scale_y_continuous(limits = c(0.2, 0.92), expand = c(0.1, 0)) +
    ggtitle(set) + theme(legend.position = "bottom")
  ggsave(file = file.path("../output/within.sum.squares/img",
                          paste0("20250403-RMSD-vs-dL2L3-",
                                 gsub(":", "_", set), "-GIDs.png")),
         plot = p + ggrepel::geom_text_repel(size = label_size, colour = "gray20"),
         width = 6, height = 6)
  print(ggplotly(p, tooltip = c("label")))
}
```

## Compare all five groups
Let's compare the distribution of the metrics between the five groups. We will focus on RMSD_dauer, d_L2L3_to_dauer, R2_dauer, and 1-avg_Pearson_dauer

List the five groups

`r paste(unique(gene_set_table$group), "\n")`

```{r}
tmp <- gene_set_table %>% 
  mutate(d_PCC_dauer = 1 - avg_Pearson_dauer,
         r_toL2L3_vs_toAdult = d_L2L3_to_dauer / d_adult_to_dauer) %>%
  select(group, pathway, gene_set_size, RMSD_dauer, d_L2L3_to_dauer,
         d_adult_to_dauer, R2_dauer, d_PCC_dauer, r_toL2L3_vs_toAdult) %>%
  pivot_longer(cols = -c(group, pathway, gene_set_size),
               names_to = "metric", values_to = "value") %>%
  mutate(metric = factor(metric,
                         levels = c("d_L2L3_to_dauer", "d_adult_to_dauer",
                                    "r_toL2L3_vs_toAdult","RMSD_dauer", 
                                    "R2_dauer", "d_PCC_dauer"),
                         labels = c("Distance to L2/L3",
                                    "Distance to adult",
                                    "d_L2L3 / d_adult",
                                    "RMSD", "(R2) RMSD / d_L2L3",
                                    "1 - avg. Pearson")))

p <- tmp %>%
  separate(group, into = c("group", "group_name"), sep = ":") %>%
  #mutate(label = paste(pathway, round(value, 3), sep = ":")) %>% 
  ggplot(aes(x = group, y = value)) +
  #geom_boxplot(outlier.shape = NA, outlier.color = NA) +
  geom_jitter(aes(text = paste(pathway, round(value, 3), sep = ":")),
              width = 0.2, size = 1.2, color = "gray40", alpha = 0.8) +
  stat_summary(fun.data = "mean_cl_boot", geom = "crossbar", 
               linewidth = 0.2, width = 0.4, color = "red") +
  scale_color_manual(NULL, values = set_colors, guide = "none") +
  facet_wrap(~metric, scales = "free_y") +
  ggtitle("Distribution of metrics, all dauers") +
  theme(strip.background = element_blank(),
        strip.text = element_text(size = rel(0.8)),
        panel.spacing = unit(0.5, "lines"))
print(ggplotly(p, tooltip = "text",
               width = 800, height = 500))
ggsave(file.path("../output/within.sum.squares/img",
                 "20250405-compare-metric-distribution-groups2-6-dauers.png"),
       width = 8, height = 5)
```

Repeat for RMSD_gid, d_L2L3_to_gid, R2_gid, and 1-avg_Pearson_gid
```{r}
tmp <- gene_set_table %>% 
  mutate(d_PCC_gid = 1 - avg_Pearson_gid,
         r_toL2L3_vs_toAdult = d_L2L3_to_gid / d_adult_to_gid) %>%
  select(group, pathway, gene_set_size, RMSD_gid, d_L2L3_to_gid,
         d_adult_to_gid, R2_gid, d_PCC_gid, r_toL2L3_vs_toAdult) %>%
  pivot_longer(cols = -c(group, pathway, gene_set_size),
               names_to = "metric", values_to = "value") %>%
  mutate(metric = factor(metric,
                         levels = c("d_L2L3_to_gid", "d_adult_to_gid",
                                    "r_toL2L3_vs_toAdult","RMSD_gid", 
                                    "R2_gid", "d_PCC_gid"),
                         labels = c("Distance to L2/L3",
                                    "Distance to adult",
                                    "d_L2L3 / d_adult",
                                    "RMSD", "(R2) RMSD / d_L2L3",
                                    "1 - avg. Pearson")))

p <- tmp %>%
  separate(group, into = c("group", "group_name"), sep = ":") %>%
  #mutate(label = paste(pathway, round(value, 3), sep = ":")) %>% 
  ggplot(aes(x = group, y = value)) +
  #geom_boxplot(outlier.shape = NA, outlier.color = NA) +
  geom_jitter(aes(text = paste(pathway, round(value, 3), sep = ":")),
              width = 0.2, size = 1.2, color = "gray40", alpha = 0.8) +
  stat_summary(fun.data = "mean_cl_boot", geom = "crossbar", 
               linewidth = 0.2, width = 0.4, color = "red") +
  scale_color_manual(NULL, values = set_colors, guide = "none") +
  facet_wrap(~metric, scales = "free_y") +
  ggtitle("Distribution of metrics, GIDs") +
  theme(strip.background = element_blank(),
        strip.text = element_text(size = rel(0.8)),
        panel.spacing = unit(0.5, "lines"))
print(ggplotly(p, tooltip = "text",
               width = 800, height = 500))
ggsave(file.path("../output/within.sum.squares/img",
                 "20250405-compare-metric-distribution-groups2-6-GIDs.png"),
       width = 8, height = 5)
```
# ----- Don't go below this line !-----
## I haven't updated the code and results below this yet.

<!--

### Case study
The two measures of similarity, i.e., Euclidean distance-based RMSD and Pearson's correlation, show opposite results for TGFB vs daf.16. Here, we take a closer look at both.

#### function to plot individual pathway
Try to create a function to automate the case study analysis.
```{r}
plot_individual_pathway <- function(gene_set, pathway_name){
  # this function takes a gene set and the name of the pathway within that set
  # then plots pairwise scatter plots along with the corresponding PCC or SCC.
  # it also plots the PCA plot.
  
  # Gather the data
  genes <- dplyr::filter(gene_set, pathway == pathway_name) |> pull(WBGeneID)
  # all replicates, for PCA
  rep_matrix <- counts_rlog[genes, , drop = FALSE]
  # mean expression only, for pairwise correlations and heatmap
  mean_matrix <- sampleMean_rlog[genes, , drop = FALSE]
  
  # Examine the pairwise relationships.
  png(file.path("../output/within.sum.squares/img",
                paste0(pathway_name, "-pairs-PCC-plot.png")),
      width = 1000, height = 1000)
  # Plotting the correlation matrix
  pairs(mean_matrix[, c("WT_CTRL", "WT_L2_L3", all_dauer)],
        upper.panel = panel.cor,    # Correlation panel
        lower.panel = panel.smooth, # Smoothed regression lines
        main = paste0("Pairwise PCCs for ", pathway_name))
  dev.off()
  
  # Spearman's correlation, rank only.
  png(file.path("../output/within.sum.squares/img",
                paste0(pathway_name, "-pairs-SCC-plot.png")),
      width = 1000, height = 1000)
  # Plotting the correlation matrix
  pairs(mean_matrix[, c("WT_CTRL", "WT_L2_L3", all_dauer)],
        upper.panel = function(x, y, ...) panel.cor(x, y, method = "spearman"),
        lower.panel = panel.smooth, # Smoothed regression lines
        main = paste0("Pairwise SCCs for ", pathway_name))
  dev.off()
  
  # PCA plot
  PCA <- pca(rep_matrix, metadata = sample_df)
  biplot(PCA, colby = "condition", title = paste0("PCA for ", pathway_name))
  ggsave(file.path("../output/within.sum.squares/img",
                   paste0(pathway_name, "-pairs-PCA-plot.png")),
         width = 6, height = 6)
  
  # return the expression matrix
  return(mean_matrix)
}
```

Iterate through all Set 1 pathways
```{r}
core_dauer_1_expressions <- 
  lapply(unique(core_dauer_1$pathway),
         function(pathway){plot_individual_pathway(core_dauer_1, pathway)}
  )
names(core_dauer_1_expressions) <- unique(core_dauer_1$pathway)
```

Iterate through all Set 2 pathways
```{r}
core_dauer_2_expressions <- 
  lapply(unique(core_dauer_2$pathway),
         function(pathway){plot_individual_pathway(core_dauer_2, pathway)}
  )
names(core_dauer_2_expressions) <- unique(core_dauer_2$pathway)
```

## Set 3: all Kim mountain genes
Read gene list and add gene name column
```{r,fig.format='svg'}
# load Wormbase gene table for ID conversion
#wormbase.genes <- read_tsv(here("input", "wormbase.genes.names.txt"))
# load Wormbase gene IDs
#gene_IDs <- wb_load_gene_ids("WS269")

# load gene list and add gene name column
kim_mountains <- read_tsv("../input/updated.names.kim.mountains.tsv") |>
  # rename and order columns to be consistent with the other sets
  select(pathway = Kim_group, WBGeneID, geneName) |>
  distinct() # remove duplicate rows
```

Apply the calculation
```{r,fig.format='svg'}
# a list object to store the results of individual gene sets
Kim_result <- list(
  "all_Kim" = within_dauer_compactness_score(sampleMean_rlog, kim_mountains$WBGeneID)
)
```

Do the same for each subgroup
```{r results='hide'}
subset_kim_result <- with(kim_mountains, split(WBGeneID, pathway)) |>
  keep(\(x) length(unique(x)) > 1) |> # keep only the groups with more than one gene
  map(\(gene_set) within_dauer_compactness_score(sampleMean_rlog, gene_set))
Kim_result <- append(Kim_result, subset_kim_result)
```

Save the output
```{r}
# make a table and sort by the ratio of average distance to centroid for dauers
# over distance from dauer centroid to L2/L3
# combine with the core dauer pathway results
Kim_core_dauer_result_table <- bind_rows(
  "Kim mountain" = bind_rows(Kim_result, .id = "subset"),
  "Core dauer set 1" = bind_rows(core_dauer_1_result, .id = "subset"),
  "Core dauer set 2" = bind_rows(core_dauer_2_result, .id = "subset"),
  .id = "Gene_set"
) |> arrange(R1_dauer)

# write output to file, preserves more digits for later calculations
write_tsv(Kim_core_dauer_result_table,
          file.path("../output/within.sum.squares",
                    "20250307-geneset-stats-Kim-mountain-and-core-dauer.tsv"))

# save a rounded form to make it easier to view
Kim_core_dauer_result_table %>% 
  mutate(across(-c(Gene_set, subset, gene_set_size), \(x) round(x, digits = 3))) %>% 
  write_tsv(file.path("../output/within.sum.squares",
                       "20250307-geneset-stats-Kim-mountain-and-core-dauer-rounded.tsv"))
```

### Plots for Set 1-3
Instead of ranking pathways by R1 or R2, we can plot the pathways by their RMSD and distance between mean dauer to L2/L3
```{r}
p1 <- Kim_core_dauer_result_table %>% 
  #filter(Gene_set == "Kim mountain") %>%
  ggplot(aes(x = d_L2L3_to_dauer, y = RMSD_dauer)) + 
  stat_smooth(method = "lm", formula = y~x+0, 
              color = "skyblue2", linetype = 2) +
  geom_point(aes(color = Gene_set)) +
  scale_color_manual(NULL, values = set_colors) +
  #scale_color_brewer(NULL, palette = "Dark2") +
  ggrepel::geom_text_repel(
    aes(label = paste0(subset, " (", gene_set_size, ")")),
    size = 3, colour = "gray20") +
  xlab("Distance Dauer to L2/L3") + ylab("RMSD within Dauer") +
  scale_x_continuous(expand = c(0.1, 0.05)) +
  scale_y_continuous(expand = c(0.1, 0)) +
  ggtitle("All Dauers") + theme(legend.position = "bottom")
p1
ggsave(file.path("../output/within.sum.squares/img",
                 "Kim-and-Core-Dauer-RMSD-vs-dL2L3-all-dauers.png"),
       width = 6, height = 6)
```

And for the GIDs
```{r}
p2 <- Kim_core_dauer_result_table %>% 
  ggplot(aes(x = d_L2L3_to_gid, y = RMSD_gid)) + 
  stat_smooth(method = "lm", formula = y~x+0, 
              color = "skyblue2", linetype = 2) +
  geom_point(aes(color = Gene_set)) +
  scale_color_manual(NULL, values = set_colors) +
  #scale_color_brewer(NULL, palette = "Dark2") +
  ggrepel::geom_text_repel(
    aes(label = paste0(subset, " (", gene_set_size, ")")),
    size = 3, colour = "gray20") +
  xlab("Distance Dauer to L2/L3") + ylab("RMSD within Dauer") +
  scale_x_continuous(expand = c(0.1, 0.05)) +
  scale_y_continuous(expand = c(0.1, 0)) +
  ggtitle("GIDs") + theme(legend.position = "bottom")
p2
ggsave(file.path("../output/within.sum.squares/img",
                 "Kim-and-Core-Dauer-RMSD-vs-dL2L3-GIDs.png"),
       width = 6, height = 6)
```

### PCA for Set 1-3
Here, we generate PCA plots for individual pathways as a diagnostic tool. We will use the `counts_rlog` object, which means we will
1. Use all replicates instead of sample means;
2. Include WT adult
3. Unfiltered, include all genes

1 and 2 are to provide additional information for the diagnosis (with little cost). (1) gives a sense of how variable are the biological replicates within each dauer / control type; Inclusion of WT adult provides an additional reference.

3 because lowly expressed genes will make very little contribution to PCA (since we are going to center but NOT scale the data).
```{r}
plotPCA(rlog[unique(c(core_dauer_1$WBGeneID, core_dauer_2$WBGeneID,
                             kim_mountains$WBGeneID)), ,drop=FALSE], 
        intgroup = "condition", ntop = 10000)
```

### Case study
#### Retinoblastoma complex
Gather the data
```{r}
RbC_genes <- filter(kim_mountains, Kim_group == "Retinoblastoma complex") |> 
  pull(WBGeneID)
# all replicates, for PCA
RbC_rep_matrix <- counts_rlog[RbC_genes, , drop = FALSE]
# mean expression only, for pairwise correlations and heatmap
RbC_mean_matrix <- sampleMean_rlog[RbC_genes, , drop = FALSE]
# display the matrix
round(RbC_mean_matrix, digits = 2)
```


## Set 4: Metabolic Pathways
The pathway gene lists are from <https://wormflux.umassmed.edu/>
```
# This code is only run to get geneset names in the same format as the Kim mountains and insulin.
wormflux = xlsx::read.xlsx("../../../Deseq_data/Immune_genes_tables_Outside_tables/wormflux_4.xlsx",sheetIndex = 2,header = T)
list.wormflux = unclass(wormflux)
list.wormflux.not.na = map(list.wormflux, function(a)na.omit(a))
gids = wb_load_gene_ids("WS295")
list.wormflux.df = map(list.wormflux.not.na,.f = function(b)data.frame(genes = b))
bind.wormflux.df = bind_rows(list.wormflux.df,.id = "id") |> dplyr::rename(class=id)
bind.wormflux.df.table = bind.wormflux.df |> mutate(geneName = wbData::i2s(genes,gids) ) |> dplyr::rename(pathway  = class,WBGeneID  =genes)
all(bind.wormflux.df.table$genes %in% rownames(sampleMean_rlog))
write_tsv(bind.wormflux.df.table, file = here("input","updated.names.wormflux.tsv"))
```

Read in the gene list and apply the calculation
```{r}
# How the data will be load
Metabolic_pathways <- read_tsv(file = "../input/updated.names.wormflux.tsv")
```

```{r}
# a list object to store the results of individual gene sets
Metabolic_result <- list(
  "all_Metabolic" = within_dauer_compactness_score(sampleMean_rlog,
                                                   Metabolic_pathways$WBGeneID)
)
```

```{r}
subset_Metabolic_result <- with(Metabolic_pathways, split(WBGeneID, pathway)) |>
  keep(\(x) length(unique(x)) > 1) |> # keep only the groups with more than one gene
  map(\(gene_set) within_dauer_compactness_score(sampleMean_rlog, gene_set))
Metabolic_result <- append(Metabolic_result, subset_Metabolic_result)
```

Save the output
```{r}
# make a table and sort by the ratio of average distance to centroid for dauers
# over distance from dauer centroid to L2/L3
# combine with the core dauer pathway results
Metabolic_result_table <- bind_rows(Metabolic_result, .id = "subset") |>
  mutate(Gene_set = "Metabolic pathways") |>
  relocate(Gene_set, .before = subset) |>
  arrange(R1_dauer)

# write output to file, preserves more digits for later calculations
write_tsv(Metabolic_result_table,
          here::here("output/within.sum.squares",
                     "20250307-geneset-stats-Metabolic-pathways.tsv"))

# save a rounded form to make it easier to view
Metabolic_result_table %>% 
  mutate(across(-c(Gene_set, subset, gene_set_size), \(x) round(x, digits = 3))) %>% 
  write_tsv(here::here("output/within.sum.squares",
                       "20250307-geneset-stats-Metabolic-pathways-rounded.tsv"))
```

### Plots for Set 4
Instead of ranking pathways by R1 or R2, we can plot the pathways by their RMSD and distance between mean dauer to L2/L3
```{r}
p1 <- Metabolic_result_table %>%
  ggplot(aes(x = d_L2L3_to_dauer, y = RMSD_dauer)) + 
  stat_smooth(method = "lm", formula = y~x+0, 
              color = "skyblue2", linetype = 2) +
  geom_point(aes(color = gene_set_size < 10)) +
  scale_color_manual("Fewer than 10 genes", values = c("black", "pink")) +
  ggrepel::geom_text_repel(
    aes(label = paste0(subset, " (", gene_set_size, ")")),
    size = 2, colour = "gray20", min.segment.length = 0) +
  xlab("Distance Dauer to L2/L3") + ylab("RMSD within Dauer") +
  scale_x_continuous(expand = c(0.1, 0.05)) +
  scale_y_continuous(expand = c(0.1, 0)) +
  ggtitle("All Dauers") + theme(legend.position = "bottom")
p1
ggsave(file.path("../output/within.sum.squares/img",
                 "Metabolic-pathways-RMSD-vs-dL2L3-all-dauers.png"),
       width = 6, height = 6)
```

And for the GIDs
```{r}
p2 <- Metabolic_result_table %>%
  ggplot(aes(x = d_L2L3_to_gid, y = RMSD_gid)) + 
  stat_smooth(method = "lm", formula = y~x+0, 
              color = "skyblue2", linetype = 2) +
  geom_point(aes(color = gene_set_size < 10)) +
  scale_color_manual("Fewer than 10 genes", values = c("black", "pink")) +
  ggrepel::geom_text_repel(
    aes(label = paste0(subset, " (", gene_set_size, ")")),
    size = 2, colour = "gray20", min.segment.length = 0) +
  #xlab("Distance Dauer to L2/L3") + ylab("RMSD within Dauer") +
  scale_x_continuous(expand = c(0.1, 0.05)) +
  scale_y_continuous(expand = c(0.1, 0)) +
  xlab("Distance Dauer to L2/L3") + ylab("RMSD within GIDs") +
  ggtitle("GIDs") + theme(legend.position = "bottom")
p2
ggsave(file.path("../output/within.sum.squares/img",
                 "Metabolic-pathways-RMSD-vs-dL2L3-GIDs.png"),
       width = 6, height = 6)
```
-->