Goal & Background
update 2025-04-03
Veena has reorganized the gene sets. Here are the five groups:
Group 1: All pathways combined
Group 2: pathways involved in DNA metabolism and gene expression
Group 3: pathways involved in energy metabolism/mitochondria
Group 4: signal transduction pathways
Group 5: morphological features
Group 6: metabolic pathways.
Original goal and background
The goal is to calculate a number of derived values to characterize
the level of conservation or divergence among the Dauers / GIDs and
normalize the value by both the number of genes in the set and relative
to the distance from the average of Dauers to the WT L2/L3.
The transcript counts are first library size normalized and
transformed by the regularized log transformation into
counts_rlog. The resulting log transformed counts are used
as the input for downstream analyses. The rlog transformation moderates
the log2FC - log transformation is known to make lowly expressed genes
appear more variable (noise amplified). What this measure doesn’t
account for is gene length. Notably, the rlog counts are still read
counts, rather than transcript counts.
How can gene length affect the downstream analyses? I initially
thought it would give more weight in the distance calculation to long
genes with higher read counts when expressed at the same level as a
shorter genes. But I then realized that on a log scale, correcting for
gene length is equivalent to subtracting a constant from each row
(gene), where the constant is log2(gene_length/1000) (per
kilo base feature length). Location shifts such as this doesn’t affect
distance measures.
In a different analysis
(../../90-archived-analyses/output/not_in_use_within.sum.squares.zscore),
I transformed the rlog counts to a z-score on a per gene basis. The
rationale is that this give every gene equal weight and thus avoid
having variable genes contributing too much to the distance measure. The
potential caveat of the z-score is that lowly expressed genes that do
not actually have meaningful changes in expression can contribute to
distance measures if they have a high variance. This is a trade-off we
are making to have a more balanced measure of distance.
Overview by PCA
We use PCA to examine the overall structure of the data and
relationship between the dauers and the WT L2/L3 and adults.
I don’t expect the lowly expressed genes to affect the PCA.
Nonetheless, we will try both and see how their results differ. We will
plot all replicates for additional information about sample variability
vs true separation between samples.
All genes
pca_all <- pca(counts_rlog, metadata = sample_df)
biplot(pca_all, colby = "condition", title = "PCA for all genes")

Filtered expression matrix
pca_filtered <- pca(counts_rlog[keep, , drop = FALSE], metadata = sample_df)
biplot(pca_filtered, colby = "condition", title = "PCA for all genes")

DESeq2 provides a plotPCA() function, which by default
uses the top 500 most variable genes
plotPCA(rlog, intgroup = "condition", ntop = 10000)
using ntop=10000 top features by variance

#ggsave("../output/PCA/20250319-transcriptome-top10k-variable-genes-PCA.png")
Analyze gene sets
function to calculate a number of derived values
source("../script/20250403-within-dauer-compactness-score-functions.R")
load the new six group gene sets
gene_set_files <- list.files("../input/gene-lists",
pattern = "G*.tsv", full.names = TRUE)
# group 1 is just the union of the remaining five groups
all_pathways <- read_tsv("../input/gene-lists/Group.1.geneset.tsv")
gene.lists <- map(gene_set_files[-1], read_tsv)
names(gene.lists) <- c(
"2:DNA_metabolism_txn",
"3:Energy_metabolism_mito",
"4:Signal_transduction",
"5:Morphological_features",
"6:Metabolic_features"
)
gene_list_0 <- list_rbind(gene.lists, names_to = "group")
There are still duplicate rows in the gene list. Let’s examine them
and then remove the duplicated rows.
# check for duplicates
duplicates <- gene_list_0 |>
group_by(across(everything())) |>
filter(n() > 1) |>
arrange(group, pathway, WBGeneID)
duplicates
There are 10 duplicate rows
Remove the duplicates
gene_list <- distinct(gene_list_0) # remove duplicate rows
Are there pathways that belong to multiple groups?
# check for pathways that belong to multiple groups
gene_list |>
group_by(pathway) |>
filter(n_distinct(group) > 1) |>
arrange(group, pathway, WBGeneID)
Good, the five groups have no overlaps. Thus, group ID is just
another label on the pathways and won’t cause ambiguity.
What’s the distribution of the number of genes per pathway?
# check the distribution of the number of genes per pathway
gene_list |>
group_by(pathway) |>
summarise(n = n_distinct(WBGeneID)) |>
pull(n) |>
table()
6 7 8 9 10 11 12 13 14 15 17 18 19 20 21 22
4 3 4 3 2 3 5 7 3 4 2 3 2 4 2 1
23 24 25 26 27 28 29 30 33 34 35 37 38 39 40 42
2 2 3 3 1 1 1 2 1 3 1 2 2 5 1 1
44 45 51 54 55 60 67 68 79 82 88 96 98 99 104 107
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
109 116 128 160 167 302 332 338 352 440 446 732 1319 1431
1 1 1 1 1 1 1 1 1 1 1 1 1 1
The smallest pathways contain at least 6 genes.
Calculate the within cluster distance and other metrics
Now that we have all the pathways in a single file, we can just
iterate through all of them.
Split the gene sets and apply the calculation to each subset, combine
the results
# a list object to store the results of individual gene sets
gene_set_results <- with(gene_list, split(WBGeneID, pathway)) |>
keep(\(x) length(unique(x)) > 1) |> # keep only the groups with more than one gene
map(\(gene_set) within_dauer_compactness_score(sampleMean_rlog, gene_set))
# combine the results into a single data frame
gene_set_table <- bind_rows(gene_set_results, .id = "pathway") |>
left_join(distinct(gene_list, group, pathway), by = "pathway") |>
relocate(group, .before = pathway) |>
arrange(group, R2_dauer)
Export the list to a file
# write output to file, preserves more digits for later calculations
write_tsv(gene_set_table,
file.path("../output/within.sum.squares",
"20250403-geneset-stats-all.tsv"))
# also write a version with fewer digits
gene_set_table %>%
mutate(across(-c(pathway, group, gene_set_size), \(x) round(x, digits = 3))) %>%
write_tsv(file.path("../output/within.sum.squares",
"20250403-geneset-stats-all-rounded.tsv"))
RMSD vs d_L2L3 correlation
We observed a general linear relationship between the two measures
across groups (see below). Here, we would like to quantitative assess
the strength of correlation and test for significance.
gene_set_table %>%
#group_by(group) %>%
nest(.by = group) %>%
mutate(
model = map(data, \(df)
cor.test(~ RMSD_dauer + d_L2L3_to_dauer,
data = df, method = "pearson")),
tidied = map(model, broom::tidy)
) %>%
unnest(tidied) %>%
select(group, Pearson = estimate, p.value, conf.low, conf.high)
Same analysis for the GIDs
gene_set_table %>%
#group_by(group) %>%
nest(.by = group) %>%
mutate(
model = map(data, \(df)
cor.test(~ RMSD_gid + d_L2L3_to_gid,
data = df, method = "pearson")),
tidied = map(model, broom::tidy)
) %>%
unnest(tidied) %>%
select(group, Pearson = estimate, p.value, conf.low, conf.high)
Plots
RMSD vs d_L2L3 for groups 2-5
First, we will plot RMSD_dauer vs d_L2L3_to_dauer for groups 2-4
combined.
set_colors <- RColorBrewer::brewer.pal(5, "Set2")
names(set_colors) <- c(
"DNA_metabolism_txn", "Energy_metabolism_mito", "Signal_transduction",
"Morphological_features", "Metabolic_features"
)
p <- gene_set_table %>%
filter(group != "6:Metabolic_features") %>%
mutate(label = paste0(pathway, " (", gene_set_size, ")")) %>%
ggplot(aes(x = d_L2L3_to_dauer, y = RMSD_dauer,
label = label)) +
#stat_smooth(method = "lm", formula = y~x+0, se = FALSE,
# color = "skyblue2", linetype = 2) +
geom_point() +
#scale_color_manual("Groups", values = set_colors,
# labels = paste(1:5, names(set_colors), sep = ":")) +
xlab("Distance Dauer to L2/L3") + ylab("RMSD within Dauer") +
scale_x_continuous(limits = c(0.5, 2.62), expand = c(0.1, 0.05)) +
scale_y_continuous(limits = c(0.2, 0.92), expand = c(0.1, 0)) +
#guides(color = guide_legend(nrow = 2)) +
facet_wrap(~ group, nrow = 2) +
theme(strip.background = element_blank(),
strip.text = element_text(size = rel(0.8)),
panel.spacing = unit(0.5, "lines"))
#theme(legend.position = "top", legend.text = element_text(size = rel(0.6)))
ggsave(file = file.path("../output/within.sum.squares/img",
paste0("20250404-RMSD-vs-dL2L3-groups2-5.png")),
plot = p + ggrepel::geom_text_repel(size = 2.5, colour = "gray20"),
width = 7, height = 7)
print(ggplotly(p, tooltip = c("label"), width = 600, height = 600))
NULL
Next, we will plot RMSD_gid vs d_L2L3_to_gid for groups 2-4
combined.
p <- gene_set_table %>%
filter(group != "6:Metabolic_features") %>%
mutate(label = paste0(pathway, " (", gene_set_size, ")")) %>%
ggplot(aes(x = d_L2L3_to_gid, y = RMSD_gid,
label = label)) +
#stat_smooth(method = "lm", formula = y~x+0, se = FALSE,
# color = "skyblue2", linetype = 2) +
geom_point() +
#scale_color_manual("Groups", values = set_colors,
# labels = paste(1:5, names(set_colors), sep = ":")) +
xlab("Distance GID to L2/L3") + ylab("RMSD within GIDs") +
scale_x_continuous(limits = c(0.5, 2.62), expand = c(0.1, 0.05)) +
scale_y_continuous(limits = c(0.2, 0.92), expand = c(0.1, 0)) +
#guides(color = guide_legend(nrow = 2)) +
facet_wrap(~group, nrow = 2) +
theme(strip.background = element_blank(),
strip.text = element_text(size = rel(0.8)),
panel.spacing = unit(0.5, "lines"))
#theme(legend.position = "top", legend.text = element_text(size = rel(0.6)))
ggsave(file = file.path("../output/within.sum.squares/img",
paste0("20250404-RMSD-vs-dL2L3-groups2-5-GIDs.png")),
plot = p + ggrepel::geom_text_repel(size = 2.5, colour = "gray20"),
width = 7, height = 7)
print(ggplotly(p, tooltip = c("label"), width = 600, height = 600))
NULL
RMSD vs d_L2L3 for each group separately
We will generate one RMSD_dauer vs d_L2L3_to_dauer plot for each
group
#set_colors <- RColorBrewer::brewer.pal(5, "Accent")
#names(set_colors) <- names(gene.lists)
for (set in names(gene.lists)){
tmp <- gene_set_table %>%
filter(group == set) %>%
mutate(label = paste0(pathway, " (", gene_set_size, ")"))
# if there are more than 15 pathways in a group, use 2.5 for label size
# otherwise, use 3
label_size <- ifelse(nrow(tmp) > 15, 2.5, 3)
# make plot
p <- tmp %>%
ggplot(aes(x = d_L2L3_to_dauer, y = RMSD_dauer,
label = label)) +
#stat_smooth(method = "lm", formula = y~x+0, se = FALSE,
# color = "skyblue2", linetype = 2) +
geom_point(aes()) +
#scale_color_manual(NULL, values = set_colors) +
xlab("Distance Dauer to L2/L3") + ylab("RMSD within Dauer") +
scale_x_continuous(limits = c(0.5, 2.62), expand = c(0.1, 0.05)) +
scale_y_continuous(limits = c(0.2, 0.92), expand = c(0.1, 0)) +
ggtitle(set) + theme(legend.position = "bottom")
ggsave(file = file.path("../output/within.sum.squares/img",
paste0("20250403-RMSD-vs-dL2L3-",
gsub(":", "_", set), ".png")),
plot = p + ggrepel::geom_text_repel(size = label_size, colour = "gray20"),
width = 6, height = 6)
print(ggplotly(p, tooltip = c("label")))
}
Do the same for RMSD_gid vs d_L2L3_to_gid
for (set in names(gene.lists)){
tmp <- gene_set_table %>%
filter(group == set) %>%
mutate(label = paste0(pathway, " (", gene_set_size, ")"))
# if there are more than 15 pathways in a group, use 2.5 for label size
# otherwise, use 3
label_size <- ifelse(nrow(tmp) > 15, 2.5, 3)
# make plot
p <- tmp %>%
ggplot(aes(x = d_L2L3_to_gid, y = RMSD_gid,
label = label)) +
#stat_smooth(method = "lm", formula = y~x+0, se = FALSE,
# color = "skyblue2", linetype = 2) +
geom_point(aes()) +
#scale_color_manual(NULL, values = set_colors) +
xlab("Distance GID to L2/L3") + ylab("RMSD within GIDs") +
scale_x_continuous(limits = c(0.5, 2.62), expand = c(0.1, 0.05)) +
scale_y_continuous(limits = c(0.2, 0.92), expand = c(0.1, 0)) +
ggtitle(set) + theme(legend.position = "bottom")
ggsave(file = file.path("../output/within.sum.squares/img",
paste0("20250403-RMSD-vs-dL2L3-",
gsub(":", "_", set), "-GIDs.png")),
plot = p + ggrepel::geom_text_repel(size = label_size, colour = "gray20"),
width = 6, height = 6)
print(ggplotly(p, tooltip = c("label")))
}
Compare all five groups
Let’s compare the distribution of the metrics between the five
groups. We will focus on RMSD_dauer, d_L2L3_to_dauer, R2_dauer, and
1-avg_Pearson_dauer
List the five groups
2:DNA_metabolism_txn , 3:Energy_metabolism_mito ,
4:Signal_transduction , 5:Morphological_features ,
6:Metabolic_features
tmp <- gene_set_table %>%
mutate(d_PCC_dauer = 1 - avg_Pearson_dauer,
r_toL2L3_vs_toAdult = d_L2L3_to_dauer / d_adult_to_dauer) %>%
select(group, pathway, gene_set_size, RMSD_dauer, d_L2L3_to_dauer,
d_adult_to_dauer, R2_dauer, d_PCC_dauer, r_toL2L3_vs_toAdult) %>%
pivot_longer(cols = -c(group, pathway, gene_set_size),
names_to = "metric", values_to = "value") %>%
mutate(metric = factor(metric,
levels = c("d_L2L3_to_dauer", "d_adult_to_dauer",
"r_toL2L3_vs_toAdult","RMSD_dauer",
"R2_dauer", "d_PCC_dauer"),
labels = c("Distance to L2/L3",
"Distance to adult",
"d_L2L3 / d_adult",
"RMSD", "(R2) RMSD / d_L2L3",
"1 - avg. Pearson")))
p <- tmp %>%
separate(group, into = c("group", "group_name"), sep = ":") %>%
#mutate(label = paste(pathway, round(value, 3), sep = ":")) %>%
ggplot(aes(x = group, y = value)) +
#geom_boxplot(outlier.shape = NA, outlier.color = NA) +
geom_jitter(aes(text = paste(pathway, round(value, 3), sep = ":")),
width = 0.2, size = 1.2, color = "gray40", alpha = 0.8) +
stat_summary(fun.data = "mean_cl_boot", geom = "crossbar",
linewidth = 0.2, width = 0.4, color = "red") +
scale_color_manual(NULL, values = set_colors, guide = "none") +
facet_wrap(~metric, scales = "free_y") +
ggtitle("Distribution of metrics, all dauers") +
theme(strip.background = element_blank(),
strip.text = element_text(size = rel(0.8)),
panel.spacing = unit(0.5, "lines"))
Warning: Ignoring unknown aesthetics: text
print(ggplotly(p, tooltip = "text",
width = 800, height = 500))
NULL
ggsave(file.path("../output/within.sum.squares/img",
"20250405-compare-metric-distribution-groups2-6-dauers.png"),
width = 8, height = 5)
Repeat for RMSD_gid, d_L2L3_to_gid, R2_gid, and 1-avg_Pearson_gid
tmp <- gene_set_table %>%
mutate(d_PCC_gid = 1 - avg_Pearson_gid,
r_toL2L3_vs_toAdult = d_L2L3_to_gid / d_adult_to_gid) %>%
select(group, pathway, gene_set_size, RMSD_gid, d_L2L3_to_gid,
d_adult_to_gid, R2_gid, d_PCC_gid, r_toL2L3_vs_toAdult) %>%
pivot_longer(cols = -c(group, pathway, gene_set_size),
names_to = "metric", values_to = "value") %>%
mutate(metric = factor(metric,
levels = c("d_L2L3_to_gid", "d_adult_to_gid",
"r_toL2L3_vs_toAdult","RMSD_gid",
"R2_gid", "d_PCC_gid"),
labels = c("Distance to L2/L3",
"Distance to adult",
"d_L2L3 / d_adult",
"RMSD", "(R2) RMSD / d_L2L3",
"1 - avg. Pearson")))
p <- tmp %>%
separate(group, into = c("group", "group_name"), sep = ":") %>%
#mutate(label = paste(pathway, round(value, 3), sep = ":")) %>%
ggplot(aes(x = group, y = value)) +
#geom_boxplot(outlier.shape = NA, outlier.color = NA) +
geom_jitter(aes(text = paste(pathway, round(value, 3), sep = ":")),
width = 0.2, size = 1.2, color = "gray40", alpha = 0.8) +
stat_summary(fun.data = "mean_cl_boot", geom = "crossbar",
linewidth = 0.2, width = 0.4, color = "red") +
scale_color_manual(NULL, values = set_colors, guide = "none") +
facet_wrap(~metric, scales = "free_y") +
ggtitle("Distribution of metrics, GIDs") +
theme(strip.background = element_blank(),
strip.text = element_text(size = rel(0.8)),
panel.spacing = unit(0.5, "lines"))
Warning: Ignoring unknown aesthetics: text
print(ggplotly(p, tooltip = "text",
width = 800, height = 500))
NULL
ggsave(file.path("../output/within.sum.squares/img",
"20250405-compare-metric-distribution-groups2-6-GIDs.png"),
width = 8, height = 5)
—– Don’t go below this line !—–
I haven’t updated the code and results below this yet.
plot_individual_pathway <- function(gene_set, pathway_name){
# this function takes a gene set and the name of the pathway within that set
# then plots pairwise scatter plots along with the corresponding PCC or SCC.
# it also plots the PCA plot.
# Gather the data
genes <- dplyr::filter(gene_set, pathway == pathway_name) |> pull(WBGeneID)
# all replicates, for PCA
rep_matrix <- counts_rlog[genes, , drop = FALSE]
# mean expression only, for pairwise correlations and heatmap
mean_matrix <- sampleMean_rlog[genes, , drop = FALSE]
# Examine the pairwise relationships.
png(file.path("../output/within.sum.squares/img",
paste0(pathway_name, "-pairs-PCC-plot.png")),
width = 1000, height = 1000)
# Plotting the correlation matrix
pairs(mean_matrix[, c("WT_CTRL", "WT_L2_L3", all_dauer)],
upper.panel = panel.cor, # Correlation panel
lower.panel = panel.smooth, # Smoothed regression lines
main = paste0("Pairwise PCCs for ", pathway_name))
dev.off()
# Spearman's correlation, rank only.
png(file.path("../output/within.sum.squares/img",
paste0(pathway_name, "-pairs-SCC-plot.png")),
width = 1000, height = 1000)
# Plotting the correlation matrix
pairs(mean_matrix[, c("WT_CTRL", "WT_L2_L3", all_dauer)],
upper.panel = function(x, y, ...) panel.cor(x, y, method = "spearman"),
lower.panel = panel.smooth, # Smoothed regression lines
main = paste0("Pairwise SCCs for ", pathway_name))
dev.off()
# PCA plot
PCA <- pca(rep_matrix, metadata = sample_df)
biplot(PCA, colby = "condition", title = paste0("PCA for ", pathway_name))
ggsave(file.path("../output/within.sum.squares/img",
paste0(pathway_name, "-pairs-PCA-plot.png")),
width = 6, height = 6)
# return the expression matrix
return(mean_matrix)
}
Iterate through all Set 1 pathways
core_dauer_1_expressions <-
lapply(unique(core_dauer_1$pathway),
function(pathway){plot_individual_pathway(core_dauer_1, pathway)}
)
names(core_dauer_1_expressions) <- unique(core_dauer_1$pathway)
Iterate through all Set 2 pathways
core_dauer_2_expressions <-
lapply(unique(core_dauer_2$pathway),
function(pathway){plot_individual_pathway(core_dauer_2, pathway)}
)
names(core_dauer_2_expressions) <- unique(core_dauer_2$pathway)
Set 3: all Kim mountain genes
Read gene list and add gene name column
# load Wormbase gene table for ID conversion
#wormbase.genes <- read_tsv(here("input", "wormbase.genes.names.txt"))
# load Wormbase gene IDs
#gene_IDs <- wb_load_gene_ids("WS269")
# load gene list and add gene name column
kim_mountains <- read_tsv("../input/updated.names.kim.mountains.tsv") |>
# rename and order columns to be consistent with the other sets
select(pathway = Kim_group, WBGeneID, geneName) |>
distinct() # remove duplicate rows
Apply the calculation
# a list object to store the results of individual gene sets
Kim_result <- list(
"all_Kim" = within_dauer_compactness_score(sampleMean_rlog, kim_mountains$WBGeneID)
)
Do the same for each subgroup
Save the output
# make a table and sort by the ratio of average distance to centroid for dauers
# over distance from dauer centroid to L2/L3
# combine with the core dauer pathway results
Kim_core_dauer_result_table <- bind_rows(
"Kim mountain" = bind_rows(Kim_result, .id = "subset"),
"Core dauer set 1" = bind_rows(core_dauer_1_result, .id = "subset"),
"Core dauer set 2" = bind_rows(core_dauer_2_result, .id = "subset"),
.id = "Gene_set"
) |> arrange(R1_dauer)
# write output to file, preserves more digits for later calculations
write_tsv(Kim_core_dauer_result_table,
file.path("../output/within.sum.squares",
"20250307-geneset-stats-Kim-mountain-and-core-dauer.tsv"))
# save a rounded form to make it easier to view
Kim_core_dauer_result_table %>%
mutate(across(-c(Gene_set, subset, gene_set_size), \(x) round(x, digits = 3))) %>%
write_tsv(file.path("../output/within.sum.squares",
"20250307-geneset-stats-Kim-mountain-and-core-dauer-rounded.tsv"))
Plots for Set 1-3
Instead of ranking pathways by R1 or R2, we can plot the pathways by
their RMSD and distance between mean dauer to L2/L3
p1 <- Kim_core_dauer_result_table %>%
#filter(Gene_set == "Kim mountain") %>%
ggplot(aes(x = d_L2L3_to_dauer, y = RMSD_dauer)) +
stat_smooth(method = "lm", formula = y~x+0,
color = "skyblue2", linetype = 2) +
geom_point(aes(color = Gene_set)) +
scale_color_manual(NULL, values = set_colors) +
#scale_color_brewer(NULL, palette = "Dark2") +
ggrepel::geom_text_repel(
aes(label = paste0(subset, " (", gene_set_size, ")")),
size = 3, colour = "gray20") +
xlab("Distance Dauer to L2/L3") + ylab("RMSD within Dauer") +
scale_x_continuous(expand = c(0.1, 0.05)) +
scale_y_continuous(expand = c(0.1, 0)) +
ggtitle("All Dauers") + theme(legend.position = "bottom")
p1
ggsave(file.path("../output/within.sum.squares/img",
"Kim-and-Core-Dauer-RMSD-vs-dL2L3-all-dauers.png"),
width = 6, height = 6)
And for the GIDs
p2 <- Kim_core_dauer_result_table %>%
ggplot(aes(x = d_L2L3_to_gid, y = RMSD_gid)) +
stat_smooth(method = "lm", formula = y~x+0,
color = "skyblue2", linetype = 2) +
geom_point(aes(color = Gene_set)) +
scale_color_manual(NULL, values = set_colors) +
#scale_color_brewer(NULL, palette = "Dark2") +
ggrepel::geom_text_repel(
aes(label = paste0(subset, " (", gene_set_size, ")")),
size = 3, colour = "gray20") +
xlab("Distance Dauer to L2/L3") + ylab("RMSD within Dauer") +
scale_x_continuous(expand = c(0.1, 0.05)) +
scale_y_continuous(expand = c(0.1, 0)) +
ggtitle("GIDs") + theme(legend.position = "bottom")
p2
ggsave(file.path("../output/within.sum.squares/img",
"Kim-and-Core-Dauer-RMSD-vs-dL2L3-GIDs.png"),
width = 6, height = 6)
PCA for Set 1-3
Here, we generate PCA plots for individual pathways as a diagnostic
tool. We will use the counts_rlog object, which means we
will 1. Use all replicates instead of sample means; 2. Include WT adult
3. Unfiltered, include all genes
1 and 2 are to provide additional information for the diagnosis (with
little cost). (1) gives a sense of how variable are the biological
replicates within each dauer / control type; Inclusion of WT adult
provides an additional reference.
3 because lowly expressed genes will make very little contribution to
PCA (since we are going to center but NOT scale the data).
plotPCA(rlog[unique(c(core_dauer_1$WBGeneID, core_dauer_2$WBGeneID,
kim_mountains$WBGeneID)), ,drop=FALSE],
intgroup = "condition", ntop = 10000)
Case study
Retinoblastoma complex
Gather the data
RbC_genes <- filter(kim_mountains, Kim_group == "Retinoblastoma complex") |>
pull(WBGeneID)
# all replicates, for PCA
RbC_rep_matrix <- counts_rlog[RbC_genes, , drop = FALSE]
# mean expression only, for pairwise correlations and heatmap
RbC_mean_matrix <- sampleMean_rlog[RbC_genes, , drop = FALSE]
# display the matrix
round(RbC_mean_matrix, digits = 2)
Set 4: Metabolic Pathways
The pathway gene lists are from https://wormflux.umassmed.edu/
# This code is only run to get geneset names in the same format as the Kim mountains and insulin.
wormflux = xlsx::read.xlsx("../../../Deseq_data/Immune_genes_tables_Outside_tables/wormflux_4.xlsx",sheetIndex = 2,header = T)
list.wormflux = unclass(wormflux)
list.wormflux.not.na = map(list.wormflux, function(a)na.omit(a))
gids = wb_load_gene_ids("WS295")
list.wormflux.df = map(list.wormflux.not.na,.f = function(b)data.frame(genes = b))
bind.wormflux.df = bind_rows(list.wormflux.df,.id = "id") |> dplyr::rename(class=id)
bind.wormflux.df.table = bind.wormflux.df |> mutate(geneName = wbData::i2s(genes,gids) ) |> dplyr::rename(pathway = class,WBGeneID =genes)
all(bind.wormflux.df.table$genes %in% rownames(sampleMean_rlog))
write_tsv(bind.wormflux.df.table, file = here("input","updated.names.wormflux.tsv"))
Read in the gene list and apply the calculation
# How the data will be load
Metabolic_pathways <- read_tsv(file = "../input/updated.names.wormflux.tsv")
# a list object to store the results of individual gene sets
Metabolic_result <- list(
"all_Metabolic" = within_dauer_compactness_score(sampleMean_rlog,
Metabolic_pathways$WBGeneID)
)
subset_Metabolic_result <- with(Metabolic_pathways, split(WBGeneID, pathway)) |>
keep(\(x) length(unique(x)) > 1) |> # keep only the groups with more than one gene
map(\(gene_set) within_dauer_compactness_score(sampleMean_rlog, gene_set))
Metabolic_result <- append(Metabolic_result, subset_Metabolic_result)
Save the output
# make a table and sort by the ratio of average distance to centroid for dauers
# over distance from dauer centroid to L2/L3
# combine with the core dauer pathway results
Metabolic_result_table <- bind_rows(Metabolic_result, .id = "subset") |>
mutate(Gene_set = "Metabolic pathways") |>
relocate(Gene_set, .before = subset) |>
arrange(R1_dauer)
# write output to file, preserves more digits for later calculations
write_tsv(Metabolic_result_table,
here::here("output/within.sum.squares",
"20250307-geneset-stats-Metabolic-pathways.tsv"))
# save a rounded form to make it easier to view
Metabolic_result_table %>%
mutate(across(-c(Gene_set, subset, gene_set_size), \(x) round(x, digits = 3))) %>%
write_tsv(here::here("output/within.sum.squares",
"20250307-geneset-stats-Metabolic-pathways-rounded.tsv"))
Plots for Set 4
Instead of ranking pathways by R1 or R2, we can plot the pathways by
their RMSD and distance between mean dauer to L2/L3
p1 <- Metabolic_result_table %>%
ggplot(aes(x = d_L2L3_to_dauer, y = RMSD_dauer)) +
stat_smooth(method = "lm", formula = y~x+0,
color = "skyblue2", linetype = 2) +
geom_point(aes(color = gene_set_size < 10)) +
scale_color_manual("Fewer than 10 genes", values = c("black", "pink")) +
ggrepel::geom_text_repel(
aes(label = paste0(subset, " (", gene_set_size, ")")),
size = 2, colour = "gray20", min.segment.length = 0) +
xlab("Distance Dauer to L2/L3") + ylab("RMSD within Dauer") +
scale_x_continuous(expand = c(0.1, 0.05)) +
scale_y_continuous(expand = c(0.1, 0)) +
ggtitle("All Dauers") + theme(legend.position = "bottom")
p1
ggsave(file.path("../output/within.sum.squares/img",
"Metabolic-pathways-RMSD-vs-dL2L3-all-dauers.png"),
width = 6, height = 6)
And for the GIDs
p2 <- Metabolic_result_table %>%
ggplot(aes(x = d_L2L3_to_gid, y = RMSD_gid)) +
stat_smooth(method = "lm", formula = y~x+0,
color = "skyblue2", linetype = 2) +
geom_point(aes(color = gene_set_size < 10)) +
scale_color_manual("Fewer than 10 genes", values = c("black", "pink")) +
ggrepel::geom_text_repel(
aes(label = paste0(subset, " (", gene_set_size, ")")),
size = 2, colour = "gray20", min.segment.length = 0) +
#xlab("Distance Dauer to L2/L3") + ylab("RMSD within Dauer") +
scale_x_continuous(expand = c(0.1, 0.05)) +
scale_y_continuous(expand = c(0.1, 0)) +
xlab("Distance Dauer to L2/L3") + ylab("RMSD within GIDs") +
ggtitle("GIDs") + theme(legend.position = "bottom")
p2
ggsave(file.path("../output/within.sum.squares/img",
"Metabolic-pathways-RMSD-vs-dL2L3-GIDs.png"),
width = 6, height = 6)
–>
---
title: "Within measurement Dauers"
author: "Bin Z. He"
date: "2025-04-03, updated on `r Sys.Date()`"
output:
  html_notebook:
    toc: TRUE
    code_folding: hide
    theme: flatly
---


```{r message=FALSE}
library(DESeq2)
#library(tximport)
library(tidyverse)
library(plotly)
library(PCAtools)
#library(patchwork)
#library(wbData)
#library(clusterProfiler)
#library(org.Ce.eg.db)
#library(DOSE)
```

Common settings

```{r}
# enrichment color
options(enrichplot.colours = c("red","blue"))
# plotting theme
old <- theme_set(theme_bw(base_size = 16))
```

# Goal & Background
<h2 style="color:red">update 2025-04-03</h2>
Veena has reorganized the gene sets. Here are the five groups:

Group 1: All pathways combined

Group 2: pathways involved in DNA metabolism and gene expression

Group 3: pathways involved in energy metabolism/mitochondria

Group 4: signal transduction pathways

Group 5: morphological features

Group 6: metabolic pathways.

## Original goal and background
The goal is to calculate a number of derived values to characterize the level of conservation or divergence among the Dauers / GIDs and normalize the value by both the number of genes in the set and relative to the distance from the average of Dauers to the WT L2/L3.

The transcript counts are first library size normalized and transformed by the regularized log transformation into `counts_rlog`. The resulting log transformed counts are used as the input for downstream analyses. The rlog transformation moderates the log2FC - log transformation is known to make lowly expressed genes appear more variable (noise amplified). What this measure doesn't account for is gene length. Notably, the rlog counts are still read counts, rather than transcript counts.

How can gene length affect the downstream analyses? I initially thought it would give more weight in the distance calculation to long genes with higher read counts when expressed at the same level as a shorter genes. But I then realized that on a log scale, correcting for gene length is equivalent to subtracting a constant from each row (gene), where the constant is `log2(gene_length/1000)` (per kilo base feature length). Location shifts such as this doesn't affect distance measures.

In a different analysis (`../../90-archived-analyses/output/not_in_use_within.sum.squares.zscore`), I transformed the rlog counts to a z-score on a per gene basis. The rationale is that this give every gene equal weight and thus avoid having variable genes contributing too much to the distance measure. The potential caveat of the z-score is that lowly expressed genes that do not actually have meaningful changes in expression can contribute to distance measures if they have a high variance. This is a trade-off we are making to have a more balanced measure of distance.

# Import data
The preprocessing of the quant data were performed in an early analysis (`../../90-archived-analyses/Analysis/20250226-HBZ-within.sumsquares.Rmd`) and saved to an RData file. We will load that file here.

```{r}
load("../input/20250310-Dauer-project-DESeq2-objects.rdata")
```

# Overview by PCA
We use PCA to examine the overall structure of the data and relationship between the dauers and the WT L2/L3 and adults.

I don't expect the lowly expressed genes to affect the PCA. Nonetheless, we will try both and see how their results differ. We will plot all replicates for additional information about sample variability vs true separation between samples.

All genes
```{r}
pca_all <- pca(counts_rlog, metadata = sample_df)
biplot(pca_all, colby = "condition", title = "PCA for all genes")
```

Filtered expression matrix
```{r}
pca_filtered <- pca(counts_rlog[keep, , drop = FALSE], metadata = sample_df)
biplot(pca_filtered, colby = "condition", title = "PCA for all genes")
```

DESeq2 provides a `plotPCA()` function, which by default uses the top 500 most variable genes
```{r}
plotPCA(rlog, intgroup = "condition", ntop = 10000)
#ggsave("../output/PCA/20250319-transcriptome-top10k-variable-genes-PCA.png")
```

# Analyze gene sets
## function to calculate a number of derived values
```{r}
source("../script/20250403-within-dauer-compactness-score-functions.R")
```

## load the new six group gene sets
```{r message=FALSE}
gene_set_files <- list.files("../input/gene-lists",
                             pattern = "G*.tsv", full.names = TRUE)
# group 1 is just the union of the remaining five groups
all_pathways <- read_tsv("../input/gene-lists/Group.1.geneset.tsv")
gene.lists <- map(gene_set_files[-1], read_tsv)
names(gene.lists) <- c(
  "2:DNA_metabolism_txn",
  "3:Energy_metabolism_mito",
  "4:Signal_transduction",
  "5:Morphological_features",
  "6:Metabolic_features"
)
gene_list_0 <- list_rbind(gene.lists, names_to = "group")
```

There are still duplicate rows in the gene list. Let's examine them and then remove the duplicated rows.
```{r}
# check for duplicates
duplicates <- gene_list_0 |> 
  group_by(across(everything())) |> 
  filter(n() > 1) |> 
  arrange(group, pathway, WBGeneID)
duplicates
```
> There are 10 duplicate rows

Remove the duplicates
```{r}
gene_list <- distinct(gene_list_0) # remove duplicate rows
```

Are there pathways that belong to multiple groups?
```{r}
# check for pathways that belong to multiple groups
gene_list |> 
  group_by(pathway) |> 
  filter(n_distinct(group) > 1) |> 
  arrange(group, pathway, WBGeneID)
```
> Good, the five groups have no overlaps. Thus, group ID is just another label on the pathways and won't cause ambiguity.

What's the distribution of the number of genes per pathway?
```{r}
# check the distribution of the number of genes per pathway
gene_list |> 
  group_by(pathway) |> 
  summarise(n = n_distinct(WBGeneID)) |> 
  pull(n) |>
  table()
```
> The smallest pathways contain at least 6 genes.

## Calculate the within cluster distance and other metrics
Now that we have all the pathways in a single file, we can just iterate through all of them.

Split the gene sets and apply the calculation to each subset, combine the results
```{r}
# a list object to store the results of individual gene sets
gene_set_results <- with(gene_list, split(WBGeneID, pathway)) |>
  keep(\(x) length(unique(x)) > 1) |> # keep only the groups with more than one gene
  map(\(gene_set) within_dauer_compactness_score(sampleMean_rlog, gene_set))

# combine the results into a single data frame
gene_set_table <- bind_rows(gene_set_results, .id = "pathway") |>
  left_join(distinct(gene_list, group, pathway), by = "pathway") |>
  relocate(group, .before = pathway) |>
  arrange(group, R2_dauer)
```

Export the list to a file
```{r}
# write output to file, preserves more digits for later calculations
write_tsv(gene_set_table,
          file.path("../output/within.sum.squares",
                    "20250403-geneset-stats-all.tsv"))

# also write a version with fewer digits
gene_set_table %>% 
  mutate(across(-c(pathway, group, gene_set_size), \(x) round(x, digits = 3))) %>% 
  write_tsv(file.path("../output/within.sum.squares",
                       "20250403-geneset-stats-all-rounded.tsv"))
```

## RMSD vs d_L2L3 correlation
We observed a general linear relationship between the two measures across groups (see below). Here, we would like to quantitative assess the strength of correlation and test for significance.
```{r}
gene_set_table %>% 
  #group_by(group) %>% 
  nest(.by = group) %>% 
  mutate(
    model = map(data, \(df) 
                cor.test(~ RMSD_dauer + d_L2L3_to_dauer, 
                         data = df, method = "pearson")),
    tidied = map(model, broom::tidy)
  ) %>% 
  unnest(tidied) %>% 
  select(group, Pearson = estimate, p.value, conf.low, conf.high)
```

Same analysis for the GIDs
```{r}
gene_set_table %>% 
  #group_by(group) %>% 
  nest(.by = group) %>% 
  mutate(
    model = map(data, \(df) 
                cor.test(~ RMSD_gid + d_L2L3_to_gid, 
                         data = df, method = "pearson")),
    tidied = map(model, broom::tidy)
  ) %>% 
  unnest(tidied) %>% 
  select(group, Pearson = estimate, p.value, conf.low, conf.high)
```

# Plots{.tabset}
## RMSD vs d_L2L3 for groups 2-5
First, we will plot RMSD_dauer vs d_L2L3_to_dauer for groups 2-4 combined.
```{r}
set_colors <- RColorBrewer::brewer.pal(5, "Set2")
names(set_colors) <- c(
  "DNA_metabolism_txn", "Energy_metabolism_mito", "Signal_transduction",
  "Morphological_features", "Metabolic_features"
)
p <- gene_set_table %>% 
  filter(group != "6:Metabolic_features") %>%
  mutate(label = paste0(pathway, " (", gene_set_size, ")")) %>%
  ggplot(aes(x = d_L2L3_to_dauer, y = RMSD_dauer,
             label = label)) + 
  #stat_smooth(method = "lm", formula = y~x+0, se = FALSE,
  #            color = "skyblue2", linetype = 2) +
  geom_point() +
  #scale_color_manual("Groups", values = set_colors, 
  #                   labels = paste(1:5, names(set_colors), sep = ":")) +
  xlab("Distance Dauer to L2/L3") + ylab("RMSD within Dauer") +
  scale_x_continuous(limits = c(0.5, 2.62), expand = c(0.1, 0.05)) +
  scale_y_continuous(limits = c(0.2, 0.92), expand = c(0.1, 0)) +
  #guides(color = guide_legend(nrow = 2)) +
  facet_wrap(~ group, nrow = 2) +
  theme(strip.background = element_blank(),
        strip.text = element_text(size = rel(0.8)),
        panel.spacing = unit(0.5, "lines"))
  #theme(legend.position = "top", legend.text = element_text(size = rel(0.6)))
ggsave(file = file.path("../output/within.sum.squares/img",
                        paste0("20250404-RMSD-vs-dL2L3-groups2-5.png")),
       plot = p + ggrepel::geom_text_repel(size = 2.5, colour = "gray20"),
       width = 7, height = 7)
print(ggplotly(p, tooltip = c("label"), width = 600, height = 600))
```
Next, we will plot RMSD_gid vs d_L2L3_to_gid for groups 2-4 combined.
```{r}
p <- gene_set_table %>% 
  filter(group != "6:Metabolic_features") %>%
  mutate(label = paste0(pathway, " (", gene_set_size, ")")) %>%
  ggplot(aes(x = d_L2L3_to_gid, y = RMSD_gid,
             label = label)) + 
  #stat_smooth(method = "lm", formula = y~x+0, se = FALSE,
  #            color = "skyblue2", linetype = 2) +
  geom_point() +
  #scale_color_manual("Groups", values = set_colors, 
  #                   labels = paste(1:5, names(set_colors), sep = ":")) +
  xlab("Distance GID to L2/L3") + ylab("RMSD within GIDs") +
  scale_x_continuous(limits = c(0.5, 2.62), expand = c(0.1, 0.05)) +
  scale_y_continuous(limits = c(0.2, 0.92), expand = c(0.1, 0)) +
  #guides(color = guide_legend(nrow = 2)) +
  facet_wrap(~group, nrow = 2) +
  theme(strip.background = element_blank(),
        strip.text = element_text(size = rel(0.8)),
        panel.spacing = unit(0.5, "lines"))
  #theme(legend.position = "top", legend.text = element_text(size = rel(0.6)))
ggsave(file = file.path("../output/within.sum.squares/img",
                        paste0("20250404-RMSD-vs-dL2L3-groups2-5-GIDs.png")),
       plot = p + ggrepel::geom_text_repel(size = 2.5, colour = "gray20"),
       width = 7, height = 7)
print(ggplotly(p, tooltip = c("label"), width = 600, height = 600))
```

## RMSD vs d_L2L3 for each group separately
We will generate one RMSD_dauer vs d_L2L3_to_dauer plot for each group

```{r}
#set_colors <- RColorBrewer::brewer.pal(5, "Accent")
#names(set_colors) <- names(gene.lists)
for (set in names(gene.lists)){
  tmp <- gene_set_table %>% 
    filter(group == set) %>%
    mutate(label = paste0(pathway, " (", gene_set_size, ")"))
  # if there are more than 15 pathways in a group, use 2.5 for label size
  # otherwise, use 3
  label_size <- ifelse(nrow(tmp) > 15, 2.5, 3)
  # make plot
  p <- tmp %>%
    ggplot(aes(x = d_L2L3_to_dauer, y = RMSD_dauer,
               label = label)) + 
    #stat_smooth(method = "lm", formula = y~x+0, se = FALSE,
    #            color = "skyblue2", linetype = 2) +
    geom_point(aes()) +
    #scale_color_manual(NULL, values = set_colors) +
    xlab("Distance Dauer to L2/L3") + ylab("RMSD within Dauer") +
    scale_x_continuous(limits = c(0.5, 2.62), expand = c(0.1, 0.05)) +
    scale_y_continuous(limits = c(0.2, 0.92), expand = c(0.1, 0)) +
    ggtitle(set) + theme(legend.position = "bottom")
  ggsave(file = file.path("../output/within.sum.squares/img",
                          paste0("20250403-RMSD-vs-dL2L3-",
                                 gsub(":", "_", set), ".png")),
         plot = p + ggrepel::geom_text_repel(size = label_size, colour = "gray20"),
         width = 6, height = 6)
  print(ggplotly(p, tooltip = c("label")))
}
```

Do the same for RMSD_gid vs d_L2L3_to_gid
```{r}
for (set in names(gene.lists)){
  tmp <- gene_set_table %>% 
    filter(group == set) %>%
    mutate(label = paste0(pathway, " (", gene_set_size, ")"))
  # if there are more than 15 pathways in a group, use 2.5 for label size
  # otherwise, use 3
  label_size <- ifelse(nrow(tmp) > 15, 2.5, 3)
  # make plot
  p <- tmp %>%
    ggplot(aes(x = d_L2L3_to_gid, y = RMSD_gid,
               label = label)) +
    #stat_smooth(method = "lm", formula = y~x+0, se = FALSE,
    #            color = "skyblue2", linetype = 2) +
    geom_point(aes()) +
    #scale_color_manual(NULL, values = set_colors) +
    xlab("Distance GID to L2/L3") + ylab("RMSD within GIDs") +
    scale_x_continuous(limits = c(0.5, 2.62), expand = c(0.1, 0.05)) +
    scale_y_continuous(limits = c(0.2, 0.92), expand = c(0.1, 0)) +
    ggtitle(set) + theme(legend.position = "bottom")
  ggsave(file = file.path("../output/within.sum.squares/img",
                          paste0("20250403-RMSD-vs-dL2L3-",
                                 gsub(":", "_", set), "-GIDs.png")),
         plot = p + ggrepel::geom_text_repel(size = label_size, colour = "gray20"),
         width = 6, height = 6)
  print(ggplotly(p, tooltip = c("label")))
}
```

## Compare all five groups
Let's compare the distribution of the metrics between the five groups. We will focus on RMSD_dauer, d_L2L3_to_dauer, R2_dauer, and 1-avg_Pearson_dauer

List the five groups

`r paste(unique(gene_set_table$group), "\n")`

```{r}
tmp <- gene_set_table %>% 
  mutate(d_PCC_dauer = 1 - avg_Pearson_dauer,
         r_toL2L3_vs_toAdult = d_L2L3_to_dauer / d_adult_to_dauer) %>%
  select(group, pathway, gene_set_size, RMSD_dauer, d_L2L3_to_dauer,
         d_adult_to_dauer, R2_dauer, d_PCC_dauer, r_toL2L3_vs_toAdult) %>%
  pivot_longer(cols = -c(group, pathway, gene_set_size),
               names_to = "metric", values_to = "value") %>%
  mutate(metric = factor(metric,
                         levels = c("d_L2L3_to_dauer", "d_adult_to_dauer",
                                    "r_toL2L3_vs_toAdult","RMSD_dauer", 
                                    "R2_dauer", "d_PCC_dauer"),
                         labels = c("Distance to L2/L3",
                                    "Distance to adult",
                                    "d_L2L3 / d_adult",
                                    "RMSD", "(R2) RMSD / d_L2L3",
                                    "1 - avg. Pearson")))

p <- tmp %>%
  separate(group, into = c("group", "group_name"), sep = ":") %>%
  #mutate(label = paste(pathway, round(value, 3), sep = ":")) %>% 
  ggplot(aes(x = group, y = value)) +
  #geom_boxplot(outlier.shape = NA, outlier.color = NA) +
  geom_jitter(aes(text = paste(pathway, round(value, 3), sep = ":")),
              width = 0.2, size = 1.2, color = "gray40", alpha = 0.8) +
  stat_summary(fun.data = "mean_cl_boot", geom = "crossbar", 
               linewidth = 0.2, width = 0.4, color = "red") +
  scale_color_manual(NULL, values = set_colors, guide = "none") +
  facet_wrap(~metric, scales = "free_y") +
  ggtitle("Distribution of metrics, all dauers") +
  theme(strip.background = element_blank(),
        strip.text = element_text(size = rel(0.8)),
        panel.spacing = unit(0.5, "lines"))
print(ggplotly(p, tooltip = "text",
               width = 800, height = 500))
ggsave(file.path("../output/within.sum.squares/img",
                 "20250405-compare-metric-distribution-groups2-6-dauers.png"),
       width = 8, height = 5)
```

Repeat for RMSD_gid, d_L2L3_to_gid, R2_gid, and 1-avg_Pearson_gid
```{r}
tmp <- gene_set_table %>% 
  mutate(d_PCC_gid = 1 - avg_Pearson_gid,
         r_toL2L3_vs_toAdult = d_L2L3_to_gid / d_adult_to_gid) %>%
  select(group, pathway, gene_set_size, RMSD_gid, d_L2L3_to_gid,
         d_adult_to_gid, R2_gid, d_PCC_gid, r_toL2L3_vs_toAdult) %>%
  pivot_longer(cols = -c(group, pathway, gene_set_size),
               names_to = "metric", values_to = "value") %>%
  mutate(metric = factor(metric,
                         levels = c("d_L2L3_to_gid", "d_adult_to_gid",
                                    "r_toL2L3_vs_toAdult","RMSD_gid", 
                                    "R2_gid", "d_PCC_gid"),
                         labels = c("Distance to L2/L3",
                                    "Distance to adult",
                                    "d_L2L3 / d_adult",
                                    "RMSD", "(R2) RMSD / d_L2L3",
                                    "1 - avg. Pearson")))

p <- tmp %>%
  separate(group, into = c("group", "group_name"), sep = ":") %>%
  #mutate(label = paste(pathway, round(value, 3), sep = ":")) %>% 
  ggplot(aes(x = group, y = value)) +
  #geom_boxplot(outlier.shape = NA, outlier.color = NA) +
  geom_jitter(aes(text = paste(pathway, round(value, 3), sep = ":")),
              width = 0.2, size = 1.2, color = "gray40", alpha = 0.8) +
  stat_summary(fun.data = "mean_cl_boot", geom = "crossbar", 
               linewidth = 0.2, width = 0.4, color = "red") +
  scale_color_manual(NULL, values = set_colors, guide = "none") +
  facet_wrap(~metric, scales = "free_y") +
  ggtitle("Distribution of metrics, GIDs") +
  theme(strip.background = element_blank(),
        strip.text = element_text(size = rel(0.8)),
        panel.spacing = unit(0.5, "lines"))
print(ggplotly(p, tooltip = "text",
               width = 800, height = 500))
ggsave(file.path("../output/within.sum.squares/img",
                 "20250405-compare-metric-distribution-groups2-6-GIDs.png"),
       width = 8, height = 5)
```
# ----- Don't go below this line !-----
## I haven't updated the code and results below this yet.

<!--

### Case study
The two measures of similarity, i.e., Euclidean distance-based RMSD and Pearson's correlation, show opposite results for TGFB vs daf.16. Here, we take a closer look at both.

#### function to plot individual pathway
Try to create a function to automate the case study analysis.
```{r}
plot_individual_pathway <- function(gene_set, pathway_name){
  # this function takes a gene set and the name of the pathway within that set
  # then plots pairwise scatter plots along with the corresponding PCC or SCC.
  # it also plots the PCA plot.
  
  # Gather the data
  genes <- dplyr::filter(gene_set, pathway == pathway_name) |> pull(WBGeneID)
  # all replicates, for PCA
  rep_matrix <- counts_rlog[genes, , drop = FALSE]
  # mean expression only, for pairwise correlations and heatmap
  mean_matrix <- sampleMean_rlog[genes, , drop = FALSE]
  
  # Examine the pairwise relationships.
  png(file.path("../output/within.sum.squares/img",
                paste0(pathway_name, "-pairs-PCC-plot.png")),
      width = 1000, height = 1000)
  # Plotting the correlation matrix
  pairs(mean_matrix[, c("WT_CTRL", "WT_L2_L3", all_dauer)],
        upper.panel = panel.cor,    # Correlation panel
        lower.panel = panel.smooth, # Smoothed regression lines
        main = paste0("Pairwise PCCs for ", pathway_name))
  dev.off()
  
  # Spearman's correlation, rank only.
  png(file.path("../output/within.sum.squares/img",
                paste0(pathway_name, "-pairs-SCC-plot.png")),
      width = 1000, height = 1000)
  # Plotting the correlation matrix
  pairs(mean_matrix[, c("WT_CTRL", "WT_L2_L3", all_dauer)],
        upper.panel = function(x, y, ...) panel.cor(x, y, method = "spearman"),
        lower.panel = panel.smooth, # Smoothed regression lines
        main = paste0("Pairwise SCCs for ", pathway_name))
  dev.off()
  
  # PCA plot
  PCA <- pca(rep_matrix, metadata = sample_df)
  biplot(PCA, colby = "condition", title = paste0("PCA for ", pathway_name))
  ggsave(file.path("../output/within.sum.squares/img",
                   paste0(pathway_name, "-pairs-PCA-plot.png")),
         width = 6, height = 6)
  
  # return the expression matrix
  return(mean_matrix)
}
```

Iterate through all Set 1 pathways
```{r}
core_dauer_1_expressions <- 
  lapply(unique(core_dauer_1$pathway),
         function(pathway){plot_individual_pathway(core_dauer_1, pathway)}
  )
names(core_dauer_1_expressions) <- unique(core_dauer_1$pathway)
```

Iterate through all Set 2 pathways
```{r}
core_dauer_2_expressions <- 
  lapply(unique(core_dauer_2$pathway),
         function(pathway){plot_individual_pathway(core_dauer_2, pathway)}
  )
names(core_dauer_2_expressions) <- unique(core_dauer_2$pathway)
```

## Set 3: all Kim mountain genes
Read gene list and add gene name column
```{r,fig.format='svg'}
# load Wormbase gene table for ID conversion
#wormbase.genes <- read_tsv(here("input", "wormbase.genes.names.txt"))
# load Wormbase gene IDs
#gene_IDs <- wb_load_gene_ids("WS269")

# load gene list and add gene name column
kim_mountains <- read_tsv("../input/updated.names.kim.mountains.tsv") |>
  # rename and order columns to be consistent with the other sets
  select(pathway = Kim_group, WBGeneID, geneName) |>
  distinct() # remove duplicate rows
```

Apply the calculation
```{r,fig.format='svg'}
# a list object to store the results of individual gene sets
Kim_result <- list(
  "all_Kim" = within_dauer_compactness_score(sampleMean_rlog, kim_mountains$WBGeneID)
)
```

Do the same for each subgroup
```{r results='hide'}
subset_kim_result <- with(kim_mountains, split(WBGeneID, pathway)) |>
  keep(\(x) length(unique(x)) > 1) |> # keep only the groups with more than one gene
  map(\(gene_set) within_dauer_compactness_score(sampleMean_rlog, gene_set))
Kim_result <- append(Kim_result, subset_kim_result)
```

Save the output
```{r}
# make a table and sort by the ratio of average distance to centroid for dauers
# over distance from dauer centroid to L2/L3
# combine with the core dauer pathway results
Kim_core_dauer_result_table <- bind_rows(
  "Kim mountain" = bind_rows(Kim_result, .id = "subset"),
  "Core dauer set 1" = bind_rows(core_dauer_1_result, .id = "subset"),
  "Core dauer set 2" = bind_rows(core_dauer_2_result, .id = "subset"),
  .id = "Gene_set"
) |> arrange(R1_dauer)

# write output to file, preserves more digits for later calculations
write_tsv(Kim_core_dauer_result_table,
          file.path("../output/within.sum.squares",
                    "20250307-geneset-stats-Kim-mountain-and-core-dauer.tsv"))

# save a rounded form to make it easier to view
Kim_core_dauer_result_table %>% 
  mutate(across(-c(Gene_set, subset, gene_set_size), \(x) round(x, digits = 3))) %>% 
  write_tsv(file.path("../output/within.sum.squares",
                       "20250307-geneset-stats-Kim-mountain-and-core-dauer-rounded.tsv"))
```

### Plots for Set 1-3
Instead of ranking pathways by R1 or R2, we can plot the pathways by their RMSD and distance between mean dauer to L2/L3
```{r}
p1 <- Kim_core_dauer_result_table %>% 
  #filter(Gene_set == "Kim mountain") %>%
  ggplot(aes(x = d_L2L3_to_dauer, y = RMSD_dauer)) + 
  stat_smooth(method = "lm", formula = y~x+0, 
              color = "skyblue2", linetype = 2) +
  geom_point(aes(color = Gene_set)) +
  scale_color_manual(NULL, values = set_colors) +
  #scale_color_brewer(NULL, palette = "Dark2") +
  ggrepel::geom_text_repel(
    aes(label = paste0(subset, " (", gene_set_size, ")")),
    size = 3, colour = "gray20") +
  xlab("Distance Dauer to L2/L3") + ylab("RMSD within Dauer") +
  scale_x_continuous(expand = c(0.1, 0.05)) +
  scale_y_continuous(expand = c(0.1, 0)) +
  ggtitle("All Dauers") + theme(legend.position = "bottom")
p1
ggsave(file.path("../output/within.sum.squares/img",
                 "Kim-and-Core-Dauer-RMSD-vs-dL2L3-all-dauers.png"),
       width = 6, height = 6)
```

And for the GIDs
```{r}
p2 <- Kim_core_dauer_result_table %>% 
  ggplot(aes(x = d_L2L3_to_gid, y = RMSD_gid)) + 
  stat_smooth(method = "lm", formula = y~x+0, 
              color = "skyblue2", linetype = 2) +
  geom_point(aes(color = Gene_set)) +
  scale_color_manual(NULL, values = set_colors) +
  #scale_color_brewer(NULL, palette = "Dark2") +
  ggrepel::geom_text_repel(
    aes(label = paste0(subset, " (", gene_set_size, ")")),
    size = 3, colour = "gray20") +
  xlab("Distance Dauer to L2/L3") + ylab("RMSD within Dauer") +
  scale_x_continuous(expand = c(0.1, 0.05)) +
  scale_y_continuous(expand = c(0.1, 0)) +
  ggtitle("GIDs") + theme(legend.position = "bottom")
p2
ggsave(file.path("../output/within.sum.squares/img",
                 "Kim-and-Core-Dauer-RMSD-vs-dL2L3-GIDs.png"),
       width = 6, height = 6)
```

### PCA for Set 1-3
Here, we generate PCA plots for individual pathways as a diagnostic tool. We will use the `counts_rlog` object, which means we will
1. Use all replicates instead of sample means;
2. Include WT adult
3. Unfiltered, include all genes

1 and 2 are to provide additional information for the diagnosis (with little cost). (1) gives a sense of how variable are the biological replicates within each dauer / control type; Inclusion of WT adult provides an additional reference.

3 because lowly expressed genes will make very little contribution to PCA (since we are going to center but NOT scale the data).
```{r}
plotPCA(rlog[unique(c(core_dauer_1$WBGeneID, core_dauer_2$WBGeneID,
                             kim_mountains$WBGeneID)), ,drop=FALSE], 
        intgroup = "condition", ntop = 10000)
```

### Case study
#### Retinoblastoma complex
Gather the data
```{r}
RbC_genes <- filter(kim_mountains, Kim_group == "Retinoblastoma complex") |> 
  pull(WBGeneID)
# all replicates, for PCA
RbC_rep_matrix <- counts_rlog[RbC_genes, , drop = FALSE]
# mean expression only, for pairwise correlations and heatmap
RbC_mean_matrix <- sampleMean_rlog[RbC_genes, , drop = FALSE]
# display the matrix
round(RbC_mean_matrix, digits = 2)
```


## Set 4: Metabolic Pathways
The pathway gene lists are from <https://wormflux.umassmed.edu/>
```
# This code is only run to get geneset names in the same format as the Kim mountains and insulin.
wormflux = xlsx::read.xlsx("../../../Deseq_data/Immune_genes_tables_Outside_tables/wormflux_4.xlsx",sheetIndex = 2,header = T)
list.wormflux = unclass(wormflux)
list.wormflux.not.na = map(list.wormflux, function(a)na.omit(a))
gids = wb_load_gene_ids("WS295")
list.wormflux.df = map(list.wormflux.not.na,.f = function(b)data.frame(genes = b))
bind.wormflux.df = bind_rows(list.wormflux.df,.id = "id") |> dplyr::rename(class=id)
bind.wormflux.df.table = bind.wormflux.df |> mutate(geneName = wbData::i2s(genes,gids) ) |> dplyr::rename(pathway  = class,WBGeneID  =genes)
all(bind.wormflux.df.table$genes %in% rownames(sampleMean_rlog))
write_tsv(bind.wormflux.df.table, file = here("input","updated.names.wormflux.tsv"))
```

Read in the gene list and apply the calculation
```{r}
# How the data will be load
Metabolic_pathways <- read_tsv(file = "../input/updated.names.wormflux.tsv")
```

```{r}
# a list object to store the results of individual gene sets
Metabolic_result <- list(
  "all_Metabolic" = within_dauer_compactness_score(sampleMean_rlog,
                                                   Metabolic_pathways$WBGeneID)
)
```

```{r}
subset_Metabolic_result <- with(Metabolic_pathways, split(WBGeneID, pathway)) |>
  keep(\(x) length(unique(x)) > 1) |> # keep only the groups with more than one gene
  map(\(gene_set) within_dauer_compactness_score(sampleMean_rlog, gene_set))
Metabolic_result <- append(Metabolic_result, subset_Metabolic_result)
```

Save the output
```{r}
# make a table and sort by the ratio of average distance to centroid for dauers
# over distance from dauer centroid to L2/L3
# combine with the core dauer pathway results
Metabolic_result_table <- bind_rows(Metabolic_result, .id = "subset") |>
  mutate(Gene_set = "Metabolic pathways") |>
  relocate(Gene_set, .before = subset) |>
  arrange(R1_dauer)

# write output to file, preserves more digits for later calculations
write_tsv(Metabolic_result_table,
          here::here("output/within.sum.squares",
                     "20250307-geneset-stats-Metabolic-pathways.tsv"))

# save a rounded form to make it easier to view
Metabolic_result_table %>% 
  mutate(across(-c(Gene_set, subset, gene_set_size), \(x) round(x, digits = 3))) %>% 
  write_tsv(here::here("output/within.sum.squares",
                       "20250307-geneset-stats-Metabolic-pathways-rounded.tsv"))
```

### Plots for Set 4
Instead of ranking pathways by R1 or R2, we can plot the pathways by their RMSD and distance between mean dauer to L2/L3
```{r}
p1 <- Metabolic_result_table %>%
  ggplot(aes(x = d_L2L3_to_dauer, y = RMSD_dauer)) + 
  stat_smooth(method = "lm", formula = y~x+0, 
              color = "skyblue2", linetype = 2) +
  geom_point(aes(color = gene_set_size < 10)) +
  scale_color_manual("Fewer than 10 genes", values = c("black", "pink")) +
  ggrepel::geom_text_repel(
    aes(label = paste0(subset, " (", gene_set_size, ")")),
    size = 2, colour = "gray20", min.segment.length = 0) +
  xlab("Distance Dauer to L2/L3") + ylab("RMSD within Dauer") +
  scale_x_continuous(expand = c(0.1, 0.05)) +
  scale_y_continuous(expand = c(0.1, 0)) +
  ggtitle("All Dauers") + theme(legend.position = "bottom")
p1
ggsave(file.path("../output/within.sum.squares/img",
                 "Metabolic-pathways-RMSD-vs-dL2L3-all-dauers.png"),
       width = 6, height = 6)
```

And for the GIDs
```{r}
p2 <- Metabolic_result_table %>%
  ggplot(aes(x = d_L2L3_to_gid, y = RMSD_gid)) + 
  stat_smooth(method = "lm", formula = y~x+0, 
              color = "skyblue2", linetype = 2) +
  geom_point(aes(color = gene_set_size < 10)) +
  scale_color_manual("Fewer than 10 genes", values = c("black", "pink")) +
  ggrepel::geom_text_repel(
    aes(label = paste0(subset, " (", gene_set_size, ")")),
    size = 2, colour = "gray20", min.segment.length = 0) +
  #xlab("Distance Dauer to L2/L3") + ylab("RMSD within Dauer") +
  scale_x_continuous(expand = c(0.1, 0.05)) +
  scale_y_continuous(expand = c(0.1, 0)) +
  xlab("Distance Dauer to L2/L3") + ylab("RMSD within GIDs") +
  ggtitle("GIDs") + theme(legend.position = "bottom")
p2
ggsave(file.path("../output/within.sum.squares/img",
                 "Metabolic-pathways-RMSD-vs-dL2L3-GIDs.png"),
       width = 6, height = 6)
```
-->