This analysis addresses the methodological question raised in ANCOM-BC Issue #196: Can normalized abundance data be effectively used with differential abundance tools designed for raw counts?
Many differential abundance (DA) tools were originally designed for raw read counts where library size differences are informative and corrected internally (e.g. via size-factor estimation, library size, gene length). For genome-level metagenomic data, however, raw counts are dominated by genome length; common practice is therefore to normalise reads to TPM/RPKM or coverage. This creates a methodological question about the potential effect on the results.
This study evaluates 48 metric × method combinations (6 metrics × 8 DA tools) using CAMISIM synthetic communities with known ground truth.
Metric | Length normalisation | Depth normalisation | What the number means |
---|---|---|---|
counts | ✗ (raw reads) | ✗ | Number of read pairs whose best hit is this genome |
relative % | ✗ | ✓ (values sum to 100 %) | Share of all mapped reads that belong to the genome |
breadth | ✓ (divide by length) | ✗ | Fraction of genome bases with ≥ 1× read coverage |
coverage (mean depth) | ✓ (divide by length) | ✗ * | Mean per-base depth across the genome |
RPKM | ✓ (/ kb) | ✓ (/ 10⁶ reads) | Reads-per-kilobase-per-million mapped reads |
TPM | ✓ (/ kb) | ✓ (rescaled to 10⁶) | Transcripts-per-million; similar to RPKM but sample-wise rescaled |
Metric categories used throughout this report:
• Non-normalized: counts
• Length-normalized: breadth, coverage
• Depth-normalized: relative (%)
• Length + Depth-normalized: RPKM, TPM
* Coverage still increases with sequencing depth, but it no longer carries the multiplicative genome-length bias.
Details data import, cleaning, and integration steps used to construct the analysis dataset.
The genomes and their distribution was:
# Load ground truth
truth_table <- read_csv('/home/mrahman/da_analysis/data_HOMD/design/truth_table.csv') %>%
mutate(genome = str_replace(genome, "\\.fna$", "")) %>%
mutate(genome = str_replace_all(genome, "[^A-Za-z0-9]", "")) %>%
distinct(genome, .keep_all = TRUE)
cat("DA genomes:", sum(truth_table$is_DA), "\n")
## DA genomes: 80
## Non-DA genomes: 20
truth_table %>%
group_by(is_DA) %>%
summarise(
n_genomes = n(),
mean_log2FC = mean(true_log2FC, na.rm = TRUE),
min_log2FC = min(true_log2FC, na.rm = TRUE),
max_log2FC = max(true_log2FC, na.rm = TRUE),
.groups = 'drop'
) %>%
kable(caption = "Ground Truth Summary", digits = 3) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
is_DA | n_genomes | mean_log2FC | min_log2FC | max_log2FC |
---|---|---|---|---|
FALSE | 20 | 0.000 | 0 | 0 |
TRUE | 80 | 2.087 | 1 | 3 |
The CAMISIM simulation generated samples with varying sequencing depths to mimic limited realistic sequencing conditions:
## Sequencing depth summary:
## - Total samples: 40
## - Range: 18,802,195 - 58,149,611 read pairs
## - Mean: 38,841,542 read pairs
## - Median: 41,634,552 read pairs
## - Coefficient of variation: 0.331
Sample | Read Pairs |
---|---|
C01_rep01 | 51,197,632 |
C01_rep02 | 38,972,502 |
C02_rep01 | 43,481,293 |
C02_rep02 | 27,967,003 |
C03_rep01 | 48,070,192 |
# Define available metrics
metrics <- c("counts", "tpm", "rpkm", "relative", "breadth", "coverage")
cat("Available metrics:", paste(metrics, collapse = ", "), "\n")
## Available metrics: counts, tpm, rpkm, relative, breadth, coverage
# load DA results for a specific metric
load_metric_results <- function(metric) {
da_dir <- paste0('/home/mrahman/da_analysis/data_HOMD/da/', metric)
da_files <- list.files(da_dir, pattern = '^DA_.*\\.csv$', full.names = TRUE) %>%
.[!grepl("DA_basic\\.csv$", .)] # basic results are invalid 0s
read_da_file <- function(path) {
tool <- basename(path) %>% str_replace_all('^DA_|\\.csv$', '')
df <- read_csv(path, col_names = FALSE, show_col_types = FALSE) %>%
rename(genome = X1) %>%
filter(genome != "genome") %>%
mutate(tool = tool,
metric = metric,
genome = str_replace(genome, "\\.fna$", ""),
genome = str_replace_all(genome, "[^A-Za-z0-9]", ""))
# Handle different column structures
if (ncol(df) >= 3) {
df <- df %>%
rename(rawP = X2, adjP = X3) %>%
mutate(rawP = as.numeric(rawP), adjP = as.numeric(adjP))
} else {
df <- df %>% mutate(rawP = NA_real_, adjP = NA_real_)
}
return(df)
}
result <- map_dfr(da_files, read_da_file) %>%
distinct(tool, genome, .keep_all = TRUE) %>%
group_by(tool, metric) %>%
mutate(adjP = p.adjust(rawP, method = "BH")) %>% # recompute
ungroup() %>%
mutate(adjP = replace_na(adjP, 1))
return(result)
}
all_metrics_results <- map_dfr(metrics, load_metric_results)
master_df <- all_metrics_results %>%
left_join(truth_table, by = 'genome') %>%
complete(genome, tool, metric, fill = list(adjP = 1)) %>%
mutate(score = -log10(adjP),
is_DA = replace_na(is_DA, FALSE))
cat("Total combinations:", nrow(master_df), "\n")
## Total combinations: 4800
# Show available combinations
combinations <- master_df %>%
count(metric, tool) %>%
arrange(metric, tool)
DT::datatable(combinations,
caption = "Available Tool-Metric Combinations",
options = list(pageLength = 5, scrollX = TRUE))
Assesses tool–metric combinations using classification metrics such as F1-score and AUC-PR.
performance_metrics <- master_df %>%
group_by(tool, metric) %>%
summarise(
n_genomes = n(),
n_da_genomes = sum(is_DA, na.rm = TRUE),
TP = sum(adjP < 0.05 & is_DA, na.rm = TRUE),
FP = sum(adjP < 0.05 & !is_DA, na.rm = TRUE),
TN = sum(adjP >= 0.05 & !is_DA, na.rm = TRUE),
FN = sum(adjP >= 0.05 & is_DA, na.rm = TRUE),
.groups = 'drop'
) %>%
mutate(
Precision = ifelse(TP + FP > 0, TP / (TP + FP), 0),
Recall = ifelse(TP + FN > 0, TP / (TP + FN), 0),
F1_Score = ifelse(Precision + Recall > 0, 2 * Precision * Recall / (Precision + Recall), 0),
Specificity = ifelse(TN + FP > 0, TN / (TN + FP), 0),
NPV = ifelse(TN + FN > 0, TN / (TN + FN), 0)
) %>%
mutate(across(where(is.numeric), ~round(., 3)))
calculate_auc_metrics <- function(df, show_progress = interactive()) {
unique_combinations <- df %>% distinct(tool, metric)
auc_results <- data.frame(
tool = character(),
metric = character(),
auc_roc = numeric(),
auc_pr = numeric(),
stringsAsFactors = FALSE
)
if (show_progress) {
pb <- txtProgressBar(min = 0, max = nrow(unique_combinations), style = 3)
}
for(i in 1:nrow(unique_combinations)) {
current_tool <- unique_combinations$tool[i]
current_metric <- unique_combinations$metric[i]
tool_data <- df %>%
filter(tool == current_tool, metric == current_metric, !is.na(score), !is.na(is_DA))
auc_roc_val <- NA_real_
auc_pr_val <- NA_real_
if (nrow(tool_data) > 5 && sum(tool_data$is_DA) > 0 && sum(!tool_data$is_DA) > 0) {
# Calculate AUC-ROC
tryCatch({
roc_obj <- pROC::roc(tool_data$is_DA, tool_data$score, quiet = TRUE)
auc_roc_val <- as.numeric(pROC::auc(roc_obj))
}, error = function(e) {
})
# Calculate AUC-PR
da_scores <- tool_data$score[tool_data$is_DA == TRUE]
non_da_scores <- tool_data$score[tool_data$is_DA == FALSE]
da_scores <- da_scores[!is.na(da_scores)]
non_da_scores <- non_da_scores[!is.na(non_da_scores)]
if (length(da_scores) > 0 && length(non_da_scores) > 0) {
tryCatch({
pr_obj <- PRROC::pr.curve(scores.class0 = da_scores,
scores.class1 = non_da_scores,
curve = FALSE)
auc_pr_val <- pr_obj$auc.integral
}, error = function(e) {
# Silent error handling
})
}
}
auc_results <- rbind(auc_results, data.frame(
tool = current_tool,
metric = current_metric,
auc_roc = auc_roc_val,
auc_pr = auc_pr_val,
stringsAsFactors = FALSE
))
if (show_progress) setTxtProgressBar(pb, i)
}
if (show_progress) close(pb)
return(auc_results)
}
auc_metrics <- calculate_auc_metrics(master_df, show_progress = FALSE) %>%
mutate(across(where(is.numeric), ~round(., 3)))
# Combine all performance metrics
complete_performance <- performance_metrics %>%
left_join(auc_metrics, by = c("tool", "metric"))
# Display comprehensive performance table
DT::datatable(complete_performance %>%
select(tool, metric, F1_Score, auc_pr, Precision, Recall, auc_roc) %>%
arrange(metric, desc(F1_Score)),
caption = "Complete Performance Matrix",
options = list(pageLength = 15, scrollX = TRUE)) %>%
formatRound(columns = c("F1_Score", "auc_pr", "Precision", "Recall", "auc_roc"), digits = 3)
Summarises aggregate performance statistics to rank abundance metrics across all tools.
# Metric performance summary
metric_summary <- complete_performance %>%
group_by(metric) %>%
summarise(
n_tools = n(),
mean_TP = mean(TP, na.rm = TRUE),
mean_FP = mean(FP, na.rm = TRUE),
mean_TN = mean(TN, na.rm = TRUE),
mean_FN = mean(FN, na.rm = TRUE),
median_f1 = median(F1_Score, na.rm = TRUE),
max_f1 = max(F1_Score, na.rm = TRUE),
median_auc_pr = median(auc_pr, na.rm = TRUE),
max_auc_pr = max(auc_pr, na.rm = TRUE),
median_precision = median(Precision, na.rm = TRUE),
median_recall = median(Recall, na.rm = TRUE),
.groups = 'drop'
) %>%
arrange(desc(median_f1)) %>%
mutate(across(where(is.numeric), ~round(., 3)))
# Create interactive table with additional confusion-matrix elements
metric_summary %>%
select(metric, n_tools, mean_TP, mean_FP, mean_TN, mean_FN,
median_precision, median_recall, median_f1, max_f1, median_auc_pr, max_auc_pr) %>%
arrange(desc(median_f1)) %>%
kable(caption = "Metric Performance Summary (TP/FP = mean; others = median)",
col.names = c("Metric", "N Tools", "Mean TP", "Mean FP", "Mean TN", "Mean FN",
"Median Precision", "Median Recall", "Median F1", "Max F1", "Median AUC-PR",
"Max AUC-PR")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
row_spec(1:3, bold = TRUE, color = "white", background = "#2E86AB")
Metric | N Tools | Mean TP | Mean FP | Mean TN | Mean FN | Median Precision | Median Recall | Median F1 | Max F1 | Median AUC-PR | Max AUC-PR |
---|---|---|---|---|---|---|---|---|---|---|---|
breadth | 8 | 79.625 | 19.750 | 0.250 | 0.375 | 0.800 | 1.000 | 0.889 | 0.894 | 0.770 | 0.908 |
tpm | 8 | 78.625 | 19.875 | 0.125 | 1.375 | 0.800 | 1.000 | 0.889 | 0.889 | 0.751 | 0.806 |
counts | 8 | 76.500 | 19.375 | 0.625 | 3.500 | 0.798 | 0.988 | 0.883 | 0.889 | 0.736 | 0.798 |
rpkm | 8 | 51.375 | 16.750 | 6.000 | 25.875 | 0.764 | 0.719 | 0.709 | 0.883 | 0.713 | 0.775 |
coverage | 8 | 39.375 | 13.375 | 19.875 | 27.375 | 0.699 | 0.450 | 0.539 | 0.870 | 0.686 | 0.832 |
relative | 8 | 21.000 | 9.250 | 25.625 | 44.125 | 0.715 | 0.246 | 0.341 | 0.887 | 0.760 | 0.823 |
# Metric comparison box plots
# Define consistent metric order
metric_order <- c("counts", "tpm", "breadth", "coverage", "rpkm", "relative")
p1 <- ggplot(complete_performance, aes(x = factor(metric, levels = metric_order),
y = F1_Score, fill = metric)) +
geom_boxplot(alpha = 0.7, outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.6, size = 2) +
scale_fill_manual(values = metric_colors) +
scale_y_continuous(limits = c(0, 1)) +
labs(title = "A) F1-Score Distribution by Metric",
x = "Abundance Metric", y = "F1-Score") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none") +
coord_flip()
p2 <- ggplot(complete_performance %>% filter(!is.na(auc_pr)),
aes(x = factor(metric, levels = metric_order), y = auc_pr, fill = metric)) +
geom_boxplot(alpha = 0.7, outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.6, size = 2) +
scale_fill_manual(values = metric_colors) +
scale_y_continuous(limits = c(0, 1)) +
labs(title = "B) AUC-PR Distribution by Metric",
x = "Abundance Metric", y = "AUC-PR") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none") +
coord_flip()
p3 <- ggplot(complete_performance, aes(x = factor(metric, levels = metric_order),
y = Precision, fill = metric)) +
geom_boxplot(alpha = 0.7, outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.6, size = 2) +
scale_fill_manual(values = metric_colors) +
scale_y_continuous(limits = c(0, 1)) +
labs(title = "C) Precision Distribution by Metric",
x = "Abundance Metric", y = "Precision") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none") +
coord_flip()
p4 <- ggplot(complete_performance, aes(x = factor(metric, levels = metric_order),
y = Recall, fill = metric)) +
geom_boxplot(alpha = 0.7, outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.6, size = 2) +
scale_fill_manual(values = metric_colors) +
scale_y_continuous(limits = c(0, 1)) +
labs(title = "D) Recall Distribution by Metric",
x = "Abundance Metric", y = "Recall") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none") +
coord_flip()
gridExtra::grid.arrange(p1, p2, p3, p4, ncol = 2,
top = "Performance Metrics Distribution Across Abundance Metrics")
Observation: Normalized-metric Breadth and TPM sits at top: they have the highest median F1 and AUC-PR, slightly outperforming raw counts.
# Pairwise significance of F1-Score between abundance metrics
library(tidyr)
metric_levels <- unique(complete_performance$metric)
pairwise_results <- combn(metric_levels, 2, function(m) {
df_wide <- complete_performance %>%
filter(metric %in% m) %>%
select(tool, metric, F1_Score) %>%
pivot_wider(names_from = metric, values_from = F1_Score)
# Ensure both columns exist and have no NA rows
if(nrow(df_wide) == 0 || anyNA(df_wide[, m])) return(NULL)
wt <- wilcox.test(df_wide[[m[1]]], df_wide[[m[2]]], paired = TRUE, exact = FALSE)
data.frame(metric1 = m[1], metric2 = m[2],
median_diff = median(df_wide[[m[1]]] - df_wide[[m[2]]], na.rm = TRUE),
p_value = wt$p.value)
}, simplify = FALSE) %>%
bind_rows() %>%
mutate(p_adj = p.adjust(p_value, method = "BH")) %>%
arrange(p_adj)
DT::datatable(pairwise_results,
caption = "Pairwise Wilcoxon Tests: F1-Score Differences Between Metrics (paired by tool)",
options = list(pageLength = 15, scrollX = TRUE)) %>%
formatRound(columns = c("median_diff", "p_value", "p_adj"), digits = 3)
library(ggplot2)
heat <- pairwise_results %>%
mutate(sig = ifelse(p_adj <= 0.05, "*", "")) %>% # significance flag
complete(metric1 = metric_levels,
metric2 = metric_levels,
fill = list(median_diff = NA, p_adj = NA, sig = "")) %>%
mutate(metric1 = factor(metric1, levels = metric_levels),
metric2 = factor(metric2, levels = metric_levels))
ggplot(heat, aes(metric1, metric2, fill = median_diff)) +
geom_tile(colour = "white") +
geom_text(aes(label = sig), colour = "black", size = 6, vjust = 0.5) +
scale_fill_gradient2(low = "#d73027", mid = "white", high = "#1a9850",
midpoint = 0, na.value = "grey90",
name = "Median F1\ndifference") +
labs(x = NULL, y = NULL,
title = "Pairwise F1-Score Differences Between Abundance Metrics",
subtitle = "* indicates BH-adjusted p ≤ 0.05") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
panel.grid = element_blank())
Observation: Green, starred tiles show where the row metric’s median F1 is significantly higher than the column metric after BH correction. Breadth, counts and TPM each outperform coverage, relative and (for breadth/TPM) RPKM, while no significant difference is detected among the top three metrics themselves.
# Filter performance for ANCOM only
ancom_performance <- complete_performance %>%
filter(tool == "ANCOM") %>%
select(metric, F1_Score, auc_pr, Precision, Recall) %>%
arrange(desc(F1_Score))
# Display table with existing aesthetics
ancom_performance %>%
kable(caption = "ANCOM Performance Across Abundance Metrics", digits = 3) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
metric | F1_Score | auc_pr | Precision | Recall |
---|---|---|---|---|
breadth | 0.883 | 0.728 | 0.798 | 0.988 |
counts | 0.869 | 0.659 | 0.800 | 0.950 |
coverage | 0.869 | 0.729 | 0.800 | 0.950 |
rpkm | 0.869 | 0.715 | 0.800 | 0.950 |
tpm | 0.869 | 0.655 | 0.800 | 0.950 |
relative | 0.081 | 0.823 | 1.000 | 0.042 |
Observation: For ANCOM, breadth tops, and rest of the metrics except relative is tied at 0.869. No signicant differences.
# Filter for counts and coverage metrics
counts_cov_perf <- complete_performance %>%
filter(metric %in% c("counts", "coverage")) %>%
select(tool, metric, F1_Score, auc_pr, Precision, Recall) %>%
arrange(tool, metric)
# Display in wide format for easier comparison
counts_cov_wide <- counts_cov_perf %>%
pivot_wider(names_from = metric, values_from = c(F1_Score, auc_pr, Precision, Recall)) %>%
arrange(desc(F1_Score_coverage))
counts_cov_wide %>%
kable(caption = "Counts vs Coverage Performance (All Tools)", digits = 3) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
tool | F1_Score_counts | F1_Score_coverage | auc_pr_counts | auc_pr_coverage | Precision_counts | Precision_coverage | Recall_counts | Recall_coverage |
---|---|---|---|---|---|---|---|---|
limma | 0.883 | 0.870 | 0.627 | 0.631 | 0.798 | 0.794 | 0.988 | 0.963 |
ANCOM | 0.869 | 0.869 | 0.659 | 0.729 | 0.800 | 0.800 | 0.950 | 0.950 |
dearseq | 0.889 | 0.816 | 0.775 | 0.671 | 0.800 | 0.718 | 1.000 | 0.944 |
edgeR | 0.883 | 0.559 | 0.652 | 0.665 | 0.798 | 0.679 | 0.988 | 0.475 |
corncob | 0.777 | 0.519 | 0.750 | 0.686 | 0.792 | 0.667 | 0.762 | 0.425 |
metagenomeSeq | 0.889 | 0.500 | 0.798 | 0.832 | 0.800 | 0.806 | 1.000 | 0.362 |
ALDEx2 | 0.876 | 0.208 | 0.721 | 0.727 | 0.796 | 0.625 | 0.975 | 0.125 |
NOISeq | 0.883 | 0.000 | 0.786 | NA | 0.798 | 0.000 | 0.988 | 0.000 |
# Long format for plotting
counts_cov_perf %>%
ggplot(aes(x = tool, y = F1_Score, fill = metric)) +
geom_col(position = position_dodge(width = 0.8), width = 0.7) +
scale_fill_manual(values = metric_colors[c("counts", "coverage")]) +
labs(title = "F1-Score: Counts vs Coverage Across Tools",
x = "DA Tool", y = "F1-Score", fill = "Metric") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Determine best tool for counts and coverage based on F1_Score
best_counts_cov <- complete_performance %>%
filter(metric %in% c("counts", "coverage")) %>%
group_by(metric) %>%
slice_max(order_by = F1_Score, n = 1, with_ties = FALSE) %>%
ungroup() %>%
select(metric, tool, F1_Score, auc_pr, Precision, Recall) %>%
arrange(desc(F1_Score))
best_counts_cov %>%
kable(caption = "Best Performing Method for Counts and Coverage", digits = 3) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
metric | tool | F1_Score | auc_pr | Precision | Recall |
---|---|---|---|---|---|
counts | dearseq | 0.889 | 0.775 | 0.800 | 1.000 |
coverage | limma | 0.870 | 0.631 | 0.794 | 0.963 |
Quantifies how consistently each tool performs across metrics and ranks their stability.
# Tool consistency analysis
tool_rankings <- complete_performance %>%
group_by(metric) %>%
arrange(desc(F1_Score)) %>%
mutate(f1_rank = row_number()) %>%
arrange(desc(auc_pr)) %>%
mutate(auc_pr_rank = row_number()) %>%
ungroup() %>%
mutate(combined_rank = (f1_rank + auc_pr_rank) / 2) %>%
arrange(metric, combined_rank)
tool_consistency <- tool_rankings %>%
group_by(tool) %>%
summarise(
metrics_tested = n(),
mean_f1 = mean(F1_Score, na.rm = TRUE),
sd_f1 = sd(F1_Score, na.rm = TRUE),
mean_auc_pr = mean(auc_pr, na.rm = TRUE),
sd_auc_pr = sd(auc_pr, na.rm = TRUE),
mean_rank = mean(combined_rank, na.rm = TRUE),
rank_consistency = sd(combined_rank, na.rm = TRUE),
.groups = 'drop'
) %>%
arrange(mean_rank) %>%
mutate(across(where(is.numeric), ~round(., 3)))
# Display tool consistency table
tool_consistency %>%
kable(caption = "Tool Consistency Analysis (Ranked by Mean Performance Rank)",
col.names = c("Tool", "Metrics Tested", "Mean F1", "SD F1", "Mean AUC-PR",
"SD AUC-PR", "Mean Rank", "Rank Consistency")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
row_spec(1:3, bold = TRUE, color = "white", background = "#2E86AB")
Tool | Metrics Tested | Mean F1 | SD F1 | Mean AUC-PR | SD AUC-PR | Mean Rank | Rank Consistency |
---|---|---|---|---|---|---|---|
dearseq | 6 | 0.850 | 0.048 | 0.757 | 0.054 | 3.083 | 1.068 |
metagenomeSeq | 6 | 0.751 | 0.211 | 0.802 | 0.023 | 3.333 | 1.506 |
NOISeq | 6 | 0.650 | 0.380 | 0.777 | 0.054 | 3.750 | 2.622 |
ALDEx2 | 6 | 0.572 | 0.346 | 0.768 | 0.071 | 4.583 | 1.772 |
ANCOM | 6 | 0.740 | 0.323 | 0.718 | 0.061 | 4.750 | 1.782 |
edgeR | 6 | 0.701 | 0.212 | 0.653 | 0.023 | 5.333 | 0.258 |
corncob | 6 | 0.615 | 0.288 | 0.719 | 0.041 | 5.583 | 1.158 |
limma | 6 | 0.788 | 0.231 | 0.648 | 0.027 | 5.583 | 1.068 |
best_tools_per_metric <- tool_rankings %>% group_by(metric) %>% slice_min(combined_rank, n = 3) %>% select(metric, tool, F1_Score, auc_pr, combined_rank) %>% arrange(metric, combined_rank)
DT::datatable(best_tools_per_metric, caption = “Top 3 Tools per Metric (by Combined F1 + AUC-PR Ranking)”, options = list(pageLength = 20, scrollX = TRUE)) %>% formatRound(columns = c(“F1_Score”, “auc_pr”, “combined_rank”), digits = 3)
---
# 6. Cross-Metric Performance Heatmaps
``` r
# F1-Score heatmap
f1_heatmap_data <- complete_performance %>%
select(tool, metric, F1_Score) %>%
pivot_wider(names_from = metric, values_from = F1_Score) %>%
column_to_rownames("tool") %>%
as.matrix()
# Replace NAs with 0 for visualization
f1_heatmap_data[is.na(f1_heatmap_data)] <- 0
# Create F1-Score heatmap
par(mfrow = c(1, 1))
corrplot(f1_heatmap_data,
method = "color",
is.corr = FALSE,
col = colorRampPalette(c("#053061", "#FFFFFF", "#67001F"))(200),
tl.col = "black",
tl.srt = 45,
addCoef.col = "black",
number.cex = 0.7,
title = "F1-Score Heatmap: Tools vs Metrics",
mar = c(0,0,2,0),
cl.cex = 0.8)
# AUC-PR heatmap
auc_pr_heatmap_data <- complete_performance %>%
select(tool, metric, auc_pr) %>%
filter(!is.na(auc_pr)) %>%
pivot_wider(names_from = metric, values_from = auc_pr) %>%
column_to_rownames("tool") %>%
as.matrix()
# Fill missing values with 0
auc_pr_complete <- matrix(0, nrow = nrow(f1_heatmap_data), ncol = ncol(f1_heatmap_data))
rownames(auc_pr_complete) <- rownames(f1_heatmap_data)
colnames(auc_pr_complete) <- colnames(f1_heatmap_data)
# Fill in available values
for(i in rownames(auc_pr_heatmap_data)) {
for(j in colnames(auc_pr_heatmap_data)) {
if(i %in% rownames(auc_pr_complete) & j %in% colnames(auc_pr_complete)) {
auc_pr_complete[i, j] <- auc_pr_heatmap_data[i, j]
}
}
}
corrplot(auc_pr_complete,
method = "color",
is.corr = FALSE,
col = colorRampPalette(c("#053061", "#FFFFFF", "#67001F"))(200),
tl.col = "black",
tl.srt = 45,
addCoef.col = "black",
number.cex = 0.7,
title = "AUC-PR Heatmap: Tools vs Metrics",
mar = c(0,0,2,0),
cl.cex = 0.8)
This section focuses exclusively on ANCOM-BC run on two abundance metrics: raw counts (library-size biased) and length-normalised coverage. All other metrics (TPM, RPKM, breadth, relative) are summarised later for completeness.
# Classification performance for counts & coverage
ancom_perf <- ancom_df %>%
filter(metric %in% primary_metrics) %>%
group_by(metric) %>%
summarise(
TP = sum(adjP < 0.05 & is_DA),
FP = sum(adjP < 0.05 & !is_DA),
TN = sum(adjP >= 0.05 & !is_DA),
FN = sum(adjP >= 0.05 & is_DA),
.groups = 'drop'
) %>%
mutate(
Precision = ifelse(TP + FP > 0, TP / (TP + FP), 0),
Recall = ifelse(TP + FN > 0, TP / (TP + FN), 0),
F1_Score = ifelse(Precision + Recall > 0, 2 * Precision * Recall / (Precision + Recall), 0)
)
# Calculate AUC-PR for each metric
auc_tbl <- ancom_df %>%
filter(metric %in% primary_metrics) %>%
group_by(metric) %>%
summarise(auc_pr = {
da_scores <- score[is_DA]
non_da_scores <- score[!is_DA]
PRROC::pr.curve(scores.class0 = da_scores,
scores.class1 = non_da_scores,
curve = FALSE)$auc.integral
}, .groups = 'drop')
ancom_perf <- left_join(ancom_perf, auc_tbl, by = 'metric') %>%
mutate(across(where(is.numeric), ~round(., 3)))
ancom_perf %>%
kable(caption = "ANCOM-BC: Counts vs Coverage Performance", digits = 3) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
metric | TP | FP | TN | FN | Precision | Recall | F1_Score | auc_pr |
---|---|---|---|---|---|---|---|---|
counts | 76 | 19 | 1 | 4 | 0.8 | 0.95 | 0.869 | 0.659 |
coverage | 76 | 19 | 1 | 4 | 0.8 | 0.95 | 0.869 | 0.729 |
# Side-by-side bar plots: F1 and AUC-PR
library(gridExtra)
p_f1 <- ggplot(ancom_perf, aes(x = metric, y = F1_Score, fill = metric)) +
geom_col(width = 0.6) +
scale_fill_manual(values = metric_colors[c("counts", "coverage")]) +
labs(title = "F1-Score", x = NULL, y = NULL) +
theme_minimal() +
theme(legend.position = "none")
p_auc <- ggplot(ancom_perf, aes(x = metric, y = auc_pr, fill = metric)) +
geom_col(width = 0.6) +
scale_fill_manual(values = metric_colors[c("counts", "coverage")]) +
labs(title = "AUC-PR", x = NULL, y = NULL) +
theme_minimal() +
theme(legend.position = "none")
gridExtra::grid.arrange(p_f1, p_auc, nrow = 1)
metric | TP | FP | TN | FN | Precision | Recall | F1_Score | auc_pr |
---|---|---|---|---|---|---|---|---|
breadth | 79 | 20 | 0 | 1 | 0.798 | 0.988 | 0.883 | 0.728 |
relative | 3 | 0 | 29 | 68 | 1.000 | 0.042 | 0.081 | 0.823 |
rpkm | 76 | 19 | 1 | 4 | 0.800 | 0.950 | 0.869 | 0.715 |
tpm | 76 | 19 | 1 | 4 | 0.800 | 0.950 | 0.869 | 0.655 |
# Detailed normalization impact analysis
normalization_analysis <- metric_summary %>%
mutate(
category = case_when(
metric == "counts" ~ "Non-normalized",
metric %in% c("breadth", "coverage") ~ "Length-normalized",
metric == "relative" ~ "Depth-normalized",
metric %in% c("tpm", "rpkm") ~ "Length + Depth-normalized",
TRUE ~ "Other"
)
) %>%
group_by(category) %>%
summarise(
n_metrics = n(),
mean_f1 = mean(median_f1),
mean_auc_pr = mean(median_auc_pr, na.rm = TRUE),
.groups = 'drop'
) %>%
mutate(across(where(is.numeric), ~round(., 3)))
normalization_analysis %>%
kable(caption = "Normalization Impact Analysis",
col.names = c("Category", "N Metrics", "Mean F1", "Mean AUC-PR")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
row_spec(which(normalization_analysis$category == "Length/Depth Normalized"),
bold = TRUE, color = "white", background = "#2E86AB")
Category | N Metrics | Mean F1 | Mean AUC-PR |
---|---|---|---|
Depth-normalized | 1 | 0.341 | 0.760 |
Length + Depth-normalized | 2 | 0.799 | 0.732 |
Length-normalized | 2 | 0.714 | 0.728 |
Non-normalized | 1 | 0.883 | 0.736 |
# Calculate improvement
counts_f1 <- normalization_analysis$mean_f1[normalization_analysis$category == "Raw Counts"]
normalized_f1 <- normalization_analysis$mean_f1[normalization_analysis$category == "Length/Depth Normalized"]
improvement <- ((normalized_f1 - counts_f1) / counts_f1) * 100
cat("\n**Key Finding**: Length/depth normalization do not provide any significant improvement in F1-Score over raw counts\n")
##
## **Key Finding**: Length/depth normalization do not provide any significant improvement in F1-Score over raw counts
Question: Can normalized abundance data (TPM, RPKM) be effectively used with differential abundance tools designed for raw counts?
Answer: Breadth/TPM can be alternatively used. These metrics are on par with raw counts on overall ranking power ( F1 and AUC-PR) but provide no significant improvement over raw counts.
Breadth and TPM are better metrics than coverage and relative for DA analysis.
Question: Is there a recommended tool based on the analysis?
Answer: Although some methods rank higher in this benchmark, no single DA algorithm is reliably superior across all datasets and assumptions. Simulated samples may not be representative of real-world complex biological data. Current best practice is to run several complementary tools and focus on features that are supported by a consensus of methods. This multi-tool approach is shown by Willis et al. 2022 and is already common in recent studies. For example, Stewart et al. 2025 analysed differential taxa with six independent packages (ANCOM-BC2, MaAsLin2, ALDEx2, DESeq2, LinDA and ZicoSeq) to ensure robust biological interpretation.
# Identify the top 3 performing tool-metric combinations by F1-Score
best_combos <- complete_performance %>%
filter(!is.na(F1_Score), !is.na(auc_pr)) %>%
arrange(desc(F1_Score), desc(auc_pr)) %>%
slice_head(n = 3) %>%
mutate(label = sprintf("- **%s + %s**: F1 = %.3f, AUC-PR = %.3f", tool, metric, F1_Score, auc_pr))
cat(paste(best_combos$label, collapse = "\n"))
## - **ALDEx2 + breadth**: F1 = 0.894, AUC-PR = 0.908
## - **corncob + breadth**: F1 = 0.894, AUC-PR = 0.722
## - **dearseq + breadth**: F1 = 0.889, AUC-PR = 0.813
## R version 4.3.3 (2024-02-29)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
##
## Matrix products: default
## BLAS/LAPACK: /home/mrahman/miniconda3/envs/r_env/lib/libopenblasp-r0.3.27.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Berlin
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] viridis_0.6.5 viridisLite_0.4.2 RColorBrewer_1.1-3 plotly_4.11.0
## [5] DT_0.33 kableExtra_1.4.0 knitr_1.50 corrplot_0.95
## [9] PRROC_1.4 rlang_1.1.6 pROC_1.18.5 gridExtra_2.3
## [13] lubridate_1.9.4 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
## [17] purrr_1.0.4 readr_2.1.5 tidyr_1.3.1 tibble_3.3.0
## [21] ggplot2_3.5.2 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.52 bslib_0.9.0 htmlwidgets_1.6.4
## [5] tzdb_0.5.0 vctrs_0.6.5 tools_4.3.3 crosstalk_1.2.1
## [9] generics_0.1.4 parallel_4.3.3 pkgconfig_2.0.3 data.table_1.17.6
## [13] lifecycle_1.0.4 compiler_4.3.3 farver_2.1.2 textshaping_1.0.1
## [17] htmltools_0.5.8.1 sass_0.4.10 yaml_2.3.10 lazyeval_0.2.2
## [21] pillar_1.10.2 crayon_1.5.3 jquerylib_0.1.4 cachem_1.1.0
## [25] tidyselect_1.2.1 digest_0.6.37 stringi_1.8.7 labeling_0.4.3
## [29] fastmap_1.2.0 grid_4.3.3 cli_3.6.5 magrittr_2.0.3
## [33] dichromat_2.0-0.1 withr_3.0.2 scales_1.4.0 bit64_4.6.0-1
## [37] timechange_0.3.0 rmarkdown_2.29 httr_1.4.7 bit_4.6.0
## [41] hms_1.1.3 evaluate_1.0.4 Rcpp_1.0.14 glue_1.8.0
## [45] xml2_1.3.8 svglite_2.2.1 rstudioapi_0.17.1 vroom_1.6.5
## [49] jsonlite_2.0.0 R6_2.6.1 plyr_1.8.9 systemfonts_1.2.3
Lists locations of data, scripts, and pipeline resources for transparent reuse.
core5:/home/mrahman/magician
core5:/local/mrahman/magician/result/
core5:/home/mrahman/magician/make_sample_distribution.py
core5:/home/mrahman/magician/workflow/scripts/run_benchdamic_ALL.R
core5:/home/mrahman/da_analysis/Rmds/complete_multi_metric_analysis_HOMD.Rmd