This analysis addresses the methodological question raised in ANCOM-BC Issue #196: Can normalized abundance data be effectively used with differential abundance tools designed for raw counts?

1. Introduction and Motivation

Background

Many differential abundance (DA) tools were originally designed for raw read counts where library size differences are informative and corrected internally (e.g. via size-factor estimation, library size, gene length). For genome-level metagenomic data, however, raw counts are dominated by genome length; common practice is therefore to normalise reads to TPM/RPKM or coverage. This creates a methodological question about the potential effect on the results.

This study evaluates 48 metric × method combinations (6 metrics × 8 DA tools) using CAMISIM synthetic communities with known ground truth.

Research Questions

Cross-metric performance: How do DA tools perform across different abundance metrics?
Normalization impact: Does length/depth normalization improve or reduce DA detection power?
Method preferences: Do some DA tools work better with specific abundance metrics?
Consistency evaluation: Are DA results consistent across metrics for individual tools?

Study Design

Simulation: HOMD with 100 genomes (80 truly differentially abundant). Genomes were selected randomly from the HOMD dataset.
Synthetic sample generation: CAMISIM created 40 metagenomic samples (20 control, 20 treatment)
Up-stream processing: Each sample was processed with the standard MAGician workflow (repository).
Abundance Metrics: 6 metrics (counts, TPM, RPKM, relative abundance, breadth, coverage)
Coverage quantification: Genome-level abundance metrics were computed using CoverM ((repository).
DA Methods: 8 tools (ALDEx2, ANCOM, corncob, dearseq, edgeR, limma, metagenomeSeq, NOISeq) ran by Benchdamic (repository).
Evaluation: Binary classification and threshold-free metrics (F1-Score, AUC-PR, Precision, Recall, Specificity, NPV)
Statistical Tests: t-test for F1-Score differences

Definitions

Abundance-metric normalisation cheat-sheet

Metric	Length normalisation	Depth normalisation	What the number means
counts	✗ (raw reads)	✗	Number of read pairs whose best hit is this genome
relative %	✗	✓ (values sum to 100 %)	Share of all mapped reads that belong to the genome
breadth	✓ (divide by length)	✗	Fraction of genome bases with ≥ 1× read coverage
coverage (mean depth)	✓ (divide by length)	✗ *	Mean per-base depth across the genome
RPKM	✓ (/ kb)	✓ (/ 10⁶ reads)	Reads-per-kilobase-per-million mapped reads
TPM	✓ (/ kb)	✓ (rescaled to 10⁶)	Transcripts-per-million; similar to RPKM but sample-wise rescaled

Metric categories used throughout this report:

• Non-normalized: counts
• Length-normalized: breadth, coverage
• Depth-normalized: relative (%)
• Length + Depth-normalized: RPKM, TPM

^{* Coverage still increases with sequencing depth, but it no
longer carries the multiplicative genome-length bias.}

Data Loading and Preprocessing

Details data import, cleaning, and integration steps used to construct the analysis dataset.

Genomic Sample Distribution

The genomes and their distribution was:

Ground Truth

# Load ground truth
truth_table <- read_csv('/home/mrahman/da_analysis/data_HOMD/design/truth_table.csv') %>%
  mutate(genome = str_replace(genome, "\\.fna$", "")) %>%
  mutate(genome = str_replace_all(genome, "[^A-Za-z0-9]", "")) %>%
  distinct(genome, .keep_all = TRUE)

cat("DA genomes:", sum(truth_table$is_DA), "\n")

## DA genomes: 80

cat("Non-DA genomes:", sum(!truth_table$is_DA), "\n")

## Non-DA genomes: 20

truth_table %>%
  group_by(is_DA) %>%
  summarise(
    n_genomes = n(),
    mean_log2FC = mean(true_log2FC, na.rm = TRUE),
    min_log2FC = min(true_log2FC, na.rm = TRUE),
    max_log2FC = max(true_log2FC, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  kable(caption = "Ground Truth Summary", digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Ground Truth Summary
is_DA	n_genomes	mean_log2FC	min_log2FC	max_log2FC
FALSE	20	0.000	0	0
TRUE	80	2.087	1	3

Simulated read-depth per sample

The CAMISIM simulation generated samples with varying sequencing depths to mimic limited realistic sequencing conditions:

## Sequencing depth summary:

## - Total samples: 40

## - Range: 18,802,195 - 58,149,611 read pairs

## - Mean: 38,841,542 read pairs

## - Median: 41,634,552 read pairs

## - Coefficient of variation: 0.331

Sequencing depth (read pairs) per simulated sample (first 5 entries)
Sample	Read Pairs
C01_rep01	51,197,632
C01_rep02	38,972,502
C02_rep01	43,481,293
C02_rep02	27,967,003
C03_rep01	48,070,192

Loading DA Results

# Define available metrics
metrics <- c("counts", "tpm", "rpkm", "relative", "breadth", "coverage")
cat("Available metrics:", paste(metrics, collapse = ", "), "\n")

## Available metrics: counts, tpm, rpkm, relative, breadth, coverage

# load DA results for a specific metric
load_metric_results <- function(metric) {
  da_dir <- paste0('/home/mrahman/da_analysis/data_HOMD/da/', metric)
  da_files <- list.files(da_dir, pattern = '^DA_.*\\.csv$', full.names = TRUE) %>%
            .[!grepl("DA_basic\\.csv$", .)] # basic results are invalid 0s

read_da_file <- function(path) {
    tool <- basename(path) %>% str_replace_all('^DA_|\\.csv$', '')
    
    df <- read_csv(path, col_names = FALSE, show_col_types = FALSE) %>%
      rename(genome = X1) %>%
      filter(genome != "genome") %>% 
      mutate(tool = tool,
             metric = metric,
             genome = str_replace(genome, "\\.fna$", ""),
             genome = str_replace_all(genome, "[^A-Za-z0-9]", ""))
    
    # Handle different column structures
    if (ncol(df) >= 3) {
      df <- df %>% 
        rename(rawP = X2, adjP = X3) %>%
        mutate(rawP = as.numeric(rawP), adjP = as.numeric(adjP))
    } else {
      df <- df %>% mutate(rawP = NA_real_, adjP = NA_real_)
    }
    
    return(df)
  }
  
  result <- map_dfr(da_files, read_da_file) %>%
    distinct(tool, genome, .keep_all = TRUE) %>%
    group_by(tool, metric) %>%
    mutate(adjP = p.adjust(rawP, method = "BH")) %>% # recompute
    ungroup() %>%
    mutate(adjP = replace_na(adjP, 1))
  
  return(result)
}

all_metrics_results <- map_dfr(metrics, load_metric_results)


master_df <- all_metrics_results %>%
  left_join(truth_table, by = 'genome') %>%
  complete(genome, tool, metric, fill = list(adjP = 1)) %>%
  mutate(score = -log10(adjP),
         is_DA = replace_na(is_DA, FALSE))

cat("Total combinations:", nrow(master_df), "\n")

## Total combinations: 4800

# Show available combinations
combinations <- master_df %>% 
  count(metric, tool) %>% 
  arrange(metric, tool)

DT::datatable(combinations, 
              caption = "Available Tool-Metric Combinations",
              options = list(pageLength = 5, scrollX = TRUE))

3. Performance Analysis Across Metrics

Assesses tool–metric combinations using classification metrics such as F1-score and AUC-PR.

performance_metrics <- master_df %>%
  group_by(tool, metric) %>%
  summarise(
    n_genomes = n(),
    n_da_genomes = sum(is_DA, na.rm = TRUE),
    TP = sum(adjP < 0.05 & is_DA, na.rm = TRUE),
    FP = sum(adjP < 0.05 & !is_DA, na.rm = TRUE),
    TN = sum(adjP >= 0.05 & !is_DA, na.rm = TRUE),
    FN = sum(adjP >= 0.05 & is_DA, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(
    Precision = ifelse(TP + FP > 0, TP / (TP + FP), 0),
    Recall = ifelse(TP + FN > 0, TP / (TP + FN), 0),
    F1_Score = ifelse(Precision + Recall > 0, 2 * Precision * Recall / (Precision + Recall), 0),
    Specificity = ifelse(TN + FP > 0, TN / (TN + FP), 0),
    NPV = ifelse(TN + FN > 0, TN / (TN + FN), 0)
  ) %>%
  mutate(across(where(is.numeric), ~round(., 3)))

calculate_auc_metrics <- function(df, show_progress = interactive()) {
  unique_combinations <- df %>% distinct(tool, metric)
  
  auc_results <- data.frame(
    tool = character(),
    metric = character(),
    auc_roc = numeric(),
    auc_pr = numeric(),
    stringsAsFactors = FALSE
  )
  
  if (show_progress) {
    pb <- txtProgressBar(min = 0, max = nrow(unique_combinations), style = 3)
  }
  
  for(i in 1:nrow(unique_combinations)) {
    current_tool <- unique_combinations$tool[i]
    current_metric <- unique_combinations$metric[i]
    
    tool_data <- df %>% 
      filter(tool == current_tool, metric == current_metric, !is.na(score), !is.na(is_DA))
    
    auc_roc_val <- NA_real_
    auc_pr_val <- NA_real_
    
    if (nrow(tool_data) > 5 && sum(tool_data$is_DA) > 0 && sum(!tool_data$is_DA) > 0) {
      # Calculate AUC-ROC
      tryCatch({
        roc_obj <- pROC::roc(tool_data$is_DA, tool_data$score, quiet = TRUE)
        auc_roc_val <- as.numeric(pROC::auc(roc_obj))
      }, error = function(e) {
      })
      
      # Calculate AUC-PR
      da_scores <- tool_data$score[tool_data$is_DA == TRUE]
      non_da_scores <- tool_data$score[tool_data$is_DA == FALSE]
      da_scores <- da_scores[!is.na(da_scores)]
      non_da_scores <- non_da_scores[!is.na(non_da_scores)]
      
      if (length(da_scores) > 0 && length(non_da_scores) > 0) {
        tryCatch({
          pr_obj <- PRROC::pr.curve(scores.class0 = da_scores, 
                                   scores.class1 = non_da_scores, 
                                   curve = FALSE)
          auc_pr_val <- pr_obj$auc.integral
        }, error = function(e) {
          # Silent error handling
        })
      }
    }
    
    auc_results <- rbind(auc_results, data.frame(
      tool = current_tool,
      metric = current_metric,
      auc_roc = auc_roc_val,
      auc_pr = auc_pr_val,
      stringsAsFactors = FALSE
    ))
    
    if (show_progress) setTxtProgressBar(pb, i)
  }
  
  if (show_progress) close(pb)
  return(auc_results)
}

auc_metrics <- calculate_auc_metrics(master_df, show_progress = FALSE) %>%
  mutate(across(where(is.numeric), ~round(., 3)))

# Combine all performance metrics
complete_performance <- performance_metrics %>%
  left_join(auc_metrics, by = c("tool", "metric"))


# Display comprehensive performance table
DT::datatable(complete_performance %>% 
                select(tool, metric, F1_Score, auc_pr, Precision, Recall, auc_roc) %>%
                arrange(metric, desc(F1_Score)), 
              caption = "Complete Performance Matrix",
              options = list(pageLength = 15, scrollX = TRUE)) %>%
  formatRound(columns = c("F1_Score", "auc_pr", "Precision", "Recall", "auc_roc"), digits = 3)

4. Metric Performance Summary

Summarises aggregate performance statistics to rank abundance metrics across all tools.

# Metric performance summary
metric_summary <- complete_performance %>%
  group_by(metric) %>%
  summarise(
    n_tools = n(),
    mean_TP = mean(TP, na.rm = TRUE),
    mean_FP = mean(FP, na.rm = TRUE),
    mean_TN = mean(TN, na.rm = TRUE),
    mean_FN = mean(FN, na.rm = TRUE),
    median_f1 = median(F1_Score, na.rm = TRUE),
    max_f1 = max(F1_Score, na.rm = TRUE),
    median_auc_pr = median(auc_pr, na.rm = TRUE),
    max_auc_pr = max(auc_pr, na.rm = TRUE),
    median_precision = median(Precision, na.rm = TRUE),
    median_recall = median(Recall, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  arrange(desc(median_f1)) %>%
  mutate(across(where(is.numeric), ~round(., 3)))

# Create interactive table with additional confusion-matrix elements
metric_summary %>%
  select(metric, n_tools, mean_TP, mean_FP, mean_TN, mean_FN,
         median_precision, median_recall, median_f1, max_f1, median_auc_pr, max_auc_pr) %>%
  arrange(desc(median_f1)) %>%
  kable(caption = "Metric Performance Summary (TP/FP = mean; others = median)", 
        col.names = c("Metric", "N Tools", "Mean TP", "Mean FP", "Mean TN", "Mean FN",
                     "Median Precision", "Median Recall", "Median F1", "Max F1", "Median AUC-PR", 
                     "Max AUC-PR")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  row_spec(1:3, bold = TRUE, color = "white", background = "#2E86AB")

Metric Performance Summary (TP/FP = mean; others = median)
Metric	N Tools	Mean TP	Mean FP	Mean TN	Mean FN	Median Precision	Median Recall	Median F1	Max F1	Median AUC-PR	Max AUC-PR
breadth	8	79.625	19.750	0.250	0.375	0.800	1.000	0.889	0.894	0.770	0.908
tpm	8	78.625	19.875	0.125	1.375	0.800	1.000	0.889	0.889	0.751	0.806
counts	8	76.500	19.375	0.625	3.500	0.798	0.988	0.883	0.889	0.736	0.798
rpkm	8	51.375	16.750	6.000	25.875	0.764	0.719	0.709	0.883	0.713	0.775
coverage	8	39.375	13.375	19.875	27.375	0.699	0.450	0.539	0.870	0.686	0.832
relative	8	21.000	9.250	25.625	44.125	0.715	0.246	0.341	0.887	0.760	0.823

# Metric comparison box plots
# Define consistent metric order
metric_order <- c("counts", "tpm", "breadth", "coverage", "rpkm", "relative")

p1 <- ggplot(complete_performance, aes(x = factor(metric, levels = metric_order), 
                                       y = F1_Score, fill = metric)) +
  geom_boxplot(alpha = 0.7, outlier.shape = NA) +
  geom_jitter(width = 0.2, alpha = 0.6, size = 2) +
  scale_fill_manual(values = metric_colors) +
  scale_y_continuous(limits = c(0, 1)) +
  labs(title = "A) F1-Score Distribution by Metric",
       x = "Abundance Metric", y = "F1-Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") +
  coord_flip()

p2 <- ggplot(complete_performance %>% filter(!is.na(auc_pr)), 
             aes(x = factor(metric, levels = metric_order), y = auc_pr, fill = metric)) +
  geom_boxplot(alpha = 0.7, outlier.shape = NA) +
  geom_jitter(width = 0.2, alpha = 0.6, size = 2) +
  scale_fill_manual(values = metric_colors) +
  scale_y_continuous(limits = c(0, 1)) +
  labs(title = "B) AUC-PR Distribution by Metric",
       x = "Abundance Metric", y = "AUC-PR") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") +
  coord_flip()

p3 <- ggplot(complete_performance, aes(x = factor(metric, levels = metric_order), 
                                       y = Precision, fill = metric)) +
  geom_boxplot(alpha = 0.7, outlier.shape = NA) +
  geom_jitter(width = 0.2, alpha = 0.6, size = 2) +
  scale_fill_manual(values = metric_colors) +
  scale_y_continuous(limits = c(0, 1)) +
  labs(title = "C) Precision Distribution by Metric",
       x = "Abundance Metric", y = "Precision") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") +
  coord_flip()

p4 <- ggplot(complete_performance, aes(x = factor(metric, levels = metric_order), 
                                       y = Recall, fill = metric)) +
  geom_boxplot(alpha = 0.7, outlier.shape = NA) +
  geom_jitter(width = 0.2, alpha = 0.6, size = 2) +
  scale_fill_manual(values = metric_colors) +
  scale_y_continuous(limits = c(0, 1)) +
  labs(title = "D) Recall Distribution by Metric",
       x = "Abundance Metric", y = "Recall") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") +
  coord_flip()

gridExtra::grid.arrange(p1, p2, p3, p4, ncol = 2, 
                       top = "Performance Metrics Distribution Across Abundance Metrics")

Observation: Normalized-metric Breadth and TPM sits at top: they have the highest median F1 and AUC-PR, slightly outperforming raw counts.

# Pairwise significance of F1-Score between abundance metrics
library(tidyr)

metric_levels <- unique(complete_performance$metric)

pairwise_results <- combn(metric_levels, 2, function(m) {
  df_wide <- complete_performance %>%
    filter(metric %in% m) %>%
    select(tool, metric, F1_Score) %>%
    pivot_wider(names_from = metric, values_from = F1_Score)
  # Ensure both columns exist and have no NA rows
  if(nrow(df_wide) == 0 || anyNA(df_wide[, m])) return(NULL)
  wt <- wilcox.test(df_wide[[m[1]]], df_wide[[m[2]]], paired = TRUE, exact = FALSE)
  data.frame(metric1 = m[1], metric2 = m[2],
             median_diff = median(df_wide[[m[1]]] - df_wide[[m[2]]], na.rm = TRUE),
             p_value = wt$p.value)
}, simplify = FALSE) %>%
  bind_rows() %>%
  mutate(p_adj = p.adjust(p_value, method = "BH")) %>%
  arrange(p_adj)

DT::datatable(pairwise_results,
              caption = "Pairwise Wilcoxon Tests: F1-Score Differences Between Metrics (paired by tool)",
              options = list(pageLength = 15, scrollX = TRUE)) %>%
  formatRound(columns = c("median_diff", "p_value", "p_adj"), digits = 3)

library(ggplot2)

heat <- pairwise_results %>%
  mutate(sig = ifelse(p_adj <= 0.05, "*", "")) %>%          # significance flag
  complete(metric1 = metric_levels,
           metric2 = metric_levels,
           fill = list(median_diff = NA, p_adj = NA, sig = "")) %>% 
  mutate(metric1 = factor(metric1, levels = metric_levels),
         metric2 = factor(metric2, levels = metric_levels))

ggplot(heat, aes(metric1, metric2, fill = median_diff)) +
  geom_tile(colour = "white") +
  geom_text(aes(label = sig), colour = "black", size = 6, vjust = 0.5) +
  scale_fill_gradient2(low = "#d73027", mid = "white", high = "#1a9850",
                       midpoint = 0, na.value = "grey90",
                       name = "Median F1\ndifference") +
  labs(x = NULL, y = NULL,
       title = "Pairwise F1-Score Differences Between Abundance Metrics",
       subtitle = "* indicates BH-adjusted p ≤ 0.05") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid = element_blank())

Observation: Green, starred tiles show where the row metric’s median F1 is significantly higher than the column metric after BH correction. Breadth, counts and TPM each outperform coverage, relative and (for breadth/TPM) RPKM, while no significant difference is detected among the top three metrics themselves.

ANCOM Performance Across Metrics

# Filter performance for ANCOM only
ancom_performance <- complete_performance %>%
  filter(tool == "ANCOM") %>%
  select(metric, F1_Score, auc_pr, Precision, Recall) %>%
  arrange(desc(F1_Score))

# Display table with existing aesthetics
ancom_performance %>%
  kable(caption = "ANCOM Performance Across Abundance Metrics", digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)

ANCOM Performance Across Abundance Metrics
metric	F1_Score	auc_pr	Precision	Recall
breadth	0.883	0.728	0.798	0.988
counts	0.869	0.659	0.800	0.950
coverage	0.869	0.729	0.800	0.950
rpkm	0.869	0.715	0.800	0.950
tpm	0.869	0.655	0.800	0.950
relative	0.081	0.823	1.000	0.042

Observation: For ANCOM, breadth tops, and rest of the metrics except relative is tied at 0.869. No signicant differences.

Counts vs Coverage: Performance Across All Methods

# Filter for counts and coverage metrics
counts_cov_perf <- complete_performance %>%
  filter(metric %in% c("counts", "coverage")) %>%
  select(tool, metric, F1_Score, auc_pr, Precision, Recall) %>%
  arrange(tool, metric)

# Display in wide format for easier comparison
counts_cov_wide <- counts_cov_perf %>%
  pivot_wider(names_from = metric, values_from = c(F1_Score, auc_pr, Precision, Recall)) %>%
  arrange(desc(F1_Score_coverage))

counts_cov_wide %>%
  kable(caption = "Counts vs Coverage Performance (All Tools)", digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)

Counts vs Coverage Performance (All Tools)
tool	F1_Score_counts	F1_Score_coverage	auc_pr_counts	auc_pr_coverage	Precision_counts	Precision_coverage	Recall_counts	Recall_coverage
limma	0.883	0.870	0.627	0.631	0.798	0.794	0.988	0.963
ANCOM	0.869	0.869	0.659	0.729	0.800	0.800	0.950	0.950
dearseq	0.889	0.816	0.775	0.671	0.800	0.718	1.000	0.944
edgeR	0.883	0.559	0.652	0.665	0.798	0.679	0.988	0.475
corncob	0.777	0.519	0.750	0.686	0.792	0.667	0.762	0.425
metagenomeSeq	0.889	0.500	0.798	0.832	0.800	0.806	1.000	0.362
ALDEx2	0.876	0.208	0.721	0.727	0.796	0.625	0.975	0.125
NOISeq	0.883	0.000	0.786	NA	0.798	0.000	0.988	0.000

# Long format for plotting
counts_cov_perf %>%
  ggplot(aes(x = tool, y = F1_Score, fill = metric)) +
  geom_col(position = position_dodge(width = 0.8), width = 0.7) +
  scale_fill_manual(values = metric_colors[c("counts", "coverage")]) +
  labs(title = "F1-Score: Counts vs Coverage Across Tools",
       x = "DA Tool", y = "F1-Score", fill = "Metric") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Best Performing Methods for Counts and Coverage

# Determine best tool for counts and coverage based on F1_Score
best_counts_cov <- complete_performance %>%
  filter(metric %in% c("counts", "coverage")) %>%
  group_by(metric) %>%
  slice_max(order_by = F1_Score, n = 1, with_ties = FALSE) %>%
  ungroup() %>%
  select(metric, tool, F1_Score, auc_pr, Precision, Recall) %>%
  arrange(desc(F1_Score))

best_counts_cov %>%
  kable(caption = "Best Performing Method for Counts and Coverage", digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)

Best Performing Method for Counts and Coverage
metric	tool	F1_Score	auc_pr	Precision	Recall
counts	dearseq	0.889	0.775	0.800	1.000
coverage	limma	0.870	0.631	0.794	0.963

5. Tool Consistency Analysis

Quantifies how consistently each tool performs across metrics and ranks their stability.

# Tool consistency analysis
tool_rankings <- complete_performance %>%
  group_by(metric) %>%
  arrange(desc(F1_Score)) %>%
  mutate(f1_rank = row_number()) %>%
  arrange(desc(auc_pr)) %>%
  mutate(auc_pr_rank = row_number()) %>%
  ungroup() %>%
  mutate(combined_rank = (f1_rank + auc_pr_rank) / 2) %>%
  arrange(metric, combined_rank)

tool_consistency <- tool_rankings %>%
  group_by(tool) %>%
  summarise(
    metrics_tested = n(),
    mean_f1 = mean(F1_Score, na.rm = TRUE),
    sd_f1 = sd(F1_Score, na.rm = TRUE),
    mean_auc_pr = mean(auc_pr, na.rm = TRUE),
    sd_auc_pr = sd(auc_pr, na.rm = TRUE),
    mean_rank = mean(combined_rank, na.rm = TRUE),
    rank_consistency = sd(combined_rank, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  arrange(mean_rank) %>%
  mutate(across(where(is.numeric), ~round(., 3)))

# Display tool consistency table
tool_consistency %>%
  kable(caption = "Tool Consistency Analysis (Ranked by Mean Performance Rank)", 
        col.names = c("Tool", "Metrics Tested", "Mean F1", "SD F1", "Mean AUC-PR", 
                     "SD AUC-PR", "Mean Rank", "Rank Consistency")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  row_spec(1:3, bold = TRUE, color = "white", background = "#2E86AB")

Tool Consistency Analysis (Ranked by Mean Performance Rank)
Tool	Metrics Tested	Mean F1	SD F1	Mean AUC-PR	SD AUC-PR	Mean Rank	Rank Consistency
dearseq	6	0.850	0.048	0.757	0.054	3.083	1.068
metagenomeSeq	6	0.751	0.211	0.802	0.023	3.333	1.506
NOISeq	6	0.650	0.380	0.777	0.054	3.750	2.622
ALDEx2	6	0.572	0.346	0.768	0.071	4.583	1.772
ANCOM	6	0.740	0.323	0.718	0.061	4.750	1.782
edgeR	6	0.701	0.212	0.653	0.023	5.333	0.258
corncob	6	0.615	0.288	0.719	0.041	5.583	1.158
limma	6	0.788	0.231	0.648	0.027	5.583	1.068

Best performing tools per metric

best_tools_per_metric <- tool_rankings %>% group_by(metric) %>% slice_min(combined_rank, n = 3) %>% select(metric, tool, F1_Score, auc_pr, combined_rank) %>% arrange(metric, combined_rank)

Create interactive table

DT::datatable(best_tools_per_metric, caption = “Top 3 Tools per Metric (by Combined F1 + AUC-PR Ranking)”, options = list(pageLength = 20, scrollX = TRUE)) %>% formatRound(columns = c(“F1_Score”, “auc_pr”, “combined_rank”), digits = 3)


---

# 6. Cross-Metric Performance Heatmaps


``` r
# F1-Score heatmap
f1_heatmap_data <- complete_performance %>%
  select(tool, metric, F1_Score) %>%
  pivot_wider(names_from = metric, values_from = F1_Score) %>%
  column_to_rownames("tool") %>%
  as.matrix()

# Replace NAs with 0 for visualization
f1_heatmap_data[is.na(f1_heatmap_data)] <- 0

# Create F1-Score heatmap
par(mfrow = c(1, 1))
corrplot(f1_heatmap_data, 
         method = "color", 
         is.corr = FALSE,
         col = colorRampPalette(c("#053061", "#FFFFFF", "#67001F"))(200),
         tl.col = "black", 
         tl.srt = 45,
         addCoef.col = "black",
         number.cex = 0.7,
         title = "F1-Score Heatmap: Tools vs Metrics",
         mar = c(0,0,2,0),
         cl.cex = 0.8)

# AUC-PR heatmap
auc_pr_heatmap_data <- complete_performance %>%
  select(tool, metric, auc_pr) %>%
  filter(!is.na(auc_pr)) %>%
  pivot_wider(names_from = metric, values_from = auc_pr) %>%
  column_to_rownames("tool") %>%
  as.matrix()

# Fill missing values with 0
auc_pr_complete <- matrix(0, nrow = nrow(f1_heatmap_data), ncol = ncol(f1_heatmap_data))
rownames(auc_pr_complete) <- rownames(f1_heatmap_data)
colnames(auc_pr_complete) <- colnames(f1_heatmap_data)

# Fill in available values
for(i in rownames(auc_pr_heatmap_data)) {
  for(j in colnames(auc_pr_heatmap_data)) {
    if(i %in% rownames(auc_pr_complete) & j %in% colnames(auc_pr_complete)) {
      auc_pr_complete[i, j] <- auc_pr_heatmap_data[i, j]
    }
  }
}

corrplot(auc_pr_complete, 
         method = "color", 
         is.corr = FALSE,
         col = colorRampPalette(c("#053061", "#FFFFFF", "#67001F"))(200),
         tl.col = "black", 
         tl.srt = 45,
         addCoef.col = "black",
         number.cex = 0.7,
         title = "AUC-PR Heatmap: Tools vs Metrics",
         mar = c(0,0,2,0),
         cl.cex = 0.8)

8. ANCOM-BC Normalization Impact

8.1 Methods

This section focuses exclusively on ANCOM-BC run on two abundance metrics: raw counts (library-size biased) and length-normalised coverage. All other metrics (TPM, RPKM, breadth, relative) are summarised later for completeness.

# Subset to ANCOM-BC results only
ancom_df <- master_df %>%
  filter(tool == "ANCOM")

# Primary comparison: counts vs coverage
primary_metrics <- c("counts", "coverage")

8.2 Performance Across Metrics

# Classification performance for counts & coverage
ancom_perf <- ancom_df %>%
  filter(metric %in% primary_metrics) %>%
  group_by(metric) %>%
  summarise(
    TP = sum(adjP < 0.05 & is_DA),
    FP = sum(adjP < 0.05 & !is_DA),
    TN = sum(adjP >= 0.05 & !is_DA),
    FN = sum(adjP >= 0.05 & is_DA),
    .groups = 'drop'
  ) %>%
  mutate(
    Precision = ifelse(TP + FP > 0, TP / (TP + FP), 0),
    Recall    = ifelse(TP + FN > 0, TP / (TP + FN), 0),
    F1_Score  = ifelse(Precision + Recall > 0, 2 * Precision * Recall / (Precision + Recall), 0)
  )

# Calculate AUC-PR for each metric
auc_tbl <- ancom_df %>%
  filter(metric %in% primary_metrics) %>%
  group_by(metric) %>%
  summarise(auc_pr = {
    da_scores     <- score[is_DA]
    non_da_scores <- score[!is_DA]
    PRROC::pr.curve(scores.class0 = da_scores,
                    scores.class1 = non_da_scores,
                    curve = FALSE)$auc.integral
  }, .groups = 'drop')

ancom_perf <- left_join(ancom_perf, auc_tbl, by = 'metric') %>%
  mutate(across(where(is.numeric), ~round(., 3)))

ancom_perf %>%
  kable(caption = "ANCOM-BC: Counts vs Coverage Performance", digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)

ANCOM-BC: Counts vs Coverage Performance
metric	TP	FP	TN	FN	Precision	Recall	F1_Score	auc_pr
counts	76	19	1	4	0.8	0.95	0.869	0.659
coverage	76	19	1	4	0.8	0.95	0.869	0.729

# Side-by-side bar plots: F1 and AUC-PR
library(gridExtra)

p_f1 <- ggplot(ancom_perf, aes(x = metric, y = F1_Score, fill = metric)) +
  geom_col(width = 0.6) +
  scale_fill_manual(values = metric_colors[c("counts", "coverage")]) +
  labs(title = "F1-Score", x = NULL, y = NULL) +
  theme_minimal() +
  theme(legend.position = "none")

p_auc <- ggplot(ancom_perf, aes(x = metric, y = auc_pr, fill = metric)) +
  geom_col(width = 0.6) +
  scale_fill_manual(values = metric_colors[c("counts", "coverage")]) +
  labs(title = "AUC-PR", x = NULL, y = NULL) +
  theme_minimal() +
  theme(legend.position = "none")

gridExtra::grid.arrange(p_f1, p_auc, nrow = 1)

8.4 Supplementary Metrics Table

Supplementary Metrics: ANCOM-BC Performance
metric	TP	FP	TN	FN	Precision	Recall	F1_Score	auc_pr
breadth	79	20	0	1	0.798	0.988	0.883	0.728
relative	3	0	29	68	1.000	0.042	0.081	0.823
rpkm	76	19	1	4	0.800	0.950	0.869	0.715
tpm	76	19	1	4	0.800	0.950	0.869	0.655

9. Key Findings and Statistical Analysis

9.1 Normalization Impact

# Detailed normalization impact analysis
normalization_analysis <- metric_summary %>%
  mutate(
    category = case_when(
      metric == "counts" ~ "Non-normalized",
      metric %in% c("breadth", "coverage") ~ "Length-normalized",
      metric == "relative" ~ "Depth-normalized",
      metric %in% c("tpm", "rpkm") ~ "Length + Depth-normalized",
      TRUE ~ "Other"
    )
  ) %>%
  group_by(category) %>%
  summarise(
    n_metrics = n(),
    mean_f1 = mean(median_f1),
    mean_auc_pr = mean(median_auc_pr, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(across(where(is.numeric), ~round(., 3)))

normalization_analysis %>%
  kable(caption = "Normalization Impact Analysis", 
        col.names = c("Category", "N Metrics", "Mean F1", "Mean AUC-PR")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) %>%
  row_spec(which(normalization_analysis$category == "Length/Depth Normalized"), 
           bold = TRUE, color = "white", background = "#2E86AB")

Normalization Impact Analysis
Category	N Metrics	Mean F1	Mean AUC-PR
Depth-normalized	1	0.341	0.760
Length + Depth-normalized	2	0.799	0.732
Length-normalized	2	0.714	0.728
Non-normalized	1	0.883	0.736

# Calculate improvement
counts_f1 <- normalization_analysis$mean_f1[normalization_analysis$category == "Raw Counts"]
normalized_f1 <- normalization_analysis$mean_f1[normalization_analysis$category == "Length/Depth Normalized"]
improvement <- ((normalized_f1 - counts_f1) / counts_f1) * 100

cat("\n**Key Finding**: Length/depth normalization do not provide any significant improvement in F1-Score over raw counts\n")

## 
## **Key Finding**: Length/depth normalization do not provide any significant improvement in F1-Score over raw counts

10. Discussion and Implications

10.1 Answer to ANCOM-BC Issue #196

Question: Can normalized abundance data (TPM, RPKM) be effectively used with differential abundance tools designed for raw counts?

Answer: Breadth/TPM can be alternatively used. These metrics are on par with raw counts on overall ranking power ( F1 and AUC-PR) but provide no significant improvement over raw counts.

Breadth and TPM are better metrics than coverage and relative for DA analysis.

Question: Is there a recommended tool based on the analysis?

Answer: Although some methods rank higher in this benchmark, no single DA algorithm is reliably superior across all datasets and assumptions. Simulated samples may not be representative of real-world complex biological data. Current best practice is to run several complementary tools and focus on features that are supported by a consensus of methods. This multi-tool approach is shown by Willis et al. 2022 and is already common in recent studies. For example, Stewart et al. 2025 analysed differential taxa with six independent packages (ANCOM-BC2, MaAsLin2, ALDEx2, DESeq2, LinDA and ZicoSeq) to ensure robust biological interpretation.

Best Performing Combinations

# Identify the top 3 performing tool-metric combinations by F1-Score
best_combos <- complete_performance %>%
  filter(!is.na(F1_Score), !is.na(auc_pr)) %>%
  arrange(desc(F1_Score), desc(auc_pr)) %>%
  slice_head(n = 3) %>%
  mutate(label = sprintf("- **%s + %s**: F1 = %.3f, AUC-PR = %.3f", tool, metric, F1_Score, auc_pr))

cat(paste(best_combos$label, collapse = "\n"))

## - **ALDEx2 + breadth**: F1 = 0.894, AUC-PR = 0.908
## - **corncob + breadth**: F1 = 0.894, AUC-PR = 0.722
## - **dearseq + breadth**: F1 = 0.889, AUC-PR = 0.813

Session Information

sessionInfo()

## R version 4.3.3 (2024-02-29)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /home/mrahman/miniconda3/envs/r_env/lib/libopenblasp-r0.3.27.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Berlin
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] viridis_0.6.5      viridisLite_0.4.2  RColorBrewer_1.1-3 plotly_4.11.0     
##  [5] DT_0.33            kableExtra_1.4.0   knitr_1.50         corrplot_0.95     
##  [9] PRROC_1.4          rlang_1.1.6        pROC_1.18.5        gridExtra_2.3     
## [13] lubridate_1.9.4    forcats_1.0.0      stringr_1.5.1      dplyr_1.1.4       
## [17] purrr_1.0.4        readr_2.1.5        tidyr_1.3.1        tibble_3.3.0      
## [21] ggplot2_3.5.2      tidyverse_2.0.0   
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6      xfun_0.52         bslib_0.9.0       htmlwidgets_1.6.4
##  [5] tzdb_0.5.0        vctrs_0.6.5       tools_4.3.3       crosstalk_1.2.1  
##  [9] generics_0.1.4    parallel_4.3.3    pkgconfig_2.0.3   data.table_1.17.6
## [13] lifecycle_1.0.4   compiler_4.3.3    farver_2.1.2      textshaping_1.0.1
## [17] htmltools_0.5.8.1 sass_0.4.10       yaml_2.3.10       lazyeval_0.2.2   
## [21] pillar_1.10.2     crayon_1.5.3      jquerylib_0.1.4   cachem_1.1.0     
## [25] tidyselect_1.2.1  digest_0.6.37     stringi_1.8.7     labeling_0.4.3   
## [29] fastmap_1.2.0     grid_4.3.3        cli_3.6.5         magrittr_2.0.3   
## [33] dichromat_2.0-0.1 withr_3.0.2       scales_1.4.0      bit64_4.6.0-1    
## [37] timechange_0.3.0  rmarkdown_2.29    httr_1.4.7        bit_4.6.0        
## [41] hms_1.1.3         evaluate_1.0.4    Rcpp_1.0.14       glue_1.8.0       
## [45] xml2_1.3.8        svglite_2.2.1     rstudioapi_0.17.1 vroom_1.6.5      
## [49] jsonlite_2.0.0    R6_2.6.1          plyr_1.8.9        systemfonts_1.2.3

13. Data and Code

Lists locations of data, scripts, and pipeline resources for transparent reuse.

MAGician Snakemake pipeline: core5:/home/mrahman/magician
MAGician outputs: core5:/local/mrahman/magician/result/
Sample-distribution design script: core5:/home/mrahman/magician/make_sample_distribution.py
Benchdamic execution script: core5:/home/mrahman/magician/workflow/scripts/run_benchdamic_ALL.R
This R Markdown file: core5:/home/mrahman/da_analysis/Rmds/complete_multi_metric_analysis_HOMD.Rmd

Multi-Metric Differential Abundance Benchmarking (HOMD)

Evaluation of DA Tools Across Six Abundance Metrics

Md Mobashir Rahman