This analysis addresses the methodological question raised in ANCOM-BC Issue #196: Can normalized abundance data be effectively used with differential abundance tools designed for raw counts?


1. Introduction and Motivation

Background

Many differential abundance (DA) tools were originally designed for raw read counts where library size differences are informative and corrected internally (e.g. via size-factor estimation, library size, gene length). For genome-level metagenomic data, however, raw counts are dominated by genome length; common practice is therefore to normalise reads to TPM/RPKM or coverage. This creates a methodological question about the potential effect on the results.

This study evaluates 48 metric × method combinations (6 metrics × 8 DA tools) using CAMISIM synthetic communities with known ground truth.

Research Questions

  1. Cross-metric performance: How do DA tools perform across different abundance metrics?
  2. Normalization impact: Does length/depth normalization improve or reduce DA detection power?
  3. Method preferences: Do some DA tools work better with specific abundance metrics?
  4. Consistency evaluation: Are DA results consistent across metrics for individual tools?

Study Design

  • Simulation: 3 sets of data were used: HOMD_Final with 200 genomes (80 truly differentially abundant. 120 truly non-differentially abundant). Genomes were selected randomly from the HOMD dataset. HOMD2 with 200 genomes (80 truly differentially abundant. 120 truly non-differentially abundant). Genomes were selected randomly from the HOMD dataset. mooh_representative with 200 genomes (80 truly differentially abundant. 120 truly non-differentially abundant). Genomes were selected based on Metaphlan outputs.

  • Synthetic sample generation: CAMISIM created 40/60 metagenomic samples (50% control, 50% treatment)

  • Up-stream processing: Each sample was processed with the standard MAGician workflow (repository).

  • Abundance Metrics: 6 metrics (counts, TPM, RPKM, relative abundance, breadth, coverage)

  • Coverage quantification: Genome-level abundance metrics were computed using CoverM ((repository).

  • DA Methods: 6 tools (ALDEx2, ANCOM-BC2, metagenomeSeq, wilcoxon, maaslin2) ran.

  • Evaluation: Binary classification and threshold-free metrics (F1-Score, AUC-PR, Specificity, Sensitivity etc)

Definitions

Abundance-metric normalisation cheat-sheet

Note: These metric names correspond to the final output files generated by the MAGICIAN pipeline. The mapping from CoverM column names to output files is: Read Count → counts, RPKM → rpkm, TPM → tpm, Relative Abundance (%) → relative, Covered Bases → breadth, Mean → coverage.

Metric Length normalisation Depth normalisation What the number means
counts ✗ (raw reads) Number of read pairs whose best hit is this genome
relative % ✓ (values sum to 100 %) Share of all mapped reads that belong to the genome
breadth ✓ (divide by length) Fraction of genome bases with ≥ 1× read coverage
coverage (mean depth) ✓ (divide by length) ✗ * Mean per-base depth across the genome
RPKM ✓ (/ kb) ✓ (/ 10⁶ reads) Reads-per-kilobase-per-million mapped reads
TPM ✓ (/ kb) ✓ (rescaled to 10⁶) Transcripts-per-million; similar to RPKM but sample-wise rescaled

Metric categories used throughout this report:

Non-normalized: counts
Length-normalized: breadth, coverage
Depth-normalized: relative (%)
Length + Depth-normalized: RPKM, TPM

CoverM to Output File Mapping

To avoid confusion, here’s the explicit mapping from CoverM’s internal column names to our final analysis files:

CoverM Column Name Final Output File Metric Name Used in Analysis Biological Interpretation
Read Count counts.tsv counts Raw read pairs mapped to genome
RPKM rpkm.tsv RPKM Reads per kilobase per million
TPM tpm.tsv TPM Transcripts per million
Relative Abundance (%) relative.tsv relative Percentage of total mapped reads
Covered Bases breadth.tsv breadth Fraction of genome with ≥1× coverage
Mean coverage.tsv coverage Mean read depth across genome

Important: Throughout this analysis, when we refer to “breadth” we mean the fraction of genome covered (from CoverM’s Covered Bases), and “coverage” refers to mean depth (from CoverM’s Mean). This follows genomics terminology where breadth = presence/absence and coverage = abundance intensity. (See how they use the term “coverage” here: https://pmc.ncbi.nlm.nih.gov/articles/PMC4992084/)

* Coverage still increases with sequencing depth, but it no longer carries the multiplicative genome-length bias.

Data Source

HOMD Data

The GTDB_taxonomy20250406.csv file was downloaded from the https://homd.org/download/download section (Genomic Meta Information::Batch Download GTDB (V226 taxonomy)). This CSV file contains metadata for 8,489 oral human microbial genomes from the Human Oral Microbiome Database (HOMD), including genome accession IDs, HOMD taxonomy IDs, and full GTDB taxonomic lineages.

From this dataset, 200 random genomes were selected and downloaded using the /home/mrahman/magician/download_genome.py script. The script randomly samples genome accession IDs from the CSV, then downloads the corresponding genomic FASTA files from NCBI’s Assembly database. A random seed ensures reproducibility of the genome selection. The downloaded genome sequences are stored in the data/HOMD/genomes directory as compressed FASTA files (*.fna.gz). The following command was used to download the genomes:

mamba activate magician_base && python /home/mrahman/magician/download_genome.py \
    --csv /home/mrahman/da_analysis/data_HOMD/gtdb_taxonomy/GTDB_taxonomy20250406.csv \
    --n 200 \
    --out /home/mrahman/da_analysis/data_HOMD/genomes \
    --email mobashirrahman@gmail.com

CAMISIM Strain-Madness

Data were downloaded from ENA Project accession number PRJEB50270 (study title: Critical Assessment of Metagenome Interpretation strain madness raw sequencing data, study name: CAMI 2 strain)). The following script were used to download 40 samples out of 100 to keep computational cost reasonable.

mamba activate sra && get_strain_madness_reads.sh /local/mrahman/magician/data/strain_madness_reads 0 39

I realized the dataset is not particularly suitable for our analysis because it is not geared toward DA analysis. For example, it do not have any associated metadata (e.g. clinical metadata, case-control status). I therefore decided to use the HOMD/mooh dataset for our analysis. A possible solution would have been, for example, using SIMBA to add in synthetic spike-ins for selected genomes and simulate case-control status to get a suitable dataset for DA analysis. As a workaround, we used Zeevi et al. dataset from the publication (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03390-9) included in the analysis as it had well-described metadata and already used for DA benchmarkiing by the Wirvel et al study.

CAMISIM Simulation and Metagenomics Downstream Processing

1. Community Design

  • Baseline weights: sampled per genome from a Gamma(shape=2, scale=5) to emulate long‑tailed abundance distributions as implemented in CAMISIM documentation.

  • Differential abundance: randomly selects a fixed number of genomes ( e.g., 80) and splits them evenly into up/down sets (Upregulation and downregulation). Applies multiplicative fold changes to reflect effect sizes.

  • Zero‑sum constraint: rescales only the DA subset so the sum of their abundances matches between groups. This preserves the relative mass of the non‑DA background and avoids any unwanted compositional shifts which may distort the DA methods.

  • Replication and depth: generates 10 controls and 10 treatments with r technical/biological replicates (e.g., r=2 → 40 samples). Library sizes are drawn uniformly from a range (e.g., 15–60M reads) to capture depth variability without confounding group effects.

  • CAMISIM inputs and truth: writes a sample_distributions.tsv (per‑sample weights),, read_depth.txt for sequencing depth, and an “initial” truth_table.csv recording genome‑level DA status and true log2FC for downstream evaluation.

2. Sample Distribution

  • Structure: rows are genomes; columns are per‑replicate samples; entries are integer weights interpreted by CAMISIM as “abundance weights”.

  • Group separation: control replicates use baseline weights; treatment replicates use the enriched weights (baseline modified by assigned folds), after zero‑sum balancing within the DA set.

3. Magician Downstream Processing

Reads are simulated (ART Illumina) and pooled into gzipped R1/R2 per sample. FastQC assesses raw quality; BBDuk trims adapters/low‑quality bases. metaSPAdes is used for metagenome assembly.BBMap aligns trimmed reads back to assemblies; jgi_summarize_bam_contig_depths computes depth tables for binning.

MetaBAT2 clusters contigs into MAGs using depth and composition; dRep dereplicates and compares genomes (Bins/MAGs and references). CheckM quantifies completeness/contamination for bins and references.

CoverM maps paired reads against all genomes (MAGs + references) and exports six metrics (count, RPKM, TPM, relative, covered_bases (“breadth”), mean depth (“coverage”).

4. DA Benchmarking

  1. A script generates truth_table_with_bins.csv by integrating MAG bin annotations into the ground-truth.

  2. This analysis benchmarks differential abundance (DA) methods across six abundance metrics using the HOMD_Final dataset with ground-truth labels. Features are harmonized and filtered; samples are aligned to metadata (control vs case). For each method–metric, per-feature q-values (BH, α=0.05) define DA calls. Results are evaluated against truth, separated by source (reference genomes vs recovered bins), reporting TP, FP, FN, TN, sensitivity, specificity, precision, FDR, accuracy, AUROC, AUPRC, and totals for later analysis.

Data Loading and Preprocessing

Details data import, cleaning, and integration steps used to construct the analysis dataset.

Genomic Sample Distribution

The genomes and their distribution was:

Ground Truth

# Initial ground truth (reference genomes only) for display
truth_table_initial <- read_csv(file.path(data_dirs$HOMD_Final, 'design/truth_table.csv')) %>%
  mutate(genome_accession = str_replace(genome, "\\.fna$", "")) %>%
  mutate(genome = str_replace_all(genome_accession, "[^A-Za-z0-9]", "")) %>%
  mutate(is_DA = case_when(
    is.logical(is_DA) ~ is_DA,
    str_to_lower(as.character(is_DA)) %in% c("true", "t", "1") ~ TRUE,
    str_to_lower(as.character(is_DA)) %in% c("false", "f", "0") ~ FALSE,
    TRUE ~ NA
  )) %>%
  distinct(genome, .keep_all = TRUE)

# Evaluation ground truth (includes bins) used to score methods
truth_table <- read_csv(file.path(data_dirs$HOMD_Final, 'design/truth_table_with_bins.csv')) %>%
  mutate(genome_accession = str_replace(genome, "\\.fna$", "")) %>%
  mutate(genome = str_replace_all(genome_accession, "[^A-Za-z0-9]", "")) %>%
  mutate(is_DA = case_when(
    is.logical(is_DA) ~ is_DA,
    str_to_lower(as.character(is_DA)) %in% c("true", "t", "1") ~ TRUE,
    str_to_lower(as.character(is_DA)) %in% c("false", "f", "0") ~ FALSE,
    TRUE ~ NA
  )) %>%
  distinct(genome, .keep_all = TRUE)

cat("Initial reference-only DA genomes:", sum(truth_table_initial$is_DA), "\n")
## Initial reference-only DA genomes: 80
cat("Initial reference-only non-DA genomes:", sum(!truth_table_initial$is_DA), "\n")
## Initial reference-only non-DA genomes: 120
cat("Evaluation entries (with bins):", nrow(truth_table), "\n")
## Evaluation entries (with bins): 238
truth_table_initial %>%
  group_by(is_DA) %>%
  summarise(
    n_genomes = n(),
    mean_log2FC = mean(true_log2FC, na.rm = TRUE),
    min_log2FC = min(true_log2FC, na.rm = TRUE),
    max_log2FC = max(true_log2FC, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  kable(caption = "Initial Ground Truth Summary (reference genomes)", digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Initial Ground Truth Summary (reference genomes)
is_DA n_genomes mean_log2FC min_log2FC max_log2FC
FALSE 120 0.000 0.000 0.000
TRUE 80 -1.263 -5.209 1.559

Simulated read-depth per sample

The CAMISIM simulation generated samples with varying sequencing depths to mimic limited realistic sequencing conditions:

## Sequencing depth summary:
## - Total samples: 40
## - Range: 17,009,209 - 59,564,475 read pairs
## - Mean: 36,954,723 read pairs
## - Median: 36,887,326 read pairs
## - Coefficient of variation: 0.347
Sequencing depth (read pairs) per simulated sample (first 5 entries)
Sample Read Pairs
C01_rep01 24,407,595
C01_rep02 38,628,974
C02_rep01 22,368,809
C02_rep02 22,465,809
C03_rep01 52,633,693

Genomes Recovered from bins

In ideal scenario, we would have recovered all the original genomes as the bins.

Bin recovery:

suppressPackageStartupMessages({
  library(readxl)
})

summaries_dir <- file.path(data_dirs$HOMD_Final, "summaries")

# Helper: normalize IDs to match truth_table/truth_table_initial normalization
normalize_genome_id <- function(x) {
  x %>%
    str_replace("\\.fna$", "") %>%
    str_replace_all("[^A-Za-z0-9]", "")
}

# List bin_summary files
bin_files <- list.files(summaries_dir, pattern = '^bin_summary_.*\\.xlsx$', full.names = TRUE)

# Read a single bin_summary and extract sample name
read_bin_summary <- function(path) {
  sample_id <- basename(path) %>% str_replace("^bin_summary_", "") %>% str_replace("\\.xlsx$", "")
  df <- suppressMessages(readxl::read_xlsx(path))
  # Expect columns: bin_name, closest_genome
  df %>%
    janitor::clean_names() %>%
    transmute(
      sample = sample_id,
      bin_name = as.character(bin_name),
      closest_genome_raw = as.character(closest_genome),
      closest_genome = normalize_genome_id(as.character(closest_genome))
    ) %>%
    distinct(sample, closest_genome, .keep_all = TRUE)
}

if (length(bin_files) == 0) {
  warning("No bin_summary_*.xlsx files found in ", summaries_dir)
  bin_summaries <- tibble(sample = character(), bin_name = character(), closest_genome_raw = character(), closest_genome = character())
} else {
  bin_summaries <- map_dfr(bin_files, read_bin_summary)
}

# Build presence/absence matrix genome × sample
all_samples <- sort(unique(bin_summaries$sample))
all_genomes <- sort(unique(c(truth_table_initial$genome, truth_table$genome, bin_summaries$closest_genome)))

recovery_long <- bin_summaries %>%
  transmute(genome = closest_genome, sample, recovered = 1L) %>%
  complete(genome = all_genomes, sample = all_samples, fill = list(recovered = 0L))

# Join truth covariates
truth_covars <- truth_table_initial %>%
  select(genome, genome_accession, is_DA, true_log2FC, baseline_weight, enriched_weight, direction)

# Ensure plain tibbles to avoid SummarizedExperiment join methods
recovery_long <- tibble::as_tibble(recovery_long)
truth_covars  <- tibble::as_tibble(truth_covars)

recovery_long <- recovery_long %>%
  dplyr::left_join(truth_covars, by = "genome")

# Per-sample summaries
sample_coverage <- recovery_long %>%
  group_by(sample) %>%
  summarise(
    n_recovered = sum(recovered, na.rm = TRUE),
    n_total_truth = sum(!is.na(is_DA)),
    prop_recovered = ifelse(n_total_truth > 0, n_recovered / n_total_truth, NA_real_),
    .groups = 'drop'
  )

# Split by is_DA/direction for stacked bars
sample_coverage_split <- recovery_long %>%
  # Vectorized classification (avoid non-vectorized isTRUE/isFALSE)
  mutate(
    direction = tolower(as.character(direction)),
    class = case_when(
      is_DA == TRUE & direction == "up"   ~ "DA_up",
      is_DA == TRUE & direction == "down" ~ "DA_down",
      is_DA == FALSE                       ~ "non_DA",
      TRUE                                 ~ "unknown"
    )
  ) %>%
  group_by(sample, class) %>%
  summarise(n = sum(recovered, na.rm = TRUE), .groups = 'drop')

# Prepare stacked-bar data with total counts and readable count labels
sample_coverage_counts <- sample_coverage_split %>%
  group_by(sample) %>%
  mutate(total = sum(n),
         frac = ifelse(total > 0, n / total, NA_real_),
         count_label = ifelse(is.na(frac) | frac < 0.08, "", as.character(n))) %>%
  ungroup()

# Quick sanity table
head(sample_coverage_split) %>% kable(caption = "Example of recovered counts per sample by DA class") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Example of recovered counts per sample by DA class
sample class n
C01_rep01 DA_down 0
C01_rep01 DA_up 3
C01_rep01 non_DA 12
C01_rep01 unknown 1
C01_rep02 DA_down 0
C01_rep02 DA_up 3
# Diagnostics: ensure is_DA is logical and mapping rate
diagnostic_map <- recovery_long %>%
  summarise(
    n_recovery_rows = n(),
    n_with_truth = sum(!is.na(is_DA)),
    prop_with_truth = n_with_truth / n_recovery_rows,
    n_unknown_class = sum(is.na(is_DA) | is.na(direction)),
    .groups = 'drop'
  )

print(diagnostic_map)
## # A tibble: 1 × 4
##   n_recovery_rows n_with_truth prop_with_truth n_unknown_class
##             <int>        <int>           <dbl>           <int>
## 1            9560         8000           0.837            1560

Recovered Genome Distribution

# Stacked by DA class
class_colors <- c(DA_up = "#2ca02c", DA_down = "#d62728", non_DA = "#7f7f7f", unknown = "#bcbd22")

ggplot(sample_coverage_counts, aes(x = reorder(sample, total), y = n, fill = class)) +
  geom_col() +
  geom_text(aes(label = count_label), position = position_stack(vjust = 0.5), size = 3, color = "black") +
  # Total recovered genomes per sample printed just beyond the bar
  geom_text(data = distinct(sample_coverage_counts, sample, total),
            aes(x = reorder(sample, total), y = total, label = total),
            inherit.aes = FALSE, hjust = -0.1, size = 3.2) +
  scale_y_continuous(expand = expansion(mult = c(0.01, 0.10))) +
  scale_fill_manual(values = class_colors) +
  coord_flip(clip = "off") +
  labs(title = "Recovered genomes per sample by DA class",
       x = "Sample", y = "Recovered genomes", fill = "Class") +
  theme_minimal() +
  theme(plot.margin = margin(5.5, 30, 5.5, 5.5))

# Per-genome recovery rate across samples
genome_recovery <- recovery_long %>%
  group_by(genome) %>%
  summarise(
    recovery_rate = mean(recovered, na.rm = TRUE),
    is_DA = first(is_DA),
    direction = first(direction),
    true_log2FC = first(true_log2FC),
    baseline_weight = first(baseline_weight),
    .groups = 'drop'
  )

Observation: Treatment samples consistently recover a higher total number of genomes (~21–24). Most of them have DA_up genomes whose abundance were increased during CAMISIM sample distribution. Only a few downregulated genomes are recovered.

On the other hand,controls recover fewer genomes overall (~15–19) compared to treatment.

# Scatter + smooth: baseline_weight vs recovery_rate by sample group (Control/Treatment)
rec_by_group <- recovery_long %>%
  mutate(group = case_when(
    stringr::str_starts(sample, "C") ~ "Control",
    stringr::str_starts(sample, "T") ~ "Treatment",
    TRUE ~ "Unknown"
  )) %>%
  group_by(genome, group) %>%
  summarise(recovery_rate = mean(recovered, na.rm = TRUE), .groups = 'drop') %>%
  dplyr::left_join(truth_covars %>% tibble::as_tibble() %>% dplyr::select(genome, is_DA, direction, baseline_weight), by = "genome")

ggplot(rec_by_group %>% filter(!is.na(baseline_weight)),
       aes(x = baseline_weight, y = recovery_rate, color = group)) +
  geom_point(alpha = 0.6, position = position_jitter(width = 0, height = 0.01), size = 1.8) +
  geom_smooth(method = "loess", se = TRUE) +
  facet_wrap(~ direction, nrow = 1, scales = "free_x") +
  scale_color_manual(values = c(Control = "#1f77b4", Treatment = "#ff7f0e", Unknown = "#7f7f7f")) +
  labs(title = "Recovery rate vs baseline_weight by group",
       x = "Baseline weight", y = "Recovery rate", color = "Group") +
  theme_minimal()

  • Observation: Recovery rate increases with higher baseline weight, specially for upregulated genomes.

Presence–absence heatmap (genomes × samples)

library(ComplexHeatmap)
library(circlize)

# Prepare matrix (drop NAs to avoid invalid rownames)
rec_df <- recovery_long %>%
  select(genome, sample, recovered) %>%
  filter(!is.na(genome), !is.na(sample))

rec_mat <- rec_df %>%
  pivot_wider(names_from = sample, values_from = recovered, values_fill = 0) %>%
  filter(!is.na(genome)) %>%
  column_to_rownames("genome") %>%
  as.matrix()

# Row metadata (aligned to matrix order)
row_meta <- truth_covars %>%
  filter(!is.na(genome), genome %in% rownames(rec_mat)) %>%
  distinct(genome, .keep_all = TRUE) %>%
  column_to_rownames("genome")
row_meta_aligned <- row_meta[rownames(rec_mat), , drop = FALSE]
row_meta_aligned$direction <- replace_na(tolower(row_meta_aligned$direction), "none")

# Row class used for splitting
row_class <- case_when(
  row_meta_aligned$is_DA %in% TRUE  & row_meta_aligned$direction == "up"   ~ "DA_up",
  row_meta_aligned$is_DA %in% TRUE  & row_meta_aligned$direction == "down" ~ "DA_down",
  row_meta_aligned$is_DA %in% FALSE                                       ~ "non_DA",
  TRUE                                                                    ~ "unknown"
)
row_class <- factor(row_class, levels = c("DA_up", "DA_down", "non_DA", "unknown"))

# Row prevalence to order genomes within each split
row_prev <- rowSums(rec_mat, na.rm = TRUE) / ncol(rec_mat)
row_order_vec <- order(as.integer(row_class), -row_prev)

# Top (column) annotations: Control/Treatment/Unknown by sample name
sample_groups <- case_when(
  stringr::str_starts(colnames(rec_mat), "C") ~ "Control",
  stringr::str_starts(colnames(rec_mat), "T") ~ "Treatment",
  TRUE ~ "Unknown"
)
top_anno <- HeatmapAnnotation(
  group = factor(sample_groups, levels = c("Control", "Treatment", "Unknown")),
  col = list(group = c(Control = "#1f77b4", Treatment = "#ff7f0e", Unknown = "#7f7f7f")),
  annotation_name_side = "left"
)

# Right (row) annotations: selected covariates
ha_row <- rowAnnotation(
  log2FC = row_meta_aligned$true_log2FC,
  baseline = row_meta_aligned$baseline_weight,
  col = list(
    log2FC = circlize::colorRamp2(c(min(row_meta_aligned$true_log2FC, na.rm=TRUE), 0, 
                                     max(row_meta_aligned$true_log2FC, na.rm=TRUE)),
                                   c("#2166AC", "#F7F7F7", "#B2182B")),
    baseline = circlize::colorRamp2(range(row_meta_aligned$baseline_weight, na.rm=TRUE),
                                    c("#f7f7f7", "#636363"))
  )
)

# Binary color map for presence/absence using requested palette
# Absent = Modern Red (#E63946), Present = Modern Green (#2A9D8F)
hcols <- circlize::colorRamp2(c(0, 1), c("#E63946", "#2A9D8F"))

Heatmap(
  rec_mat,
  name = "presence",
  col = hcols,
  heatmap_legend_param = list(at = c(0, 1), labels = c("absent", "present"), color_bar = "discrete"),
  use_raster = TRUE,
  raster_quality = 2,
  show_row_names = FALSE,
  show_column_names = TRUE,
  column_title = "Samples",
  row_title = "Genomes",
  right_annotation = ha_row,
  top_annotation = top_anno,
  column_split = factor(sample_groups, levels = c("Control", "Treatment", "Unknown")),
  cluster_columns = TRUE,
  row_split = row_class,
  # Order genomes: first by class, then by prevalence within class
  row_order = row_order_vec
)

  • Observation: DA patterns dominate presence/absence determination of the genome recovery. Increased log2FC (DA_up genomes in treatment) leads to more genome recovery in Treatment than Control samples. High baseline also leads to more genome (in all samples) recovery. (as expected)

Recovered Genomes: Taxonomy Table and Plot

We would like to see if there is any genus that contributes abnormally to the recovered genomes. So first we identify taxon names of the recovered genomes (bins).

suppressPackageStartupMessages({
  library(rentrez)
  library(xml2)
})

# Genomes recovered at least once and their counts across samples
recovered_counts <- recovery_long %>%
  group_by(genome) %>%
  summarise(count = sum(recovered, na.rm = TRUE), .groups = 'drop') %>%
  filter(count > 0) %>%
  dplyr::left_join(truth_covars %>% tibble::as_tibble() %>% dplyr::select(genome, genome_accession), by = 'genome')

# Helper to get family/genus/species via NCBI (assembly -> taxid -> taxonomy)
get_taxonomy_for_accession <- function(acc) {
  if (is.na(acc) || acc == "") return(tibble(family = NA_character_, genus = NA_character_, species = NA_character_))
  out <- tryCatch({
    es <- rentrez::entrez_search(db = 'assembly', term = acc, retmax = 1)
    if (length(es$ids) == 0) return(tibble(family = NA_character_, genus = NA_character_, species = NA_character_))
    sm <- rentrez::entrez_summary(db = 'assembly', id = es$ids[[1]])
    txid <- sm$taxid
    if (is.null(txid) || is.na(txid)) return(tibble(family = NA_character_, genus = NA_character_, species = NA_character_))
    xml_txt <- rentrez::entrez_fetch(db = 'taxonomy', id = txid, rettype = 'xml')
    doc <- xml2::read_xml(xml_txt)
    family <- xml2::xml_text(xml2::xml_find_first(doc, "//LineageEx/Taxon[Rank='family']/ScientificName"))
    genus  <- xml2::xml_text(xml2::xml_find_first(doc, "//LineageEx/Taxon[Rank='genus']/ScientificName"))
    species <- xml2::xml_text(xml2::xml_find_first(doc, "//Taxon/ScientificName"))
    tibble(family = ifelse(length(family)==0, NA, family),
           genus = ifelse(length(genus)==0, NA, genus),
           species = ifelse(length(species)==0, NA, species))
  }, error = function(e) tibble(family = NA_character_, genus = NA_character_, species = NA_character_))
  out
}

unique_acc <- recovered_counts$genome_accession %>% unique()
tax_lookup <- map_dfr(unique_acc, ~{
  res <- get_taxonomy_for_accession(.x)
  mutate(res, genome_accession = .x)
})

recovered_tax <- recovered_counts %>%
  dplyr::left_join(tax_lookup, by = 'genome_accession') %>%
  arrange(desc(count))

# Display table: row names = genome accession
recovered_tax_table <- recovered_tax %>%
  transmute(`genome` = coalesce(genome_accession, genome), count, family, genus, species)

recovered_tax_table %>%
  dplyr::rename(`Count (samples)` = count) %>%
  kable(caption = 'Recovered genomes with NCBI taxonomy', align = 'lrrrr') %>%
  kable_styling(bootstrap_options = c('striped','hover','condensed'), full_width = FALSE)
Recovered genomes with NCBI taxonomy
genome Count (samples) family genus species
GCA_002763575.1 40 Prevotellaceae Prevotella Prevotella intermedia
GCA_009377295.1 40 Paenibacillaceae Paenibacillus Paenibacillus glucanolyticus
GCA_016775375.1 40 Peptostreptococcaceae Peptostreptococcus Peptostreptococcus stomatis
GCA_016861285.1 40 Pasteurellaceae Haemophilus Haemophilus influenzae
GCA_017598955.1 40 Micrococcaceae Kocuria Kocuria rhizophila
GCA_938023715.1 40 Prevotellaceae Prevotella uncultured Prevotella sp.
NA 40 NA NA NA
GCA_002091815.1 39 Pseudomonadaceae Ectopseudomonas Pseudomonas oleovorans subsp. oleovorans NBRC 13583 = DSM 1045
GCA_000195125.1 37 Streptococcaceae Streptococcus Streptococcus sanguinis SK1058
GCA_020529565.1 35 Streptococcaceae Streptococcus Streptococcus mutans
GCA_001597385.1 23 Fusobacteriaceae Fusobacterium Fusobacterium necrophorum subsp. funduliforme
GCA_016127175.1 22 Veillonellaceae Veillonella Veillonella parvula
GCA_937889145.1 22 Prevotellaceae Prevotella Prevotella pallens
GCA_000160335.2 20 Staphylococcaceae Staphylococcus Staphylococcus aureus subsp. aureus 55/2053
GCA_000180075.1 20 Streptococcaceae Streptococcus Streptococcus vestibularis F0396
GCA_004136195.1 20 Propionibacteriaceae Cutibacterium Cutibacterium acnes DSM 1897
GCA_021309225.1 20 Pasteurellaceae Aggregatibacter Aggregatibacter actinomycetemcomitans
GCA_023110145.1 20 Streptococcaceae Streptococcus Streptococcus vestibularis
GCA_023314415.1 20 Enterobacteriaceae Kluyvera Kluyvera ascorbata
GCA_927911685.1 20 Actinomycetaceae Actinomyces uncultured Actinomyces sp.
GCA_958455085.1 20 Veillonellaceae Dialister Dialister invisus
GCA_000016825.1 18 Lactobacillaceae Limosilactobacillus Limosilactobacillus reuteri subsp. reuteri
GCA_003433955.1 18 Peptoniphilaceae Anaerococcus Anaerococcus nagyae
GCA_030180395.1 17 Fusobacteriaceae Fusobacterium Fusobacterium necrophorum
GCA_001866575.1 16 Propionibacteriaceae Cutibacterium Cutibacterium avidum
GCA_021309315.1 13 Pasteurellaceae Aggregatibacter Aggregatibacter actinomycetemcomitans
GCA_008726565.1 12 Streptococcaceae Streptococcus Streptococcus anginosus
GCA_022569235.1 12 Staphylococcaceae Staphylococcus Staphylococcus lugdunensis
GCA_014962425.1 8 Staphylococcaceae Staphylococcus Staphylococcus lugdunensis
GCA_030229925.1 8 Corynebacteriaceae Corynebacterium Corynebacterium propinquum
GCA_001053555.1 7 Corynebacteriaceae Corynebacterium Corynebacterium propinquum
GCA_008085135.1 7 Pasteurellaceae Aggregatibacter Aggregatibacter actinomycetemcomitans
GCA_031158635.1 7 Propionibacteriaceae Cutibacterium Cutibacterium avidum
GCA_028325975.1 6 Streptococcaceae Streptococcus Streptococcus parasanguinis
GCA_002212995.1 5 Streptococcaceae Streptococcus Streptococcus mutans
GCA_003856455.1 4 Staphylococcaceae Staphylococcus Staphylococcus epidermidis
GCA_900636255.1 3 Staphylococcaceae Staphylococcus Staphylococcus warneri
GCA_003596365.3 2 Staphylococcaceae Staphylococcus Staphylococcus haemolyticus
GCA_011038575.1 1 Staphylococcaceae Staphylococcus Staphylococcus epidermidis

Then we barplot them by aggregating at genus level.

# Plot: top 15 genera by number of recovered genomes
top_genera <- recovered_tax %>%
  dplyr::count(genus, sort = TRUE, name = 'n_genomes') %>%
  filter(!is.na(genus)) %>%
  slice_head(n = 15)

ggplot(top_genera, aes(x = reorder(genus, n_genomes), y = n_genomes)) +
  geom_col(fill = '#2A9D8F') +
  coord_flip() +
  labs(title = 'Top genera among recovered genomes', x = 'Genus', y = 'Recovered genomes (unique)') +
  theme_minimal()

  • Observation: 7 species are recovered from both Streptococcus and Staphylococcus.

Community Composition: Initial vs Recovered at Genus Level

suppressPackageStartupMessages({
  library(tidyverse)
  library(mia)
  library(miaViz)
  library(S4Vectors)
  library(SummarizedExperiment)
  library(TreeSummarizedExperiment)
  library(rentrez)
  library(xml2)
})

pal_amr <- c(
  "#9C81ab", "#81ba9c", "gold1", "skyblue2", "#e65d5f", "#FDBF6F",
  "#CAD2D6", "#FB9A99", "maroon", "steelblue4", "dodgerblue2", "#D48B28",
  "#6A9DA9", "khaki2", "#B5D39D", "#42B93E", "#B3E3AA", "#A9FDAC",
  "#32A27f", "#38BEA5", "#A3AB28", "#28C99f", "#FED1BC", "#FE654F",
  "#49C8E1", "#BB39A3", "grey", "pink",
  "#177B94", "#FF707E", "#2AC02C", "#D62728", "#9478DD", "#BC54AB",
  "#E377C2", "#7F7F7F", "#BCBD22", "#17BECF",
  "#6B4226", "#FFA70A", "#FF4540", "#CD5C5C", "#25EB57", "#66CDAA",
  "#4682B4", "#1E90FF", "#C71585", "#7AD7D6",
  "#FFD49F", "#8BD870", "#B0B83F", "#A52A2A", "#6495ED", "#9932CC",
  "#FF1493", "#ADFF2F", "#40E0D0", "#C0C0C0",
  "#FFA8B9", "#F4CA60", "#D8251E", "#0B53FF", "#FFA500", "#FF6370",
  "#DC143C", "#8A2BE2", "#5F9EA0", "#00FA9A",
  "#FFD700", "#4DBDFF", "#708090", "#00FFFF", "#FF4500", "#7FFFD4",
  "#AD0F5E", "#20B2AA", "#708090", "#8FBC8F",
  "#0000FF", "#808000", "#228B22", "#C0C0C0", "#808080", "#008000",
  "#FF8C00", "#7C1353", "#000000", "#A9A9A9",
  "#FF4500", "#822222", "#FF2222", "#00FF00", "#800080", "#FF00FF",
  "#E9967A", "#FF69B4", "#FF69B4", "#EE82EE",
  "#FFB6C1", "#FAFEEE", "#F0E68C", "#9FB98F", "#ADD8E6", "#87CEEB",
  "#FFDEAD", "#DDA0DD", "#B0E0E6"
)

# Input: remapped relative abundances (rows = genomes or bins, cols = samples)
abund_file <- file.path(data_dirs$HOMD_Final, "aggregated_metrics_remapped", "remapped_relative.tsv")
stopifnot(file.exists(abund_file))

abund <- readr::read_tsv(abund_file, show_col_types = FALSE)
stopifnot("Genome" %in% colnames(abund))

rownms <- abund$Genome
mat <- as.matrix(abund %>% select(-Genome))
rownames(mat) <- rownms

# Identify source type and normalize accession IDs
source_type <- ifelse(grepl("^bin_", rownms), "recovered", "initial")
acc_undotted <- sub("^bin_", "", rownms)
acc <- ifelse(grepl("\\.\\d+$", acc_undotted), acc_undotted, sub("([0-9])$", ".\\1", acc_undotted))

row_meta <- S4Vectors::DataFrame(
  Genome = rownms,
  Source = source_type,
  Accession_raw = acc_undotted,
  Accession = acc,
  row.names = rownms
)

# Build TSE and compute per-sample relative abundance (sums to 1)
tse_full <- TreeSummarizedExperiment::TreeSummarizedExperiment(
  assays = list(counts = mat),
  rowData = row_meta
)
tse_full <- mia::transformAssay(tse_full, assay.type = "counts", method = "relabundance")

# Fetch Genus via NCBI (rentrez) with caching
cache_dir_cc <- file.path(results_dir, "cache")
if (!dir.exists(cache_dir_cc)) dir.create(cache_dir_cc, recursive = TRUE)
cache_map <- file.path(cache_dir_cc, "accession_to_genus.csv")

fetch_genus <- function(acc_vec) {
  acc_vec <- unique(acc_vec[!is.na(acc_vec) & acc_vec != ""])
  if (length(acc_vec) == 0) return(tibble(Accession = character(), Genus = character()))
  purrr::map_dfr(acc_vec, function(acc_one) {
    # Try exact assembly accession first
    uid <- tryCatch(rentrez::entrez_search(db = "assembly",
                                          term = paste0(acc_one, "[Assembly Accession]"),
                                          retmax = 1)$ids,
                    error = function(e) NULL)
    if (is.null(uid) || length(uid) == 0) {
      uid <- tryCatch(rentrez::entrez_search(db = "assembly", term = acc_one, retmax = 1)$ids,
                      error = function(e) NULL)
    }
    if (!is.null(uid) && length(uid) > 0) {
      sm <- tryCatch(rentrez::entrez_summary(db = "assembly", id = uid[1]), error = function(e) NULL)
      txid <- tryCatch({
        if (is.null(sm)) NA_character_ else as.character(if (!is.null(sm$taxid)) sm$taxid else sm$tax_id)
      }, error = function(e) NA_character_)
      if (!is.na(txid)) {
        tx_xml <- tryCatch(rentrez::entrez_fetch(db = "taxonomy", id = txid, rettype = "xml"), error = function(e) NULL)
        if (!is.null(tx_xml)) {
          doc <- tryCatch(xml2::read_xml(tx_xml), error = function(e) NULL)
          if (!is.null(doc)) {
            genus <- xml2::xml_text(xml2::xml_find_first(doc, "//LineageEx/Taxon[Rank='genus']/ScientificName"))
            if (length(genus) == 0 || is.na(genus) || genus == "") genus <- NA_character_
            return(tibble(Accession = acc_one, Genus = genus))
          }
        }
      }
    }
    tibble(Accession = acc_one, Genus = NA_character_)
  })
}

# Load cache and lookup what’s missing
cache_df <- if (file.exists(cache_map)) suppressMessages(readr::read_csv(cache_map, show_col_types = FALSE)) else tibble(Accession = character(), Genus = character())
to_lookup <- setdiff(unique(row_meta$Accession), cache_df$Accession)

if (length(to_lookup) > 0) {
  # Optional: supply ENTREZ_KEY for higher rate limit
  if (nzchar(Sys.getenv("ENTREZ_KEY"))) rentrez::set_entrez_key(Sys.getenv("ENTREZ_KEY"))
  new_map <- fetch_genus(to_lookup)
  cache_df <- dplyr::bind_rows(cache_df, new_map) %>% dplyr::distinct(Accession, .keep_all = TRUE)
  readr::write_csv(cache_df, cache_map)
}

genus_map <- cache_df
row_genus <- genus_map$Genus[match(row_meta$Accession, genus_map$Accession)]
row_genus[is.na(row_genus) | row_genus == ""] <- "Unknown"
rowData(tse_full)$Genus <- row_genus

# Split by source
tse_initial <- tse_full[rowData(tse_full)$Source == "initial", ]
tse_recovered <- tse_full[rowData(tse_full)$Source == "recovered", ]

# Aggregate by Genus within each subset
tse_initial_g <- mia::agglomerateByVariable(tse_initial, by = "rows", f = "Genus", fun = sum, assay.type = "relabundance")
tse_recovered_g <- mia::agglomerateByVariable(tse_recovered, by = "rows", f = "Genus", fun = sum, assay.type = "relabundance")

# Determine shared top genera (union of top 10 from both)
top_init <- mia::getTop(tse_initial_g, top = 10, assay.type = "relabundance")
top_recv <- mia::getTop(tse_recovered_g, top = 10, assay.type = "relabundance")
top_union <- unique(c(top_init, top_recv))

# Collapse others into "Other"
rowData(tse_initial_g)$Genus_sub <- ifelse(rownames(tse_initial_g) %in% top_union, rownames(tse_initial_g), "Other")
rowData(tse_recovered_g)$Genus_sub <- ifelse(rownames(tse_recovered_g) %in% top_union, rownames(tse_recovered_g), "Other")
tse_initial_sub <- mia::agglomerateByVariable(tse_initial_g, by = "rows", f = "Genus_sub", fun = sum, assay.type = "relabundance")
tse_recovered_sub <- mia::agglomerateByVariable(tse_recovered_g, by = "rows", f = "Genus_sub", fun = sum, assay.type = "relabundance")

# Visualize side-by-side (stacked bar via ggplot, with ordering)
create_comp_plot <- function(tse_sub, title) {
  M <- SummarizedExperiment::assay(tse_sub, "relabundance")
  if (is.null(M)) M <- SummarizedExperiment::assay(tse_sub, 1)
  cs <- colSums(M, na.rm = TRUE); cs[cs == 0] <- 1
  M <- sweep(M, 2, cs, "/")

  df <- as.data.frame(M) %>%
    tibble::rownames_to_column("Genus") %>%
    tidyr::pivot_longer(-Genus, names_to = "Sample", values_to = "Abundance")

  totals <- df %>% group_by(Genus) %>% summarise(Total = sum(Abundance, na.rm = TRUE), .groups = 'drop')
  levels_g <- totals %>% filter(Genus != "Other") %>% arrange(desc(Total)) %>% pull(Genus)
  if ("Other" %in% df$Genus) levels_g <- c(levels_g, "Other")
  df$Genus <- factor(df$Genus, levels = levels_g)

  top_g <- dplyr::first(levels_g)
  if (!is.na(top_g)) {
    sample_order <- df %>% filter(Genus == top_g) %>% arrange(desc(Abundance)) %>% pull(Sample)
    df$Sample <- factor(df$Sample, levels = unique(sample_order))
  }

  ggplot(df, aes(x = Sample, y = Abundance, fill = Genus)) +
    geom_col(width = 0.9) +
    scale_fill_manual(values = pal_amr) +   # << apply palette
    labs(title = title, x = "Samples", y = "Relative abundance") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
          legend.position = "right")
}

p_init <- create_comp_plot(tse_initial_sub, "Initial (Reference) - Genus Composition")
p_recv <- create_comp_plot(tse_recovered_sub, "Recovered (Bins) - Genus Composition")

if (requireNamespace("patchwork", quietly = TRUE)) {
  (p_init | p_recv)
} else {
  gridExtra::grid.arrange(p_init, p_recv, ncol = 2)
}

  • Observation: The reference (left) shows smooth, consistent genus profiles across all samples, with stable relative abundances. In contrast, the recovered bins display noisy, fragmented patterns with inflated or diminished representation of several genera.

There is a clear compositional shift, with recovered bins showing markedly altered and noisier genus abundances compared to the stable reference profiles.

Differential Abundance Analysis

Loading and Processing DA Results

# Load comprehensive summary results
da_results <- read_csv(file.path(results_input_dir, "comprehensive_summary_with_bins.csv"), 
                      show_col_types = FALSE) %>%
  mutate(
    method = str_to_lower(method),
    metric_type = factor(metric_type, levels = c("counts", "tpm", "rpkm", "relative", "breadth", "coverage")),
    source_type = factor(source_type, levels = c("all", "reference", "bin"))
  )

# Display structure
cat("DA Results Summary:\n")
## DA Results Summary:
cat("- Total rows:", nrow(da_results), "\n")
## - Total rows: 78
cat("- Methods:", paste(unique(da_results$method), collapse = ", "), "\n")
## - Methods: ancombc2, aldex2, metagenomeseq, maaslin2, wilcoxon
cat("- Metrics:", paste(unique(da_results$metric_type), collapse = ", "), "\n")
## - Metrics: counts, coverage, breadth, rpkm, tpm, relative
cat("- Source types:", paste(unique(da_results$source_type), collapse = ", "), "\n")
## - Source types: all, reference, bin
# Quick overview
da_results %>%
  dplyr::count(method, source_type, metric_type) %>%
  kable(caption = "Results availability by method, source type, and metric") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Results availability by method, source type, and metric
method source_type metric_type n
aldex2 all counts 1
aldex2 all breadth 1
aldex2 reference counts 1
aldex2 reference breadth 1
aldex2 bin counts 1
aldex2 bin breadth 1
ancombc2 all counts 1
ancombc2 all tpm 1
ancombc2 all rpkm 1
ancombc2 all relative 1
ancombc2 all breadth 1
ancombc2 all coverage 1
ancombc2 reference counts 1
ancombc2 reference tpm 1
ancombc2 reference rpkm 1
ancombc2 reference relative 1
ancombc2 reference breadth 1
ancombc2 reference coverage 1
ancombc2 bin counts 1
ancombc2 bin tpm 1
ancombc2 bin rpkm 1
ancombc2 bin relative 1
ancombc2 bin breadth 1
ancombc2 bin coverage 1
maaslin2 all counts 1
maaslin2 all tpm 1
maaslin2 all rpkm 1
maaslin2 all relative 1
maaslin2 all breadth 1
maaslin2 all coverage 1
maaslin2 reference counts 1
maaslin2 reference tpm 1
maaslin2 reference rpkm 1
maaslin2 reference relative 1
maaslin2 reference breadth 1
maaslin2 reference coverage 1
maaslin2 bin counts 1
maaslin2 bin tpm 1
maaslin2 bin rpkm 1
maaslin2 bin relative 1
maaslin2 bin breadth 1
maaslin2 bin coverage 1
metagenomeseq all counts 1
metagenomeseq all tpm 1
metagenomeseq all rpkm 1
metagenomeseq all relative 1
metagenomeseq all breadth 1
metagenomeseq all coverage 1
metagenomeseq reference counts 1
metagenomeseq reference tpm 1
metagenomeseq reference rpkm 1
metagenomeseq reference relative 1
metagenomeseq reference breadth 1
metagenomeseq reference coverage 1
metagenomeseq bin counts 1
metagenomeseq bin tpm 1
metagenomeseq bin rpkm 1
metagenomeseq bin relative 1
metagenomeseq bin breadth 1
metagenomeseq bin coverage 1
wilcoxon all counts 1
wilcoxon all tpm 1
wilcoxon all rpkm 1
wilcoxon all relative 1
wilcoxon all breadth 1
wilcoxon all coverage 1
wilcoxon reference counts 1
wilcoxon reference tpm 1
wilcoxon reference rpkm 1
wilcoxon reference relative 1
wilcoxon reference breadth 1
wilcoxon reference coverage 1
wilcoxon bin counts 1
wilcoxon bin tpm 1
wilcoxon bin rpkm 1
wilcoxon bin relative 1
wilcoxon bin breadth 1
wilcoxon bin coverage 1

Reference Genomes Results

# Filter for reference genomes only
reference_results <- da_results %>%
  filter(source_type == "reference") %>%
  select(method, metric_type, tp, fp, fn, tn, sensitivity, specificity, precision, fdr, accuracy, npv, auroc, auprc, called, total_features) %>%
  arrange(method, metric_type)

# Create formatted table for reference results
reference_table <- reference_results %>%
  mutate(
    across(c(sensitivity, specificity, precision, accuracy, npv, auroc, auprc), ~ round(.x, 3)),
    across(c(fdr), ~ round(.x, 3)),
    tp = as.integer(tp),
    fp = as.integer(fp),
    fn = as.integer(fn),
    tn = as.integer(tn),
    called = as.integer(called),
    total_features = as.integer(total_features)
  ) %>%
  dplyr::rename(
    Method = method,
    Metric = metric_type,
    TP = tp,
    FP = fp,
    FN = fn,
    TN = tn,
    Sensitivity = sensitivity,
    Specificity = specificity,
    Precision = precision,
    FDR = fdr,
    Accuracy = accuracy,
    NPV = npv,
    AUROC = auroc,
    AUPRC = auprc,
    Called = called,
    Total = total_features
  )

# Display interactive table with pagination
reference_dt <- reference_table %>%
  DT::datatable(
    caption = "Differential Abundance Results: Reference Genomes Only",
    options = list(
      pageLength = 10,
      lengthMenu = c(5, 10, 15, 20, -1),
      scrollX = TRUE,
      scrollY = "400px",
      dom = 'Bfrtip',
      buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
    ),
    extensions = c('Buttons', 'Scroller'),
    filter = 'top',
    rownames = FALSE
  ) %>%
  DT::formatRound(columns = c('Sensitivity', 'Specificity', 'Precision', 'FDR', 'Accuracy', 'NPV', 'AUROC', 'AUPRC'), digits = 3)

reference_dt
# Save interactive table
save_datatable(reference_dt, "reference_genomes_results.html")

Bins Results

# Filter for bins only
bins_results <- da_results %>%
  filter(source_type == "bin") %>%
  select(method, metric_type, tp, fp, fn, tn, sensitivity, specificity, precision, fdr, accuracy, npv, auroc, auprc, called, total_features) %>%
  arrange(method, metric_type)

# Create formatted table for bins results
bins_table <- bins_results %>%
  mutate(
    across(c(sensitivity, specificity, precision, accuracy, npv, auroc, auprc), ~ round(.x, 3)),
    across(c(fdr), ~ round(.x, 3)),
    tp = as.integer(tp),
    fp = as.integer(fp),
    fn = as.integer(fn),
    tn = as.integer(tn),
    called = as.integer(called),
    total_features = as.integer(total_features)
  ) %>%
  dplyr::rename(
    Method = method,
    Metric = metric_type,
    TP = tp,
    FP = fp,
    FN = fn,
    TN = tn,
    Sensitivity = sensitivity,
    Specificity = specificity,
    Precision = precision,
    FDR = fdr,
    Accuracy = accuracy,
    NPV = npv,
    AUROC = auroc,
    AUPRC = auprc,
    Called = called,
    Total = total_features
  )

# Display interactive table with pagination
bins_dt <- bins_table %>%
  DT::datatable(
    caption = "Differential Abundance Results: Bins Only",
    options = list(
      pageLength = 10,
      lengthMenu = c(5, 10, 15, 20, -1),
      scrollX = TRUE,
      scrollY = "400px",
      dom = 'Bfrtip',
      buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
    ),
    extensions = c('Buttons', 'Scroller'),
    filter = 'top',
    rownames = FALSE
  ) %>%
  DT::formatRound(columns = c('Sensitivity', 'Specificity', 'Precision', 'FDR', 'Accuracy', 'NPV', 'AUROC', 'AUPRC'), digits = 3)

bins_dt
# Save interactive table
save_datatable(bins_dt, "bins_results.html")

Summary Statistics

# Summary statistics for reference vs bins
summary_stats <- da_results %>%
  filter(source_type %in% c("reference", "bin")) %>%
  group_by(source_type, metric_type) %>%
  summarise(
    n_methods = n(),
    mean_sensitivity = mean(sensitivity, na.rm = TRUE),
    mean_specificity = mean(specificity, na.rm = TRUE),
    mean_precision = mean(precision, na.rm = TRUE),
    mean_auroc = mean(auroc, na.rm = TRUE),
    mean_auprc = mean(auprc, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(
    across(c(mean_sensitivity, mean_specificity, mean_precision, mean_auroc, mean_auprc), ~ round(.x, 3))
  )

# Display interactive summary table
summary_dt <- summary_stats %>%
  DT::datatable(
    caption = "Summary Statistics: Reference vs Bins Performance",
    options = list(
      pageLength = 10,
      lengthMenu = c(5, 10, 15, 20, -1),
      scrollX = TRUE,
      dom = 'Bfrtip',
      buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
    ),
    extensions = c('Buttons'),
    filter = 'top',
    rownames = FALSE
  ) %>%
  DT::formatRound(columns = c('mean_sensitivity', 'mean_specificity', 'mean_precision', 'mean_auroc', 'mean_auprc'), digits = 3)

summary_dt

Performance Comparison Plots

# 1) Long data
plot_data <- da_results %>%
  filter(source_type %in% c("reference", "bin")) %>%
  mutate(
    f1_score = 2 * (precision * sensitivity) / (precision + sensitivity),
    auc_pr   = auprc
  ) %>%
  select(method, metric_type, source_type, f1_score, auc_pr, sensitivity, specificity) %>%
  pivot_longer(c(f1_score, auc_pr, sensitivity, specificity),
               names_to = "metric", values_to = "value") %>%
  filter(!is.na(value))

metric_order  <- c("counts","rpkm","relative","breadth","coverage","tpm")
tool_colors   <- c(ancombc2="#E63946", aldex2="#F77F00", metagenomeseq="#2A9D8F",
                   wilcoxon="#264653", maaslin2="#7209B7")

# 2) Compute common y-limits per performance metric across BOTH sources
common_limits <- plot_data %>%
  group_by(metric) %>%
  summarise(lo = min(value, na.rm=TRUE), hi = max(value, na.rm=TRUE), .groups="drop") %>%
  mutate(
    # pad a touch so whiskers/points aren’t clipped
    lo = pmax(0, lo - 0.02),
    hi = pmin(1, hi + 0.02)
  )

lims_for <- function(metric_name) {
  with(common_limits[common_limits$metric==metric_name, ], c(lo, hi))
}

# 3) Subplot
create_subplot <- function(data, metric_name, title, y_label) {
  ggplot(data %>% filter(metric == metric_name),
         aes(x = factor(metric_type, levels = metric_order), y = value)) +
    geom_boxplot(alpha = 0.7, outlier.shape = NA, fill="white", color="black") +
    geom_jitter(aes(color = method), width = 0.2, alpha = 0.8, size = 2.5) +
    scale_color_manual(values = tool_colors) +
    scale_y_continuous(limits = lims_for(metric_name)) +
    scale_x_discrete(drop = FALSE) +       # <- keep all metrics even if absent in a subset
    labs(title = title, x = "Abundance Metric", y = y_label, color = "Method") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          legend.position = "bottom") +
    coord_flip()
}

# 4) Make panels 
ref_data  <- plot_data %>% filter(source_type == "reference")
bins_data <- plot_data %>% filter(source_type == "bin")

p1_ref  <- create_subplot(ref_data,  "f1_score",   "A) F1-Score Distribution by Metric (Reference)", "F1-Score")
p2_ref  <- create_subplot(ref_data,  "auc_pr",     "B) AUC-PR Distribution by Metric (Reference)",   "AUC-PR")
p3_ref  <- create_subplot(ref_data,  "sensitivity","C) Sensitivity Distribution by Metric (Reference)","Sensitivity")
p4_ref  <- create_subplot(ref_data,  "specificity","D) Specificity Distribution by Metric (Reference)","Specificity")

p1_bins <- create_subplot(bins_data, "f1_score",   "A) F1-Score Distribution by Metric (Bins)", "F1-Score")
p2_bins <- create_subplot(bins_data, "auc_pr",     "B) AUC-PR Distribution by Metric (Bins)",   "AUC-PR")
p3_bins <- create_subplot(bins_data, "sensitivity","C) Sensitivity Distribution by Metric (Bins)","Sensitivity")
p4_bins <- create_subplot(bins_data, "specificity","D) Specificity Distribution by Metric (Bins)","Specificity")

gridExtra::grid.arrange(p1_ref, p2_ref, p3_ref, p4_ref, ncol = 2,
                        top = "Performance Metrics Distribution Across Abundance Metrics - Reference Genomes")

gridExtra::grid.arrange(p1_bins, p2_bins, p3_bins, p4_bins, ncol = 2,
                        top = "Performance Metrics Distribution Across Abundance Metrics - Bins")

- Observation: Reference genomes show high sensitivity but low and variable specificity, with F1 and AUC-PR clustering around 0.6–0.8. Bin-level results are weaker, with reduced sensitivity and broader variability across abundance metrics.

Performance Heatmaps

# Prepare data for heatmaps
heatmap_data <- da_results %>%
  filter(source_type %in% c("reference", "bin")) %>%
  mutate(
    # Calculate F1-Score from precision and sensitivity
    f1_score = 2 * (precision * sensitivity) / (precision + sensitivity),
    # Use existing metrics
    auc_pr = auprc
  ) %>%
  select(method, metric_type, source_type, f1_score, auc_pr, sensitivity, specificity) %>%
  # Ensure consistent metric order
  mutate(metric_type = factor(metric_type, levels = c("counts", "rpkm", "relative", "breadth", "coverage", "tpm")))

# Function to create heatmap
create_heatmap <- function(data, metric_col, title, source_label) {
  # Prepare matrix
  heatmap_matrix <- data %>%
    select(method, metric_type, !!sym(metric_col)) %>%
    pivot_wider(names_from = metric_type, values_from = !!sym(metric_col)) %>%
    column_to_rownames("method") %>%
    as.matrix()
  
  # Replace NAs with 0 for visualization
  heatmap_matrix[is.na(heatmap_matrix)] <- 0
  
  # Create heatmap
  corrplot(heatmap_matrix, 
           method = "color", 
           is.corr = FALSE,
           col = colorRampPalette(c("#053061", "#FFFFFF", "#67001F"))(200),
           tl.col = "black", 
           tl.srt = 45,
           addCoef.col = "black",
           number.cex = 0.7,
           title = paste(title, "-", source_label),
           mar = c(0,0,2,0),
           cl.cex = 0.8)
}

# Reference genomes heatmaps
ref_heatmap_data <- heatmap_data %>% filter(source_type == "reference")

par(mfrow = c(2, 2), mar = c(2, 2, 3, 2))
create_heatmap(ref_heatmap_data, "f1_score", "F1-Score Heatmap", "Reference Genomes")
create_heatmap(ref_heatmap_data, "auc_pr", "AUC-PR Heatmap", "Reference Genomes")
create_heatmap(ref_heatmap_data, "sensitivity", "Sensitivity Heatmap", "Reference Genomes")
create_heatmap(ref_heatmap_data, "specificity", "Specificity Heatmap", "Reference Genomes")

# Bins heatmaps
bins_heatmap_data <- heatmap_data %>% filter(source_type == "bin")

par(mfrow = c(2, 2), mar = c(2, 2, 3, 2))
create_heatmap(bins_heatmap_data, "f1_score", "F1-Score Heatmap", "Bins")
create_heatmap(bins_heatmap_data, "auc_pr", "AUC-PR Heatmap", "Bins")
create_heatmap(bins_heatmap_data, "sensitivity", "Sensitivity Heatmap", "Bins")
create_heatmap(bins_heatmap_data, "specificity", "Specificity Heatmap", "Bins")

# Reset par
par(mfrow = c(1, 1))
  • Observation: Reference genomes show generally high sensitivity (ANCOMBC-2: 1.00, for all metrics) across methods but poor specificity, leading to only moderate F1 and AUC-PR scores. Bins show large performance drops. ANCOM-BC2 and metagenomeSeq achieve the good balance — both show high sensitivity (≈1.0) and comparatively better specificity than others. MetagenomeSeq has the highest overall F1 and AUC-PR.

There is no significant changes in performance for the metrics except for a few metrics where aldex2 produced NA result. We should focus more on the Specificity heatmap as FDR control (reduction of False Positives is the bigger problem). The methods can clearly can identify True Positives (sensitivity 1.0) but struggles with FDR control. In that regard, MetagenomeSeq (at reference) and ANCOMBC-2 (at bins) performs better than other methods.

Effect of Alpha on False Positive Rate

# Define alpha values and corresponding directories
alpha_values <- c(0.05, 0.01, 0.005, 0.001, 0.00001)
alpha_dirs <- c("with_bins", "alpha_0.01", "alpha_0.005", "alpha_0.001", "alpha_0.00001")

# Function to load data for a specific alpha
load_alpha_data <- function(alpha_dir) {
  file_path <- file.path("/home/mrahman/SIMBA/results/homd_final", alpha_dir, "comprehensive_summary_with_bins.csv")
  if (file.exists(file_path)) {
    data <- read_csv(file_path, show_col_types = FALSE) %>%
      mutate(alpha_dir = alpha_dir)
    return(data)
  } else {
    warning(paste("File not found:", file_path))
    return(NULL)
  }
}

# Load data for all alpha values
alpha_data_list <- map(alpha_dirs, load_alpha_data)
alpha_data_list <- alpha_data_list[!map_lgl(alpha_data_list, is.null)]

# Combine all data
combined_alpha_data <- bind_rows(alpha_data_list) %>%
  mutate(
    alpha_value = case_when(
      alpha_dir == "with_bins" ~ 0.05,
      alpha_dir == "alpha_0.01" ~ 0.01,
      alpha_dir == "alpha_0.005" ~ 0.005,
      alpha_dir == "alpha_0.001" ~ 0.001,
      alpha_dir == "alpha_0.00001" ~ 0.00001,
      TRUE ~ NA_real_
    ),
    alpha_label = paste("α =", alpha_value)
  ) %>%
  filter(!is.na(alpha_value))

# Prepare data for plotting
alpha_plot_data <- combined_alpha_data %>%
  filter(source_type %in% c("reference", "bin")) %>%
  select(method, metric_type, source_type, alpha_value, alpha_label, fdr, fp, tn) %>%
  mutate(
    method = str_to_title(method),
    metric_type = factor(metric_type, levels = c("counts", "tpm", "rpkm", "relative", "breadth", "coverage"))
  )

# Create summary statistics by alpha
alpha_summary <- alpha_plot_data %>%
  group_by(alpha_value, alpha_label, source_type) %>%
  summarise(
    mean_fdr = mean(fdr, na.rm = TRUE),
    median_fdr = median(fdr, na.rm = TRUE),
    sd_fdr = sd(fdr, na.rm = TRUE),
    mean_fp = mean(fp, na.rm = TRUE),
    median_fp = median(fp, na.rm = TRUE),
    n_methods = n(),
    .groups = 'drop'
  )

# Plot 1: FDR vs Alpha (line plot)
p1 <- ggplot(alpha_summary, aes(x = alpha_value, y = mean_fdr, color = source_type)) +
  geom_line(size = 1.2) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = mean_fdr - sd_fdr/sqrt(n_methods), 
                    ymax = mean_fdr + sd_fdr/sqrt(n_methods)), 
                width = 0.001, alpha = 0.7) +
  scale_x_log10(breaks = alpha_values, labels = alpha_values) +
  scale_color_manual(values = c("reference" = "#E63946", "bin" = "#2A9D8F")) +
  labs(
    title = "A) False Discovery Rate vs Alpha Value",
    x = "Alpha Value (log scale)",
    y = "Mean False Discovery Rate",
    color = "Source Type"
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    plot.title = element_text(face = "bold")
  )

# Plot 2: FDR vs Alpha (bar plot by method)
p2 <- ggplot(alpha_plot_data, aes(x = alpha_label, y = fdr, fill = source_type)) +
  geom_boxplot(alpha = 0.7) +
  facet_wrap(~ method, scales = "free_y", ncol = 3) +
  scale_fill_manual(values = c("reference" = "#E63946", "bin" = "#2A9D8F")) +
  labs(
    title = "B) FDR Distribution by Alpha Value and Method",
    x = "Alpha Value",
    y = "False Discovery Rate",
    fill = "Source Type"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom",
    plot.title = element_text(face = "bold")
  )

# Plot 3: False Positives vs Alpha
p3 <- ggplot(alpha_summary, aes(x = alpha_value, y = mean_fp, color = source_type)) +
  geom_line(size = 1.2) +
  geom_point(size = 3) +
  scale_x_log10(breaks = alpha_values, labels = alpha_values) +
  scale_color_manual(values = c("reference" = "#E63946", "bin" = "#2A9D8F")) +
  labs(
    title = "C) False Positives vs Alpha Value",
    x = "Alpha Value (log scale)",
    y = "Mean False Positives",
    color = "Source Type"
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    plot.title = element_text(face = "bold")
  )

# Plot 4: FDR by Metric Type
p4 <- ggplot(alpha_plot_data, aes(x = alpha_label, y = fdr, fill = metric_type)) +
  geom_boxplot(alpha = 0.7) +
  facet_wrap(~ source_type, ncol = 2) +
  scale_fill_manual(values = metric_colors) +
  labs(
    title = "D) FDR by Alpha Value and Metric Type",
    x = "Alpha Value",
    y = "False Discovery Rate",
    fill = "Metric Type"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom",
    plot.title = element_text(face = "bold")
  )

# Arrange plots
gridExtra::grid.arrange(p1, p2, p3, p4, ncol = 2, 
                       top = "Effect of Alpha Value on False Discovery Rate and False Positives")

# Summary table
alpha_summary_table <- alpha_summary %>%
  select(alpha_value, alpha_label, source_type, mean_fdr, mean_fp) %>%
  pivot_wider(names_from = source_type, values_from = c(mean_fdr, mean_fp)) %>%
  arrange(desc(alpha_value)) %>%
  select(-alpha_value) %>%
  dplyr::rename(
    Alpha = alpha_label,
    `Mean FDR (Reference)` = mean_fdr_reference,
    `Mean FDR (Bins)` = mean_fdr_bin,
    `Mean FP (Reference)` = mean_fp_reference,
    `Mean FP (Bins)` = mean_fp_bin
  )

alpha_summary_table %>%
  kable(caption = "Summary of FDR and False Positives by Alpha Value", 
        digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Summary of FDR and False Positives by Alpha Value
Alpha Mean FDR (Bins) Mean FDR (Reference) Mean FP (Bins) Mean FP (Reference)
α = 0.01 0.4168 0.4800 7.5385 83.5385
α = 0.005 0.3823 0.4749 6.6538 82.0769
α = 0.001 0.3265 0.4539 5.1538 75.9231
α = 1e-05 0.2128 0.3966 3.0769 60.1923
# Statistical analysis: Correlation between alpha and FDR
correlation_results <- alpha_plot_data %>%
  group_by(source_type, method) %>%
  summarise(
    correlation = cor(alpha_value, fdr, use = "complete.obs"),
    p_value = cor.test(alpha_value, fdr)$p.value,
    .groups = 'drop'
  ) %>%
  mutate(
    significance = case_when(
      p_value < 0.001 ~ "***",
      p_value < 0.01 ~ "**",
      p_value < 0.05 ~ "*",
      TRUE ~ "ns"
    )
  )

cat("\nCorrelation Analysis (Alpha vs FDR):\n")
## 
## Correlation Analysis (Alpha vs FDR):
cat("=====================================\n")
## =====================================
correlation_results %>%
  kable(caption = "Correlation between Alpha and FDR by Method and Source Type", 
        digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Correlation between Alpha and FDR by Method and Source Type
source_type method correlation p_value significance
bin Aldex2 0.7090 0.0489
bin Ancombc2 0.6301 0.0010 ***
bin Maaslin2 0.7335 0.0000 ***
bin Metagenomeseq 0.6126 0.0041 **
bin Wilcoxon 0.7606 0.0000 ***
reference Aldex2 0.4564 0.2557 ns
reference Ancombc2 0.2836 0.1793 ns
reference Maaslin2 0.3429 0.1009 ns
reference Metagenomeseq 0.3037 0.1491 ns
reference Wilcoxon 0.4909 0.0148
  • Observation: Reducing alpha decreases both FDR and false positives, the effect is more striking for reference genomes than bins.
  • Statistical Comparison of Metrics Within Tools

    # ---- Compact summary: A) Leaderboard + B) Sens–Spec ----
    
    suppressPackageStartupMessages({
      library(tidyverse)
      library(ggrepel)
      library(viridis)
      library(scales)
      library(patchwork)
    })
    
    required_cols <- c("method","metric_type","source_type","precision","sensitivity","specificity","auprc")
    metric_levels <- c("counts","tpm","rpkm","relative","breadth","coverage")
    
    # Prepare data
    stats_data <- da_results %>%
      filter(source_type %in% c("reference","bin")) %>%
      mutate(
        f1_score = 2 * (precision * sensitivity) / pmax(precision + sensitivity, .Machine$double.eps),
        auc_pr   = auprc
      ) %>%
      select(method, metric_type, source_type, f1_score, auc_pr, sensitivity, specificity) %>%
      mutate(
        metric_type = factor(metric_type, levels = metric_levels),
        f1_score = pmin(pmax(f1_score, 0), 1),
        auc_pr   = pmin(pmax(auc_pr,   0), 1),
        sensitivity = pmin(pmax(sensitivity, 0), 1),
        specificity = pmin(pmax(specificity, 0), 1)
      ) %>%
      filter(!is.na(method), !is.na(metric_type), !is.na(source_type))
    
    # Helpers
    safe_z <- function(x){
      m <- mean(x, na.rm = TRUE); s <- sd(x, na.rm = TRUE)
      if (!is.finite(s) || s == 0) return(rep(0, length(x)))
      (x - m) / s
    }
    placeholder_plot <- function(title_txt){
      ggplot() +
        annotate("text", x = 0, y = 0, label = paste0(title_txt, "\n(no valid data)"), size = 4) +
        xlim(-1,1) + ylim(-1,1) + theme_void() +
        theme(plot.title = element_text(face = "bold")) + labs(title = title_txt)
    }
    
    # A) Overall Leaderboard (composite of z(F1), z(AUCPR))
    leader_input <- stats_data %>%
      pivot_longer(c(f1_score, auc_pr), names_to = "measure", values_to = "val") %>%
      group_by(metric_type, source_type, measure) %>%
      mutate(z = safe_z(val)) %>%
      ungroup() %>%
      group_by(method) %>%
      summarise(composite = mean(z, na.rm = TRUE), .groups = "drop") %>%
      filter(is.finite(composite))
    
    pA <- if (nrow(leader_input) > 0) {
      ggplot(leader_input, aes(reorder(method, composite), composite)) +
        geom_col(width = 0.7) +
        coord_flip() +
        labs(title = "A) Overall Leaderboard",
             x = NULL, y = "Composite score (z-avg of F1 & AUCPR)") +
        theme_minimal(base_size = 11) +
        theme(plot.title = element_text(face = "bold"))
    } else {
      placeholder_plot("A) Overall Leaderboard")
    }
    
    # B) Sensitivity–Specificity Trade-off (avg across metrics), faceted by Source
    trade <- stats_data %>%
      group_by(method, source_type) %>%
      summarise(
        sens = mean(sensitivity, na.rm = TRUE),
        spec = mean(specificity, na.rm = TRUE),
        .groups = "drop"
      ) %>%
      filter(is.finite(sens), is.finite(spec))
    
    pB <- if (nrow(trade) > 0) {
      ggplot(trade, aes(spec, sens, label = method)) +
        geom_point(size = 2) +
        ggrepel::geom_text_repel(size = 3, max.overlaps = 20, seed = 1) +
        facet_wrap(~ source_type, drop = FALSE) +
        labs(title = "B) Sensitivity–Specificity Trade-off",
             x = "Specificity", y = "Sensitivity") +
        coord_cartesian(xlim = c(0,1), ylim = c(0,1), expand = TRUE) +
        theme_minimal(base_size = 11) +
        theme(plot.title = element_text(face = "bold"))
    } else {
      placeholder_plot("B) Sensitivity–Specificity Trade-off")
    }
    
    # Assemble (heatmap removed)
    summary_plot <- (pA / pB) +
      plot_layout(heights = c(1, 1)) &
      theme(plot.margin = margin(6, 6, 6, 6))
    
    print(summary_plot)

    Consensus Analysis

    Feature-Level Consensus

    For this analysis, we implement four vote-based consensus strategies over feature-level calls made by individual tools (q ≤ 0.05):

    • ≥2 tools: Genome is DA if at least 2 tools call DA
    • Majority vote: Genome is DA if >50% of tools call DA
    • ≥75% tools: Genome is DA if ≥75% of tools call DA
    • All tools: Genome is DA only if all tools call DA
    suppressPackageStartupMessages({ library(readr); library(purrr) })
    
    # Define metrics and tools present in this run
    all_metrics <- c("counts", "tpm", "rpkm", "relative", "breadth", "coverage")
    tools_raw <- c("aldex2", "ancombc2", "metagenomeseq", "maaslin2", "wilcoxon")
    tool_pretty_map <- c(
      aldex2 = "ALDEx2",
      ancombc2 = "ANCOM-BC2",
      metagenomeseq = "metagenomeSeq",
      maaslin2 = "MaAsLin2",
      wilcoxon = "Wilcoxon"
    )
    
    # Normalization helper to match truth_table 'genome'
    normalize_id <- function(x) stringr::str_replace_all(x, "[^A-Za-z0-9]", "")
    
    # Build master_df of per-feature results across tools and metrics
    read_tool_results <- function(metric, tool) {
      path <- file.path(results_input_dir, metric, paste0(tool, "_results.csv"))
      if (!file.exists(path)) return(NULL)
      df <- suppressMessages(readr::read_csv(path, show_col_types = FALSE))
      if (!all(c("feature", "qval") %in% names(df))) return(NULL)
      df %>%
        transmute(
          genome_raw = as.character(feature),
          genome = normalize_id(as.character(feature)),
          adjP = as.numeric(qval),
          direction = as.character(dplyr::coalesce(direction, NA_character_)),
          tool = tool_pretty_map[[tool]],
          metric = metric,
          source_type = ifelse(stringr::str_starts(genome_raw, "bin_"), "bin", "reference")
        )
    }
    
    master_df <- map_dfr(all_metrics, function(m) {
      map_dfr(tools_raw, function(t) read_tool_results(m, t))
    }) %>%
      distinct()
    
    # Join ground truth (truth_table already normalized earlier in this Rmd)
    truth_lookup <- truth_table %>% dplyr::select(genome, is_DA) %>% tibble::as_tibble()
    master_df <- tibble::as_tibble(master_df) %>% dplyr::left_join(truth_lookup, by = "genome")
    
    # Derive per-tool DA calls at q ≤ 0.05 (user-confirmed)
    alpha_q <- 0.05
    master_df <- master_df %>% mutate(da_call = !is.na(adjP) & adjP <= alpha_q)
    
    # Build consensus summaries per metric (over all source types combined)
    all_consensus_summaries <- list()
    
    for (current_metric in all_metrics) {
      metric_calls <- master_df %>% filter(metric == current_metric)
    
      # n_tools_total per genome is the number of tools that produced a result for that genome
      consensus_summary <- metric_calls %>%
        group_by(genome) %>%
        summarise(
          n_tools_total = dplyr::n_distinct(tool),
          n_tools_da = sum(da_call, na.rm = TRUE),
          prop_tools_da = ifelse(n_tools_total > 0, n_tools_da / n_tools_total, NA_real_),
          is_DA = dplyr::first(is_DA),
          .groups = 'drop'
        ) %>%
        mutate(
          consensus_2plus = n_tools_da >= 2,
          consensus_majority = n_tools_da > (n_tools_total / 2),
          consensus_75plus = n_tools_total > 0 & (n_tools_da / n_tools_total) >= 0.75,
          consensus_all = n_tools_total > 0 & n_tools_da == n_tools_total
        )
    
      all_consensus_summaries[[current_metric]] <- consensus_summary
    }
    
    # Quick sanity: show coverage of tools per metric
    tool_coverage <- master_df %>%
      dplyr::count(metric, tool) %>% dplyr::arrange(metric, tool)
    
      tool_coverage %>%
      kable(caption = "Per-metric tool result counts (rows)") %>%
      kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
    Per-metric tool result counts (rows)
    metric tool n
    breadth ALDEx2 239
    breadth ANCOM-BC2 239
    breadth MaAsLin2 235
    breadth Wilcoxon 239
    breadth metagenomeSeq 239
    counts ALDEx2 239
    counts ANCOM-BC2 239
    counts MaAsLin2 235
    counts Wilcoxon 239
    counts metagenomeSeq 239
    coverage ANCOM-BC2 235
    coverage MaAsLin2 235
    coverage Wilcoxon 235
    coverage metagenomeSeq 235
    relative ANCOM-BC2 235
    relative MaAsLin2 235
    relative Wilcoxon 235
    relative metagenomeSeq 235
    rpkm ANCOM-BC2 235
    rpkm MaAsLin2 235
    rpkm Wilcoxon 235
    rpkm metagenomeSeq 235
    tpm ANCOM-BC2 239
    tpm MaAsLin2 235
    tpm Wilcoxon 239
    tpm metagenomeSeq 239

    Consensus Heatmaps

    suppressPackageStartupMessages({
      library(tidyr); library(scales)
      if (!requireNamespace("ggnewscale", quietly = TRUE)) {
        # Optional install if not present; comment out for offline runs
        # install.packages("ggnewscale")
      }
      library(ggnewscale)
    })
    
    # Get log2FC values for ordering
    log2fc_data <- truth_table %>%
      select(genome, true_log2FC) %>%
      mutate(true_log2FC = replace_na(true_log2FC, 0))
    
    for (current_metric in all_metrics) {
      cat("\n\n#### Consensus Heatmap:", toupper(current_metric), "Metric\n\n")
    
      # Prepare per-tool calls for current metric
      current_metric_data <- master_df %>%
        filter(metric == current_metric, !is.na(adjP)) %>%
        select(genome, tool, adjP, is_DA)
    
      individual_tools <- current_metric_data %>%
        mutate(da_call = adjP <= alpha_q) %>%
        select(genome, tool, da_call, is_DA) %>%
        mutate(correct_call = case_when(
          da_call & is_DA ~ "TP",
          !da_call & !is_DA ~ "TN",
          da_call & !is_DA ~ "FP",
          !da_call & is_DA ~ "FN",
          TRUE ~ NA_character_
        )) %>%
        select(genome, tool, da_call, correct_call, is_DA)
    
      consensus_summary_current <- all_consensus_summaries[[current_metric]]
    
      heatmap_data <- consensus_summary_current %>%
        dplyr::left_join(log2fc_data, by = "genome") %>%
        mutate(
          consensus_2plus_correct = case_when(
            consensus_2plus & is_DA ~ "TP",
            !consensus_2plus & !is_DA ~ "TN",
            consensus_2plus & !is_DA ~ "FP",
            !consensus_2plus & is_DA ~ "FN"
          ),
          consensus_majority_correct = case_when(
            consensus_majority & is_DA ~ "TP",
            !consensus_majority & !is_DA ~ "TN",
            consensus_majority & !is_DA ~ "FP",
            !consensus_majority & is_DA ~ "FN"
          ),
          consensus_75plus_correct = case_when(
            consensus_75plus & is_DA ~ "TP",
            !consensus_75plus & !is_DA ~ "TN",
            consensus_75plus & !is_DA ~ "FP",
            !consensus_75plus & is_DA ~ "FN"
          ),
          consensus_all_correct = case_when(
            consensus_all & is_DA ~ "TP",
            !consensus_all & !is_DA ~ "TN",
            consensus_all & !is_DA ~ "FP",
            !consensus_all & is_DA ~ "FN"
          )
        ) %>%
        arrange(desc(abs(true_log2FC)), desc(prop_tools_da), genome) %>%
        mutate(genome_order = dplyr::row_number())
    
      cat("**Data Summary:**\n")
      cat("- Genomes analyzed:", nrow(heatmap_data), "\n")
      cat("- Available tools:", paste(unique(individual_tools$tool), collapse = ", "), "\n\n")
    
      # 1) Metadata block
      metadata_long <- heatmap_data %>%
        select(genome, genome_order, is_DA, n_tools_da, n_tools_total, prop_tools_da, true_log2FC) %>%
        mutate(is_DA_numeric = as.numeric(is_DA), true_log2FC_abs = abs(true_log2FC)) %>%
        select(-is_DA) %>%
        pivot_longer(cols = -c(genome, genome_order), names_to = "column", values_to = "value") %>%
        mutate(column_type = "metadata")
    
      # 2) Individual tools block
      tools_long <- individual_tools %>%
        select(genome, tool, correct_call) %>%
        dplyr::left_join(heatmap_data %>% dplyr::select(genome, genome_order), by = "genome") %>%
        dplyr::rename(column = tool, value = correct_call) %>%
        mutate(column_type = "individual_tool")
    
      # 3) Consensus block
      consensus_long <- heatmap_data %>%
        select(genome, genome_order,
               consensus_2plus_correct, consensus_majority_correct,
               consensus_75plus_correct, consensus_all_correct) %>%
        pivot_longer(cols = -c(genome, genome_order), names_to = "column", values_to = "value") %>%
        mutate(
          column = case_when(
            column == "consensus_2plus_correct" ~ "≥2 Tools",
            column == "consensus_majority_correct" ~ "Majority",
            column == "consensus_75plus_correct" ~ "≥75%",
            column == "consensus_all_correct" ~ "All Tools"
          ),
          column_type = "consensus"
        )
    
      # Harmonize numeric encodings for fills
      metadata_long <- metadata_long %>% mutate(value_numeric = as.numeric(value), value = as.character(value))
      tools_long <- tools_long %>% mutate(
        value_numeric = dplyr::case_when(
          value == "TP" ~ 1, value == "TN" ~ 0.5, value == "FP" ~ -0.5, value == "FN" ~ -1, TRUE ~ 0
        ), value = as.character(value))
      consensus_long <- consensus_long %>% mutate(
        value_numeric = dplyr::case_when(
          value == "TP" ~ 1, value == "TN" ~ 0.5, value == "FP" ~ -0.5, value == "FN" ~ -1, TRUE ~ 0
        ), value = as.character(value))
    
      # Column order reflecting actual tools
      plot_data <- dplyr::bind_rows(metadata_long, tools_long, consensus_long) %>%
        mutate(
          column_order = dplyr::case_when(
            column == "is_DA_numeric" ~ 1,
            column == "n_tools_da" ~ 2,
            column == "n_tools_total" ~ 3,
            column == "prop_tools_da" ~ 4,
            column == "ALDEx2" ~ 5,
            column == "ANCOM-BC2" ~ 6,
            column == "metagenomeSeq" ~ 7,
            column == "MaAsLin2" ~ 8,
            column == "Wilcoxon" ~ 9,
            column == "≥2 Tools" ~ 10,
            column == "Majority" ~ 11,
            column == "≥75%" ~ 12,
            column == "All Tools" ~ 13,
            column == "true_log2FC" ~ 14,
            column == "true_log2FC_abs" ~ 15,
            TRUE ~ 99
          )
        ) %>%
        arrange(column_order)
    
      p_single <- plot_data %>%
        ggplot(aes(x = reorder(column, column_order), y = reorder(genome, -genome_order))) +
        geom_tile(aes(fill = ifelse(column_type == "metadata", value_numeric, NA)), color = "white", size = 0.1) +
        scale_fill_gradient2(low = "white", mid = "lightblue", high = "darkblue", midpoint = 0, na.value = "transparent", name = "Numeric Value") +
        { if (requireNamespace("ggnewscale", quietly = TRUE)) ggnewscale::new_scale_fill() else NULL } +
        geom_tile(data = dplyr::filter(plot_data, column_type %in% c("individual_tool", "consensus")),
                  aes(fill = value), color = "white", size = 0.1) +
        scale_fill_manual(values = c("TP" = "#2d8f2d", "TN" = "#90EE90", "FP" = "#dc143c", "FN" = "#ffb6c1"), name = "Classification", na.value = "transparent") +
        labs(title = paste("Comprehensive Consensus Analysis Heatmap:", stringr::str_to_title(current_metric), "Metric"),
             subtitle = "Genome-level results across individual tools and consensus strategies",
             x = "Analysis Components", y = "Genomes (ordered by |log2FC| and DA proportion)") +
        theme_minimal() +
        theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
              axis.text.y = element_text(size = 6),
              plot.title = element_text(face = "bold", size = 14),
              plot.subtitle = element_text(size = 11),
              legend.position = "right", legend.box = "vertical", panel.grid = element_blank(), axis.ticks = element_blank()) +
        geom_vline(xintercept = c(4.5, 9.5, 13.5), color = "black", size = 1, alpha = 0.8)
    
      print(p_single)
      cat("\n\n")
    }
    ## 
    ## 
    ## #### Consensus Heatmap: COUNTS Metric
    ## 
    ## **Data Summary:**
    ## - Genomes analyzed: 239 
    ## - Available tools: ALDEx2, ANCOM-BC2, metagenomeSeq, MaAsLin2, Wilcoxon

    ## 
    ## 
    ## 
    ## 
    ## #### Consensus Heatmap: TPM Metric
    ## 
    ## **Data Summary:**
    ## - Genomes analyzed: 239 
    ## - Available tools: ANCOM-BC2, metagenomeSeq, MaAsLin2, Wilcoxon

    ## 
    ## 
    ## 
    ## 
    ## #### Consensus Heatmap: RPKM Metric
    ## 
    ## **Data Summary:**
    ## - Genomes analyzed: 235 
    ## - Available tools: ANCOM-BC2, metagenomeSeq, MaAsLin2, Wilcoxon

    ## 
    ## 
    ## 
    ## 
    ## #### Consensus Heatmap: RELATIVE Metric
    ## 
    ## **Data Summary:**
    ## - Genomes analyzed: 235 
    ## - Available tools: ANCOM-BC2, metagenomeSeq, MaAsLin2, Wilcoxon

    ## 
    ## 
    ## 
    ## 
    ## #### Consensus Heatmap: BREADTH Metric
    ## 
    ## **Data Summary:**
    ## - Genomes analyzed: 239 
    ## - Available tools: ALDEx2, ANCOM-BC2, metagenomeSeq, MaAsLin2, Wilcoxon

    ## 
    ## 
    ## 
    ## 
    ## #### Consensus Heatmap: COVERAGE Metric
    ## 
    ## **Data Summary:**
    ## - Genomes analyzed: 235 
    ## - Available tools: ANCOM-BC2, metagenomeSeq, MaAsLin2, Wilcoxon

    Compute Consensus Performance (for comparison plots)

    # ===========================
    # Full replacement: ALL plots as one A4-sized figure
    # ===========================
    suppressPackageStartupMessages({
      library(tidyverse)
      library(patchwork)
      library(viridis)
      library(scales)
    })
    
    # ---- Helper: safe reorder that tolerates all-NA gracefully ----
    safe_reorder <- function(x, by) {
      if (all(is.na(by))) return(factor(x, levels = unique(x)))
      reorder(x, by)
    }
    
    # ---- Helper to compute confusion metrics (from your snippet) ----
    compute_metrics <- function(pred, truth) {
      tp <- sum(pred & truth, na.rm = TRUE)
      fp <- sum(pred & !truth, na.rm = TRUE)
      fn <- sum(!pred & truth, na.rm = TRUE)
      tn <- sum(!pred & !truth, na.rm = TRUE)
      precision <- ifelse((tp + fp) > 0, tp / (tp + fp), NA_real_)
      recall <- ifelse((tp + fn) > 0, tp / (tp + fn), NA_real_)
      specificity <- ifelse((tn + fp) > 0, tn / (tn + fp), NA_real_)
      f1 <- ifelse((precision + recall) > 0, 2 * precision * recall / (precision + recall), NA_real_)
      tibble(tp = tp, fp = fp, fn = fn, tn = tn,
             precision = precision, sensitivity = recall, specificity = specificity, f1_score = f1)
    }
    
    # ---- Consensus strategies (from your snippet) ----
    consensus_strategies <- c("consensus_2plus", "consensus_majority", "consensus_75plus", "consensus_all")
    strategy_names       <- c("≥2 tools", "Majority (>50%)", "≥75% tools", "All tools")
    
    # ==== 1) Build consensus performance across metrics ====
    # expects: `all_metrics` (character vec) and `all_consensus_summaries` (named list) to exist
    all_consensus_performance <- purrr::map_dfr(all_metrics, function(m) {
      cs <- all_consensus_summaries[[m]]
      purrr::map_dfr(seq_along(consensus_strategies), function(i) {
        flag <- consensus_strategies[i]
        name <- strategy_names[i]
        pred <- cs[[flag]]
        truth <- cs$is_DA
        mets <- compute_metrics(pred, truth)
        mets %>%
          mutate(metric = m, strategy = name, auprc = NA_real_) %>%
          select(metric, strategy, f1_score, precision, sensitivity, specificity, auprc)
      })
    })
    
    # ==== 2) Build individual tool performance per metric ====
    # expects: `da_results` tibble with method, metric_type, source_type, precision, sensitivity, specificity, auprc
    recode_methods <- function(x) dplyr::recode(x,
      ancombc2 = "ANCOM-BC2",
      aldex2   = "ALDEx2",
      metagenomeseq = "metagenomeSeq",
      maaslin2 = "MaAsLin2",
      wilcoxon = "Wilcoxon",
      .default = x
    )
    
    get_individual_performance <- function(metric_name) {
      da_results %>%
        filter(metric_type == metric_name, source_type %in% c("reference", "bin")) %>%
        mutate(f1_score = 2 * (precision * sensitivity) / pmax(precision + sensitivity, .Machine$double.eps)) %>%
        group_by(method) %>%
        summarise(
          f1_score   = mean(f1_score,   na.rm = TRUE),
          precision  = mean(precision,  na.rm = TRUE),
          sensitivity= mean(sensitivity,na.rm = TRUE),
          specificity= mean(specificity,na.rm = TRUE),
          auprc      = mean(auprc,      na.rm = TRUE),
          .groups = 'drop'
        ) %>%
        mutate(type = "Individual tool",
               strategy = recode_methods(method)) %>%
        select(strategy, f1_score, precision, sensitivity, specificity, auprc, type)
    }
    
    # ==== 3) One function that makes a 2-panel block (Sensitivity | Specificity) for a metric ====
    make_metric_block <- function(metric_name) {
      ind <- get_individual_performance(metric_name)
      con <- all_consensus_performance %>%
        filter(metric == metric_name) %>%
        mutate(type = "Consensus") %>%
        select(strategy, f1_score, precision, sensitivity, specificity, auprc, type)
    
      combined <- bind_rows(con, ind) %>% mutate(metric = metric_name)
    
      # Sensitivity
      p_sens <- ggplot(
        combined,
        aes(x = safe_reorder(strategy, sensitivity), y = sensitivity, fill = type)
      ) +
        geom_col(width = 0.7) +
        scale_y_continuous(labels = percent_format(accuracy = 1), limits = c(0, 1)) +
        scale_fill_manual(values = c("Consensus" = "#e31a1c", "Individual tool" = "#1f78b4")) +
        coord_flip() +
        labs(title = paste("Sensitivity —", stringr::str_to_title(metric_name)),
             x = NULL, y = "Sensitivity", fill = NULL) +
        theme_minimal(base_size = 11) +
        theme(legend.position = "bottom",
              plot.title = element_text(face = "bold"),
              axis.text.x = element_text())
    
      # Specificity
      p_spec <- ggplot(
        combined,
        aes(x = safe_reorder(strategy, specificity), y = specificity, fill = type)
      ) +
        geom_col(width = 0.7) +
        scale_y_continuous(labels = percent_format(accuracy = 1), limits = c(0, 1)) +
        scale_fill_manual(values = c("Consensus" = "#e31a1c", "Individual tool" = "#1f78b4")) +
        coord_flip() +
        labs(title = paste("Specificity —", stringr::str_to_title(metric_name)),
             x = NULL, y = "Specificity", fill = NULL) +
        theme_minimal(base_size = 11) +
        theme(legend.position = "bottom",
              plot.title = element_text(face = "bold"),
              axis.text.x = element_text())
    
      # Return a 2-panel block for this metric
      (p_sens | p_spec) + plot_layout(guides = "collect") & theme(legend.position = "bottom")
    }
    
    # ==== 4) Build all metric blocks and a cross-metric summary heatmap ====
    metric_blocks <- purrr::map(all_metrics, make_metric_block)
    
    # Cross-metric summary for consensus strategies only (visual, no table)
    cross_metric_heat <- all_consensus_performance %>%
      group_by(strategy) %>%
      summarise(
        F1          = mean(f1_score,    na.rm = TRUE),
        Precision   = mean(precision,   na.rm = TRUE),
        Sensitivity = mean(sensitivity, na.rm = TRUE),
        Specificity = mean(specificity, na.rm = TRUE),
        .groups = "drop"
      ) %>%
      pivot_longer(-strategy, names_to = "Measure", values_to = "Value")
    
    p_heat <- ggplot(cross_metric_heat, aes(Measure, strategy, fill = Value)) +
      geom_tile(color = "white", linewidth = 0.25) +
      scale_fill_viridis(limits = c(0, 1), option = "C", guide = guide_colorbar(title.position = "top")) +
      labs(title = "Cross-metric Consensus Summary (mean across metrics)",
           x = "Measure", y = "Consensus strategy", fill = "Score") +
      theme_minimal(base_size = 11) +
      theme(plot.title = element_text(face = "bold"),
            axis.text.x = element_text(angle = 45, hjust = 1))
    
    # ==== 5) Assemble EVERYTHING into ONE big figure with tags ====
    # Arrange metric blocks vertically; add heatmap at the bottom
    big_plot <- wrap_plots(metric_blocks, ncol = 1) /
                p_heat +
      plot_layout(heights = c(rep(1, length(metric_blocks)), 0.9)) +
      plot_annotation(tag_levels = "A")
    
    # Print (may require a larger pane); best to use ggsave below
    print(big_plot)

    • Observation: Consensus strategies (red) almost always push sensitivity upward compared to any single method (blue).

    The “All tools” consensus achieves the most balanced average performance but other strategies could produce better sensitivity at the cost of worse specificity (Panel M). Adding more methods could further increase performance.

    Always-Wrong Genomes (all tools misclassified)

    There are some genomes that are always misclassifed by all tools. We would like to see if these genomes have any particular properties (genus, genome_size, gc_content) that can influence their misclassification. Lets first identify the misclassified ref and bins genome from different metrics and gather their properties.

    library(rentrez)
    library(xml2)
    library(Biostrings)
    
    # Identify genomes where all available tools were wrong (per metric)
    always_wrong_list <- list()
    
    for (current_metric in all_metrics) {
      calls <- master_df %>%
        filter(metric == current_metric) %>%
        mutate(is_wrong = case_when(
          da_call & !is_DA ~ TRUE,   # FP
          !da_call & is_DA ~ TRUE,   # FN
          TRUE ~ FALSE
        ))
      per_genome <- calls %>%
        group_by(genome) %>%
        summarise(n_tools_total = dplyr::n_distinct(tool),
                  n_wrong = sum(is_wrong, na.rm = TRUE),
                  is_DA = dplyr::first(is_DA),
                  source_genome = dplyr::first(source_type),
                  .groups = 'drop') %>%
        mutate(always_wrong = n_tools_total >= 2 & n_wrong == n_tools_total,
               wrong_type = dplyr::case_when(
                 always_wrong & is_DA ~ "FN_all",
                 always_wrong & !is_DA ~ "FP_all",
                 TRUE ~ NA_character_
               )) %>%
        filter(always_wrong)
    
      # Join truth_table for accessions and compute query_accession per source type
      per_genome <- per_genome %>%
        dplyr::left_join(truth_table %>% dplyr::select(genome, genome_accession, closest_genome), by = "genome") %>%
        mutate(
          closest_genome_dotted = ifelse(!is.na(closest_genome),
                                         sub("([0-9])$", ".\\1", closest_genome), NA_character_),
          query_accession = dplyr::case_when(
            source_genome == "reference" & !is.na(genome_accession) ~ genome_accession,
            source_genome == "bin" & !is.na(closest_genome_dotted) ~ closest_genome_dotted,
            TRUE ~ genome_accession
          ),
          metric = current_metric
        )
      always_wrong_list[[current_metric]] <- per_genome
    }
    
    always_wrong_all <- dplyr::bind_rows(always_wrong_list)
    
    cat("Genomes with unanimous misclassification (by metric):", nrow(always_wrong_all), "\n")
    ## Genomes with unanimous misclassification (by metric): 162
    # Helper: extract first non-null field from esummary
    get_first <- function(x, fields) {
      for (f in fields) {
        if (!is.null(x[[f]])) return(x[[f]])
      }
      return(NA)
    }
    
    # Fetch NCBI metadata for a vector of accessions
    fetch_ncbi_meta <- function(acc_vec) {
      acc_vec <- unique(acc_vec[!is.na(acc_vec)])
      if (length(acc_vec) == 0) return(tibble())
      purrr::map_dfr(acc_vec, function(acc) {
        # Normalize accession to dotted form if missing (e.g., add .1)
        dotted <- if (grepl("\\.\\d+$", acc)) acc else paste0(acc, ".1")
        # Try exact assembly accession match first
        uid <- tryCatch({ rentrez::entrez_search(db = "assembly", term = paste0(dotted, "[Assembly Accession]"))$ids }, error = function(e) NULL)
        if (is.null(uid) || length(uid) == 0) {
          # fallback: try the original string
          uid <- tryCatch({ rentrez::entrez_search(db = "assembly", term = acc)$ids }, error = function(e) NULL)
        }
        if (!is.null(uid) && length(uid) > 0) {
          sm <- tryCatch({ rentrez::entrez_summary(db = "assembly", id = uid[1]) }, error = function(e) NULL)
          if (!is.null(sm)) {
            # Extract basic fields
            asm_acc <- get_first(sm, c("assemblyaccession", "assembly_accession"))
            org <- get_first(sm, c("organism", "organism_name"))
            gc <- suppressWarnings(as.numeric(get_first(sm, c("gcpercent", "gc_percent", "gc"))))
            size <- suppressWarnings(as.numeric(get_first(sm, c("seq_length_sum", "totallength", "total_length"))))
            taxid <- suppressWarnings(as.character(get_first(sm, c("taxid", "tax_id"))))
    
            # Taxonomy via taxonomy DB if taxid available (prefer rank-aware XML)
            fam <- genus <- species <- NA_character_
            if (!is.na(taxid)) {
              tx_xml <- tryCatch({ rentrez::entrez_fetch(db = "taxonomy", id = taxid, rettype = "xml") }, error = function(e) NULL)
              if (!is.null(tx_xml)) {
                doc <- tryCatch(xml2::read_xml(tx_xml), error = function(e) NULL)
                if (!is.null(doc)) {
                  fam <- xml2::xml_text(xml2::xml_find_first(doc, "//LineageEx/Taxon[Rank='family']/ScientificName"))
                  genus <- xml2::xml_text(xml2::xml_find_first(doc, "//LineageEx/Taxon[Rank='genus']/ScientificName"))
                  species <- xml2::xml_text(xml2::xml_find_first(doc, "//Taxon/ScientificName"))
                  if (identical(fam, character(0))) fam <- NA_character_
                  if (identical(genus, character(0))) genus <- NA_character_
                  if (identical(species, character(0))) species <- NA_character_
                }
              }
            }
    
            return(tibble(accession = acc, assembly_accession = asm_acc, organism = org,
                          genome_size = size, gc_percent = gc, family = fam,
                          genus = genus, species = species))
          }
        }
        # Fallback: no metadata
        tibble(accession = acc, assembly_accession = NA_character_, organism = NA_character_,
               genome_size = NA_real_, gc_percent = NA_real_, family = NA_character_,
               genus = NA_character_, species = NA_character_)
      })
    }
    
    # Pull NCBI metadata with a small cache (assembly meta + taxonomy)
    cache_dir <- file.path(results_dir, "cache")
    if (!dir.exists(cache_dir)) dir.create(cache_dir, recursive = TRUE)
    cache_file <- file.path(cache_dir, "assembly_meta.csv")
    ncbi_meta <- tibble()
    accs <- unique(always_wrong_all$query_accession[!is.na(always_wrong_all$query_accession)])
    cached <- if (file.exists(cache_file)) suppressMessages(readr::read_csv(cache_file, show_col_types = FALSE)) else tibble()
    to_fetch <- setdiff(accs, cached$accession %||% character())
    fetched <- if (length(to_fetch) > 0) fetch_ncbi_meta(to_fetch) else tibble()
    if (nrow(fetched) > 0) {
      cached <- dplyr::bind_rows(cached, fetched) %>% dplyr::distinct(accession, .keep_all = TRUE)
      readr::write_csv(cached, cache_file)
    }
    ncbi_meta <- cached %>% dplyr::filter(accession %in% accs)
    
    # Compute GC% and genome size from FASTA (online) using Biostrings
    compute_gc_size_online <- function(assembly_acc) {
      if (is.na(assembly_acc) || assembly_acc == "")
        return(tibble(accession = assembly_acc, n_sequences = NA_integer_, genome_size_bp = NA_real_, gc_percent_fasta = NA_real_, ftp = NA_character_))
      # find assembly
      dotted <- if (grepl("\\.\\d+$", assembly_acc)) assembly_acc else paste0(assembly_acc, ".1")
      es <- tryCatch(entrez_search(db = "assembly", term = paste0(dotted, "[Assembly Accession]"), retmax = 1), error = function(e) NULL)
      if (is.null(es) || length(es$ids) == 0) {
        es <- tryCatch(entrez_search(db = "assembly", term = assembly_acc, retmax = 1), error = function(e) NULL)
      }
      if (is.null(es) || length(es$ids) == 0) {
        return(tibble(accession = assembly_acc, n_sequences = NA_integer_, genome_size_bp = NA_real_, gc_percent_fasta = NA_real_, ftp = NA_character_))
      }
      sm <- tryCatch(entrez_summary(db = "assembly", id = es$ids[[1]]), error = function(e) NULL)
      if (is.null(sm)) return(tibble(accession = assembly_acc, n_sequences = NA_integer_, genome_size_bp = NA_real_, gc_percent_fasta = NA_real_, ftp = NA_character_))
      ftp <- sm$ftppath_genbank
      if (is.null(ftp) || ftp == "") ftp <- sm$ftppath_refseq
      if (is.null(ftp) || ftp == "") return(tibble(accession = assembly_acc, n_sequences = NA_integer_, genome_size_bp = NA_real_, gc_percent_fasta = NA_real_, ftp = NA_character_))
      base <- basename(ftp)
      fna_url <- paste0(ftp, "/", base, "_genomic.fna.gz")
      # cache downloads
      fna_cache_dir <- file.path(cache_dir, "fna")
      if (!dir.exists(fna_cache_dir)) dir.create(fna_cache_dir, recursive = TRUE)
      dest <- file.path(fna_cache_dir, paste0(gsub("/", "_", base), "_genomic.fna.gz"))
      if (!file.exists(dest)) {
        ok <- tryCatch({ utils::download.file(fna_url, dest, mode = "wb", quiet = TRUE); TRUE }, error = function(e) FALSE)
        if (!ok) return(tibble(accession = assembly_acc, n_sequences = NA_integer_, genome_size_bp = NA_real_, gc_percent_fasta = NA_real_, ftp = ftp))
      }
      # read and compute stats
      dna <- tryCatch(Biostrings::readDNAStringSet(dest), error = function(e) NULL)
      if (is.null(dna)) return(tibble(accession = assembly_acc, n_sequences = NA_integer_, genome_size_bp = NA_real_, gc_percent_fasta = NA_real_, ftp = ftp))
      total_bp <- sum(Biostrings::width(dna))
      # letterFrequency can compute for all sequences; sum over them
      lf <- Biostrings::letterFrequency(dna, c("A","C","G","T"))
      atgc <- sum(lf)
      gc <- sum(lf[,"C"] + lf[,"G"])
      gc_pct <- ifelse(atgc > 0, 100 * gc / atgc, NA_real_)
      tibble(accession = assembly_acc, n_sequences = length(dna), genome_size_bp = as.numeric(total_bp), gc_percent_fasta = round(gc_pct, 3), ftp = ftp)
    }
    
    # Run online GC/size computation (cached FASTA files)
    gcsize_cache_file <- file.path(cache_dir, "gc_size_meta.csv")
    gcsize <- tibble()
    accs <- unique(always_wrong_all$query_accession[!is.na(always_wrong_all$query_accession)])
    cached2 <- if (file.exists(gcsize_cache_file)) suppressMessages(readr::read_csv(gcsize_cache_file, show_col_types = FALSE)) else tibble()
    to_compute <- setdiff(accs, cached2$accession %>% unique() %||% character())
    computed <- if (length(to_compute) > 0) purrr::map_dfr(to_compute, compute_gc_size_online) else tibble()
    if (nrow(computed) > 0) {
      cached2 <- dplyr::bind_rows(cached2, computed) %>% dplyr::distinct(accession, .keep_all = TRUE)
      readr::write_csv(cached2, gcsize_cache_file)
    }
    gcsize <- cached2 %>% dplyr::filter(accession %in% accs)
    
    # Combine
    always_wrong_annot <- always_wrong_all %>%
      mutate(accession = query_accession) %>%
      dplyr::left_join(ncbi_meta, by = "accession") %>%
      dplyr::left_join(gcsize, by = "accession") %>%
      mutate(
        genome_size_bp = dplyr::coalesce(as.numeric(genome_size_bp), as.numeric(genome_size)),
        gc_percent_final = dplyr::coalesce(as.numeric(gc_percent_fasta), as.numeric(gc_percent)),
        meta_source = dplyr::case_when(
          !is.na(gc_percent_fasta) | !is.na(genome_size_bp) ~ "FASTA",
          (!is.na(genome_size) | !is.na(gc_percent)) ~ "NCBI_Summary",
          TRUE ~ NA_character_
        )
      ) %>%
      select(metric, source_genome, genome, wrong_type, n_tools_total,
             genome_size_bp, gc_percent_final, family, genus, species, accession, meta_source)
    
    # Summary counts
    summary_wrong <- always_wrong_annot %>%
      group_by(metric, source_genome, wrong_type) %>%
      summarise(n_genomes = dplyr::n_distinct(genome), .groups = 'drop') %>%
      arrange(metric, source_genome, wrong_type)
    
    kable(summary_wrong, caption = "Counts of genomes unanimously misclassified (by metric, source, and type)") %>%
      kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
    Counts of genomes unanimously misclassified (by metric, source, and type)
    metric source_genome wrong_type n_genomes
    breadth reference FP_all 21
    counts reference FP_all 23
    coverage bin FN_all 1
    coverage reference FP_all 21
    relative reference FP_all 42
    rpkm bin FN_all 1
    rpkm reference FP_all 27
    tpm bin FN_all 2
    tpm reference FP_all 24
    # Split tables by source
    ref_wrong <- always_wrong_annot %>% filter(source_genome == "reference")
    bin_wrong <- always_wrong_annot %>% filter(source_genome == "bin")
    
    ref_dt <- DT::datatable(ref_wrong,
                            caption = "Reference genomes misclassified by all tools (per metric)",
                            options = list(pageLength = 10, scrollX = TRUE),
                            rownames = FALSE)
    
    bin_dt <- DT::datatable(bin_wrong,
                            caption = "Bin genomes misclassified by all tools (per metric)",
                            options = list(pageLength = 10, scrollX = TRUE),
                            rownames = FALSE)
    
    # Ensure both widgets are rendered in HTML (avoid printing NULLs)
    htmltools::tagList(ref_dt, bin_dt)

    Correctness vs Genome Features

    We would like to see if genomes that are always correctly classified by DA methods against always misclassified genomes.

    # Build per-genome correctness summary per metric
    correctness_long <- master_df %>%
      filter(!is.na(adjP)) %>%
      mutate(is_correct = case_when(
        da_call & is_DA ~ TRUE,   # TP
        !da_call & !is_DA ~ TRUE, # TN
        da_call & !is_DA ~ FALSE, # FP
        !da_call & is_DA ~ FALSE, # FN
        TRUE ~ NA
      )) %>%
      select(metric, genome, tool, is_correct, is_DA, source_genome = source_type)
    
    correctness_by_genome <- correctness_long %>%
      group_by(metric, genome) %>%
      summarise(
        n_tools_total = n_distinct(tool),
        n_correct = sum(is_correct, na.rm = TRUE),
        n_wrong = sum(!is_correct, na.rm = TRUE),
        is_DA = dplyr::first(is_DA),
        source_genome = dplyr::first(source_genome),
        .groups = 'drop'
      ) %>%
      mutate(
        always_wrong = n_tools_total >= 2 & n_correct == 0,
        always_right = n_tools_total >= 2 & n_correct == n_tools_total,
        group = case_when(
          always_wrong ~ "always_wrong",
          always_right ~ "always_right",
          TRUE ~ "other"
        )
      )
    
    # Keep only always_wrong vs always_right
    comp_set <- correctness_by_genome %>% filter(group %in% c("always_wrong", "always_right"))
    
    # Map genomes to accessions (for metadata fetch)
    acc_map <- truth_table %>% select(genome, genome_accession, closest_genome) %>%
      mutate(closest_genome_dotted = ifelse(!is.na(closest_genome), sub("([0-9])$", ".\\1", closest_genome), NA_character_))
    
    comp_set <- comp_set %>%
      dplyr::left_join(acc_map, by = "genome") %>%
      mutate(query_accession = dplyr::case_when(
        source_genome == "reference" & !is.na(genome_accession) ~ genome_accession,
        source_genome == "bin" & !is.na(closest_genome_dotted) ~ closest_genome_dotted,
        TRUE ~ genome_accession
      ))
    
    # Fetch metadata for both groups using cached functions/utilities from previous chunk
    cache_dir <- file.path(results_dir, "cache")
    if (!dir.exists(cache_dir)) dir.create(cache_dir, recursive = TRUE)
    cache_file <- file.path(cache_dir, "assembly_meta.csv")
    gcsize_cache_file <- file.path(cache_dir, "gc_size_meta.csv")
    
    accs_all <- unique(comp_set$query_accession[!is.na(comp_set$query_accession)])
    
    # Assembly/taxonomy meta
    meta_cached <- if (file.exists(cache_file)) suppressMessages(readr::read_csv(cache_file, show_col_types = FALSE)) else tibble()
    to_fetch <- setdiff(accs_all, meta_cached$accession %||% character())
    meta_new <- if (length(to_fetch) > 0) fetch_ncbi_meta(to_fetch) else tibble()
    if (nrow(meta_new) > 0) {
      meta_cached <- dplyr::bind_rows(meta_cached, meta_new) %>% dplyr::distinct(accession, .keep_all = TRUE)
      readr::write_csv(meta_cached, cache_file)
    }
    ncbi_meta_all <- meta_cached %>% dplyr::filter(accession %in% accs_all)
    
    # FASTA-derived GC/size
    gc_cached <- if (file.exists(gcsize_cache_file)) suppressMessages(readr::read_csv(gcsize_cache_file, show_col_types = FALSE)) else tibble()
    to_compute <- setdiff(accs_all, gc_cached$accession %||% character())
    gc_new <- if (length(to_compute) > 0) purrr::map_dfr(to_compute, compute_gc_size_online) else tibble()
    if (nrow(gc_new) > 0) {
      gc_cached <- dplyr::bind_rows(gc_cached, gc_new) %>% dplyr::distinct(accession, .keep_all = TRUE)
      readr::write_csv(gc_cached, gcsize_cache_file)
    }
    gcsize_all <- gc_cached %>% dplyr::filter(accession %in% accs_all)
    
    # Attach metadata
    comp_meta <- comp_set %>%
      mutate(accession = query_accession) %>%
      dplyr::left_join(ncbi_meta_all, by = c("accession")) %>%
      dplyr::left_join(gcsize_all, by = c("accession")) %>%
      mutate(
        genome_size_bp = dplyr::coalesce(as.numeric(genome_size_bp), as.numeric(genome_size)),
        gc_percent_final = dplyr::coalesce(as.numeric(gc_percent_fasta), as.numeric(gc_percent))
      )
    
    # QC: counts per stratum
    qc_groups <- comp_meta %>% dplyr::count(metric, source_genome, group, name = "n_genomes") %>% dplyr::arrange(metric, source_genome, group)
    qc_table <- qc_groups %>%
      tidyr::pivot_wider(names_from = group, values_from = n_genomes, values_fill = 0)
    qc_table %>% kable(caption = "Counts per metric/source for always_wrong vs always_right") %>% kable_styling(bootstrap_options = c("striped","hover","condensed"))
    Counts per metric/source for always_wrong vs always_right
    metric source_genome always_right always_wrong
    breadth bin 4 1
    breadth reference 59 21
    counts bin 2 1
    counts reference 74 23
    coverage bin 2 2
    coverage reference 76 21
    relative bin 4 1
    relative reference 75 42
    rpkm bin 2 2
    rpkm reference 76 27
    tpm bin 2 3
    tpm reference 73 24
    # PLOTS
    library(ggplot2)
    
    # 1) Genome size plot (log10 bp)
    size_plot_data <- comp_meta %>% filter(!is.na(genome_size_bp)) %>% mutate(log10_size = log10(genome_size_bp))
    
    # Wilcoxon rank-sum per facet (only where both groups exist)
    suppressPackageStartupMessages({ library(rstatix); library(ggpubr) })
    size_facets_ok <- size_plot_data %>%
      dplyr::count(metric, source_genome, group) %>%
      tidyr::pivot_wider(names_from = group, values_from = n, values_fill = 0) %>%
      filter(always_right > 0, always_wrong > 0) %>%
      select(metric, source_genome)
    size_pvals <- size_plot_data %>%
      dplyr::inner_join(size_facets_ok, by = c("metric","source_genome")) %>%
      group_by(metric, source_genome) %>%
      rstatix::wilcox_test(log10_size ~ group, ref.group = "always_right") %>%
      ungroup() %>%
      mutate(p.format = rstatix::p_format(p), group1 = "always_right", group2 = "always_wrong")
    size_y <- size_plot_data %>% group_by(metric, source_genome) %>% summarise(y.position = max(log10_size, na.rm = TRUE) * 1.07, .groups = 'drop')
    size_pvals <- size_pvals %>% dplyr::left_join(size_y, by = c("metric","source_genome"))
    
    p_size <- ggplot(size_plot_data, aes(x = group, y = log10_size, fill = group)) +
      geom_violin(trim = TRUE, alpha = 0.7, color = NA) +
      geom_boxplot(width = 0.18, outlier.shape = NA, alpha = 0.9) +
      scale_fill_manual(values = c(always_wrong = "#E63946", always_right = "#2A9D8F")) +
      labs(title = "Genome size (log10 bp): always_wrong vs always_right", x = NULL, y = "log10(genome size bp)") +
      facet_grid(source_genome ~ metric, scales = "free_y") +
      theme_minimal() + theme(legend.position = "none", axis.text.x = element_text(angle = 30, hjust = 1))
    p_size <- p_size + ggpubr::stat_pvalue_manual(size_pvals, label = "p = {p.format}",
                                     xmin = "group1", xmax = "group2",
                                     y.position = "y.position", tip.length = 0.01, size = 3,
                                     inherit.aes = FALSE)
    print(p_size)

    # 2) GC% plot
    gc_plot_data <- comp_meta %>% filter(!is.na(gc_percent_final))
    
    # Wilcoxon rank-sum per facet (only where both groups exist)
    gc_facets_ok <- gc_plot_data %>%
      dplyr::count(metric, source_genome, group) %>%
      tidyr::pivot_wider(names_from = group, values_from = n, values_fill = 0) %>%
      filter(always_right > 0, always_wrong > 0) %>%
      select(metric, source_genome)
    gc_pvals <- gc_plot_data %>%
      dplyr::inner_join(gc_facets_ok, by = c("metric","source_genome")) %>%
      group_by(metric, source_genome) %>%
      rstatix::wilcox_test(gc_percent_final ~ group, ref.group = "always_right") %>%
      ungroup() %>%
      mutate(p.format = rstatix::p_format(p), group1 = "always_right", group2 = "always_wrong")
    gc_y <- gc_plot_data %>% group_by(metric, source_genome) %>% summarise(y.position = max(gc_percent_final, na.rm = TRUE) * 1.07, .groups = 'drop')
    gc_pvals <- gc_pvals %>% dplyr::left_join(gc_y, by = c("metric","source_genome"))
    
    p_gc <- ggplot(gc_plot_data, aes(x = group, y = gc_percent_final, fill = group)) +
      geom_violin(trim = TRUE, alpha = 0.7, color = NA) +
      geom_boxplot(width = 0.18, outlier.shape = NA, alpha = 0.9) +
      scale_fill_manual(values = c(always_wrong = "#E63946", always_right = "#2A9D8F")) +
      labs(title = "GC%: always_wrong vs always_right", x = NULL, y = "GC percent") +
      facet_grid(source_genome ~ metric) +
      theme_minimal() + theme(legend.position = "none", axis.text.x = element_text(angle = 30, hjust = 1))
    p_gc <- p_gc + ggpubr::stat_pvalue_manual(gc_pvals, label = "p = {p.format}",
                                  xmin = "group1", xmax = "group2",
                                  y.position = "y.position", tip.length = 0.01, size = 3,
                                  inherit.aes = FALSE)
    print(p_gc)

    # 3) Genus bar plots (percentage within group)
    genus_data <- comp_meta %>% filter(!is.na(genus) & genus != "") %>%
      group_by(metric, source_genome, group, genus) %>%
      summarise(n = n(), .groups = 'drop') %>%
      group_by(metric, source_genome, group) %>%
      mutate(pct = 100 * n / sum(n)) %>% ungroup()
    
    # Pick top 15 genera overall (within each metric+source across groups)
    top_genera <- genus_data %>%
      group_by(metric, source_genome, genus) %>% summarise(total = sum(n), .groups = 'drop') %>%
      group_by(metric, source_genome) %>% slice_max(order_by = total, n = 15, with_ties = FALSE) %>% ungroup()
    
    genus_top <- genus_data %>% dplyr::inner_join(top_genera %>% dplyr::select(metric, source_genome, genus), by = c("metric","source_genome","genus"))
    
    p_genus <- ggplot(genus_top, aes(x = reorder(genus, pct), y = pct, fill = group)) +
      geom_col(position = position_dodge(width = 0.75)) +
      coord_flip() +
      scale_fill_manual(values = c(always_wrong = "#E63946", always_right = "#2A9D8F")) +
      labs(title = "Genus composition: always_wrong vs always_right", x = "Genus (top 15)", y = "Percent within group", fill = "Group") +
      facet_grid(source_genome ~ metric, scales = "free_y") +
      theme_minimal()
    print(p_genus)

    • Observation: A couple of isolated significant tests appear (e.g. lower genome size in reference genomes under the TPM metric, GC% differences under coverage for bin genomes), but these do not persist across metrics. So no overall systematic trends. In the genus composition plots, the overall pattern is similar with the genome size and GC% results: most taxa appear in both “always right” and “always wrong” groups with similar prevalence, but there are few isolated enrichment.

    Final Recommendation

    • Single Methods: metagenomeSeq, MaAsLin2, and ANCOM-BC2 consistently emerged as reliable performers.
    • Metrics: Counts remain the default across most methods, but alternative metrics (TPM, RPKM, relative, coverage, breadth) can generally be interchanged without introducing major bias. The key exception is ALDEx2, which fails on continuous metrics; otherwise, metric choice is safe.
    • Consensus Approach: All_tools approach provides the best overall performance. Using >50% or >75% consensus thresholds may provide better performance with more tools included in the analysis
    • Alpha: Lowering the significance cutoff reduces false discovery rates. For scenarios where false positives carry high cost, stricter thresholds (e.g., α = 0.001 or lower) can help reducing false positive counts (and thus controlling FDR).