This analysis addresses the methodological question raised in ANCOM-BC Issue #196: Can normalized abundance data be effectively used with differential abundance tools designed for raw counts?

1. Introduction and Motivation

Background

Many differential abundance (DA) tools were originally designed for raw read counts where library size differences are informative and corrected internally (e.g. via size-factor estimation, library size, gene length). For genome-level metagenomic data, however, raw counts are dominated by genome length; common practice is therefore to normalise reads to TPM/RPKM or coverage. This creates a methodological question about the potential effect on the results.

This study evaluates 48 metric × method combinations (6 metrics × 8 DA tools) using CAMISIM synthetic communities with known ground truth.

Research Questions

Cross-metric performance: How do DA tools perform across different abundance metrics?
Normalization impact: Does length/depth normalization improve or reduce DA detection power?
Method preferences: Do some DA tools work better with specific abundance metrics?
Consistency evaluation: Are DA results consistent across metrics for individual tools?

Study Design

Simulation: 3 sets of data were used: HOMD_Final with 200 genomes (80 truly differentially abundant. 120 truly non-differentially abundant). Genomes were selected randomly from the HOMD dataset. HOMD2 with 200 genomes (80 truly differentially abundant. 120 truly non-differentially abundant). Genomes were selected randomly from the HOMD dataset. mooh_representative with 200 genomes (80 truly differentially abundant. 120 truly non-differentially abundant). Genomes were selected based on Metaphlan outputs.
Synthetic sample generation: CAMISIM created 40/60 metagenomic samples (50% control, 50% treatment)
Up-stream processing: Each sample was processed with the standard MAGician workflow (repository).
Abundance Metrics: 6 metrics (counts, TPM, RPKM, relative abundance, breadth, coverage)
Coverage quantification: Genome-level abundance metrics were computed using CoverM ((repository).
DA Methods: 6 tools (ALDEx2, ANCOM-BC2, metagenomeSeq, wilcoxon, maaslin2) ran.
Evaluation: Binary classification and threshold-free metrics (F1-Score, AUC-PR, Specificity, Sensitivity etc)

Definitions

Abundance-metric normalisation cheat-sheet

Note: These metric names correspond to the final output files generated by the MAGICIAN pipeline. The mapping from CoverM column names to output files is: Read Count → counts, RPKM → rpkm, TPM → tpm, Relative Abundance (%) → relative, Covered Bases → breadth, Mean → coverage.

Metric	Length normalisation	Depth normalisation	What the number means
counts	✗ (raw reads)	✗	Number of read pairs whose best hit is this genome
relative %	✗	✓ (values sum to 100 %)	Share of all mapped reads that belong to the genome
breadth	✓ (divide by length)	✗	Fraction of genome bases with ≥ 1× read coverage
coverage (mean depth)	✓ (divide by length)	✗ *	Mean per-base depth across the genome
RPKM	✓ (/ kb)	✓ (/ 10⁶ reads)	Reads-per-kilobase-per-million mapped reads
TPM	✓ (/ kb)	✓ (rescaled to 10⁶)	Transcripts-per-million; similar to RPKM but sample-wise rescaled

Metric categories used throughout this report:

• Non-normalized: counts
• Length-normalized: breadth, coverage
• Depth-normalized: relative (%)
• Length + Depth-normalized: RPKM, TPM

CoverM to Output File Mapping

To avoid confusion, here’s the explicit mapping from CoverM’s internal column names to our final analysis files:

CoverM Column Name	Final Output File	Metric Name Used in Analysis	Biological Interpretation
`Read Count`	`counts.tsv`	counts	Raw read pairs mapped to genome
`RPKM`	`rpkm.tsv`	RPKM	Reads per kilobase per million
`TPM`	`tpm.tsv`	TPM	Transcripts per million
`Relative Abundance (%)`	`relative.tsv`	relative	Percentage of total mapped reads
`Covered Bases`	`breadth.tsv`	breadth	Fraction of genome with ≥1× coverage
`Mean`	`coverage.tsv`	coverage	Mean read depth across genome

Important: Throughout this analysis, when we refer to “breadth” we mean the fraction of genome covered (from CoverM’s Covered Bases), and “coverage” refers to mean depth (from CoverM’s Mean). This follows genomics terminology where breadth = presence/absence and coverage = abundance intensity. (See how they use the term “coverage” here: https://pmc.ncbi.nlm.nih.gov/articles/PMC4992084/)

^{* Coverage still increases with sequencing depth, but it no
longer carries the multiplicative genome-length bias.}

Data Source

HOMD Data

The GTDB_taxonomy20250406.csv file was downloaded from the https://homd.org/download/download section (Genomic Meta Information::Batch Download GTDB (V226 taxonomy)). This CSV file contains metadata for 8,489 oral human microbial genomes from the Human Oral Microbiome Database (HOMD), including genome accession IDs, HOMD taxonomy IDs, and full GTDB taxonomic lineages.

From this dataset, 200 random genomes were selected and downloaded using the /home/mrahman/magician/download_genome.py script. The script randomly samples genome accession IDs from the CSV, then downloads the corresponding genomic FASTA files from NCBI’s Assembly database. A random seed ensures reproducibility of the genome selection. The downloaded genome sequences are stored in the data/HOMD/genomes directory as compressed FASTA files (*.fna.gz). The following command was used to download the genomes:

mamba activate magician_base && python /home/mrahman/magician/download_genome.py \
    --csv /home/mrahman/da_analysis/data_HOMD/gtdb_taxonomy/GTDB_taxonomy20250406.csv \
    --n 200 \
    --out /home/mrahman/da_analysis/data_HOMD/genomes \
    --email mobashirrahman@gmail.com

CAMISIM Strain-Madness

Data were downloaded from ENA Project accession number PRJEB50270 (study title: Critical Assessment of Metagenome Interpretation strain madness raw sequencing data, study name: CAMI 2 strain)). The following script were used to download 40 samples out of 100 to keep computational cost reasonable.

mamba activate sra && get_strain_madness_reads.sh /local/mrahman/magician/data/strain_madness_reads 0 39

I realized the dataset is not particularly suitable for our analysis because it is not geared toward DA analysis. For example, it do not have any associated metadata (e.g. clinical metadata, case-control status). I therefore decided to use the HOMD/mooh dataset for our analysis. A possible solution would have been, for example, using SIMBA to add in synthetic spike-ins for selected genomes and simulate case-control status to get a suitable dataset for DA analysis. As a workaround, we used Zeevi et al. dataset from the publication (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03390-9) included in the analysis as it had well-described metadata and already used for DA benchmarkiing by the Wirvel et al study.

CAMISIM Simulation and Metagenomics Downstream Processing

1. Community Design

Baseline weights: sampled per genome from a Gamma(shape=2, scale=5) to emulate long‑tailed abundance distributions as implemented in CAMISIM documentation.
Differential abundance: randomly selects a fixed number of genomes ( e.g., 80) and splits them evenly into up/down sets (Upregulation and downregulation). Applies multiplicative fold changes to reflect effect sizes.
Zero‑sum constraint: rescales only the DA subset so the sum of their abundances matches between groups. This preserves the relative mass of the non‑DA background and avoids any unwanted compositional shifts which may distort the DA methods.
Replication and depth: generates 10 controls and 10 treatments with r technical/biological replicates (e.g., r=2 → 40 samples). Library sizes are drawn uniformly from a range (e.g., 15–60M reads) to capture depth variability without confounding group effects.
CAMISIM inputs and truth: writes a sample_distributions.tsv (per‑sample weights),, read_depth.txt for sequencing depth, and an “initial” truth_table.csv recording genome‑level DA status and true log2FC for downstream evaluation.

2. Sample Distribution

Structure: rows are genomes; columns are per‑replicate samples; entries are integer weights interpreted by CAMISIM as “abundance weights”.
Group separation: control replicates use baseline weights; treatment replicates use the enriched weights (baseline modified by assigned folds), after zero‑sum balancing within the DA set.

3. Magician Downstream Processing

Reads are simulated (ART Illumina) and pooled into gzipped R1/R2 per sample. FastQC assesses raw quality; BBDuk trims adapters/low‑quality bases. metaSPAdes is used for metagenome assembly.BBMap aligns trimmed reads back to assemblies; jgi_summarize_bam_contig_depths computes depth tables for binning.

MetaBAT2 clusters contigs into MAGs using depth and composition; dRep dereplicates and compares genomes (Bins/MAGs and references). CheckM quantifies completeness/contamination for bins and references.

CoverM maps paired reads against all genomes (MAGs + references) and exports six metrics (count, RPKM, TPM, relative, covered_bases (“breadth”), mean depth (“coverage”).

4. DA Benchmarking

A script generates truth_table_with_bins.csv by integrating MAG bin annotations into the ground-truth.
This analysis benchmarks differential abundance (DA) methods across six abundance metrics using the HOMD_Final dataset with ground-truth labels. Features are harmonized and filtered; samples are aligned to metadata (control vs case). For each method–metric, per-feature q-values (BH, α=0.05) define DA calls. Results are evaluated against truth, separated by source (reference genomes vs recovered bins), reporting TP, FP, FN, TN, sensitivity, specificity, precision, FDR, accuracy, AUROC, AUPRC, and totals for later analysis.

Data Loading and Preprocessing

Details data import, cleaning, and integration steps used to construct the analysis dataset.

Genomic Sample Distribution

The genomes and their distribution was:

Ground Truth

# Initial ground truth (reference genomes only) for display
truth_table_initial <- read_csv(file.path(data_dirs$HOMD_Final, 'design/truth_table.csv')) %>%
  mutate(genome_accession = str_replace(genome, "\\.fna$", "")) %>%
  mutate(genome = str_replace_all(genome_accession, "[^A-Za-z0-9]", "")) %>%
  mutate(is_DA = case_when(
    is.logical(is_DA) ~ is_DA,
    str_to_lower(as.character(is_DA)) %in% c("true", "t", "1") ~ TRUE,
    str_to_lower(as.character(is_DA)) %in% c("false", "f", "0") ~ FALSE,
    TRUE ~ NA
  )) %>%
  distinct(genome, .keep_all = TRUE)

# Evaluation ground truth (includes bins) used to score methods
truth_table <- read_csv(file.path(data_dirs$HOMD_Final, 'design/truth_table_with_bins.csv')) %>%
  mutate(genome_accession = str_replace(genome, "\\.fna$", "")) %>%
  mutate(genome = str_replace_all(genome_accession, "[^A-Za-z0-9]", "")) %>%
  mutate(is_DA = case_when(
    is.logical(is_DA) ~ is_DA,
    str_to_lower(as.character(is_DA)) %in% c("true", "t", "1") ~ TRUE,
    str_to_lower(as.character(is_DA)) %in% c("false", "f", "0") ~ FALSE,
    TRUE ~ NA
  )) %>%
  distinct(genome, .keep_all = TRUE)

cat("Initial reference-only DA genomes:", sum(truth_table_initial$is_DA), "\n")

## Initial reference-only DA genomes: 80

cat("Initial reference-only non-DA genomes:", sum(!truth_table_initial$is_DA), "\n")

## Initial reference-only non-DA genomes: 120

cat("Evaluation entries (with bins):", nrow(truth_table), "\n")

## Evaluation entries (with bins): 238

truth_table_initial %>%
  group_by(is_DA) %>%
  summarise(
    n_genomes = n(),
    mean_log2FC = mean(true_log2FC, na.rm = TRUE),
    min_log2FC = min(true_log2FC, na.rm = TRUE),
    max_log2FC = max(true_log2FC, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  kable(caption = "Initial Ground Truth Summary (reference genomes)", digits = 3) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Initial Ground Truth Summary (reference genomes)
is_DA	n_genomes	mean_log2FC	min_log2FC	max_log2FC
FALSE	120	0.000	0.000	0.000
TRUE	80	-1.263	-5.209	1.559

Simulated read-depth per sample

The CAMISIM simulation generated samples with varying sequencing depths to mimic limited realistic sequencing conditions:

## Sequencing depth summary:

## - Total samples: 40

## - Range: 17,009,209 - 59,564,475 read pairs

## - Mean: 36,954,723 read pairs

## - Median: 36,887,326 read pairs

## - Coefficient of variation: 0.347

Sequencing depth (read pairs) per simulated sample (first 5 entries)
Sample	Read Pairs
C01_rep01	24,407,595
C01_rep02	38,628,974
C02_rep01	22,368,809
C02_rep02	22,465,809
C03_rep01	52,633,693

Genomes Recovered from bins

In ideal scenario, we would have recovered all the original genomes as the bins.

Bin recovery:

suppressPackageStartupMessages({
  library(readxl)
})

summaries_dir <- file.path(data_dirs$HOMD_Final, "summaries")

# Helper: normalize IDs to match truth_table/truth_table_initial normalization
normalize_genome_id <- function(x) {
  x %>%
    str_replace("\\.fna$", "") %>%
    str_replace_all("[^A-Za-z0-9]", "")
}

# List bin_summary files
bin_files <- list.files(summaries_dir, pattern = '^bin_summary_.*\\.xlsx$', full.names = TRUE)

# Read a single bin_summary and extract sample name
read_bin_summary <- function(path) {
  sample_id <- basename(path) %>% str_replace("^bin_summary_", "") %>% str_replace("\\.xlsx$", "")
  df <- suppressMessages(readxl::read_xlsx(path))
  # Expect columns: bin_name, closest_genome
  df %>%
    janitor::clean_names() %>%
    transmute(
      sample = sample_id,
      bin_name = as.character(bin_name),
      closest_genome_raw = as.character(closest_genome),
      closest_genome = normalize_genome_id(as.character(closest_genome))
    ) %>%
    distinct(sample, closest_genome, .keep_all = TRUE)
}

if (length(bin_files) == 0) {
  warning("No bin_summary_*.xlsx files found in ", summaries_dir)
  bin_summaries <- tibble(sample = character(), bin_name = character(), closest_genome_raw = character(), closest_genome = character())
} else {
  bin_summaries <- map_dfr(bin_files, read_bin_summary)
}

# Build presence/absence matrix genome × sample
all_samples <- sort(unique(bin_summaries$sample))
all_genomes <- sort(unique(c(truth_table_initial$genome, truth_table$genome, bin_summaries$closest_genome)))

recovery_long <- bin_summaries %>%
  transmute(genome = closest_genome, sample, recovered = 1L) %>%
  complete(genome = all_genomes, sample = all_samples, fill = list(recovered = 0L))

# Join truth covariates
truth_covars <- truth_table_initial %>%
  select(genome, genome_accession, is_DA, true_log2FC, baseline_weight, enriched_weight, direction)

# Ensure plain tibbles to avoid SummarizedExperiment join methods
recovery_long <- tibble::as_tibble(recovery_long)
truth_covars  <- tibble::as_tibble(truth_covars)

recovery_long <- recovery_long %>%
  dplyr::left_join(truth_covars, by = "genome")

# Per-sample summaries
sample_coverage <- recovery_long %>%
  group_by(sample) %>%
  summarise(
    n_recovered = sum(recovered, na.rm = TRUE),
    n_total_truth = sum(!is.na(is_DA)),
    prop_recovered = ifelse(n_total_truth > 0, n_recovered / n_total_truth, NA_real_),
    .groups = 'drop'
  )

# Split by is_DA/direction for stacked bars
sample_coverage_split <- recovery_long %>%
  # Vectorized classification (avoid non-vectorized isTRUE/isFALSE)
  mutate(
    direction = tolower(as.character(direction)),
    class = case_when(
      is_DA == TRUE & direction == "up"   ~ "DA_up",
      is_DA == TRUE & direction == "down" ~ "DA_down",
      is_DA == FALSE                       ~ "non_DA",
      TRUE                                 ~ "unknown"
    )
  ) %>%
  group_by(sample, class) %>%
  summarise(n = sum(recovered, na.rm = TRUE), .groups = 'drop')

# Prepare stacked-bar data with total counts and readable count labels
sample_coverage_counts <- sample_coverage_split %>%
  group_by(sample) %>%
  mutate(total = sum(n),
         frac = ifelse(total > 0, n / total, NA_real_),
         count_label = ifelse(is.na(frac) | frac < 0.08, "", as.character(n))) %>%
  ungroup()

# Quick sanity table
head(sample_coverage_split) %>% kable(caption = "Example of recovered counts per sample by DA class") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Example of recovered counts per sample by DA class
sample	class	n
C01_rep01	DA_down	0
C01_rep01	DA_up	3
C01_rep01	non_DA	12
C01_rep01	unknown	1
C01_rep02	DA_down	0
C01_rep02	DA_up	3

# Diagnostics: ensure is_DA is logical and mapping rate
diagnostic_map <- recovery_long %>%
  summarise(
    n_recovery_rows = n(),
    n_with_truth = sum(!is.na(is_DA)),
    prop_with_truth = n_with_truth / n_recovery_rows,
    n_unknown_class = sum(is.na(is_DA) | is.na(direction)),
    .groups = 'drop'
  )

print(diagnostic_map)

## # A tibble: 1 × 4
##   n_recovery_rows n_with_truth prop_with_truth n_unknown_class
##             <int>        <int>           <dbl>           <int>
## 1            9560         8000           0.837            1560

Recovered Genome Distribution

# Stacked by DA class
class_colors <- c(DA_up = "#2ca02c", DA_down = "#d62728", non_DA = "#7f7f7f", unknown = "#bcbd22")

ggplot(sample_coverage_counts, aes(x = reorder(sample, total), y = n, fill = class)) +
  geom_col() +
  geom_text(aes(label = count_label), position = position_stack(vjust = 0.5), size = 3, color = "black") +
  # Total recovered genomes per sample printed just beyond the bar
  geom_text(data = distinct(sample_coverage_counts, sample, total),
            aes(x = reorder(sample, total), y = total, label = total),
            inherit.aes = FALSE, hjust = -0.1, size = 3.2) +
  scale_y_continuous(expand = expansion(mult = c(0.01, 0.10))) +
  scale_fill_manual(values = class_colors) +
  coord_flip(clip = "off") +
  labs(title = "Recovered genomes per sample by DA class",
       x = "Sample", y = "Recovered genomes", fill = "Class") +
  theme_minimal() +
  theme(plot.margin = margin(5.5, 30, 5.5, 5.5))

# Per-genome recovery rate across samples
genome_recovery <- recovery_long %>%
  group_by(genome) %>%
  summarise(
    recovery_rate = mean(recovered, na.rm = TRUE),
    is_DA = first(is_DA),
    direction = first(direction),
    true_log2FC = first(true_log2FC),
    baseline_weight = first(baseline_weight),
    .groups = 'drop'
  )

Observation: Treatment samples consistently recover a higher total number of genomes (~21–24). Most of them have DA_up genomes whose abundance were increased during CAMISIM sample distribution. Only a few downregulated genomes are recovered.

On the other hand,controls recover fewer genomes overall (~15–19) compared to treatment.

# Scatter + smooth: baseline_weight vs recovery_rate by sample group (Control/Treatment)
rec_by_group <- recovery_long %>%
  mutate(group = case_when(
    stringr::str_starts(sample, "C") ~ "Control",
    stringr::str_starts(sample, "T") ~ "Treatment",
    TRUE ~ "Unknown"
  )) %>%
  group_by(genome, group) %>%
  summarise(recovery_rate = mean(recovered, na.rm = TRUE), .groups = 'drop') %>%
  dplyr::left_join(truth_covars %>% tibble::as_tibble() %>% dplyr::select(genome, is_DA, direction, baseline_weight), by = "genome")

ggplot(rec_by_group %>% filter(!is.na(baseline_weight)),
       aes(x = baseline_weight, y = recovery_rate, color = group)) +
  geom_point(alpha = 0.6, position = position_jitter(width = 0, height = 0.01), size = 1.8) +
  geom_smooth(method = "loess", se = TRUE) +
  facet_wrap(~ direction, nrow = 1, scales = "free_x") +
  scale_color_manual(values = c(Control = "#1f77b4", Treatment = "#ff7f0e", Unknown = "#7f7f7f")) +
  labs(title = "Recovery rate vs baseline_weight by group",
       x = "Baseline weight", y = "Recovery rate", color = "Group") +
  theme_minimal()

Observation: Recovery rate increases with higher baseline weight, specially for upregulated genomes.

Presence–absence heatmap (genomes × samples)

library(ComplexHeatmap)
library(circlize)

# Prepare matrix (drop NAs to avoid invalid rownames)
rec_df <- recovery_long %>%
  select(genome, sample, recovered) %>%
  filter(!is.na(genome), !is.na(sample))

rec_mat <- rec_df %>%
  pivot_wider(names_from = sample, values_from = recovered, values_fill = 0) %>%
  filter(!is.na(genome)) %>%
  column_to_rownames("genome") %>%
  as.matrix()

# Row metadata (aligned to matrix order)
row_meta <- truth_covars %>%
  filter(!is.na(genome), genome %in% rownames(rec_mat)) %>%
  distinct(genome, .keep_all = TRUE) %>%
  column_to_rownames("genome")
row_meta_aligned <- row_meta[rownames(rec_mat), , drop = FALSE]
row_meta_aligned$direction <- replace_na(tolower(row_meta_aligned$direction), "none")

# Row class used for splitting
row_class <- case_when(
  row_meta_aligned$is_DA %in% TRUE  & row_meta_aligned$direction == "up"   ~ "DA_up",
  row_meta_aligned$is_DA %in% TRUE  & row_meta_aligned$direction == "down" ~ "DA_down",
  row_meta_aligned$is_DA %in% FALSE                                       ~ "non_DA",
  TRUE                                                                    ~ "unknown"
)
row_class <- factor(row_class, levels = c("DA_up", "DA_down", "non_DA", "unknown"))

# Row prevalence to order genomes within each split
row_prev <- rowSums(rec_mat, na.rm = TRUE) / ncol(rec_mat)
row_order_vec <- order(as.integer(row_class), -row_prev)

# Top (column) annotations: Control/Treatment/Unknown by sample name
sample_groups <- case_when(
  stringr::str_starts(colnames(rec_mat), "C") ~ "Control",
  stringr::str_starts(colnames(rec_mat), "T") ~ "Treatment",
  TRUE ~ "Unknown"
)
top_anno <- HeatmapAnnotation(
  group = factor(sample_groups, levels = c("Control", "Treatment", "Unknown")),
  col = list(group = c(Control = "#1f77b4", Treatment = "#ff7f0e", Unknown = "#7f7f7f")),
  annotation_name_side = "left"
)

# Right (row) annotations: selected covariates
ha_row <- rowAnnotation(
  log2FC = row_meta_aligned$true_log2FC,
  baseline = row_meta_aligned$baseline_weight,
  col = list(
    log2FC = circlize::colorRamp2(c(min(row_meta_aligned$true_log2FC, na.rm=TRUE), 0, 
                                     max(row_meta_aligned$true_log2FC, na.rm=TRUE)),
                                   c("#2166AC", "#F7F7F7", "#B2182B")),
    baseline = circlize::colorRamp2(range(row_meta_aligned$baseline_weight, na.rm=TRUE),
                                    c("#f7f7f7", "#636363"))
  )
)

# Binary color map for presence/absence using requested palette
# Absent = Modern Red (#E63946), Present = Modern Green (#2A9D8F)
hcols <- circlize::colorRamp2(c(0, 1), c("#E63946", "#2A9D8F"))

Heatmap(
  rec_mat,
  name = "presence",
  col = hcols,
  heatmap_legend_param = list(at = c(0, 1), labels = c("absent", "present"), color_bar = "discrete"),
  use_raster = TRUE,
  raster_quality = 2,
  show_row_names = FALSE,
  show_column_names = TRUE,
  column_title = "Samples",
  row_title = "Genomes",
  right_annotation = ha_row,
  top_annotation = top_anno,
  column_split = factor(sample_groups, levels = c("Control", "Treatment", "Unknown")),
  cluster_columns = TRUE,
  row_split = row_class,
  # Order genomes: first by class, then by prevalence within class
  row_order = row_order_vec
)

Observation: DA patterns dominate presence/absence determination of the genome recovery. Increased log2FC (DA_up genomes in treatment) leads to more genome recovery in Treatment than Control samples. High baseline also leads to more genome (in all samples) recovery. (as expected)

Recovered Genomes: Taxonomy Table and Plot

We would like to see if there is any genus that contributes abnormally to the recovered genomes. So first we identify taxon names of the recovered genomes (bins).

suppressPackageStartupMessages({
  library(rentrez)
  library(xml2)
})

# Genomes recovered at least once and their counts across samples
recovered_counts <- recovery_long %>%
  group_by(genome) %>%
  summarise(count = sum(recovered, na.rm = TRUE), .groups = 'drop') %>%
  filter(count > 0) %>%
  dplyr::left_join(truth_covars %>% tibble::as_tibble() %>% dplyr::select(genome, genome_accession), by = 'genome')

# Helper to get family/genus/species via NCBI (assembly -> taxid -> taxonomy)
get_taxonomy_for_accession <- function(acc) {
  if (is.na(acc) || acc == "") return(tibble(family = NA_character_, genus = NA_character_, species = NA_character_))
  out <- tryCatch({
    es <- rentrez::entrez_search(db = 'assembly', term = acc, retmax = 1)
    if (length(es$ids) == 0) return(tibble(family = NA_character_, genus = NA_character_, species = NA_character_))
    sm <- rentrez::entrez_summary(db = 'assembly', id = es$ids[[1]])
    txid <- sm$taxid
    if (is.null(txid) || is.na(txid)) return(tibble(family = NA_character_, genus = NA_character_, species = NA_character_))
    xml_txt <- rentrez::entrez_fetch(db = 'taxonomy', id = txid, rettype = 'xml')
    doc <- xml2::read_xml(xml_txt)
    family <- xml2::xml_text(xml2::xml_find_first(doc, "//LineageEx/Taxon[Rank='family']/ScientificName"))
    genus  <- xml2::xml_text(xml2::xml_find_first(doc, "//LineageEx/Taxon[Rank='genus']/ScientificName"))
    species <- xml2::xml_text(xml2::xml_find_first(doc, "//Taxon/ScientificName"))
    tibble(family = ifelse(length(family)==0, NA, family),
           genus = ifelse(length(genus)==0, NA, genus),
           species = ifelse(length(species)==0, NA, species))
  }, error = function(e) tibble(family = NA_character_, genus = NA_character_, species = NA_character_))
  out
}

unique_acc <- recovered_counts$genome_accession %>% unique()
tax_lookup <- map_dfr(unique_acc, ~{
  res <- get_taxonomy_for_accession(.x)
  mutate(res, genome_accession = .x)
})

recovered_tax <- recovered_counts %>%
  dplyr::left_join(tax_lookup, by = 'genome_accession') %>%
  arrange(desc(count))

# Display table: row names = genome accession
recovered_tax_table <- recovered_tax %>%
  transmute(`genome` = coalesce(genome_accession, genome), count, family, genus, species)

recovered_tax_table %>%
  dplyr::rename(`Count (samples)` = count) %>%
  kable(caption = 'Recovered genomes with NCBI taxonomy', align = 'lrrrr') %>%
  kable_styling(bootstrap_options = c('striped','hover','condensed'), full_width = FALSE)

Recovered genomes with NCBI taxonomy
genome	Count (samples)	family	genus	species
GCA_002763575.1	40	Prevotellaceae	Prevotella	Prevotella intermedia
GCA_009377295.1	40	Paenibacillaceae	Paenibacillus	Paenibacillus glucanolyticus
GCA_016775375.1	40	Peptostreptococcaceae	Peptostreptococcus	Peptostreptococcus stomatis
GCA_016861285.1	40	Pasteurellaceae	Haemophilus	Haemophilus influenzae
GCA_017598955.1	40	Micrococcaceae	Kocuria	Kocuria rhizophila
GCA_938023715.1	40	Prevotellaceae	Prevotella	uncultured Prevotella sp.
NA	40	NA	NA	NA
GCA_002091815.1	39	Pseudomonadaceae	Ectopseudomonas	Pseudomonas oleovorans subsp. oleovorans NBRC 13583 = DSM 1045
GCA_000195125.1	37	Streptococcaceae	Streptococcus	Streptococcus sanguinis SK1058
GCA_020529565.1	35	Streptococcaceae	Streptococcus	Streptococcus mutans
GCA_001597385.1	23	Fusobacteriaceae	Fusobacterium	Fusobacterium necrophorum subsp. funduliforme
GCA_016127175.1	22	Veillonellaceae	Veillonella	Veillonella parvula
GCA_937889145.1	22	Prevotellaceae	Prevotella	Prevotella pallens
GCA_000160335.2	20	Staphylococcaceae	Staphylococcus	Staphylococcus aureus subsp. aureus 55/2053
GCA_000180075.1	20	Streptococcaceae	Streptococcus	Streptococcus vestibularis F0396
GCA_004136195.1	20	Propionibacteriaceae	Cutibacterium	Cutibacterium acnes DSM 1897
GCA_021309225.1	20	Pasteurellaceae	Aggregatibacter	Aggregatibacter actinomycetemcomitans
GCA_023110145.1	20	Streptococcaceae	Streptococcus	Streptococcus vestibularis
GCA_023314415.1	20	Enterobacteriaceae	Kluyvera	Kluyvera ascorbata
GCA_927911685.1	20	Actinomycetaceae	Actinomyces	uncultured Actinomyces sp.
GCA_958455085.1	20	Veillonellaceae	Dialister	Dialister invisus
GCA_000016825.1	18	Lactobacillaceae	Limosilactobacillus	Limosilactobacillus reuteri subsp. reuteri
GCA_003433955.1	18	Peptoniphilaceae	Anaerococcus	Anaerococcus nagyae
GCA_030180395.1	17	Fusobacteriaceae	Fusobacterium	Fusobacterium necrophorum
GCA_001866575.1	16	Propionibacteriaceae	Cutibacterium	Cutibacterium avidum
GCA_021309315.1	13	Pasteurellaceae	Aggregatibacter	Aggregatibacter actinomycetemcomitans
GCA_008726565.1	12	Streptococcaceae	Streptococcus	Streptococcus anginosus
GCA_022569235.1	12	Staphylococcaceae	Staphylococcus	Staphylococcus lugdunensis
GCA_014962425.1	8	Staphylococcaceae	Staphylococcus	Staphylococcus lugdunensis
GCA_030229925.1	8	Corynebacteriaceae	Corynebacterium	Corynebacterium propinquum
GCA_001053555.1	7	Corynebacteriaceae	Corynebacterium	Corynebacterium propinquum
GCA_008085135.1	7	Pasteurellaceae	Aggregatibacter	Aggregatibacter actinomycetemcomitans
GCA_031158635.1	7	Propionibacteriaceae	Cutibacterium	Cutibacterium avidum
GCA_028325975.1	6	Streptococcaceae	Streptococcus	Streptococcus parasanguinis
GCA_002212995.1	5	Streptococcaceae	Streptococcus	Streptococcus mutans
GCA_003856455.1	4	Staphylococcaceae	Staphylococcus	Staphylococcus epidermidis
GCA_900636255.1	3	Staphylococcaceae	Staphylococcus	Staphylococcus warneri
GCA_003596365.3	2	Staphylococcaceae	Staphylococcus	Staphylococcus haemolyticus
GCA_011038575.1	1	Staphylococcaceae	Staphylococcus	Staphylococcus epidermidis

Then we barplot them by aggregating at genus level.

# Plot: top 15 genera by number of recovered genomes
top_genera <- recovered_tax %>%
  dplyr::count(genus, sort = TRUE, name = 'n_genomes') %>%
  filter(!is.na(genus)) %>%
  slice_head(n = 15)

ggplot(top_genera, aes(x = reorder(genus, n_genomes), y = n_genomes)) +
  geom_col(fill = '#2A9D8F') +
  coord_flip() +
  labs(title = 'Top genera among recovered genomes', x = 'Genus', y = 'Recovered genomes (unique)') +
  theme_minimal()

Observation: 7 species are recovered from both Streptococcus and Staphylococcus.

Community Composition: Initial vs Recovered at Genus Level

suppressPackageStartupMessages({
  library(tidyverse)
  library(mia)
  library(miaViz)
  library(S4Vectors)
  library(SummarizedExperiment)
  library(TreeSummarizedExperiment)
  library(rentrez)
  library(xml2)
})

pal_amr <- c(
  "#9C81ab", "#81ba9c", "gold1", "skyblue2", "#e65d5f", "#FDBF6F",
  "#CAD2D6", "#FB9A99", "maroon", "steelblue4", "dodgerblue2", "#D48B28",
  "#6A9DA9", "khaki2", "#B5D39D", "#42B93E", "#B3E3AA", "#A9FDAC",
  "#32A27f", "#38BEA5", "#A3AB28", "#28C99f", "#FED1BC", "#FE654F",
  "#49C8E1", "#BB39A3", "grey", "pink",
  "#177B94", "#FF707E", "#2AC02C", "#D62728", "#9478DD", "#BC54AB",
  "#E377C2", "#7F7F7F", "#BCBD22", "#17BECF",
  "#6B4226", "#FFA70A", "#FF4540", "#CD5C5C", "#25EB57", "#66CDAA",
  "#4682B4", "#1E90FF", "#C71585", "#7AD7D6",
  "#FFD49F", "#8BD870", "#B0B83F", "#A52A2A", "#6495ED", "#9932CC",
  "#FF1493", "#ADFF2F", "#40E0D0", "#C0C0C0",
  "#FFA8B9", "#F4CA60", "#D8251E", "#0B53FF", "#FFA500", "#FF6370",
  "#DC143C", "#8A2BE2", "#5F9EA0", "#00FA9A",
  "#FFD700", "#4DBDFF", "#708090", "#00FFFF", "#FF4500", "#7FFFD4",
  "#AD0F5E", "#20B2AA", "#708090", "#8FBC8F",
  "#0000FF", "#808000", "#228B22", "#C0C0C0", "#808080", "#008000",
  "#FF8C00", "#7C1353", "#000000", "#A9A9A9",
  "#FF4500", "#822222", "#FF2222", "#00FF00", "#800080", "#FF00FF",
  "#E9967A", "#FF69B4", "#FF69B4", "#EE82EE",
  "#FFB6C1", "#FAFEEE", "#F0E68C", "#9FB98F", "#ADD8E6", "#87CEEB",
  "#FFDEAD", "#DDA0DD", "#B0E0E6"
)

# Input: remapped relative abundances (rows = genomes or bins, cols = samples)
abund_file <- file.path(data_dirs$HOMD_Final, "aggregated_metrics_remapped", "remapped_relative.tsv")
stopifnot(file.exists(abund_file))

abund <- readr::read_tsv(abund_file, show_col_types = FALSE)
stopifnot("Genome" %in% colnames(abund))

rownms <- abund$Genome
mat <- as.matrix(abund %>% select(-Genome))
rownames(mat) <- rownms

# Identify source type and normalize accession IDs
source_type <- ifelse(grepl("^bin_", rownms), "recovered", "initial")
acc_undotted <- sub("^bin_", "", rownms)
acc <- ifelse(grepl("\\.\\d+$", acc_undotted), acc_undotted, sub("([0-9])$", ".\\1", acc_undotted))

row_meta <- S4Vectors::DataFrame(
  Genome = rownms,
  Source = source_type,
  Accession_raw = acc_undotted,
  Accession = acc,
  row.names = rownms
)

# Build TSE and compute per-sample relative abundance (sums to 1)
tse_full <- TreeSummarizedExperiment::TreeSummarizedExperiment(
  assays = list(counts = mat),
  rowData = row_meta
)
tse_full <- mia::transformAssay(tse_full, assay.type = "counts", method = "relabundance")

# Fetch Genus via NCBI (rentrez) with caching
cache_dir_cc <- file.path(results_dir, "cache")
if (!dir.exists(cache_dir_cc)) dir.create(cache_dir_cc, recursive = TRUE)
cache_map <- file.path(cache_dir_cc, "accession_to_genus.csv")

fetch_genus <- function(acc_vec) {
  acc_vec <- unique(acc_vec[!is.na(acc_vec) & acc_vec != ""])
  if (length(acc_vec) == 0) return(tibble(Accession = character(), Genus = character()))
  purrr::map_dfr(acc_vec, function(acc_one) {
    # Try exact assembly accession first
    uid <- tryCatch(rentrez::entrez_search(db = "assembly",
                                          term = paste0(acc_one, "[Assembly Accession]"),
                                          retmax = 1)$ids,
                    error = function(e) NULL)
    if (is.null(uid) || length(uid) == 0) {
      uid <- tryCatch(rentrez::entrez_search(db = "assembly", term = acc_one, retmax = 1)$ids,
                      error = function(e) NULL)
    }
    if (!is.null(uid) && length(uid) > 0) {
      sm <- tryCatch(rentrez::entrez_summary(db = "assembly", id = uid[1]), error = function(e) NULL)
      txid <- tryCatch({
        if (is.null(sm)) NA_character_ else as.character(if (!is.null(sm$taxid)) sm$taxid else sm$tax_id)
      }, error = function(e) NA_character_)
      if (!is.na(txid)) {
        tx_xml <- tryCatch(rentrez::entrez_fetch(db = "taxonomy", id = txid, rettype = "xml"), error = function(e) NULL)
        if (!is.null(tx_xml)) {
          doc <- tryCatch(xml2::read_xml(tx_xml), error = function(e) NULL)
          if (!is.null(doc)) {
            genus <- xml2::xml_text(xml2::xml_find_first(doc, "//LineageEx/Taxon[Rank='genus']/ScientificName"))
            if (length(genus) == 0 || is.na(genus) || genus == "") genus <- NA_character_
            return(tibble(Accession = acc_one, Genus = genus))
          }
        }
      }
    }
    tibble(Accession = acc_one, Genus = NA_character_)
  })
}

# Load cache and lookup what’s missing
cache_df <- if (file.exists(cache_map)) suppressMessages(readr::read_csv(cache_map, show_col_types = FALSE)) else tibble(Accession = character(), Genus = character())
to_lookup <- setdiff(unique(row_meta$Accession), cache_df$Accession)

if (length(to_lookup) > 0) {
  # Optional: supply ENTREZ_KEY for higher rate limit
  if (nzchar(Sys.getenv("ENTREZ_KEY"))) rentrez::set_entrez_key(Sys.getenv("ENTREZ_KEY"))
  new_map <- fetch_genus(to_lookup)
  cache_df <- dplyr::bind_rows(cache_df, new_map) %>% dplyr::distinct(Accession, .keep_all = TRUE)
  readr::write_csv(cache_df, cache_map)
}

genus_map <- cache_df
row_genus <- genus_map$Genus[match(row_meta$Accession, genus_map$Accession)]
row_genus[is.na(row_genus) | row_genus == ""] <- "Unknown"
rowData(tse_full)$Genus <- row_genus

# Split by source
tse_initial <- tse_full[rowData(tse_full)$Source == "initial", ]
tse_recovered <- tse_full[rowData(tse_full)$Source == "recovered", ]

# Aggregate by Genus within each subset
tse_initial_g <- mia::agglomerateByVariable(tse_initial, by = "rows", f = "Genus", fun = sum, assay.type = "relabundance")
tse_recovered_g <- mia::agglomerateByVariable(tse_recovered, by = "rows", f = "Genus", fun = sum, assay.type = "relabundance")

# Determine shared top genera (union of top 10 from both)
top_init <- mia::getTop(tse_initial_g, top = 10, assay.type = "relabundance")
top_recv <- mia::getTop(tse_recovered_g, top = 10, assay.type = "relabundance")
top_union <- unique(c(top_init, top_recv))

# Collapse others into "Other"
rowData(tse_initial_g)$Genus_sub <- ifelse(rownames(tse_initial_g) %in% top_union, rownames(tse_initial_g), "Other")
rowData(tse_recovered_g)$Genus_sub <- ifelse(rownames(tse_recovered_g) %in% top_union, rownames(tse_recovered_g), "Other")
tse_initial_sub <- mia::agglomerateByVariable(tse_initial_g, by = "rows", f = "Genus_sub", fun = sum, assay.type = "relabundance")
tse_recovered_sub <- mia::agglomerateByVariable(tse_recovered_g, by = "rows", f = "Genus_sub", fun = sum, assay.type = "relabundance")

# Visualize side-by-side (stacked bar via ggplot, with ordering)
create_comp_plot <- function(tse_sub, title) {
  M <- SummarizedExperiment::assay(tse_sub, "relabundance")
  if (is.null(M)) M <- SummarizedExperiment::assay(tse_sub, 1)
  cs <- colSums(M, na.rm = TRUE); cs[cs == 0] <- 1
  M <- sweep(M, 2, cs, "/")

  df <- as.data.frame(M) %>%
    tibble::rownames_to_column("Genus") %>%
    tidyr::pivot_longer(-Genus, names_to = "Sample", values_to = "Abundance")

  totals <- df %>% group_by(Genus) %>% summarise(Total = sum(Abundance, na.rm = TRUE), .groups = 'drop')
  levels_g <- totals %>% filter(Genus != "Other") %>% arrange(desc(Total)) %>% pull(Genus)
  if ("Other" %in% df$Genus) levels_g <- c(levels_g, "Other")
  df$Genus <- factor(df$Genus, levels = levels_g)

  top_g <- dplyr::first(levels_g)
  if (!is.na(top_g)) {
    sample_order <- df %>% filter(Genus == top_g) %>% arrange(desc(Abundance)) %>% pull(Sample)
    df$Sample <- factor(df$Sample, levels = unique(sample_order))
  }

  ggplot(df, aes(x = Sample, y = Abundance, fill = Genus)) +
    geom_col(width = 0.9) +
    scale_fill_manual(values = pal_amr) +   # << apply palette
    labs(title = title, x = "Samples", y = "Relative abundance") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
          legend.position = "right")
}

p_init <- create_comp_plot(tse_initial_sub, "Initial (Reference) - Genus Composition")
p_recv <- create_comp_plot(tse_recovered_sub, "Recovered (Bins) - Genus Composition")

if (requireNamespace("patchwork", quietly = TRUE)) {
  (p_init | p_recv)
} else {
  gridExtra::grid.arrange(p_init, p_recv, ncol = 2)
}

Observation: The reference (left) shows smooth, consistent genus profiles across all samples, with stable relative abundances. In contrast, the recovered bins display noisy, fragmented patterns with inflated or diminished representation of several genera.

There is a clear compositional shift, with recovered bins showing markedly altered and noisier genus abundances compared to the stable reference profiles.

Differential Abundance Analysis

Loading and Processing DA Results

# Load comprehensive summary results
da_results <- read_csv(file.path(results_input_dir, "comprehensive_summary_with_bins.csv"), 
                      show_col_types = FALSE) %>%
  mutate(
    method = str_to_lower(method),
    metric_type = factor(metric_type, levels = c("counts", "tpm", "rpkm", "relative", "breadth", "coverage")),
    source_type = factor(source_type, levels = c("all", "reference", "bin"))
  )

# Display structure
cat("DA Results Summary:\n")

## DA Results Summary:

cat("- Total rows:", nrow(da_results), "\n")

## - Total rows: 78

cat("- Methods:", paste(unique(da_results$method), collapse = ", "), "\n")

## - Methods: ancombc2, aldex2, metagenomeseq, maaslin2, wilcoxon

cat("- Metrics:", paste(unique(da_results$metric_type), collapse = ", "), "\n")

## - Metrics: counts, coverage, breadth, rpkm, tpm, relative

cat("- Source types:", paste(unique(da_results$source_type), collapse = ", "), "\n")

## - Source types: all, reference, bin

# Quick overview
da_results %>%
  dplyr::count(method, source_type, metric_type) %>%
  kable(caption = "Results availability by method, source type, and metric") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Results availability by method, source type, and metric
method	source_type	metric_type	n
aldex2	all	counts	1
aldex2	all	breadth	1
aldex2	reference	counts	1
aldex2	reference	breadth	1
aldex2	bin	counts	1
aldex2	bin	breadth	1
ancombc2	all	counts	1
ancombc2	all	tpm	1
ancombc2	all	rpkm	1
ancombc2	all	relative	1
ancombc2	all	breadth	1
ancombc2	all	coverage	1
ancombc2	reference	counts	1
ancombc2	reference	tpm	1
ancombc2	reference	rpkm	1
ancombc2	reference	relative	1
ancombc2	reference	breadth	1
ancombc2	reference	coverage	1
ancombc2	bin	counts	1
ancombc2	bin	tpm	1
ancombc2	bin	rpkm	1
ancombc2	bin	relative	1
ancombc2	bin	breadth	1
ancombc2	bin	coverage	1
maaslin2	all	counts	1
maaslin2	all	tpm	1
maaslin2	all	rpkm	1
maaslin2	all	relative	1
maaslin2	all	breadth	1
maaslin2	all	coverage	1
maaslin2	reference	counts	1
maaslin2	reference	tpm	1
maaslin2	reference	rpkm	1
maaslin2	reference	relative	1
maaslin2	reference	breadth	1
maaslin2	reference	coverage	1
maaslin2	bin	counts	1
maaslin2	bin	tpm	1
maaslin2	bin	rpkm	1
maaslin2	bin	relative	1
maaslin2	bin	breadth	1
maaslin2	bin	coverage	1
metagenomeseq	all	counts	1
metagenomeseq	all	tpm	1
metagenomeseq	all	rpkm	1
metagenomeseq	all	relative	1
metagenomeseq	all	breadth	1
metagenomeseq	all	coverage	1
metagenomeseq	reference	counts	1
metagenomeseq	reference	tpm	1
metagenomeseq	reference	rpkm	1
metagenomeseq	reference	relative	1
metagenomeseq	reference	breadth	1
metagenomeseq	reference	coverage	1
metagenomeseq	bin	counts	1
metagenomeseq	bin	tpm	1
metagenomeseq	bin	rpkm	1
metagenomeseq	bin	relative	1
metagenomeseq	bin	breadth	1
metagenomeseq	bin	coverage	1
wilcoxon	all	counts	1
wilcoxon	all	tpm	1
wilcoxon	all	rpkm	1
wilcoxon	all	relative	1
wilcoxon	all	breadth	1
wilcoxon	all	coverage	1
wilcoxon	reference	counts	1
wilcoxon	reference	tpm	1
wilcoxon	reference	rpkm	1
wilcoxon	reference	relative	1
wilcoxon	reference	breadth	1
wilcoxon	reference	coverage	1
wilcoxon	bin	counts	1
wilcoxon	bin	tpm	1
wilcoxon	bin	rpkm	1
wilcoxon	bin	relative	1
wilcoxon	bin	breadth	1
wilcoxon	bin	coverage	1

Reference Genomes Results

# Filter for reference genomes only
reference_results <- da_results %>%
  filter(source_type == "reference") %>%
  select(method, metric_type, tp, fp, fn, tn, sensitivity, specificity, precision, fdr, accuracy, npv, auroc, auprc, called, total_features) %>%
  arrange(method, metric_type)

# Create formatted table for reference results
reference_table <- reference_results %>%
  mutate(
    across(c(sensitivity, specificity, precision, accuracy, npv, auroc, auprc), ~ round(.x, 3)),
    across(c(fdr), ~ round(.x, 3)),
    tp = as.integer(tp),
    fp = as.integer(fp),
    fn = as.integer(fn),
    tn = as.integer(tn),
    called = as.integer(called),
    total_features = as.integer(total_features)
  ) %>%
  dplyr::rename(
    Method = method,
    Metric = metric_type,
    TP = tp,
    FP = fp,
    FN = fn,
    TN = tn,
    Sensitivity = sensitivity,
    Specificity = specificity,
    Precision = precision,
    FDR = fdr,
    Accuracy = accuracy,
    NPV = npv,
    AUROC = auroc,
    AUPRC = auprc,
    Called = called,
    Total = total_features
  )

# Display interactive table with pagination
reference_dt <- reference_table %>%
  DT::datatable(
    caption = "Differential Abundance Results: Reference Genomes Only",
    options = list(
      pageLength = 10,
      lengthMenu = c(5, 10, 15, 20, -1),
      scrollX = TRUE,
      scrollY = "400px",
      dom = 'Bfrtip',
      buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
    ),
    extensions = c('Buttons', 'Scroller'),
    filter = 'top',
    rownames = FALSE
  ) %>%
  DT::formatRound(columns = c('Sensitivity', 'Specificity', 'Precision', 'FDR', 'Accuracy', 'NPV', 'AUROC', 'AUPRC'), digits = 3)

reference_dt

# Save interactive table
save_datatable(reference_dt, "reference_genomes_results.html")

Bins Results

# Filter for bins only
bins_results <- da_results %>%
  filter(source_type == "bin") %>%
  select(method, metric_type, tp, fp, fn, tn, sensitivity, specificity, precision, fdr, accuracy, npv, auroc, auprc, called, total_features) %>%
  arrange(method, metric_type)

# Create formatted table for bins results
bins_table <- bins_results %>%
  mutate(
    across(c(sensitivity, specificity, precision, accuracy, npv, auroc, auprc), ~ round(.x, 3)),
    across(c(fdr), ~ round(.x, 3)),
    tp = as.integer(tp),
    fp = as.integer(fp),
    fn = as.integer(fn),
    tn = as.integer(tn),
    called = as.integer(called),
    total_features = as.integer(total_features)
  ) %>%
  dplyr::rename(
    Method = method,
    Metric = metric_type,
    TP = tp,
    FP = fp,
    FN = fn,
    TN = tn,
    Sensitivity = sensitivity,
    Specificity = specificity,
    Precision = precision,
    FDR = fdr,
    Accuracy = accuracy,
    NPV = npv,
    AUROC = auroc,
    AUPRC = auprc,
    Called = called,
    Total = total_features
  )

# Display interactive table with pagination
bins_dt <- bins_table %>%
  DT::datatable(
    caption = "Differential Abundance Results: Bins Only",
    options = list(
      pageLength = 10,
      lengthMenu = c(5, 10, 15, 20, -1),
      scrollX = TRUE,
      scrollY = "400px",
      dom = 'Bfrtip',
      buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
    ),
    extensions = c('Buttons', 'Scroller'),
    filter = 'top',
    rownames = FALSE
  ) %>%
  DT::formatRound(columns = c('Sensitivity', 'Specificity', 'Precision', 'FDR', 'Accuracy', 'NPV', 'AUROC', 'AUPRC'), digits = 3)

bins_dt

# Save interactive table
save_datatable(bins_dt, "bins_results.html")

Summary Statistics

# Summary statistics for reference vs bins
summary_stats <- da_results %>%
  filter(source_type %in% c("reference", "bin")) %>%
  group_by(source_type, metric_type) %>%
  summarise(
    n_methods = n(),
    mean_sensitivity = mean(sensitivity, na.rm = TRUE),
    mean_specificity = mean(specificity, na.rm = TRUE),
    mean_precision = mean(precision, na.rm = TRUE),
    mean_auroc = mean(auroc, na.rm = TRUE),
    mean_auprc = mean(auprc, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(
    across(c(mean_sensitivity, mean_specificity, mean_precision, mean_auroc, mean_auprc), ~ round(.x, 3))
  )

# Display interactive summary table
summary_dt <- summary_stats %>%
  DT::datatable(
    caption = "Summary Statistics: Reference vs Bins Performance",
    options = list(
      pageLength = 10,
      lengthMenu = c(5, 10, 15, 20, -1),
      scrollX = TRUE,
      dom = 'Bfrtip',
      buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
    ),
    extensions = c('Buttons'),
    filter = 'top',
    rownames = FALSE
  ) %>%
  DT::formatRound(columns = c('mean_sensitivity', 'mean_specificity', 'mean_precision', 'mean_auroc', 'mean_auprc'), digits = 3)

summary_dt

Performance Comparison Plots

# 1) Long data
plot_data <- da_results %>%
  filter(source_type %in% c("reference", "bin")) %>%
  mutate(
    f1_score = 2 * (precision * sensitivity) / (precision + sensitivity),
    auc_pr   = auprc
  ) %>%
  select(method, metric_type, source_type, f1_score, auc_pr, sensitivity, specificity) %>%
  pivot_longer(c(f1_score, auc_pr, sensitivity, specificity),
               names_to = "metric", values_to = "value") %>%
  filter(!is.na(value))

metric_order  <- c("counts","rpkm","relative","breadth","coverage","tpm")
tool_colors   <- c(ancombc2="#E63946", aldex2="#F77F00", metagenomeseq="#2A9D8F",
                   wilcoxon="#264653", maaslin2="#7209B7")

# 2) Compute common y-limits per performance metric across BOTH sources
common_limits <- plot_data %>%
  group_by(metric) %>%
  summarise(lo = min(value, na.rm=TRUE), hi = max(value, na.rm=TRUE), .groups="drop") %>%
  mutate(
    # pad a touch so whiskers/points aren’t clipped
    lo = pmax(0, lo - 0.02),
    hi = pmin(1, hi + 0.02)
  )

lims_for <- function(metric_name) {
  with(common_limits[common_limits$metric==metric_name, ], c(lo, hi))
}

# 3) Subplot
create_subplot <- function(data, metric_name, title, y_label) {
  ggplot(data %>% filter(metric == metric_name),
         aes(x = factor(metric_type, levels = metric_order), y = value)) +
    geom_boxplot(alpha = 0.7, outlier.shape = NA, fill="white", color="black") +
    geom_jitter(aes(color = method), width = 0.2, alpha = 0.8, size = 2.5) +
    scale_color_manual(values = tool_colors) +
    scale_y_continuous(limits = lims_for(metric_name)) +
    scale_x_discrete(drop = FALSE) +       # <- keep all metrics even if absent in a subset
    labs(title = title, x = "Abundance Metric", y = y_label, color = "Method") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          legend.position = "bottom") +
    coord_flip()
}

# 4) Make panels 
ref_data  <- plot_data %>% filter(source_type == "reference")
bins_data <- plot_data %>% filter(source_type == "bin")

p1_ref  <- create_subplot(ref_data,  "f1_score",   "A) F1-Score Distribution by Metric (Reference)", "F1-Score")
p2_ref  <- create_subplot(ref_data,  "auc_pr",     "B) AUC-PR Distribution by Metric (Reference)",   "AUC-PR")
p3_ref  <- create_subplot(ref_data,  "sensitivity","C) Sensitivity Distribution by Metric (Reference)","Sensitivity")
p4_ref  <- create_subplot(ref_data,  "specificity","D) Specificity Distribution by Metric (Reference)","Specificity")

p1_bins <- create_subplot(bins_data, "f1_score",   "A) F1-Score Distribution by Metric (Bins)", "F1-Score")
p2_bins <- create_subplot(bins_data, "auc_pr",     "B) AUC-PR Distribution by Metric (Bins)",   "AUC-PR")
p3_bins <- create_subplot(bins_data, "sensitivity","C) Sensitivity Distribution by Metric (Bins)","Sensitivity")
p4_bins <- create_subplot(bins_data, "specificity","D) Specificity Distribution by Metric (Bins)","Specificity")

gridExtra::grid.arrange(p1_ref, p2_ref, p3_ref, p4_ref, ncol = 2,
                        top = "Performance Metrics Distribution Across Abundance Metrics - Reference Genomes")

gridExtra::grid.arrange(p1_bins, p2_bins, p3_bins, p4_bins, ncol = 2,
                        top = "Performance Metrics Distribution Across Abundance Metrics - Bins")

- Observation: Reference genomes show high sensitivity but low and variable specificity, with F1 and AUC-PR clustering around 0.6–0.8. Bin-level results are weaker, with reduced sensitivity and broader variability across abundance metrics.

Performance Heatmaps

# Prepare data for heatmaps
heatmap_data <- da_results %>%
  filter(source_type %in% c("reference", "bin")) %>%
  mutate(
    # Calculate F1-Score from precision and sensitivity
    f1_score = 2 * (precision * sensitivity) / (precision + sensitivity),
    # Use existing metrics
    auc_pr = auprc
  ) %>%
  select(method, metric_type, source_type, f1_score, auc_pr, sensitivity, specificity) %>%
  # Ensure consistent metric order
  mutate(metric_type = factor(metric_type, levels = c("counts", "rpkm", "relative", "breadth", "coverage", "tpm")))

# Function to create heatmap
create_heatmap <- function(data, metric_col, title, source_label) {
  # Prepare matrix
  heatmap_matrix <- data %>%
    select(method, metric_type, !!sym(metric_col)) %>%
    pivot_wider(names_from = metric_type, values_from = !!sym(metric_col)) %>%
    column_to_rownames("method") %>%
    as.matrix()
  
  # Replace NAs with 0 for visualization
  heatmap_matrix[is.na(heatmap_matrix)] <- 0
  
  # Create heatmap
  corrplot(heatmap_matrix, 
           method = "color", 
           is.corr = FALSE,
           col = colorRampPalette(c("#053061", "#FFFFFF", "#67001F"))(200),
           tl.col = "black", 
           tl.srt = 45,
           addCoef.col = "black",
           number.cex = 0.7,
           title = paste(title, "-", source_label),
           mar = c(0,0,2,0),
           cl.cex = 0.8)
}

# Reference genomes heatmaps
ref_heatmap_data <- heatmap_data %>% filter(source_type == "reference")

par(mfrow = c(2, 2), mar = c(2, 2, 3, 2))
create_heatmap(ref_heatmap_data, "f1_score", "F1-Score Heatmap", "Reference Genomes")
create_heatmap(ref_heatmap_data, "auc_pr", "AUC-PR Heatmap", "Reference Genomes")
create_heatmap(ref_heatmap_data, "sensitivity", "Sensitivity Heatmap", "Reference Genomes")
create_heatmap(ref_heatmap_data, "specificity", "Specificity Heatmap", "Reference Genomes")

# Bins heatmaps
bins_heatmap_data <- heatmap_data %>% filter(source_type == "bin")

par(mfrow = c(2, 2), mar = c(2, 2, 3, 2))
create_heatmap(bins_heatmap_data, "f1_score", "F1-Score Heatmap", "Bins")
create_heatmap(bins_heatmap_data, "auc_pr", "AUC-PR Heatmap", "Bins")
create_heatmap(bins_heatmap_data, "sensitivity", "Sensitivity Heatmap", "Bins")
create_heatmap(bins_heatmap_data, "specificity", "Specificity Heatmap", "Bins")

# Reset par
par(mfrow = c(1, 1))

Observation: Reference genomes show generally high sensitivity (ANCOMBC-2: 1.00, for all metrics) across methods but poor specificity, leading to only moderate F1 and AUC-PR scores. Bins show large performance drops. ANCOM-BC2 and metagenomeSeq achieve the good balance — both show high sensitivity (≈1.0) and comparatively better specificity than others. MetagenomeSeq has the highest overall F1 and AUC-PR.

There is no significant changes in performance for the metrics except for a few metrics where aldex2 produced NA result. We should focus more on the Specificity heatmap as FDR control (reduction of False Positives is the bigger problem). The methods can clearly can identify True Positives (sensitivity 1.0) but struggles with FDR control. In that regard, MetagenomeSeq (at reference) and ANCOMBC-2 (at bins) performs better than other methods.

Effect of Alpha on False Positive Rate

# Define alpha values and corresponding directories
alpha_values <- c(0.05, 0.01, 0.005, 0.001, 0.00001)
alpha_dirs <- c("with_bins", "alpha_0.01", "alpha_0.005", "alpha_0.001", "alpha_0.00001")

# Function to load data for a specific alpha
load_alpha_data <- function(alpha_dir) {
  file_path <- file.path("/home/mrahman/SIMBA/results/homd_final", alpha_dir, "comprehensive_summary_with_bins.csv")
  if (file.exists(file_path)) {
    data <- read_csv(file_path, show_col_types = FALSE) %>%
      mutate(alpha_dir = alpha_dir)
    return(data)
  } else {
    warning(paste("File not found:", file_path))
    return(NULL)
  }
}

# Load data for all alpha values
alpha_data_list <- map(alpha_dirs, load_alpha_data)
alpha_data_list <- alpha_data_list[!map_lgl(alpha_data_list, is.null)]

# Combine all data
combined_alpha_data <- bind_rows(alpha_data_list) %>%
  mutate(
    alpha_value = case_when(
      alpha_dir == "with_bins" ~ 0.05,
      alpha_dir == "alpha_0.01" ~ 0.01,
      alpha_dir == "alpha_0.005" ~ 0.005,
      alpha_dir == "alpha_0.001" ~ 0.001,
      alpha_dir == "alpha_0.00001" ~ 0.00001,
      TRUE ~ NA_real_
    ),
    alpha_label = paste("α =", alpha_value)
  ) %>%
  filter(!is.na(alpha_value))

# Prepare data for plotting
alpha_plot_data <- combined_alpha_data %>%
  filter(source_type %in% c("reference", "bin")) %>%
  select(method, metric_type, source_type, alpha_value, alpha_label, fdr, fp, tn) %>%
  mutate(
    method = str_to_title(method),
    metric_type = factor(metric_type, levels = c("counts", "tpm", "rpkm", "relative", "breadth", "coverage"))
  )

# Create summary statistics by alpha
alpha_summary <- alpha_plot_data %>%
  group_by(alpha_value, alpha_label, source_type) %>%
  summarise(
    mean_fdr = mean(fdr, na.rm = TRUE),
    median_fdr = median(fdr, na.rm = TRUE),
    sd_fdr = sd(fdr, na.rm = TRUE),
    mean_fp = mean(fp, na.rm = TRUE),
    median_fp = median(fp, na.rm = TRUE),
    n_methods = n(),
    .groups = 'drop'
  )

# Plot 1: FDR vs Alpha (line plot)
p1 <- ggplot(alpha_summary, aes(x = alpha_value, y = mean_fdr, color = source_type)) +
  geom_line(size = 1.2) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = mean_fdr - sd_fdr/sqrt(n_methods), 
                    ymax = mean_fdr + sd_fdr/sqrt(n_methods)), 
                width = 0.001, alpha = 0.7) +
  scale_x_log10(breaks = alpha_values, labels = alpha_values) +
  scale_color_manual(values = c("reference" = "#E63946", "bin" = "#2A9D8F")) +
  labs(
    title = "A) False Discovery Rate vs Alpha Value",
    x = "Alpha Value (log scale)",
    y = "Mean False Discovery Rate",
    color = "Source Type"
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    plot.title = element_text(face = "bold")
  )

# Plot 2: FDR vs Alpha (bar plot by method)
p2 <- ggplot(alpha_plot_data, aes(x = alpha_label, y = fdr, fill = source_type)) +
  geom_boxplot(alpha = 0.7) +
  facet_wrap(~ method, scales = "free_y", ncol = 3) +
  scale_fill_manual(values = c("reference" = "#E63946", "bin" = "#2A9D8F")) +
  labs(
    title = "B) FDR Distribution by Alpha Value and Method",
    x = "Alpha Value",
    y = "False Discovery Rate",
    fill = "Source Type"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom",
    plot.title = element_text(face = "bold")
  )

# Plot 3: False Positives vs Alpha
p3 <- ggplot(alpha_summary, aes(x = alpha_value, y = mean_fp, color = source_type)) +
  geom_line(size = 1.2) +
  geom_point(size = 3) +
  scale_x_log10(breaks = alpha_values, labels = alpha_values) +
  scale_color_manual(values = c("reference" = "#E63946", "bin" = "#2A9D8F")) +
  labs(
    title = "C) False Positives vs Alpha Value",
    x = "Alpha Value (log scale)",
    y = "Mean False Positives",
    color = "Source Type"
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    plot.title = element_text(face = "bold")
  )

# Plot 4: FDR by Metric Type
p4 <- ggplot(alpha_plot_data, aes(x = alpha_label, y = fdr, fill = metric_type)) +
  geom_boxplot(alpha = 0.7) +
  facet_wrap(~ source_type, ncol = 2) +
  scale_fill_manual(values = metric_colors) +
  labs(
    title = "D) FDR by Alpha Value and Metric Type",
    x = "Alpha Value",
    y = "False Discovery Rate",
    fill = "Metric Type"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom",
    plot.title = element_text(face = "bold")
  )

# Arrange plots
gridExtra::grid.arrange(p1, p2, p3, p4, ncol = 2, 
                       top = "Effect of Alpha Value on False Discovery Rate and False Positives")

# Summary table
alpha_summary_table <- alpha_summary %>%
  select(alpha_value, alpha_label, source_type, mean_fdr, mean_fp) %>%
  pivot_wider(names_from = source_type, values_from = c(mean_fdr, mean_fp)) %>%
  arrange(desc(alpha_value)) %>%
  select(-alpha_value) %>%
  dplyr::rename(
    Alpha = alpha_label,
    `Mean FDR (Reference)` = mean_fdr_reference,
    `Mean FDR (Bins)` = mean_fdr_bin,
    `Mean FP (Reference)` = mean_fp_reference,
    `Mean FP (Bins)` = mean_fp_bin
  )

alpha_summary_table %>%
  kable(caption = "Summary of FDR and False Positives by Alpha Value", 
        digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Summary of FDR and False Positives by Alpha Value
Alpha	Mean FDR (Bins)	Mean FDR (Reference)	Mean FP (Bins)	Mean FP (Reference)
α = 0.01	0.4168	0.4800	7.5385	83.5385
α = 0.005	0.3823	0.4749	6.6538	82.0769
α = 0.001	0.3265	0.4539	5.1538	75.9231
α = 1e-05	0.2128	0.3966	3.0769	60.1923

# Statistical analysis: Correlation between alpha and FDR
correlation_results <- alpha_plot_data %>%
  group_by(source_type, method) %>%
  summarise(
    correlation = cor(alpha_value, fdr, use = "complete.obs"),
    p_value = cor.test(alpha_value, fdr)$p.value,
    .groups = 'drop'
  ) %>%
  mutate(
    significance = case_when(
      p_value < 0.001 ~ "***",
      p_value < 0.01 ~ "**",
      p_value < 0.05 ~ "*",
      TRUE ~ "ns"
    )
  )

cat("\nCorrelation Analysis (Alpha vs FDR):\n")

## 
## Correlation Analysis (Alpha vs FDR):

cat("=====================================\n")

## =====================================

correlation_results %>%
  kable(caption = "Correlation between Alpha and FDR by Method and Source Type", 
        digits = 4) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Correlation between Alpha and FDR by Method and Source Type
source_type	method	correlation	p_value	significance
bin	Aldex2	0.7090	0.0489
bin	Ancombc2	0.6301	0.0010	***
bin	Maaslin2	0.7335	0.0000	***
bin	Metagenomeseq	0.6126	0.0041	**
bin	Wilcoxon	0.7606	0.0000	***
reference	Aldex2	0.4564	0.2557	ns
reference	Ancombc2	0.2836	0.1793	ns
reference	Maaslin2	0.3429	0.1009	ns
reference	Metagenomeseq	0.3037	0.1491	ns
reference	Wilcoxon	0.4909	0.0148

Observation: Reducing alpha decreases both FDR and false positives, the effect is more striking for reference genomes than bins.

Statistical Comparison of Metrics Within Tools

# ---- Compact summary: A) Leaderboard + B) Sens–Spec ----

suppressPackageStartupMessages({
  library(tidyverse)
  library(ggrepel)
  library(viridis)
  library(scales)
  library(patchwork)
})

required_cols <- c("method","metric_type","source_type","precision","sensitivity","specificity","auprc")
metric_levels <- c("counts","tpm","rpkm","relative","breadth","coverage")

# Prepare data
stats_data <- da_results %>%
  filter(source_type %in% c("reference","bin")) %>%
  mutate(
    f1_score = 2 * (precision * sensitivity) / pmax(precision + sensitivity, .Machine$double.eps),
    auc_pr   = auprc
  ) %>%
  select(method, metric_type, source_type, f1_score, auc_pr, sensitivity, specificity) %>%
  mutate(
    metric_type = factor(metric_type, levels = metric_levels),
    f1_score = pmin(pmax(f1_score, 0), 1),
    auc_pr   = pmin(pmax(auc_pr,   0), 1),
    sensitivity = pmin(pmax(sensitivity, 0), 1),
    specificity = pmin(pmax(specificity, 0), 1)
  ) %>%
  filter(!is.na(method), !is.na(metric_type), !is.na(source_type))

# Helpers
safe_z <- function(x){
  m <- mean(x, na.rm = TRUE); s <- sd(x, na.rm = TRUE)
  if (!is.finite(s) || s == 0) return(rep(0, length(x)))
  (x - m) / s
}
placeholder_plot <- function(title_txt){
  ggplot() +
    annotate("text", x = 0, y = 0, label = paste0(title_txt, "\n(no valid data)"), size = 4) +
    xlim(-1,1) + ylim(-1,1) + theme_void() +
    theme(plot.title = element_text(face = "bold")) + labs(title = title_txt)
}

# A) Overall Leaderboard (composite of z(F1), z(AUCPR))
leader_input <- stats_data %>%
  pivot_longer(c(f1_score, auc_pr), names_to = "measure", values_to = "val") %>%
  group_by(metric_type, source_type, measure) %>%
  mutate(z = safe_z(val)) %>%
  ungroup() %>%
  group_by(method) %>%
  summarise(composite = mean(z, na.rm = TRUE), .groups = "drop") %>%
  filter(is.finite(composite))

pA <- if (nrow(leader_input) > 0) {
  ggplot(leader_input, aes(reorder(method, composite), composite)) +
    geom_col(width = 0.7) +
    coord_flip() +
    labs(title = "A) Overall Leaderboard",
         x = NULL, y = "Composite score (z-avg of F1 & AUCPR)") +
    theme_minimal(base_size = 11) +
    theme(plot.title = element_text(face = "bold"))
} else {
  placeholder_plot("A) Overall Leaderboard")
}

# B) Sensitivity–Specificity Trade-off (avg across metrics), faceted by Source
trade <- stats_data %>%
  group_by(method, source_type) %>%
  summarise(
    sens = mean(sensitivity, na.rm = TRUE),
    spec = mean(specificity, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(is.finite(sens), is.finite(spec))

pB <- if (nrow(trade) > 0) {
  ggplot(trade, aes(spec, sens, label = method)) +
    geom_point(size = 2) +
    ggrepel::geom_text_repel(size = 3, max.overlaps = 20, seed = 1) +
    facet_wrap(~ source_type, drop = FALSE) +
    labs(title = "B) Sensitivity–Specificity Trade-off",
         x = "Specificity", y = "Sensitivity") +
    coord_cartesian(xlim = c(0,1), ylim = c(0,1), expand = TRUE) +
    theme_minimal(base_size = 11) +
    theme(plot.title = element_text(face = "bold"))
} else {
  placeholder_plot("B) Sensitivity–Specificity Trade-off")
}

# Assemble (heatmap removed)
summary_plot <- (pA / pB) +
  plot_layout(heights = c(1, 1)) &
  theme(plot.margin = margin(6, 6, 6, 6))

print(summary_plot)

Consensus Analysis

Feature-Level Consensus

For this analysis, we implement four vote-based consensus strategies over feature-level calls made by individual tools (q ≤ 0.05):

≥2 tools: Genome is DA if at least 2 tools call DA
Majority vote: Genome is DA if >50% of tools call DA
≥75% tools: Genome is DA if ≥75% of tools call DA
All tools: Genome is DA only if all tools call DA

suppressPackageStartupMessages({ library(readr); library(purrr) })

# Define metrics and tools present in this run
all_metrics <- c("counts", "tpm", "rpkm", "relative", "breadth", "coverage")
tools_raw <- c("aldex2", "ancombc2", "metagenomeseq", "maaslin2", "wilcoxon")
tool_pretty_map <- c(
  aldex2 = "ALDEx2",
  ancombc2 = "ANCOM-BC2",
  metagenomeseq = "metagenomeSeq",
  maaslin2 = "MaAsLin2",
  wilcoxon = "Wilcoxon"
)

# Normalization helper to match truth_table 'genome'
normalize_id <- function(x) stringr::str_replace_all(x, "[^A-Za-z0-9]", "")

# Build master_df of per-feature results across tools and metrics
read_tool_results <- function(metric, tool) {
  path <- file.path(results_input_dir, metric, paste0(tool, "_results.csv"))
  if (!file.exists(path)) return(NULL)
  df <- suppressMessages(readr::read_csv(path, show_col_types = FALSE))
  if (!all(c("feature", "qval") %in% names(df))) return(NULL)
  df %>%
    transmute(
      genome_raw = as.character(feature),
      genome = normalize_id(as.character(feature)),
      adjP = as.numeric(qval),
      direction = as.character(dplyr::coalesce(direction, NA_character_)),
      tool = tool_pretty_map[[tool]],
      metric = metric,
      source_type = ifelse(stringr::str_starts(genome_raw, "bin_"), "bin", "reference")
    )
}

master_df <- map_dfr(all_metrics, function(m) {
  map_dfr(tools_raw, function(t) read_tool_results(m, t))
}) %>%
  distinct()

# Join ground truth (truth_table already normalized earlier in this Rmd)
truth_lookup <- truth_table %>% dplyr::select(genome, is_DA) %>% tibble::as_tibble()
master_df <- tibble::as_tibble(master_df) %>% dplyr::left_join(truth_lookup, by = "genome")

# Derive per-tool DA calls at q ≤ 0.05 (user-confirmed)
alpha_q <- 0.05
master_df <- master_df %>% mutate(da_call = !is.na(adjP) & adjP <= alpha_q)

# Build consensus summaries per metric (over all source types combined)
all_consensus_summaries <- list()

for (current_metric in all_metrics) {
  metric_calls <- master_df %>% filter(metric == current_metric)

  # n_tools_total per genome is the number of tools that produced a result for that genome
  consensus_summary <- metric_calls %>%
    group_by(genome) %>%
    summarise(
      n_tools_total = dplyr::n_distinct(tool),
      n_tools_da = sum(da_call, na.rm = TRUE),
      prop_tools_da = ifelse(n_tools_total > 0, n_tools_da / n_tools_total, NA_real_),
      is_DA = dplyr::first(is_DA),
      .groups = 'drop'
    ) %>%
    mutate(
      consensus_2plus = n_tools_da >= 2,
      consensus_majority = n_tools_da > (n_tools_total / 2),
      consensus_75plus = n_tools_total > 0 & (n_tools_da / n_tools_total) >= 0.75,
      consensus_all = n_tools_total > 0 & n_tools_da == n_tools_total
    )

  all_consensus_summaries[[current_metric]] <- consensus_summary
}

# Quick sanity: show coverage of tools per metric
tool_coverage <- master_df %>%
  dplyr::count(metric, tool) %>% dplyr::arrange(metric, tool)

  tool_coverage %>%
  kable(caption = "Per-metric tool result counts (rows)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Per-metric tool result counts (rows)
metric	tool	n
breadth	ALDEx2	239
breadth	ANCOM-BC2	239
breadth	MaAsLin2	235
breadth	Wilcoxon	239
breadth	metagenomeSeq	239
counts	ALDEx2	239
counts	ANCOM-BC2	239
counts	MaAsLin2	235
counts	Wilcoxon	239
counts	metagenomeSeq	239
coverage	ANCOM-BC2	235
coverage	MaAsLin2	235
coverage	Wilcoxon	235
coverage	metagenomeSeq	235
relative	ANCOM-BC2	235
relative	MaAsLin2	235
relative	Wilcoxon	235
relative	metagenomeSeq	235
rpkm	ANCOM-BC2	235
rpkm	MaAsLin2	235
rpkm	Wilcoxon	235
rpkm	metagenomeSeq	235
tpm	ANCOM-BC2	239
tpm	MaAsLin2	235
tpm	Wilcoxon	239
tpm	metagenomeSeq	239

Consensus Heatmaps

suppressPackageStartupMessages({
  library(tidyr); library(scales)
  if (!requireNamespace("ggnewscale", quietly = TRUE)) {
    # Optional install if not present; comment out for offline runs
    # install.packages("ggnewscale")
  }
  library(ggnewscale)
})

# Get log2FC values for ordering
log2fc_data <- truth_table %>%
  select(genome, true_log2FC) %>%
  mutate(true_log2FC = replace_na(true_log2FC, 0))

for (current_metric in all_metrics) {
  cat("\n\n#### Consensus Heatmap:", toupper(current_metric), "Metric\n\n")

  # Prepare per-tool calls for current metric
  current_metric_data <- master_df %>%
    filter(metric == current_metric, !is.na(adjP)) %>%
    select(genome, tool, adjP, is_DA)

  individual_tools <- current_metric_data %>%
    mutate(da_call = adjP <= alpha_q) %>%
    select(genome, tool, da_call, is_DA) %>%
    mutate(correct_call = case_when(
      da_call & is_DA ~ "TP",
      !da_call & !is_DA ~ "TN",
      da_call & !is_DA ~ "FP",
      !da_call & is_DA ~ "FN",
      TRUE ~ NA_character_
    )) %>%
    select(genome, tool, da_call, correct_call, is_DA)

  consensus_summary_current <- all_consensus_summaries[[current_metric]]

  heatmap_data <- consensus_summary_current %>%
    dplyr::left_join(log2fc_data, by = "genome") %>%
    mutate(
      consensus_2plus_correct = case_when(
        consensus_2plus & is_DA ~ "TP",
        !consensus_2plus & !is_DA ~ "TN",
        consensus_2plus & !is_DA ~ "FP",
        !consensus_2plus & is_DA ~ "FN"
      ),
      consensus_majority_correct = case_when(
        consensus_majority & is_DA ~ "TP",
        !consensus_majority & !is_DA ~ "TN",
        consensus_majority & !is_DA ~ "FP",
        !consensus_majority & is_DA ~ "FN"
      ),
      consensus_75plus_correct = case_when(
        consensus_75plus & is_DA ~ "TP",
        !consensus_75plus & !is_DA ~ "TN",
        consensus_75plus & !is_DA ~ "FP",
        !consensus_75plus & is_DA ~ "FN"
      ),
      consensus_all_correct = case_when(
        consensus_all & is_DA ~ "TP",
        !consensus_all & !is_DA ~ "TN",
        consensus_all & !is_DA ~ "FP",
        !consensus_all & is_DA ~ "FN"
      )
    ) %>%
    arrange(desc(abs(true_log2FC)), desc(prop_tools_da), genome) %>%
    mutate(genome_order = dplyr::row_number())

  cat("**Data Summary:**\n")
  cat("- Genomes analyzed:", nrow(heatmap_data), "\n")
  cat("- Available tools:", paste(unique(individual_tools$tool), collapse = ", "), "\n\n")

  # 1) Metadata block
  metadata_long <- heatmap_data %>%
    select(genome, genome_order, is_DA, n_tools_da, n_tools_total, prop_tools_da, true_log2FC) %>%
    mutate(is_DA_numeric = as.numeric(is_DA), true_log2FC_abs = abs(true_log2FC)) %>%
    select(-is_DA) %>%
    pivot_longer(cols = -c(genome, genome_order), names_to = "column", values_to = "value") %>%
    mutate(column_type = "metadata")

  # 2) Individual tools block
  tools_long <- individual_tools %>%
    select(genome, tool, correct_call) %>%
    dplyr::left_join(heatmap_data %>% dplyr::select(genome, genome_order), by = "genome") %>%
    dplyr::rename(column = tool, value = correct_call) %>%
    mutate(column_type = "individual_tool")

  # 3) Consensus block
  consensus_long <- heatmap_data %>%
    select(genome, genome_order,
           consensus_2plus_correct, consensus_majority_correct,
           consensus_75plus_correct, consensus_all_correct) %>%
    pivot_longer(cols = -c(genome, genome_order), names_to = "column", values_to = "value") %>%
    mutate(
      column = case_when(
        column == "consensus_2plus_correct" ~ "≥2 Tools",
        column == "consensus_majority_correct" ~ "Majority",
        column == "consensus_75plus_correct" ~ "≥75%",
        column == "consensus_all_correct" ~ "All Tools"
      ),
      column_type = "consensus"
    )

  # Harmonize numeric encodings for fills
  metadata_long <- metadata_long %>% mutate(value_numeric = as.numeric(value), value = as.character(value))
  tools_long <- tools_long %>% mutate(
    value_numeric = dplyr::case_when(
      value == "TP" ~ 1, value == "TN" ~ 0.5, value == "FP" ~ -0.5, value == "FN" ~ -1, TRUE ~ 0
    ), value = as.character(value))
  consensus_long <- consensus_long %>% mutate(
    value_numeric = dplyr::case_when(
      value == "TP" ~ 1, value == "TN" ~ 0.5, value == "FP" ~ -0.5, value == "FN" ~ -1, TRUE ~ 0
    ), value = as.character(value))

  # Column order reflecting actual tools
  plot_data <- dplyr::bind_rows(metadata_long, tools_long, consensus_long) %>%
    mutate(
      column_order = dplyr::case_when(
        column == "is_DA_numeric" ~ 1,
        column == "n_tools_da" ~ 2,
        column == "n_tools_total" ~ 3,
        column == "prop_tools_da" ~ 4,
        column == "ALDEx2" ~ 5,
        column == "ANCOM-BC2" ~ 6,
        column == "metagenomeSeq" ~ 7,
        column == "MaAsLin2" ~ 8,
        column == "Wilcoxon" ~ 9,
        column == "≥2 Tools" ~ 10,
        column == "Majority" ~ 11,
        column == "≥75%" ~ 12,
        column == "All Tools" ~ 13,
        column == "true_log2FC" ~ 14,
        column == "true_log2FC_abs" ~ 15,
        TRUE ~ 99
      )
    ) %>%
    arrange(column_order)

  p_single <- plot_data %>%
    ggplot(aes(x = reorder(column, column_order), y = reorder(genome, -genome_order))) +
    geom_tile(aes(fill = ifelse(column_type == "metadata", value_numeric, NA)), color = "white", size = 0.1) +
    scale_fill_gradient2(low = "white", mid = "lightblue", high = "darkblue", midpoint = 0, na.value = "transparent", name = "Numeric Value") +
    { if (requireNamespace("ggnewscale", quietly = TRUE)) ggnewscale::new_scale_fill() else NULL } +
    geom_tile(data = dplyr::filter(plot_data, column_type %in% c("individual_tool", "consensus")),
              aes(fill = value), color = "white", size = 0.1) +
    scale_fill_manual(values = c("TP" = "#2d8f2d", "TN" = "#90EE90", "FP" = "#dc143c", "FN" = "#ffb6c1"), name = "Classification", na.value = "transparent") +
    labs(title = paste("Comprehensive Consensus Analysis Heatmap:", stringr::str_to_title(current_metric), "Metric"),
         subtitle = "Genome-level results across individual tools and consensus strategies",
         x = "Analysis Components", y = "Genomes (ordered by |log2FC| and DA proportion)") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
          axis.text.y = element_text(size = 6),
          plot.title = element_text(face = "bold", size = 14),
          plot.subtitle = element_text(size = 11),
          legend.position = "right", legend.box = "vertical", panel.grid = element_blank(), axis.ticks = element_blank()) +
    geom_vline(xintercept = c(4.5, 9.5, 13.5), color = "black", size = 1, alpha = 0.8)

  print(p_single)
  cat("\n\n")
}

## 
## 
## #### Consensus Heatmap: COUNTS Metric
## 
## **Data Summary:**
## - Genomes analyzed: 239 
## - Available tools: ALDEx2, ANCOM-BC2, metagenomeSeq, MaAsLin2, Wilcoxon

## 
## 
## 
## 
## #### Consensus Heatmap: TPM Metric
## 
## **Data Summary:**
## - Genomes analyzed: 239 
## - Available tools: ANCOM-BC2, metagenomeSeq, MaAsLin2, Wilcoxon

## 
## 
## 
## 
## #### Consensus Heatmap: RPKM Metric
## 
## **Data Summary:**
## - Genomes analyzed: 235 
## - Available tools: ANCOM-BC2, metagenomeSeq, MaAsLin2, Wilcoxon

## 
## 
## 
## 
## #### Consensus Heatmap: RELATIVE Metric
## 
## **Data Summary:**
## - Genomes analyzed: 235 
## - Available tools: ANCOM-BC2, metagenomeSeq, MaAsLin2, Wilcoxon

## 
## 
## 
## 
## #### Consensus Heatmap: BREADTH Metric
## 
## **Data Summary:**
## - Genomes analyzed: 239 
## - Available tools: ALDEx2, ANCOM-BC2, metagenomeSeq, MaAsLin2, Wilcoxon

## 
## 
## 
## 
## #### Consensus Heatmap: COVERAGE Metric
## 
## **Data Summary:**
## - Genomes analyzed: 235 
## - Available tools: ANCOM-BC2, metagenomeSeq, MaAsLin2, Wilcoxon

Compute Consensus Performance (for comparison plots)

# ===========================
# Full replacement: ALL plots as one A4-sized figure
# ===========================
suppressPackageStartupMessages({
  library(tidyverse)
  library(patchwork)
  library(viridis)
  library(scales)
})

# ---- Helper: safe reorder that tolerates all-NA gracefully ----
safe_reorder <- function(x, by) {
  if (all(is.na(by))) return(factor(x, levels = unique(x)))
  reorder(x, by)
}

# ---- Helper to compute confusion metrics (from your snippet) ----
compute_metrics <- function(pred, truth) {
  tp <- sum(pred & truth, na.rm = TRUE)
  fp <- sum(pred & !truth, na.rm = TRUE)
  fn <- sum(!pred & truth, na.rm = TRUE)
  tn <- sum(!pred & !truth, na.rm = TRUE)
  precision <- ifelse((tp + fp) > 0, tp / (tp + fp), NA_real_)
  recall <- ifelse((tp + fn) > 0, tp / (tp + fn), NA_real_)
  specificity <- ifelse((tn + fp) > 0, tn / (tn + fp), NA_real_)
  f1 <- ifelse((precision + recall) > 0, 2 * precision * recall / (precision + recall), NA_real_)
  tibble(tp = tp, fp = fp, fn = fn, tn = tn,
         precision = precision, sensitivity = recall, specificity = specificity, f1_score = f1)
}

# ---- Consensus strategies (from your snippet) ----
consensus_strategies <- c("consensus_2plus", "consensus_majority", "consensus_75plus", "consensus_all")
strategy_names       <- c("≥2 tools", "Majority (>50%)", "≥75% tools", "All tools")

# ==== 1) Build consensus performance across metrics ====
# expects: `all_metrics` (character vec) and `all_consensus_summaries` (named list) to exist
all_consensus_performance <- purrr::map_dfr(all_metrics, function(m) {
  cs <- all_consensus_summaries[[m]]
  purrr::map_dfr(seq_along(consensus_strategies), function(i) {
    flag <- consensus_strategies[i]
    name <- strategy_names[i]
    pred <- cs[[flag]]
    truth <- cs$is_DA
    mets <- compute_metrics(pred, truth)
    mets %>%
      mutate(metric = m, strategy = name, auprc = NA_real_) %>%
      select(metric, strategy, f1_score, precision, sensitivity, specificity, auprc)
  })
})

# ==== 2) Build individual tool performance per metric ====
# expects: `da_results` tibble with method, metric_type, source_type, precision, sensitivity, specificity, auprc
recode_methods <- function(x) dplyr::recode(x,
  ancombc2 = "ANCOM-BC2",
  aldex2   = "ALDEx2",
  metagenomeseq = "metagenomeSeq",
  maaslin2 = "MaAsLin2",
  wilcoxon = "Wilcoxon",
  .default = x
)

get_individual_performance <- function(metric_name) {
  da_results %>%
    filter(metric_type == metric_name, source_type %in% c("reference", "bin")) %>%
    mutate(f1_score = 2 * (precision * sensitivity) / pmax(precision + sensitivity, .Machine$double.eps)) %>%
    group_by(method) %>%
    summarise(
      f1_score   = mean(f1_score,   na.rm = TRUE),
      precision  = mean(precision,  na.rm = TRUE),
      sensitivity= mean(sensitivity,na.rm = TRUE),
      specificity= mean(specificity,na.rm = TRUE),
      auprc      = mean(auprc,      na.rm = TRUE),
      .groups = 'drop'
    ) %>%
    mutate(type = "Individual tool",
           strategy = recode_methods(method)) %>%
    select(strategy, f1_score, precision, sensitivity, specificity, auprc, type)
}

# ==== 3) One function that makes a 2-panel block (Sensitivity | Specificity) for a metric ====
make_metric_block <- function(metric_name) {
  ind <- get_individual_performance(metric_name)
  con <- all_consensus_performance %>%
    filter(metric == metric_name) %>%
    mutate(type = "Consensus") %>%
    select(strategy, f1_score, precision, sensitivity, specificity, auprc, type)

  combined <- bind_rows(con, ind) %>% mutate(metric = metric_name)

  # Sensitivity
  p_sens <- ggplot(
    combined,
    aes(x = safe_reorder(strategy, sensitivity), y = sensitivity, fill = type)
  ) +
    geom_col(width = 0.7) +
    scale_y_continuous(labels = percent_format(accuracy = 1), limits = c(0, 1)) +
    scale_fill_manual(values = c("Consensus" = "#e31a1c", "Individual tool" = "#1f78b4")) +
    coord_flip() +
    labs(title = paste("Sensitivity —", stringr::str_to_title(metric_name)),
         x = NULL, y = "Sensitivity", fill = NULL) +
    theme_minimal(base_size = 11) +
    theme(legend.position = "bottom",
          plot.title = element_text(face = "bold"),
          axis.text.x = element_text())

  # Specificity
  p_spec <- ggplot(
    combined,
    aes(x = safe_reorder(strategy, specificity), y = specificity, fill = type)
  ) +
    geom_col(width = 0.7) +
    scale_y_continuous(labels = percent_format(accuracy = 1), limits = c(0, 1)) +
    scale_fill_manual(values = c("Consensus" = "#e31a1c", "Individual tool" = "#1f78b4")) +
    coord_flip() +
    labs(title = paste("Specificity —", stringr::str_to_title(metric_name)),
         x = NULL, y = "Specificity", fill = NULL) +
    theme_minimal(base_size = 11) +
    theme(legend.position = "bottom",
          plot.title = element_text(face = "bold"),
          axis.text.x = element_text())

  # Return a 2-panel block for this metric
  (p_sens | p_spec) + plot_layout(guides = "collect") & theme(legend.position = "bottom")
}

# ==== 4) Build all metric blocks and a cross-metric summary heatmap ====
metric_blocks <- purrr::map(all_metrics, make_metric_block)

# Cross-metric summary for consensus strategies only (visual, no table)
cross_metric_heat <- all_consensus_performance %>%
  group_by(strategy) %>%
  summarise(
    F1          = mean(f1_score,    na.rm = TRUE),
    Precision   = mean(precision,   na.rm = TRUE),
    Sensitivity = mean(sensitivity, na.rm = TRUE),
    Specificity = mean(specificity, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  pivot_longer(-strategy, names_to = "Measure", values_to = "Value")

p_heat <- ggplot(cross_metric_heat, aes(Measure, strategy, fill = Value)) +
  geom_tile(color = "white", linewidth = 0.25) +
  scale_fill_viridis(limits = c(0, 1), option = "C", guide = guide_colorbar(title.position = "top")) +
  labs(title = "Cross-metric Consensus Summary (mean across metrics)",
       x = "Measure", y = "Consensus strategy", fill = "Score") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face = "bold"),
        axis.text.x = element_text(angle = 45, hjust = 1))

# ==== 5) Assemble EVERYTHING into ONE big figure with tags ====
# Arrange metric blocks vertically; add heatmap at the bottom
big_plot <- wrap_plots(metric_blocks, ncol = 1) /
            p_heat +
  plot_layout(heights = c(rep(1, length(metric_blocks)), 0.9)) +
  plot_annotation(tag_levels = "A")

# Print (may require a larger pane); best to use ggsave below
print(big_plot)

Observation: Consensus strategies (red) almost always push sensitivity upward compared to any single method (blue).

The “All tools” consensus achieves the most balanced average performance but other strategies could produce better sensitivity at the cost of worse specificity (Panel M). Adding more methods could further increase performance.

Always-Wrong Genomes (all tools misclassified)

There are some genomes that are always misclassifed by all tools. We would like to see if these genomes have any particular properties (genus, genome_size, gc_content) that can influence their misclassification. Lets first identify the misclassified ref and bins genome from different metrics and gather their properties.

library(rentrez)
library(xml2)
library(Biostrings)

# Identify genomes where all available tools were wrong (per metric)
always_wrong_list <- list()

for (current_metric in all_metrics) {
  calls <- master_df %>%
    filter(metric == current_metric) %>%
    mutate(is_wrong = case_when(
      da_call & !is_DA ~ TRUE,   # FP
      !da_call & is_DA ~ TRUE,   # FN
      TRUE ~ FALSE
    ))
  per_genome <- calls %>%
    group_by(genome) %>%
    summarise(n_tools_total = dplyr::n_distinct(tool),
              n_wrong = sum(is_wrong, na.rm = TRUE),
              is_DA = dplyr::first(is_DA),
              source_genome = dplyr::first(source_type),
              .groups = 'drop') %>%
    mutate(always_wrong = n_tools_total >= 2 & n_wrong == n_tools_total,
           wrong_type = dplyr::case_when(
             always_wrong & is_DA ~ "FN_all",
             always_wrong & !is_DA ~ "FP_all",
             TRUE ~ NA_character_
           )) %>%
    filter(always_wrong)

  # Join truth_table for accessions and compute query_accession per source type
  per_genome <- per_genome %>%
    dplyr::left_join(truth_table %>% dplyr::select(genome, genome_accession, closest_genome), by = "genome") %>%
    mutate(
      closest_genome_dotted = ifelse(!is.na(closest_genome),
                                     sub("([0-9])$", ".\\1", closest_genome), NA_character_),
      query_accession = dplyr::case_when(
        source_genome == "reference" & !is.na(genome_accession) ~ genome_accession,
        source_genome == "bin" & !is.na(closest_genome_dotted) ~ closest_genome_dotted,
        TRUE ~ genome_accession
      ),
      metric = current_metric
    )
  always_wrong_list[[current_metric]] <- per_genome
}

always_wrong_all <- dplyr::bind_rows(always_wrong_list)

cat("Genomes with unanimous misclassification (by metric):", nrow(always_wrong_all), "\n")

## Genomes with unanimous misclassification (by metric): 162

# Helper: extract first non-null field from esummary
get_first <- function(x, fields) {
  for (f in fields) {
    if (!is.null(x[[f]])) return(x[[f]])
  }
  return(NA)
}

# Fetch NCBI metadata for a vector of accessions
fetch_ncbi_meta <- function(acc_vec) {
  acc_vec <- unique(acc_vec[!is.na(acc_vec)])
  if (length(acc_vec) == 0) return(tibble())
  purrr::map_dfr(acc_vec, function(acc) {
    # Normalize accession to dotted form if missing (e.g., add .1)
    dotted <- if (grepl("\\.\\d+$", acc)) acc else paste0(acc, ".1")
    # Try exact assembly accession match first
    uid <- tryCatch({ rentrez::entrez_search(db = "assembly", term = paste0(dotted, "[Assembly Accession]"))$ids }, error = function(e) NULL)
    if (is.null(uid) || length(uid) == 0) {
      # fallback: try the original string
      uid <- tryCatch({ rentrez::entrez_search(db = "assembly", term = acc)$ids }, error = function(e) NULL)
    }
    if (!is.null(uid) && length(uid) > 0) {
      sm <- tryCatch({ rentrez::entrez_summary(db = "assembly", id = uid[1]) }, error = function(e) NULL)
      if (!is.null(sm)) {
        # Extract basic fields
        asm_acc <- get_first(sm, c("assemblyaccession", "assembly_accession"))
        org <- get_first(sm, c("organism", "organism_name"))
        gc <- suppressWarnings(as.numeric(get_first(sm, c("gcpercent", "gc_percent", "gc"))))
        size <- suppressWarnings(as.numeric(get_first(sm, c("seq_length_sum", "totallength", "total_length"))))
        taxid <- suppressWarnings(as.character(get_first(sm, c("taxid", "tax_id"))))

        # Taxonomy via taxonomy DB if taxid available (prefer rank-aware XML)
        fam <- genus <- species <- NA_character_
        if (!is.na(taxid)) {
          tx_xml <- tryCatch({ rentrez::entrez_fetch(db = "taxonomy", id = taxid, rettype = "xml") }, error = function(e) NULL)
          if (!is.null(tx_xml)) {
            doc <- tryCatch(xml2::read_xml(tx_xml), error = function(e) NULL)
            if (!is.null(doc)) {
              fam <- xml2::xml_text(xml2::xml_find_first(doc, "//LineageEx/Taxon[Rank='family']/ScientificName"))
              genus <- xml2::xml_text(xml2::xml_find_first(doc, "//LineageEx/Taxon[Rank='genus']/ScientificName"))
              species <- xml2::xml_text(xml2::xml_find_first(doc, "//Taxon/ScientificName"))
              if (identical(fam, character(0))) fam <- NA_character_
              if (identical(genus, character(0))) genus <- NA_character_
              if (identical(species, character(0))) species <- NA_character_
            }
          }
        }

        return(tibble(accession = acc, assembly_accession = asm_acc, organism = org,
                      genome_size = size, gc_percent = gc, family = fam,
                      genus = genus, species = species))
      }
    }
    # Fallback: no metadata
    tibble(accession = acc, assembly_accession = NA_character_, organism = NA_character_,
           genome_size = NA_real_, gc_percent = NA_real_, family = NA_character_,
           genus = NA_character_, species = NA_character_)
  })
}

# Pull NCBI metadata with a small cache (assembly meta + taxonomy)
cache_dir <- file.path(results_dir, "cache")
if (!dir.exists(cache_dir)) dir.create(cache_dir, recursive = TRUE)
cache_file <- file.path(cache_dir, "assembly_meta.csv")
ncbi_meta <- tibble()
accs <- unique(always_wrong_all$query_accession[!is.na(always_wrong_all$query_accession)])
cached <- if (file.exists(cache_file)) suppressMessages(readr::read_csv(cache_file, show_col_types = FALSE)) else tibble()
to_fetch <- setdiff(accs, cached$accession %||% character())
fetched <- if (length(to_fetch) > 0) fetch_ncbi_meta(to_fetch) else tibble()
if (nrow(fetched) > 0) {
  cached <- dplyr::bind_rows(cached, fetched) %>% dplyr::distinct(accession, .keep_all = TRUE)
  readr::write_csv(cached, cache_file)
}
ncbi_meta <- cached %>% dplyr::filter(accession %in% accs)

# Compute GC% and genome size from FASTA (online) using Biostrings
compute_gc_size_online <- function(assembly_acc) {
  if (is.na(assembly_acc) || assembly_acc == "")
    return(tibble(accession = assembly_acc, n_sequences = NA_integer_, genome_size_bp = NA_real_, gc_percent_fasta = NA_real_, ftp = NA_character_))
  # find assembly
  dotted <- if (grepl("\\.\\d+$", assembly_acc)) assembly_acc else paste0(assembly_acc, ".1")
  es <- tryCatch(entrez_search(db = "assembly", term = paste0(dotted, "[Assembly Accession]"), retmax = 1), error = function(e) NULL)
  if (is.null(es) || length(es$ids) == 0) {
    es <- tryCatch(entrez_search(db = "assembly", term = assembly_acc, retmax = 1), error = function(e) NULL)
  }
  if (is.null(es) || length(es$ids) == 0) {
    return(tibble(accession = assembly_acc, n_sequences = NA_integer_, genome_size_bp = NA_real_, gc_percent_fasta = NA_real_, ftp = NA_character_))
  }
  sm <- tryCatch(entrez_summary(db = "assembly", id = es$ids[[1]]), error = function(e) NULL)
  if (is.null(sm)) return(tibble(accession = assembly_acc, n_sequences = NA_integer_, genome_size_bp = NA_real_, gc_percent_fasta = NA_real_, ftp = NA_character_))
  ftp <- sm$ftppath_genbank
  if (is.null(ftp) || ftp == "") ftp <- sm$ftppath_refseq
  if (is.null(ftp) || ftp == "") return(tibble(accession = assembly_acc, n_sequences = NA_integer_, genome_size_bp = NA_real_, gc_percent_fasta = NA_real_, ftp = NA_character_))
  base <- basename(ftp)
  fna_url <- paste0(ftp, "/", base, "_genomic.fna.gz")
  # cache downloads
  fna_cache_dir <- file.path(cache_dir, "fna")
  if (!dir.exists(fna_cache_dir)) dir.create(fna_cache_dir, recursive = TRUE)
  dest <- file.path(fna_cache_dir, paste0(gsub("/", "_", base), "_genomic.fna.gz"))
  if (!file.exists(dest)) {
    ok <- tryCatch({ utils::download.file(fna_url, dest, mode = "wb", quiet = TRUE); TRUE }, error = function(e) FALSE)
    if (!ok) return(tibble(accession = assembly_acc, n_sequences = NA_integer_, genome_size_bp = NA_real_, gc_percent_fasta = NA_real_, ftp = ftp))
  }
  # read and compute stats
  dna <- tryCatch(Biostrings::readDNAStringSet(dest), error = function(e) NULL)
  if (is.null(dna)) return(tibble(accession = assembly_acc, n_sequences = NA_integer_, genome_size_bp = NA_real_, gc_percent_fasta = NA_real_, ftp = ftp))
  total_bp <- sum(Biostrings::width(dna))
  # letterFrequency can compute for all sequences; sum over them
  lf <- Biostrings::letterFrequency(dna, c("A","C","G","T"))
  atgc <- sum(lf)
  gc <- sum(lf[,"C"] + lf[,"G"])
  gc_pct <- ifelse(atgc > 0, 100 * gc / atgc, NA_real_)
  tibble(accession = assembly_acc, n_sequences = length(dna), genome_size_bp = as.numeric(total_bp), gc_percent_fasta = round(gc_pct, 3), ftp = ftp)
}

# Run online GC/size computation (cached FASTA files)
gcsize_cache_file <- file.path(cache_dir, "gc_size_meta.csv")
gcsize <- tibble()
accs <- unique(always_wrong_all$query_accession[!is.na(always_wrong_all$query_accession)])
cached2 <- if (file.exists(gcsize_cache_file)) suppressMessages(readr::read_csv(gcsize_cache_file, show_col_types = FALSE)) else tibble()
to_compute <- setdiff(accs, cached2$accession %>% unique() %||% character())
computed <- if (length(to_compute) > 0) purrr::map_dfr(to_compute, compute_gc_size_online) else tibble()
if (nrow(computed) > 0) {
  cached2 <- dplyr::bind_rows(cached2, computed) %>% dplyr::distinct(accession, .keep_all = TRUE)
  readr::write_csv(cached2, gcsize_cache_file)
}
gcsize <- cached2 %>% dplyr::filter(accession %in% accs)

# Combine
always_wrong_annot <- always_wrong_all %>%
  mutate(accession = query_accession) %>%
  dplyr::left_join(ncbi_meta, by = "accession") %>%
  dplyr::left_join(gcsize, by = "accession") %>%
  mutate(
    genome_size_bp = dplyr::coalesce(as.numeric(genome_size_bp), as.numeric(genome_size)),
    gc_percent_final = dplyr::coalesce(as.numeric(gc_percent_fasta), as.numeric(gc_percent)),
    meta_source = dplyr::case_when(
      !is.na(gc_percent_fasta) | !is.na(genome_size_bp) ~ "FASTA",
      (!is.na(genome_size) | !is.na(gc_percent)) ~ "NCBI_Summary",
      TRUE ~ NA_character_
    )
  ) %>%
  select(metric, source_genome, genome, wrong_type, n_tools_total,
         genome_size_bp, gc_percent_final, family, genus, species, accession, meta_source)

# Summary counts
summary_wrong <- always_wrong_annot %>%
  group_by(metric, source_genome, wrong_type) %>%
  summarise(n_genomes = dplyr::n_distinct(genome), .groups = 'drop') %>%
  arrange(metric, source_genome, wrong_type)

kable(summary_wrong, caption = "Counts of genomes unanimously misclassified (by metric, source, and type)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Counts of genomes unanimously misclassified (by metric, source, and type)
metric	source_genome	wrong_type	n_genomes
breadth	reference	FP_all	21
counts	reference	FP_all	23
coverage	bin	FN_all	1
coverage	reference	FP_all	21
relative	reference	FP_all	42
rpkm	bin	FN_all	1
rpkm	reference	FP_all	27
tpm	bin	FN_all	2
tpm	reference	FP_all	24

# Split tables by source
ref_wrong <- always_wrong_annot %>% filter(source_genome == "reference")
bin_wrong <- always_wrong_annot %>% filter(source_genome == "bin")

ref_dt <- DT::datatable(ref_wrong,
                        caption = "Reference genomes misclassified by all tools (per metric)",
                        options = list(pageLength = 10, scrollX = TRUE),
                        rownames = FALSE)

bin_dt <- DT::datatable(bin_wrong,
                        caption = "Bin genomes misclassified by all tools (per metric)",
                        options = list(pageLength = 10, scrollX = TRUE),
                        rownames = FALSE)

# Ensure both widgets are rendered in HTML (avoid printing NULLs)
htmltools::tagList(ref_dt, bin_dt)

Correctness vs Genome Features

We would like to see if genomes that are always correctly classified by DA methods against always misclassified genomes.

# Build per-genome correctness summary per metric
correctness_long <- master_df %>%
  filter(!is.na(adjP)) %>%
  mutate(is_correct = case_when(
    da_call & is_DA ~ TRUE,   # TP
    !da_call & !is_DA ~ TRUE, # TN
    da_call & !is_DA ~ FALSE, # FP
    !da_call & is_DA ~ FALSE, # FN
    TRUE ~ NA
  )) %>%
  select(metric, genome, tool, is_correct, is_DA, source_genome = source_type)

correctness_by_genome <- correctness_long %>%
  group_by(metric, genome) %>%
  summarise(
    n_tools_total = n_distinct(tool),
    n_correct = sum(is_correct, na.rm = TRUE),
    n_wrong = sum(!is_correct, na.rm = TRUE),
    is_DA = dplyr::first(is_DA),
    source_genome = dplyr::first(source_genome),
    .groups = 'drop'
  ) %>%
  mutate(
    always_wrong = n_tools_total >= 2 & n_correct == 0,
    always_right = n_tools_total >= 2 & n_correct == n_tools_total,
    group = case_when(
      always_wrong ~ "always_wrong",
      always_right ~ "always_right",
      TRUE ~ "other"
    )
  )

# Keep only always_wrong vs always_right
comp_set <- correctness_by_genome %>% filter(group %in% c("always_wrong", "always_right"))

# Map genomes to accessions (for metadata fetch)
acc_map <- truth_table %>% select(genome, genome_accession, closest_genome) %>%
  mutate(closest_genome_dotted = ifelse(!is.na(closest_genome), sub("([0-9])$", ".\\1", closest_genome), NA_character_))

comp_set <- comp_set %>%
  dplyr::left_join(acc_map, by = "genome") %>%
  mutate(query_accession = dplyr::case_when(
    source_genome == "reference" & !is.na(genome_accession) ~ genome_accession,
    source_genome == "bin" & !is.na(closest_genome_dotted) ~ closest_genome_dotted,
    TRUE ~ genome_accession
  ))

# Fetch metadata for both groups using cached functions/utilities from previous chunk
cache_dir <- file.path(results_dir, "cache")
if (!dir.exists(cache_dir)) dir.create(cache_dir, recursive = TRUE)
cache_file <- file.path(cache_dir, "assembly_meta.csv")
gcsize_cache_file <- file.path(cache_dir, "gc_size_meta.csv")

accs_all <- unique(comp_set$query_accession[!is.na(comp_set$query_accession)])

# Assembly/taxonomy meta
meta_cached <- if (file.exists(cache_file)) suppressMessages(readr::read_csv(cache_file, show_col_types = FALSE)) else tibble()
to_fetch <- setdiff(accs_all, meta_cached$accession %||% character())
meta_new <- if (length(to_fetch) > 0) fetch_ncbi_meta(to_fetch) else tibble()
if (nrow(meta_new) > 0) {
  meta_cached <- dplyr::bind_rows(meta_cached, meta_new) %>% dplyr::distinct(accession, .keep_all = TRUE)
  readr::write_csv(meta_cached, cache_file)
}
ncbi_meta_all <- meta_cached %>% dplyr::filter(accession %in% accs_all)

# FASTA-derived GC/size
gc_cached <- if (file.exists(gcsize_cache_file)) suppressMessages(readr::read_csv(gcsize_cache_file, show_col_types = FALSE)) else tibble()
to_compute <- setdiff(accs_all, gc_cached$accession %||% character())
gc_new <- if (length(to_compute) > 0) purrr::map_dfr(to_compute, compute_gc_size_online) else tibble()
if (nrow(gc_new) > 0) {
  gc_cached <- dplyr::bind_rows(gc_cached, gc_new) %>% dplyr::distinct(accession, .keep_all = TRUE)
  readr::write_csv(gc_cached, gcsize_cache_file)
}
gcsize_all <- gc_cached %>% dplyr::filter(accession %in% accs_all)

# Attach metadata
comp_meta <- comp_set %>%
  mutate(accession = query_accession) %>%
  dplyr::left_join(ncbi_meta_all, by = c("accession")) %>%
  dplyr::left_join(gcsize_all, by = c("accession")) %>%
  mutate(
    genome_size_bp = dplyr::coalesce(as.numeric(genome_size_bp), as.numeric(genome_size)),
    gc_percent_final = dplyr::coalesce(as.numeric(gc_percent_fasta), as.numeric(gc_percent))
  )

# QC: counts per stratum
qc_groups <- comp_meta %>% dplyr::count(metric, source_genome, group, name = "n_genomes") %>% dplyr::arrange(metric, source_genome, group)
qc_table <- qc_groups %>%
  tidyr::pivot_wider(names_from = group, values_from = n_genomes, values_fill = 0)
qc_table %>% kable(caption = "Counts per metric/source for always_wrong vs always_right") %>% kable_styling(bootstrap_options = c("striped","hover","condensed"))

Counts per metric/source for always_wrong vs always_right
metric	source_genome	always_right	always_wrong
breadth	bin	4	1
breadth	reference	59	21
counts	bin	2	1
counts	reference	74	23
coverage	bin	2	2
coverage	reference	76	21
relative	bin	4	1
relative	reference	75	42
rpkm	bin	2	2
rpkm	reference	76	27
tpm	bin	2	3
tpm	reference	73	24

# PLOTS
library(ggplot2)

# 1) Genome size plot (log10 bp)
size_plot_data <- comp_meta %>% filter(!is.na(genome_size_bp)) %>% mutate(log10_size = log10(genome_size_bp))

# Wilcoxon rank-sum per facet (only where both groups exist)
suppressPackageStartupMessages({ library(rstatix); library(ggpubr) })
size_facets_ok <- size_plot_data %>%
  dplyr::count(metric, source_genome, group) %>%
  tidyr::pivot_wider(names_from = group, values_from = n, values_fill = 0) %>%
  filter(always_right > 0, always_wrong > 0) %>%
  select(metric, source_genome)
size_pvals <- size_plot_data %>%
  dplyr::inner_join(size_facets_ok, by = c("metric","source_genome")) %>%
  group_by(metric, source_genome) %>%
  rstatix::wilcox_test(log10_size ~ group, ref.group = "always_right") %>%
  ungroup() %>%
  mutate(p.format = rstatix::p_format(p), group1 = "always_right", group2 = "always_wrong")
size_y <- size_plot_data %>% group_by(metric, source_genome) %>% summarise(y.position = max(log10_size, na.rm = TRUE) * 1.07, .groups = 'drop')
size_pvals <- size_pvals %>% dplyr::left_join(size_y, by = c("metric","source_genome"))

p_size <- ggplot(size_plot_data, aes(x = group, y = log10_size, fill = group)) +
  geom_violin(trim = TRUE, alpha = 0.7, color = NA) +
  geom_boxplot(width = 0.18, outlier.shape = NA, alpha = 0.9) +
  scale_fill_manual(values = c(always_wrong = "#E63946", always_right = "#2A9D8F")) +
  labs(title = "Genome size (log10 bp): always_wrong vs always_right", x = NULL, y = "log10(genome size bp)") +
  facet_grid(source_genome ~ metric, scales = "free_y") +
  theme_minimal() + theme(legend.position = "none", axis.text.x = element_text(angle = 30, hjust = 1))
p_size <- p_size + ggpubr::stat_pvalue_manual(size_pvals, label = "p = {p.format}",
                                 xmin = "group1", xmax = "group2",
                                 y.position = "y.position", tip.length = 0.01, size = 3,
                                 inherit.aes = FALSE)
print(p_size)

# 2) GC% plot
gc_plot_data <- comp_meta %>% filter(!is.na(gc_percent_final))

# Wilcoxon rank-sum per facet (only where both groups exist)
gc_facets_ok <- gc_plot_data %>%
  dplyr::count(metric, source_genome, group) %>%
  tidyr::pivot_wider(names_from = group, values_from = n, values_fill = 0) %>%
  filter(always_right > 0, always_wrong > 0) %>%
  select(metric, source_genome)
gc_pvals <- gc_plot_data %>%
  dplyr::inner_join(gc_facets_ok, by = c("metric","source_genome")) %>%
  group_by(metric, source_genome) %>%
  rstatix::wilcox_test(gc_percent_final ~ group, ref.group = "always_right") %>%
  ungroup() %>%
  mutate(p.format = rstatix::p_format(p), group1 = "always_right", group2 = "always_wrong")
gc_y <- gc_plot_data %>% group_by(metric, source_genome) %>% summarise(y.position = max(gc_percent_final, na.rm = TRUE) * 1.07, .groups = 'drop')
gc_pvals <- gc_pvals %>% dplyr::left_join(gc_y, by = c("metric","source_genome"))

p_gc <- ggplot(gc_plot_data, aes(x = group, y = gc_percent_final, fill = group)) +
  geom_violin(trim = TRUE, alpha = 0.7, color = NA) +
  geom_boxplot(width = 0.18, outlier.shape = NA, alpha = 0.9) +
  scale_fill_manual(values = c(always_wrong = "#E63946", always_right = "#2A9D8F")) +
  labs(title = "GC%: always_wrong vs always_right", x = NULL, y = "GC percent") +
  facet_grid(source_genome ~ metric) +
  theme_minimal() + theme(legend.position = "none", axis.text.x = element_text(angle = 30, hjust = 1))
p_gc <- p_gc + ggpubr::stat_pvalue_manual(gc_pvals, label = "p = {p.format}",
                              xmin = "group1", xmax = "group2",
                              y.position = "y.position", tip.length = 0.01, size = 3,
                              inherit.aes = FALSE)
print(p_gc)

# 3) Genus bar plots (percentage within group)
genus_data <- comp_meta %>% filter(!is.na(genus) & genus != "") %>%
  group_by(metric, source_genome, group, genus) %>%
  summarise(n = n(), .groups = 'drop') %>%
  group_by(metric, source_genome, group) %>%
  mutate(pct = 100 * n / sum(n)) %>% ungroup()

# Pick top 15 genera overall (within each metric+source across groups)
top_genera <- genus_data %>%
  group_by(metric, source_genome, genus) %>% summarise(total = sum(n), .groups = 'drop') %>%
  group_by(metric, source_genome) %>% slice_max(order_by = total, n = 15, with_ties = FALSE) %>% ungroup()

genus_top <- genus_data %>% dplyr::inner_join(top_genera %>% dplyr::select(metric, source_genome, genus), by = c("metric","source_genome","genus"))

p_genus <- ggplot(genus_top, aes(x = reorder(genus, pct), y = pct, fill = group)) +
  geom_col(position = position_dodge(width = 0.75)) +
  coord_flip() +
  scale_fill_manual(values = c(always_wrong = "#E63946", always_right = "#2A9D8F")) +
  labs(title = "Genus composition: always_wrong vs always_right", x = "Genus (top 15)", y = "Percent within group", fill = "Group") +
  facet_grid(source_genome ~ metric, scales = "free_y") +
  theme_minimal()
print(p_genus)

Observation: A couple of isolated significant tests appear (e.g. lower genome size in reference genomes under the TPM metric, GC% differences under coverage for bin genomes), but these do not persist across metrics. So no overall systematic trends. In the genus composition plots, the overall pattern is similar with the genome size and GC% results: most taxa appear in both “always right” and “always wrong” groups with similar prevalence, but there are few isolated enrichment.

Final Recommendation

Single Methods: metagenomeSeq, MaAsLin2, and ANCOM-BC2 consistently emerged as reliable performers.
Metrics: Counts remain the default across most methods, but alternative metrics (TPM, RPKM, relative, coverage, breadth) can generally be interchanged without introducing major bias. The key exception is ALDEx2, which fails on continuous metrics; otherwise, metric choice is safe.
Consensus Approach: All_tools approach provides the best overall performance. Using >50% or >75% consensus thresholds may provide better performance with more tools included in the analysis
Alpha: Lowering the significance cutoff reduces false discovery rates. For scenarios where false positives carry high cost, stricter thresholds (e.g., α = 0.001 or lower) can help reducing false positive counts (and thus controlling FDR).

Multi-Metric Differential Abundance Benchmarking (HOMD_Final)

Evaluation of DA Tools Across Six Abundance Metrics

Md Mobashir Rahman

2025-09-02