This analysis addresses the methodological question raised in ANCOM-BC Issue #196: Can normalized abundance data be effectively used with differential abundance tools designed for raw counts?
Many differential abundance (DA) tools were originally designed for raw read counts where library size differences are informative and corrected internally (e.g. via size-factor estimation, library size, gene length). For genome-level metagenomic data, however, raw counts are dominated by genome length; common practice is therefore to normalise reads to TPM/RPKM or coverage. This creates a methodological question about the potential effect on the results.
This study evaluates 48 metric × method combinations (6 metrics × 8 DA tools) using CAMISIM synthetic communities with known ground truth.
Simulation: 3 sets of data were used: HOMD_Final with 200 genomes (80 truly differentially abundant. 120 truly non-differentially abundant). Genomes were selected randomly from the HOMD dataset. HOMD2 with 200 genomes (80 truly differentially abundant. 120 truly non-differentially abundant). Genomes were selected randomly from the HOMD dataset. mooh_representative with 200 genomes (80 truly differentially abundant. 120 truly non-differentially abundant). Genomes were selected based on Metaphlan outputs.
Synthetic sample generation: CAMISIM created 40/60 metagenomic samples (50% control, 50% treatment)
Up-stream processing: Each sample was processed with the standard MAGician workflow (repository).
Abundance Metrics: 6 metrics (counts, TPM, RPKM, relative abundance, breadth, coverage)
Coverage quantification: Genome-level abundance metrics were computed using CoverM ((repository).
DA Methods: 6 tools (ALDEx2, ANCOM-BC2, metagenomeSeq, wilcoxon, maaslin2) ran.
Evaluation: Binary classification and threshold-free metrics (F1-Score, AUC-PR, Specificity, Sensitivity etc)
Note: These metric names correspond to the final output files generated by the MAGICIAN pipeline. The mapping from CoverM column names to output files is:
Read Count→ counts,RPKM→ rpkm,TPM→ tpm,Relative Abundance (%)→ relative,Covered Bases→ breadth,Mean→ coverage.
| Metric | Length normalisation | Depth normalisation | What the number means |
|---|---|---|---|
| counts | ✗ (raw reads) | ✗ | Number of read pairs whose best hit is this genome |
| relative % | ✗ | ✓ (values sum to 100 %) | Share of all mapped reads that belong to the genome |
| breadth | ✓ (divide by length) | ✗ | Fraction of genome bases with ≥ 1× read coverage |
| coverage (mean depth) | ✓ (divide by length) | ✗ * | Mean per-base depth across the genome |
| RPKM | ✓ (/ kb) | ✓ (/ 10⁶ reads) | Reads-per-kilobase-per-million mapped reads |
| TPM | ✓ (/ kb) | ✓ (rescaled to 10⁶) | Transcripts-per-million; similar to RPKM but sample-wise rescaled |
Metric categories used throughout this report:
• Non-normalized: counts
• Length-normalized: breadth, coverage
• Depth-normalized: relative (%)
• Length + Depth-normalized: RPKM, TPM
To avoid confusion, here’s the explicit mapping from CoverM’s internal column names to our final analysis files:
| CoverM Column Name | Final Output File | Metric Name Used in Analysis | Biological Interpretation |
|---|---|---|---|
Read Count |
counts.tsv |
counts | Raw read pairs mapped to genome |
RPKM |
rpkm.tsv |
RPKM | Reads per kilobase per million |
TPM |
tpm.tsv |
TPM | Transcripts per million |
Relative Abundance (%) |
relative.tsv |
relative | Percentage of total mapped reads |
Covered Bases |
breadth.tsv |
breadth | Fraction of genome with ≥1× coverage |
Mean |
coverage.tsv |
coverage | Mean read depth across genome |
Important: Throughout this analysis, when we refer to “breadth” we mean the fraction of genome covered (from CoverM’s
Covered Bases), and “coverage” refers to mean depth (from CoverM’sMean). This follows genomics terminology where breadth = presence/absence and coverage = abundance intensity. (See how they use the term “coverage” here: https://pmc.ncbi.nlm.nih.gov/articles/PMC4992084/)
* Coverage still increases with sequencing depth, but it no longer carries the multiplicative genome-length bias.
The GTDB_taxonomy20250406.csv file was downloaded from
the https://homd.org/download/download section (Genomic Meta
Information::Batch Download GTDB (V226 taxonomy)). This CSV file
contains metadata for 8,489 oral human microbial genomes from the Human
Oral Microbiome Database (HOMD), including genome accession IDs, HOMD
taxonomy IDs, and full GTDB taxonomic lineages.
From this dataset, 200 random genomes were selected and downloaded
using the /home/mrahman/magician/download_genome.py script.
The script randomly samples genome accession IDs from the CSV, then
downloads the corresponding genomic FASTA files from NCBI’s Assembly
database. A random seed ensures reproducibility of the genome selection.
The downloaded genome sequences are stored in the
data/HOMD/genomes directory as compressed FASTA files
(*.fna.gz). The following command was used to download the
genomes:
Data were downloaded from ENA Project accession number PRJEB50270 (study title: Critical Assessment of Metagenome Interpretation strain madness raw sequencing data, study name: CAMI 2 strain)). The following script were used to download 40 samples out of 100 to keep computational cost reasonable.
mamba activate sra && get_strain_madness_reads.sh /local/mrahman/magician/data/strain_madness_reads 0 39I realized the dataset is not particularly suitable for our analysis because it is not geared toward DA analysis. For example, it do not have any associated metadata (e.g. clinical metadata, case-control status). I therefore decided to use the HOMD/mooh dataset for our analysis. A possible solution would have been, for example, using SIMBA to add in synthetic spike-ins for selected genomes and simulate case-control status to get a suitable dataset for DA analysis. As a workaround, we used Zeevi et al. dataset from the publication (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-024-03390-9) included in the analysis as it had well-described metadata and already used for DA benchmarkiing by the Wirvel et al study.
Baseline weights: sampled per genome from a Gamma(shape=2, scale=5) to emulate long‑tailed abundance distributions as implemented in CAMISIM documentation.
Differential abundance: randomly selects a fixed number of genomes ( e.g., 80) and splits them evenly into up/down sets (Upregulation and downregulation). Applies multiplicative fold changes to reflect effect sizes.
Zero‑sum constraint: rescales only the DA subset so the sum of their abundances matches between groups. This preserves the relative mass of the non‑DA background and avoids any unwanted compositional shifts which may distort the DA methods.
Replication and depth: generates 10 controls and 10 treatments with r technical/biological replicates (e.g., r=2 → 40 samples). Library sizes are drawn uniformly from a range (e.g., 15–60M reads) to capture depth variability without confounding group effects.
CAMISIM inputs and truth: writes a sample_distributions.tsv (per‑sample weights),, read_depth.txt for sequencing depth, and an “initial” truth_table.csv recording genome‑level DA status and true log2FC for downstream evaluation.
Structure: rows are genomes; columns are per‑replicate samples; entries are integer weights interpreted by CAMISIM as “abundance weights”.
Group separation: control replicates use baseline weights; treatment replicates use the enriched weights (baseline modified by assigned folds), after zero‑sum balancing within the DA set.
Reads are simulated (ART Illumina) and pooled into gzipped R1/R2 per sample. FastQC assesses raw quality; BBDuk trims adapters/low‑quality bases. metaSPAdes is used for metagenome assembly.BBMap aligns trimmed reads back to assemblies; jgi_summarize_bam_contig_depths computes depth tables for binning.
MetaBAT2 clusters contigs into MAGs using depth and composition; dRep dereplicates and compares genomes (Bins/MAGs and references). CheckM quantifies completeness/contamination for bins and references.
CoverM maps paired reads against all genomes (MAGs + references) and exports six metrics (count, RPKM, TPM, relative, covered_bases (“breadth”), mean depth (“coverage”).
A script generates truth_table_with_bins.csv by integrating MAG bin annotations into the ground-truth.
This analysis benchmarks differential abundance (DA) methods across six abundance metrics using the HOMD_Final dataset with ground-truth labels. Features are harmonized and filtered; samples are aligned to metadata (control vs case). For each method–metric, per-feature q-values (BH, α=0.05) define DA calls. Results are evaluated against truth, separated by source (reference genomes vs recovered bins), reporting TP, FP, FN, TN, sensitivity, specificity, precision, FDR, accuracy, AUROC, AUPRC, and totals for later analysis.
Details data import, cleaning, and integration steps used to construct the analysis dataset.
The genomes and their distribution was:
# Initial ground truth (reference genomes only) for display
truth_table_initial <- read_csv(file.path(data_dirs$HOMD_Final, 'design/truth_table.csv')) %>%
mutate(genome_accession = str_replace(genome, "\\.fna$", "")) %>%
mutate(genome = str_replace_all(genome_accession, "[^A-Za-z0-9]", "")) %>%
mutate(is_DA = case_when(
is.logical(is_DA) ~ is_DA,
str_to_lower(as.character(is_DA)) %in% c("true", "t", "1") ~ TRUE,
str_to_lower(as.character(is_DA)) %in% c("false", "f", "0") ~ FALSE,
TRUE ~ NA
)) %>%
distinct(genome, .keep_all = TRUE)
# Evaluation ground truth (includes bins) used to score methods
truth_table <- read_csv(file.path(data_dirs$HOMD_Final, 'design/truth_table_with_bins.csv')) %>%
mutate(genome_accession = str_replace(genome, "\\.fna$", "")) %>%
mutate(genome = str_replace_all(genome_accession, "[^A-Za-z0-9]", "")) %>%
mutate(is_DA = case_when(
is.logical(is_DA) ~ is_DA,
str_to_lower(as.character(is_DA)) %in% c("true", "t", "1") ~ TRUE,
str_to_lower(as.character(is_DA)) %in% c("false", "f", "0") ~ FALSE,
TRUE ~ NA
)) %>%
distinct(genome, .keep_all = TRUE)
cat("Initial reference-only DA genomes:", sum(truth_table_initial$is_DA), "\n")## Initial reference-only DA genomes: 80
## Initial reference-only non-DA genomes: 120
## Evaluation entries (with bins): 238
truth_table_initial %>%
group_by(is_DA) %>%
summarise(
n_genomes = n(),
mean_log2FC = mean(true_log2FC, na.rm = TRUE),
min_log2FC = min(true_log2FC, na.rm = TRUE),
max_log2FC = max(true_log2FC, na.rm = TRUE),
.groups = 'drop'
) %>%
kable(caption = "Initial Ground Truth Summary (reference genomes)", digits = 3) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| is_DA | n_genomes | mean_log2FC | min_log2FC | max_log2FC |
|---|---|---|---|---|
| FALSE | 120 | 0.000 | 0.000 | 0.000 |
| TRUE | 80 | -1.263 | -5.209 | 1.559 |
The CAMISIM simulation generated samples with varying sequencing depths to mimic limited realistic sequencing conditions:
## Sequencing depth summary:
## - Total samples: 40
## - Range: 17,009,209 - 59,564,475 read pairs
## - Mean: 36,954,723 read pairs
## - Median: 36,887,326 read pairs
## - Coefficient of variation: 0.347
| Sample | Read Pairs |
|---|---|
| C01_rep01 | 24,407,595 |
| C01_rep02 | 38,628,974 |
| C02_rep01 | 22,368,809 |
| C02_rep02 | 22,465,809 |
| C03_rep01 | 52,633,693 |
In ideal scenario, we would have recovered all the original genomes as the bins.
suppressPackageStartupMessages({
library(readxl)
})
summaries_dir <- file.path(data_dirs$HOMD_Final, "summaries")
# Helper: normalize IDs to match truth_table/truth_table_initial normalization
normalize_genome_id <- function(x) {
x %>%
str_replace("\\.fna$", "") %>%
str_replace_all("[^A-Za-z0-9]", "")
}
# List bin_summary files
bin_files <- list.files(summaries_dir, pattern = '^bin_summary_.*\\.xlsx$', full.names = TRUE)
# Read a single bin_summary and extract sample name
read_bin_summary <- function(path) {
sample_id <- basename(path) %>% str_replace("^bin_summary_", "") %>% str_replace("\\.xlsx$", "")
df <- suppressMessages(readxl::read_xlsx(path))
# Expect columns: bin_name, closest_genome
df %>%
janitor::clean_names() %>%
transmute(
sample = sample_id,
bin_name = as.character(bin_name),
closest_genome_raw = as.character(closest_genome),
closest_genome = normalize_genome_id(as.character(closest_genome))
) %>%
distinct(sample, closest_genome, .keep_all = TRUE)
}
if (length(bin_files) == 0) {
warning("No bin_summary_*.xlsx files found in ", summaries_dir)
bin_summaries <- tibble(sample = character(), bin_name = character(), closest_genome_raw = character(), closest_genome = character())
} else {
bin_summaries <- map_dfr(bin_files, read_bin_summary)
}
# Build presence/absence matrix genome × sample
all_samples <- sort(unique(bin_summaries$sample))
all_genomes <- sort(unique(c(truth_table_initial$genome, truth_table$genome, bin_summaries$closest_genome)))
recovery_long <- bin_summaries %>%
transmute(genome = closest_genome, sample, recovered = 1L) %>%
complete(genome = all_genomes, sample = all_samples, fill = list(recovered = 0L))
# Join truth covariates
truth_covars <- truth_table_initial %>%
select(genome, genome_accession, is_DA, true_log2FC, baseline_weight, enriched_weight, direction)
# Ensure plain tibbles to avoid SummarizedExperiment join methods
recovery_long <- tibble::as_tibble(recovery_long)
truth_covars <- tibble::as_tibble(truth_covars)
recovery_long <- recovery_long %>%
dplyr::left_join(truth_covars, by = "genome")
# Per-sample summaries
sample_coverage <- recovery_long %>%
group_by(sample) %>%
summarise(
n_recovered = sum(recovered, na.rm = TRUE),
n_total_truth = sum(!is.na(is_DA)),
prop_recovered = ifelse(n_total_truth > 0, n_recovered / n_total_truth, NA_real_),
.groups = 'drop'
)
# Split by is_DA/direction for stacked bars
sample_coverage_split <- recovery_long %>%
# Vectorized classification (avoid non-vectorized isTRUE/isFALSE)
mutate(
direction = tolower(as.character(direction)),
class = case_when(
is_DA == TRUE & direction == "up" ~ "DA_up",
is_DA == TRUE & direction == "down" ~ "DA_down",
is_DA == FALSE ~ "non_DA",
TRUE ~ "unknown"
)
) %>%
group_by(sample, class) %>%
summarise(n = sum(recovered, na.rm = TRUE), .groups = 'drop')
# Prepare stacked-bar data with total counts and readable count labels
sample_coverage_counts <- sample_coverage_split %>%
group_by(sample) %>%
mutate(total = sum(n),
frac = ifelse(total > 0, n / total, NA_real_),
count_label = ifelse(is.na(frac) | frac < 0.08, "", as.character(n))) %>%
ungroup()
# Quick sanity table
head(sample_coverage_split) %>% kable(caption = "Example of recovered counts per sample by DA class") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| sample | class | n |
|---|---|---|
| C01_rep01 | DA_down | 0 |
| C01_rep01 | DA_up | 3 |
| C01_rep01 | non_DA | 12 |
| C01_rep01 | unknown | 1 |
| C01_rep02 | DA_down | 0 |
| C01_rep02 | DA_up | 3 |
# Diagnostics: ensure is_DA is logical and mapping rate
diagnostic_map <- recovery_long %>%
summarise(
n_recovery_rows = n(),
n_with_truth = sum(!is.na(is_DA)),
prop_with_truth = n_with_truth / n_recovery_rows,
n_unknown_class = sum(is.na(is_DA) | is.na(direction)),
.groups = 'drop'
)
print(diagnostic_map)## # A tibble: 1 × 4
## n_recovery_rows n_with_truth prop_with_truth n_unknown_class
## <int> <int> <dbl> <int>
## 1 9560 8000 0.837 1560
# Stacked by DA class
class_colors <- c(DA_up = "#2ca02c", DA_down = "#d62728", non_DA = "#7f7f7f", unknown = "#bcbd22")
ggplot(sample_coverage_counts, aes(x = reorder(sample, total), y = n, fill = class)) +
geom_col() +
geom_text(aes(label = count_label), position = position_stack(vjust = 0.5), size = 3, color = "black") +
# Total recovered genomes per sample printed just beyond the bar
geom_text(data = distinct(sample_coverage_counts, sample, total),
aes(x = reorder(sample, total), y = total, label = total),
inherit.aes = FALSE, hjust = -0.1, size = 3.2) +
scale_y_continuous(expand = expansion(mult = c(0.01, 0.10))) +
scale_fill_manual(values = class_colors) +
coord_flip(clip = "off") +
labs(title = "Recovered genomes per sample by DA class",
x = "Sample", y = "Recovered genomes", fill = "Class") +
theme_minimal() +
theme(plot.margin = margin(5.5, 30, 5.5, 5.5))# Per-genome recovery rate across samples
genome_recovery <- recovery_long %>%
group_by(genome) %>%
summarise(
recovery_rate = mean(recovered, na.rm = TRUE),
is_DA = first(is_DA),
direction = first(direction),
true_log2FC = first(true_log2FC),
baseline_weight = first(baseline_weight),
.groups = 'drop'
)Observation: Treatment samples consistently recover a higher total number of genomes (~21–24). Most of them have DA_up genomes whose abundance were increased during CAMISIM sample distribution. Only a few downregulated genomes are recovered.
On the other hand,controls recover fewer genomes overall (~15–19) compared to treatment.
# Scatter + smooth: baseline_weight vs recovery_rate by sample group (Control/Treatment)
rec_by_group <- recovery_long %>%
mutate(group = case_when(
stringr::str_starts(sample, "C") ~ "Control",
stringr::str_starts(sample, "T") ~ "Treatment",
TRUE ~ "Unknown"
)) %>%
group_by(genome, group) %>%
summarise(recovery_rate = mean(recovered, na.rm = TRUE), .groups = 'drop') %>%
dplyr::left_join(truth_covars %>% tibble::as_tibble() %>% dplyr::select(genome, is_DA, direction, baseline_weight), by = "genome")
ggplot(rec_by_group %>% filter(!is.na(baseline_weight)),
aes(x = baseline_weight, y = recovery_rate, color = group)) +
geom_point(alpha = 0.6, position = position_jitter(width = 0, height = 0.01), size = 1.8) +
geom_smooth(method = "loess", se = TRUE) +
facet_wrap(~ direction, nrow = 1, scales = "free_x") +
scale_color_manual(values = c(Control = "#1f77b4", Treatment = "#ff7f0e", Unknown = "#7f7f7f")) +
labs(title = "Recovery rate vs baseline_weight by group",
x = "Baseline weight", y = "Recovery rate", color = "Group") +
theme_minimal()library(ComplexHeatmap)
library(circlize)
# Prepare matrix (drop NAs to avoid invalid rownames)
rec_df <- recovery_long %>%
select(genome, sample, recovered) %>%
filter(!is.na(genome), !is.na(sample))
rec_mat <- rec_df %>%
pivot_wider(names_from = sample, values_from = recovered, values_fill = 0) %>%
filter(!is.na(genome)) %>%
column_to_rownames("genome") %>%
as.matrix()
# Row metadata (aligned to matrix order)
row_meta <- truth_covars %>%
filter(!is.na(genome), genome %in% rownames(rec_mat)) %>%
distinct(genome, .keep_all = TRUE) %>%
column_to_rownames("genome")
row_meta_aligned <- row_meta[rownames(rec_mat), , drop = FALSE]
row_meta_aligned$direction <- replace_na(tolower(row_meta_aligned$direction), "none")
# Row class used for splitting
row_class <- case_when(
row_meta_aligned$is_DA %in% TRUE & row_meta_aligned$direction == "up" ~ "DA_up",
row_meta_aligned$is_DA %in% TRUE & row_meta_aligned$direction == "down" ~ "DA_down",
row_meta_aligned$is_DA %in% FALSE ~ "non_DA",
TRUE ~ "unknown"
)
row_class <- factor(row_class, levels = c("DA_up", "DA_down", "non_DA", "unknown"))
# Row prevalence to order genomes within each split
row_prev <- rowSums(rec_mat, na.rm = TRUE) / ncol(rec_mat)
row_order_vec <- order(as.integer(row_class), -row_prev)
# Top (column) annotations: Control/Treatment/Unknown by sample name
sample_groups <- case_when(
stringr::str_starts(colnames(rec_mat), "C") ~ "Control",
stringr::str_starts(colnames(rec_mat), "T") ~ "Treatment",
TRUE ~ "Unknown"
)
top_anno <- HeatmapAnnotation(
group = factor(sample_groups, levels = c("Control", "Treatment", "Unknown")),
col = list(group = c(Control = "#1f77b4", Treatment = "#ff7f0e", Unknown = "#7f7f7f")),
annotation_name_side = "left"
)
# Right (row) annotations: selected covariates
ha_row <- rowAnnotation(
log2FC = row_meta_aligned$true_log2FC,
baseline = row_meta_aligned$baseline_weight,
col = list(
log2FC = circlize::colorRamp2(c(min(row_meta_aligned$true_log2FC, na.rm=TRUE), 0,
max(row_meta_aligned$true_log2FC, na.rm=TRUE)),
c("#2166AC", "#F7F7F7", "#B2182B")),
baseline = circlize::colorRamp2(range(row_meta_aligned$baseline_weight, na.rm=TRUE),
c("#f7f7f7", "#636363"))
)
)
# Binary color map for presence/absence using requested palette
# Absent = Modern Red (#E63946), Present = Modern Green (#2A9D8F)
hcols <- circlize::colorRamp2(c(0, 1), c("#E63946", "#2A9D8F"))
Heatmap(
rec_mat,
name = "presence",
col = hcols,
heatmap_legend_param = list(at = c(0, 1), labels = c("absent", "present"), color_bar = "discrete"),
use_raster = TRUE,
raster_quality = 2,
show_row_names = FALSE,
show_column_names = TRUE,
column_title = "Samples",
row_title = "Genomes",
right_annotation = ha_row,
top_annotation = top_anno,
column_split = factor(sample_groups, levels = c("Control", "Treatment", "Unknown")),
cluster_columns = TRUE,
row_split = row_class,
# Order genomes: first by class, then by prevalence within class
row_order = row_order_vec
)We would like to see if there is any genus that contributes abnormally to the recovered genomes. So first we identify taxon names of the recovered genomes (bins).
suppressPackageStartupMessages({
library(rentrez)
library(xml2)
})
# Genomes recovered at least once and their counts across samples
recovered_counts <- recovery_long %>%
group_by(genome) %>%
summarise(count = sum(recovered, na.rm = TRUE), .groups = 'drop') %>%
filter(count > 0) %>%
dplyr::left_join(truth_covars %>% tibble::as_tibble() %>% dplyr::select(genome, genome_accession), by = 'genome')
# Helper to get family/genus/species via NCBI (assembly -> taxid -> taxonomy)
get_taxonomy_for_accession <- function(acc) {
if (is.na(acc) || acc == "") return(tibble(family = NA_character_, genus = NA_character_, species = NA_character_))
out <- tryCatch({
es <- rentrez::entrez_search(db = 'assembly', term = acc, retmax = 1)
if (length(es$ids) == 0) return(tibble(family = NA_character_, genus = NA_character_, species = NA_character_))
sm <- rentrez::entrez_summary(db = 'assembly', id = es$ids[[1]])
txid <- sm$taxid
if (is.null(txid) || is.na(txid)) return(tibble(family = NA_character_, genus = NA_character_, species = NA_character_))
xml_txt <- rentrez::entrez_fetch(db = 'taxonomy', id = txid, rettype = 'xml')
doc <- xml2::read_xml(xml_txt)
family <- xml2::xml_text(xml2::xml_find_first(doc, "//LineageEx/Taxon[Rank='family']/ScientificName"))
genus <- xml2::xml_text(xml2::xml_find_first(doc, "//LineageEx/Taxon[Rank='genus']/ScientificName"))
species <- xml2::xml_text(xml2::xml_find_first(doc, "//Taxon/ScientificName"))
tibble(family = ifelse(length(family)==0, NA, family),
genus = ifelse(length(genus)==0, NA, genus),
species = ifelse(length(species)==0, NA, species))
}, error = function(e) tibble(family = NA_character_, genus = NA_character_, species = NA_character_))
out
}
unique_acc <- recovered_counts$genome_accession %>% unique()
tax_lookup <- map_dfr(unique_acc, ~{
res <- get_taxonomy_for_accession(.x)
mutate(res, genome_accession = .x)
})
recovered_tax <- recovered_counts %>%
dplyr::left_join(tax_lookup, by = 'genome_accession') %>%
arrange(desc(count))
# Display table: row names = genome accession
recovered_tax_table <- recovered_tax %>%
transmute(`genome` = coalesce(genome_accession, genome), count, family, genus, species)
recovered_tax_table %>%
dplyr::rename(`Count (samples)` = count) %>%
kable(caption = 'Recovered genomes with NCBI taxonomy', align = 'lrrrr') %>%
kable_styling(bootstrap_options = c('striped','hover','condensed'), full_width = FALSE)| genome | Count (samples) | family | genus | species |
|---|---|---|---|---|
| GCA_002763575.1 | 40 | Prevotellaceae | Prevotella | Prevotella intermedia |
| GCA_009377295.1 | 40 | Paenibacillaceae | Paenibacillus | Paenibacillus glucanolyticus |
| GCA_016775375.1 | 40 | Peptostreptococcaceae | Peptostreptococcus | Peptostreptococcus stomatis |
| GCA_016861285.1 | 40 | Pasteurellaceae | Haemophilus | Haemophilus influenzae |
| GCA_017598955.1 | 40 | Micrococcaceae | Kocuria | Kocuria rhizophila |
| GCA_938023715.1 | 40 | Prevotellaceae | Prevotella | uncultured Prevotella sp. |
| NA | 40 | NA | NA | NA |
| GCA_002091815.1 | 39 | Pseudomonadaceae | Ectopseudomonas | Pseudomonas oleovorans subsp. oleovorans NBRC 13583 = DSM 1045 |
| GCA_000195125.1 | 37 | Streptococcaceae | Streptococcus | Streptococcus sanguinis SK1058 |
| GCA_020529565.1 | 35 | Streptococcaceae | Streptococcus | Streptococcus mutans |
| GCA_001597385.1 | 23 | Fusobacteriaceae | Fusobacterium | Fusobacterium necrophorum subsp. funduliforme |
| GCA_016127175.1 | 22 | Veillonellaceae | Veillonella | Veillonella parvula |
| GCA_937889145.1 | 22 | Prevotellaceae | Prevotella | Prevotella pallens |
| GCA_000160335.2 | 20 | Staphylococcaceae | Staphylococcus | Staphylococcus aureus subsp. aureus 55/2053 |
| GCA_000180075.1 | 20 | Streptococcaceae | Streptococcus | Streptococcus vestibularis F0396 |
| GCA_004136195.1 | 20 | Propionibacteriaceae | Cutibacterium | Cutibacterium acnes DSM 1897 |
| GCA_021309225.1 | 20 | Pasteurellaceae | Aggregatibacter | Aggregatibacter actinomycetemcomitans |
| GCA_023110145.1 | 20 | Streptococcaceae | Streptococcus | Streptococcus vestibularis |
| GCA_023314415.1 | 20 | Enterobacteriaceae | Kluyvera | Kluyvera ascorbata |
| GCA_927911685.1 | 20 | Actinomycetaceae | Actinomyces | uncultured Actinomyces sp. |
| GCA_958455085.1 | 20 | Veillonellaceae | Dialister | Dialister invisus |
| GCA_000016825.1 | 18 | Lactobacillaceae | Limosilactobacillus | Limosilactobacillus reuteri subsp. reuteri |
| GCA_003433955.1 | 18 | Peptoniphilaceae | Anaerococcus | Anaerococcus nagyae |
| GCA_030180395.1 | 17 | Fusobacteriaceae | Fusobacterium | Fusobacterium necrophorum |
| GCA_001866575.1 | 16 | Propionibacteriaceae | Cutibacterium | Cutibacterium avidum |
| GCA_021309315.1 | 13 | Pasteurellaceae | Aggregatibacter | Aggregatibacter actinomycetemcomitans |
| GCA_008726565.1 | 12 | Streptococcaceae | Streptococcus | Streptococcus anginosus |
| GCA_022569235.1 | 12 | Staphylococcaceae | Staphylococcus | Staphylococcus lugdunensis |
| GCA_014962425.1 | 8 | Staphylococcaceae | Staphylococcus | Staphylococcus lugdunensis |
| GCA_030229925.1 | 8 | Corynebacteriaceae | Corynebacterium | Corynebacterium propinquum |
| GCA_001053555.1 | 7 | Corynebacteriaceae | Corynebacterium | Corynebacterium propinquum |
| GCA_008085135.1 | 7 | Pasteurellaceae | Aggregatibacter | Aggregatibacter actinomycetemcomitans |
| GCA_031158635.1 | 7 | Propionibacteriaceae | Cutibacterium | Cutibacterium avidum |
| GCA_028325975.1 | 6 | Streptococcaceae | Streptococcus | Streptococcus parasanguinis |
| GCA_002212995.1 | 5 | Streptococcaceae | Streptococcus | Streptococcus mutans |
| GCA_003856455.1 | 4 | Staphylococcaceae | Staphylococcus | Staphylococcus epidermidis |
| GCA_900636255.1 | 3 | Staphylococcaceae | Staphylococcus | Staphylococcus warneri |
| GCA_003596365.3 | 2 | Staphylococcaceae | Staphylococcus | Staphylococcus haemolyticus |
| GCA_011038575.1 | 1 | Staphylococcaceae | Staphylococcus | Staphylococcus epidermidis |
Then we barplot them by aggregating at genus level.
# Plot: top 15 genera by number of recovered genomes
top_genera <- recovered_tax %>%
dplyr::count(genus, sort = TRUE, name = 'n_genomes') %>%
filter(!is.na(genus)) %>%
slice_head(n = 15)
ggplot(top_genera, aes(x = reorder(genus, n_genomes), y = n_genomes)) +
geom_col(fill = '#2A9D8F') +
coord_flip() +
labs(title = 'Top genera among recovered genomes', x = 'Genus', y = 'Recovered genomes (unique)') +
theme_minimal()suppressPackageStartupMessages({
library(tidyverse)
library(mia)
library(miaViz)
library(S4Vectors)
library(SummarizedExperiment)
library(TreeSummarizedExperiment)
library(rentrez)
library(xml2)
})
pal_amr <- c(
"#9C81ab", "#81ba9c", "gold1", "skyblue2", "#e65d5f", "#FDBF6F",
"#CAD2D6", "#FB9A99", "maroon", "steelblue4", "dodgerblue2", "#D48B28",
"#6A9DA9", "khaki2", "#B5D39D", "#42B93E", "#B3E3AA", "#A9FDAC",
"#32A27f", "#38BEA5", "#A3AB28", "#28C99f", "#FED1BC", "#FE654F",
"#49C8E1", "#BB39A3", "grey", "pink",
"#177B94", "#FF707E", "#2AC02C", "#D62728", "#9478DD", "#BC54AB",
"#E377C2", "#7F7F7F", "#BCBD22", "#17BECF",
"#6B4226", "#FFA70A", "#FF4540", "#CD5C5C", "#25EB57", "#66CDAA",
"#4682B4", "#1E90FF", "#C71585", "#7AD7D6",
"#FFD49F", "#8BD870", "#B0B83F", "#A52A2A", "#6495ED", "#9932CC",
"#FF1493", "#ADFF2F", "#40E0D0", "#C0C0C0",
"#FFA8B9", "#F4CA60", "#D8251E", "#0B53FF", "#FFA500", "#FF6370",
"#DC143C", "#8A2BE2", "#5F9EA0", "#00FA9A",
"#FFD700", "#4DBDFF", "#708090", "#00FFFF", "#FF4500", "#7FFFD4",
"#AD0F5E", "#20B2AA", "#708090", "#8FBC8F",
"#0000FF", "#808000", "#228B22", "#C0C0C0", "#808080", "#008000",
"#FF8C00", "#7C1353", "#000000", "#A9A9A9",
"#FF4500", "#822222", "#FF2222", "#00FF00", "#800080", "#FF00FF",
"#E9967A", "#FF69B4", "#FF69B4", "#EE82EE",
"#FFB6C1", "#FAFEEE", "#F0E68C", "#9FB98F", "#ADD8E6", "#87CEEB",
"#FFDEAD", "#DDA0DD", "#B0E0E6"
)
# Input: remapped relative abundances (rows = genomes or bins, cols = samples)
abund_file <- file.path(data_dirs$HOMD_Final, "aggregated_metrics_remapped", "remapped_relative.tsv")
stopifnot(file.exists(abund_file))
abund <- readr::read_tsv(abund_file, show_col_types = FALSE)
stopifnot("Genome" %in% colnames(abund))
rownms <- abund$Genome
mat <- as.matrix(abund %>% select(-Genome))
rownames(mat) <- rownms
# Identify source type and normalize accession IDs
source_type <- ifelse(grepl("^bin_", rownms), "recovered", "initial")
acc_undotted <- sub("^bin_", "", rownms)
acc <- ifelse(grepl("\\.\\d+$", acc_undotted), acc_undotted, sub("([0-9])$", ".\\1", acc_undotted))
row_meta <- S4Vectors::DataFrame(
Genome = rownms,
Source = source_type,
Accession_raw = acc_undotted,
Accession = acc,
row.names = rownms
)
# Build TSE and compute per-sample relative abundance (sums to 1)
tse_full <- TreeSummarizedExperiment::TreeSummarizedExperiment(
assays = list(counts = mat),
rowData = row_meta
)
tse_full <- mia::transformAssay(tse_full, assay.type = "counts", method = "relabundance")
# Fetch Genus via NCBI (rentrez) with caching
cache_dir_cc <- file.path(results_dir, "cache")
if (!dir.exists(cache_dir_cc)) dir.create(cache_dir_cc, recursive = TRUE)
cache_map <- file.path(cache_dir_cc, "accession_to_genus.csv")
fetch_genus <- function(acc_vec) {
acc_vec <- unique(acc_vec[!is.na(acc_vec) & acc_vec != ""])
if (length(acc_vec) == 0) return(tibble(Accession = character(), Genus = character()))
purrr::map_dfr(acc_vec, function(acc_one) {
# Try exact assembly accession first
uid <- tryCatch(rentrez::entrez_search(db = "assembly",
term = paste0(acc_one, "[Assembly Accession]"),
retmax = 1)$ids,
error = function(e) NULL)
if (is.null(uid) || length(uid) == 0) {
uid <- tryCatch(rentrez::entrez_search(db = "assembly", term = acc_one, retmax = 1)$ids,
error = function(e) NULL)
}
if (!is.null(uid) && length(uid) > 0) {
sm <- tryCatch(rentrez::entrez_summary(db = "assembly", id = uid[1]), error = function(e) NULL)
txid <- tryCatch({
if (is.null(sm)) NA_character_ else as.character(if (!is.null(sm$taxid)) sm$taxid else sm$tax_id)
}, error = function(e) NA_character_)
if (!is.na(txid)) {
tx_xml <- tryCatch(rentrez::entrez_fetch(db = "taxonomy", id = txid, rettype = "xml"), error = function(e) NULL)
if (!is.null(tx_xml)) {
doc <- tryCatch(xml2::read_xml(tx_xml), error = function(e) NULL)
if (!is.null(doc)) {
genus <- xml2::xml_text(xml2::xml_find_first(doc, "//LineageEx/Taxon[Rank='genus']/ScientificName"))
if (length(genus) == 0 || is.na(genus) || genus == "") genus <- NA_character_
return(tibble(Accession = acc_one, Genus = genus))
}
}
}
}
tibble(Accession = acc_one, Genus = NA_character_)
})
}
# Load cache and lookup what’s missing
cache_df <- if (file.exists(cache_map)) suppressMessages(readr::read_csv(cache_map, show_col_types = FALSE)) else tibble(Accession = character(), Genus = character())
to_lookup <- setdiff(unique(row_meta$Accession), cache_df$Accession)
if (length(to_lookup) > 0) {
# Optional: supply ENTREZ_KEY for higher rate limit
if (nzchar(Sys.getenv("ENTREZ_KEY"))) rentrez::set_entrez_key(Sys.getenv("ENTREZ_KEY"))
new_map <- fetch_genus(to_lookup)
cache_df <- dplyr::bind_rows(cache_df, new_map) %>% dplyr::distinct(Accession, .keep_all = TRUE)
readr::write_csv(cache_df, cache_map)
}
genus_map <- cache_df
row_genus <- genus_map$Genus[match(row_meta$Accession, genus_map$Accession)]
row_genus[is.na(row_genus) | row_genus == ""] <- "Unknown"
rowData(tse_full)$Genus <- row_genus
# Split by source
tse_initial <- tse_full[rowData(tse_full)$Source == "initial", ]
tse_recovered <- tse_full[rowData(tse_full)$Source == "recovered", ]
# Aggregate by Genus within each subset
tse_initial_g <- mia::agglomerateByVariable(tse_initial, by = "rows", f = "Genus", fun = sum, assay.type = "relabundance")
tse_recovered_g <- mia::agglomerateByVariable(tse_recovered, by = "rows", f = "Genus", fun = sum, assay.type = "relabundance")
# Determine shared top genera (union of top 10 from both)
top_init <- mia::getTop(tse_initial_g, top = 10, assay.type = "relabundance")
top_recv <- mia::getTop(tse_recovered_g, top = 10, assay.type = "relabundance")
top_union <- unique(c(top_init, top_recv))
# Collapse others into "Other"
rowData(tse_initial_g)$Genus_sub <- ifelse(rownames(tse_initial_g) %in% top_union, rownames(tse_initial_g), "Other")
rowData(tse_recovered_g)$Genus_sub <- ifelse(rownames(tse_recovered_g) %in% top_union, rownames(tse_recovered_g), "Other")
tse_initial_sub <- mia::agglomerateByVariable(tse_initial_g, by = "rows", f = "Genus_sub", fun = sum, assay.type = "relabundance")
tse_recovered_sub <- mia::agglomerateByVariable(tse_recovered_g, by = "rows", f = "Genus_sub", fun = sum, assay.type = "relabundance")
# Visualize side-by-side (stacked bar via ggplot, with ordering)
create_comp_plot <- function(tse_sub, title) {
M <- SummarizedExperiment::assay(tse_sub, "relabundance")
if (is.null(M)) M <- SummarizedExperiment::assay(tse_sub, 1)
cs <- colSums(M, na.rm = TRUE); cs[cs == 0] <- 1
M <- sweep(M, 2, cs, "/")
df <- as.data.frame(M) %>%
tibble::rownames_to_column("Genus") %>%
tidyr::pivot_longer(-Genus, names_to = "Sample", values_to = "Abundance")
totals <- df %>% group_by(Genus) %>% summarise(Total = sum(Abundance, na.rm = TRUE), .groups = 'drop')
levels_g <- totals %>% filter(Genus != "Other") %>% arrange(desc(Total)) %>% pull(Genus)
if ("Other" %in% df$Genus) levels_g <- c(levels_g, "Other")
df$Genus <- factor(df$Genus, levels = levels_g)
top_g <- dplyr::first(levels_g)
if (!is.na(top_g)) {
sample_order <- df %>% filter(Genus == top_g) %>% arrange(desc(Abundance)) %>% pull(Sample)
df$Sample <- factor(df$Sample, levels = unique(sample_order))
}
ggplot(df, aes(x = Sample, y = Abundance, fill = Genus)) +
geom_col(width = 0.9) +
scale_fill_manual(values = pal_amr) + # << apply palette
labs(title = title, x = "Samples", y = "Relative abundance") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
legend.position = "right")
}
p_init <- create_comp_plot(tse_initial_sub, "Initial (Reference) - Genus Composition")
p_recv <- create_comp_plot(tse_recovered_sub, "Recovered (Bins) - Genus Composition")
if (requireNamespace("patchwork", quietly = TRUE)) {
(p_init | p_recv)
} else {
gridExtra::grid.arrange(p_init, p_recv, ncol = 2)
}There is a clear compositional shift, with recovered bins showing markedly altered and noisier genus abundances compared to the stable reference profiles.
# Load comprehensive summary results
da_results <- read_csv(file.path(results_input_dir, "comprehensive_summary_with_bins.csv"),
show_col_types = FALSE) %>%
mutate(
method = str_to_lower(method),
metric_type = factor(metric_type, levels = c("counts", "tpm", "rpkm", "relative", "breadth", "coverage")),
source_type = factor(source_type, levels = c("all", "reference", "bin"))
)
# Display structure
cat("DA Results Summary:\n")## DA Results Summary:
## - Total rows: 78
## - Methods: ancombc2, aldex2, metagenomeseq, maaslin2, wilcoxon
## - Metrics: counts, coverage, breadth, rpkm, tpm, relative
## - Source types: all, reference, bin
# Quick overview
da_results %>%
dplyr::count(method, source_type, metric_type) %>%
kable(caption = "Results availability by method, source type, and metric") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| method | source_type | metric_type | n |
|---|---|---|---|
| aldex2 | all | counts | 1 |
| aldex2 | all | breadth | 1 |
| aldex2 | reference | counts | 1 |
| aldex2 | reference | breadth | 1 |
| aldex2 | bin | counts | 1 |
| aldex2 | bin | breadth | 1 |
| ancombc2 | all | counts | 1 |
| ancombc2 | all | tpm | 1 |
| ancombc2 | all | rpkm | 1 |
| ancombc2 | all | relative | 1 |
| ancombc2 | all | breadth | 1 |
| ancombc2 | all | coverage | 1 |
| ancombc2 | reference | counts | 1 |
| ancombc2 | reference | tpm | 1 |
| ancombc2 | reference | rpkm | 1 |
| ancombc2 | reference | relative | 1 |
| ancombc2 | reference | breadth | 1 |
| ancombc2 | reference | coverage | 1 |
| ancombc2 | bin | counts | 1 |
| ancombc2 | bin | tpm | 1 |
| ancombc2 | bin | rpkm | 1 |
| ancombc2 | bin | relative | 1 |
| ancombc2 | bin | breadth | 1 |
| ancombc2 | bin | coverage | 1 |
| maaslin2 | all | counts | 1 |
| maaslin2 | all | tpm | 1 |
| maaslin2 | all | rpkm | 1 |
| maaslin2 | all | relative | 1 |
| maaslin2 | all | breadth | 1 |
| maaslin2 | all | coverage | 1 |
| maaslin2 | reference | counts | 1 |
| maaslin2 | reference | tpm | 1 |
| maaslin2 | reference | rpkm | 1 |
| maaslin2 | reference | relative | 1 |
| maaslin2 | reference | breadth | 1 |
| maaslin2 | reference | coverage | 1 |
| maaslin2 | bin | counts | 1 |
| maaslin2 | bin | tpm | 1 |
| maaslin2 | bin | rpkm | 1 |
| maaslin2 | bin | relative | 1 |
| maaslin2 | bin | breadth | 1 |
| maaslin2 | bin | coverage | 1 |
| metagenomeseq | all | counts | 1 |
| metagenomeseq | all | tpm | 1 |
| metagenomeseq | all | rpkm | 1 |
| metagenomeseq | all | relative | 1 |
| metagenomeseq | all | breadth | 1 |
| metagenomeseq | all | coverage | 1 |
| metagenomeseq | reference | counts | 1 |
| metagenomeseq | reference | tpm | 1 |
| metagenomeseq | reference | rpkm | 1 |
| metagenomeseq | reference | relative | 1 |
| metagenomeseq | reference | breadth | 1 |
| metagenomeseq | reference | coverage | 1 |
| metagenomeseq | bin | counts | 1 |
| metagenomeseq | bin | tpm | 1 |
| metagenomeseq | bin | rpkm | 1 |
| metagenomeseq | bin | relative | 1 |
| metagenomeseq | bin | breadth | 1 |
| metagenomeseq | bin | coverage | 1 |
| wilcoxon | all | counts | 1 |
| wilcoxon | all | tpm | 1 |
| wilcoxon | all | rpkm | 1 |
| wilcoxon | all | relative | 1 |
| wilcoxon | all | breadth | 1 |
| wilcoxon | all | coverage | 1 |
| wilcoxon | reference | counts | 1 |
| wilcoxon | reference | tpm | 1 |
| wilcoxon | reference | rpkm | 1 |
| wilcoxon | reference | relative | 1 |
| wilcoxon | reference | breadth | 1 |
| wilcoxon | reference | coverage | 1 |
| wilcoxon | bin | counts | 1 |
| wilcoxon | bin | tpm | 1 |
| wilcoxon | bin | rpkm | 1 |
| wilcoxon | bin | relative | 1 |
| wilcoxon | bin | breadth | 1 |
| wilcoxon | bin | coverage | 1 |
# Filter for reference genomes only
reference_results <- da_results %>%
filter(source_type == "reference") %>%
select(method, metric_type, tp, fp, fn, tn, sensitivity, specificity, precision, fdr, accuracy, npv, auroc, auprc, called, total_features) %>%
arrange(method, metric_type)
# Create formatted table for reference results
reference_table <- reference_results %>%
mutate(
across(c(sensitivity, specificity, precision, accuracy, npv, auroc, auprc), ~ round(.x, 3)),
across(c(fdr), ~ round(.x, 3)),
tp = as.integer(tp),
fp = as.integer(fp),
fn = as.integer(fn),
tn = as.integer(tn),
called = as.integer(called),
total_features = as.integer(total_features)
) %>%
dplyr::rename(
Method = method,
Metric = metric_type,
TP = tp,
FP = fp,
FN = fn,
TN = tn,
Sensitivity = sensitivity,
Specificity = specificity,
Precision = precision,
FDR = fdr,
Accuracy = accuracy,
NPV = npv,
AUROC = auroc,
AUPRC = auprc,
Called = called,
Total = total_features
)
# Display interactive table with pagination
reference_dt <- reference_table %>%
DT::datatable(
caption = "Differential Abundance Results: Reference Genomes Only",
options = list(
pageLength = 10,
lengthMenu = c(5, 10, 15, 20, -1),
scrollX = TRUE,
scrollY = "400px",
dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
),
extensions = c('Buttons', 'Scroller'),
filter = 'top',
rownames = FALSE
) %>%
DT::formatRound(columns = c('Sensitivity', 'Specificity', 'Precision', 'FDR', 'Accuracy', 'NPV', 'AUROC', 'AUPRC'), digits = 3)
reference_dt# Filter for bins only
bins_results <- da_results %>%
filter(source_type == "bin") %>%
select(method, metric_type, tp, fp, fn, tn, sensitivity, specificity, precision, fdr, accuracy, npv, auroc, auprc, called, total_features) %>%
arrange(method, metric_type)
# Create formatted table for bins results
bins_table <- bins_results %>%
mutate(
across(c(sensitivity, specificity, precision, accuracy, npv, auroc, auprc), ~ round(.x, 3)),
across(c(fdr), ~ round(.x, 3)),
tp = as.integer(tp),
fp = as.integer(fp),
fn = as.integer(fn),
tn = as.integer(tn),
called = as.integer(called),
total_features = as.integer(total_features)
) %>%
dplyr::rename(
Method = method,
Metric = metric_type,
TP = tp,
FP = fp,
FN = fn,
TN = tn,
Sensitivity = sensitivity,
Specificity = specificity,
Precision = precision,
FDR = fdr,
Accuracy = accuracy,
NPV = npv,
AUROC = auroc,
AUPRC = auprc,
Called = called,
Total = total_features
)
# Display interactive table with pagination
bins_dt <- bins_table %>%
DT::datatable(
caption = "Differential Abundance Results: Bins Only",
options = list(
pageLength = 10,
lengthMenu = c(5, 10, 15, 20, -1),
scrollX = TRUE,
scrollY = "400px",
dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
),
extensions = c('Buttons', 'Scroller'),
filter = 'top',
rownames = FALSE
) %>%
DT::formatRound(columns = c('Sensitivity', 'Specificity', 'Precision', 'FDR', 'Accuracy', 'NPV', 'AUROC', 'AUPRC'), digits = 3)
bins_dt# Summary statistics for reference vs bins
summary_stats <- da_results %>%
filter(source_type %in% c("reference", "bin")) %>%
group_by(source_type, metric_type) %>%
summarise(
n_methods = n(),
mean_sensitivity = mean(sensitivity, na.rm = TRUE),
mean_specificity = mean(specificity, na.rm = TRUE),
mean_precision = mean(precision, na.rm = TRUE),
mean_auroc = mean(auroc, na.rm = TRUE),
mean_auprc = mean(auprc, na.rm = TRUE),
.groups = 'drop'
) %>%
mutate(
across(c(mean_sensitivity, mean_specificity, mean_precision, mean_auroc, mean_auprc), ~ round(.x, 3))
)
# Display interactive summary table
summary_dt <- summary_stats %>%
DT::datatable(
caption = "Summary Statistics: Reference vs Bins Performance",
options = list(
pageLength = 10,
lengthMenu = c(5, 10, 15, 20, -1),
scrollX = TRUE,
dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf', 'print')
),
extensions = c('Buttons'),
filter = 'top',
rownames = FALSE
) %>%
DT::formatRound(columns = c('mean_sensitivity', 'mean_specificity', 'mean_precision', 'mean_auroc', 'mean_auprc'), digits = 3)
summary_dt# 1) Long data
plot_data <- da_results %>%
filter(source_type %in% c("reference", "bin")) %>%
mutate(
f1_score = 2 * (precision * sensitivity) / (precision + sensitivity),
auc_pr = auprc
) %>%
select(method, metric_type, source_type, f1_score, auc_pr, sensitivity, specificity) %>%
pivot_longer(c(f1_score, auc_pr, sensitivity, specificity),
names_to = "metric", values_to = "value") %>%
filter(!is.na(value))
metric_order <- c("counts","rpkm","relative","breadth","coverage","tpm")
tool_colors <- c(ancombc2="#E63946", aldex2="#F77F00", metagenomeseq="#2A9D8F",
wilcoxon="#264653", maaslin2="#7209B7")
# 2) Compute common y-limits per performance metric across BOTH sources
common_limits <- plot_data %>%
group_by(metric) %>%
summarise(lo = min(value, na.rm=TRUE), hi = max(value, na.rm=TRUE), .groups="drop") %>%
mutate(
# pad a touch so whiskers/points aren’t clipped
lo = pmax(0, lo - 0.02),
hi = pmin(1, hi + 0.02)
)
lims_for <- function(metric_name) {
with(common_limits[common_limits$metric==metric_name, ], c(lo, hi))
}
# 3) Subplot
create_subplot <- function(data, metric_name, title, y_label) {
ggplot(data %>% filter(metric == metric_name),
aes(x = factor(metric_type, levels = metric_order), y = value)) +
geom_boxplot(alpha = 0.7, outlier.shape = NA, fill="white", color="black") +
geom_jitter(aes(color = method), width = 0.2, alpha = 0.8, size = 2.5) +
scale_color_manual(values = tool_colors) +
scale_y_continuous(limits = lims_for(metric_name)) +
scale_x_discrete(drop = FALSE) + # <- keep all metrics even if absent in a subset
labs(title = title, x = "Abundance Metric", y = y_label, color = "Method") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom") +
coord_flip()
}
# 4) Make panels
ref_data <- plot_data %>% filter(source_type == "reference")
bins_data <- plot_data %>% filter(source_type == "bin")
p1_ref <- create_subplot(ref_data, "f1_score", "A) F1-Score Distribution by Metric (Reference)", "F1-Score")
p2_ref <- create_subplot(ref_data, "auc_pr", "B) AUC-PR Distribution by Metric (Reference)", "AUC-PR")
p3_ref <- create_subplot(ref_data, "sensitivity","C) Sensitivity Distribution by Metric (Reference)","Sensitivity")
p4_ref <- create_subplot(ref_data, "specificity","D) Specificity Distribution by Metric (Reference)","Specificity")
p1_bins <- create_subplot(bins_data, "f1_score", "A) F1-Score Distribution by Metric (Bins)", "F1-Score")
p2_bins <- create_subplot(bins_data, "auc_pr", "B) AUC-PR Distribution by Metric (Bins)", "AUC-PR")
p3_bins <- create_subplot(bins_data, "sensitivity","C) Sensitivity Distribution by Metric (Bins)","Sensitivity")
p4_bins <- create_subplot(bins_data, "specificity","D) Specificity Distribution by Metric (Bins)","Specificity")
gridExtra::grid.arrange(p1_ref, p2_ref, p3_ref, p4_ref, ncol = 2,
top = "Performance Metrics Distribution Across Abundance Metrics - Reference Genomes")gridExtra::grid.arrange(p1_bins, p2_bins, p3_bins, p4_bins, ncol = 2,
top = "Performance Metrics Distribution Across Abundance Metrics - Bins")
- Observation: Reference genomes show high sensitivity
but low and variable specificity, with F1 and AUC-PR clustering around
0.6–0.8. Bin-level results are weaker, with reduced sensitivity and
broader variability across abundance metrics.
# Prepare data for heatmaps
heatmap_data <- da_results %>%
filter(source_type %in% c("reference", "bin")) %>%
mutate(
# Calculate F1-Score from precision and sensitivity
f1_score = 2 * (precision * sensitivity) / (precision + sensitivity),
# Use existing metrics
auc_pr = auprc
) %>%
select(method, metric_type, source_type, f1_score, auc_pr, sensitivity, specificity) %>%
# Ensure consistent metric order
mutate(metric_type = factor(metric_type, levels = c("counts", "rpkm", "relative", "breadth", "coverage", "tpm")))
# Function to create heatmap
create_heatmap <- function(data, metric_col, title, source_label) {
# Prepare matrix
heatmap_matrix <- data %>%
select(method, metric_type, !!sym(metric_col)) %>%
pivot_wider(names_from = metric_type, values_from = !!sym(metric_col)) %>%
column_to_rownames("method") %>%
as.matrix()
# Replace NAs with 0 for visualization
heatmap_matrix[is.na(heatmap_matrix)] <- 0
# Create heatmap
corrplot(heatmap_matrix,
method = "color",
is.corr = FALSE,
col = colorRampPalette(c("#053061", "#FFFFFF", "#67001F"))(200),
tl.col = "black",
tl.srt = 45,
addCoef.col = "black",
number.cex = 0.7,
title = paste(title, "-", source_label),
mar = c(0,0,2,0),
cl.cex = 0.8)
}
# Reference genomes heatmaps
ref_heatmap_data <- heatmap_data %>% filter(source_type == "reference")
par(mfrow = c(2, 2), mar = c(2, 2, 3, 2))
create_heatmap(ref_heatmap_data, "f1_score", "F1-Score Heatmap", "Reference Genomes")
create_heatmap(ref_heatmap_data, "auc_pr", "AUC-PR Heatmap", "Reference Genomes")
create_heatmap(ref_heatmap_data, "sensitivity", "Sensitivity Heatmap", "Reference Genomes")
create_heatmap(ref_heatmap_data, "specificity", "Specificity Heatmap", "Reference Genomes")# Bins heatmaps
bins_heatmap_data <- heatmap_data %>% filter(source_type == "bin")
par(mfrow = c(2, 2), mar = c(2, 2, 3, 2))
create_heatmap(bins_heatmap_data, "f1_score", "F1-Score Heatmap", "Bins")
create_heatmap(bins_heatmap_data, "auc_pr", "AUC-PR Heatmap", "Bins")
create_heatmap(bins_heatmap_data, "sensitivity", "Sensitivity Heatmap", "Bins")
create_heatmap(bins_heatmap_data, "specificity", "Specificity Heatmap", "Bins")There is no significant changes in performance for the metrics except for a few metrics where aldex2 produced NA result. We should focus more on the Specificity heatmap as FDR control (reduction of False Positives is the bigger problem). The methods can clearly can identify True Positives (sensitivity 1.0) but struggles with FDR control. In that regard, MetagenomeSeq (at reference) and ANCOMBC-2 (at bins) performs better than other methods.
# Define alpha values and corresponding directories
alpha_values <- c(0.05, 0.01, 0.005, 0.001, 0.00001)
alpha_dirs <- c("with_bins", "alpha_0.01", "alpha_0.005", "alpha_0.001", "alpha_0.00001")
# Function to load data for a specific alpha
load_alpha_data <- function(alpha_dir) {
file_path <- file.path("/home/mrahman/SIMBA/results/homd_final", alpha_dir, "comprehensive_summary_with_bins.csv")
if (file.exists(file_path)) {
data <- read_csv(file_path, show_col_types = FALSE) %>%
mutate(alpha_dir = alpha_dir)
return(data)
} else {
warning(paste("File not found:", file_path))
return(NULL)
}
}
# Load data for all alpha values
alpha_data_list <- map(alpha_dirs, load_alpha_data)
alpha_data_list <- alpha_data_list[!map_lgl(alpha_data_list, is.null)]
# Combine all data
combined_alpha_data <- bind_rows(alpha_data_list) %>%
mutate(
alpha_value = case_when(
alpha_dir == "with_bins" ~ 0.05,
alpha_dir == "alpha_0.01" ~ 0.01,
alpha_dir == "alpha_0.005" ~ 0.005,
alpha_dir == "alpha_0.001" ~ 0.001,
alpha_dir == "alpha_0.00001" ~ 0.00001,
TRUE ~ NA_real_
),
alpha_label = paste("α =", alpha_value)
) %>%
filter(!is.na(alpha_value))
# Prepare data for plotting
alpha_plot_data <- combined_alpha_data %>%
filter(source_type %in% c("reference", "bin")) %>%
select(method, metric_type, source_type, alpha_value, alpha_label, fdr, fp, tn) %>%
mutate(
method = str_to_title(method),
metric_type = factor(metric_type, levels = c("counts", "tpm", "rpkm", "relative", "breadth", "coverage"))
)
# Create summary statistics by alpha
alpha_summary <- alpha_plot_data %>%
group_by(alpha_value, alpha_label, source_type) %>%
summarise(
mean_fdr = mean(fdr, na.rm = TRUE),
median_fdr = median(fdr, na.rm = TRUE),
sd_fdr = sd(fdr, na.rm = TRUE),
mean_fp = mean(fp, na.rm = TRUE),
median_fp = median(fp, na.rm = TRUE),
n_methods = n(),
.groups = 'drop'
)
# Plot 1: FDR vs Alpha (line plot)
p1 <- ggplot(alpha_summary, aes(x = alpha_value, y = mean_fdr, color = source_type)) +
geom_line(size = 1.2) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = mean_fdr - sd_fdr/sqrt(n_methods),
ymax = mean_fdr + sd_fdr/sqrt(n_methods)),
width = 0.001, alpha = 0.7) +
scale_x_log10(breaks = alpha_values, labels = alpha_values) +
scale_color_manual(values = c("reference" = "#E63946", "bin" = "#2A9D8F")) +
labs(
title = "A) False Discovery Rate vs Alpha Value",
x = "Alpha Value (log scale)",
y = "Mean False Discovery Rate",
color = "Source Type"
) +
theme_minimal() +
theme(
legend.position = "bottom",
plot.title = element_text(face = "bold")
)
# Plot 2: FDR vs Alpha (bar plot by method)
p2 <- ggplot(alpha_plot_data, aes(x = alpha_label, y = fdr, fill = source_type)) +
geom_boxplot(alpha = 0.7) +
facet_wrap(~ method, scales = "free_y", ncol = 3) +
scale_fill_manual(values = c("reference" = "#E63946", "bin" = "#2A9D8F")) +
labs(
title = "B) FDR Distribution by Alpha Value and Method",
x = "Alpha Value",
y = "False Discovery Rate",
fill = "Source Type"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom",
plot.title = element_text(face = "bold")
)
# Plot 3: False Positives vs Alpha
p3 <- ggplot(alpha_summary, aes(x = alpha_value, y = mean_fp, color = source_type)) +
geom_line(size = 1.2) +
geom_point(size = 3) +
scale_x_log10(breaks = alpha_values, labels = alpha_values) +
scale_color_manual(values = c("reference" = "#E63946", "bin" = "#2A9D8F")) +
labs(
title = "C) False Positives vs Alpha Value",
x = "Alpha Value (log scale)",
y = "Mean False Positives",
color = "Source Type"
) +
theme_minimal() +
theme(
legend.position = "bottom",
plot.title = element_text(face = "bold")
)
# Plot 4: FDR by Metric Type
p4 <- ggplot(alpha_plot_data, aes(x = alpha_label, y = fdr, fill = metric_type)) +
geom_boxplot(alpha = 0.7) +
facet_wrap(~ source_type, ncol = 2) +
scale_fill_manual(values = metric_colors) +
labs(
title = "D) FDR by Alpha Value and Metric Type",
x = "Alpha Value",
y = "False Discovery Rate",
fill = "Metric Type"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "bottom",
plot.title = element_text(face = "bold")
)
# Arrange plots
gridExtra::grid.arrange(p1, p2, p3, p4, ncol = 2,
top = "Effect of Alpha Value on False Discovery Rate and False Positives")# Summary table
alpha_summary_table <- alpha_summary %>%
select(alpha_value, alpha_label, source_type, mean_fdr, mean_fp) %>%
pivot_wider(names_from = source_type, values_from = c(mean_fdr, mean_fp)) %>%
arrange(desc(alpha_value)) %>%
select(-alpha_value) %>%
dplyr::rename(
Alpha = alpha_label,
`Mean FDR (Reference)` = mean_fdr_reference,
`Mean FDR (Bins)` = mean_fdr_bin,
`Mean FP (Reference)` = mean_fp_reference,
`Mean FP (Bins)` = mean_fp_bin
)
alpha_summary_table %>%
kable(caption = "Summary of FDR and False Positives by Alpha Value",
digits = 4) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| Alpha | Mean FDR (Bins) | Mean FDR (Reference) | Mean FP (Bins) | Mean FP (Reference) |
|---|---|---|---|---|
| α = 0.01 | 0.4168 | 0.4800 | 7.5385 | 83.5385 |
| α = 0.005 | 0.3823 | 0.4749 | 6.6538 | 82.0769 |
| α = 0.001 | 0.3265 | 0.4539 | 5.1538 | 75.9231 |
| α = 1e-05 | 0.2128 | 0.3966 | 3.0769 | 60.1923 |
# Statistical analysis: Correlation between alpha and FDR
correlation_results <- alpha_plot_data %>%
group_by(source_type, method) %>%
summarise(
correlation = cor(alpha_value, fdr, use = "complete.obs"),
p_value = cor.test(alpha_value, fdr)$p.value,
.groups = 'drop'
) %>%
mutate(
significance = case_when(
p_value < 0.001 ~ "***",
p_value < 0.01 ~ "**",
p_value < 0.05 ~ "*",
TRUE ~ "ns"
)
)
cat("\nCorrelation Analysis (Alpha vs FDR):\n")##
## Correlation Analysis (Alpha vs FDR):
## =====================================
correlation_results %>%
kable(caption = "Correlation between Alpha and FDR by Method and Source Type",
digits = 4) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| source_type | method | correlation | p_value | significance |
|---|---|---|---|---|
| bin | Aldex2 | 0.7090 | 0.0489 |
|
| bin | Ancombc2 | 0.6301 | 0.0010 | *** |
| bin | Maaslin2 | 0.7335 | 0.0000 | *** |
| bin | Metagenomeseq | 0.6126 | 0.0041 | ** |
| bin | Wilcoxon | 0.7606 | 0.0000 | *** |
| reference | Aldex2 | 0.4564 | 0.2557 | ns |
| reference | Ancombc2 | 0.2836 | 0.1793 | ns |
| reference | Maaslin2 | 0.3429 | 0.1009 | ns |
| reference | Metagenomeseq | 0.3037 | 0.1491 | ns |
| reference | Wilcoxon | 0.4909 | 0.0148 |
|
# ---- Compact summary: A) Leaderboard + B) Sens–Spec ----
suppressPackageStartupMessages({
library(tidyverse)
library(ggrepel)
library(viridis)
library(scales)
library(patchwork)
})
required_cols <- c("method","metric_type","source_type","precision","sensitivity","specificity","auprc")
metric_levels <- c("counts","tpm","rpkm","relative","breadth","coverage")
# Prepare data
stats_data <- da_results %>%
filter(source_type %in% c("reference","bin")) %>%
mutate(
f1_score = 2 * (precision * sensitivity) / pmax(precision + sensitivity, .Machine$double.eps),
auc_pr = auprc
) %>%
select(method, metric_type, source_type, f1_score, auc_pr, sensitivity, specificity) %>%
mutate(
metric_type = factor(metric_type, levels = metric_levels),
f1_score = pmin(pmax(f1_score, 0), 1),
auc_pr = pmin(pmax(auc_pr, 0), 1),
sensitivity = pmin(pmax(sensitivity, 0), 1),
specificity = pmin(pmax(specificity, 0), 1)
) %>%
filter(!is.na(method), !is.na(metric_type), !is.na(source_type))
# Helpers
safe_z <- function(x){
m <- mean(x, na.rm = TRUE); s <- sd(x, na.rm = TRUE)
if (!is.finite(s) || s == 0) return(rep(0, length(x)))
(x - m) / s
}
placeholder_plot <- function(title_txt){
ggplot() +
annotate("text", x = 0, y = 0, label = paste0(title_txt, "\n(no valid data)"), size = 4) +
xlim(-1,1) + ylim(-1,1) + theme_void() +
theme(plot.title = element_text(face = "bold")) + labs(title = title_txt)
}
# A) Overall Leaderboard (composite of z(F1), z(AUCPR))
leader_input <- stats_data %>%
pivot_longer(c(f1_score, auc_pr), names_to = "measure", values_to = "val") %>%
group_by(metric_type, source_type, measure) %>%
mutate(z = safe_z(val)) %>%
ungroup() %>%
group_by(method) %>%
summarise(composite = mean(z, na.rm = TRUE), .groups = "drop") %>%
filter(is.finite(composite))
pA <- if (nrow(leader_input) > 0) {
ggplot(leader_input, aes(reorder(method, composite), composite)) +
geom_col(width = 0.7) +
coord_flip() +
labs(title = "A) Overall Leaderboard",
x = NULL, y = "Composite score (z-avg of F1 & AUCPR)") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"))
} else {
placeholder_plot("A) Overall Leaderboard")
}
# B) Sensitivity–Specificity Trade-off (avg across metrics), faceted by Source
trade <- stats_data %>%
group_by(method, source_type) %>%
summarise(
sens = mean(sensitivity, na.rm = TRUE),
spec = mean(specificity, na.rm = TRUE),
.groups = "drop"
) %>%
filter(is.finite(sens), is.finite(spec))
pB <- if (nrow(trade) > 0) {
ggplot(trade, aes(spec, sens, label = method)) +
geom_point(size = 2) +
ggrepel::geom_text_repel(size = 3, max.overlaps = 20, seed = 1) +
facet_wrap(~ source_type, drop = FALSE) +
labs(title = "B) Sensitivity–Specificity Trade-off",
x = "Specificity", y = "Sensitivity") +
coord_cartesian(xlim = c(0,1), ylim = c(0,1), expand = TRUE) +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"))
} else {
placeholder_plot("B) Sensitivity–Specificity Trade-off")
}
# Assemble (heatmap removed)
summary_plot <- (pA / pB) +
plot_layout(heights = c(1, 1)) &
theme(plot.margin = margin(6, 6, 6, 6))
print(summary_plot)For this analysis, we implement four vote-based consensus strategies over feature-level calls made by individual tools (q ≤ 0.05):
suppressPackageStartupMessages({ library(readr); library(purrr) })
# Define metrics and tools present in this run
all_metrics <- c("counts", "tpm", "rpkm", "relative", "breadth", "coverage")
tools_raw <- c("aldex2", "ancombc2", "metagenomeseq", "maaslin2", "wilcoxon")
tool_pretty_map <- c(
aldex2 = "ALDEx2",
ancombc2 = "ANCOM-BC2",
metagenomeseq = "metagenomeSeq",
maaslin2 = "MaAsLin2",
wilcoxon = "Wilcoxon"
)
# Normalization helper to match truth_table 'genome'
normalize_id <- function(x) stringr::str_replace_all(x, "[^A-Za-z0-9]", "")
# Build master_df of per-feature results across tools and metrics
read_tool_results <- function(metric, tool) {
path <- file.path(results_input_dir, metric, paste0(tool, "_results.csv"))
if (!file.exists(path)) return(NULL)
df <- suppressMessages(readr::read_csv(path, show_col_types = FALSE))
if (!all(c("feature", "qval") %in% names(df))) return(NULL)
df %>%
transmute(
genome_raw = as.character(feature),
genome = normalize_id(as.character(feature)),
adjP = as.numeric(qval),
direction = as.character(dplyr::coalesce(direction, NA_character_)),
tool = tool_pretty_map[[tool]],
metric = metric,
source_type = ifelse(stringr::str_starts(genome_raw, "bin_"), "bin", "reference")
)
}
master_df <- map_dfr(all_metrics, function(m) {
map_dfr(tools_raw, function(t) read_tool_results(m, t))
}) %>%
distinct()
# Join ground truth (truth_table already normalized earlier in this Rmd)
truth_lookup <- truth_table %>% dplyr::select(genome, is_DA) %>% tibble::as_tibble()
master_df <- tibble::as_tibble(master_df) %>% dplyr::left_join(truth_lookup, by = "genome")
# Derive per-tool DA calls at q ≤ 0.05 (user-confirmed)
alpha_q <- 0.05
master_df <- master_df %>% mutate(da_call = !is.na(adjP) & adjP <= alpha_q)
# Build consensus summaries per metric (over all source types combined)
all_consensus_summaries <- list()
for (current_metric in all_metrics) {
metric_calls <- master_df %>% filter(metric == current_metric)
# n_tools_total per genome is the number of tools that produced a result for that genome
consensus_summary <- metric_calls %>%
group_by(genome) %>%
summarise(
n_tools_total = dplyr::n_distinct(tool),
n_tools_da = sum(da_call, na.rm = TRUE),
prop_tools_da = ifelse(n_tools_total > 0, n_tools_da / n_tools_total, NA_real_),
is_DA = dplyr::first(is_DA),
.groups = 'drop'
) %>%
mutate(
consensus_2plus = n_tools_da >= 2,
consensus_majority = n_tools_da > (n_tools_total / 2),
consensus_75plus = n_tools_total > 0 & (n_tools_da / n_tools_total) >= 0.75,
consensus_all = n_tools_total > 0 & n_tools_da == n_tools_total
)
all_consensus_summaries[[current_metric]] <- consensus_summary
}
# Quick sanity: show coverage of tools per metric
tool_coverage <- master_df %>%
dplyr::count(metric, tool) %>% dplyr::arrange(metric, tool)
tool_coverage %>%
kable(caption = "Per-metric tool result counts (rows)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| metric | tool | n |
|---|---|---|
| breadth | ALDEx2 | 239 |
| breadth | ANCOM-BC2 | 239 |
| breadth | MaAsLin2 | 235 |
| breadth | Wilcoxon | 239 |
| breadth | metagenomeSeq | 239 |
| counts | ALDEx2 | 239 |
| counts | ANCOM-BC2 | 239 |
| counts | MaAsLin2 | 235 |
| counts | Wilcoxon | 239 |
| counts | metagenomeSeq | 239 |
| coverage | ANCOM-BC2 | 235 |
| coverage | MaAsLin2 | 235 |
| coverage | Wilcoxon | 235 |
| coverage | metagenomeSeq | 235 |
| relative | ANCOM-BC2 | 235 |
| relative | MaAsLin2 | 235 |
| relative | Wilcoxon | 235 |
| relative | metagenomeSeq | 235 |
| rpkm | ANCOM-BC2 | 235 |
| rpkm | MaAsLin2 | 235 |
| rpkm | Wilcoxon | 235 |
| rpkm | metagenomeSeq | 235 |
| tpm | ANCOM-BC2 | 239 |
| tpm | MaAsLin2 | 235 |
| tpm | Wilcoxon | 239 |
| tpm | metagenomeSeq | 239 |
suppressPackageStartupMessages({
library(tidyr); library(scales)
if (!requireNamespace("ggnewscale", quietly = TRUE)) {
# Optional install if not present; comment out for offline runs
# install.packages("ggnewscale")
}
library(ggnewscale)
})
# Get log2FC values for ordering
log2fc_data <- truth_table %>%
select(genome, true_log2FC) %>%
mutate(true_log2FC = replace_na(true_log2FC, 0))
for (current_metric in all_metrics) {
cat("\n\n#### Consensus Heatmap:", toupper(current_metric), "Metric\n\n")
# Prepare per-tool calls for current metric
current_metric_data <- master_df %>%
filter(metric == current_metric, !is.na(adjP)) %>%
select(genome, tool, adjP, is_DA)
individual_tools <- current_metric_data %>%
mutate(da_call = adjP <= alpha_q) %>%
select(genome, tool, da_call, is_DA) %>%
mutate(correct_call = case_when(
da_call & is_DA ~ "TP",
!da_call & !is_DA ~ "TN",
da_call & !is_DA ~ "FP",
!da_call & is_DA ~ "FN",
TRUE ~ NA_character_
)) %>%
select(genome, tool, da_call, correct_call, is_DA)
consensus_summary_current <- all_consensus_summaries[[current_metric]]
heatmap_data <- consensus_summary_current %>%
dplyr::left_join(log2fc_data, by = "genome") %>%
mutate(
consensus_2plus_correct = case_when(
consensus_2plus & is_DA ~ "TP",
!consensus_2plus & !is_DA ~ "TN",
consensus_2plus & !is_DA ~ "FP",
!consensus_2plus & is_DA ~ "FN"
),
consensus_majority_correct = case_when(
consensus_majority & is_DA ~ "TP",
!consensus_majority & !is_DA ~ "TN",
consensus_majority & !is_DA ~ "FP",
!consensus_majority & is_DA ~ "FN"
),
consensus_75plus_correct = case_when(
consensus_75plus & is_DA ~ "TP",
!consensus_75plus & !is_DA ~ "TN",
consensus_75plus & !is_DA ~ "FP",
!consensus_75plus & is_DA ~ "FN"
),
consensus_all_correct = case_when(
consensus_all & is_DA ~ "TP",
!consensus_all & !is_DA ~ "TN",
consensus_all & !is_DA ~ "FP",
!consensus_all & is_DA ~ "FN"
)
) %>%
arrange(desc(abs(true_log2FC)), desc(prop_tools_da), genome) %>%
mutate(genome_order = dplyr::row_number())
cat("**Data Summary:**\n")
cat("- Genomes analyzed:", nrow(heatmap_data), "\n")
cat("- Available tools:", paste(unique(individual_tools$tool), collapse = ", "), "\n\n")
# 1) Metadata block
metadata_long <- heatmap_data %>%
select(genome, genome_order, is_DA, n_tools_da, n_tools_total, prop_tools_da, true_log2FC) %>%
mutate(is_DA_numeric = as.numeric(is_DA), true_log2FC_abs = abs(true_log2FC)) %>%
select(-is_DA) %>%
pivot_longer(cols = -c(genome, genome_order), names_to = "column", values_to = "value") %>%
mutate(column_type = "metadata")
# 2) Individual tools block
tools_long <- individual_tools %>%
select(genome, tool, correct_call) %>%
dplyr::left_join(heatmap_data %>% dplyr::select(genome, genome_order), by = "genome") %>%
dplyr::rename(column = tool, value = correct_call) %>%
mutate(column_type = "individual_tool")
# 3) Consensus block
consensus_long <- heatmap_data %>%
select(genome, genome_order,
consensus_2plus_correct, consensus_majority_correct,
consensus_75plus_correct, consensus_all_correct) %>%
pivot_longer(cols = -c(genome, genome_order), names_to = "column", values_to = "value") %>%
mutate(
column = case_when(
column == "consensus_2plus_correct" ~ "≥2 Tools",
column == "consensus_majority_correct" ~ "Majority",
column == "consensus_75plus_correct" ~ "≥75%",
column == "consensus_all_correct" ~ "All Tools"
),
column_type = "consensus"
)
# Harmonize numeric encodings for fills
metadata_long <- metadata_long %>% mutate(value_numeric = as.numeric(value), value = as.character(value))
tools_long <- tools_long %>% mutate(
value_numeric = dplyr::case_when(
value == "TP" ~ 1, value == "TN" ~ 0.5, value == "FP" ~ -0.5, value == "FN" ~ -1, TRUE ~ 0
), value = as.character(value))
consensus_long <- consensus_long %>% mutate(
value_numeric = dplyr::case_when(
value == "TP" ~ 1, value == "TN" ~ 0.5, value == "FP" ~ -0.5, value == "FN" ~ -1, TRUE ~ 0
), value = as.character(value))
# Column order reflecting actual tools
plot_data <- dplyr::bind_rows(metadata_long, tools_long, consensus_long) %>%
mutate(
column_order = dplyr::case_when(
column == "is_DA_numeric" ~ 1,
column == "n_tools_da" ~ 2,
column == "n_tools_total" ~ 3,
column == "prop_tools_da" ~ 4,
column == "ALDEx2" ~ 5,
column == "ANCOM-BC2" ~ 6,
column == "metagenomeSeq" ~ 7,
column == "MaAsLin2" ~ 8,
column == "Wilcoxon" ~ 9,
column == "≥2 Tools" ~ 10,
column == "Majority" ~ 11,
column == "≥75%" ~ 12,
column == "All Tools" ~ 13,
column == "true_log2FC" ~ 14,
column == "true_log2FC_abs" ~ 15,
TRUE ~ 99
)
) %>%
arrange(column_order)
p_single <- plot_data %>%
ggplot(aes(x = reorder(column, column_order), y = reorder(genome, -genome_order))) +
geom_tile(aes(fill = ifelse(column_type == "metadata", value_numeric, NA)), color = "white", size = 0.1) +
scale_fill_gradient2(low = "white", mid = "lightblue", high = "darkblue", midpoint = 0, na.value = "transparent", name = "Numeric Value") +
{ if (requireNamespace("ggnewscale", quietly = TRUE)) ggnewscale::new_scale_fill() else NULL } +
geom_tile(data = dplyr::filter(plot_data, column_type %in% c("individual_tool", "consensus")),
aes(fill = value), color = "white", size = 0.1) +
scale_fill_manual(values = c("TP" = "#2d8f2d", "TN" = "#90EE90", "FP" = "#dc143c", "FN" = "#ffb6c1"), name = "Classification", na.value = "transparent") +
labs(title = paste("Comprehensive Consensus Analysis Heatmap:", stringr::str_to_title(current_metric), "Metric"),
subtitle = "Genome-level results across individual tools and consensus strategies",
x = "Analysis Components", y = "Genomes (ordered by |log2FC| and DA proportion)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
axis.text.y = element_text(size = 6),
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 11),
legend.position = "right", legend.box = "vertical", panel.grid = element_blank(), axis.ticks = element_blank()) +
geom_vline(xintercept = c(4.5, 9.5, 13.5), color = "black", size = 1, alpha = 0.8)
print(p_single)
cat("\n\n")
}##
##
## #### Consensus Heatmap: COUNTS Metric
##
## **Data Summary:**
## - Genomes analyzed: 239
## - Available tools: ALDEx2, ANCOM-BC2, metagenomeSeq, MaAsLin2, Wilcoxon
##
##
##
##
## #### Consensus Heatmap: TPM Metric
##
## **Data Summary:**
## - Genomes analyzed: 239
## - Available tools: ANCOM-BC2, metagenomeSeq, MaAsLin2, Wilcoxon
##
##
##
##
## #### Consensus Heatmap: RPKM Metric
##
## **Data Summary:**
## - Genomes analyzed: 235
## - Available tools: ANCOM-BC2, metagenomeSeq, MaAsLin2, Wilcoxon
##
##
##
##
## #### Consensus Heatmap: RELATIVE Metric
##
## **Data Summary:**
## - Genomes analyzed: 235
## - Available tools: ANCOM-BC2, metagenomeSeq, MaAsLin2, Wilcoxon
##
##
##
##
## #### Consensus Heatmap: BREADTH Metric
##
## **Data Summary:**
## - Genomes analyzed: 239
## - Available tools: ALDEx2, ANCOM-BC2, metagenomeSeq, MaAsLin2, Wilcoxon
##
##
##
##
## #### Consensus Heatmap: COVERAGE Metric
##
## **Data Summary:**
## - Genomes analyzed: 235
## - Available tools: ANCOM-BC2, metagenomeSeq, MaAsLin2, Wilcoxon
# ===========================
# Full replacement: ALL plots as one A4-sized figure
# ===========================
suppressPackageStartupMessages({
library(tidyverse)
library(patchwork)
library(viridis)
library(scales)
})
# ---- Helper: safe reorder that tolerates all-NA gracefully ----
safe_reorder <- function(x, by) {
if (all(is.na(by))) return(factor(x, levels = unique(x)))
reorder(x, by)
}
# ---- Helper to compute confusion metrics (from your snippet) ----
compute_metrics <- function(pred, truth) {
tp <- sum(pred & truth, na.rm = TRUE)
fp <- sum(pred & !truth, na.rm = TRUE)
fn <- sum(!pred & truth, na.rm = TRUE)
tn <- sum(!pred & !truth, na.rm = TRUE)
precision <- ifelse((tp + fp) > 0, tp / (tp + fp), NA_real_)
recall <- ifelse((tp + fn) > 0, tp / (tp + fn), NA_real_)
specificity <- ifelse((tn + fp) > 0, tn / (tn + fp), NA_real_)
f1 <- ifelse((precision + recall) > 0, 2 * precision * recall / (precision + recall), NA_real_)
tibble(tp = tp, fp = fp, fn = fn, tn = tn,
precision = precision, sensitivity = recall, specificity = specificity, f1_score = f1)
}
# ---- Consensus strategies (from your snippet) ----
consensus_strategies <- c("consensus_2plus", "consensus_majority", "consensus_75plus", "consensus_all")
strategy_names <- c("≥2 tools", "Majority (>50%)", "≥75% tools", "All tools")
# ==== 1) Build consensus performance across metrics ====
# expects: `all_metrics` (character vec) and `all_consensus_summaries` (named list) to exist
all_consensus_performance <- purrr::map_dfr(all_metrics, function(m) {
cs <- all_consensus_summaries[[m]]
purrr::map_dfr(seq_along(consensus_strategies), function(i) {
flag <- consensus_strategies[i]
name <- strategy_names[i]
pred <- cs[[flag]]
truth <- cs$is_DA
mets <- compute_metrics(pred, truth)
mets %>%
mutate(metric = m, strategy = name, auprc = NA_real_) %>%
select(metric, strategy, f1_score, precision, sensitivity, specificity, auprc)
})
})
# ==== 2) Build individual tool performance per metric ====
# expects: `da_results` tibble with method, metric_type, source_type, precision, sensitivity, specificity, auprc
recode_methods <- function(x) dplyr::recode(x,
ancombc2 = "ANCOM-BC2",
aldex2 = "ALDEx2",
metagenomeseq = "metagenomeSeq",
maaslin2 = "MaAsLin2",
wilcoxon = "Wilcoxon",
.default = x
)
get_individual_performance <- function(metric_name) {
da_results %>%
filter(metric_type == metric_name, source_type %in% c("reference", "bin")) %>%
mutate(f1_score = 2 * (precision * sensitivity) / pmax(precision + sensitivity, .Machine$double.eps)) %>%
group_by(method) %>%
summarise(
f1_score = mean(f1_score, na.rm = TRUE),
precision = mean(precision, na.rm = TRUE),
sensitivity= mean(sensitivity,na.rm = TRUE),
specificity= mean(specificity,na.rm = TRUE),
auprc = mean(auprc, na.rm = TRUE),
.groups = 'drop'
) %>%
mutate(type = "Individual tool",
strategy = recode_methods(method)) %>%
select(strategy, f1_score, precision, sensitivity, specificity, auprc, type)
}
# ==== 3) One function that makes a 2-panel block (Sensitivity | Specificity) for a metric ====
make_metric_block <- function(metric_name) {
ind <- get_individual_performance(metric_name)
con <- all_consensus_performance %>%
filter(metric == metric_name) %>%
mutate(type = "Consensus") %>%
select(strategy, f1_score, precision, sensitivity, specificity, auprc, type)
combined <- bind_rows(con, ind) %>% mutate(metric = metric_name)
# Sensitivity
p_sens <- ggplot(
combined,
aes(x = safe_reorder(strategy, sensitivity), y = sensitivity, fill = type)
) +
geom_col(width = 0.7) +
scale_y_continuous(labels = percent_format(accuracy = 1), limits = c(0, 1)) +
scale_fill_manual(values = c("Consensus" = "#e31a1c", "Individual tool" = "#1f78b4")) +
coord_flip() +
labs(title = paste("Sensitivity —", stringr::str_to_title(metric_name)),
x = NULL, y = "Sensitivity", fill = NULL) +
theme_minimal(base_size = 11) +
theme(legend.position = "bottom",
plot.title = element_text(face = "bold"),
axis.text.x = element_text())
# Specificity
p_spec <- ggplot(
combined,
aes(x = safe_reorder(strategy, specificity), y = specificity, fill = type)
) +
geom_col(width = 0.7) +
scale_y_continuous(labels = percent_format(accuracy = 1), limits = c(0, 1)) +
scale_fill_manual(values = c("Consensus" = "#e31a1c", "Individual tool" = "#1f78b4")) +
coord_flip() +
labs(title = paste("Specificity —", stringr::str_to_title(metric_name)),
x = NULL, y = "Specificity", fill = NULL) +
theme_minimal(base_size = 11) +
theme(legend.position = "bottom",
plot.title = element_text(face = "bold"),
axis.text.x = element_text())
# Return a 2-panel block for this metric
(p_sens | p_spec) + plot_layout(guides = "collect") & theme(legend.position = "bottom")
}
# ==== 4) Build all metric blocks and a cross-metric summary heatmap ====
metric_blocks <- purrr::map(all_metrics, make_metric_block)
# Cross-metric summary for consensus strategies only (visual, no table)
cross_metric_heat <- all_consensus_performance %>%
group_by(strategy) %>%
summarise(
F1 = mean(f1_score, na.rm = TRUE),
Precision = mean(precision, na.rm = TRUE),
Sensitivity = mean(sensitivity, na.rm = TRUE),
Specificity = mean(specificity, na.rm = TRUE),
.groups = "drop"
) %>%
pivot_longer(-strategy, names_to = "Measure", values_to = "Value")
p_heat <- ggplot(cross_metric_heat, aes(Measure, strategy, fill = Value)) +
geom_tile(color = "white", linewidth = 0.25) +
scale_fill_viridis(limits = c(0, 1), option = "C", guide = guide_colorbar(title.position = "top")) +
labs(title = "Cross-metric Consensus Summary (mean across metrics)",
x = "Measure", y = "Consensus strategy", fill = "Score") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1))
# ==== 5) Assemble EVERYTHING into ONE big figure with tags ====
# Arrange metric blocks vertically; add heatmap at the bottom
big_plot <- wrap_plots(metric_blocks, ncol = 1) /
p_heat +
plot_layout(heights = c(rep(1, length(metric_blocks)), 0.9)) +
plot_annotation(tag_levels = "A")
# Print (may require a larger pane); best to use ggsave below
print(big_plot)The “All tools” consensus achieves the most balanced average performance but other strategies could produce better sensitivity at the cost of worse specificity (Panel M). Adding more methods could further increase performance.
There are some genomes that are always misclassifed by all tools. We would like to see if these genomes have any particular properties (genus, genome_size, gc_content) that can influence their misclassification. Lets first identify the misclassified ref and bins genome from different metrics and gather their properties.
library(rentrez)
library(xml2)
library(Biostrings)
# Identify genomes where all available tools were wrong (per metric)
always_wrong_list <- list()
for (current_metric in all_metrics) {
calls <- master_df %>%
filter(metric == current_metric) %>%
mutate(is_wrong = case_when(
da_call & !is_DA ~ TRUE, # FP
!da_call & is_DA ~ TRUE, # FN
TRUE ~ FALSE
))
per_genome <- calls %>%
group_by(genome) %>%
summarise(n_tools_total = dplyr::n_distinct(tool),
n_wrong = sum(is_wrong, na.rm = TRUE),
is_DA = dplyr::first(is_DA),
source_genome = dplyr::first(source_type),
.groups = 'drop') %>%
mutate(always_wrong = n_tools_total >= 2 & n_wrong == n_tools_total,
wrong_type = dplyr::case_when(
always_wrong & is_DA ~ "FN_all",
always_wrong & !is_DA ~ "FP_all",
TRUE ~ NA_character_
)) %>%
filter(always_wrong)
# Join truth_table for accessions and compute query_accession per source type
per_genome <- per_genome %>%
dplyr::left_join(truth_table %>% dplyr::select(genome, genome_accession, closest_genome), by = "genome") %>%
mutate(
closest_genome_dotted = ifelse(!is.na(closest_genome),
sub("([0-9])$", ".\\1", closest_genome), NA_character_),
query_accession = dplyr::case_when(
source_genome == "reference" & !is.na(genome_accession) ~ genome_accession,
source_genome == "bin" & !is.na(closest_genome_dotted) ~ closest_genome_dotted,
TRUE ~ genome_accession
),
metric = current_metric
)
always_wrong_list[[current_metric]] <- per_genome
}
always_wrong_all <- dplyr::bind_rows(always_wrong_list)
cat("Genomes with unanimous misclassification (by metric):", nrow(always_wrong_all), "\n")## Genomes with unanimous misclassification (by metric): 162
# Helper: extract first non-null field from esummary
get_first <- function(x, fields) {
for (f in fields) {
if (!is.null(x[[f]])) return(x[[f]])
}
return(NA)
}
# Fetch NCBI metadata for a vector of accessions
fetch_ncbi_meta <- function(acc_vec) {
acc_vec <- unique(acc_vec[!is.na(acc_vec)])
if (length(acc_vec) == 0) return(tibble())
purrr::map_dfr(acc_vec, function(acc) {
# Normalize accession to dotted form if missing (e.g., add .1)
dotted <- if (grepl("\\.\\d+$", acc)) acc else paste0(acc, ".1")
# Try exact assembly accession match first
uid <- tryCatch({ rentrez::entrez_search(db = "assembly", term = paste0(dotted, "[Assembly Accession]"))$ids }, error = function(e) NULL)
if (is.null(uid) || length(uid) == 0) {
# fallback: try the original string
uid <- tryCatch({ rentrez::entrez_search(db = "assembly", term = acc)$ids }, error = function(e) NULL)
}
if (!is.null(uid) && length(uid) > 0) {
sm <- tryCatch({ rentrez::entrez_summary(db = "assembly", id = uid[1]) }, error = function(e) NULL)
if (!is.null(sm)) {
# Extract basic fields
asm_acc <- get_first(sm, c("assemblyaccession", "assembly_accession"))
org <- get_first(sm, c("organism", "organism_name"))
gc <- suppressWarnings(as.numeric(get_first(sm, c("gcpercent", "gc_percent", "gc"))))
size <- suppressWarnings(as.numeric(get_first(sm, c("seq_length_sum", "totallength", "total_length"))))
taxid <- suppressWarnings(as.character(get_first(sm, c("taxid", "tax_id"))))
# Taxonomy via taxonomy DB if taxid available (prefer rank-aware XML)
fam <- genus <- species <- NA_character_
if (!is.na(taxid)) {
tx_xml <- tryCatch({ rentrez::entrez_fetch(db = "taxonomy", id = taxid, rettype = "xml") }, error = function(e) NULL)
if (!is.null(tx_xml)) {
doc <- tryCatch(xml2::read_xml(tx_xml), error = function(e) NULL)
if (!is.null(doc)) {
fam <- xml2::xml_text(xml2::xml_find_first(doc, "//LineageEx/Taxon[Rank='family']/ScientificName"))
genus <- xml2::xml_text(xml2::xml_find_first(doc, "//LineageEx/Taxon[Rank='genus']/ScientificName"))
species <- xml2::xml_text(xml2::xml_find_first(doc, "//Taxon/ScientificName"))
if (identical(fam, character(0))) fam <- NA_character_
if (identical(genus, character(0))) genus <- NA_character_
if (identical(species, character(0))) species <- NA_character_
}
}
}
return(tibble(accession = acc, assembly_accession = asm_acc, organism = org,
genome_size = size, gc_percent = gc, family = fam,
genus = genus, species = species))
}
}
# Fallback: no metadata
tibble(accession = acc, assembly_accession = NA_character_, organism = NA_character_,
genome_size = NA_real_, gc_percent = NA_real_, family = NA_character_,
genus = NA_character_, species = NA_character_)
})
}
# Pull NCBI metadata with a small cache (assembly meta + taxonomy)
cache_dir <- file.path(results_dir, "cache")
if (!dir.exists(cache_dir)) dir.create(cache_dir, recursive = TRUE)
cache_file <- file.path(cache_dir, "assembly_meta.csv")
ncbi_meta <- tibble()
accs <- unique(always_wrong_all$query_accession[!is.na(always_wrong_all$query_accession)])
cached <- if (file.exists(cache_file)) suppressMessages(readr::read_csv(cache_file, show_col_types = FALSE)) else tibble()
to_fetch <- setdiff(accs, cached$accession %||% character())
fetched <- if (length(to_fetch) > 0) fetch_ncbi_meta(to_fetch) else tibble()
if (nrow(fetched) > 0) {
cached <- dplyr::bind_rows(cached, fetched) %>% dplyr::distinct(accession, .keep_all = TRUE)
readr::write_csv(cached, cache_file)
}
ncbi_meta <- cached %>% dplyr::filter(accession %in% accs)
# Compute GC% and genome size from FASTA (online) using Biostrings
compute_gc_size_online <- function(assembly_acc) {
if (is.na(assembly_acc) || assembly_acc == "")
return(tibble(accession = assembly_acc, n_sequences = NA_integer_, genome_size_bp = NA_real_, gc_percent_fasta = NA_real_, ftp = NA_character_))
# find assembly
dotted <- if (grepl("\\.\\d+$", assembly_acc)) assembly_acc else paste0(assembly_acc, ".1")
es <- tryCatch(entrez_search(db = "assembly", term = paste0(dotted, "[Assembly Accession]"), retmax = 1), error = function(e) NULL)
if (is.null(es) || length(es$ids) == 0) {
es <- tryCatch(entrez_search(db = "assembly", term = assembly_acc, retmax = 1), error = function(e) NULL)
}
if (is.null(es) || length(es$ids) == 0) {
return(tibble(accession = assembly_acc, n_sequences = NA_integer_, genome_size_bp = NA_real_, gc_percent_fasta = NA_real_, ftp = NA_character_))
}
sm <- tryCatch(entrez_summary(db = "assembly", id = es$ids[[1]]), error = function(e) NULL)
if (is.null(sm)) return(tibble(accession = assembly_acc, n_sequences = NA_integer_, genome_size_bp = NA_real_, gc_percent_fasta = NA_real_, ftp = NA_character_))
ftp <- sm$ftppath_genbank
if (is.null(ftp) || ftp == "") ftp <- sm$ftppath_refseq
if (is.null(ftp) || ftp == "") return(tibble(accession = assembly_acc, n_sequences = NA_integer_, genome_size_bp = NA_real_, gc_percent_fasta = NA_real_, ftp = NA_character_))
base <- basename(ftp)
fna_url <- paste0(ftp, "/", base, "_genomic.fna.gz")
# cache downloads
fna_cache_dir <- file.path(cache_dir, "fna")
if (!dir.exists(fna_cache_dir)) dir.create(fna_cache_dir, recursive = TRUE)
dest <- file.path(fna_cache_dir, paste0(gsub("/", "_", base), "_genomic.fna.gz"))
if (!file.exists(dest)) {
ok <- tryCatch({ utils::download.file(fna_url, dest, mode = "wb", quiet = TRUE); TRUE }, error = function(e) FALSE)
if (!ok) return(tibble(accession = assembly_acc, n_sequences = NA_integer_, genome_size_bp = NA_real_, gc_percent_fasta = NA_real_, ftp = ftp))
}
# read and compute stats
dna <- tryCatch(Biostrings::readDNAStringSet(dest), error = function(e) NULL)
if (is.null(dna)) return(tibble(accession = assembly_acc, n_sequences = NA_integer_, genome_size_bp = NA_real_, gc_percent_fasta = NA_real_, ftp = ftp))
total_bp <- sum(Biostrings::width(dna))
# letterFrequency can compute for all sequences; sum over them
lf <- Biostrings::letterFrequency(dna, c("A","C","G","T"))
atgc <- sum(lf)
gc <- sum(lf[,"C"] + lf[,"G"])
gc_pct <- ifelse(atgc > 0, 100 * gc / atgc, NA_real_)
tibble(accession = assembly_acc, n_sequences = length(dna), genome_size_bp = as.numeric(total_bp), gc_percent_fasta = round(gc_pct, 3), ftp = ftp)
}
# Run online GC/size computation (cached FASTA files)
gcsize_cache_file <- file.path(cache_dir, "gc_size_meta.csv")
gcsize <- tibble()
accs <- unique(always_wrong_all$query_accession[!is.na(always_wrong_all$query_accession)])
cached2 <- if (file.exists(gcsize_cache_file)) suppressMessages(readr::read_csv(gcsize_cache_file, show_col_types = FALSE)) else tibble()
to_compute <- setdiff(accs, cached2$accession %>% unique() %||% character())
computed <- if (length(to_compute) > 0) purrr::map_dfr(to_compute, compute_gc_size_online) else tibble()
if (nrow(computed) > 0) {
cached2 <- dplyr::bind_rows(cached2, computed) %>% dplyr::distinct(accession, .keep_all = TRUE)
readr::write_csv(cached2, gcsize_cache_file)
}
gcsize <- cached2 %>% dplyr::filter(accession %in% accs)
# Combine
always_wrong_annot <- always_wrong_all %>%
mutate(accession = query_accession) %>%
dplyr::left_join(ncbi_meta, by = "accession") %>%
dplyr::left_join(gcsize, by = "accession") %>%
mutate(
genome_size_bp = dplyr::coalesce(as.numeric(genome_size_bp), as.numeric(genome_size)),
gc_percent_final = dplyr::coalesce(as.numeric(gc_percent_fasta), as.numeric(gc_percent)),
meta_source = dplyr::case_when(
!is.na(gc_percent_fasta) | !is.na(genome_size_bp) ~ "FASTA",
(!is.na(genome_size) | !is.na(gc_percent)) ~ "NCBI_Summary",
TRUE ~ NA_character_
)
) %>%
select(metric, source_genome, genome, wrong_type, n_tools_total,
genome_size_bp, gc_percent_final, family, genus, species, accession, meta_source)
# Summary counts
summary_wrong <- always_wrong_annot %>%
group_by(metric, source_genome, wrong_type) %>%
summarise(n_genomes = dplyr::n_distinct(genome), .groups = 'drop') %>%
arrange(metric, source_genome, wrong_type)
kable(summary_wrong, caption = "Counts of genomes unanimously misclassified (by metric, source, and type)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| metric | source_genome | wrong_type | n_genomes |
|---|---|---|---|
| breadth | reference | FP_all | 21 |
| counts | reference | FP_all | 23 |
| coverage | bin | FN_all | 1 |
| coverage | reference | FP_all | 21 |
| relative | reference | FP_all | 42 |
| rpkm | bin | FN_all | 1 |
| rpkm | reference | FP_all | 27 |
| tpm | bin | FN_all | 2 |
| tpm | reference | FP_all | 24 |
# Split tables by source
ref_wrong <- always_wrong_annot %>% filter(source_genome == "reference")
bin_wrong <- always_wrong_annot %>% filter(source_genome == "bin")
ref_dt <- DT::datatable(ref_wrong,
caption = "Reference genomes misclassified by all tools (per metric)",
options = list(pageLength = 10, scrollX = TRUE),
rownames = FALSE)
bin_dt <- DT::datatable(bin_wrong,
caption = "Bin genomes misclassified by all tools (per metric)",
options = list(pageLength = 10, scrollX = TRUE),
rownames = FALSE)
# Ensure both widgets are rendered in HTML (avoid printing NULLs)
htmltools::tagList(ref_dt, bin_dt)We would like to see if genomes that are always correctly classified by DA methods against always misclassified genomes.
# Build per-genome correctness summary per metric
correctness_long <- master_df %>%
filter(!is.na(adjP)) %>%
mutate(is_correct = case_when(
da_call & is_DA ~ TRUE, # TP
!da_call & !is_DA ~ TRUE, # TN
da_call & !is_DA ~ FALSE, # FP
!da_call & is_DA ~ FALSE, # FN
TRUE ~ NA
)) %>%
select(metric, genome, tool, is_correct, is_DA, source_genome = source_type)
correctness_by_genome <- correctness_long %>%
group_by(metric, genome) %>%
summarise(
n_tools_total = n_distinct(tool),
n_correct = sum(is_correct, na.rm = TRUE),
n_wrong = sum(!is_correct, na.rm = TRUE),
is_DA = dplyr::first(is_DA),
source_genome = dplyr::first(source_genome),
.groups = 'drop'
) %>%
mutate(
always_wrong = n_tools_total >= 2 & n_correct == 0,
always_right = n_tools_total >= 2 & n_correct == n_tools_total,
group = case_when(
always_wrong ~ "always_wrong",
always_right ~ "always_right",
TRUE ~ "other"
)
)
# Keep only always_wrong vs always_right
comp_set <- correctness_by_genome %>% filter(group %in% c("always_wrong", "always_right"))
# Map genomes to accessions (for metadata fetch)
acc_map <- truth_table %>% select(genome, genome_accession, closest_genome) %>%
mutate(closest_genome_dotted = ifelse(!is.na(closest_genome), sub("([0-9])$", ".\\1", closest_genome), NA_character_))
comp_set <- comp_set %>%
dplyr::left_join(acc_map, by = "genome") %>%
mutate(query_accession = dplyr::case_when(
source_genome == "reference" & !is.na(genome_accession) ~ genome_accession,
source_genome == "bin" & !is.na(closest_genome_dotted) ~ closest_genome_dotted,
TRUE ~ genome_accession
))
# Fetch metadata for both groups using cached functions/utilities from previous chunk
cache_dir <- file.path(results_dir, "cache")
if (!dir.exists(cache_dir)) dir.create(cache_dir, recursive = TRUE)
cache_file <- file.path(cache_dir, "assembly_meta.csv")
gcsize_cache_file <- file.path(cache_dir, "gc_size_meta.csv")
accs_all <- unique(comp_set$query_accession[!is.na(comp_set$query_accession)])
# Assembly/taxonomy meta
meta_cached <- if (file.exists(cache_file)) suppressMessages(readr::read_csv(cache_file, show_col_types = FALSE)) else tibble()
to_fetch <- setdiff(accs_all, meta_cached$accession %||% character())
meta_new <- if (length(to_fetch) > 0) fetch_ncbi_meta(to_fetch) else tibble()
if (nrow(meta_new) > 0) {
meta_cached <- dplyr::bind_rows(meta_cached, meta_new) %>% dplyr::distinct(accession, .keep_all = TRUE)
readr::write_csv(meta_cached, cache_file)
}
ncbi_meta_all <- meta_cached %>% dplyr::filter(accession %in% accs_all)
# FASTA-derived GC/size
gc_cached <- if (file.exists(gcsize_cache_file)) suppressMessages(readr::read_csv(gcsize_cache_file, show_col_types = FALSE)) else tibble()
to_compute <- setdiff(accs_all, gc_cached$accession %||% character())
gc_new <- if (length(to_compute) > 0) purrr::map_dfr(to_compute, compute_gc_size_online) else tibble()
if (nrow(gc_new) > 0) {
gc_cached <- dplyr::bind_rows(gc_cached, gc_new) %>% dplyr::distinct(accession, .keep_all = TRUE)
readr::write_csv(gc_cached, gcsize_cache_file)
}
gcsize_all <- gc_cached %>% dplyr::filter(accession %in% accs_all)
# Attach metadata
comp_meta <- comp_set %>%
mutate(accession = query_accession) %>%
dplyr::left_join(ncbi_meta_all, by = c("accession")) %>%
dplyr::left_join(gcsize_all, by = c("accession")) %>%
mutate(
genome_size_bp = dplyr::coalesce(as.numeric(genome_size_bp), as.numeric(genome_size)),
gc_percent_final = dplyr::coalesce(as.numeric(gc_percent_fasta), as.numeric(gc_percent))
)
# QC: counts per stratum
qc_groups <- comp_meta %>% dplyr::count(metric, source_genome, group, name = "n_genomes") %>% dplyr::arrange(metric, source_genome, group)
qc_table <- qc_groups %>%
tidyr::pivot_wider(names_from = group, values_from = n_genomes, values_fill = 0)
qc_table %>% kable(caption = "Counts per metric/source for always_wrong vs always_right") %>% kable_styling(bootstrap_options = c("striped","hover","condensed"))| metric | source_genome | always_right | always_wrong |
|---|---|---|---|
| breadth | bin | 4 | 1 |
| breadth | reference | 59 | 21 |
| counts | bin | 2 | 1 |
| counts | reference | 74 | 23 |
| coverage | bin | 2 | 2 |
| coverage | reference | 76 | 21 |
| relative | bin | 4 | 1 |
| relative | reference | 75 | 42 |
| rpkm | bin | 2 | 2 |
| rpkm | reference | 76 | 27 |
| tpm | bin | 2 | 3 |
| tpm | reference | 73 | 24 |
# PLOTS
library(ggplot2)
# 1) Genome size plot (log10 bp)
size_plot_data <- comp_meta %>% filter(!is.na(genome_size_bp)) %>% mutate(log10_size = log10(genome_size_bp))
# Wilcoxon rank-sum per facet (only where both groups exist)
suppressPackageStartupMessages({ library(rstatix); library(ggpubr) })
size_facets_ok <- size_plot_data %>%
dplyr::count(metric, source_genome, group) %>%
tidyr::pivot_wider(names_from = group, values_from = n, values_fill = 0) %>%
filter(always_right > 0, always_wrong > 0) %>%
select(metric, source_genome)
size_pvals <- size_plot_data %>%
dplyr::inner_join(size_facets_ok, by = c("metric","source_genome")) %>%
group_by(metric, source_genome) %>%
rstatix::wilcox_test(log10_size ~ group, ref.group = "always_right") %>%
ungroup() %>%
mutate(p.format = rstatix::p_format(p), group1 = "always_right", group2 = "always_wrong")
size_y <- size_plot_data %>% group_by(metric, source_genome) %>% summarise(y.position = max(log10_size, na.rm = TRUE) * 1.07, .groups = 'drop')
size_pvals <- size_pvals %>% dplyr::left_join(size_y, by = c("metric","source_genome"))
p_size <- ggplot(size_plot_data, aes(x = group, y = log10_size, fill = group)) +
geom_violin(trim = TRUE, alpha = 0.7, color = NA) +
geom_boxplot(width = 0.18, outlier.shape = NA, alpha = 0.9) +
scale_fill_manual(values = c(always_wrong = "#E63946", always_right = "#2A9D8F")) +
labs(title = "Genome size (log10 bp): always_wrong vs always_right", x = NULL, y = "log10(genome size bp)") +
facet_grid(source_genome ~ metric, scales = "free_y") +
theme_minimal() + theme(legend.position = "none", axis.text.x = element_text(angle = 30, hjust = 1))
p_size <- p_size + ggpubr::stat_pvalue_manual(size_pvals, label = "p = {p.format}",
xmin = "group1", xmax = "group2",
y.position = "y.position", tip.length = 0.01, size = 3,
inherit.aes = FALSE)
print(p_size)# 2) GC% plot
gc_plot_data <- comp_meta %>% filter(!is.na(gc_percent_final))
# Wilcoxon rank-sum per facet (only where both groups exist)
gc_facets_ok <- gc_plot_data %>%
dplyr::count(metric, source_genome, group) %>%
tidyr::pivot_wider(names_from = group, values_from = n, values_fill = 0) %>%
filter(always_right > 0, always_wrong > 0) %>%
select(metric, source_genome)
gc_pvals <- gc_plot_data %>%
dplyr::inner_join(gc_facets_ok, by = c("metric","source_genome")) %>%
group_by(metric, source_genome) %>%
rstatix::wilcox_test(gc_percent_final ~ group, ref.group = "always_right") %>%
ungroup() %>%
mutate(p.format = rstatix::p_format(p), group1 = "always_right", group2 = "always_wrong")
gc_y <- gc_plot_data %>% group_by(metric, source_genome) %>% summarise(y.position = max(gc_percent_final, na.rm = TRUE) * 1.07, .groups = 'drop')
gc_pvals <- gc_pvals %>% dplyr::left_join(gc_y, by = c("metric","source_genome"))
p_gc <- ggplot(gc_plot_data, aes(x = group, y = gc_percent_final, fill = group)) +
geom_violin(trim = TRUE, alpha = 0.7, color = NA) +
geom_boxplot(width = 0.18, outlier.shape = NA, alpha = 0.9) +
scale_fill_manual(values = c(always_wrong = "#E63946", always_right = "#2A9D8F")) +
labs(title = "GC%: always_wrong vs always_right", x = NULL, y = "GC percent") +
facet_grid(source_genome ~ metric) +
theme_minimal() + theme(legend.position = "none", axis.text.x = element_text(angle = 30, hjust = 1))
p_gc <- p_gc + ggpubr::stat_pvalue_manual(gc_pvals, label = "p = {p.format}",
xmin = "group1", xmax = "group2",
y.position = "y.position", tip.length = 0.01, size = 3,
inherit.aes = FALSE)
print(p_gc)# 3) Genus bar plots (percentage within group)
genus_data <- comp_meta %>% filter(!is.na(genus) & genus != "") %>%
group_by(metric, source_genome, group, genus) %>%
summarise(n = n(), .groups = 'drop') %>%
group_by(metric, source_genome, group) %>%
mutate(pct = 100 * n / sum(n)) %>% ungroup()
# Pick top 15 genera overall (within each metric+source across groups)
top_genera <- genus_data %>%
group_by(metric, source_genome, genus) %>% summarise(total = sum(n), .groups = 'drop') %>%
group_by(metric, source_genome) %>% slice_max(order_by = total, n = 15, with_ties = FALSE) %>% ungroup()
genus_top <- genus_data %>% dplyr::inner_join(top_genera %>% dplyr::select(metric, source_genome, genus), by = c("metric","source_genome","genus"))
p_genus <- ggplot(genus_top, aes(x = reorder(genus, pct), y = pct, fill = group)) +
geom_col(position = position_dodge(width = 0.75)) +
coord_flip() +
scale_fill_manual(values = c(always_wrong = "#E63946", always_right = "#2A9D8F")) +
labs(title = "Genus composition: always_wrong vs always_right", x = "Genus (top 15)", y = "Percent within group", fill = "Group") +
facet_grid(source_genome ~ metric, scales = "free_y") +
theme_minimal()
print(p_genus)