EzAAI (High Throughput Prokaryotic Average Amino acid Identity Calculator) is a workflow suite designed to improve performance in Amino Acid Identity (AAI) calculations. It includes a novel module for hierarchical clustering analysis.
This section details the steps to generate the AAI matrix from the command line.
1. Profile Extraction
First, we must convert our genome files (.fna) into EzAAI-compatible databases. To automate this, we use the following shell script.
Create the script create_AAI_db.sh:
#!/bin/bash# Create output folder if it doesn't existmkdir-p db # Iterate over .fna files (Adjust path as needed)# Replace 'raw_genomes' with your actual pathfor file in raw_genomes/*.fna;dobase=$(basename"$file" .fna)echo"Processing $base..."ezaai extract -i"$file"-o db/"$base".db -l"$base"done
Execution:
chmod +x create_AAI_db.sh./create_AAI_db.sh
2. AAI Calculation
Once the profile database is generated in the db/ folder, we execute the “all-vs-all” (pairwise) calculation.
This pipeline module evaluates the taxonomic quality of the input dataset and determines empirical biological boundaries based on genomic similarity.
1. Setup and Data Loading
Note
Loading aai.tsv and removing self-comparisons (100% matches).
Code
# --- 1. LIBRARIES ---library(ggplot2)# svglib is usually for reading, standard ggsave handles svg writing automatically# library(svglib) # --- 2. LOAD DATA ---# Ensure "aai.tsv" is in your working directory# Using read.table as in the original scriptdatos <-read.table("/Users/abraham/Documents/Articulos_ab/Chromobacterium/Cap_2/ezAAI/out/aai.tsv", header =TRUE, sep ="\t", quote ="")# Filter out self-comparisons (100% identity) to clean up the plotdatos_plot <-subset(datos, Label.1!= Label.2)# Check datahead(datos_plot)
We load the real EzAAI output (aai.tsv) using standard R functions and filter out self-comparisons.
2. AAI Distribution and Boundary Detection
To identify natural taxonomic boundaries, we analyze the distribution of AAI values. We utilize a density analysis approach to detect “valleys” (local minima) in the distribution curve. These valleys often correspond to the biological gap between intra-genus and inter-genus comparisons.
2.1. Mathematical Valley Detection
We employ the second derivative logic to identify inflection points where the curve hits a local minimum.
Code
# Calculate density of the AAI valuesd <-density(datos_plot$AAI)# Mathematical logic to find local minima (valleys)# We look for points where the sign of the difference of the slope changesvalles_indices <-which(diff(sign(diff(d$y))) ==2) +1valles_x <- d$x[valles_indices]# Filter for biologically relevant valleys (typically between 45% and 90%)valles_reales <- valles_x[valles_x >45& valles_x <90]cat("Detected Potential Cutoffs (Valleys):", round(valles_reales, 2), "%\n")
p1 <-ggplot(datos_plot, aes(x = AAI)) +# Histogram in backgroundgeom_histogram(aes(y =after_stat(density)), binwidth =0.5, fill ="#BDC3C7", color =NA, alpha=0.6) +# Density curvegeom_density(color ="#2E86C1", linewidth =1) +# Vertical lines at detected valleysgeom_vline(xintercept = valles_reales, linetype ="dashed", color ="#C0392B", alpha=0.8, linewidth=0.7) +# Labels for the valleysannotate("text", x = valles_reales, y =0, label =paste0(round(valles_reales, 1), "%"), vjust =-1, color ="#C0392B", fontface ="bold", angle =90, size =4) +# Labels and Themelabs(title ="AAI Value Distribution and Genus/Family Boundaries",subtitle ="Red dashed lines indicate natural cutoffs (valleys) suggested by the data",x ="Amino Acid Identity (AAI %)", y ="Density") +theme_minimal(base_size =14) +xlim(40, 100)# Save the plot as SVGggsave("Figure1_AAI_Distribution.svg", plot = p1, width =10, height =6)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).
Code
# Display in reportprint(p1)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).
3. Taxonomic Validation
In this section, we parse the filenames to extract Genus information. This allows us to ground-truth the AAI values against current taxonomy.
Assumption: Filenames follow the format Genus_species... (e.g., Chromobacterium_violaceum.fasta).
Code
# Helper function to extract Genus (string split by first underscore)get_genus <-function(x) { sapply(strsplit(as.character(x), "_"), `[`, 1) }# Apply extraction to both labelsdatos_plot$Gen1 <-get_genus(datos_plot$Label.1)datos_plot$Gen2 <-get_genus(datos_plot$Label.2)# Classify comparisons: Intra-Genus vs Inter-Genusdatos_plot$Categoria <-ifelse(datos_plot$Gen1 == datos_plot$Gen2, "Intra-Genus (Same)", "Inter-Genus (Different)")# Show classification summarytable(datos_plot$Categoria)
This plot validates if the genomic data supports the assigned taxonomy. We expect a clear separation between “Same Genus” and “Different Genus” groups.
Code
p2 <-ggplot(datos_plot, aes(x = Categoria, y = AAI, fill = Categoria)) +geom_boxplot(outlier.alpha =0.2, outlier.shape =16, width =0.5) +# Reference line (Standard Genus Cutoff ~65%)geom_hline(yintercept =65, linetype ="dashed", color ="black", alpha=0.7) +scale_fill_manual(values =c("#E74C3C", "#2ECC71")) +# Red/Green scheme# Labelslabs(title ="Taxonomic Resolution (Intra vs Inter Genus)",subtitle ="Validation of genomic coherence in the dataset",y ="Amino Acid Identity (AAI %)", x ="") +theme_minimal(base_size =14) +theme(legend.position ="none", axis.text.x =element_text(size=12, face="bold"))# Save the plot as SVGggsave("Figure2_Taxonomic_Validation_Boxplot.svg", plot = p2, width =8, height =6)# Display in reportprint(p2)
4. Threshold Optimization (Graph Theory)
Instead of arbitrarily choosing a cutoff (e.g., 65%), we use an iterative graph-based approach to find the mathematically optimal AAI threshold. We test thresholds from 60% to 85% and score them based on cluster stability (penalizing overlaps and singletons).
Code
# Load necessary libraries for graph analysisif(!require("igraph")) install.packages("igraph")if(!require("dplyr")) install.packages("dplyr")library(igraph)library(dplyr)
Code
# 1. Prepare Data (Reuse loaded data)# Create edge list from non-identical pairsdf_edges <- datos_plot[, c("Label.1", "Label.2", "AAI")]colnames(df_edges) <-c("from", "to", "weight")# Get all unique genomestodos_genomas <-unique(c(df_edges$from, df_edges$to))total_N <-length(todos_genomas)# 2. Define Parametersumbrales <-seq(60, 85, by =0.5) # Steps of 0.5 for speed (use 0.1 for higher precision)# Penalties (Based on Park et al 2022 / Barco et al 2020 logic)PENALTY_OVERLAP <- total_N +100PENALTY_SINGLETON <-1resultados <-data.frame(Threshold =numeric(), Score =numeric(), Overlaps =numeric(), Singletons =numeric(), Clusters =numeric())# 3. Optimization Loop# We suppress the loop output to keep the report cleanfor (t in umbrales) {# A. Filter network by threshold 't' edges_t <- df_edges[df_edges$weight >= t, ]# B. Build Undirected Graph g <-graph_from_data_frame(edges_t, directed =FALSE, vertices =NULL)# C. Find "Maximal Cliques" cliques_list <-max_cliques(g, min =1)# D. Scoring Logic pertenencia <-unlist(cliques_list) conteo_pertenencia <-table(pertenencia)# Identify Overlaps nodos_overlap <-names(conteo_pertenencia[conteo_pertenencia >1]) num_overlaps <-length(nodos_overlap)# Identify Singletonsif (vcount(g) >0) { degrees <-degree(g) singletons_degree_0 <-sum(degrees ==0) nodos_en_grafo <-names(V(g)) } else { singletons_degree_0 <-0 nodos_en_grafo <-c() } nodos_aislados_total <-setdiff(todos_genomas, nodos_en_grafo) total_singletons <- singletons_degree_0 +length(nodos_aislados_total)# E. Final Score Calculation score <-- (num_overlaps * PENALTY_OVERLAP) - (total_singletons * PENALTY_SINGLETON) resultados <-rbind(resultados, data.frame(Threshold = t, Score = score, Overlaps = num_overlaps, Singletons = total_singletons,Clusters =length(cliques_list)))}# 4. Select Best Resultmejor_res <- resultados[which.max(resultados$Score), ]cat("Optimization Complete.\n")
The plot below shows the stability score across different AAI thresholds. The peak represents the point where genera are best defined (minimized overlaps and appropriate grouping).
Code
cat("Generating optimization plot...\n")
Generating optimization plot...
Code
p_opt <-ggplot(resultados, aes(x = Threshold, y = Score)) +# Trend linegeom_line(color ="#b4b4b4b8", linewidth =0.7) +# Pointsgeom_point(color ="#444444b8", size =0.3, alpha =0.5) +# Vertical guide to optimumgeom_segment(data = mejor_res, aes(x = Threshold, xend = Threshold, y =min(resultados$Score), yend = Score),linetype ="dashed", color ="#f79c9cff", linewidth =0.5) +# Optimal pointgeom_point(data = mejor_res, aes(x = Threshold, y = Score), color ="#f15959ff", size =3.5) +# Labelannotate("text", x = mejor_res$Threshold, y = mejor_res$Score, label =paste0(mejor_res$Threshold, "%"), vjust =-1.5, color ="#f15959ff", fontface ="bold", size =4) +# Aesthetics (Translated labels to English)labs(title ="AAI Threshold Optimization (Genus Level)",subtitle =paste("Optimal Cutoff Detected at:", mejor_res$Threshold, "% AAI"),y ="Coherence Score", x ="Identity Threshold (%)") +theme_classic(base_size =14) +theme(plot.title =element_text(face ="bold", size =16),axis.text =element_text(color ="black"),axis.line =element_line(color ="black"),panel.grid.major.y =element_line(color ="grey90", linetype ="dotted") )# Save and Printggsave("Figure3_AAI_Optimization.png", plot = p_opt, width =8, height =5, dpi =300)print(p_opt)
5. Interactive 3D Visualization (Manual Threshold)
Finally, we visualize the genomic distances in a 3-Dimensional space using Principal Coordinates Analysis (PCoA/MDS). The coloring corresponds to the clusters defined by the manual threshold.
Code
# --- 1. LIBRARIES ---library(plotly)library(reshape2)library(dplyr)library(htmlwidgets)# ==========================================# PARAMETER (MANUAL INPUT)# ==========================================# Aquí defines el valor manualmente, ignorando la optimización anteriorUMBRAL_CORTE <-81.5# <--- ¡CAMBIA ESTE NÚMERO AQUÍ!cat(paste0("Generating 3D Map using Manual Threshold: ", UMBRAL_CORTE, "%\n"))
Generating 3D Map using Manual Threshold: 81.5%
Code
# ==========================================# --- 2. PREPARE DATA ---# We use 'datos_plot' which is already loaded and cleaned in Section 2# Convert to Distance Matrixmatriz_dist <-dcast(datos_plot, Label.1~ Label.2, value.var ="AAI", fill =0)rownames(matriz_dist) <- matriz_dist$Label.1matriz_dist <- matriz_dist[,-1]mat_sim <-as.matrix(matriz_dist)mat_sim <-pmax(mat_sim, t(mat_sim), na.rm=TRUE) dist_final <-100- mat_sim # Distance = 100 - AAI# --- 3. DYNAMIC CLUSTERING ---# Hierarchical Clustering (Average Linkage / UPGMA)hc <-hclust(as.dist(dist_final), method ="average")# Cut tree at defined heightaltura_corte <-100- UMBRAL_CORTEgrupos <-cutree(hc, h = altura_corte)# --- 4. 3D COORDINATES CALCULATION (MDS/PCoA) ---mds <-cmdscale(dist_final, k =3, eig =TRUE)# Create 3D Data Framedf_plot <-data.frame(Dim1 = mds$points[, 1], Dim2 = mds$points[, 2],Dim3 = mds$points[, 3])df_plot$Genoma <-rownames(df_plot)# Add Cluster Info and Original Genusdf_plot$Cluster <-factor(grupos) df_plot$Cluster_ID <-paste("Cluster", df_plot$Cluster)df_plot$Genero_Original <-get_genus(df_plot$Genoma) # Reuse function from Sec 4# --- 5. GENERATE INTERACTIVE 3D PLOT ---# Define dynamic palettenum_colores <-length(unique(df_plot$Cluster))colores_dinamicos <-colorRampPalette(c("#E74C3C", "#3498DB", "#2ECC71", "#9B59B6", "#F1C40F", "#1ABC9C", "#34495E"))(num_colores)# Create Plotly Objectfig <-plot_ly(data = df_plot, x =~Dim1, y =~Dim2, z =~Dim3, color =~Cluster_ID, colors = colores_dinamicos,type ='scatter3d', mode ='markers',marker =list(size =5, opacity =0.8,line =list(width =0.5, color ='white')),text =~paste("<b>Genome:</b>", Genoma,"<br><b>Cluster:</b>", Cluster_ID,"<br><b>NCBI Genus:</b>", Genero_Original),hoverinfo ="text")# Customize Layoutfig <- fig %>%layout(title =paste0("Genomic 3D Map (Manual Threshold: ", UMBRAL_CORTE, "%)"),scene =list( xaxis =list(title ="Dim 1", showgrid =TRUE),yaxis =list(title ="Dim 2", showgrid =TRUE),zaxis =list(title ="Dim 3 (Depth)", showgrid =TRUE),camera =list(eye =list(x =1.5, y =1.5, z =1.5)) ),legend =list(title =list(text ="<b>Clusters</b>")))# Save as external HTML filenombre_archivo <-paste0("Map_3D_Threshold_", UMBRAL_CORTE, ".html")saveWidget(as_widget(fig), nombre_archivo)cat(paste0("Success! 3D file saved as: '", nombre_archivo, "'\n"))
Success! 3D file saved as: 'Map_3D_Threshold_81.5.html'
Code
# Print inside the Quarto reportfig
6. Cluster Summary Table
Below is the detailed list of genomes and their assigned clusters based on the manual threshold defined above. You can filter and search specific genomes in this table.
Source Code
---title: "Genomic Analysis Pipeline: EzAAI & Taxonomic Validation"author: "Abraham Guerra"date: "today"date-format: "MMMM DD, YYYY"lang: enformat: html: theme: cosmo toc: true toc-depth: 3 toc-title: "Contents" smooth-scroll: true code-fold: show code-tools: true highlight-style: github include-in-header: text: | <style> :root { --brand-acc: #007acc; } h1, h2, h3 { scroll-margin-top: 80px; } .navbar, .quarto-title-banner { backdrop-filter: blur(3px); } div.sourceCode { display:block; width:100%; overflow-x:auto; } div.sourceCode > pre { background: transparent !important; margin:0; } pre.sourceCode, pre:not(.sourceCode) { background-color: #f5f5ff !important; border-left: 4px solid var(--brand-acc) !important; border-radius: 8px; padding: .75rem 1rem; margin: 1rem 0; } .cell-output pre { background-color: #f5f5f5 !important; border-radius: 6px; padding: .5rem .75rem; } </style>editor: visual---# 🧬 Introduction to EzAAI**EzAAI** (High Throughput Prokaryotic Average Amino acid Identity Calculator) is a workflow suite designed to improve performance in Amino Acid Identity (AAI) calculations. It includes a novel module for hierarchical clustering analysis.Official documentation: [https://endixk.github.io/ezaai/](https://endixk.github.io/ezaai/)## Available ModulesThe suite consists of four main modules:* **Extract:** Extracts profile DBs from genomes using Prodigal.* **Convert:** Converts CDS FASTA files into profile DBs.* **Calculate:** Calculates AAI values from profile databases using MMseqs2.* **Cluster:** Performs hierarchical clustering of taxa based on AAI values.---# ⚙️ Installation and SetupTo use EzAAI, it is recommended to create a dedicated environment using Conda.```bashconda create -n EzAAIconda activate EzAAI conda install -y-c bioconda ezaaiezaai-h```# 🖥️ Pipeline Execution (Bash)This section details the steps to generate the AAI matrix from the command line.## 1. Profile ExtractionFirst, we must convert our genome files (`.fna`) into EzAAI-compatible databases. To automate this, we use the following shell script.**Create the script `create_AAI_db.sh`:**```bash#!/bin/bash# Create output folder if it doesn't existmkdir-p db # Iterate over .fna files (Adjust path as needed)# Replace 'raw_genomes' with your actual pathfor file in raw_genomes/*.fna;dobase=$(basename"$file" .fna)echo"Processing $base..."ezaai extract -i"$file"-o db/"$base".db -l"$base"done```**Execution:**```bashchmod +x create_AAI_db.sh./create_AAI_db.sh```## 2. AAI CalculationOnce the profile database is generated in the `db/` folder, we execute the "all-vs-all" (pairwise) calculation.```bash# Create output directorymkdir-p out# Run calculationezaai calculate -i db/ -j db/ -o out/aai.tsv```## 3. Result VerificationWe can inspect the first few lines of the resulting file to ensure the format is correct.```bashhead-5 aai.tsv```*Expected Output:*```ID1 ID2 Label1 Label2 AAI786958951 786958951 Leucobacter muris Leucobacter muris 100.000000786958951 199206886 Leucobacter muris Clavibacter nebraskensis 61.609688```# 📊 Analysis and Validation (R)This pipeline module evaluates the taxonomic quality of the input dataset and determines empirical biological boundaries based on genomic similarity.## 1. Setup and Data Loading::: {.callout-note}Loading `aai.tsv` and removing self-comparisons (100% matches).:::```{r}#| label: setup-data#| message: false#| warning: false# --- 1. LIBRARIES ---library(ggplot2)# svglib is usually for reading, standard ggsave handles svg writing automatically# library(svglib) # --- 2. LOAD DATA ---# Ensure "aai.tsv" is in your working directory# Using read.table as in the original scriptdatos <-read.table("/Users/abraham/Documents/Articulos_ab/Chromobacterium/Cap_2/ezAAI/out/aai.tsv", header =TRUE, sep ="\t", quote ="")# Filter out self-comparisons (100% identity) to clean up the plotdatos_plot <-subset(datos, Label.1!= Label.2)# Check datahead(datos_plot)```::: {.callout-note}We load the real EzAAI output (`aai.tsv`) using standard R functions and filter out self-comparisons.:::## 2. AAI Distribution and Boundary DetectionTo identify natural taxonomic boundaries, we analyze the distribution of AAI values. We utilize a density analysis approach to detect "valleys" (local minima) in the distribution curve. These valleys often correspond to the biological gap between intra-genus and inter-genus comparisons.### 2.1. Mathematical Valley DetectionWe employ the second derivative logic to identify inflection points where the curve hits a local minimum.```{r}#| label: density-calculation# Calculate density of the AAI valuesd <-density(datos_plot$AAI)# Mathematical logic to find local minima (valleys)# We look for points where the sign of the difference of the slope changesvalles_indices <-which(diff(sign(diff(d$y))) ==2) +1valles_x <- d$x[valles_indices]# Filter for biologically relevant valleys (typically between 45% and 90%)valles_reales <- valles_x[valles_x >45& valles_x <90]cat("Detected Potential Cutoffs (Valleys):", round(valles_reales, 2), "%\n")```### 2.1.1. Visualization: AAI Histogram```{r}#| label: plot-distribution#| message: falsep1 <-ggplot(datos_plot, aes(x = AAI)) +# Histogram in backgroundgeom_histogram(aes(y =after_stat(density)), binwidth =0.5, fill ="#BDC3C7", color =NA, alpha=0.6) +# Density curvegeom_density(color ="#2E86C1", linewidth =1) +# Vertical lines at detected valleysgeom_vline(xintercept = valles_reales, linetype ="dashed", color ="#C0392B", alpha=0.8, linewidth=0.7) +# Labels for the valleysannotate("text", x = valles_reales, y =0, label =paste0(round(valles_reales, 1), "%"), vjust =-1, color ="#C0392B", fontface ="bold", angle =90, size =4) +# Labels and Themelabs(title ="AAI Value Distribution and Genus/Family Boundaries",subtitle ="Red dashed lines indicate natural cutoffs (valleys) suggested by the data",x ="Amino Acid Identity (AAI %)", y ="Density") +theme_minimal(base_size =14) +xlim(40, 100)# Save the plot as SVGggsave("Figure1_AAI_Distribution.svg", plot = p1, width =10, height =6)# Display in reportprint(p1)```## 3. Taxonomic ValidationIn this section, we parse the filenames to extract Genus information. This allows us to ground-truth the AAI values against current taxonomy.**Assumption:** Filenames follow the format `Genus_species...` (e.g., `Chromobacterium_violaceum.fasta`).```{r}#| label: taxonomy-classification# Helper function to extract Genus (string split by first underscore)get_genus <-function(x) { sapply(strsplit(as.character(x), "_"), `[`, 1) }# Apply extraction to both labelsdatos_plot$Gen1 <-get_genus(datos_plot$Label.1)datos_plot$Gen2 <-get_genus(datos_plot$Label.2)# Classify comparisons: Intra-Genus vs Inter-Genusdatos_plot$Categoria <-ifelse(datos_plot$Gen1 == datos_plot$Gen2, "Intra-Genus (Same)", "Inter-Genus (Different)")# Show classification summarytable(datos_plot$Categoria)```### 3.1. Visualization: Genomic Coherence BoxplotThis plot validates if the genomic data supports the assigned taxonomy. We expect a clear separation between "Same Genus" and "Different Genus" groups.```{r}#| label: plot-taxonomyp2 <-ggplot(datos_plot, aes(x = Categoria, y = AAI, fill = Categoria)) +geom_boxplot(outlier.alpha =0.2, outlier.shape =16, width =0.5) +# Reference line (Standard Genus Cutoff ~65%)geom_hline(yintercept =65, linetype ="dashed", color ="black", alpha=0.7) +scale_fill_manual(values =c("#E74C3C", "#2ECC71")) +# Red/Green scheme# Labelslabs(title ="Taxonomic Resolution (Intra vs Inter Genus)",subtitle ="Validation of genomic coherence in the dataset",y ="Amino Acid Identity (AAI %)", x ="") +theme_minimal(base_size =14) +theme(legend.position ="none", axis.text.x =element_text(size=12, face="bold"))# Save the plot as SVGggsave("Figure2_Taxonomic_Validation_Boxplot.svg", plot = p2, width =8, height =6)# Display in reportprint(p2)```---## 4. Threshold Optimization (Graph Theory)Instead of arbitrarily choosing a cutoff (e.g., 65%), we use an iterative graph-based approach to find the **mathematically optimal AAI threshold**. We test thresholds from 60% to 85% and score them based on cluster stability (penalizing overlaps and singletons).```{r}#| label: load-igraph-libraries#| message: false#| warning: false# Load necessary libraries for graph analysisif(!require("igraph")) install.packages("igraph")if(!require("dplyr")) install.packages("dplyr")library(igraph)library(dplyr)``````{r}#| label: run-optimization#| cache: true#| message: false#| warning: false# 1. Prepare Data (Reuse loaded data)# Create edge list from non-identical pairsdf_edges <- datos_plot[, c("Label.1", "Label.2", "AAI")]colnames(df_edges) <-c("from", "to", "weight")# Get all unique genomestodos_genomas <-unique(c(df_edges$from, df_edges$to))total_N <-length(todos_genomas)# 2. Define Parametersumbrales <-seq(60, 85, by =0.5) # Steps of 0.5 for speed (use 0.1 for higher precision)# Penalties (Based on Park et al 2022 / Barco et al 2020 logic)PENALTY_OVERLAP <- total_N +100PENALTY_SINGLETON <-1resultados <-data.frame(Threshold =numeric(), Score =numeric(), Overlaps =numeric(), Singletons =numeric(), Clusters =numeric())# 3. Optimization Loop# We suppress the loop output to keep the report cleanfor (t in umbrales) {# A. Filter network by threshold 't' edges_t <- df_edges[df_edges$weight >= t, ]# B. Build Undirected Graph g <-graph_from_data_frame(edges_t, directed =FALSE, vertices =NULL)# C. Find "Maximal Cliques" cliques_list <-max_cliques(g, min =1)# D. Scoring Logic pertenencia <-unlist(cliques_list) conteo_pertenencia <-table(pertenencia)# Identify Overlaps nodos_overlap <-names(conteo_pertenencia[conteo_pertenencia >1]) num_overlaps <-length(nodos_overlap)# Identify Singletonsif (vcount(g) >0) { degrees <-degree(g) singletons_degree_0 <-sum(degrees ==0) nodos_en_grafo <-names(V(g)) } else { singletons_degree_0 <-0 nodos_en_grafo <-c() } nodos_aislados_total <-setdiff(todos_genomas, nodos_en_grafo) total_singletons <- singletons_degree_0 +length(nodos_aislados_total)# E. Final Score Calculation score <-- (num_overlaps * PENALTY_OVERLAP) - (total_singletons * PENALTY_SINGLETON) resultados <-rbind(resultados, data.frame(Threshold = t, Score = score, Overlaps = num_overlaps, Singletons = total_singletons,Clusters =length(cliques_list)))}# 4. Select Best Resultmejor_res <- resultados[which.max(resultados$Score), ]cat("Optimization Complete.\n")cat("Optimal Threshold Found:", mejor_res$Threshold, "%\n")cat("Max Score:", mejor_res$Score, "\n")```### Optimization CurveThe plot below shows the stability score across different AAI thresholds. The peak represents the point where genera are best defined (minimized overlaps and appropriate grouping).```{r}#| label: plot-optimization#| message: falsecat("Generating optimization plot...\n")p_opt <-ggplot(resultados, aes(x = Threshold, y = Score)) +# Trend linegeom_line(color ="#b4b4b4b8", linewidth =0.7) +# Pointsgeom_point(color ="#444444b8", size =0.3, alpha =0.5) +# Vertical guide to optimumgeom_segment(data = mejor_res, aes(x = Threshold, xend = Threshold, y =min(resultados$Score), yend = Score),linetype ="dashed", color ="#f79c9cff", linewidth =0.5) +# Optimal pointgeom_point(data = mejor_res, aes(x = Threshold, y = Score), color ="#f15959ff", size =3.5) +# Labelannotate("text", x = mejor_res$Threshold, y = mejor_res$Score, label =paste0(mejor_res$Threshold, "%"), vjust =-1.5, color ="#f15959ff", fontface ="bold", size =4) +# Aesthetics (Translated labels to English)labs(title ="AAI Threshold Optimization (Genus Level)",subtitle =paste("Optimal Cutoff Detected at:", mejor_res$Threshold, "% AAI"),y ="Coherence Score", x ="Identity Threshold (%)") +theme_classic(base_size =14) +theme(plot.title =element_text(face ="bold", size =16),axis.text =element_text(color ="black"),axis.line =element_line(color ="black"),panel.grid.major.y =element_line(color ="grey90", linetype ="dotted") )# Save and Printggsave("Figure3_AAI_Optimization.png", plot = p_opt, width =8, height =5, dpi =300)print(p_opt)```## 5. Interactive 3D Visualization (Manual Threshold)Finally, we visualize the genomic distances in a 3-Dimensional space using Principal Coordinates Analysis (PCoA/MDS). The coloring corresponds to the clusters defined by the manual threshold.```{r}#| label: plot-3d-interactive#| message: false#| warning: false# --- 1. LIBRARIES ---library(plotly)library(reshape2)library(dplyr)library(htmlwidgets)# ==========================================# PARAMETER (MANUAL INPUT)# ==========================================# Aquí defines el valor manualmente, ignorando la optimización anteriorUMBRAL_CORTE <-81.5# <--- ¡CAMBIA ESTE NÚMERO AQUÍ!cat(paste0("Generating 3D Map using Manual Threshold: ", UMBRAL_CORTE, "%\n"))# ==========================================# --- 2. PREPARE DATA ---# We use 'datos_plot' which is already loaded and cleaned in Section 2# Convert to Distance Matrixmatriz_dist <-dcast(datos_plot, Label.1~ Label.2, value.var ="AAI", fill =0)rownames(matriz_dist) <- matriz_dist$Label.1matriz_dist <- matriz_dist[,-1]mat_sim <-as.matrix(matriz_dist)mat_sim <-pmax(mat_sim, t(mat_sim), na.rm=TRUE) dist_final <-100- mat_sim # Distance = 100 - AAI# --- 3. DYNAMIC CLUSTERING ---# Hierarchical Clustering (Average Linkage / UPGMA)hc <-hclust(as.dist(dist_final), method ="average")# Cut tree at defined heightaltura_corte <-100- UMBRAL_CORTEgrupos <-cutree(hc, h = altura_corte)# --- 4. 3D COORDINATES CALCULATION (MDS/PCoA) ---mds <-cmdscale(dist_final, k =3, eig =TRUE)# Create 3D Data Framedf_plot <-data.frame(Dim1 = mds$points[, 1], Dim2 = mds$points[, 2],Dim3 = mds$points[, 3])df_plot$Genoma <-rownames(df_plot)# Add Cluster Info and Original Genusdf_plot$Cluster <-factor(grupos) df_plot$Cluster_ID <-paste("Cluster", df_plot$Cluster)df_plot$Genero_Original <-get_genus(df_plot$Genoma) # Reuse function from Sec 4# --- 5. GENERATE INTERACTIVE 3D PLOT ---# Define dynamic palettenum_colores <-length(unique(df_plot$Cluster))colores_dinamicos <-colorRampPalette(c("#E74C3C", "#3498DB", "#2ECC71", "#9B59B6", "#F1C40F", "#1ABC9C", "#34495E"))(num_colores)# Create Plotly Objectfig <-plot_ly(data = df_plot, x =~Dim1, y =~Dim2, z =~Dim3, color =~Cluster_ID, colors = colores_dinamicos,type ='scatter3d', mode ='markers',marker =list(size =5, opacity =0.8,line =list(width =0.5, color ='white')),text =~paste("<b>Genome:</b>", Genoma,"<br><b>Cluster:</b>", Cluster_ID,"<br><b>NCBI Genus:</b>", Genero_Original),hoverinfo ="text")# Customize Layoutfig <- fig %>%layout(title =paste0("Genomic 3D Map (Manual Threshold: ", UMBRAL_CORTE, "%)"),scene =list( xaxis =list(title ="Dim 1", showgrid =TRUE),yaxis =list(title ="Dim 2", showgrid =TRUE),zaxis =list(title ="Dim 3 (Depth)", showgrid =TRUE),camera =list(eye =list(x =1.5, y =1.5, z =1.5)) ),legend =list(title =list(text ="<b>Clusters</b>")))# Save as external HTML filenombre_archivo <-paste0("Map_3D_Threshold_", UMBRAL_CORTE, ".html")saveWidget(as_widget(fig), nombre_archivo)cat(paste0("Success! 3D file saved as: '", nombre_archivo, "'\n"))# Print inside the Quarto reportfig```## 6. Cluster Summary TableBelow is the detailed list of genomes and their assigned clusters based on the manual threshold defined above. You can filter and search specific genomes in this table.```{r}#| label: cluster-table#| echo: false#| message: false# Load DT library for interactive tablesif (!require("DT")) install.packages("DT")library(DT)# Create a clean subset of the data generated in the previous section# (We reuse 'df_plot' which is available from the previous chunk)tabla_clusters <- df_plot[, c("Genoma", "Cluster_ID", "Genero_Original")]# 1. Save this specific table as TSVwrite.table(tabla_clusters, "Manual_Cluster_List.tsv", sep="\t", row.names=FALSE, quote=FALSE)# 2. Display interactive tabledatatable(tabla_clusters, filter ='top',caption ="Manual Cluster Assignments",options =list(pageLength =10, autoWidth =TRUE))```