Genomic Analysis Pipeline: EzAAI & Taxonomic Validation

Author

Abraham Guerra

Published

January 13, 2026

🧬 Introduction to EzAAI

EzAAI (High Throughput Prokaryotic Average Amino acid Identity Calculator) is a workflow suite designed to improve performance in Amino Acid Identity (AAI) calculations. It includes a novel module for hierarchical clustering analysis.

Official documentation: https://endixk.github.io/ezaai/

Available Modules

The suite consists of four main modules:

  • Extract: Extracts profile DBs from genomes using Prodigal.
  • Convert: Converts CDS FASTA files into profile DBs.
  • Calculate: Calculates AAI values from profile databases using MMseqs2.
  • Cluster: Performs hierarchical clustering of taxa based on AAI values.

⚙️ Installation and Setup

To use EzAAI, it is recommended to create a dedicated environment using Conda.

conda create -n EzAAI
conda activate EzAAI 
conda install -y -c bioconda ezaai
ezaai -h

🖥️ Pipeline Execution (Bash)

This section details the steps to generate the AAI matrix from the command line.

1. Profile Extraction

First, we must convert our genome files (.fna) into EzAAI-compatible databases. To automate this, we use the following shell script.

Create the script create_AAI_db.sh:

#!/bin/bash

# Create output folder if it doesn't exist
mkdir -p db 

# Iterate over .fna files (Adjust path as needed)
# Replace 'raw_genomes' with your actual path
for file in raw_genomes/*.fna; do
    base=$(basename "$file" .fna)
    echo "Processing $base..."
    ezaai extract -i "$file" -o db/"$base".db -l "$base"
done

Execution:

chmod +x create_AAI_db.sh
./create_AAI_db.sh

2. AAI Calculation

Once the profile database is generated in the db/ folder, we execute the “all-vs-all” (pairwise) calculation.

# Create output directory
mkdir -p out

# Run calculation
ezaai calculate -i db/ -j db/ -o out/aai.tsv

3. Result Verification

We can inspect the first few lines of the resulting file to ensure the format is correct.

head -5 aai.tsv

Expected Output:

ID1        ID2        Label1              Label2                   AAI
786958951  786958951  Leucobacter muris   Leucobacter muris        100.000000
786958951  199206886  Leucobacter muris   Clavibacter nebraskensis 61.609688

📊 Analysis and Validation (R)

This pipeline module evaluates the taxonomic quality of the input dataset and determines empirical biological boundaries based on genomic similarity.

1. Setup and Data Loading

Note

Loading aai.tsv and removing self-comparisons (100% matches).

Code
# --- 1. LIBRARIES ---
library(ggplot2)
# svglib is usually for reading, standard ggsave handles svg writing automatically
# library(svglib) 

# --- 2. LOAD DATA ---
# Ensure "aai.tsv" is in your working directory
# Using read.table as in the original script
datos <- read.table("/Users/abraham/Documents/Articulos_ab/Chromobacterium/Cap_2/ezAAI/out/aai.tsv", header = TRUE, sep = "\t", quote = "")

# Filter out self-comparisons (100% identity) to clean up the plot
datos_plot <- subset(datos, Label.1 != Label.2)

# Check data
head(datos_plot)
       ID.1      ID.2                           Label.1
2 808333033 177779930 Chromobacterium_vaccinii_LABIM181
3 808333033 805955504 Chromobacterium_vaccinii_LABIM181
4 808333033 344075726 Chromobacterium_vaccinii_LABIM181
5 808333033 988690950 Chromobacterium_vaccinii_LABIM181
6 808333033  30592115 Chromobacterium_vaccinii_LABIM181
7 808333033 216535554 Chromobacterium_vaccinii_LABIM181
                                   Label.2      AAI CDS.count.1 CDS.count.2
2 Chromobacteriaceae_bacterium_SMAG_U85202 73.79317        4471        2363
3   Chromobacterium_phragmitis_IIBBL_274_1 89.28943        4471        4477
4     Paludibacterium_purpuratum_CECT_8976 68.92051        4471        3762
5                        Vogesella_sp_XCS3 69.36104        4471        3618
6        Chromobacterium_piscinae_LABIM198 96.05034        4471        4687
7        Chromobacterium_subtsugae_MWU2387 84.85802        4471        5383
  Matched.count Proteome.cov. ID.param. Cov..param.
2          1771      0.518291       0.4         0.5
3          3642      0.814037       0.4         0.5
4          2167      0.526418       0.4         0.5
5          2164      0.535048       0.4         0.5
6          3877      0.846691       0.4         0.5
7          3430      0.696164       0.4         0.5
Note

We load the real EzAAI output (aai.tsv) using standard R functions and filter out self-comparisons.

2. AAI Distribution and Boundary Detection

To identify natural taxonomic boundaries, we analyze the distribution of AAI values. We utilize a density analysis approach to detect “valleys” (local minima) in the distribution curve. These valleys often correspond to the biological gap between intra-genus and inter-genus comparisons.

2.1. Mathematical Valley Detection

We employ the second derivative logic to identify inflection points where the curve hits a local minimum.

Code
# Calculate density of the AAI values
d <- density(datos_plot$AAI)

# Mathematical logic to find local minima (valleys)
# We look for points where the sign of the difference of the slope changes
valles_indices <- which(diff(sign(diff(d$y))) == 2) + 1
valles_x <- d$x[valles_indices]

# Filter for biologically relevant valleys (typically between 45% and 90%)
valles_reales <- valles_x[valles_x > 45 & valles_x < 90]

cat("Detected Potential Cutoffs (Valleys):", round(valles_reales, 2), "%\n")
Detected Potential Cutoffs (Valleys): 72.46 81.46 87.74 89.89 %

2.1.1. Visualization: AAI Histogram

Code
p1 <- ggplot(datos_plot, aes(x = AAI)) +
  # Histogram in background
  geom_histogram(aes(y = after_stat(density)), binwidth = 0.5, fill = "#BDC3C7", color = NA, alpha=0.6) +
  # Density curve
  geom_density(color = "#2E86C1", linewidth = 1) +
  # Vertical lines at detected valleys
  geom_vline(xintercept = valles_reales, linetype = "dashed", color = "#C0392B", alpha=0.8, linewidth=0.7) +
  # Labels for the valleys
  annotate("text", x = valles_reales, y = 0, label = paste0(round(valles_reales, 1), "%"), 
           vjust = -1, color = "#C0392B", fontface = "bold", angle = 90, size = 4) +
  # Labels and Theme
  labs(title = "AAI Value Distribution and Genus/Family Boundaries",
       subtitle = "Red dashed lines indicate natural cutoffs (valleys) suggested by the data",
       x = "Amino Acid Identity (AAI %)", 
       y = "Density") +
  theme_minimal(base_size =14) +
  xlim(40, 100)

# Save the plot as SVG
ggsave("Figure1_AAI_Distribution.svg", plot = p1, width = 10, height = 6)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).
Code
# Display in report
print(p1)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).

3. Taxonomic Validation

In this section, we parse the filenames to extract Genus information. This allows us to ground-truth the AAI values against current taxonomy.

Assumption: Filenames follow the format Genus_species... (e.g., Chromobacterium_violaceum.fasta).

Code
# Helper function to extract Genus (string split by first underscore)
get_genus <- function(x) { 
  sapply(strsplit(as.character(x), "_"), `[`, 1) 
}

# Apply extraction to both labels
datos_plot$Gen1 <- get_genus(datos_plot$Label.1)
datos_plot$Gen2 <- get_genus(datos_plot$Label.2)

# Classify comparisons: Intra-Genus vs Inter-Genus
datos_plot$Categoria <- ifelse(datos_plot$Gen1 == datos_plot$Gen2, 
                               "Intra-Genus (Same)", 
                               "Inter-Genus (Different)")

# Show classification summary
table(datos_plot$Categoria)

Inter-Genus (Different)      Intra-Genus (Same) 
                  28224                   23078 

3.1. Visualization: Genomic Coherence Boxplot

This plot validates if the genomic data supports the assigned taxonomy. We expect a clear separation between “Same Genus” and “Different Genus” groups.

Code
p2 <- ggplot(datos_plot, aes(x = Categoria, y = AAI, fill = Categoria)) +
  geom_boxplot(outlier.alpha = 0.2, outlier.shape = 16, width = 0.5) +
  
  # Reference line (Standard Genus Cutoff ~65%)
  geom_hline(yintercept = 65, linetype = "dashed", color = "black", alpha=0.7) + 
  
  scale_fill_manual(values = c("#E74C3C", "#2ECC71")) + # Red/Green scheme
  
  # Labels
  labs(title = "Taxonomic Resolution (Intra vs Inter Genus)",
       subtitle = "Validation of genomic coherence in the dataset",
       y = "Amino Acid Identity (AAI %)", 
       x = "") +
  
  theme_minimal(base_size = 14) +
  theme(legend.position = "none", 
        axis.text.x = element_text(size=12, face="bold"))

# Save the plot as SVG
ggsave("Figure2_Taxonomic_Validation_Boxplot.svg", plot = p2, width = 8, height = 6)

# Display in report
print(p2)


4. Threshold Optimization (Graph Theory)

Instead of arbitrarily choosing a cutoff (e.g., 65%), we use an iterative graph-based approach to find the mathematically optimal AAI threshold. We test thresholds from 60% to 85% and score them based on cluster stability (penalizing overlaps and singletons).

Code
# Load necessary libraries for graph analysis
if(!require("igraph")) install.packages("igraph")
if(!require("dplyr")) install.packages("dplyr")
library(igraph)
library(dplyr)
Code
# 1. Prepare Data (Reuse loaded data)
# Create edge list from non-identical pairs
df_edges <- datos_plot[, c("Label.1", "Label.2", "AAI")]
colnames(df_edges) <- c("from", "to", "weight")

# Get all unique genomes
todos_genomas <- unique(c(df_edges$from, df_edges$to))
total_N <- length(todos_genomas)

# 2. Define Parameters
umbrales <- seq(60, 85, by = 0.5) # Steps of 0.5 for speed (use 0.1 for higher precision)

# Penalties (Based on Park et al 2022 / Barco et al 2020 logic)
PENALTY_OVERLAP <- total_N + 100  
PENALTY_SINGLETON <- 1            

resultados <- data.frame(Threshold = numeric(), Score = numeric(), 
                         Overlaps = numeric(), Singletons = numeric(), 
                         Clusters = numeric())

# 3. Optimization Loop
# We suppress the loop output to keep the report clean
for (t in umbrales) {
  
  # A. Filter network by threshold 't'
  edges_t <- df_edges[df_edges$weight >= t, ]
  
  # B. Build Undirected Graph
  g <- graph_from_data_frame(edges_t, directed = FALSE, vertices = NULL)
  
  # C. Find "Maximal Cliques"
  cliques_list <- max_cliques(g, min = 1)
  
  # D. Scoring Logic
  pertenencia <- unlist(cliques_list)
  conteo_pertenencia <- table(pertenencia)
  
  # Identify Overlaps
  nodos_overlap <- names(conteo_pertenencia[conteo_pertenencia > 1])
  num_overlaps <- length(nodos_overlap)
  
  # Identify Singletons
  if (vcount(g) > 0) {
    degrees <- degree(g)
    singletons_degree_0 <- sum(degrees == 0)
    nodos_en_grafo <- names(V(g))
  } else {
    singletons_degree_0 <- 0
    nodos_en_grafo <- c()
  }
  
  nodos_aislados_total <- setdiff(todos_genomas, nodos_en_grafo)
  total_singletons <- singletons_degree_0 + length(nodos_aislados_total)
  
  # E. Final Score Calculation
  score <- - (num_overlaps * PENALTY_OVERLAP) - (total_singletons * PENALTY_SINGLETON)
  
  resultados <- rbind(resultados, data.frame(Threshold = t, Score = score, 
                                             Overlaps = num_overlaps, 
                                             Singletons = total_singletons,
                                             Clusters = length(cliques_list)))
}

# 4. Select Best Result
mejor_res <- resultados[which.max(resultados$Score), ]

cat("Optimization Complete.\n")
Optimization Complete.
Code
cat("Optimal Threshold Found:", mejor_res$Threshold, "%\n")
Optimal Threshold Found: 60 %
Code
cat("Max Score:", mejor_res$Score, "\n")
Max Score: 0 

Optimization Curve

The plot below shows the stability score across different AAI thresholds. The peak represents the point where genera are best defined (minimized overlaps and appropriate grouping).

Code
cat("Generating optimization plot...\n")
Generating optimization plot...
Code
p_opt <- ggplot(resultados, aes(x = Threshold, y = Score)) +
  # Trend line
  geom_line(color = "#b4b4b4b8", linewidth = 0.7) +
  # Points
  geom_point(color = "#444444b8", size = 0.3, alpha = 0.5) +
  # Vertical guide to optimum
  geom_segment(data = mejor_res, 
               aes(x = Threshold, xend = Threshold, 
                   y = min(resultados$Score), yend = Score),
               linetype = "dashed", color = "#f79c9cff", linewidth = 0.5) +
  # Optimal point
  geom_point(data = mejor_res, aes(x = Threshold, y = Score), 
             color = "#f15959ff", size = 3.5) +
  # Label
  annotate("text", x = mejor_res$Threshold, y = mejor_res$Score, 
           label = paste0(mejor_res$Threshold, "%"), 
           vjust = -1.5, color = "#f15959ff", fontface = "bold", size = 4) +
  # Aesthetics (Translated labels to English)
  labs(title = "AAI Threshold Optimization (Genus Level)",
       subtitle = paste("Optimal Cutoff Detected at:", mejor_res$Threshold, "% AAI"),
       y = "Coherence Score", 
       x = "Identity Threshold (%)") +
  theme_classic(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    axis.text = element_text(color = "black"),
    axis.line = element_line(color = "black"),
    panel.grid.major.y = element_line(color = "grey90", linetype = "dotted")
  )

# Save and Print
ggsave("Figure3_AAI_Optimization.png", plot = p_opt, width = 8, height = 5, dpi = 300)
print(p_opt)

5. Interactive 3D Visualization (Manual Threshold)

Finally, we visualize the genomic distances in a 3-Dimensional space using Principal Coordinates Analysis (PCoA/MDS). The coloring corresponds to the clusters defined by the manual threshold.

Code
# --- 1. LIBRARIES ---
library(plotly)
library(reshape2)
library(dplyr)
library(htmlwidgets)

# ==========================================
#      PARAMETER (MANUAL INPUT)
# ==========================================
# Aquí defines el valor manualmente, ignorando la optimización anterior
UMBRAL_CORTE <- 81.5   # <--- ¡CAMBIA ESTE NÚMERO AQUÍ!
cat(paste0("Generating 3D Map using Manual Threshold: ", UMBRAL_CORTE, "%\n"))
Generating 3D Map using Manual Threshold: 81.5%
Code
# ==========================================

# --- 2. PREPARE DATA ---
# We use 'datos_plot' which is already loaded and cleaned in Section 2

# Convert to Distance Matrix
matriz_dist <- dcast(datos_plot, Label.1 ~ Label.2, value.var = "AAI", fill = 0)
rownames(matriz_dist) <- matriz_dist$Label.1
matriz_dist <- matriz_dist[,-1]
mat_sim <- as.matrix(matriz_dist)
mat_sim <- pmax(mat_sim, t(mat_sim), na.rm=TRUE) 
dist_final <- 100 - mat_sim # Distance = 100 - AAI

# --- 3. DYNAMIC CLUSTERING ---
# Hierarchical Clustering (Average Linkage / UPGMA)
hc <- hclust(as.dist(dist_final), method = "average")

# Cut tree at defined height
altura_corte <- 100 - UMBRAL_CORTE
grupos <- cutree(hc, h = altura_corte)

# --- 4. 3D COORDINATES CALCULATION (MDS/PCoA) ---
mds <- cmdscale(dist_final, k = 3, eig = TRUE)

# Create 3D Data Frame
df_plot <- data.frame(Dim1 = mds$points[, 1], 
                      Dim2 = mds$points[, 2],
                      Dim3 = mds$points[, 3])

df_plot$Genoma <- rownames(df_plot)

# Add Cluster Info and Original Genus
df_plot$Cluster <- factor(grupos) 
df_plot$Cluster_ID <- paste("Cluster", df_plot$Cluster)
df_plot$Genero_Original <- get_genus(df_plot$Genoma) # Reuse function from Sec 4

# --- 5. GENERATE INTERACTIVE 3D PLOT ---

# Define dynamic palette
num_colores <- length(unique(df_plot$Cluster))
colores_dinamicos <- colorRampPalette(c("#E74C3C", "#3498DB", "#2ECC71", "#9B59B6", "#F1C40F", "#1ABC9C", "#34495E"))(num_colores)

# Create Plotly Object
fig <- plot_ly(data = df_plot, 
               x = ~Dim1, 
               y = ~Dim2, 
               z = ~Dim3, 
               color = ~Cluster_ID, 
               colors = colores_dinamicos,
               type = 'scatter3d', 
               mode = 'markers',
               marker = list(size = 5, 
                             opacity = 0.8,
                             line = list(width = 0.5, color = 'white')),
               text = ~paste("<b>Genome:</b>", Genoma,
                             "<br><b>Cluster:</b>", Cluster_ID,
                             "<br><b>NCBI Genus:</b>", Genero_Original),
               hoverinfo = "text")

# Customize Layout
fig <- fig %>% layout(
  title = paste0("Genomic 3D Map (Manual Threshold: ", UMBRAL_CORTE, "%)"),
  scene = list( 
    xaxis = list(title = "Dim 1", showgrid = TRUE),
    yaxis = list(title = "Dim 2", showgrid = TRUE),
    zaxis = list(title = "Dim 3 (Depth)", showgrid = TRUE),
    camera = list(eye = list(x = 1.5, y = 1.5, z = 1.5)) 
  ),
  legend = list(title = list(text = "<b>Clusters</b>"))
)

# Save as external HTML file
nombre_archivo <- paste0("Map_3D_Threshold_", UMBRAL_CORTE, ".html")
saveWidget(as_widget(fig), nombre_archivo)

cat(paste0("Success! 3D file saved as: '", nombre_archivo, "'\n"))
Success! 3D file saved as: 'Map_3D_Threshold_81.5.html'
Code
# Print inside the Quarto report
fig

6. Cluster Summary Table

Below is the detailed list of genomes and their assigned clusters based on the manual threshold defined above. You can filter and search specific genomes in this table.