Species Demarcation Analysis: Average Nucleotide Identity (ANI)
Author
Tu Nombre
Published
January 14, 2026
1. Bioinformatics Pipeline: PyANI
To assess genomic relatedness at the species level, we calculated the Average Nucleotide Identity (ANI) using PyANI (v0.2.10). Specifically, we employed the ANIb module, which uses BLASTN+ to align 1,020 bp genomic fragments.
1.1. Environment Setup & Installation
The analysis was performed in a dedicated Conda environment to ensure dependency compatibility.
The Average Nucleotide Identity (ANI) was calculated using the ANIb method (BLASTN+), which is considered the standard for species delineation. The analysis was executed from the command line using the average_nucleotide_identity.py script provided by the PyANI package.
Primary Execution Command:
The following command was used to initiate the all-vs-all comparison, generating Excel tables and SVG graphics:
-i .: Specifies the current directory (.) as the input source containing the genome FASTA files.
-o ANIb_output: Designates the directory where all result files and figures will be saved.
-m ANIb: Selects the ANIb method, which uses BLASTN+ to align 1,020 bp genomic fragments (Gold Standard).
-g: Enables the generation of visualization graphics (heatmaps and dendrograms).
--gformat svg: Sets the output format for figures to SVG (Scalable Vector Graphics), ensuring high resolution for publication.
--write_excel: Generates additional result matrices in Excel format (.xlsx) for easy manual inspection.
-v: Activates verbose mode to provide detailed logging of the process in the terminal.
Resuming Interrupted Runs:
Given the high computational cost of all-vs-all BLASTN alignments, creating a checkpoint strategy is essential. In instances where the process was interrupted (e.g., server timeout) after the generation of .blast files but before matrix calculation, the analysis was resumed without re-aligning sequences using the following command:
--skip_blastn: This is the critical flag for resumption. It instructs the pipeline to bypass the computationally intensive BLASTN alignment step, utilizing the pre-existing .blast output files found in the directory.
--noclobber: Prevents the overwriting of existing alignment files, ensuring that the pipeline only processes missing data or generates the final tables without altering raw results.
--force: Forces the script to execute even if the output directory (ANIb_output) already exists. By default, PyANI halts to prevent accidental data loss; this flag overrides that safety check for resumption.
--nocompress: Disables the compression of intermediate files. This optimization reduces the I/O overhead during the re-parsing of thousands of BLAST files, speeding up the matrix generation.
Execution Verification:
Successful completion of the analysis was verified by the following log output. Note the “Skipping BLASTN runs” warning, which confirms that the resumption flags correctly utilized the pre-calculated alignment files, focusing the computational effort solely on matrix generation:
Once the ANIb_percentage_identity.tab matrix was generated, we processed it in R to define species clusters based on the 95-96% threshold and visualize the data in 3D.
2.1. Load Libraries and Data
We use igraph for network-based clustering and plotly for 3D visualization. The script automatically detects if the input data is in a 0-1 scale (default PyANI output) or 0-100% and normalizes it.
Code
# Load required librariesif (!require("igraph")) install.packages("igraph")if (!require("plotly")) install.packages("plotly")if (!require("reshape2")) install.packages("reshape2")if (!require("dplyr")) install.packages("dplyr")library(igraph)library(plotly)library(reshape2)library(dplyr)# File configurationINPUT_FILE <-"/Users/abraham/Documents/Articulos_ab/Chromobacterium/Cap_2/ANIb_output/ANIb_percentage_identity.tab"THRESHOLD <-95# The gold standard for species demarcation# Load Dataif(file.exists(INPUT_FILE)){# check.names=FALSE is crucial to maintain original filenames (avoids R replacing hyphens with dots) matriz_ani <-read.table(INPUT_FILE, header =TRUE, sep ="\t", row.names =1, check.names =FALSE) matriz_ani <-as.matrix(matriz_ani)# AUTO-CORRECTION: Check if scale is 0-1 (PyANI default) and convert to 0-100%if(max(matriz_ani, na.rm=TRUE) <=1.0){ matriz_ani <- matriz_ani *100 }} else {stop("Input file not found. Please ensure 'ANIb_percentage_identity.tab' is in the working directory.")}
2.2. Species Clustering (Graph-Based)
To define species boundaries, we employed a network-based clustering approach. A graph \(G(V, E)\) was constructed where \(V\) represents the set of genomes and \(E\) represents the connections between them. An edge is drawn between two genomes if and only if their ANI value meets the species demarcation threshold (ANI ≥ 95%).
Consequently, each connected component in the graph represents a putative species (Operational Taxonomic Unit). Genomes that do not establish any connection above the threshold are considered singletons, representing unique or potentially novel species within the dataset.
Code
# 1. Transform Matrix to Edge List# We convert the wide matrix into a long format (Source, Target, Weight) to facilitate filtering.df_edges <-melt(matriz_ani)colnames(df_edges) <-c("Genoma1", "Genoma2", "ANI")# 2. Apply Species Threshold# We filter out self-loops (Genome A vs Genome A) and any connection below 95%.# Only strong links remain.edges_fuertes <-subset(df_edges, ANI >= THRESHOLD &as.character(Genoma1) !=as.character(Genoma2))# 3. Construct the Networkg <-graph_from_data_frame(edges_fuertes, directed =FALSE)# 4. Handle Singletons (Crucial Step)# Identify genomes that were filtered out because they had no close relatives (ANI < 95%)# and add them back to the graph as isolated nodes.todos_genomas <-rownames(matriz_ani)nodos_en_grafo <-V(g)$namenodos_aislados <-setdiff(todos_genomas, nodos_en_grafo)if(length(nodos_aislados) >0){ g <-add_vertices(g, length(nodos_aislados), name = nodos_aislados)}# 5. Extract Clusters# Each connected component corresponds to a species.comp <-components(g)clusters <- comp$membership# 6. Create Results Tabledf_res <-data.frame(Genome =names(clusters), Cluster_Num =as.numeric(clusters))df_res$Species_ID <-paste0("Species_", df_res$Cluster_Num)# Helper: Extract original Genus from filename for validationget_genus <-function(x) { parts <-strsplit(as.character(x), "_")[[1]]if(length(parts) >0) return(parts[1]) elsereturn("Unknown")}df_res$Original_Genus <-sapply(df_res$Genome, get_genus)# Output Summarycat(paste0("Analysis Summary:\n"))
2.3. 3D Ordination (Metric Multidimensional Scaling)
To provide a spatial verification of the defined boundaries, we projected the high-dimensional genomic relationships into a 3D Euclidean space using Metric Multidimensional Scaling (MDS), also known as Principal Coordinates Analysis (PCoA).
This ordination technique requires a dissimilarity matrix (distance) rather than a similarity matrix. Therefore, ANI values were transformed using the formula \(D = 100 - ANI\). In this projected space, genomes belonging to the same species are expected to form tight, cohesive clusters, separated by empty space (the “species gap”) from other groups.
Code
# 1. Transform Similarity to Distance# MDS requires a distance matrix. Since ANI is similarity (0-100%),# we calculate Dissimilarity = 100 - ANI.dist_matrix <-100- matriz_ani# Ensure mathematical consistency: The distance from a genome to itself must be exactly 0.diag(dist_matrix) <-0# 2. Perform MDS (k=3 dimensions)# We use 'cmdscale' (Classical Multidimensional Scaling) to reduce dimensions to 3 (X, Y, Z).mds_fit <-cmdscale(dist_matrix, k =3)# 3. Create Coordinate Dataframedf_mds <-data.frame(Dim1 = mds_fit[,1], Dim2 = mds_fit[,2], Dim3 = mds_fit[,3])df_mds$Genome <-rownames(df_mds)# 4. Integrate with Clustering Results# We merge the 3D coordinates with the Species ID assigned in the previous step.df_final <-merge(df_mds, df_res, by ="Genome")# Preview the combined datasethead(df_final)
To facilitate the visual inspection of the taxonomic structure, we generated an interactive 3D scatter plot. In this visualization:
Points represent individual genomes.
Color indicates the Species Cluster assigned by our graph-based algorithm (95% ANI).
Shape represents the Original Genus label derived from the input filename.
This visualization allows for a rapid assessment of taxonomic consistency. Ideally, points of the same color should form a single, isolated cloud. If a genome with a specific shape (e.g., Chromobacterium) falls into a color cluster dominated by another shape (e.g., Vogesella), it suggests a potential misclassification in public databases or a synonymy.
Code
# 1. Define Color Palette# We create a dynamic palette to ensure we have enough distinct colors for all detected species.n_especies <-length(unique(df_final$Species_ID))mis_colores <-colorRampPalette(c("#E74C3C", "#3498DB", "#2ECC71", "#9B59B6", "#F1C40F", "#1ABC9C", "#34495E"))(n_especies)# 2. Generate Plotly Objectfig <-plot_ly(data = df_final, x =~Dim1, y =~Dim2, z =~Dim3,color =~Species_ID,colors = mis_colores,symbol =~Original_Genus, symbols =c('circle', 'square', 'diamond', 'cross', 'x', 'triangle-up'),type ='scatter3d', mode ='markers',marker =list(size =5, opacity =0.8, line =list(width =0.5, color ='white')),text =~paste("<b>Genome:</b>", Genome,"<br><b>Assigned Species:</b>", Species_ID,"<br><b>Ref Genus:</b>", Original_Genus),hoverinfo ="text")# 3. Customize Layoutfig <- fig %>%layout(title =paste0("Genomic Space 3D (Species Cutoff: ", THRESHOLD, "%)"),scene =list(xaxis =list(title ="Dim 1 (MDS)"),yaxis =list(title ="Dim 2 (MDS)"),zaxis =list(title ="Dim 3 (MDS)") ),legend =list(title =list(text ="<b>Species Clusters</b>")),margin =list(l=0, r=0, b=0, t=50))# Render the plotfig
2.5. Export Results
Finally, we export the taxonomic classification results to a tab-separated value (TSV) file. This table serves as the definitive guide for species assignment in this study, linking each genome to its computed Species Cluster and its original label.
Code
# Define output filenameOUTPUT_TABLE <-"Tabla_Final_Especies_ANI.tsv"# Select key columns for the final report# We sort by Cluster Number to group species together in the output listdf_export <- df_final[, c("Genome", "Species_ID", "Original_Genus", "Cluster_Num")]df_export <- df_export[order(df_export$Cluster_Num), ]# Write to filewrite.table(df_export, file = OUTPUT_TABLE, sep ="\t", row.names =FALSE, quote =FALSE)# Confirmation messagecat(paste0("✅ Results successfully exported to: ", OUTPUT_TABLE, "\n"))
✅ Results successfully exported to: Tabla_Final_Especies_ANI.tsv
Code
cat("You can now open this file in Excel to inspect the specific species assignments.")
You can now open this file in Excel to inspect the specific species assignments.