Species Demarcation Analysis: Average Nucleotide Identity (ANI)

Author

Tu Nombre

Published

January 14, 2026

1. Bioinformatics Pipeline: PyANI

To assess genomic relatedness at the species level, we calculated the Average Nucleotide Identity (ANI) using PyANI (v0.2.10). Specifically, we employed the ANIb module, which uses BLASTN+ to align 1,020 bp genomic fragments.

1.1. Environment Setup & Installation

The analysis was performed in a dedicated Conda environment to ensure dependency compatibility.

# Create and activate environment
conda create –name pyani_env python=3.8 -y
conda activate pyani_env

# Install dependencies
conda install mummer blast legacy-blast -y
conda install pyani

1.2. Execution of the ANI Analysis

The Average Nucleotide Identity (ANI) was calculated using the ANIb method (BLASTN+), which is considered the standard for species delineation. The analysis was executed from the command line using the average_nucleotide_identity.py script provided by the PyANI package.

Primary Execution Command:

The following command was used to initiate the all-vs-all comparison, generating Excel tables and SVG graphics:

average_nucleotide_identity.py -i . -o ANIb_output -m ANIb -g -v --write_excel --gformat svg

Command Parameters:

  • -i .: Specifies the current directory (.) as the input source containing the genome FASTA files.
  • -o ANIb_output: Designates the directory where all result files and figures will be saved.
  • -m ANIb: Selects the ANIb method, which uses BLASTN+ to align 1,020 bp genomic fragments (Gold Standard).
  • -g: Enables the generation of visualization graphics (heatmaps and dendrograms).
  • --gformat svg: Sets the output format for figures to SVG (Scalable Vector Graphics), ensuring high resolution for publication.
  • --write_excel: Generates additional result matrices in Excel format (.xlsx) for easy manual inspection.
  • -v: Activates verbose mode to provide detailed logging of the process in the terminal.

Resuming Interrupted Runs:

Given the high computational cost of all-vs-all BLASTN alignments, creating a checkpoint strategy is essential. In instances where the process was interrupted (e.g., server timeout) after the generation of .blast files but before matrix calculation, the analysis was resumed without re-aligning sequences using the following command:

average_nucleotide_identity.py -i . -o ANIb_output -m ANIb -v --skip_blastn --noclobber --nocompress --force --write_excel

Recovery Parameters:

  • --skip_blastn: This is the critical flag for resumption. It instructs the pipeline to bypass the computationally intensive BLASTN alignment step, utilizing the pre-existing .blast output files found in the directory.
  • --noclobber: Prevents the overwriting of existing alignment files, ensuring that the pipeline only processes missing data or generates the final tables without altering raw results.
  • --force: Forces the script to execute even if the output directory (ANIb_output) already exists. By default, PyANI halts to prevent accidental data loss; this flag overrides that safety check for resumption.
  • --nocompress: Disables the compression of intermediate files. This optimization reduces the I/O overhead during the re-parsing of thousands of BLAST files, speeding up the matrix generation.

Execution Verification:

Successful completion of the analysis was verified by the following log output. Note the “Skipping BLASTN runs” warning, which confirms that the resumption flags correctly utilized the pre-calculated alignment files, focusing the computational effort solely on matrix generation:

INFO: Carrying out ANIb analysis
INFO: Running ANIb
INFO: Writing BLAST output to ANIb_output/blastn_output
WARNING: Skipping BLASTN runs (as instructed)!
INFO: Processing pairwise ANIb BLAST output.
INFO: Writing ANIb results to ANIb_output
INFO:   ANIb_alignment_lengths
INFO:   ANIb_percentage_identity
INFO:   ANIb_alignment_coverage
INFO:   ANIb_similarity_errors
INFO:   ANIb_hadamard
INFO: Done: Tue May 14 11:42:36 2024.
INFO: Time taken: 890.59s

2. Statistical Analysis & Clustering in R

Once the ANIb_percentage_identity.tab matrix was generated, we processed it in R to define species clusters based on the 95-96% threshold and visualize the data in 3D.

2.1. Load Libraries and Data

We use igraph for network-based clustering and plotly for 3D visualization. The script automatically detects if the input data is in a 0-1 scale (default PyANI output) or 0-100% and normalizes it.

Code
# Load required libraries
if (!require("igraph")) install.packages("igraph")
if (!require("plotly")) install.packages("plotly")
if (!require("reshape2")) install.packages("reshape2")
if (!require("dplyr")) install.packages("dplyr")

library(igraph)
library(plotly)
library(reshape2)
library(dplyr)

# File configuration
INPUT_FILE <- "/Users/abraham/Documents/Articulos_ab/Chromobacterium/Cap_2/ANIb_output/ANIb_percentage_identity.tab" 
THRESHOLD <- 95  # The gold standard for species demarcation

# Load Data
if(file.exists(INPUT_FILE)){
  # check.names=FALSE is crucial to maintain original filenames (avoids R replacing hyphens with dots)
  matriz_ani <- read.table(INPUT_FILE, header = TRUE, sep = "\t", row.names = 1, check.names = FALSE)
  matriz_ani <- as.matrix(matriz_ani)
  
  # AUTO-CORRECTION: Check if scale is 0-1 (PyANI default) and convert to 0-100%
  if(max(matriz_ani, na.rm=TRUE) <= 1.0){
    matriz_ani <- matriz_ani * 100
  }
} else {
  stop("Input file not found. Please ensure 'ANIb_percentage_identity.tab' is in the working directory.")
}

2.2. Species Clustering (Graph-Based)

To define species boundaries, we employed a network-based clustering approach. A graph \(G(V, E)\) was constructed where \(V\) represents the set of genomes and \(E\) represents the connections between them. An edge is drawn between two genomes if and only if their ANI value meets the species demarcation threshold (ANI ≥ 95%).

Consequently, each connected component in the graph represents a putative species (Operational Taxonomic Unit). Genomes that do not establish any connection above the threshold are considered singletons, representing unique or potentially novel species within the dataset.

Code
# 1. Transform Matrix to Edge List
# We convert the wide matrix into a long format (Source, Target, Weight) to facilitate filtering.
df_edges <- melt(matriz_ani)
colnames(df_edges) <- c("Genoma1", "Genoma2", "ANI")

# 2. Apply Species Threshold
# We filter out self-loops (Genome A vs Genome A) and any connection below 95%.
# Only strong links remain.
edges_fuertes <- subset(df_edges, ANI >= THRESHOLD & as.character(Genoma1) != as.character(Genoma2))

# 3. Construct the Network
g <- graph_from_data_frame(edges_fuertes, directed = FALSE)

# 4. Handle Singletons (Crucial Step)
# Identify genomes that were filtered out because they had no close relatives (ANI < 95%)
# and add them back to the graph as isolated nodes.
todos_genomas <- rownames(matriz_ani)
nodos_en_grafo <- V(g)$name
nodos_aislados <- setdiff(todos_genomas, nodos_en_grafo)

if(length(nodos_aislados) > 0){
  g <- add_vertices(g, length(nodos_aislados), name = nodos_aislados)
}

# 5. Extract Clusters
# Each connected component corresponds to a species.
comp <- components(g)
clusters <- comp$membership

# 6. Create Results Table
df_res <- data.frame(Genome = names(clusters), Cluster_Num = as.numeric(clusters))
df_res$Species_ID <- paste0("Species_", df_res$Cluster_Num)

# Helper: Extract original Genus from filename for validation
get_genus <- function(x) { 
  parts <- strsplit(as.character(x), "_")[[1]]
  if(length(parts) > 0) return(parts[1]) else return("Unknown")
}
df_res$Original_Genus <- sapply(df_res$Genome, get_genus)

# Output Summary
cat(paste0("Analysis Summary:\n"))
Analysis Summary:
Code
cat(paste0("-----------------\n"))
-----------------
Code
cat(paste0("Total Genomes Analyzed: ", nrow(df_res), "\n"))
Total Genomes Analyzed: 227
Code
cat(paste0("Total Species Detected: ", max(df_res$Cluster_Num), "\n"))
Total Species Detected: 61
Code
cat(paste0("Singletons (Unique Species): ", length(nodos_aislados), "\n"))
Singletons (Unique Species): 29
Code
# Show first rows
head(df_res)
                          Genome Cluster_Num Species_ID Original_Genus
1             Aquitalea_sp_FJL05           1  Species_1      Aquitalea
2         Aquitalea_sp_SHSU_b289           2  Species_2      Aquitalea
3 Aquitalea_magnusonii_DSM_25134           3  Species_3      Aquitalea
4             Aquitalea_sp_ASV15           3  Species_3      Aquitalea
5  Aquitalea_magnusonii_CCM_7607           3  Species_3      Aquitalea
6   Aquitalea_pelogenes_CCM_7557           4  Species_4      Aquitalea

2.3. 3D Ordination (Metric Multidimensional Scaling)

To provide a spatial verification of the defined boundaries, we projected the high-dimensional genomic relationships into a 3D Euclidean space using Metric Multidimensional Scaling (MDS), also known as Principal Coordinates Analysis (PCoA).

This ordination technique requires a dissimilarity matrix (distance) rather than a similarity matrix. Therefore, ANI values were transformed using the formula \(D = 100 - ANI\). In this projected space, genomes belonging to the same species are expected to form tight, cohesive clusters, separated by empty space (the “species gap”) from other groups.

Code
# 1. Transform Similarity to Distance
# MDS requires a distance matrix. Since ANI is similarity (0-100%),
# we calculate Dissimilarity = 100 - ANI.
dist_matrix <- 100 - matriz_ani

# Ensure mathematical consistency: The distance from a genome to itself must be exactly 0.
diag(dist_matrix) <- 0 

# 2. Perform MDS (k=3 dimensions)
# We use 'cmdscale' (Classical Multidimensional Scaling) to reduce dimensions to 3 (X, Y, Z).
mds_fit <- cmdscale(dist_matrix, k = 3)

# 3. Create Coordinate Dataframe
df_mds <- data.frame(Dim1 = mds_fit[,1], Dim2 = mds_fit[,2], Dim3 = mds_fit[,3])
df_mds$Genome <- rownames(df_mds)

# 4. Integrate with Clustering Results
# We merge the 3D coordinates with the Species ID assigned in the previous step.
df_final <- merge(df_mds, df_res, by = "Genome")

# Preview the combined dataset
head(df_final)
                          Genome      Dim1     Dim2      Dim3 Cluster_Num
1  Aquitalea_aquatica_HSC_21Su07 -5.710992 5.259397 -11.71246          33
2 Aquitalea_aquatilis_THG_DN7_12 -5.583183 5.048127 -11.46175           1
3 Aquitalea_denitrificans_5YN1_3 -5.646146 5.129755 -10.73394           2
4  Aquitalea_magnusonii_CCM_7607 -5.422862 4.964157 -11.42690           3
5 Aquitalea_magnusonii_DSM_25134 -5.486581 5.054358 -11.63526           3
6        Aquitalea_magnusonii_H3 -5.648896 5.018936 -10.77088          34
  Species_ID Original_Genus
1 Species_33      Aquitalea
2  Species_1      Aquitalea
3  Species_2      Aquitalea
4  Species_3      Aquitalea
5  Species_3      Aquitalea
6 Species_34      Aquitalea

2.4. Interactive 3D Visualization

To facilitate the visual inspection of the taxonomic structure, we generated an interactive 3D scatter plot. In this visualization:

  • Points represent individual genomes.
  • Color indicates the Species Cluster assigned by our graph-based algorithm (95% ANI).
  • Shape represents the Original Genus label derived from the input filename.

This visualization allows for a rapid assessment of taxonomic consistency. Ideally, points of the same color should form a single, isolated cloud. If a genome with a specific shape (e.g., Chromobacterium) falls into a color cluster dominated by another shape (e.g., Vogesella), it suggests a potential misclassification in public databases or a synonymy.

Code
# 1. Define Color Palette
# We create a dynamic palette to ensure we have enough distinct colors for all detected species.
n_especies <- length(unique(df_final$Species_ID))
mis_colores <- colorRampPalette(c("#E74C3C", "#3498DB", "#2ECC71", "#9B59B6", "#F1C40F", "#1ABC9C", "#34495E"))(n_especies)

# 2. Generate Plotly Object
fig <- plot_ly(data = df_final, 
               x = ~Dim1, y = ~Dim2, z = ~Dim3,
               color = ~Species_ID,
               colors = mis_colores,
               symbol = ~Original_Genus, 
               symbols = c('circle', 'square', 'diamond', 'cross', 'x', 'triangle-up'),
               type = 'scatter3d', 
               mode = 'markers',
               marker = list(size = 5, opacity = 0.8, line = list(width = 0.5, color = 'white')),
               text = ~paste("<b>Genome:</b>", Genome,
                             "<br><b>Assigned Species:</b>", Species_ID,
                             "<br><b>Ref Genus:</b>", Original_Genus),
               hoverinfo = "text")

# 3. Customize Layout
fig <- fig %>% layout(
  title = paste0("Genomic Space 3D (Species Cutoff: ", THRESHOLD, "%)"),
  scene = list(
    xaxis = list(title = "Dim 1 (MDS)"),
    yaxis = list(title = "Dim 2 (MDS)"),
    zaxis = list(title = "Dim 3 (MDS)")
  ),
  legend = list(title = list(text = "<b>Species Clusters</b>")),
  margin = list(l=0, r=0, b=0, t=50)
)

# Render the plot
fig

2.5. Export Results

Finally, we export the taxonomic classification results to a tab-separated value (TSV) file. This table serves as the definitive guide for species assignment in this study, linking each genome to its computed Species Cluster and its original label.

Code
# Define output filename
OUTPUT_TABLE <- "Tabla_Final_Especies_ANI.tsv"

# Select key columns for the final report
# We sort by Cluster Number to group species together in the output list
df_export <- df_final[, c("Genome", "Species_ID", "Original_Genus", "Cluster_Num")]
df_export <- df_export[order(df_export$Cluster_Num), ]

# Write to file
write.table(df_export, 
            file = OUTPUT_TABLE, 
            sep = "\t", 
            row.names = FALSE, 
            quote = FALSE)

# Confirmation message
cat(paste0("✅ Results successfully exported to: ", OUTPUT_TABLE, "\n"))
✅ Results successfully exported to: Tabla_Final_Especies_ANI.tsv
Code
cat("You can now open this file in Excel to inspect the specific species assignments.")
You can now open this file in Excel to inspect the specific species assignments.