Treatment of Polytomies in Morphological Phylogenetic Analysis Using Clustering Algorithms - A Study of Burkesuchus mallingrandensis and Arackar licanantay

1. Introduction

Palaeontology has increasingly incorporated technological tools to manage, analyse, and experiment with fossil data, a practice formalised as Palaeoinformatics (Dolven and Skjerpen 2011). This field integrates technology to organise palaeontological information, enabling researchers to process large datasets and conduct more precise analyses of evolutionary history (Lawing and Matzke 2014) (Marshall et al. 2018). One of the main applications of palaeoinformatics is reconstructing the evolutionary history (Cunningham et al. 2014) (López-Antoñanzas et al. 2022).

Phylogenetic inference aims to reconstruct evolutionary history by determining relationships among organisms, typically as phylogenetic trees (Kubatko 2008). These trees are created after analyzing morphological or molecular characteristics and are evaluated using various optimization criteria such as parsimony, likelihood, minimum evolution, or least squares (Kubatko 2008) (Bryant and Waddell 1998) (WILLIAMS and EBACH 2010) (Villalobos-Cid et al. 2019). One of the most commonly used methods in paleontology is maximum parsimony (Goloboff, Torres, and Arias 2017) (Schrago, Aguiar, and Mello 2018) (Sansom et al. 2018). However, the application of various optimization methods can result in different topologies that explain the evolutionary history of species with the same number of changes (Wilkinson and Benton 1996) (Heled and Bouckaert 2013) (Gori et al. 2015) (Villalobos-Cid et al. 2022). To obtain a single topology, various consensus techniques (e.g., strict consensus, semi-strict consensus, majority rule) are applied, combining all the trees generated after the application of an optimization method (Müller and Dias-da-Silva 2017) (Villalobos-Cid, Salinas, and Inostroza-Ponta 2020) (Huson and Cetinkaya 2023) (Lovegrove, Upchurch, and Barrett 2024). The problem is that this consensus topology will not necessarily preserve the score obtained by the parsimony method, it will lose information about the divergent evolutionary traits present in the original topologies, and it may also produce polytomies that is, nodes with more than two descendant lineages (Kornilios 2017). These polytomies arise due to ambiguous or incomplete data, convergent evolution, or rapid evolutionary events. These issues complicate the study of morphological rates of change and the integration of temporal information into phylogenetic trees (Dolven and Skjerpen 2011) (Larson et al. 2020) (Flores, Lehtonen, and Hyvönen 2021).

Therefore, a pipeline is proposed to prevent the occurrence of polytomies and the issues arising from them. This proposed pipeline combines medoid and clustering techniques to select a single representative topology from the phylogenetic landscape of trees obtained under the máximum parsimony criterion, preserving the optimisation score while resolving polytomies. Additionally, clustering methods allow a comprehensive exploration of the phylogenetic landscape, providing deeper insights into the diversity of plausible evolutionary scenarios. By clarifying evolutionary relationships, the pipeline facilitates the integration of temporal data and the analysis of morphological changes within the study group.

To compare and analyze results, two case studies of fossils found in Chile were used: the reptile Burkesuchus mallingrandensis (Novas et al. 2021) and the dinosaur Arackar licanantay (Rubilar et al. 2021). This repository presents the code and the results obtained for both case studies. In Section Two, there is an analysis of the data. First, the necessary libraries for running the project are listed. Then, the functions created for formatting tables, calculating hypervolume, and implementing the seven clustering algorithms used (K-Means, PAM, FANNY, CLARA, DBSCAN, SOM, MST-knn) are defined. Then, the results obtained for each case study are presented.

2. Data analysis

The data analysis section is divided into four parts. First, the necessary libraries for the code to run are listed. Second, all the functions required for formatting tables, performing calculations, and implementing the seven clustering methods applied in the data analysis are presented. Third, the section displays the results obtained for the prehistoric reptile Burkesuchus mallingrandensis. Finally, the last section presents the results obtained for Arackar licanantay.

2.1 Required Libraries

Below, all the libraries that must be installed to run the complete code are listed, along with a brief description of each, indicating why they were used or if there are any special installation requirements.

library("TreeTools")        # Create, Modify and Analyse Phylogenetic Trees - permite trabajar con TNT
library("phangorn")         # Phylogenetic Reconstruction and Analysis
library("phytools")         # Phylogenetic Tools for Comparative Biology (and Other Things)
library("mstknnclust")      # MST-kNN Clustering Algorithm
library("ape")              # Analyses of Phylogenetics and Evolution 
library("tidyverse")        # It contains various packages designed for data manipulation and analysis.
library("cluster")          # "Finding Groups in Data": Cluster Analysis Extended
library("factoextra")       # Extract and Visualize the Results of Multivariate Data Analyses
library("NbClust")          # Determining the Best Number of Clusters in a Data Set
library("ggplot2")          # Create Elegant Data Visualisations Using the Grammar of Graphics
library("plotly")           # Package to create a variety of interactive graphics
library("reshape2")         # Flexibly restructure and aggregate data
library("knitr")            # A General-Purpose Package for Dynamic Report Generation
library("kableExtra")       # Help to build common complex tables -> devtools::install_github("kupietz/kableExtra")
library("clValid")          # Validation of Clustering Results
library("fpc")              # Flexible Procedures for Clustering -> uso para DBSCAN
library("dbscan")           # Density-Based Spatial Clustering of Applications with Noise and Related Algorithms
library("kohonen")          #  Supervised and Unsupervised Self-Organising Maps -> se usa para utilizar SOM
library("mstknnclust")      # MST-kNN Clustering Algorithm 
library('ggfortify')        # Data Visualization Tools for Statistical Analysis Results -> Para bosque filogenético
library("ggforce")          # Extension to "ggplot2" with a collection of mainly new stats and geoms 
library("ggtree")           # install.packages("BiocManager") BiocManager::install("ggtree")
library("treeio")           # Package to make it easier to import and store phylogenetic tree with associated data
library("strap")            # Stratigraphic Tree Analysis for Palaeontology -> Inclusión de tiempo
library("ecr")              # Evolutionary Computation in R

2.2 Functions

Below are all the functions created for the implementation of this project in the field of paleoinformatics.

2.2.1 Function to format tables

The first function created is used to format all the tables that will be presented throughout this documentation.

formato_tablas  = function(tabla){
  kable(tabla, table.attr = "style='width:60%;'", booktabs = TRUE, lines = "full") %>%
       kable_styling(font_size = 14)
}

2.2.2 Function to calculate hypervolume

A function is created to calculate the hypervolume, which takes as input a table with the indices obtained (Dunn, Connectivity, Silhouette) from the various clustering methods. In the case of Dunn and Connectivity, the values are converted to negative numbers since these indices maximize values, and what we aim for in this project is minimization. The retrieved values are normalized between 0 and 1, then converted into a matrix, and the hypervolume is calculated relative to the point (2,2,2,2).

# FUNCION PARA CALCULAR EL HIPERVOLUMEN ----------------------------------------------------------------------
# El objetivo de la funcion es minimizar la cantidad de clusters, minimizar -Dunn, Connectivity
# y -Silhouette
hipervolumen = function(tabla_indices){
  # Primero se normalizan los datos de la tabla
  tabla_tmp_norm = tabla_indices
  # Convertir a negativos los valores de Dunn y Silhouette, ya que estos índices maximizan valores
  # y lo que se busca es minimizar para calcular el hipervolumen
  tabla_tmp_norm[,2]= as.numeric(tabla_tmp_norm[,2])
  tabla_tmp_norm[,3] = -as.numeric(tabla_tmp_norm[,3])
  tabla_tmp_norm[,4]= as.numeric(tabla_tmp_norm[,4])
  tabla_tmp_norm[,5] = -as.numeric(tabla_tmp_norm[,5])
  
  for (a in 2:ncol(tabla_tmp_norm)){
    tabla_tmp_norm[,a] = round((tabla_tmp_norm[,a]-min(tabla_tmp_norm[,a]))/(max(tabla_tmp_norm[,a])-min(tabla_tmp_norm[,a])),4)
  }
  
  # Convertir los datos normalizados en una matriz
  tmp = t(as.matrix(tabla_tmp_norm[,2:5]))
  
  resultado=NULL
  for (a in 1:ncol(tmp))
  {
    resultado = c(resultado,computeHV(cbind(tmp[,a],tmp[,a]),c(2,2,2,2)))
  }
  return(resultado)
}

2.2.3 Function to apply the K-MEANS clustering method

K-means is one of the most well-known and widely used clustering algorithms. It is an unsupervised machine learning method based on partitions, used for pattern recognition and grouping data according to their common characteristics. The quality of clustering is evaluated using the within-cluster sum of squared errors criterion.

This algorithm requires the number of clusters to be predefined by the user. Its objective function aims to minimize the sum of squared Euclidean distances between each data point and the centroid of its assigned cluster. The process begins with centroid assignment based on the Euclidean distance. Then, in each iteration, the centroids are recalculated by averaging the data points within each cluster until convergence is reached or a stopping criterion is met (Oti et al. 2021) (Sinaga and Yang 2020).

# -------------------------------------------------------------------------------------------------------
# FUNCION PARA EL METODO KMEANS -------------------------------------------------------------------------
# -------------------------------------------------------------------------------------------------------
metodo_KMEANS = function(matriz_distancia_arboles,res_calidad_clusters){
  # Se realizará un análisis para un rango de 2 a 15 clusters
  posibles_k = seq(2,15)
  # Se establece la semilla 2 para realizar los cálculos 
  set.seed(2)
  # Se ejecuta un ciclo for que permite calcular los índices de Dunn, Connectivity y Silhouette
  # para diferente cantidades de clusters
  for(i in posibles_k){
    # Se realizan los agrupamientos de los árboles con el método KMEANS para los diferentes valores de i
    clusters = kmeans(matriz_distancia_arboles, i)
    
    # Se calculan los índices de calidad con "i" cantidad de clusters
    puntaje_Dunn = dunn(distance = matriz_distancia_arboles, clusters$cluster)
    puntaje_Connectivity = connectivity(distance = matriz_distancia_arboles, clusters$cluster)
    puntaje_Silhouette = mean(silhouette(clusters$cluster, matriz_distancia_arboles)[,3])
    
    # Se almacenan los resultados en una variable temporal
    tmp = c("Kmeans", i, round(puntaje_Dunn, 3), round(puntaje_Connectivity, 3), round(puntaje_Silhouette, 3))
    # Se agregan los resultados a la variable global res_calidad_clusters
    res_calidad_clusters = rbind(res_calidad_clusters, tmp)
  }
  
  # Establecer los nombres de las columnas del data.frame
  colnames(res_calidad_clusters) = c("Method", "k", "Dunn", "Connectivity", "Silhouette")
  
  #La funcion retorna el dataframe res_calidad de clusters
  return(res_calidad_clusters)
}

2.2.4 Function to apply the Partitioning Around Medoids (PAM) method

PAM or k-medoids is a clustering algorithm that uses the medoid as the center of the cluster instead of the mean, allowing the use of different similarity metrics. PAM is an iterative optimization that combines the relocation between the different clusters with the nomination of possible medoids. The initialization of the medoids is random; on the other hand, the assignment of points to the nearest medoid and the iterative optimization is done through swaps by reviewing all the neighbors of the current node, which makes this algorithm computationally expensive compared to other algorithms like CLARA (Schubert and Rousseeuw 2021) (Al Abid 2014).

# -----------------------------------------------------------------------------------------------------
# FUNCION PARA EL METODO PAM --------------------------------------------------------------------------
# -----------------------------------------------------------------------------------------------------
metodo_PAM = function(matriz_distancia_arboles,res_calidad_clusters){
  
  # Se realizará un análisis para un rango de 2 a 15 clusters
  posibles_k = seq(2,15)
  
  # Se establece la semilla 2 para realizar los cálculos 
  set.seed(2)
  
  # Se ejecuta un ciclo for que permite calcular los índices de Dunn, Connectivity y Silhouette
  # para diferente cantidade de clusters
  for(i in posibles_k){
    # Se realizan los agrupamientos de los árboles con el método KMEANS para los diferentes valores de i
    clusters = pam(matriz_distancia_arboles, i)
    
    # Se calculan los índices de calidad con "i" cantidad de clusters
    puntaje_Dunn = dunn(distance = matriz_distancia_arboles, clusters$clustering)
    puntaje_Connectivity = connectivity(distance = matriz_distancia_arboles, clusters$clustering)
    puntaje_Silhouette = mean(silhouette(clusters$clustering, matriz_distancia_arboles)[,3])
    
    # Se almacenan los resultados en una variable temporal
    tmp = c("PAM", i, round(puntaje_Dunn, 3), round(puntaje_Connectivity, 3), round(puntaje_Silhouette, 3))
    
    # Se agregan los resultados a la variable global res_calidad_clusters
    res_calidad_clusters = rbind(res_calidad_clusters, tmp)
  }
  
  # Establecer los nombres de las columnas del data.frame
  colnames(res_calidad_clusters) = c("Method", "k", "Dunn", "Connectivity", "Silhouette")
  
  #La funcion retorna el dataframe res_calidad_clusters
  return(res_calidad_clusters)
}

2.2.5 Function to apply the Fuzzy Analysis Clustering (FANNY) method

Fanny is a clustering technique that minimizes a specific objective function, allows detecting patterns in the data, and is based on fuzzy logic, where data can belong to more than one group in a partial manner. This clustering method handles uncertainties in classification by assigning coefficients that indicate the degree of membership to the various clusters (Liu 2021) (Rajkumar, Adimulam, and Kodukula 2019).

# -----------------------------------------------------------------------------------------------------
# FUNCION PARA EL METODO FANNY ------------------------------------------------------------------------
# -----------------------------------------------------------------------------------------------------
metodo_FANNY = function(matriz_distancia_arboles,res_calidad_clusters,k_invalido){
  
  # Se realizará un análisis para un rango de 2 a 15 clusters
  posibles_k = seq(2,15)
  
  # Si hay un valor que invalida el metodo, se debe eliminar
  if (k_invalido >1 & k_invalido < 16){
    posibles_k = posibles_k[posibles_k != k_invalido]
  }

  
  # Se establece la semilla 2 para realizar los cálculos 
  set.seed(2)
  
  # Se ejecuta un ciclo for que permite calcular los índices de Dunn, Connectivity y Silhouette
  # para diferente cantidade de clusters
  for(i in posibles_k){
    # Se realizan los agrupamientos de los árboles con el método KMEANS para los diferentes valores de i
    clusters = fanny(matriz_distancia_arboles, i,memb.exp = 1.1)
    
    # Se calculan los índices de calidad con "i" cantidad de clusters
    puntaje_Dunn = dunn(distance = matriz_distancia_arboles, clusters$clustering)
    puntaje_Connectivity = connectivity(distance = matriz_distancia_arboles, clusters$clustering)
    puntaje_Silhouette = mean(silhouette(clusters$clustering, matriz_distancia_arboles)[,3])
    
    # Se almacenan los resultados en una variable temporal
    tmp = c("FANNY", i, round(puntaje_Dunn, 3), round(puntaje_Connectivity, 3), round(puntaje_Silhouette, 3))
    
    # Se agregan los resultados a la variable global res_calidad_clusters
    res_calidad_clusters = rbind(res_calidad_clusters, tmp)
  }

  # Establecer los nombres de columnas del data.frame que almacena los resultados de calidad de clusters
  colnames(res_calidad_clusters) = c("Method", "k", "Dunn", "Connectivity", "Silhouette")
  
  #La funcion retorna el dataframe res_calidad de clusters
  return(res_calidad_clusters)
}

2.2.6 Function to apply the Clustering Large Applications (CLARA) method

CLARA is a clustering method and an extension of the k-medoids algorithm. It uses a sampling approach to handle large datasets. This algorithm has a great capacity to detect outliers and handle noise more effectively compared to other algorithms such us k-means, using clustering to group nodes with low availability into sets with similar availability patterns. It also implements a reassignment algorithm that minimizes the need for additional replicas (Gupta and Panda 2019) (Gonzalo et al. 2022).

# -----------------------------------------------------------------------------------------------------
# FUNCION PARA EL METODO CLARA --------------------------------------------------------------------------
# -----------------------------------------------------------------------------------------------------
metodo_CLARA = function(matriz_distancia_arboles, res_calidad_clusters){
  
  # Se realizará un análisis para un rango de 2 a 15 clusters
  posibles_k = seq(2,15)
  # Se establece la semilla 2 para realizar los cálculos 
  set.seed(2)
  
  # Se ejecuta un ciclo for que permite calcular los índices de Dunn, Connectivity y Silhouette
  # para diferente cantidade de clusters
  for(i in posibles_k){
    # Se realizan los agrupamientos de los árboles con el método KMEANS para los diferentes valores de i
    clusters = clara(matriz_distancia_arboles, i)
    
    # Se calculan los índices de calidad con "i" cantidad de clusters
    puntaje_Dunn = dunn(distance = matriz_distancia_arboles, clusters$clustering)
    puntaje_Connectivity = connectivity(distance = matriz_distancia_arboles, clusters$clustering)
    puntaje_Silhouette = mean(silhouette(clusters$clustering, matriz_distancia_arboles)[,3])
    
    # Se almacenan los resultados en una variable temporal
    tmp = c("CLARA", i, round(puntaje_Dunn, 3), round(puntaje_Connectivity, 3), round(puntaje_Silhouette, 3))
    
    # Se agregan los resultados a la variable global res_calidad_clusters
    res_calidad_clusters = rbind(res_calidad_clusters, tmp)
  }
  
  # Establecer los nombres de columnas del data.frame que contendrá la calidad de cluster para cada método
  colnames(res_calidad_clusters) = c("Method", "k", "Dunn", "Connectivity", "Silhouette")
  
  #La funcion retorna el dataframe res_calidad de clusters
  return(res_calidad_clusters)
}

2.2.7 Función para aplicar el método Density-based Spatial Clustering of Applications with Noise (DBSCAN)

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that identifies densely connected regions as clusters. One of its main advantages is that it does not require specifying the number of clusters in advance. Its process begins with analyzing the proximity between points in space, using two key parameters: ε (epsilon), which defines the neighborhood radius around each point, and MinPts, which determines the minimum number of neighbors within that radius for a point to be considered a “core” point of a cluster. However, this algorithm has a high computational cost and, since it uses global values for its parameters, it struggles with datasets that have varying densities. Additionally, points that are too dispersed compared to the others will not belong to any cluster; these are known as noise (Cretulescu et al. 2019) (Cheng et al. 2024) (Fahim 2023).

# -----------------------------------------------------------------------------------------------------
# FUNCION PARA EL METODO DBSCAN -----------------------------------------------------------------------
# -----------------------------------------------------------------------------------------------------
metodo_DBSCAN = function(matriz_distancia_arboles, res_calidad_clusters){
  # Semilla para reproducción de resultados
  set.seed(2)                 
  # Definimos minPts = 5. Valor por defecto recomendado.
  # k = minPts - 1
  k = 4
  
  #Calcula Calculate and Plot k-Nearest Neighbor Distances (https://rpubs.com/oduor/dbscanwithnoise)
  distancia_KNN = sort(kNNdist(matriz_distancia_arboles,k=k))
  distancia_KNN = data.frame(puntos = 1:length(distancia_KNN),puntos_originales= names(distancia_KNN),
                             distancia_KNN=round(distancia_KNN,3))
  
  #Calcular valor máximo de la derivada para hallar el valor de eps (buscar en ChatGPT)
  posicion_eps = which.max(diff(distancia_KNN$distancia_KNN))
  valor_eps = distancia_KNN$distancia_KNN[posicion_eps]
  
  # Se arman los grupos
  clusters = dbscan(matriz_distancia_arboles,eps = valor_eps, minPts = (k+1))
  
  # Se calculan los indices con el metodo DBSCAN
  puntaje_Dunn = 0 #dunn(distance = matriz_distancia_arboles, clusters$cluster) #Se hace 0 para evitar el infinito
  puntaje_Connectivity = connectivity(distance = matriz_distancia_arboles, clusters$cluster)
  puntaje_Silhouette = mean(silhouette(clusters$cluster, matriz_distancia_arboles)[,3])
  
  n_grupos_dbscan = length(unique(clusters$cluster))
  tmp = c("DBSCAN",n_grupos_dbscan,round(puntaje_Dunn,3),round(puntaje_Connectivity,3),round(puntaje_Silhouette,3))
  res_calidad_clusters=rbind(res_calidad_clusters,tmp)
  
  # Establecer los nombres de columnas del data.frame que contendrá la calidad de cluster para cada método
  colnames(res_calidad_clusters) = c("Method", "k", "Dunn", "Connectivity", "Silhouette")
  
  return(list(res_calidad_clusters,distancia_KNN,posicion_eps,valor_eps))
}

2.2.8 Función para aplicar el método Self-Organising Maps (SOM)

The Self-Organizing Map (SOM) is one of the most popular artificial neural network methods for cluster analysis, dimensionality reduction, and data visualization. This method is trained through unsupervised competitive learning and generally follows two stages. First, the data is projected onto prototypes organized in a grid (commonly rectangular or hexagonal in shape), and then these prototypes are grouped to obtain the final clusters in the so-called ‘feature map.’ The goal of SOM is to represent high-dimensional data in a lower-dimensional space while preserving distances and topologies. Each neuron in the grid calculates a discrimination function, and the best match, known as the ‘Best Matching Unit’ (BMU), is selected. Additionally, the neurons neighboring the BMU also participate in adjusting the weights, so they progressively adjust to improve the representation of the data (Yang, Ouyang, and Shi 2012) (Miljkovic 2017).

# -----------------------------------------------------------------------------------------------------
# FUNCION PARA EL METODO SOM --------------------------------------------------------------------------
# -----------------------------------------------------------------------------------------------------

metodo_SOM = function(matriz_distancia_arboles,res_calidad_clusters,num_min_K,num_max_k){
  posibles_k = seq(num_min_K, num_max_k)
  set.seed(2)
  for(i in posibles_k){
    
    # Crear una cuadrícula SOM
    som_grid = somgrid(i,i, "hexagonal")
    
    # Entrenar el SOM
    distancia_matriz = as.matrix(matriz_distancia_arboles)
    som_model = som(distancia_matriz, grid = som_grid, rlen = 1000, alpha = c(0.05, 0.01))
    
    # Asignar datos a los clusters (neuronas ganadoras)
    clusters = predict(som_model, newdata = distancia_matriz)
    
    puntaje_Dunn = dunn(distance = matriz_distancia_arboles, clusters$unit.classif)
    puntaje_Connectivity = connectivity(distance = matriz_distancia_arboles, clusters$unit.classif)
    puntaje_Silhouette = mean(silhouette(clusters$unit.classif, matriz_distancia_arboles)[,3])
    k_grupos_SOM = length(unique(clusters$unit.classif))
    tmp = c("SOM",k_grupos_SOM,round(puntaje_Dunn,3),round(puntaje_Connectivity,3),round(puntaje_Silhouette,3))
    res_calidad_clusters=rbind(res_calidad_clusters,tmp)
  }
  
  # Establecer los nombres de columnas del data.frame que contendrá la calidad de cluster para cada método
  colnames(res_calidad_clusters) = c("Method", "k", "Dunn", "Connectivity", "Silhouette")
  
  #La funcion retorna el dataframe res_calidad de clusters
  return(res_calidad_clusters)
}

2.2.9 Function to apply the Minimum-Spanning Tree with K-Nearest Neighbor (MST-KNN) method

MST-KNN is a clustering method that is based on graphs to find significant relationships among the data being analyzed. On one hand, the Minimal Spanning Tree (MST) corresponds to a connected and acyclic subgraph that connects all the nodes of the complete graph, while minimizing its total cost. On the other hand, k-Nearest Neighbors (kNN) refers to a graph that connects each node to its K nearest neighbors. MST-kNN integrates both types of graphs (MST and KNN) to construct a proximity graph that captures the similarities among the data, with the edges representing similar objects, ensuring that the most meaningful relationships are present in each cluster without the need to predefine the number of clusters.

To begin, the algorithm is given a distance matrix between the data to be analyzed and optionally a value for K. Then, a complete graph is generated from the distance matrix, followed by the creation of an MST graph and a kNN graph, which are intersected to form the proximity graph. The output is the total number of clusters formed and a partition indicating which cluster each object of study belongs to (Inostroza-Ponta et al. 2007) (Parraga-Alava, Moscato, and Inostroza-Ponta 2023).

# -----------------------------------------------------------------------------------------------------
# FUNCION PARA EL METODO MST-KNN ----------------------------------------------------------------------
# -----------------------------------------------------------------------------------------------------
metodo_MSTKNN = function(matriz_distancia_arboles,res_calidad_clusters,arboles_filogeneticos){
  set.seed(2)
  # Se agrupan los arboles por grupo a través de el algoritmo mstknn
  clusters = mst.knn(matriz_distancia_arboles)
  # Se crea una secuencia de 1 en 1 con la cantidad total de arboles del grupo en estudio
  arboles = seq(1:length(arboles_filogeneticos))
  # Se ingresan a una variable los arboles que se logran clasificar con mstknn
  arboles_clasificados_en_cluster = as.numeric(names(clusters$cluster))
  # Se revisa si se le asigno un cluster a cada arbol
  cantidad_arboles_faltantes = length(arboles) - length(arboles_clasificados_en_cluster)
  # Se obtiene la ultima posicion del ultimo arbol clasificado
  ultima_posicion = length(arboles_clasificados_en_cluster)
  # Se obtiene la cantidad de clusters formados
  n_grupos_mstknn = clusters$cnumber
  
  # Si existen arboles que quedaron sin cluster, entonces estos deben ser agregados 
  if(cantidad_arboles_faltantes != 0){
    # Se obtienen los arboles que quedaron sin grupos
    arboles_ausentes = setdiff(arboles,arboles_clasificados_en_cluster)
    # Se calcula cuantos clusters faltan por ingresar al listado
    clusters_por_ingresar = (n_grupos_mstknn + 1):(n_grupos_mstknn + cantidad_arboles_faltantes)
    
    for(j in 1:cantidad_arboles_faltantes){
      # Se agrega la cantidad faltante de clusters al listado
      clusters$cluster=c(clusters$cluster,clusters_por_ingresar[j])
      # Se obtiene el arbol que quedo sin cluster
      ingresar_arbol = setdiff(arboles,arboles_clasificados_en_cluster)[j]
      # Se agrega el arbol al listado de arboles y su nuevo cluster
      names(clusters$cluster)[ultima_posicion + j] = c(as.character(ultima_posicion + j),ingresar_arbol)
    }
  }

  # Una vez agregados todos los arboles con sus respectivos clusters al conjunto
  # Se procede a calcular los indices de Dunn, Connectivity y Silhouette
  puntaje_Dunn = dunn(distance = matriz_distancia_arboles, clusters$cluster)
  puntaje_Connectivity = connectivity(distance = matriz_distancia_arboles, clusters$cluster)
  puntaje_Silhouette = mean(silhouette(clusters$cluster, matriz_distancia_arboles)[,3])
  
  # Se almacenan en tmp los valores redondeados a 3 decimales
  tmp = c("MST-kNN",n_grupos_mstknn,round(puntaje_Dunn,3),round(puntaje_Connectivity,3),round(puntaje_Silhouette,3))
  res_calidad_clusters=rbind(res_calidad_clusters,tmp)
  
  # Establecer los nombres de columnas del data.frame que contendrá la calidad de cluster para cada método
  colnames(res_calidad_clusters) = c("Method", "k", "Dunn", "Connectivity", "Silhouette")
  
  return(res_calidad_clusters)
}

2.3 Burkesuchus mallingrandensis

Burkesuchus mallingrandensis is a new genus and species of basal crocodyliform discovered in the Toqui Formation of southern Chile, specifically in the Aysen region. Dated to the Upper Jurassic (Tithonian). This discovery constitutes one of the few records of non-pelagic Jurassic crocodyliforms for the entire South American continent. Burkesuchus was a small crocodyliform, approximately 70 cm in length similar to current crocodiles, with a skull that was depressed dorsoventrally and wide transversely in the posterior region. Its cranial morphology includes distinctive features such as bones ornamented by grooves and fossae, fused and subtriangular frontals, and a posteroventrally flexed squamosal process that forms a broad bony wing. In addition, it presents a series of autapomorphies in its cranial and vertebral anatomy that differentiate it from other known crocodyliforms. The discovery of Burkesuchus helps to close the morphological gap between “protosuchian” grade crocodyliforms and the first branched mesoeucrocodylians, providing valuable information on the evolution of crocodyliforms in South America during the Jurassic (Novas et al. 2021).

2.3.1 Reading files

To analyze the evolutionary relationships of this study group, the authors used the software TNT - Tree analysis using New Technology, a tool that allows analysis under the parsimony criterion. In this case, the data matrix (hyperlink) in tnt format containing 111 species and 448 traits for each of them is read. The TNT character matrix is then converted into a phyDat object, which is the format used by the phangorn package in R for phylogenetic analysis. The researchers applied various configurations to this matrix in the TNT software, such as, the algorithms Sectorial Searches, Ratchet with 20 substitutions and tree fusing with 5 rounds with a total of 1000 replicates (Novas et al. 2021). In our case, after applying the step by step, we obtained 100 more parsimonious trees with a score of 1669 each (hyperlink). Then, the 100 trees obtained from TNT are read and finally, the database containing the geological eras and other relevant data about the study group (hyperlink).

#Lectura de secuencias
secuencias_bm = ReadTntCharacters("Datos/Burkesuchus/secuencias_burkesuchus.tnt")
secuencias_bm = MatrixToPhyDat(secuencias_bm)

#Lectura de árboles - verificar que archivo contenga en su encabezado ruta asociada a secuencias
arboles_bm = ReadTntTree("Datos/Burkesuchus/arboles_burkesuchus_100_1008.tree")

#Lectura de anotaciones - tiempo
anotaciones_bm = read.csv("Datos/Burkesuchus/anotaciones_burkesuchus_age.csv",sep=";",header=T)

# Semilla para reproducción de resultados
set.seed(2)                 

# Variables para Burkesuchus
# Tabla donde se almacenaran los datos calculados por cada metodo
resultados_calidad_clusters = NULL
# Convertir la matriz resultante en un data.frame
resultados_calidad_clusters = data.frame(resultados_calidad_clusters)
# Eliminar los nombres de fila para asegurar que los índices sean numéricos y secuenciales
rownames(resultados_calidad_clusters) = NULL

2.3.2 Consensus trees

After applying the maximum parsimony method, multiple trees with the same score may arise. In this case, 100 trees were obtained, each with 1669 steps. These 100 topologies will be subjected to various clustering algorithms to determine which one best represents the set under analysis. The medoid will then be selected, meaning the tree that minimizes the distances to all the trees within the cluster.

2.3.3 RF Distance

With the group of trees already read, the Robinson-Foulds distance is obtained to calculate the similarity between the different topologies. The results are converted into a matrix to analyze them using a heatmap.

# Cálculo de distancia RF
distancia_arboles_bm = RF.dist(arboles_bm,normalize = T, check.labels = F)
distancia_arboles_bm = round(distancia_arboles_bm,2)
distancia_arboles_bm = as.matrix(distancia_arboles_bm)
# Graficar matriz de distancia

  # Convertimos la matriz de distancia en un formato adecuado para el heatmap
dist_melted = melt(as.matrix(distancia_arboles_bm))

  # Creamos el mapa de calor con plot_ly
heatmap_plot = plot_ly(
  x = dist_melted$Var1,
  y = dist_melted$Var2,
  z = dist_melted$value,
  type = "heatmap",
  colorscale = list(
    c(0, "white"),
    c(1, "#00A499")
  )  # Cambiamos la escala de colores de blanco a #00A499
)

  # Agregamos un título al gráfico
heatmap_plot <- layout(
  heatmap_plot,
  title = list(text = "RF distance - B. mallingrandensis (100 trees)", font = list(size = 17))
)

# Cambiamos la fuente de los ejes
heatmap_plot$x$axis$title = list(font = list(size = 19))
heatmap_plot$y$axis$title = list(font = list(size = 19))

  # Mostramos el gráfico
heatmap_plot
show(heatmap_plot)

2.3.4 Clustering techniques

The set of trees will be subjected to seven clustering algorithms, analyzing between 2 and 15 clusters for each method. Each grouping will be evaluated using three indices that measure cluster quality: Dunn, Connectivity, and Silhouette. In this case, the goal is to maximize the data, followed by the calculation of hypervolume, which will determine the optimal number of clusters for each method, as well as for the dataset as a whole and the most suitable method in particular.

K-Means Clustering (KMEANS)

The function created for k-Means requires two input parameters: the distance matrix of the obtained trees and the table where the results will be stored. The quality indices Dunn, Connectivity, and Silhouette are calculated for groups ranging from 2 to 15 clusters. The hypervolume is then computed, and the partial results obtained for this particular method are presented.

# Se llama a la funcion metodo_kmeans
resultados_KMEANS = metodo_KMEANS(distancia_arboles_bm,resultados_calidad_clusters)
# Paso a dataframe la tabla obtenida
tabla_HP_KMEANS = data.frame(resultados_KMEANS)
# Se eliminan los nombres de las filas de la tabla
rownames(tabla_HP_KMEANS)=NULL
# Se calcula el hipervolumen con los resultados obtenidos por KMEANS
resultados_hp_KMEANS = hipervolumen(tabla_HP_KMEANS)
# Valor de Hipervolumen para cada cluster definido
tabla_HP_KMEANS$hypervolume = resultados_hp_KMEANS
indice_KMEANS = which.max(tabla_HP_KMEANS$hypervolume)
formato_tablas(tabla_HP_KMEANS)
Method k Dunn Connectivity Silhouette hypervolume
Kmeans 2 0.143 8.358 0.521 8.000000
Kmeans 3 0.167 15.514 0.401 9.433825
Kmeans 4 0.167 44.079 0.289 7.207897
Kmeans 5 0.167 118.771 0.053 3.569711
Kmeans 6 0.167 69.164 0.193 5.291622
Kmeans 7 0.167 135.015 0.009 2.767629
Kmeans 8 0.167 80.908 0.277 5.145938
Kmeans 9 0.2 145.804 0.188 4.488880
Kmeans 10 0.167 154.054 0.095 2.496328
Kmeans 11 0.167 130.949 0.133 2.841958
Kmeans 12 0.2 112.012 0.241 4.828526
Kmeans 13 0.167 167.797 0.073 1.844623
Kmeans 14 0.2 114.183 0.245 4.204650
Kmeans 15 0.167 166.117 0.138 1.797899

Using the KMEANS algorithm, outstanding performance is observed when using a k value of 3. This is reflected in a Dunn Index of 0.167, a Connectivity Index of 15.514, and a Silhouette Index reaching 0.401.

Partitioning Around Medoids (PAM)

The function created for PAM requires two input parameters: the distance matrix of the obtained trees and the table where the results will be stored. The quality indices Dunn, Connectivity, and Silhouette are calculated for groups ranging from 2 to 15 clusters. The hypervolume is then computed, and the partial results obtained for this particular method are presented.

# Se llama a la funcion que aplica el metodo PAM
resultados_PAM = metodo_PAM(distancia_arboles_bm,resultados_calidad_clusters)

# Se convierte a dataframe la tabla obtenida en la funcion anterior
tabla_HP_PAM = data.frame(resultados_PAM)

# Se eliminan los nombres de las filas de la tabla
rownames(tabla_HP_PAM)=NULL

# Se calcula el hipervolumen con los resultados obtenidos por PAM
resultados_hp_PAM = hipervolumen(tabla_HP_PAM)

# Se añaden a la tabla los valores del hipervolumen
tabla_HP_PAM$Hypervolume = resultados_hp_PAM
# Se obtiene el indice donde se encuentra el mayor volumen
indice_PAM = which.max(tabla_HP_PAM$Hypervolume)
# Se le da formato para visualizacion a la tabla 
formato_tablas(tabla_HP_PAM)
Method k Dunn Connectivity Silhouette Hypervolume
PAM 2 0.143 8.358 0.521 8.000000
PAM 3 0.167 14.527 0.403 7.028344
PAM 4 0.167 32.496 0.294 4.471896
PAM 5 0.167 68.895 0.28 3.386735
PAM 6 0.2 104.188 0.276 3.213737
PAM 7 0.2 112.763 0.269 2.808688
PAM 8 0.2 113.504 0.285 2.829546
PAM 9 0.25 114.245 0.299 3.670206
PAM 10 0.2 118.937 0.286 2.453906
PAM 11 0.2 123.487 0.272 2.120445
PAM 12 0.2 125.481 0.276 1.995118
PAM 13 0.25 127.082 0.279 2.436879
PAM 14 0.25 127.948 0.289 2.344573
PAM 15 0.25 128.975 0.298 2.230200

Using the PAM algorithm, outstanding performance is observed when using a k value equal to 2. This is reflected in a Dunn Index of 0.143, a Connectivity Index of 8.358, and a Silhouette Index reaching 0.521.

Partitioning Around Medoids (FANNY)

The function created for FANNY requires three input parameters: the distance matrix of the obtained trees and the table where the results will be stored and a value that could invalidate the method; therefore, it will not be considered. The quality indices Dunn, Connectivity, and Silhouette are calculated for groups ranging from 2 to 15 clusters. The hypervolume is then computed, and the partial results obtained for this particular method are presented.

# Se llama a la funcion que aplica el metodo FANNY
resultados_FANNY = metodo_FANNY(distancia_arboles_bm,resultados_calidad_clusters,0)

# Se convierte a dataframe la tabla obtenida en la funcion anterior
tabla_HP_FANNY = data.frame(resultados_FANNY)

# Se eliminan los nombres de las filas de la tabla
rownames(tabla_HP_FANNY)=NULL

# Se calcula el hipervolumen con los resultados obtenidos por FANNY
resultados_hp_FANNY = hipervolumen(tabla_HP_FANNY)
# Se añaden a la tabla los valores del hipervolumen
tabla_HP_FANNY$Hypervolume = resultados_hp_FANNY
# Se obtiene el indice donde se encuentra el mayor volumen
indice_FANNY = which.max(tabla_HP_FANNY$Hypervolume)
# Se le da formato para visualizacion a la tabla 
formato_tablas(tabla_HP_FANNY)
Method k Dunn Connectivity Silhouette Hypervolume
FANNY 2 0.143 8.358 0.521 8.000000
FANNY 3 0.167 14.313 0.403 9.638694
FANNY 4 0.167 42.854 0.293 7.606916
FANNY 5 0.167 88.815 0.158 5.356497
FANNY 6 0.167 83.294 0.172 5.315967
FANNY 7 0.167 88.2 0.176 5.015640
FANNY 8 0.167 112.012 0.162 4.307856
FANNY 9 0.167 131.412 0.157 3.770081
FANNY 10 0.167 137.989 0.102 3.228902
FANNY 11 0.167 150.928 0.148 3.070762
FANNY 12 0.167 154.352 0.142 2.824254
FANNY 13 0.167 161.679 0.124 2.500521
FANNY 14 0.2 161.679 0.15 3.399955
FANNY 15 0.167 193.544 -0.047 1.421100

Using the FANNY algorithm, outstanding performance is observed when using a k value equal to 3. This is reflected in a Dunn Index of 0.167, a Connectivity Index of 14.313, and a Silhouette Index reaching 0.403.

Clustering Large Applications (CLARA)

The function created for CLARA requires two input parameters: the distance matrix of the obtained trees and the table where the results will be stored. The quality indices Dunn, Connectivity, and Silhouette are calculated for groups ranging from 2 to 15 clusters. The hypervolume is then computed, and the partial results obtained for this particular method are presented.

# Se llama a la funcion que aplica el metodo CLARA
resultados_CLARA = metodo_CLARA(distancia_arboles_bm,resultados_calidad_clusters)

# Se convierte a dataframe la tabla obtenida en la funcion anterior
tabla_HP_CLARA = data.frame(resultados_CLARA)

# Se eliminan los nombres de las filas de la tabla
rownames(tabla_HP_CLARA)=NULL

# Se calcula el hipervolumen con los resultados obtenidos por CLARA
resultados_hp_CLARA = hipervolumen(tabla_HP_CLARA)
# Se añaden a la tabla los valores del hipervolumen
tabla_HP_CLARA$Hypervolume = resultados_hp_CLARA
# Se obtiene el indice donde se encuentra el mayor volumen
indice_CLARA = which.max(tabla_HP_CLARA$Hypervolume)
# Se le da formato para visualizacion a la tabla 
formato_tablas(tabla_HP_CLARA)
Method k Dunn Connectivity Silhouette Hypervolume
CLARA 2 0.143 8.358 0.521 8.000000
CLARA 3 0.167 15.626 0.403 7.436098
CLARA 4 0.167 37.75 0.268 4.788351
CLARA 5 0.167 58.997 0.206 3.474098
CLARA 6 0.167 103.668 0.257 3.019743
CLARA 7 0.2 105.397 0.279 3.784242
CLARA 8 0.2 104.2 0.295 3.780868
CLARA 9 0.167 113.328 0.244 2.363619
CLARA 10 0.167 123.525 0.253 2.141140
CLARA 11 0.25 120.403 0.292 3.740586
CLARA 12 0.2 126.737 0.275 2.469518
CLARA 13 0.2 135.599 0.237 1.951378
CLARA 14 0.25 128.306 0.29 2.896549
CLARA 15 0.25 136.187 0.263 2.362000

Using the CLARA algorithm, outstanding performance is observed when using a k value equal to 2. This is reflected in a Dunn Index of 0.143, a Connectivity Index of 8.358, and a Silhouette Index reaching 0.521.

Density-based spatial clustering of applications with noise (DBSCAN)

The function created for DBSCAN requires two input parameters: the distance matrix of the obtained trees and the table where the results will be stored. This method determines the appropriate number of clusters on its own, and based on that value, the quality indices of Dunn, Connectivity, and Silhouette are calculated. The partial results obtained for this particular method are then presented and stored in a table. The function returns various results that will be stored in variables, including the value of the epsilon parameter and its position. With this, a graphical representation of the k-Nearest Neighbor distances will be created.

# Se llama a la funcion que aplica el metodo DBSCAN
resultados_DBSCAN = metodo_DBSCAN(distancia_arboles_bm,resultados_calidad_clusters)
# Se almacenan el variables los resultados que retorno la funcion
tabla_DBSCAN = resultados_DBSCAN[[1]]
dataframe_DBSCAN = resultados_DBSCAN[[2]]
posicion_eps = resultados_DBSCAN[[3]]
valor_eps = resultados_DBSCAN[[4]]

# Crear el gráfico de líneas
g = ggplot(dataframe_DBSCAN, aes(x = puntos, y = distancia_KNN))
g = g + geom_line(size = 1, color = "#00A499")
g = g + labs(title = "Distance Graph k-Nearest Neighbor (Eps parameter)",
       x = "Points (example) ordered by KNN distance",
       y = "4-NN distance")
g = g + theme_classic(base_size = 12) + theme(axis.text = element_text(size = 12))
g = g + scale_y_continuous(limits = c(0.05, 0.2),
                           breaks = seq(0.05, 0.2, by = 0.02))

g = g + geom_vline(xintercept = posicion_eps, colour = "#063330", linetype="dashed")
g = g + annotate("text", x = valor_eps, y = 0.19, label = paste("EPS:", posicion_eps), color = "black", hjust = 1.2)

plot(g)

# Se le da formato para visualizacion a la tabla
formato_tablas(tabla_DBSCAN)
Method k Dunn Connectivity Silhouette
DBSCAN 2 0 7.716 0.288

Self-Organising Maps (SOM)

The function created for SOM requires four input parameters: the distance matrix of the obtained trees and the table where the results will be stored and two numerical values, a minimum and a maximum, which will be used within the function to create a sequence of K values corresponding to the number of clusters. In the case of Burkesuchus mallingrandensis, a minimum of 2 and a maximum of 10 were used. The hypervolume is then computed, and the partial results obtained for this particular method are presented.

# Se llama a la funcion que aplica el metodo SOM
resultados_SOM = metodo_SOM(distancia_arboles_bm,resultados_calidad_clusters,2,10)

# Se convierte a dataframe la tabla obtenida en la funcion anterior
tabla_HP_S0M = data.frame(resultados_SOM)

# Se eliminan los nombres de las filas de la tabla
rownames(tabla_HP_S0M)=NULL

# Se calcula el hipervolumen con los resultados obtenidos por CLARA
resultados_hp_SOM = hipervolumen(tabla_HP_S0M)
# Se añaden a la tabla los valores del hipervolumen
tabla_HP_S0M$Hypervolume = resultados_hp_SOM
# Se obtiene el indice donde se encuentra el mayor volumen
indice_SOM = which.max(tabla_HP_S0M$Hypervolume)
formato_tablas(tabla_HP_S0M)
Method k Dunn Connectivity Silhouette Hypervolume
SOM 4 0.167 18.283 0.403 8.000000
SOM 8 0.167 147.827 -0.09 3.226117
SOM 13 0.167 140.913 0.087 3.914528
SOM 18 0.167 208.688 -0.07 2.429159
SOM 27 0.2 221.867 -0.134 2.231870
SOM 36 0.2 230.56 -0.141 1.858259
SOM 42 0.333 241.871 -0.01 3.230058
SOM 48 0.2 252.818 -0.182 1.272714
SOM 50 0.2 256.963 -0.143 1.278760

Using the SOM algorithm, outstanding performance is observed when using a k value equal to 4. This is reflected in a Dunn Index of 0.167, a Connectivity Index of 18.283, and a Silhouette Index reaching 0.403.

Minimum Spanning Tree with K Nearest Neighbor (MST-kNN)

The function created for MST-KNN requires three input parameters: the distance matrix of the obtained trees and the table where the results will be stored and the trees corresponding to the case study. This method determines the appropriate number of clusters on its own, and based on that value, the quality indices of Dunn, Connectivity, and Silhouette are calculated. The partial results obtained for this particular method are then presented and stored in a table.

# Se llama a la funcion que aplica el metodo MSTKNN
resultados_MSTKNN = metodo_MSTKNN(distancia_arboles_bm,resultados_calidad_clusters,arboles_bm)
## 
##  Only there is  97 nodes in solutions. Clustering solution only will have these nodes.
# Se convierte a dataframe la tabla obtenida en la funcion anterior
tabla_HP_MSTKNN = data.frame(resultados_MSTKNN)
# Se le da formato para visualizacion a la tabla
formato_tablas(tabla_HP_MSTKNN)
Method k Dunn Connectivity Silhouette
MST-kNN 13 0.111 139.9 -0.222

2.3.5 Choosing the best method

Now, all the results from each method, along with their respective indices per cluster, are stored in a single table. The hypervolume is recalculated, allowing the determination of which method and the optimal number of clusters should be applied to the dataset corresponding to the reptile Burkesuchus mallingrandensis.

# Se anidan todos los resultados obtenidos por cada metodo en una sola tabla
resultados_calidad_clusters = rbind(resultados_calidad_clusters,resultados_KMEANS)
resultados_calidad_clusters = rbind(resultados_calidad_clusters,resultados_PAM)
resultados_calidad_clusters = rbind(resultados_calidad_clusters,resultados_FANNY)
resultados_calidad_clusters = rbind(resultados_calidad_clusters,resultados_CLARA)
resultados_calidad_clusters = rbind(resultados_calidad_clusters,tabla_DBSCAN)
resultados_calidad_clusters = rbind(resultados_calidad_clusters,resultados_SOM)
resultados_calidad_clusters = rbind(resultados_calidad_clusters,resultados_MSTKNN)

# Se convierte a dataframe la tabla que almacena todos los resultados 
tabla_HP_FINAL = data.frame(resultados_calidad_clusters)
rownames(tabla_HP_FINAL)=NULL
# Se almacenan los resultados en una variable temporal
resultados_finales = hipervolumen(tabla_HP_FINAL)
# Se añaden a la tabla final los valores retornados por la funcion hipervolumen
tabla_HP_FINAL$Hypervolume = resultados_finales
# Se obtiene el indice que representa un mayor hipervolumen
indice_final = which.max(tabla_HP_FINAL$Hypervolume)
# Se le da formato de visualizacion a la tabla
formato_tablas(tabla_HP_FINAL)
Method k Dunn Connectivity Silhouette Hypervolume
Kmeans 2 0.143 8.358 0.521 11.420334
Kmeans 3 0.167 15.514 0.401 10.756183
Kmeans 4 0.167 44.079 0.289 9.201501
Kmeans 5 0.167 118.771 0.053 6.195582
Kmeans 6 0.167 69.164 0.193 7.864879
Kmeans 7 0.167 135.015 0.009 5.557374
Kmeans 8 0.167 80.908 0.277 8.029978
Kmeans 9 0.2 145.804 0.188 6.659528
Kmeans 10 0.167 154.054 0.095 5.548461
Kmeans 11 0.167 130.949 0.133 6.055202
Kmeans 12 0.2 112.012 0.241 7.361902
Kmeans 13 0.167 167.797 0.073 5.043071
Kmeans 14 0.2 114.183 0.245 7.174343
Kmeans 15 0.167 166.117 0.138 5.259256
PAM 2 0.143 8.358 0.521 11.420334
PAM 3 0.167 14.527 0.403 10.793866
PAM 4 0.167 32.496 0.294 9.469714
PAM 5 0.167 68.895 0.28 8.552454
PAM 6 0.2 104.188 0.276 8.264924
PAM 7 0.2 112.763 0.269 7.954946
PAM 8 0.2 113.504 0.285 7.955350
PAM 9 0.25 114.245 0.299 8.684939
PAM 10 0.2 118.937 0.286 7.676729
PAM 11 0.2 123.487 0.272 7.416497
PAM 12 0.2 125.481 0.276 7.316844
PAM 13 0.25 127.082 0.279 7.895817
PAM 14 0.25 127.948 0.289 7.847890
PAM 15 0.25 128.975 0.298 7.789105
FANNY 2 0.143 8.358 0.521 11.420334
FANNY 3 0.167 14.313 0.403 10.798243
FANNY 4 0.167 42.854 0.293 9.254790
FANNY 5 0.167 88.815 0.158 7.363047
FANNY 6 0.167 83.294 0.172 7.472858
FANNY 7 0.167 88.2 0.176 7.331337
FANNY 8 0.167 112.012 0.162 6.753853
FANNY 9 0.167 131.412 0.157 6.321917
FANNY 10 0.167 137.989 0.102 5.839992
FANNY 11 0.167 150.928 0.148 5.811014
FANNY 12 0.167 154.352 0.142 5.658355
FANNY 13 0.167 161.679 0.124 5.386941
FANNY 14 0.2 161.679 0.15 5.810547
FANNY 15 0.167 193.544 -0.047 4.023920
CLARA 2 0.143 8.358 0.521 11.420334
CLARA 3 0.167 15.626 0.403 10.769791
CLARA 4 0.167 37.75 0.268 9.171157
CLARA 5 0.167 58.997 0.206 8.226561
CLARA 6 0.167 103.668 0.257 7.644317
CLARA 7 0.2 105.397 0.279 8.169993
CLARA 8 0.2 104.2 0.295 8.208545
CLARA 9 0.167 113.328 0.244 7.141044
CLARA 10 0.167 123.525 0.253 6.928494
CLARA 11 0.25 120.403 0.292 8.310104
CLARA 12 0.2 126.737 0.275 7.286781
CLARA 13 0.2 135.599 0.237 6.818030
CLARA 14 0.25 128.306 0.29 7.846689
CLARA 15 0.25 136.187 0.263 7.428678
DBSCAN 2 0 7.716 0.288 6.745600
SOM 4 0.167 18.283 0.403 10.598136
SOM 8 0.167 147.827 -0.09 4.767492
SOM 13 0.167 140.913 0.087 5.517507
SOM 18 0.167 208.688 -0.07 3.598494
SOM 27 0.2 221.867 -0.134 3.020761
SOM 36 0.2 230.56 -0.141 2.535665
SOM 42 0.333 241.871 -0.01 3.180866
SOM 48 0.2 252.818 -0.182 1.786215
SOM 50 0.2 256.963 -0.143 1.770744
MST-kNN 13 0.111 139.9 -0.222 3.469973

A better performance is observed when using the Kmeans algorithm, with a K value equal to 2. This is reflected in a Dunn Index of 0.143, a Connectivity Index of 8.358, and a Silhouette Index of 0.521.

2.3.5 Identification of medoids

For this study group, the k-means method will be applied with two clusters. The consensus medoid tree for the entire study group is identified, as well as the medoid trees for each of the two clusters.

# Aplicación de KMeans con k = 2
set.seed(2)
clusters = kmeans(distancia_arboles_bm, 2)

# Identificación de medoide de todo el conjunto
arbol_medoide_bm = names(which.min(rowMeans(distancia_arboles_bm)))

# Identificación de medoide de grupo 1
ID_g1 = as.numeric(names(which(clusters$cluster==1)))
arbol_medoide_bm_g1 =names(which.min(rowMeans(distancia_arboles_bm[ID_g1,ID_g1])))


# Identificación de medoide de grupo 2
ID_g2 = as.numeric(names(which(clusters$cluster==2)))
arbol_medoide_bm_g2 =names(which.min(rowMeans(distancia_arboles_bm[ID_g2,ID_g2])))

The representative tree of the entire dataset is the 6, while the representative trees of the two clusters are 4 and 1. Cluster 1 contains 46 trees, while Cluster 2 contains 54 trees, so the overall medoid is closer to Cluster 2.

2.3.6 Phylogenetic forest

Now, the phylogenetic forest is plotted for the entire study group, allowing the approximate distribution of the previously obtained topologies for each of the two clusters to be visualized in two dimensions.

set.seed(2)
resultados = kmeans(distancia_arboles_bm,2)

# Visualización
coordenadas=fviz_cluster(resultados, data = distancia_arboles_bm)$data
g = fviz_cluster(resultados, data = distancia_arboles_bm, ellipse.type = "norm",
                 labelsize = 10, repel = TRUE, show.clust.cent = FALSE, ggtheme = theme_classic(),
                 pointsize = 3, shape = "circle", palette = c("#00A499","#695c59"),)
g = g + labs(title = "Phylogenetic forest - B. Mallingrandensis")
g = g + theme(axis.text = element_text(size = 12), legend.position = "bottom")

#Medoide del árbol general (árbol 6)
g = g + geom_segment(x=coordenadas[6,]$x-1,xend=coordenadas[6,]$x+1,y=coordenadas[6,]$y,yend=coordenadas[6,]$y,
                     linetype = "dashed",color = "black")
g = g + geom_segment(x=coordenadas[6,]$x,xend=coordenadas[6,]$x,y=coordenadas[6,]$y-1,yend=coordenadas[6,]$y+1,
                     linetype = "dashed", color = "black")
g = g + geom_text(aes(x = coordenadas[6,]$x, y = coordenadas[6,]$y, label = "Medoid tree"),
                  vjust = 2, hjust = -0.2, size = 4, color = "black")

# Medoide cluster 1 (Arbol 1)
g = g + geom_segment(x=coordenadas[1,]$x-1,xend=coordenadas[1,]$x+1,y=coordenadas[1,]$y,yend=coordenadas[1,]$y,
                     linetype = "dashed", color = "#00635d")
g = g + geom_segment(x=coordenadas[1,]$x,xend=coordenadas[1,]$x,y=coordenadas[1,]$y-1,yend=coordenadas[1,]$y+1,
                     linetype = "dashed", color = "#00635d")
g = g + geom_text(aes(x = coordenadas[1,]$x, y = coordenadas[1,]$y, label = "Medoid cluster 1"),
                  vjust = 2, hjust = 1, size = 4, color = "#00635d")

# #Medoide cluster 2 (Árbol 4)
g = g + geom_segment(x=coordenadas[4,]$x-1,xend=coordenadas[4,]$x+1,y=coordenadas[4,]$y,yend=coordenadas[4,]$y,
                     linetype = "dashed", color = "#3d1308")
g = g + geom_segment(x=coordenadas[4,]$x,xend=coordenadas[4,]$x,y=coordenadas[4,]$y-1,yend=coordenadas[4,]$y+1,
                     linetype = "dashed", color = "#3d1308")
g = g + geom_text(aes(x = coordenadas[4,]$x, y = coordenadas[4,]$y, label = "Medoid cluster 2"),
                  vjust = 2.5, hjust = -0.1, size = 4, color = "#3d1308")

plot(g)

2.3.7 Representative trees

Once the medoids for each cluster and the medoid for the entire dataset have been identified, they are plotted to visualize the distribution of each species in the topology. This allows for analysis and comparison among the three medoid trees identified in the pipeline and also enables a comparison of our consensus medoid tree with the one from the literature. Additionally, each tree is assigned a color-coded graph indicating the support of each node: red represents low support, while green indicates high node support.

Medoid tree of the entire study group

# Agregar el soporte por nodo
arbol_medoide_bm_conf= addConfidences(arboles_bm[6],arboles_bm)

# Convertir el soporte del nodo de raiz que es NA a 1
idx_NA = which(is.na(arbol_medoide_bm_conf[[1]]$node.label))
arbol_medoide_bm_conf[[1]]$node.label[idx_NA]=1

# Graficar árbol medoide

# Ajuste de color
paleta_de_colores = colorRampPalette(c("red3", "mediumseagreen"))(100)
posiciones_color = round(arbol_medoide_bm_conf[[1]]$node.label*100,9)
colores = paleta_de_colores[posiciones_color]
colores_especies = rep("black",length(arbol_medoide_bm_conf[[1]]$tip.label))
colores_especies[111] = "#c7254e"

# Gráfico
g = ggtree(arbol_medoide_bm_conf, layout = "dendrogram",ladderize=T)
g = g + geom_tiplab(size = 4, color = colores_especies)  # Etiquetas en los nodos terminales 
g = g + theme(plot.margin = margin(t = 0, r = 0, b = 150, l = 0, unit = "pt"), legend.position = "top")
g = g + geom_nodepoint(size = 4, color = colores) + theme(legend.position='top')
g = g + labs(title = "Representative consensus tree (medoid) with node support (100 trees)") +
    theme(plot.title = element_text(size = 20)) 

plot(g)

Medoid tree of group 1

# Árbol medoide - Grupo 1 (Árbol 1)
# Agregar el soporte por nodo
arbol_medoide_bm_conf= addConfidences(arboles_bm[1],arboles_bm)

# Convertir el soporte del nodo de raiz que es NA a 1
idx_NA = which(is.na(arbol_medoide_bm_conf[[1]]$node.label))
arbol_medoide_bm_conf[[1]]$node.label[idx_NA]=1

#Para tesis-: cambiar tamaño de letra a 10 y tamaño de nodo a 10 #Cambiar a fan

# Graficar árbol medoide

  # Ajuste de color
paleta_de_colores = colorRampPalette(c("red3", "mediumseagreen"))(100)
posiciones_color = round(arbol_medoide_bm_conf[[1]]$node.label*100,9)
colores = paleta_de_colores[posiciones_color]
colores_especies = rep("black",length(arbol_medoide_bm_conf[[1]]$tip.label))
colores_especies[111] = "#c7254e"

  # Gráfico
g = ggtree(arbol_medoide_bm_conf, layout = "dendrogram",ladderize=T)
g = g + geom_tiplab(size = 4, color = colores_especies)  # Etiquetas en los nodos terminales 
g = g + theme(plot.margin = margin(t = 0, r = 0, b = 150, l = 0, unit = "pt"), legend.position = "top")
g = g + geom_nodepoint(size = 4, color = colores) + theme(legend.position='top')
g = g + labs(title = "Representative tree of Cluster (medoid) 1 with node support (100 trees)")

plot(g)

Medoid tree of group 2

# Árbol medoide - Grupo 2 (Árbol 4)

# Agregar el soporte por nodo
arbol_medoide_bm_conf= addConfidences(arboles_bm[4],arboles_bm)

# Convertir el soporte del nodo de raiz que es NA a 1
idx_NA = which(is.na(arbol_medoide_bm_conf[[1]]$node.label))
arbol_medoide_bm_conf[[1]]$node.label[idx_NA]=1

#Para tesis-: cambiar tamaño de letra a 10 y tamaño de nodo a 10 #Cambiar a fan

# Graficar árbol medoide

  # Ajuste de color
paleta_de_colores = colorRampPalette(c("red3", "mediumseagreen"))(100)
posiciones_color = round(arbol_medoide_bm_conf[[1]]$node.label*100,9)
colores = paleta_de_colores[posiciones_color]
colores_especies = rep("black",length(arbol_medoide_bm_conf[[1]]$tip.label))
colores_especies[111] = "#c7254e"

  # Gráfico
g = ggtree(arbol_medoide_bm_conf, layout = "dendrogram",ladderize=T)
g = g + geom_tiplab(size = 4, color = colores_especies)  # Etiquetas en los nodos terminales 
g = g + theme(plot.margin = margin(t = 0, r = 0, b = 150, l = 0, unit = "pt"), legend.position = "top")
g = g + geom_nodepoint(size = 4, color = colores) + theme(legend.position='top')
g = g + labs(title = "Representative tree of Cluster 2 (medoid) with node support (100 trees)")

plot(g)

It was observed that the 3 topologies are similar and that Burkesuchus mallingrandensis is found in a stable clade, therefore, the evolutionary relationships of that group presented in the topology have sufficient support and possibly come close to representing a part of the history. On the other hand, towards the right of the topologies, it can be observed that some species present are found in nodes with less support, such as Coringasuchus, Mariliasuchus, labidiosuchus, Caryionosuchus, among others. This indicates that the evolutionary relationships among them are not fully resolved and are not as clear as with other species, which may be a consequence of convergent evolution, rapid evolution, or lack of information about these species. On the other hand, when comparing the medoid tree of the whole group (topology 6) with the strict consensus tree obtained from the literature, some differences can be observed, starting with the polytomies generated in the published tree and the absence of them in the topology declared by the pipeline.

Finally, according to the medoid tree obtained, the reptile Burkesuchus mallingrandensis presents a good support of nodes, being in a stable group. When compared with the published strict consensus tree, the distribution of the branches changes a little because the published topology presents polytomies, however, the evolutionary relationships do not vary too much, being this prehistoric reptile the ancestor of the current crocodiles. On the other hand, the medoid tree shows a region with lower node support, delineated between the species Caryonosuchus and Coringasuchus, and somewhat more subtly includes Llanosuchus and Notosuchus. These species are precisely those found in the largest polytomy within the published strict consensus tree due to unresolved relationships.

2.3.8 Temporality to the representative tree

Now that the consensus medoid tree for the study group (topology 6) has been identified, we proceed to calculate the Hamming distance for the sequences obtained using the TNT tool. Additionally, by integrating them with the created database, we incorporate temporality into the consensus medoid tree using the “Strap” library in R, allowing us to visualize the historical period in which each species lived. Thanks to this visualization it can be inferred that the case study Burkesuchus mallingrandensis lived in the Tithonian age, at the end of the Late Jurassic epoch, and that there are 3 species still present today: Crocodiles, Gaviales and Caimans.

# Cálculo de distancia Hamming entre secuencias
distancia_secuencias = dist.hamming(secuencias_bm)

#Agrega la distancia
arbol_medoide_bm_tiempo= nnls.tree(as.matrix(distancia_secuencias), arbol_medoide_bm_conf[[1]],rooted = F)

#Convierto distancia negativas a positivas
idx = which(arbol_medoide_bm_tiempo$edge.length<0)
arbol_medoide_bm_tiempo$edge.length[idx] = 0

#Crear tiempo
ages=anotaciones_bm[,c(7,13)]
rownames(ages)=anotaciones_bm[,1]
names(ages)=c("FAD","LAD")

#Agrego el tiempo
arbol_medoide_bm_tiempo = DatePhylo(arbol_medoide_bm_tiempo, ages, method="equal", rlen=0.1)

#Grafico
geoscalePhylo(tree=arbol_medoide_bm_tiempo, ages=ages, boxes="Age", cex.age = 0.89,cex.tip=1,cex.ts = 0.9,
              tick.scale="Age",units = c("Eon","Era","Period","Epoch","Age"),erotate = 90,urotate = 0,
              direction = "rightwards", quat.rm = T, vers = "ICS2013")

2.4 Arackar Licanantay

Arackar licanantay is a new lithostrotian sauropod discovered in the Atacama Region, Chile, during the Late Cretaceous period. This dinosaur is represented by a subadult specimen of around 6.3 meters in length, which makes it a relatively small specimen compared to other sauropods. The fossil found includes an association of bones that belong to a single individual, which provides valuable information about the anatomy and morphology of this dinosaur (Rubilar et al. 2021).

From a phylogenetic perspective, Arackar licanantay is placed in a lithostrotian clade that includes Rapetosaurus, Arackar and Isisaurus. This discovery is relevant not only for being the third species of dinosaur discovered in Chile, but also for being the most complete sauropod recorded in the country and in the southern Pacific region of South America. This discovery contributes significantly to the knowledge of the diversity and evolution of sauropods in this geographical area during the Upper Cretaceous (Rubilar et al. 2021).

2.4.1 Reading files

For this case study, the authors analyzed a matrix (hyperlink) of 88 species with 405 characteristics each using the maximum parsimony method in the TNT software tool, applying various configurations such as sectorial searches, drift, tree fusing, driven search, stabilize consensus, among others.

In our case, before submitting the data to the software, the respective modifications were made and a new matrix was obtained (hyperlink). After applying the step-by-step procedure in TNT, we obtained 97 more parsimonious trees with a score of 1322 each (hyperlink). Then, the 97 trees obtained from TNT are read and finally, the database (hyperlink) containing the geological eras and other relevant data on the study group is obtained.

# Semilla para reproducción de resultados
set.seed(2)                 

#Lectura de secuencias
secuencias_al = ReadTntCharacters("Datos/Arackar/arackarcorregido.tnt")
secuencias_al = MatrixToPhyDat(secuencias_al)

#Lectura de árboles - verificar que archivo contenga en su encabezado ruta asociada a secuencias
arboles_al = ReadTntTree("Datos/Arackar/arboles_TNT_arackar_1322.tre")

# Variables para Arackar Licanantay
# Tabla donde se almacenaran los datos calculados por cada metodo
resultados_calidad_clusters_al = NULL
# Convertir la matriz resultante en un data.frame
resultados_calidad_clusters_al = data.frame(resultados_calidad_clusters_al)
# Eliminar los nombres de fila para asegurar que los índices sean numéricos y secuenciales
rownames(resultados_calidad_clusters_al) = NULL

2.4.2 Consensus trees

After applying the maximum parsimony method, multiple trees with the same score may arise. In this case, 97 trees were obtained, each with 1322 steps. These 97 topologies will be subjected to various clustering algorithms to determine which one best represents the set under analysis. The medoid will then be selected, meaning the tree that minimizes the distances to all the trees within the cluster.

2.4.3 RF Distance

With the group of trees already read, the Robinson-Foulds distance is obtained to calculate the similarity between the different topologies. The results are converted into a matrix to analyze them using a heatmap.

# Cálculo de distancia RF
distancia_arboles_al = RF.dist(arboles_al,normalize = T, check.labels = F)
distancia_arboles_al = round(distancia_arboles_al,2)
distancia_arboles_al = as.matrix(distancia_arboles_al)
# Graficar matriz de distancia

  # Convertimos la matriz de distancia en un formato adecuado para el heatmap
dist_melted = melt(as.matrix(distancia_arboles_al))

  # Creamos el mapa de calor con plot_ly
heatmap_plot = plot_ly(
  x = dist_melted$Var1,
  y = dist_melted$Var2,
  z = dist_melted$value,
  type = "heatmap",
  colorscale = list(
    c(0, "white"),
    c(1, "#00A499")
  )  # Cambiamos la escala de colores de blanco a #00A499
)

  # Agregamos un título al gráfico
heatmap_plot <- layout(
  heatmap_plot,
  title = list(text = "RF distance - Arackar licanantay (97 trees)", font = list(size = 12))
)

# Cambiamos la fuente de los ejes
heatmap_plot$x$axis$title = list(font = list(size = 30))
heatmap_plot$y$axis$title = list(font = list(size = 30))

  # Mostramos el gráfico
heatmap_plot

2.4.4 Clustering techniques

The set of trees will be subjected to seven clustering algorithms, analyzing between 2 and 15 clusters for each method. Each grouping will be evaluated using three indices that measure cluster quality: Dunn, Connectivity, and Silhouette. In this case, the goal is to maximize the data, followed by the calculation of hypervolume, which will determine the optimal number of clusters for each method, as well as for the dataset as a whole and the most suitable method in particular.

K-Means Clustering (KMEANS)

The function created for k-Means requires two input parameters: the distance matrix of the obtained trees and the table where the results will be stored. The quality indices Dunn, Connectivity, and Silhouette are calculated for groups ranging from 2 to 15 clusters. The hypervolume is then computed, and the partial results obtained for this particular method are presented.

# Se llama a la funcion metodo_kmeans
resultados_KMEANS_AL = metodo_KMEANS(distancia_arboles_al,resultados_calidad_clusters_al)
# Paso a dataframe la tabla obtenida
tabla_HP_KMEANS_AL = data.frame(resultados_KMEANS_AL)
# Se eliminan los nombres de las filas de la tabla
rownames(tabla_HP_KMEANS_AL)=NULL
# Se calcula el hipervolumen con los resultados obtenidos por KMEANS
resultados_hp_KMEANS_AL = hipervolumen(tabla_HP_KMEANS_AL)
# Valor de Hipervolumen para cada cluster definido
tabla_HP_KMEANS_AL$Hypervolume = resultados_hp_KMEANS_AL
indice_KMEANS_AL = which.max(tabla_HP_KMEANS_AL$Hypervolume)
formato_tablas(tabla_HP_KMEANS_AL)
Method k Dunn Connectivity Silhouette Hypervolume
Kmeans 2 0.16 29.356 0.186 6.909200
Kmeans 3 0.167 49.032 0.114 3.603053
Kmeans 4 0.182 61.634 0.179 5.801803
Kmeans 5 0.182 70.358 0.16 4.640483
Kmeans 6 0.182 82.049 0.178 4.593923
Kmeans 7 0.273 84.617 0.147 5.305755
Kmeans 8 0.182 93.747 0.161 3.418076
Kmeans 9 0.278 94.077 0.204 6.505433
Kmeans 10 0.278 97.056 0.197 5.791564
Kmeans 11 0.263 107.178 0.184 4.394420
Kmeans 12 0.312 105.871 0.204 5.579646
Kmeans 13 0.312 120.874 0.163 3.545876
Kmeans 14 0.312 121.513 0.196 4.020880
Kmeans 15 0.312 123.504 0.213 4.000000

Using the KMEANS algorithm, outstanding performance is observed when using a k value equal to 2. This is reflected in a Dunn Index of 0.16, a Connectivity Index of 29.356, and a Silhouette Index reaching 0.186.

Partitioning Around Medoids (PAM)

The function created for PAM requires two input parameters: the distance matrix of the obtained trees and the table where the results will be stored. The quality indices Dunn, Connectivity, and Silhouette are calculated for groups ranging from 2 to 15 clusters. The hypervolume is then computed, and the partial results obtained for this particular method are presented.

# Se llama a la funcion que aplica el metodo PAM
resultados_PAM_AL = metodo_PAM(distancia_arboles_al,resultados_calidad_clusters_al)

# Se convierte a dataframe la tabla obtenida en la funcion anterior
tabla_HP_PAM_AL = data.frame(resultados_PAM_AL)

# Se eliminan los nombres de las filas de la tabla
rownames(tabla_HP_PAM_AL)=NULL

# Se calcula el hipervolumen con los resultados obtenidos por PAM
resultados_hp_PAM_AL = hipervolumen(tabla_HP_PAM_AL)

# Se añaden a la tabla los valores del hipervolumen
tabla_HP_PAM_AL$Hypervolume = resultados_hp_PAM_AL
# Se obtiene el indice donde se encuentra el mayor volumen
indice_PAM_AL = which.max(tabla_HP_PAM_AL$Hypervolume)
# Se le da formato para visualizacion a la tabla 
formato_tablas(tabla_HP_PAM_AL)
Method k Dunn Connectivity Silhouette Hypervolume
PAM 2 0.037 40.366 0.147 5.525600
PAM 3 0.037 75.19 0.125 3.504253
PAM 4 0.238 75.699 0.159 8.011703
PAM 5 0.037 102.203 0.11 2.213092
PAM 6 0.091 78.958 0.176 5.334668
PAM 7 0.091 92.685 0.17 4.372711
PAM 8 0.095 93.705 0.176 4.342737
PAM 9 0.105 92.714 0.198 4.881685
PAM 10 0.111 104.319 0.177 3.749230
PAM 11 0.111 106.477 0.199 3.930137
PAM 12 0.111 110.227 0.204 3.654744
PAM 13 0.111 114.653 0.206 3.301103
PAM 14 0.111 118.947 0.204 2.905057
PAM 15 0.278 122.917 0.207 4.000000

Using the PAM algorithm, outstanding performance is observed when using a k value equal to 4. This is reflected in a Dunn Index of 0.238, a Connectivity Index of 75.699, and a Silhouette Index reaching 0.159.

Partitioning Around Medoids (FANNY)

The function created for FANNY requires three input parameters: the distance matrix of the obtained trees and the table where the results will be stored and a value that could invalidate the method; therefore, it will not be considered. In this second case study, the value that invalidates this method is 8. The quality indices Dunn, Connectivity, and Silhouette are calculated for groups ranging from 2 to 15 clusters. The hypervolume is then computed, and the partial results obtained for this particular method are presented.

# Se llama a la funcion que aplica el metodo FANNY
resultados_FANNY_AL = metodo_FANNY(distancia_arboles_al,resultados_calidad_clusters_al,8)

# Se convierte a dataframe la tabla obtenida en la funcion anterior
tabla_HP_FANNY_AL = data.frame(resultados_FANNY_AL)

# Se eliminan los nombres de las filas de la tabla
rownames(tabla_HP_FANNY_AL)=NULL

# Se calcula el hipervolumen con los resultados obtenidos por FANNY
resultados_hp_FANNY_AL = hipervolumen(tabla_HP_FANNY_AL)
# Se añaden a la tabla los valores del hipervolumen
tabla_HP_FANNY_AL$Hypervolume = resultados_hp_FANNY_AL
# Se obtiene el indice donde se encuentra el mayor volumen
indice_FANNY_AL = which.max(tabla_HP_FANNY_AL$Hypervolume)
# Se le da formato para visualizacion a la tabla 
formato_tablas(tabla_HP_FANNY_AL)
Method k Dunn Connectivity Silhouette Hypervolume
FANNY 2 0.16 29.356 0.186 5.624800
FANNY 3 0.273 40.519 0.199 10.036917
FANNY 4 0.273 56.528 0.16 5.442024
FANNY 5 0.286 69.8 0.171 5.874538
FANNY 6 0.316 76.038 0.178 6.507560
FANNY 7 0.316 80.912 0.185 6.508198
FANNY 9 0.25 95.945 0.183 4.034583
FANNY 10 0.238 99.666 0.165 2.794404
FANNY 11 0.238 108.712 0.198 3.599330
FANNY 12 0.238 121.033 0.192 2.823301
FANNY 13 0.278 110.548 0.204 3.869819
FANNY 14 0.263 118.256 0.211 3.370887
FANNY 15 0.263 122.857 0.224 3.320600

Using the FANNY algorithm, outstanding performance is observed when using a k value equal to 3. This is reflected in a Dunn Index of 0.273, a Connectivity Index of 40.519, and a Silhouette Index reaching 0.199.

Clustering Large Applications (CLARA)

The function created for CLARA requires two input parameters: the distance matrix of the obtained trees and the table where the results will be stored. The quality indices Dunn, Connectivity, and Silhouette are calculated for groups ranging from 2 to 15 clusters. The hypervolume is then computed, and the partial results obtained for this particular method are presented.

# Se llama a la funcion que aplica el metodo CLARA
resultados_CLARA_AL = metodo_CLARA(distancia_arboles_al,resultados_calidad_clusters_al)

# Se convierte a dataframe la tabla obtenida en la funcion anterior
tabla_HP_CLARA_AL = data.frame(resultados_CLARA_AL)

# Se eliminan los nombres de las filas de la tabla
rownames(tabla_HP_CLARA_AL)=NULL

# Se calcula el hipervolumen con los resultados obtenidos por CLARA
resultados_hp_CLARA_AL = hipervolumen(tabla_HP_CLARA_AL)
# Se añaden a la tabla los valores del hipervolumen
tabla_HP_CLARA_AL$Hypervolume = resultados_hp_CLARA_AL
# Se obtiene el indice donde se encuentra el mayor volumen
indice_CLARA_AL = which.max(tabla_HP_CLARA_AL$Hypervolume)
# Se le da formato para visualizacion a la tabla 
formato_tablas(tabla_HP_CLARA_AL)
Method k Dunn Connectivity Silhouette Hypervolume
CLARA 2 0.037 40.366 0.147 5.982800
CLARA 3 0.037 75.19 0.125 3.970136
CLARA 4 0.037 95.7 0.09 2.468554
CLARA 5 0.037 91.068 0.11 2.892241
CLARA 6 0.091 84.964 0.147 5.986360
CLARA 7 0.091 108.002 0.153 4.799857
CLARA 8 0.091 105.431 0.168 5.085115
CLARA 9 0.095 114.312 0.168 4.533829
CLARA 10 0.111 104.752 0.178 5.527884
CLARA 11 0.056 110.538 0.183 3.334021
CLARA 12 0.056 114.519 0.181 2.979904
CLARA 13 0.095 116.588 0.202 4.106557
CLARA 14 0.111 116.45 0.195 4.128088
CLARA 15 0.125 123.833 0.205 4.000000

Using the CLARA algorithm, outstanding performance is observed when using a k value equal to 6. This is reflected in a Dunn Index of 0.091, a Connectivity Index of 84.964, and a Silhouette Index reaching 0.147.

Density-based spatial clustering of applications with noise (DBSCAN)

The function created for DBSCAN requires two input parameters: the distance matrix of the obtained trees and the table where the results will be stored. This method determines the appropriate number of clusters on its own, and based on that value, the quality indices of Dunn, Connectivity, and Silhouette are calculated. The partial results obtained for this particular method are then presented and stored in a table. The function returns various results that will be stored in variables, including the value of the epsilon parameter and its position. With this, a graphical representation of the k-Nearest Neighbor distances will be created.

# Se llama a la funcion que aplica el metodo DBSCAN
resultados_DBSCAN_AL = metodo_DBSCAN(distancia_arboles_al,resultados_calidad_clusters_al)
# Se almacenan el variables los resultados que retorno la funcion
tabla_DBSCAN_AL = resultados_DBSCAN_AL[[1]]
dataframe_DBSCAN_AL = resultados_DBSCAN_AL[[2]]
posicion_eps_AL = resultados_DBSCAN_AL[[3]]
valor_eps_AL = resultados_DBSCAN_AL[[4]]

# Crear el gráfico de líneas
g = ggplot(dataframe_DBSCAN_AL, aes(x = puntos, y = distancia_KNN))
g = g + geom_line(size = 1, color = "#00A499")
g = g + labs(title = "k-Nearest Neighbor distance plot (Parameter eps)",
       x = "Points (example) ordered by KNN distance",
       y = "4-NN distance")
g = g + theme_classic(base_size = 12) + theme(axis.text = element_text(size = 12))
g = g + scale_y_continuous(limits = c(0.2, 0.5),
                           breaks = seq(0.2, 0.5, by = 0.05))

g = g + geom_vline(xintercept = posicion_eps_AL, colour = "#063330", linetype="dashed")
g = g + annotate("text", x = valor_eps_AL, y = 0.19, label = paste("EPS:", posicion_eps_AL), color = "black", hjust = 1.2)

plot(g)

# Se le da formato para visualizacion a la tabla
formato_tablas(tabla_DBSCAN_AL)
Method k Dunn Connectivity Silhouette
DBSCAN 2 0 4.04 0.113

Self-Organising Maps (SOM)

The function created for SOM requires four input parameters: the distance matrix of the obtained trees and the table where the results will be stored and two numerical values, a minimum and a maximum, which will be used within the function to create a sequence of K values corresponding to the number of clusters. In the case of Arackar licanantay, a minimum of 2 and a maximum of 9 were used. The hypervolume is then computed, and the partial results obtained for this particular method are presented.

# Se llama a la funcion que aplica el metodo SOM
resultados_SOM_AL = metodo_SOM(distancia_arboles_al,resultados_calidad_clusters_al,2,9)

# Se convierte a dataframe la tabla obtenida en la funcion anterior
tabla_HP_S0M_AL = data.frame(resultados_SOM_AL)

# Se eliminan los nombres de las filas de la tabla
rownames(tabla_HP_S0M_AL)=NULL

# Se calcula el hipervolumen con los resultados obtenidos por CLARA
resultados_hp_SOM_AL = hipervolumen(tabla_HP_S0M_AL)
# Se añaden a la tabla los valores del hipervolumen
tabla_HP_S0M_AL$Hypervolume = resultados_hp_SOM_AL
# Se obtiene el indice donde se encuentra el mayor volumen
indice_SOM_AL = which.max(tabla_HP_S0M_AL$Hypervolume)
formato_tablas(tabla_HP_S0M_AL)
Method k Dunn Connectivity Silhouette Hypervolume
SOM 4 0.273 57.107 0.163 8.874470
SOM 9 0.238 93.783 0.188 8.344209
SOM 16 0.238 117.007 0.22 8.851579
SOM 24 0.312 152.331 0.172 6.032925
SOM 31 0.333 161.904 0.202 7.016394
SOM 40 0.286 194.579 0.158 3.222544
SOM 48 0.167 202.608 0.139 1.480472
SOM 54 0.125 218.736 0.175 1.444400

Using the SOM algorithm, outstanding performance is observed when using a k value equal to 4. This is reflected in a Dunn Index of 0.273, a Connectivity Index of 57.107, and a Silhouette Index reaching 0.163.

Minimum Spanning Tree with K Nearest Neighbor (MST-kNN)

The function created for MST-KNN requires three input parameters: the distance matrix of the obtained trees and the table where the results will be stored and the trees corresponding to the case study. This method determines the appropriate number of clusters on its own, and based on that value, the quality indices of Dunn, Connectivity, and Silhouette are calculated. The partial results obtained for this particular method are then presented and stored in a table.

# Se llama a la funcion que aplica el metodo MSTKNN
resultados_MSTKNN_AL = metodo_MSTKNN(distancia_arboles_al,resultados_calidad_clusters_al,arboles_al)
## 
##  Only there is  95 nodes in solutions. Clustering solution only will have these nodes.
# Se convierte a dataframe la tabla obtenida en la funcion anterior
tabla_HP_MSTKNN_AL = data.frame(resultados_MSTKNN_AL)
# Se le da formato para visualizacion a la tabla
formato_tablas(tabla_HP_MSTKNN_AL)
Method k Dunn Connectivity Silhouette
MST-kNN 10 0.24 92.864 0.009

2.4.5 Choosing the best method

Now, all the results from each method, along with their respective indices per cluster, are stored in a single table. The hypervolume is recalculated, allowing the determination of which method and the optimal number of clusters should be applied to the dataset corresponding to the dinosaur Arackar licanantay.

# Se anidan todos los resultados obtenidos por cada metodo en una sola tabla
resultados_calidad_clusters_al = rbind(resultados_calidad_clusters_al,resultados_KMEANS_AL)
resultados_calidad_clusters_al = rbind(resultados_calidad_clusters_al,resultados_PAM_AL)
resultados_calidad_clusters_al = rbind(resultados_calidad_clusters_al,resultados_FANNY_AL)
resultados_calidad_clusters_al = rbind(resultados_calidad_clusters_al,resultados_CLARA_AL)
resultados_calidad_clusters_al = rbind(resultados_calidad_clusters_al,tabla_DBSCAN_AL)
resultados_calidad_clusters_al = rbind(resultados_calidad_clusters_al,resultados_SOM_AL)
resultados_calidad_clusters_al = rbind(resultados_calidad_clusters_al,resultados_MSTKNN_AL)

# Se convierte a dataframe la tabla que almacena todos los resultados 
tabla_HP_FINAL_AL = data.frame(resultados_calidad_clusters_al)
rownames(tabla_HP_FINAL_AL)=NULL
# Se almacenan los resultados en una variable temporal
resultados_finales_AL = hipervolumen(tabla_HP_FINAL_AL)
# Se añaden a la tabla final los valores retornados por la funcion hipervolumen
tabla_HP_FINAL_AL$Hypervolume = resultados_finales_AL
# Se obtiene el indice que representa un mayor hipervolumen
indice_final_AL = which.max(tabla_HP_FINAL_AL$Hypervolume)
# Se le da formato de visualizacion a la tabla
formato_tablas(tabla_HP_FINAL_AL)
Method k Dunn Connectivity Silhouette Hypervolume
Kmeans 2 0.16 29.356 0.186 10.161065
Kmeans 3 0.167 49.032 0.114 7.925665
Kmeans 4 0.182 61.634 0.179 9.406623
Kmeans 5 0.182 70.358 0.16 8.647123
Kmeans 6 0.182 82.049 0.178 8.693654
Kmeans 7 0.273 84.617 0.147 9.241977
Kmeans 8 0.182 93.747 0.161 7.871618
Kmeans 9 0.278 94.077 0.204 10.316524
Kmeans 10 0.278 97.056 0.197 9.948173
Kmeans 11 0.263 107.178 0.184 9.013342
Kmeans 12 0.312 105.871 0.204 10.187166
Kmeans 13 0.312 120.874 0.163 8.655478
Kmeans 14 0.312 121.513 0.196 9.308616
Kmeans 15 0.312 123.504 0.213 9.535850
PAM 2 0.037 40.366 0.147 6.679912
PAM 3 0.037 75.19 0.125 5.653608
PAM 4 0.238 75.699 0.159 9.514031
PAM 5 0.037 102.203 0.11 4.893700
PAM 6 0.091 78.958 0.176 7.183235
PAM 7 0.091 92.685 0.17 6.728163
PAM 8 0.095 93.705 0.176 6.810109
PAM 9 0.105 92.714 0.198 7.316841
PAM 10 0.111 104.319 0.177 6.721743
PAM 11 0.111 106.477 0.199 6.987564
PAM 12 0.111 110.227 0.204 6.919215
PAM 13 0.111 114.653 0.206 6.784975
PAM 14 0.111 118.947 0.204 6.589217
PAM 15 0.278 122.917 0.207 8.920515
FANNY 2 0.16 29.356 0.186 10.161065
FANNY 3 0.273 40.519 0.199 12.426558
FANNY 4 0.273 56.528 0.16 10.667162
FANNY 5 0.286 69.8 0.171 10.722958
FANNY 6 0.316 76.038 0.178 11.143173
FANNY 7 0.316 80.912 0.185 11.078852
FANNY 9 0.25 95.945 0.183 9.288466
FANNY 10 0.238 99.666 0.165 8.492307
FANNY 11 0.238 108.712 0.198 8.903242
FANNY 12 0.238 121.033 0.192 8.349504
FANNY 13 0.278 110.548 0.204 9.411250
FANNY 14 0.263 118.256 0.211 9.015655
FANNY 15 0.263 122.857 0.224 9.061936
CLARA 2 0.037 40.366 0.147 6.679912
CLARA 3 0.037 75.19 0.125 5.653608
CLARA 4 0.037 95.7 0.09 4.719947
CLARA 5 0.037 91.068 0.11 5.058007
CLARA 6 0.091 84.964 0.147 6.525662
CLARA 7 0.091 108.002 0.153 6.135619
CLARA 8 0.091 105.431 0.168 6.376942
CLARA 9 0.095 114.312 0.168 6.199214
CLARA 10 0.111 104.752 0.178 6.730307
CLARA 11 0.056 110.538 0.183 5.807516
CLARA 12 0.056 114.519 0.181 5.646242
CLARA 13 0.095 116.588 0.202 6.437964
CLARA 14 0.111 116.45 0.195 6.495476
CLARA 15 0.125 123.833 0.205 6.634823
DBSCAN 2 0 4.04 0.113 5.934800
SOM 4 0.273 57.107 0.163 10.738350
SOM 9 0.238 93.783 0.188 9.273299
SOM 16 0.238 117.007 0.22 8.666540
SOM 24 0.312 152.331 0.172 7.030628
SOM 31 0.333 161.904 0.202 6.923101
SOM 40 0.286 194.579 0.158 4.443683
SOM 48 0.167 202.608 0.139 2.889340
SOM 54 0.125 218.736 0.175 2.437346
MST-kNN 10 0.24 92.864 0.009 5.039289

A better performance is observed when using the FANNY algorithm, with a K value equal to 3. This is reflected in a Dunn Index of 0.273, a Connectivity Index of 29.356, and a Silhouette Index of 0.199.

2.4.6 Identification of medoids

For this study group, the FANNY method will be applied with three clusters. The consensus medoid tree for the entire study group is identified, as well as the medoid trees for each of the three clusters.

# Aplicación de FANNY con k = 3
set.seed(2)
clusters = fanny(distancia_arboles_al, k = 3, memb.exp = 1.1)

# Identificación de medoide de todo el conjunto
arbol_medoide_al = names(which.min(rowMeans(distancia_arboles_al)))

# Identificación de medoide de grupo 1
ID_g1 = as.numeric(names(which(clusters$cluster==1)))
arbol_medoide_al_g1 =names(which.min(rowMeans(distancia_arboles_al[ID_g1,ID_g1])))

# Identificación de medoide de grupo 2
ID_g2 = as.numeric(names(which(clusters$cluster==2)))
arbol_medoide_al_g2 =names(which.min(rowMeans(distancia_arboles_al[ID_g2,ID_g2])))

# Identificación de medoide de grupo 3
ID_g3 = as.numeric(names(which(clusters$cluster==3)))
arbol_medoide_al_g3 =names(which.min(rowMeans(distancia_arboles_al[ID_g3,ID_g3])))

The representative tree of the entire dataset is the 92, while the representative trees of the 3 clusters are:

  • Group 1: 90 
  • Group 2: 15 
  • Group 3: 72

2.4.6 Phylogenetic forest

Now, the phylogenetic forest is plotted for the entire study group, allowing the approximate distribution of the previously obtained topologies for each of the three clusters to be visualized in two dimensions.

# Clustering
set.seed(2)
resultados = fanny(distancia_arboles_al, k = 3, memb.exp = 1.1)

# Visualización
coordenadas=fviz_cluster(resultados, data = distancia_arboles_al)$data
g = fviz_cluster(resultados, data = distancia_arboles_al, ellipse.type = "norm",
                 labelsize = 10, repel = TRUE, show.clust.cent = FALSE, ggtheme = theme_classic(),
                 pointsize = 3, shape = "circle", palette = c("#00A499","#695c59","#c7254e"))
g = g + labs(title = "Phylogenetic tree - Arackar licanantay")
g = g + theme(axis.text = element_text(size = 12), legend.position = "bottom")


#Medoide del árbol general (árbol 92)
g = g + geom_segment(x=coordenadas[92,]$x-1,xend=coordenadas[92,]$x+1,y=coordenadas[92,]$y,yend=coordenadas[92,]$y,
                     linetype = "dashed",color = "black")
g = g + geom_segment(x=coordenadas[92,]$x,xend=coordenadas[92,]$x,y=coordenadas[92,]$y-1,yend=coordenadas[92,]$y+1,
                     linetype = "dashed", color = "black")
g = g + geom_text(aes(x = coordenadas[92,]$x, y = coordenadas[92,]$y, label = "Medoid tree"),
                  vjust = -0.5, hjust = -0.2, size = 4, color = "black")

# Medoide cluster 1 (Arbol 90)
g = g + geom_segment(x=coordenadas[90,]$x-1,xend=coordenadas[90,]$x+1,y=coordenadas[90,]$y,yend=coordenadas[90,]$y,
                     linetype = "dashed", color = "#00635d")
g = g + geom_segment(x=coordenadas[90,]$x,xend=coordenadas[90,]$x,y=coordenadas[90,]$y-1,yend=coordenadas[90,]$y+1,
                     linetype = "dashed", color = "#00635d")
g = g + geom_text(aes(x = coordenadas[90,]$x, y = coordenadas[90,]$y, label = "Medoid cluster 1"),
                  vjust = -1, hjust = -0.1, size = 4, color = "#00635d")

# #Medoide cluster 2 (Árbol 15)
g = g + geom_segment(x=coordenadas[15,]$x-1,xend=coordenadas[15,]$x+1,y=coordenadas[15,]$y,yend=coordenadas[15,]$y,
                     linetype = "dashed", color = "#3d1308")
g = g + geom_segment(x=coordenadas[15,]$x,xend=coordenadas[15,]$x,y=coordenadas[15,]$y-1,yend=coordenadas[15,]$y+1,
                     linetype = "dashed", color = "#3d1308")
g = g + geom_text(aes(x = coordenadas[15,]$x, y = coordenadas[15,]$y, label = "Medoid cluster 2"),
                  vjust = -1, hjust = -0.1, size = 4, color = "#3d1308")

# #Medoide cluster 3 (Árbol 72)
g = g + geom_segment(x=coordenadas[72,]$x-1,xend=coordenadas[72,]$x+1,y=coordenadas[72,]$y,yend=coordenadas[72,]$y,
                     linetype = "dashed", color = "#c7254e")
g = g + geom_segment(x=coordenadas[72,]$x,xend=coordenadas[72,]$x,y=coordenadas[72,]$y-1,yend=coordenadas[72,]$y+1,
                     linetype = "dashed", color = "#c7254e")
g = g + geom_text(aes(x = coordenadas[72,]$x, y = coordenadas[72,]$y, label = "Medoid cluster 3"),
                  vjust = 0, hjust = -0.2, size = 4, color = "#c7254e")
plot(g)

2.4.7 Representative trees

Once the medoids for each cluster and the medoid for the entire dataset have been identified, they are plotted to visualize the distribution of each species in the topology. This allows for analysis and comparison among the four medoid trees identified in the pipeline and also enables a comparison of our consensus medoid tree with the one from the literature. Additionally, each tree is assigned a color-coded graph indicating the support of each node: red represents low support, while green indicates high node support.

Medoid tree of the entire study group

# Agregar el soporte por nodo
arbol_medoide_al_conf= addConfidences(arboles_al[92],arboles_al)

# Convertir el soporte del nodo de raiz que es NA a 1
idx_NA = which(is.na(arbol_medoide_al_conf[[1]]$node.label))
arbol_medoide_al_conf[[1]]$node.label[idx_NA]=1

# Graficar árbol medoide

# Ajuste de color
paleta_de_colores = colorRampPalette(c("red3", "mediumseagreen"))(97)
posiciones_color = round(arbol_medoide_al_conf[[1]]$node.label*97,9)
colores = paleta_de_colores[posiciones_color]
colores_especies = rep("black",length(arbol_medoide_al_conf[[1]]$tip.label))
colores_especies[88] = "#c7254e"

# Gráfico
g = ggtree(arbol_medoide_al_conf, layout = "dendrogram",ladderize=T)
g = g + geom_tiplab(size = 4, color = colores_especies)  # Etiquetas en los nodos terminales 
g = g + theme(plot.margin = margin(t = 0, r = 0, b = 150, l = 0, unit = "pt"), legend.position = "top")
g = g + geom_nodepoint(size = 4, color = colores) + theme(legend.position='top')
g = g + labs(title = "Representative consensus tree (medoid) with node support (97 trees)")
g = g + theme(legend.position = "top", aspect.ratio = 0.3) 

plot(g)

Medoid tree of group 1

# Agregar el soporte por nodo
arbol_medoide_al_conf= addConfidences(arboles_al[90],arboles_al)

# Convertir el soporte del nodo de raiz que es NA a 1
idx_NA = which(is.na(arbol_medoide_al_conf[[1]]$node.label))
arbol_medoide_al_conf[[1]]$node.label[idx_NA]=1

# Graficar árbol medoide

# Ajuste de color
paleta_de_colores = colorRampPalette(c("red3", "mediumseagreen"))(97)
posiciones_color = round(arbol_medoide_al_conf[[1]]$node.label*97,9)
colores = paleta_de_colores[posiciones_color]
colores_especies = rep("black",length(arbol_medoide_al_conf[[1]]$tip.label))
colores_especies[88] = "#c7254e"

# Gráfico
g = ggtree(arbol_medoide_al_conf, layout = "dendrogram",ladderize=T)
g = g + geom_tiplab(size = 4, color = colores_especies)  # Etiquetas en los nodos terminales 
g = g + theme(plot.margin = margin(t = 0, r = 0, b = 150, l = 0, unit = "pt"), legend.position = "top")
g = g + geom_nodepoint(size = 4, color = colores) + theme(legend.position='top')
g = g + labs(title = "Medoid tree of cluster 1 with node support (97 trees)")
g = g + theme(legend.position = "top", aspect.ratio = 0.3) 

plot(g)

Medoid tree of group 2

# Agregar el soporte por nodo
arbol_medoide_al_conf= addConfidences(arboles_al[15],arboles_al)

# Convertir el soporte del nodo de raiz que es NA a 1
idx_NA = which(is.na(arbol_medoide_al_conf[[1]]$node.label))
arbol_medoide_al_conf[[1]]$node.label[idx_NA]=1

# Graficar árbol medoide

# Ajuste de color
paleta_de_colores = colorRampPalette(c("red3", "mediumseagreen"))(97)
posiciones_color = round(arbol_medoide_al_conf[[1]]$node.label*97,9)
colores = paleta_de_colores[posiciones_color]
colores_especies = rep("black",length(arbol_medoide_al_conf[[1]]$tip.label))
colores_especies[88] = "#c7254e"

# Gráfico
g = ggtree(arbol_medoide_al_conf, layout = "dendrogram",ladderize=T)
g = g + geom_tiplab(size = 4, color = colores_especies)  # Etiquetas en los nodos terminales 
g = g + theme(plot.margin = margin(t = 0, r = 0, b = 150, l = 0, unit = "pt"), legend.position = "top")
g = g + geom_nodepoint(size = 4, color = colores) + theme(legend.position='top')
g = g + labs(title = "Medoid tree of cluster 2 with node support (97 trees)")
g = g + theme(legend.position = "top", aspect.ratio = 0.3) 

plot(g)

Medoid tree of group 3

# Agregar el soporte por nodo
arbol_medoide_al_conf= addConfidences(arboles_al[72],arboles_al)

# Convertir el soporte del nodo de raiz que es NA a 1
idx_NA = which(is.na(arbol_medoide_al_conf[[1]]$node.label))
arbol_medoide_al_conf[[1]]$node.label[idx_NA]=1

# Graficar árbol medoide

# Ajuste de color
paleta_de_colores = colorRampPalette(c("red3", "mediumseagreen"))(97)
posiciones_color = round(arbol_medoide_al_conf[[1]]$node.label*97,9)
colores = paleta_de_colores[posiciones_color]
colores_especies = rep("black",length(arbol_medoide_al_conf[[1]]$tip.label))
colores_especies[88] = "#c7254e"

# Gráfico
g = ggtree(arbol_medoide_al_conf, layout = "dendrogram",ladderize=T)
g = g + geom_tiplab(size = 4, color = colores_especies)  # Etiquetas en los nodos terminales 
g = g + theme(plot.margin = margin(t = 0, r = 0, b = 150, l = 0, unit = "pt"), legend.position = "top")
g = g + geom_nodepoint(size = 4, color = colores) + theme(legend.position='top')
g = g + labs(title = "Medoid tree of cluster 3 with node support (97 trees)")
g = g + theme(legend.position = "top", aspect.ratio = 0.3) 

plot(g)

As a result, compared to the previous case (Burkesuchus mallingrandensis), several nodes exhibit lower support, making evolutionary relationships more challenging and ambiguous to explain. The first notable aspect across these four topologies is that the case study, Arackar licanantay, forms a dichotomy with Isisaurus colberti. However, in medoid tree 15, cluster 2, its position shifts to the end of the topology. The clearest evolutionary relationship involving Arackar licanantay is its close affinity with Isisaurus colberti and, to a lesser extent, with Rapetosaurus krausei. When comparing the medoid tree obtained through the proposed pipeline with the one partially published in the literature, notable differences emerge. For instance, medoid 92 lacks polytomies, and certain evolutionary relationships differ, such as those involving Quetecsaurus rusconii and Frutalongkosaurus dukei, among others. However, the topology reported in the literature shares similarities with the final section of the medoid tree obtained by the pipeline for Arackar licanantay.

For Arackar licanantay, the species closest to the dinosaur were taken into consideration by the researchers when publishing a part of the consensus tree. This section matches the species found between Chubutisaurus insignis and Neuquensaurus australis in the consensus topology 92 selected by the pipeline for the second case study, with the exception of Nemegtosaurus mongoliensis and Trigonosaurus pricei, two species that are absent in the published topology section, which could be found elsewhere in the tree, because they were nodes with low support.

2.4.8 Temporality to the representative tree

Now that the consensus medoid tree for the study group (topology 92) has been identified, we proceed to calculate the Hamming distance for the sequences obtained using the TNT tool. Additionally, by integrating them with the created database, we incorporate temporality into the consensus medoid tree using the “Strap” library in R, allowing us to visualize the historical period in which each species lived. Thanks to this visualization it can be inferred that the case study Arackar licanantay lived in between the Campanian and Maastrichtian geological ages, corresponding to the last ages of the Cretaceous period.

anotaciones_al=read.csv("Datos/Arackar/anotaciones_arackar_age.csv",sep=";",header=T)

# Cálculo de distancia Hamming entre secuencias
distancia_secuencias = dist.hamming(secuencias_al)

#Agrega la distancia
arbol_medoide_al_tiempo= nnls.tree(as.matrix(distancia_secuencias), arbol_medoide_al_conf[[1]],rooted = F)

#Convierto distancia negativas a positivas
idx = which(arbol_medoide_al_tiempo$edge.length<0)
arbol_medoide_al_tiempo$edge.length[idx] = 0

#Crear tiempo
ages=anotaciones_al[,c(7,13)]
rownames(ages)=anotaciones_al[,1]
names(ages)=c("FAD","LAD")

#Agrego el tiempo
arbol_medoide_al_tiempo = DatePhylo(arbol_medoide_al_tiempo, ages, method="equal", rlen=0.1)

#Grafico
geoscalePhylo(tree=arbol_medoide_al_tiempo, ages=ages, boxes="Age", cex.age = 0.8,cex.tip=0.85, cex.ts = 0.9,
              tick.scale="Age",units = c("Eon","Era","Period","Epoch","Age"),erotate = 90,urotate = 0,
              direction = "rightwards", quat.rm = T, vers = "ICS2013", label.offset = 0.001)

3 References

Al Abid, Faisal Bin. 2014. “A Novel Approach for PAM Clustering Method.” International Journal of Computer Applications 86 (17).
Bryant, D., and P. Waddell. 1998. “Rapid Evaluation of Least-Squares and Minimum-Evolution Criteria on Phylogenetic Trees.” Molecular Biology and Evolution 15 (10): 1346–59. https://doi.org/10.1093/oxfordjournals.molbev.a025863.
Cheng, Dongdong, Cheng Zhang, Ya Li, Shuyin Xia, Guoyin Wang, Jinlong Huang, Sulan Zhang, and Jiang Xie. 2024. “GB-DBSCAN: A Fast Granular-Ball Based DBSCAN Clustering Algorithm.” Information Sciences 674: 120731. https://doi.org/https://doi.org/10.1016/j.ins.2024.120731.
Cretulescu, Radu, Daniel Morariu, Macarie Breazu, and Daniel Volovici. 2019. “DBSCAN Algorithm for Document Clustering.” International Journal of Advanced Statistics and IT&C for Economics and Life Sciences 9 (June): 58–66. https://doi.org/10.2478/ijasitels-2019-0007.
Cunningham, John A., Imran A. Rahman, Stephan Lautenschlager, Emily J. Rayfield, and Philip C. J. Donoghue. 2014. “A Virtual World of Paleontology.” Trends in Ecology & Evolution 29 (6): 347–57. https://doi.org/10.1016/j.tree.2014.04.004.
Dolven, Jane, and Hans Skjerpen. 2011. “Paleoinformatics: Past, Present and Future Perspectives.” In, 45–52. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-16271-8_3.
Fahim, Ahmed. 2023. “A Varied Density-Based Clustering Algorithm.” Journal of Computational Science 66: 101925. https://doi.org/https://doi.org/10.1016/j.jocs.2022.101925.
Flores, Jorge R., Samuli Lehtonen, and Jaakko Hyvönen. 2021. “Exploring the Impact of Unstable Terminals on Branch Support Values in Paleontological Data.” Paleobiology 47 (3): 432–45. https://doi.org/10.1017/pab.2020.64.
Goloboff, Pablo A., Ambrosio Torres, and J. Salvador Arias. 2017. “Weighted Parsimony Outperforms Other Methods of Phylogenetic Inference Under Models Appropriate for Morphology.” Cladistics 34 (4): 407–37. https://doi.org/10.1111/cla.12205.
Gonzalo, Sergio, Joan Manuel Marquès, Alberto García-Villoria, Javier Panadero, and Laura Calvet. 2022. “CLARA: A Novel Clustering-Based Resource-Allocation Mechanism for Exploiting Low-Availability Complementarities of Voluntarily Contributed Nodes.” Future Generation Computer Systems 128: 248–64. https://doi.org/https://doi.org/10.1016/j.future.2021.10.002.
Gori, Kevin, Tomasz Suchan, Nadir Alvarez, Nick Goldman, and Christophe Dessimoz. 2015. “Clustering Genes of Common Evolutionary History.” Molecular Biology and Evolution 33: 1590–1605. https://api.semanticscholar.org/CorpusID:14386985.
Gupta, Tanvi, and Supriya Panda. 2019. “A Comparison of k-Means Clustering Algorithm and CLARA Clustering Algorithm on Iris Dataset” 7 (February): 4766–68. https://doi.org/10.14419/ijet.v7i4.21472.
Heled, Joseph, and Remco R. Bouckaert. 2013. “Looking for Trees in the Forest: Summary Tree from Posterior Samples.” BMC Evolutionary Biology 13: 221–21. https://api.semanticscholar.org/CorpusID:7975266.
Huson, Daniel H., and Banu Cetinkaya. 2023. “Visualizing Incompatibilities in Phylogenetic Trees Using Consensus Outlines.” Frontiers in Bioinformatics 3. https://doi.org/10.3389/fbinf.2023.1155286.
Inostroza-Ponta, Mario, Alexandre Mendes, Regina Berretta, and Pablo Moscato. 2007. “An Integrated QAP-Based Approach to Visualize Patterns of Gene Expression Similarity.” In, 4828:156–67. https://doi.org/10.1007/978-3-540-76931-6_14.
Kornilios, Panagiotis. 2017. “Polytomies, Signal and Noise: Revisiting the Mitochondrial Phylogeny and Phylogeography of the Eurasian Blindsnake Species Complex (Typhlopidae, Squamata).” Zoologica Scripta 46 (6): 665–74. https://doi.org/10.1111/zsc.12243.
Kubatko, L. S. 2008. “Inference of Phylogenetic Trees.” In Tutorials in Mathematical Biosciences IV: Evolution and Ecology, edited by Avner Friedman, 1–38. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-74331-6_1.
Larson, Drew A., Joseph F. Walker, Oscar M. Vargas, and Stephen A. Smith. 2020. “A Consensus Phylogenomic Approach Highlights Paleopolyploid and Rapid Radiation in the History of Ericales.” American Journal of Botany 107 (5): 773–89. https://doi.org/10.1002/ajb2.1469.
Lawing, A. Michelle, and Nicholas J. Matzke. 2014. “Conservation Paleobiology Needs Phylogenetic Methods.” Ecography 37 (11): 1109–22. https://doi.org/10.1111/ecog.00783.
Liu, Fang AND Chen, Wenquan AND Lu. 2021. Site-specific management zones based on geostatistical and fuzzy clustering approach in a coastal reclaimed area of abandoned salt pan.” Chilean Journal of Agricultural Research 81 (September): 420–33. http://www.scielo.cl/scielo.php?script=sci_arttext&pid=S0718-58392021000300420&nrm=iso.
López-Antoñanzas, Raquel, Jonathan Mitchell, Tiago R. Simões, Fabien L. Condamine, Robin Aguilée, Pablo Peláez-Campomanes, Sabrina Renaud, Jonathan Rolland, and Philip C. J. Donoghue. 2022. “Integrative Phylogenetics: Tools for Palaeontologists to Explore the Tree of Life.” Biology 11 (8): 1185. https://doi.org/10.3390/biology11081185.
Lovegrove, Jack, Paul Upchurch, and Paul M. Barrett. 2024. “Untangling the Tree or Unravelling the Consensus? Recent Developments in the Quest to Resolve the Broad-Scale Relationships Within Dinosauria.” Journal of Systematic Palaeontology 22 (1). https://doi.org/10.1080/14772019.2024.2345333.
Marshall, C. R., S. Finnegan, E. C. Clites, P. A. Holroyd, N. Bonuso, C. Cortez, E. Davis, et al. 2018. “Quantifying the Dark Data in Museum Fossil Collections as Palaeontology Undergoes a Second Digital Revolution.” Biology Letters 14 (9): 20180431.
Miljkovic, Dubravko. 2017. “Brief Review of Self-Organizing Maps.” In. https://doi.org/10.23919/MIPRO.2017.7973581.
Müller, Rodrigo Temp, and Sérgio Dias-da-Silva. 2017. “Taxon Sample and Character Coding Deeply Impact Unstable Branches in Phylogenetic Trees of Dinosaurs.” Historical Biology 31 (8): 1089–92. https://doi.org/10.1080/08912963.2017.1418341.
Novas, Fernando E., Federico L. Agnolin, Gabriel L. Lio, Sebastián Rozadilla, Manuel Suárez, Rita de la Cruz, Ismar de Souza Carvalho, David Rubilar-Rogers, and Marcelo P. Isasi. 2021. “New Transitional Fossil from Late Jurassic of Chile Sheds Light on the Origin of Modern Crocodiles.” Scientific Reports 11 (1): 1–13.
Oti, Eric, Michael Olusola, Francis Eze, and Samuel Enogwe. 2021. “Comprehensive Review of k-Means Clustering Algorithms.” International Journal of Advances in Scientific Research and Engineering 07 (January): 64–69. https://doi.org/10.31695/IJASRE.2021.34050.
Parraga-Alava, Jorge, Pablo Moscato, and Mario Inostroza-Ponta. 2023. Mstknnclust: MST-kNN Clustering Algorithm (version 0.3.2). https://trove.nla.gov.au/work/28729389?selectedversion=NBD44634158.
Rajkumar, K. Varada, Yesu Adimulam, and Subrahmanyam Kodukula. 2019. “Fuzzy Clustering and Fuzzy c-Means Partition Cluster Analysis and Validation Studies on a Subset of CiteScore Dataset.” International Journal of Electrical and Computer Engineering (IJECE) 9 (August): 2760. https://doi.org/10.11591/ijece.v9i4.pp2760-2770.
Rubilar, David, Alexander O. Vargas, Bernardo González Riga, Sergio Soto-Acuña, Jhonatan Alarcón-Muñoz, José Iriarte-Dı́az, Carlos Arévalo, and Carolina S. Gutstein. 2021. “Arackar Licanantay Gen. Et Sp. Nov. A New Lithostrotian (Dinosauria, Sauropoda) from the Upper Cretaceous of the Atacama Region, Northern Chile.” Cretaceous Research 124: 104802.
Sansom, Robert S., Peter G. Choate, Joseph N. Keating, and Emma Randle. 2018. “Parsimony, Not Bayesian Analysis, Recovers More Stratigraphically Congruent Phylogenetic Trees.” Biology Letters 14 (6): 20180263. https://doi.org/10.1098/rsbl.2018.0263.
Schrago, Carlos G., Barbara O. Aguiar, and Beatriz Mello. 2018. “Comparative Evaluation of Maximum Parsimony and Bayesian Phylogenetic Reconstruction Using Empirical Morphological Data.” Journal of Evolutionary Biology 31 (10): 1477–84. https://doi.org/10.1111/jeb.13344.
Schubert, Erich, and Peter J Rousseeuw. 2021. “Fast and Eager k-Medoids Clustering: O (k) Runtime Improvement of the PAM, CLARA, and CLARANS Algorithms.” Information Systems 101: 101804.
Sinaga, Kristina, and Miin-Shen Yang. 2020. “Unsupervised k-Means Clustering Algorithm.” IEEE Access PP (April): 1–1. https://doi.org/10.1109/ACCESS.2020.2988796.
Villalobos-Cid, Manuel, Marcio Dorn, Rodrigo Ligabue-Braun, and Mario Inostroza-Ponta. 2019. “A Memetic Algorithm Based on an NSGA-II Scheme for Phylogenetic Tree Inference.” IEEE Transactions on Evolutionary Computation 23 (5): 776–87. https://doi.org/10.1109/tevc.2018.2883888.
Villalobos-Cid, Manuel, César Rivera, Eduardo I. Kessi-Pérez, and Mario Inostroza-Ponta. 2022. “A Multi-Modal Algorithm Based on an NSGA-II Scheme for Phylogenetic Tree Inference.” Biosystems 213: 104606. http://dx.doi.org/10.1016/j.biosystems.2022.104606.
Villalobos-Cid, Manuel, Francisco Salinas, and Mario Inostroza-Ponta. 2020. “Total Evidence or Taxonomic Congruence? A Comparison of Methods for Combining Biological Evidence.” Journal of Bioinformatics and Computational Biology 18 (06): 2050040. https://doi.org/10.1142/s0219720020500407.
Wilkinson, M., and Michael J. Benton. 1996. “Sphenodontid Phylogeny and the Problems of Multiple Trees.” Philosophical Transactions of the Royal Society B 351: 1–16. https://api.semanticscholar.org/CorpusID:84900142.
WILLIAMS, DAVID M., and MALTE C. EBACH. 2010. “Molecular Systematics and the ‘Blender of Optimization’: Is There a Crisis in Systematics?” Systematics and Biodiversity 8 (4): 481–84. http://dx.doi.org/10.1080/14772000.2010.530303.
Yang, Le, Zhongbin Ouyang, and Yong Shi. 2012. “A Modified Clustering Method Based on Self-Organizing Maps and Its Applications.” Procedia Computer Science 9: 1371–79. https://doi.org/https://doi.org/10.1016/j.procs.2012.04.151.