Este pipeline es para hacer una reconstrucción ancestral de secuencias y crear un grafo con toda la información de las transiciones.

Obtención de datos

En este pipe voy a utilizar un clado de 11 especies:

File	Accesion
Acaryochloris_marina_MBIC11017.gbff	GCF_000018105.1
Acaryochloris_marina_S15.gbff	GCF_018336915.1
Acaryochloris_sp__Moss_Beach.gbff	GCF_021497025.1
Gloeomargarita_lithophora_Alchichica-_D10.gbff	GCF_001870225.1
Synechococcus_sp_PCC_6312.gbff	GCF_000316685.1
Thermostichus_lividus_PCC_6715.gbff	GCF_002754935.1
Thermosynechococcus_elongatus_PKUAC-SCTE542.gbff	GCF_003555505.1
Thermosynechococcus_sp_a_NK55.gbff	GCF_000505665.1
Thermosynechococcus_sp_CL-1.gbff	GCF_008386235.1
Thermosynechococcus_sp_HN-54.gbff	GCF_023650955.1
Thermosynechococcus_sp_TA-1.gbff	GCF_017086385.1
Thermosynechococcus_vestitus_BP-1.gbff	GCF_000011345.1

Ortólogos

Primero obtengo los ortólogos usando get_homologues:

## "/home/lalibelulalo/PIPELINES_2023/ASR_ORTHOLOGUES/Thermosynechococcus_clade/"
get_homologues.pl -d gbff -t 0 -M -n PPN

Obtengo el pangenoma:

compare_clusters.pl -o pangenoma -m -d gbff_homologues/ThermosynechococcusspaNK55GCF000505665_f0_0taxa_algOMCL_e0_

Obtengo una lista de ortólogos (orthologues.list) la cual contenga solo aquellos en los que dicho ortólogo esta en las 11 especies del clado.

awk -F "," '{if($15==1 && $16==1 && $17==1 && $18==1 && $19==1 && $20==1 && $21==1 && $22==1 && $23==1 && $24==1 && $25==1 && $26==1) {print $1}}' pangenoma/pangenome_matrix_t0.tr.csv | sed 's/.faa/.fna/g' >orthologues.list

Conteo de palindromos

Hago un conteo de palindromos para saber cuales ortólogos contienen los palindromos de interés. Para eso uso orthologues.list y una lista con los palindromos de interes: pals.list.

Para esto uso parte del código del conteo de palindromos.

Pals.list contiene los palindromos a contar. Estos van en un a sola linea y separados por comas. El numero final corresponde al orden del modelo de markov (0-3) sin embargo es irrelevante para este paso ya que solo queremos saber el numero observado de palindromos.

python3 CountPalsInOrthologues.py gbff_homologues/ThermosynechococcusspaNK55GCF000505665_f0_0taxa_algOMCL_e0_ pals.list orthologues.list 3

Filtrado del conteo

Para empezar escojo la especie con mayor conteo de palindromos segun la filogenia anotada (Synechococcus_sp_PCC_6312). Sin embargo tambien me aseguro de que dicha especie sea la que tenga el número de sitios:

Counts <- read.table("Markov_count_CAGGCCTG_gbff_homologuesThermosynechococcusspaNK55GCF000505665_f0_0taxa_algOMCL_e0__2023-4-20_13hrs47mins_Octanuc_M3_.txt", sep = "\t", header = TRUE)

Spp <- Counts$spp
Spp <- unique(Spp)

i=0
for (n in 1:length(Spp)){ 
  for (m in 1:length(Counts$spp)) {
    if (Spp[n] == Counts[m,2]){
      i=i+Counts[m,5]
    }
  }
  print(paste0(Spp[n]," ",i))
  i=0
}

## [1] "Acaryochloris_marina_MBIC11017 6"
## [1] "Acaryochloris_marina_S15 14"
## [1] "Acaryochloris_sp_Moss_Beach 17"
## [1] "Gloeomargarita_lithophora_Alchichica-D10 0"
## [1] "Synechococcus_sp_PCC_6312 377"
## [1] "Thermostichus_lividus_PCC_6715 8"
## [1] "Thermosynechococcus_elongatus_PKUAC-SCTE542 12"
## [1] "Thermosynechococcus_sp_CL-1 10"
## [1] "Thermosynechococcus_sp_HN-54 14"
## [1] "Thermosynechococcus_sp_TA-1 13"
## [1] "Thermosynechococcus_sp_NK55a 11"
## [1] "Thermosynechococcus_vestitus_BP-1 18"

Aqui podemos ver que Synechococcus_sp_PCC_6312 es la que tiene mayores sitios (377). Por lo tanto uso: Synechococcus_sp_PCC_6312.

Solo quiero aquellos donde Synechococcus_sp_PCC_6312 tenga un conteo mayor o igual a 1 ya que en ese sitio es donde voy a reconstruir la secuencia ancestral

awk '{if ($5!=0 && $2=="Synechococcus_sp_PCC_6312"){print $1}}' Markov_count_CAGGCCTG_gbff_homologuesThermosynechococcusspaNK55GCF000505665_f0_0taxa_algOMCL_e0__2023-4-20_13hrs47mins_Octanuc_M3_.txt >Ortologos_CAGGCCTG_Synechococcus_sp_PCC_6312.txt

Ortologos_CAGGCCTG_Synechococcus_sp_PCC_6312.txt contiene 306 ortólogos los cuales estan presentes en las 11 especies y tienen por lo menos una ocurrencia del palindromo CAGGCCTG

###
39183_lipA.fna
39193_GNAT_family_N-acetyl...fna
39229_hypothetical_protein.fna
39242_hypothetical_protein.fna
39246_zds.fna
39271_ATP-dependent_Clp_pr...fna
39288_glycosyl_hydrolase-r...fna
39297_histidinol-phosphate...fna
39309_ndk.fna
39310_PhzF_family_phenazin...fna
etc ...
###

Preparación de ortólogos para la alineción

Primero creo una carpeta que contendrá unicamente los ortologos que vamos a usar. Es decir aquellos que estan presentes en todas las especies a analizar y que contienen por lo menos un sitio con palíndromo.

# /home/biocomp/HIP_2023/PIPELINES_2023/ASR_ORTHOLOGUES/Geminocystis_clade
mkdir Orthologues_CAGGCCTG_Synechococcus_sp_PCC_6312  ## Creo Carpeta
cd gbff_homologues/ThermosynechococcusspaNK55GCF000505665_f0_0taxa_algOMCL_e0_ ## Entro a la carpeta de homólogos
for word in $(cat ../../Ortologos_CAGGCCTG_Synechococcus_sp_PCC_6312.txt); do cp $word ../../Orthologues_CAGGCCTG_Synechococcus_sp_PCC_6312; done ## copio los ortólogos de la lista a la carpeta

Filtrado de paralogos

voy a crear dos carpetas una que contenga archivos con paralogos y otra que solo tenga ortólogos

mkdir Only_ORTHOLOGUES
mkdir PARALOGUES

Para filtrar aquellos con paralogos uso el siguiente scripty de python:

from Bio import SeqIO
import re
import os
import sys
import shutil

SEQUENCES_PATH = 'Orthologues_CAGGCCTG_Synechococcus_sp_PCC_6312' ## Path de la carpeta con los ortólogos.
#output_file = 'Orthologues_Palindrome_sites.txt' ## Nombre del archivo de salida
#output = open (output_file, 'w') ## Abrimos el archivo de salida

if SEQUENCES_PATH.endswith("/"): ## Esta parte la pongo por si corro este codigo en la terminal
    SequencesPath = re.sub('/', '', SEQUENCES_PATH) ## De este modo evito errores en el argumento path 
    SequencesDir = str(SEQUENCES_PATH)
else:
    SequencesPath = SEQUENCES_PATH
    SequencesDir = str("".join ([SEQUENCES_PATH,'/']))
Orthologues = [x for x in os.listdir(SequencesDir) if x.endswith(".fna")] ## creo un arreglo con todos los ortólogis de la carpeta

for Orthologue in Orthologues:
    i = 0
    FNA = str("".join ([SequencesDir,Orthologue])) ## creo el path completo con el path de la carpeta de ortólogos y el path del ortólogo
    for record in SeqIO.parse(open(FNA),'fasta'):
        i += 1
    #print ('{}\t{}'.format(Orthologue,i))
    if i == 12:
        shutil.copyfile(FNA,str("".join (['gbff/Only_ORTHOLOGUES/',Orthologue]))) ## si hasy unicamente 6 entradas entonces no hay paralogos
    else:
        shutil.copyfile(FNA,str("".join (['gbff/PARALOGUES/',Orthologue]))) ## Si hay mas de 6 entradas hay parálogos

Esto me dio como resultado 282 archivos sin paralogos y 24 archivos con paralogos los cuals omitiré por ahora.

Ahora edito el archivo fasta para que el encabezado incluya unicamente la especie

# /home/biocomp/HIP_2023/PIPELINES_2023/ASR_ORTHOLOGUES/Cyanobacterium_clade/Only_ORTHOLOGUES
for f in *.fna; do awk -F "|" '{if (NR%2){print ">"$2}else{print $1}}' $f | sed 's/ /_/g' | sed 's/\.//g' | sed 's/\[//g' | sed 's/\]//g' | sed 's/)//g' | sed 's/(//g'| sed "s/'//g" | sed 's/\//-/g' | sed 's/Acaryochloris_marina_//g' | sed 's/Acaryochloris_sp_//g' | sed 's/Gloeomargarita_lithophora_Alchichica-//g' | sed 's/Synechococcus_sp_//g' | sed 's/Thermostichus_lividus_//g' | sed 's/Thermosynechococcus_sp_//g' | sed 's/Thermosynechococcus_vestitus_//g' | sed 's/Thermosynechococcus_elongatus_PKUAC-//g' >$f.awk1;done

Alineación Múltiple

Hago la alineación multiple con MAFFT

for f in *.sed; do mafft $f >$f.mafft;done

Cambio de fasta a PHYLIP

python3 Fasta2Phylip.py Only_ORTHOLOGUES/

Obtengo las coordenadas del palindromo en la alineación usando el siguiente codigo de python:

from Bio import SeqIO
import re
import os
import sys
SEQUENCES_PATH = 'Only_ORTHOLOGUES/' ## Path de la carpeta con los ortólogos.
output_file = 'Orthologues_Palindrome_sites.txt' ## Nombre del archivo de salida
output = open (output_file, 'w') ## Abrimos el archivo de salida
output.write('FILE\tPAL\tSTART\tEND\n')

if SEQUENCES_PATH.endswith("/"): ## Esta parte la pongo por si corro este codigo en la terminal
    SequencesPath = re.sub('/', '', SEQUENCES_PATH) ## De este modo evito errores en el argumento path 
    SequencesDir = str(SEQUENCES_PATH)
else:
    SequencesPath = SEQUENCES_PATH
    SequencesDir = str("".join ([SEQUENCES_PATH,'/']))
Orthologues = [x for x in os.listdir(SequencesDir) if x.endswith(".phy")] ## creo un arreglo con todos los ortólogos de la carpeta
pattern = '[cC][-]*[aA][-]*[gG][-]*[gG][-]*[cC][-]*[cC][-]*[tT][-]*[gG]'
j=0
k=0
for Orthologue in Orthologues:
    FNA = str("".join ([SequencesDir,Orthologue])) ## creo el path completo con el path de la carpeta de ortólogos y el path del ortólogo
    for record in SeqIO.parse(open(FNA),'phylip'):
        Seq = str(record.seq)
        Spp = record.description
        i = 0
        if Spp == 'PCC_6312':#PCC_6303_PCC_6303 336-3_336-3
            for match in re.finditer(pattern, Seq):
                i += 1
                j += 1
                site = match.group()
                start = match.span()[0]
                end = match.span()[1]
                if len(site)==8:
                    k += 1
                    print ('{}\t{}\t{}\t{}:{}'.format(i,Orthologue,site,start+1,end))
                    phyl = re.sub('.awk2', '.phy', Orthologue)
                    output.write('{}\t{}\t{}\t{}\n'.format(phyl,site,start+1,end))
    print ('_________________________________________________________________________\n')
output.close()
print ("TOTAL: {} sitios".format(k))

python3 AlignmentPalindromeCoords.py Only_ORTHOLOGUES/

El ultimo script tiene como salida el archivo Orthologues_Palindrome_sites.txt el cual contiene 4 columnas. Estas contienen el nombre del archivo del ortólogo, el palindromo en cuestion y el intervalo de sitios en donde se encuentra el palindromo:

###
FILE    PAL START   END
39384_bifunctional_folylpo...fna.awk1.mafft.phy caggcctg    753 760
39612_bchB.fna.awk1.mafft.phy   caggcctg    640 647
40294_lpxC.fna.awk1.mafft.phy   caggcctg    401 408
40004_transglycosylase_dom...fna.awk1.mafft.phy caggcctg    1820    1827
39889_glycosyltransferase.fna.awk1.mafft.phy    caggcctg    142 149
39889_glycosyltransferase.fna.awk1.mafft.phy    caggcctg    1034    1041
39985_Rad3-related_DNA_hel...fna.awk1.mafft.phy caggcctg    793 800
40123_gcvP.fna.awk1.mafft.phy   caggcctg    1359    1366
40123_gcvP.fna.awk1.mafft.phy   caggcctg    1492    1499
... etc
###

GRAFO

primero cargo las librerias necesarias

library(ggplot2)
library(ggtree)
library(ape)
library(tidyverse)
library(tidytree)
library(phangorn)
library(dplyr)

Preparacion de datos para el grafo

Primero cargo Orthologues_Palindrome_sites.txt del paso anterior.

## /home/lalibelulalo/PIPELINES_2023/ASR_ORTHOLOGUES/Thermosynechococcus_clade/
Sites <- read.table("Orthologues_Palindrome_sites.txt", sep = "\t", header = TRUE)

Creo la matriz que va a contener unicamente las transiciones

DF <- matrix(0,
             nrow = 0,
             ncol = 2)
colnames(DF) <- c("from","to")

Creo una matriz adicional que tendra las transiciones y una columna extra con la direccion de la transición

{LINKS <- matrix(0,
             nrow = 0,
             ncol = 3)
colnames(LINKS) <- c("from","to","direction")}

Cargo el modelo de evolucion y el metodo de reconstrucción (bayesiano).

EvolModel = "F81"
Method = "bayes"
Tree = read.tree("SpeciesTree_rooted.txt")#"SpeciesTree_rooted.txt"
#Root = '336-3'
#Tree <-root(Tree, outgroup = Root, edgelabel = TRUE)

En el archivo Orthologues_Palindrome_sites.txt hay 902 sitios en total. Cargamos el primer sitio el cual corresponde a:

###
FILE    PAL START   END
19781_GIY-YIG_nuclease_fam...fna.awk1.r.sed.mafft.phy   tcgatcga    230 237

###

El numero ORTH correponde al numero de sitio. Por lo tanto cargamos el PATH del alineamiento del primer sitio (Alignment), así como su inicio (SI) y su termino (EI)

ORTH = 1
Alignment = paste0("Only_ORTHOLOGUES/",Sites[ORTH,1])
SI = Sites[ORTH,3]
EI = Sites[ORTH,4]

Alineación multiple

Cargo la alineación multiple

cyanobacterias <- read.phyDat(Alignment, format = "interleaved")

## Cambio el attributo site.patter a FALSE para poder acceder a todos los indices
cyanobacterias <- subset(cyanobacterias, site.pattern = FALSE) 
#Tree = pratchet(cyanobacterias, trace = 0)|>acctran(cyanobacterias)

Esta filogenia contiene 23 nodos

ggtree(Tree,branch.length="none",) +
  geom_tiplab(color='firebrick', offset = .14)+
  geom_label(aes(label = node),show.legend = FALSE)

Filogenia y parametros del modelo

Estimo los parametros para el modelo. en esta parte uso el modelo de evolucion elegido anteriormente (“F81”). Además en esta parte el arbol se desenraiza. Por lo tanto, debo cambiar el arbol al nuevo sin raiz para evitar errores.

fit <- pml(Tree, cyanobacterias)
fit <- optim.pml(fit, model=EvolModel, control = pml.control(trace=0))

Tree2 <- fit[["tree"]]

Reconstrucción de Estados Ancestrales

Hago la reconstruccion de estados ancestrales para cada sitio. Para esto hay dos métodos: bayesiano o máxima verosimilitud. Anteriormente habia declarado la variable Method como bayes.

  ## Calculo la probabilidad del árbol filogenético dado el alineamiento y el modelo
  if(Method == "bayes"){
    anc.bayes <- ancestral.pml(fit, type="bayes", return = "prob")
    Reconstruction <- anc.bayes
  }
  if(Method == "ml"){
    anc.ml <- ancestral.pml(fit, "ml")
    Reconstruction <- anc.ml
  }

Armado de palindromos ancestrales

Armo los palindromos ancestrales de acuerdo a los estados ancestrales usando los sitios de inicio y fin de cada sitio palindromico (SI y EI). Posteriormente uso dicho intervalo para buscar lel estado de cada nucleotido del intervalo y guardarlo en el vector (PalindromeNucleotidePositions). Tambien extraigo los nombres de cada nodo de la reconstrucción:

PalInterval = SI:EI ## ESTE ES EL INTERVALO DEL SITIO PALINDROMICO
  
## Obtengo las posiciones para la posicion PalInterval
PalindromeNucleotidePositions <- attributes(cyanobacterias)$index[PalInterval]

## Nombre de los nodos
Nodos <- attributes(Reconstruction)$names

Ahora creo una matriz que contendra 9 columnas, la primera contendrá los Nodos y las 8 restantes contendran los nucleotidos correspondientes a su palindromo y prosigo a rellenar la matriz. Sin embargo, en este paso hay un detalle.

La reconstrucción de estado ancestral para cada sitio arroja una probabilidad para cada uno de los 4 sitios posibles (A, T ,G o C). Si hay equivalencia en la probabilidad en los estados, es decir que el estado ancestral de un nucleotido tenga la misma probabilidad de haber estado en cualesquiera de los 4 posibles estados, entonces dicho estado se omite, de lo contrario se toma aquel estado con la mayor probabilidad. Otro punto importante es que hay sitios en los que una especie puede contener deleciones. Para estos casos, si gfalta algún nucleotido sew rellena con un “-”.

  ## Transformo el arbol en data frame para extraer las etiquetas de los nodos.
  a.1 <- as_tibble(Tree2)

  ## NA.NODE corresponde al nodo a partir del cual ya no estan las puntas del arbol
  NA.NODE = length(cyanobacterias)+1
  
  ## Esto es para agregar la etiqueta de los nodos a los que faltan, es decir a los nodos que no son puntas del arbol.
  for (i in NA.NODE:length(Nodos)){ ## ESTO DEPENDE DE LA ESTRUCTURA DEL ARBOL
    a.1$label[i] <- a.1$node[i] # x$node[i-1]
  }
  
  ## Agrego las etiquetas de nodo al arbol
  b.1 <- tibble::tibble(label = Nodos)
  c.1 <- full_join(a.1, b.1, by = 'label')
  NodeLabels <- c.1$label
  
  TipLabels <- Tree2$tip.label
  NodeLabels2 <- NodeLabels[-c(1:length(TipLabels))]

  ## Creo una matriz de 8 x length(Nodos) (8 nucleotidos x numero de nodos) para guardar los resultados
  ScoresMtx <- matrix(0,
                      nrow = length(Nodos),
                      ncol = 8)
  colnames(ScoresMtx) <- c("1","2","3","4","5","6","7","8")
  rownames(ScoresMtx) <- Nodos
  
  
  ## Relleno la matriz con los scores.
  ## ESTE CICLO ES UNICAMENTE PARA LAS PUNTAS DE LOS NODOS
  j = 1
  i = 1
  k = 0
  for (Position in PalindromeNucleotidePositions){
    for (Nodo in TipLabels){
      PositionScores <- Reconstruction[[Nodo]][Position,] ## Extraigo los scores para el nodo "NODO" en la posicion "POSITION" de la secuencia
      if (length(unique(PositionScores)) == 1){
        k=k+1
        WinnerNucleotide = "-"
        ScoresMtx[i,j] <- WinnerNucleotide
        i=i+1 ## contador para cada Nodo
      }else{
        PositionWinnerScore <- max(PositionScores) ## Extraigo el valor mas grande. Este es el valor de la probabilidad de que el nucleotido ancestral sea ese
        WinnerNucleotide <- which(PositionScores == PositionWinnerScore, arr.ind = T) ## pregunto a que numero (Nucleótido) corresponde ese valor
        ScoresMtx[i,j] <- WinnerNucleotide ## Guardo dicho valor en la matriz de resultados
        i=i+1 ## contador para cada Nodo
      }
    }
    j=j+1 ## contador para cada posición
    i=1 ## Reiniciamoa el contador de Nodo para la siguiente posicion
  }
  
  ## Relleno la matriz con los scores
  ## ESTA PARTE ES PARA TODOS LOS NODOS RESTANTES QUE NO SON PUNTAS
  j = 1
  i = length(TipLabels)+1
  k = 0
  #(length(TipLabels)+1):length(NodeLabels)
  for (Position in PalindromeNucleotidePositions){
    for (Nodo in NodeLabels2){
      PositionScores <- Reconstruction[[Nodo]][Position,] ## Extraigo los scores para el nodo "NODO" en la posicion "POSITION" de la secuencia
      if (length(unique(PositionScores)) == 1){
        k=k+1
        WinnerNucleotide = "X"
        ScoresMtx[i,j] <- WinnerNucleotide
        i=i+1 ## contador para cada Nodo
      }else{
        PositionWinnerScore <- max(PositionScores) ## Extraigo el valor mas grande. Este es el valor de la probabilidad de que el nucleotido ancestral sea ese
        WinnerNucleotide <- which(PositionScores == PositionWinnerScore, arr.ind = T) ## pregunto a que numero (Nucleótido) corresponde ese valor
        ScoresMtx[i,j] <- WinnerNucleotide ## Guardo dicho valor en la matriz de resultados
        i=i+1 ## contador para cada Nodo
      }
    }
    j=j+1 ## contador para cada posición
    i=length(TipLabels)+1 ## Reiniciamoa el contador de Nodo para la siguiente posicion
  }

Ahora decodificamos el palindromo a nucleotidos para esto creo una nueva matriz que contendra dos columnas: nodos y sus correspondientes palíndromos.

## Creo un matriz de 1x14 para guardar la etiqueta (palindromo) de cada nodo
PalsMtx <- matrix(0,
                  nrow = length(Nodos),
                  ncol = 1)
colnames(PalsMtx) <- c("Palindrome")
rownames(PalsMtx) <- Nodos

## Extraigo cada palindromo, cambio el codigo de numero por el codigo de letras y lo guardo en la matriz 1x14
i=0
j=0
for (Nodo in Nodos){
  j=j+1
  Palindrome <- as.character(ScoresMtx[j,]) # Este es el palindromo
  pal =""
  for (nuc in Palindrome){
    if(nuc== 1){nuc = 'A'}
    if(nuc== 2){nuc = 'C'}
    if(nuc== 3){nuc = 'G'}
    if(nuc== 4){nuc = 'T'}
    pal = paste(pal,nuc,sep = "")
  }
  PalsMtx[j,1] <- pal
}

Guardo los nodos y los palindromos de cada nodo en 2 vectores. Los cuales voy a usar para anotar el arbol filogenetico.

l <- as.list(PalsMtx[,1]) 
Nodes = names(l)
NodePals = as.character(PalsMtx[,1])

Anotacion del arbol filogenético

Para poder hacer esto, primero transformo el arbol filogenetico en una tabla y agrego 2 columnas: (label) que contendra el nombre de cada nodo y (Palindromo) que contendra el palindromo de dicho nodo.

## Transformo el arbol en data frame para poder agregar los palindromos de cada nodo
x <- as_tibble(Tree2)

## Extraigo el Boostrap por si se ocupa
NA.NODE = length(cyanobacterias)+1
BootStrap <- x$label[NA.NODE:length(Nodes)]
for (i in NA.NODE:length(Nodes)){ ## ESTO DEPENDE DE LA ESTRUCTURA DEL ARBOL
  x$label[i] <- x$node[i] # x$node[i-1]
}
## Agrego los datos de los palindromos al arbol
d <- tibble::tibble(label = Nodes,
                    Palindromo = NodePals)
y <- full_join(x, d, by = 'label')

Hasta aqui tenemos una filogenoa anotada con su palindromo ancestral en cada nodo:

ggtree(as.treedata(y),branch.length="none") +
    geom_tiplab(color='firebrick', offset = .14)+
    geom_label(aes(label = Palindromo,color=Palindromo),show.legend = FALSE)+
    aes(color=Palindromo)+
    guides(fill = guide_legend(override.aes = list(color = NA)), 
           color = FALSE, 
           shape = 1)

Matriz de transiciones para el grafo

Ahora necesito crear una matriz que tenga las tranmsiciones entre cada nodo. Para esto extraigo las direcciones de los nodos desde el arbol filogenetico previamente anotado es decir las columnas from y to y las guardo en una nueva matriz:

## Esta es la configuracion del arbol
from<-y$label#c("336-3","7","7","8","8","9","9","10","10")
to<-y$parent#c("7","NIES-3974","8","PCC_6303","9","PCC_7716","10","NIES-4071","NIES-4105")

LinksMtx <- matrix(0,
                   nrow = length(from),
                   ncol = 2)
colnames(LinksMtx) <- c("from","to")

for (i in 1:length(from)){
  LinksMtx[i,1] = from[i]
  LinksMtx[i,2] = to[i]
}

Luego extraigo las columnas label y Palindromo y las guardo en otra matriz. Dicha matriz me servira como diccionario para obtener las transiciones en cada nodo. Posteriormente la convierto a data frame.

## AQUI VOY A CREAR LA MATRIZ QUE ME SERVIRA COMO DICCIONARIO PARA RELLENAR LA MATRIZ DE TRANSICIONES
Mat <- y[,4:5]

Mat2 <- as.data.frame(Mat) ## CONVIERTO A DF
Mat2

##         label Palindromo
## 1         D10   --------
## 2    PCC_6312   CAGGCCTG
## 3    PCC_6715   CCCGGGTG
## 4       HN-54   TCTGGGTG
## 5     SCTE542   GCTGGCTG
## 6        CL-1   CCGGGATG
## 7        TA-1   CCGGGATG
## 8       NK55a   GCTGGATG
## 9        BP-1   GCTGGTTG
## 10  MBIC11017   TTGCCTTG
## 11 Moss_Beach   TTGCCTTG
## 12        S15   TTGCCTTG
## 13         13   CTGGCTTG
## 14         14   CCGGCTTG
## 15         15   CCTGGGTG
## 16         16   GCTGGATG
## 17         17   GCTGGATG
## 18         18   GCTGGATG
## 19         19   CCGGGATG
## 20         20   GCTGGATG
## 21         21   TTGCCTTG
## 22         22   TTGCCTTG

Ahora creo una nueva matriz que va a tener todas las transiciones. Para esto uso las ultimas dos matrices. La matriz LinksMtx que tiene las direcciones entre nodos la paso a palindromos con ayuda de Mat2 la cual me sirve como diccionario y convierto esta ultima matriz (LinksMtx2) a dataframe.

## AQUI CREO LA MATRIZ DE TRANSICIONES.
LinksMtx2 <- matrix(0,
                    nrow = length(from),
                    ncol = 2)
colnames(LinksMtx2) <- c("from","to")

## AQUI RELLENO LA MATRIZ DE TRANSICIONES USANDO COMO DICCIONARIO A LA MATRIZ ANTERIOR (Mat2)
for (i in 1:length(from)) {
  for (j in 1:length(from)){
    if (LinksMtx[i,1]==Mat2[j,1]){
      LinksMtx2[i,1] = Mat2[j,2]
    }
    if (LinksMtx[i,2]==Mat2[j,1]){
      LinksMtx2[i,2] = Mat2[j,2]
    }
  }
}
df <- as.data.frame(LinksMtx2) ## CONVIERTO A DF LA MATRIZ DE TRANSICIONES

Creo una matriz adicional la cual es identica a la anterior salvo que le agrego las direcciones de cada transicion, las cuales podrian ser de utilidad.

directions<-paste(from,to,sep="--") ## ESTE VECTOR CONTIENE LAS DIRECCIONES
Links0 = as.data.frame(LinksMtx2) ## HAGO UNA COPIA DE LA MATRIZ DE TRANSICIONES
Links0$direction<-directions ## AGREGO COMO COLUMNA NUEVA A "directions"

Al final de este paso tendre dos matrices DF y LINKS

DF contiene solo las transiciones

DF <- rbind(DF,df) ## UNO LA MATRIZ ACTUAL DE TRANSICIONES CON LA ANTERIOR PARA CREAR UNA SOLA
DF

##        from       to
## 1  -------- CTGGCTTG
## 2  CAGGCCTG CCGGCTTG
## 3  CCCGGGTG CCTGGGTG
## 4  TCTGGGTG GCTGGATG
## 5  GCTGGCTG GCTGGATG
## 6  CCGGGATG CCGGGATG
## 7  CCGGGATG CCGGGATG
## 8  GCTGGATG GCTGGATG
## 9  GCTGGTTG GCTGGATG
## 10 TTGCCTTG TTGCCTTG
## 11 TTGCCTTG TTGCCTTG
## 12 TTGCCTTG TTGCCTTG
## 13 CTGGCTTG CTGGCTTG
## 14 CCGGCTTG CTGGCTTG
## 15 CCTGGGTG CCGGCTTG
## 16 GCTGGATG CCTGGGTG
## 17 GCTGGATG GCTGGATG
## 18 GCTGGATG GCTGGATG
## 19 CCGGGATG GCTGGATG
## 20 GCTGGATG GCTGGATG
## 21 TTGCCTTG CTGGCTTG
## 22 TTGCCTTG TTGCCTTG

y LINKS las transiciones con sus respectivas direcciones.

LINKS<-rbind(LINKS,Links0) ## UNO LA MATRIZ ACTUAL DE TRANSICIONES(CON DIRECCIONES) CON LA ANTERIOR PARA CREAR UNA SOLA
LINKS

##        from       to      direction
## 1  -------- CTGGCTTG        D10--13
## 2  CAGGCCTG CCGGCTTG   PCC_6312--14
## 3  CCCGGGTG CCTGGGTG   PCC_6715--15
## 4  TCTGGGTG GCTGGATG      HN-54--17
## 5  GCTGGCTG GCTGGATG    SCTE542--18
## 6  CCGGGATG CCGGGATG       CL-1--19
## 7  CCGGGATG CCGGGATG       TA-1--19
## 8  GCTGGATG GCTGGATG      NK55a--20
## 9  GCTGGTTG GCTGGATG       BP-1--20
## 10 TTGCCTTG TTGCCTTG  MBIC11017--21
## 11 TTGCCTTG TTGCCTTG Moss_Beach--22
## 12 TTGCCTTG TTGCCTTG        S15--22
## 13 CTGGCTTG CTGGCTTG         13--13
## 14 CCGGCTTG CTGGCTTG         14--13
## 15 CCTGGGTG CCGGCTTG         15--14
## 16 GCTGGATG CCTGGGTG         16--15
## 17 GCTGGATG GCTGGATG         17--16
## 18 GCTGGATG GCTGGATG         18--17
## 19 CCGGGATG GCTGGATG         19--18
## 20 GCTGGATG GCTGGATG         20--16
## 21 TTGCCTTG CTGGCTTG         21--13
## 22 TTGCCTTG TTGCCTTG         22--21

Finalmente repito esto para los sitios palindromicos restantes:

EvolModel = "F81"
Method = "bayes"
Tree = read.tree("SpeciesTree_rooted.txt")#"SpeciesTree_rooted.txt"
for (ORTH in 2:1:length(Sites$FILE)){ ## HAY 2593 ortólogos
  Alignment = paste0("Only_ORTHOLOGUES/",Sites[ORTH,1])
  SI = Sites[ORTH,3]
  EI = Sites[ORTH,4]
  
  ## Cargo la alineación multiple
  cyanobacterias <- read.phyDat(Alignment, format = "interleaved")
  
  ## Cambio el attributo site.patter a FALSE para poder acceder a todos los indices
  cyanobacterias <- subset(cyanobacterias, site.pattern = FALSE) 
  #Tree = pratchet(cyanobacterias, trace = 0)|>acctran(cyanobacterias)
  parsimony(Tree, cyanobacterias)
  
  ## Estimo los parametros para el modelo
  fit <- pml(Tree, cyanobacterias)
  fit <- optim.pml(fit, model=EvolModel, control = pml.control(trace=0))
  
  Tree2 <- fit[["tree"]]
  
  ## Calculo la probabilidad del árbol filogenético dado el alineamiento y el modelo
  if(Method == "bayes"){
    anc.bayes <- ancestral.pml(fit, type="bayes", return = "prob")
    Reconstruction <- anc.bayes
  }
  if(Method == "ml"){
    anc.ml <- ancestral.pml(fit, "ml")
    Reconstruction <- anc.ml
  }
  
  #plotAnc(Tree2, anc.bayes, 1)
  PalInterval = SI:EI ## ESTE ES EL INTERVALO DEL SITIO PALINDROMICO
  
  ## Obtengo las posiciones para la posicion PalInterval
  PalindromeNucleotidePositions <- attributes(cyanobacterias)$index[PalInterval]
  
  ## Nombre de los nodos
  Nodos <- attributes(Reconstruction)$names
  
    ########################################################################################################
  ## Transformo el arbol en data frame para poder agregar los palindromos de cada nodo
  a.1 <- as_tibble(Tree2)
  
  NA.NODE = length(cyanobacterias)+1
  for (i in NA.NODE:length(Nodos)){ ## ESTO DEPENDE DE LA ESTRUCTURA DEL ARBOL
    a.1$label[i] <- a.1$node[i] # x$node[i-1]
  }
  ## Agrego los datos de los palindromos al arbol
  b.1 <- tibble::tibble(label = Nodos)
  c.1 <- full_join(a.1, b.1, by = 'label')
  NodeLabels <- c.1$label
  
  TipLabels <- Tree2$tip.label
  NodeLabels2 <- NodeLabels[-c(1:length(TipLabels))]

  ## Creo una matriz de 8 x length(Nodos) (8 nucleotidos, numero de nodos) para guardar los resultados
  ScoresMtx <- matrix(0,
                      nrow = length(Nodos),
                      ncol = 8)
  colnames(ScoresMtx) <- c("1","2","3","4","5","6","7","8")
  rownames(ScoresMtx) <- Nodos
  
  
  ## Relleno la matriz con los scores
  j = 1
  i = 1
  k = 0
  for (Position in PalindromeNucleotidePositions){
    for (Nodo in TipLabels){
      PositionScores <- Reconstruction[[Nodo]][Position,] ## Extraigo los scores para el nodo "NODO" en la posicion "POSITION" de la secuencia
      if (length(unique(PositionScores)) == 1){
        k=k+1
        WinnerNucleotide = "-"
        ScoresMtx[i,j] <- WinnerNucleotide
        i=i+1 ## contador para cada Nodo
      }else{
        PositionWinnerScore <- max(PositionScores) ## Extraigo el valor mas grande. Este es el valor de la probabilidad de que el nucleotido ancestral sea ese
        WinnerNucleotide <- which(PositionScores == PositionWinnerScore, arr.ind = T) ## pregunto a que numero (Nucleótido) corresponde ese valor
        ScoresMtx[i,j] <- WinnerNucleotide ## Guardo dicho valor en la matriz de resultados
        i=i+1 ## contador para cada Nodo
      }
    }
    j=j+1 ## contador para cada posición
    i=1 ## Reiniciamoa el contador de Nodo para la siguiente posicion
  }
  
  ## Relleno la matriz con los scores
  j = 1
  i = length(TipLabels)+1
  k = 0
  #(length(TipLabels)+1):length(NodeLabels)
  for (Position in PalindromeNucleotidePositions){
    for (Nodo in NodeLabels2){
      PositionScores <- Reconstruction[[Nodo]][Position,] ## Extraigo los scores para el nodo "NODO" en la posicion "POSITION" de la secuencia
      if (length(unique(PositionScores)) == 1){
        k=k+1
        WinnerNucleotide = "X"
        ScoresMtx[i,j] <- WinnerNucleotide
        i=i+1 ## contador para cada Nodo
      }else{
        PositionWinnerScore <- max(PositionScores) ## Extraigo el valor mas grande. Este es el valor de la probabilidad de que el nucleotido ancestral sea ese
        WinnerNucleotide <- which(PositionScores == PositionWinnerScore, arr.ind = T) ## pregunto a que numero (Nucleótido) corresponde ese valor
        ScoresMtx[i,j] <- WinnerNucleotide ## Guardo dicho valor en la matriz de resultados
        i=i+1 ## contador para cada Nodo
      }
    }
    j=j+1 ## contador para cada posición
    i=length(TipLabels)+1 ## Reiniciamoa el contador de Nodo para la siguiente posicion
  }
  ########################################################################################################
  
  ## EN ESTA PARTE ARMAREMOS EL PALINDROMO PARA CADA NODO
  
  ## Creo un matriz de 1x14 para guardar la etiqueta (palindromo) de cada nodo
  PalsMtx <- matrix(0,
                    nrow = length(Nodos),
                    ncol = 1)
  colnames(PalsMtx) <- c("Palindrome")
  rownames(PalsMtx) <- Nodos
  
  ## Extraigo cada palindromo, cambio el codigo de numero por el codigo de letras y lo guardo en la matriz 1x14
  i=0
  j=0
  for (Nodo in Nodos){
    j=j+1
    Palindrome <- as.character(ScoresMtx[j,]) # Este es el palindromo
    pal =""
    for (nuc in Palindrome){
      if(nuc== 1){nuc = 'A'}
      if(nuc== 2){nuc = 'C'}
      if(nuc== 3){nuc = 'G'}
      if(nuc== 4){nuc = 'T'}
      pal = paste(pal,nuc,sep = "")
    }
    PalsMtx[j,1] <- pal
  }
  
  ## ESTA PATRTE ES PARA GUARDAR LOS PALINDROMOS DE CADA NODO
  l <- as.list(PalsMtx[,1]) 
  Nodes = names(l)
  NodePals = as.character(PalsMtx[,1])
  
  ## Transformo el arbol en data frame para poder agregar los palindromos de cada nodo
  x <- as_tibble(Tree2)
  
  ## Extraigo el Boostrap por si se ocupa
  NA.NODE = length(cyanobacterias)+1
  BootStrap <- x$label[NA.NODE:length(Nodes)]
  for (i in NA.NODE:length(Nodes)){ ## ESTO DEPENDE DE LA ESTRUCTURA DEL ARBOL
    x$label[i] <- x$node[i] # x$node[i-1]
  }
  ## Agrego los datos de los palindromos al arbol
  d <- tibble::tibble(label = Nodes,
                      Palindromo = NodePals)
  y <- full_join(x, d, by = 'label')
  
  #------------------------
  
  ## Esta es la configuracion del arbol
  from<-y$label
  to<-y$parent
  
  LinksMtx <- matrix(0,
                     nrow = length(from),
                     ncol = 2)
  colnames(LinksMtx) <- c("from","to")
  
  for (i in 1:length(from)){
    LinksMtx[i,1] = from[i]
    LinksMtx[i,2] = to[i]
  }
  
  ## AQUI VOY A CREAR LA MATRIZ QUE ME SERVIRA COMO DICCIONARIO PARA RELLENAR LA MATRIZ DE TRANSICIONES
  Mat <- y[,4:5]
  #Mat <- Mat %>% ## AQJUI QUITO EL RENGLON 7 QUE NO CONTIENE NADA. ESTE CORRESPONDE A LA RAIZ DEL ARBOL
  #  filter(!is.na(Palindromo))
  Mat
  
  Mat2 <- as.data.frame(Mat) ## CONVIERTO A DF
  Mat2
  
  ## AQUI CREO LA MATRIZ DE TRANSICIONES.
  LinksMtx2 <- matrix(0,
                      nrow = length(from),
                      ncol = 2)
  colnames(LinksMtx2) <- c("from","to")
  
  ## AQUI RELLENO LA MATRIZ DE TRANSICIONES USANDO COMO DICCIONARIO A LA MATRIZ ANTERIOR (Mat2)
  for (i in 1:length(from)) {
    for (j in 1:length(from)){
      if (LinksMtx[i,1]==Mat2[j,1]){
        LinksMtx2[i,1] = Mat2[j,2]
      }
      if (LinksMtx[i,2]==Mat2[j,1]){
        LinksMtx2[i,2] = Mat2[j,2]
      }
    }
  }
  
  directions<-paste(from,to,sep="--") ## ESTE VECTOR CONTIENE LAS DIRECCIONES
  Links0 = as.data.frame(LinksMtx2) ## HAGO UNA COPIA DE LA MATRIZ DE TRANSICIONES
  Links0$direction<-directions ## AGREGO COMO COLUMNA NUEVA A "directions"
  Links0
  LINKS<-rbind(LINKS,Links0) ## UNO LA MATRIZ ACTUAL DE TRANSICIONES(CON DIRECCIONES) CON LA ANTERIOR PARA CREAR UNA SOLA
  
  df <- as.data.frame(LinksMtx2) ## CONVIERTO A DF LA MATRIZ DE TRANSICIONES
  DF <- rbind(DF,df) ## UNO LA MATRIZ ACTUAL DE TRANSICIONES CON LA ANTERIOR PARA CREAR UNA SOLA
  #print(paste0("Archivo ",ORTH, " de ",length(Sites$FILE),"."))
}

Ahora las matrices DF y LINKS contienen todas las transiciones de todos los sitios. En total son 20008 transiciones.

length(DF$from)

## [1] 7282

length(LINKS$from)

## [1] 7282

Filtrado de datos para el GRAFO

Primero cuento el numero de veces que se repite cada linea de la matriz de transiciones. El conteo para cada transicion sera el peso de cada vertice. Luego cambio el nombre de la columna counts a weight, paso su contenido a numeros enteros y la convierto a tibble. En total tengo 7282 transiciones ya que son 22 transiciones por arbol y hay 331 sitios:

library(data.table)
RowCts <- setDT(DF)[,list(Count=as.numeric(.N)),names(DF)] ## CUENTO EL NUMERO DE OCURRENCIAS DE CADA TRANSICIÓN #2536
colnames(RowCts) <- c("from","to","weight") 
RowCts = transform(RowCts, weight = as.integer(weight)) ## CONVIERTO A ENTEROS LA COLUMNA "weight"
RowCts <-as_tibble(RowCts) ## CONVIERTO A TIBBLE
RowCts #8070

## # A tibble: 3,093 × 3
##    from     to       weight
##    <chr>    <chr>     <int>
##  1 -------- CTGGCTTG      1
##  2 CAGGCCTG CCGGCTTG      1
##  3 CCCGGGTG CCTGGGTG      1
##  4 TCTGGGTG GCTGGATG      1
##  5 GCTGGCTG GCTGGATG      1
##  6 CCGGGATG CCGGGATG      2
##  7 GCTGGATG GCTGGATG      5
##  8 GCTGGTTG GCTGGATG      1
##  9 TTGCCTTG TTGCCTTG      6
## 10 CTGGCTTG CTGGCTTG     18
## # … with 3,083 more rows

Filtro conteos bajos (<=10) y loops. Al final me quedo con 4 transiciones.

RowCts<-RowCts%>% ##QUITO LOS LOOPS. ES DECIR QUITO AQUELLAS TRANSICIONES QUE VAYAN ASI MISMA
  filter(from!=to)
length(RowCts$from) #2297

## [1] 2297

RowCts<-RowCts%>% ## QUIRO AQUELLAS TRANSICIONES CON UN PESO MENOR A 7
  filter(weight>=7)
length(RowCts$from) #53

## [1] 13

Extraigo todos los nodos posibles y los agrego a una matriz que va a contener los nodos del grafo. Adiconalmente le agrego una columna score que me indicara que tan parecido es el aplindromo a HIP1 (despues la quito porque no creo que sea un buen índice)

GraphNodes <- c(RowCts$from, RowCts$to) ## OBTENGO LOS PALINDROMOS DE CADA NODO
length(GraphNodes) #106

## [1] 26

GNMtx <- matrix(0, ## CREO LA MATRIZ QUE CONTENDRÁ LAS NODOS PARA EL GRAFO
                nrow = length(GraphNodes),
                ncol = 2)
colnames(GNMtx) <- c("pal","score") ## ETIQUETO LAS COLUMNAS DE LA MATRIZ

for (i in 1:length(GraphNodes)){ ## RELLENO LA MATRIZ
  GNMtx[i,1] = GraphNodes[i] ## AGREGO LOS NODOS
  GNMtx[i,2] = RecordLinkage::levenshteinSim(GraphNodes[i], 'GCGATCGC')*1000 ## AGREGO UN SCORE QUE SERÁ EL PORCENTAJE DE IDENTIDAD CON HIP1 MULTIPLICADO POR 1000
}

GNMtx <- as.data.frame(GNMtx) ## CONVIERTO LA MATRIZ DE NODOS A DF
GNMtx <- GNMtx[!duplicated(GNMtx), ] ## ELIMINO LOS NODOS DUPLICADOS

Creo un vector que enumerará los palindromos. Dicha enumeracion correspondera a su ID

palindromes<-GNMtx[,1] #34  ## EXTRAIGO LOS NODOS DE LA MATRIZ DEL GRAFO. ESTO LO HAGO PARA POSTERIAMENTE ENUMERARLAS Y ASOCIAR CADA NUMERO COMO ID

ids<-c() ## CREO UN VECTOR VACIO QUE CONTENDRÁ LOS ID'S
for (k in 1:length(palindromes)){ ## ENUMERO LOS PALINDROMOS. ESTOS NUMEROS SERAN LOS IDS
  ids <-append(ids, k)
}
ids

##  [1]  1  2  3  4  5  6  7  8  9 10 11

length(ids) #28

## [1] 11

length(GNMtx[,1]) #28

## [1] 11

Creo una matriz nueva que contendrá los nodos del grafo pero con su ID para cada palindromo:

Nodes2 <-cbind(GNMtx,ids) ## CREO UNA MATRIZ NUEVA CON LA MATRIZ DE NODOS Y LOS IDS. ESTA LA VOY A USAR COMO UN DICCIONARIO ID-NODO
length(Nodes2$pal)

## [1] 11

Nodes2<-Nodes2[,c(3,1,2)] ## REORDENO LAS COLUMNAS DE LA MATRIZ
colnames(Nodes2)<-c("id","label","score") ## ETIQUETO LAS COLUMNAS

Creo la matriz de vertices del grafo y tambien sustituyo el nombre de los nodos por sus respectivos ID’S. Adicionalmente cambio los scores que tengan 0 por 1’s ya que hay errores. Sin embargo, la columna score no la uso mas adelante.

GEMtx <- RowCts ## ESTA VA A SER LA MATRIZ DE VERTICES
Edges2 <- as.matrix(RowCts)#RowCts ESTA ES OTRA MATRIZ DE VERTICES DE LA CUAL VOY A SUSTITUIR LOS NODOS POR SUS IDS

for (n in 1:length(Nodes2[,2])){ ##AQUI CAMBIARE LOS PALINDROMOS DE LOS NODOS POR IDS
  Edges2[Edges2==Nodes2[n,2]] <- as.numeric(Nodes2[n,1])
  #print(n)
}

for (n in 1:length(Nodes2[,2])){ ## HANDLE 0's
  Nodes2[Nodes2==0] <- as.numeric(1)
  #print(n)
}

VISUALIZACIÓN DEL GRAFO

Cargo las librerias

library(visNetwork)
library(networkD3)
library(tidygraph)
library(ggraph)

Para empezar cargo los nodos y los vertices. En esta parte quito la columna score de los nodos.

nodes = as_tibble(Nodes2)
nodes = select(nodes, -score)

edges = as_tibble(Edges2)
edges = transform(edges, from = as.integer(from))
edges = transform(edges, to = as.integer(to))
edges = transform(edges, weight = as.integer(weight))
edges = as_tibble(edges)

Creo un grafo dirigido agregando un peso de acuerdo a la columna weight.

routes_tidy <- tbl_graph(nodes = nodes,
                         edges = edges,
                         directed = TRUE)

routes_tidy

## # A tbl_graph: 11 nodes and 13 edges
## #
## # A directed simple graph with 1 component
## #
## # Node Data: 11 × 2 (active)
##      id label   
##   <int> <chr>   
## 1     1 CAGGCCTG
## 2     2 --------
## 3     3 CAGGCATG
## 4     4 CAAGCCTG
## 5     5 CAGGCTTG
## 6     6 CAAGCATG
## # … with 5 more rows
## #
## # Edge Data: 13 × 3
##    from    to weight
##   <int> <int>  <int>
## 1     1     4     24
## 2     1     5     12
## 3     1     7      7
## # … with 10 more rows

routes_tidy %>% 
  activate(edges) %>% 
  arrange(desc(weight))

## # A tbl_graph: 11 nodes and 13 edges
## #
## # A directed simple graph with 1 component
## #
## # Edge Data: 13 × 3 (active)
##    from    to weight
##   <int> <int>  <int>
## 1     2     1     44
## 2     1     4     24
## 3     4     1     13
## 4     1     5     12
## 5     3     1      9
## 6     6     3      8
## # … with 7 more rows
## #
## # Node Data: 11 × 2
##      id label   
##   <int> <chr>   
## 1     1 CAGGCCTG
## 2     2 --------
## 3     3 CAGGCATG
## # … with 8 more rows

Visualizo el grafo:

ggraph(routes_tidy, layout = "graphopt") + 
  geom_node_point() +
  geom_edge_link(aes(width = weight), alpha = 0.8) + 
  scale_edge_width(range = c(0.2, 2)) +
  geom_node_text(aes(label = label), repel = TRUE) +
  labs(edge_width = "Times") +
  theme_graph()

GRAFOS INTERACTIVOS

Este es un grafo sin peso en los vertices.

visNetwork(nodes, edges)

Este es el mismo grafo pero con peso en sus vertices

edges <- mutate(edges, width = weight/5 + 1)
visNetwork(nodes, edges) %>% 
  visIgraphLayout(layout = "layout_with_fr") %>% 
  visEdges(arrows = "middle")

Este es otro grafo de fuerzas con peso en sus vertices y te muestra las conceciones de cada nodo de manera mas visual

nodes_d3 <- mutate(nodes, id = id - 1)
edges_d3 <- mutate(edges, from = from - 1, to = to - 1)

forceNetwork(Links = edges_d3, Nodes = nodes_d3,
             Source = "from", Target = "to", 
             NodeID = "label", Group = "id", Value = "weight", 
             opacity = 1, fontSize = 16, zoom = TRUE)

Este ultimo grafo muestra el grafo anterior pero de una forma mas analizable.

sankeyNetwork(Links = edges_d3, Nodes = nodes_d3, Source = "from", Target = "to", 
              NodeID = "label", Value = "weight", fontSize = 16, unit = "TIME(s)")

GRAFOS POR DIRECCION

WGHT = 2

ggtree(Tree2) +
  geom_tiplab(color='firebrick', offset = .10, hjust=2.5)+
  geom_label(aes(label = node),show.legend = TRUE)

D10–13

## [1] 6219

## # A tbl_graph: 9 nodes and 7 edges
## #
## # A rooted forest with 2 trees
## #
## # Edge Data: 7 × 4 (active)
##    from    to direction weight
##   <int> <int> <chr>      <int>
## 1     4     8 D10--13        4
## 2     2     8 D10--13        3
## 3     3     8 D10--13        3
## 4     5     8 D10--13        3
## 5     1     8 D10--13        2
## 6     6     9 D10--13        2
## # … with 1 more row
## #
## # Node Data: 9 × 2
##      id label   
##   <int> <chr>   
## 1     1 CCGGTTTG
## 2     2 CAGGGCTG
## 3     3 --------
## # … with 6 more rows

PCC_6312–14

## [1] 6219

## # A tbl_graph: 43 nodes and 42 edges
## #
## # A rooted tree
## #
## # Edge Data: 42 × 4 (active)
##    from    to direction    weight
##   <int> <int> <chr>         <int>
## 1     1     2 PCC_6312--14     16
## 2     1     5 PCC_6312--14      9
## 3     1    17 PCC_6312--14      7
## 4     1    11 PCC_6312--14      6
## 5     1    16 PCC_6312--14      6
## 6     1    30 PCC_6312--14      6
## # … with 36 more rows
## #
## # Node Data: 43 × 2
##      id label   
##   <int> <chr>   
## 1     1 CAGGCCTG
## 2     2 CAAGCCTG
## 3     3 CAGCCTTG
## # … with 40 more rows

}

PCC_6715–15

## [1] 6219

## # A tbl_graph: 6 nodes and 3 edges
## #
## # A rooted forest with 3 trees
## #
## # Edge Data: 3 × 4 (active)
##    from    to direction    weight
##   <int> <int> <chr>         <int>
## 1     1     4 PCC_6715--15      5
## 2     3     6 PCC_6715--15      3
## 3     2     5 PCC_6715--15      2
## #
## # Node Data: 6 × 2
##      id label   
##   <int> <chr>   
## 1     1 --------
## 2     2 CAAGACTG
## 3     3 CAAGCATG
## # … with 3 more rows

HN-54–17

## [1] 6219

## # A tbl_graph: 4 nodes and 2 edges
## #
## # A rooted forest with 2 trees
## #
## # Edge Data: 2 × 4 (active)
##    from    to direction weight
##   <int> <int> <chr>      <int>
## 1     2     4 HN-54--17      4
## 2     1     3 HN-54--17      2
## #
## # Node Data: 4 × 2
##      id label   
##   <int> <chr>   
## 1     1 CAAACGTG
## 2     2 --------
## 3     3 CAAGCGTG
## # … with 1 more row

SCTE542–18

## [1] 6219

## # A tbl_graph: 2 nodes and 1 edges
## #
## # A rooted tree
## #
## # Edge Data: 1 × 4 (active)
##    from    to direction   weight
##   <int> <int> <chr>        <int>
## 1     1     2 SCTE542--18      4
## #
## # Node Data: 2 × 2
##      id label   
##   <int> <chr>   
## 1     1 --------
## 2     2 CAGGCCTG

}

CL-1–19

## [1] 6219

## # A tbl_graph: 2 nodes and 1 edges
## #
## # A rooted tree
## #
## # Edge Data: 1 × 4 (active)
##    from    to direction weight
##   <int> <int> <chr>      <int>
## 1     1     2 CL-1--19       4
## #
## # Node Data: 2 × 2
##      id label   
##   <int> <chr>   
## 1     1 --------
## 2     2 CAGGCCTG

TA-1–19

## [1] 6219

## # A tbl_graph: 2 nodes and 1 edges
## #
## # A rooted tree
## #
## # Edge Data: 1 × 4 (active)
##    from    to direction weight
##   <int> <int> <chr>      <int>
## 1     1     2 TA-1--19       4
## #
## # Node Data: 2 × 2
##      id label   
##   <int> <chr>   
## 1     1 --------
## 2     2 CAGGCCTG

NK55a–20

## [1] 6219

## # A tbl_graph: 2 nodes and 1 edges
## #
## # A rooted tree
## #
## # Edge Data: 1 × 4 (active)
##    from    to direction weight
##   <int> <int> <chr>      <int>
## 1     1     2 NK55a--20      4
## #
## # Node Data: 2 × 2
##      id label   
##   <int> <chr>   
## 1     1 --------
## 2     2 CAGGCCTG

}

BP-1–20

## [1] 6219

## # A tbl_graph: 3 nodes and 2 edges
## #
## # A rooted tree
## #
## # Edge Data: 2 × 4 (active)
##    from    to direction weight
##   <int> <int> <chr>      <int>
## 1     1     2 BP-1--20       4
## 2     2     3 BP-1--20       2
## #
## # Node Data: 3 × 2
##      id label   
##   <int> <chr>   
## 1     1 --------
## 2     2 CAGGCCTG
## 3     3 CAAGCCTG

MBIC11017–21

## [1] 6219

## # A tbl_graph: 4 nodes and 2 edges
## #
## # A rooted forest with 2 trees
## #
## # Edge Data: 2 × 4 (active)
##    from    to direction     weight
##   <int> <int> <chr>          <int>
## 1     2     4 MBIC11017--21      4
## 2     1     3 MBIC11017--21      2
## #
## # Node Data: 4 × 2
##      id label   
##   <int> <chr>   
## 1     1 CCGATTTA
## 2     2 --------
## 3     3 CCGATCTA
## # … with 1 more row

Moss_Beach–22

## [1] 6219

## # A tbl_graph: 2 nodes and 1 edges
## #
## # A rooted tree
## #
## # Edge Data: 1 × 4 (active)
##    from    to direction      weight
##   <int> <int> <chr>           <int>
## 1     1     2 Moss_Beach--22      4
## #
## # Node Data: 2 × 2
##      id label   
##   <int> <chr>   
## 1     1 --------
## 2     2 CAGGCCTG

}

S15–22

## [1] 6219

## # A tbl_graph: 4 nodes and 2 edges
## #
## # A rooted forest with 2 trees
## #
## # Edge Data: 2 × 4 (active)
##    from    to direction weight
##   <int> <int> <chr>      <int>
## 1     1     3 S15--22        4
## 2     2     4 S15--22        2
## #
## # Node Data: 4 × 2
##      id label   
##   <int> <chr>   
## 1     1 --------
## 2     2 CAGGCTTG
## 3     3 CAGGCCTG
## # … with 1 more row

14–13

## [1] 6219

## # A tbl_graph: 12 nodes and 7 edges
## #
## # A rooted forest with 5 trees
## #
## # Edge Data: 7 × 4 (active)
##    from    to direction weight
##   <int> <int> <chr>      <int>
## 1     5     1 14--13         3
## 2     1     8 14--13         2
## 3     2     9 14--13         2
## 4     3    10 14--13         2
## 5     4    11 14--13         2
## 6     6     1 14--13         2
## # … with 1 more row
## #
## # Node Data: 12 × 2
##      id label   
##   <int> <chr>   
## 1     1 CAGGCTTG
## 2     2 GAGGCCTG
## 3     3 CGGGCTTG
## # … with 9 more rows

15–14

## [1] 6219

## # A tbl_graph: 13 nodes and 9 edges
## #
## # A rooted forest with 4 trees
## #
## # Edge Data: 9 × 4 (active)
##    from    to direction weight
##   <int> <int> <chr>      <int>
## 1     6    12 15--14         8
## 2     5    12 15--14         4
## 3     1    10 15--14         2
## 4     2    11 15--14         2
## 5     3    12 15--14         2
## 6     4    12 15--14         2
## # … with 3 more rows
## #
## # Node Data: 13 × 2
##      id label   
##   <int> <chr>   
## 1     1 CCGGCCTT
## 2     2 ACGGCCTA
## 3     3 CAGGCGTG
## # … with 10 more rows

}

16–15

## [1] 6219

## # A tbl_graph: 6 nodes and 3 edges
## #
## # A rooted forest with 3 trees
## #
## # Edge Data: 3 × 4 (active)
##    from    to direction weight
##   <int> <int> <chr>      <int>
## 1     3     6 16--15         3
## 2     1     4 16--15         2
## 3     2     5 16--15         2
## #
## # Node Data: 6 × 2
##      id label   
##   <int> <chr>   
## 1     1 GAAACGTG
## 2     2 CGGGCTTA
## 3     3 CAAGCATG
## # … with 3 more rows

17–16 WEIGHT = 1

## [1] 6219

## # A tbl_graph: 76 nodes and 40 edges
## #
## # A directed simple graph with 37 components
## #
## # Edge Data: 40 × 4 (active)
##    from    to direction weight
##   <int> <int> <chr>      <int>
## 1     1    41 17--16         1
## 2     2    42 17--16         1
## 3     3    43 17--16         1
## 4     4    44 17--16         1
## 5     5     7 17--16         1
## 6     6    45 17--16         1
## # … with 34 more rows
## #
## # Node Data: 76 × 2
##      id label   
##   <int> <chr>   
## 1     1 ACCGCCTA
## 2     2 GTCTCCTT
## 3     3 CCCCCTTG
## # … with 73 more rows

18–17 WEIGHT = 1

## [1] 6219

## # A tbl_graph: 45 nodes and 23 edges
## #
## # A rooted forest with 22 trees
## #
## # Edge Data: 23 × 4 (active)
##    from    to direction weight
##   <int> <int> <chr>      <int>
## 1     1    24 18--17         1
## 2     2    24 18--17         1
## 3     3    25 18--17         1
## 4     4    26 18--17         1
## 5     5    27 18--17         1
## 6     6    28 18--17         1
## # … with 17 more rows
## #
## # Node Data: 45 × 2
##      id label   
##   <int> <chr>   
## 1     1 CAAGCATG
## 2     2 CAAGTGTG
## 3     3 CAGGCTGC
## # … with 42 more rows

}

19–18 WEIGHT = 1

## [1] 6219

## # A tbl_graph: 123 nodes and 65 edges
## #
## # A rooted forest with 58 trees
## #
## # Edge Data: 65 × 4 (active)
##    from    to direction weight
##   <int> <int> <chr>      <int>
## 1     1    63 19--18         1
## 2     2    64 19--18         1
## 3     3    65 19--18         1
## 4     4    66 19--18         1
## 5     5    67 19--18         1
## 6     6    68 19--18         1
## # … with 59 more rows
## #
## # Node Data: 123 × 2
##      id label   
##   <int> <chr>   
## 1     1 CCGGGATG
## 2     2 CGCCAGTG
## 3     3 CTGGAGTG
## # … with 120 more rows

20–16 WEIGHT = 1

## [1] 6219

## # A tbl_graph: 130 nodes and 68 edges
## #
## # A rooted forest with 62 trees
## #
## # Edge Data: 68 × 4 (active)
##    from    to direction weight
##   <int> <int> <chr>      <int>
## 1     1    68 20--16         1
## 2     2    69 20--16         1
## 3     3    70 20--16         1
## 4     4    71 20--16         1
## 5     5    72 20--16         1
## 6     6    73 20--16         1
## # … with 62 more rows
## #
## # Node Data: 130 × 2
##      id label   
##   <int> <chr>   
## 1     1 GAAGCCCA
## 2     2 CTTGAATG
## 3     3 CGCCTCTG
## # … with 127 more rows

21–13

## [1] 6219

## # A tbl_graph: 10 nodes and 7 edges
## #
## # A rooted forest with 3 trees
## #
## # Edge Data: 7 × 4 (active)
##    from    to direction weight
##   <int> <int> <chr>      <int>
## 1     1     8 21--13         2
## 2     2     9 21--13         2
## 3     3     7 21--13         2
## 4     4    10 21--13         2
## 5     5     7 21--13         2
## 6     6     8 21--13         2
## # … with 1 more row
## #
## # Node Data: 10 × 2
##      id label   
##   <int> <chr>   
## 1     1 CAGTCTTG
## 2     2 CGGGGTTA
## 3     3 CAGGCTTG
## # … with 7 more rows

22–21 WEIGHT = 1

## [1] 6219

## # A tbl_graph: 214 nodes and 119 edges
## #
## # A directed simple graph with 98 components
## #
## # Edge Data: 119 × 4 (active)
##    from    to direction weight
##   <int> <int> <chr>      <int>
## 1     1    97 22--21         1
## 2     2   116 22--21         1
## 3     3   117 22--21         1
## 4     4   118 22--21         1
## 5     5    66 22--21         1
## 6     6   119 22--21         1
## # … with 113 more rows
## #
## # Node Data: 214 × 2
##      id label   
##   <int> <chr>   
## 1     1 CAGGCCTG
## 2     2 CCCGAAAC
## 3     3 AAGCCCTC
## # … with 211 more rows

PIPELINE Reconstrucción Ancestral de Secuencias - CAGGCCTG

Eduardo PM

2023-03-30