Obtencion de datos para la alineación

Antes que nada. Cada genoma puede contener ID’s correspondientes a plasmidos. En este tutorial solo usare los cromosomas de cada especie. Es decir, omitiremos los plásmidos.

## /home/lalibelulalo/PIPELINES_2023/ASR_GENOMES/Callothrix_clade/fna
for f in *.fna; do seqkit split --by-id $f;done

Luego creo una carpeta (only_chromosomes) y paso ahi los Fasta correspondientes a los cromosomas.

CACTUS

Vamos a usar otro programa para la alineación ya que el archivo maf de salida es mas informativo y facil de reformatear.

Datos de entrada

Primero necestiamos un archivo con los genomas de entrada: Este debe traer el arbol filogenetico en una linea y la etiqueta de los genomas con su respectiva ubicación.

Es muy importante que los nodos del el arbol no estén etiquetados o si lo están, que esta etoqueta sea distinta para cada nodo. De lo contrario habrá errores a la hora de correr cactus

Con el nombre de los nodos me refiero a la etiqueta que esta entre ) y :

Por ejemplo:

En este arbol la primera y la ultima etiqueta de nodo son identicas, esto ocasionó problemas a la hora de ejecutar cactus:

###

(336-3:0.101119,(NIES-3974:0.218161,(PCC_6303:0.172103,(PCC_7716:0.058496,(NIES-4071:8.62091e-07,NIES-4105:8.62091e-07)0.986266:0.0633482)0.960653:0.109454)0.507795:0.0389476)1:0.101119);

Para que el programa funcionara unicamente cambié el 8.62091e-07 por 8.620912e-07:

(336-3:0.101119,(NIES-3974:0.218161,(PCC_6303:0.172103,(PCC_7716:0.058496,(NIES-4071:8.62091e-07,NIES-4105:8.620912e-07)0.986266:0.0633482)0.960653:0.109454)0.507795:0.0389476)1:0.101119);

(336-3:0.101119,(NIES-3974:0.218161,(PCC_6303:0.172103,(PCC_7716:0.058496,(NIES-4071:8.62091e-07,NIES-4105:8.620912e-07)0.986266:0.0633482)0.960653:0.109454)0.507795:0.0389476)1:0.1011191);

336-3 /home/lalibelulalo/PIPELINES_2023/ASR_GENOMES/Callothrix_clade/fna/only_chromosomes/336-3.fna
NIES-4071 /home/lalibelulalo/PIPELINES_2023/ASR_GENOMES/Callothrix_clade/fna/only_chromosomes/NIES-4071.fna
PCC_6303 /home/lalibelulalo/PIPELINES_2023/ASR_GENOMES/Callothrix_clade/fna/only_chromosomes/PCC_6303.fna
NIES-3974 /home/lalibelulalo/PIPELINES_2023/ASR_GENOMES/Callothrix_clade/fna/only_chromosomes/NIES-3974.fna
NIES-4105 /home/lalibelulalo/PIPELINES_2023/ASR_GENOMES/Callothrix_clade/fna/only_chromosomes/NIES-4105.fna
PCC_7716 /home/lalibelulalo/PIPELINES_2023/ASR_GENOMES/Callothrix_clade/fna/only_chromosomes/PCC_7716.fna

Creo un ambiente para correr cactus en la carpeta de cactus Esto que sigue lo hago en la carpeta de cactus

## /home/lalibelulalo/software/cactus
cd cactus
virtualenv -p python3 cactus_env
echo "export PATH=$(pwd)/bin:\$PATH" >> cactus_env/bin/activate
echo "export PYTHONPATH=$(pwd)/lib:\$PYTHONPATH" >> cactus_env/bin/activate
source cactus_env/bin/activate
python3 -m pip install -U setuptools pip
python3 -m pip install -U .
python3 -m pip install -U -r ./toil-requirement.txt

Corro cactus:

## /home/lalibelulalo/PIPELINES_2023/ASR_GENOMES/Callothrix_clade
cd ../../PIPELINES_2023/ASR_GENOMES/Callothrix_clade/
cactus ./calothrix ./evolverCalothrix.txt ./evolverCalothrix.hal

como resultado tenemos el archivo evolverPico.hal

Convierto a MAF

Para este paso necesitamos una referencia. Esta referencia será el genoma con mayor conteo de palindromos. Para esto corremos el siguiente script:

## /home/lalibelulalo/PIPELINES_2023/ASR_GENOMES/Callothrix_clade
python3 CountPalsInOrthologues.py fna/only_chromosomes/ pals.list genomes.list 3

## /home/lalibelulalo/PIPELINES_2023/ASR_GENOMES/Callothrix_clade
File = "../../Markov_count_GCGATCGC_2023-5-9_14hrs14mins.txt"
table = read.table(File,header = TRUE,sep = "\t")
table = table[order(table$obs,decreasing = TRUE),]
#knitr::kable(table, align = "lccrr")
rmarkdown::paged_table(table,options = list(rows.print = 10, cols.print = 4))

Usamos a 336-3 como referencia:

Pasamos dicho archivo a formato MAF

## /home/lalibelulalo/PIPELINES_2023/ASR_GENOMES/Callothrix_clade
hal2maf evolverCalothrix.hal evolverCalothrix.336-3.hal.maf --refGenome 336-3 --noAncestors --inMemory

como resultado tenemos el archivo evolverCallothrix.336-3.hal.maf

###
##maf version=1 scoring=N/A
# hal (336-3:0.101119,(NIES-3974:0.218161,(PCC_6303:0.172103,(PCC_7716:0.058496,(NIES-4071:8.62091e-07,NIES-4105:8.62091e-07)0.986266:0.0633482)0.960653:0.109454)0.507795:0.0389476)1:0.101119)Anc0;

a
s   336-3.336-3 0   1000    +   6283267 GCACTGCCCTACTGGAAGCACTTCGTGGCACTTACTGCCAACATGTGGTTCAAGTCACAGAAGAACCCTTCATTTCTTTACAAACTATTCCCCTTCACCGGGGACGGAAACTTGTCAATTATTGCGCCTCCAAAACCACAGATGGAAACCACTTTACAAACTATTCCCCTTCACCGGGGACGAAAACGCTTGCATTAGCCCACAGGTCAGTGGACTGACTAATTATTTCCACAACTATGGGCAAACTAGCTTACAGCCGCTACCATCTGAGTTGACAGGGAATCAAAAATCATAGTATTATAATCTGGACAAGGAAGAACTATCTTTCAATAATAAACCTGTGAATATTTCTCCATCTCCAAAGCAGCCGCGTGTCTCATGGCAGGCGGCTTTCTCCATGTTGATGTACATTTTCATGTATCTACCAATATTGGTGCTTGCCTTCTACAGCTTCAACAATTCAGCTTACAGTGCTGGTTGGCAAGGATTCACCCTCGATTGGTATCGCCAGCTATTTGCGGATGAACGCATCTTATTAGCGACCAAAAACAGTCTCCTAGTAGCCAGTTGTGCGGTGGGGATTGCGGCAGTATTTGGAACCCTGATGGCAGTGGGATTAGCTCGATATCAGTTTCCCGGCAAAGGTTTATATAAAGGAGTCAGTTACCTACCCTTAATTATTCCCGACATTGCGATCGCCGTTGCCACCCTAGTTTTTCTGGCTGCTTTTGCCATACCTCTAAGTCTATGGACAATCATCGCCGCCCATGTGGTTTTCTGTCTTGCCTACATTGCCCTAGTTGTCTCCTCCCGACTAACAAACCTGGATTCCCATTTAGAAGAAGCCGCCCTAGATTTAGGTGCTACACCTACCCAAGCATTTATTAAAGTTTTAATACCCCAGTTAATGCCGGGTATAATTGCAGGCTGTCTGTTAGCTTTTATCCTTAGCTTAGATGACTTTCTGATTGCCAGTTTTACCTCCGGTAGCGGCTCTAGT

a
s   336-3.336-3 1000    1000    +   6283267 ACCCTGCCCATGGAAATTTTTAGTCGCCTGCGAACAGGTGTTAAACCAGACATAAATGCCCTCAGTGTTATCCTAATTCTCGTATCAGCAATTGTAGCTTTTATTGCCGAATCAATTAGAATTGCTGGAGACAAAAAGCACAGATAAAAAAGTATTTATACCACAATCTCTATCATCGTAATTCCTCTGAGCCACCAGGCAAACATTCTGCAATGCCATAGACAATAAATACTTATTTCATATAAGACTTAATACTTATAATAAAAATACCCCTCTCCTGAGTCAGGCACAGTGCCCCTGGGCAGTTGGGGGAAATTAATAATTACAACTGACGAGATAAATCACAATCCTATTGACAAAATCACTCCCCATCAACTATTATTCTTTCTGTAATGGAAATCCATGCGCTCATATATATTCATATTTAGAAAATTACCAGTACAGATAAGCAAATAAATAGCATTTCCCATCTGCATTTTTCTCAACACAGAACCTCAGTAGATGAAGTAAATCTATAGATTTGTACAATTCTGTTTTCAAATAGATAATATTGAGCCTCACTTAGGCAAGATGCGAAGCTCACTATGCAAATCTGTCAAAATCCCCATTGCTCCAATCCCTTCAATGCTGACGGCAACAGATTTTGCATCAGTTGCGGGCAAAGCAACTTTGGTGAATTACTGCGGAACCGTTATCGTGTTCTGAAATTATTAGGAGAAGGTGGTTTTAGTAGAACCTACGCTACAGAAGATGCTGATAGACTCAATGCACCCTGTGTAATTAAACAATTTTTCCCCCAGGTGCATGGAACATCAGAACGTACTAAAGCCGCAGAATTATTTAAAGAAGAAGCCAAGCGTCTATATGAATTGGGAGAAAATCATATTCAAATCCCTCGACTATTGGCATTTTTTGAACAAGGTTCCAGTTTATATTTAGTTCAAGAATTTATTCAAGGAAAAACCCTCTTAGAAGAGGTACAAAAGCAGCATTTTACAGA

a
s   336-3.336-3 2000    1000    +   6283267 AGATAAAGTTAGGGAAATTTTAACTGAGTTATTACCTGTTCTCGATTTTATTCACACCAATAACGTGATTCATCGAGATATTAAACCAGAAAATATTATTCGCCGAGATACAGACGGTAAGCTCGTATTAATCGATTTTGGTGGAGCCAAACAAGTTACCCAAACAAGTTTAGCCAGACAAGCGACAGTTCTCTATACAATTGGTTATGCTCCCAGTGAGCAAATGGCAGGATTTGCTTGCCATGCCAGTGATATCTATGCTTTAGGTGTGACTTGTGTACGCTTACTTACCAGATGTTTGCCACTCCATGATTCATCAGGGCAAGTTTACGATCGCCTTTACGATCCCATGAGTGCTAAATGGCTATGGCGACAACGCATCCAGGAACAAAACATCACAATCAGTGAAGAATTAGGGCATATCCTGGATAAGATGCTGAAAAATTTAGCCTCAGAAAGATATCAATCTGCTGCGGTAATTATGAGAGAGCTACAAAATAGCAAAACCCTATCCCCTGGAGCAAATTCTTCAGAAACACAACCGATAATTGCAATTCCTCGAAATCCATCGAATTTTACCAACTTAGACACCTTTGAATTTGACGTAGTGACAATTGATAGTCATAGTCAGATAGTCAATCGGAACCGTACTGTTGCCCAATATTATGCAGAAGAATTGGGACATGACGTAATTATGGATATGGTAGCAGTACCTGGTGGCTCTTTTATGATGGGTTCCAAGGAAGATGAAGGAGATGCAGACGAACGTCCCCAACACTTAGTAGCAATTGAGCCTTTCTTCATTGGTAGATATCCTGTTACCCAAGCCCAATGGCGAGCAGTGGCATCTTTGCCCAAAGTCAATCAACCCCTCAACCCCTATCCGTCGAAATTCAAAGGTAGCAATCGCCCAGTAGAAAATGTATCTTGGTACGAAGCAGTGGAATTTTGCGCTCGACTAGAGCAAAAAACCGGGCGCAACTATCGCTTACCCAGTGAG

... etc
###

Reformateo las etiquetas de los genomas:

## /home/lalibelulalo/PIPELINES_2023/ASR_GENOMES/Callothrix_clade
sed 's/PCC_7716.PCC_7716/PCC_7716/g' evolverCalothrix.336-3.hal.maf | sed 's/336-3.336-3/336-3/g' | sed 's/NIES-3974.NIES-3974/NIES-3974/g' | sed 's/NIES-4071.NIES-4071/NIES-4071/g' | sed 's/NIES-4105.NIES-4105/NIES-4105/g' | sed 's/PCC_6303.PCC_6303/PCC_6303/g' >evolverCalothrix.336-3.hal.maf.sed

Copio evolverCallothrix.hal.maf.sed a la carpeta de chromosomes_only y convierto maf a fasta:

## /home/lalibelulalo/PIPELINES_2023/ASR_GENOMES/Callothrix_clade/fna/only_chromosomes/
cd fna/only_chromosomes/
mafToFastaStitcher --maf ../../evolverCalothrix.336-3.hal.maf.sed --seqs 336-3.fna,NIES-3974.fna,NIES-4071.fna,NIES-4105.fna,PCC_6303.fna,PCC_7716.fna --breakpointPenalty 5 --interstitialSequence 20 --outMfa evolverCalothrix.336-3.hal.maf.sed.fasta

Convierto a philyp

## /home/lalibelulalo/PIPELINES_2023/ASR_GENOMES/Callothrix_clade/PALINDROMES/GCGATCGC/
python3 ../../Fasta2Phylip.py ../../fna/only_chromosomes/evolverCalothrix.336-3.hal.maf.sed.fasta

Obtengo las coordenadas del palindromo

## /home/lalibelulalo/PIPELINES_2023/ASR_GENOMES/Callothrix_clade/PALINDROMES/GCGATCGC/
python3 ../../AlignmentPalindromeCoords.py evolverCalothrix.336-3.hal.maf.sed.fasta.phy GCGATCGC 336-3

El ultimo script tiene como salida el archivo Orthologues_Palindrome_sites.txt el cual contiene 2 columnas, una con el nombre del archivo del ortólogo y otra con el intervalo de sitios en donde se encuentra el palindromo:

## /home/lalibelulalo/PIPELINES_2023/ASR_GENOMES/Callothrix_clade
File = "Orthologues_Palindrome_sites.txt"
table = read.table(File,header = TRUE,sep = "\t")
rmarkdown::paged_table(table,options = list(rows.print = 10, cols.print = 4))

GRAFO

primero cargo las librerias necesarias

library(ggplot2)
library(ggtree)
library(ape)
library(tidyverse)
library(tidytree)
library(phangorn)
library(dplyr)

Preparacion de datos para el grafo

Primero cargo Orthologues_Palindrome_sites.txt del paso anterior.

## /home/lalibelulalo/PIPELINES_2023/ASR_GENOMES/Callothrix_clade/PALINDROMES/GCGATCGC
Sites <- read.table("Orthologues_Palindrome_sites.txt", sep = "\t", header = TRUE)

Creo la matriz que va a contener unicamente las transiciones

DF <- matrix(0,
             nrow = 0,
             ncol = 2)
colnames(DF) <- c("from","to")

Creo una matriz adicional que tendra las transiciones y una columna extra con la direccion de la transición

{LINKS <- matrix(0,
             nrow = 0,
             ncol = 3)
colnames(LINKS) <- c("from","to","direction")}

Creo una matriz adicional que tendra las deleciones:

{DELECIONES <- matrix(0,
                     nrow = 0,
                     ncol = 4)
colnames(DELECIONES) <- c("ortólogo","sitio","deleciones","species")}

Cargo el modelo de evolucion, el metodo de reconstrucción (bayesiano), la filogenia (previamente calculada con orthofinder) la cual enraizo en 336-3.

EvolModel = "F81"
Method = "bayes"
Tree = read.tree("../../SpeciesTree_rooted.txt")
Alignment = "evolverCalothrix.336-3.hal.maf.sed.fasta.phy"

Esta filogenia contiene 11 nodos, sin embargo posteriormente para el analisis se quita la raiz y quedan 10.

ggtree(Tree,branch.length="none",) +
  geom_tiplab(color='firebrick', offset = .14)+
  geom_label(aes(label = node),show.legend = FALSE)

En el archivo Orthologues_Palindrome_sites.txt hay 6065 sitios en total. Cargamos el primer sitio el cual corresponde a:

table[1,]

##                                           FILE      PAL START END
## 1 evolverCalothrix.336-3.hal.maf.sed.fasta.phy GCGATCGC   692 699

El numero ORTH correponde al numero de sitio. Por lo tanto cargamos el PATH del alineamiento del primer sitio (Alignment), así como su inicio (SI) y su termino (EI)

ORTH = 1
SI = Sites[ORTH,3]
EI = Sites[ORTH,4]

Alineación multiple

Cargo la alineación multiple

cyanobacterias <- read.phyDat(Alignment, format = "interleaved")

## Cambio el attributo site.patter a FALSE para poder acceder a todos los indices
cyanobacterias <- subset(cyanobacterias, site.pattern = FALSE) 
parsimony(Tree, cyanobacterias)

## [1] 1580809

Filogenia y parametros del modelo

Estimo los parametros para el modelo. en esta parte uso el modelo de evolucion elegido anteriormente (“F81”). Además en esta parte el arbol se desenraiza y ahora tenemos 10 nodos. Por lo tanto, debo cambiar el arbol al nuevo sin raiz para evitar errores.

fit <- pml(Tree, cyanobacterias)
fit <- optim.pml(fit, model=EvolModel, control = pml.control(trace=0))

Tree2 <- fit[["tree"]]
ggtree(Tree2,branch.length="none",) +
  geom_tiplab(color='firebrick', offset = .14)+
  geom_label(aes(label = node),show.legend = FALSE)

Reconstrucción de Estados Ancestrales

Hago la reconstruccion de estados ancestrales para cada sitio. Para esto hay dos métodos: bayesiano o máxima verosimilitud. Anteriormente habia declarado la variable Method como bayes.

  ## Calculo la probabilidad del árbol filogenético dado el alineamiento y el modelo
  if(Method == "bayes"){
    anc.bayes <- ancestral.pml(fit, type="bayes", return = "prob")
    Reconstruction <- anc.bayes
  }
  if(Method == "ml"){
    anc.ml <- ancestral.pml(fit, "ml")
    Reconstruction <- anc.ml
  }

Armado de palindromos ancestrales

Armo los palindromos ancestrales de acuerdo a los estados ancestrales usando los sitios de inicio y fin de cada sitio palindromico (SI y EI). Posteriormente uso dicho intervalo para buscar lel estado de cada nucleotido del intervalo y guardarlo en el vector (PalindromeNucleotidePositions). Tambien extraigo los nombres de cada nodo de la reconstrucción:

PalInterval = SI:EI ## ESTE ES EL INTERVALO DEL SITIO PALINDROMICO
  
## Obtengo las posiciones para la posicion PalInterval
PalindromeNucleotidePositions <- attributes(cyanobacterias)$index[PalInterval]

## Nombre de los nodos
Nodos <- attributes(Reconstruction)$names

Ahora creo una matriz que contendra 9 columnas, la primera contendrá los Nodos y las 8 restantes contendran los nucleotidos correspondientes a su palindromo y prosigo a rellenar la matriz. Sin embargo, en este paso hay un detalle.

La reconstrucción de estado ancestral para cada sitio arroja una probabilidad para cada uno de los 4 sitios posibles (A, T ,G o C). Si hay equivalencia en la probabilidad en los estados, es decir que el estado ancestral de un nucleotido tenga la misma probabilidad de haber estado en cualesquiera de los 4 posibles estados, entonces dicho estado se omite, de lo contrario se toma aquel estado con la mayor probabilidad. Otro punto importante es que hay sitios en los que una especie puede contener deleciones. Para estos casos, si gfalta algún nucleotido sew rellena con un “-”.

  ## Transformo el arbol en data frame para extraer las etiquetas de los nodos.
  a.1 <- as_tibble(Tree2)

  ## NA.NODE corresponde al nodo a partir del cual ya no estan las puntas del arbol
  NA.NODE = length(cyanobacterias)+1
  
  ## Esto es para agregar la etiqueta de los nodos a los que faltan, es decir a los nodos que no son puntas del arbol.
  for (i in NA.NODE:length(Nodos)){ ## ESTO DEPENDE DE LA ESTRUCTURA DEL ARBOL
    a.1$label[i] <- a.1$node[i]
  }
  
  ## Agrego las etiquetas de nodo al arbol
  b.1 <- tibble::tibble(label = Nodos)
  c.1 <- full_join(a.1, b.1, by = 'label')
  NodeLabels <- c.1$label
  
  TipLabels <- Tree2$tip.label
  NodeLabels2 <- NodeLabels[-c(1:length(TipLabels))]

  ## Creo una matriz de 8 x length(Nodos) (8 nucleotidos x numero de nodos) para guardar los resultados
  ScoresMtx <- matrix(0,
                      nrow = length(Nodos),
                      ncol = 8)
  colnames(ScoresMtx) <- c("1","2","3","4","5","6","7","8")
  rownames(ScoresMtx) <- Nodos
  
  
  ## Relleno la matriz con los scores.
  ## ESTE CICLO ES UNICAMENTE PARA LAS PUNTAS DE LOS NODOS
  j = 1
  i = 1
  k = 0
  for (Position in PalindromeNucleotidePositions){
    for (Nodo in TipLabels){
      PositionScores <- Reconstruction[[Nodo]][Position,] ## Extraigo los scores para el nodo "NODO" en la posicion "POSITION" de la secuencia
      if (length(unique(PositionScores)) == 1){
        k=k+1
        WinnerNucleotide = "-"
        ScoresMtx[i,j] <- WinnerNucleotide
        i=i+1 ## contador para cada Nodo
      }else{
        PositionWinnerScore <- max(PositionScores) ## Extraigo el valor mas grande. Este es el valor de la probabilidad de que el nucleotido ancestral sea ese
        WinnerNucleotide <- which(PositionScores == PositionWinnerScore, arr.ind = T) ## pregunto a que numero (Nucleótido) corresponde ese valor
        ScoresMtx[i,j] <- WinnerNucleotide ## Guardo dicho valor en la matriz de resultados
        i=i+1 ## contador para cada Nodo
      }
    }
    j=j+1 ## contador para cada posición
    i=1 ## Reiniciamoa el contador de Nodo para la siguiente posicion
  }
  
  ## Relleno la matriz con los scores
  ## ESTA PARTE ES PARA TODOS LOS NODOS RESTANTES QUE NO SON PUNTAS
  j = 1
  i = length(TipLabels)+1
  k = 0

  for (Position in PalindromeNucleotidePositions){
    for (Nodo in NodeLabels2){
      PositionScores <- Reconstruction[[Nodo]][Position,] ## Extraigo los scores para el nodo "NODO" en la posicion "POSITION" de la secuencia
      if (length(unique(PositionScores)) == 1){
        k=k+1
        WinnerNucleotide = "X"
        ScoresMtx[i,j] <- WinnerNucleotide
        i=i+1 ## contador para cada Nodo
      }else{
        PositionWinnerScore <- max(PositionScores) ## Extraigo el valor mas grande. Este es el valor de la probabilidad de que el nucleotido ancestral sea ese
        WinnerNucleotide <- which(PositionScores == PositionWinnerScore, arr.ind = T) ## pregunto a que numero (Nucleótido) corresponde ese valor
        ScoresMtx[i,j] <- WinnerNucleotide ## Guardo dicho valor en la matriz de resultados
        i=i+1 ## contador para cada Nodo
      }
    }
    j=j+1 ## contador para cada posición
    i=length(TipLabels)+1 ## Reiniciamoa el contador de Nodo para la siguiente posicion
  }

Ahora decodificamos el palindromo a nucleotidos para esto creo una nueva matriz que contendra dos columnas: nodos y sus correspondientes palíndromos.

## Creo un matriz de 1x14 para guardar la etiqueta (palindromo) de cada nodo
PalsMtx <- matrix(0,
                  nrow = length(Nodos),
                  ncol = 1)
colnames(PalsMtx) <- c("Palindrome")
rownames(PalsMtx) <- Nodos

## Extraigo cada palindromo, cambio el codigo de numero por el codigo de letras y lo guardo en la matriz 1x14
i=0
j=0
for (Nodo in Nodos){
  j=j+1
  Palindrome <- as.character(ScoresMtx[j,]) # Este es el palindromo
  pal =""
  for (nuc in Palindrome){
    if(nuc== 1){nuc = 'A'}
    if(nuc== 2){nuc = 'C'}
    if(nuc== 3){nuc = 'G'}
    if(nuc== 4){nuc = 'T'}
    pal = paste(pal,nuc,sep = "")
  }
  PalsMtx[j,1] <- pal
}

Guardo los nodos y los palindromos de cada nodo en 2 vectores. Los cuales voy a usar para anotar el arbol filogenetico.

l <- as.list(PalsMtx[,1]) 
Nodes = names(l)
NodePals = as.character(PalsMtx[,1])

Anotacion del arbol filogenético

Para poder hacer esto, primero transformo el arbol filogenetico en una tabla y agrego 2 columnas: (label) que contendra el nombre de cada nodo y (Palindromo) que contendra el palindromo de dicho nodo.

## Transformo el arbol en data frame para poder agregar los palindromos de cada nodo
x <- as_tibble(Tree2)

## Extraigo el Boostrap por si se ocupa
NA.NODE = length(cyanobacterias)+1
BootStrap <- x$label[NA.NODE:length(Nodes)]
for (i in NA.NODE:length(Nodes)){ ## ESTO DEPENDE DE LA ESTRUCTURA DEL ARBOL
  x$label[i] <- x$node[i] # x$node[i-1]
}
## Agrego los datos de los palindromos al arbol
d <- tibble::tibble(label = Nodes,
                    Palindromo = NodePals)
y <- full_join(x, d, by = 'label')

Hasta aqui tenemos una filogenoa anotada con su palindromo ancestral en cada nodo:

ggtree(as.treedata(y),branch.length="none") +
    geom_tiplab(color='firebrick', offset = .14)+
    geom_label(aes(label = Palindromo,color=Palindromo),show.legend = FALSE)+
    aes(color=Palindromo)+
    guides(fill = guide_legend(override.aes = list(color = NA)), 
           color = FALSE, 
           shape = 1)

Matriz de transiciones para el grafo

Ahora necesito crear una matriz que tenga las tranmsiciones entre cada nodo. Para esto extraigo las direcciones de los nodos desde el arbol filogenetico previamente anotado es decir las columnas from y to y las guardo en una nueva matriz:

## Esta es la configuracion del arbol
from<-y$label#c("336-3","7","7","8","8","9","9","10","10")
to<-y$parent#c("7","NIES-3974","8","PCC_6303","9","PCC_7716","10","NIES-4071","NIES-4105")

LinksMtx <- matrix(0,
                   nrow = length(from),
                   ncol = 2)
colnames(LinksMtx) <- c("from","to")

for (i in 1:length(from)){
  LinksMtx[i,1] = from[i]
  LinksMtx[i,2] = to[i]
}

Luego extraigo las columnas label y Palindromo y las guardo en otra matriz. Dicha matriz me servira como diccionario para obtener las transiciones en cada nodo. Posteriormente la convierto a data frame.

## AQUI VOY A CREAR LA MATRIZ QUE ME SERVIRA COMO DICCIONARIO PARA RELLENAR LA MATRIZ DE TRANSICIONES
Mat <- y[,4:5]

Mat2 <- as.data.frame(Mat) ## CONVIERTO A DF
Mat2

##        label Palindromo
## 1      336-3   GCGATCGC
## 2  NIES-3974   --------
## 3   PCC_6303   --------
## 4   PCC_7716   --------
## 5  NIES-4071   --------
## 6  NIES-4105   --------
## 7          7   GCGATCGC
## 8          8   GCGATCGC
## 9          9   GCGATCGC
## 10        10   GCGATCGC

Hago un conteo de las deleciones y las guardo en un vector que contendrá el ortologo, sitio y numero de deleciones:

d = 0
SPP = c()
for (i in 1:length(Mat2[,1])) {
  if(Mat2[i,2] == '--------'){
    d = d+1
    spp = Mat2[i,1]
    SPP = c(SPP,spp)
  }
}
Spp <- paste(SPP, collapse = ",")
deleciones <-c()
if(d != 0){
  SITE = paste0(Sites[ORTH,3],":",Sites[ORTH,4])
  deleciones <- c(Sites[ORTH,1],SITE,d,Spp)
}

Ahora creo una nueva matriz que va a tener todas las transiciones. Para esto uso las ultimas dos matrices. La matriz LinksMtx que tiene las direcciones entre nodos la paso a palindromos con ayuda de Mat2 la cual me sirve como diccionario y convierto esta ultima matriz (LinksMtx2) a dataframe.

## AQUI CREO LA MATRIZ DE TRANSICIONES.
LinksMtx2 <- matrix(0,
                    nrow = length(from),
                    ncol = 2)
colnames(LinksMtx2) <- c("from","to")

## AQUI RELLENO LA MATRIZ DE TRANSICIONES USANDO COMO DICCIONARIO A LA MATRIZ ANTERIOR (Mat2)
for (i in 1:length(from)) {
  for (j in 1:length(from)){
    if (LinksMtx[i,1]==Mat2[j,1]){
      LinksMtx2[i,1] = Mat2[j,2]
    }
    if (LinksMtx[i,2]==Mat2[j,1]){
      LinksMtx2[i,2] = Mat2[j,2]
    }
  }
}
df <- as.data.frame(LinksMtx2) ## CONVIERTO A DF LA MATRIZ DE TRANSICIONES

Creo una matriz adicional la cual es identica a la anterior salvo que le agrego las direcciones de cada transicion, las cuales podrian ser de utilidad.

directions<-paste(from,to,sep="--") ## ESTE VECTOR CONTIENE LAS DIRECCIONES
Links0 = as.data.frame(LinksMtx2) ## HAGO UNA COPIA DE LA MATRIZ DE TRANSICIONES
Links0$direction<-directions ## AGREGO COMO COLUMNA NUEVA A "directions"

Al final de este paso tendre dos matrices DF y LINKS

DF contiene solo las transiciones

DF <- rbind(DF,df) ## UNO LA MATRIZ ACTUAL DE TRANSICIONES CON LA ANTERIOR PARA CREAR UNA SOLA
DF

##        from       to
## 1  GCGATCGC GCGATCGC
## 2  -------- GCGATCGC
## 3  -------- GCGATCGC
## 4  -------- GCGATCGC
## 5  -------- GCGATCGC
## 6  -------- GCGATCGC
## 7  GCGATCGC GCGATCGC
## 8  GCGATCGC GCGATCGC
## 9  GCGATCGC GCGATCGC
## 10 GCGATCGC GCGATCGC

y LINKS las transiciones con sus respectivas direcciones.

LINKS<-rbind(LINKS,Links0) ## UNO LA MATRIZ ACTUAL DE TRANSICIONES(CON DIRECCIONES) CON LA ANTERIOR PARA CREAR UNA SOLA
LINKS

##        from       to     direction
## 1  GCGATCGC GCGATCGC      336-3--7
## 2  -------- GCGATCGC  NIES-3974--7
## 3  -------- GCGATCGC   PCC_6303--8
## 4  -------- GCGATCGC   PCC_7716--9
## 5  -------- GCGATCGC NIES-4071--10
## 6  -------- GCGATCGC NIES-4105--10
## 7  GCGATCGC GCGATCGC          7--7
## 8  GCGATCGC GCGATCGC          8--7
## 9  GCGATCGC GCGATCGC          9--8
## 10 GCGATCGC GCGATCGC         10--9

Finalmente repito esto para los 6065 sitios palindromicos restantes:

for (ORTH in 2:length(Sites[,1])){ 
  SI = Sites[ORTH,3]
  EI = Sites[ORTH,4]
  PalInterval = SI:EI ## ESTE ES EL INTERVALO DEL SITIO PALINDROMICO
  
  ## Obtengo las posiciones para la posicion PalInterval
  PalindromeNucleotidePositions <- attributes(cyanobacterias)$index[PalInterval]
  
  ## Nombre de los nodos
  Nodos <- attributes(Reconstruction)$names
  
  ################## EN ESTA PARTE ARMAREMOS LOS PALINDROMOS DE LAS PUNTAS ###################
  a.1 <- as_tibble(Tree2)
  ## Extraigo el Boostrap por si se ocupa
  NA.NODE = length(cyanobacterias)+1
  for (i in NA.NODE:length(Nodos)){ ## ESTO DEPENDE DE LA ESTRUCTURA DEL ARBOL
    a.1$label[i] <- a.1$node[i] 
  }
  ## Agrego los datos de los palindromos al arbol
  b.1 <- tibble::tibble(label = Nodos)
  c.1 <- full_join(a.1, b.1, by = 'label')
  NodeLabels <- c.1$label
  
  TipLabels <- Tree2$tip.label
  NodeLabels2 <- NodeLabels[-c(1:length(TipLabels))]

  ## Creo una matriz de 8 x length(Nodos) (8 nucleotidos, numero de nodos) para guardar los resultados
  ScoresMtx <- matrix(0,
                      nrow = length(Nodos),
                      ncol = 8)
  colnames(ScoresMtx) <- c("1","2","3","4","5","6","7","8")
  rownames(ScoresMtx) <- Nodos
  
  
  ## Relleno la matriz con los scores
  j = 1
  i = 1
  k = 0
  for (Position in PalindromeNucleotidePositions){
    for (Nodo in TipLabels){
      PositionScores <- Reconstruction[[Nodo]][Position,] ## Extraigo los scores para el nodo "NODO" en la posicion "POSITION" de la secuencia
      if (length(unique(PositionScores)) == 1){
        k=k+1
        WinnerNucleotide = "-"
        ScoresMtx[i,j] <- WinnerNucleotide
        i=i+1 ## contador para cada Nodo
      }else{
        PositionWinnerScore <- max(PositionScores) ## Extraigo el valor mas grande. Este es el valor de la probabilidad de que el nucleotido ancestral sea ese
        WinnerNucleotide <- which(PositionScores == PositionWinnerScore, arr.ind = T) ## pregunto a que numero (Nucleótido) corresponde ese valor
        ScoresMtx[i,j] <- WinnerNucleotide ## Guardo dicho valor en la matriz de resultados
        i=i+1 ## contador para cada Nodo
      }
    }
    j=j+1 ## contador para cada posición
    i=1 ## Reiniciamoa el contador de Nodo para la siguiente posicion
  }
  
  ## Relleno la matriz con los scores
  j = 1
  i = length(TipLabels)+1
  k = 0
  #(length(TipLabels)+1):length(NodeLabels)
  for (Position in PalindromeNucleotidePositions){
    for (Nodo in NodeLabels2){
      PositionScores <- Reconstruction[[Nodo]][Position,] ## Extraigo los scores para el nodo "NODO" en la posicion "POSITION" de la secuencia
      if (length(unique(PositionScores)) == 1){
        k=k+1
        WinnerNucleotide = "X"
        ScoresMtx[i,j] <- WinnerNucleotide
        i=i+1 ## contador para cada Nodo
      }else{
        PositionWinnerScore <- max(PositionScores) ## Extraigo el valor mas grande. Este es el valor de la probabilidad de que el nucleotido ancestral sea ese
        WinnerNucleotide <- which(PositionScores == PositionWinnerScore, arr.ind = T) ## pregunto a que numero (Nucleótido) corresponde ese valor
        ScoresMtx[i,j] <- WinnerNucleotide ## Guardo dicho valor en la matriz de resultados
        i=i+1 ## contador para cada Nodo
      }
    }
    j=j+1 ## contador para cada posición
    i=length(TipLabels)+1 ## Reiniciamoa el contador de Nodo para la siguiente posicion
  }
  ################## EN ESTA PARTE ARMAREMOS LOS PALINDROMOS DE LOS NODOS ###################
  
  ## EN ESTA PARTE ARMAREMOS EL PALINDROMO PARA CADA NODO
  
  ## Creo un matriz de 1x14 para guardar la etiqueta (palindromo) de cada nodo
  PalsMtx <- matrix(0,
                    nrow = length(Nodos),
                    ncol = 1)
  colnames(PalsMtx) <- c("Palindrome")
  rownames(PalsMtx) <- Nodos
  
  ## Extraigo cada palindromo, cambio el codigo de numero por el codigo de letras y lo guardo en la matriz 1x14
  i=0
  j=0
  for (Nodo in Nodos){
    j=j+1
    Palindrome <- as.character(ScoresMtx[j,]) # Este es el palindromo
    pal =""
    for (nuc in Palindrome){
      if(nuc== 1){nuc = 'A'}
      if(nuc== 2){nuc = 'C'}
      if(nuc== 3){nuc = 'G'}
      if(nuc== 4){nuc = 'T'}
      pal = paste(pal,nuc,sep = "")
    }
    PalsMtx[j,1] <- pal
  }
  
  ## ESTA PATRTE ES PARA GUARDAR LOS PALINDROMOS DE CADA NODO
  l <- as.list(PalsMtx[,1]) 
  Nodes = names(l)
  NodePals = as.character(PalsMtx[,1])
  
  ## Transformo el arbol en data frame para poder agregar los palindromos de cada nodo
  x <- as_tibble(Tree2)
  
  ## Extraigo el Boostrap por si se ocupa
  NA.NODE = length(cyanobacterias)+1
  BootStrap <- x$label[NA.NODE:length(Nodes)]
  for (i in NA.NODE:length(Nodes)){ ## ESTO DEPENDE DE LA ESTRUCTURA DEL ARBOL
    x$label[i] <- x$node[i] # x$node[i-1]
  }
  ## Agrego los datos de los palindromos al arbol
  d <- tibble::tibble(label = Nodes,
                      Palindromo = NodePals)
  y <- full_join(x, d, by = 'label')
  
  #------------------------
  
  ## Esta es la configuracion del arbol
  from<-y$label
  to<-y$parent
  
  LinksMtx <- matrix(0,
                     nrow = length(from),
                     ncol = 2)
  colnames(LinksMtx) <- c("from","to")
  
  for (i in 1:length(from)){
    LinksMtx[i,1] = from[i]
    LinksMtx[i,2] = to[i]
  }
  
  ## AQUI VOY A CREAR LA MATRIZ QUE ME SERVIRA COMO DICCIONARIO PARA RELLENAR LA MATRIZ DE TRANSICIONES
  Mat <- y[,4:5]
  Mat
  Mat2 <- as.data.frame(Mat) ## CONVIERTO A DF
  Mat2
  
  ## AQUI HAGO EL RECUENTO DE LAS DELECIONES DE 8 nucleotidos
  d = 0
  SPP = c()
  for (i in 1:length(Mat2[,1])) {
    if(Mat2[i,2] == '--------'){
      d = d+1
      spp = Mat2[i,1]
      SPP = c(SPP,spp)
    }
  }
  Spp <- paste(SPP, collapse = ",")
  deleciones <-c()
  if(d != 0){
    #print(paste0("Hay ",d," delecion(es) en el archivo ",Sites[ORTH,1],". Intervarlo ",Sites[ORTH,3],":",Sites[ORTH,4],"."))
    SITE = paste0(Sites[ORTH,3],":",Sites[ORTH,4])
    deleciones <- c(Sites[ORTH,1],SITE,d,Spp)
  }
  
  ## AQUI CREO LA MATRIZ DE TRANSICIONES.
  LinksMtx2 <- matrix(0,
                      nrow = length(from),
                      ncol = 2)
  colnames(LinksMtx2) <- c("from","to")
  
  ## AQUI RELLENO LA MATRIZ DE TRANSICIONES USANDO COMO DICCIONARIO A LA MATRIZ ANTERIOR (Mat2)
  for (i in 1:length(from)) {
    for (j in 1:length(from)){
      if (LinksMtx[i,1]==Mat2[j,1]){
        LinksMtx2[i,1] = Mat2[j,2]
      }
      if (LinksMtx[i,2]==Mat2[j,1]){
        LinksMtx2[i,2] = Mat2[j,2]
      }
    }
  }
  
  directions<-paste(from,to,sep="--") ## ESTE VECTOR CONTIENE LAS DIRECCIONES
  Links0 = as.data.frame(LinksMtx2) ## HAGO UNA COPIA DE LA MATRIZ DE TRANSICIONES
  Links0$direction<-directions ## AGREGO COMO COLUMNA NUEVA A "directions"
  Links0
  LINKS<-rbind(LINKS,Links0) ## UNO LA MATRIZ ACTUAL DE TRANSICIONES(CON DIRECCIONES) CON LA ANTERIOR PARA CREAR UNA SOLA
  
  df <- as.data.frame(LinksMtx2) ## CONVIERTO A DF LA MATRIZ DE TRANSICIONES
  DF <- rbind(DF,df) ## UNO LA MATRIZ ACTUAL DE TRANSICIONES CON LA ANTERIOR PARA CREAR UNA SOLA
  DELECIONES <- rbind(DELECIONES,deleciones)
}
DELECIONES = as.data.frame(DELECIONES)
DELECIONES = DELECIONES[order(DELECIONES$deleciones,decreasing = TRUE),]
write.table(DELECIONES,file="DELECIONES.txt",sep="\t",row.names = FALSE,col.names = TRUE)

LINKS = as.data.frame(LINKS)
LINKS = LINKS[order(LINKS$direction),]
write.table(LINKS,file="edges.directions",sep="\t",row.names = FALSE,col.names = TRUE)

File= "DELECIONES.txt"
table = read.table(File,header = TRUE,sep = "\t")
rmarkdown::paged_table(table,options = list(rows.print = 10, cols.print = 4))

Ahora las matrices DF y LINKS contienen todas las transiciones de todos los sitios. En total son 60650 transiciones ya que son 10 transiciones por arbol y tenemos 6065 arboles.

length(DF$from)

## [1] 60650

length(LINKS$from)

## [1] 60650

length(DELECIONES[,1])

## [1] 4978

Filtrado de datos para el GRAFO

Primero cuento el numero de veces que se repite cada linea de la matriz de transiciones. El conteo para cada transicion sera el peso de cada vertice. Luego cambio el nombre de la columna counts a weight, paso su contenido a numeros enteros y la convierto a tibble.

library(data.table)
RowCts <- setDT(DF)[,list(Count=as.numeric(.N)),names(DF)] ## CUENTO EL NUMERO DE OCURRENCIAS DE CADA TRANSICIÓN #6867
colnames(RowCts) <- c("from","to","weight") 
RowCts = transform(RowCts, weight = as.integer(weight)) ## CONVIERTO A ENTEROS LA COLUMNA "weight"
RowCts <-as_tibble(RowCts) ## CONVIERTO A TIBBLE
RowCts #6867

## # A tibble: 6,867 × 3
##    from     to       weight
##    <chr>    <chr>     <int>
##  1 GCGATCGC GCGATCGC  24322
##  2 -------- GCGATCGC  21109
##  3 GCGATCGC GAAATCGC      4
##  4 GAAATCGC GAAATCGC     20
##  5 GAAATCAC GAAATCGC      1
##  6 GCCATCAC GCGATCAC      1
##  7 GCGATTAA GCGATTAA     11
##  8 GCAATTAA GCAATTAA      4
##  9 GCGATCAC GCGATCGC      5
## 10 GCGATTAA GCGATCAC      1
## # ℹ 6,857 more rows

Filtro conteos bajos (<=7) y loops. Al final me quedo con 63 transiciones.

RowCts<-RowCts%>% ##QUITO LOS LOOPS. ES DECIR QUITO AQUELLAS TRANSICIONES QUE VAYAN ASI MISMA
  filter(from!=to)
length(RowCts$from) #5116

## [1] 5116

RowCts<-RowCts%>% ## QUITO AQUELLAS TRANSICIONES CON UN PESO MENOR A 7
  filter(weight>=7)
length(RowCts$from) #72

## [1] 72

Extraigo todos los nodos posibles y los agrego a una matriz que va a contener los nodos del grafo. Adiconalmente le agrego una columna score que me indicara que tan parecido es el aplindromo a HIP1 (despues la quito porque no creo que sea un buen índice)

GraphNodes <- c(RowCts$from, RowCts$to) ## OBTENGO LOS PALINDROMOS DE CADA NODO
length(GraphNodes) #144

## [1] 144

GNMtx <- matrix(0, ## CREO LA MATRIZ QUE CONTENDRÁ LAS NODOS PARA EL GRAFO
                nrow = length(GraphNodes),
                ncol = 2)
colnames(GNMtx) <- c("pal","score") ## ETIQUETO LAS COLUMNAS DE LA MATRIZ

for (i in 1:length(GraphNodes)){ ## RELLENO LA MATRIZ
  GNMtx[i,1] = GraphNodes[i] ## AGREGO LOS NODOS
  GNMtx[i,2] = RecordLinkage::levenshteinSim(GraphNodes[i], 'GCGATCGC')*1000 ## AGREGO UN SCORE QUE SERÁ EL PORCENTAJE DE IDENTIDAD CON HIP1 MULTIPLICADO POR 1000
}

GNMtx <- as.data.frame(GNMtx) ## CONVIERTO LA MATRIZ DE NODOS A DF
GNMtx <- GNMtx[!duplicated(GNMtx), ] ## ELIMINO LOS NODOS DUPLICADOS

Creo un vector que enumerará los palindromos. Dicha enumeracion correspondera a su ID

palindromes<-GNMtx[,1] #34  ## EXTRAIGO LOS NODOS DE LA MATRIZ DEL GRAFO. ESTO LO HAGO PARA POSTERIAMENTE ENUMERARLAS Y ASOCIAR CADA NUMERO COMO ID

ids<-c() ## CREO UN VECTOR VACIO QUE CONTENDRÁ LOS ID'S
for (k in 1:length(palindromes)){ ## ENUMERO LOS PALINDROMOS. ESTOS NUMEROS SERAN LOS IDS
  ids <-append(ids, k)
}
ids

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37

length(ids) #37

## [1] 37

length(GNMtx[,1]) #37

## [1] 37

Creo una matriz nueva que contendrá los nodos del grafo pero con su ID para cada palindromo:

Nodes2 <-cbind(GNMtx,ids) ## CREO UNA MATRIZ NUEVA CON LA MATRIZ DE NODOS Y LOS IDS. ESTA LA VOY A USAR COMO UN DICCIONARIO ID-NODO
length(Nodes2$pal)

## [1] 37

Nodes2<-Nodes2[,c(3,1,2)] ## REORDENO LAS COLUMNAS DE LA MATRIZ
colnames(Nodes2)<-c("id","label","score") ## ETIQUETO LAS COLUMNAS

Creo la matriz de vertices del grafo y tambien sustituyo el nombre de los nodos por sus respectivos ID’S. Adicionalmente cambio los scores que tengan 0 por 1’s ya que hay errores. Sin embargo, la columna score no la uso mas adelante.

GEMtx <- RowCts ## ESTA VA A SER LA MATRIZ DE VERTICES
Edges2 <- as.matrix(RowCts)#RowCts ESTA ES OTRA MATRIZ DE VERTICES DE LA CUAL VOY A SUSTITUIR LOS NODOS POR SUS IDS

for (n in 1:length(Nodes2[,2])){ ##AQUI CAMBIARE LOS PALINDROMOS DE LOS NODOS POR IDS
  Edges2[Edges2==Nodes2[n,2]] <- as.numeric(Nodes2[n,1])
  #print(n)
}

for (n in 1:length(Nodes2[,2])){ ## HANDLE 0's
  Nodes2[Nodes2==0] <- as.numeric(1)
  #print(n)
}

VISUALIZACIÓN DEL GRAFO

Cargo las librerias

library(visNetwork)
library(networkD3)
library(tidygraph)
library(ggraph)

Para empezar cargo los nodos y los vertices. En esta parte quito la columna score de los nodos.

nodes = as_tibble(Nodes2)
nodes = select(nodes, -score)

edges = as_tibble(Edges2)
edges = transform(edges, from = as.integer(from))
edges = transform(edges, to = as.integer(to))
edges = transform(edges, weight = as.integer(weight))
edges = as_tibble(edges)

Creo un grafo dirigido agregando un peso de acuerdo a la columna weight.

routes_tidy <- tbl_graph(nodes = nodes,
                         edges = edges,
                         directed = TRUE)

routes_tidy

## # A tbl_graph: 37 nodes and 72 edges
## #
## # A directed simple graph with 1 component
## #
## # A tibble: 37 × 2
##      id label   
##   <int> <chr>   
## 1     1 --------
## 2     2 GCGATAGC
## 3     3 GCGATCGC
## 4     4 GCAATCGC
## 5     5 ACGATCGC
## 6     6 GCGATTGC
## # ℹ 31 more rows
## #
## # A tibble: 72 × 3
##    from    to weight
##   <int> <int>  <int>
## 1     1     3  21109
## 2     2     3     25
## 3     3    24      8
## # ℹ 69 more rows

routes_tidy %>% 
  activate(edges) %>% 
  arrange(desc(weight))

## # A tbl_graph: 37 nodes and 72 edges
## #
## # A directed simple graph with 1 component
## #
## # A tibble: 72 × 3
##    from    to weight
##   <int> <int>  <int>
## 1     1     3  21109
## 2     3     6     51
## 3     4     3     42
## 4     6     3     32
## 5     7     3     31
## 6     3     4     27
## # ℹ 66 more rows
## #
## # A tibble: 37 × 2
##      id label   
##   <int> <chr>   
## 1     1 --------
## 2     2 GCGATAGC
## 3     3 GCGATCGC
## # ℹ 34 more rows

Visualizo el grafo:

ggraph(routes_tidy, layout = "graphopt") + 
  geom_node_point() +
  geom_edge_link(aes(width = weight), alpha = 0.8) + 
  scale_edge_width(range = c(0.2, 2)) +
  geom_node_text(aes(label = label), repel = TRUE) +
  labs(edge_width = "Times") +
  theme_graph()

GRAFOS INTERACTIVOS

Este es un grafo sin peso en los vertices.

visNetwork(nodes, edges)

Este es el mismo grafo pero con peso en sus vertices

edges <- mutate(edges, width = weight/5 + 1)
visNetwork(nodes, edges) %>% 
  visIgraphLayout(layout = "layout_with_fr") %>% 
  visEdges(arrows = "middle")

Este es otro grafo de fuerzas con peso en sus vertices y te muestra las conceciones de cada nodo de manera mas visual

nodes_d3 <- mutate(nodes, id = id - 1)
edges_d3 <- mutate(edges, from = from - 1, to = to - 1)

forceNetwork(Links = edges_d3, Nodes = nodes_d3,
             Source = "from", Target = "to", 
             NodeID = "label", Group = "id", Value = "weight", 
             opacity = 1, fontSize = 16, zoom = TRUE)

Este ultimo grafo muestra el grafo anterior pero de una forma mas analizable.

sankeyNetwork(Links = edges_d3, Nodes = nodes_d3, Source = "from", Target = "to", 
              NodeID = "label", Value = "weight", fontSize = 16, unit = "TIME(s)")

GRAFOS POR DIRECCION

Las siguientes figuras corresponden a cada una de las direcciones.

WGHT = 4

336-3–7

ggtree(Tree2) +
  geom_tiplab(color='firebrick', offset = .10, hjust=3.0)+
  geom_label(aes(label = node),show.legend = TRUE)

## [1] 11612

## # A tbl_graph: 38 nodes and 38 edges
## #
## # A directed multigraph with 1 component
## #
## # A tibble: 38 × 4
##    from    to direction weight
##   <int> <int> <chr>      <int>
## 1     1     1 336-3--7    4988
## 2     1     9 336-3--7      40
## 3     1    19 336-3--7      24
## 4     1     3 336-3--7      17
## 5     1     4 336-3--7      17
## 6     1    11 336-3--7      14
## # ℹ 32 more rows
## #
## # A tibble: 38 × 2
##      id label   
##   <int> <chr>   
## 1     1 GCGATCGC
## 2     2 GAGATCGC
## 3     3 GCAATTGC
## # ℹ 35 more rows

NIES-3974–7

ggtree(Tree2) +
  geom_tiplab(color='firebrick', offset = .10, hjust=3.0)+
  geom_label(aes(label = node),show.legend = TRUE)

## [1] 11612

## # A tbl_graph: 12 nodes and 13 edges
## #
## # A directed multigraph with 1 component
## #
## # A tibble: 12 × 2
##      id label   
##   <int> <chr>   
## 1     1 --------
## 2     2 GCGATCGC
## 3     3 GTGATCGC
## 4     4 GCAATTGC
## 5     5 GCGATTGC
## 6     6 GCGATCGT
## # ℹ 6 more rows
## #
## # A tibble: 13 × 4
##    from    to direction    weight
##   <int> <int> <chr>         <int>
## 1     1     2 NIES-3974--7   4296
## 2     2     2 NIES-3974--7    600
## 3     1     4 NIES-3974--7      7
## # ℹ 10 more rows

## # A tbl_graph: 12 nodes and 13 edges
## #
## # A directed multigraph with 1 component
## #
## # A tibble: 13 × 4
##    from    to direction    weight
##   <int> <int> <chr>         <int>
## 1     1     2 NIES-3974--7   4296
## 2     2     2 NIES-3974--7    600
## 3     1     5 NIES-3974--7     15
## 4     1     8 NIES-3974--7      9
## 5     1     4 NIES-3974--7      7
## 6     1     7 NIES-3974--7      6
## # ℹ 7 more rows
## #
## # A tibble: 12 × 2
##      id label   
##   <int> <chr>   
## 1     1 --------
## 2     2 GCGATCGC
## 3     3 GTGATCGC
## # ℹ 9 more rows

NIES-4071–10

ggtree(Tree2) +
  geom_tiplab(color='firebrick', offset = .10, hjust=3.0)+
  geom_label(aes(label = node),show.legend = TRUE)

## [1] 11612

## # A tbl_graph: 49 nodes and 48 edges
## #
## # A directed multigraph with 48 components
## #
## # A tibble: 48 × 4
##    from    to direction     weight
##   <int> <int> <chr>          <int>
## 1     1    49 NIES-4071--10   4212
## 2     8     8 NIES-4071--10     49
## 3    11    11 NIES-4071--10     34
## 4    25    25 NIES-4071--10     28
## 5    24    24 NIES-4071--10     22
## 6    33    33 NIES-4071--10     18
## # ℹ 42 more rows
## #
## # A tibble: 49 × 2
##      id label   
##   <int> <chr>   
## 1     1 --------
## 2     2 ACGGTCGG
## 3     3 GCGATAGC
## # ℹ 46 more rows

NIES-4105–10

ggtree(Tree2) +
  geom_tiplab(color='firebrick', offset = .10, hjust=3.0)+
  geom_label(aes(label = node),show.legend = TRUE)

## [1] 11612

## # A tbl_graph: 49 nodes and 48 edges
## #
## # A directed multigraph with 48 components
## #
## # A tibble: 48 × 4
##    from    to direction     weight
##   <int> <int> <chr>          <int>
## 1     1    49 NIES-4105--10   4212
## 2     8     8 NIES-4105--10     49
## 3    11    11 NIES-4105--10     34
## 4    25    25 NIES-4105--10     28
## 5    24    24 NIES-4105--10     22
## 6    33    33 NIES-4105--10     18
## # ℹ 42 more rows
## #
## # A tibble: 49 × 2
##      id label   
##   <int> <chr>   
## 1     1 --------
## 2     2 ACGGTCGG
## 3     3 GCGATAGC
## # ℹ 46 more rows

PCC_6303–8

ggtree(Tree2) +
  geom_tiplab(color='firebrick', offset = .10, hjust=3.0)+
  geom_label(aes(label = node),show.legend = TRUE)

## [1] 11612

## # A tbl_graph: 9 nodes and 12 edges
## #
## # A directed multigraph with 1 component
## #
## # A tibble: 12 × 4
##    from    to direction   weight
##   <int> <int> <chr>        <int>
## 1     1     2 PCC_6303--8   4175
## 2     2     2 PCC_6303--8    527
## 3     2     8 PCC_6303--8     11
## 4     3     3 PCC_6303--8      5
## 5     2     6 PCC_6303--8      5
## 6     2     7 PCC_6303--8      5
## # ℹ 6 more rows
## #
## # A tibble: 9 × 2
##      id label   
##   <int> <chr>   
## 1     1 --------
## 2     2 GCGATCGC
## 3     3 GCAATTGC
## # ℹ 6 more rows

}

PCC_7716–9

ggtree(Tree2) +
  geom_tiplab(color='firebrick', offset = .10, hjust=3.0)+
  geom_label(aes(label = node),show.legend = TRUE)

## [1] 11612

## # A tbl_graph: 39 nodes and 42 edges
## #
## # A directed multigraph with 32 components
## #
## # A tibble: 42 × 4
##    from    to direction   weight
##   <int> <int> <chr>        <int>
## 1     1    37 PCC_7716--9   4214
## 2     7     7 PCC_7716--9     30
## 3     6     6 PCC_7716--9     29
## 4    13    13 PCC_7716--9     26
## 5     9     9 PCC_7716--9     15
## 6    18    18 PCC_7716--9     15
## # ℹ 36 more rows
## #
## # A tibble: 39 × 2
##      id label   
##   <int> <chr>   
## 1     1 --------
## 2     2 GAAATCGC
## 3     3 GCGATAGC
## # ℹ 36 more rows

10–9

ggtree(Tree2) +
  geom_tiplab(color='firebrick', offset = .10, hjust=3.0)+
  geom_label(aes(label = node),show.legend = TRUE)

## [1] 11612

## # A tbl_graph: 39 nodes and 44 edges
## #
## # A directed multigraph with 33 components
## #
## # A tibble: 44 × 4
##    from    to direction weight
##   <int> <int> <chr>      <int>
## 1     1     1 10--9       4220
## 2     5     5 10--9         25
## 3    14    14 10--9         20
## 4    11    11 10--9         19
## 5    25    25 10--9         19
## 6    20    20 10--9         17
## # ℹ 38 more rows
## #
## # A tibble: 39 × 2
##      id label   
##   <int> <chr>   
## 1     1 GCGATCGC
## 2     2 GCGATAGC
## 3     3 GCGACCGC
## # ℹ 36 more rows

9–8

ggtree(Tree2) +
  geom_tiplab(color='firebrick', offset = .10, hjust=3.0)+
  geom_label(aes(label = node),show.legend = TRUE)

## [1] 11612

## # A tbl_graph: 34 nodes and 47 edges
## #
## # A directed multigraph with 4 components
## #
## # A tibble: 47 × 4
##    from    to direction weight
##   <int> <int> <chr>      <int>
## 1     1     1 9--8        4228
## 2     3     1 9--8          25
## 3     4     1 9--8          22
## 4     2     1 9--8          18
## 5    14     1 9--8          18
## 6    17     1 9--8          16
## # ℹ 41 more rows
## #
## # A tibble: 34 × 2
##      id label   
##   <int> <chr>   
## 1     1 GCGATCGC
## 2     2 GCGATAGC
## 3     3 GCAATTGC
## # ℹ 31 more rows

8–7

ggtree(Tree2) +
  geom_tiplab(color='firebrick', offset = .10, hjust=3.0)+
  geom_label(aes(label = node),show.legend = TRUE)

## [1] 11612

## # A tbl_graph: 21 nodes and 29 edges
## #
## # A directed multigraph with 13 components
## #
## # A tibble: 21 × 2
##      id label   
##   <int> <chr>   
## 1     1 GCGATCGC
## 2     2 GCAATCGC
## 3     3 GCAATTGC
## 4     4 GCTATCGC
## 5     5 ACGATCGC
## 6     6 GCGATTGC
## # ℹ 15 more rows
## #
## # A tibble: 29 × 4
##    from    to direction weight
##   <int> <int> <chr>      <int>
## 1     1     1 8--7        4771
## 2     2     1 8--7          26
## 3     3     3 8--7          11
## # ℹ 26 more rows

## # A tbl_graph: 21 nodes and 29 edges
## #
## # A directed multigraph with 13 components
## #
## # A tibble: 29 × 4
##    from    to direction weight
##   <int> <int> <chr>      <int>
## 1     1     1 8--7        4771
## 2     2     1 8--7          26
## 3     6     6 8--7          21
## 4     2     2 8--7          14
## 5     6     1 8--7          13
## 6     4     4 8--7          12
## # ℹ 23 more rows
## #
## # A tibble: 21 × 2
##      id label   
##   <int> <chr>   
## 1     1 GCGATCGC
## 2     2 GCAATCGC
## 3     3 GCAATTGC
## # ℹ 18 more rows

Reconstrucción Ancestral de Secuencias - GENOMAS - GCGATCGC - 336-3

Eduardo PM

2023-03-13