Las paqueterias en R se encuentran en tres respositorios grandes (los repositorios son sitios donde se almancenan los paquetes):
Un paquete, en el entorno de R, es un conjunto de rutinas creadas pensando en resolver/analizar integralmente un tipo de informacion en especifico. Esto implica tanto la:
Estos paquetes en su mayoria estan desarrollados por grupos de investigacion independiente y el soporte de estos paquetes es usualemnte dado por la comunidad cientifica, lo cual es una de las mayores fortalezas del entorno R.
Para esta clase vamos a utilizar los siguiente paquetes:
Para instalar un paquete, dependiendo del respositorio tenemos diferentes funciones
install.packages("seqinr")
Aunque esta función queda completamente definida por si misma, es recomendable que activen/usen el parámetro dependencies. La razón es muy sencilla. Este es un lenguaje orientado a objetos, cuando se define una función/paquete en R, usualmente este usa otros objetos/funciones definidas en otro paquete, entonces Ud. necesita instalar los otros paquetes para que pueda correr adecudamente este paquete. Entonces, le recomendamos activar este parametro para que R instale automaticamente el conjunto de paqueterias que Ud. necesita para correr este paquete.
install.packages("seqinr", dependencies = T)
Para instalar un paquete desde el repositorio de Bioconductor, podemos usar la siguientes lineas de comando
## Bioconductor version 3.16 (BiocManager 1.30.20), R 4.2.3 (2023-03-15)
## Warning: package(s) not installed when version(s) same as or greater than current; use
## `force = TRUE` to re-install: 'msa'
## Old packages: 'fontawesome', 'processx', 'ps', 'segmented', 'TH.data',
## 'tinytex', 'zip', 'zoo'
Como se ve en este caso, primero debemos instalar el “instalador de
paquetes” de Bioconductor, con la función clásica de R
install.packages. Una vez instalado, ya podemos usar
BiocManger::install para llamar el paquete que deseamos
Con las funciones anteriores lo que hacemos es descargar el paquete de la nube (repositorio), pero todavia esta no esta “desplegado” en nuestro entorno de trabajo. Para ello tenemos que llamar al paquete con la función library()
library(seqinr)
library(msa)
## Loading required package: Biostrings
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: XVector
## Loading required package: GenomeInfoDb
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:seqinr':
##
## translate
## The following object is masked from 'package:base':
##
## strsplit
Para cargar las secuencias podemos usar dos metodos:
Para esto podemos usar la función read.fasta() del paquete seqinr.
NOTA: Es comun que algunas funciones de diferntes paquetes tengan el mismo nombre, esto crea un problema en R porque no sabe a que funcion estas haciendo referencia. Para solucionar esto puedes escribir el nombre del paquete seguido de la función separado por “::”
Vamos a trabajar con la secuencia de actina de Trypanosoma cruzi, que tiene algunas particularidades moleculares intersantes. Descargue la secuencia en su disco local en formato FASTA
actin <- seqinr::read.fasta(file = "/Users/alfredocardenasrivera/Downloads/actin_Tc.fasta") # cambie la direccion de la carpeta segun la tenga en su disco local
actin
## $XP_809496.1
## [1] "m" "e" "a" "t" "l" "w" "d" "e" "e" "p" "a" "v" "v" "l" "d" "n" "g" "s"
## [19] "g" "n" "i" "k" "c" "g" "f" "a" "g" "e" "e" "i" "p" "r" "c" "v" "f" "p"
## [37] "s" "v" "t" "g" "v" "s" "m" "n" "a" "r" "s" "s" "g" "s" "s" "s" "s" "q"
## [55] "r" "v" "y" "v" "g" "d" "e" "a" "l" "q" "e" "k" "g" "l" "r" "y" "f" "y"
## [73] "p" "m" "e" "h" "g" "i" "v" "f" "d" "w" "d" "q" "m" "e" "r" "v" "w" "r"
## [91] "h" "a" "y" "e" "q" "l" "r" "v" "p" "p" "e" "r" "q" "a" "v" "l" "l" "t"
## [109] "e" "a" "p" "l" "n" "p" "i" "s" "n" "r" "e" "k" "m" "a" "e" "t" "l" "f"
## [127] "e" "s" "f" "g" "v" "p" "a" "l" "h" "v" "q" "i" "q" "a" "v" "l" "t" "l"
## [145] "y" "s" "s" "g" "r" "t" "d" "g" "l" "v" "l" "d" "s" "g" "d" "g" "v" "t"
## [163] "h" "l" "v" "p" "v" "f" "e" "g" "q" "t" "m" "p" "q" "s" "v" "r" "r" "l"
## [181] "e" "l" "a" "g" "r" "d" "l" "t" "e" "w" "m" "m" "e" "l" "l" "s" "d" "e"
## [199] "l" "d" "r" "p" "f" "t" "t" "s" "a" "d" "r" "e" "i" "a" "r" "r" "v" "k"
## [217] "e" "s" "l" "c" "y" "i" "p" "l" "f" "f" "e" "e" "e" "l" "q" "a" "a" "e"
## [235] "e" "d" "g" "i" "n" "e" "d" "v" "k" "g" "k" "e" "p" "f" "t" "l" "p" "d"
## [253] "g" "e" "v" "i" "h" "v" "g" "r" "a" "r" "f" "c" "c" "p" "e" "i" "l" "f"
## [271] "n" "p" "a" "l" "a" "e" "k" "p" "y" "d" "g" "i" "q" "h" "a" "v" "i" "n"
## [289] "c" "v" "n" "s" "c" "p" "i" "d" "l" "r" "r" "q" "l" "l" "g" "s" "i" "v"
## [307] "l" "s" "g" "g" "n" "t" "m" "f" "k" "g" "m" "q" "q" "r" "l" "q" "s" "e"
## [325] "l" "a" "a" "l" "a" "n" "k" "r" "a" "a" "e" "d" "v" "r" "v" "v" "a" "a"
## [343] "s" "e" "r" "k" "f" "s" "v" "w" "i" "g" "a" "a" "i" "l" "a" "s" "l" "t"
## [361] "s" "f" "a" "s" "e" "w" "i" "t" "r" "t" "e" "y" "a" "e" "q" "g" "a" "a"
## [379] "v" "l" "h" "k" "r" "c" "d" "s" "l" "s" "f" "v" "s" "k"
## attr(,"name")
## [1] "XP_809496.1"
## attr(,"Annot")
## [1] ">XP_809496.1 actin 2, putative [Trypanosoma cruzi]"
## attr(,"class")
## [1] "SeqFastadna"
Usted debe ver que se desplega en la consola la secuencia como una cadena de characters, el nombre de la secuencia, el codigo, el organismo y la “clase” definida par esta función
NOTA: Si usted consulta la documentacion de esta funcion, notara que hay dos clases definidas en este paquete para esta función. - DNA: parametro por defecto, y se define para secuenias de bases nitrogenadas - AA: se define para secuencias de aminoacidos
NOTA: para llamar la documentación de una función puede usar la siguiente línea de código
?read.fasta
Como Ud. habrá notado la secuencia es definida con el tipo o clase: DNA. Esto es erroneo ya que la secuencia es de aminoacidos. Para corregir este error tenemos que definir el parámetro seqtype como AA.
actin <- seqinr::read.fasta(file = "/Users/alfredocardenasrivera/Downloads/actin_Tc.fasta",
seqtype = "AA")
actin
## $XP_809496.1
## [1] "M" "E" "A" "T" "L" "W" "D" "E" "E" "P" "A" "V" "V" "L" "D" "N" "G" "S"
## [19] "G" "N" "I" "K" "C" "G" "F" "A" "G" "E" "E" "I" "P" "R" "C" "V" "F" "P"
## [37] "S" "V" "T" "G" "V" "S" "M" "N" "A" "R" "S" "S" "G" "S" "S" "S" "S" "Q"
## [55] "R" "V" "Y" "V" "G" "D" "E" "A" "L" "Q" "E" "K" "G" "L" "R" "Y" "F" "Y"
## [73] "P" "M" "E" "H" "G" "I" "V" "F" "D" "W" "D" "Q" "M" "E" "R" "V" "W" "R"
## [91] "H" "A" "Y" "E" "Q" "L" "R" "V" "P" "P" "E" "R" "Q" "A" "V" "L" "L" "T"
## [109] "E" "A" "P" "L" "N" "P" "I" "S" "N" "R" "E" "K" "M" "A" "E" "T" "L" "F"
## [127] "E" "S" "F" "G" "V" "P" "A" "L" "H" "V" "Q" "I" "Q" "A" "V" "L" "T" "L"
## [145] "Y" "S" "S" "G" "R" "T" "D" "G" "L" "V" "L" "D" "S" "G" "D" "G" "V" "T"
## [163] "H" "L" "V" "P" "V" "F" "E" "G" "Q" "T" "M" "P" "Q" "S" "V" "R" "R" "L"
## [181] "E" "L" "A" "G" "R" "D" "L" "T" "E" "W" "M" "M" "E" "L" "L" "S" "D" "E"
## [199] "L" "D" "R" "P" "F" "T" "T" "S" "A" "D" "R" "E" "I" "A" "R" "R" "V" "K"
## [217] "E" "S" "L" "C" "Y" "I" "P" "L" "F" "F" "E" "E" "E" "L" "Q" "A" "A" "E"
## [235] "E" "D" "G" "I" "N" "E" "D" "V" "K" "G" "K" "E" "P" "F" "T" "L" "P" "D"
## [253] "G" "E" "V" "I" "H" "V" "G" "R" "A" "R" "F" "C" "C" "P" "E" "I" "L" "F"
## [271] "N" "P" "A" "L" "A" "E" "K" "P" "Y" "D" "G" "I" "Q" "H" "A" "V" "I" "N"
## [289] "C" "V" "N" "S" "C" "P" "I" "D" "L" "R" "R" "Q" "L" "L" "G" "S" "I" "V"
## [307] "L" "S" "G" "G" "N" "T" "M" "F" "K" "G" "M" "Q" "Q" "R" "L" "Q" "S" "E"
## [325] "L" "A" "A" "L" "A" "N" "K" "R" "A" "A" "E" "D" "V" "R" "V" "V" "A" "A"
## [343] "S" "E" "R" "K" "F" "S" "V" "W" "I" "G" "A" "A" "I" "L" "A" "S" "L" "T"
## [361] "S" "F" "A" "S" "E" "W" "I" "T" "R" "T" "E" "Y" "A" "E" "Q" "G" "A" "A"
## [379] "V" "L" "H" "K" "R" "C" "D" "S" "L" "S" "F" "V" "S" "K"
## attr(,"name")
## [1] "XP_809496.1"
## attr(,"Annot")
## [1] ">XP_809496.1 actin 2, putative [Trypanosoma cruzi]"
## attr(,"class")
## [1] "SeqFastaAA"
Tambien podemos cargar secuencias multiples definidas en archivo
multi.actin <- seqinr::read.fasta(file = "/Users/alfredocardenasrivera/Downloads/multi_actin.txt",
seqtype = "AA")
Existen varios paquetes que nos permiten importar las secuencias
directamente de los repositorios de forma remota. Para saber a que bases
de datos tenemos acceso con seqinr podemos usar la
función seqinr::choosebank(). (les recomiendo activar el
parametro infobank = T para poder ver el estado de cada una
de estas bases de datos, algunas se encuentran inactivas
transitoriamente)
seqinr::choosebank(infobank = T)
## bank status
## 1 genbank on
## 2 embl on
## 3 emblwgs on
## 4 genbankseqinr on
## 5 swissprot on
## 6 ensembl on
## 7 hogenom7dna on
## 8 hogenom7 on
## 9 hogenom on
## 10 hogenomdna on
## 11 hovergendna on
## 12 hovergen on
## 13 hogenom5 on
## 14 hogenom5dna on
## 15 hogenom4 on
## 16 hogenom4dna on
## 17 homolens on
## 18 homolensdna on
## 19 hobacnucl on
## 20 hobacprot on
## 21 phever2 on
## 22 phever2dna on
## 23 refseq on
## 24 refseq16s on
## 25 greviews on
## 26 bacterial off
## 27 archaeal on
## 28 protozoan on
## 29 ensprotists on
## 30 ensfungi on
## 31 ensmetazoa on
## 32 ensplants on
## 33 ensemblbacteria on
## 34 mito on
## 35 polymorphix on
## 36 emglib on
## 37 refseqViruses on
## 38 ribodb on
## 39 taxodb on
## info
## 1 GenBank Release 246 (15 October 2021) Last Updated: Nov 19, 2021
## 2 EMBL Nucleotide Archive Release 143 (March 2020) Last Updated: Nov 21, 2021
## 3 EMBL Whole Genome Shotgun sequences (July 2018)
## 4 GenBank Release 231 (15 April 2019) Last Updated: Jun 8, 2019
## 5 UniProt Knowledgebase Release 2021_03 of 09-Jun-2021, Last Updated: Aug 6, 2021
## 6 Ensembl 85 - (10/03/16) Last Updated: Oct 3, 2016
## 7 HOGENOM - genomic data - Release 07 (Nov 3,2015) Last Updated: Apr 19, 2017
## 8 HOGENOM - protein data - Release 07 (Nov 3,2015) Last Updated: Jun 3, 2019
## 9 HOGENOM - protein data - Release 06 (Oct 30,2011) Last Updated: May 10, 2012
## 10 HOGENOM - genomic data - Release 06 (Oct 30,2011) Last Updated: Nov 14, 2011
## 11 HOVERGEN - genomic data - Release 49 (Dec 22 2009) Last Updated: Dec 22, 2009
## 12 HOVERGEN - protein data - Release 49 (Dec 22 2009) Last Updated: Dec 22, 2009
## 13 HOGENOM5 (AA)
## 14 HOGENOM5 (DNA)
## 15 HOGENOM4 (AA)
## 16 HOGENOM4 (DNA)
## 17 HOMOLENS 5 - Homologous genes from Ensembl(60)\t Last Updated: Feb 17, 2011
## 18 HOMOLENS 5 - Homologous genes from Ensembl(60)\t Last Updated: Feb 17, 2011
## 19 HOBACGEN - genomic data - Release 10 (February 12 2002)
## 20 HOBACGEN - protein data - Release 10 (February 12 2002)
## 21 PhEVER - protein data - Release 2 (June 1 2010) Last Updated: Jul 22, 2010
## 22 PhEVER - genomic data - Release 2 (June 1 2010) Last Updated: Jul 22, 2010
## 23 Refseq RNA Sequences. Last Updated: Jan 27, 2018
## 24 Refseq RNA 16S 23S Sequences. Last Updated: Jan 29, 2018
## 25 Genome Review from EBI Last Updated: Jan 24, 2013
## 26 NCBI Bacterial Genomes
## 27 Archaeal Genomes from NCBI Last Updated: Aug 25, 2015
## 28 Protozoan Genomes from NCBI Last Updated: Feb 23, 2011
## 29 Ensembl protists 86 - (10/05/16) Last Updated: Oct 5, 2016
## 30 Ensembl fungi 86 - (10/09/16) Last Updated: Oct 9, 2016
## 31 Ensembl protists 86 - (10/05/16) Last Updated: Oct 5, 2016
## 32 Ensembl protists 86 - (10/16/16) Last Updated: Oct 16, 2016
## 33 Ensembl Bacterial Genomes 21 - (02/23/14) Last Updated: Feb 27, 2014
## 34 Mitochondrial sequences - Release 41 (May 19, 2010) Last Updated: Jul 9, 2010
## 35 POLYBASE - Release 1 (June 20, 2003)
## 36 EMGLib Release 5 (December 9, 2003)
## 37 RNA sequences - numrel1 (daterel1) Last Updated: May 10, 2012
## 38 RiboDB
## 39 taxonomic database
Cunado ejecutemos la función seqinr::choosebank()
debemos indicar la base de datos en el parámetro bank=.
Para este ejemplo vamos a llamar a la base de datos de
Uniprot ("swissprot"). Luego que
ejecutemos esa linea de comando, se establece una conexión temporal con
el servidor.
seqinr::choosebank(bank = "swissprot")
OJO: si el servidor no detecata ninguna actividad en la conexión en un tiempo, se pierde la conexión y Ud tendrá que ejecutar nuevamente la función
seqinr::choosebank(bank = "swissprot"), para restablecer la conexión.
Ahora podemos usar la función seqinr::query() para
llamar las secuencias desde esa base de datos. Les recomiendo que lean
la documentación de esta función antes (pueden usar el siguiente linea
de comando para llamarlo desde la consola query), para que
sepan que parametros pueden usar para llamar o filtrar sus secuencias.
Esta función les permite llamar una secucencia o un grupo de secuencias.
Primero vamos a llamar a la proteína Spike de
SARS-CoV-2. Esta proteína está identificada con el código P0DTC2, el
cual definiremos en el parametro AC=
NOTA: cuando defina un parametro en esta funcion TODO debe estar entrecomillado, ademas, el signo “=” debe estar SIN ESPACIO entre el parámetro y el valor del parámetro
spike <- seqinr::query("spike", "AC=P0DTC2")
spike
## 1 SQ for AC=P0DTC2
NOTA: Como se habrá dado cuenta, esta función tiene una estructura inusual. El nombre del objeto que creamos será el primer parametro que le pasamos a la función.
El objeto que se genera es un “qaw”, que es un objeto definido en este paquete y, como las listas, tiene diferentes slots como:
Podemos explorar este objesto con funciones basicas de R como
str(), summary(), lenght(),
structure(), View(). Como otros funciones
especificas del paquete seqinr como son:
str(spike)
## List of 6
## $ call : language seqinr::query(listname = "spike", query = "AC=P0DTC2")
## $ name : chr "spike"
## $ nelem : int 1
## $ typelist: chr "SQ"
## $ req :List of 1
## ..$ : 'SeqAcnucWeb' chr "SPIKE_SARS2"
## .. ..- attr(*, "length")= num 1273
## .. ..- attr(*, "frame")= num 0
## .. ..- attr(*, "ncbigc")= num 1
## $ socket : 'sockconn' int 4
## ..- attr(*, "conn_id")=<externalptr>
## - attr(*, "class")= chr "qaw"
structure(spike)
## 1 SQ for AC=P0DTC2
summary(spike)
## Length Class Mode
## call 3 -none- call
## name 1 -none- character
## nelem 1 -none- numeric
## typelist 1 -none- character
## req 1 -none- list
## socket 1 sockconn numeric
table(spike$req) # OJO aqui tenemos que indicar especifiamente la lista con las secuencias
##
## SPIKE_SARS2
## 1
length(spike$req) # OJO aqui tenemos que indicar especifiamente la lista con las secuencias
## [1] 1
seqinr::getSequence(spike,as.string = T)
## [[1]]
## [1] "M" "F" "V" "F" "L" "V" "L" "L" "P" "L" "V" "S" "S" "Q" "C" "V" "N" "L"
## [19] "T" "T" "R" "T" "Q" "L" "P" "P" "A" "Y" "T" "N" "S" "F" "T" "R" "G" "V"
## [37] "Y" "Y" "P" "D" "K" "V" "F" "R" "S" "S" "V" "L" "H" "S" "T" "Q" "D" "L"
## [55] "F" "L" "P" "F" "F" "S" "N" "V" "T" "W" "F" "H" "A" "I" "H" "V" "S" "G"
## [73] "T" "N" "G" "T" "K" "R" "F" "D" "N" "P" "V" "L" "P" "F" "N" "D" "G" "V"
## [91] "Y" "F" "A" "S" "T" "E" "K" "S" "N" "I" "I" "R" "G" "W" "I" "F" "G" "T"
## [109] "T" "L" "D" "S" "K" "T" "Q" "S" "L" "L" "I" "V" "N" "N" "A" "T" "N" "V"
## [127] "V" "I" "K" "V" "C" "E" "F" "Q" "F" "C" "N" "D" "P" "F" "L" "G" "V" "Y"
## [145] "Y" "H" "K" "N" "N" "K" "S" "W" "M" "E" "S" "E" "F" "R" "V" "Y" "S" "S"
## [163] "A" "N" "N" "C" "T" "F" "E" "Y" "V" "S" "Q" "P" "F" "L" "M" "D" "L" "E"
## [181] "G" "K" "Q" "G" "N" "F" "K" "N" "L" "R" "E" "F" "V" "F" "K" "N" "I" "D"
## [199] "G" "Y" "F" "K" "I" "Y" "S" "K" "H" "T" "P" "I" "N" "L" "V" "R" "D" "L"
## [217] "P" "Q" "G" "F" "S" "A" "L" "E" "P" "L" "V" "D" "L" "P" "I" "G" "I" "N"
## [235] "I" "T" "R" "F" "Q" "T" "L" "L" "A" "L" "H" "R" "S" "Y" "L" "T" "P" "G"
## [253] "D" "S" "S" "S" "G" "W" "T" "A" "G" "A" "A" "A" "Y" "Y" "V" "G" "Y" "L"
## [271] "Q" "P" "R" "T" "F" "L" "L" "K" "Y" "N" "E" "N" "G" "T" "I" "T" "D" "A"
## [289] "V" "D" "C" "A" "L" "D" "P" "L" "S" "E" "T" "K" "C" "T" "L" "K" "S" "F"
## [307] "T" "V" "E" "K" "G" "I" "Y" "Q" "T" "S" "N" "F" "R" "V" "Q" "P" "T" "E"
## [325] "S" "I" "V" "R" "F" "P" "N" "I" "T" "N" "L" "C" "P" "F" "G" "E" "V" "F"
## [343] "N" "A" "T" "R" "F" "A" "S" "V" "Y" "A" "W" "N" "R" "K" "R" "I" "S" "N"
## [361] "C" "V" "A" "D" "Y" "S" "V" "L" "Y" "N" "S" "A" "S" "F" "S" "T" "F" "K"
## [379] "C" "Y" "G" "V" "S" "P" "T" "K" "L" "N" "D" "L" "C" "F" "T" "N" "V" "Y"
## [397] "A" "D" "S" "F" "V" "I" "R" "G" "D" "E" "V" "R" "Q" "I" "A" "P" "G" "Q"
## [415] "T" "G" "K" "I" "A" "D" "Y" "N" "Y" "K" "L" "P" "D" "D" "F" "T" "G" "C"
## [433] "V" "I" "A" "W" "N" "S" "N" "N" "L" "D" "S" "K" "V" "G" "G" "N" "Y" "N"
## [451] "Y" "L" "Y" "R" "L" "F" "R" "K" "S" "N" "L" "K" "P" "F" "E" "R" "D" "I"
## [469] "S" "T" "E" "I" "Y" "Q" "A" "G" "S" "T" "P" "C" "N" "G" "V" "E" "G" "F"
## [487] "N" "C" "Y" "F" "P" "L" "Q" "S" "Y" "G" "F" "Q" "P" "T" "N" "G" "V" "G"
## [505] "Y" "Q" "P" "Y" "R" "V" "V" "V" "L" "S" "F" "E" "L" "L" "H" "A" "P" "A"
## [523] "T" "V" "C" "G" "P" "K" "K" "S" "T" "N" "L" "V" "K" "N" "K" "C" "V" "N"
## [541] "F" "N" "F" "N" "G" "L" "T" "G" "T" "G" "V" "L" "T" "E" "S" "N" "K" "K"
## [559] "F" "L" "P" "F" "Q" "Q" "F" "G" "R" "D" "I" "A" "D" "T" "T" "D" "A" "V"
## [577] "R" "D" "P" "Q" "T" "L" "E" "I" "L" "D" "I" "T" "P" "C" "S" "F" "G" "G"
## [595] "V" "S" "V" "I" "T" "P" "G" "T" "N" "T" "S" "N" "Q" "V" "A" "V" "L" "Y"
## [613] "Q" "D" "V" "N" "C" "T" "E" "V" "P" "V" "A" "I" "H" "A" "D" "Q" "L" "T"
## [631] "P" "T" "W" "R" "V" "Y" "S" "T" "G" "S" "N" "V" "F" "Q" "T" "R" "A" "G"
## [649] "C" "L" "I" "G" "A" "E" "H" "V" "N" "N" "S" "Y" "E" "C" "D" "I" "P" "I"
## [667] "G" "A" "G" "I" "C" "A" "S" "Y" "Q" "T" "Q" "T" "N" "S" "P" "R" "R" "A"
## [685] "R" "S" "V" "A" "S" "Q" "S" "I" "I" "A" "Y" "T" "M" "S" "L" "G" "A" "E"
## [703] "N" "S" "V" "A" "Y" "S" "N" "N" "S" "I" "A" "I" "P" "T" "N" "F" "T" "I"
## [721] "S" "V" "T" "T" "E" "I" "L" "P" "V" "S" "M" "T" "K" "T" "S" "V" "D" "C"
## [739] "T" "M" "Y" "I" "C" "G" "D" "S" "T" "E" "C" "S" "N" "L" "L" "L" "Q" "Y"
## [757] "G" "S" "F" "C" "T" "Q" "L" "N" "R" "A" "L" "T" "G" "I" "A" "V" "E" "Q"
## [775] "D" "K" "N" "T" "Q" "E" "V" "F" "A" "Q" "V" "K" "Q" "I" "Y" "K" "T" "P"
## [793] "P" "I" "K" "D" "F" "G" "G" "F" "N" "F" "S" "Q" "I" "L" "P" "D" "P" "S"
## [811] "K" "P" "S" "K" "R" "S" "F" "I" "E" "D" "L" "L" "F" "N" "K" "V" "T" "L"
## [829] "A" "D" "A" "G" "F" "I" "K" "Q" "Y" "G" "D" "C" "L" "G" "D" "I" "A" "A"
## [847] "R" "D" "L" "I" "C" "A" "Q" "K" "F" "N" "G" "L" "T" "V" "L" "P" "P" "L"
## [865] "L" "T" "D" "E" "M" "I" "A" "Q" "Y" "T" "S" "A" "L" "L" "A" "G" "T" "I"
## [883] "T" "S" "G" "W" "T" "F" "G" "A" "G" "A" "A" "L" "Q" "I" "P" "F" "A" "M"
## [901] "Q" "M" "A" "Y" "R" "F" "N" "G" "I" "G" "V" "T" "Q" "N" "V" "L" "Y" "E"
## [919] "N" "Q" "K" "L" "I" "A" "N" "Q" "F" "N" "S" "A" "I" "G" "K" "I" "Q" "D"
## [937] "S" "L" "S" "S" "T" "A" "S" "A" "L" "G" "K" "L" "Q" "D" "V" "V" "N" "Q"
## [955] "N" "A" "Q" "A" "L" "N" "T" "L" "V" "K" "Q" "L" "S" "S" "N" "F" "G" "A"
## [973] "I" "S" "S" "V" "L" "N" "D" "I" "L" "S" "R" "L" "D" "K" "V" "E" "A" "E"
## [991] "V" "Q" "I" "D" "R" "L" "I" "T" "G" "R" "L" "Q" "S" "L" "Q" "T" "Y" "V"
## [1009] "T" "Q" "Q" "L" "I" "R" "A" "A" "E" "I" "R" "A" "S" "A" "N" "L" "A" "A"
## [1027] "T" "K" "M" "S" "E" "C" "V" "L" "G" "Q" "S" "K" "R" "V" "D" "F" "C" "G"
## [1045] "K" "G" "Y" "H" "L" "M" "S" "F" "P" "Q" "S" "A" "P" "H" "G" "V" "V" "F"
## [1063] "L" "H" "V" "T" "Y" "V" "P" "A" "Q" "E" "K" "N" "F" "T" "T" "A" "P" "A"
## [1081] "I" "C" "H" "D" "G" "K" "A" "H" "F" "P" "R" "E" "G" "V" "F" "V" "S" "N"
## [1099] "G" "T" "H" "W" "F" "V" "T" "Q" "R" "N" "F" "Y" "E" "P" "Q" "I" "I" "T"
## [1117] "T" "D" "N" "T" "F" "V" "S" "G" "N" "C" "D" "V" "V" "I" "G" "I" "V" "N"
## [1135] "N" "T" "V" "Y" "D" "P" "L" "Q" "P" "E" "L" "D" "S" "F" "K" "E" "E" "L"
## [1153] "D" "K" "Y" "F" "K" "N" "H" "T" "S" "P" "D" "V" "D" "L" "G" "D" "I" "S"
## [1171] "G" "I" "N" "A" "S" "V" "V" "N" "I" "Q" "K" "E" "I" "D" "R" "L" "N" "E"
## [1189] "V" "A" "K" "N" "L" "N" "E" "S" "L" "I" "D" "L" "Q" "E" "L" "G" "K" "Y"
## [1207] "E" "Q" "Y" "I" "K" "W" "P" "W" "Y" "I" "W" "L" "G" "F" "I" "A" "G" "L"
## [1225] "I" "A" "I" "V" "M" "V" "T" "I" "M" "L" "C" "C" "M" "T" "S" "C" "C" "S"
## [1243] "C" "L" "K" "G" "C" "C" "S" "C" "G" "S" "C" "C" "K" "F" "D" "E" "D" "D"
## [1261] "S" "E" "P" "V" "L" "K" "G" "V" "K" "L" "H" "Y" "T"
seqinr::getName(spike)
## [1] "SPIKE_SARS2"
seqinr::getLength(spike)
## [1] 1273
seqinr::getKeyword(spike)
## [[1]]
## [1] "SPIKE GLYCOPROTEIN"
## [2] "S GLYCOPROTEIN"
## [3] "E2"
## [4] "PEPLOMER PROTEIN"
## [5] "FULL"
## [6] "SPIKE PROTEIN S1"
## [7] "SPIKE PROTEIN S2"
## [8] "SPIKE PROTEIN S2'"
## [9] "PRECURSOR"
## [10] "S"
## [11] "2"
## [12] "VIRION MEMBRANE"
## [13] "SINGLE-PASS TYPE I MEMBRANE PROTEIN"
## [14] "HOST ENDOPLASMIC RETICULUM-GOLGI INTERME"
## [15] "3D-STRUCTURE"
## [16] "COILED COIL"
## [17] "DISULFIDE BOND"
## [18] "FUSION OF VIRUS MEMBRANE WITH HOST ENDOS"
## [19] "FUSION OF VIRUS MEMBRANE WITH HOST MEMBR"
## [20] "GLYCOPROTEIN"
## [21] "HOST CELL MEMBRANE"
## [22] "HOST MEMBRANE"
## [23] "HOST-VIRUS INTERACTION"
## [24] "INHIBITION OF HOST INNATE IMMUNE RESPONS"
## [25] "INHIBITION OF HOST INTERFERON SIGNALING"
## [26] "INHIBITION OF HOST TETHERIN BY VIRUS"
## [27] "LIPOPROTEIN"
## [28] "MEMBRANE"
## [29] "PALMITATE"
## [30] "REFERENCE PROTEOME"
## [31] "SIGNAL"
## [32] "TRANSMEMBRANE"
## [33] "TRANSMEMBRANE HELIX"
## [34] "VIRAL ATTACHMENT TO HOST CELL"
## [35] "VIRAL ENVELOPE PROTEIN"
## [36] "VIRAL IMMUNOEVASION"
## [37] "VIRAL PENETRATION INTO HOST CYTOPLASM"
## [38] "VIRION"
## [39] "VIRULENCE"
## [40] "VIRUS ENTRY INTO HOST CELL"
## [41] "CHAIN"
## [42] "TOPO_DOM"
## [43] "TRANSMEM"
## [44] "DOMAIN"
## [45] "REGION"
## [46] "COILED"
## [47] "MOTIF"
## [48] "SITE"
## [49] "CARBOHYD"
## [50] "DISULFID"
Tambien podemos llamar multiples secuencias con la misma función
seqinr::query(). Para ello vamos a usar el parametro que
define el organimos y vamos a llamar a todas las secuencias definidas
para SARS-CoV-2, con la siguienge linea de comando:
seqinr::choosebank(bank = "swissprot")
cov2 <- seqinr::query("cov2", "SP=Severe acute respiratory syndrome coronavirus 2")
length(cov2$req)
## [1] 75730
El resultado son más de 75 mil secuencias,que son muchas secuencias
para ser procesadas por una computadora personal promedio. Ahora, no
todas las secuencias están curadas (algunas pueden estar incompletas,
repetidas, ser predicciones in silico entre otros). Para reducir el
numero de secuencias a aquellas depuradas (“revisadas”), podemos agregar
el parámetro ST y definirlo como “reviewed”. Usaremos el
operador lógico “AND” para unir ambas condiciones
seqinr::choosebank(bank = "swissprot")
cov2 <- seqinr::query("cov2", "SP=Severe acute respiratory syndrome coronavirus 2 AND ST=reviewed")
length(cov2$req)
## [1] 16
Ahora redujimos la cantidad de secuencia a solo 16, pero ya depuradas, lo que mejora enormente la velocidad de nuestro procesamiento y la calidad de los resultados.
Para extraer todas las secuenicias en un solo objeto lista podemos
usar una función sapply() de la familia
apply.
sapply(cov2$req,seqinr::getSequence,as.string = TRUE)
## [1] "MDLFMRIFTIGTVTLKQGEIKDATPSDFVRATATIPIQASLPFGWLIVGVALLAVFQSASKIITLKKRWQLALSKGVHFVCNLLLLFVTVYSHLLLVAAGLEAPFLYLYALVYFLQSINFVRIIMRLWLCWKCRSKNPLLYDANYFLCWHTNCYDYCIPYNSVTSSIVITSGDGTTSPISEHDYQIGGYTEKWESGVKDCVVLHSYFTSDYYQLYSTQLSTDTGVEHVTFFIYNKIVDEPEEHVQIHTIDGSSGVVNPVMEPIYDEPTTTTSVPL"
## [2] "MSDNGPQNQRNAPRITFGGPSDSTGSNQNGERSGARSKQRRPQGLPNNTASWFTALTQHGKEDLKFPRGQGVPINTNSSPDDQIGYYRRATRRIRGGDGKMKDLSPRWYFYYLGTGPEAGLPYGANKDGIIWVATEGALNTPKDHIGTRNPANNAAIVLQLPQGTTLPKGFYAEGSRGGSQASSRSSSRSRNSSRNSTPGSSRGTSPARMAGNGGDAALALLLLDRLNQLESKMSGKGQQQQGQTVTKKSAAEASKKPRQKRTATKAYNVTQAFGRRGPEQTQGNFGDQELIRQGTDYKHWPQIAQFAPSASAFFGMSRIGMEVTPSGTWLTYTGAIKLDDKDPNFKDQVILLNKHIDAYKTFPPTEPKKDKKKKADETQALPQRQKKQQTVTLLPAADLDDFSKQLQQSMSSADSTQA"
## [3] "MFHLVDFQVTIAEILLIIMRTFKVSIWNLDYIINLIIKNLSKSLTENKYSQLDEEQPMEID"
## [4] "MKIILFLALITLATCELYHYQECVRGTTVLLKEPCSSGTYEGNSPFHPLADNKFALTCFSTQFAFACPDGVKHVYQLRARSVSPKLFIRQEEVQELYSPIFLIVAAIVFITLCFTLKRKTE"
## [5] "MIELSLIDFYLCFLAFLLFLVLIMLIIFWFSLELQDHNETCHA"
## [6] "MKFLVFLGIITTVAAFHQECSLQSCTQHQPYVVDDPCPIHFYSKWYIRVGARKSAPLIELCVDEAGSKSPIQYIDIGNYTVSCLPFTINCQEPKLGSLVVRCSFYEDFLEYHDVRVVLDFI"
## [7] "MMPTIFFAGILIVTTIVYLTIV"
## [8] "MLLLQILFALLQRYRYKPHSLSDGLLLALHFLLFFRALPKS"
## [9] "MAYCWRCTSCCFSERFQNHNPQKEMATSTLQGCSLCLQLAVVVCNSLLTPFARCCWP"
## [10] "MDPKISEMHPALRLVDPQIQLAVTRMENAVGRDQNNVGPKVYPIILRLGSPLSLNMARKTLNSLEDKAFQLTPIAVQMTKLATTEELPDEFVVVTVK"
## [11] "MLQSCYNFLKEQHCQKASTQKGAEAAVKPLLVPHHVVATVQEIQLQAAVGELLLLEWLAMAVMLLLLCCCLTD"
## [12] "MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHLKDGTCGLVEVEKGVLPQLEQPYVFIKRSDARTAPHGHVMVELVAELEGIQYGRSGETLGVLVPHVGEIPVAYRKVLLRKNGNKGAGGHSYGADLKSFDLGDELGTDPYEDFQENWNTKHSSGVTRELMRELNGGAYTRYVDNNFCGPDGYPLECIKDLLARAGKASCTLSEQLDFIDTKRGVYCCREHEHEIAWYTERSEKSYELQTPFEIKLAKKFDTFNGECPNFVFPLNSIIKTIQPRVEKKKLDGFMGRIRSVYPVASPNECNQMCLSTLMKCDHCGETSWQTGDFVKATCEFCGTENLTKEGATTCGYLPQNAVVKIYCPACHNSEVGPEHSLAEYHNESGLKTILRKGGRTIAFGGCVFSYVGCHNKCAYWVPRASANIGCNHTGVVGEGSEGLNDNLLEILQKEKVNINIVGDFKLNEEIAIILASFSASTSAFVETVKGLDYKAFKQIVESCGNFKVTKGKAKKGAWNIGEQKSILSPLYAFASEAARVVRSIFSRTLETAQNSVRVLQKAAITILDGISQYSLRLIDAMMFTSDLATNNLVVMAYITGGVVQLTSQWLTNIFGTVYEKLKPVLDWLEEKFKEGVEFLRDGWEIVKFISTCACEIVGGQIVTCAKEIKESVQTFFKLVNKFLALCADSIIIGGAKLKALNLGETFVTHSKGLYRKCVKSREETGLLMPLKAPKEIIFLEGETLPTEVLTEEVVLKTGDLQPLEQPTSEAVEAPLVGTPVCINGLMLLEIKDTEKYCALAPNMMVTNNTFTLKGGAPTKVTFGDDTVIEVQGYKSVNITFELDERIDKVLNEKCSAYTVELGTEVNEFACVVADAVIKTLQPVSELLTPLGIDLDEWSMATYYLFDESGEFKLASHMYCSFYPPDEDEEEGDCEEEEFEPSTQYEYGTEDDYQGKPLEFGATSAALQPEEEQEEDWLDDDSQQTVGQQDGSEDNQTTTIQTIVEVQPQLEMELTPVVQTIEVNSFSGYLKLTDNVYIKNADIVEEAKKVKPTVVVNAANVYLKHGGGVAGALNKATNNAMQVESDDYIATNGPLKVGGSCVLSGHNLAKHCLHVVGPNVNKGEDIQLLKSAYENFNQHEVLLAPLLSAGIFGADPIHSLRVCVDTVRTNVYLAVFDKNLYDKLVSSFLEMKSEKQVEQKIAEIPKEEVKPFITESKPSVEQRKQDDKKIKACVEEVTTTLEETKFLTENLLLYIDINGNLHPDSATLVSDIDITFLKKDAPYIVGDVVQEGVLTAVVIPTKKAGGTTEMLAKALRKVPTDNYITTYPGQGLNGYTVEEAKTVLKKCKSAFYILPSIISNEKQEILGTVSWNLREMLAHAEETRKLMPVCVETKAIVSTIQRKYKGIKIQEGVVDYGARFYFYTSKTTVASLINTLNDLNETLVTMPLGYVTHGLNLEEAARYMRSLKVPATVSVSSPDAVTAYNGYLTSSSKTPEEHFIETISLAGSYKDWSYSGQSTQLGIEFLKRGDKSVYYTSNPTTFHLDGEVITFDNLKTLLSLREVRTIKVFTTVDNINLHTQVVDMSMTYGQQFGPTYLDGADVTKIKPHNSHEGKTFYVLPNDDTLRVEAFEYYHTTDPSFLGRYMSALNHTKKWKYPQVNGLTSIKWADNNCYLATALLTLQQIELKFNPPALQDAYYRARAGEAANFCALILAYCNKTVGELGDVRETMSYLFQHANLDSCKRVLNVVCKTCGQQQTTLKGVEAVMYMGTLSYEQFKKGVQIPCTCGKQATKYLVQQESPFVMMSAPPAQYELKHGTFTCASEYTGNYQCGHYKHITSKETLYCIDGALLTKSSEYKGPITDVFYKENSYTTTIKPVTYKLDGVVCTEIDPKLDNYYKKDNSYFTEQPIDLVPNQPYPNASFDNFKFVCDNIKFADDLNQLTGYKKPASRELKVTFFPDLNGDVVAIDYKHYTPSFKKGAKLLHKPIVWHVNNATNKATYKPNTWCIRCLWSTKPVETSNSFDVLKSEDAQGMDNLACEDLKPVSEEVVENPTIQKDVLECNVKTTEVVGDIILKPANNSLKITEEVGHTDLMAAYVDNSSLTIKKPNELSRVLGLKTLATHGLAAVNSVPWDTIANYAKPFLNKVVSTTTNIVTRCLNRVCTNYMPYFFTLLLQLCTFTRSTNSRIKASMPTTIAKNTVKSVGKFCLEASFNYLKSPNFSKLINIIIWFLLLSVCLGSLIYSTAALGVLMSNLGMPSYCTGYREGYLNSTNVTIATYCTGSIPCSVCLSGLDSLDTYPSLETIQITISSFKWDLTAFGLVAEWFLAYILFTRFFYVLGLAAIMQLFFSYFAVHFISNSWLMWLIINLVQMAPISAMVRMYIFFASFYYVWKSYVHVVDGCNSSTCMMCYKRNRATRVECTTIVNGVRRSFYVYANGGKGFCKLHNWNCVNCDTFCAGSTFISDEVARDLSLQFKRPINPTDQSSYIVDSVTVKNGSIHLYFDKAGQKTYERHSLSHFVNLDNLRANNTKGSLPINVIVFDGKSKCEESSAKSASVYYSQLMCQPILLLDQALVSDVGDSAEVAVKMFDAYVNTFSSTFNVPMEKLKTLVATAEAELAKNVSLDNVLSTFISAARQGFVDSDVETKDVVECLKLSHQSDIEVTGDSCNNYMLTYNKVENMTPRDLGACIDCSARHINAQVAKSHNIALIWNVKDFMSLSEQLRKQIRSAAKKNNLPFKLTCATTRQVVNVVTTKIALKGGKIVNNWLKQLIKVTLVFLFVAAIFYLITPVHVMSKHTDFSSEIIGYKAIDGGVTRDIASTDTCFANKHADFDTWFSQRGGSYTNDKACPLIAAVITREVGFVVPGLPGTILRTTNGDFLHFLPRVFSAVGNICYTPSKLIEYTDFATSACVLAAECTIFKDASGKPVPYCYDTNVLEGSVAYESLRPDTRYVLMDGSIIQFPNTYLEGSVRVVTTFDSEYCRHGTCERSEAGVCVSTSGRWVLNNDYYRSLPGVFCGVDAVNLLTNMFTPLIQPIGALDISASIVAGGIVAIVVTCLAYYFMRFRRAFGEYSHVVAFNTLLFLMSFTVLCLTPVYSFLPGVYSVIYLYLTFYLTNDVSFLAHIQWMVMFTPLVPFWITIAYIICISTKHFYWFFSNYLKRRVVFNGVSFSTFEEAALCTFLLNKEMYLKLRSDVLLPLTQYNRYLALYNKYKYFSGAMDTTSYREAACCHLAKALNDFSNSGSDVLYQPPQTSITSAVLQSGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTSEDMLNPNYEDLLIRKSNHNFLVQAGNVQLRVIGHSMQNCVLKLKVDTANPKTPKYKFVRIQPGQTFSVLACYNGSPSGVYQCAMRPNFTIKGSFLNGSCGSVGFNIDYDCVSFCYMHHMELPTGVHAGTDLEGNFYGPFVDRQTAQAAGTDTTITVNVLAWLYAAVINGDRWFLNRFTTTLNDFNLVAMKYNYEPLTQDHVDILGPLSAQTGIAVLDMCASLKELLQNGMNGRTILGSALLEDEFTPFDVVRQCSGVTFQSAVKRTIKGTHHWLLLTILTSLLVLVQSTQWSLFFFLYENAFLPFAMGIIAMSAFAMMFVKHKHAFLCLFLLPSLATVAYFNMVYMPASWVMRIMTWLDMVDTSLSGFKLKDCVMYASAVVLLILMTARTVYDDGARRVWTLMNVLTLVYKVYYGNALDQAISMWALIISVTSNYSGVVTTVMFLARGIVFMCVEYCPIFFITGNTLQCIMLVYCFLGYFCTCYFGLFCLLNRYFRLTLGVYDYLVSTQEFRYMNSQGLLPPKNSIDAFKLNIKLLGVGGKPCIKVATVQSKMSDVKCTSVVLLSVLQQLRVESSSKLWAQCVQLHNDILLAKDTTEAFEKMVSLLSVLLSMQGAVDINKLCEEMLDNRATLQAIASEFSSLPSYAAFATAQEAYEQAVANGDSEVVLKKLKKSLNVAKSEFDRDAAMQRKLEKMADQAMTQMYKQARSEDKRAKVTSAMQTMLFTMLRKLDNDALNNIINNARDGCVPLNIIPLTTAAKLMVVIPDYNTYKNTCDGTTFTYASALWEIQQVVDADSKIVQLSEISMDNSPNLAWPLIVTALRANSAVKLQNNELSPVALRQMSCAAGTTQTACTDDNALAYYNTTKGGRFVLALLSDLQDLKWARFPKSDGTGTIYTELEPPCRFVTDTPKGPKVKYLYFIKGLNNLNRGMVLGSLAATVRLQAGNATEVPANSTVLSFCAFAVDAAKAYKDYLASGGQPITNCVKMLCTHTGTGQAITVTPEANMDQESFGGASCCLYCRCHIDHPNPKGFCDLKGKYVQIPTTCANDPVGFTLKNTVCTVCGMWKGYGCSCDQLREPMLQSADAQSFLNRVCGVSAARLTPCGTGTSTDVVYRAFDIYNDKVAGFAKFLKTNCCRFQEKDEDDNLIDSYFVVKRHTFSNYQHEETIYNLLKDCPAVAKHDFFKFRIDGDMVPHISRQRLTKYTMADLVYALRHFDEGNCDTLKEILVTYNCCDDDYFNKKDWYDFVENPDILRVYANLGERVRQALLKTVQFCDAMRNAGIVGVLTLDNQDLNGNWYDFGDFIQTTPGSGVPVVDSYYSLLMPILTLTRALTAESHVDTDLTKPYIKWDLLKYDFTEERLKLFDRYFKYWDQTYHPNCVNCLDDRCILHCANFNVLFSTVFPPTSFGPLVRKIFVDGVPFVVSTGYHFRELGVVHNQDVNLHSSRLSFKELLVYAADPAMHAASGNLLLDKRTTCFSVAALTNNVAFQTVKPGNFNKDFYDFAVSKGFFKEGSSVELKHFFFAQDGNAAISDYDYYRYNLPTMCDIRQLLFVVEVVDKYFDCYDGGCINANQVIVNNLDKSAGFPFNKWGKARLYYDSMSYEDQDALFAYTKRNVIPTITQMNLKYAISAKNRARTVAGVSICSTMTNRQFHQKLLKSIAATRGATVVIGTSKFYGGWHNMLKTVYSDVENPHLMGWDYPKCDRAMPNMLRIMASLVLARKHTTCCSLSHRFYRLANECAQVLSEMVMCGGSLYVKPGGTSSGDATTAYANSVFNICQAVTANVNALLSTDGNKIADKYVRNLQHRLYECLYRNRDVDTDFVNEFYAYLRKHFSMMILSDDAVVCFNSTYASQGLVASIKNFKSVLYYQNNVFMSEAKCWTETDLTKGPHEFCSQHTMLVKQGDDYVYLPYPDPSRILGAGCFVDDIVKTDGTLMIERFVSLAIDAYPLTKHPNQEYADVFHLYLQYIRKLHDELTGHMLDMYSVMLTNDNTSRYWEPEFYEAMYTPHTVLQAVGACVLCNSQTSLRCGACIRRPFLCCKCCYDHVISTSHKLVLSVNPYVCNAPGCDVTDVTQLYLGGMSYYCKSHKPPISFPLCANGQVFGLYKNTCVGSDNVTDFNAIATCDWTNAGDYILANTCTERLKLFAAETLKATEETFKLSYGIATVREVLSDRELHLSWEVGKPRPPLNRNYVFTGYRVTKNSKVQIGEYTFEKGDYGDAVVYRGTTTYKLNVGDYFVLTSHTVMPLSAPTLVPQEHYVRITGLYPTLNISDEFSSNVANYQKVGMQKYSTLQGPPGTGKSHFAIGLALYYPSARIVYTACSHAAVDALCEKALKYLPIDKCSRIIPARARVECFDKFKVNSTLEQYVFCTVNALPETTADIVVFDEISMATNYDLSVVNARLRAKHYVYIGDPAQLPAPRTLLTKGTLEPEYFNSVCRLMKTIGPDMFLGTCRRCPAEIVDTVSALVYDNKLKAHKDKSAQCFKMFYKGVITHDVSSAINRPQIGVVREFLTRNPAWRKAVFISPYNSQNAVASKILGLPTQTVDSSQGSEYDYVIFTQTTETAHSCNVNRFNVAITRAKVGILCIMSDRDLYDKLQFTSLEIPRRNVATLQAENVTGLFKDCSKVITGLHPTQAPTHLSVDTKFKTEGLCVDIPGIPKDMTYRRLISMMGFKMNYQVNGYPNMFITREEAIRHVRAWIGFDVEGCHATREAVGTNLPLQLGFSTGVNLVAVPTGYVDTPNNTDFSRVSAKPPPGDQFKHLIPLMYKGLPWNVVRIKIVQMLSDTLKNLSDRVVFVLWAHGFELTSMKYFVKIGPERTCCLCDRRATCFSTASDTYACWHHSIGFDYVYNPFMIDVQQWGFTGNLQSNHDLYCQVHGNAHVASCDAIMTRCLAVHECFVKRVDWTIEYPIIGDELKINAACRKVQHMVVKAALLADKFPVLHDIGNPKAIKCVPQADVEWKFYDAQPCSDKAYKIEELFYSYATHSDKFTDGVCLFWNCNVDRYPANSIVCRFDTRVLSNLNLPGCDGGSLYVNKHAFHTPAFDKSAFVNLKQLPFFYYSDSPCESHGKQVVSDIDYVPLKSATCITRCNLGGAVCRHHANEYRLYLDAYNMMISAGFSLWVYKQFDTYNLWNTFTRLQSLENVAFNVVNKGHFDGQQGEVPVSIINNTVYTKVDGVDVELFENKTTLPVNVAFELWAKRNIKPVPEVKILNNLGVDIAANTVIWDYKRDAPAHISTIGVCSMTDIAKKPTETICAPLTVFFDGRVDGQVDLFRNARNGVLITEGSVKGLQPSVGPKQASLNGVTLIGEAVKTQFNYYKKVDGVVQQLPETYFTQSRNLQEFKPRSQMEIDFLELAMDEFIERYKLEGYAFEHIVYGDFSHSQLGGLHLLIGLAKRFKESPFELEDFIPMDSTVKNYFITDAQTGSSKCVCSVIDLLLDDFVEIIKSQDLSVVSKVVKVTIDYTEISFMLWCKDGHVETFYPKLQSSQAWQPGVAMPNLYKMQRMLLEKCDLQNYGDSATLPKGIMMNVAKYTQLCQYLNTLTLAVPYNMRVIHFGAGSDKGVAPGTAVLRQWLPTGTLLVDSDLNDFVSDADSTLIGDCATVHTANKWDLIISDMYDPKTKNVTKENDSKEGFFTYICGFIQQKLALGGSVAIKITEHSWNADLYKLMGHFAWWTAFVTNVNASSSEAFLIGCNYLGKPREQIDGYVMHANYIFWRNTNPIQLSSYSLFDMSKFPLKLRGTAVMSLKEGQINDMILSLLSKGRLIIRENNRVVISSDVLVNN"
## [13] "MESLVPGFNEKTHVQLSLPVLQVRDVLVRGFGDSVEEVLSEARQHLKDGTCGLVEVEKGVLPQLEQPYVFIKRSDARTAPHGHVMVELVAELEGIQYGRSGETLGVLVPHVGEIPVAYRKVLLRKNGNKGAGGHSYGADLKSFDLGDELGTDPYEDFQENWNTKHSSGVTRELMRELNGGAYTRYVDNNFCGPDGYPLECIKDLLARAGKASCTLSEQLDFIDTKRGVYCCREHEHEIAWYTERSEKSYELQTPFEIKLAKKFDTFNGECPNFVFPLNSIIKTIQPRVEKKKLDGFMGRIRSVYPVASPNECNQMCLSTLMKCDHCGETSWQTGDFVKATCEFCGTENLTKEGATTCGYLPQNAVVKIYCPACHNSEVGPEHSLAEYHNESGLKTILRKGGRTIAFGGCVFSYVGCHNKCAYWVPRASANIGCNHTGVVGEGSEGLNDNLLEILQKEKVNINIVGDFKLNEEIAIILASFSASTSAFVETVKGLDYKAFKQIVESCGNFKVTKGKAKKGAWNIGEQKSILSPLYAFASEAARVVRSIFSRTLETAQNSVRVLQKAAITILDGISQYSLRLIDAMMFTSDLATNNLVVMAYITGGVVQLTSQWLTNIFGTVYEKLKPVLDWLEEKFKEGVEFLRDGWEIVKFISTCACEIVGGQIVTCAKEIKESVQTFFKLVNKFLALCADSIIIGGAKLKALNLGETFVTHSKGLYRKCVKSREETGLLMPLKAPKEIIFLEGETLPTEVLTEEVVLKTGDLQPLEQPTSEAVEAPLVGTPVCINGLMLLEIKDTEKYCALAPNMMVTNNTFTLKGGAPTKVTFGDDTVIEVQGYKSVNITFELDERIDKVLNEKCSAYTVELGTEVNEFACVVADAVIKTLQPVSELLTPLGIDLDEWSMATYYLFDESGEFKLASHMYCSFYPPDEDEEEGDCEEEEFEPSTQYEYGTEDDYQGKPLEFGATSAALQPEEEQEEDWLDDDSQQTVGQQDGSEDNQTTTIQTIVEVQPQLEMELTPVVQTIEVNSFSGYLKLTDNVYIKNADIVEEAKKVKPTVVVNAANVYLKHGGGVAGALNKATNNAMQVESDDYIATNGPLKVGGSCVLSGHNLAKHCLHVVGPNVNKGEDIQLLKSAYENFNQHEVLLAPLLSAGIFGADPIHSLRVCVDTVRTNVYLAVFDKNLYDKLVSSFLEMKSEKQVEQKIAEIPKEEVKPFITESKPSVEQRKQDDKKIKACVEEVTTTLEETKFLTENLLLYIDINGNLHPDSATLVSDIDITFLKKDAPYIVGDVVQEGVLTAVVIPTKKAGGTTEMLAKALRKVPTDNYITTYPGQGLNGYTVEEAKTVLKKCKSAFYILPSIISNEKQEILGTVSWNLREMLAHAEETRKLMPVCVETKAIVSTIQRKYKGIKIQEGVVDYGARFYFYTSKTTVASLINTLNDLNETLVTMPLGYVTHGLNLEEAARYMRSLKVPATVSVSSPDAVTAYNGYLTSSSKTPEEHFIETISLAGSYKDWSYSGQSTQLGIEFLKRGDKSVYYTSNPTTFHLDGEVITFDNLKTLLSLREVRTIKVFTTVDNINLHTQVVDMSMTYGQQFGPTYLDGADVTKIKPHNSHEGKTFYVLPNDDTLRVEAFEYYHTTDPSFLGRYMSALNHTKKWKYPQVNGLTSIKWADNNCYLATALLTLQQIELKFNPPALQDAYYRARAGEAANFCALILAYCNKTVGELGDVRETMSYLFQHANLDSCKRVLNVVCKTCGQQQTTLKGVEAVMYMGTLSYEQFKKGVQIPCTCGKQATKYLVQQESPFVMMSAPPAQYELKHGTFTCASEYTGNYQCGHYKHITSKETLYCIDGALLTKSSEYKGPITDVFYKENSYTTTIKPVTYKLDGVVCTEIDPKLDNYYKKDNSYFTEQPIDLVPNQPYPNASFDNFKFVCDNIKFADDLNQLTGYKKPASRELKVTFFPDLNGDVVAIDYKHYTPSFKKGAKLLHKPIVWHVNNATNKATYKPNTWCIRCLWSTKPVETSNSFDVLKSEDAQGMDNLACEDLKPVSEEVVENPTIQKDVLECNVKTTEVVGDIILKPANNSLKITEEVGHTDLMAAYVDNSSLTIKKPNELSRVLGLKTLATHGLAAVNSVPWDTIANYAKPFLNKVVSTTTNIVTRCLNRVCTNYMPYFFTLLLQLCTFTRSTNSRIKASMPTTIAKNTVKSVGKFCLEASFNYLKSPNFSKLINIIIWFLLLSVCLGSLIYSTAALGVLMSNLGMPSYCTGYREGYLNSTNVTIATYCTGSIPCSVCLSGLDSLDTYPSLETIQITISSFKWDLTAFGLVAEWFLAYILFTRFFYVLGLAAIMQLFFSYFAVHFISNSWLMWLIINLVQMAPISAMVRMYIFFASFYYVWKSYVHVVDGCNSSTCMMCYKRNRATRVECTTIVNGVRRSFYVYANGGKGFCKLHNWNCVNCDTFCAGSTFISDEVARDLSLQFKRPINPTDQSSYIVDSVTVKNGSIHLYFDKAGQKTYERHSLSHFVNLDNLRANNTKGSLPINVIVFDGKSKCEESSAKSASVYYSQLMCQPILLLDQALVSDVGDSAEVAVKMFDAYVNTFSSTFNVPMEKLKTLVATAEAELAKNVSLDNVLSTFISAARQGFVDSDVETKDVVECLKLSHQSDIEVTGDSCNNYMLTYNKVENMTPRDLGACIDCSARHINAQVAKSHNIALIWNVKDFMSLSEQLRKQIRSAAKKNNLPFKLTCATTRQVVNVVTTKIALKGGKIVNNWLKQLIKVTLVFLFVAAIFYLITPVHVMSKHTDFSSEIIGYKAIDGGVTRDIASTDTCFANKHADFDTWFSQRGGSYTNDKACPLIAAVITREVGFVVPGLPGTILRTTNGDFLHFLPRVFSAVGNICYTPSKLIEYTDFATSACVLAAECTIFKDASGKPVPYCYDTNVLEGSVAYESLRPDTRYVLMDGSIIQFPNTYLEGSVRVVTTFDSEYCRHGTCERSEAGVCVSTSGRWVLNNDYYRSLPGVFCGVDAVNLLTNMFTPLIQPIGALDISASIVAGGIVAIVVTCLAYYFMRFRRAFGEYSHVVAFNTLLFLMSFTVLCLTPVYSFLPGVYSVIYLYLTFYLTNDVSFLAHIQWMVMFTPLVPFWITIAYIICISTKHFYWFFSNYLKRRVVFNGVSFSTFEEAALCTFLLNKEMYLKLRSDVLLPLTQYNRYLALYNKYKYFSGAMDTTSYREAACCHLAKALNDFSNSGSDVLYQPPQTSITSAVLQSGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTSEDMLNPNYEDLLIRKSNHNFLVQAGNVQLRVIGHSMQNCVLKLKVDTANPKTPKYKFVRIQPGQTFSVLACYNGSPSGVYQCAMRPNFTIKGSFLNGSCGSVGFNIDYDCVSFCYMHHMELPTGVHAGTDLEGNFYGPFVDRQTAQAAGTDTTITVNVLAWLYAAVINGDRWFLNRFTTTLNDFNLVAMKYNYEPLTQDHVDILGPLSAQTGIAVLDMCASLKELLQNGMNGRTILGSALLEDEFTPFDVVRQCSGVTFQSAVKRTIKGTHHWLLLTILTSLLVLVQSTQWSLFFFLYENAFLPFAMGIIAMSAFAMMFVKHKHAFLCLFLLPSLATVAYFNMVYMPASWVMRIMTWLDMVDTSLSGFKLKDCVMYASAVVLLILMTARTVYDDGARRVWTLMNVLTLVYKVYYGNALDQAISMWALIISVTSNYSGVVTTVMFLARGIVFMCVEYCPIFFITGNTLQCIMLVYCFLGYFCTCYFGLFCLLNRYFRLTLGVYDYLVSTQEFRYMNSQGLLPPKNSIDAFKLNIKLLGVGGKPCIKVATVQSKMSDVKCTSVVLLSVLQQLRVESSSKLWAQCVQLHNDILLAKDTTEAFEKMVSLLSVLLSMQGAVDINKLCEEMLDNRATLQAIASEFSSLPSYAAFATAQEAYEQAVANGDSEVVLKKLKKSLNVAKSEFDRDAAMQRKLEKMADQAMTQMYKQARSEDKRAKVTSAMQTMLFTMLRKLDNDALNNIINNARDGCVPLNIIPLTTAAKLMVVIPDYNTYKNTCDGTTFTYASALWEIQQVVDADSKIVQLSEISMDNSPNLAWPLIVTALRANSAVKLQNNELSPVALRQMSCAAGTTQTACTDDNALAYYNTTKGGRFVLALLSDLQDLKWARFPKSDGTGTIYTELEPPCRFVTDTPKGPKVKYLYFIKGLNNLNRGMVLGSLAATVRLQAGNATEVPANSTVLSFCAFAVDAAKAYKDYLASGGQPITNCVKMLCTHTGTGQAITVTPEANMDQESFGGASCCLYCRCHIDHPNPKGFCDLKGKYVQIPTTCANDPVGFTLKNTVCTVCGMWKGYGCSCDQLREPMLQSADAQSFLNGFAV"
## [14] "MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT"
## [15] "MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNVSLVKPSFYVYSRVKNLNSSRVPDLLV"
## [16] "MADSNGTITVEELKKLLEQWNLVIGFLFLTWICLLQFAYANRNRFLYIIKLIFLWLLWPVTLACFVLAAVYRINWITGGIAIAMACLVGLMWLSYFIASFRLFARTRSMWSFNPETNILLNVPLHGTILTRPLLESELVIGAVILRGHLRIAGHHLGRCDIKDLPKEITVATSRTLSYYKLGASQRVAGDSGFAAYSRYRIGNYKLNTDHSSSSDNIALLVQ"
Algunas veces es necesario importar un grupo espec
Vamos a realizar un analisis “descriptivo” de las secuencias.
str().table().summary()str(actin)
## List of 1
## $ XP_809496.1: 'SeqFastaAA' chr [1:392] "M" "E" "A" "T" ...
## ..- attr(*, "name")= chr "XP_809496.1"
## ..- attr(*, "Annot")= chr ">XP_809496.1 actin 2, putative [Trypanosoma cruzi]"
table(actin)
## XP_809496.1
## A C D E F G H I K L M N P Q R S T V W Y
## 34 8 19 38 17 27 7 17 12 38 10 11 19 16 27 30 16 32 6 8
summary(actin)
## Length Class Mode
## XP_809496.1 392 SeqFastaAA character
Tambien podemos usar algunas funciones especificas del paquete seqinr para analizar la secuencia como:
seqinr::getLength() para obtener la longitud de la
secuenciaseqinr::getAnnot() para obtner la anotaciones de la
secencia (lo que vendría a ser la metadata de la secuencia)seqinr::getSequence() para obtener la secuenciaseqinr::getName() para obtener el codigo/identificador
unico de la secuenciaseqinr::getLength(actin)
## [1] 392
seqinr::getAnnot(actin)
## [[1]]
## [1] ">XP_809496.1 actin 2, putative [Trypanosoma cruzi]"
seqinr::getSequence(actin)
## [[1]]
## [1] "M" "E" "A" "T" "L" "W" "D" "E" "E" "P" "A" "V" "V" "L" "D" "N" "G" "S"
## [19] "G" "N" "I" "K" "C" "G" "F" "A" "G" "E" "E" "I" "P" "R" "C" "V" "F" "P"
## [37] "S" "V" "T" "G" "V" "S" "M" "N" "A" "R" "S" "S" "G" "S" "S" "S" "S" "Q"
## [55] "R" "V" "Y" "V" "G" "D" "E" "A" "L" "Q" "E" "K" "G" "L" "R" "Y" "F" "Y"
## [73] "P" "M" "E" "H" "G" "I" "V" "F" "D" "W" "D" "Q" "M" "E" "R" "V" "W" "R"
## [91] "H" "A" "Y" "E" "Q" "L" "R" "V" "P" "P" "E" "R" "Q" "A" "V" "L" "L" "T"
## [109] "E" "A" "P" "L" "N" "P" "I" "S" "N" "R" "E" "K" "M" "A" "E" "T" "L" "F"
## [127] "E" "S" "F" "G" "V" "P" "A" "L" "H" "V" "Q" "I" "Q" "A" "V" "L" "T" "L"
## [145] "Y" "S" "S" "G" "R" "T" "D" "G" "L" "V" "L" "D" "S" "G" "D" "G" "V" "T"
## [163] "H" "L" "V" "P" "V" "F" "E" "G" "Q" "T" "M" "P" "Q" "S" "V" "R" "R" "L"
## [181] "E" "L" "A" "G" "R" "D" "L" "T" "E" "W" "M" "M" "E" "L" "L" "S" "D" "E"
## [199] "L" "D" "R" "P" "F" "T" "T" "S" "A" "D" "R" "E" "I" "A" "R" "R" "V" "K"
## [217] "E" "S" "L" "C" "Y" "I" "P" "L" "F" "F" "E" "E" "E" "L" "Q" "A" "A" "E"
## [235] "E" "D" "G" "I" "N" "E" "D" "V" "K" "G" "K" "E" "P" "F" "T" "L" "P" "D"
## [253] "G" "E" "V" "I" "H" "V" "G" "R" "A" "R" "F" "C" "C" "P" "E" "I" "L" "F"
## [271] "N" "P" "A" "L" "A" "E" "K" "P" "Y" "D" "G" "I" "Q" "H" "A" "V" "I" "N"
## [289] "C" "V" "N" "S" "C" "P" "I" "D" "L" "R" "R" "Q" "L" "L" "G" "S" "I" "V"
## [307] "L" "S" "G" "G" "N" "T" "M" "F" "K" "G" "M" "Q" "Q" "R" "L" "Q" "S" "E"
## [325] "L" "A" "A" "L" "A" "N" "K" "R" "A" "A" "E" "D" "V" "R" "V" "V" "A" "A"
## [343] "S" "E" "R" "K" "F" "S" "V" "W" "I" "G" "A" "A" "I" "L" "A" "S" "L" "T"
## [361] "S" "F" "A" "S" "E" "W" "I" "T" "R" "T" "E" "Y" "A" "E" "Q" "G" "A" "A"
## [379] "V" "L" "H" "K" "R" "C" "D" "S" "L" "S" "F" "V" "S" "K"
seqinr::getName(actin)
## [1] "XP_809496.1"
Ahora probemos las mismas funciones con el archivo de secuenias multiples multi.actin.txt (solo presentaremos las 5 primeras como ejemplo)
seqinr::getLength(multi.actin)
## [1] 392 392 392 392 392 392 391 384 388 388 384 388 376 376 376 376 376 376
## [19] 376 383 376 404 376 376 376 376 376 376 376 376 376 376 376 376 415 376
## [37] 376 376 376 367 376 358 341 321 294 294 285 288 393 416 390 416 390 390
## [55] 416 470 390 390 416 394 416 192 416 417 416 416 416 416 416 416 393 393
## [73] 416 403 399 403 396 401 166 494 401 403 398 401 330 403 403 401 400 400
## [91] 305 398 400 394 394 394 392 394 383 163
seqinr::getAnnot(multi.actin)[1:5]
## [[1]]
## [1] ">XP_809496.1 actin 2, putative [Trypanosoma cruzi]"
##
## [[2]]
## [1] ">EKG04871.1 actin 2, putative [Trypanosoma cruzi]"
##
## [[3]]
## [1] ">ESS65030.1 actin 2 [Trypanosoma cruzi Dm28c]"
##
## [[4]]
## [1] ">XP_806044.1 actin 2, putative [Trypanosoma cruzi]"
##
## [[5]]
## [1] ">RNF12397.1 putative actin 2 [Trypanosoma cruzi]"
seqinr::getSequence(multi.actin)[1:5]
## [[1]]
## [1] "M" "E" "A" "T" "L" "W" "D" "E" "E" "P" "A" "V" "V" "L" "D" "N" "G" "S"
## [19] "G" "N" "I" "K" "C" "G" "F" "A" "G" "E" "E" "I" "P" "R" "C" "V" "F" "P"
## [37] "S" "V" "T" "G" "V" "S" "M" "N" "A" "R" "S" "S" "G" "S" "S" "S" "S" "Q"
## [55] "R" "V" "Y" "V" "G" "D" "E" "A" "L" "Q" "E" "K" "G" "L" "R" "Y" "F" "Y"
## [73] "P" "M" "E" "H" "G" "I" "V" "F" "D" "W" "D" "Q" "M" "E" "R" "V" "W" "R"
## [91] "H" "A" "Y" "E" "Q" "L" "R" "V" "P" "P" "E" "R" "Q" "A" "V" "L" "L" "T"
## [109] "E" "A" "P" "L" "N" "P" "I" "S" "N" "R" "E" "K" "M" "A" "E" "T" "L" "F"
## [127] "E" "S" "F" "G" "V" "P" "A" "L" "H" "V" "Q" "I" "Q" "A" "V" "L" "T" "L"
## [145] "Y" "S" "S" "G" "R" "T" "D" "G" "L" "V" "L" "D" "S" "G" "D" "G" "V" "T"
## [163] "H" "L" "V" "P" "V" "F" "E" "G" "Q" "T" "M" "P" "Q" "S" "V" "R" "R" "L"
## [181] "E" "L" "A" "G" "R" "D" "L" "T" "E" "W" "M" "M" "E" "L" "L" "S" "D" "E"
## [199] "L" "D" "R" "P" "F" "T" "T" "S" "A" "D" "R" "E" "I" "A" "R" "R" "V" "K"
## [217] "E" "S" "L" "C" "Y" "I" "P" "L" "F" "F" "E" "E" "E" "L" "Q" "A" "A" "E"
## [235] "E" "D" "G" "I" "N" "E" "D" "V" "K" "G" "K" "E" "P" "F" "T" "L" "P" "D"
## [253] "G" "E" "V" "I" "H" "V" "G" "R" "A" "R" "F" "C" "C" "P" "E" "I" "L" "F"
## [271] "N" "P" "A" "L" "A" "E" "K" "P" "Y" "D" "G" "I" "Q" "H" "A" "V" "I" "N"
## [289] "C" "V" "N" "S" "C" "P" "I" "D" "L" "R" "R" "Q" "L" "L" "G" "S" "I" "V"
## [307] "L" "S" "G" "G" "N" "T" "M" "F" "K" "G" "M" "Q" "Q" "R" "L" "Q" "S" "E"
## [325] "L" "A" "A" "L" "A" "N" "K" "R" "A" "A" "E" "D" "V" "R" "V" "V" "A" "A"
## [343] "S" "E" "R" "K" "F" "S" "V" "W" "I" "G" "A" "A" "I" "L" "A" "S" "L" "T"
## [361] "S" "F" "A" "S" "E" "W" "I" "T" "R" "T" "E" "Y" "A" "E" "Q" "G" "A" "A"
## [379] "V" "L" "H" "K" "R" "C" "D" "S" "L" "S" "F" "V" "S" "K"
##
## [[2]]
## [1] "M" "E" "A" "T" "L" "W" "D" "E" "E" "P" "A" "V" "V" "L" "D" "N" "G" "S"
## [19] "G" "N" "I" "K" "C" "G" "F" "A" "G" "E" "E" "I" "P" "R" "C" "V" "F" "P"
## [37] "S" "V" "T" "G" "V" "S" "M" "N" "A" "R" "S" "S" "G" "S" "S" "S" "S" "Q"
## [55] "R" "V" "Y" "V" "G" "D" "E" "A" "L" "Q" "E" "K" "G" "L" "R" "Y" "F" "Y"
## [73] "P" "M" "E" "H" "G" "I" "V" "Y" "D" "W" "D" "Q" "M" "E" "R" "V" "W" "R"
## [91] "H" "A" "Y" "E" "Q" "L" "R" "V" "P" "P" "E" "R" "Q" "A" "V" "L" "L" "T"
## [109] "E" "A" "P" "M" "N" "P" "I" "S" "N" "R" "E" "K" "M" "A" "E" "T" "L" "F"
## [127] "E" "S" "F" "G" "V" "P" "A" "L" "H" "V" "Q" "I" "Q" "A" "V" "L" "T" "L"
## [145] "Y" "S" "S" "G" "R" "T" "D" "G" "L" "V" "L" "D" "S" "G" "D" "G" "V" "T"
## [163] "H" "L" "V" "P" "V" "F" "E" "G" "Q" "T" "M" "P" "Q" "T" "V" "R" "R" "L"
## [181] "E" "L" "A" "G" "R" "D" "L" "T" "E" "W" "M" "M" "E" "L" "L" "S" "D" "E"
## [199] "L" "D" "R" "P" "F" "T" "T" "S" "A" "D" "R" "E" "V" "A" "R" "R" "V" "K"
## [217] "E" "S" "L" "C" "Y" "I" "P" "L" "F" "F" "E" "E" "E" "L" "Q" "A" "A" "E"
## [235] "E" "D" "G" "I" "N" "E" "D" "A" "K" "G" "K" "E" "P" "F" "T" "L" "P" "D"
## [253] "G" "E" "V" "I" "H" "V" "G" "R" "A" "R" "F" "C" "C" "P" "E" "I" "L" "F"
## [271] "N" "P" "A" "L" "A" "E" "K" "P" "Y" "D" "G" "I" "Q" "H" "A" "V" "I" "N"
## [289] "C" "V" "N" "S" "C" "P" "I" "D" "L" "R" "R" "Q" "L" "L" "G" "S" "I" "V"
## [307] "L" "S" "G" "G" "N" "T" "M" "F" "K" "G" "M" "Q" "Q" "R" "L" "Q" "S" "E"
## [325] "L" "A" "A" "L" "A" "N" "K" "R" "A" "A" "E" "D" "V" "R" "V" "V" "A" "A"
## [343] "S" "E" "R" "K" "F" "S" "V" "W" "I" "G" "A" "A" "I" "L" "A" "S" "L" "T"
## [361] "S" "F" "A" "S" "E" "W" "I" "T" "R" "T" "E" "Y" "A" "E" "Q" "G" "A" "A"
## [379] "V" "L" "H" "K" "R" "C" "D" "S" "L" "S" "F" "V" "S" "K"
##
## [[3]]
## [1] "M" "E" "A" "T" "L" "W" "D" "E" "E" "P" "A" "V" "V" "L" "D" "N" "G" "S"
## [19] "G" "N" "I" "K" "C" "G" "F" "A" "G" "E" "E" "I" "P" "R" "C" "V" "F" "P"
## [37] "S" "V" "T" "G" "V" "S" "M" "N" "A" "R" "S" "S" "G" "S" "S" "S" "S" "Q"
## [55] "R" "V" "Y" "V" "G" "D" "E" "A" "L" "Q" "E" "K" "G" "L" "R" "Y" "F" "Y"
## [73] "P" "M" "E" "H" "G" "I" "V" "Y" "D" "W" "D" "Q" "M" "E" "R" "V" "W" "R"
## [91] "H" "A" "Y" "E" "Q" "L" "R" "V" "P" "P" "E" "R" "Q" "A" "V" "L" "L" "T"
## [109] "E" "A" "P" "M" "N" "P" "I" "S" "N" "R" "E" "K" "M" "A" "E" "T" "L" "F"
## [127] "E" "S" "F" "G" "V" "P" "A" "L" "H" "V" "Q" "I" "Q" "A" "V" "L" "T" "L"
## [145] "Y" "S" "S" "G" "R" "T" "D" "G" "L" "V" "L" "D" "S" "G" "D" "G" "V" "T"
## [163] "H" "L" "V" "P" "V" "F" "E" "G" "Q" "T" "M" "P" "Q" "T" "V" "R" "R" "L"
## [181] "E" "L" "A" "G" "R" "D" "L" "T" "E" "W" "M" "M" "E" "L" "L" "S" "D" "E"
## [199] "L" "D" "R" "P" "F" "T" "T" "S" "A" "D" "R" "E" "V" "A" "R" "R" "V" "K"
## [217] "E" "S" "L" "C" "Y" "I" "P" "L" "F" "F" "E" "E" "E" "L" "Q" "A" "A" "E"
## [235] "E" "D" "G" "I" "N" "E" "D" "A" "K" "G" "K" "E" "P" "F" "T" "L" "P" "D"
## [253] "G" "E" "V" "I" "H" "V" "G" "R" "A" "R" "F" "C" "C" "P" "E" "I" "L" "F"
## [271] "N" "P" "A" "L" "A" "E" "K" "P" "Y" "D" "G" "I" "Q" "H" "A" "V" "I" "N"
## [289] "C" "V" "N" "S" "C" "P" "I" "D" "L" "R" "R" "Q" "L" "L" "G" "S" "I" "V"
## [307] "L" "S" "G" "G" "N" "T" "M" "F" "K" "G" "M" "Q" "K" "R" "L" "Q" "S" "E"
## [325] "L" "A" "A" "L" "A" "N" "K" "R" "A" "A" "E" "D" "V" "R" "V" "V" "A" "A"
## [343] "S" "E" "R" "K" "F" "S" "V" "W" "I" "G" "A" "A" "I" "L" "A" "S" "L" "T"
## [361] "S" "F" "A" "S" "E" "W" "I" "T" "R" "T" "E" "Y" "A" "E" "Q" "G" "A" "A"
## [379] "V" "L" "H" "K" "R" "C" "D" "S" "L" "S" "F" "V" "S" "K"
##
## [[4]]
## [1] "M" "E" "A" "T" "L" "W" "D" "E" "E" "P" "A" "V" "V" "L" "D" "N" "G" "S"
## [19] "G" "N" "I" "K" "C" "G" "F" "A" "G" "E" "E" "I" "P" "R" "C" "V" "F" "P"
## [37] "S" "V" "T" "G" "V" "S" "M" "N" "T" "R" "S" "S" "G" "S" "S" "S" "S" "Q"
## [55] "R" "V" "Y" "V" "G" "D" "E" "A" "L" "Q" "E" "K" "G" "L" "R" "Y" "F" "Y"
## [73] "P" "M" "E" "H" "G" "I" "V" "S" "D" "W" "D" "Q" "M" "E" "R" "V" "W" "R"
## [91] "H" "A" "Y" "E" "Q" "L" "R" "V" "P" "P" "E" "R" "Q" "A" "V" "L" "L" "T"
## [109] "E" "A" "P" "L" "N" "P" "I" "S" "N" "R" "E" "K" "M" "A" "E" "T" "L" "F"
## [127] "E" "S" "F" "G" "V" "P" "A" "L" "H" "V" "Q" "I" "Q" "A" "V" "L" "T" "L"
## [145] "Y" "S" "S" "G" "R" "T" "D" "G" "L" "V" "L" "D" "S" "G" "D" "G" "V" "T"
## [163] "H" "L" "V" "P" "V" "F" "E" "G" "Q" "T" "M" "P" "Q" "S" "V" "R" "R" "L"
## [181] "E" "L" "A" "G" "R" "D" "L" "T" "E" "W" "M" "M" "E" "L" "L" "S" "D" "E"
## [199] "L" "D" "R" "P" "F" "T" "T" "S" "A" "D" "R" "E" "V" "A" "R" "R" "V" "K"
## [217] "E" "S" "L" "C" "Y" "I" "P" "L" "F" "F" "E" "E" "E" "L" "Q" "A" "A" "E"
## [235] "E" "D" "G" "I" "N" "E" "D" "A" "K" "G" "K" "E" "P" "F" "T" "L" "P" "D"
## [253] "G" "E" "V" "I" "H" "V" "G" "R" "A" "R" "F" "C" "C" "P" "E" "I" "L" "F"
## [271] "N" "P" "A" "L" "A" "E" "K" "P" "Y" "D" "G" "I" "Q" "H" "A" "V" "I" "N"
## [289] "C" "V" "N" "S" "C" "P" "I" "D" "L" "R" "R" "Q" "L" "L" "G" "S" "I" "V"
## [307] "L" "S" "G" "G" "N" "T" "M" "F" "K" "G" "M" "Q" "Q" "R" "L" "Q" "S" "E"
## [325] "L" "A" "A" "L" "A" "N" "K" "R" "A" "A" "E" "D" "V" "R" "V" "V" "A" "A"
## [343] "S" "E" "R" "K" "F" "S" "V" "W" "I" "G" "A" "A" "I" "L" "A" "S" "L" "T"
## [361] "S" "F" "A" "S" "E" "W" "I" "T" "R" "T" "E" "Y" "A" "E" "Q" "G" "A" "A"
## [379] "V" "L" "H" "K" "R" "C" "D" "S" "L" "S" "F" "V" "S" "K"
##
## [[5]]
## [1] "M" "E" "A" "T" "L" "W" "D" "E" "E" "P" "A" "V" "V" "L" "D" "N" "G" "S"
## [19] "G" "N" "I" "K" "C" "G" "F" "A" "G" "E" "E" "I" "P" "R" "C" "V" "F" "P"
## [37] "S" "V" "T" "G" "V" "S" "M" "N" "A" "R" "S" "S" "G" "S" "S" "S" "S" "Q"
## [55] "R" "V" "Y" "V" "G" "D" "E" "A" "L" "Q" "E" "K" "G" "L" "R" "Y" "F" "Y"
## [73] "P" "M" "E" "H" "G" "I" "V" "Y" "D" "W" "D" "Q" "M" "E" "R" "V" "W" "Q"
## [91] "H" "A" "Y" "E" "Q" "L" "R" "V" "P" "P" "E" "R" "Q" "A" "V" "L" "L" "T"
## [109] "E" "A" "P" "M" "N" "P" "I" "S" "N" "R" "E" "K" "M" "A" "E" "T" "L" "F"
## [127] "E" "S" "F" "G" "V" "P" "A" "L" "H" "V" "Q" "I" "Q" "A" "V" "L" "T" "L"
## [145] "Y" "S" "S" "G" "R" "T" "D" "G" "L" "V" "L" "D" "S" "G" "D" "G" "V" "T"
## [163] "H" "L" "V" "P" "V" "F" "E" "G" "Q" "T" "M" "P" "Q" "T" "V" "R" "R" "L"
## [181] "E" "L" "A" "G" "R" "D" "L" "T" "E" "W" "M" "M" "E" "L" "L" "S" "D" "E"
## [199] "L" "D" "R" "P" "F" "T" "T" "S" "A" "D" "R" "E" "V" "A" "R" "R" "V" "K"
## [217] "E" "S" "L" "C" "Y" "I" "P" "L" "F" "F" "E" "E" "E" "L" "Q" "A" "A" "E"
## [235] "E" "D" "G" "I" "N" "E" "D" "A" "K" "G" "K" "E" "P" "F" "T" "L" "P" "D"
## [253] "G" "E" "V" "I" "H" "V" "G" "R" "A" "R" "F" "C" "C" "P" "E" "I" "L" "F"
## [271] "N" "P" "A" "L" "A" "E" "K" "P" "Y" "D" "G" "I" "Q" "H" "A" "V" "I" "N"
## [289] "C" "V" "N" "S" "C" "P" "I" "D" "L" "R" "R" "Q" "L" "L" "G" "S" "I" "V"
## [307] "L" "S" "G" "G" "N" "T" "M" "F" "K" "G" "M" "Q" "K" "R" "L" "Q" "S" "E"
## [325] "L" "A" "A" "L" "A" "N" "K" "R" "A" "A" "E" "D" "V" "R" "V" "V" "A" "A"
## [343] "S" "E" "R" "K" "F" "S" "V" "W" "I" "G" "A" "A" "I" "L" "A" "S" "L" "T"
## [361] "S" "F" "A" "S" "E" "W" "I" "T" "R" "T" "E" "Y" "A" "E" "Q" "G" "A" "A"
## [379] "V" "L" "H" "K" "R" "C" "D" "S" "L" "S" "F" "V" "S" "K"
seqinr::getName(multi.actin)[1:5]
## [1] "XP_809496.1" "EKG04871.1" "ESS65030.1" "XP_806044.1" "RNF12397.1"
Tambien podemos analizar cada una de las secuencias dentro del objeto multi.actin, usando el $
table(multi.actin$XP_809496.1)
##
## A C D E F G H I K L M N P Q R S T V W Y
## 34 8 19 38 17 27 7 17 12 38 10 11 19 16 27 30 16 32 6 8
round(100*table(multi.actin$XP_809496.1)/seqinr::getLength(multi.actin$XP_809496.1),2)
##
## A C D E F G H I K L M N P Q R S
## 8.67 2.04 4.85 9.69 4.34 6.89 1.79 4.34 3.06 9.69 2.55 2.81 4.85 4.08 6.89 7.65
## T V W Y
## 4.08 8.16 1.53 2.04
Y que pasaria si quisieramos saber la frecuencia relativa de todos las frecuencias relativas de cada una de secuencias y que los resultados respectivos para cada aminoacido esten organizados comodamente en una tabla. Pues bueno para ello podemos hacer uso de una de las funciones más poderosas de R (por su simpleza y rapidez de ejecución). La familia de funciones apply. En este caso que deseamos una tabla usaremos la funcion sapply(). Una de las mayores ventajas de la familia apply es que nos permite ejecutar una función personalizada por cada uno de los elementos de una lista. Como un loop for pero mas rápido en su ejecución.
freRel <- sapply(X = multi.actin, FUN = function(x) round(100*table(x)/seqinr::getLength(x),2))
freRel[1:5,1:10]
## XP_809496.1 EKG04871.1 ESS65030.1 XP_806044.1 RNF12397.1 KAF5219922.1
## A 8.67 8.93 8.93 8.67 8.93 8.67
## C 2.04 2.04 2.04 2.04 2.04 2.04
## D 4.85 4.85 4.85 4.85 4.85 4.85
## E 9.69 9.69 9.69 9.69 9.69 9.69
## F 4.34 4.08 4.08 4.08 4.08 4.08
## EKF33322.1 XP_029232524.1 XP_029234507.1 ESL09674.1
## A 8.44 8.33 9.79 8.76
## C 2.05 2.34 2.84 2.32
## D 4.60 4.69 4.12 4.64
## E 9.97 10.16 10.05 9.54
## F 3.84 4.17 3.87 4.12
Si quisieramos hacer un boxplot() para explorar la dispersion de los datos para cada aminoacido, necesitamos transponer la matriz de datos (es decir, cambiar las filas por columnas). Usaremos la función t() para ello
freRel <- t(freRel)
NOTA: Note que estamos re-escribiendo el objeto. Lo que hace R es: 1. Reserva un espacio de memoria para procesar los datos 2. Ejecutar la función y ver si consigue un resultado viable. 3. Si el resultado es viable, entonces recien destruye el objeto original y lo reemplaza por el nuevo. Si el resultado no es viable, conserva el objeto anterior (OJO, eso quiere decir que nuesto script seguira corriendo, aun así la función no haya sido exitosa, entonces podemos llegar a resultados erroneos pensando que nuestro script corrio completamente)
Ahora ya podemos realizar la grafica de caja y bigotes
boxplot(freRel)
Aunque este es un ejercicio meramente demostrativo y no intenta ver un
transfondo evolutivo, al menos no en este punto, podemos llegar a
ciertas conclusiones interesantes. Como que la distribución de la
frecuencias es diferente para cada aminoacido, que el aminoacido mas
frecuentemente presente o usado en las secuencias es leucina (l).
Seguido de ácido Glutámico (e), Glicina (g), Serina (s) y Valina
(v).
PREGUNTA PARA REFLEXIONAR: Ahora sabemos la frecuencia “esperada” de cada aminoacido. Que pasaria si en una secuencia de 25 aa, veo que la “frecuencia observada” de algunos aminoacidos esta incrementada. Que podría sospechar? Que repercución biológica tendría esto?
Si quisieramos usar el paquete de ggplot2 para hacer el boxplot, primero tenemos que hacer ciertos cambios. Primero tenemos que convertir la matriz de datos en un data frame de dos columnas, donde una sea el nombre del aminoacio (como factor) y la segunda la frecuencia (el valor numerico). Para ello usaremos primero la funcion stack(), y luego convertiremos el producto en una data.frame con la función as.data.frame(). Finalmente usaremos geom_boxplot() de la paqueteria de ggplot2 para generar el gráfico.
dat <- stack(freRel) # convertimos la matriz en un data frame de tres columnas
dat <- as.data.frame(dat) # convertimos el objeto S4 generado por stack a un data.frame
colnames(dat) # identificamos el nombre de las columnas
## [1] "row" "col" "value"
library(ggplot2) # cargamos la libreria de ggplot2
ggplot(data = dat, aes(x = col, y = value, fill = col)) +
geom_boxplot() +
labs(title = "FRECUENCIA RELATIVA DE AA",
subtitle = "Familia de actina de tripanosomatidos",
caption = Sys.Date())+
ylab("Frecuencia relativa") +
xlab("Aminoácido")
# ALIEAMIENTO DE SECUENCIAS
align <- msa(unlist(seqinr::getSequence(multi.actin,as.string = T)),type = "protein")
## use default substitution matrix
print(align, show = "complete")
##
## MsaAAMultipleAlignment with 100 rows and 557 columns
## aln (1..73)
## [1] -------------------------------------------------------------------------
## [2] -------------------------------------------------------------------------
## [3] -------------------------------------------------------------------------
## [4] ---------------------------------------MACFISPSSCDDWSLVFFCFFWSCRVPPVKTKWP
## [5] -------------------------------------------------------------------------
## [6] -------------------------------------------------------------------------
## [7] -------------------------------------------------------------------------
## [8] -------------------------------------------------------------------------
## [9] -------------------------------------------------------------------------
## ... ...
## [93] -------------------------------------------------------------------------
## [94] -------------------------------------------------------------------------
## [95] -------------------------------------------------------------------------
## [96] -------------------------------------------------------------------------
## [97] -------------------------------------------------------------------------
## [98] -------------------------------------------------------------------------
## [99] -------------------------------------------------------------------------
## [100] -------------------------------------------------------------------------
## Con -------------------------------------------------------------------------
##
## aln (74..146)
## [1] --------------------MAYPVVVIDNGTGYTKMGYAGNEEPTYIIPTAYADNEASRRR-----------
## [2] --------------------MAYPVVVIDNGTGYTKMGYAGNEEPTYIIPTAYADNEASRRR-----------
## [3] --------------------MTHPVVVIDNGTGYTKMGYAGNEEPTFTIPTVYADNEVARRR-----------
## [4] SQHRPHARTPTLRQRRRPGPMSYPVVVIDNGTGYTKLGYAGNEEPTYVIPSLYADNAAAWRR-----------
## [5] --------------------MSYPVVVIDNGTGYTKMGYAGNEEPTYIIPSLYADNATARRR-----------
## [6] --------------------MSYPVVVIDNGTGYTKMGYAGNEEPTYIFPSLYADGAAGRRR-----------
## [7] --------------------MSYPVVVIDNGTGYTKMGYAGNEEPTYIIPSLYADNETVRRR-----------
## [8] --------------------MSYPVVVIDNGTGYTKMGYAGNEEPTYIIPSLYADNETVRRR-----------
## [9] --------------------MSYPVVVIDNGTGYTKMGYAGNEEPTYIIPSLYADNETVRRR-----------
## ... ...
## [93] -------------------MSEKLPVVLDNGSGFLKCGFAGSNFPEVFFRTAVGRPVLRQTKMSEGRSSKRK-
## [94] --------------------MVSSPIVLDNGSGFLKCGYAGANFPEVCFQTAVGRPVLRSTKSGSN-SGKGV-
## [95] ------------------MVQQQPVVVFDMGSNKTRVGFAGEEAPRVISSTVVGVPRQRGLVG----------
## [96] ------------------MVQQQPVVVFDMGSNKTRVGFAGEEAPRVISSTVVGVPRQRGLVG----------
## [97] ------------------MVQQQPVVVFDMGSNKTRVGFAGEEAPRVISSTVVGVPRQRGLVG----------
## [98] ------------------MVQQQPVVVFDMGSNKTRVGFAGEEAPRVISSTVVGVPRQRGLVG----------
## [99] ---------------MQQQQQRQSVAVFDVGSCSTRIGFAGEEAPRVVSPTVVGVPRHRGVLG----------
## [100] -----------------MLHQHHPIAVVDVGSGTTRLGFGGEEAPRVVQPTVVGTPQCQGMLG----------
## Con ------------------???E????V?DNGSG??K?GFAG???PR?VFPS?VG?P?????------------
##
## aln (147..219)
## [1] --SHDVFSDLDFYVGDEALAH---SSSCNLYHPIKHGIVEDWDKMERIWQHCVYKYLRVDPEEHGFILTEPPA
## [2] --SHDVFSDLDFYVGDEALAH---SSSCNLYHPIKHGIVEDWDKMERIWQHCVYKYLRVDPEEHGFILTEPPA
## [3] --SNDIFSDLDFNIGDEAIAR---AGPCNLSHPIRHGIVEDWDKMERMWLHCIYKYLRVDPGEHGFILTEPLA
## [4] --SNDVFEDLDFYIGEEAAAR---AGSCTVSYPIQHGIVKDWDKMERIWQHCIYKYLHVEPEEHGFILTEPPA
## [5] --SNDVFEDLDFYIGEEAAAR---AGSCTVSYPIKHGIVEDWDKMERIWQHCIYKYLHVEPEEHGFILTEPPA
## [6] --SSDVFEDLDFCIGDEAAAC---AGSCNLSYPIKHGIVEDWDKMERIWQHCIYKYLRVEPEEHGFILTEPPA
## [7] --SNDVFDDLDFYIGEEAAAR---ASSCTLSYPIKHGIVEDWDKMERIWQHCIYKYLRVEPEEHGFILTEPPA
## [8] --SNDVFDDLDFYIGEEAAAR---ASSCTLSYPIKHGIVEDWDKMERIWQHCIYKYLRVEPEEHGFILTEPPA
## [9] --SNDVFDDLDFYIGEEAAAR---ASSCTLSYPIKHGIVEDWDKMERIWQHCIYKYLRVEPEEHGFILTEPPA
## ... ...
## [93] -DTQVDPLTKDLVLGDECNGA---HHLLDMTFPIHNGVIQNMDDMRYLWKHAFHNLLSVEPEDHSLLISEAPL
## [94] -STHSDPLLKDLVLGDECTSI---RHLLDMSFPINNGIIKNMDDMCHLWNYTFNDLLHIKPEEHSLLLSEAPL
## [95] ---SLLQHYSDDYAGDAACAQ---EGMLNLSYPVRNRCITSMPEVEHFLQDVFYSRLPLVPSNTMMLWVESVR
## [96] ---SLLQHYSDDYAGDAACAQ---EGMLNLSYPVRNRCITSMPEVEHFLQDVFYSRLPLVPSNTMMLWVESVR
## [97] ---SLMQHYSDDYAGDAACAQ---EGMLTLSYPVRNRRITSMPEVEHFLQDVFYSRLPLVPSNTMMLWVESVR
## [98] ---SLMQHYSDDYAGDAACAQ---EGMLNLSYPVRNRRITSMPEVEHFLQDVFYSRLPLVPSNTMMLWVESVR
## [99] ---SLLQHHSDDYAGDDALER---EGILKLSRPVQDRRVVSFEGLEHILHDALYTWLPVIPSETPLMWVEATG
## [100] ---SLLQHHGDTFAGDAAWER---RGLLTLSYPVQGRRVVSYKGLEHILHDALYAWLPFVPDETPLLWVEPAC
## Con ---??????????VGDEA?A?---???L?L?YPI?HGIV??WD?ME??W?HTFY??LRVNPE?H?VLLTEAP?
##
## aln (220..292)
## [1] NPPENREHTAEVMFETFGVKQLHIAVQGALALRASWTSGKAQQLGLVGENTGVVVDSGDGVTHIVPIVDGFVM
## [2] NPPENREHTAEVMFETFGVKQLHIAVQGALALRASWTSGKAQQLGLVGENTGVVVDSGDGVTHIVPIVDGFVM
## [3] NPPENREHTAEVMFETFGVKQLHIAVQGVLALRASWTSGMAQQLGLAGENTGVVVDSGDGVTHVVPIVDGFVM
## [4] NPPENREYTAEVMFETFGVKQLHIAVQGALALRASWTSGKAQELGVAGKDTGLVIDSGAGVTHIMPIVDGFVL
## [5] NPPENREYAAEVMFETFGVKQLHIAVQGALALRASWTSGKAQELGVSGKDTGLVIDSGAGVTHIMPIVDGFVL
## [6] NPPENREYTAEVMFETFGVKQLHIAVQGTLALRASWTSGKAQELGVAGKDTGLVIDSGAGVTHVIPIVDGFVL
## [7] NPPENREYTAEVMFETFGVKQLHIAVQGALALRASWTSGKAKELGVAGKDTGLVIDSGAGVTHVIPIVDGFVL
## [8] NPPENREYTAEVMFETFGVKQLHIAVQGALALRASWTSGKAKELGVAGKDTGLVIDSGAGVTHVIPIVDGFVL
## [9] NPPENREYTAEVMFETFGVKQLHIAVQGALALRASWTSGKAKELGVAGKDTGLVIDSGAGVTHVIPIVDGFVL
## ... ...
## [93] FSHKDRVKLYEVMFEEFKFPFVQSTPQGVLSLFS------------NGLQTGVAVECGECVSHCTPIFEGYTI
## [94] FSHNDRVKLYEVMFEEYKFPFIQSVPQGVLSLFS------------NGLQTGVALECGECMSHCTPIFEGYAI
## [95] TSREDRERLCEMMFESFGLPQLGLVAASATTVFS------------TGRTTGLVVDSGEGCTNFNAVWEGYNL
## [96] TSREDRERLCEMMFESFGLPQLGLVAASATTVFS------------TGRTTGLVVDSGEGCTNFNAVWEGYNL
## [97] TSREDRERLCEMMFESFGLPQLGLVAASATTVFS------------TGRTTGLVVDSGEGCTNFNAVWEGYNL
## [98] TSREDRERLCEMMFESFGLPQLGLVAASATTVFS------------TGRTTGLVVDSGEGCTNFNAVWEGYNL
## [99] TPRKDRERLCEVLFESFDIPLLGITSAAAATVYS------------TGRTTGLVLDSGEGCTTINAVWEGYIL
## [100] APREDRERMCELLFESFDLPLLAMTSAAAATVYS------------TGRTSGLVLDSGEDCTTVNAVWEGYNL
## Con NP??NREKM?E?MFETFGVPA??V?IQAVLSLYS------------SGRTTG?VLDSGDGVTH?VPI?EGY?L
##
## aln (293..365)
## [1] HNAIQHIPLAGRDITNFVLEWLRERGEPVPA-DDALYLAQHIKEKYCYIARNIAREFETYDSDLPNHITKHHA
## [2] HNAIQHIPLAGRDITNFVLEWLRERGEPVPA-DDALYLAQHIKEKYCYIARNIAREFETYDSDLPNHITKHHA
## [3] HNAMCHIPLAGRDITNFVLEWLRERGEAVPA-DDALYLAQRIKEQHCYIARDIAHEFDKYDNNLPANITKHHD
## [4] NQAIHHIPLAGRDITNFVLERLRERGEPVPP-DDALLLAQRIKEEYCYIARDIASEFDTYDRDLPKYVTKHRD
## [5] NQAIHHIPLAGRDITNFVLERLRERGEPVPP-DDALLLAQRIKEQYCYIARDIAREFDTYDRDLPKYVTKHRD
## [6] NQAIHHIPLAGRDITNFVLERLRERGEPVPP-DDALLLAQRIKEAHCYIARDIAREFDTYDRDLPKYITTHRD
## [7] NQAIQHIPLAGRDITNFILERLRERGEPVPP-DDALLLAQRIKEQYCYIARDIAREFDTYDRNLPDHITKHCD
## [8] NQAIQHIPLAGRDITNFILERLRERGEPVPP-DDALLLAQRIKEQYCYIARDIAREFDTYDRNLPDHITKHCD
## [9] NQAIQHIPLAGRDITNFVLERLRERGEPVPP-DDALLLAQRIKEQYCYIARDIAREFDTYDRNLPDHITKHCD
## ... ...
## [93] PKANRRVDLGGRNITEFLVRLMQRRG-YSFNQSSDFETVRCIKERFCYAAVDPKLEQRLALET------TVLE
## [94] PKANKRVDLGGRHITEFLIRLMQRRG-YNFNQSSDFETVRRIKERFCYAAVDSKLEQRLAFET------TVLE
## [95] QYATHTSDVAGRVLTDRLLAFLRAKG-YPLSTPNDRRIVEDVKHTLCYVAADVQEEVKKMHKK-------LQK
## [96] QYATHTSDVAGRVLTDRLLAFLRAKG-YPLSTPNDRRIVEDVKHTLCYVAADVQEEVKKMHKK-------LQK
## [97] QYATHTSDVAGRVLTDRLLAFLRAKG-YPLSTPNDRRIVEDVKHTLCYVAADVQEEVKKMHKK-------LQK
## [98] QYATHTSDVAGRVLTDRLLAFLRAKG-YPLSTPNDRRVVEDVKHTLCYVAADVQEEVKKMHKK-------LQK
## [99] QHALHVSHGAGRVLTDRLLAFLRGKG-YALSTPRDRDIVESMKRSLCYVAADAAQEVVKLQKK-------KEL
## [100] QHAFHSSPIAGRTLTDRLLEYLRGKG-YTLSTAEDRCLVEKIKRSRCYVAVDAEAEMVDMGRK-------AHL
## Con P?AIRR?DLAGRDLTE?L??LL?E?G-??FTTS???E?VR??KE?LCYVA?D??EE???????------???E
##
## aln (366..438)
## [1] VNRKTGESYTVDVGYEKFLGPEMFFSPDIFSREWTL-----------------PLPDVIDKAIWSCPIDCRRP
## [2] VNRKTGESYTVDVGYEKFLGPEMFFSPDIFSREWTL-----------------PLPDVIDKAIWSCPIDCRRP
## [3] VNRKTGNPYTVDVGYEKFLGPELFFHPEIFSSEWSL-----------------PLPDVIDKAVWSCPIDCRRP
## [4] VNSKTGQPYTVDVGYEKFLGPEVFFHPEIFSNEWTT-----------------PLPEVVDKAVWSCPIDCRRP
## [5] VNSKTGQPYTVDVGYEKFLGPEMFFHPEIFSSEWTT-----------------PLPEVVDKAVWSCPIDCRRP
## [6] VNSKTGQPYTVDVGYEKFLGPELFFHPEIFSSEWTT-----------------PLPEVVDKAVWSCPIDCRRP
## [7] VNSKTGQPYTVDVGYEKFLGPEVFFHPEIFSGEWTM-----------------PLPEVVDRAVWSCPIDCRRP
## [8] VNSKTGQPYAVDVGYEKFLGPEVFFHPEIFSGEWTM-----------------PLPEVVDRAVWSCPIDCRRP
## [9] VNSKTGQPYTVDVGYEKFLGPEVFFHPEIFSGEWTM-----------------PLPEVVDRAVWSCPIDCRRP
## ... ...
## [93] KTFLLPDGSSCSIGQERFEATEALFQPRLIDVECE------------------GISSQLWNCIQATDIDVRSA
## [94] KNFLLPDGSSCSIGQERFEAPEALFQPRLIDMECE------------------GISVQLWSCIQAADIDVRAS
## [95] EYYGLPDEQRIYVEESQFMVPELLFNPSAEGDIGCCGRNNAEVNVDASGVGAGGWTDAIAKVVESAPHFTRPH
## [96] EYYGLPDEQRIYVEESQFMVPELLFNPSAEGDIGCCGRNNAEVNVDASGVGAGGWTDAIAKVVESAPHFTRPH
## [97] EYYGLPDEQRIYVEESQFMVPELLFNPSAEGDIGCCGRNNAEVNVDASGVGAGGWTDAIAKVVESAPHFTRPH
## [98] EYYGLPDEQRIYVEESQFMVPELLFNPSAEGDIGCCGRNNAEVNVDASGVGAGGWTDAIAKVVESAPHFTRPH
## [99] VCYVLPDEQRIYLHESQFMIPELLFTP--SGDETDNDYNNSNINSSRC---GGGWAEAVTQVVESAPAFTQSH
## [100] DSYELPDEQHIYLHESQFMVPEALFAP--PRDEGSDGGASGEV----------GWAEAVTHVVRKAPPFTQSH
## Con E?F?LPDG????VG??RF??PEALF?P?LI??????-----------------G??E?????I??C?IDVRR?
##
## aln (439..511)
## [1] LYRNVVLSGGTTMFPKFDKRLQKDLRALVSRRAKKFTKALGDPSKQITYDVNVVAHERQRYAVWYGGSMLGMS
## [2] LYRNVVLSGGTTMFPKFDKRLQKDLRALVSRRGKKFTKALGDPSKQITYDVNVVAHERQRYAVWYGGSMLGMS
## [3] LYRNVVLSGGTTMFPKFDKRLQKDLRELVHRRAEKFTKAFADPKRQITYDVNVVAHERQRYAVWYGGSMLGIS
## [4] LYRNIVLSGGNTMFPKFDKRLQKDLRAIVDRRAKKNMAAFKDPTRHITYDVNVVSHERQRYAVWYGGSMLGSS
## [5] LYRNIVLSGGNTMFPKFDKRLQKDLRVIVDRRAKKNMAAFRDPTRHITYDVNVVSHERQRYAVWYGGSMLGSS
## [6] LYRNIVLSGGNTMFPKFDKRLQKDLRVIVDRRARKNMAASRDPNCHITYDVNVVSHERQRYAVWYGGSMLGSS
## [7] LYRNIVLSGGTTMFPKFDKRLQKDLRVIVDRRAKKNMEASKDPNRQITYDVNVVSHDRQRYAVWYGGSMLGSS
## [8] LYRNIVLSGGTTMFPKFDKRLQKDLRVIVDRRAKKNMEASKDPNRQITYDVNVVSHDRQRYAVWYGGSMLGSS
## [9] LYRNIVLSGGTTMFPKFDKRLQKDLRVIVDRRAKKNMEASKDPNRQITYDVNVVSHDRQRYAVWYGGSMLGSS
## ... ...
## [93] LYSHVVLSGGSTMFPGFPSRIERDMRAAYSERI-----VKGDPERLSRFPLCVEDPPRRRWMSFLGGAALAAV
## [94] LYAHVVLSGGSTMFPGFPSRIERDMRAFYSERV-----VRGDPERLARFPLCVEDPPRRKWMSFLGGAALASA
## [95] LLKSIVLGGGNTMFPGIEQRLRREVSALPAS---------------AECEANCVAFRDRDLAAWIGGSVVASM
## [96] LLKSIVLGGGNTMFPGIEQRLRREVSALPAS---------------VECEANCVAFRDRDLAAWIGGSVVASM
## [97] LLKSIVLGGGNTMFPGIEQRLRREVSALPAS---------------AECEANCVAFRDRDLAAWIGGSVVASM
## [98] LLKSIVLGGGNTMFPGIEQRLRREVSALPAS---------------AECEANCVAFRDRDLAAWIGGSVVASM
## [99] LYANIVLGGGNTMFPGIEERLQHDVAALNAG---------------SRRSVNCIAFPDRDTAAWIGGSVAASM
## [100] LLENIVLGGGNTLFPGLEQRLQHDVSALNTS---------------GEQEVNCVAFPDREMAAWIGASVVASM
## Con LY?NIVLSGG?TMF??LP?RL?KE???L???--------------???????VVAPP?RKYSVWIGGS?L?SL
##
## aln (512..557)
## [1] PDFAA-VAKTKQEYDEHGPYVCRRNNMFHSVFE-------------
## [2] PDFAA-VAKTKQEYDEHGPYVCRRNNMFHSVFE-------------
## [3] PEFAS-VAKTKQEYEEYGPYICRRNSMYHCVFE-------------
## [4] PEFAT-LAKTRKEYEEYGPYICRQNNMFHSVFE-------------
## [5] PEFAT-LAKTKEEYEEYGPYICRQNNMFHSVFE-------------
## [6] PEFAA-LAKTKAQYEEYGPYICRQNNMFHSVFD-------------
## [7] PEFST-LAKTKEQYEEYGPYICRQNNMFHSVFD-------------
## [8] PEFST-LAKTKEQYEEYGPYICRQNNMFHSVFD-------------
## [9] PEFST-LAKTKEQYEEYGPYICRQNNMFHSVFD-------------
## ... ...
## [93] TAGNHDMWLSKKEWDEGGASAIQARFGV------------------
## [94] TADSTEMWFSKEEWLEGGPSALRARFGA------------------
## [95] PTFPH-MCLSRKDYLEKGATVVHERI--------------------
## [96] PTFPH-MCLSRKDYLEKGATVVHERI--------------------
## [97] PTFPH-MCLSRKDYLEKGATVVHERI--------------------
## [98] PTFPH-MCLSRKDYLEKGATVVHERI--------------------
## [99] PTFLS-TCLARKDYLEKGAALMHAKV--------------------
## [100] PTFSQ-LCLARKDYYEKGVAAMHLRV--------------------
## Con T?F??-MW?TK?EY?E?GPSIVH?????------------------
msaPrettyPrint(align, y=c(164, 213), output="asis",
showNames="none", showLogo="none", askForOverwrite=FALSE)
## \begin{texshade}{/var/folders/70/4klcrc8j0nn3br1qq16nc_8w0000gn/T//RtmpmdKuXX/seqffc50318c97.fasta}
## \seqtype{P}
## \setends{consensus}{164..213}
## \shadingmode{identical}
## \threshold{50}
## \showconsensus[ColdHot]{bottom}
## \shadingcolors{blues}
## \hidelogoscale
## \hidenames
## \shownumbering{right}
## \showlegend
## \end{texshade}
Para esta sección vamos a usar el genoma de SARS-CoV-2.
Vamos a usar la función query() del paquete
seqinr para descargar la secuencia del genoma completo
directamente del servidor de GENBANK (para ello,
recuerde que tiene que seleccionar el servidor con la función
choosebank()).
Podemos explorar rapidamente el obejto usando algunas funciones básicas vistas anteriormente
sars2
## 1 SQ for AC=MN908947
summary(sars2)
## Length Class Mode
## call 3 -none- call
## name 1 -none- character
## nelem 1 -none- numeric
## typelist 1 -none- character
## req 1 -none- list
## socket 1 sockconn numeric
seqinr::getName(sars2)
## [1] "MN908947"
seqinr::getKeyword(sars2)
## [[1]]
## [1] "DIVISION VRL" "SOURCE" "5'UTR" "GENE" "CDS"
## [6] "3'UTR" "RELEASE 237"
Como sabemos que la función choosebank() nos da una
puerta de acceso temporal y que luego de un tiempo de inactividad
caduca, usaremos las función getSequence para descargar la
secuencia a nuestro entorno de trabajo y poder trabajar con ella.
covid19 <- seqinr::getSequence(sars2)
Pero como este dentro de una lista, debemos primero sacarlo de la
lista con la función unlist, para luego poder hacer los
analisis
covid19 <- unlist(covid19)
Este estadistico nos ayuda a encontrar frecuecuenccias de di,tri, o mas nucleotidos fuera de lo esperado. Pero como sabemos cual es lo esperado. Para eso debemos recordar como calcular la freciencia de que suceda un objeto. Suponga Ud. que en su clase tiene 4 chicos y 6 chicas, cual seria la probabilidad de escoger un alumno y que el sea chico. El calculo es muy sencillo:
\[ \begin{aligned} P(x) = \frac{4}{4+6} \\ P(x) = \frac{4}{10} \\ P(x) = 0.4 \end{aligned} \] Eso quiere decir que la probabilida de escoger un chico al azar es de 0.4. Ahora que pasa si deseamos saber la probabilidad de escoger primero a un hombre y luego a una mujer. En este caso debemos aplicar la “regla de la multiplicacion”
\[ \begin{aligned} P(A|B) = P(A) * P(B) \\ P(A|B) = (0.4)*(0.6) \\ P(A|B) = 0.24 \end{aligned} \]
Esta es la frecuencia que esperamos que ocurra ese evento, podemos llamarlo como “frecuencia esperada”. Ahora esto no estrictamente pasa en la realidad y esto puede moverse mas o menos hacia una frecuencia mas alta o mas baja. Pero, cuando saber que la frecuencia es muy baja o muy alta. Esto lo trataremos en el siguiente capitulo cuando hablemos de la prueba exacta de Fisher. Por ahora usaremos el estadistico de Rho, que es la división de la frecuancia individual de cada evento entre la frecuencia de los eventos juntos. Como Ud. entenderá, si se acerca a uno quiere decir que nuestra “frecuencia observada” coincide con la “frecuencia esperada”.
seqinr::rho(covid19)
##
## aa ac ag at ca cc cg ct
## 1.0742060 1.2302052 0.9922942 0.8034304 1.2672998 0.8804024 0.4077025 1.1810577
## ga gc gg gt ta tc tg tt
## 0.9182424 1.0847301 0.9508448 1.0579442 0.8274498 0.8019388 1.3763907 1.0445057
El resultado es un poco complejo de analizar con estos valores. Pero si nosotros restamos 1 a todos los valores, y los graficamo en un barplot será más facil de visualizar si algún dinucleotido sale fuera de lo esperado.
library(ggplot2) # cargamos la libreria ggplot1 para hacer los graficos
rho1 <- seqinr::rho(covid19) # guardamos el resultado de rho en un objeto
rho1 <- rho1 - 1 # le restamos 1
rho1 <- as.data.frame(rho1) # convertimos el objeto en data frame para que este en dos columnas
ggplot(data = rho1, aes(fill=Var1, y=Freq, x=Var1)) +
geom_bar(position="stack", stat="identity") +
labs(title = "FRECUENCIA RELATIVA DE DINUCLEOTIDOS",
subtitle = "SARS-CoV-2 genoma",
x = "Dinucleotido",
y = "Frecuencia relativa")