This code creates alignments and a phylogenetic tree to show the evolutionary relationship between the human version homologs of the OAS3 gene. The OAS3 gene produces an enzyme which plays an important role in the inhibition of cellular protein synthesis and viral infection resistance.
Key information use to make this script can be found here:
Other resources consulted includes
Other interesting resources and online tools include:
Load necessary packages: Download and load drawProteins from Bioconductor
library(BiocManager)
## Bioconductor version '3.13' is out-of-date; the current release version '3.14'
## is available with R version '4.1'; see https://bioconductor.org/install
library(drawProteins)
library(msa)
## Loading required package: Biostrings
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: XVector
## Loading required package: GenomeInfoDb
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
##
## Attaching package: 'msa'
## The following object is masked from 'package:BiocManager':
##
## version
Load other packages:
# github packages
library(compbio4all)
library(ggmsa)
## Registered S3 methods overwritten by 'ggalt':
## method from
## grid.draw.absoluteGrob ggplot2
## grobHeight.absoluteGrob ggplot2
## grobWidth.absoluteGrob ggplot2
## grobX.absoluteGrob ggplot2
## grobY.absoluteGrob ggplot2
# CRAN packages
library(rentrez)
library(seqinr)
##
## Attaching package: 'seqinr'
## The following object is masked from 'package:Biostrings':
##
## translate
library(ape)
##
## Attaching package: 'ape'
## The following objects are masked from 'package:seqinr':
##
## as.alignment, consensus
## The following object is masked from 'package:Biostrings':
##
## complement
library(pander)
library(ggplot2)
## Biostrings
library(Biostrings)
library(HGNChelper)
TODO: Brief summary of where information was obtained, and if certain kinds of information was not available.
Accession numbers were obtained from RefSeq, Refseq HomoloGene, UniProt and PDB. UniProt accession numbers can be found by searching for the gene name. PDB accessions can be found by searching with a UniProt accession or a gene name, though many proteins are not in PDB. The the Neanderthal genome database was searched but did not yield sequence information on OAS3.
A protein BLAST search (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome) was carried out excluding vertebrates to determine if it occurred outside of vertebreates. The gene does not appear in non-vertebrates and so a second search was conducted to exclude mammals.
Not available:
Does not occur:
oas3_table<-c("NP_006178" ,"Q9Y6K5","4S3N","Homo sapiens" ,"Human" ,"OAS3",
"XP_509393" ,"NA" ,"NA" ,"Pan troglodytes" ,"Chimpanzee","OAS3",
"NP_660261" ,"Q8VI93","NA" ,"Mus musculus" ,"Mouse" ,"OAS3",
"NP_001009493","Q5MYT7","NA" ,"Rattus norvegicus" ,"Rat" ,"OAS3",
"NP_001041556","Q2KKD1","NA" ,"Canis lupus" ,"Dog" ,"OAS3",
"XP_015008356","NA" ,"NA" ,"Macaca mulatta" ,"Monkey" ,"OAS3",
"XP_015008356","Q5J0M4","NA" ,"Equus caballus" ,"Horse" ,"OAS3",
"NP_001075226","NA" ,"NA" ,"Mesocricetus auratus","Hamster" ,"OAS3",
"XP_031506643","NA" ,"NA" ,"Papio anubis" ,"Baboon" ,"OAS3",
"XP_004053976","NA" ,"NA" ,"Gorilla gorilla" ,"Gorilla" ,"OAS3")
Convert vector information into a table
oas3_table_matrix <- matrix(oas3_table,
byrow = T,
nrow = 10)
oas3_table <- data.frame(oas3_table_matrix,
stringsAsFactors = F)
names(oas3_table) <- c("NCBI Protein Accession","Uniprot ID","PDB","Species","Common name" ,"Gene Name")
The finished table
pander::pander(oas3_table)
| NCBI Protein Accession | Uniprot ID | PDB | Species |
|---|---|---|---|
| NP_006178 | Q9Y6K5 | 4S3N | Homo sapiens |
| XP_509393 | NA | NA | Pan troglodytes |
| NP_660261 | Q8VI93 | NA | Mus musculus |
| NP_001009493 | Q5MYT7 | NA | Rattus norvegicus |
| NP_001041556 | Q2KKD1 | NA | Canis lupus |
| XP_015008356 | NA | NA | Macaca mulatta |
| XP_015008356 | Q5J0M4 | NA | Equus caballus |
| NP_001075226 | NA | NA | Mesocricetus auratus |
| XP_031506643 | NA | NA | Papio anubis |
| XP_004053976 | NA | NA | Gorilla gorilla |
| Common name | Gene Name |
|---|---|
| Human | OAS3 |
| Chimpanzee | OAS3 |
| Mouse | OAS3 |
| Rat | OAS3 |
| Dog | OAS3 |
| Monkey | OAS3 |
| Horse | OAS3 |
| Hamster | OAS3 |
| Baboon | OAS3 |
| Gorilla | OAS3 |
All sequences were downloaded using a wrapper compbio4all::entrez_fetch_list() which uses rentrez::entrez_fetch() to access NCBI databases.
# download FASTA sequences
oas3s_list <- entrez_fetch_list(db = "protein",
id = oas3_table$`NCBI Protein Accession`,
rettype = "fasta")
Number of FASTA files obtained
length(oas3s_list)
## [1] 10
The first entry
oas3s_list[[1]]
## [1] ">NP_006178.2 2'-5'-oligoadenylate synthase 3 [Homo sapiens]\nMDLYSTPAAALDRFVARRLQPRKEFVEKARRALGALAAALRERGGRLGAAAPRVLKTVKGGSSGRGTALK\nGGCDSELVIFLDCFKSYVDQRARRAEILSEMRASLESWWQNPVPGLRLTFPEQSVPGALQFRLTSVDLED\nWMDVSLVPAFNVLGQAGSGVKPKPQVYSTLLNSGCQGGEHAACFTELRRNFVNIRPAKLKNLILLVKHWY\nHQVCLQGLWKETLPPVYALELLTIFAWEQGCKKDAFSLAEGLRTVLGLIQQHQHLCVFWTVNYGFEDPAV\nGQFLQRQLKRPRPVILDPADPTWDLGNGAAWHWDLLAQEAASCYDHPCFLRGMGDPVQSWKGPGLPRAGC\nSGLGHPIQLDPNQKTPENSKSLNAVYPRAGSKPPSCPAPGPTGAASIVPSVPGMALDLSQIPTKELDRFI\nQDHLKPSPQFQEQVKKAIDIILRCLHENCVHKASRVSKGGSFGRGTDLRDGCDVELIIFLNCFTDYKDQG\nPRRAEILDEMRAQLESWWQDQVPSLSLQFPEQNVPEALQFQLVSTALKSWTDVSLLPAFDAVGQLSSGTK\nPNPQVYSRLLTSGCQEGEHKACFAELRRNFMNIRPVKLKNLILLVKHWYRQVAAQNKGKGPAPASLPPAY\nALELLTIFAWEQGCRQDCFNMAQGFRTVLGLVQQHQQLCVYWTVNYSTEDPAMRMHLLGQLRKPRPLVLD\nPADPTWNVGHGSWELLAQEAAALGMQACFLSRDGTSVQPWDVMPALLYQTPAGDLDKFISEFLQPNRQFL\nAQVNKAVDTICSFLKENCFRNSPIKVIKVVKGGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNKRAEII\nSEIRAQLEACQQERQFEVKFEVSKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALGQLVSGSRPSSQVY\nVDLIHSYSNAGEYSTCFTELQRDFIISRPTKLKSLIRLVKHWYQQCTKISKGRGSLPPQHGLELLTVYAW\nEQGGKDSQFNMAEGFRTVLELVTQYRQLCIYWTINYNAKDKTVGDFLKQQLQKPRPIILDPADPTGNLGH\nNARWDLLAKEAAACTSALCCMGRNGIPIQPWPVKAAV\n\n"
###Initial data cleaning Remove FASTA header
for(i in 1:length(oas3s_list)){
oas3s_list[[i]] <- compbio4all::fasta_cleaner(oas3s_list[[i]], parse = F)
}
Specific additional cleaning steps will be as needed for particular analyses
First, we use a UniProt accession to download data from UniProt. This produces a list.
Q9Y6K5_json <- drawProteins::get_features("Q9Y6K5")
## [1] "Download has worked"
is(Q9Y6K5_json)
## [1] "list" "vector" "list_OR_List" "vector_OR_Vector"
## [5] "vector_OR_factor"
Then the raw data from the webpage is converted to a dataframe
my_prot_df <- drawProteins::feature_to_dataframe(Q9Y6K5_json)
is(my_prot_df)
## [1] "data.frame" "list" "oldClass" "vector"
## [5] "list_OR_List" "vector_OR_Vector" "vector_OR_factor"
The information available on a protein on UniProt varies a lot depending on how much its been studied. drawProteins can extract information about the following things:
and others.
If available, it can plot the information. You can get a sense for what’s available by looking at the dataframe produced by drawProteins::feature_to_dataframe()
my_prot_df[,-2]
## type begin end length accession entryName taxid order
## featuresTemp CHAIN 1 1087 1086 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.1 REGION 6 343 337 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.2 REGION 12 57 45 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.3 REGION 186 200 14 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.4 REGION 344 410 66 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.5 REGION 411 742 331 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.6 REGION 750 1084 334 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.7 METAL 816 816 0 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.8 METAL 818 818 0 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.9 METAL 888 888 0 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.10 BINDING 804 804 0 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.11 BINDING 947 947 0 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.12 BINDING 950 950 0 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.13 BINDING 969 969 0 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.14 SITE 155 155 0 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.15 SITE 244 244 0 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.16 MOD_RES 1 1 0 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.17 MOD_RES 365 365 0 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.18 VARIANT 18 18 0 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.19 VARIANT 18 18 0 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.20 VARIANT 18 18 0 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.21 VARIANT 65 65 0 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.22 VARIANT 378 378 0 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.23 VARIANT 381 381 0 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.24 VARIANT 869 869 0 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.25 MUTAGEN 30 30 0 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.26 MUTAGEN 41 41 0 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.27 MUTAGEN 76 76 0 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.28 MUTAGEN 145 145 0 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.29 MUTAGEN 816 818 2 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.30 CONFLICT 159 159 0 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.31 CONFLICT 249 249 0 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.32 CONFLICT 287 288 1 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.33 CONFLICT 316 316 0 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.34 CONFLICT 393 393 0 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.35 CONFLICT 503 504 1 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.36 CONFLICT 984 984 0 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.37 HELIX 2 5 3 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.38 HELIX 8 10 2 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.39 HELIX 11 18 7 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.40 HELIX 23 41 18 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.41 STRAND 54 60 6 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.42 HELIX 61 65 4 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.43 STRAND 73 81 8 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.44 HELIX 89 92 3 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.45 HELIX 95 108 13 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.46 STRAND 116 119 3 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.47 STRAND 129 139 10 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.48 STRAND 141 149 8 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.49 HELIX 165 172 7 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.50 TURN 177 180 3 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.51 HELIX 181 184 3 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.52 HELIX 185 193 8 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.53 HELIX 197 215 18 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.54 HELIX 226 240 14 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.55 HELIX 248 260 12 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.56 HELIX 262 264 2 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.57 STRAND 275 277 2 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.58 HELIX 278 288 10 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.59 STRAND 290 292 2 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.60 STRAND 294 296 2 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.61 STRAND 304 306 2 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.62 HELIX 313 322 9 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.63 HELIX 327 329 2 Q9Y6K5 OAS3_HUMAN 9606 1
## featuresTemp.64 TURN 332 334 2 Q9Y6K5 OAS3_HUMAN 9606 1
From the dataframe it can plot the available information. It uses ggplot2 and so uses some coding conventions of ggplot which can look unfamiliar if you’re new to it. Also, its a little tricky to understand how information in the dataframe gets turned turned into things on the plots by different function.
my_prot_df <- drawProteins::feature_to_dataframe(Q9Y6K5_json)
my_canvas <- draw_canvas(my_prot_df)
my_canvas <- draw_chains(my_canvas, my_prot_df, label_size = 2.5)
my_canvas <- draw_regions(my_canvas, my_prot_df)
#my_canvas <- draw_motif(my_canvas, my_prot_df)
#my_canvas <- draw_phospho(my_canvas, my_prot_df)
#my_canvas <- draw_repeat(my_canvas, my_prot_df)
#my_canvas <- draw_recept_dom(my_canvas, my_prot_df)
#my_canvas <- draw_folding(my_canvas, my_prot_df)
my_canvas
Prepare data
oas3s_human_FASTA <- rentrez::entrez_fetch(id = "Q9Y6K5",
db = "protein",
rettype="fasta")
oas3s_human_vector <- fasta_cleaner(oas3s_human_FASTA)
# set up 2 x 2 grid, make margins
par(mfrow = c(2,2),
mar = c(0,0,2,1))
# plot 1: Defaults
dotPlot(oas3s_human_vector, oas3s_human_vector,
wsize = 1,
nmatch = 1,
main = "")
# plot 2 size = 10, nmatch = 1
dotPlot(oas3s_human_vector, oas3s_human_vector,
wsize = 10,
nmatch = 1,
main = "")
# plot 3: size = 10, nmatch = 5
dotPlot(oas3s_human_vector, oas3s_human_vector,
wsize = 10,
nmatch = 5,
main = "")
# plot 4: size = 20, nmatch = 5
dotPlot(oas3s_human_vector, oas3s_human_vector,
wsize = 20,
nmatch = 5,
main = "")
# reset par() - run this or other plots will be small!
par(mfrow = c(1,1),
mar = c(4,4,4,4))
Best plot:
# plot 1: Defaults
dotPlot(oas3s_human_vector, oas3s_human_vector,
wsize = 20,
nmatch = 5,
main = "")
TODO: Create table
Below are links to relevant information.
The gene is listed in Alphafold (https://alphafold.ebi.ac.uk/entry/Q9Y6K5). The predicted structure contains alpha helices, beta sheets, and disordered regions.
Because this protein is poorly characterized I used IUPred2A to determine if there were any disordered regions (https://iupred2a.elte.hu/). Two peaks exceeded the threshold of 0.5.
Multivariate statistical techniques were used to confirm the information about protein structure and location in the line database.
Uniprot (which uses http://www.csbio.sjtu.edu.cn/bioinf/Cell-PLoc-2/ I believe) indicates that the protein is a nucleic and cytoplasmic protein.
Alphafold indicates that there are a mix of alpha helices and beta sheets. I therefore predict that machine-learning methods will indicate an a+b and a/b structure.
NOTE: My protein contains a “U” for an unknown amino acid. I removed this from the sequence because it is otherwise undefined.
First, I need the data from Chou and Zhang (1994) Table 5. Code to build this table is available at https://rpubs.com/lowbrowR/843543
#a vector of amino acid names
aa.1.1 <- c("A","R","N","D","C","Q","E","G","H","I",
"L","K","M","F","P","S","T","W","Y","V")
# alpha proteins
alpha <- c(285, 53, 97, 163, 22, 67, 134, 197, 111, 91,
221, 249, 48, 123, 82, 122, 119, 33, 63, 167)
# beta proteins
beta <- c(203, 67, 139, 121, 75, 122, 86, 297, 49, 120,
177, 115, 16, 85, 127, 341, 253, 44, 110, 229)
# alpha + beta
a.plus.b <- c(175, 78, 120, 111, 74, 74, 86, 171, 33, 93,
110, 112, 25, 52, 71, 126, 117, 30, 108, 123)
# alpha/beta
a.div.b <- c(361, 146, 183, 244, 63, 114, 257, 377, 107, 239,
339, 321, 91, 158, 188, 327, 238, 72, 130, 378)
The table looks like this:
aa.table <- data.frame(aa.1.1, alpha, beta, a.plus.b, a.div.b)
pander(aa.table)
| aa.1.1 | alpha | beta | a.plus.b | a.div.b |
|---|---|---|---|---|
| A | 285 | 203 | 175 | 361 |
| R | 53 | 67 | 78 | 146 |
| N | 97 | 139 | 120 | 183 |
| D | 163 | 121 | 111 | 244 |
| C | 22 | 75 | 74 | 63 |
| Q | 67 | 122 | 74 | 114 |
| E | 134 | 86 | 86 | 257 |
| G | 197 | 297 | 171 | 377 |
| H | 111 | 49 | 33 | 107 |
| I | 91 | 120 | 93 | 239 |
| L | 221 | 177 | 110 | 339 |
| K | 249 | 115 | 112 | 321 |
| M | 48 | 16 | 25 | 91 |
| F | 123 | 85 | 52 | 158 |
| P | 82 | 127 | 71 | 188 |
| S | 122 | 341 | 126 | 327 |
| T | 119 | 253 | 117 | 238 |
| W | 33 | 44 | 30 | 72 |
| Y | 63 | 110 | 108 | 130 |
| V | 167 | 229 | 123 | 378 |
Convert to frequencies
alpha.prop <- alpha/sum(alpha)
beta.prop <- beta/sum(beta)
a.plus.b.prop <- a.plus.b/sum(a.plus.b)
a.div.b <- a.div.b/sum(a.div.b)
# make a dataframe
aa.prop <- data.frame(alpha.prop,
beta.prop,
a.plus.b.prop,
a.div.b)
#row labels
row.names(aa.prop) <- aa.1.1
Table 5 therefore becomes this
aa.prop
## alpha.prop beta.prop a.plus.b.prop a.div.b
## A 0.116469146 0.073126801 0.09264161 0.08331410
## R 0.021659174 0.024135447 0.04129169 0.03369490
## N 0.039640376 0.050072046 0.06352567 0.04223402
## D 0.066612178 0.043587896 0.05876125 0.05631202
## C 0.008990601 0.027017291 0.03917417 0.01453958
## Q 0.027380466 0.043948127 0.03917417 0.02630972
## E 0.054760932 0.030979827 0.04552673 0.05931225
## G 0.080506743 0.106988473 0.09052409 0.08700669
## H 0.045361667 0.017651297 0.01746956 0.02469421
## I 0.037188394 0.043227666 0.04923240 0.05515809
## L 0.090314671 0.063760807 0.05823187 0.07823679
## K 0.101757254 0.041426513 0.05929063 0.07408262
## M 0.019615856 0.005763689 0.01323452 0.02100162
## F 0.050265631 0.030619597 0.02752779 0.03646434
## P 0.033510421 0.045749280 0.03758602 0.04338795
## S 0.049856968 0.122838617 0.06670196 0.07546734
## T 0.048630977 0.091138329 0.06193753 0.05492730
## W 0.013485901 0.015850144 0.01588142 0.01661666
## Y 0.025745811 0.039625360 0.05717311 0.03000231
## V 0.068246833 0.082492795 0.06511382 0.08723748
pander::pander(aa.prop)
| alpha.prop | beta.prop | a.plus.b.prop | a.div.b | |
|---|---|---|---|---|
| A | 0.1165 | 0.07313 | 0.09264 | 0.08331 |
| R | 0.02166 | 0.02414 | 0.04129 | 0.03369 |
| N | 0.03964 | 0.05007 | 0.06353 | 0.04223 |
| D | 0.06661 | 0.04359 | 0.05876 | 0.05631 |
| C | 0.008991 | 0.02702 | 0.03917 | 0.01454 |
| Q | 0.02738 | 0.04395 | 0.03917 | 0.02631 |
| E | 0.05476 | 0.03098 | 0.04553 | 0.05931 |
| G | 0.08051 | 0.107 | 0.09052 | 0.08701 |
| H | 0.04536 | 0.01765 | 0.01747 | 0.02469 |
| I | 0.03719 | 0.04323 | 0.04923 | 0.05516 |
| L | 0.09031 | 0.06376 | 0.05823 | 0.07824 |
| K | 0.1018 | 0.04143 | 0.05929 | 0.07408 |
| M | 0.01962 | 0.005764 | 0.01323 | 0.021 |
| F | 0.05027 | 0.03062 | 0.02753 | 0.03646 |
| P | 0.03351 | 0.04575 | 0.03759 | 0.04339 |
| S | 0.04986 | 0.1228 | 0.0667 | 0.07547 |
| T | 0.04863 | 0.09114 | 0.06194 | 0.05493 |
| W | 0.01349 | 0.01585 | 0.01588 | 0.01662 |
| Y | 0.02575 | 0.03963 | 0.05717 | 0.03 |
| V | 0.06825 | 0.08249 | 0.06511 | 0.08724 |
Determine the number of each amino acid in my protein.
aa.total <- data.frame(a.plus.b)
row.names(aa.total) <- aa.1.1
colnames(aa.total) <- ("Total of amino acid")
aa.total
## Total of amino acid
## A 175
## R 78
## N 120
## D 111
## C 74
## Q 74
## E 86
## G 171
## H 33
## I 93
## L 110
## K 112
## M 25
## F 52
## P 71
## S 126
## T 117
## W 30
## Y 108
## V 123
pander::pander(aa.total)
| Total of amino acid | |
|---|---|
| A | 175 |
| R | 78 |
| N | 120 |
| D | 111 |
| C | 74 |
| Q | 74 |
| E | 86 |
| G | 171 |
| H | 33 |
| I | 93 |
| L | 110 |
| K | 112 |
| M | 25 |
| F | 52 |
| P | 71 |
| S | 126 |
| T | 117 |
| W | 30 |
| Y | 108 |
| V | 123 |
A Function to convert a table into a vector is helpful here because R is goofy about tables not being the same as vectors.
table_to_vector <- function(table_x){
table_names <- attr(table_x, "dimnames")[[1]]
table_vect <- as.vector(table_x)
names(table_vect) <- table_names
return(table_vect)
}
oas3s_human_table <- table(oas3s_human_vector)/length(oas3s_human_vector)
OAS3.human.aa.freq <- table_to_vector(oas3s_human_table)
OAS3.human.aa.freq
## A C D E F G H
## 0.08371665 0.02851886 0.04875805 0.04783809 0.04323827 0.07083717 0.01839926
## I K L M N P Q
## 0.03495860 0.05243790 0.11591536 0.01379945 0.03219871 0.06531739 0.06991720
## R S T V W Y
## 0.05887764 0.06531739 0.04047838 0.06439742 0.02391904 0.02115915
Check for the presence of “U” (unknown aa.)
aa.names <- names(OAS3.human.aa.freq)
i.U <- which(aa.names == "U")
aa.names[i.U]
## character(0)
OAS3.human.aa.freq[i.U]
## named numeric(0)
Remove the U (would be better to remove form the original sequence, but this will work)
# no U's are present
Add data on my focal protein to the amino acid frequency table.
aa.prop$OAS3.human.aa.freq <- OAS3.human.aa.freq
pander::pander(aa.prop)
| alpha.prop | beta.prop | a.plus.b.prop | a.div.b | OAS3.human.aa.freq | |
|---|---|---|---|---|---|
| A | 0.1165 | 0.07313 | 0.09264 | 0.08331 | 0.08372 |
| R | 0.02166 | 0.02414 | 0.04129 | 0.03369 | 0.02852 |
| N | 0.03964 | 0.05007 | 0.06353 | 0.04223 | 0.04876 |
| D | 0.06661 | 0.04359 | 0.05876 | 0.05631 | 0.04784 |
| C | 0.008991 | 0.02702 | 0.03917 | 0.01454 | 0.04324 |
| Q | 0.02738 | 0.04395 | 0.03917 | 0.02631 | 0.07084 |
| E | 0.05476 | 0.03098 | 0.04553 | 0.05931 | 0.0184 |
| G | 0.08051 | 0.107 | 0.09052 | 0.08701 | 0.03496 |
| H | 0.04536 | 0.01765 | 0.01747 | 0.02469 | 0.05244 |
| I | 0.03719 | 0.04323 | 0.04923 | 0.05516 | 0.1159 |
| L | 0.09031 | 0.06376 | 0.05823 | 0.07824 | 0.0138 |
| K | 0.1018 | 0.04143 | 0.05929 | 0.07408 | 0.0322 |
| M | 0.01962 | 0.005764 | 0.01323 | 0.021 | 0.06532 |
| F | 0.05027 | 0.03062 | 0.02753 | 0.03646 | 0.06992 |
| P | 0.03351 | 0.04575 | 0.03759 | 0.04339 | 0.05888 |
| S | 0.04986 | 0.1228 | 0.0667 | 0.07547 | 0.06532 |
| T | 0.04863 | 0.09114 | 0.06194 | 0.05493 | 0.04048 |
| W | 0.01349 | 0.01585 | 0.01588 | 0.01662 | 0.0644 |
| Y | 0.02575 | 0.03963 | 0.05717 | 0.03 | 0.02392 |
| V | 0.06825 | 0.08249 | 0.06511 | 0.08724 | 0.02116 |
Two custom functions are needed: one to calculate correlates between two columns of our table, and one to calculate correlation similarities.
# Corrleation used in Chou adn Zhange 1992.
chou_cor <- function(x,y){
numerator <- sum(x*y)
denominator <- sqrt((sum(x^2))*(sum(y^2)))
result <- numerator/denominator
return(result)
}
# Cosine similarity used in Higgs and Attwood (2005).
chou_cosine <- function(z.1, z.2){
z.1.abs <- sqrt(sum(z.1^2))
z.2.abs <- sqrt(sum(z.2^2))
my.cosine <- sum(z.1*z.2)/(z.1.abs*z.2.abs)
return(my.cosine)
}
Calculate correlation between each column
corr.alpha <- chou_cor(aa.prop[,5], aa.prop[,1])
corr.beta <- chou_cor(aa.prop[,5], aa.prop[,2])
corr.apb <- chou_cor(aa.prop[,5], aa.prop[,3])
corr.adb <- chou_cor(aa.prop[,5], aa.prop[,4])
Calculate cosine similarity
cos.alpha <- chou_cosine(aa.prop[,5], aa.prop[,1])
cos.beta <- chou_cosine(aa.prop[,5], aa.prop[,2])
cos.apb <- chou_cosine(aa.prop[,5], aa.prop[,3])
cos.adb <- chou_cosine(aa.prop[,5], aa.prop[,4])
Calculate distance. Note: we need to flip the dataframe on its side using a command called t()
aa.prop.flipped <- t(aa.prop)
round(aa.prop.flipped,2)
## A R N D C Q E G H I L K
## alpha.prop 0.12 0.02 0.04 0.07 0.01 0.03 0.05 0.08 0.05 0.04 0.09 0.10
## beta.prop 0.07 0.02 0.05 0.04 0.03 0.04 0.03 0.11 0.02 0.04 0.06 0.04
## a.plus.b.prop 0.09 0.04 0.06 0.06 0.04 0.04 0.05 0.09 0.02 0.05 0.06 0.06
## a.div.b 0.08 0.03 0.04 0.06 0.01 0.03 0.06 0.09 0.02 0.06 0.08 0.07
## OAS3.human.aa.freq 0.08 0.03 0.05 0.05 0.04 0.07 0.02 0.03 0.05 0.12 0.01 0.03
## M F P S T W Y V
## alpha.prop 0.02 0.05 0.03 0.05 0.05 0.01 0.03 0.07
## beta.prop 0.01 0.03 0.05 0.12 0.09 0.02 0.04 0.08
## a.plus.b.prop 0.01 0.03 0.04 0.07 0.06 0.02 0.06 0.07
## a.div.b 0.02 0.04 0.04 0.08 0.05 0.02 0.03 0.09
## OAS3.human.aa.freq 0.07 0.07 0.06 0.07 0.04 0.06 0.02 0.02
We can get distance matrix like this
dist(aa.prop.flipped, method = "euclidean")
## alpha.prop beta.prop a.plus.b.prop a.div.b
## beta.prop 0.13342098
## a.plus.b.prop 0.09281824 0.08289406
## a.div.b 0.06699039 0.08659174 0.06175113
## OAS3.human.aa.freq 0.18218375 0.18183104 0.15689863 0.16738924
Individual distances using dist()
dist.alpha <- dist((aa.prop.flipped[c(1,5),]), method = "euclidean")
dist.beta <- dist((aa.prop.flipped[c(2,5),]), method = "euclidean")
dist.apb <- dist((aa.prop.flipped[c(3,5),]), method = "euclidean")
dist.adb <- dist((aa.prop.flipped[c(4,5),]), method = "euclidean")
Compile the information. Rounding makes it easier to read
# fold types
fold.type <- c("alpha","beta","alpha plus beta", "alpha/beta")
# data
corr.sim <- round(c(corr.alpha,corr.beta,corr.apb,corr.adb),5)
cosine.sim <- round(c(cos.alpha,cos.beta,cos.apb,cos.adb),5)
Euclidean.dist <- round(c(dist.alpha,dist.beta,dist.apb,dist.adb),5)
# summary
sim.sum <- c("","","most.sim","")
dist.sum <- c("","","min.dist","")
df <- data.frame(fold.type,
corr.sim ,
cosine.sim ,
Euclidean.dist ,
sim.sum ,
dist.sum )
Display output
pander::pander(df)
| fold.type | corr.sim | cosine.sim | Euclidean.dist | sim.sum | dist.sum |
|---|---|---|---|---|---|
| alpha | 0.7427 | 0.7427 | 0.1822 | ||
| beta | 0.7475 | 0.7475 | 0.1818 | ||
| alpha plus beta | 0.7971 | 0.7971 | 0.1569 | most.sim | min.dist |
| alpha/beta | 0.7731 | 0.7731 | 0.1674 |
TBD
# ec <- c(8.6, 2.9, 4.9, 5.1, 3.7, 7.8, 2.1, 4.6, 6.3, 8.8, 2.5, 4.6, 4.9,
# 4, 4.2, 7.3, 6, 6.7, 1.4, 3.6)/100
#
# an <- c(7.6, 2.2, 5.2, 6.2, 4.0, 6.9, 2.1, 5.1, 5.8, 9.4, 2.1, 4.4, 5.4, 4.1,
# 5.0, 7.2, 6.1, 6.7, 1.4, 3.2)/100
#
# df <- data.frame(ec,an)
# ave.vect <- apply(df,1,mean)
#
#
#
# cor.mat <- matrix(NA, 20, nrow = 20, ncol = 20)
#
# for(i in 1:20){
# for(j in 1:20){
# cor.mat[i,j] <- (ec[j]-ave.vect[i])*(ec[i]-ave.vect[j])
# }
# }
#
# t(ec-ave.vect)%*%ginv(cor.mat)%*%(ec-ave.vect)
Convert all FASTA records intro entries in a single vector. FASTA entries are contained in a list produced at the beginning of the script. They were cleaned to remove the header and newline characters.
names(oas3s_list)
## [1] "NP_006178" "XP_509393" "NP_660261" "NP_001009493" "NP_001041556"
## [6] "XP_015008356" "XP_015008356" "NP_001075226" "XP_031506643" "XP_004053976"
length(oas3s_list)
## [1] 10
Each entry is a full entry with no spaces or parsing, and no header
oas3s_list[1]
## $NP_006178
## [1] "MDLYSTPAAALDRFVARRLQPRKEFVEKARRALGALAAALRERGGRLGAAAPRVLKTVKGGSSGRGTALKGGCDSELVIFLDCFKSYVDQRARRAEILSEMRASLESWWQNPVPGLRLTFPEQSVPGALQFRLTSVDLEDWMDVSLVPAFNVLGQAGSGVKPKPQVYSTLLNSGCQGGEHAACFTELRRNFVNIRPAKLKNLILLVKHWYHQVCLQGLWKETLPPVYALELLTIFAWEQGCKKDAFSLAEGLRTVLGLIQQHQHLCVFWTVNYGFEDPAVGQFLQRQLKRPRPVILDPADPTWDLGNGAAWHWDLLAQEAASCYDHPCFLRGMGDPVQSWKGPGLPRAGCSGLGHPIQLDPNQKTPENSKSLNAVYPRAGSKPPSCPAPGPTGAASIVPSVPGMALDLSQIPTKELDRFIQDHLKPSPQFQEQVKKAIDIILRCLHENCVHKASRVSKGGSFGRGTDLRDGCDVELIIFLNCFTDYKDQGPRRAEILDEMRAQLESWWQDQVPSLSLQFPEQNVPEALQFQLVSTALKSWTDVSLLPAFDAVGQLSSGTKPNPQVYSRLLTSGCQEGEHKACFAELRRNFMNIRPVKLKNLILLVKHWYRQVAAQNKGKGPAPASLPPAYALELLTIFAWEQGCRQDCFNMAQGFRTVLGLVQQHQQLCVYWTVNYSTEDPAMRMHLLGQLRKPRPLVLDPADPTWNVGHGSWELLAQEAAALGMQACFLSRDGTSVQPWDVMPALLYQTPAGDLDKFISEFLQPNRQFLAQVNKAVDTICSFLKENCFRNSPIKVIKVVKGGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNKRAEIISEIRAQLEACQQERQFEVKFEVSKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALGQLVSGSRPSSQVYVDLIHSYSNAGEYSTCFTELQRDFIISRPTKLKSLIRLVKHWYQQCTKISKGRGSLPPQHGLELLTVYAWEQGGKDSQFNMAEGFRTVLELVTQYRQLCIYWTINYNAKDKTVGDFLKQQLQKPRPIILDPADPTGNLGHNARWDLLAKEAAACTSALCCMGRNGIPIQPWPVKAAV"
Make each entry of the list into a vector. There are several ways to do this.
oas3s_vector <- rep(NA, length(oas3s_list))
for (i in 1:length(oas3s_list)){
oas3s_vector[i] <- oas3s_list[[i]]
}
Name the vector
names(oas3s_vector) <- names(oas3s_list)
Do pairwise alignments for humans, chimps and 2-other species.
align01.02 <- Biostrings::pairwiseAlignment(
oas3s_list[[1]],
oas3s_list[[2]])
align01.05 <- Biostrings::pairwiseAlignment(
oas3s_list[[1]],
oas3s_list[[5]])
align01.06 <- Biostrings::pairwiseAlignment(
oas3s_list[[1]],
oas3s_list[[6]])
align02.05 <- Biostrings::pairwiseAlignment(
oas3s_list[[2]],
oas3s_list[[5]])
align02.06 <- Biostrings::pairwiseAlignment(
oas3s_list[[2]],
oas3s_list[[6]])
align05.06 <- Biostrings::pairwiseAlignment(
oas3s_list[[5]],
oas3s_list[[6]])
Biostrings::pid(align01.02)
## [1] 99.26403
Biostrings::pid(align01.05)
## [1] 78.47286
Biostrings::pid(align01.06)
## [1] 86.03239
Biostrings::pid(align02.05)
## [1] 78.47286
Biostrings::pid(align02.06)
## [1] 85.93117
Biostrings::pid(align05.06)
## [1] 69.39182
Build Matrix
pids <- c(1, NA, NA, NA,
pid(align01.02), 1, NA, NA,
pid(align01.05), pid(align02.05), 1, NA,
pid(align01.06), pid(align02.06), pid(align05.06), 1)
mat <- matrix(pids, nrow = 4, byrow = T)
row.names(mat) <- c("Homo","Pan","Canis","Macaca")
colnames(mat) <- c("Homo","Pan","Canis","Macaca")
pander::pander(mat)
| Homo | Pan | Canis | Macaca | |
|---|---|---|---|---|
| Homo | 1 | NA | NA | NA |
| Pan | 99.26 | 1 | NA | NA |
| Canis | 78.47 | 78.47 | 1 | NA |
| Macaca | 86.03 | 85.93 | 69.39 | 1 |
Compare different PID methods. I did this for Humans vs. chimps and also for another comparison out of curiousity. You only have to do chimps.
diff.pids <- c('PID1', round(pid(align01.02, type = "PID1"),2), '(aligned positions + internal gap positions)',
'PID2', round(pid(align01.02, type = "PID2"),2), '(aligned positions)',
'PID3', round(pid(align01.02, type = "PID3"),2), '(length shorter sequence)',
'PID3', round(pid(align01.02, type = "PID4"),2), '(average length of the two sequences)')
mat2 <- matrix(diff.pids, nrow = 4, byrow = T)
colnames(mat2) <- c("Method", "PID", "Denominator")
pander::pander(mat2)
| Method | PID | Denominator |
|---|---|---|
| PID1 | 99.26 | (aligned positions + internal gap positions) |
| PID2 | 99.26 | (aligned positions) |
| PID3 | 99.26 | (length shorter sequence) |
| PID3 | 99.26 | (average length of the two sequences) |
For use with R bioinformatics tools we need to convert our named vector to a string set using Biostrings::AAStringSet(). Note the _ss tag at the end of the object we’re assigning the output to, which designates this as a string set.
## putting this chunk here again to make sure the vectors are named properly
#making sure the vector has names
for(i in 1:length(oas3s_list)){
oas3s_list[[i]] <- fasta_cleaner(oas3s_list[[i]], parse = F)
}
# make a vector to hold each sequence
oas3s_vector <- rep(NA, length(oas3s_list))
# name the vector (this makes ggmsa happy)
names(oas3s_vector) <- names(oas3s_list)
# extract the sequences from list and put into vector
for(i in 1:length(oas3s_vector)){
oas3s_vector[i] <- oas3s_list[[i]]
}
oas3s_vector
## NP_006178
## "MDLYSTPAAALDRFVARRLQPRKEFVEKARRALGALAAALRERGGRLGAAAPRVLKTVKGGSSGRGTALKGGCDSELVIFLDCFKSYVDQRARRAEILSEMRASLESWWQNPVPGLRLTFPEQSVPGALQFRLTSVDLEDWMDVSLVPAFNVLGQAGSGVKPKPQVYSTLLNSGCQGGEHAACFTELRRNFVNIRPAKLKNLILLVKHWYHQVCLQGLWKETLPPVYALELLTIFAWEQGCKKDAFSLAEGLRTVLGLIQQHQHLCVFWTVNYGFEDPAVGQFLQRQLKRPRPVILDPADPTWDLGNGAAWHWDLLAQEAASCYDHPCFLRGMGDPVQSWKGPGLPRAGCSGLGHPIQLDPNQKTPENSKSLNAVYPRAGSKPPSCPAPGPTGAASIVPSVPGMALDLSQIPTKELDRFIQDHLKPSPQFQEQVKKAIDIILRCLHENCVHKASRVSKGGSFGRGTDLRDGCDVELIIFLNCFTDYKDQGPRRAEILDEMRAQLESWWQDQVPSLSLQFPEQNVPEALQFQLVSTALKSWTDVSLLPAFDAVGQLSSGTKPNPQVYSRLLTSGCQEGEHKACFAELRRNFMNIRPVKLKNLILLVKHWYRQVAAQNKGKGPAPASLPPAYALELLTIFAWEQGCRQDCFNMAQGFRTVLGLVQQHQQLCVYWTVNYSTEDPAMRMHLLGQLRKPRPLVLDPADPTWNVGHGSWELLAQEAAALGMQACFLSRDGTSVQPWDVMPALLYQTPAGDLDKFISEFLQPNRQFLAQVNKAVDTICSFLKENCFRNSPIKVIKVVKGGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNKRAEIISEIRAQLEACQQERQFEVKFEVSKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALGQLVSGSRPSSQVYVDLIHSYSNAGEYSTCFTELQRDFIISRPTKLKSLIRLVKHWYQQCTKISKGRGSLPPQHGLELLTVYAWEQGGKDSQFNMAEGFRTVLELVTQYRQLCIYWTINYNAKDKTVGDFLKQQLQKPRPIILDPADPTGNLGHNARWDLLAKEAAACTSALCCMGRNGIPIQPWPVKAAV"
## XP_509393
## "MDLYSTPAAALDRFVARRLQPRKEFVEKARRALGALAAALRERGGRLGAAAPRVLKTVKGGSSGRGTALKGGCDSELVIFLDCFKSYVDQRARRAEILSEMRASLESWWQNPVPGLRLTFPEQSVPGALQFRLTSVDLEDWMDVSLVPAFNVLGQAGSGVKPKPQVYSTLLNSGCQGGEHAACFTELRRNFVNIRPAKLKNLILLVKHWYHQVCLQGLWKETLPPVYALELLTIFAWEQGCKKDAFSLAEGLRTVLGLIQQHQHLCVFWTVNYGFEDPAVGQFLQRQLKRPRPVILDPADPTWDLGNGAAWHWDLLAQEAASCYDHPCFLRGMGDPVQSWKGPGLPCAGCSGLGHPIQLDPNQKTPENSKSLSAVYPRAGSKPPSCPAPGPTGAASIVPSVPGMALDLSQIPTKELDRFIQDHLKPSPQFQEQVKKAIDIILRCLRENCVHKASRVSKGGSFGRGTDLRDGCDVELIIFLNCFTDYKDQGPRRAEILDEMRAQLESWWQDQVPGLSLQFPEQNVPEALQFQLVSTALKSWMDVSLLPAFDAVGQLSSGTKPNPQVYSRLLTSGCQEGEHKACFAELRRNFMNIRPVKLKNLILLVKHWYHQVAAQNKGKRPAPASLPPAYALELLTIFAWEQGCRQDCFNMAQGFRTVLGLVQQHQQLCVYWTVNYSTEDPAMRMHLLGQLGKPRPLVLDPADPTWNVGHGSWELLAQEAAALGMQACFLSRDGTSVQPWDVMPALLYQTPAGDLDKFISEFLQPNRQFLAQVNKAVDTICSFLKENCFRNSPIKVIKVVKGGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNKRAEIISEIRAQLEACQQERQFEVKFEVSKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALGQLVSGSRPSSQVYVDLIHSYSNAGEYSTCFTELQRDFIISRPTKLKSLIRLVKHWYQQCTKISKGRGSLPPQHGLELLTVYAWEQGGKDSQFNMAEGFRTVLELVTQYRQLCIYWTINYNAKDKTVGDFLKQQLQKPRPIILDPADPTGNLGHNARWDLLAKEAAACTSALCCMGRNGIPIQPWPVKAAV"
## NP_660261
## "MDLFHTPAGALDKLVAHNLHPAPEFTAAVRGALGSLNITLQQHRARGSQRPRVIRIAKGGAYARGTALRGGTDVELVIFLDCFQSFGDQKTCHSETLGAMRMLLESWGGHPGPGLTFEFSQSKASRILQFRLASADGEHWIDVSLVPAFDVLGQPRSGVKPTPNVYSSLLSSHCQAGEYSACFTEPRKNFVNTRPAKLKNLILLVKHWYHQVQTRAVRATLPPSYALELLTIFAWEQGCGKDSFSLAQGLRTVLALIQHSKYLCIFWTENYGFEDPAVGEFLRRQLKRPRPVILDPADPTWDVGNGTAWRWDVLAQEAESSFSQQCFKQASGVLVQPWEGPGLPRAGILDLGHPIYQGPNQALEDNKGHLAVQSKERSQKPSNSAPGFPEAATKIPAMPNPSANKTRKIRKKAAHPKTVQEAALDSISSHVRITQSTASSHMPPDRSSISTAGSRMSPDLSQIPSKDLDCFIQDHLRPSPQFQQQVKQAIDAILCCLREKSVYKVLRVSKGGSFGRGTDLRGSCDVELVIFYKTLGDFKGQKPHQAEILRDMQAQLRHWCQNPVPGLSLQFIEQKPNALQLQLASTDLSNRVDLSVLPAFDAVGPLKSGTKPQPQVYSSLLSSGCQAGEHAACFAELRRNFINTCPPKLKSLMLLVKHWYRQVVTRYKGGEAAGDAPPPAYALELLTIFAWEQGCGEQKFSLAEGLRTILRLIQQHQSLCIYWTVNYSVQDPAIRAHLLCQLRKARPLVLDPADPTWNVGQGDWKLLAQEAAALGSQVCLQSGDGTLVPPWDVTPALLHQTLAEDLDKFISEFLQPNRHFLTQVKRAVDTICSFLKENCFRNSTIKVLKVVKGGSSAKGTALQGRSDADLVVFLSCFRQFSEQGSHRAEIISEIQAHLEACQQMHSFDVKFEVSKRKNPRVLSFTLTSQTLLDQSVDFDVLPAFDALGQLRSGSRPDPRVYTDLIHSCSNAGEFSTCFTELQRDFITSRPTKLKSLIRLVKYWYQQCNKTIKGKGSLPPQHGLELLTVYAWEQGGQNPQFNMAEGFRTVLELIVQYRQLCVYWTINYSAEDKTIGDFLKMQLRKPRPVILDPADPTGNLGHNARWDLLAKEATVYASALCCVDRDGNPIKPWPVKAAV"
## NP_001009493
## "MDLYHTPAGALDKLVAHSLHPAPEFTAAVRRALGSLDNVLRKNGAGGLQRPRVIRIIKGGAHARGTALRGGTDVELVIFLDCLRSFGDQKTCHTEILGAIQALLESWGCNPGPGLTFEFSGPKASGILQFRLASVDQENWIDVSLVPAFDALGQLHSEVKPTPNVYSSLLSSHCQAGEHSACFTELRKNFVNIRPVKLKNLILLVKHWYRQVQTQVVRATLPPSYALELLTIFAWEQGCRKDAFSLAQGLRTVLALIQRNKHLCIFWTENYGFEDPAVGEFLRRQLKRPRPVILDPADPTWDLGNGTAWCWDVLAKEAEYSFNQQCFKEASGALVQPWEGPGLPCAGILDLGHPIQQGAKHALEDNNGHLAVQPMKESLQPSNPARGLPETATKISAMPDPTVTETHKSLKKSVHPKTVSETVVNPSSHVWITQSTASSNTPPGHSSMSTAGSQMGPDLSQIPSKELDSFIQDHLRPSSQFQQQVRQAIDTILCCLREKCVDKVLRVSKGGSFGRGTDLRGKCDVELVIFYKTLGDFKGQNSHQTEILCDMQAQLQRWCQNPAPGLSLQFIEQKSNALHLQLVPTNLSNRVDLSVLPAFDAVGPLKSGAKPLPETYSSLLSSGCQAGEHAACFAELRRNFINTRPAKLRSLMLLVKHWYRQVAARFEGGETAGAALPPAYALELLTVFAWEQGCGEQKFSMAEGLRTVLRLVQQHQSLCIYWTVNYSVQDPAIRAHLLRQLRKARPLILDPADPTWNMDQGNWKLLAQEAAALESQVCLQSRDGNLVPPWDVMPALLHQTPAQNLDKFICEFLQPDRHFLTQVKRAVDTICSFLKENCFRNSTIKVLKVVKGGSSAKGTALQGRSDADLVVFLSCFRQFSEQGSHRAEIIAEIQAQLEACQQKQRFDVKFEISKRKNPRVLSFTLTSKTLLGQSVDFDVLPAFDALGQLKSGSRPDPRVYTDLIQSYSNAGEFSTCFTELQRDFISSRPTKLKSLIRLVKHWYQQCNKTVKGKGSLPPQHGLELLTVYAWERGSQNPQFNMAEGFRTVLELIGQYRQLCVYWTINYGAEDETIGDFLKMQLQKPRPVILDPADPTGNLGHNARWDLLAKEAAAYTSALCCMDKDGNPIKPWPVKAAV"
## NP_001041556
## "MDVYRTPAAALASLVARRLQPSAEFQRAAWRALGALATTLRERGDRAAAQPWRVLKTAKGGSAGRGTALRGGCDSEIVIFLDCFKSYKDHSVDRAEILKDLWDLLQSWWQKPIPGLNFETLWQDRPGVLQFRLASTDLENWMDVSLVPAFDALGQLCAGAKPAPQVYSTLLHSGCQGGEHAACFAELRRNFVNVRPAKLKSLILLVKHWYRQVCQEEAKREMLPPAYALELLTIFAWEQGCGKDAFSLAQGLRTVLGLIQEYRQLCVFWTLNYGFENPTVRSFLSSQLKKPRPVILDPADPTWDVGNGATWHWDILAREAESCYEHPCFLQTAGDTVQPWEGTGLPRAGCSGLDHPIQRDDAQRTPGNSSSLNAVPPRAGSRQPSWPAPRPPGPDSITPSTLGRAVDLSQIATKDLDRFIQDHLKPNPQFQKQVGKAINVILGCLREKCVYKASRVSKGGSFGRGTDLRGGCDAELVIFLNCFEDYRDQRARRPEILQEMQAQLESWWQDPVPGLSLEFPEQTVPEALQFRLVSTALESWMDVCLVPAFDAVGQLCAGAKPAPQVYSTLLQSGCQGGEHAACFAELRRNFVNVRPAKLKSLILLVKHWYRQVAAQNKGQQPACASLPPVYALELLTIFAWEQGCGEDSFKMAQGLKTVLELVQQHQQLCVYWTVNYSFEDPAIRTHLLGQLQKPRPLILDPGDPTWNVGQGSWELLAQEAAVLETQACLRSTEGTSVQPWDVMPALLYQTPAGDLDKFISDFLQPNRQFLAQVNKAVDTICSFLKENCFQNSAIKVLKVVKGGSLAKGTALRGRSDADLVVFLSCFSQFAEQGNRRAEIISEIRAQLEACQQKMQLEVKFEIPKRENSRVLSFSLKSQTMLDQSVDFDVLPAFNALGQVVSSYRPPSQVYVDLIYSYNNAGEYSTCFTELQRDFIISRPTKLKSLIRLVKHWYRQCNKMPRGRGSLPPQHGLELLTVYAWEQGGQSAQFNMAQGFRTVLELVSQYRQLRVYWTVNYDNEDQTVRDFLSRQLRQPRPIILDPADPTGNLGHNARWDLLATEATACMSALCCTDRDGTPIQPWPVKAAV"
## XP_015008356
## "MDLYRTPASALDRFVATRLQPRKEFTETARRALGALAAALRERGGRPGALAPRVLKIVKGGSSGRGTALKGGCDSELVIFLDCFKSYMDQRARRAEILSKMRALLESWWQNPVPGLSLKFPQQSVPGALQFRLTSIDLEDWTDVSLVPAFDVLGQAGSRVKPKPQVYSTLLNSGCQGGEHAACFTELRRDFVNIRPAKLKNLILLVKHWYHQVCLQGLWEETLPPVYALELLTIFAWEQGCKKDAFSLAEGLRTVLDLIQQHQHLCVFWTVNYGFEDPAVGQFLQRQLERPRPVILDPADPTWDLGNGAAWHWDLLAQEAASCCDHPCFLNGMGDPVQPWQVPGLPRARCSGLGHPIQLNPNQKTPENSKSLDAVSPRAGSKAPSCPAPGPAGAASVAPSVPGMALDLSQIPTKELDRFIQDHLKPSPRFQEQVKKAIDIILRRLRENCVHKVSRVSKGGSFGRGTDLRDGCDVELVIFLNCFTDYKDQGPRRAEILDEMRAQLESWWQGQVPGLSLQFPQQNVPEALQFQLVSTAPKRWTDVSLLPAFDALGQLSSGTKPNPQVYSRLLSSGCQEGEHKACFAELRRNFVNIRPAKLKNLILLVKHWYRQVAAQNKRKRPAPASLPPAYALELLTIFAWEQGCGKDCFDMAQGFRTVLGLVQQHQQLCVYWTVNYSTEDPAMRMHLLGQLRKPRPLVLDPADPTWNVGQGSWELLAQEAAVLGMQACFLSRDGTSMPPWDVMPALLYQTPAGDLDKFISEFLQPNRQFLAQVNKAVDTICSFLKENCFRNSPIKVIKVVKGGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNKRAEIISEIRAQLEACQREQQFEVKFEVSKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALATAMQASTPPASQSYSGTSSSLALPS"
## XP_015008356
## "MDLYRTPASALDRFVATRLQPRKEFTETARRALGALAAALRERGGRPGALAPRVLKIVKGGSSGRGTALKGGCDSELVIFLDCFKSYMDQRARRAEILSKMRALLESWWQNPVPGLSLKFPQQSVPGALQFRLTSIDLEDWTDVSLVPAFDVLGQAGSRVKPKPQVYSTLLNSGCQGGEHAACFTELRRDFVNIRPAKLKNLILLVKHWYHQVCLQGLWEETLPPVYALELLTIFAWEQGCKKDAFSLAEGLRTVLDLIQQHQHLCVFWTVNYGFEDPAVGQFLQRQLERPRPVILDPADPTWDLGNGAAWHWDLLAQEAASCCDHPCFLNGMGDPVQPWQVPGLPRARCSGLGHPIQLNPNQKTPENSKSLDAVSPRAGSKAPSCPAPGPAGAASVAPSVPGMALDLSQIPTKELDRFIQDHLKPSPRFQEQVKKAIDIILRRLRENCVHKVSRVSKGGSFGRGTDLRDGCDVELVIFLNCFTDYKDQGPRRAEILDEMRAQLESWWQGQVPGLSLQFPQQNVPEALQFQLVSTAPKRWTDVSLLPAFDALGQLSSGTKPNPQVYSRLLSSGCQEGEHKACFAELRRNFVNIRPAKLKNLILLVKHWYRQVAAQNKRKRPAPASLPPAYALELLTIFAWEQGCGKDCFDMAQGFRTVLGLVQQHQQLCVYWTVNYSTEDPAMRMHLLGQLRKPRPLVLDPADPTWNVGQGSWELLAQEAAVLGMQACFLSRDGTSMPPWDVMPALLYQTPAGDLDKFISEFLQPNRQFLAQVNKAVDTICSFLKENCFRNSPIKVIKVVKGGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNKRAEIISEIRAQLEACQREQQFEVKFEVSKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALATAMQASTPPASQSYSGTSSSLALPS"
## NP_001075226
## "MDVYRTPAAELDGLVARSLQPPAEFVGAARRALGNLSAALRERGGRPGAAAQPWRVLKIGGSSGRGTALRGGCDSELVIFLDCFKSYEDQGAHRAEILNEMRALLESSWQDTVLGLSLEFPEQNTPGVLQLRLASTDLENWMDVSLVPAFDALGQLRTGAKPEPRVYSSLLDSGSRGGEHAACFAELRRNFVNARPTKLKNLILLVKHWYRQVCPQEASRELLPPAYALELLTIFAWERGCGKDAFSLAQGLRTVLGLVQDYRHLCVFWTLNYSFEDPALRQFLRRQLERPRPVILDPADPTWDVGNGAAWRWDLLAKEAESCCDHPCFLQAARGPVQPWEGPDLPRAGCPGLDHRIQQDPAQRTPEDSGVLTGVHPSTRKRQPWSPAPGPSSAASIAPRPPQEVSDLSRIPAPELDRFIQDHLMPSSQFQKQVSKAIDVILRGLRENCVHKPSRASKGGSFGRGTDLRGGCDAELVIFLNCFKDYKDQGARRGQILEEIRAQLESWWQDRVPSLSLKFPEQSAPGALQLQLASAALESRVDVSLLPAFDAIGQLRAGAKPEPGVYSSLLDSGSRGGEHAACFAELRRNFVNTRPTKLKNLILLVKHWYRQVAAQNKGAQRAGASLPPAYALELLTIFAWEQGCGEDRFSMAQGLRTVLGLVQQHRQLCVYWTVNYSFEDPALRTHLLGQLRNPRPLVLDPADPTWNVGQGSWELLAQEAAALGTQPCLMSREGTPVQPWDVMPALLCQTPASDLDKFITEFLQPNRHFLEQVNKAVDTICSFLRDNCFRNSPIKVLKGGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNRRAEIISEIRAQLEACQQEREFEVKFEISKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALGQLVPDSRPRPQVYVDLIHSYSNAGEYSPCFTELQRNFISSRPTKLKSLIRLVKHWYQQCNKMPKGRGSLPPQHGLELLTVYAWEQGGCDCQFSMAEGFRTVLELVRQYRQLCVYWTVNYDNENETVRDFLKLQLQKPRPIILDPADPTGNLGPNARWDLLAKEAVACMSAPCCMGRDGSPIQPWPVKAAV"
## XP_031506643
## "MDLYRTPASELDRFVATRLQPRKEFTETTRRALGALAAALRERRGRPGAAAPRVLKIVKGGSSGRGTALKGGCDSELVIFLDCFKSYMDQRARRAEILSEMRALLESWWQNPVPGLSLKFPQQSVPGALQFRLTSIDLEDWMDVSLVPAFDVLGQAGSRIKPKPQVYSTLLNSGCQGGEHAACFTELRRDFVNIRPAKLKNLILLVKHWYHQVCLQGLWEETLPPVYALELLTIFAWEQGCKKDAFSLAEGLRTVLDLIQQHQHLCVFWTVNYGFEDPAVGQFLQRQLERPRPVILDPADPTWDVGNGAAWHWDLLAQEAASCCDHPCFLNGMGDPVQPWQGPSLPRARCSGLGHPIQLNPNQKTPENSKSLDAVSPRAGSKAPSCPAPGPAGAASVAPSVPGMALDLSQIPTKELDRFIQDHLKPSPQFQEQVKKAIDIILRRLRENCVHKVSRVSKGGSFGRGTDLRDGCDVELVIFLNCFTDYKDQGPRRAEILDEMRAQLESWWQGQVPGLSLQFPEQNVPEALQFQLVSTAPKRWTDVSLLPAFDALGQLSSGTKPNPQVYSRLLSSGCQEGEHKACFAELRRNFVNIRPAKLKNLILLVKHWYRQVAAQNKRKRPAPASLPPAYALELLTIFAWEQGCGKDCFDMAQGFRTVLGLVQQHQQLCVYWTVNYSTEDPAMRMHLLGQLRKPRPLVLDPADPTWNVGQGSWELLAQEAAVLGMQACFLSRDGTSMPPWDVMPALLYQTPAGDLDKFISEFLQPNRQFLAQVNKAVDTICSFLKENCFRNSPIKVIKVVKGGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNKRAEIISEIRAQLEACQREQQFEVKFEVSKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALGQLVSGSRPSSQVYVNLIHSYSNAGEYSTCFTELQRDFIISRPTKLKSLIRLVKHWYQQCHKISRGRGSLPPKHGLELLTVYAWEQGGKDPQFNMAEGFRTVLELVTQYRQLCIYWTINYNTEDKTVGDFLKQQLQKPRPIILDPADPTGNLGHSARWDLLAKEAAACMSALCCVGRNGIPIQPWPVKAAV"
## XP_004053976
## "MDLYSTPAAALDRFVARSLQPRTEFVEKARRALGALAAALRERAGRLGAAAPRVLKTVKGGSSGRGTALKGGCDSELVIFLDCFKSYVDQRARRAEILSEMRASLESWWQNPVPGLRLTFPEQSVPGALQFRLTSVDLEDWMDVSLVPAFNVLGQAGSGVKPKPQVYSTLLNSGCQGGEHAACFTELRRNFVNIRPAKLKNLILLVKHWYHQVCLQGLWKETLPPVYALELLTIFAWEQGCKKDAFSLAEGLRTVLGLIQQHQHLCVFWTVNYGFEDPAVGQFLQRQLKRPRPVILDPADPTWDLGNGAAWHWDLLAQEAASCYDHPCFLRGMGDPVQSWKGPGLPRAGCSGLGHPIQLDPNQKTPENSKSLNAVYPRAGSKPPSCPAPGPTGAASIVPSVPGMALDLSQIPTKELDRFIQDHLKPSPQFQEQVKKAIDIILRCLRENCVHKASRVSKGGSFGRGTDLRDGCDVELIIFLNCFTDYKDQGPRRAEILDEMRAQLESWWQDQVPSLSLQFPEQNVPEALQFQLVSTALKSWTDVSLLPAFDAVGQLSSGTKPNPQVYSRLLTSGCQEGEHKACFAELRRNFMNIRPVKLKNLILLVKHWYRQVAAQNKGKRPAPASLPPAYALELLTIFAWEQGCRQDCFNMAQGFRTVLGLVQQHQQLCVYWTVNYSTEDPAMRMHLLGQLRKPRPLVLDPADPTWNVGHGSWELLAQEAAALGMQACFLSRDGTSVQPWDVMPALLYQTPAGDLDKFISEFLQPNRQFLAQVNKAVDTICSFLKENCFRNSPIKVIKVVKGGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNKRAEIISEIRAQLEACQQERQFEVKFEVSKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALGQLVSGSRPSSQVYVDLIHSYSNAGEYSTCFTELQRDFIISRPTKLKSLIRLVKHWYQQCTKISKGRGSLPPQHGLELLTVYAWEQGGKDSQFNMAEGFRTVLELVTQYRQLCIYWTINYNAKDKTVGDFLKQQLQKPRPIILDPADPTGNLGHNARWDLLAKEAAACTSALCCMGRNGIPIQPWPVKAAV"
oas3s_vector_ss <- Biostrings::AAStringSet(oas3s_vector)
oas3s_align <- msa(oas3s_vector_ss,
method = "ClustalW")
## use default substitution matrix
msa produces a species MSA objects
class(oas3s_align)
## [1] "MsaAAMultipleAlignment"
## attr(,"package")
## [1] "msa"
is(oas3s_align)
## [1] "MsaAAMultipleAlignment" "AAMultipleAlignment" "MsaMetaData"
## [4] "MultipleAlignment"
Default output of MSA
oas3s_align
## CLUSTAL 2.1
##
## Call:
## msa(oas3s_vector_ss, method = "ClustalW")
##
## MsaAAMultipleAlignment with 10 rows and 1144 columns
## aln names
## [1] MDLYSTPAAALDRFVARRLQPRKEF...ACTSALCCMGRNGIPIQPWPVKAAV NP_006178
## [2] MDLYSTPAAALDRFVARSLQPRTEF...ACTSALCCMGRNGIPIQPWPVKAAV XP_004053976
## [3] MDLYSTPAAALDRFVARRLQPRKEF...ACTSALCCMGRNGIPIQPWPVKAAV XP_509393
## [4] MDLYRTPASALDRFVATRLQPRKEF...------------------------- XP_015008356
## [5] MDLYRTPASALDRFVATRLQPRKEF...------------------------- XP_015008356
## [6] MDLYRTPASELDRFVATRLQPRKEF...ACMSALCCVGRNGIPIQPWPVKAAV XP_031506643
## [7] MDVYRTPAAALASLVARRLQPSAEF...ACMSALCCTDRDGTPIQPWPVKAAV NP_001041556
## [8] MDVYRTPAAELDGLVARSLQPPAEF...ACMSAPCCMGRDGSPIQPWPVKAAV NP_001075226
## [9] MDLFHTPAGALDKLVAHNLHPAPEF...VYASALCCVDRDGNPIKPWPVKAAV NP_660261
## [10] MDLYHTPAGALDKLVAHSLHPAPEF...AYTSALCCMDKDGNPIKPWPVKAAV NP_001009493
## Con MDLYRTPAAALDRFVARRLQPRKEF...AC?SALCCMGR?G?PIQPWPVKAAV Consensus
Change class of alignment
class(oas3s_align) <- "AAMultipleAlignment"
Convert to seqinr format
oas3s_align_seqinr <- msaConvert(oas3s_align, type = "seqinr::alignment")
OPTIONAL: show output with print_msa
compbio4all::print_msa(oas3s_align_seqinr)
## [1] "MDLYSTPAAALDRFVARRLQPRKEFVEKARRALGALAAALRERGGRLGAAAP--RVLKTV 0"
## [1] "MDLYSTPAAALDRFVARSLQPRTEFVEKARRALGALAAALRERAGRLGAAAP--RVLKTV 0"
## [1] "MDLYSTPAAALDRFVARRLQPRKEFVEKARRALGALAAALRERGGRLGAAAP--RVLKTV 0"
## [1] "MDLYRTPASALDRFVATRLQPRKEFTETARRALGALAAALRERGGRPGALAP--RVLKIV 0"
## [1] "MDLYRTPASALDRFVATRLQPRKEFTETARRALGALAAALRERGGRPGALAP--RVLKIV 0"
## [1] "MDLYRTPASELDRFVATRLQPRKEFTETTRRALGALAAALRERRGRPGAAAP--RVLKIV 0"
## [1] "MDVYRTPAAALASLVARRLQPSAEFQRAAWRALGALATTLRERGDR--AAAQPWRVLKTA 0"
## [1] "MDVYRTPAAELDGLVARSLQPPAEFVGAARRALGNLSAALRERGGRPGAAAQPWRVLKIG 0"
## [1] "MDLFHTPAGALDKLVAHNLHPAPEFTAAVRGALGSLNITLQQHRAR-GSQRP--RVIRIA 0"
## [1] "MDLYHTPAGALDKLVAHSLHPAPEFTAAVRRALGSLDNVLRKNGAG-GLQRP--RVIRII 0"
## [1] " "
## [1] "KGGSSGRGTALKGGCDSELVIFLDCFKSYVDQRARRAEILSEMRASLESWWQNPVPGLRL 0"
## [1] "KGGSSGRGTALKGGCDSELVIFLDCFKSYVDQRARRAEILSEMRASLESWWQNPVPGLRL 0"
## [1] "KGGSSGRGTALKGGCDSELVIFLDCFKSYVDQRARRAEILSEMRASLESWWQNPVPGLRL 0"
## [1] "KGGSSGRGTALKGGCDSELVIFLDCFKSYMDQRARRAEILSKMRALLESWWQNPVPGLSL 0"
## [1] "KGGSSGRGTALKGGCDSELVIFLDCFKSYMDQRARRAEILSKMRALLESWWQNPVPGLSL 0"
## [1] "KGGSSGRGTALKGGCDSELVIFLDCFKSYMDQRARRAEILSEMRALLESWWQNPVPGLSL 0"
## [1] "KGGSAGRGTALRGGCDSEIVIFLDCFKSYKDHSVDRAEILKDLWDLLQSWWQKPIPGLNF 0"
## [1] "--GSSGRGTALRGGCDSELVIFLDCFKSYEDQGAHRAEILNEMRALLESSWQDTVLGLSL 0"
## [1] "KGGAYARGTALRGGTDVELVIFLDCFQSFGDQKTCHSETLGAMRMLLESWGGHPGPGLTF 0"
## [1] "KGGAHARGTALRGGTDVELVIFLDCLRSFGDQKTCHTEILGAIQALLESWGCNPGPGLTF 0"
## [1] " "
## [1] "TFPEQSVPGALQFRLTSVDLEDWMDVSLVPAFNVLGQAGSGVKPKPQVYSTLLNSGCQGG 0"
## [1] "TFPEQSVPGALQFRLTSVDLEDWMDVSLVPAFNVLGQAGSGVKPKPQVYSTLLNSGCQGG 0"
## [1] "TFPEQSVPGALQFRLTSVDLEDWMDVSLVPAFNVLGQAGSGVKPKPQVYSTLLNSGCQGG 0"
## [1] "KFPQQSVPGALQFRLTSIDLEDWTDVSLVPAFDVLGQAGSRVKPKPQVYSTLLNSGCQGG 0"
## [1] "KFPQQSVPGALQFRLTSIDLEDWTDVSLVPAFDVLGQAGSRVKPKPQVYSTLLNSGCQGG 0"
## [1] "KFPQQSVPGALQFRLTSIDLEDWMDVSLVPAFDVLGQAGSRIKPKPQVYSTLLNSGCQGG 0"
## [1] "ETLWQDRPGVLQFRLASTDLENWMDVSLVPAFDALGQLCAGAKPAPQVYSTLLHSGCQGG 0"
## [1] "EFPEQNTPGVLQLRLASTDLENWMDVSLVPAFDALGQLRTGAKPEPRVYSSLLDSGSRGG 0"
## [1] "EFSQSKASRILQFRLASADGEHWIDVSLVPAFDVLGQPRSGVKPTPNVYSSLLSSHCQAG 0"
## [1] "EFSGPKASGILQFRLASVDQENWIDVSLVPAFDALGQLHSEVKPTPNVYSSLLSSHCQAG 0"
## [1] " "
## [1] "EHAACFTELRRNFVNIRPAKLKNLILLVKHWYHQVCLQGLWKETLPPVYALELLTIFAWE 0"
## [1] "EHAACFTELRRNFVNIRPAKLKNLILLVKHWYHQVCLQGLWKETLPPVYALELLTIFAWE 0"
## [1] "EHAACFTELRRNFVNIRPAKLKNLILLVKHWYHQVCLQGLWKETLPPVYALELLTIFAWE 0"
## [1] "EHAACFTELRRDFVNIRPAKLKNLILLVKHWYHQVCLQGLWEETLPPVYALELLTIFAWE 0"
## [1] "EHAACFTELRRDFVNIRPAKLKNLILLVKHWYHQVCLQGLWEETLPPVYALELLTIFAWE 0"
## [1] "EHAACFTELRRDFVNIRPAKLKNLILLVKHWYHQVCLQGLWEETLPPVYALELLTIFAWE 0"
## [1] "EHAACFAELRRNFVNVRPAKLKSLILLVKHWYRQVCQEEAKREMLPPAYALELLTIFAWE 0"
## [1] "EHAACFAELRRNFVNARPTKLKNLILLVKHWYRQVCPQEASRELLPPAYALELLTIFAWE 0"
## [1] "EYSACFTEPRKNFVNTRPAKLKNLILLVKHWYHQVQTR-AVRATLPPSYALELLTIFAWE 0"
## [1] "EHSACFTELRKNFVNIRPVKLKNLILLVKHWYRQVQTQ-VVRATLPPSYALELLTIFAWE 0"
## [1] " "
## [1] "QGCKKDAFSLAEGLRTVLGLIQQHQHLCVFWTVNYGFEDPAVGQFLQRQLKRPRPVILDP 0"
## [1] "QGCKKDAFSLAEGLRTVLGLIQQHQHLCVFWTVNYGFEDPAVGQFLQRQLKRPRPVILDP 0"
## [1] "QGCKKDAFSLAEGLRTVLGLIQQHQHLCVFWTVNYGFEDPAVGQFLQRQLKRPRPVILDP 0"
## [1] "QGCKKDAFSLAEGLRTVLDLIQQHQHLCVFWTVNYGFEDPAVGQFLQRQLERPRPVILDP 0"
## [1] "QGCKKDAFSLAEGLRTVLDLIQQHQHLCVFWTVNYGFEDPAVGQFLQRQLERPRPVILDP 0"
## [1] "QGCKKDAFSLAEGLRTVLDLIQQHQHLCVFWTVNYGFEDPAVGQFLQRQLERPRPVILDP 0"
## [1] "QGCGKDAFSLAQGLRTVLGLIQEYRQLCVFWTLNYGFENPTVRSFLSSQLKKPRPVILDP 0"
## [1] "RGCGKDAFSLAQGLRTVLGLVQDYRHLCVFWTLNYSFEDPALRQFLRRQLERPRPVILDP 0"
## [1] "QGCGKDSFSLAQGLRTVLALIQHSKYLCIFWTENYGFEDPAVGEFLRRQLKRPRPVILDP 0"
## [1] "QGCRKDAFSLAQGLRTVLALIQRNKHLCIFWTENYGFEDPAVGEFLRRQLKRPRPVILDP 0"
## [1] " "
## [1] "ADPTWDLGNGAAWHWDLLAQEAASCYDHPCFLRGMGDPVQSWKGPGLPRAGCSGLGHPIQ 0"
## [1] "ADPTWDLGNGAAWHWDLLAQEAASCYDHPCFLRGMGDPVQSWKGPGLPRAGCSGLGHPIQ 0"
## [1] "ADPTWDLGNGAAWHWDLLAQEAASCYDHPCFLRGMGDPVQSWKGPGLPCAGCSGLGHPIQ 0"
## [1] "ADPTWDLGNGAAWHWDLLAQEAASCCDHPCFLNGMGDPVQPWQVPGLPRARCSGLGHPIQ 0"
## [1] "ADPTWDLGNGAAWHWDLLAQEAASCCDHPCFLNGMGDPVQPWQVPGLPRARCSGLGHPIQ 0"
## [1] "ADPTWDVGNGAAWHWDLLAQEAASCCDHPCFLNGMGDPVQPWQGPSLPRARCSGLGHPIQ 0"
## [1] "ADPTWDVGNGATWHWDILAREAESCYEHPCFLQTAGDTVQPWEGTGLPRAGCSGLDHPIQ 0"
## [1] "ADPTWDVGNGAAWRWDLLAKEAESCCDHPCFLQAARGPVQPWEGPDLPRAGCPGLDHRIQ 0"
## [1] "ADPTWDVGNGTAWRWDVLAQEAESSFSQQCFKQASGVLVQPWEGPGLPRAGILDLGHPIY 0"
## [1] "ADPTWDLGNGTAWCWDVLAKEAEYSFNQQCFKEASGALVQPWEGPGLPCAGILDLGHPIQ 0"
## [1] " "
## [1] "LDPNQKTPENSKSLNAVYPRAGSKPPSCPAP----------------------------- 0"
## [1] "LDPNQKTPENSKSLNAVYPRAGSKPPSCPAP----------------------------- 0"
## [1] "LDPNQKTPENSKSLSAVYPRAGSKPPSCPAP----------------------------- 0"
## [1] "LNPNQKTPENSKSLDAVSPRAGSKAPSCPAP----------------------------- 0"
## [1] "LNPNQKTPENSKSLDAVSPRAGSKAPSCPAP----------------------------- 0"
## [1] "LNPNQKTPENSKSLDAVSPRAGSKAPSCPAP----------------------------- 0"
## [1] "RDDAQRTPGNSSSLNAVPPRAGSRQPSWPAP----------------------------- 0"
## [1] "QDPAQRTPEDSGVLTGVHPSTRKRQPWSPAP----------------------------- 0"
## [1] "QGPNQALEDNKGHL-AVQSKERSQKPSNSAPGFPEAATKIPAMPNPSANKTRKIRKKAAH 0"
## [1] "QGAKHALEDNNGHL-AVQPMKESLQPSNPARGLPETATKISAMPDPTVTETHKSLKKSVH 0"
## [1] " "
## [1] "--------------------------GPTGAASIVPSVPGMALDLSQIPTKELDRFIQDH 0"
## [1] "--------------------------GPTGAASIVPSVPGMALDLSQIPTKELDRFIQDH 0"
## [1] "--------------------------GPTGAASIVPSVPGMALDLSQIPTKELDRFIQDH 0"
## [1] "--------------------------GPAGAASVAPSVPGMALDLSQIPTKELDRFIQDH 0"
## [1] "--------------------------GPAGAASVAPSVPGMALDLSQIPTKELDRFIQDH 0"
## [1] "--------------------------GPAGAASVAPSVPGMALDLSQIPTKELDRFIQDH 0"
## [1] "--------------------------RPPGPDSITPSTLGRAVDLSQIATKDLDRFIQDH 0"
## [1] "--------------------------GPSSAASIAPRPPQEVSDLSRIPAPELDRFIQDH 0"
## [1] "PKTVQEAALDSISSHVRITQSTASSHMPPDRSSISTAGSRMSPDLSQIPSKDLDCFIQDH 0"
## [1] "PKTVSETVVN-PSSHVWITQSTASSNTPPGHSSMSTAGSQMGPDLSQIPSKELDSFIQDH 0"
## [1] " "
## [1] "LKPSPQFQEQVKKAIDIILRCLHENCVHKASRVSKGGSFGRGTDLRDGCDVELIIFLNCF 0"
## [1] "LKPSPQFQEQVKKAIDIILRCLRENCVHKASRVSKGGSFGRGTDLRDGCDVELIIFLNCF 0"
## [1] "LKPSPQFQEQVKKAIDIILRCLRENCVHKASRVSKGGSFGRGTDLRDGCDVELIIFLNCF 0"
## [1] "LKPSPRFQEQVKKAIDIILRRLRENCVHKVSRVSKGGSFGRGTDLRDGCDVELVIFLNCF 0"
## [1] "LKPSPRFQEQVKKAIDIILRRLRENCVHKVSRVSKGGSFGRGTDLRDGCDVELVIFLNCF 0"
## [1] "LKPSPQFQEQVKKAIDIILRRLRENCVHKVSRVSKGGSFGRGTDLRDGCDVELVIFLNCF 0"
## [1] "LKPNPQFQKQVGKAINVILGCLREKCVYKASRVSKGGSFGRGTDLRGGCDAELVIFLNCF 0"
## [1] "LMPSSQFQKQVSKAIDVILRGLRENCVHKPSRASKGGSFGRGTDLRGGCDAELVIFLNCF 0"
## [1] "LRPSPQFQQQVKQAIDAILCCLREKSVYKVLRVSKGGSFGRGTDLRGSCDVELVIFYKTL 0"
## [1] "LRPSSQFQQQVRQAIDTILCCLREKCVDKVLRVSKGGSFGRGTDLRGKCDVELVIFYKTL 0"
## [1] " "
## [1] "TDYKDQGPRRAEILDEMRAQLESWWQDQVPSLSLQFPEQNVPEALQFQLVSTALKSWTDV 0"
## [1] "TDYKDQGPRRAEILDEMRAQLESWWQDQVPSLSLQFPEQNVPEALQFQLVSTALKSWTDV 0"
## [1] "TDYKDQGPRRAEILDEMRAQLESWWQDQVPGLSLQFPEQNVPEALQFQLVSTALKSWMDV 0"
## [1] "TDYKDQGPRRAEILDEMRAQLESWWQGQVPGLSLQFPQQNVPEALQFQLVSTAPKRWTDV 0"
## [1] "TDYKDQGPRRAEILDEMRAQLESWWQGQVPGLSLQFPQQNVPEALQFQLVSTAPKRWTDV 0"
## [1] "TDYKDQGPRRAEILDEMRAQLESWWQGQVPGLSLQFPEQNVPEALQFQLVSTAPKRWTDV 0"
## [1] "EDYRDQRARRPEILQEMQAQLESWWQDPVPGLSLEFPEQTVPEALQFRLVSTALESWMDV 0"
## [1] "KDYKDQGARRGQILEEIRAQLESWWQDRVPSLSLKFPEQSAPGALQLQLASAALESRVDV 0"
## [1] "GDFKGQKPHQAEILRDMQAQLRHWCQNPVPGLSLQFIEQ-KPNALQLQLASTDLSNRVDL 0"
## [1] "GDFKGQNSHQTEILCDMQAQLQRWCQNPAPGLSLQFIEQ-KSNALHLQLVPTNLSNRVDL 0"
## [1] " "
## [1] "SLLPAFDAVGQLSSGTKPNPQVYSRLLTSGCQEGEHKACFAELRRNFMNIRPVKLKNLIL 0"
## [1] "SLLPAFDAVGQLSSGTKPNPQVYSRLLTSGCQEGEHKACFAELRRNFMNIRPVKLKNLIL 0"
## [1] "SLLPAFDAVGQLSSGTKPNPQVYSRLLTSGCQEGEHKACFAELRRNFMNIRPVKLKNLIL 0"
## [1] "SLLPAFDALGQLSSGTKPNPQVYSRLLSSGCQEGEHKACFAELRRNFVNIRPAKLKNLIL 0"
## [1] "SLLPAFDALGQLSSGTKPNPQVYSRLLSSGCQEGEHKACFAELRRNFVNIRPAKLKNLIL 0"
## [1] "SLLPAFDALGQLSSGTKPNPQVYSRLLSSGCQEGEHKACFAELRRNFVNIRPAKLKNLIL 0"
## [1] "CLVPAFDAVGQLCAGAKPAPQVYSTLLQSGCQGGEHAACFAELRRNFVNVRPAKLKSLIL 0"
## [1] "SLLPAFDAIGQLRAGAKPEPGVYSSLLDSGSRGGEHAACFAELRRNFVNTRPTKLKNLIL 0"
## [1] "SVLPAFDAVGPLKSGTKPQPQVYSSLLSSGCQAGEHAACFAELRRNFINTCPPKLKSLML 0"
## [1] "SVLPAFDAVGPLKSGAKPLPETYSSLLSSGCQAGEHAACFAELRRNFINTRPAKLRSLML 0"
## [1] " "
## [1] "LVKHWYRQVAAQNKGKGPAPASLPPAYALELLTIFAWEQGCRQDCFNMAQGFRTVLGLVQ 0"
## [1] "LVKHWYRQVAAQNKGKRPAPASLPPAYALELLTIFAWEQGCRQDCFNMAQGFRTVLGLVQ 0"
## [1] "LVKHWYHQVAAQNKGKRPAPASLPPAYALELLTIFAWEQGCRQDCFNMAQGFRTVLGLVQ 0"
## [1] "LVKHWYRQVAAQNKRKRPAPASLPPAYALELLTIFAWEQGCGKDCFDMAQGFRTVLGLVQ 0"
## [1] "LVKHWYRQVAAQNKRKRPAPASLPPAYALELLTIFAWEQGCGKDCFDMAQGFRTVLGLVQ 0"
## [1] "LVKHWYRQVAAQNKRKRPAPASLPPAYALELLTIFAWEQGCGKDCFDMAQGFRTVLGLVQ 0"
## [1] "LVKHWYRQVAAQNKGQQPACASLPPVYALELLTIFAWEQGCGEDSFKMAQGLKTVLELVQ 0"
## [1] "LVKHWYRQVAAQNKGAQRAGASLPPAYALELLTIFAWEQGCGEDRFSMAQGLRTVLGLVQ 0"
## [1] "LVKHWYRQVVTRYKGGEAAGDAPPPAYALELLTIFAWEQGCGEQKFSLAEGLRTILRLIQ 0"
## [1] "LVKHWYRQVAARFEGGETAGAALPPAYALELLTVFAWEQGCGEQKFSMAEGLRTVLRLVQ 0"
## [1] " "
## [1] "QHQQLCVYWTVNYSTEDPAMRMHLLGQLRKPRPLVLDPADPTWNVGHGSWELLAQEAAAL 0"
## [1] "QHQQLCVYWTVNYSTEDPAMRMHLLGQLRKPRPLVLDPADPTWNVGHGSWELLAQEAAAL 0"
## [1] "QHQQLCVYWTVNYSTEDPAMRMHLLGQLGKPRPLVLDPADPTWNVGHGSWELLAQEAAAL 0"
## [1] "QHQQLCVYWTVNYSTEDPAMRMHLLGQLRKPRPLVLDPADPTWNVGQGSWELLAQEAAVL 0"
## [1] "QHQQLCVYWTVNYSTEDPAMRMHLLGQLRKPRPLVLDPADPTWNVGQGSWELLAQEAAVL 0"
## [1] "QHQQLCVYWTVNYSTEDPAMRMHLLGQLRKPRPLVLDPADPTWNVGQGSWELLAQEAAVL 0"
## [1] "QHQQLCVYWTVNYSFEDPAIRTHLLGQLQKPRPLILDPGDPTWNVGQGSWELLAQEAAVL 0"
## [1] "QHRQLCVYWTVNYSFEDPALRTHLLGQLRNPRPLVLDPADPTWNVGQGSWELLAQEAAAL 0"
## [1] "QHQSLCIYWTVNYSVQDPAIRAHLLCQLRKARPLVLDPADPTWNVGQGDWKLLAQEAAAL 0"
## [1] "QHQSLCIYWTVNYSVQDPAIRAHLLRQLRKARPLILDPADPTWNMDQGNWKLLAQEAAAL 0"
## [1] " "
## [1] "GMQACFLSRDGTSVQPWDVMPALLYQTPAGDLDKFISEFLQPNRQFLAQVNKAVDTICSF 0"
## [1] "GMQACFLSRDGTSVQPWDVMPALLYQTPAGDLDKFISEFLQPNRQFLAQVNKAVDTICSF 0"
## [1] "GMQACFLSRDGTSVQPWDVMPALLYQTPAGDLDKFISEFLQPNRQFLAQVNKAVDTICSF 0"
## [1] "GMQACFLSRDGTSMPPWDVMPALLYQTPAGDLDKFISEFLQPNRQFLAQVNKAVDTICSF 0"
## [1] "GMQACFLSRDGTSMPPWDVMPALLYQTPAGDLDKFISEFLQPNRQFLAQVNKAVDTICSF 0"
## [1] "GMQACFLSRDGTSMPPWDVMPALLYQTPAGDLDKFISEFLQPNRQFLAQVNKAVDTICSF 0"
## [1] "ETQACLRSTEGTSVQPWDVMPALLYQTPAGDLDKFISDFLQPNRQFLAQVNKAVDTICSF 0"
## [1] "GTQPCLMSREGTPVQPWDVMPALLCQTPASDLDKFITEFLQPNRHFLEQVNKAVDTICSF 0"
## [1] "GSQVCLQSGDGTLVPPWDVTPALLHQTLAEDLDKFISEFLQPNRHFLTQVKRAVDTICSF 0"
## [1] "ESQVCLQSRDGNLVPPWDVMPALLHQTPAQNLDKFICEFLQPDRHFLTQVKRAVDTICSF 0"
## [1] " "
## [1] "LKENCFRNSPIKVIKVVKGGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNKRAEIISEI 0"
## [1] "LKENCFRNSPIKVIKVVKGGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNKRAEIISEI 0"
## [1] "LKENCFRNSPIKVIKVVKGGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNKRAEIISEI 0"
## [1] "LKENCFRNSPIKVIKVVKGGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNKRAEIISEI 0"
## [1] "LKENCFRNSPIKVIKVVKGGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNKRAEIISEI 0"
## [1] "LKENCFRNSPIKVIKVVKGGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNKRAEIISEI 0"
## [1] "LKENCFQNSAIKVLKVVKGGSLAKGTALRGRSDADLVVFLSCFSQFAEQGNRRAEIISEI 0"
## [1] "LRDNCFRNSPIKVLK---GGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNRRAEIISEI 0"
## [1] "LKENCFRNSTIKVLKVVKGGSSAKGTALQGRSDADLVVFLSCFRQFSEQGSHRAEIISEI 0"
## [1] "LKENCFRNSTIKVLKVVKGGSSAKGTALQGRSDADLVVFLSCFRQFSEQGSHRAEIIAEI 0"
## [1] " "
## [1] "RAQLEACQQERQFEVKFEVSKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALGQLVSGS 0"
## [1] "RAQLEACQQERQFEVKFEVSKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALGQLVSGS 0"
## [1] "RAQLEACQQERQFEVKFEVSKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALGQLVSGS 0"
## [1] "RAQLEACQREQQFEVKFEVSKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALATAMQAS 0"
## [1] "RAQLEACQREQQFEVKFEVSKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALATAMQAS 0"
## [1] "RAQLEACQREQQFEVKFEVSKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALGQLVSGS 0"
## [1] "RAQLEACQQKMQLEVKFEIPKRENSRVLSFSLKSQTMLDQSVDFDVLPAFNALGQVVSSY 0"
## [1] "RAQLEACQQEREFEVKFEISKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALGQLVPDS 0"
## [1] "QAHLEACQQMHSFDVKFEVSKRKNPRVLSFTLTSQTLLDQSVDFDVLPAFDALGQLRSGS 0"
## [1] "QAQLEACQQKQRFDVKFEISKRKNPRVLSFTLTSKTLLGQSVDFDVLPAFDALGQLKSGS 0"
## [1] " "
## [1] "RPSSQVYVDLIHSYSNAGEYSTCFTELQRDFIISRPTKLKSLIRLVKHWYQQCTKISKGR 0"
## [1] "RPSSQVYVDLIHSYSNAGEYSTCFTELQRDFIISRPTKLKSLIRLVKHWYQQCTKISKGR 0"
## [1] "RPSSQVYVDLIHSYSNAGEYSTCFTELQRDFIISRPTKLKSLIRLVKHWYQQCTKISKGR 0"
## [1] "TPP------ASQSYSGT----------------SSSLALPS------------------- 0"
## [1] "TPP------ASQSYSGT----------------SSSLALPS------------------- 0"
## [1] "RPSSQVYVNLIHSYSNAGEYSTCFTELQRDFIISRPTKLKSLIRLVKHWYQQCHKISRGR 0"
## [1] "RPPSQVYVDLIYSYNNAGEYSTCFTELQRDFIISRPTKLKSLIRLVKHWYRQCNKMPRGR 0"
## [1] "RPRPQVYVDLIHSYSNAGEYSPCFTELQRNFISSRPTKLKSLIRLVKHWYQQCNKMPKGR 0"
## [1] "RPDPRVYTDLIHSCSNAGEFSTCFTELQRDFITSRPTKLKSLIRLVKYWYQQCNKTIKGK 0"
## [1] "RPDPRVYTDLIQSYSNAGEFSTCFTELQRDFISSRPTKLKSLIRLVKHWYQQCNKTVKGK 0"
## [1] " "
## [1] "GSLPPQHGLELLTVYAWEQGGKDSQFNMAEGFRTVLELVTQYRQLCIYWTINYNAKDKTV 0"
## [1] "GSLPPQHGLELLTVYAWEQGGKDSQFNMAEGFRTVLELVTQYRQLCIYWTINYNAKDKTV 0"
## [1] "GSLPPQHGLELLTVYAWEQGGKDSQFNMAEGFRTVLELVTQYRQLCIYWTINYNAKDKTV 0"
## [1] "------------------------------------------------------------ 0"
## [1] "------------------------------------------------------------ 0"
## [1] "GSLPPKHGLELLTVYAWEQGGKDPQFNMAEGFRTVLELVTQYRQLCIYWTINYNTEDKTV 0"
## [1] "GSLPPQHGLELLTVYAWEQGGQSAQFNMAQGFRTVLELVSQYRQLRVYWTVNYDNEDQTV 0"
## [1] "GSLPPQHGLELLTVYAWEQGGCDCQFSMAEGFRTVLELVRQYRQLCVYWTVNYDNENETV 0"
## [1] "GSLPPQHGLELLTVYAWEQGGQNPQFNMAEGFRTVLELIVQYRQLCVYWTINYSAEDKTI 0"
## [1] "GSLPPQHGLELLTVYAWERGSQNPQFNMAEGFRTVLELIGQYRQLCVYWTINYGAEDETI 0"
## [1] " "
## [1] "GDFLKQQLQKPRPIILDPADPTGNLGHNARWDLLAKEAAACTSALCCMGRNGIPIQPWPV 0"
## [1] "GDFLKQQLQKPRPIILDPADPTGNLGHNARWDLLAKEAAACTSALCCMGRNGIPIQPWPV 0"
## [1] "GDFLKQQLQKPRPIILDPADPTGNLGHNARWDLLAKEAAACTSALCCMGRNGIPIQPWPV 0"
## [1] "------------------------------------------------------------ 0"
## [1] "------------------------------------------------------------ 0"
## [1] "GDFLKQQLQKPRPIILDPADPTGNLGHSARWDLLAKEAAACMSALCCVGRNGIPIQPWPV 0"
## [1] "RDFLSRQLRQPRPIILDPADPTGNLGHNARWDLLATEATACMSALCCTDRDGTPIQPWPV 0"
## [1] "RDFLKLQLQKPRPIILDPADPTGNLGPNARWDLLAKEAVACMSAPCCMGRDGSPIQPWPV 0"
## [1] "GDFLKMQLRKPRPVILDPADPTGNLGHNARWDLLAKEATVYASALCCVDRDGNPIKPWPV 0"
## [1] "GDFLKMQLQKPRPVILDPADPTGNLGHNARWDLLAKEAAAYTSALCCMDKDGNPIKPWPV 0"
## [1] " "
## [1] "KAAV 56"
## [1] "KAAV 56"
## [1] "KAAV 56"
## [1] "---- 56"
## [1] "---- 56"
## [1] "KAAV 56"
## [1] "KAAV 56"
## [1] "KAAV 56"
## [1] "KAAV 56"
## [1] "KAAV 56"
## [1] " "
Based on the output from drawProteins, the first 50 amino acids appears to contain an interesting helical section.
NOTE: Key step - must have class set properly for ggmsa to work!
#does not work despite of the chunk I put up there
# ggmsa::ggmsa(oas3s_align,
# start = 1,
# end = 50)
Make a distance matrix This produces a “dist” class object.
oas3s_subset_dist <- seqinr::dist.alignment(oas3s_align_seqinr,
matrix = "identity")
is(oas3s_subset_dist)
## [1] "dist" "oldClass"
class(oas3s_subset_dist)
## [1] "dist"
Round for display
oas3s_align_seqinr_rnd <- round(oas3s_subset_dist, 3)
oas3s_align_seqinr_rnd
## NP_006178 XP_004053976 XP_509393 XP_015008356 XP_015008356
## XP_004053976 0.068
## XP_509393 0.086 0.091
## XP_015008356 0.291 0.293 0.293
## XP_015008356 0.291 0.293 0.293 0.000
## XP_031506643 0.254 0.254 0.256 0.180 0.180
## NP_001041556 0.460 0.461 0.460 0.499 0.499
## NP_001075226 0.452 0.451 0.455 0.491 0.491
## NP_660261 0.541 0.540 0.542 0.575 0.575
## NP_001009493 0.545 0.544 0.544 0.577 0.577
## XP_031506643 NP_001041556 NP_001075226 NP_660261
## XP_004053976
## XP_509393
## XP_015008356
## XP_015008356
## XP_031506643
## NP_001041556 0.473
## NP_001075226 0.460 0.465
## NP_660261 0.542 0.565 0.561
## NP_001009493 0.550 0.566 0.556 0.387
Build a phylogenetic tree from distance matrix
tree <- nj(oas3s_subset_dist)
Plot the tree
plot.phylo (tree, main="Phylogenetic Tree\n",
use.edge.length = F)
mtext(text = "OAS3 family gene tree - rooted, no branch lengths")