Introduction

This code creates alignments and a phylogenetic tree to show the evolutionary relationship between the human version homologs of the OAS3 gene. The OAS3 gene produces an enzyme which plays an important role in the inhibition of cellular protein synthesis and viral infection resistance.

Resources / References

Key information use to make this script can be found here:

Other resources consulted includes

Other interesting resources and online tools include:

Preparation

Load necessary packages: Download and load drawProteins from Bioconductor

library(BiocManager)
## Bioconductor version '3.13' is out-of-date; the current release version '3.14'
##   is available with R version '4.1'; see https://bioconductor.org/install
library(drawProteins)
library(msa)
## Loading required package: Biostrings
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: XVector
## Loading required package: GenomeInfoDb
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
## 
##     strsplit
## 
## Attaching package: 'msa'
## The following object is masked from 'package:BiocManager':
## 
##     version

Load other packages:

# github packages
library(compbio4all)
library(ggmsa)
## Registered S3 methods overwritten by 'ggalt':
##   method                  from   
##   grid.draw.absoluteGrob  ggplot2
##   grobHeight.absoluteGrob ggplot2
##   grobWidth.absoluteGrob  ggplot2
##   grobX.absoluteGrob      ggplot2
##   grobY.absoluteGrob      ggplot2
# CRAN packages
library(rentrez)
library(seqinr)
## 
## Attaching package: 'seqinr'
## The following object is masked from 'package:Biostrings':
## 
##     translate
library(ape)
## 
## Attaching package: 'ape'
## The following objects are masked from 'package:seqinr':
## 
##     as.alignment, consensus
## The following object is masked from 'package:Biostrings':
## 
##     complement
library(pander)
library(ggplot2)

## Biostrings
library(Biostrings)
library(HGNChelper)

Accession Numbers

TODO: Brief summary of where information was obtained, and if certain kinds of information was not available.

Accession numbers were obtained from RefSeq, Refseq HomoloGene, UniProt and PDB. UniProt accession numbers can be found by searching for the gene name. PDB accessions can be found by searching with a UniProt accession or a gene name, though many proteins are not in PDB. The the Neanderthal genome database was searched but did not yield sequence information on OAS3.

A protein BLAST search (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome) was carried out excluding vertebrates to determine if it occurred outside of vertebreates. The gene does not appear in non-vertebrates and so a second search was conducted to exclude mammals.

Accession Number Table

Not available:

  • Neanderthal

Does not occur:

  • Outside of vertebrates
oas3_table<-c("NP_006178"   ,"Q9Y6K5","4S3N","Homo sapiens"        ,"Human"     ,"OAS3",
              "XP_509393"   ,"NA"    ,"NA"  ,"Pan troglodytes"     ,"Chimpanzee","OAS3",
              "NP_660261"   ,"Q8VI93","NA"  ,"Mus musculus"        ,"Mouse"     ,"OAS3",
              "NP_001009493","Q5MYT7","NA"  ,"Rattus norvegicus"   ,"Rat"       ,"OAS3",
              "NP_001041556","Q2KKD1","NA"  ,"Canis lupus"         ,"Dog"       ,"OAS3",
              "XP_015008356","NA"    ,"NA"  ,"Macaca mulatta"      ,"Monkey"    ,"OAS3",
              "XP_015008356","Q5J0M4","NA"  ,"Equus caballus"      ,"Horse"     ,"OAS3",
              "NP_001075226","NA"    ,"NA"  ,"Mesocricetus auratus","Hamster"   ,"OAS3",
              "XP_031506643","NA"    ,"NA"  ,"Papio anubis"        ,"Baboon"    ,"OAS3",
              "XP_004053976","NA"    ,"NA"  ,"Gorilla gorilla"     ,"Gorilla"   ,"OAS3")

Convert vector information into a table

oas3_table_matrix <- matrix(oas3_table,
                              byrow = T,
                              nrow = 10)
oas3_table <- data.frame(oas3_table_matrix, 
                     stringsAsFactors = F)
names(oas3_table) <- c("NCBI Protein Accession","Uniprot ID","PDB","Species","Common name" ,"Gene Name")

The finished table

pander::pander(oas3_table)
Table continues below
NCBI Protein Accession Uniprot ID PDB Species
NP_006178 Q9Y6K5 4S3N Homo sapiens
XP_509393 NA NA Pan troglodytes
NP_660261 Q8VI93 NA Mus musculus
NP_001009493 Q5MYT7 NA Rattus norvegicus
NP_001041556 Q2KKD1 NA Canis lupus
XP_015008356 NA NA Macaca mulatta
XP_015008356 Q5J0M4 NA Equus caballus
NP_001075226 NA NA Mesocricetus auratus
XP_031506643 NA NA Papio anubis
XP_004053976 NA NA Gorilla gorilla
Common name Gene Name
Human OAS3
Chimpanzee OAS3
Mouse OAS3
Rat OAS3
Dog OAS3
Monkey OAS3
Horse OAS3
Hamster OAS3
Baboon OAS3
Gorilla OAS3

Data Preparation

Download Sequences

All sequences were downloaded using a wrapper compbio4all::entrez_fetch_list() which uses rentrez::entrez_fetch() to access NCBI databases.

# download FASTA sequences
oas3s_list <-  entrez_fetch_list(db = "protein", 
                          id = oas3_table$`NCBI Protein Accession`, 
                          rettype = "fasta")

Number of FASTA files obtained

length(oas3s_list)
## [1] 10

The first entry

oas3s_list[[1]]
## [1] ">NP_006178.2 2'-5'-oligoadenylate synthase 3 [Homo sapiens]\nMDLYSTPAAALDRFVARRLQPRKEFVEKARRALGALAAALRERGGRLGAAAPRVLKTVKGGSSGRGTALK\nGGCDSELVIFLDCFKSYVDQRARRAEILSEMRASLESWWQNPVPGLRLTFPEQSVPGALQFRLTSVDLED\nWMDVSLVPAFNVLGQAGSGVKPKPQVYSTLLNSGCQGGEHAACFTELRRNFVNIRPAKLKNLILLVKHWY\nHQVCLQGLWKETLPPVYALELLTIFAWEQGCKKDAFSLAEGLRTVLGLIQQHQHLCVFWTVNYGFEDPAV\nGQFLQRQLKRPRPVILDPADPTWDLGNGAAWHWDLLAQEAASCYDHPCFLRGMGDPVQSWKGPGLPRAGC\nSGLGHPIQLDPNQKTPENSKSLNAVYPRAGSKPPSCPAPGPTGAASIVPSVPGMALDLSQIPTKELDRFI\nQDHLKPSPQFQEQVKKAIDIILRCLHENCVHKASRVSKGGSFGRGTDLRDGCDVELIIFLNCFTDYKDQG\nPRRAEILDEMRAQLESWWQDQVPSLSLQFPEQNVPEALQFQLVSTALKSWTDVSLLPAFDAVGQLSSGTK\nPNPQVYSRLLTSGCQEGEHKACFAELRRNFMNIRPVKLKNLILLVKHWYRQVAAQNKGKGPAPASLPPAY\nALELLTIFAWEQGCRQDCFNMAQGFRTVLGLVQQHQQLCVYWTVNYSTEDPAMRMHLLGQLRKPRPLVLD\nPADPTWNVGHGSWELLAQEAAALGMQACFLSRDGTSVQPWDVMPALLYQTPAGDLDKFISEFLQPNRQFL\nAQVNKAVDTICSFLKENCFRNSPIKVIKVVKGGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNKRAEII\nSEIRAQLEACQQERQFEVKFEVSKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALGQLVSGSRPSSQVY\nVDLIHSYSNAGEYSTCFTELQRDFIISRPTKLKSLIRLVKHWYQQCTKISKGRGSLPPQHGLELLTVYAW\nEQGGKDSQFNMAEGFRTVLELVTQYRQLCIYWTINYNAKDKTVGDFLKQQLQKPRPIILDPADPTGNLGH\nNARWDLLAKEAAACTSALCCMGRNGIPIQPWPVKAAV\n\n"

###Initial data cleaning Remove FASTA header

for(i in 1:length(oas3s_list)){
  oas3s_list[[i]] <- compbio4all::fasta_cleaner(oas3s_list[[i]], parse = F)
}

Specific additional cleaning steps will be as needed for particular analyses

General Protein Information

Protein Diagram

First, we use a UniProt accession to download data from UniProt. This produces a list.

Q9Y6K5_json  <- drawProteins::get_features("Q9Y6K5")
## [1] "Download has worked"
is(Q9Y6K5_json)
## [1] "list"             "vector"           "list_OR_List"     "vector_OR_Vector"
## [5] "vector_OR_factor"

Then the raw data from the webpage is converted to a dataframe

my_prot_df <- drawProteins::feature_to_dataframe(Q9Y6K5_json)
is(my_prot_df)
## [1] "data.frame"       "list"             "oldClass"         "vector"          
## [5] "list_OR_List"     "vector_OR_Vector" "vector_OR_factor"

The information available on a protein on UniProt varies a lot depending on how much its been studied. drawProteins can extract information about the following things:

  1. Domains
  2. Chains
  3. Regions
  4. Motifs
  5. Phosphorylated sites
  6. Repeats

and others.

If available, it can plot the information. You can get a sense for what’s available by looking at the dataframe produced by drawProteins::feature_to_dataframe()

my_prot_df[,-2]
##                     type begin  end length accession  entryName taxid order
## featuresTemp       CHAIN     1 1087   1086    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.1    REGION     6  343    337    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.2    REGION    12   57     45    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.3    REGION   186  200     14    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.4    REGION   344  410     66    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.5    REGION   411  742    331    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.6    REGION   750 1084    334    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.7     METAL   816  816      0    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.8     METAL   818  818      0    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.9     METAL   888  888      0    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.10  BINDING   804  804      0    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.11  BINDING   947  947      0    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.12  BINDING   950  950      0    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.13  BINDING   969  969      0    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.14     SITE   155  155      0    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.15     SITE   244  244      0    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.16  MOD_RES     1    1      0    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.17  MOD_RES   365  365      0    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.18  VARIANT    18   18      0    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.19  VARIANT    18   18      0    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.20  VARIANT    18   18      0    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.21  VARIANT    65   65      0    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.22  VARIANT   378  378      0    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.23  VARIANT   381  381      0    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.24  VARIANT   869  869      0    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.25  MUTAGEN    30   30      0    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.26  MUTAGEN    41   41      0    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.27  MUTAGEN    76   76      0    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.28  MUTAGEN   145  145      0    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.29  MUTAGEN   816  818      2    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.30 CONFLICT   159  159      0    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.31 CONFLICT   249  249      0    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.32 CONFLICT   287  288      1    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.33 CONFLICT   316  316      0    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.34 CONFLICT   393  393      0    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.35 CONFLICT   503  504      1    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.36 CONFLICT   984  984      0    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.37    HELIX     2    5      3    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.38    HELIX     8   10      2    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.39    HELIX    11   18      7    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.40    HELIX    23   41     18    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.41   STRAND    54   60      6    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.42    HELIX    61   65      4    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.43   STRAND    73   81      8    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.44    HELIX    89   92      3    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.45    HELIX    95  108     13    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.46   STRAND   116  119      3    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.47   STRAND   129  139     10    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.48   STRAND   141  149      8    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.49    HELIX   165  172      7    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.50     TURN   177  180      3    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.51    HELIX   181  184      3    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.52    HELIX   185  193      8    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.53    HELIX   197  215     18    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.54    HELIX   226  240     14    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.55    HELIX   248  260     12    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.56    HELIX   262  264      2    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.57   STRAND   275  277      2    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.58    HELIX   278  288     10    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.59   STRAND   290  292      2    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.60   STRAND   294  296      2    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.61   STRAND   304  306      2    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.62    HELIX   313  322      9    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.63    HELIX   327  329      2    Q9Y6K5 OAS3_HUMAN  9606     1
## featuresTemp.64     TURN   332  334      2    Q9Y6K5 OAS3_HUMAN  9606     1

From the dataframe it can plot the available information. It uses ggplot2 and so uses some coding conventions of ggplot which can look unfamiliar if you’re new to it. Also, its a little tricky to understand how information in the dataframe gets turned turned into things on the plots by different function.

my_prot_df <- drawProteins::feature_to_dataframe(Q9Y6K5_json)

my_canvas <- draw_canvas(my_prot_df)
my_canvas <- draw_chains(my_canvas, my_prot_df, label_size = 2.5)

my_canvas <- draw_regions(my_canvas, my_prot_df)
#my_canvas <- draw_motif(my_canvas, my_prot_df)
#my_canvas <- draw_phospho(my_canvas, my_prot_df)
#my_canvas <- draw_repeat(my_canvas, my_prot_df)
#my_canvas <- draw_recept_dom(my_canvas, my_prot_df)
#my_canvas <- draw_folding(my_canvas, my_prot_df)
my_canvas

Draw dotplot

Prepare data

oas3s_human_FASTA <- rentrez::entrez_fetch(id = "Q9Y6K5",
                                      db = "protein", 
                                      rettype="fasta")
oas3s_human_vector <- fasta_cleaner(oas3s_human_FASTA)


# set up 2 x 2 grid, make margins
par(mfrow = c(2,2), 
    mar = c(0,0,2,1))

# plot 1: Defaults
dotPlot(oas3s_human_vector, oas3s_human_vector, 
        wsize = 1, 
        nmatch = 1, 
        main = "")

# plot 2 size = 10, nmatch = 1
dotPlot(oas3s_human_vector, oas3s_human_vector, 
        wsize = 10, 
        nmatch = 1, 
        main = "")

# plot 3: size = 10, nmatch = 5
dotPlot(oas3s_human_vector, oas3s_human_vector, 
        wsize = 10, 
        nmatch = 5,  
        main = "")

# plot 4: size = 20, nmatch = 5
dotPlot(oas3s_human_vector, oas3s_human_vector, 
        wsize = 20,
        nmatch = 5,
        main = "")

# reset par() - run this or other plots will be small!
par(mfrow = c(1,1),
    mar = c(4,4,4,4))

Best plot:

# plot 1: Defaults
dotPlot(oas3s_human_vector, oas3s_human_vector, 
        wsize = 20,
        nmatch = 5,
        main = "")

Protein properties compiled from databases

TODO: Create table

Below are links to relevant information.

  1. Pfam; http://pfam.xfam.org/protein/Q9Y6K5
  2. DisProt: no information available for human protein
  3. RepeatDB: no information available
  4. UniProt sub-cellular locations: Nucleus and cytoplasm
  5. PDB secondary structural location: no PDB entry available

The gene is listed in Alphafold (https://alphafold.ebi.ac.uk/entry/Q9Y6K5). The predicted structure contains alpha helices, beta sheets, and disordered regions.

Because this protein is poorly characterized I used IUPred2A to determine if there were any disordered regions (https://iupred2a.elte.hu/). Two peaks exceeded the threshold of 0.5.

Protein feature prediction

Multivariate statistical techniques were used to confirm the information about protein structure and location in the line database.

Uniprot (which uses http://www.csbio.sjtu.edu.cn/bioinf/Cell-PLoc-2/ I believe) indicates that the protein is a nucleic and cytoplasmic protein.

Predict protein fold

Alphafold indicates that there are a mix of alpha helices and beta sheets. I therefore predict that machine-learning methods will indicate an a+b and a/b structure.

NOTE: My protein contains a “U” for an unknown amino acid. I removed this from the sequence because it is otherwise undefined.

First, I need the data from Chou and Zhang (1994) Table 5. Code to build this table is available at https://rpubs.com/lowbrowR/843543

#a vector of amino acid names
aa.1.1 <- c("A","R","N","D","C","Q","E","G","H","I",
            "L","K","M","F","P","S","T","W","Y","V")
# alpha proteins
alpha <- c(285, 53, 97, 163, 22, 67, 134, 197, 111, 91, 
           221, 249, 48, 123, 82, 122, 119, 33, 63, 167)
# beta proteins
beta <- c(203, 67, 139, 121, 75, 122, 86, 297, 49, 120, 
          177, 115, 16, 85, 127, 341, 253, 44, 110, 229)
# alpha + beta
a.plus.b <- c(175, 78, 120, 111, 74, 74, 86, 171, 33, 93,
              110, 112, 25, 52, 71, 126, 117, 30, 108, 123)
# alpha/beta
a.div.b <- c(361, 146, 183, 244, 63, 114, 257, 377, 107, 239, 
             339, 321, 91, 158, 188, 327, 238, 72, 130, 378)

The table looks like this:

aa.table <- data.frame(aa.1.1, alpha, beta, a.plus.b, a.div.b)
pander(aa.table)
aa.1.1 alpha beta a.plus.b a.div.b
A 285 203 175 361
R 53 67 78 146
N 97 139 120 183
D 163 121 111 244
C 22 75 74 63
Q 67 122 74 114
E 134 86 86 257
G 197 297 171 377
H 111 49 33 107
I 91 120 93 239
L 221 177 110 339
K 249 115 112 321
M 48 16 25 91
F 123 85 52 158
P 82 127 71 188
S 122 341 126 327
T 119 253 117 238
W 33 44 30 72
Y 63 110 108 130
V 167 229 123 378

Convert to frequencies

alpha.prop <- alpha/sum(alpha)
beta.prop <- beta/sum(beta)
a.plus.b.prop <- a.plus.b/sum(a.plus.b)
a.div.b <- a.div.b/sum(a.div.b)
# make a dataframe
aa.prop <- data.frame(alpha.prop,
                      beta.prop,
                      a.plus.b.prop,
                      a.div.b)
#row labels
row.names(aa.prop) <- aa.1.1

Table 5 therefore becomes this

aa.prop
##    alpha.prop   beta.prop a.plus.b.prop    a.div.b
## A 0.116469146 0.073126801    0.09264161 0.08331410
## R 0.021659174 0.024135447    0.04129169 0.03369490
## N 0.039640376 0.050072046    0.06352567 0.04223402
## D 0.066612178 0.043587896    0.05876125 0.05631202
## C 0.008990601 0.027017291    0.03917417 0.01453958
## Q 0.027380466 0.043948127    0.03917417 0.02630972
## E 0.054760932 0.030979827    0.04552673 0.05931225
## G 0.080506743 0.106988473    0.09052409 0.08700669
## H 0.045361667 0.017651297    0.01746956 0.02469421
## I 0.037188394 0.043227666    0.04923240 0.05515809
## L 0.090314671 0.063760807    0.05823187 0.07823679
## K 0.101757254 0.041426513    0.05929063 0.07408262
## M 0.019615856 0.005763689    0.01323452 0.02100162
## F 0.050265631 0.030619597    0.02752779 0.03646434
## P 0.033510421 0.045749280    0.03758602 0.04338795
## S 0.049856968 0.122838617    0.06670196 0.07546734
## T 0.048630977 0.091138329    0.06193753 0.05492730
## W 0.013485901 0.015850144    0.01588142 0.01661666
## Y 0.025745811 0.039625360    0.05717311 0.03000231
## V 0.068246833 0.082492795    0.06511382 0.08723748
pander::pander(aa.prop)
  alpha.prop beta.prop a.plus.b.prop a.div.b
A 0.1165 0.07313 0.09264 0.08331
R 0.02166 0.02414 0.04129 0.03369
N 0.03964 0.05007 0.06353 0.04223
D 0.06661 0.04359 0.05876 0.05631
C 0.008991 0.02702 0.03917 0.01454
Q 0.02738 0.04395 0.03917 0.02631
E 0.05476 0.03098 0.04553 0.05931
G 0.08051 0.107 0.09052 0.08701
H 0.04536 0.01765 0.01747 0.02469
I 0.03719 0.04323 0.04923 0.05516
L 0.09031 0.06376 0.05823 0.07824
K 0.1018 0.04143 0.05929 0.07408
M 0.01962 0.005764 0.01323 0.021
F 0.05027 0.03062 0.02753 0.03646
P 0.03351 0.04575 0.03759 0.04339
S 0.04986 0.1228 0.0667 0.07547
T 0.04863 0.09114 0.06194 0.05493
W 0.01349 0.01585 0.01588 0.01662
Y 0.02575 0.03963 0.05717 0.03
V 0.06825 0.08249 0.06511 0.08724

Determine the number of each amino acid in my protein.

aa.total <- data.frame(a.plus.b)
row.names(aa.total) <- aa.1.1
colnames(aa.total) <- ("Total of amino acid")
aa.total
##   Total of amino acid
## A                 175
## R                  78
## N                 120
## D                 111
## C                  74
## Q                  74
## E                  86
## G                 171
## H                  33
## I                  93
## L                 110
## K                 112
## M                  25
## F                  52
## P                  71
## S                 126
## T                 117
## W                  30
## Y                 108
## V                 123
pander::pander(aa.total)
  Total of amino acid
A 175
R 78
N 120
D 111
C 74
Q 74
E 86
G 171
H 33
I 93
L 110
K 112
M 25
F 52
P 71
S 126
T 117
W 30
Y 108
V 123

A Function to convert a table into a vector is helpful here because R is goofy about tables not being the same as vectors.

table_to_vector <- function(table_x){
  table_names <- attr(table_x, "dimnames")[[1]]
  table_vect <- as.vector(table_x)
  names(table_vect) <- table_names
  return(table_vect)
}
oas3s_human_table <- table(oas3s_human_vector)/length(oas3s_human_vector)
OAS3.human.aa.freq <- table_to_vector(oas3s_human_table)
OAS3.human.aa.freq
##          A          C          D          E          F          G          H 
## 0.08371665 0.02851886 0.04875805 0.04783809 0.04323827 0.07083717 0.01839926 
##          I          K          L          M          N          P          Q 
## 0.03495860 0.05243790 0.11591536 0.01379945 0.03219871 0.06531739 0.06991720 
##          R          S          T          V          W          Y 
## 0.05887764 0.06531739 0.04047838 0.06439742 0.02391904 0.02115915

Check for the presence of “U” (unknown aa.)

aa.names <- names(OAS3.human.aa.freq)
i.U <- which(aa.names == "U")
aa.names[i.U]
## character(0)
OAS3.human.aa.freq[i.U]
## named numeric(0)

Remove the U (would be better to remove form the original sequence, but this will work)

# no U's are present

Add data on my focal protein to the amino acid frequency table.

aa.prop$OAS3.human.aa.freq <- OAS3.human.aa.freq
pander::pander(aa.prop)
  alpha.prop beta.prop a.plus.b.prop a.div.b OAS3.human.aa.freq
A 0.1165 0.07313 0.09264 0.08331 0.08372
R 0.02166 0.02414 0.04129 0.03369 0.02852
N 0.03964 0.05007 0.06353 0.04223 0.04876
D 0.06661 0.04359 0.05876 0.05631 0.04784
C 0.008991 0.02702 0.03917 0.01454 0.04324
Q 0.02738 0.04395 0.03917 0.02631 0.07084
E 0.05476 0.03098 0.04553 0.05931 0.0184
G 0.08051 0.107 0.09052 0.08701 0.03496
H 0.04536 0.01765 0.01747 0.02469 0.05244
I 0.03719 0.04323 0.04923 0.05516 0.1159
L 0.09031 0.06376 0.05823 0.07824 0.0138
K 0.1018 0.04143 0.05929 0.07408 0.0322
M 0.01962 0.005764 0.01323 0.021 0.06532
F 0.05027 0.03062 0.02753 0.03646 0.06992
P 0.03351 0.04575 0.03759 0.04339 0.05888
S 0.04986 0.1228 0.0667 0.07547 0.06532
T 0.04863 0.09114 0.06194 0.05493 0.04048
W 0.01349 0.01585 0.01588 0.01662 0.0644
Y 0.02575 0.03963 0.05717 0.03 0.02392
V 0.06825 0.08249 0.06511 0.08724 0.02116

Functions to calculate similarities

Two custom functions are needed: one to calculate correlates between two columns of our table, and one to calculate correlation similarities.

# Corrleation used in Chou adn Zhange 1992.
chou_cor <- function(x,y){
  numerator <- sum(x*y)
denominator <- sqrt((sum(x^2))*(sum(y^2)))
result <- numerator/denominator
return(result)
}

# Cosine similarity used in Higgs and Attwood (2005).
chou_cosine <- function(z.1, z.2){
  z.1.abs <- sqrt(sum(z.1^2))
  z.2.abs <- sqrt(sum(z.2^2))
  my.cosine <- sum(z.1*z.2)/(z.1.abs*z.2.abs)
  return(my.cosine)
}

Calculate correlation between each column

corr.alpha <- chou_cor(aa.prop[,5], aa.prop[,1])
corr.beta  <- chou_cor(aa.prop[,5], aa.prop[,2])
corr.apb   <- chou_cor(aa.prop[,5], aa.prop[,3])
corr.adb   <- chou_cor(aa.prop[,5], aa.prop[,4])

Calculate cosine similarity

cos.alpha <- chou_cosine(aa.prop[,5], aa.prop[,1])
cos.beta  <- chou_cosine(aa.prop[,5], aa.prop[,2])
cos.apb   <- chou_cosine(aa.prop[,5], aa.prop[,3])
cos.adb   <- chou_cosine(aa.prop[,5], aa.prop[,4])

Calculate distance. Note: we need to flip the dataframe on its side using a command called t()

aa.prop.flipped <- t(aa.prop)
round(aa.prop.flipped,2)
##                       A    R    N    D    C    Q    E    G    H    I    L    K
## alpha.prop         0.12 0.02 0.04 0.07 0.01 0.03 0.05 0.08 0.05 0.04 0.09 0.10
## beta.prop          0.07 0.02 0.05 0.04 0.03 0.04 0.03 0.11 0.02 0.04 0.06 0.04
## a.plus.b.prop      0.09 0.04 0.06 0.06 0.04 0.04 0.05 0.09 0.02 0.05 0.06 0.06
## a.div.b            0.08 0.03 0.04 0.06 0.01 0.03 0.06 0.09 0.02 0.06 0.08 0.07
## OAS3.human.aa.freq 0.08 0.03 0.05 0.05 0.04 0.07 0.02 0.03 0.05 0.12 0.01 0.03
##                       M    F    P    S    T    W    Y    V
## alpha.prop         0.02 0.05 0.03 0.05 0.05 0.01 0.03 0.07
## beta.prop          0.01 0.03 0.05 0.12 0.09 0.02 0.04 0.08
## a.plus.b.prop      0.01 0.03 0.04 0.07 0.06 0.02 0.06 0.07
## a.div.b            0.02 0.04 0.04 0.08 0.05 0.02 0.03 0.09
## OAS3.human.aa.freq 0.07 0.07 0.06 0.07 0.04 0.06 0.02 0.02

We can get distance matrix like this

dist(aa.prop.flipped, method = "euclidean")
##                    alpha.prop  beta.prop a.plus.b.prop    a.div.b
## beta.prop          0.13342098                                    
## a.plus.b.prop      0.09281824 0.08289406                         
## a.div.b            0.06699039 0.08659174    0.06175113           
## OAS3.human.aa.freq 0.18218375 0.18183104    0.15689863 0.16738924

Individual distances using dist()

dist.alpha <- dist((aa.prop.flipped[c(1,5),]),  method = "euclidean")
dist.beta  <- dist((aa.prop.flipped[c(2,5),]),  method = "euclidean")
dist.apb   <- dist((aa.prop.flipped[c(3,5),]),  method = "euclidean")
dist.adb  <- dist((aa.prop.flipped[c(4,5),]), method = "euclidean")

Compile the information. Rounding makes it easier to read

# fold types
fold.type <- c("alpha","beta","alpha plus beta", "alpha/beta")

# data
corr.sim <- round(c(corr.alpha,corr.beta,corr.apb,corr.adb),5)
cosine.sim <- round(c(cos.alpha,cos.beta,cos.apb,cos.adb),5)
Euclidean.dist <- round(c(dist.alpha,dist.beta,dist.apb,dist.adb),5)

# summary
sim.sum <- c("","","most.sim","")
dist.sum <- c("","","min.dist","")

df <- data.frame(fold.type,
           corr.sim ,
           cosine.sim ,
           Euclidean.dist ,
           sim.sum ,
           dist.sum )

Display output

pander::pander(df)
fold.type corr.sim cosine.sim Euclidean.dist sim.sum dist.sum
alpha 0.7427 0.7427 0.1822
beta 0.7475 0.7475 0.1818
alpha plus beta 0.7971 0.7971 0.1569 most.sim min.dist
alpha/beta 0.7731 0.7731 0.1674

Subcellular location prediction

TBD

# ec <- c(8.6, 2.9, 4.9, 5.1, 3.7, 7.8, 2.1, 4.6, 6.3, 8.8, 2.5, 4.6, 4.9,
#         4, 4.2, 7.3, 6, 6.7, 1.4, 3.6)/100
#
# an <- c(7.6, 2.2, 5.2, 6.2, 4.0, 6.9, 2.1, 5.1, 5.8, 9.4, 2.1, 4.4, 5.4, 4.1,
#         5.0, 7.2, 6.1, 6.7, 1.4, 3.2)/100
#
# df <- data.frame(ec,an)
# ave.vect <- apply(df,1,mean)
#
#
#
# cor.mat <- matrix(NA,  20, nrow = 20, ncol = 20)
#
# for(i in 1:20){
#   for(j in 1:20){
#     cor.mat[i,j] <- (ec[j]-ave.vect[i])*(ec[i]-ave.vect[j])
#   }
# }
#
# t(ec-ave.vect)%*%ginv(cor.mat)%*%(ec-ave.vect)

Percent Identity Comparisons (PID)

Data preparation

Convert all FASTA records intro entries in a single vector. FASTA entries are contained in a list produced at the beginning of the script. They were cleaned to remove the header and newline characters.

names(oas3s_list)
##  [1] "NP_006178"    "XP_509393"    "NP_660261"    "NP_001009493" "NP_001041556"
##  [6] "XP_015008356" "XP_015008356" "NP_001075226" "XP_031506643" "XP_004053976"
length(oas3s_list)
## [1] 10

Each entry is a full entry with no spaces or parsing, and no header

oas3s_list[1]
## $NP_006178
## [1] "MDLYSTPAAALDRFVARRLQPRKEFVEKARRALGALAAALRERGGRLGAAAPRVLKTVKGGSSGRGTALKGGCDSELVIFLDCFKSYVDQRARRAEILSEMRASLESWWQNPVPGLRLTFPEQSVPGALQFRLTSVDLEDWMDVSLVPAFNVLGQAGSGVKPKPQVYSTLLNSGCQGGEHAACFTELRRNFVNIRPAKLKNLILLVKHWYHQVCLQGLWKETLPPVYALELLTIFAWEQGCKKDAFSLAEGLRTVLGLIQQHQHLCVFWTVNYGFEDPAVGQFLQRQLKRPRPVILDPADPTWDLGNGAAWHWDLLAQEAASCYDHPCFLRGMGDPVQSWKGPGLPRAGCSGLGHPIQLDPNQKTPENSKSLNAVYPRAGSKPPSCPAPGPTGAASIVPSVPGMALDLSQIPTKELDRFIQDHLKPSPQFQEQVKKAIDIILRCLHENCVHKASRVSKGGSFGRGTDLRDGCDVELIIFLNCFTDYKDQGPRRAEILDEMRAQLESWWQDQVPSLSLQFPEQNVPEALQFQLVSTALKSWTDVSLLPAFDAVGQLSSGTKPNPQVYSRLLTSGCQEGEHKACFAELRRNFMNIRPVKLKNLILLVKHWYRQVAAQNKGKGPAPASLPPAYALELLTIFAWEQGCRQDCFNMAQGFRTVLGLVQQHQQLCVYWTVNYSTEDPAMRMHLLGQLRKPRPLVLDPADPTWNVGHGSWELLAQEAAALGMQACFLSRDGTSVQPWDVMPALLYQTPAGDLDKFISEFLQPNRQFLAQVNKAVDTICSFLKENCFRNSPIKVIKVVKGGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNKRAEIISEIRAQLEACQQERQFEVKFEVSKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALGQLVSGSRPSSQVYVDLIHSYSNAGEYSTCFTELQRDFIISRPTKLKSLIRLVKHWYQQCTKISKGRGSLPPQHGLELLTVYAWEQGGKDSQFNMAEGFRTVLELVTQYRQLCIYWTINYNAKDKTVGDFLKQQLQKPRPIILDPADPTGNLGHNARWDLLAKEAAACTSALCCMGRNGIPIQPWPVKAAV"

Make each entry of the list into a vector. There are several ways to do this.

oas3s_vector <- rep(NA, length(oas3s_list))
for (i in 1:length(oas3s_list)){
  oas3s_vector[i] <- oas3s_list[[i]]
}

Name the vector

names(oas3s_vector) <- names(oas3s_list)

PID Table

Do pairwise alignments for humans, chimps and 2-other species.

align01.02 <- Biostrings::pairwiseAlignment(
                  oas3s_list[[1]],
                  oas3s_list[[2]])
align01.05 <- Biostrings::pairwiseAlignment(
                  oas3s_list[[1]],
                  oas3s_list[[5]])
align01.06 <- Biostrings::pairwiseAlignment(
                  oas3s_list[[1]],
                  oas3s_list[[6]])
align02.05 <- Biostrings::pairwiseAlignment(
                  oas3s_list[[2]],
                  oas3s_list[[5]])
align02.06 <- Biostrings::pairwiseAlignment(
                  oas3s_list[[2]],
                  oas3s_list[[6]])
align05.06 <- Biostrings::pairwiseAlignment(
                  oas3s_list[[5]],
                  oas3s_list[[6]])


Biostrings::pid(align01.02)
## [1] 99.26403
Biostrings::pid(align01.05)
## [1] 78.47286
Biostrings::pid(align01.06)
## [1] 86.03239
Biostrings::pid(align02.05)
## [1] 78.47286
Biostrings::pid(align02.06)
## [1] 85.93117
Biostrings::pid(align05.06)
## [1] 69.39182

Build Matrix

pids <- c(1,                  NA,     NA,     NA,
          pid(align01.02),          1,     NA,     NA,
          pid(align01.05), pid(align02.05),      1,     NA,
          pid(align01.06), pid(align02.06), pid(align05.06), 1)

mat <- matrix(pids, nrow = 4, byrow = T)
row.names(mat) <- c("Homo","Pan","Canis","Macaca")
colnames(mat) <- c("Homo","Pan","Canis","Macaca")
pander::pander(mat)
  Homo Pan Canis Macaca
Homo 1 NA NA NA
Pan 99.26 1 NA NA
Canis 78.47 78.47 1 NA
Macaca 86.03 85.93 69.39 1

PID methods comparison

Compare different PID methods. I did this for Humans vs. chimps and also for another comparison out of curiousity. You only have to do chimps.

diff.pids <- c('PID1', round(pid(align01.02, type = "PID1"),2), '(aligned positions + internal gap positions)',
               'PID2', round(pid(align01.02, type = "PID2"),2), '(aligned positions)',
               'PID3', round(pid(align01.02, type = "PID3"),2), '(length shorter sequence)',
               'PID3', round(pid(align01.02, type = "PID4"),2), '(average length of the two sequences)')
mat2 <- matrix(diff.pids, nrow = 4, byrow = T)
colnames(mat2) <- c("Method", "PID", "Denominator")
pander::pander(mat2)
Method PID Denominator
PID1 99.26 (aligned positions + internal gap positions)
PID2 99.26 (aligned positions)
PID3 99.26 (length shorter sequence)
PID3 99.26 (average length of the two sequences)

Multiple Sequence Alignment

MSA data preparation

For use with R bioinformatics tools we need to convert our named vector to a string set using Biostrings::AAStringSet(). Note the _ss tag at the end of the object we’re assigning the output to, which designates this as a string set.

## putting this chunk here again to make sure the vectors are named properly

#making sure the vector has names 
for(i in 1:length(oas3s_list)){
  oas3s_list[[i]] <- fasta_cleaner(oas3s_list[[i]], parse = F)
}

# make a vector to hold each sequence
oas3s_vector <- rep(NA, length(oas3s_list))

# name the vector (this makes ggmsa happy)
names(oas3s_vector) <- names(oas3s_list)

# extract the sequences from list and put into vector
for(i in 1:length(oas3s_vector)){
  oas3s_vector[i] <- oas3s_list[[i]]
}

oas3s_vector
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NP_006178 
##                                                    "MDLYSTPAAALDRFVARRLQPRKEFVEKARRALGALAAALRERGGRLGAAAPRVLKTVKGGSSGRGTALKGGCDSELVIFLDCFKSYVDQRARRAEILSEMRASLESWWQNPVPGLRLTFPEQSVPGALQFRLTSVDLEDWMDVSLVPAFNVLGQAGSGVKPKPQVYSTLLNSGCQGGEHAACFTELRRNFVNIRPAKLKNLILLVKHWYHQVCLQGLWKETLPPVYALELLTIFAWEQGCKKDAFSLAEGLRTVLGLIQQHQHLCVFWTVNYGFEDPAVGQFLQRQLKRPRPVILDPADPTWDLGNGAAWHWDLLAQEAASCYDHPCFLRGMGDPVQSWKGPGLPRAGCSGLGHPIQLDPNQKTPENSKSLNAVYPRAGSKPPSCPAPGPTGAASIVPSVPGMALDLSQIPTKELDRFIQDHLKPSPQFQEQVKKAIDIILRCLHENCVHKASRVSKGGSFGRGTDLRDGCDVELIIFLNCFTDYKDQGPRRAEILDEMRAQLESWWQDQVPSLSLQFPEQNVPEALQFQLVSTALKSWTDVSLLPAFDAVGQLSSGTKPNPQVYSRLLTSGCQEGEHKACFAELRRNFMNIRPVKLKNLILLVKHWYRQVAAQNKGKGPAPASLPPAYALELLTIFAWEQGCRQDCFNMAQGFRTVLGLVQQHQQLCVYWTVNYSTEDPAMRMHLLGQLRKPRPLVLDPADPTWNVGHGSWELLAQEAAALGMQACFLSRDGTSVQPWDVMPALLYQTPAGDLDKFISEFLQPNRQFLAQVNKAVDTICSFLKENCFRNSPIKVIKVVKGGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNKRAEIISEIRAQLEACQQERQFEVKFEVSKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALGQLVSGSRPSSQVYVDLIHSYSNAGEYSTCFTELQRDFIISRPTKLKSLIRLVKHWYQQCTKISKGRGSLPPQHGLELLTVYAWEQGGKDSQFNMAEGFRTVLELVTQYRQLCIYWTINYNAKDKTVGDFLKQQLQKPRPIILDPADPTGNLGHNARWDLLAKEAAACTSALCCMGRNGIPIQPWPVKAAV" 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            XP_509393 
##                                                    "MDLYSTPAAALDRFVARRLQPRKEFVEKARRALGALAAALRERGGRLGAAAPRVLKTVKGGSSGRGTALKGGCDSELVIFLDCFKSYVDQRARRAEILSEMRASLESWWQNPVPGLRLTFPEQSVPGALQFRLTSVDLEDWMDVSLVPAFNVLGQAGSGVKPKPQVYSTLLNSGCQGGEHAACFTELRRNFVNIRPAKLKNLILLVKHWYHQVCLQGLWKETLPPVYALELLTIFAWEQGCKKDAFSLAEGLRTVLGLIQQHQHLCVFWTVNYGFEDPAVGQFLQRQLKRPRPVILDPADPTWDLGNGAAWHWDLLAQEAASCYDHPCFLRGMGDPVQSWKGPGLPCAGCSGLGHPIQLDPNQKTPENSKSLSAVYPRAGSKPPSCPAPGPTGAASIVPSVPGMALDLSQIPTKELDRFIQDHLKPSPQFQEQVKKAIDIILRCLRENCVHKASRVSKGGSFGRGTDLRDGCDVELIIFLNCFTDYKDQGPRRAEILDEMRAQLESWWQDQVPGLSLQFPEQNVPEALQFQLVSTALKSWMDVSLLPAFDAVGQLSSGTKPNPQVYSRLLTSGCQEGEHKACFAELRRNFMNIRPVKLKNLILLVKHWYHQVAAQNKGKRPAPASLPPAYALELLTIFAWEQGCRQDCFNMAQGFRTVLGLVQQHQQLCVYWTVNYSTEDPAMRMHLLGQLGKPRPLVLDPADPTWNVGHGSWELLAQEAAALGMQACFLSRDGTSVQPWDVMPALLYQTPAGDLDKFISEFLQPNRQFLAQVNKAVDTICSFLKENCFRNSPIKVIKVVKGGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNKRAEIISEIRAQLEACQQERQFEVKFEVSKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALGQLVSGSRPSSQVYVDLIHSYSNAGEYSTCFTELQRDFIISRPTKLKSLIRLVKHWYQQCTKISKGRGSLPPQHGLELLTVYAWEQGGKDSQFNMAEGFRTVLELVTQYRQLCIYWTINYNAKDKTVGDFLKQQLQKPRPIILDPADPTGNLGHNARWDLLAKEAAACTSALCCMGRNGIPIQPWPVKAAV" 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            NP_660261 
## "MDLFHTPAGALDKLVAHNLHPAPEFTAAVRGALGSLNITLQQHRARGSQRPRVIRIAKGGAYARGTALRGGTDVELVIFLDCFQSFGDQKTCHSETLGAMRMLLESWGGHPGPGLTFEFSQSKASRILQFRLASADGEHWIDVSLVPAFDVLGQPRSGVKPTPNVYSSLLSSHCQAGEYSACFTEPRKNFVNTRPAKLKNLILLVKHWYHQVQTRAVRATLPPSYALELLTIFAWEQGCGKDSFSLAQGLRTVLALIQHSKYLCIFWTENYGFEDPAVGEFLRRQLKRPRPVILDPADPTWDVGNGTAWRWDVLAQEAESSFSQQCFKQASGVLVQPWEGPGLPRAGILDLGHPIYQGPNQALEDNKGHLAVQSKERSQKPSNSAPGFPEAATKIPAMPNPSANKTRKIRKKAAHPKTVQEAALDSISSHVRITQSTASSHMPPDRSSISTAGSRMSPDLSQIPSKDLDCFIQDHLRPSPQFQQQVKQAIDAILCCLREKSVYKVLRVSKGGSFGRGTDLRGSCDVELVIFYKTLGDFKGQKPHQAEILRDMQAQLRHWCQNPVPGLSLQFIEQKPNALQLQLASTDLSNRVDLSVLPAFDAVGPLKSGTKPQPQVYSSLLSSGCQAGEHAACFAELRRNFINTCPPKLKSLMLLVKHWYRQVVTRYKGGEAAGDAPPPAYALELLTIFAWEQGCGEQKFSLAEGLRTILRLIQQHQSLCIYWTVNYSVQDPAIRAHLLCQLRKARPLVLDPADPTWNVGQGDWKLLAQEAAALGSQVCLQSGDGTLVPPWDVTPALLHQTLAEDLDKFISEFLQPNRHFLTQVKRAVDTICSFLKENCFRNSTIKVLKVVKGGSSAKGTALQGRSDADLVVFLSCFRQFSEQGSHRAEIISEIQAHLEACQQMHSFDVKFEVSKRKNPRVLSFTLTSQTLLDQSVDFDVLPAFDALGQLRSGSRPDPRVYTDLIHSCSNAGEFSTCFTELQRDFITSRPTKLKSLIRLVKYWYQQCNKTIKGKGSLPPQHGLELLTVYAWEQGGQNPQFNMAEGFRTVLELIVQYRQLCVYWTINYSAEDKTIGDFLKMQLRKPRPVILDPADPTGNLGHNARWDLLAKEATVYASALCCVDRDGNPIKPWPVKAAV" 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         NP_001009493 
##  "MDLYHTPAGALDKLVAHSLHPAPEFTAAVRRALGSLDNVLRKNGAGGLQRPRVIRIIKGGAHARGTALRGGTDVELVIFLDCLRSFGDQKTCHTEILGAIQALLESWGCNPGPGLTFEFSGPKASGILQFRLASVDQENWIDVSLVPAFDALGQLHSEVKPTPNVYSSLLSSHCQAGEHSACFTELRKNFVNIRPVKLKNLILLVKHWYRQVQTQVVRATLPPSYALELLTIFAWEQGCRKDAFSLAQGLRTVLALIQRNKHLCIFWTENYGFEDPAVGEFLRRQLKRPRPVILDPADPTWDLGNGTAWCWDVLAKEAEYSFNQQCFKEASGALVQPWEGPGLPCAGILDLGHPIQQGAKHALEDNNGHLAVQPMKESLQPSNPARGLPETATKISAMPDPTVTETHKSLKKSVHPKTVSETVVNPSSHVWITQSTASSNTPPGHSSMSTAGSQMGPDLSQIPSKELDSFIQDHLRPSSQFQQQVRQAIDTILCCLREKCVDKVLRVSKGGSFGRGTDLRGKCDVELVIFYKTLGDFKGQNSHQTEILCDMQAQLQRWCQNPAPGLSLQFIEQKSNALHLQLVPTNLSNRVDLSVLPAFDAVGPLKSGAKPLPETYSSLLSSGCQAGEHAACFAELRRNFINTRPAKLRSLMLLVKHWYRQVAARFEGGETAGAALPPAYALELLTVFAWEQGCGEQKFSMAEGLRTVLRLVQQHQSLCIYWTVNYSVQDPAIRAHLLRQLRKARPLILDPADPTWNMDQGNWKLLAQEAAALESQVCLQSRDGNLVPPWDVMPALLHQTPAQNLDKFICEFLQPDRHFLTQVKRAVDTICSFLKENCFRNSTIKVLKVVKGGSSAKGTALQGRSDADLVVFLSCFRQFSEQGSHRAEIIAEIQAQLEACQQKQRFDVKFEISKRKNPRVLSFTLTSKTLLGQSVDFDVLPAFDALGQLKSGSRPDPRVYTDLIQSYSNAGEFSTCFTELQRDFISSRPTKLKSLIRLVKHWYQQCNKTVKGKGSLPPQHGLELLTVYAWERGSQNPQFNMAEGFRTVLELIGQYRQLCVYWTINYGAEDETIGDFLKMQLQKPRPVILDPADPTGNLGHNARWDLLAKEAAAYTSALCCMDKDGNPIKPWPVKAAV" 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         NP_001041556 
##                                                    "MDVYRTPAAALASLVARRLQPSAEFQRAAWRALGALATTLRERGDRAAAQPWRVLKTAKGGSAGRGTALRGGCDSEIVIFLDCFKSYKDHSVDRAEILKDLWDLLQSWWQKPIPGLNFETLWQDRPGVLQFRLASTDLENWMDVSLVPAFDALGQLCAGAKPAPQVYSTLLHSGCQGGEHAACFAELRRNFVNVRPAKLKSLILLVKHWYRQVCQEEAKREMLPPAYALELLTIFAWEQGCGKDAFSLAQGLRTVLGLIQEYRQLCVFWTLNYGFENPTVRSFLSSQLKKPRPVILDPADPTWDVGNGATWHWDILAREAESCYEHPCFLQTAGDTVQPWEGTGLPRAGCSGLDHPIQRDDAQRTPGNSSSLNAVPPRAGSRQPSWPAPRPPGPDSITPSTLGRAVDLSQIATKDLDRFIQDHLKPNPQFQKQVGKAINVILGCLREKCVYKASRVSKGGSFGRGTDLRGGCDAELVIFLNCFEDYRDQRARRPEILQEMQAQLESWWQDPVPGLSLEFPEQTVPEALQFRLVSTALESWMDVCLVPAFDAVGQLCAGAKPAPQVYSTLLQSGCQGGEHAACFAELRRNFVNVRPAKLKSLILLVKHWYRQVAAQNKGQQPACASLPPVYALELLTIFAWEQGCGEDSFKMAQGLKTVLELVQQHQQLCVYWTVNYSFEDPAIRTHLLGQLQKPRPLILDPGDPTWNVGQGSWELLAQEAAVLETQACLRSTEGTSVQPWDVMPALLYQTPAGDLDKFISDFLQPNRQFLAQVNKAVDTICSFLKENCFQNSAIKVLKVVKGGSLAKGTALRGRSDADLVVFLSCFSQFAEQGNRRAEIISEIRAQLEACQQKMQLEVKFEIPKRENSRVLSFSLKSQTMLDQSVDFDVLPAFNALGQVVSSYRPPSQVYVDLIYSYNNAGEYSTCFTELQRDFIISRPTKLKSLIRLVKHWYRQCNKMPRGRGSLPPQHGLELLTVYAWEQGGQSAQFNMAQGFRTVLELVSQYRQLRVYWTVNYDNEDQTVRDFLSRQLRQPRPIILDPADPTGNLGHNARWDLLATEATACMSALCCTDRDGTPIQPWPVKAAV" 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         XP_015008356 
##                                                                                                                                                                                                                         "MDLYRTPASALDRFVATRLQPRKEFTETARRALGALAAALRERGGRPGALAPRVLKIVKGGSSGRGTALKGGCDSELVIFLDCFKSYMDQRARRAEILSKMRALLESWWQNPVPGLSLKFPQQSVPGALQFRLTSIDLEDWTDVSLVPAFDVLGQAGSRVKPKPQVYSTLLNSGCQGGEHAACFTELRRDFVNIRPAKLKNLILLVKHWYHQVCLQGLWEETLPPVYALELLTIFAWEQGCKKDAFSLAEGLRTVLDLIQQHQHLCVFWTVNYGFEDPAVGQFLQRQLERPRPVILDPADPTWDLGNGAAWHWDLLAQEAASCCDHPCFLNGMGDPVQPWQVPGLPRARCSGLGHPIQLNPNQKTPENSKSLDAVSPRAGSKAPSCPAPGPAGAASVAPSVPGMALDLSQIPTKELDRFIQDHLKPSPRFQEQVKKAIDIILRRLRENCVHKVSRVSKGGSFGRGTDLRDGCDVELVIFLNCFTDYKDQGPRRAEILDEMRAQLESWWQGQVPGLSLQFPQQNVPEALQFQLVSTAPKRWTDVSLLPAFDALGQLSSGTKPNPQVYSRLLSSGCQEGEHKACFAELRRNFVNIRPAKLKNLILLVKHWYRQVAAQNKRKRPAPASLPPAYALELLTIFAWEQGCGKDCFDMAQGFRTVLGLVQQHQQLCVYWTVNYSTEDPAMRMHLLGQLRKPRPLVLDPADPTWNVGQGSWELLAQEAAVLGMQACFLSRDGTSMPPWDVMPALLYQTPAGDLDKFISEFLQPNRQFLAQVNKAVDTICSFLKENCFRNSPIKVIKVVKGGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNKRAEIISEIRAQLEACQREQQFEVKFEVSKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALATAMQASTPPASQSYSGTSSSLALPS" 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         XP_015008356 
##                                                                                                                                                                                                                         "MDLYRTPASALDRFVATRLQPRKEFTETARRALGALAAALRERGGRPGALAPRVLKIVKGGSSGRGTALKGGCDSELVIFLDCFKSYMDQRARRAEILSKMRALLESWWQNPVPGLSLKFPQQSVPGALQFRLTSIDLEDWTDVSLVPAFDVLGQAGSRVKPKPQVYSTLLNSGCQGGEHAACFTELRRDFVNIRPAKLKNLILLVKHWYHQVCLQGLWEETLPPVYALELLTIFAWEQGCKKDAFSLAEGLRTVLDLIQQHQHLCVFWTVNYGFEDPAVGQFLQRQLERPRPVILDPADPTWDLGNGAAWHWDLLAQEAASCCDHPCFLNGMGDPVQPWQVPGLPRARCSGLGHPIQLNPNQKTPENSKSLDAVSPRAGSKAPSCPAPGPAGAASVAPSVPGMALDLSQIPTKELDRFIQDHLKPSPRFQEQVKKAIDIILRRLRENCVHKVSRVSKGGSFGRGTDLRDGCDVELVIFLNCFTDYKDQGPRRAEILDEMRAQLESWWQGQVPGLSLQFPQQNVPEALQFQLVSTAPKRWTDVSLLPAFDALGQLSSGTKPNPQVYSRLLSSGCQEGEHKACFAELRRNFVNIRPAKLKNLILLVKHWYRQVAAQNKRKRPAPASLPPAYALELLTIFAWEQGCGKDCFDMAQGFRTVLGLVQQHQQLCVYWTVNYSTEDPAMRMHLLGQLRKPRPLVLDPADPTWNVGQGSWELLAQEAAVLGMQACFLSRDGTSMPPWDVMPALLYQTPAGDLDKFISEFLQPNRQFLAQVNKAVDTICSFLKENCFRNSPIKVIKVVKGGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNKRAEIISEIRAQLEACQREQQFEVKFEVSKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALATAMQASTPPASQSYSGTSSSLALPS" 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         NP_001075226 
##                                                       "MDVYRTPAAELDGLVARSLQPPAEFVGAARRALGNLSAALRERGGRPGAAAQPWRVLKIGGSSGRGTALRGGCDSELVIFLDCFKSYEDQGAHRAEILNEMRALLESSWQDTVLGLSLEFPEQNTPGVLQLRLASTDLENWMDVSLVPAFDALGQLRTGAKPEPRVYSSLLDSGSRGGEHAACFAELRRNFVNARPTKLKNLILLVKHWYRQVCPQEASRELLPPAYALELLTIFAWERGCGKDAFSLAQGLRTVLGLVQDYRHLCVFWTLNYSFEDPALRQFLRRQLERPRPVILDPADPTWDVGNGAAWRWDLLAKEAESCCDHPCFLQAARGPVQPWEGPDLPRAGCPGLDHRIQQDPAQRTPEDSGVLTGVHPSTRKRQPWSPAPGPSSAASIAPRPPQEVSDLSRIPAPELDRFIQDHLMPSSQFQKQVSKAIDVILRGLRENCVHKPSRASKGGSFGRGTDLRGGCDAELVIFLNCFKDYKDQGARRGQILEEIRAQLESWWQDRVPSLSLKFPEQSAPGALQLQLASAALESRVDVSLLPAFDAIGQLRAGAKPEPGVYSSLLDSGSRGGEHAACFAELRRNFVNTRPTKLKNLILLVKHWYRQVAAQNKGAQRAGASLPPAYALELLTIFAWEQGCGEDRFSMAQGLRTVLGLVQQHRQLCVYWTVNYSFEDPALRTHLLGQLRNPRPLVLDPADPTWNVGQGSWELLAQEAAALGTQPCLMSREGTPVQPWDVMPALLCQTPASDLDKFITEFLQPNRHFLEQVNKAVDTICSFLRDNCFRNSPIKVLKGGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNRRAEIISEIRAQLEACQQEREFEVKFEISKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALGQLVPDSRPRPQVYVDLIHSYSNAGEYSPCFTELQRNFISSRPTKLKSLIRLVKHWYQQCNKMPKGRGSLPPQHGLELLTVYAWEQGGCDCQFSMAEGFRTVLELVRQYRQLCVYWTVNYDNENETVRDFLKLQLQKPRPIILDPADPTGNLGPNARWDLLAKEAVACMSAPCCMGRDGSPIQPWPVKAAV" 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         XP_031506643 
##                                                    "MDLYRTPASELDRFVATRLQPRKEFTETTRRALGALAAALRERRGRPGAAAPRVLKIVKGGSSGRGTALKGGCDSELVIFLDCFKSYMDQRARRAEILSEMRALLESWWQNPVPGLSLKFPQQSVPGALQFRLTSIDLEDWMDVSLVPAFDVLGQAGSRIKPKPQVYSTLLNSGCQGGEHAACFTELRRDFVNIRPAKLKNLILLVKHWYHQVCLQGLWEETLPPVYALELLTIFAWEQGCKKDAFSLAEGLRTVLDLIQQHQHLCVFWTVNYGFEDPAVGQFLQRQLERPRPVILDPADPTWDVGNGAAWHWDLLAQEAASCCDHPCFLNGMGDPVQPWQGPSLPRARCSGLGHPIQLNPNQKTPENSKSLDAVSPRAGSKAPSCPAPGPAGAASVAPSVPGMALDLSQIPTKELDRFIQDHLKPSPQFQEQVKKAIDIILRRLRENCVHKVSRVSKGGSFGRGTDLRDGCDVELVIFLNCFTDYKDQGPRRAEILDEMRAQLESWWQGQVPGLSLQFPEQNVPEALQFQLVSTAPKRWTDVSLLPAFDALGQLSSGTKPNPQVYSRLLSSGCQEGEHKACFAELRRNFVNIRPAKLKNLILLVKHWYRQVAAQNKRKRPAPASLPPAYALELLTIFAWEQGCGKDCFDMAQGFRTVLGLVQQHQQLCVYWTVNYSTEDPAMRMHLLGQLRKPRPLVLDPADPTWNVGQGSWELLAQEAAVLGMQACFLSRDGTSMPPWDVMPALLYQTPAGDLDKFISEFLQPNRQFLAQVNKAVDTICSFLKENCFRNSPIKVIKVVKGGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNKRAEIISEIRAQLEACQREQQFEVKFEVSKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALGQLVSGSRPSSQVYVNLIHSYSNAGEYSTCFTELQRDFIISRPTKLKSLIRLVKHWYQQCHKISRGRGSLPPKHGLELLTVYAWEQGGKDPQFNMAEGFRTVLELVTQYRQLCIYWTINYNTEDKTVGDFLKQQLQKPRPIILDPADPTGNLGHSARWDLLAKEAAACMSALCCVGRNGIPIQPWPVKAAV" 
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         XP_004053976 
##                                                    "MDLYSTPAAALDRFVARSLQPRTEFVEKARRALGALAAALRERAGRLGAAAPRVLKTVKGGSSGRGTALKGGCDSELVIFLDCFKSYVDQRARRAEILSEMRASLESWWQNPVPGLRLTFPEQSVPGALQFRLTSVDLEDWMDVSLVPAFNVLGQAGSGVKPKPQVYSTLLNSGCQGGEHAACFTELRRNFVNIRPAKLKNLILLVKHWYHQVCLQGLWKETLPPVYALELLTIFAWEQGCKKDAFSLAEGLRTVLGLIQQHQHLCVFWTVNYGFEDPAVGQFLQRQLKRPRPVILDPADPTWDLGNGAAWHWDLLAQEAASCYDHPCFLRGMGDPVQSWKGPGLPRAGCSGLGHPIQLDPNQKTPENSKSLNAVYPRAGSKPPSCPAPGPTGAASIVPSVPGMALDLSQIPTKELDRFIQDHLKPSPQFQEQVKKAIDIILRCLRENCVHKASRVSKGGSFGRGTDLRDGCDVELIIFLNCFTDYKDQGPRRAEILDEMRAQLESWWQDQVPSLSLQFPEQNVPEALQFQLVSTALKSWTDVSLLPAFDAVGQLSSGTKPNPQVYSRLLTSGCQEGEHKACFAELRRNFMNIRPVKLKNLILLVKHWYRQVAAQNKGKRPAPASLPPAYALELLTIFAWEQGCRQDCFNMAQGFRTVLGLVQQHQQLCVYWTVNYSTEDPAMRMHLLGQLRKPRPLVLDPADPTWNVGHGSWELLAQEAAALGMQACFLSRDGTSVQPWDVMPALLYQTPAGDLDKFISEFLQPNRQFLAQVNKAVDTICSFLKENCFRNSPIKVIKVVKGGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNKRAEIISEIRAQLEACQQERQFEVKFEVSKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALGQLVSGSRPSSQVYVDLIHSYSNAGEYSTCFTELQRDFIISRPTKLKSLIRLVKHWYQQCTKISKGRGSLPPQHGLELLTVYAWEQGGKDSQFNMAEGFRTVLELVTQYRQLCIYWTINYNAKDKTVGDFLKQQLQKPRPIILDPADPTGNLGHNARWDLLAKEAAACTSALCCMGRNGIPIQPWPVKAAV"
oas3s_vector_ss <- Biostrings::AAStringSet(oas3s_vector)

Building Multiple Sequence Alignment (MSA)

oas3s_align <- msa(oas3s_vector_ss,
                   method = "ClustalW")
## use default substitution matrix

Cleaning / setting up an MSA

msa produces a species MSA objects

class(oas3s_align)
## [1] "MsaAAMultipleAlignment"
## attr(,"package")
## [1] "msa"
is(oas3s_align)
## [1] "MsaAAMultipleAlignment" "AAMultipleAlignment"    "MsaMetaData"           
## [4] "MultipleAlignment"

Default output of MSA

oas3s_align
## CLUSTAL 2.1  
## 
## Call:
##    msa(oas3s_vector_ss, method = "ClustalW")
## 
## MsaAAMultipleAlignment with 10 rows and 1144 columns
##      aln                                                   names
##  [1] MDLYSTPAAALDRFVARRLQPRKEF...ACTSALCCMGRNGIPIQPWPVKAAV NP_006178
##  [2] MDLYSTPAAALDRFVARSLQPRTEF...ACTSALCCMGRNGIPIQPWPVKAAV XP_004053976
##  [3] MDLYSTPAAALDRFVARRLQPRKEF...ACTSALCCMGRNGIPIQPWPVKAAV XP_509393
##  [4] MDLYRTPASALDRFVATRLQPRKEF...------------------------- XP_015008356
##  [5] MDLYRTPASALDRFVATRLQPRKEF...------------------------- XP_015008356
##  [6] MDLYRTPASELDRFVATRLQPRKEF...ACMSALCCVGRNGIPIQPWPVKAAV XP_031506643
##  [7] MDVYRTPAAALASLVARRLQPSAEF...ACMSALCCTDRDGTPIQPWPVKAAV NP_001041556
##  [8] MDVYRTPAAELDGLVARSLQPPAEF...ACMSAPCCMGRDGSPIQPWPVKAAV NP_001075226
##  [9] MDLFHTPAGALDKLVAHNLHPAPEF...VYASALCCVDRDGNPIKPWPVKAAV NP_660261
## [10] MDLYHTPAGALDKLVAHSLHPAPEF...AYTSALCCMDKDGNPIKPWPVKAAV NP_001009493
##  Con MDLYRTPAAALDRFVARRLQPRKEF...AC?SALCCMGR?G?PIQPWPVKAAV Consensus

Change class of alignment

class(oas3s_align) <- "AAMultipleAlignment"

Convert to seqinr format

oas3s_align_seqinr <- msaConvert(oas3s_align, type = "seqinr::alignment")

OPTIONAL: show output with print_msa

compbio4all::print_msa(oas3s_align_seqinr)
## [1] "MDLYSTPAAALDRFVARRLQPRKEFVEKARRALGALAAALRERGGRLGAAAP--RVLKTV 0"
## [1] "MDLYSTPAAALDRFVARSLQPRTEFVEKARRALGALAAALRERAGRLGAAAP--RVLKTV 0"
## [1] "MDLYSTPAAALDRFVARRLQPRKEFVEKARRALGALAAALRERGGRLGAAAP--RVLKTV 0"
## [1] "MDLYRTPASALDRFVATRLQPRKEFTETARRALGALAAALRERGGRPGALAP--RVLKIV 0"
## [1] "MDLYRTPASALDRFVATRLQPRKEFTETARRALGALAAALRERGGRPGALAP--RVLKIV 0"
## [1] "MDLYRTPASELDRFVATRLQPRKEFTETTRRALGALAAALRERRGRPGAAAP--RVLKIV 0"
## [1] "MDVYRTPAAALASLVARRLQPSAEFQRAAWRALGALATTLRERGDR--AAAQPWRVLKTA 0"
## [1] "MDVYRTPAAELDGLVARSLQPPAEFVGAARRALGNLSAALRERGGRPGAAAQPWRVLKIG 0"
## [1] "MDLFHTPAGALDKLVAHNLHPAPEFTAAVRGALGSLNITLQQHRAR-GSQRP--RVIRIA 0"
## [1] "MDLYHTPAGALDKLVAHSLHPAPEFTAAVRRALGSLDNVLRKNGAG-GLQRP--RVIRII 0"
## [1] " "
## [1] "KGGSSGRGTALKGGCDSELVIFLDCFKSYVDQRARRAEILSEMRASLESWWQNPVPGLRL 0"
## [1] "KGGSSGRGTALKGGCDSELVIFLDCFKSYVDQRARRAEILSEMRASLESWWQNPVPGLRL 0"
## [1] "KGGSSGRGTALKGGCDSELVIFLDCFKSYVDQRARRAEILSEMRASLESWWQNPVPGLRL 0"
## [1] "KGGSSGRGTALKGGCDSELVIFLDCFKSYMDQRARRAEILSKMRALLESWWQNPVPGLSL 0"
## [1] "KGGSSGRGTALKGGCDSELVIFLDCFKSYMDQRARRAEILSKMRALLESWWQNPVPGLSL 0"
## [1] "KGGSSGRGTALKGGCDSELVIFLDCFKSYMDQRARRAEILSEMRALLESWWQNPVPGLSL 0"
## [1] "KGGSAGRGTALRGGCDSEIVIFLDCFKSYKDHSVDRAEILKDLWDLLQSWWQKPIPGLNF 0"
## [1] "--GSSGRGTALRGGCDSELVIFLDCFKSYEDQGAHRAEILNEMRALLESSWQDTVLGLSL 0"
## [1] "KGGAYARGTALRGGTDVELVIFLDCFQSFGDQKTCHSETLGAMRMLLESWGGHPGPGLTF 0"
## [1] "KGGAHARGTALRGGTDVELVIFLDCLRSFGDQKTCHTEILGAIQALLESWGCNPGPGLTF 0"
## [1] " "
## [1] "TFPEQSVPGALQFRLTSVDLEDWMDVSLVPAFNVLGQAGSGVKPKPQVYSTLLNSGCQGG 0"
## [1] "TFPEQSVPGALQFRLTSVDLEDWMDVSLVPAFNVLGQAGSGVKPKPQVYSTLLNSGCQGG 0"
## [1] "TFPEQSVPGALQFRLTSVDLEDWMDVSLVPAFNVLGQAGSGVKPKPQVYSTLLNSGCQGG 0"
## [1] "KFPQQSVPGALQFRLTSIDLEDWTDVSLVPAFDVLGQAGSRVKPKPQVYSTLLNSGCQGG 0"
## [1] "KFPQQSVPGALQFRLTSIDLEDWTDVSLVPAFDVLGQAGSRVKPKPQVYSTLLNSGCQGG 0"
## [1] "KFPQQSVPGALQFRLTSIDLEDWMDVSLVPAFDVLGQAGSRIKPKPQVYSTLLNSGCQGG 0"
## [1] "ETLWQDRPGVLQFRLASTDLENWMDVSLVPAFDALGQLCAGAKPAPQVYSTLLHSGCQGG 0"
## [1] "EFPEQNTPGVLQLRLASTDLENWMDVSLVPAFDALGQLRTGAKPEPRVYSSLLDSGSRGG 0"
## [1] "EFSQSKASRILQFRLASADGEHWIDVSLVPAFDVLGQPRSGVKPTPNVYSSLLSSHCQAG 0"
## [1] "EFSGPKASGILQFRLASVDQENWIDVSLVPAFDALGQLHSEVKPTPNVYSSLLSSHCQAG 0"
## [1] " "
## [1] "EHAACFTELRRNFVNIRPAKLKNLILLVKHWYHQVCLQGLWKETLPPVYALELLTIFAWE 0"
## [1] "EHAACFTELRRNFVNIRPAKLKNLILLVKHWYHQVCLQGLWKETLPPVYALELLTIFAWE 0"
## [1] "EHAACFTELRRNFVNIRPAKLKNLILLVKHWYHQVCLQGLWKETLPPVYALELLTIFAWE 0"
## [1] "EHAACFTELRRDFVNIRPAKLKNLILLVKHWYHQVCLQGLWEETLPPVYALELLTIFAWE 0"
## [1] "EHAACFTELRRDFVNIRPAKLKNLILLVKHWYHQVCLQGLWEETLPPVYALELLTIFAWE 0"
## [1] "EHAACFTELRRDFVNIRPAKLKNLILLVKHWYHQVCLQGLWEETLPPVYALELLTIFAWE 0"
## [1] "EHAACFAELRRNFVNVRPAKLKSLILLVKHWYRQVCQEEAKREMLPPAYALELLTIFAWE 0"
## [1] "EHAACFAELRRNFVNARPTKLKNLILLVKHWYRQVCPQEASRELLPPAYALELLTIFAWE 0"
## [1] "EYSACFTEPRKNFVNTRPAKLKNLILLVKHWYHQVQTR-AVRATLPPSYALELLTIFAWE 0"
## [1] "EHSACFTELRKNFVNIRPVKLKNLILLVKHWYRQVQTQ-VVRATLPPSYALELLTIFAWE 0"
## [1] " "
## [1] "QGCKKDAFSLAEGLRTVLGLIQQHQHLCVFWTVNYGFEDPAVGQFLQRQLKRPRPVILDP 0"
## [1] "QGCKKDAFSLAEGLRTVLGLIQQHQHLCVFWTVNYGFEDPAVGQFLQRQLKRPRPVILDP 0"
## [1] "QGCKKDAFSLAEGLRTVLGLIQQHQHLCVFWTVNYGFEDPAVGQFLQRQLKRPRPVILDP 0"
## [1] "QGCKKDAFSLAEGLRTVLDLIQQHQHLCVFWTVNYGFEDPAVGQFLQRQLERPRPVILDP 0"
## [1] "QGCKKDAFSLAEGLRTVLDLIQQHQHLCVFWTVNYGFEDPAVGQFLQRQLERPRPVILDP 0"
## [1] "QGCKKDAFSLAEGLRTVLDLIQQHQHLCVFWTVNYGFEDPAVGQFLQRQLERPRPVILDP 0"
## [1] "QGCGKDAFSLAQGLRTVLGLIQEYRQLCVFWTLNYGFENPTVRSFLSSQLKKPRPVILDP 0"
## [1] "RGCGKDAFSLAQGLRTVLGLVQDYRHLCVFWTLNYSFEDPALRQFLRRQLERPRPVILDP 0"
## [1] "QGCGKDSFSLAQGLRTVLALIQHSKYLCIFWTENYGFEDPAVGEFLRRQLKRPRPVILDP 0"
## [1] "QGCRKDAFSLAQGLRTVLALIQRNKHLCIFWTENYGFEDPAVGEFLRRQLKRPRPVILDP 0"
## [1] " "
## [1] "ADPTWDLGNGAAWHWDLLAQEAASCYDHPCFLRGMGDPVQSWKGPGLPRAGCSGLGHPIQ 0"
## [1] "ADPTWDLGNGAAWHWDLLAQEAASCYDHPCFLRGMGDPVQSWKGPGLPRAGCSGLGHPIQ 0"
## [1] "ADPTWDLGNGAAWHWDLLAQEAASCYDHPCFLRGMGDPVQSWKGPGLPCAGCSGLGHPIQ 0"
## [1] "ADPTWDLGNGAAWHWDLLAQEAASCCDHPCFLNGMGDPVQPWQVPGLPRARCSGLGHPIQ 0"
## [1] "ADPTWDLGNGAAWHWDLLAQEAASCCDHPCFLNGMGDPVQPWQVPGLPRARCSGLGHPIQ 0"
## [1] "ADPTWDVGNGAAWHWDLLAQEAASCCDHPCFLNGMGDPVQPWQGPSLPRARCSGLGHPIQ 0"
## [1] "ADPTWDVGNGATWHWDILAREAESCYEHPCFLQTAGDTVQPWEGTGLPRAGCSGLDHPIQ 0"
## [1] "ADPTWDVGNGAAWRWDLLAKEAESCCDHPCFLQAARGPVQPWEGPDLPRAGCPGLDHRIQ 0"
## [1] "ADPTWDVGNGTAWRWDVLAQEAESSFSQQCFKQASGVLVQPWEGPGLPRAGILDLGHPIY 0"
## [1] "ADPTWDLGNGTAWCWDVLAKEAEYSFNQQCFKEASGALVQPWEGPGLPCAGILDLGHPIQ 0"
## [1] " "
## [1] "LDPNQKTPENSKSLNAVYPRAGSKPPSCPAP----------------------------- 0"
## [1] "LDPNQKTPENSKSLNAVYPRAGSKPPSCPAP----------------------------- 0"
## [1] "LDPNQKTPENSKSLSAVYPRAGSKPPSCPAP----------------------------- 0"
## [1] "LNPNQKTPENSKSLDAVSPRAGSKAPSCPAP----------------------------- 0"
## [1] "LNPNQKTPENSKSLDAVSPRAGSKAPSCPAP----------------------------- 0"
## [1] "LNPNQKTPENSKSLDAVSPRAGSKAPSCPAP----------------------------- 0"
## [1] "RDDAQRTPGNSSSLNAVPPRAGSRQPSWPAP----------------------------- 0"
## [1] "QDPAQRTPEDSGVLTGVHPSTRKRQPWSPAP----------------------------- 0"
## [1] "QGPNQALEDNKGHL-AVQSKERSQKPSNSAPGFPEAATKIPAMPNPSANKTRKIRKKAAH 0"
## [1] "QGAKHALEDNNGHL-AVQPMKESLQPSNPARGLPETATKISAMPDPTVTETHKSLKKSVH 0"
## [1] " "
## [1] "--------------------------GPTGAASIVPSVPGMALDLSQIPTKELDRFIQDH 0"
## [1] "--------------------------GPTGAASIVPSVPGMALDLSQIPTKELDRFIQDH 0"
## [1] "--------------------------GPTGAASIVPSVPGMALDLSQIPTKELDRFIQDH 0"
## [1] "--------------------------GPAGAASVAPSVPGMALDLSQIPTKELDRFIQDH 0"
## [1] "--------------------------GPAGAASVAPSVPGMALDLSQIPTKELDRFIQDH 0"
## [1] "--------------------------GPAGAASVAPSVPGMALDLSQIPTKELDRFIQDH 0"
## [1] "--------------------------RPPGPDSITPSTLGRAVDLSQIATKDLDRFIQDH 0"
## [1] "--------------------------GPSSAASIAPRPPQEVSDLSRIPAPELDRFIQDH 0"
## [1] "PKTVQEAALDSISSHVRITQSTASSHMPPDRSSISTAGSRMSPDLSQIPSKDLDCFIQDH 0"
## [1] "PKTVSETVVN-PSSHVWITQSTASSNTPPGHSSMSTAGSQMGPDLSQIPSKELDSFIQDH 0"
## [1] " "
## [1] "LKPSPQFQEQVKKAIDIILRCLHENCVHKASRVSKGGSFGRGTDLRDGCDVELIIFLNCF 0"
## [1] "LKPSPQFQEQVKKAIDIILRCLRENCVHKASRVSKGGSFGRGTDLRDGCDVELIIFLNCF 0"
## [1] "LKPSPQFQEQVKKAIDIILRCLRENCVHKASRVSKGGSFGRGTDLRDGCDVELIIFLNCF 0"
## [1] "LKPSPRFQEQVKKAIDIILRRLRENCVHKVSRVSKGGSFGRGTDLRDGCDVELVIFLNCF 0"
## [1] "LKPSPRFQEQVKKAIDIILRRLRENCVHKVSRVSKGGSFGRGTDLRDGCDVELVIFLNCF 0"
## [1] "LKPSPQFQEQVKKAIDIILRRLRENCVHKVSRVSKGGSFGRGTDLRDGCDVELVIFLNCF 0"
## [1] "LKPNPQFQKQVGKAINVILGCLREKCVYKASRVSKGGSFGRGTDLRGGCDAELVIFLNCF 0"
## [1] "LMPSSQFQKQVSKAIDVILRGLRENCVHKPSRASKGGSFGRGTDLRGGCDAELVIFLNCF 0"
## [1] "LRPSPQFQQQVKQAIDAILCCLREKSVYKVLRVSKGGSFGRGTDLRGSCDVELVIFYKTL 0"
## [1] "LRPSSQFQQQVRQAIDTILCCLREKCVDKVLRVSKGGSFGRGTDLRGKCDVELVIFYKTL 0"
## [1] " "
## [1] "TDYKDQGPRRAEILDEMRAQLESWWQDQVPSLSLQFPEQNVPEALQFQLVSTALKSWTDV 0"
## [1] "TDYKDQGPRRAEILDEMRAQLESWWQDQVPSLSLQFPEQNVPEALQFQLVSTALKSWTDV 0"
## [1] "TDYKDQGPRRAEILDEMRAQLESWWQDQVPGLSLQFPEQNVPEALQFQLVSTALKSWMDV 0"
## [1] "TDYKDQGPRRAEILDEMRAQLESWWQGQVPGLSLQFPQQNVPEALQFQLVSTAPKRWTDV 0"
## [1] "TDYKDQGPRRAEILDEMRAQLESWWQGQVPGLSLQFPQQNVPEALQFQLVSTAPKRWTDV 0"
## [1] "TDYKDQGPRRAEILDEMRAQLESWWQGQVPGLSLQFPEQNVPEALQFQLVSTAPKRWTDV 0"
## [1] "EDYRDQRARRPEILQEMQAQLESWWQDPVPGLSLEFPEQTVPEALQFRLVSTALESWMDV 0"
## [1] "KDYKDQGARRGQILEEIRAQLESWWQDRVPSLSLKFPEQSAPGALQLQLASAALESRVDV 0"
## [1] "GDFKGQKPHQAEILRDMQAQLRHWCQNPVPGLSLQFIEQ-KPNALQLQLASTDLSNRVDL 0"
## [1] "GDFKGQNSHQTEILCDMQAQLQRWCQNPAPGLSLQFIEQ-KSNALHLQLVPTNLSNRVDL 0"
## [1] " "
## [1] "SLLPAFDAVGQLSSGTKPNPQVYSRLLTSGCQEGEHKACFAELRRNFMNIRPVKLKNLIL 0"
## [1] "SLLPAFDAVGQLSSGTKPNPQVYSRLLTSGCQEGEHKACFAELRRNFMNIRPVKLKNLIL 0"
## [1] "SLLPAFDAVGQLSSGTKPNPQVYSRLLTSGCQEGEHKACFAELRRNFMNIRPVKLKNLIL 0"
## [1] "SLLPAFDALGQLSSGTKPNPQVYSRLLSSGCQEGEHKACFAELRRNFVNIRPAKLKNLIL 0"
## [1] "SLLPAFDALGQLSSGTKPNPQVYSRLLSSGCQEGEHKACFAELRRNFVNIRPAKLKNLIL 0"
## [1] "SLLPAFDALGQLSSGTKPNPQVYSRLLSSGCQEGEHKACFAELRRNFVNIRPAKLKNLIL 0"
## [1] "CLVPAFDAVGQLCAGAKPAPQVYSTLLQSGCQGGEHAACFAELRRNFVNVRPAKLKSLIL 0"
## [1] "SLLPAFDAIGQLRAGAKPEPGVYSSLLDSGSRGGEHAACFAELRRNFVNTRPTKLKNLIL 0"
## [1] "SVLPAFDAVGPLKSGTKPQPQVYSSLLSSGCQAGEHAACFAELRRNFINTCPPKLKSLML 0"
## [1] "SVLPAFDAVGPLKSGAKPLPETYSSLLSSGCQAGEHAACFAELRRNFINTRPAKLRSLML 0"
## [1] " "
## [1] "LVKHWYRQVAAQNKGKGPAPASLPPAYALELLTIFAWEQGCRQDCFNMAQGFRTVLGLVQ 0"
## [1] "LVKHWYRQVAAQNKGKRPAPASLPPAYALELLTIFAWEQGCRQDCFNMAQGFRTVLGLVQ 0"
## [1] "LVKHWYHQVAAQNKGKRPAPASLPPAYALELLTIFAWEQGCRQDCFNMAQGFRTVLGLVQ 0"
## [1] "LVKHWYRQVAAQNKRKRPAPASLPPAYALELLTIFAWEQGCGKDCFDMAQGFRTVLGLVQ 0"
## [1] "LVKHWYRQVAAQNKRKRPAPASLPPAYALELLTIFAWEQGCGKDCFDMAQGFRTVLGLVQ 0"
## [1] "LVKHWYRQVAAQNKRKRPAPASLPPAYALELLTIFAWEQGCGKDCFDMAQGFRTVLGLVQ 0"
## [1] "LVKHWYRQVAAQNKGQQPACASLPPVYALELLTIFAWEQGCGEDSFKMAQGLKTVLELVQ 0"
## [1] "LVKHWYRQVAAQNKGAQRAGASLPPAYALELLTIFAWEQGCGEDRFSMAQGLRTVLGLVQ 0"
## [1] "LVKHWYRQVVTRYKGGEAAGDAPPPAYALELLTIFAWEQGCGEQKFSLAEGLRTILRLIQ 0"
## [1] "LVKHWYRQVAARFEGGETAGAALPPAYALELLTVFAWEQGCGEQKFSMAEGLRTVLRLVQ 0"
## [1] " "
## [1] "QHQQLCVYWTVNYSTEDPAMRMHLLGQLRKPRPLVLDPADPTWNVGHGSWELLAQEAAAL 0"
## [1] "QHQQLCVYWTVNYSTEDPAMRMHLLGQLRKPRPLVLDPADPTWNVGHGSWELLAQEAAAL 0"
## [1] "QHQQLCVYWTVNYSTEDPAMRMHLLGQLGKPRPLVLDPADPTWNVGHGSWELLAQEAAAL 0"
## [1] "QHQQLCVYWTVNYSTEDPAMRMHLLGQLRKPRPLVLDPADPTWNVGQGSWELLAQEAAVL 0"
## [1] "QHQQLCVYWTVNYSTEDPAMRMHLLGQLRKPRPLVLDPADPTWNVGQGSWELLAQEAAVL 0"
## [1] "QHQQLCVYWTVNYSTEDPAMRMHLLGQLRKPRPLVLDPADPTWNVGQGSWELLAQEAAVL 0"
## [1] "QHQQLCVYWTVNYSFEDPAIRTHLLGQLQKPRPLILDPGDPTWNVGQGSWELLAQEAAVL 0"
## [1] "QHRQLCVYWTVNYSFEDPALRTHLLGQLRNPRPLVLDPADPTWNVGQGSWELLAQEAAAL 0"
## [1] "QHQSLCIYWTVNYSVQDPAIRAHLLCQLRKARPLVLDPADPTWNVGQGDWKLLAQEAAAL 0"
## [1] "QHQSLCIYWTVNYSVQDPAIRAHLLRQLRKARPLILDPADPTWNMDQGNWKLLAQEAAAL 0"
## [1] " "
## [1] "GMQACFLSRDGTSVQPWDVMPALLYQTPAGDLDKFISEFLQPNRQFLAQVNKAVDTICSF 0"
## [1] "GMQACFLSRDGTSVQPWDVMPALLYQTPAGDLDKFISEFLQPNRQFLAQVNKAVDTICSF 0"
## [1] "GMQACFLSRDGTSVQPWDVMPALLYQTPAGDLDKFISEFLQPNRQFLAQVNKAVDTICSF 0"
## [1] "GMQACFLSRDGTSMPPWDVMPALLYQTPAGDLDKFISEFLQPNRQFLAQVNKAVDTICSF 0"
## [1] "GMQACFLSRDGTSMPPWDVMPALLYQTPAGDLDKFISEFLQPNRQFLAQVNKAVDTICSF 0"
## [1] "GMQACFLSRDGTSMPPWDVMPALLYQTPAGDLDKFISEFLQPNRQFLAQVNKAVDTICSF 0"
## [1] "ETQACLRSTEGTSVQPWDVMPALLYQTPAGDLDKFISDFLQPNRQFLAQVNKAVDTICSF 0"
## [1] "GTQPCLMSREGTPVQPWDVMPALLCQTPASDLDKFITEFLQPNRHFLEQVNKAVDTICSF 0"
## [1] "GSQVCLQSGDGTLVPPWDVTPALLHQTLAEDLDKFISEFLQPNRHFLTQVKRAVDTICSF 0"
## [1] "ESQVCLQSRDGNLVPPWDVMPALLHQTPAQNLDKFICEFLQPDRHFLTQVKRAVDTICSF 0"
## [1] " "
## [1] "LKENCFRNSPIKVIKVVKGGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNKRAEIISEI 0"
## [1] "LKENCFRNSPIKVIKVVKGGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNKRAEIISEI 0"
## [1] "LKENCFRNSPIKVIKVVKGGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNKRAEIISEI 0"
## [1] "LKENCFRNSPIKVIKVVKGGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNKRAEIISEI 0"
## [1] "LKENCFRNSPIKVIKVVKGGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNKRAEIISEI 0"
## [1] "LKENCFRNSPIKVIKVVKGGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNKRAEIISEI 0"
## [1] "LKENCFQNSAIKVLKVVKGGSLAKGTALRGRSDADLVVFLSCFSQFAEQGNRRAEIISEI 0"
## [1] "LRDNCFRNSPIKVLK---GGSSAKGTALRGRSDADLVVFLSCFSQFTEQGNRRAEIISEI 0"
## [1] "LKENCFRNSTIKVLKVVKGGSSAKGTALQGRSDADLVVFLSCFRQFSEQGSHRAEIISEI 0"
## [1] "LKENCFRNSTIKVLKVVKGGSSAKGTALQGRSDADLVVFLSCFRQFSEQGSHRAEIIAEI 0"
## [1] " "
## [1] "RAQLEACQQERQFEVKFEVSKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALGQLVSGS 0"
## [1] "RAQLEACQQERQFEVKFEVSKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALGQLVSGS 0"
## [1] "RAQLEACQQERQFEVKFEVSKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALGQLVSGS 0"
## [1] "RAQLEACQREQQFEVKFEVSKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALATAMQAS 0"
## [1] "RAQLEACQREQQFEVKFEVSKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALATAMQAS 0"
## [1] "RAQLEACQREQQFEVKFEVSKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALGQLVSGS 0"
## [1] "RAQLEACQQKMQLEVKFEIPKRENSRVLSFSLKSQTMLDQSVDFDVLPAFNALGQVVSSY 0"
## [1] "RAQLEACQQEREFEVKFEISKWENPRVLSFSLTSQTMLDQSVDFDVLPAFDALGQLVPDS 0"
## [1] "QAHLEACQQMHSFDVKFEVSKRKNPRVLSFTLTSQTLLDQSVDFDVLPAFDALGQLRSGS 0"
## [1] "QAQLEACQQKQRFDVKFEISKRKNPRVLSFTLTSKTLLGQSVDFDVLPAFDALGQLKSGS 0"
## [1] " "
## [1] "RPSSQVYVDLIHSYSNAGEYSTCFTELQRDFIISRPTKLKSLIRLVKHWYQQCTKISKGR 0"
## [1] "RPSSQVYVDLIHSYSNAGEYSTCFTELQRDFIISRPTKLKSLIRLVKHWYQQCTKISKGR 0"
## [1] "RPSSQVYVDLIHSYSNAGEYSTCFTELQRDFIISRPTKLKSLIRLVKHWYQQCTKISKGR 0"
## [1] "TPP------ASQSYSGT----------------SSSLALPS------------------- 0"
## [1] "TPP------ASQSYSGT----------------SSSLALPS------------------- 0"
## [1] "RPSSQVYVNLIHSYSNAGEYSTCFTELQRDFIISRPTKLKSLIRLVKHWYQQCHKISRGR 0"
## [1] "RPPSQVYVDLIYSYNNAGEYSTCFTELQRDFIISRPTKLKSLIRLVKHWYRQCNKMPRGR 0"
## [1] "RPRPQVYVDLIHSYSNAGEYSPCFTELQRNFISSRPTKLKSLIRLVKHWYQQCNKMPKGR 0"
## [1] "RPDPRVYTDLIHSCSNAGEFSTCFTELQRDFITSRPTKLKSLIRLVKYWYQQCNKTIKGK 0"
## [1] "RPDPRVYTDLIQSYSNAGEFSTCFTELQRDFISSRPTKLKSLIRLVKHWYQQCNKTVKGK 0"
## [1] " "
## [1] "GSLPPQHGLELLTVYAWEQGGKDSQFNMAEGFRTVLELVTQYRQLCIYWTINYNAKDKTV 0"
## [1] "GSLPPQHGLELLTVYAWEQGGKDSQFNMAEGFRTVLELVTQYRQLCIYWTINYNAKDKTV 0"
## [1] "GSLPPQHGLELLTVYAWEQGGKDSQFNMAEGFRTVLELVTQYRQLCIYWTINYNAKDKTV 0"
## [1] "------------------------------------------------------------ 0"
## [1] "------------------------------------------------------------ 0"
## [1] "GSLPPKHGLELLTVYAWEQGGKDPQFNMAEGFRTVLELVTQYRQLCIYWTINYNTEDKTV 0"
## [1] "GSLPPQHGLELLTVYAWEQGGQSAQFNMAQGFRTVLELVSQYRQLRVYWTVNYDNEDQTV 0"
## [1] "GSLPPQHGLELLTVYAWEQGGCDCQFSMAEGFRTVLELVRQYRQLCVYWTVNYDNENETV 0"
## [1] "GSLPPQHGLELLTVYAWEQGGQNPQFNMAEGFRTVLELIVQYRQLCVYWTINYSAEDKTI 0"
## [1] "GSLPPQHGLELLTVYAWERGSQNPQFNMAEGFRTVLELIGQYRQLCVYWTINYGAEDETI 0"
## [1] " "
## [1] "GDFLKQQLQKPRPIILDPADPTGNLGHNARWDLLAKEAAACTSALCCMGRNGIPIQPWPV 0"
## [1] "GDFLKQQLQKPRPIILDPADPTGNLGHNARWDLLAKEAAACTSALCCMGRNGIPIQPWPV 0"
## [1] "GDFLKQQLQKPRPIILDPADPTGNLGHNARWDLLAKEAAACTSALCCMGRNGIPIQPWPV 0"
## [1] "------------------------------------------------------------ 0"
## [1] "------------------------------------------------------------ 0"
## [1] "GDFLKQQLQKPRPIILDPADPTGNLGHSARWDLLAKEAAACMSALCCVGRNGIPIQPWPV 0"
## [1] "RDFLSRQLRQPRPIILDPADPTGNLGHNARWDLLATEATACMSALCCTDRDGTPIQPWPV 0"
## [1] "RDFLKLQLQKPRPIILDPADPTGNLGPNARWDLLAKEAVACMSAPCCMGRDGSPIQPWPV 0"
## [1] "GDFLKMQLRKPRPVILDPADPTGNLGHNARWDLLAKEATVYASALCCVDRDGNPIKPWPV 0"
## [1] "GDFLKMQLQKPRPVILDPADPTGNLGHNARWDLLAKEAAAYTSALCCMDKDGNPIKPWPV 0"
## [1] " "
## [1] "KAAV 56"
## [1] "KAAV 56"
## [1] "KAAV 56"
## [1] "---- 56"
## [1] "---- 56"
## [1] "KAAV 56"
## [1] "KAAV 56"
## [1] "KAAV 56"
## [1] "KAAV 56"
## [1] "KAAV 56"
## [1] " "

Finsihed MSA

Based on the output from drawProteins, the first 50 amino acids appears to contain an interesting helical section.

NOTE: Key step - must have class set properly for ggmsa to work!

#does not work despite of the chunk I put up there
# ggmsa::ggmsa(oas3s_align,
#              start = 1,
#              end = 50)

Distance Matrix

Make a distance matrix This produces a “dist” class object.

oas3s_subset_dist <- seqinr::dist.alignment(oas3s_align_seqinr, 
                                            matrix = "identity")
is(oas3s_subset_dist)
## [1] "dist"     "oldClass"
class(oas3s_subset_dist)
## [1] "dist"

Round for display

oas3s_align_seqinr_rnd <- round(oas3s_subset_dist, 3)
oas3s_align_seqinr_rnd
##              NP_006178 XP_004053976 XP_509393 XP_015008356 XP_015008356
## XP_004053976     0.068                                                 
## XP_509393        0.086        0.091                                    
## XP_015008356     0.291        0.293     0.293                          
## XP_015008356     0.291        0.293     0.293        0.000             
## XP_031506643     0.254        0.254     0.256        0.180        0.180
## NP_001041556     0.460        0.461     0.460        0.499        0.499
## NP_001075226     0.452        0.451     0.455        0.491        0.491
## NP_660261        0.541        0.540     0.542        0.575        0.575
## NP_001009493     0.545        0.544     0.544        0.577        0.577
##              XP_031506643 NP_001041556 NP_001075226 NP_660261
## XP_004053976                                                 
## XP_509393                                                    
## XP_015008356                                                 
## XP_015008356                                                 
## XP_031506643                                                 
## NP_001041556        0.473                                    
## NP_001075226        0.460        0.465                       
## NP_660261           0.542        0.565        0.561          
## NP_001009493        0.550        0.566        0.556     0.387

Phylognetic trees of sequences

Build a phylogenetic tree from distance matrix

tree <- nj(oas3s_subset_dist)

Plotting phylogenetic trees

Plot the tree

plot.phylo (tree, main="Phylogenetic Tree\n", 
            use.edge.length = F)
mtext(text = "OAS3 family gene tree - rooted, no branch lengths")