This code compiles summary information about the gene NADSYN1 (NAD synthetase 1). This protein is a coenzyme in metabolic redox reactions, a precursor for several cell signaling molecules, and a substrate for protein posttranslational modifications.
Key information use to make this script can be found here: - Refseq Gene: https://www.ncbi.nlm.nih.gov/gene/55191 - Refseq Homologene: https://www.ncbi.nlm.nih.gov/homologene?LinkName=gene_homologene&from_uid=55191
Other resources consulted includes - Neanderthal genome: http://neandertal.ensemblgenomes.org/index.html
Other interesting resources and online tools include: - REPPER: https://toolkit.tuebingen.mpg.de/jobs/4621683 - Sub-cellular locations prediction: https://wolfpsort.hgc.jp/
Load necessary packages:
Download and load drawProteins from Bioconductor
library(BiocManager)
## Bioconductor version '3.13' is out-of-date; the current release version '3.14'
## is available with R version '4.1'; see https://bioconductor.org/install
#install("drawProteins")
library(drawProteins)
Load other packages
# github packages
library(compbio4all)
library(ggmsa)
## Registered S3 methods overwritten by 'ggalt':
## method from
## grid.draw.absoluteGrob ggplot2
## grobHeight.absoluteGrob ggplot2
## grobWidth.absoluteGrob ggplot2
## grobX.absoluteGrob ggplot2
## grobY.absoluteGrob ggplot2
# CRAN packages
library(rentrez)
library(seqinr)
library(ape)
##
## Attaching package: 'ape'
## The following objects are masked from 'package:seqinr':
##
## as.alignment, consensus
library(pander)
library(ggplot2)
# Bioconductor packages
library(msa)
## Loading required package: Biostrings
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: XVector
## Loading required package: GenomeInfoDb
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:ape':
##
## complement
## The following object is masked from 'package:seqinr':
##
## translate
## The following object is masked from 'package:base':
##
## strsplit
##
## Attaching package: 'msa'
## The following object is masked from 'package:BiocManager':
##
## version
## Biostrings
library(Biostrings)
library(drawProteins)
library(HGNChelper)
Accession numbers were obtained from RefSeq, Refseq Homologene, UniProt and PDB. UniProt accession numbers can be found by searching for the gene name. PDB accessions can be found by searching with a UniProt accession or a gene name, though many proteins are not in PDB.
A protein BLAST search (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome) was carried out excluding vertebrates to determine if it occurred outside of vertebrates. The gene appears in non-vertebrates.
OPTIONAL: Use the function to confirm the validity of your gene name and any aliases
# this is optional
HGNChelper::checkGeneSymbols(x = c("NADSYN1"))
## Maps last updated on: Thu Oct 24 12:31:05 2019
## x Approved Suggested.Symbol
## 1 NADSYN1 TRUE NADSYN1
Not available: - Drosophila
# RefSeq Uniprot PDB sci name common name gene name
NADSYN1_table_vector<-c("NP_060631.2", "Q6IA69", "6OFB", "Homo sapiens" , "Human", "NADSYN1",
"XP_001174076.2", "K7BU87", "NA", "Pan troglodytes" , "Chimpanzee", "NADSYN1",
"XP_001098992.2", "NA", "NA", "Macaca mulatta", "Rhesus monkey", "NADSYN1",
"XP_540795.4", "NA", "NA", "Canis lupus", "Dog", "NADSYN1",
"NP_001029615.1", "Q3ZBF0.1", "NA", "Bos taurus", "Cattle", "NADSYN1",
"NP_084497.1", "Q711T7", "NA", "Mus musculus", "House mouse", "NADSYN1",
"NP_852145.1", "Q812E8.1", "NA", "Rattus norvegicus", "Norway rat", "NADSYN1",
"NP_001006465.1", "Q5ZMA6.1", "NA", "Gallus gallus", "Chicken", "NADSYN1",
"NP_001120406.1", "NA", "NA", "Xenopus tropicalis", "Tropical clawed frog", "NADSYN1",
"NP_001092723.1", "NA", "NA", "Danio rerio", "Zebrafish", "NADSYN1")
NADSYN1_matrix <- matrix( NADSYN1_table_vector, ncol = 6, byrow = TRUE)
NADSYN1_df <- data.frame( NADSYN1_matrix )
colnames( NADSYN1_df ) <- c("ncbi.protein.accession", "UniProt.id", "PDB", "species", "common.name",
"gene.name")
The finished table
pander::pander( NADSYN1_df )
| ncbi.protein.accession | UniProt.id | PDB | species |
|---|---|---|---|
| NP_060631.2 | Q6IA69 | 6OFB | Homo sapiens |
| XP_001174076.2 | K7BU87 | NA | Pan troglodytes |
| XP_001098992.2 | NA | NA | Macaca mulatta |
| XP_540795.4 | NA | NA | Canis lupus |
| NP_001029615.1 | Q3ZBF0.1 | NA | Bos taurus |
| NP_084497.1 | Q711T7 | NA | Mus musculus |
| NP_852145.1 | Q812E8.1 | NA | Rattus norvegicus |
| NP_001006465.1 | Q5ZMA6.1 | NA | Gallus gallus |
| NP_001120406.1 | NA | NA | Xenopus tropicalis |
| NP_001092723.1 | NA | NA | Danio rerio |
| common.name | gene.name |
|---|---|
| Human | NADSYN1 |
| Chimpanzee | NADSYN1 |
| Rhesus monkey | NADSYN1 |
| Dog | NADSYN1 |
| Cattle | NADSYN1 |
| House mouse | NADSYN1 |
| Norway rat | NADSYN1 |
| Chicken | NADSYN1 |
| Tropical clawed frog | NADSYN1 |
| Zebrafish | NADSYN1 |
All sequences were downloaded using a wrapper compbio4all::entrez_fetch_list() which uses rentrez::entrez_fetch() to access NCBI databases.
# download FASTA sequences
NADSYN1_list <- compbio4all::entrez_fetch_list( db = "protein",
id = NADSYN1_df$ncbi.protein.accession,
rettype = "fasta"
)
Number of FASTA files obtained
length( NADSYN1_list )
## [1] 10
The first entry
NADSYN1_list[[1]]
## [1] ">NP_060631.2 glutamine-dependent NAD(+) synthetase [Homo sapiens]\nMGRKVTVATCALNQWALDFEGNLQRILKSIEIAKNRGARYRLGPELEICGYGCWDHYYESDTLLHSFQVL\nAALVESPVTQDIICDVGMPVMHRNVRYNCRVIFLNRKILLIRPKMALANEGNYRELRWFTPWSRSRHTEE\nYFLPRMIQDLTKQETVPFGDAVLVTWDTCIGSEICEELWTPHSPHIDMGLDGVEIITNASGSHQVLRKAN\nTRVDLVTMVTSKNGGIYLLANQKGCDGDRLYYDGCAMIAMNGSVFAQGSQFSLDDVEVLTATLDLEDVRS\nYRAEISSRNLAASRASPYPRVKVDFALSCHEDLLAPISEPIEWKYHSPEEEISLGPACWLWDFLRRSQQA\nGFLLPLSGGVDSAATACLIYSMCCQVCEAVRSGNEEVLADVRTIVNQISYTPQDPRDLCGRILTTCYMAS\nKNSSQETCTRARELAQQIGSHHISLNIDPAVKAVMGIFSLVTGKSPLFAAHGGSSRENLALQNVQARIRM\nVLAYLFAQLSLWSRGVHGGLLVLGSANVDESLLGYLTKYDCSSADINPIGGISKTDLRAFVQFCIQRFQL\nPALQSILLAPATAELEPLADGQVSQTDEEDMGMTYAELSVYGKLRKVAKMGPYSMFCKLLGMWRHICTPR\nQVADKVKRFFSKYSMNRHKMTTLTPAYHAENYSPEDNRFDLRPFLYNTSWPWQFRCIENQVLQLERAEPQ\nSLDGVD\n\n"
# output should be the FASTA sequence with header information and newlines still included
Remove FASTA header
for(i in 1:length(NADSYN1_list)){
NADSYN1_list[[i]] <- compbio4all::fasta_cleaner(NADSYN1_list[[i]], parse = F)
}
Specific additional cleaning steps will be as needed for particular analyses
For code see https://rpubs.com/lowbrowR/drawProtein
Q6IA69_json <- drawProteins::get_features("Q6IA69")
## [1] "Download has worked"
my_prot_df <- drawProteins::feature_to_dataframe(Q6IA69_json)
is(my_prot_df)
## [1] "data.frame" "list" "oldClass" "vector"
## [5] "list_OR_List" "vector_OR_Vector" "vector_OR_factor"
my_canvas <- draw_canvas(my_prot_df)
my_canvas <- draw_chains(my_canvas, my_prot_df,
label_size = 2.5)
my_canvas <- draw_domains(my_canvas, my_prot_df)
my_canvas
Prepare Data
NADSYN1_list[[1]]
## [1] "MGRKVTVATCALNQWALDFEGNLQRILKSIEIAKNRGARYRLGPELEICGYGCWDHYYESDTLLHSFQVLAALVESPVTQDIICDVGMPVMHRNVRYNCRVIFLNRKILLIRPKMALANEGNYRELRWFTPWSRSRHTEEYFLPRMIQDLTKQETVPFGDAVLVTWDTCIGSEICEELWTPHSPHIDMGLDGVEIITNASGSHQVLRKANTRVDLVTMVTSKNGGIYLLANQKGCDGDRLYYDGCAMIAMNGSVFAQGSQFSLDDVEVLTATLDLEDVRSYRAEISSRNLAASRASPYPRVKVDFALSCHEDLLAPISEPIEWKYHSPEEEISLGPACWLWDFLRRSQQAGFLLPLSGGVDSAATACLIYSMCCQVCEAVRSGNEEVLADVRTIVNQISYTPQDPRDLCGRILTTCYMASKNSSQETCTRARELAQQIGSHHISLNIDPAVKAVMGIFSLVTGKSPLFAAHGGSSRENLALQNVQARIRMVLAYLFAQLSLWSRGVHGGLLVLGSANVDESLLGYLTKYDCSSADINPIGGISKTDLRAFVQFCIQRFQLPALQSILLAPATAELEPLADGQVSQTDEEDMGMTYAELSVYGKLRKVAKMGPYSMFCKLLGMWRHICTPRQVADKVKRFFSKYSMNRHKMTTLTPAYHAENYSPEDNRFDLRPFLYNTSWPWQFRCIENQVLQLERAEPQSLDGVD"
NADSYN1_human_vector <- unlist(strsplit( NADSYN1_list[[1]], "" ))
seqinr::dotPlot( NADSYN1_human_vector, NADSYN1_human_vector )
TODO:
par(mfrow = c(2,2),
mar = c(0,0,2,1))
# plot 1: Defaults
seqinr::dotPlot(NADSYN1_human_vector, NADSYN1_human_vector,
wsize = 1,
nmatch = 1,
main = "size=1, num match=1")
# plot 2 size = 10, nmatch = 10
seqinr::dotPlot(NADSYN1_human_vector, NADSYN1_human_vector,
wsize = 10,
nmatch = 1,
main = "size = 10, nmatch = 10")
# plot 3: size = 10, nmatch = 5
seqinr::dotPlot(NADSYN1_human_vector, NADSYN1_human_vector,
wsize = 10,
nmatch = 5,
main = "size = 10, nmatch = 5")
# plot 4: size = 20, nmatch = 5
seqinr::dotPlot(NADSYN1_human_vector, NADSYN1_human_vector,
wsize = 20,
nmatch = 5,
main = "size = 20, nmatch = 5")
par(mfrow = c(1,1),
mar = c(4,4,4,4))
seqinr::dotPlot(NADSYN1_human_vector, NADSYN1_human_vector,
wsize = 20,
nmatch = 5,
main = "NADSYN1 human dot plot")
TODO: Create table
Below are links to relevant information. 1. Pfam: http://pfam.xfam.org/protein/Q6IA69; “CN hydrolase” from region 6-283, “NAD synthetase” from 337 to 651 2. DisProt: NA 3. RepeatDB: NA 4. PDB secondary structural location: NA
The Homo sapiens homolog is listed in Alphafold (https://alphafold.ebi.ac.uk/entry/Q6IA69). The predicted structure contains alpha helices, beta sheets, and disordered regions.
Uniprot (which uses http://www.csbio.sjtu.edu) indicates that this protein is a NAD(+) synthetase that catalyzes the final step of the nicotinamide adenine dinucleotide (NAD) de novo synthesis pathway, the ATP-dependent amidation of deamido-NAD using L-glutamine as a nitrogen source.
Alphafold indicates that there are a mix of alpha helices and beta sheets. I therefore predict that machine-learning methods will indicate an a+b and a/b structure.
NOTE: My protein does NOT contain “U”.
First, I need the data from Chou and Zhang (1994) Table 5. Code to build this table is available at https://rpubs.com/lowbrowR/843543
The table looks like this:
# enter once
aa.1.1 <- c("A","R","N","D","C","Q","E","G","H","I",
"L","K","M","F","P","S","T","W","Y","V")
# alpha proteins
alpha <- c(285, 53, 97, 163, 22, 67, 134, 197, 111, 91,
221, 249, 48, 123, 82, 122, 119, 33, 63, 167)
# beta proteins
beta <- c(203, 67, 139, 121, 75, 122, 86, 297, 49, 120,
177, 115, 16, 85, 127, 341, 253, 44, 110, 229)
# alpha + beta
a.plus.b <- c(175, 78, 120, 111, 74, 74, 86, 171, 33, 93,
110, 112, 25, 52, 71, 126, 117, 30, 108, 123)
# alpha/beta
a.div.b <- c(361, 146, 183, 244, 63, 114, 257, 377, 107, 239,
339, 321, 91, 158, 188, 327, 238, 72, 130, 378)
pander(data.frame(aa.1.1, alpha, beta, a.plus.b, a.div.b))
| aa.1.1 | alpha | beta | a.plus.b | a.div.b |
|---|---|---|---|---|
| A | 285 | 203 | 175 | 361 |
| R | 53 | 67 | 78 | 146 |
| N | 97 | 139 | 120 | 183 |
| D | 163 | 121 | 111 | 244 |
| C | 22 | 75 | 74 | 63 |
| Q | 67 | 122 | 74 | 114 |
| E | 134 | 86 | 86 | 257 |
| G | 197 | 297 | 171 | 377 |
| H | 111 | 49 | 33 | 107 |
| I | 91 | 120 | 93 | 239 |
| L | 221 | 177 | 110 | 339 |
| K | 249 | 115 | 112 | 321 |
| M | 48 | 16 | 25 | 91 |
| F | 123 | 85 | 52 | 158 |
| P | 82 | 127 | 71 | 188 |
| S | 122 | 341 | 126 | 327 |
| T | 119 | 253 | 117 | 238 |
| W | 33 | 44 | 30 | 72 |
| Y | 63 | 110 | 108 | 130 |
| V | 167 | 229 | 123 | 378 |
Convert to frequencies Table 5 therefore becomes this
alpha.prop <- alpha/sum(alpha)
beta.prop <- beta/sum(beta)
a.plus.b.prop <- a.plus.b/sum(a.plus.b)
a.div.b <- a.div.b/sum(a.div.b)
aa.prop <- data.frame(alpha.prop,
beta.prop,
a.plus.b.prop,
a.div.b)
row.names(aa.prop) <- aa.1.1
pander::pander(aa.prop)
| alpha.prop | beta.prop | a.plus.b.prop | a.div.b | |
|---|---|---|---|---|
| A | 0.1165 | 0.07313 | 0.09264 | 0.08331 |
| R | 0.02166 | 0.02414 | 0.04129 | 0.03369 |
| N | 0.03964 | 0.05007 | 0.06353 | 0.04223 |
| D | 0.06661 | 0.04359 | 0.05876 | 0.05631 |
| C | 0.008991 | 0.02702 | 0.03917 | 0.01454 |
| Q | 0.02738 | 0.04395 | 0.03917 | 0.02631 |
| E | 0.05476 | 0.03098 | 0.04553 | 0.05931 |
| G | 0.08051 | 0.107 | 0.09052 | 0.08701 |
| H | 0.04536 | 0.01765 | 0.01747 | 0.02469 |
| I | 0.03719 | 0.04323 | 0.04923 | 0.05516 |
| L | 0.09031 | 0.06376 | 0.05823 | 0.07824 |
| K | 0.1018 | 0.04143 | 0.05929 | 0.07408 |
| M | 0.01962 | 0.005764 | 0.01323 | 0.021 |
| F | 0.05027 | 0.03062 | 0.02753 | 0.03646 |
| P | 0.03351 | 0.04575 | 0.03759 | 0.04339 |
| S | 0.04986 | 0.1228 | 0.0667 | 0.07547 |
| T | 0.04863 | 0.09114 | 0.06194 | 0.05493 |
| W | 0.01349 | 0.01585 | 0.01588 | 0.01662 |
| Y | 0.02575 | 0.03963 | 0.05717 | 0.03 |
| V | 0.06825 | 0.08249 | 0.06511 | 0.08724 |
Determine the number of each amino acid in my protein.
A Function to convert a table into a vector is helpful here because R is goofy about tables not being the same as vectors.
table_to_vector <- function(table_x){
table_names <- attr(table_x, "dimnames")[[1]]
table_vect <- as.vector(table_x)
names(table_vect) <- table_names
return(table_vect)
}
NADSYN1_human_table <- table(NADSYN1_human_vector)/length(NADSYN1_human_vector)
NADSYN1.human.aa.freq <- table_to_vector(NADSYN1_human_table)
NADSYN1.human.aa.freq
## A C D E F G H
## 0.07790368 0.03257790 0.05382436 0.05807365 0.03257790 0.06232295 0.02266289
## I K L M N P Q
## 0.05240793 0.03399433 0.10764873 0.02832861 0.03824363 0.04390935 0.04532578
## R S T V W Y
## 0.06515581 0.07648725 0.04957507 0.06373938 0.01841360 0.03682720
Check for the presence of “U” (unknown aa.)
aa.names <- names(NADSYN1.human.aa.freq)
i.U <- which(aa.names == "U")
aa.names[i.U]
## character(0)
NADSYN1.human.aa.freq[i.U]
## named numeric(0)
Add data on my focal protein to the amino acid frequency table.
aa.prop$NADSYN1.human.aa.freq <- NADSYN1.human.aa.freq
pander::pander(aa.prop)
| alpha.prop | beta.prop | a.plus.b.prop | a.div.b | NADSYN1.human.aa.freq | |
|---|---|---|---|---|---|
| A | 0.1165 | 0.07313 | 0.09264 | 0.08331 | 0.0779 |
| R | 0.02166 | 0.02414 | 0.04129 | 0.03369 | 0.03258 |
| N | 0.03964 | 0.05007 | 0.06353 | 0.04223 | 0.05382 |
| D | 0.06661 | 0.04359 | 0.05876 | 0.05631 | 0.05807 |
| C | 0.008991 | 0.02702 | 0.03917 | 0.01454 | 0.03258 |
| Q | 0.02738 | 0.04395 | 0.03917 | 0.02631 | 0.06232 |
| E | 0.05476 | 0.03098 | 0.04553 | 0.05931 | 0.02266 |
| G | 0.08051 | 0.107 | 0.09052 | 0.08701 | 0.05241 |
| H | 0.04536 | 0.01765 | 0.01747 | 0.02469 | 0.03399 |
| I | 0.03719 | 0.04323 | 0.04923 | 0.05516 | 0.1076 |
| L | 0.09031 | 0.06376 | 0.05823 | 0.07824 | 0.02833 |
| K | 0.1018 | 0.04143 | 0.05929 | 0.07408 | 0.03824 |
| M | 0.01962 | 0.005764 | 0.01323 | 0.021 | 0.04391 |
| F | 0.05027 | 0.03062 | 0.02753 | 0.03646 | 0.04533 |
| P | 0.03351 | 0.04575 | 0.03759 | 0.04339 | 0.06516 |
| S | 0.04986 | 0.1228 | 0.0667 | 0.07547 | 0.07649 |
| T | 0.04863 | 0.09114 | 0.06194 | 0.05493 | 0.04958 |
| W | 0.01349 | 0.01585 | 0.01588 | 0.01662 | 0.06374 |
| Y | 0.02575 | 0.03963 | 0.05717 | 0.03 | 0.01841 |
| V | 0.06825 | 0.08249 | 0.06511 | 0.08724 | 0.03683 |
Two custom functions are needed: one to calculate correlates between two columns of our table, and one to calculate correlation similarities.
# Correlation used in Chou and Zhange 1992.
chou_cor <- function(x,y){
numerator <- sum(x*y)
denominator <- sqrt((sum(x^2))*(sum(y^2)))
result <- numerator/denominator
return(result)
}
# Cosine similarity used in Higgs and Attwood (2005).
chou_cosine <- function(z.1, z.2){
z.1.abs <- sqrt(sum(z.1^2))
z.2.abs <- sqrt(sum(z.2^2))
my.cosine <- sum(z.1*z.2)/(z.1.abs*z.2.abs)
return(my.cosine)
}
Calculate correlation between each column
corr.alpha <- chou_cor(aa.prop[,5], aa.prop[,1])
corr.beta <- chou_cor(aa.prop[,5], aa.prop[,2])
corr.apb <- chou_cor(aa.prop[,5], aa.prop[,3])
corr.adb <- chou_cor(aa.prop[,5], aa.prop[,4])
Calculate cosine similarity
cos.alpha <- chou_cosine(aa.prop[,5], aa.prop[,1])
cos.beta <- chou_cosine(aa.prop[,5], aa.prop[,2])
cos.apb <- chou_cosine(aa.prop[,5], aa.prop[,3])
cos.adb <- chou_cosine(aa.prop[,5], aa.prop[,4])
Calculate distance. Note: we need to flip the dataframe on its side using a command called t()
aa.prop.flipped <- t(aa.prop)
round(aa.prop.flipped,2)
## A R N D C Q E G H I L
## alpha.prop 0.12 0.02 0.04 0.07 0.01 0.03 0.05 0.08 0.05 0.04 0.09
## beta.prop 0.07 0.02 0.05 0.04 0.03 0.04 0.03 0.11 0.02 0.04 0.06
## a.plus.b.prop 0.09 0.04 0.06 0.06 0.04 0.04 0.05 0.09 0.02 0.05 0.06
## a.div.b 0.08 0.03 0.04 0.06 0.01 0.03 0.06 0.09 0.02 0.06 0.08
## NADSYN1.human.aa.freq 0.08 0.03 0.05 0.06 0.03 0.06 0.02 0.05 0.03 0.11 0.03
## K M F P S T W Y V
## alpha.prop 0.10 0.02 0.05 0.03 0.05 0.05 0.01 0.03 0.07
## beta.prop 0.04 0.01 0.03 0.05 0.12 0.09 0.02 0.04 0.08
## a.plus.b.prop 0.06 0.01 0.03 0.04 0.07 0.06 0.02 0.06 0.07
## a.div.b 0.07 0.02 0.04 0.04 0.08 0.05 0.02 0.03 0.09
## NADSYN1.human.aa.freq 0.04 0.04 0.05 0.07 0.08 0.05 0.06 0.02 0.04
We can get distance matrix like this
dist(aa.prop.flipped, method = "euclidean")
## alpha.prop beta.prop a.plus.b.prop a.div.b
## beta.prop 0.13342098
## a.plus.b.prop 0.09281824 0.08289406
## a.div.b 0.06699039 0.08659174 0.06175113
## NADSYN1.human.aa.freq 0.15601389 0.14202150 0.12175640 0.13019379
Individual distances using dist()
dist.alpha <- dist((aa.prop.flipped[c(1,5),]), method = "euclidean")
dist.beta <- dist((aa.prop.flipped[c(2,5),]), method = "euclidean")
dist.apb <- dist((aa.prop.flipped[c(3,5),]), method = "euclidean")
dist.adb <- dist((aa.prop.flipped[c(4,5),]), method = "euclidean")
Compile the information. Rounding makes it easier to read
# fold types
fold.type <- c("alpha","beta","alpha plus beta", "alpha/beta")
# data
corr.sim <- round(c(corr.alpha,corr.beta,corr.apb,corr.adb),5)
cosine.sim <- round(c(cos.alpha,cos.beta,cos.apb,cos.adb),5)
Euclidean.dist <- round(c(dist.alpha,dist.beta,dist.apb,dist.adb),5)
# summary
sim.sum <- c("","","most.sim","")
dist.sum <- c("","","min.dist","")
df <- data.frame(fold.type,
corr.sim ,
cosine.sim ,
Euclidean.dist ,
sim.sum ,
dist.sum )
Display output
pander::pander(df)
| fold.type | corr.sim | cosine.sim | Euclidean.dist | sim.sum | dist.sum |
|---|---|---|---|---|---|
| alpha | 0.8078 | 0.8078 | 0.156 | ||
| beta | 0.844 | 0.844 | 0.142 | ||
| alpha plus beta | 0.8744 | 0.8744 | 0.1218 | most.sim | min.dist |
| alpha/beta | 0.8593 | 0.8593 | 0.1302 |
Convert all FASTA records intro entries in a single vector. FASTA entries are contained in a list produced at the beginning of the script. They were cleaned to remove the header and newline characters.
NADSYN1_list
## $NP_060631.2
## [1] "MGRKVTVATCALNQWALDFEGNLQRILKSIEIAKNRGARYRLGPELEICGYGCWDHYYESDTLLHSFQVLAALVESPVTQDIICDVGMPVMHRNVRYNCRVIFLNRKILLIRPKMALANEGNYRELRWFTPWSRSRHTEEYFLPRMIQDLTKQETVPFGDAVLVTWDTCIGSEICEELWTPHSPHIDMGLDGVEIITNASGSHQVLRKANTRVDLVTMVTSKNGGIYLLANQKGCDGDRLYYDGCAMIAMNGSVFAQGSQFSLDDVEVLTATLDLEDVRSYRAEISSRNLAASRASPYPRVKVDFALSCHEDLLAPISEPIEWKYHSPEEEISLGPACWLWDFLRRSQQAGFLLPLSGGVDSAATACLIYSMCCQVCEAVRSGNEEVLADVRTIVNQISYTPQDPRDLCGRILTTCYMASKNSSQETCTRARELAQQIGSHHISLNIDPAVKAVMGIFSLVTGKSPLFAAHGGSSRENLALQNVQARIRMVLAYLFAQLSLWSRGVHGGLLVLGSANVDESLLGYLTKYDCSSADINPIGGISKTDLRAFVQFCIQRFQLPALQSILLAPATAELEPLADGQVSQTDEEDMGMTYAELSVYGKLRKVAKMGPYSMFCKLLGMWRHICTPRQVADKVKRFFSKYSMNRHKMTTLTPAYHAENYSPEDNRFDLRPFLYNTSWPWQFRCIENQVLQLERAEPQSLDGVD"
##
## $XP_001174076.2
## [1] "MGRKVTVATCALNQWALDFEGNLQRILKSIEIAKNRGARYRLGPELEICGYGCWDHYYESDTLLHSFEVLAALLESPVTQDIICDVGMPVMHRNVRYNCRVIFLTGRKILLIRPKMALANEGNYRELRWFTPWSRSRHTEEYFLPRMIQDLTKQETVPFGDAVLVTWDTCIGSEICEELWTPHSPHIDMGLDGVEIITNASGSHHVLRKANTRVDLVTMVTSKNGGIYLLANQKGCDGDRLYYDGCAMIAMNGSVFAQGSQFSLDDVEVLTATLDLEDVRSYRAEISSRNLAASRASPYPRVKVDFALSCHEDLLAPISEPIEWKYHSPEEEISLGPACWLWDFLRRSQQAGFLLPLSGGVDSAATACLIYSMCCQVCEAVRSGNEEVLADVRTIVNQISYTPQDPRDLCGRILTTCYMASKNSSQETCTRARELAQQIGSHHISLNIDPAVKAVMGIFSLVTGKSPLFAAHGGSSRENLALQNVQARIRMVLAYLFAQLSLWSRGVHGGLLVLGSANVDESLLGYLTKYDCSSADINPIGGISKTDLRVFVQFCIQRFQLPALQSILLAPATAELEPLADGQVSQTDEEDMGMTYAELSVYGKLRKVAKMGPYSMFCKLLGMWRHICTPRQVADKVKRFFSKYSMNRHKMTTLTPAYHAENYSPEDNRFDLRPFLYNTSWPWQFRCIENQVLQLERAEPQSLDGVD"
##
## $XP_001098992.2
## [1] "MGRKVTVATCALNQWALDFEGNLQRILKSIEIAKNRGARYRLGPELEICGYGCWDHYYESDTLLHSFQVLAALLESPVTQDIICDVGMPVMHRNVRYNCRVIFLNRKILLIRPKMALANEGNYRELRWFTPWSRSRHTEEYLLPRMIQDLTKQETAPFGDAVLATWDTCIGSEICEELWTPHSPHIDMGLDGVEIITNASGSHHVLRKANTRVDLVTMATSKNGGIYLLANQKGCDGDRLYYDGCAMIAMNGSVFAQGSQFSLDDVEVLTATLDLEDVRSYRAEISSRNLAASRASPYPRVKVDFALSCHEDLLAPVSEPIEWKYHSPEEEISLGPACWLWDFLRRSQQGGFLLPLSGGVDSAATACLVYSMCCQVCKSVRSGNQEVLADVRTIVNQISYTPQDPRDLCGRILTTCYMASKNSSQETCTRARELAQQIGRWILYVRTVEGEHLSREERLGSIWNVPSGALGQSLQNVQARIRMVLAYLFAQLSLWSRGIRGGLLVLGSANVDESLLGYLTKYDCSSADINPIGGISKTDLRAFVQFCIERFQLTALQSIVSAPATAELEPLADGQVSQTDEEDMGMTYAELSVYGKLRKVAKMGPYSMFCKLLGMWRHVCTPRQVADKVKWFFTKHSMNRHKMTTLTPAYHAENYSPEDNRFDLRPFLYNTSWPWQFRCIENQVLQLERAAPQSLDGVD"
##
## $XP_540795.4
## [1] "MGRKVTVAACALNQWALDFQGNLQRILKSIEIAKRKGARYRLGPELEICGYGCWDHYYESDTLLHSLQVLAALLESPVTQDIICDVGMPVLHRNVRYNCRVIFLNRRILLIRPKMALANEGNYRELRWFTPWSRSRQTEEYFLPRMIQDVTKQETVPFGDAVLATRDTCIGSEICEELWTPHSPHVDMGLDGVEIFTNASGSHHVLRKAHARVDLVTMATTKNGGIYLLANQKGCDGDRLYYDGCALIAMNGHIFAQGSQFSLDDVEVLTATLDLEDVRSYRAEISSRNLAASRVSPYPRVKVDFALSCREDLLEPPSEPVEWMYHSPAEEISLGPACWLWDFLRRSRQAGFFLPLSGGVDSAATACLVYSMCRQVCEAVRNGNQEVLADVRAIVDQLSYTPQDPRDLCGRLLTTCYMASENSSQETCDRAKELARQIGSHHIGLNIDPAVTAVVGIFSLVTGKRPLFAAHGGSSRENLALQNVQARLRMVLAYLFAQLSLWARGARGGLLVLGSANVDESLLGYLTKYDCSSADINPIGGISKTDLKAFVHFCMEHFQLPALQRILAAPATAELEPLTDGQVSQTDEEDMGMTYAELSVYGRLRKIAKAGPYSMFCKLVNMWKDACSPRQVADKVRQFFSKYAMNRHKMTTLTPAYHAESYSPDDNRFDLRPFLYNSSWPWQFRCIEDQVHQLESRGPQDLDGVD"
##
## $NP_001029615.1
## [1] "MGRKVTVATCALNQWALDFEGNLQRILKSIEIAKHRGARYRLGPELEICGYGCWDHYYESDTLLHSLQVLAALLESPVTQDIICDVGMPVMHRNVRYNCRVIFLNRKILLIRPKMALANEGNYRELRWFTPWSRSRQTEEYFLPRMLQDLTKQETVPFGDAVLSTWDTCIGSEVCEELWTPHSPHVDMGLDGVEIFTNASGSHHVLRKAHARVDLVTMATTKNGGIYLLANQKGCDGDRLYYDGCALIAMNGSIFAQGSQFSLDDVEVLTATLDLEDIRSYRAEISSRNLAASRVSPYPRVKVDFALSCHEDLLEPVSEPIEWKYHSPAEEISLGPACWLWDFLRRSRQAGFFLPLSGGVDSAATACLVYSMCHQVCEAVKRGNLEVLADVRTIVNQLSYTPQDPRELCGRVLTTCYMASENSSQETCDRARELAQQIGSHHIGLHIDPVVKALVGLFSLVTGASPRFAVHGGSDRENLALQNVQARVRMVIAYLFAQLSLWSRGAPGGLLVLGSANVDESLLGYLTKYDCSSADINPIGGISKTDLRAFVQLCVERFQLPALQSILAAPATAELEPLAHGRVSQTDEEDMGMTYAELSVYGRLRKVAKTGPYSMFCKLLDMWRDTCSPRQVADKVKCFFSKYSMNRHKMTTLTPAYHAESYSPDDNRFDLRPFLYNTRWPWQFRCIENQVLQLEGRQRQELDGVD"
##
## $NP_084497.1
## [1] "MGRKVTVATCALNQWALDFEGNFQRILKSIQIAKGKGARYRLGPELEICGYGCWDHYHESDTLLHSLQVLAALLDSPVTQDIICDVGMPIMHRNVRYNCRVIFLNRKILLIRPKMALANEGNYRELRWFTPWTRSRQTEEYVLPRMLQDLTKQKTVPFGDVVLATQDTCVGSEICEELWTPRSPHIDMGLDGVEIITNASGSHHVLRKAHTRVDLVTMATSKNGGIYLLANQKGCDGDRLYYDGCAMIAMNGSIFAQGTQFSLDDVEVLTATLDLEDVRSYKAEISSRNLEATRVSPYPRVTVDFALSVSEDLLEPVSEPMEWTYHRPEEEISLGPACWLWDFLRRSKQAGFFLPLSGGVDSAASACIVYSMCCLVCDAVKSGNQQVLTDVQNLVDESSYTPQDPRELCGRLLTTCYMASENSSQETHSRATKLAQLIGSYHINLSIDTAVKAVLGIFSLMTGKLPRFSAHGGSSRENLALQNVQARIRMVLAYLFAQLSLWSRGARGSLLVLGSANVDESLLGYLTKYDCSSADINPIGGISKTDLRAFVQFCAERFQLPVLQTILSAPATAELEPLADGQVSQMDEEDMGMTYAELSIFGRLRKVAKAGPYSMFCKLLNMWRDSYTPTQVAEKVKLFFSKYSMNRHKMTTLTPAYHAENYSPDDNRFDLRPFLYNTRWPWQFLCIDNQVLQLERKASQTREEQVLEHFKEPSPIWKQLLPKDP"
##
## $NP_852145.1
## [1] "MGRKVTVATCALNQWALDFEGNFQRILKSIQIAKGKGARYRLGPELEICGYGCWDHYHESDTLLHSLQVLAALLDAPATQDIICDVGMPIMHRNVRYNCLVIFLNRKILLIRPKMALANEGNYRELRWFTPWARSRQTEEYVLPRMLQDLTKQETVPFGDVVLATQDTCIGSEICEELWTPCSPHVNMGLDGVEIITNASGSHHVLRKAHTRVDLVTMATSKNGGIYLLANQKGCDGHLLYYDGCAMIAMNGSIFAQGTQFSLDDVEVLTATLDLEDVRSYRAKISSRNLEATRVNPYPRVTVDFALSVSEDLLEPVSEPVEWTYHRPEEEISLGPACWLWDFLRRNNQAGFFLPLSGGVDSAASACVVYSMCCLVCEAVKSGNQQVLTDVQNLVDESSYTPQDPRELCGRLLTTCYMASENSSQETHNRATELAQQIGSYHISLNIDPAVKAILGIFSLVTGKFPRFSAHGGSSRENLALQNVQARIRMVLAYLFAQLSLWSRGARGSLLVLGSANVDESLLGYLTKYDCSSADINPIGGISKTDLRAFVQLCAERFQLPVLQAILSAPATAELEPLADGQVSQMDEEDMGMTYTELSIFGRLRKVAKAGPYSMFCKLLNMWKDSCTPRQVAEKVKRFFSKYSINRHKMTTLTPAYHAENYSPDDNRFDLRPFLYNTRWPWQFLCIDNQVVQLERKTSQTLEEQIQEHFKEPSPIWKQLLPKDP"
##
## $NP_001006465.1
## [1] "MGRAVSVAACALNQWALDFEGNAERILRSISIAKSKGARYRLGPELEICGYGCADHYYESDTLLHSFQVLAKLLESPATQDIICDVGMPLMHRNVRYNCRVIFLNKKILLIRPKISLANAGNYRELRWFTPWNKARHVEEYLLPRIIQEVTGQDTVPFGDAVLATKDTCLGTEICEELWAPNSPHIEMGLDGVEIFTNSSGSHHVLRKAHTRVDLVNSATAKNGGIYILSNQKGCDGDRLYYDGCAMISMNGETVAQGSQFSLDDVEVLVATLDLEDVRSYRAEISSRNLAASKVNPFPRIKVNFALSCSDDLSVPICVPIQWRHHSPEEEICLGPACWLWDYLRRSKQAGFLLPLSGGIDSSATACIVYSMCRQVCLAVKNGNSEVLADARKIVHDETYIPEDPQEFCKRVFTTCYMASENSSQDTRNRAKLLAEQIGSYHINLNIDAAVKAIVGIFSMVTGRTPRFSVYGGSRRENLALQNVQARVRMVPAYLFAQLTLWTRGMPGGLLVLGSANVDESLRGYLTKYDCSSADINPIGGISKTDLKNFIQYCIENFQLTALRSIMAAPPTAELEPLMDGQVSQTDEADMGMTYAELSIYGKLRKIAKAGPYSMFCKLINLWKEICTPREVASKVKHFFRMYSVNRHKMTTLTPSYHAENYSPDDNRFDLRPFLYNTTWSWQFRCIDNQVSHLEKKEGISVAEDTD"
##
## $NP_001120406.1
## [1] "MGRKVTVATCALNQWALDFEGNLNRILRSISIAKEKKARYRLGPELEICGYGCSDHFYESDTIFHSFQVLAKLLESPETTDIICDVGMPVMHKNVRYNCRVIFLNRKILLIRPKMVMANEGNYRELRWFTPWSRIREVEDFFLPRTIQCITGQITVPFGDAVIATKDTCVGTEICEELWAPNSPHIDMGLDGVEIITNGSASHHELRKAYLRVDLIKSTTAKNGGIYLLSNMKGCDSDRLYFDGCAMVSLNGDIVAQGSQFSLTDVEVLTATLDLEDVRSYRAQISSRCISASRVRPFHRVHVDFSLSSFDDLDLPTNDLIQWKYHTPEEEISLGPACWLWDYLRRSKQSGFLLPLSGGVDSSAVACIVYSMCTLVCEAVATGNGDVLTEVQGIVQDDTYLPTSPQDLCKRILTTCYMASENSSQDTHDRAKHLAEQIGSYHLTPKIDGAVKAIMNIFQVVTGKVPKFRAHGGSGRENLALQNVQARIRMVIAYLFAQLSLWARGLEGGLLVLGSANVDESLRGYLTKYDCSSADLNPIGGISKTDLRGFIQYSIDRFQLHALKGIMSAPPTAELEPLTDGKVSQTDEDDMGMTYAELSVYGKLRKVLKAGPYSMFCKLLLMWKNICTPKQVADKVKHFFRTYSINRHKMTTLTPAYHAESYSPDDNRFDLRPFLYNTAWNWQFRCIDNEVSHLERNRDANISEEID"
##
## $NP_001092723.1
## [1] "MGRKVTLATCSLNQWALDFDGNLGRILKSIEIAKQKGAKYRLGPELEICGYGCADHFYESDTLLHCFQVLKSLLESPLTQDIICDVGMPVMHHNVRYNCRVIFLNKKILFIRPKMLLANYGNNREFRWFSPWSRPRYVEEYFLPRMIQDVTEQSTVPFGDVVLSTIDTCIGSEICAELWNPRSPHVDMGLDGIEIFTNSSASYHELRKADHRVNLVKSATTKSGGIYMFANQRGCDGDRLYYDGCAMIAINGDIVARGAQFSLEDVEVVTATLDLEDVRSYRGERCHPHMEYEHKPYQRIKTDFSLSDCDDRCLPTHQPVEWIFHTPEEEISLGPACWLWDYLRRSGQAGFLLPLSGGVDSSSSACIVYSMCVQICQAVEHGNCQVLEDVQRVVGDSSYRPQDPRELCGRLFTTCYMASENSSEDTRNRAKDLAAQIGSNHLNINIDMAVKAMLGIFSMVTGKWPQFRANGGSARENLALQNVQARIRMVLAYLFAQLCLWAQGKTGGLLVLGSANVDESLTGYFTKYDCSSADINPIGGVSKTDLKGFLEYCVKRLQLTSLIGILEAPPTAELEPLTDGKVVQTDEADMGMTYSELSVIGRLRKISKCGPYSMFCKLISSWKDTFSPSQVATKVKHFFRMYSINRHKMTTVTPSYHADSYGPDDNRFDLRPFLYNTRWSWQFRCIDNEVAKME"
names(NADSYN1_list)
## [1] "NP_060631.2" "XP_001174076.2" "XP_001098992.2" "XP_540795.4"
## [5] "NP_001029615.1" "NP_084497.1" "NP_852145.1" "NP_001006465.1"
## [9] "NP_001120406.1" "NP_001092723.1"
length( NADSYN1_list )
## [1] 10
Each entry is a full entry with no spaces or parsing, and no header
NADSYN1_list[1]
## $NP_060631.2
## [1] "MGRKVTVATCALNQWALDFEGNLQRILKSIEIAKNRGARYRLGPELEICGYGCWDHYYESDTLLHSFQVLAALVESPVTQDIICDVGMPVMHRNVRYNCRVIFLNRKILLIRPKMALANEGNYRELRWFTPWSRSRHTEEYFLPRMIQDLTKQETVPFGDAVLVTWDTCIGSEICEELWTPHSPHIDMGLDGVEIITNASGSHQVLRKANTRVDLVTMVTSKNGGIYLLANQKGCDGDRLYYDGCAMIAMNGSVFAQGSQFSLDDVEVLTATLDLEDVRSYRAEISSRNLAASRASPYPRVKVDFALSCHEDLLAPISEPIEWKYHSPEEEISLGPACWLWDFLRRSQQAGFLLPLSGGVDSAATACLIYSMCCQVCEAVRSGNEEVLADVRTIVNQISYTPQDPRDLCGRILTTCYMASKNSSQETCTRARELAQQIGSHHISLNIDPAVKAVMGIFSLVTGKSPLFAAHGGSSRENLALQNVQARIRMVLAYLFAQLSLWSRGVHGGLLVLGSANVDESLLGYLTKYDCSSADINPIGGISKTDLRAFVQFCIQRFQLPALQSILLAPATAELEPLADGQVSQTDEEDMGMTYAELSVYGKLRKVAKMGPYSMFCKLLGMWRHICTPRQVADKVKRFFSKYSMNRHKMTTLTPAYHAENYSPEDNRFDLRPFLYNTSWPWQFRCIENQVLQLERAEPQSLDGVD"
Make each entry of the list into a vector. There are several ways to do this.
NADSYN1_vector <- unlist( NADSYN1_list )
Name the vector
names( NADSYN1_list )
## [1] "NP_060631.2" "XP_001174076.2" "XP_001098992.2" "XP_540795.4"
## [5] "NP_001029615.1" "NP_084497.1" "NP_852145.1" "NP_001006465.1"
## [9] "NP_001120406.1" "NP_001092723.1"
names( NADSYN1_vector ) <- names( NADSYN1_list )
Do pairwise alignments for humans, chimps and 2-other species.
NADSYN1_human <- NADSYN1_vector["NP_060631.2"]
NADSYN1_chimp <- NADSYN1_vector["XP_001174076.2"]
NADSYN1_rhesusmonkey <- NADSYN1_vector["XP_001098992.2"]
NADSYN1_cattle <- NADSYN1_vector["NP_001029615.1"]
align.human.chimp <- Biostrings::pairwiseAlignment(NADSYN1_human, NADSYN1_chimp)
align.human.rhesusmonkey <- Biostrings::pairwiseAlignment(NADSYN1_human, NADSYN1_rhesusmonkey)
align.human.cattle <- Biostrings::pairwiseAlignment(NADSYN1_human, NADSYN1_cattle)
align.chimp.rhesusmonkey <- Biostrings::pairwiseAlignment(NADSYN1_chimp, NADSYN1_rhesusmonkey)
align.chimp.cattle <- Biostrings::pairwiseAlignment(NADSYN1_chimp, NADSYN1_cattle)
align.rhesusmonkey.cattle <- Biostrings::pairwiseAlignment(NADSYN1_rhesusmonkey, NADSYN1_cattle)
Build matrix
pids <- c(1,NA, NA,NA,
Biostrings::pid(align.human.chimp), 1, NA, NA,
Biostrings::pid(align.human.rhesusmonkey), Biostrings::pid(align.human.cattle), 1,NA,
Biostrings::pid(align.chimp.rhesusmonkey), Biostrings::pid(align.chimp.cattle), Biostrings::pid(align.rhesusmonkey.cattle), 1)
mat <- matrix(pids, nrow = 4, byrow = T)
row.names(mat) <- c("Homo","Chimp","Rhesus","Cattle")
colnames(mat) <- c("Homo","Chimp","Rhesus","Cattle")
pander::pander(mat)
| Homo | Chimp | Rhesus | Cattle | |
|---|---|---|---|---|
| Homo | 1 | NA | NA | NA |
| Chimp | 99.15 | 1 | NA | NA |
| Rhesus | 91.7 | 90.37 | 1 | NA |
| Cattle | 91.43 | 90.1 | 86.26 | 1 |
Compare different PID methods. I did this for Humans vs. chimps
PID1 <- Biostrings::pid(align.human.chimp, type="PID1")
PID2 <- Biostrings::pid(align.human.chimp, type="PID2")
PID3 <- Biostrings::pid(align.human.chimp, type="PID3")
PID4 <- Biostrings::pid(align.human.chimp, type="PID4")
method <- c("PID1", "PID2", "PID3", "PID4")
PID <- c( PID1, PID2, PID3, PID4 )
pid.comparison <- data.frame( method, PID )
pander::pander(pid.comparison)
| method | PID |
|---|---|
| PID1 | 99.15 |
| PID2 | 99.29 |
| PID3 | 99.29 |
| PID4 | 99.22 |
For use with R bioinformatics tools we need to convert our named vector to a string set using Biostrings::AAStringSet(). Note the _ss tag at the end of the object we’re assigning the output to, which designates this as a string set.
NADSYN1_vector_ss <- Biostrings::AAStringSet( NADSYN1_vector )
NADSYN1_align <- msa(NADSYN1_vector_ss, method = "ClustalW")
## use default substitution matrix
msa produces a species MSA object
class( NADSYN1_align )
## [1] "MsaAAMultipleAlignment"
## attr(,"package")
## [1] "msa"
is( NADSYN1_align )
## [1] "MsaAAMultipleAlignment" "AAMultipleAlignment" "MsaMetaData"
## [4] "MultipleAlignment"
Default output of MSA
NADSYN1_align
## CLUSTAL 2.1
##
## Call:
## msa(NADSYN1_vector_ss, method = "ClustalW")
##
## MsaAAMultipleAlignment with 10 rows and 727 columns
## aln names
## [1] MGRKVTVATCALNQWALDFEGNFQR...TREEQVLEHFKEPSPIWKQLLPKDP NP_084497.1
## [2] MGRKVTVATCALNQWALDFEGNFQR...TLEEQIQEHFKEPSPIWKQLLPKDP NP_852145.1
## [3] MGRKVTVATCALNQWALDFEGNLQR...SLDGVD------------------- NP_060631.2
## [4] MGRKVTVATCALNQWALDFEGNLQR...SLDGVD------------------- XP_001174076.2
## [5] MGRKVTVATCALNQWALDFEGNLQR...SLDGVD------------------- XP_001098992.2
## [6] MGRKVTVAACALNQWALDFQGNLQR...DLDGVD------------------- XP_540795.4
## [7] MGRKVTVATCALNQWALDFEGNLQR...ELDGVD------------------- NP_001029615.1
## [8] MGRAVSVAACALNQWALDFEGNAER...SVAEDTD------------------ NP_001006465.1
## [9] MGRKVTVATCALNQWALDFEGNLNR...NISEEID------------------ NP_001120406.1
## [10] MGRKVTLATCSLNQWALDFDGNLGR...------------------------- NP_001092723.1
## Con MGRKVTVATCALNQWALDFEGNLQR...?LDGVD------------------- Consensus
Change class of alignment
class(NADSYN1_align) <- "AAMultipleAlignment"
Convert to seqinr format
NADSYN1_align_seqinr <- msaConvert(NADSYN1_align, type = "seqinr::alignment")
OPTIONAL: show output with print_msa
compbio4all::print_msa(NADSYN1_align_seqinr)
## [1] "MGRKVTVATCALNQWALDFEGNFQRILKSIQIAKGKGARYRLGPELEICGYGCWDHYHES 0"
## [1] "MGRKVTVATCALNQWALDFEGNFQRILKSIQIAKGKGARYRLGPELEICGYGCWDHYHES 0"
## [1] "MGRKVTVATCALNQWALDFEGNLQRILKSIEIAKNRGARYRLGPELEICGYGCWDHYYES 0"
## [1] "MGRKVTVATCALNQWALDFEGNLQRILKSIEIAKNRGARYRLGPELEICGYGCWDHYYES 0"
## [1] "MGRKVTVATCALNQWALDFEGNLQRILKSIEIAKNRGARYRLGPELEICGYGCWDHYYES 0"
## [1] "MGRKVTVAACALNQWALDFQGNLQRILKSIEIAKRKGARYRLGPELEICGYGCWDHYYES 0"
## [1] "MGRKVTVATCALNQWALDFEGNLQRILKSIEIAKHRGARYRLGPELEICGYGCWDHYYES 0"
## [1] "MGRAVSVAACALNQWALDFEGNAERILRSISIAKSKGARYRLGPELEICGYGCADHYYES 0"
## [1] "MGRKVTVATCALNQWALDFEGNLNRILRSISIAKEKKARYRLGPELEICGYGCSDHFYES 0"
## [1] "MGRKVTLATCSLNQWALDFDGNLGRILKSIEIAKQKGAKYRLGPELEICGYGCADHFYES 0"
## [1] " "
## [1] "DTLLHSLQVLAALLDSPVTQDIICDVGMPIMHRNVRYNCRVIFLN-RKILLIRPKMALAN 0"
## [1] "DTLLHSLQVLAALLDAPATQDIICDVGMPIMHRNVRYNCLVIFLN-RKILLIRPKMALAN 0"
## [1] "DTLLHSFQVLAALVESPVTQDIICDVGMPVMHRNVRYNCRVIFLN-RKILLIRPKMALAN 0"
## [1] "DTLLHSFEVLAALLESPVTQDIICDVGMPVMHRNVRYNCRVIFLTGRKILLIRPKMALAN 0"
## [1] "DTLLHSFQVLAALLESPVTQDIICDVGMPVMHRNVRYNCRVIFLN-RKILLIRPKMALAN 0"
## [1] "DTLLHSLQVLAALLESPVTQDIICDVGMPVLHRNVRYNCRVIFLN-RRILLIRPKMALAN 0"
## [1] "DTLLHSLQVLAALLESPVTQDIICDVGMPVMHRNVRYNCRVIFLN-RKILLIRPKMALAN 0"
## [1] "DTLLHSFQVLAKLLESPATQDIICDVGMPLMHRNVRYNCRVIFLN-KKILLIRPKISLAN 0"
## [1] "DTIFHSFQVLAKLLESPETTDIICDVGMPVMHKNVRYNCRVIFLN-RKILLIRPKMVMAN 0"
## [1] "DTLLHCFQVLKSLLESPLTQDIICDVGMPVMHHNVRYNCRVIFLN-KKILFIRPKMLLAN 0"
## [1] " "
## [1] "EGNYRELRWFTPWTRSRQTEEYVLPRMLQDLTKQKTVPFGDVVLATQDTCVGSEICEELW 0"
## [1] "EGNYRELRWFTPWARSRQTEEYVLPRMLQDLTKQETVPFGDVVLATQDTCIGSEICEELW 0"
## [1] "EGNYRELRWFTPWSRSRHTEEYFLPRMIQDLTKQETVPFGDAVLVTWDTCIGSEICEELW 0"
## [1] "EGNYRELRWFTPWSRSRHTEEYFLPRMIQDLTKQETVPFGDAVLVTWDTCIGSEICEELW 0"
## [1] "EGNYRELRWFTPWSRSRHTEEYLLPRMIQDLTKQETAPFGDAVLATWDTCIGSEICEELW 0"
## [1] "EGNYRELRWFTPWSRSRQTEEYFLPRMIQDVTKQETVPFGDAVLATRDTCIGSEICEELW 0"
## [1] "EGNYRELRWFTPWSRSRQTEEYFLPRMLQDLTKQETVPFGDAVLSTWDTCIGSEVCEELW 0"
## [1] "AGNYRELRWFTPWNKARHVEEYLLPRIIQEVTGQDTVPFGDAVLATKDTCLGTEICEELW 0"
## [1] "EGNYRELRWFTPWSRIREVEDFFLPRTIQCITGQITVPFGDAVIATKDTCVGTEICEELW 0"
## [1] "YGNNREFRWFSPWSRPRYVEEYFLPRMIQDVTEQSTVPFGDVVLSTIDTCIGSEICAELW 0"
## [1] " "
## [1] "TPRSPHIDMGLDGVEIITNASGSHHVLRKAHTRVDLVTMATSKNGGIYLLANQKGCDGDR 0"
## [1] "TPCSPHVNMGLDGVEIITNASGSHHVLRKAHTRVDLVTMATSKNGGIYLLANQKGCDGHL 0"
## [1] "TPHSPHIDMGLDGVEIITNASGSHQVLRKANTRVDLVTMVTSKNGGIYLLANQKGCDGDR 0"
## [1] "TPHSPHIDMGLDGVEIITNASGSHHVLRKANTRVDLVTMVTSKNGGIYLLANQKGCDGDR 0"
## [1] "TPHSPHIDMGLDGVEIITNASGSHHVLRKANTRVDLVTMATSKNGGIYLLANQKGCDGDR 0"
## [1] "TPHSPHVDMGLDGVEIFTNASGSHHVLRKAHARVDLVTMATTKNGGIYLLANQKGCDGDR 0"
## [1] "TPHSPHVDMGLDGVEIFTNASGSHHVLRKAHARVDLVTMATTKNGGIYLLANQKGCDGDR 0"
## [1] "APNSPHIEMGLDGVEIFTNSSGSHHVLRKAHTRVDLVNSATAKNGGIYILSNQKGCDGDR 0"
## [1] "APNSPHIDMGLDGVEIITNGSASHHELRKAYLRVDLIKSTTAKNGGIYLLSNMKGCDSDR 0"
## [1] "NPRSPHVDMGLDGIEIFTNSSASYHELRKADHRVNLVKSATTKSGGIYMFANQRGCDGDR 0"
## [1] " "
## [1] "LYYDGCAMIAMNGSIFAQGTQFSLDDVEVLTATLDLEDVRSYKAEISSRNLEATRVSPYP 0"
## [1] "LYYDGCAMIAMNGSIFAQGTQFSLDDVEVLTATLDLEDVRSYRAKISSRNLEATRVNPYP 0"
## [1] "LYYDGCAMIAMNGSVFAQGSQFSLDDVEVLTATLDLEDVRSYRAEISSRNLAASRASPYP 0"
## [1] "LYYDGCAMIAMNGSVFAQGSQFSLDDVEVLTATLDLEDVRSYRAEISSRNLAASRASPYP 0"
## [1] "LYYDGCAMIAMNGSVFAQGSQFSLDDVEVLTATLDLEDVRSYRAEISSRNLAASRASPYP 0"
## [1] "LYYDGCALIAMNGHIFAQGSQFSLDDVEVLTATLDLEDVRSYRAEISSRNLAASRVSPYP 0"
## [1] "LYYDGCALIAMNGSIFAQGSQFSLDDVEVLTATLDLEDIRSYRAEISSRNLAASRVSPYP 0"
## [1] "LYYDGCAMISMNGETVAQGSQFSLDDVEVLVATLDLEDVRSYRAEISSRNLAASKVNPFP 0"
## [1] "LYFDGCAMVSLNGDIVAQGSQFSLTDVEVLTATLDLEDVRSYRAQISSRCISASRVRPFH 0"
## [1] "LYYDGCAMIAINGDIVARGAQFSLEDVEVVTATLDLEDVRSYRGERCHPHMEYEHK-PYQ 0"
## [1] " "
## [1] "RVTVDFALSVSEDLLEPVSEPMEWTYHRPEEEISLGPACWLWDFLRRSKQAGFFLPLSGG 0"
## [1] "RVTVDFALSVSEDLLEPVSEPVEWTYHRPEEEISLGPACWLWDFLRRNNQAGFFLPLSGG 0"
## [1] "RVKVDFALSCHEDLLAPISEPIEWKYHSPEEEISLGPACWLWDFLRRSQQAGFLLPLSGG 0"
## [1] "RVKVDFALSCHEDLLAPISEPIEWKYHSPEEEISLGPACWLWDFLRRSQQAGFLLPLSGG 0"
## [1] "RVKVDFALSCHEDLLAPVSEPIEWKYHSPEEEISLGPACWLWDFLRRSQQGGFLLPLSGG 0"
## [1] "RVKVDFALSCREDLLEPPSEPVEWMYHSPAEEISLGPACWLWDFLRRSRQAGFFLPLSGG 0"
## [1] "RVKVDFALSCHEDLLEPVSEPIEWKYHSPAEEISLGPACWLWDFLRRSRQAGFFLPLSGG 0"
## [1] "RIKVNFALSCSDDLSVPICVPIQWRHHSPEEEICLGPACWLWDYLRRSKQAGFLLPLSGG 0"
## [1] "RVHVDFSLSSFDDLDLPTNDLIQWKYHTPEEEISLGPACWLWDYLRRSKQSGFLLPLSGG 0"
## [1] "RIKTDFSLSDCDDRCLPTHQPVEWIFHTPEEEISLGPACWLWDYLRRSGQAGFLLPLSGG 0"
## [1] " "
## [1] "VDSAASACIVYSMCCLVCDAVKSGNQQVLTDVQNLVDESSYTPQDPRELCGRLLTTCYMA 0"
## [1] "VDSAASACVVYSMCCLVCEAVKSGNQQVLTDVQNLVDESSYTPQDPRELCGRLLTTCYMA 0"
## [1] "VDSAATACLIYSMCCQVCEAVRSGNEEVLADVRTIVNQISYTPQDPRDLCGRILTTCYMA 0"
## [1] "VDSAATACLIYSMCCQVCEAVRSGNEEVLADVRTIVNQISYTPQDPRDLCGRILTTCYMA 0"
## [1] "VDSAATACLVYSMCCQVCKSVRSGNQEVLADVRTIVNQISYTPQDPRDLCGRILTTCYMA 0"
## [1] "VDSAATACLVYSMCRQVCEAVRNGNQEVLADVRAIVDQLSYTPQDPRDLCGRLLTTCYMA 0"
## [1] "VDSAATACLVYSMCHQVCEAVKRGNLEVLADVRTIVNQLSYTPQDPRELCGRVLTTCYMA 0"
## [1] "IDSSATACIVYSMCRQVCLAVKNGNSEVLADARKIVHDETYIPEDPQEFCKRVFTTCYMA 0"
## [1] "VDSSAVACIVYSMCTLVCEAVATGNGDVLTEVQGIVQDDTYLPTSPQDLCKRILTTCYMA 0"
## [1] "VDSSSSACIVYSMCVQICQAVEHGNCQVLEDVQRVVGDSSYRPQDPRELCGRLFTTCYMA 0"
## [1] " "
## [1] "SENSSQETHSRATKLAQLIGSYHINLSIDTAVKAVLG-IFSLMTGKLPRFSAHGGSSREN 0"
## [1] "SENSSQETHNRATELAQQIGSYHISLNIDPAVKAILG-IFSLVTGKFPRFSAHGGSSREN 0"
## [1] "SKNSSQETCTRARELAQQIGSHHISLNIDPAVKAVMG-IFSLVTGKSPLFAAHGGSSREN 0"
## [1] "SKNSSQETCTRARELAQQIGSHHISLNIDPAVKAVMG-IFSLVTGKSPLFAAHGGSSREN 0"
## [1] "SKNSSQETCTRARELAQQIGRWIL------YVRTVEGEHLSREERLGSIWNVPSGALG-- 0"
## [1] "SENSSQETCDRAKELARQIGSHHIGLNIDPAVTAVVG-IFSLVTGKRPLFAAHGGSSREN 0"
## [1] "SENSSQETCDRARELAQQIGSHHIGLHIDPVVKALVG-LFSLVTGASPRFAVHGGSDREN 0"
## [1] "SENSSQDTRNRAKLLAEQIGSYHINLNIDAAVKAIVG-IFSMVTGRTPRFSVYGGSRREN 0"
## [1] "SENSSQDTHDRAKHLAEQIGSYHLTPKIDGAVKAIMN-IFQVVTGKVPKFRAHGGSGREN 0"
## [1] "SENSSEDTRNRAKDLAAQIGSNHLNINIDMAVKAMLG-IFSMVTGKWPQFRANGGSAREN 0"
## [1] " "
## [1] "LALQNVQARIRMVLAYLFAQLSLWSRGARGSLLVLGSANVDESLLGYLTKYDCSSADINP 0"
## [1] "LALQNVQARIRMVLAYLFAQLSLWSRGARGSLLVLGSANVDESLLGYLTKYDCSSADINP 0"
## [1] "LALQNVQARIRMVLAYLFAQLSLWSRGVHGGLLVLGSANVDESLLGYLTKYDCSSADINP 0"
## [1] "LALQNVQARIRMVLAYLFAQLSLWSRGVHGGLLVLGSANVDESLLGYLTKYDCSSADINP 0"
## [1] "QSLQNVQARIRMVLAYLFAQLSLWSRGIRGGLLVLGSANVDESLLGYLTKYDCSSADINP 0"
## [1] "LALQNVQARLRMVLAYLFAQLSLWARGARGGLLVLGSANVDESLLGYLTKYDCSSADINP 0"
## [1] "LALQNVQARVRMVIAYLFAQLSLWSRGAPGGLLVLGSANVDESLLGYLTKYDCSSADINP 0"
## [1] "LALQNVQARVRMVPAYLFAQLTLWTRGMPGGLLVLGSANVDESLRGYLTKYDCSSADINP 0"
## [1] "LALQNVQARIRMVIAYLFAQLSLWARGLEGGLLVLGSANVDESLRGYLTKYDCSSADLNP 0"
## [1] "LALQNVQARIRMVLAYLFAQLCLWAQGKTGGLLVLGSANVDESLTGYFTKYDCSSADINP 0"
## [1] " "
## [1] "IGGISKTDLRAFVQFCAERFQLPVLQTILSAPATAELEPLADGQVSQMDEEDMGMTYAEL 0"
## [1] "IGGISKTDLRAFVQLCAERFQLPVLQAILSAPATAELEPLADGQVSQMDEEDMGMTYTEL 0"
## [1] "IGGISKTDLRAFVQFCIQRFQLPALQSILLAPATAELEPLADGQVSQTDEEDMGMTYAEL 0"
## [1] "IGGISKTDLRVFVQFCIQRFQLPALQSILLAPATAELEPLADGQVSQTDEEDMGMTYAEL 0"
## [1] "IGGISKTDLRAFVQFCIERFQLTALQSIVSAPATAELEPLADGQVSQTDEEDMGMTYAEL 0"
## [1] "IGGISKTDLKAFVHFCMEHFQLPALQRILAAPATAELEPLTDGQVSQTDEEDMGMTYAEL 0"
## [1] "IGGISKTDLRAFVQLCVERFQLPALQSILAAPATAELEPLAHGRVSQTDEEDMGMTYAEL 0"
## [1] "IGGISKTDLKNFIQYCIENFQLTALRSIMAAPPTAELEPLMDGQVSQTDEADMGMTYAEL 0"
## [1] "IGGISKTDLRGFIQYSIDRFQLHALKGIMSAPPTAELEPLTDGKVSQTDEDDMGMTYAEL 0"
## [1] "IGGVSKTDLKGFLEYCVKRLQLTSLIGILEAPPTAELEPLTDGKVVQTDEADMGMTYSEL 0"
## [1] " "
## [1] "SIFGRLRKVAKAGPYSMFCKLLNMWRDSYTPTQVAEKVKLFFSKYSMNRHKMTTLTPAYH 0"
## [1] "SIFGRLRKVAKAGPYSMFCKLLNMWKDSCTPRQVAEKVKRFFSKYSINRHKMTTLTPAYH 0"
## [1] "SVYGKLRKVAKMGPYSMFCKLLGMWRHICTPRQVADKVKRFFSKYSMNRHKMTTLTPAYH 0"
## [1] "SVYGKLRKVAKMGPYSMFCKLLGMWRHICTPRQVADKVKRFFSKYSMNRHKMTTLTPAYH 0"
## [1] "SVYGKLRKVAKMGPYSMFCKLLGMWRHVCTPRQVADKVKWFFTKHSMNRHKMTTLTPAYH 0"
## [1] "SVYGRLRKIAKAGPYSMFCKLVNMWKDACSPRQVADKVRQFFSKYAMNRHKMTTLTPAYH 0"
## [1] "SVYGRLRKVAKTGPYSMFCKLLDMWRDTCSPRQVADKVKCFFSKYSMNRHKMTTLTPAYH 0"
## [1] "SIYGKLRKIAKAGPYSMFCKLINLWKEICTPREVASKVKHFFRMYSVNRHKMTTLTPSYH 0"
## [1] "SVYGKLRKVLKAGPYSMFCKLLLMWKNICTPKQVADKVKHFFRTYSINRHKMTTLTPAYH 0"
## [1] "SVIGRLRKISKCGPYSMFCKLISSWKDTFSPSQVATKVKHFFRMYSINRHKMTTVTPSYH 0"
## [1] " "
## [1] "AENYSPDDNRFDLRPFLYNTRWPWQFLCIDNQVLQLERKASQTREEQVLEHFKEPSPIWK 0"
## [1] "AENYSPDDNRFDLRPFLYNTRWPWQFLCIDNQVVQLERKTSQTLEEQIQEHFKEPSPIWK 0"
## [1] "AENYSPEDNRFDLRPFLYNTSWPWQFRCIENQVLQLERAEPQSLDGVD------------ 0"
## [1] "AENYSPEDNRFDLRPFLYNTSWPWQFRCIENQVLQLERAEPQSLDGVD------------ 0"
## [1] "AENYSPEDNRFDLRPFLYNTSWPWQFRCIENQVLQLERAAPQSLDGVD------------ 0"
## [1] "AESYSPDDNRFDLRPFLYNSSWPWQFRCIEDQVHQLESRGPQDLDGVD------------ 0"
## [1] "AESYSPDDNRFDLRPFLYNTRWPWQFRCIENQVLQLEGRQRQELDGVD------------ 0"
## [1] "AENYSPDDNRFDLRPFLYNTTWSWQFRCIDNQVSHLEKKEGISVAEDTD----------- 0"
## [1] "AESYSPDDNRFDLRPFLYNTAWNWQFRCIDNEVSHLERNRDANISEEID----------- 0"
## [1] "ADSYGPDDNRFDLRPFLYNTRWSWQFRCIDNEVAKME----------------------- 0"
## [1] " "
## [1] "QLLPKDP 53"
## [1] "QLLPKDP 53"
## [1] "------- 53"
## [1] "------- 53"
## [1] "------- 53"
## [1] "------- 53"
## [1] "------- 53"
## [1] "------- 53"
## [1] "------- 53"
## [1] "------- 53"
## [1] " "
Most sections seemed to be fairly conserved
class(NADSYN1_align) <- "AAMultipleAlignment"
ggmsa::ggmsa(NADSYN1_align, start = 50, end = 100)
Make a distance matrix
NADSYN1_dist <- seqinr::dist.alignment(NADSYN1_align_seqinr,
matrix = "identity")
This produces a “dist” class object
is( NADSYN1_dist )
## [1] "dist" "oldClass"
class( NADSYN1_dist )
## [1] "dist"
Round for display
NADSYN1_align_seqinr_rnd <- round(NADSYN1_dist, 3)
NADSYN1_align_seqinr_rnd
## NP_084497.1 NP_852145.1 NP_060631.2 XP_001174076.2
## NP_852145.1 0.238
## NP_060631.2 0.382 0.386
## XP_001174076.2 0.384 0.387 0.084
## XP_001098992.2 0.411 0.427 0.270 0.273
## XP_540795.4 0.397 0.393 0.337 0.339
## NP_001029615.1 0.384 0.384 0.310 0.313
## NP_001006465.1 0.496 0.496 0.478 0.478
## NP_001120406.1 0.512 0.514 0.494 0.494
## NP_001092723.1 0.526 0.526 0.530 0.530
## XP_001098992.2 XP_540795.4 NP_001029615.1 NP_001006465.1
## NP_852145.1
## NP_060631.2
## XP_001174076.2
## XP_001098992.2
## XP_540795.4 0.388
## NP_001029615.1 0.357 0.308
## NP_001006465.1 0.501 0.480 0.483
## NP_001120406.1 0.520 0.513 0.512 0.483
## NP_001092723.1 0.555 0.516 0.525 0.523
## NP_001120406.1
## NP_852145.1
## NP_060631.2
## XP_001174076.2
## XP_001098992.2
## XP_540795.4
## NP_001029615.1
## NP_001006465.1
## NP_001120406.1
## NP_001092723.1 0.527
Build a phylogenetic tree from distance matrix
tree <- nj(NADSYN1_align_seqinr_rnd)
Plot the tree
plot.phylo(tree, main="NADSYN1 Phylogenetic Tree",
use.edge.length = F)
mtext(text = "NADSYN1 Phylogenetic Tree - rooted, no branch lengths")