This code compiles summary information about the gene BDKRB2 (Bradykinin Receptor B2) having an adaptation for breath-hold diving.
It also generates alignments and a phylogeneitc tree to indicating the evolutionary relationship betweeen the human version of the gene and its homologs in other species.
Key information use to make this script can be found here:
Refseq Gene: https://www.ncbi.nlm.nih.gov/gene/1733 Refseq Homologene: https://www.ncbi.nlm.nih.gov/homologene/620 Other resources consulted includes
Neanderthal genome: http://neandertal.ensemblgenomes.org/index.html Other interesting resources and online tools include:
REPPER: https://toolkit.tuebingen.mpg.de/jobs/7803820 Sub-cellular locations prediction: https://wolfpsort.hgc.jp/
library(BiocManager)
library(drawProteins)
#github packages
library(compbio4all)
library(ggmsa)
## Registered S3 methods overwritten by 'ggalt':
## method from
## grid.draw.absoluteGrob ggplot2
## grobHeight.absoluteGrob ggplot2
## grobWidth.absoluteGrob ggplot2
## grobX.absoluteGrob ggplot2
## grobY.absoluteGrob ggplot2
## ggmsa v1.0.0 Document: http://yulab-smu.top/ggmsa/
##
## If you use ggmsa in published research, please cite: DOI: 10.18129/B9.bioc.ggmsa
#CRAN packages
library(rentrez)
library(seqinr)
library(ape)
##
## Attaching package: 'ape'
## The following objects are masked from 'package:seqinr':
##
## as.alignment, consensus
library(pander)
library(ggplot2)
#Biostrings
library(Biostrings)
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Warning: package 'S4Vectors' was built under R version 4.1.2
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: XVector
## Loading required package: GenomeInfoDb
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:ape':
##
## complement
## The following object is masked from 'package:seqinr':
##
## translate
## The following object is masked from 'package:base':
##
## strsplit
library(HGNChelper)
#Bioconductor packages
library(msa)
##
## Attaching package: 'msa'
## The following object is masked from 'package:BiocManager':
##
## version
library(drawProteins)
Accession numbers were obtained from RefSeq, Refseq Homlogene, UniProt and PDB. UniProt accession numbers can be found by searching for the gene name. PDB accessions can be found by searching with a UniProt accession or a gene name, though many proteins are not in PDB. The the Neanderthal genome database was searched but did not yield sequence information on.
A protein BLAST search (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome) was carried out excluding vertebrates to determine if it occurred outside of vertebreates. The gene does not appear in non-vertebrates and so a second search was conducted to exclude mammals.
# Use the function to confirm the validity of BDKRB2 and any aliases
HGNChelper::checkGeneSymbols(x = c("BDKRB2"))
## Maps last updated on: Thu Oct 24 12:31:05 2019
## x Approved Suggested.Symbol
## 1 BDKRB2 TRUE BDKRB2
colNames <- c("ncbi.protein.accession", "UniProt.id", "PDB", "species", "common.name", "gene.name")
refseq <- c("NP_000614.1", "XP_522944.1", "XP_001101523.1", "NP_001003095.1", "NP_001179055.1", "NP_033877.3", "NP_775123.2", "XP_005343576.1", "XP_001489322.1", "XP_008247599.2")
uniprot <- c("P30411", "H2Q8W1", "A0A5F7ZK71", "Q9BDQ4", "F1MWK0", "P32299", "P25023", "NA", "F6QJ63", "G1T869")
pdb <- c("7F2O", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA")
sciName <- c("Homo sapiens", "Pan troglodytes", "Macaca mulatta", "Canis lupus familiaris", "Bos taurus", "Mus musculus", "Rattus norvegicus" , "Microtus ochrogaster", "Equus caballus", "Oryctolagus cuniculus")
comName <- c("human", "chimpanzee", "rhesus monkey", "dog", "cattle", "house mouse", "norway rat", "prairie vole", "horse", "rabbit")
geneName <- c("BDKRB2", "BDKRB2", "BDKRB2", "BDKRB2", "BDKRB2", "BDKRB2", "BDKRB2", "BDKRB2", "BDKRB2", "BDKRB2")
acc_num_table <- data.frame( ncbi.protein.accession = refseq, Uniprot.id = uniprot, PDB = pdb, species = sciName, common.name = comName, gene.name = geneName)
pander::pander(acc_num_table)
| ncbi.protein.accession | Uniprot.id | PDB | species |
|---|---|---|---|
| NP_000614.1 | P30411 | 7F2O | Homo sapiens |
| XP_522944.1 | H2Q8W1 | NA | Pan troglodytes |
| XP_001101523.1 | A0A5F7ZK71 | NA | Macaca mulatta |
| NP_001003095.1 | Q9BDQ4 | NA | Canis lupus familiaris |
| NP_001179055.1 | F1MWK0 | NA | Bos taurus |
| NP_033877.3 | P32299 | NA | Mus musculus |
| NP_775123.2 | P25023 | NA | Rattus norvegicus |
| XP_005343576.1 | NA | NA | Microtus ochrogaster |
| XP_001489322.1 | F6QJ63 | NA | Equus caballus |
| XP_008247599.2 | G1T869 | NA | Oryctolagus cuniculus |
| common.name | gene.name |
|---|---|
| human | BDKRB2 |
| chimpanzee | BDKRB2 |
| rhesus monkey | BDKRB2 |
| dog | BDKRB2 |
| cattle | BDKRB2 |
| house mouse | BDKRB2 |
| norway rat | BDKRB2 |
| prairie vole | BDKRB2 |
| horse | BDKRB2 |
| rabbit | BDKRB2 |
#downloading each fasta sequence
human_fasta <- rentrez::entrez_fetch(db = "protein",
id = "NP_000614.1",
rettype = "fasta")
chimpanzee_fasta <- rentrez::entrez_fetch(db = "protein",
id = "XP_522944.1",
rettype = "fasta")
monkey_fasta <- rentrez::entrez_fetch(db = "protein",
id = "XP_001101523.1",
rettype = "fasta")
dog_fasta <- rentrez::entrez_fetch(db = "protein",
id = "NP_001003095.1",
rettype = "fasta")
cattle_fasta <- rentrez::entrez_fetch(db = "protein",
id = "NP_001179055.1",
rettype = "fasta")
mouse_fasta <- rentrez::entrez_fetch(db = "protein",
id = "NP_033877.3",
rettype = "fasta")
rat_fasta <- rentrez::entrez_fetch(db = "protein",
id = "NP_775123.2",
rettype = "fasta")
vole_fasta <- rentrez::entrez_fetch(db = "protein",
id = "XP_005343576.1",
rettype = "fasta")
horse_fasta <- rentrez::entrez_fetch(db = "protein",
id = "XP_001489322.1",
rettype = "fasta")
rabbit_fasta <- rentrez::entrez_fetch(db = "protein",
id = "XP_008247599.2",
rettype = "fasta")
id <- c("NP_000614.1", "XP_522944.1", "XP_001101523.1", "NP_001003095.1", "NP_001179055.1", "NP_033877.3", "NP_775123.2", "XP_005343576.1", "XP_001489322.1", "XP_008247599.2" )
#list of fasta files
fasta_list <- list(human_fasta, chimpanzee_fasta, monkey_fasta, dog_fasta, cattle_fasta, mouse_fasta, rat_fasta, vole_fasta, horse_fasta, rabbit_fasta)
names(fasta_list) <- id
#number of fasta files obtained
length(fasta_list)
## [1] 10
#the first entry
fasta_list[[1]]
## [1] ">NP_000614.1 B2 bradykinin receptor [Homo sapiens]\nMFSPWKISMFLSVREDSVPTTASFSADMLNVTLQGPTLNGTFAQSKCPQVEWLGWLNTIQPPFLWVLFVL\nATLENIFVLSVFCLHKSSCTVAEIYLGNLAAADLILACGLPFWAITISNNFDWLFGETLCRVVNAIISMN\nLYSSICFLMLVSIDRYLALVKTMSMGRMRGVRWAKLYSLVIWGCTLLLSSPMLVFRTMKEYSDEGHNVTA\nCVISYPSLIWEVFTNMLLNVVGFLLPLSVITFCTMQIMQVLRNNEMQKFKEIQTERRATVLVLVVLLLFI\nICWLPFQISTFLDTLHRLGILSSCQDERIIDVITQIASFMAYSNSCLNPLVYVIVGKRFRKKSWEVYQGV\nCQKGGCRSEPIQMENSMGTLRTSISVERQIHKLQDWAGSRQ\n\n"
fasta_list
## $NP_000614.1
## [1] ">NP_000614.1 B2 bradykinin receptor [Homo sapiens]\nMFSPWKISMFLSVREDSVPTTASFSADMLNVTLQGPTLNGTFAQSKCPQVEWLGWLNTIQPPFLWVLFVL\nATLENIFVLSVFCLHKSSCTVAEIYLGNLAAADLILACGLPFWAITISNNFDWLFGETLCRVVNAIISMN\nLYSSICFLMLVSIDRYLALVKTMSMGRMRGVRWAKLYSLVIWGCTLLLSSPMLVFRTMKEYSDEGHNVTA\nCVISYPSLIWEVFTNMLLNVVGFLLPLSVITFCTMQIMQVLRNNEMQKFKEIQTERRATVLVLVVLLLFI\nICWLPFQISTFLDTLHRLGILSSCQDERIIDVITQIASFMAYSNSCLNPLVYVIVGKRFRKKSWEVYQGV\nCQKGGCRSEPIQMENSMGTLRTSISVERQIHKLQDWAGSRQ\n\n"
##
## $XP_522944.1
## [1] ">XP_522944.1 B2 bradykinin receptor isoform X1 [Pan troglodytes]\nMFSPWKIPMFLSVREDSVPTTASFSTDMLNVTLQGPTLNGTFAQSKCPQVEWLGWLNTIQPPFLWVLFVL\nATLENIFVLSVFCLHKSSCTVAEIYLGNLAAADLILACGLPFWAITISNNFDWLFGETLCRVVNAIISMN\nLYSSICFLMLVSIDRYLALVKTMSMGRMRGVRWAKLYSLVIWGCTLLLSSPMLVFRTMKEYSDEGHNVTA\nCVISYPSLIWEVFTNMLLNVVGFLLPLSVITFCTMQIMQVLRNNEMQKFKEIQTERRATVLVLVVLLLFI\nICWLPFQISTFLDTLHRLGILSSCQDERIIDVITQIASFMAYSNSCLNPLVYVIVGKRFRKKSWEVYQGV\nCQKGGCRSEPIQMENSMGTLRTSISVERQIHKLQDWAGSRQ\n\n"
##
## $XP_001101523.1
## [1] ">XP_001101523.1 PREDICTED: b2 bradykinin receptor [Macaca mulatta]\nMFSPWKTLMFLSVREDSVPTTASFSADMLNVTLQLPALNGTFAQSKCPPVEWLGWLNTIQAPFLWVLFVL\nATLENIFVLSVFCLHRSSCTVAEIYLGNLAAADLILACGLPFWAITISNNFNWLFGETLCRVVNAIISMN\nLYSSICFLMLVSIDRYLALVKTMSMGRMRGVRWAKLYSLVIWACTLLLSSPMLVFRTMEEYSDEGHNVTA\nCVISYPSLVWEVFTNMLLNIVGFLLPLSVITFCTMQIMQVLRNNEMQKFKEIQTERRATVLVLVVLLLFV\nICWLPFQVSTFLDTLHRLGILASCWDEHIIDVITQIASFMAYSNSCLNPLVYVIVGKRFRKKSWEVYQGV\nCQKGGCRSESIQMENSMGTLRTSISVERQIHKLQDWAGSRQ\n\n"
##
## $NP_001003095.1
## [1] ">NP_001003095.1 B2 bradykinin receptor [Canis lupus familiaris]\nMFSAWKRPMFLSFHEDPVPTTASLSTEMFNSTSQDLMPTLNGTLPSPCVYPEWWNWLNTIQAPFLWVLFI\nLAALENLFVLSIFCLHKSSCTVAEIYLGNLALADLILASGLPFWAITIANNFDWLFGEVLCRVVNTMLYM\nNLYSSICFLMLVSIDRYLALVKTMSMGRMRGVRWAKLYSLVIWGCTLLLSSPMLAFRTMKEYRDEGYNVT\nACVIIYPSRTWEVFTNVLLNFVGFLLPLTVITFCTVQIMQVLRNNEMQKFKEIQTERKATVLVLAVLLLF\nVICWLPFQISTFLDTLLRLDILSGCRHEHLVDVFTQIASYVAYSNSCLNPLVYVIVGKRFRKKSREVFRG\nLCQKGGCVLESNKMDNSMGTLRTSVSVERQIHKLPEWAENSQ\n\n"
##
## $NP_001179055.1
## [1] ">NP_001179055.1 B2 bradykinin receptor [Bos taurus]\nMFSAWRRPMFLSIHEDTVPTTASFGAEMFNLTSQVLEPALNGTLPEGSSCFQSDLWNWLNTIQPPFLWIL\nFLLAALENTFVLSVFCLHKSSCTVAEIYLGNLAVADLILACGLPFWAITIANNFDWLFGEALCRVVNTIL\nYMNLYSSIYFLMLVSIDRYLALVKTMSMGRMRGVRWAKLYSLVIWGCALLLSSPMLAFRTMQEYNAEGHN\nVTACVINYPSHSWEVFTNILLNSVGFLLPLSVITFCTVQIMQVLRNNEMQKFKEIQTERKATLLVLAVLL\nLFVVCWLPFQISTFLDTLLRLHVLSGCWDEYVIDIFTQIASFVAYSNSCLNPLVYVIVGKRFRKKSQEVY\nARLCRPGGCGSAEPSQTENSMGTLRTSISVERNIHKLQ\n\n"
##
## $NP_033877.3
## [1] ">NP_033877.3 B2 bradykinin receptor [Mus musculus]\nMPCSWKLLGFLSVHEPMPTAASFGIEMFNVTTQVLGSALNGTLSKDNCPDTEWWSWLNAIQAPFLWVLFL\nLAALENLFVLSVFFLHKNSCTVAEIYLGNLAAADLILACGLPFWAITIANNFDWVFGEVLCRVVNTMIYM\nNLYSSICFLMLVSIDRYLALVKTMSMGRMRGVRWAKLYSLVIWGCTLLLSSPMLVFRTMREYSEEGHNVT\nACVIVYPSRSWEVFTNVLLNLVGFLLPLSVITFCTVRILQVLRNNEMKKFKEVQTERKATVLVLAVLGLF\nVLCWVPFQISTFLDTLLRLGVLSGCWDEHAVDVITQISSYVAYSNSGLNPLVYVIVGKRFRKKSREVYRV\nLCQKGGCMGEPVQMENSMGTLRTSISVERQIHKLQDWAGKKQ\n\n"
##
## $NP_775123.2
## [1] ">NP_775123.2 B2 bradykinin receptor isoform 2 [Rattus norvegicus]\nMHCSWKRPVLLSVHEPMPTTASLGIEMFNITTQALGSAHNGTFSEVNCPDTEWWSWLNAIQAPFLWVLFL\nLAALENIFVLSVFCLHKTNCTVAEIYLGNLAAADLILACGLPFWAITIANNFDWLFGEVLCRVVNTMIYM\nNLYSSICFLMLVSIDRYLALVKTMSMGRMRGVRWAKLYSLVIWSCTLLLSSPMLVFRTMKDYREEGHNVT\nACVIVYPSRSWEVFTNMLLNLVGFLLPLSIITFCTVRIMQVLRNNEMKKFKEVQTEKKATVLVLAVLGLF\nVLCWFPFQISTFLDTLLRLGVLSGCWNERAVDIVTQISSYVAYSNSCLNPLVYVIVGKRFRKKSREVYQA\nICRKGGCMGESVQMENSMGTLRTSISVDRQIHKLQDWAGNKQ\n\n"
##
## $XP_005343576.1
## [1] ">XP_005343576.1 B2 bradykinin receptor [Microtus ochrogaster]\nMYPPWKRPMPLSMHEDPMYTTASLGIDMFNITTHVLESAVNKSLPENSDCPDTEWLSWLNAIQAPFLWVL\nFLLAALENIFVLSVFCLHKNNCTVAEIYLGNLAAADLILACGLPFWAITIANNFDWLFGEVLCRVVNTLI\nYMNLYSSICFLMLVSVDRYLALVKTMSMGRMRGVRWAKIYSLVIWGCTLLLSSPMLMFRTLMKYNEEGFN\nVTACAIIYPSPSWEVFTNVLLNLVGFLLPLTIITFCTVKIMQVLRNNEMKKFKEIQTERKATVLVLAVLG\nLFVICWVPFQISTFLDTLIRLNVLSGCWVRHAVDVITQISSYMAYSNSCLNPLVYVIVGKRFRKKSREVY\nQAMCRKGGCMGESIQMENSLGTLKTSISVERQIHKLQDWSGNRQ\n\n"
##
## $XP_001489322.1
## [1] ">XP_001489322.1 B2 bradykinin receptor [Equus caballus]\nMFSAWRRPMFLSVHEDSVPVMASSSAEMLNLTSQVLVPALNGTLSQSSPCSPSEWVSWLNAIQAPFLWVL\nFVLAALENLFVLSVFCLHKSSCTVAEVYLGNLAAADLILAFGLPFWAVTIANNFDWLFGEVLCRVVNAAV\nYTNLYSSICFLMLVSIDRYLALVKTMSMGRMRGVRWAKLYSLVIWGCTLLLSAPMLAFRTMREYADEGYN\nITACIIVYPSRTWEVFTNILLNFVGFLLPLCVITFCTVQIMRVLRNNEMQKFKEIQTERKATVLVLAVLL\nLFVICWLPFQISTFLDTLLRLKILSSCWDEHVVDVLTQIASFVAYSNSCLNPLVYVIVGKRFRKKSREVY\nGRLCQKGGCLSEPSQTESSMGTLRTSISVERQIHKLPEWVGSGPPSGSSSGSGRPQGYADSRED\n\n"
##
## $XP_008247599.2
## [1] ">XP_008247599.2 PREDICTED: B2 bradykinin receptor isoform X1 [Oryctolagus cuniculus]\nMSCSPASRLPRVSEGWGGPAALLAGPREPREPPSARPLRTRRSLAYIPIGVQMFSSLKTSMFLTVYDDPT\nPATASLGAEMLNITSQVLAPALNGSVSQSSGCPNTEWSGWLNVIQAPFLWVLFVLATLENLFVLSVFCLH\nKSSCTVAEVYLGNLAAADLILACGLPFWAVTIANHFDWLFGEALCRVVNTMIYMNLYSSICFLMLVSIDR\nYLALVKTMSIGRMRGVRWAKLYSLVIWGCTLLLSSPMLVFRTMKDYRDEGYNVTACIIDYPSRSWEVFTN\nVLLNLVGFLLPLSVITFCTVQILQVLRNNEMQKFKEIQTERRATVLVLAVLLLFVVCWLPFQVSTFLDTL\nLKLGVLSSCWDEHVIDVITQVGSFMGYSNSCLNPLVYVIVGKRFRKKSREVYRAACPKAGCVLEPVQAES\nSMGTLRTSISVERQIHKLPEWTRSSQ\n\n"
#remove FASTA header
for(i in 1:length(fasta_list)){
fasta_list[[i]] <- compbio4all::fasta_cleaner(fasta_list[[i]], parse = F)
}
fasta_list[1]
## $NP_000614.1
## [1] "MFSPWKISMFLSVREDSVPTTASFSADMLNVTLQGPTLNGTFAQSKCPQVEWLGWLNTIQPPFLWVLFVLATLENIFVLSVFCLHKSSCTVAEIYLGNLAAADLILACGLPFWAITISNNFDWLFGETLCRVVNAIISMNLYSSICFLMLVSIDRYLALVKTMSMGRMRGVRWAKLYSLVIWGCTLLLSSPMLVFRTMKEYSDEGHNVTACVISYPSLIWEVFTNMLLNVVGFLLPLSVITFCTMQIMQVLRNNEMQKFKEIQTERRATVLVLVVLLLFIICWLPFQISTFLDTLHRLGILSSCQDERIIDVITQIASFMAYSNSCLNPLVYVIVGKRFRKKSWEVYQGVCQKGGCRSEPIQMENSMGTLRTSISVERQIHKLQDWAGSRQ"
# building a protein diagram
P30411_json <- drawProteins::get_features("P30411")
## [1] "Download has worked"
is(P30411_json)
## [1] "list" "vector" "list_OR_List" "vector_OR_Vector"
## [5] "vector_OR_factor"
# raw data from the webpage is converted to a dataframe
my_prot_df <- drawProteins::feature_to_dataframe(P30411_json)
is(my_prot_df)
## [1] "data.frame" "list" "oldClass" "vector"
## [5] "list_OR_List" "vector_OR_Vector" "vector_OR_factor"
my_prot_df
## type description begin end length
## featuresTemp CHAIN B2 bradykinin receptor 1 391 390
## featuresTemp.1 TOPO_DOM Extracellular 1 60 59
## featuresTemp.2 TRANSMEM Helical; Name=1 61 84 23
## featuresTemp.3 TOPO_DOM Cytoplasmic 85 93 8
## featuresTemp.4 TRANSMEM Helical; Name=2 94 118 24
## featuresTemp.5 TOPO_DOM Extracellular 119 131 12
## featuresTemp.6 TRANSMEM Helical; Name=3 132 153 21
## featuresTemp.7 TOPO_DOM Cytoplasmic 154 175 21
## featuresTemp.8 TRANSMEM Helical; Name=4 176 198 22
## featuresTemp.9 TOPO_DOM Extracellular 199 221 22
## featuresTemp.10 TRANSMEM Helical; Name=5 222 248 26
## featuresTemp.11 TOPO_DOM Cytoplasmic 249 267 18
## featuresTemp.12 TRANSMEM Helical; Name=6 268 292 24
## featuresTemp.13 TOPO_DOM Extracellular 293 311 18
## featuresTemp.14 TRANSMEM Helical; Name=7 312 335 23
## featuresTemp.15 TOPO_DOM Cytoplasmic 336 391 55
## featuresTemp.16 MOD_RES Phosphotyrosine 156 156 0
## featuresTemp.17 MOD_RES Phosphotyrosine 347 347 0
## featuresTemp.18 MOD_RES Phosphoserine 366 366 0
## featuresTemp.19 MOD_RES Phosphothreonine 369 369 0
## featuresTemp.20 MOD_RES Phosphoserine; by GRK6 373 373 0
## featuresTemp.21 MOD_RES Phosphoserine; by GRK6 375 375 0
## featuresTemp.22 LIPID S-palmitoyl cysteine 351 351 0
## featuresTemp.23 CARBOHYD N-linked (GlcNAc...) asparagine 30 30 0
## featuresTemp.24 CARBOHYD N-linked (GlcNAc...) asparagine 39 39 0
## featuresTemp.25 CARBOHYD N-linked (GlcNAc...) asparagine 207 207 0
## featuresTemp.26 DISULFID 130 211 81
## featuresTemp.27 VAR_SEQ NONE 1 27 26
## featuresTemp.28 VARIANT in dbSNP:rs1046248 14 14 0
## featuresTemp.29 VARIANT in dbSNP:rs2227279 354 354 0
## accession entryName taxid order
## featuresTemp P30411 BKRB2_HUMAN 9606 1
## featuresTemp.1 P30411 BKRB2_HUMAN 9606 1
## featuresTemp.2 P30411 BKRB2_HUMAN 9606 1
## featuresTemp.3 P30411 BKRB2_HUMAN 9606 1
## featuresTemp.4 P30411 BKRB2_HUMAN 9606 1
## featuresTemp.5 P30411 BKRB2_HUMAN 9606 1
## featuresTemp.6 P30411 BKRB2_HUMAN 9606 1
## featuresTemp.7 P30411 BKRB2_HUMAN 9606 1
## featuresTemp.8 P30411 BKRB2_HUMAN 9606 1
## featuresTemp.9 P30411 BKRB2_HUMAN 9606 1
## featuresTemp.10 P30411 BKRB2_HUMAN 9606 1
## featuresTemp.11 P30411 BKRB2_HUMAN 9606 1
## featuresTemp.12 P30411 BKRB2_HUMAN 9606 1
## featuresTemp.13 P30411 BKRB2_HUMAN 9606 1
## featuresTemp.14 P30411 BKRB2_HUMAN 9606 1
## featuresTemp.15 P30411 BKRB2_HUMAN 9606 1
## featuresTemp.16 P30411 BKRB2_HUMAN 9606 1
## featuresTemp.17 P30411 BKRB2_HUMAN 9606 1
## featuresTemp.18 P30411 BKRB2_HUMAN 9606 1
## featuresTemp.19 P30411 BKRB2_HUMAN 9606 1
## featuresTemp.20 P30411 BKRB2_HUMAN 9606 1
## featuresTemp.21 P30411 BKRB2_HUMAN 9606 1
## featuresTemp.22 P30411 BKRB2_HUMAN 9606 1
## featuresTemp.23 P30411 BKRB2_HUMAN 9606 1
## featuresTemp.24 P30411 BKRB2_HUMAN 9606 1
## featuresTemp.25 P30411 BKRB2_HUMAN 9606 1
## featuresTemp.26 P30411 BKRB2_HUMAN 9606 1
## featuresTemp.27 P30411 BKRB2_HUMAN 9606 1
## featuresTemp.28 P30411 BKRB2_HUMAN 9606 1
## featuresTemp.29 P30411 BKRB2_HUMAN 9606 1
my_canvas <- draw_canvas(my_prot_df)
my_canvas <- draw_chains(my_canvas, my_prot_df, label_size = 2.5)
my_canvas <- draw_domains(my_canvas, my_prot_df)
my_canvas <- draw_recept_dom(my_canvas, my_prot_df)
my_canvas <- draw_regions(my_canvas, my_prot_df)
my_canvas <- draw_phospho(my_canvas, my_prot_df)
my_canvas
#download sequence P30411
P30411_FASTA <- rentrez::entrez_fetch(id = "P30411" ,
db = "protein",
rettype="fasta")
#clean up sequence
P30411_vector <- fasta_cleaner(P30411_FASTA)
par(mfrow = c(2,2),
mar = c(4,6,3,5))
# plot 1: Defaults
dotPlot(P30411_vector, P30411_vector,
wsize = 1,
nmatch = 1,
main = "Defaults",
xlab = "BDKRB2_human_vector",
ylab = "BDKRB3_human_vactor")
# plot 2 size = 10, nmatch = 1
dotPlot(P30411_vector, P30411_vector,
wsize = 10,
nmatch = 1,
main = " size = 10, nmatch = 1",
xlab = "BDKRB2_human_vector",
ylab = "BDKRB3_human_vactor")
# plot 3: size = 10, nmatch = 5
dotPlot(P30411_vector, P30411_vector,
wsize = 10,
nmatch = 5,
main = "size = 10, nmatch = 5",
xlab = "BDKRB2_human_vector",
ylab = "BDKRB3_human_vactor")
# plot 4: size = 20, nmatch = 5
dotPlot(P30411_vector, P30411_vector,
wsize = 20,
nmatch = 5 ,
main = "size = 20, nmatch = 5",
xlab = "BDKRB2_human_vector",
ylab = "BDKRB3_human_vactor")
# reset par() - run this or other plots will be small!
par(mfrow = c(1,1),
mar = c(4,4,4,4))
dotPlot(P30411_vector, P30411_vector,
wsize = 1,
nmatch = 1,
main = "BDKRB2 Sequence Repeats Dotplot",
xlab = "BDKRB2_human_vector",
ylab = "BDKRB3_human_vactor")
databases <- c("Pfam", "DisProt", "RepeatsDB", "UniProt sub-cellular locations", "PDB secondary structure location")
protein_properties <- c("Protein has a named domain 7tm_1. It also has one low complexity region and 7 transmembrane regions.", "NA", "NA", "Cell membrane and Multi-pass membrane protein", "membrane protein" )
sources <- c("http://pfam.xfam.org/protein/P30411", "NA", "NA", "https://www.uniprot.org/uniprot/P30411", "https://www.rcsb.org/structure/7F2O")
prot_prop_table <- data.frame(databases, protein_properties, sources)
pander::pander(prot_prop_table)
| databases | protein_properties |
|---|---|
| Pfam | Protein has a named domain 7tm_1. It also has one low complexity region and 7 transmembrane regions. |
| DisProt | NA |
| RepeatsDB | NA |
| UniProt sub-cellular locations | Cell membrane and Multi-pass membrane protein |
| PDB secondary structure location | membrane protein |
| sources |
|---|
| http://pfam.xfam.org/protein/P30411 |
| NA |
| NA |
| https://www.uniprot.org/uniprot/P30411 |
| https://www.rcsb.org/structure/7F2O |
# enter once
aa.1.1 <- c("A","R","N","D","C","Q","E","G","H","I",
"L","K","M","F","P","S","T","W","Y","V")
# enter again
aa.1.2 <- c("A","R","N","D","C","Q","E","G","H","I",
"L","K","M","F","P","S","T","W","Y","V")
# are they the same length?
length(aa.1.1) == length(aa.1.2)
## [1] TRUE
aa.1.1 == aa.1.2
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE
unique(aa.1.1)
## [1] "A" "R" "N" "D" "C" "Q" "E" "G" "H" "I" "L" "K" "M" "F" "P" "S" "T" "W" "Y"
## [20] "V"
length(unique(aa.1.1)) == length(unique(aa.1.2))
## [1] TRUE
any(c(aa.1.1 == aa.1.2) == FALSE)
## [1] FALSE
# alpha proteins
alpha <- c(285, 53, 97, 163, 22, 67, 134, 197, 111, 91,
221, 249, 48, 123, 82, 122, 119, 33, 63, 167)
# check against chou's total
sum(alpha) == 2447
## [1] TRUE
# beta proteins
beta <- c(203, 67, 139, 121, 75, 122, 86, 297, 49, 120,
177, 115, 16, 85, 127, 341, 253, 44, 110, 229)
# check against chou's total
sum(beta) == 2776
## [1] TRUE
# alpha + beta
a.plus.b <- c(175, 78, 120, 111, 74, 74, 86, 171, 33, 93,
110, 112, 25, 52, 71, 126, 117, 30, 108, 123)
sum(a.plus.b) == 1889
## [1] TRUE
# alpha/beta
a.div.b <- c(361, 146, 183, 244, 63, 114, 257, 377, 107, 239,
339, 321, 91, 158, 188, 327, 238, 72, 130, 378)
sum(a.div.b) == 4333
## [1] TRUE
# Chou and Chang’s table 5
data.frame(aa.1.1, alpha, beta, a.plus.b, a.div.b)
## aa.1.1 alpha beta a.plus.b a.div.b
## 1 A 285 203 175 361
## 2 R 53 67 78 146
## 3 N 97 139 120 183
## 4 D 163 121 111 244
## 5 C 22 75 74 63
## 6 Q 67 122 74 114
## 7 E 134 86 86 257
## 8 G 197 297 171 377
## 9 H 111 49 33 107
## 10 I 91 120 93 239
## 11 L 221 177 110 339
## 12 K 249 115 112 321
## 13 M 48 16 25 91
## 14 F 123 85 52 158
## 15 P 82 127 71 188
## 16 S 122 341 126 327
## 17 T 119 253 117 238
## 18 W 33 44 30 72
## 19 Y 63 110 108 130
## 20 V 167 229 123 378
alpha.prop <- alpha/sum(alpha)
beta.prop <- beta/sum(beta)
a.plus.b.prop <- a.plus.b/sum(a.plus.b)
a.div.b <- a.div.b/sum(a.div.b)
#dataframe
aa.prop <- data.frame(alpha.prop,
beta.prop,
a.plus.b.prop,
a.div.b)
#row labels
row.names(aa.prop) <- aa.1.1
aa.prop
## alpha.prop beta.prop a.plus.b.prop a.div.b
## A 0.116469146 0.073126801 0.09264161 0.08331410
## R 0.021659174 0.024135447 0.04129169 0.03369490
## N 0.039640376 0.050072046 0.06352567 0.04223402
## D 0.066612178 0.043587896 0.05876125 0.05631202
## C 0.008990601 0.027017291 0.03917417 0.01453958
## Q 0.027380466 0.043948127 0.03917417 0.02630972
## E 0.054760932 0.030979827 0.04552673 0.05931225
## G 0.080506743 0.106988473 0.09052409 0.08700669
## H 0.045361667 0.017651297 0.01746956 0.02469421
## I 0.037188394 0.043227666 0.04923240 0.05515809
## L 0.090314671 0.063760807 0.05823187 0.07823679
## K 0.101757254 0.041426513 0.05929063 0.07408262
## M 0.019615856 0.005763689 0.01323452 0.02100162
## F 0.050265631 0.030619597 0.02752779 0.03646434
## P 0.033510421 0.045749280 0.03758602 0.04338795
## S 0.049856968 0.122838617 0.06670196 0.07546734
## T 0.048630977 0.091138329 0.06193753 0.05492730
## W 0.013485901 0.015850144 0.01588142 0.01661666
## Y 0.025745811 0.039625360 0.05717311 0.03000231
## V 0.068246833 0.082492795 0.06511382 0.08723748
# download
NP_000614.1 <- rentrez::entrez_fetch(id = "NP_000614.1",
db = "protein",
rettype = "fasta")
# clean and turn into vector
NP_000614.1 <- compbio4all::fasta_cleaner(NP_000614.1, parse = TRUE)
# Determine amino acid frequencies for my focal gene (counts of amino acid divided by length of sequence)
NP_000614.1.freq.table <- table(NP_000614.1)/length(NP_000614.1)
# Function to convert a table into a vector
table_to_vector <- function(table_x){
table_names <- attr(table_x, "dimnames")[[1]]
table_vect <- as.vector(table_x)
names(table_vect) <- table_names
return(table_vect)
}
BDKRB2.human.aa.freq <- table_to_vector(NP_000614.1.freq.table)
#Check for presence of U, an unknown amino acid
# none
aa.names <- names(BDKRB2.human.aa.freq)
any(aa.names == "U")
## [1] FALSE
#i.U <- which(aa.names == "U")
#aa.names[i.U]
#BDKRB2.human.aa.freq[i.U]
#which(aa.names %in% c("U"))
#BDKRB2.human.aa.freq <- BDKRB2.human.aa.freq [-i.U]
aa.prop$BDKRB2.human.aa.freq <- BDKRB2.human.aa.freq
aa.prop
## alpha.prop beta.prop a.plus.b.prop a.div.b BDKRB2.human.aa.freq
## A 0.116469146 0.073126801 0.09264161 0.08331410 0.04603581
## R 0.021659174 0.024135447 0.04129169 0.03369490 0.03580563
## N 0.039640376 0.050072046 0.06352567 0.04223402 0.02557545
## D 0.066612178 0.043587896 0.05876125 0.05631202 0.04092072
## C 0.008990601 0.027017291 0.03917417 0.01453958 0.05626598
## Q 0.027380466 0.043948127 0.03917417 0.02630972 0.04603581
## E 0.054760932 0.030979827 0.04552673 0.05931225 0.01023018
## G 0.080506743 0.106988473 0.09052409 0.08700669 0.07416880
## H 0.045361667 0.017651297 0.01746956 0.02469421 0.03324808
## I 0.037188394 0.043227666 0.04923240 0.05515809 0.13043478
## L 0.090314671 0.063760807 0.05823187 0.07823679 0.04347826
## K 0.101757254 0.041426513 0.05929063 0.07408262 0.04347826
## M 0.019615856 0.005763689 0.01323452 0.02100162 0.03324808
## F 0.050265631 0.030619597 0.02752779 0.03646434 0.04347826
## P 0.033510421 0.045749280 0.03758602 0.04338795 0.04603581
## S 0.049856968 0.122838617 0.06670196 0.07546734 0.08951407
## T 0.048630977 0.091138329 0.06193753 0.05492730 0.06138107
## W 0.013485901 0.015850144 0.01588142 0.01661666 0.08695652
## Y 0.025745811 0.039625360 0.05717311 0.03000231 0.03069054
## V 0.068246833 0.082492795 0.06511382 0.08723748 0.02301790
#Corrleation used in Chou adn Zhange 1992.
chou_cor <- function(x,y){
numerator <- sum(x*y)
denominator <- sqrt((sum(x^2))*(sum(y^2)))
result <- numerator/denominator
return(result)
}
#Cosine similarity used in Higgs and Attwood (2005).
chou_cosine <- function(z.1, z.2){
z.1.abs <- sqrt(sum(z.1^2))
z.2.abs <- sqrt(sum(z.2^2))
my.cosine <- sum(z.1*z.2)/(z.1.abs*z.2.abs)
return(my.cosine)
}
# Data exploration. None of the correlations are great…
par(mfrow = c(2,2), mar = c(5,4,1,1))
plot(alpha.prop ~ BDKRB2.human.aa.freq, data = aa.prop)
plot(beta.prop ~ BDKRB2.human.aa.freq, data = aa.prop)
plot(a.plus.b.prop ~ BDKRB2.human.aa.freq, data = aa.prop)
plot(a.div.b ~ BDKRB2.human.aa.freq, data = aa.prop)
par(mfrow = c(1,1), mar = c(1,5,1,1))
#Calculate correlation between each column
corr.alpha <- chou_cor(aa.prop[,5], aa.prop[,1])
corr.beta <- chou_cor(aa.prop[,5], aa.prop[,2])
corr.apb <- chou_cor(aa.prop[,5], aa.prop[,3])
corr.adb <- chou_cor(aa.prop[,5], aa.prop[,4])
# Calculate cosine similarity
cos.alpha <- chou_cosine(aa.prop[,5], aa.prop[,1])
cos.beta <- chou_cosine(aa.prop[,5], aa.prop[,2])
cos.apb <- chou_cosine(aa.prop[,5], aa.prop[,3])
cos.adb <- chou_cosine(aa.prop[,5], aa.prop[,4])
# Calculate distance. Note: we need to flip the dataframe on its side using a command called t()
aa.prop.flipped <- t(aa.prop)
round(aa.prop.flipped,2)
## A R N D C Q E G H I L
## alpha.prop 0.12 0.02 0.04 0.07 0.01 0.03 0.05 0.08 0.05 0.04 0.09
## beta.prop 0.07 0.02 0.05 0.04 0.03 0.04 0.03 0.11 0.02 0.04 0.06
## a.plus.b.prop 0.09 0.04 0.06 0.06 0.04 0.04 0.05 0.09 0.02 0.05 0.06
## a.div.b 0.08 0.03 0.04 0.06 0.01 0.03 0.06 0.09 0.02 0.06 0.08
## BDKRB2.human.aa.freq 0.05 0.04 0.03 0.04 0.06 0.05 0.01 0.07 0.03 0.13 0.04
## K M F P S T W Y V
## alpha.prop 0.10 0.02 0.05 0.03 0.05 0.05 0.01 0.03 0.07
## beta.prop 0.04 0.01 0.03 0.05 0.12 0.09 0.02 0.04 0.08
## a.plus.b.prop 0.06 0.01 0.03 0.04 0.07 0.06 0.02 0.06 0.07
## a.div.b 0.07 0.02 0.04 0.04 0.08 0.05 0.02 0.03 0.09
## BDKRB2.human.aa.freq 0.04 0.03 0.04 0.05 0.09 0.06 0.09 0.03 0.02
# distance matrix
dist(aa.prop.flipped, method = "euclidean")
## alpha.prop beta.prop a.plus.b.prop a.div.b
## beta.prop 0.13342098
## a.plus.b.prop 0.09281824 0.08289406
## a.div.b 0.06699039 0.08659174 0.06175113
## BDKRB2.human.aa.freq 0.18614186 0.15394420 0.14797253 0.15496644
# Individual distances
dist.alpha <- dist((aa.prop.flipped[c(1,5),]), method = "euclidean")
dist.beta <- dist((aa.prop.flipped[c(2,5),]), method = "euclidean")
dist.apb <- dist((aa.prop.flipped[c(3,5),]), method = "euclidean")
dist.adb <- dist((aa.prop.flipped[c(4,5),]), method = "euclidean")
# fold types
fold.type <- c("alpha","beta","alpha plus beta", "alpha/beta")
# data
corr.sim <- round(c(corr.alpha,corr.beta,corr.apb,corr.adb),5)
cosine.sim <- round(c(cos.alpha,cos.beta,cos.apb,cos.adb),5)
Euclidean.dist <- round(c(dist.alpha,dist.beta,dist.apb,dist.adb),5)
# summary
sim.sum <- c("","","most.sim","")
dist.sum <- c("","","min.dist","")
df <- data.frame(fold.type,
corr.sim ,
cosine.sim ,
Euclidean.dist ,
sim.sum ,
dist.sum )
# Display output
pander::pander(df)
| fold.type | corr.sim | cosine.sim | Euclidean.dist | sim.sum | dist.sum |
|---|---|---|---|---|---|
| alpha | 0.7359 | 0.7359 | 0.1861 | ||
| beta | 0.822 | 0.822 | 0.1539 | ||
| alpha plus beta | 0.8236 | 0.8236 | 0.148 | most.sim | min.dist |
| alpha/beta | 0.8095 | 0.8095 | 0.155 |
fasta_vector <- unlist(fasta_list)
names(fasta_vector) <- names(fasta_list)
fasta_vector
## NP_000614.1
## "MFSPWKISMFLSVREDSVPTTASFSADMLNVTLQGPTLNGTFAQSKCPQVEWLGWLNTIQPPFLWVLFVLATLENIFVLSVFCLHKSSCTVAEIYLGNLAAADLILACGLPFWAITISNNFDWLFGETLCRVVNAIISMNLYSSICFLMLVSIDRYLALVKTMSMGRMRGVRWAKLYSLVIWGCTLLLSSPMLVFRTMKEYSDEGHNVTACVISYPSLIWEVFTNMLLNVVGFLLPLSVITFCTMQIMQVLRNNEMQKFKEIQTERRATVLVLVVLLLFIICWLPFQISTFLDTLHRLGILSSCQDERIIDVITQIASFMAYSNSCLNPLVYVIVGKRFRKKSWEVYQGVCQKGGCRSEPIQMENSMGTLRTSISVERQIHKLQDWAGSRQ"
## XP_522944.1
## "MFSPWKIPMFLSVREDSVPTTASFSTDMLNVTLQGPTLNGTFAQSKCPQVEWLGWLNTIQPPFLWVLFVLATLENIFVLSVFCLHKSSCTVAEIYLGNLAAADLILACGLPFWAITISNNFDWLFGETLCRVVNAIISMNLYSSICFLMLVSIDRYLALVKTMSMGRMRGVRWAKLYSLVIWGCTLLLSSPMLVFRTMKEYSDEGHNVTACVISYPSLIWEVFTNMLLNVVGFLLPLSVITFCTMQIMQVLRNNEMQKFKEIQTERRATVLVLVVLLLFIICWLPFQISTFLDTLHRLGILSSCQDERIIDVITQIASFMAYSNSCLNPLVYVIVGKRFRKKSWEVYQGVCQKGGCRSEPIQMENSMGTLRTSISVERQIHKLQDWAGSRQ"
## XP_001101523.1
## "MFSPWKTLMFLSVREDSVPTTASFSADMLNVTLQLPALNGTFAQSKCPPVEWLGWLNTIQAPFLWVLFVLATLENIFVLSVFCLHRSSCTVAEIYLGNLAAADLILACGLPFWAITISNNFNWLFGETLCRVVNAIISMNLYSSICFLMLVSIDRYLALVKTMSMGRMRGVRWAKLYSLVIWACTLLLSSPMLVFRTMEEYSDEGHNVTACVISYPSLVWEVFTNMLLNIVGFLLPLSVITFCTMQIMQVLRNNEMQKFKEIQTERRATVLVLVVLLLFVICWLPFQVSTFLDTLHRLGILASCWDEHIIDVITQIASFMAYSNSCLNPLVYVIVGKRFRKKSWEVYQGVCQKGGCRSESIQMENSMGTLRTSISVERQIHKLQDWAGSRQ"
## NP_001003095.1
## "MFSAWKRPMFLSFHEDPVPTTASLSTEMFNSTSQDLMPTLNGTLPSPCVYPEWWNWLNTIQAPFLWVLFILAALENLFVLSIFCLHKSSCTVAEIYLGNLALADLILASGLPFWAITIANNFDWLFGEVLCRVVNTMLYMNLYSSICFLMLVSIDRYLALVKTMSMGRMRGVRWAKLYSLVIWGCTLLLSSPMLAFRTMKEYRDEGYNVTACVIIYPSRTWEVFTNVLLNFVGFLLPLTVITFCTVQIMQVLRNNEMQKFKEIQTERKATVLVLAVLLLFVICWLPFQISTFLDTLLRLDILSGCRHEHLVDVFTQIASYVAYSNSCLNPLVYVIVGKRFRKKSREVFRGLCQKGGCVLESNKMDNSMGTLRTSVSVERQIHKLPEWAENSQ"
## NP_001179055.1
## "MFSAWRRPMFLSIHEDTVPTTASFGAEMFNLTSQVLEPALNGTLPEGSSCFQSDLWNWLNTIQPPFLWILFLLAALENTFVLSVFCLHKSSCTVAEIYLGNLAVADLILACGLPFWAITIANNFDWLFGEALCRVVNTILYMNLYSSIYFLMLVSIDRYLALVKTMSMGRMRGVRWAKLYSLVIWGCALLLSSPMLAFRTMQEYNAEGHNVTACVINYPSHSWEVFTNILLNSVGFLLPLSVITFCTVQIMQVLRNNEMQKFKEIQTERKATLLVLAVLLLFVVCWLPFQISTFLDTLLRLHVLSGCWDEYVIDIFTQIASFVAYSNSCLNPLVYVIVGKRFRKKSQEVYARLCRPGGCGSAEPSQTENSMGTLRTSISVERNIHKLQ"
## NP_033877.3
## "MPCSWKLLGFLSVHEPMPTAASFGIEMFNVTTQVLGSALNGTLSKDNCPDTEWWSWLNAIQAPFLWVLFLLAALENLFVLSVFFLHKNSCTVAEIYLGNLAAADLILACGLPFWAITIANNFDWVFGEVLCRVVNTMIYMNLYSSICFLMLVSIDRYLALVKTMSMGRMRGVRWAKLYSLVIWGCTLLLSSPMLVFRTMREYSEEGHNVTACVIVYPSRSWEVFTNVLLNLVGFLLPLSVITFCTVRILQVLRNNEMKKFKEVQTERKATVLVLAVLGLFVLCWVPFQISTFLDTLLRLGVLSGCWDEHAVDVITQISSYVAYSNSGLNPLVYVIVGKRFRKKSREVYRVLCQKGGCMGEPVQMENSMGTLRTSISVERQIHKLQDWAGKKQ"
## NP_775123.2
## "MHCSWKRPVLLSVHEPMPTTASLGIEMFNITTQALGSAHNGTFSEVNCPDTEWWSWLNAIQAPFLWVLFLLAALENIFVLSVFCLHKTNCTVAEIYLGNLAAADLILACGLPFWAITIANNFDWLFGEVLCRVVNTMIYMNLYSSICFLMLVSIDRYLALVKTMSMGRMRGVRWAKLYSLVIWSCTLLLSSPMLVFRTMKDYREEGHNVTACVIVYPSRSWEVFTNMLLNLVGFLLPLSIITFCTVRIMQVLRNNEMKKFKEVQTEKKATVLVLAVLGLFVLCWFPFQISTFLDTLLRLGVLSGCWNERAVDIVTQISSYVAYSNSCLNPLVYVIVGKRFRKKSREVYQAICRKGGCMGESVQMENSMGTLRTSISVDRQIHKLQDWAGNKQ"
## XP_005343576.1
## "MYPPWKRPMPLSMHEDPMYTTASLGIDMFNITTHVLESAVNKSLPENSDCPDTEWLSWLNAIQAPFLWVLFLLAALENIFVLSVFCLHKNNCTVAEIYLGNLAAADLILACGLPFWAITIANNFDWLFGEVLCRVVNTLIYMNLYSSICFLMLVSVDRYLALVKTMSMGRMRGVRWAKIYSLVIWGCTLLLSSPMLMFRTLMKYNEEGFNVTACAIIYPSPSWEVFTNVLLNLVGFLLPLTIITFCTVKIMQVLRNNEMKKFKEIQTERKATVLVLAVLGLFVICWVPFQISTFLDTLIRLNVLSGCWVRHAVDVITQISSYMAYSNSCLNPLVYVIVGKRFRKKSREVYQAMCRKGGCMGESIQMENSLGTLKTSISVERQIHKLQDWSGNRQ"
## XP_001489322.1
## "MFSAWRRPMFLSVHEDSVPVMASSSAEMLNLTSQVLVPALNGTLSQSSPCSPSEWVSWLNAIQAPFLWVLFVLAALENLFVLSVFCLHKSSCTVAEVYLGNLAAADLILAFGLPFWAVTIANNFDWLFGEVLCRVVNAAVYTNLYSSICFLMLVSIDRYLALVKTMSMGRMRGVRWAKLYSLVIWGCTLLLSAPMLAFRTMREYADEGYNITACIIVYPSRTWEVFTNILLNFVGFLLPLCVITFCTVQIMRVLRNNEMQKFKEIQTERKATVLVLAVLLLFVICWLPFQISTFLDTLLRLKILSSCWDEHVVDVLTQIASFVAYSNSCLNPLVYVIVGKRFRKKSREVYGRLCQKGGCLSEPSQTESSMGTLRTSISVERQIHKLPEWVGSGPPSGSSSGSGRPQGYADSRED"
## XP_008247599.2
## "MSCSPASRLPRVSEGWGGPAALLAGPREPREPPSARPLRTRRSLAYIPIGVQMFSSLKTSMFLTVYDDPTPATASLGAEMLNITSQVLAPALNGSVSQSSGCPNTEWSGWLNVIQAPFLWVLFVLATLENLFVLSVFCLHKSSCTVAEVYLGNLAAADLILACGLPFWAVTIANHFDWLFGEALCRVVNTMIYMNLYSSICFLMLVSIDRYLALVKTMSIGRMRGVRWAKLYSLVIWGCTLLLSSPMLVFRTMKDYRDEGYNVTACIIDYPSRSWEVFTNVLLNLVGFLLPLSVITFCTVQILQVLRNNEMQKFKEIQTERRATVLVLAVLLLFVVCWLPFQVSTFLDTLLKLGVLSSCWDEHVIDVITQVGSFMGYSNSCLNPLVYVIVGKRFRKKSREVYRAACPKAGCVLEPVQAESSMGTLRTSISVERQIHKLPEWTRSSQ"
# human v chimp
align01.02 <- pairwiseAlignment(fasta_vector[1], fasta_vector[2])
# human v cattle
align01.05 <- pairwiseAlignment(fasta_vector[1], fasta_vector[5])
# human v house mouse
align01.06 <- pairwiseAlignment(fasta_vector[1], fasta_vector[6])
# chimp v cattle
align02.05 <- pairwiseAlignment(fasta_vector[2], fasta_vector[5])
# chimp v house mouse
align02.06 <- pairwiseAlignment(fasta_vector[2], fasta_vector[6])
# cattle v house mouse
align05.06 <- pairwiseAlignment(fasta_vector[5], fasta_vector[6])
Biostrings::pid(align01.02)
## [1] 99.48849
Biostrings::pid(align01.05)
## [1] 80.6701
Biostrings::pid(align01.06)
## [1] 79.64377
Biostrings::pid(align02.05)
## [1] 80.6701
Biostrings::pid(align02.06)
## [1] 79.64377
Biostrings::pid(align05.06)
## [1] 80.15464
# build matrix
pids <- c(1, NA, NA, NA,
pid(align01.02), 1, NA, NA,
pid(align01.05), pid(align02.05), 1, NA,
pid(align01.06), pid(align02.06), pid(align05.06), 1)
mat <- matrix(pids, nrow = 4, byrow = T)
row.names(mat) <- c("Homo","Pan","Rat","Fish")
colnames(mat) <- c("Homo","Pan","Rat","Fish")
pander::pander(mat)
| Â | Homo | Pan | Rat | Fish |
|---|---|---|---|---|
| Homo | 1 | NA | NA | NA |
| Pan | 99.49 | 1 | NA | NA |
| Rat | 80.67 | 80.67 | 1 | NA |
| Fish | 79.64 | 79.64 | 80.15 | 1 |
pidHumanChimp <- Biostrings::pid(align01.02)
method <- c("PID1", "PID2", "PID3", "PID4")
PID <- c(pid(align01.02 , type = "PID1"),pid(align01.02 , type = "PID2"),
pid(align01.02 , type = "PID3") , pid(align01.02 , type = "PID4") )
denominator <- c("(aligned positions + internal gap positions)", "(aligned positions)", "(length shorter sequence)", "(average length of the two sequences)")
pidDF <- data.frame(method, PID, denominator)
pander::pander(pidDF)
| method | PID | denominator |
|---|---|---|
| PID1 | 99.49 | (aligned positions + internal gap positions) |
| PID2 | 99.49 | (aligned positions) |
| PID3 | 99.49 | (length shorter sequence) |
| PID4 | 99.49 | (average length of the two sequences) |
# For use with R bioinformatics tools we need to convert our named vector to a string set using Biostrings::AAStringSet()
fasta_vector_ss <- Biostrings::AAStringSet(fasta_vector)
BDKRB2_align <-msa(fasta_vector_ss,
method = "ClustalW")
## use default substitution matrix
class(BDKRB2_align) <- "AAMultipleAlignment"
BDKRB2_align_seqinr <- msaConvert(BDKRB2_align, type = "seqinr::alignment")
print_msa(alignment = BDKRB2_align_seqinr,
chunksize = 60)
## [1] "----------------------------------------------------MFSPWKIS 0"
## [1] "----------------------------------------------------MFSPWKIP 0"
## [1] "----------------------------------------------------MFSPWKTL 0"
## [1] "MSCSPASRLPRVSEGWGGPAALLAGPREPREPPSARPLRTRRSLAYIPIGVQMFSSLKTS 0"
## [1] "----------------------------------------------------MFSAWKRP 0"
## [1] "----------------------------------------------------MFSAWRRP 0"
## [1] "----------------------------------------------------MFSAWRRP 0"
## [1] "----------------------------------------------------MPCSWKLL 0"
## [1] "----------------------------------------------------MHCSWKRP 0"
## [1] "----------------------------------------------------MYPPWKRP 0"
## [1] " "
## [1] "MFLSVREDSVPTTASFSADMLNVT--LQGPTLNGTFAQS-KCPQVEWLGWLNTIQPPFLW 0"
## [1] "MFLSVREDSVPTTASFSTDMLNVT--LQGPTLNGTFAQS-KCPQVEWLGWLNTIQPPFLW 0"
## [1] "MFLSVREDSVPTTASFSADMLNVT--LQLPALNGTFAQS-KCPPVEWLGWLNTIQAPFLW 0"
## [1] "MFLTVYDDPTPATASLGAEMLNITSQVLAPALNGSVSQSSGCPNTEWSGWLNVIQAPFLW 0"
## [1] "MFLSFHEDPVPTTASLSTEMFNSTSQDLMPTLNGTLP--SPCVYPEWWNWLNTIQAPFLW 0"
## [1] "MFLSVHEDSVPVMASSSAEMLNLTSQVLVPALNGTLSQSSPCSPSEWVSWLNAIQAPFLW 0"
## [1] "MFLSIHEDTVPTTASFGAEMFNLTSQVLEPALNGTLPEGSSCFQSDLWNWLNTIQPPFLW 0"
## [1] "GFLSVHE-PMPTAASFGIEMFNVTTQVLGSALNGTLSKD-NCPDTEWWSWLNAIQAPFLW 0"
## [1] "VLLSVHE-PMPTTASLGIEMFNITTQALGSAHNGTFSEV-NCPDTEWWSWLNAIQAPFLW 0"
## [1] "MPLSMHEDPMYTTASLGIDMFNITTHVLESAVNKSLPENSDCPDTEWLSWLNAIQAPFLW 0"
## [1] " "
## [1] "VLFVLATLENIFVLSVFCLHKSSCTVAEIYLGNLAAADLILACGLPFWAITISNNFDWLF 0"
## [1] "VLFVLATLENIFVLSVFCLHKSSCTVAEIYLGNLAAADLILACGLPFWAITISNNFDWLF 0"
## [1] "VLFVLATLENIFVLSVFCLHRSSCTVAEIYLGNLAAADLILACGLPFWAITISNNFNWLF 0"
## [1] "VLFVLATLENLFVLSVFCLHKSSCTVAEVYLGNLAAADLILACGLPFWAVTIANHFDWLF 0"
## [1] "VLFILAALENLFVLSIFCLHKSSCTVAEIYLGNLALADLILASGLPFWAITIANNFDWLF 0"
## [1] "VLFVLAALENLFVLSVFCLHKSSCTVAEVYLGNLAAADLILAFGLPFWAVTIANNFDWLF 0"
## [1] "ILFLLAALENTFVLSVFCLHKSSCTVAEIYLGNLAVADLILACGLPFWAITIANNFDWLF 0"
## [1] "VLFLLAALENLFVLSVFFLHKNSCTVAEIYLGNLAAADLILACGLPFWAITIANNFDWVF 0"
## [1] "VLFLLAALENIFVLSVFCLHKTNCTVAEIYLGNLAAADLILACGLPFWAITIANNFDWLF 0"
## [1] "VLFLLAALENIFVLSVFCLHKNNCTVAEIYLGNLAAADLILACGLPFWAITIANNFDWLF 0"
## [1] " "
## [1] "GETLCRVVNAIISMNLYSSICFLMLVSIDRYLALVKTMSMGRMRGVRWAKLYSLVIWGCT 0"
## [1] "GETLCRVVNAIISMNLYSSICFLMLVSIDRYLALVKTMSMGRMRGVRWAKLYSLVIWGCT 0"
## [1] "GETLCRVVNAIISMNLYSSICFLMLVSIDRYLALVKTMSMGRMRGVRWAKLYSLVIWACT 0"
## [1] "GEALCRVVNTMIYMNLYSSICFLMLVSIDRYLALVKTMSIGRMRGVRWAKLYSLVIWGCT 0"
## [1] "GEVLCRVVNTMLYMNLYSSICFLMLVSIDRYLALVKTMSMGRMRGVRWAKLYSLVIWGCT 0"
## [1] "GEVLCRVVNAAVYTNLYSSICFLMLVSIDRYLALVKTMSMGRMRGVRWAKLYSLVIWGCT 0"
## [1] "GEALCRVVNTILYMNLYSSIYFLMLVSIDRYLALVKTMSMGRMRGVRWAKLYSLVIWGCA 0"
## [1] "GEVLCRVVNTMIYMNLYSSICFLMLVSIDRYLALVKTMSMGRMRGVRWAKLYSLVIWGCT 0"
## [1] "GEVLCRVVNTMIYMNLYSSICFLMLVSIDRYLALVKTMSMGRMRGVRWAKLYSLVIWSCT 0"
## [1] "GEVLCRVVNTLIYMNLYSSICFLMLVSVDRYLALVKTMSMGRMRGVRWAKIYSLVIWGCT 0"
## [1] " "
## [1] "LLLSSPMLVFRTMKEYSDEGHNVTACVISYPSLIWEVFTNMLLNVVGFLLPLSVITFCTM 0"
## [1] "LLLSSPMLVFRTMKEYSDEGHNVTACVISYPSLIWEVFTNMLLNVVGFLLPLSVITFCTM 0"
## [1] "LLLSSPMLVFRTMEEYSDEGHNVTACVISYPSLVWEVFTNMLLNIVGFLLPLSVITFCTM 0"
## [1] "LLLSSPMLVFRTMKDYRDEGYNVTACIIDYPSRSWEVFTNVLLNLVGFLLPLSVITFCTV 0"
## [1] "LLLSSPMLAFRTMKEYRDEGYNVTACVIIYPSRTWEVFTNVLLNFVGFLLPLTVITFCTV 0"
## [1] "LLLSAPMLAFRTMREYADEGYNITACIIVYPSRTWEVFTNILLNFVGFLLPLCVITFCTV 0"
## [1] "LLLSSPMLAFRTMQEYNAEGHNVTACVINYPSHSWEVFTNILLNSVGFLLPLSVITFCTV 0"
## [1] "LLLSSPMLVFRTMREYSEEGHNVTACVIVYPSRSWEVFTNVLLNLVGFLLPLSVITFCTV 0"
## [1] "LLLSSPMLVFRTMKDYREEGHNVTACVIVYPSRSWEVFTNMLLNLVGFLLPLSIITFCTV 0"
## [1] "LLLSSPMLMFRTLMKYNEEGFNVTACAIIYPSPSWEVFTNVLLNLVGFLLPLTIITFCTV 0"
## [1] " "
## [1] "QIMQVLRNNEMQKFKEIQTERRATVLVLVVLLLFIICWLPFQISTFLDTLHRLGILSSCQ 0"
## [1] "QIMQVLRNNEMQKFKEIQTERRATVLVLVVLLLFIICWLPFQISTFLDTLHRLGILSSCQ 0"
## [1] "QIMQVLRNNEMQKFKEIQTERRATVLVLVVLLLFVICWLPFQVSTFLDTLHRLGILASCW 0"
## [1] "QILQVLRNNEMQKFKEIQTERRATVLVLAVLLLFVVCWLPFQVSTFLDTLLKLGVLSSCW 0"
## [1] "QIMQVLRNNEMQKFKEIQTERKATVLVLAVLLLFVICWLPFQISTFLDTLLRLDILSGCR 0"
## [1] "QIMRVLRNNEMQKFKEIQTERKATVLVLAVLLLFVICWLPFQISTFLDTLLRLKILSSCW 0"
## [1] "QIMQVLRNNEMQKFKEIQTERKATLLVLAVLLLFVVCWLPFQISTFLDTLLRLHVLSGCW 0"
## [1] "RILQVLRNNEMKKFKEVQTERKATVLVLAVLGLFVLCWVPFQISTFLDTLLRLGVLSGCW 0"
## [1] "RIMQVLRNNEMKKFKEVQTEKKATVLVLAVLGLFVLCWFPFQISTFLDTLLRLGVLSGCW 0"
## [1] "KIMQVLRNNEMKKFKEIQTERKATVLVLAVLGLFVICWVPFQISTFLDTLIRLNVLSGCW 0"
## [1] " "
## [1] "DERIIDVITQIASFMAYSNSCLNPLVYVIVGKRFRKKSWEVYQGVCQKGGC-RSEPIQME 0"
## [1] "DERIIDVITQIASFMAYSNSCLNPLVYVIVGKRFRKKSWEVYQGVCQKGGC-RSEPIQME 0"
## [1] "DEHIIDVITQIASFMAYSNSCLNPLVYVIVGKRFRKKSWEVYQGVCQKGGC-RSESIQME 0"
## [1] "DEHVIDVITQVGSFMGYSNSCLNPLVYVIVGKRFRKKSREVYRAACPKAGC-VLEPVQAE 0"
## [1] "HEHLVDVFTQIASYVAYSNSCLNPLVYVIVGKRFRKKSREVFRGLCQKGGC-VLESNKMD 0"
## [1] "DEHVVDVLTQIASFVAYSNSCLNPLVYVIVGKRFRKKSREVYGRLCQKGGC-LSEPSQTE 0"
## [1] "DEYVIDIFTQIASFVAYSNSCLNPLVYVIVGKRFRKKSQEVYARLCRPGGCGSAEPSQTE 0"
## [1] "DEHAVDVITQISSYVAYSNSGLNPLVYVIVGKRFRKKSREVYRVLCQKGGC-MGEPVQME 0"
## [1] "NERAVDIVTQISSYVAYSNSCLNPLVYVIVGKRFRKKSREVYQAICRKGGC-MGESVQME 0"
## [1] "VRHAVDVITQISSYMAYSNSCLNPLVYVIVGKRFRKKSREVYQAMCRKGGC-MGESIQME 0"
## [1] " "
## [1] "NSMGTLRTSISVERQIHKLQDWAGSRQ-------------------- 13"
## [1] "NSMGTLRTSISVERQIHKLQDWAGSRQ-------------------- 13"
## [1] "NSMGTLRTSISVERQIHKLQDWAGSRQ-------------------- 13"
## [1] "SSMGTLRTSISVERQIHKLPEWTRSSQ-------------------- 13"
## [1] "NSMGTLRTSVSVERQIHKLPEWAENSQ-------------------- 13"
## [1] "SSMGTLRTSISVERQIHKLPEWVGSGPPSGSSSGSGRPQGYADSRED 13"
## [1] "NSMGTLRTSISVERNIHKLQ--------------------------- 13"
## [1] "NSMGTLRTSISVERQIHKLQDWAGKKQ-------------------- 13"
## [1] "NSMGTLRTSISVDRQIHKLQDWAGNKQ-------------------- 13"
## [1] "NSLGTLKTSISVERQIHKLQDWSGNRQ-------------------- 13"
## [1] " "
## some conserved regions present
ggmsa::ggmsa(BDKRB2_align, start = 300, end = 400)
# Calculating genetic distance from an MSA is done using the `seqinr::dist.alignment()` function.
BDKRB2_dist <- seqinr::dist.alignment(BDKRB2_align_seqinr,
matrix = "identity")
# round to 3 digits for easier view
BDKRB2_dist_rounded <- round(BDKRB2_dist,
digits = 3)
# distance matrix for all sequences
BDKRB2_dist_rounded
## NP_000614.1 XP_522944.1 XP_001101523.1 XP_008247599.2
## XP_522944.1 0.072
## XP_001101523.1 0.215 0.220
## XP_008247599.2 0.441 0.447 0.441
## NP_001003095.1 0.439 0.433 0.448 0.446
## XP_001489322.1 0.435 0.435 0.435 0.427
## NP_001179055.1 0.439 0.439 0.454 0.463
## NP_033877.3 0.447 0.447 0.450 0.443
## NP_775123.2 0.461 0.459 0.467 0.463
## XP_005343576.1 0.480 0.477 0.477 0.491
## NP_001003095.1 XP_001489322.1 NP_001179055.1 NP_033877.3
## XP_522944.1
## XP_001101523.1
## XP_008247599.2
## NP_001003095.1
## XP_001489322.1 0.398
## NP_001179055.1 0.408 0.410
## NP_033877.3 0.427 0.443 0.438
## NP_775123.2 0.439 0.476 0.450 0.315
## XP_005343576.1 0.457 0.494 0.471 0.398
## NP_775123.2
## XP_522944.1
## XP_001101523.1
## XP_008247599.2
## NP_001003095.1
## XP_001489322.1
## NP_001179055.1
## NP_033877.3
## NP_775123.2
## XP_005343576.1 0.375
There are MANY ways to build phylogenetic trees. Here the neighbor joining algorithm via the function nj() is used. Neighbor joining uses genetic distances to cluster sequences into clades.
nj() is simple function that takes only a single argument, a distance matrix.
#
# Note - not using rounded values
tree_BDKRB2 <- nj(BDKRB2_dist)
# plot tree
plot.phylo(tree_BDKRB2, main="Phylogenetic Tree\n",
type = "phylogram",
use.edge.length = F)
# add label
mtext(text = "BDKRB2 family gene tree - Rooted, no branch lengths")