Introduction

Transient receptor potential cation channel subfamily M (melastatin) member 8 (TRPM8) is a protein that is coded for by the TRPM8 gene. The gene codes for cold-sensing TRP channel that is activated by chemical ligands such as menthol and icilin; it is the primary molecular transducer for cold somatosensation in humans.

Resources & References

Some important resources used to compile this information: RefSeq Page: https://www.ncbi.nlm.nih.gov/nuccore/NM_001397607.1 HomoloGene Page: https://www.ncbi.nlm.nih.gov/gene/79054 UniProt Page: https://www.uniprot.org/uniprot/Q7Z2W7 PDB Page: https://www.rcsb.org/structure/6BPQ

Other resources consulted include: Neanderthal Genome: http://neandertal.ensemblgenomes.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000144481

Preparation

# Github packages
library(compbio4all)
library(ggmsa)
## Registered S3 methods overwritten by 'ggalt':
##   method                  from   
##   grid.draw.absoluteGrob  ggplot2
##   grobHeight.absoluteGrob ggplot2
##   grobWidth.absoluteGrob  ggplot2
##   grobX.absoluteGrob      ggplot2
##   grobY.absoluteGrob      ggplot2
# CRAN packages
library(rentrez)
library(seqinr)
library(ape)
## 
## Attaching package: 'ape'
## The following objects are masked from 'package:seqinr':
## 
##     as.alignment, consensus
library(pander)
library(ggplot2)

# Bioconductor packages
library(BiocManager)
## Bioconductor version '3.13' is out-of-date; the current release version '3.14'
##   is available with R version '4.1'; see https://bioconductor.org/install
library(drawProteins) # not working
# library(msa) # does not work, will use manual functions

library(Biostrings)
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: XVector
## Loading required package: GenomeInfoDb
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:ape':
## 
##     complement
## The following object is masked from 'package:seqinr':
## 
##     translate
## The following object is masked from 'package:base':
## 
##     strsplit
library(HGNChelper)

data(BLOSUM50)

Accession Numbers

Accession numbers were obtained from RefSeq, RefSeq Homlogene, UniProt and PDB. UniProt accession numbers can be found by searching for the gene name. PDB accessions can be found by searching with a UniProt accession or a gene name; though many proteins are not in PDB, TRPM8 is. The Neanderthal genome database was searched as well.

A protein BLAST search (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome) was carried out excluding vertebrates to determine if it occurred outside of vertebrates. The gene does not appear in non-vertebrates and so a second search was conducted to exclude mammals.

Accession Number Table

Does not occur outside of vertebrates.

ncbi.protein.accession <- c("NP_076985","NP_599013.1", "NP_599198.2", 
                            "NP_001007083.1", "XP_005049003.1", "NP_001104239.1", 
                            "NP_001166561.1", "XP_028687091.1", "NP_001192995.1", 
                            "XP_024210920.1")
UniProt.id <- c("Q7Z2W7","Q8R4D5", "Q8R455", "NA", "NA", "NA", "NA", "NA", "NA", 
                "NA")
PDB <- c("6BPQ","NA","NA","NA", "NA", "NA", "NA", "NA", "NA", "NA")
species <- c("Homo sapiens","Mus musculus", "Rattus norvegicus", "Gallus gallus", 
             "Ficedula albicollis", "Canis lupus familiaris", "Cavia porcellus", 
             "Macaca mulatta", "Bos taurus", "Pan troglodytes")
species <- c("Human", "Mouse", "Rat", "Chicken", "Flycatcher", "Dog", "Guinea 
                 pig", "Monkey", "Cattle", "Chimpanzee")
gene.name <- c("TRPM8", "Trpm8","Trpm8", "TRPM8", "TRPM8", "TRPM8", "Trpm8", 
               "TRPM8", "TRPM8", "TRPM8")

# Converting the vectors into a combined dataframe
trpm8.df <- data.frame(ncbi.protein.accession = ncbi.protein.accession, UniProt.id = UniProt.id, PDB = PDB, species = species, species = species, gene.name = gene.name)

# Display the table
pander::pander(trpm8.df)
Table continues below
ncbi.protein.accession UniProt.id PDB species
NP_076985 Q7Z2W7 6BPQ Human
NP_599013.1 Q8R4D5 NA Mouse
NP_599198.2 Q8R455 NA Rat
NP_001007083.1 NA NA Chicken
XP_005049003.1 NA NA Flycatcher
NP_001104239.1 NA NA Dog
NP_001166561.1 NA NA Guinea pig
XP_028687091.1 NA NA Monkey
NP_001192995.1 NA NA Cattle
XP_024210920.1 NA NA Chimpanzee
species.1 gene.name
Human TRPM8
Mouse Trpm8
Rat Trpm8
Chicken TRPM8
Flycatcher TRPM8
Dog TRPM8
Guinea pig Trpm8
Monkey TRPM8
Cattle TRPM8
Chimpanzee TRPM8

Data Preparation

Download Sequences

All sequences were downloaded using a wrapper compbio4all::entrez_fetch_list() which uses rentrez::entrez_fetch() to access NCBI databases.

## [1] 10
## [1] ">NP_076985.4 transient receptor potential cation channel subfamily M member 8 isoform 1 [Homo sapiens]\nMSFRAARLSMRNRRNDTLDSTRTLYSSASRSTDLSYSESDLVNFIQANFKKRECVFFTKDSKATENVCKC\nGYAQSQHMEGTQINQSEKWNYKKHTKEFPTDAFGDIQFETLGKKGKYIRLSCDTDAEILYELLTQHWHLK\nTPNLVISVTGGAKNFALKPRMRKIFSRLIYIAQSKGAWILTGGTHYGLMKYIGEVVRDNTISRSSEENIV\nAIGIAAWGMVSNRDTLIRNCDAEGYFLAQYLMDDFTRDPLYILDNNHTHLLLVDNGCHGHPTVEAKLRNQ\nLEKYISERTIQDSNYGGKIPIVCFAQGGGKETLKAINTSIKNKIPCVVVEGSGQIADVIASLVEVEDALT\nSSAVKEKLVRFLPRTVSRLPEEETESWIKWLKEILECSHLLTVIKMEEAGDEIVSNAISYALYKAFSTSE\nQDKDNWNGQLKLLLEWNQLDLANDEIFTNDRRWESADLQEVMFTALIKDRPKFVRLFLENGLNLRKFLTH\nDVLTELFSNHFSTLVYRNLQIAKNSYNDALLTFVWKLVANFRRGFRKEDRNGRDEMDIELHDVSPITRHP\nLQALFIWAILQNKKELSKVIWEQTRGCTLAALGASKLLKTLAKVKNDINAAGESEELANEYETRAVELFT\nECYSSDEDLAEQLLVYSCEAWGGSNCLELAVEATDQHFIAQPGVQNFLSKQWYGEISRDTKNWKIILCLF\nIIPLVGCGFVSFRKKPVDKHKKLLWYYVAFFTSPFVVFSWNVVFYIAFLLLFAYVLLMDFHSVPHPPELV\nLYSLVFVLFCDEVRQWYVNGVNYFTDLWNVMDTLGLFYFIAGIVFRLHSSNKSSLYSGRVIFCLDYIIFT\nLRLIHIFTVSRNLGPKIIMLQRMLIDVFFFLFLFAVWMVAFGVARQGILRQNEQRWRWIFRSVIYEPYLA\nMFGQVPSDVDGTTYDFAHCTFTGNESKPLCVELDEHNLPRFPEWITIPLVCIYMLSTNILLVNLLVAMFG\nYTVGTVQENNDQVWKFQRYFLVQEYCSRLNIPFPFIVFAYFYMVVKKCFKCCCKEKNMESSVCCFKNEDN\nETLAWEGVMKENYLVKINTKANDTSEEMRHRFRQLDTKLNDLKGLLKEIANKIK\n\n"

Initial Data Cleaning

Remove FASTA header.

General Protein Information

Protein Diagram

First, we use a UniProt accession to download data from UniProt. This produces a list.

## [1] "Download has worked"
## [1] "list"             "vector"           "list_OR_List"     "vector_OR_Vector"
## [5] "vector_OR_factor"

Then the raw data from the webpage is converted to a dataframe.

## [1] "data.frame"       "list"             "oldClass"         "vector"          
## [5] "list_OR_List"     "vector_OR_Vector" "vector_OR_factor"

The information available on a protein on UniProt varies a lot depending on how much its been studied. drawProteins can extract information about the following things:

domains chains regions motifs phosphorylated sites repeats and others

If available, it can plot the information. You can get a sense for what’s available by looking at the dataframe produced by drawProteins::feature_to_dataframe()

##                     type begin  end length accession   entryName taxid order
## featuresTemp       CHAIN     1 1104   1103    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.1  TOPO_DOM     1  691    690    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.2  TRANSMEM   692  712     20    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.3  TOPO_DOM   713  734     21    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.4  TRANSMEM   735  755     20    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.5  TOPO_DOM   756  759      3    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.6  TRANSMEM   760  780     20    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.7  TOPO_DOM   781  794     13    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.8  TRANSMEM   795  815     20    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.9  TOPO_DOM   816  829     13    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.10 TRANSMEM   830  850     20    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.11 TOPO_DOM   851  958    107    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.12 TRANSMEM   959  979     20    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.13 TOPO_DOM   980 1104    124    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.14   REGION   187  195      8    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.15   COILED  1071 1104     33    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.16 CARBOHYD   934  934      0    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.17  VAR_SEQ     1  188    187    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.18  VAR_SEQ     1   77     76    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.19  VAR_SEQ     1    2      1    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.20  VAR_SEQ     3  314    311    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.21  VAR_SEQ   234  242      8    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.22  VAR_SEQ   243 1104    861    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.23  VAR_SEQ   675  784    109    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.24  VARIANT   247  247      0    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.25  VARIANT   251  251      0    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.26  VARIANT   419  419      0    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.27  VARIANT   462  462      0    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.28  VARIANT   732  732      0    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.29  VARIANT   821  821      0    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.30  MUTAGEN   821  821      0    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.31  MUTAGEN   934  934      0    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.32  MUTAGEN   946  946      0    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.33  MUTAGEN  1089 1089      0    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.34 CONFLICT    58   58      0    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.35 CONFLICT   693  693      0    Q7Z2W7 TRPM8_HUMAN  9606     1
## featuresTemp.36 CONFLICT   795  795      0    Q7Z2W7 TRPM8_HUMAN  9606     1

Domains present

Draw Dotplot

Taking only the human sequence of TRPM8.

##  chr [1:1104] "M" "S" "F" "R" "A" "A" "R" "L" "S" "M" "R" "N" "R" "R" "N" ...

Protein properties compiled from databases

Below are links to relevant information. This particular protein is not in Pfam, DisProt, or RepeatDB. In UniProt, the sub-cellular location is listed as: Endoplasmic reticulum membrane. In PDB, the secondary structure is shown as containing alpha helices and beta sheets.

Protein feature prediction

Multivariate statistcal techniques were used to confirm the information about protein structure and location in the line database.

Uniprot indicates that the protein is a membrane-bound protein in the ER.

Predict Protein Fold

Alphafold indicates that there are a mix of alpha helices and beta sheets. I therefore predict that machine-learning methods will indicate an a+b and a/b structure.

##    aa.1.1 alpha beta a.plus.b a.div.b
## 1       A   285  203      175     361
## 2       R    53   67       78     146
## 3       N    97  139      120     183
## 4       D   163  121      111     244
## 5       C    22   75       74      63
## 6       Q    67  122       74     114
## 7       E   134   86       86     257
## 8       G   197  297      171     377
## 9       H   111   49       33     107
## 10      I    91  120       93     239
## 11      L   221  177      110     339
## 12      K   249  115      112     321
## 13      M    48   16       25      91
## 14      F   123   85       52     158
## 15      P    82  127       71     188
## 16      S   122  341      126     327
## 17      T   119  253      117     238
## 18      W    33   44       30      72
## 19      Y    63  110      108     130
## 20      V   167  229      123     378
##    alpha.prop   beta.prop a.plus.b.prop    a.div.b
## A 0.116469146 0.073126801    0.09264161 0.08331410
## R 0.021659174 0.024135447    0.04129169 0.03369490
## N 0.039640376 0.050072046    0.06352567 0.04223402
## D 0.066612178 0.043587896    0.05876125 0.05631202
## C 0.008990601 0.027017291    0.03917417 0.01453958
## Q 0.027380466 0.043948127    0.03917417 0.02630972
## E 0.054760932 0.030979827    0.04552673 0.05931225
## G 0.080506743 0.106988473    0.09052409 0.08700669
## H 0.045361667 0.017651297    0.01746956 0.02469421
## I 0.037188394 0.043227666    0.04923240 0.05515809
## L 0.090314671 0.063760807    0.05823187 0.07823679
## K 0.101757254 0.041426513    0.05929063 0.07408262
## M 0.019615856 0.005763689    0.01323452 0.02100162
## F 0.050265631 0.030619597    0.02752779 0.03646434
## P 0.033510421 0.045749280    0.03758602 0.04338795
## S 0.049856968 0.122838617    0.06670196 0.07546734
## T 0.048630977 0.091138329    0.06193753 0.05492730
## W 0.013485901 0.015850144    0.01588142 0.01661666
## Y 0.025745811 0.039625360    0.05717311 0.03000231
## V 0.068246833 0.082492795    0.06511382 0.08723748
##          A          C          D          E          F          G          H 
## 0.05706522 0.02445652 0.04981884 0.06702899 0.06340580 0.04710145 0.02083333 
##          I          K          L          M          N          P          Q 
## 0.06250000 0.06521739 0.11050725 0.01992754 0.05706522 0.02536232 0.03351449 
##          R          S          T          V          W          Y 
## 0.04981884 0.06159420 0.05434783 0.06974638 0.02264493 0.03804348
## character(0)
## named numeric(0)
  alpha.prop beta.prop a.plus.b.prop a.div.b TRPM8.human.aa.freq
A 0.1165 0.07313 0.09264 0.08331 0.05707
R 0.02166 0.02414 0.04129 0.03369 0.02446
N 0.03964 0.05007 0.06353 0.04223 0.04982
D 0.06661 0.04359 0.05876 0.05631 0.06703
C 0.008991 0.02702 0.03917 0.01454 0.06341
Q 0.02738 0.04395 0.03917 0.02631 0.0471
E 0.05476 0.03098 0.04553 0.05931 0.02083
G 0.08051 0.107 0.09052 0.08701 0.0625
H 0.04536 0.01765 0.01747 0.02469 0.06522
I 0.03719 0.04323 0.04923 0.05516 0.1105
L 0.09031 0.06376 0.05823 0.07824 0.01993
K 0.1018 0.04143 0.05929 0.07408 0.05707
M 0.01962 0.005764 0.01323 0.021 0.02536
F 0.05027 0.03062 0.02753 0.03646 0.03351
P 0.03351 0.04575 0.03759 0.04339 0.04982
S 0.04986 0.1228 0.0667 0.07547 0.06159
T 0.04863 0.09114 0.06194 0.05493 0.05435
W 0.01349 0.01585 0.01588 0.01662 0.06975
Y 0.02575 0.03963 0.05717 0.03 0.02264
V 0.06825 0.08249 0.06511 0.08724 0.03804

Functions to Calculate Similarities

Two custom functions are needed: one to calculate correlates between two columns of our table, and one to calculate correlation similarities.

##                        A    R    N    D    C    Q    E    G    H    I    L    K
## alpha.prop          0.12 0.02 0.04 0.07 0.01 0.03 0.05 0.08 0.05 0.04 0.09 0.10
## beta.prop           0.07 0.02 0.05 0.04 0.03 0.04 0.03 0.11 0.02 0.04 0.06 0.04
## a.plus.b.prop       0.09 0.04 0.06 0.06 0.04 0.04 0.05 0.09 0.02 0.05 0.06 0.06
## a.div.b             0.08 0.03 0.04 0.06 0.01 0.03 0.06 0.09 0.02 0.06 0.08 0.07
## TRPM8.human.aa.freq 0.06 0.02 0.05 0.07 0.06 0.05 0.02 0.06 0.07 0.11 0.02 0.06
##                        M    F    P    S    T    W    Y    V
## alpha.prop          0.02 0.05 0.03 0.05 0.05 0.01 0.03 0.07
## beta.prop           0.01 0.03 0.05 0.12 0.09 0.02 0.04 0.08
## a.plus.b.prop       0.01 0.03 0.04 0.07 0.06 0.02 0.06 0.07
## a.div.b             0.02 0.04 0.04 0.08 0.05 0.02 0.03 0.09
## TRPM8.human.aa.freq 0.03 0.03 0.05 0.06 0.05 0.07 0.02 0.04
##                     alpha.prop  beta.prop a.plus.b.prop    a.div.b
## beta.prop           0.13342098                                    
## a.plus.b.prop       0.09281824 0.08289406                         
## a.div.b             0.06699039 0.08659174    0.06175113           
## TRPM8.human.aa.freq 0.16132094 0.15447076    0.12884017 0.14072023
fold.type corr.sim cosine.sim Euclidean.dist sim.sum dist.sum
alpha 0.7949 0.7949 0.1613
beta 0.8154 0.8154 0.1545
alpha plus beta 0.8599 0.8599 0.1288 most.sim min.dist
alpha/beta 0.8362 0.8362 0.1407

Percent Identity Comparisons (PID)

Data preparation

Convert all FASTA records intro entries in a single vector. FASTA entries are contained in a list produced at the beginning of the script. They were cleaned to remove the header and newline characters.

##  [1] "NP_076985"      "NP_599013.1"    "NP_599198.2"    "NP_001007083.1"
##  [5] "XP_005049003.1" "NP_001104239.1" "NP_001166561.1" "XP_028687091.1"
##  [9] "NP_001192995.1" "XP_024210920.1"
## [1] 10
## $NP_076985
## [1] "MSFRAARLSMRNRRNDTLDSTRTLYSSASRSTDLSYSESDLVNFIQANFKKRECVFFTKDSKATENVCKCGYAQSQHMEGTQINQSEKWNYKKHTKEFPTDAFGDIQFETLGKKGKYIRLSCDTDAEILYELLTQHWHLKTPNLVISVTGGAKNFALKPRMRKIFSRLIYIAQSKGAWILTGGTHYGLMKYIGEVVRDNTISRSSEENIVAIGIAAWGMVSNRDTLIRNCDAEGYFLAQYLMDDFTRDPLYILDNNHTHLLLVDNGCHGHPTVEAKLRNQLEKYISERTIQDSNYGGKIPIVCFAQGGGKETLKAINTSIKNKIPCVVVEGSGQIADVIASLVEVEDALTSSAVKEKLVRFLPRTVSRLPEEETESWIKWLKEILECSHLLTVIKMEEAGDEIVSNAISYALYKAFSTSEQDKDNWNGQLKLLLEWNQLDLANDEIFTNDRRWESADLQEVMFTALIKDRPKFVRLFLENGLNLRKFLTHDVLTELFSNHFSTLVYRNLQIAKNSYNDALLTFVWKLVANFRRGFRKEDRNGRDEMDIELHDVSPITRHPLQALFIWAILQNKKELSKVIWEQTRGCTLAALGASKLLKTLAKVKNDINAAGESEELANEYETRAVELFTECYSSDEDLAEQLLVYSCEAWGGSNCLELAVEATDQHFIAQPGVQNFLSKQWYGEISRDTKNWKIILCLFIIPLVGCGFVSFRKKPVDKHKKLLWYYVAFFTSPFVVFSWNVVFYIAFLLLFAYVLLMDFHSVPHPPELVLYSLVFVLFCDEVRQWYVNGVNYFTDLWNVMDTLGLFYFIAGIVFRLHSSNKSSLYSGRVIFCLDYIIFTLRLIHIFTVSRNLGPKIIMLQRMLIDVFFFLFLFAVWMVAFGVARQGILRQNEQRWRWIFRSVIYEPYLAMFGQVPSDVDGTTYDFAHCTFTGNESKPLCVELDEHNLPRFPEWITIPLVCIYMLSTNILLVNLLVAMFGYTVGTVQENNDQVWKFQRYFLVQEYCSRLNIPFPFIVFAYFYMVVKKCFKCCCKEKNMESSVCCFKNEDNETLAWEGVMKENYLVKINTKANDTSEEMRHRFRQLDTKLNDLKGLLKEIANKIK"

PID Table

## [1] 93.75
## [1] 93.75
## [1] 80.70652
## [1] 98.55072
## [1] 79.98188
## [1] 79.71014
  Human Mouse Rat Chicken
Human 1 NA NA NA
Mouse 93.75 1 NA NA
Rat 93.75 98.55 1 NA
Chicken 80.71 79.98 79.71 1
method PID Denominator
PID1 80.7065217391304 (aligned positions + internal gap positions)
PID2 81.3698630136986 (aligned positions)
PID3 81.3698630136986 (length shorter sequence)
PID4 81.0368349249659 (average length of the two sequences)

Multiple sequence alignment

MSA Data Preparation

Distance Matrix

I am skipping the ggpubr step due to problems with MSA. Dr. Brouwer said this was okay!

##                NP_599198.2 NP_001166561.1 XP_028687091.1 XP_024210920.1
## NP_001166561.1   0.2388833                                             
## XP_028687091.1   0.2623749      0.2658048                              
## XP_024210920.1   0.2675032      0.2708682      0.2658048               
## NP_076985        0.2500000      0.2500000      0.2481818      0.1563858
## NP_001104239.1   0.2292078      0.2252213      0.2481818      0.2211629
## NP_001192995.1   0.2725351      0.2691910      0.2855201      0.2571443
## NP_001007083.1   0.4400083      0.4431106      0.4420789      0.4410448
## XP_005049003.1   0.4189305      0.4211068      0.4200200      0.4178381
##                NP_076985 NP_001104239.1 NP_001192995.1 NP_001007083.1
## NP_001166561.1                                                       
## XP_028687091.1                                                       
## XP_024210920.1                                                       
## NP_076985                                                            
## NP_001104239.1 0.2272233                                             
## NP_001192995.1 0.2606430      0.2388833                              
## NP_001007083.1 0.4400083      0.4316264      0.4368839               
## XP_005049003.1 0.4200200      0.4112228      0.4167428      0.4211068

Phylogenetic tree for all sequences

Plotting phylogenetic trees for all sequences