NOTE to STUDENTS
TODO: Brief statement in of the gene and its full name and in your own words what is done in your script
EXAMPLE:
This code compiles summary information about the gene DIO1 (iodothyronine deiodinase 1).
It also generates alignments and a phylogeneitc tree to indicating the evolutionary relationship betweeen the human version of the gene and its homologs in other species.
Key information use to make this script can be found here:
Other resources consulted includes
Other interesting resources and online tools include:
Load necessary packages:
Download and load drawProteins from Bioconductor
#
library(BiocManager)
## Bioconductor version '3.13' is out-of-date; the current release version '3.14'
## is available with R version '4.1'; see https://bioconductor.org/install
#install("drawProteins")
library(drawProteins)
Load other packages
# github packages
# CRAN packages
# Bioconductor packages
# github packages
library(compbio4all)
library(ggmsa)
# CRAN packages
library(rentrez)
library(seqinr)
library(ape)
library(pander)
library(ggplot2)
# Bioconductor packages
## msa
### The msa package is having problems on some platforms
### You can skip the msa steps if necessary. The msa output
### is used to make a distance matrix and then phylogenetics trees,
### but I provide code to build the matrix by hand so
### you can proceed even if msa doesn't work for you.
library(msa)
library(drawProteins)
## Biostrings
library(Biostrings)
library(HGNChelper)
TODO: Brief summary of where information was obtained, and if certain kinds of information was not available.
Accession numbers were obtained from RefSeq, Refseq Homlogene, UniProt and PDB. UniProt accession numbers can be found by searching for the gene name. PDB accessions can be found by searching with a UniProt accession or a gene name, though many proteins are not in PDB. The the Neanderthal genome database was searched but did not yield sequence information on.
A protein BLAST search (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome) was carried out excluding vertebrates to determine if it occurred outside of vertebreates. The gene does not appear in non-vertebrates and so a second search was conducted to exclude mammals.
OPTIONAL: Use the function to confirm the validity of your gene name and any aliases
# this is optional
HGNChelper::checkGeneSymbols(x = c("DIO1"))
## Maps last updated on: Thu Oct 24 12:31:05 2019
## x Approved Suggested.Symbol
## 1 DIO1 TRUE DIO1
Not available:
Does not occur:
NOTE TO STUDENTS There are many ways to make this table. I prefer making it first using matrix() because it allows me to lay out all the information lined up in a table and make sure all the element line up correctly.
**TODO*: add additional sequences to get a total of 10
# RefSeq Uniprot PDB sci name common name gene name
dio1_table<-c("NP_000783", "P49895","NA","Homo sapiens" , "Human", "DIO1",
"NP_001116123","NA", "NA", "Pan troglodytes" , "Chimpanzee","DIO1",
"NP_031886", "Q61153","NA","Mus musculus", "Mouse" ,"DIO1",
"NP_001091083","P24389","NA","Rattus norvegicus", "Rat", "Dio1",
"NP_001243226","Q2QEI3","NA","Xenopus tropicalis","Frog", "dio1",
"NP_001007284","F1R7E6","NA","Danio rerio", "Fish", "dio1")
## **TODO*: add additional sequences to get a total of 10
Convert vector information into a table
#
## [1] 6 6
## [1] "data.frame" "list" "oldClass" "vector"
## [5] "list_OR_List" "vector_OR_Vector" "vector_OR_factor"
## [1] "V1" "V2" "V3" "V4" "V5" "V6"
The finished table
pander::pander(dio1_table)
| ncbi.protein.accession | UniProt.id | PDB | species | common.name |
|---|---|---|---|---|
| NP_000783 | P49895 | NA | Homo sapiens | Human |
| NP_001116123 | NA | NA | Pan troglodytes | Chimpanzee |
| NP_031886 | Q61153 | NA | Mus musculus | Mouse |
| NP_001091083 | P24389 | NA | Rattus norvegicus | Rat |
| NP_001243226 | Q2QEI3 | NA | Xenopus tropicalis | Frog |
| NP_001007284 | F1R7E6 | NA | Danio rerio | Fish |
| gene.name |
|---|
| DIO1 |
| DIO1 |
| DIO1 |
| Dio1 |
| dio1 |
| dio1 |
All sequences were downloaded using a wrapper compbio4all::entrez_fetch_list() which uses rentrez::entrez_fetch() to access NCBI databases.
# download FASTA sequences
Number of FASTA files obtained
length(dio1s_list)
## [1] 6
The first entry
dio1s_list[[1]]
## [1] ">NP_000783.2 type I iodothyronine deiodinase isoform a [Homo sapiens]\nMGLPQPGLWLKRLWVLLEVAVHVVVGKVLLILFPDRVKRNILAMGEKTGMTRNPHFSHDNWIPTFFSTQY\nFWFVLKVRWQRLEDTTELGGLAPNCPVVRLSGQRCNIWEFMQGNRPLVLNFGSCTUPSFMFKFDQFKRLI\nEDFSSIADFLVIYIEEAHASDGWAFKNNMDIRNHQNLQDRLQAAHLLLARSPQCPVVVDTMQNQSSQLYA\nALPERLYIIQEGRILYKGKSGPWNYNPEEVRAVLEKLHS\n\n"
Remove FASTA header
for(i in 1:length(dio1s_list)){
dio1s_list[[i]] <- compbio4all::fasta_cleaner(dio1s_list[[i]], parse = F)
}
Specific additional cleaning steps will be as needed for particular analyses
Prepare data
#
dotPlot(dio1s_human_vector,dio1s_human_vector)
TODO:
TODO Create table
Below are links to relevant information. This particular protein is not in
A the Danio homolog is listed in Alphafold (https://alphafold.ebi.ac.uk/entry/F1R7E6). The predicted structure contains alpha helices, beta sheets, and disordered regions.
Because this protein is poorly characterized I used IUPred2A to determine if there were any disordered regions (https://iupred2a.elte.hu/). No peaks exceeded the threshold of 0.5.
Multivariate statistcal techniques were used to confirm the information about protein structure and location in the line database.
Uniprot (which uses http://www.csbio.sjtu.edu.cn/bioinf/Cell-PLoc-2/ I believe) indicates that the protein is a membrane-bound protein, particularlly in the ER.
Alphafold indicates that there are a mix of alpha helices and beta sheets. I therefore predict that machine-learning methods will indicate an a+b and a/b structure.
NOTE: My protein contains a “U” for an unknown amino acid. I removed this from the sequence because it is otherwise undefined.
First, I need the data from Chou and Zhang (1994) Table 5. Code to build this table is available at https://rpubs.com/lowbrowR/843543
The table looks like this:
pander(data.frame(aa.1.1, alpha, beta, a.plus.b, a.div.b))
| aa.1.1 | alpha | beta | a.plus.b | a.div.b |
|---|---|---|---|---|
| A | 285 | 203 | 175 | 361 |
| R | 53 | 67 | 78 | 146 |
| N | 97 | 139 | 120 | 183 |
| D | 163 | 121 | 111 | 244 |
| C | 22 | 75 | 74 | 63 |
| Q | 67 | 122 | 74 | 114 |
| E | 134 | 86 | 86 | 257 |
| G | 197 | 297 | 171 | 377 |
| H | 111 | 49 | 33 | 107 |
| I | 91 | 120 | 93 | 239 |
| L | 221 | 177 | 110 | 339 |
| K | 249 | 115 | 112 | 321 |
| M | 48 | 16 | 25 | 91 |
| F | 123 | 85 | 52 | 158 |
| P | 82 | 127 | 71 | 188 |
| S | 122 | 341 | 126 | 327 |
| T | 119 | 253 | 117 | 238 |
| W | 33 | 44 | 30 | 72 |
| Y | 63 | 110 | 108 | 130 |
| V | 167 | 229 | 123 | 378 |
Convert to frequencies
#
Table 5 therefore becomes this
pander::pander(aa.prop)
| alpha.prop | beta.prop | a.plus.b.prop | a.div.b | |
|---|---|---|---|---|
| A | 0.1165 | 0.07313 | 0.09264 | 0.08331 |
| R | 0.02166 | 0.02414 | 0.04129 | 0.03369 |
| N | 0.03964 | 0.05007 | 0.06353 | 0.04223 |
| D | 0.06661 | 0.04359 | 0.05876 | 0.05631 |
| C | 0.008991 | 0.02702 | 0.03917 | 0.01454 |
| Q | 0.02738 | 0.04395 | 0.03917 | 0.02631 |
| E | 0.05476 | 0.03098 | 0.04553 | 0.05931 |
| G | 0.08051 | 0.107 | 0.09052 | 0.08701 |
| H | 0.04536 | 0.01765 | 0.01747 | 0.02469 |
| I | 0.03719 | 0.04323 | 0.04923 | 0.05516 |
| L | 0.09031 | 0.06376 | 0.05823 | 0.07824 |
| K | 0.1018 | 0.04143 | 0.05929 | 0.07408 |
| M | 0.01962 | 0.005764 | 0.01323 | 0.021 |
| F | 0.05027 | 0.03062 | 0.02753 | 0.03646 |
| P | 0.03351 | 0.04575 | 0.03759 | 0.04339 |
| S | 0.04986 | 0.1228 | 0.0667 | 0.07547 |
| T | 0.04863 | 0.09114 | 0.06194 | 0.05493 |
| W | 0.01349 | 0.01585 | 0.01588 | 0.01662 |
| Y | 0.02575 | 0.03963 | 0.05717 | 0.03 |
| V | 0.06825 | 0.08249 | 0.06511 | 0.08724 |
Determine the number of each amino acid in my protein.
#
A Function to convert a table into a vector is helpful here because R is goofy about tables not being the same as vectors.
table_to_vector <- function(table_x){
table_names <- attr(table_x, "dimnames")[[1]]
table_vect <- as.vector(table_x)
names(table_vect) <- table_names
return(table_vect)
}
dio1s_human_table <- table(dio1s_human_vector)/length(dio1s_human_vector)
DIO1.human.aa.freq <- table_to_vector(dio1s_human_table)
DIO1.human.aa.freq
## A C D E F G
## 0.082677165 0.015748031 0.039370079 0.055118110 0.066929134 0.059055118
## H I K L M N
## 0.019685039 0.047244094 0.078740157 0.110236220 0.035433071 0.035433071
## P Q R S T U
## 0.051181102 0.019685039 0.051181102 0.066929134 0.039370079 0.003937008
## V W Y
## 0.074803150 0.015748031 0.031496063
Check for the presence of “U” (unknown aa.)
aa.names <- names(DIO1.human.aa.freq)
i.U <- which(aa.names == "U")
aa.names[i.U]
## [1] "U"
DIO1.human.aa.freq[i.U]
## U
## 0.003937008
Remove the U (would be better to remove form the original sequence, but this will work)
#
## [1] 0.996063
Add data on my focal protein to the amino acid frequency table.
aa.prop$DIO1.human.aa.freq <- DIO1.human.aa.freq
pander::pander(aa.prop)
| alpha.prop | beta.prop | a.plus.b.prop | a.div.b | DIO1.human.aa.freq | |
|---|---|---|---|---|---|
| A | 0.1165 | 0.07313 | 0.09264 | 0.08331 | 0.08268 |
| R | 0.02166 | 0.02414 | 0.04129 | 0.03369 | 0.01575 |
| N | 0.03964 | 0.05007 | 0.06353 | 0.04223 | 0.03937 |
| D | 0.06661 | 0.04359 | 0.05876 | 0.05631 | 0.05512 |
| C | 0.008991 | 0.02702 | 0.03917 | 0.01454 | 0.06693 |
| Q | 0.02738 | 0.04395 | 0.03917 | 0.02631 | 0.05906 |
| E | 0.05476 | 0.03098 | 0.04553 | 0.05931 | 0.01969 |
| G | 0.08051 | 0.107 | 0.09052 | 0.08701 | 0.04724 |
| H | 0.04536 | 0.01765 | 0.01747 | 0.02469 | 0.07874 |
| I | 0.03719 | 0.04323 | 0.04923 | 0.05516 | 0.1102 |
| L | 0.09031 | 0.06376 | 0.05823 | 0.07824 | 0.03543 |
| K | 0.1018 | 0.04143 | 0.05929 | 0.07408 | 0.03543 |
| M | 0.01962 | 0.005764 | 0.01323 | 0.021 | 0.05118 |
| F | 0.05027 | 0.03062 | 0.02753 | 0.03646 | 0.01969 |
| P | 0.03351 | 0.04575 | 0.03759 | 0.04339 | 0.05118 |
| S | 0.04986 | 0.1228 | 0.0667 | 0.07547 | 0.06693 |
| T | 0.04863 | 0.09114 | 0.06194 | 0.05493 | 0.03937 |
| W | 0.01349 | 0.01585 | 0.01588 | 0.01662 | 0.0748 |
| Y | 0.02575 | 0.03963 | 0.05717 | 0.03 | 0.01575 |
| V | 0.06825 | 0.08249 | 0.06511 | 0.08724 | 0.0315 |
Two custom functions are needed: one to calculate correlates between two columns of our table, and one to calculate correlation similarities.
# Corrleation used in Chou adn Zhange 1992.
chou_cor <- function(x,y){
numerator <- sum(x*y)
denominator <- sqrt((sum(x^2))*(sum(y^2)))
result <- numerator/denominator
return(result)
}
# Cosine similarity used in Higgs and Attwood (2005).
chou_cosine <- function(z.1, z.2){
z.1.abs <- sqrt(sum(z.1^2))
z.2.abs <- sqrt(sum(z.2^2))
my.cosine <- sum(z.1*z.2)/(z.1.abs*z.2.abs)
return(my.cosine)
}
Calculate correlation between each column
#
Calculate cosine similarity
#
Calculate distance.
Note: we need to flip the dataframe on its side using a command called t()
aa.prop.flipped <- t(aa.prop)
round(aa.prop.flipped,2)
## A R N D C Q E G H I L K
## alpha.prop 0.12 0.02 0.04 0.07 0.01 0.03 0.05 0.08 0.05 0.04 0.09 0.10
## beta.prop 0.07 0.02 0.05 0.04 0.03 0.04 0.03 0.11 0.02 0.04 0.06 0.04
## a.plus.b.prop 0.09 0.04 0.06 0.06 0.04 0.04 0.05 0.09 0.02 0.05 0.06 0.06
## a.div.b 0.08 0.03 0.04 0.06 0.01 0.03 0.06 0.09 0.02 0.06 0.08 0.07
## DIO1.human.aa.freq 0.08 0.02 0.04 0.06 0.07 0.06 0.02 0.05 0.08 0.11 0.04 0.04
## M F P S T W Y V
## alpha.prop 0.02 0.05 0.03 0.05 0.05 0.01 0.03 0.07
## beta.prop 0.01 0.03 0.05 0.12 0.09 0.02 0.04 0.08
## a.plus.b.prop 0.01 0.03 0.04 0.07 0.06 0.02 0.06 0.07
## a.div.b 0.02 0.04 0.04 0.08 0.05 0.02 0.03 0.09
## DIO1.human.aa.freq 0.05 0.02 0.05 0.07 0.04 0.07 0.02 0.03
We can get distance matrix like this
#
## alpha.prop beta.prop a.plus.b.prop a.div.b
## beta.prop 0.13342098
## a.plus.b.prop 0.09281824 0.08289406
## a.div.b 0.06699039 0.08659174 0.06175113
## DIO1.human.aa.freq 0.17233193 0.17219592 0.14877223 0.15762267
Individual distances using dist()
#
Compile the information. Rounding makes it easier to read
# fold types
fold.type <- c("alpha","beta","alpha plus beta", "alpha/beta")
# data
corr.sim <- round(c(corr.alpha,corr.beta,corr.apb,corr.adb),5)
cosine.sim <- round(c(cos.alpha,cos.beta,cos.apb,cos.adb),5)
Euclidean.dist <- round(c(dist.alpha,dist.beta,dist.apb,dist.adb),5)
# summary
sim.sum <- c("","","most.sim","")
dist.sum <- c("","","min.dist","")
df <- data.frame(fold.type,
corr.sim ,
cosine.sim ,
Euclidean.dist ,
sim.sum ,
dist.sum )
Display output
pander::pander(df)
| fold.type | corr.sim | cosine.sim | Euclidean.dist | sim.sum | dist.sum |
|---|---|---|---|---|---|
| alpha | 0.7691 | 0.7691 | 0.1723 | ||
| beta | 0.7731 | 0.7731 | 0.1722 | ||
| alpha plus beta | 0.8169 | 0.8169 | 0.1488 | most.sim | min.dist |
| alpha/beta | 0.7981 | 0.7981 | 0.1576 |
TBD
# ec <- c(8.6, 2.9, 4.9, 5.1, 3.7, 7.8, 2.1, 4.6, 6.3, 8.8, 2.5, 4.6, 4.9,
# 4, 4.2, 7.3, 6, 6.7, 1.4, 3.6)/100
#
# an <- c(7.6, 2.2, 5.2, 6.2, 4.0, 6.9, 2.1, 5.1, 5.8, 9.4, 2.1, 4.4, 5.4, 4.1,
# 5.0, 7.2, 6.1, 6.7, 1.4, 3.2)/100
#
# df <- data.frame(ec,an)
# ave.vect <- apply(df,1,mean)
#
#
#
# cor.mat <- matrix(NA, 20, nrow = 20, ncol = 20)
#
# for(i in 1:20){
# for(j in 1:20){
# cor.mat[i,j] <- (ec[j]-ave.vect[i])*(ec[i]-ave.vect[j])
# }
# }
#
# t(ec-ave.vect)%*%ginv(cor.mat)%*%(ec-ave.vect)
Convert all FASTA records intro entries in a single vector. FASTA entries are contained in a list produced at the beginning of the script. They were cleaned to remove the header and newline characters.
names(dio1s_list)
## [1] "NP_000783" "NP_001116123" "NP_031886" "NP_001091083" "NP_001243226"
## [6] "NP_001007284"
length(dio1s_list)
## [1] 6
Each entry is a full entry with no spaces or parsing, and no header
dio1s_list[1]
## $NP_000783
## [1] "MGLPQPGLWLKRLWVLLEVAVHVVVGKVLLILFPDRVKRNILAMGEKTGMTRNPHFSHDNWIPTFFSTQYFWFVLKVRWQRLEDTTELGGLAPNCPVVRLSGQRCNIWEFMQGNRPLVLNFGSCTUPSFMFKFDQFKRLIEDFSSIADFLVIYIEEAHASDGWAFKNNMDIRNHQNLQDRLQAAHLLLARSPQCPVVVDTMQNQSSQLYAALPERLYIIQEGRILYKGKSGPWNYNPEEVRAVLEKLHS"
Make each entry of the list into a vector. There are several ways to do this.
#
Name the vector
# name the vector
names(dio1s_vector) <- names(dio1s_list)
Do pairwise alignments for humans, chimps and 2-other species.
Biostrings::pid(align01.02)
## [1] 99.19679
Biostrings::pid(align01.05)
## [1] 52.17391
Biostrings::pid(align01.06)
## [1] 48.04688
Build matrix
#
pids <- c(1, NA, NA, NA,
pid(align01.02), 1, NA, NA,
pid(align01.05), pid(align02.05), 1, NA,
pid(align01.06), pid(align02.06), pid(align05.06), 1)
mat <- matrix(pids, nrow = 4, byrow = T)
row.names(mat) <- c("Homo","Pan","Rat","Fish")
colnames(mat) <- c("Homo","Pan","Rat","Fish")
pander::pander(mat)
| Homo | Pan | Rat | Fish | |
|---|---|---|---|---|
| Homo | 1 | NA | NA | NA |
| Pan | 99.2 | 1 | NA | NA |
| Rat | 52.17 | 48.44 | 1 | NA |
| Fish | 48.05 | 48.44 | 50.2 | 1 |
Compare different PID methods. I did this for Humans vs. chimps and also for another comparison out of curiousity. You only have to do chimps.
#
A comparison of chimps and human PID with different methods.
| method | PID | denominator |
|---|---|---|
| PID1 | 99.2 | (aligned positions + internal gap positions) |
| PID2 | 99.2 | (aligned positions) |
| PID3 | 99.2 | (length shorter sequence) |
| PID4 | 99.2 | (average length of the two sequences) |
A comparison of fish and human PID with different methods.
| method | PID | denominator |
|---|---|---|
| PID1 | 48.05 | (aligned positions + internal gap positions) |
| PID2 | 50 | (aligned positions) |
| PID3 | 49.4 | (length shorter sequence) |
| PID4 | 48.91 | (average length of the two sequences) |
For use with R bioinformatics tools we need to convert our named vector to a string set using Biostrings::AAStringSet(). Note the _ss tag at the end of the object we’re assigning the output to, which designates this as a string set.
dio1s_vector_ss <- Biostrings::AAStringSet(dio1s_vector)
#
## use default substitution matrix
msa produces a species MSA objects
class(dio1s_align)
## [1] "MsaAAMultipleAlignment"
## attr(,"package")
## [1] "msa"
is(dio1s_align)
## [1] "MsaAAMultipleAlignment" "AAMultipleAlignment" "MsaMetaData"
## [4] "MultipleAlignment"
Default output of MSA
## CLUSTAL 2.1
##
## Call:
## msa::msa(dio1s_vector_ss, method = "ClustalW")
##
## MsaAAMultipleAlignment with 6 rows and 263 columns
## aln names
## [1] ------MGLPQPGLWLKRLWVLLEVA...WNYNPEEVRAVLEKLHS-------- NP_000783
## [2] ------MGLPQPGLWLKRLWVLLEVA...WNYNPEEVRAVLEKLHS-------- NP_001116123
## [3] ------MGLPQLWLWLKRLVIFLQVA...WNYNPEEVRAVLEKLCTPPRHVPQL NP_031886
## [4] --------MLSIGVLLHKLLILLQVT...WNYHPQEIRAVLEKLK--------- NP_001091083
## [5] -MESLLQTIKLMVRFIQKTMIFFFLF...WGYKPEEVHSVLEKKK--------- NP_001243226
## [6] MGSAVGFALRKLFVYISAVLMVCAAI...WGYKPEEVRKVLEKSK--------- NP_001007284
## Con ------MGLPQ?GLWLKRL?ILL?VA...WNYNPEEVRAVLEKLK--------- Consensus
Change class of alignment
#
## [1] "MsaAAMultipleAlignment"
## attr(,"package")
## [1] "msa"
## [1] "AAMultipleAlignment"
Convert to seqinr format
#
## [1] "AAMultipleAlignment" "MultipleAlignment"
## [1] "AAMultipleAlignment"
## [1] "alignment"
## [1] "alignment"
OPTIONAL: show output with print_msa
compbio4all::print_msa(dio1s_align_seqinr)
## [1] "------MGLPQPGLWLKRLWVLLEVAVHVVVGKVLLILFPDRVKRNILAMGEKTGMTRNP 0"
## [1] "------MGLPQPGLWLKRLWVLLEVAVHVVVGKVLLILFPDRVKRNILAMGEKTGMTRNP 0"
## [1] "------MGLPQLWLWLKRLVIFLQVALEVAVGKVLMTLFPGRVKQSILAMGQKTGMARNP 0"
## [1] "--------MLSIGVLLHKLLILLQVTLSVVVGKTMMILFPDATKRYILKLGEKSRMNQNP 0"
## [1] "-MESLLQTIKLMVRFIQKTMIFFFLFIYVVVGKVLMFFFP-QTMASVLKSRFETTGVHDP 0"
## [1] "MGSAVGFALRKLFVYISAVLMVCAAILRMSMLKLLSFISPGRMRKIHMKMGERTTMTQNP 0"
## [1] " "
## [1] "HFSHDNWIPTFFSTQYFWFVLKVRWQRLEDTTELGGLAPNCPVVRLSGQRCNIWEFMQGN 0"
## [1] "HFSHDNWIPTFFSTQYFWFVLKVRWQRLEDTTELGGLAPNCPVVHLSGQRCNIWEFMQGN 0"
## [1] "RFAPDNWVPTFFSIQYFWFVLKVRWQRLEDRAEFGGLAPNCTVVCLSGQKCNIWDFIQGS 0"
## [1] "KFSYENWGPTFFSFQYLLFVLKVKWRRLEDEAHEGRPAPNTPVVALNGEMQHLFSFMRDN 0"
## [1] "KFQYEDWGPTFFTYKFLRSVLEIMWLRLEDEAFVGHSAPNTPVIDLNGELHHIWDYLQGT 0"
## [1] "KFRYEDWGPAFFSLAFIKTLFFVNWCSLGLEAFEGHSAPDSALITLDRQKTSVHRFLKGN 0"
## [1] " "
## [1] "RPLVLNFGSCTUPSFMFKFDQFKRLIEDFSSIADFLVIYIEEAHASDGWAFKNNMDIRNH 0"
## [1] "RPLVLNFGSCTUPSFMFKFDQFKRLIEDFSSIADFLVIYIEEAHASDGWAFKNNMDIRNH 0"
## [1] "RPLVLNFGSCTUPSFLLKFDQFKRLVDDFASTADFLIIYIEEAHATDGWAFKNNVDIRQH 0"
## [1] "RPLILNFGSCTUPSFMLKFDEFNKLVKDFSSIADFLIIYIEEAHAVDGWAFRNNVVIKNH 0"
## [1] "RPLVLNFGSCTUPPFLFRLGEFNKLVNDFSSIADFLIIYIDEAHAADEWALKNNLHIKKH 0"
## [1] "RPLVLSFGSCTUPPFLYKLDEFKQLVKDFSNVADFLIVYLAEAHATDAWAFKNNVDISVH 0"
## [1] " "
## [1] "QNLQDRLQAAHLLLARSPQCPVVVDTMQNQSSQLYAALPERLYIIQEGRILYKGKSGPWN 0"
## [1] "QNLQDRLQAAHLLLARSPQCPVVVDTMQNQSSQLYAALPERLYVIQEGRILYKGKSGPWN 0"
## [1] "RSLQERVRAARMLLARSPQCPVVVDTMQNQSSQLYAALPERLYVIQEGRICYKGKAGPWN 0"
## [1] "RSLEDRKTAAQFLQQKNPLCPVVLDTMENLSSSKYAALPERLYILQAGNVIYKGGVGPWN 0"
## [1] "RCLQDRLAAAKRLLEELPSCPVVLDTMSNLCSAKYAALPERLYILQEGKIIYKGKMGPWG 0"
## [1] "KNLEERLAAARTLLKEDPPCPVVVDEMNNITASKYGALPERLYVIQSGKVIYQGGIGPWG 0"
## [1] " "
## [1] "YNPEEVRAVLEKLHS-------- 37"
## [1] "YNPEEVRAVLEKLHS-------- 37"
## [1] "YNPEEVRAVLEKLCTPPRHVPQL 37"
## [1] "YHPQEIRAVLEKLK--------- 37"
## [1] "YKPEEVHSVLEKKK--------- 37"
## [1] "YKPEEVRKVLEKSK--------- 37"
## [1] " "
Based on the output from drawProtiens, the first 50 amino acids appears to contain an interesting helical section.
NOTE: Key step - must have class set properly for ggmsa to work!
Make a distance matrix
This produces a “dist” class object.
is(dio1s_subset_dist)
## [1] "dist" "oldClass"
class(dio1s_subset_dist)
## [1] "dist"
Round for display
dio1s_align_seqinr_rnd <- round(dio1s_subset_dist, 3)
dio1s_align_seqinr_rnd
## NP_000783 NP_001116123 NP_031886 NP_001091083 NP_001243226
## NP_001116123 0.090
## NP_031886 0.458 0.453
## NP_001091083 0.619 0.623 0.639
## NP_001243226 0.698 0.701 0.693 0.662
## NP_001007284 0.739 0.737 0.720 0.723 0.720
Build a phylogenetic tree from distance matrix
#
Plot the tree.
#