Introduction

The EGLN1 gene encodes Hypoxia-inducible factor prolyl hydroxylase 2 (HIF-PH2), or prolyl hydroxylase domain-containing protein 2 (PHD2). Upregulation of this gene is caused by the cllular response to hypoxia, or when the body is deprived of oxygen.

In this portfolio, we will be looking at the sequences of the EGLN1 gene of various species by grabbing their various accession numbers and fasta files. We will then prepare the data for various purposes such as diagramming key domains in the proteins encoded by the gene, using dot plots to find sequence repeats, finding various protein properties described by the human gene, predicting the secondary structures, creating percent identity comparisons for different species, performing a multiple sequence alignment, and finally plotting a phylogenetic tree for all sequences.

Resources/References

Here are the links to resources used to help make this script:

Refseq Gene: https://www.ncbi.nlm.nih.gov/gene/54583

Refseq Homologene: https://www.ncbi.nlm.nih.gov/homologene/?term=EGLN1

Uniprot: https://www.uniprot.org/uniprot/?query=EGLN1&sort=score

Protein Data Bank: https://www.rcsb.org/

Blast: https://blast.ncbi.nlm.nih.gov/Blast.cgi

Pfam: http://pfam.xfam.org/protein/Q9GZT9

Links to resources consulted:

Neanderthal genome: http://neandertal.ensemblgenomes.org/index.html

Disprot: https://www.disprot.org/

RepeatsDP: https://repeatsdb.bio.unipd.it/

Preparation

Here is the necessary packages needed throughout this portfolio.

# github packages
library(compbio4all)
library(ggmsa)
## Registered S3 methods overwritten by 'ggalt':
##   method                  from   
##   grid.draw.absoluteGrob  ggplot2
##   grobHeight.absoluteGrob ggplot2
##   grobWidth.absoluteGrob  ggplot2
##   grobX.absoluteGrob      ggplot2
##   grobY.absoluteGrob      ggplot2
# CRAN packages
library(rentrez)
library(seqinr)
library(ape)
## 
## Attaching package: 'ape'
## The following objects are masked from 'package:seqinr':
## 
##     as.alignment, consensus
library(pander)

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.1.2
# Bioconductor packages
library(msa)
## Loading required package: Biostrings
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## Loading required package: IRanges
## 
## Attaching package: 'IRanges'
## The following object is masked from 'package:grDevices':
## 
##     windows
## Loading required package: XVector
## Loading required package: GenomeInfoDb
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:ape':
## 
##     complement
## The following object is masked from 'package:seqinr':
## 
##     translate
## The following object is masked from 'package:base':
## 
##     strsplit
library(Biostrings)

library(BiocManager)
## Bioconductor version '3.13' is out-of-date; the current release version '3.14'
##   is available with R version '4.1'; see https://bioconductor.org/install
## 
## Attaching package: 'BiocManager'
## The following object is masked from 'package:msa':
## 
##     version
#install("drawProteins")
library(drawProteins)

Accession Numbers

Accession numbers were obtained from RefSeq, Refseq Homlogene, UniProt and PDB and Blast searches (last three). UniProt accession numbers can be found by searching for the gene name. PDB accessions can be found by searching with a UniProt accession number. The Neanderthal Genome database was searched with the gene name however I was unable to download the fasta files with the given ID so I decided not to use it . Blast searches were conducted with the Human accession number and the three most genetically distant were chosen.

ncbi_protein_accession <- c("NP_071334", "XP_525092.2", "NP_444437.2", "XP_001104870.1", "XP_002728657.3", "NP_001002595.2", "NP_001015960.1", "XP_032955515.1", "XP_026268254.1", "XP_020929239.1")

UniProt_id <- c("Q9GZT9", "K7C5B9", "Q91YE3", "A0A5F8AKN1", "P59722", "NA", "NA", "NA", "NA", "NA")

PDB <- c("4KBZ", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA")

species <- c("Homo sapiens", "Pan troglodytes", "Mus musculus", "Macaca mulatta", "Rattus norvegicus", "Danio rerio", "Xenopus tropicalis", "   Rhinolophus ferrumequinum", "Urocitellus parryii", "Sus scrofa")

common_name <- c("Humans", "Chimpanzees", "Mice", "Monkeys", "Rats", "Fish", 
                 "Frogs", "Bat", "Squirrel", "Pig")

gene_name <- c("EGLN1", "EGLN1", "EGLN1", "EGLN1", "EGLN1", "EGLN1", "EGLN1", "EGLN1", "EGLN1", "EGLN1")

Accession Numbers Table

Here is a table showing all the accession numbers with their respective species. Anything designated with NA was unable to be found.

EGLN1.df <- data.frame(ncbi_protein_accession, UniProt_id, PDB, species, common_name, gene_name)

pander::pander(EGLN1.df)
Table continues below
ncbi_protein_accession UniProt_id PDB species
NP_071334 Q9GZT9 4KBZ Homo sapiens
XP_525092.2 K7C5B9 NA Pan troglodytes
NP_444437.2 Q91YE3 NA Mus musculus
XP_001104870.1 A0A5F8AKN1 NA Macaca mulatta
XP_002728657.3 P59722 NA Rattus norvegicus
NP_001002595.2 NA NA Danio rerio
NP_001015960.1 NA NA Xenopus tropicalis
XP_032955515.1 NA NA Rhinolophus ferrumequinum
XP_026268254.1 NA NA Urocitellus parryii
XP_020929239.1 NA NA Sus scrofa
common_name gene_name
Humans EGLN1
Chimpanzees EGLN1
Mice EGLN1
Monkeys EGLN1
Rats EGLN1
Fish EGLN1
Frogs EGLN1
Bat EGLN1
Squirrel EGLN1
Pig EGLN1

Data Preparation

In the code below, we will simply download the fasta files for each sequence.

human_fasta <- entrez_fetch(db = "protein", 
                          id = "NP_071334", 
                          rettype = "fasta")
chimp_fasta <- entrez_fetch(db = "protein", 
                          id = "XP_525092.2", 
                          rettype = "fasta")
mouse_fasta <- entrez_fetch(db = "protein", 
                          id = "NP_444437.2", 
                          rettype = "fasta")
monkey_fasta <- entrez_fetch(db = "protein", 
                          id = "XP_001104870.1", 
                          rettype = "fasta")
rat_fasta <- entrez_fetch(db = "protein", 
                          id = "XP_002728657.3", 
                          rettype = "fasta")
fish_fasta <- entrez_fetch(db = "protein", 
                          id = "NP_001002595.2", 
                          rettype = "fasta")
frog_fasta <- entrez_fetch(db = "protein", 
                          id = "NP_001015960.1", 
                          rettype = "fasta")
bat_fasta <- entrez_fetch(db = "protein", 
                          id = "XP_032955515.1", 
                          rettype = "fasta")
squirrel_fasta <- entrez_fetch(db = "protein", 
                          id = "XP_026268254.1", 
                          rettype = "fasta")
pig_fasta <- entrez_fetch(db = "protein", 
                          id = "XP_020929239.1", 
                          rettype = "fasta")

Here we will put all of our fasta files into one list and then clean up each fasta file so we can use it. We will use the the cat() function to make the output similar to that of a web browser/word processor and making it easier to read with new lines. We will then loop through the list to remove the headings of each fasta file. The last line shows what the new fasta files look like.

EGLN1_fasta_list <- c(human_fasta, chimp_fasta, mouse_fasta, monkey_fasta, rat_fasta, fish_fasta, frog_fasta, bat_fasta, squirrel_fasta, pig_fasta)

cat(EGLN1_fasta_list)
## >NP_071334.1 egl nine homolog 1 isoform 1 [Homo sapiens]
## MANDSGGPGGPSPSERDRQYCELCGKMENLLRCSRCRSSFYCCKEHQRQDWKKHKLVCQGSEGALGHGVG
## PHQHSGPAPPAAVPPPRAGAREPRKAAARRDNASGDAAKGKVKAKPPADPAAAASPCRAAAGGQGSAVAA
## EAEPGKEEPPARSSLFQEKANLYPPSNTPGDALSPGGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVV
## DDFLGKETGQQIGDEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLI
## RHCNGKLGSYKINGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGK
## AQFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLTGEKGVRVELNKPSDS
## VGKDVF
## 
##  >XP_525092.2 PREDICTED: egl nine homolog 1 [Pan troglodytes]
## MYKEKHIAHVCIWQNPQYRDKQDRNSHLATLTGASLAHPAPAWRKSSHDYAAFAPLPACGWRPPPAGVGA
## AALPPRLFRRFAGPCALGVPRRRPAAGPKAPGKRPGKSTVKPPADPAAAASPSRAAAGGQGSAVAAEAEP
## GKEEPPARSSLFQEKANLYPPSNTPGDALSPSGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFL
## GKETGQQIGDEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLIRHCN
## GKLGSYKINGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKAQFA
## DIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLTGEKGVRVELNKPSDSVSKD
## VF
## 
##  >NP_444437.2 egl nine homolog 1 isoform 1 [Mus musculus]
## MASDSGGPGVLSASERDRQYCELCGKMENLLRCGRCRSSFYCCKEHQRQDWKKHKLVCQGGEAPRAQPAP
## AQPRVAPPPGGAPGAARAGGAARRGDSAAASRVPGPEDAAQARSGPGPAEPGSEDPPLSRSPGPERASLC
## PAGGGPGEALSPGGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFLGRETGQQIGDEVRALHDTG
## KFTDGQLVSQKSDSSKDIRGDQITWIEGKEPGCETIGLLMSSMDDLIRHCSGKLGNYRINGRTKAMVACY
## PGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKAQFADIEPKFDRLLFFWSDRRNP
## HEVQPAYATRYAITVWYFDADERARAKVKYLTGEKGVRVELKPNSVSKDV
## 
##  >XP_001104870.1 egl nine homolog 1 isoform X1 [Macaca mulatta]
## MANDSGGPGGPSPSERDRQYCELCGKMENLLRCSRCRSSFYCCKEHQRQDWKKHKLVCQGSESALGPGVG
## PHQHSGPAPPVAAPPPRAGAREPRKAAARRDNASGDAAKAKAKGKPPADPAAAASPSRAAPGGQGSAVAA
## EAEPGKEEPPARSSLFQEKANLYPPSNTPGDALSPSGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVV
## DDFLGKETGQQIGDEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLI
## RHCNGKLGSYKINGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGK
## AQFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLTGEKGVRVELNKPSDS
## VGKDVF
## 
##  >XP_002728657.3 PREDICTED: egl nine homolog 1 [Rattus norvegicus]
## MNKHGICVVDDFLGRETGQQIGDEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGWRPSAP
## GAARAGGAARRGDSSTAASRVPGPEDATQAGSGPGPAEPSSEDPPPSRSPGPERASLCPAGGGPGEALSP
## SGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFLGRETGQQIGDEVRALHDTGKFTDGQLVSQKS
## DSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLIRHCSGKLGNYRINGRTKAMVACYPGNGTGYVRHVD
## NPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKAQFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYA
## ITVWYFDADERARAKVKYLTGEKGVRVELKPNSVSKDV
## 
##  >NP_001002595.2 egl nine homolog 1b [Danio rerio]
## MEGNSRESERLERERQFCELCGKMENLMKCGRCRSSFYCSKEHQRQDWKKHKRVCKEADKQQQPVGEEEE
## EQPSDTSKQSSTSQSNSTLTSQDEKMPDFIKSATGSDTKPTPDSSKPNGQTRSPPQKLATDYIVPCMNKH
## GICVVDNFLGNEVGRSILEDVRALYLTGGFTDGQLVSQRSDSSKDIRGDKITWVEGKEPGCERIAFLMSR
## MDDLIRHCNGKLGNYRINGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDATEHGGLLRI
## FPEGTAQFADIEPKFDRLLLFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKEKYSTSSGERGVKVE
## LNKPSEPS
## 
##  >NP_001015960.1 egl nine homolog 1 [Xenopus tropicalis]
## MAGGGSEGSNQSERDRQYCELCGKMEDLLRCGRCRSSFYCSKEHQRQDWKKHKLFCKSDSTITPASQNTV
## KNSKKEQFSSSAVSISHSDTSQFKSHVEPASGVSDEQEEVDGGAALKAINEEDDTQVSGEAMGSSTSRKV
## TTSGGRPNGQTKPPLHRIALEYTIPCMNKHGICVLDDFLGQETGDRIVCEVKALHNTGRFTDGQLVSQKS
## DSTRDIRGDQITWVEGKESSCKAIGKLMNKMDDLIRHCSGKLGNFRINGRTKAMVACYPGNGTGYVRHVD
## NPNADGRCVTCIYYLNKQWDAKTHGGLLRIFPEGKSQFADIEPKFDRLLLFWSDRRNPHEVQPAFATRYA
## ITVWYFDADERARAKEKYLNTGERGVRIELNKPSEQVVKEVQNP
## 
##  >XP_032955515.1 egl nine homolog 1 isoform X1 [Rhinolophus ferrumequinum]
## MANDSGGPGGPSPSERDRQYCELCGKMENLLRCSRCRSSFYCSKEHQRQDWKKHKLVCQRGDGAPGPGAS
## QHPHPGPAPPAAAPPPRATGAKEARTAAPRRDSAAAGDAADAKAKPAAAASPPRAAPGGQGPAAAAATAT
## AEAEPAKEEPLGRSSLFQDKANLYPPGNTPGEALSHGGGPRPNGQTKPLPALKLALEYIVPCMNKHGICV
## VDDFLGKETGQQIGEEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDL
## IRHCNGKLGNYKINGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEG
## KAQFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLTGEKGVRVELNKPSD
## SISKDVL
## 
##  >XP_026268254.1 egl nine homolog 1 [Urocitellus parryii]
## MASDSGGPGGPSASERDRQYCELCGKMENLLRCGRCRSSFYCCKEHQRQDWKKHKLVCQGGDPAGPAAAP
## RAQPGPAPPAAAPPTRAGAAAGNAGARAGKAAGQRQGSAEAARAPAPGDAAQARGPRGAAECGQEEPPAR
## RSPCPERASLCPAGSSPGEALSPGGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFLGRETGQQI
## GDEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLIRHCNGKLGNYKI
## NGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKAQFADIEPKFDR
## LLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLTGEKGVRVELNKPSDPVGKDV
## 
##  >XP_020929239.1 egl nine homolog 1 isoform X1 [Sus scrofa]
## MANDSGGPGGPSPSERDRQYCELCGKMENLLRCGRCRSSFYCSKEHQRQDWKKHKLVCQGGEGAPAPGAG
## QQPQPGPAPLLRKAVPRRDRVAAVEAAGPRAAGDAAKAKAKAKPAADPAAAASPPRAAPGGAGPAAAAAA
## AEAEPAKEEPPARSSLYQEKANLYPPSNTPGEGLSHGGGLRPNGQTKPLPALKLALEYIVPCMNKHGICV
## VDDFLGKETGQQIGDEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDL
## IRHCNGKLGNYKINGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEG
## KAQFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLTGEKGVRVELNKPSD
## SVSKDVL
for(i in 1:length(EGLN1_fasta_list)){
  EGLN1_fasta_list[[i]] <- compbio4all::fasta_cleaner(EGLN1_fasta_list[[i]], parse = F)
}

EGLN1_fasta_list[1]
## [1] "MANDSGGPGGPSPSERDRQYCELCGKMENLLRCSRCRSSFYCCKEHQRQDWKKHKLVCQGSEGALGHGVGPHQHSGPAPPAAVPPPRAGAREPRKAAARRDNASGDAAKGKVKAKPPADPAAAASPCRAAAGGQGSAVAAEAEPGKEEPPARSSLFQEKANLYPPSNTPGDALSPGGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFLGKETGQQIGDEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLIRHCNGKLGSYKINGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKAQFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLTGEKGVRVELNKPSDSVGKDVF"

General Protein Information

We can use the UniProt accession numbers to download our data (note we only have 5 proteins with UniProt accession numbers). We then can make a dataframe that contains information on domains, chains, regions, motifs, phosphorylated sites and repeats.

human_uniprot <- drawProteins::get_features("Q9GZT9")
## [1] "Download has worked"
chimp_uniprot <- drawProteins::get_features("K7C5B9")
## [1] "Download has worked"
mouse_uniprot <- drawProteins::get_features("Q91YE3")
## [1] "Download has worked"
monkey_uniprot <- drawProteins::get_features("A0A5F8AKN1")
## [1] "Download has worked"
rat_uniprot <- drawProteins::get_features("P59722")
## [1] "Download has worked"
human_uniprot_df <- drawProteins::feature_to_dataframe(human_uniprot)
chimp_uniprot_df <- drawProteins::feature_to_dataframe(chimp_uniprot)
mouse_uniprot_df <- drawProteins::feature_to_dataframe(mouse_uniprot)
monkey_uniprot_df <- drawProteins::feature_to_dataframe(monkey_uniprot)
rat_uniprot_df <- drawProteins::feature_to_dataframe(rat_uniprot)
## Warning in drawProteins::extract_feat_acc(features_in_lists_of_six[[i]]): NAs
## introduced by coercion

Protein Diagram

We can now plot the data. Below are the domains and regions found in the human protein.

human_canvas <- draw_canvas(human_uniprot_df)  
human_canvas <- draw_chains(human_canvas, human_uniprot_df, 
                         label_size = 2.5)
human_canvas <- draw_domains(human_canvas, human_uniprot_df)
human_canvas <- draw_regions(human_canvas, human_uniprot_df)
human_canvas

Here is the data for the protein in chimps.

chimp_canvas <- draw_canvas(chimp_uniprot_df)  
chimp_canvas <- draw_chains(chimp_canvas, chimp_uniprot_df, 
                         label_size = 2.5)
chimp_canvas <- draw_domains(chimp_canvas, chimp_uniprot_df)
chimp_canvas <- draw_regions(chimp_canvas, chimp_uniprot_df)
chimp_canvas

Now for the protein in mice.

mouse_canvas <- draw_canvas(mouse_uniprot_df)  
mouse_canvas <- draw_chains(mouse_canvas, mouse_uniprot_df, 
                         label_size = 2.5)
mouse_canvas <- draw_domains(mouse_canvas, mouse_uniprot_df)
mouse_canvas <- draw_regions(mouse_canvas, mouse_uniprot_df)
mouse_canvas

Here is the data for the protein in monkeys.

monkey_canvas <- draw_canvas(monkey_uniprot_df)  
monkey_canvas <- draw_chains(monkey_canvas, monkey_uniprot_df, 
                         label_size = 2.5)
monkey_canvas <- draw_domains(monkey_canvas, monkey_uniprot_df)
monkey_canvas <- draw_folding(monkey_canvas, monkey_uniprot_df)
monkey_canvas <- draw_regions(monkey_canvas, monkey_uniprot_df)
monkey_canvas

And finally, here is the data for the protein in rats.

rat_canvas <- draw_canvas(rat_uniprot_df)  
rat_canvas <- draw_chains(rat_canvas, rat_uniprot_df, 
                         label_size = 2.5)
rat_canvas <- draw_domains(rat_canvas, rat_uniprot_df)
rat_canvas <- draw_regions(rat_canvas, rat_uniprot_df)
rat_canvas
## Warning: Removed 1 rows containing missing values (geom_rect).

Draw Dotplot

We can make dotplots of each fasta file sequence plotted against itself to discover areas of sequence repeats. We can create vectors containing single amino acid characters using our list of all the fasta files. Then we can make our dotplot with the dotPlot() function. Below will show four different dotplots all with different settings used for the human version.

human_vector <- fasta_cleaner(EGLN1_fasta_list[1])

par(mfrow = c(2,2), 
    mar = c(0,0,1,1))

dotPlot(human_vector, human_vector)

dotPlot(human_vector, human_vector, 
        wsize = 10, 
        nmatch = 1, 
        )

dotPlot(human_vector, human_vector, 
        wsize = 10, 
        nmatch = 5, 
        )

dotPlot(human_vector, human_vector, 
        wsize = 20, 
        nmatch = 1, 
        )

Now we can plot the best version of the plot.

dotPlot(human_vector, human_vector)

Now we will repeat for the remaining sequences, showing the final version.

chimp_vector <- fasta_cleaner(EGLN1_fasta_list[2])

dotPlot(chimp_vector, chimp_vector)

mouse_vector <- fasta_cleaner(EGLN1_fasta_list[3])

dotPlot(mouse_vector, mouse_vector)

monkey_vector <- fasta_cleaner(EGLN1_fasta_list[4])

dotPlot(monkey_vector, monkey_vector)

rat_vector <- fasta_cleaner(EGLN1_fasta_list[5])

dotPlot(rat_vector, rat_vector)

fish_vector <- fasta_cleaner(EGLN1_fasta_list[6])

dotPlot(fish_vector, fish_vector)

frog_vector <- fasta_cleaner(EGLN1_fasta_list[7])

dotPlot(frog_vector, frog_vector)

bat_vector <- fasta_cleaner(EGLN1_fasta_list[8])

dotPlot(bat_vector, bat_vector)

squirrel_vector <- fasta_cleaner(EGLN1_fasta_list[9])

dotPlot(squirrel_vector, squirrel_vector)

pig_vector <- fasta_cleaner(EGLN1_fasta_list[10])

dotPlot(pig_vector, pig_vector)

Protein Properties Compiled from Databases

We can now look for protein properties at various websites to learn more about the human protein. Below will show domains located with Pfam, the subcellular location from Uniprot, and secondary structure classification from the PDB. No data was found in DisProt or RepeatDB.

protein_properties_vector <- c("PFAM", "There are two noteworthy domains. The first is the zf-MYND domain from amino acids 21 to 58. The second is the 2OG-FeII_Oxy_3 domain from amino acids 298 to 391.", "http://pfam.xfam.org/protein/Q9GZT9", "Uniprot", "The subcellular location appears to be in the nucleus and cytoplasm.", "https://www.uniprot.org/uniprot/Q9GZT9#subcellular_location", "PDB", "By looking at the sequence, it appears that the protein is composed of alpha helices and beta sheets.", "https://www.rcsb.org/sequence/4kbz")
                                    
protein_properties_table <- matrix(protein_properties_vector, nrow = 3, ncol = 3, byrow = T)

protein_properties_df <- data.frame(protein_properties_table, stringsAsFactors = F)

names(protein_properties_df) <- c("Source", "Protein Property", "Link")
 
pander::pander(protein_properties_df)
Table continues below
Source Protein Property
PFAM There are two noteworthy domains. The first is the zf-MYND domain from amino acids 21 to 58. The second is the 2OG-FeII_Oxy_3 domain from amino acids 298 to 391.
Uniprot The subcellular location appears to be in the nucleus and cytoplasm.
PDB By looking at the sequence, it appears that the protein is composed of alpha helices and beta sheets.
Link
http://pfam.xfam.org/protein/Q9GZT9
https://www.uniprot.org/uniprot/Q9GZT9#subcellular_location
https://www.rcsb.org/sequence/4kbz

Protein Feature Prediction

Protein Folding Prediction

First, we will make a vector of the amino acid one letter abbreviations. Note that we will use double entry to confirm there were no mistakes.

aa.1.1 <- c("A","R","N","D","C","Q","E","G","H","I",
            "L","K","M","F","P","S","T","W","Y","V")

aa.1.2 <- c("A","R","N","D","C","Q","E","G","H","I",
            "L","K","M","F","P","S","T","W","Y","V")

length(aa.1.1) == length(aa.1.2)
## [1] TRUE
aa.1.1 == aa.1.2
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE
unique(aa.1.1)
##  [1] "A" "R" "N" "D" "C" "Q" "E" "G" "H" "I" "L" "K" "M" "F" "P" "S" "T" "W" "Y"
## [20] "V"

We will now make vectors of each amino acid by type based off of Chou’s table 5. We will do logical checks with sum to determine if the vectors are accurate before we show the table.

alpha <- c(285, 53, 97, 163, 22, 67, 134, 197, 111, 91, 
           221, 249, 48, 123, 82, 122, 119, 33, 63, 167)
sum(alpha) == 2447
## [1] TRUE
beta <- c(203, 67, 139, 121, 75, 122, 86, 297, 49, 120, 
          177, 115, 16, 85, 127, 341, 253, 44, 110, 229)
sum(beta) == 2776
## [1] TRUE
a.plus.b <- c(175, 78, 120, 111, 74, 74, 86, 171, 33, 93,
              110, 112, 25, 52, 71, 126, 117, 30, 108, 123)
sum(a.plus.b) == 1889
## [1] TRUE
a.div.b <- c(361, 146, 183, 244, 63, 114, 257, 377, 107, 239, 
             339, 321, 91, 158, 188, 327, 238, 72, 130, 378)
sum(a.div.b) == 4333
## [1] TRUE
pander(data.frame(aa.1.1, alpha, beta, a.plus.b, a.div.b))
aa.1.1 alpha beta a.plus.b a.div.b
A 285 203 175 361
R 53 67 78 146
N 97 139 120 183
D 163 121 111 244
C 22 75 74 63
Q 67 122 74 114
E 134 86 86 257
G 197 297 171 377
H 111 49 33 107
I 91 120 93 239
L 221 177 110 339
K 249 115 112 321
M 48 16 25 91
F 123 85 52 158
P 82 127 71 188
S 122 341 126 327
T 119 253 117 238
W 33 44 30 72
Y 63 110 108 130
V 167 229 123 378

We must now calculate the proportions of the amino acids in each fold class and convert to a data frame where we can then show frequencies.

alpha.prop <- alpha/sum(alpha)
beta.prop <- beta/sum(beta)
a.plus.b.prop <- a.plus.b/sum(a.plus.b)
a.div.b <- a.div.b/sum(a.div.b)

aa.prop_df <- data.frame(alpha.prop,
                      beta.prop,
                      a.plus.b.prop,
                      a.div.b)

row.names(aa.prop_df) <- aa.1.1

pander::pander(aa.prop_df)
  alpha.prop beta.prop a.plus.b.prop a.div.b
A 0.1165 0.07313 0.09264 0.08331
R 0.02166 0.02414 0.04129 0.03369
N 0.03964 0.05007 0.06353 0.04223
D 0.06661 0.04359 0.05876 0.05631
C 0.008991 0.02702 0.03917 0.01454
Q 0.02738 0.04395 0.03917 0.02631
E 0.05476 0.03098 0.04553 0.05931
G 0.08051 0.107 0.09052 0.08701
H 0.04536 0.01765 0.01747 0.02469
I 0.03719 0.04323 0.04923 0.05516
L 0.09031 0.06376 0.05823 0.07824
K 0.1018 0.04143 0.05929 0.07408
M 0.01962 0.005764 0.01323 0.021
F 0.05027 0.03062 0.02753 0.03646
P 0.03351 0.04575 0.03759 0.04339
S 0.04986 0.1228 0.0667 0.07547
T 0.04863 0.09114 0.06194 0.05493
W 0.01349 0.01585 0.01588 0.01662
Y 0.02575 0.03963 0.05717 0.03
V 0.06825 0.08249 0.06511 0.08724

Now, we will test for the number of each amino acid in our human gene with the table() function.

EGLN1_human_aa <- compbio4all::fasta_cleaner(human_fasta, parse = TRUE)

table(EGLN1_human_aa)
## EGLN1_human_aa
##  A  C  D  E  F  G  H  I  K  L  M  N  P  Q  R  S  T  V  W  Y 
## 43 15 28 23 11 47 10 14 33 26  6 15 34 14 28 26 13 22  5 13

The below function allows conversion of a table into a vector.

table_to_vector <- function(table_x){
  table_names <- attr(table_x, "dimnames")[[1]]
  table_vect <- as.vector(table_x)
  names(table_vect) <- table_names
  return(table_vect)
}

NP_071334_freq_table <- table(EGLN1_human_aa)/length(EGLN1_human_aa)

EGLN1_human_aa_freq <- table_to_vector(NP_071334_freq_table)
pander::pander(EGLN1_human_aa_freq)
Table continues below
A C D E F G H I
0.1009 0.03521 0.06573 0.05399 0.02582 0.1103 0.02347 0.03286
Table continues below
K L M N P Q R S
0.07746 0.06103 0.01408 0.03521 0.07981 0.03286 0.06573 0.06103
T V W Y
0.03052 0.05164 0.01174 0.03052

We then can test for the presence of unknown amino acids labeled as U. We do not have any unknowns in our gene.

aa.names <- names(EGLN1_human_aa_freq)
any(aa.names == "U")
## [1] FALSE

We can then finally add the data on our human protein to the amino acid frequency table made previously.

aa.prop_df$EGLN1_human_aa_freq <- EGLN1_human_aa_freq
pander::pander(aa.prop_df)
  alpha.prop beta.prop a.plus.b.prop a.div.b EGLN1_human_aa_freq
A 0.1165 0.07313 0.09264 0.08331 0.1009
R 0.02166 0.02414 0.04129 0.03369 0.03521
N 0.03964 0.05007 0.06353 0.04223 0.06573
D 0.06661 0.04359 0.05876 0.05631 0.05399
C 0.008991 0.02702 0.03917 0.01454 0.02582
Q 0.02738 0.04395 0.03917 0.02631 0.1103
E 0.05476 0.03098 0.04553 0.05931 0.02347
G 0.08051 0.107 0.09052 0.08701 0.03286
H 0.04536 0.01765 0.01747 0.02469 0.07746
I 0.03719 0.04323 0.04923 0.05516 0.06103
L 0.09031 0.06376 0.05823 0.07824 0.01408
K 0.1018 0.04143 0.05929 0.07408 0.03521
M 0.01962 0.005764 0.01323 0.021 0.07981
F 0.05027 0.03062 0.02753 0.03646 0.03286
P 0.03351 0.04575 0.03759 0.04339 0.06573
S 0.04986 0.1228 0.0667 0.07547 0.06103
T 0.04863 0.09114 0.06194 0.05493 0.03052
W 0.01349 0.01585 0.01588 0.01662 0.05164
Y 0.02575 0.03963 0.05717 0.03 0.01174
V 0.06825 0.08249 0.06511 0.08724 0.03052

Functions to Calculate Similarities

We need two functions to calculate correlation between two columns and another to calculate correlation similarities.

# Corrleation used in Chou adn Zhange 1992.
chou_cor <- function(x,y){
  numerator <- sum(x*y)
denominator <- sqrt((sum(x^2))*(sum(y^2)))
result <- numerator/denominator
return(result)
}

# Cosine similarity used in Higgs and Attwood (2005). 
chou_cosine <- function(z.1, z.2){
  z.1.abs <- sqrt(sum(z.1^2))
  z.2.abs <- sqrt(sum(z.2^2))
  my.cosine <- sum(z.1*z.2)/(z.1.abs*z.2.abs)
  return(my.cosine)
}

We can calculate correlation between each column.

corr.alpha <- chou_cor(aa.prop_df[,5], aa.prop_df[,1])
corr.beta <- chou_cor(aa.prop_df[,5], aa.prop_df[,2])
corr.apb <- chou_cor(aa.prop_df[,5], aa.prop_df[,3])
corr.adb <- chou_cor(aa.prop_df[,5], aa.prop_df[,4])

And now calculate cosine similarity.

cos.alpha <- chou_cosine(aa.prop_df[,5], aa.prop_df[,1])
cos.beta  <- chou_cosine(aa.prop_df[,5], aa.prop_df[,2])
cos.apb   <- chou_cosine(aa.prop_df[,5], aa.prop_df[,3])
cos.adb   <- chou_cosine(aa.prop_df[,5], aa.prop_df[,4])

We need to flip the data frame with the t() function.

aa.prop.flipped <- t(aa.prop_df)
round(aa.prop.flipped,2)
##                        A    R    N    D    C    Q    E    G    H    I    L    K
## alpha.prop          0.12 0.02 0.04 0.07 0.01 0.03 0.05 0.08 0.05 0.04 0.09 0.10
## beta.prop           0.07 0.02 0.05 0.04 0.03 0.04 0.03 0.11 0.02 0.04 0.06 0.04
## a.plus.b.prop       0.09 0.04 0.06 0.06 0.04 0.04 0.05 0.09 0.02 0.05 0.06 0.06
## a.div.b             0.08 0.03 0.04 0.06 0.01 0.03 0.06 0.09 0.02 0.06 0.08 0.07
## EGLN1_human_aa_freq 0.10 0.04 0.07 0.05 0.03 0.11 0.02 0.03 0.08 0.06 0.01 0.04
##                        M    F    P    S    T    W    Y    V
## alpha.prop          0.02 0.05 0.03 0.05 0.05 0.01 0.03 0.07
## beta.prop           0.01 0.03 0.05 0.12 0.09 0.02 0.04 0.08
## a.plus.b.prop       0.01 0.03 0.04 0.07 0.06 0.02 0.06 0.07
## a.div.b             0.02 0.04 0.04 0.08 0.05 0.02 0.03 0.09
## EGLN1_human_aa_freq 0.08 0.03 0.07 0.06 0.03 0.05 0.01 0.03

We can get the euclidean distance.

dist(aa.prop.flipped, method = "euclidean")
##                     alpha.prop  beta.prop a.plus.b.prop    a.div.b
## beta.prop           0.13342098                                    
## a.plus.b.prop       0.09281824 0.08289406                         
## a.div.b             0.06699039 0.08659174    0.06175113           
## EGLN1_human_aa_freq 0.17891026 0.18908175    0.16206768 0.17398567

It’s possible to get the individual distances with the dist() function.

dist.alpha <- dist((aa.prop.flipped[c(1,4),]),  method = "euclidean")
dist.beta  <- dist((aa.prop.flipped[c(2,4),]),  method = "euclidean")
dist.apb   <- dist((aa.prop.flipped[c(3,4),]),  method = "euclidean")
dist.adb  <- dist((aa.prop.flipped[c(4,4),]), method = "euclidean")

We can then compile the information and display the output.

# fold types
fold.type <- c("alpha","beta","alpha plus beta", "alpha/beta")

# data
corr.sim <- round(c(corr.alpha,corr.beta,corr.apb,corr.adb),5)
cosine.sim <- round(c(cos.alpha,cos.beta,cos.apb,cos.adb),5)
Euclidean.dist <- round(c(dist.alpha,dist.beta,dist.apb,dist.adb),5)

# summary
sim.sum <- c("","","most.sim","")
dist.sum <- c("","","min.dist","")

df <- data.frame(fold.type,
           corr.sim ,
           cosine.sim ,
           Euclidean.dist ,
           sim.sum ,
           dist.sum )

pander::pander(df)
fold.type corr.sim cosine.sim Euclidean.dist sim.sum dist.sum
alpha 0.756 0.756 0.06699
beta 0.7313 0.7313 0.08659
alpha plus beta 0.7882 0.7882 0.06175 most.sim min.dist
alpha/beta 0.7598 0.7598 0

Percent Identity Comparisons

To begin our pid comparisons, we need to align each sequence to each other. We will be using the human, chimp, mouse, and squirrel sequences.

align_human_chimp <- Biostrings:: pairwiseAlignment(
    human_fasta,
    chimp_fasta)
align_human_mouse <- Biostrings:: pairwiseAlignment(
    human_fasta,
    mouse_fasta)
align_human_squirrel <- Biostrings:: pairwiseAlignment(
    human_fasta,
    squirrel_fasta)

align_human_mouse
## Global PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: >NP_071334.1 egl nine homolog 1 is...RARAKVKYLTGEKGVRVELNKPSDS
## VGKDVF
## 
## 
## subject: >NP_444437.2 egl nine homolog 1 is...RARAKVKYLTGEKGVRVEL-KPN-S-VSKDV-
## 
## 
## score: 1313.664
#other comparisons needed for table below

align_chimp_mouse <- Biostrings:: pairwiseAlignment(
    chimp_fasta,
    mouse_fasta)

align_chimp_squirrel <- Biostrings:: pairwiseAlignment(
    chimp_fasta,
    squirrel_fasta)

align_mouse_squirrel <- Biostrings:: pairwiseAlignment(
    mouse_fasta,
    squirrel_fasta)

PID Table

Here we can see the pid of the human sequence to chimp, mouse and squirrel.

Biostrings::pid(align_human_chimp)
## [1] 71.2381
Biostrings::pid(align_human_mouse)
## [1] 77.09163
Biostrings::pid(align_human_squirrel)
## [1] 77.95591

We will now build a matrix to easily show pid comparisons. NA refers to redundant information.

pids_vector <- c("Human", 1, NA, NA, NA,
          "Chimp", pid(align_human_chimp), 1, NA, NA,
          "Mouse", pid(align_human_mouse), pid(align_chimp_mouse), 1, NA,
          "Squirrel", pid(align_human_squirrel), pid(align_chimp_squirrel), pid(align_mouse_squirrel), 1)

pid_table <- matrix(pids_vector, nrow = 4, ncol = 5, byrow = T)

pid_table.df <- data.frame(pid_table, 
                     stringsAsFactors = F)

names(pid_table.df) <- c("Species", "Human", "Chimp","Mouse", "Squirrel")
 
pander::pander(pid_table.df)
Species Human Chimp Mouse Squirrel
Human 1 NA NA NA
Chimp 71.2380952380952 1 NA NA
Mouse 77.0916334661355 63.5658914728682 1 NA
Squirrel 77.9559118236473 66.2835249042146 81.7622950819672 1

PID Methods Comparison

There are different methods used to calculate pid. Below uses the human and chimp sequences to show differences.

pid_methods_compare_vector <- c("PID1", pid(align_human_chimp, type = "PID1"), 
                                "aligned positions + internal gap positions",
                                "PID2", pid(align_human_chimp, type = "PID2"),
                                "aligned positions", "PID3", 
                                pid(align_human_chimp, type = "PID3"),
                                "length shorter sequence", "PID4", 
                                pid(align_human_chimp, type = "PID4"),
                                "average length of the two sequences")

pid_methods_compare_table <- matrix(pid_methods_compare_vector, nrow = 4, byrow = T)

pid_methods_compare_table.df <- data.frame(pid_methods_compare_table, 
                     stringsAsFactors = F)

names(pid_methods_compare_table.df) <- c("Method", "PID", "denominator")
 
pander::pander(pid_methods_compare_table.df)
Method PID denominator
PID1 71.2380952380952 aligned positions + internal gap positions
PID2 81.8380743982494 aligned positions
PID3 76.1710794297352 length shorter sequence
PID4 76.1710794297352 average length of the two sequences

Multiple Sequence Allignment

MSA Data Preparation

We need to prepare our data for creating an msa.The code below is taking the NCBI accession numbers, and using the NCBI database to add every sequence in variable EGLN1. We then clean the files with cat() and create our list.

EGLN1 <-entrez_fetch (db = "protein", 
                          id = ncbi_protein_accession, 
                          rettype = "fasta")

cat(EGLN1)
## >NP_071334.1 egl nine homolog 1 isoform 1 [Homo sapiens]
## MANDSGGPGGPSPSERDRQYCELCGKMENLLRCSRCRSSFYCCKEHQRQDWKKHKLVCQGSEGALGHGVG
## PHQHSGPAPPAAVPPPRAGAREPRKAAARRDNASGDAAKGKVKAKPPADPAAAASPCRAAAGGQGSAVAA
## EAEPGKEEPPARSSLFQEKANLYPPSNTPGDALSPGGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVV
## DDFLGKETGQQIGDEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLI
## RHCNGKLGSYKINGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGK
## AQFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLTGEKGVRVELNKPSDS
## VGKDVF
## 
## >XP_525092.2 PREDICTED: egl nine homolog 1 [Pan troglodytes]
## MYKEKHIAHVCIWQNPQYRDKQDRNSHLATLTGASLAHPAPAWRKSSHDYAAFAPLPACGWRPPPAGVGA
## AALPPRLFRRFAGPCALGVPRRRPAAGPKAPGKRPGKSTVKPPADPAAAASPSRAAAGGQGSAVAAEAEP
## GKEEPPARSSLFQEKANLYPPSNTPGDALSPSGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFL
## GKETGQQIGDEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLIRHCN
## GKLGSYKINGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKAQFA
## DIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLTGEKGVRVELNKPSDSVSKD
## VF
## 
## >NP_444437.2 egl nine homolog 1 isoform 1 [Mus musculus]
## MASDSGGPGVLSASERDRQYCELCGKMENLLRCGRCRSSFYCCKEHQRQDWKKHKLVCQGGEAPRAQPAP
## AQPRVAPPPGGAPGAARAGGAARRGDSAAASRVPGPEDAAQARSGPGPAEPGSEDPPLSRSPGPERASLC
## PAGGGPGEALSPGGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFLGRETGQQIGDEVRALHDTG
## KFTDGQLVSQKSDSSKDIRGDQITWIEGKEPGCETIGLLMSSMDDLIRHCSGKLGNYRINGRTKAMVACY
## PGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKAQFADIEPKFDRLLFFWSDRRNP
## HEVQPAYATRYAITVWYFDADERARAKVKYLTGEKGVRVELKPNSVSKDV
## 
## >XP_001104870.1 egl nine homolog 1 isoform X1 [Macaca mulatta]
## MANDSGGPGGPSPSERDRQYCELCGKMENLLRCSRCRSSFYCCKEHQRQDWKKHKLVCQGSESALGPGVG
## PHQHSGPAPPVAAPPPRAGAREPRKAAARRDNASGDAAKAKAKGKPPADPAAAASPSRAAPGGQGSAVAA
## EAEPGKEEPPARSSLFQEKANLYPPSNTPGDALSPSGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVV
## DDFLGKETGQQIGDEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLI
## RHCNGKLGSYKINGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGK
## AQFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLTGEKGVRVELNKPSDS
## VGKDVF
## 
## >XP_002728657.3 PREDICTED: egl nine homolog 1 [Rattus norvegicus]
## MNKHGICVVDDFLGRETGQQIGDEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGWRPSAP
## GAARAGGAARRGDSSTAASRVPGPEDATQAGSGPGPAEPSSEDPPPSRSPGPERASLCPAGGGPGEALSP
## SGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFLGRETGQQIGDEVRALHDTGKFTDGQLVSQKS
## DSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLIRHCSGKLGNYRINGRTKAMVACYPGNGTGYVRHVD
## NPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKAQFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYA
## ITVWYFDADERARAKVKYLTGEKGVRVELKPNSVSKDV
## 
## >NP_001002595.2 egl nine homolog 1b [Danio rerio]
## MEGNSRESERLERERQFCELCGKMENLMKCGRCRSSFYCSKEHQRQDWKKHKRVCKEADKQQQPVGEEEE
## EQPSDTSKQSSTSQSNSTLTSQDEKMPDFIKSATGSDTKPTPDSSKPNGQTRSPPQKLATDYIVPCMNKH
## GICVVDNFLGNEVGRSILEDVRALYLTGGFTDGQLVSQRSDSSKDIRGDKITWVEGKEPGCERIAFLMSR
## MDDLIRHCNGKLGNYRINGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDATEHGGLLRI
## FPEGTAQFADIEPKFDRLLLFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKEKYSTSSGERGVKVE
## LNKPSEPS
## 
## >NP_001015960.1 egl nine homolog 1 [Xenopus tropicalis]
## MAGGGSEGSNQSERDRQYCELCGKMEDLLRCGRCRSSFYCSKEHQRQDWKKHKLFCKSDSTITPASQNTV
## KNSKKEQFSSSAVSISHSDTSQFKSHVEPASGVSDEQEEVDGGAALKAINEEDDTQVSGEAMGSSTSRKV
## TTSGGRPNGQTKPPLHRIALEYTIPCMNKHGICVLDDFLGQETGDRIVCEVKALHNTGRFTDGQLVSQKS
## DSTRDIRGDQITWVEGKESSCKAIGKLMNKMDDLIRHCSGKLGNFRINGRTKAMVACYPGNGTGYVRHVD
## NPNADGRCVTCIYYLNKQWDAKTHGGLLRIFPEGKSQFADIEPKFDRLLLFWSDRRNPHEVQPAFATRYA
## ITVWYFDADERARAKEKYLNTGERGVRIELNKPSEQVVKEVQNP
## 
## >XP_032955515.1 egl nine homolog 1 isoform X1 [Rhinolophus ferrumequinum]
## MANDSGGPGGPSPSERDRQYCELCGKMENLLRCSRCRSSFYCSKEHQRQDWKKHKLVCQRGDGAPGPGAS
## QHPHPGPAPPAAAPPPRATGAKEARTAAPRRDSAAAGDAADAKAKPAAAASPPRAAPGGQGPAAAAATAT
## AEAEPAKEEPLGRSSLFQDKANLYPPGNTPGEALSHGGGPRPNGQTKPLPALKLALEYIVPCMNKHGICV
## VDDFLGKETGQQIGEEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDL
## IRHCNGKLGNYKINGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEG
## KAQFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLTGEKGVRVELNKPSD
## SISKDVL
## 
## >XP_026268254.1 egl nine homolog 1 [Urocitellus parryii]
## MASDSGGPGGPSASERDRQYCELCGKMENLLRCGRCRSSFYCCKEHQRQDWKKHKLVCQGGDPAGPAAAP
## RAQPGPAPPAAAPPTRAGAAAGNAGARAGKAAGQRQGSAEAARAPAPGDAAQARGPRGAAECGQEEPPAR
## RSPCPERASLCPAGSSPGEALSPGGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFLGRETGQQI
## GDEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLIRHCNGKLGNYKI
## NGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKAQFADIEPKFDR
## LLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLTGEKGVRVELNKPSDPVGKDV
## 
## >XP_020929239.1 egl nine homolog 1 isoform X1 [Sus scrofa]
## MANDSGGPGGPSPSERDRQYCELCGKMENLLRCGRCRSSFYCSKEHQRQDWKKHKLVCQGGEGAPAPGAG
## QQPQPGPAPLLRKAVPRRDRVAAVEAAGPRAAGDAAKAKAKAKPAADPAAAASPPRAAPGGAGPAAAAAA
## AEAEPAKEEPPARSSLYQEKANLYPPSNTPGEGLSHGGGLRPNGQTKPLPALKLALEYIVPCMNKHGICV
## VDDFLGKETGQQIGDEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDL
## IRHCNGKLGNYKINGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEG
## KAQFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLTGEKGVRVELNKPSD
## SVSKDVL
EGLN1_list <- entrez_fetch_list(db = "protein", 
                          id = ncbi_protein_accession, 
                          rettype = "fasta")

We need to make a new vector with the cleaned fasta files from each sequence starting from a vector of NAs. We need to add names to our vector and add the respective amino acids from each file.

for(i in 1:length(EGLN1_list)){
  EGLN1_list[[i]] <- fasta_cleaner(EGLN1_list[[i]], parse = F)
}

EGLN1_vector <- rep(NA, length(EGLN1_list))

for(i in 1:length(EGLN1_vector)){
  EGLN1_vector[i] <- EGLN1_list[[i]]
}

names(EGLN1_vector) <- names(EGLN1_list)

EGLN1_list_ss <- Biostrings:: AAStringSet(EGLN1_vector)

Building Multiple Sequence Alignment

The code below creates an interface, preparing for the making of an msa.

EGLN1_align_msa <- msa(EGLN1_list_ss,
                     method = "ClustalW")
## use default substitution matrix
EGLN1_align_msa
## CLUSTAL 2.1  
## 
## Call:
##    msa(EGLN1_list_ss, method = "ClustalW")
## 
## MsaAAMultipleAlignment with 10 rows and 439 columns
##      aln                                                   names
##  [1] MASDSGGPGV---LSASERDRQYCE...--GEKGVRVELKPNS--VSKDV--- NP_444437.2
##  [2] --MNKHGICV---VDDFLGRETGQQ...--GEKGVRVELKPNS--VSKDV--- XP_002728657.3
##  [3] MASDSGGPGG---PSASERDRQYCE...--GEKGVRVELNKPSDPVGKDV--- XP_026268254.1
##  [4] MANDSGGPGG---PSPSERDRQYCE...--GEKGVRVELNKPSDSVGKDVF-- NP_071334
##  [5] MANDSGGPGG---PSPSERDRQYCE...--GEKGVRVELNKPSDSVGKDVF-- XP_001104870.1
##  [6] MANDSGGPGG---PSPSERDRQYCE...--GEKGVRVELNKPSDSISKDVL-- XP_032955515.1
##  [7] MANDSGGPGG---PSPSERDRQYCE...--GEKGVRVELNKPSDSVSKDVL-- XP_020929239.1
##  [8] MYKEKHIAHVCIWQNPQYRDKQ--D...--GEKGVRVELNKPSDSVSKDVF-- XP_525092.2
##  [9] --MEGNSRE----SERLERERQFCE...SSGERGVKVELNKPSEPS------- NP_001002595.2
## [10] --MAGGGSEG---SNQSERDRQYCE...-TGERGVRIELNKPSEQVVKEVQNP NP_001015960.1
##  Con MA?DSGGPGG---PSPSERDRQYCE...--GEKGVRVELNKPSDSVSKDV?-- Consensus

Cleaning/Setting Up an MSA

We need to clean the code a bit before displaying our msa. We have to change the class of alignment as well as convert to the sequinr form.

class(EGLN1_align_msa) <- "AAMultipleAlignment"

EGLN1_align_msa_seqinr <- msaConvert(EGLN1_align_msa, type = "seqinr::alignment")

compbio4all::print_msa(EGLN1_align_msa_seqinr)
## [1] "MASDSGGPGV---LSASERDRQYCELCGKMENLLRCGRCRS-SFYCCKEHQRQDWKKHKL 0"
## [1] "--MNKHGICV---VDDFLGRETGQQIGDEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKI 0"
## [1] "MASDSGGPGG---PSASERDRQYCELCGKMENLLRCGRCRS-SFYCCKEHQRQDWKKHKL 0"
## [1] "MANDSGGPGG---PSPSERDRQYCELCGKMENLLRCSRCRS-SFYCCKEHQRQDWKKHKL 0"
## [1] "MANDSGGPGG---PSPSERDRQYCELCGKMENLLRCSRCRS-SFYCCKEHQRQDWKKHKL 0"
## [1] "MANDSGGPGG---PSPSERDRQYCELCGKMENLLRCSRCRS-SFYCSKEHQRQDWKKHKL 0"
## [1] "MANDSGGPGG---PSPSERDRQYCELCGKMENLLRCGRCRS-SFYCSKEHQRQDWKKHKL 0"
## [1] "MYKEKHIAHVCIWQNPQYRDKQ--DRNSHLATLTGASLAHP-APAWRKSS--HDYAAFAP 0"
## [1] "--MEGNSRE----SERLERERQFCELCGKMENLMKCGRCRS-SFYCSKEHQRQDWKKHKR 0"
## [1] "--MAGGGSEG---SNQSERDRQYCELCGKMEDLLRCGRCRS-SFYCSKEHQRQDWKKHKL 0"
## [1] " "
## [1] "VCQGGE---------------------APRAQPAPAQPRVAPPPGGAPGAARAGGAARRG 0"
## [1] "TWIEGK---------------------EPGWRPS------------APGAARAGGAARRG 0"
## [1] "VCQGGDPA-----------GPAAAPRAQPGPAPPAAAPPTRAGAAAGNAGARAGKAAGQR 0"
## [1] "VCQGSEGALGHGVGPHQHSGPAPPAAVPPPRAGAREPRKAAARRDNASGDAAKGKVKAKP 0"
## [1] "VCQGSESALGPGVGPHQHSGPAPPVAAPPPRAGAREPRKAAARRDNASGDAAKAKAKGKP 0"
## [1] "VCQRGDGAPGPGASQHPHPGPAPPAAAPPPRATGAKEARTAAPR--RDSAAAGDAADAK- 0"
## [1] "VCQGGEGAPAPGAGQQPQPGPAPLLRKAVPRRDRVAAVEAAGPR--AAGDAAKAKAKAKP 0"
## [1] "LPACGWRPPPAGVGAAALP-PRLFRRFAGPCALGVPRRRPAAGP--KAPGKRPGKSTVKP 0"
## [1] "VCKEAD-----------------------KQQQPVG-----------EEEEEQPS----- 0"
## [1] "FCKSDSTI-------------------TPASQNTVK-----------NSKKEQFSSSAVS 0"
## [1] " "
## [1] "DS-AAASRVPGPEDAAQARSGPG------PAEPGSEDPPLSRSPGPERASLCPAGGGPGE 0"
## [1] "DSSTAASRVPGPEDATQAGSGPG------PAEPSSEDPPPSRSPGPERASLCPAGGGPGE 0"
## [1] "QGSAEAARAPAPGDAAQARGPRG------AAECGQEEPPARRSPCPERASLCPAGSSPGE 0"
## [1] "PADPAAAASPCRAAAGGQGSAVA-----AEAEPGKEEPPARSSLFQEKANLYPPSNTPGD 0"
## [1] "PADPAAAASPSRAAPGGQGSAVA-----AEAEPGKEEPPARSSLFQEKANLYPPSNTPGD 0"
## [1] "-AKPAAAASPPRAAPGGQGPAAAAATATAEAEPAKEEPLGRSSLFQDKANLYPPGNTPGE 0"
## [1] "AADPAAAASPPRAAPGGAGPAAAAAAA--EAEPAKEEPPARSSLYQEKANLYPPSNTPGE 0"
## [1] "PADPAAAASPSRAAAGGQGSAVA-----AEAEPGKEEPPARSSLFQEKANLYPPSNTPGD 0"
## [1] "----DTSKQSSTSQSNSTLTSQD------EKMP------------DFIKS--ATGSDT-- 0"
## [1] "ISHSDTSQFKSHVEPASGVSDEQ------EEVDGGAALKAINEEDDTQVSGEAMGSSTSR 0"
## [1] " "
## [1] "ALSPGGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFLGRETGQQIGDEVRALHD 0"
## [1] "ALSPSGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFLGRETGQQIGDEVRALHD 0"
## [1] "ALSPGGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFLGRETGQQIGDEVRALHD 0"
## [1] "ALSPGGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFLGKETGQQIGDEVRALHD 0"
## [1] "ALSPSGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFLGKETGQQIGDEVRALHD 0"
## [1] "ALSHGGGPRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFLGKETGQQIGEEVRALHD 0"
## [1] "GLSHGGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFLGKETGQQIGDEVRALHD 0"
## [1] "ALSPSGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFLGKETGQQIGDEVRALHD 0"
## [1] "KPTPDSS-KPNGQTRS-PPQKLATDYIVPCMNKHGICVVDNFLGNEVGRSILEDVRALYL 0"
## [1] "KVTTSGG-RPNGQTKP-PLHRIALEYTIPCMNKHGICVLDDFLGQETGDRIVCEVKALHN 0"
## [1] " "
## [1] "TGKFTDGQLVSQKSDSSKDIRGDQITWIEGKEPGCETIGLLMSSMDDLIRHCSGKLGNYR 0"
## [1] "TGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLIRHCSGKLGNYR 0"
## [1] "TGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLIRHCNGKLGNYK 0"
## [1] "TGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLIRHCNGKLGSYK 0"
## [1] "TGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLIRHCNGKLGSYK 0"
## [1] "TGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLIRHCNGKLGNYK 0"
## [1] "TGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLIRHCNGKLGNYK 0"
## [1] "TGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLIRHCNGKLGSYK 0"
## [1] "TGGFTDGQLVSQRSDSSKDIRGDKITWVEGKEPGCERIAFLMSRMDDLIRHCNGKLGNYR 0"
## [1] "TGRFTDGQLVSQKSDSTRDIRGDQITWVEGKESSCKAIGKLMNKMDDLIRHCSGKLGNFR 0"
## [1] " "
## [1] "INGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKA 0"
## [1] "INGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKA 0"
## [1] "INGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKA 0"
## [1] "INGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKA 0"
## [1] "INGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKA 0"
## [1] "INGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKA 0"
## [1] "INGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKA 0"
## [1] "INGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKA 0"
## [1] "INGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDATEHGGLLRIFPEGTA 0"
## [1] "INGRTKAMVACYPGNGTGYVRHVDNPNADGRCVTCIYYLNKQWDAKTHGGLLRIFPEGKS 0"
## [1] " "
## [1] "QFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLT--GEKG 0"
## [1] "QFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLT--GEKG 0"
## [1] "QFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLT--GEKG 0"
## [1] "QFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLT--GEKG 0"
## [1] "QFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLT--GEKG 0"
## [1] "QFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLT--GEKG 0"
## [1] "QFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLT--GEKG 0"
## [1] "QFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLT--GEKG 0"
## [1] "QFADIEPKFDRLLLFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKEKYSTSSGERG 0"
## [1] "QFADIEPKFDRLLLFWSDRRNPHEVQPAFATRYAITVWYFDADERARAKEKYLN-TGERG 0"
## [1] " "
## [1] "VRVELKPNS--VSKDV--- 41"
## [1] "VRVELKPNS--VSKDV--- 41"
## [1] "VRVELNKPSDPVGKDV--- 41"
## [1] "VRVELNKPSDSVGKDVF-- 41"
## [1] "VRVELNKPSDSVGKDVF-- 41"
## [1] "VRVELNKPSDSISKDVL-- 41"
## [1] "VRVELNKPSDSVSKDVL-- 41"
## [1] "VRVELNKPSDSVSKDVF-- 41"
## [1] "VKVELNKPSEPS------- 41"
## [1] "VRIELNKPSEQVVKEVQNP 41"
## [1] " "

Finished MSA

We can now display our msa.

ggmsa:: ggmsa(EGLN1_align_msa,  
      start = 0, 
      end = 50) 

Distance Matrix

Too make things easier to calculate a distance matrix, we are only going to use a subset of the sequences. We will use sequences for humans, mice, fish, bats and squirrels. We will make a new vector with the accession numbers of the sequences we want and we will use bracket notation to grab what we want from EGLN1_list_ss. We will then add this to a new variable.

assessions_subsets <- c("NP_071334", "NP_444437.2", "NP_001002595.2", "XP_032955515.1", "XP_026268254.1")

EGLN1_list_ss[assessions_subsets]
## AAStringSet object of length 5:
##     width seq                                               names               
## [1]   426 MANDSGGPGGPSPSERDRQYCEL...LTGEKGVRVELNKPSDSVGKDVF NP_071334
## [2]   400 MASDSGGPGVLSASERDRQYCEL...VKYLTGEKGVRVELKPNSVSKDV NP_444437.2
## [3]   358 MEGNSRESERLERERQFCELCGK...EKYSTSSGERGVKVELNKPSEPS NP_001002595.2
## [4]   427 MANDSGGPGGPSPSERDRQYCEL...LTGEKGVRVELNKPSDSISKDVL XP_032955515.1
## [5]   413 MASDSGGPGGPSASERDRQYCEL...YLTGEKGVRVELNKPSDPVGKDV XP_026268254.1
EGLN1_subset_list_ss <- EGLN1_list_ss[assessions_subsets]

We will have to do the same conversions again as we did to set up our msa.

EGLN1_align_subset <- msa(EGLN1_subset_list_ss,
                     method = "ClustalW")
## use default substitution matrix
class(EGLN1_align_subset) <- "AAMultipleAlignment"
EGLN1_align_subset_seqinr <- msaConvert(EGLN1_align_subset, type = "seqinr::alignment")

Now we will calculating genetic distance from our msa using the seqinr::dist.alignment() function.

EGLN1_subset_align_dist <- seqinr::dist.alignment(EGLN1_align_subset_seqinr, 
                                       matrix = "identity")

EGLN1_subset_align_dist_rounded <- round(EGLN1_subset_align_dist,
                              digits = 3)

EGLN1_subset_align_dist_rounded
##                NP_444437.2 XP_026268254.1 NP_071334 XP_032955515.1
## XP_026268254.1       0.312                                        
## NP_071334            0.410          0.395                         
## XP_032955515.1       0.430          0.412     0.309               
## NP_001002595.2       0.555          0.543     0.548          0.546

Now we will repeat the same steps but with all the sequences. You will see that the matrix is a bit more confusing to understand as it is much bigger.

EGLN1_full_align_dist <- seqinr::dist.alignment(EGLN1_align_msa_seqinr, 
                                       matrix = "identity")

EGLN1_full_align_dist_rounded <- round(EGLN1_full_align_dist,
                              digits = 3)

EGLN1_full_align_dist_rounded
##                NP_444437.2 XP_002728657.3 XP_026268254.1 NP_071334
## XP_002728657.3       0.407                                        
## XP_026268254.1       0.361          0.495                         
## NP_071334            0.433          0.545          0.426          
## XP_001104870.1       0.439          0.545          0.426     0.153
## XP_032955515.1       0.444          0.551          0.431     0.337
## XP_020929239.1       0.425          0.540          0.424     0.326
## XP_525092.2          0.545          0.533          0.544     0.463
## NP_001002595.2       0.560          0.631          0.551     0.566
## NP_001015960.1       0.581          0.663          0.586     0.598
##                XP_001104870.1 XP_032955515.1 XP_020929239.1 XP_525092.2
## XP_002728657.3                                                         
## XP_026268254.1                                                         
## NP_071334                                                              
## XP_001104870.1                                                         
## XP_032955515.1          0.337                                          
## XP_020929239.1          0.323          0.318                           
## XP_525092.2             0.463          0.502          0.489            
## NP_001002595.2          0.563          0.556          0.558       0.628
## NP_001015960.1          0.594          0.591          0.594       0.663
##                NP_001002595.2
## XP_002728657.3               
## XP_026268254.1               
## NP_071334                    
## XP_001104870.1               
## XP_032955515.1               
## XP_020929239.1               
## XP_525092.2                  
## NP_001002595.2               
## NP_001015960.1          0.545

Phylogenetic Trees of Sequences

To plot a phylogenetic tree, we can use the neighbor joining algorithm with the nj() function. The nj() function takes in a distance matrix as an argument and uses genetic distances to cluster sequences into clades.

tree_set <- nj(EGLN1_full_align_dist)

Plotting Phylogenetic Trees

Now we can plot the tree.

plot.phylo(tree_set, main="Phylogenetic Tree", 
            use.edge.length = F)

# add label
mtext(text = "ELGN1 family gene tree - rooted, no branch lenths")