The EGLN1 gene encodes Hypoxia-inducible factor prolyl hydroxylase 2 (HIF-PH2), or prolyl hydroxylase domain-containing protein 2 (PHD2). Upregulation of this gene is caused by the cllular response to hypoxia, or when the body is deprived of oxygen.
In this portfolio, we will be looking at the sequences of the EGLN1 gene of various species by grabbing their various accession numbers and fasta files. We will then prepare the data for various purposes such as diagramming key domains in the proteins encoded by the gene, using dot plots to find sequence repeats, finding various protein properties described by the human gene, predicting the secondary structures, creating percent identity comparisons for different species, performing a multiple sequence alignment, and finally plotting a phylogenetic tree for all sequences.
Here are the links to resources used to help make this script:
Refseq Gene: https://www.ncbi.nlm.nih.gov/gene/54583
Refseq Homologene: https://www.ncbi.nlm.nih.gov/homologene/?term=EGLN1
Uniprot: https://www.uniprot.org/uniprot/?query=EGLN1&sort=score
Protein Data Bank: https://www.rcsb.org/
Blast: https://blast.ncbi.nlm.nih.gov/Blast.cgi
Pfam: http://pfam.xfam.org/protein/Q9GZT9
Links to resources consulted:
Neanderthal genome: http://neandertal.ensemblgenomes.org/index.html
Disprot: https://www.disprot.org/
RepeatsDP: https://repeatsdb.bio.unipd.it/
Here is the necessary packages needed throughout this portfolio.
# github packages
library(compbio4all)
library(ggmsa)
## Registered S3 methods overwritten by 'ggalt':
## method from
## grid.draw.absoluteGrob ggplot2
## grobHeight.absoluteGrob ggplot2
## grobWidth.absoluteGrob ggplot2
## grobX.absoluteGrob ggplot2
## grobY.absoluteGrob ggplot2
# CRAN packages
library(rentrez)
library(seqinr)
library(ape)
##
## Attaching package: 'ape'
## The following objects are masked from 'package:seqinr':
##
## as.alignment, consensus
library(pander)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.1.2
# Bioconductor packages
library(msa)
## Loading required package: Biostrings
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:grDevices':
##
## windows
## Loading required package: XVector
## Loading required package: GenomeInfoDb
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:ape':
##
## complement
## The following object is masked from 'package:seqinr':
##
## translate
## The following object is masked from 'package:base':
##
## strsplit
library(Biostrings)
library(BiocManager)
## Bioconductor version '3.13' is out-of-date; the current release version '3.14'
## is available with R version '4.1'; see https://bioconductor.org/install
##
## Attaching package: 'BiocManager'
## The following object is masked from 'package:msa':
##
## version
#install("drawProteins")
library(drawProteins)
Accession numbers were obtained from RefSeq, Refseq Homlogene, UniProt and PDB and Blast searches (last three). UniProt accession numbers can be found by searching for the gene name. PDB accessions can be found by searching with a UniProt accession number. The Neanderthal Genome database was searched with the gene name however I was unable to download the fasta files with the given ID so I decided not to use it . Blast searches were conducted with the Human accession number and the three most genetically distant were chosen.
ncbi_protein_accession <- c("NP_071334", "XP_525092.2", "NP_444437.2", "XP_001104870.1", "XP_002728657.3", "NP_001002595.2", "NP_001015960.1", "XP_032955515.1", "XP_026268254.1", "XP_020929239.1")
UniProt_id <- c("Q9GZT9", "K7C5B9", "Q91YE3", "A0A5F8AKN1", "P59722", "NA", "NA", "NA", "NA", "NA")
PDB <- c("4KBZ", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA")
species <- c("Homo sapiens", "Pan troglodytes", "Mus musculus", "Macaca mulatta", "Rattus norvegicus", "Danio rerio", "Xenopus tropicalis", " Rhinolophus ferrumequinum", "Urocitellus parryii", "Sus scrofa")
common_name <- c("Humans", "Chimpanzees", "Mice", "Monkeys", "Rats", "Fish",
"Frogs", "Bat", "Squirrel", "Pig")
gene_name <- c("EGLN1", "EGLN1", "EGLN1", "EGLN1", "EGLN1", "EGLN1", "EGLN1", "EGLN1", "EGLN1", "EGLN1")
Here is a table showing all the accession numbers with their respective species. Anything designated with NA was unable to be found.
EGLN1.df <- data.frame(ncbi_protein_accession, UniProt_id, PDB, species, common_name, gene_name)
pander::pander(EGLN1.df)
| ncbi_protein_accession | UniProt_id | PDB | species |
|---|---|---|---|
| NP_071334 | Q9GZT9 | 4KBZ | Homo sapiens |
| XP_525092.2 | K7C5B9 | NA | Pan troglodytes |
| NP_444437.2 | Q91YE3 | NA | Mus musculus |
| XP_001104870.1 | A0A5F8AKN1 | NA | Macaca mulatta |
| XP_002728657.3 | P59722 | NA | Rattus norvegicus |
| NP_001002595.2 | NA | NA | Danio rerio |
| NP_001015960.1 | NA | NA | Xenopus tropicalis |
| XP_032955515.1 | NA | NA | Rhinolophus ferrumequinum |
| XP_026268254.1 | NA | NA | Urocitellus parryii |
| XP_020929239.1 | NA | NA | Sus scrofa |
| common_name | gene_name |
|---|---|
| Humans | EGLN1 |
| Chimpanzees | EGLN1 |
| Mice | EGLN1 |
| Monkeys | EGLN1 |
| Rats | EGLN1 |
| Fish | EGLN1 |
| Frogs | EGLN1 |
| Bat | EGLN1 |
| Squirrel | EGLN1 |
| Pig | EGLN1 |
In the code below, we will simply download the fasta files for each sequence.
human_fasta <- entrez_fetch(db = "protein",
id = "NP_071334",
rettype = "fasta")
chimp_fasta <- entrez_fetch(db = "protein",
id = "XP_525092.2",
rettype = "fasta")
mouse_fasta <- entrez_fetch(db = "protein",
id = "NP_444437.2",
rettype = "fasta")
monkey_fasta <- entrez_fetch(db = "protein",
id = "XP_001104870.1",
rettype = "fasta")
rat_fasta <- entrez_fetch(db = "protein",
id = "XP_002728657.3",
rettype = "fasta")
fish_fasta <- entrez_fetch(db = "protein",
id = "NP_001002595.2",
rettype = "fasta")
frog_fasta <- entrez_fetch(db = "protein",
id = "NP_001015960.1",
rettype = "fasta")
bat_fasta <- entrez_fetch(db = "protein",
id = "XP_032955515.1",
rettype = "fasta")
squirrel_fasta <- entrez_fetch(db = "protein",
id = "XP_026268254.1",
rettype = "fasta")
pig_fasta <- entrez_fetch(db = "protein",
id = "XP_020929239.1",
rettype = "fasta")
Here we will put all of our fasta files into one list and then clean up each fasta file so we can use it. We will use the the cat() function to make the output similar to that of a web browser/word processor and making it easier to read with new lines. We will then loop through the list to remove the headings of each fasta file. The last line shows what the new fasta files look like.
EGLN1_fasta_list <- c(human_fasta, chimp_fasta, mouse_fasta, monkey_fasta, rat_fasta, fish_fasta, frog_fasta, bat_fasta, squirrel_fasta, pig_fasta)
cat(EGLN1_fasta_list)
## >NP_071334.1 egl nine homolog 1 isoform 1 [Homo sapiens]
## MANDSGGPGGPSPSERDRQYCELCGKMENLLRCSRCRSSFYCCKEHQRQDWKKHKLVCQGSEGALGHGVG
## PHQHSGPAPPAAVPPPRAGAREPRKAAARRDNASGDAAKGKVKAKPPADPAAAASPCRAAAGGQGSAVAA
## EAEPGKEEPPARSSLFQEKANLYPPSNTPGDALSPGGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVV
## DDFLGKETGQQIGDEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLI
## RHCNGKLGSYKINGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGK
## AQFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLTGEKGVRVELNKPSDS
## VGKDVF
##
## >XP_525092.2 PREDICTED: egl nine homolog 1 [Pan troglodytes]
## MYKEKHIAHVCIWQNPQYRDKQDRNSHLATLTGASLAHPAPAWRKSSHDYAAFAPLPACGWRPPPAGVGA
## AALPPRLFRRFAGPCALGVPRRRPAAGPKAPGKRPGKSTVKPPADPAAAASPSRAAAGGQGSAVAAEAEP
## GKEEPPARSSLFQEKANLYPPSNTPGDALSPSGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFL
## GKETGQQIGDEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLIRHCN
## GKLGSYKINGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKAQFA
## DIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLTGEKGVRVELNKPSDSVSKD
## VF
##
## >NP_444437.2 egl nine homolog 1 isoform 1 [Mus musculus]
## MASDSGGPGVLSASERDRQYCELCGKMENLLRCGRCRSSFYCCKEHQRQDWKKHKLVCQGGEAPRAQPAP
## AQPRVAPPPGGAPGAARAGGAARRGDSAAASRVPGPEDAAQARSGPGPAEPGSEDPPLSRSPGPERASLC
## PAGGGPGEALSPGGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFLGRETGQQIGDEVRALHDTG
## KFTDGQLVSQKSDSSKDIRGDQITWIEGKEPGCETIGLLMSSMDDLIRHCSGKLGNYRINGRTKAMVACY
## PGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKAQFADIEPKFDRLLFFWSDRRNP
## HEVQPAYATRYAITVWYFDADERARAKVKYLTGEKGVRVELKPNSVSKDV
##
## >XP_001104870.1 egl nine homolog 1 isoform X1 [Macaca mulatta]
## MANDSGGPGGPSPSERDRQYCELCGKMENLLRCSRCRSSFYCCKEHQRQDWKKHKLVCQGSESALGPGVG
## PHQHSGPAPPVAAPPPRAGAREPRKAAARRDNASGDAAKAKAKGKPPADPAAAASPSRAAPGGQGSAVAA
## EAEPGKEEPPARSSLFQEKANLYPPSNTPGDALSPSGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVV
## DDFLGKETGQQIGDEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLI
## RHCNGKLGSYKINGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGK
## AQFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLTGEKGVRVELNKPSDS
## VGKDVF
##
## >XP_002728657.3 PREDICTED: egl nine homolog 1 [Rattus norvegicus]
## MNKHGICVVDDFLGRETGQQIGDEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGWRPSAP
## GAARAGGAARRGDSSTAASRVPGPEDATQAGSGPGPAEPSSEDPPPSRSPGPERASLCPAGGGPGEALSP
## SGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFLGRETGQQIGDEVRALHDTGKFTDGQLVSQKS
## DSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLIRHCSGKLGNYRINGRTKAMVACYPGNGTGYVRHVD
## NPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKAQFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYA
## ITVWYFDADERARAKVKYLTGEKGVRVELKPNSVSKDV
##
## >NP_001002595.2 egl nine homolog 1b [Danio rerio]
## MEGNSRESERLERERQFCELCGKMENLMKCGRCRSSFYCSKEHQRQDWKKHKRVCKEADKQQQPVGEEEE
## EQPSDTSKQSSTSQSNSTLTSQDEKMPDFIKSATGSDTKPTPDSSKPNGQTRSPPQKLATDYIVPCMNKH
## GICVVDNFLGNEVGRSILEDVRALYLTGGFTDGQLVSQRSDSSKDIRGDKITWVEGKEPGCERIAFLMSR
## MDDLIRHCNGKLGNYRINGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDATEHGGLLRI
## FPEGTAQFADIEPKFDRLLLFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKEKYSTSSGERGVKVE
## LNKPSEPS
##
## >NP_001015960.1 egl nine homolog 1 [Xenopus tropicalis]
## MAGGGSEGSNQSERDRQYCELCGKMEDLLRCGRCRSSFYCSKEHQRQDWKKHKLFCKSDSTITPASQNTV
## KNSKKEQFSSSAVSISHSDTSQFKSHVEPASGVSDEQEEVDGGAALKAINEEDDTQVSGEAMGSSTSRKV
## TTSGGRPNGQTKPPLHRIALEYTIPCMNKHGICVLDDFLGQETGDRIVCEVKALHNTGRFTDGQLVSQKS
## DSTRDIRGDQITWVEGKESSCKAIGKLMNKMDDLIRHCSGKLGNFRINGRTKAMVACYPGNGTGYVRHVD
## NPNADGRCVTCIYYLNKQWDAKTHGGLLRIFPEGKSQFADIEPKFDRLLLFWSDRRNPHEVQPAFATRYA
## ITVWYFDADERARAKEKYLNTGERGVRIELNKPSEQVVKEVQNP
##
## >XP_032955515.1 egl nine homolog 1 isoform X1 [Rhinolophus ferrumequinum]
## MANDSGGPGGPSPSERDRQYCELCGKMENLLRCSRCRSSFYCSKEHQRQDWKKHKLVCQRGDGAPGPGAS
## QHPHPGPAPPAAAPPPRATGAKEARTAAPRRDSAAAGDAADAKAKPAAAASPPRAAPGGQGPAAAAATAT
## AEAEPAKEEPLGRSSLFQDKANLYPPGNTPGEALSHGGGPRPNGQTKPLPALKLALEYIVPCMNKHGICV
## VDDFLGKETGQQIGEEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDL
## IRHCNGKLGNYKINGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEG
## KAQFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLTGEKGVRVELNKPSD
## SISKDVL
##
## >XP_026268254.1 egl nine homolog 1 [Urocitellus parryii]
## MASDSGGPGGPSASERDRQYCELCGKMENLLRCGRCRSSFYCCKEHQRQDWKKHKLVCQGGDPAGPAAAP
## RAQPGPAPPAAAPPTRAGAAAGNAGARAGKAAGQRQGSAEAARAPAPGDAAQARGPRGAAECGQEEPPAR
## RSPCPERASLCPAGSSPGEALSPGGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFLGRETGQQI
## GDEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLIRHCNGKLGNYKI
## NGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKAQFADIEPKFDR
## LLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLTGEKGVRVELNKPSDPVGKDV
##
## >XP_020929239.1 egl nine homolog 1 isoform X1 [Sus scrofa]
## MANDSGGPGGPSPSERDRQYCELCGKMENLLRCGRCRSSFYCSKEHQRQDWKKHKLVCQGGEGAPAPGAG
## QQPQPGPAPLLRKAVPRRDRVAAVEAAGPRAAGDAAKAKAKAKPAADPAAAASPPRAAPGGAGPAAAAAA
## AEAEPAKEEPPARSSLYQEKANLYPPSNTPGEGLSHGGGLRPNGQTKPLPALKLALEYIVPCMNKHGICV
## VDDFLGKETGQQIGDEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDL
## IRHCNGKLGNYKINGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEG
## KAQFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLTGEKGVRVELNKPSD
## SVSKDVL
for(i in 1:length(EGLN1_fasta_list)){
EGLN1_fasta_list[[i]] <- compbio4all::fasta_cleaner(EGLN1_fasta_list[[i]], parse = F)
}
EGLN1_fasta_list[1]
## [1] "MANDSGGPGGPSPSERDRQYCELCGKMENLLRCSRCRSSFYCCKEHQRQDWKKHKLVCQGSEGALGHGVGPHQHSGPAPPAAVPPPRAGAREPRKAAARRDNASGDAAKGKVKAKPPADPAAAASPCRAAAGGQGSAVAAEAEPGKEEPPARSSLFQEKANLYPPSNTPGDALSPGGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFLGKETGQQIGDEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLIRHCNGKLGSYKINGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKAQFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLTGEKGVRVELNKPSDSVGKDVF"
We can use the UniProt accession numbers to download our data (note we only have 5 proteins with UniProt accession numbers). We then can make a dataframe that contains information on domains, chains, regions, motifs, phosphorylated sites and repeats.
human_uniprot <- drawProteins::get_features("Q9GZT9")
## [1] "Download has worked"
chimp_uniprot <- drawProteins::get_features("K7C5B9")
## [1] "Download has worked"
mouse_uniprot <- drawProteins::get_features("Q91YE3")
## [1] "Download has worked"
monkey_uniprot <- drawProteins::get_features("A0A5F8AKN1")
## [1] "Download has worked"
rat_uniprot <- drawProteins::get_features("P59722")
## [1] "Download has worked"
human_uniprot_df <- drawProteins::feature_to_dataframe(human_uniprot)
chimp_uniprot_df <- drawProteins::feature_to_dataframe(chimp_uniprot)
mouse_uniprot_df <- drawProteins::feature_to_dataframe(mouse_uniprot)
monkey_uniprot_df <- drawProteins::feature_to_dataframe(monkey_uniprot)
rat_uniprot_df <- drawProteins::feature_to_dataframe(rat_uniprot)
## Warning in drawProteins::extract_feat_acc(features_in_lists_of_six[[i]]): NAs
## introduced by coercion
We can now plot the data. Below are the domains and regions found in the human protein.
human_canvas <- draw_canvas(human_uniprot_df)
human_canvas <- draw_chains(human_canvas, human_uniprot_df,
label_size = 2.5)
human_canvas <- draw_domains(human_canvas, human_uniprot_df)
human_canvas <- draw_regions(human_canvas, human_uniprot_df)
human_canvas
Here is the data for the protein in chimps.
chimp_canvas <- draw_canvas(chimp_uniprot_df)
chimp_canvas <- draw_chains(chimp_canvas, chimp_uniprot_df,
label_size = 2.5)
chimp_canvas <- draw_domains(chimp_canvas, chimp_uniprot_df)
chimp_canvas <- draw_regions(chimp_canvas, chimp_uniprot_df)
chimp_canvas
Now for the protein in mice.
mouse_canvas <- draw_canvas(mouse_uniprot_df)
mouse_canvas <- draw_chains(mouse_canvas, mouse_uniprot_df,
label_size = 2.5)
mouse_canvas <- draw_domains(mouse_canvas, mouse_uniprot_df)
mouse_canvas <- draw_regions(mouse_canvas, mouse_uniprot_df)
mouse_canvas
Here is the data for the protein in monkeys.
monkey_canvas <- draw_canvas(monkey_uniprot_df)
monkey_canvas <- draw_chains(monkey_canvas, monkey_uniprot_df,
label_size = 2.5)
monkey_canvas <- draw_domains(monkey_canvas, monkey_uniprot_df)
monkey_canvas <- draw_folding(monkey_canvas, monkey_uniprot_df)
monkey_canvas <- draw_regions(monkey_canvas, monkey_uniprot_df)
monkey_canvas
And finally, here is the data for the protein in rats.
rat_canvas <- draw_canvas(rat_uniprot_df)
rat_canvas <- draw_chains(rat_canvas, rat_uniprot_df,
label_size = 2.5)
rat_canvas <- draw_domains(rat_canvas, rat_uniprot_df)
rat_canvas <- draw_regions(rat_canvas, rat_uniprot_df)
rat_canvas
## Warning: Removed 1 rows containing missing values (geom_rect).
We can make dotplots of each fasta file sequence plotted against itself to discover areas of sequence repeats. We can create vectors containing single amino acid characters using our list of all the fasta files. Then we can make our dotplot with the dotPlot() function. Below will show four different dotplots all with different settings used for the human version.
human_vector <- fasta_cleaner(EGLN1_fasta_list[1])
par(mfrow = c(2,2),
mar = c(0,0,1,1))
dotPlot(human_vector, human_vector)
dotPlot(human_vector, human_vector,
wsize = 10,
nmatch = 1,
)
dotPlot(human_vector, human_vector,
wsize = 10,
nmatch = 5,
)
dotPlot(human_vector, human_vector,
wsize = 20,
nmatch = 1,
)
Now we can plot the best version of the plot.
dotPlot(human_vector, human_vector)
Now we will repeat for the remaining sequences, showing the final version.
chimp_vector <- fasta_cleaner(EGLN1_fasta_list[2])
dotPlot(chimp_vector, chimp_vector)
mouse_vector <- fasta_cleaner(EGLN1_fasta_list[3])
dotPlot(mouse_vector, mouse_vector)
monkey_vector <- fasta_cleaner(EGLN1_fasta_list[4])
dotPlot(monkey_vector, monkey_vector)
rat_vector <- fasta_cleaner(EGLN1_fasta_list[5])
dotPlot(rat_vector, rat_vector)
fish_vector <- fasta_cleaner(EGLN1_fasta_list[6])
dotPlot(fish_vector, fish_vector)
frog_vector <- fasta_cleaner(EGLN1_fasta_list[7])
dotPlot(frog_vector, frog_vector)
bat_vector <- fasta_cleaner(EGLN1_fasta_list[8])
dotPlot(bat_vector, bat_vector)
squirrel_vector <- fasta_cleaner(EGLN1_fasta_list[9])
dotPlot(squirrel_vector, squirrel_vector)
pig_vector <- fasta_cleaner(EGLN1_fasta_list[10])
dotPlot(pig_vector, pig_vector)
We can now look for protein properties at various websites to learn more about the human protein. Below will show domains located with Pfam, the subcellular location from Uniprot, and secondary structure classification from the PDB. No data was found in DisProt or RepeatDB.
protein_properties_vector <- c("PFAM", "There are two noteworthy domains. The first is the zf-MYND domain from amino acids 21 to 58. The second is the 2OG-FeII_Oxy_3 domain from amino acids 298 to 391.", "http://pfam.xfam.org/protein/Q9GZT9", "Uniprot", "The subcellular location appears to be in the nucleus and cytoplasm.", "https://www.uniprot.org/uniprot/Q9GZT9#subcellular_location", "PDB", "By looking at the sequence, it appears that the protein is composed of alpha helices and beta sheets.", "https://www.rcsb.org/sequence/4kbz")
protein_properties_table <- matrix(protein_properties_vector, nrow = 3, ncol = 3, byrow = T)
protein_properties_df <- data.frame(protein_properties_table, stringsAsFactors = F)
names(protein_properties_df) <- c("Source", "Protein Property", "Link")
pander::pander(protein_properties_df)
| Source | Protein Property |
|---|---|
| PFAM | There are two noteworthy domains. The first is the zf-MYND domain from amino acids 21 to 58. The second is the 2OG-FeII_Oxy_3 domain from amino acids 298 to 391. |
| Uniprot | The subcellular location appears to be in the nucleus and cytoplasm. |
| PDB | By looking at the sequence, it appears that the protein is composed of alpha helices and beta sheets. |
| Link |
|---|
| http://pfam.xfam.org/protein/Q9GZT9 |
| https://www.uniprot.org/uniprot/Q9GZT9#subcellular_location |
| https://www.rcsb.org/sequence/4kbz |
First, we will make a vector of the amino acid one letter abbreviations. Note that we will use double entry to confirm there were no mistakes.
aa.1.1 <- c("A","R","N","D","C","Q","E","G","H","I",
"L","K","M","F","P","S","T","W","Y","V")
aa.1.2 <- c("A","R","N","D","C","Q","E","G","H","I",
"L","K","M","F","P","S","T","W","Y","V")
length(aa.1.1) == length(aa.1.2)
## [1] TRUE
aa.1.1 == aa.1.2
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE
unique(aa.1.1)
## [1] "A" "R" "N" "D" "C" "Q" "E" "G" "H" "I" "L" "K" "M" "F" "P" "S" "T" "W" "Y"
## [20] "V"
We will now make vectors of each amino acid by type based off of Chou’s table 5. We will do logical checks with sum to determine if the vectors are accurate before we show the table.
alpha <- c(285, 53, 97, 163, 22, 67, 134, 197, 111, 91,
221, 249, 48, 123, 82, 122, 119, 33, 63, 167)
sum(alpha) == 2447
## [1] TRUE
beta <- c(203, 67, 139, 121, 75, 122, 86, 297, 49, 120,
177, 115, 16, 85, 127, 341, 253, 44, 110, 229)
sum(beta) == 2776
## [1] TRUE
a.plus.b <- c(175, 78, 120, 111, 74, 74, 86, 171, 33, 93,
110, 112, 25, 52, 71, 126, 117, 30, 108, 123)
sum(a.plus.b) == 1889
## [1] TRUE
a.div.b <- c(361, 146, 183, 244, 63, 114, 257, 377, 107, 239,
339, 321, 91, 158, 188, 327, 238, 72, 130, 378)
sum(a.div.b) == 4333
## [1] TRUE
pander(data.frame(aa.1.1, alpha, beta, a.plus.b, a.div.b))
| aa.1.1 | alpha | beta | a.plus.b | a.div.b |
|---|---|---|---|---|
| A | 285 | 203 | 175 | 361 |
| R | 53 | 67 | 78 | 146 |
| N | 97 | 139 | 120 | 183 |
| D | 163 | 121 | 111 | 244 |
| C | 22 | 75 | 74 | 63 |
| Q | 67 | 122 | 74 | 114 |
| E | 134 | 86 | 86 | 257 |
| G | 197 | 297 | 171 | 377 |
| H | 111 | 49 | 33 | 107 |
| I | 91 | 120 | 93 | 239 |
| L | 221 | 177 | 110 | 339 |
| K | 249 | 115 | 112 | 321 |
| M | 48 | 16 | 25 | 91 |
| F | 123 | 85 | 52 | 158 |
| P | 82 | 127 | 71 | 188 |
| S | 122 | 341 | 126 | 327 |
| T | 119 | 253 | 117 | 238 |
| W | 33 | 44 | 30 | 72 |
| Y | 63 | 110 | 108 | 130 |
| V | 167 | 229 | 123 | 378 |
We must now calculate the proportions of the amino acids in each fold class and convert to a data frame where we can then show frequencies.
alpha.prop <- alpha/sum(alpha)
beta.prop <- beta/sum(beta)
a.plus.b.prop <- a.plus.b/sum(a.plus.b)
a.div.b <- a.div.b/sum(a.div.b)
aa.prop_df <- data.frame(alpha.prop,
beta.prop,
a.plus.b.prop,
a.div.b)
row.names(aa.prop_df) <- aa.1.1
pander::pander(aa.prop_df)
| Â | alpha.prop | beta.prop | a.plus.b.prop | a.div.b |
|---|---|---|---|---|
| A | 0.1165 | 0.07313 | 0.09264 | 0.08331 |
| R | 0.02166 | 0.02414 | 0.04129 | 0.03369 |
| N | 0.03964 | 0.05007 | 0.06353 | 0.04223 |
| D | 0.06661 | 0.04359 | 0.05876 | 0.05631 |
| C | 0.008991 | 0.02702 | 0.03917 | 0.01454 |
| Q | 0.02738 | 0.04395 | 0.03917 | 0.02631 |
| E | 0.05476 | 0.03098 | 0.04553 | 0.05931 |
| G | 0.08051 | 0.107 | 0.09052 | 0.08701 |
| H | 0.04536 | 0.01765 | 0.01747 | 0.02469 |
| I | 0.03719 | 0.04323 | 0.04923 | 0.05516 |
| L | 0.09031 | 0.06376 | 0.05823 | 0.07824 |
| K | 0.1018 | 0.04143 | 0.05929 | 0.07408 |
| M | 0.01962 | 0.005764 | 0.01323 | 0.021 |
| F | 0.05027 | 0.03062 | 0.02753 | 0.03646 |
| P | 0.03351 | 0.04575 | 0.03759 | 0.04339 |
| S | 0.04986 | 0.1228 | 0.0667 | 0.07547 |
| T | 0.04863 | 0.09114 | 0.06194 | 0.05493 |
| W | 0.01349 | 0.01585 | 0.01588 | 0.01662 |
| Y | 0.02575 | 0.03963 | 0.05717 | 0.03 |
| V | 0.06825 | 0.08249 | 0.06511 | 0.08724 |
Now, we will test for the number of each amino acid in our human gene with the table() function.
EGLN1_human_aa <- compbio4all::fasta_cleaner(human_fasta, parse = TRUE)
table(EGLN1_human_aa)
## EGLN1_human_aa
## A C D E F G H I K L M N P Q R S T V W Y
## 43 15 28 23 11 47 10 14 33 26 6 15 34 14 28 26 13 22 5 13
The below function allows conversion of a table into a vector.
table_to_vector <- function(table_x){
table_names <- attr(table_x, "dimnames")[[1]]
table_vect <- as.vector(table_x)
names(table_vect) <- table_names
return(table_vect)
}
NP_071334_freq_table <- table(EGLN1_human_aa)/length(EGLN1_human_aa)
EGLN1_human_aa_freq <- table_to_vector(NP_071334_freq_table)
pander::pander(EGLN1_human_aa_freq)
| A | C | D | E | F | G | H | I |
|---|---|---|---|---|---|---|---|
| 0.1009 | 0.03521 | 0.06573 | 0.05399 | 0.02582 | 0.1103 | 0.02347 | 0.03286 |
| K | L | M | N | P | Q | R | S |
|---|---|---|---|---|---|---|---|
| 0.07746 | 0.06103 | 0.01408 | 0.03521 | 0.07981 | 0.03286 | 0.06573 | 0.06103 |
| T | V | W | Y |
|---|---|---|---|
| 0.03052 | 0.05164 | 0.01174 | 0.03052 |
We then can test for the presence of unknown amino acids labeled as U. We do not have any unknowns in our gene.
aa.names <- names(EGLN1_human_aa_freq)
any(aa.names == "U")
## [1] FALSE
We can then finally add the data on our human protein to the amino acid frequency table made previously.
aa.prop_df$EGLN1_human_aa_freq <- EGLN1_human_aa_freq
pander::pander(aa.prop_df)
| Â | alpha.prop | beta.prop | a.plus.b.prop | a.div.b | EGLN1_human_aa_freq |
|---|---|---|---|---|---|
| A | 0.1165 | 0.07313 | 0.09264 | 0.08331 | 0.1009 |
| R | 0.02166 | 0.02414 | 0.04129 | 0.03369 | 0.03521 |
| N | 0.03964 | 0.05007 | 0.06353 | 0.04223 | 0.06573 |
| D | 0.06661 | 0.04359 | 0.05876 | 0.05631 | 0.05399 |
| C | 0.008991 | 0.02702 | 0.03917 | 0.01454 | 0.02582 |
| Q | 0.02738 | 0.04395 | 0.03917 | 0.02631 | 0.1103 |
| E | 0.05476 | 0.03098 | 0.04553 | 0.05931 | 0.02347 |
| G | 0.08051 | 0.107 | 0.09052 | 0.08701 | 0.03286 |
| H | 0.04536 | 0.01765 | 0.01747 | 0.02469 | 0.07746 |
| I | 0.03719 | 0.04323 | 0.04923 | 0.05516 | 0.06103 |
| L | 0.09031 | 0.06376 | 0.05823 | 0.07824 | 0.01408 |
| K | 0.1018 | 0.04143 | 0.05929 | 0.07408 | 0.03521 |
| M | 0.01962 | 0.005764 | 0.01323 | 0.021 | 0.07981 |
| F | 0.05027 | 0.03062 | 0.02753 | 0.03646 | 0.03286 |
| P | 0.03351 | 0.04575 | 0.03759 | 0.04339 | 0.06573 |
| S | 0.04986 | 0.1228 | 0.0667 | 0.07547 | 0.06103 |
| T | 0.04863 | 0.09114 | 0.06194 | 0.05493 | 0.03052 |
| W | 0.01349 | 0.01585 | 0.01588 | 0.01662 | 0.05164 |
| Y | 0.02575 | 0.03963 | 0.05717 | 0.03 | 0.01174 |
| V | 0.06825 | 0.08249 | 0.06511 | 0.08724 | 0.03052 |
We need two functions to calculate correlation between two columns and another to calculate correlation similarities.
# Corrleation used in Chou adn Zhange 1992.
chou_cor <- function(x,y){
numerator <- sum(x*y)
denominator <- sqrt((sum(x^2))*(sum(y^2)))
result <- numerator/denominator
return(result)
}
# Cosine similarity used in Higgs and Attwood (2005).
chou_cosine <- function(z.1, z.2){
z.1.abs <- sqrt(sum(z.1^2))
z.2.abs <- sqrt(sum(z.2^2))
my.cosine <- sum(z.1*z.2)/(z.1.abs*z.2.abs)
return(my.cosine)
}
We can calculate correlation between each column.
corr.alpha <- chou_cor(aa.prop_df[,5], aa.prop_df[,1])
corr.beta <- chou_cor(aa.prop_df[,5], aa.prop_df[,2])
corr.apb <- chou_cor(aa.prop_df[,5], aa.prop_df[,3])
corr.adb <- chou_cor(aa.prop_df[,5], aa.prop_df[,4])
And now calculate cosine similarity.
cos.alpha <- chou_cosine(aa.prop_df[,5], aa.prop_df[,1])
cos.beta <- chou_cosine(aa.prop_df[,5], aa.prop_df[,2])
cos.apb <- chou_cosine(aa.prop_df[,5], aa.prop_df[,3])
cos.adb <- chou_cosine(aa.prop_df[,5], aa.prop_df[,4])
We need to flip the data frame with the t() function.
aa.prop.flipped <- t(aa.prop_df)
round(aa.prop.flipped,2)
## A R N D C Q E G H I L K
## alpha.prop 0.12 0.02 0.04 0.07 0.01 0.03 0.05 0.08 0.05 0.04 0.09 0.10
## beta.prop 0.07 0.02 0.05 0.04 0.03 0.04 0.03 0.11 0.02 0.04 0.06 0.04
## a.plus.b.prop 0.09 0.04 0.06 0.06 0.04 0.04 0.05 0.09 0.02 0.05 0.06 0.06
## a.div.b 0.08 0.03 0.04 0.06 0.01 0.03 0.06 0.09 0.02 0.06 0.08 0.07
## EGLN1_human_aa_freq 0.10 0.04 0.07 0.05 0.03 0.11 0.02 0.03 0.08 0.06 0.01 0.04
## M F P S T W Y V
## alpha.prop 0.02 0.05 0.03 0.05 0.05 0.01 0.03 0.07
## beta.prop 0.01 0.03 0.05 0.12 0.09 0.02 0.04 0.08
## a.plus.b.prop 0.01 0.03 0.04 0.07 0.06 0.02 0.06 0.07
## a.div.b 0.02 0.04 0.04 0.08 0.05 0.02 0.03 0.09
## EGLN1_human_aa_freq 0.08 0.03 0.07 0.06 0.03 0.05 0.01 0.03
We can get the euclidean distance.
dist(aa.prop.flipped, method = "euclidean")
## alpha.prop beta.prop a.plus.b.prop a.div.b
## beta.prop 0.13342098
## a.plus.b.prop 0.09281824 0.08289406
## a.div.b 0.06699039 0.08659174 0.06175113
## EGLN1_human_aa_freq 0.17891026 0.18908175 0.16206768 0.17398567
It’s possible to get the individual distances with the dist() function.
dist.alpha <- dist((aa.prop.flipped[c(1,4),]), method = "euclidean")
dist.beta <- dist((aa.prop.flipped[c(2,4),]), method = "euclidean")
dist.apb <- dist((aa.prop.flipped[c(3,4),]), method = "euclidean")
dist.adb <- dist((aa.prop.flipped[c(4,4),]), method = "euclidean")
We can then compile the information and display the output.
# fold types
fold.type <- c("alpha","beta","alpha plus beta", "alpha/beta")
# data
corr.sim <- round(c(corr.alpha,corr.beta,corr.apb,corr.adb),5)
cosine.sim <- round(c(cos.alpha,cos.beta,cos.apb,cos.adb),5)
Euclidean.dist <- round(c(dist.alpha,dist.beta,dist.apb,dist.adb),5)
# summary
sim.sum <- c("","","most.sim","")
dist.sum <- c("","","min.dist","")
df <- data.frame(fold.type,
corr.sim ,
cosine.sim ,
Euclidean.dist ,
sim.sum ,
dist.sum )
pander::pander(df)
| fold.type | corr.sim | cosine.sim | Euclidean.dist | sim.sum | dist.sum |
|---|---|---|---|---|---|
| alpha | 0.756 | 0.756 | 0.06699 | ||
| beta | 0.7313 | 0.7313 | 0.08659 | ||
| alpha plus beta | 0.7882 | 0.7882 | 0.06175 | most.sim | min.dist |
| alpha/beta | 0.7598 | 0.7598 | 0 |
To begin our pid comparisons, we need to align each sequence to each other. We will be using the human, chimp, mouse, and squirrel sequences.
align_human_chimp <- Biostrings:: pairwiseAlignment(
human_fasta,
chimp_fasta)
align_human_mouse <- Biostrings:: pairwiseAlignment(
human_fasta,
mouse_fasta)
align_human_squirrel <- Biostrings:: pairwiseAlignment(
human_fasta,
squirrel_fasta)
align_human_mouse
## Global PairwiseAlignmentsSingleSubject (1 of 1)
## pattern: >NP_071334.1 egl nine homolog 1 is...RARAKVKYLTGEKGVRVELNKPSDS
## VGKDVF
##
##
## subject: >NP_444437.2 egl nine homolog 1 is...RARAKVKYLTGEKGVRVEL-KPN-S-VSKDV-
##
##
## score: 1313.664
#other comparisons needed for table below
align_chimp_mouse <- Biostrings:: pairwiseAlignment(
chimp_fasta,
mouse_fasta)
align_chimp_squirrel <- Biostrings:: pairwiseAlignment(
chimp_fasta,
squirrel_fasta)
align_mouse_squirrel <- Biostrings:: pairwiseAlignment(
mouse_fasta,
squirrel_fasta)
Here we can see the pid of the human sequence to chimp, mouse and squirrel.
Biostrings::pid(align_human_chimp)
## [1] 71.2381
Biostrings::pid(align_human_mouse)
## [1] 77.09163
Biostrings::pid(align_human_squirrel)
## [1] 77.95591
We will now build a matrix to easily show pid comparisons. NA refers to redundant information.
pids_vector <- c("Human", 1, NA, NA, NA,
"Chimp", pid(align_human_chimp), 1, NA, NA,
"Mouse", pid(align_human_mouse), pid(align_chimp_mouse), 1, NA,
"Squirrel", pid(align_human_squirrel), pid(align_chimp_squirrel), pid(align_mouse_squirrel), 1)
pid_table <- matrix(pids_vector, nrow = 4, ncol = 5, byrow = T)
pid_table.df <- data.frame(pid_table,
stringsAsFactors = F)
names(pid_table.df) <- c("Species", "Human", "Chimp","Mouse", "Squirrel")
pander::pander(pid_table.df)
| Species | Human | Chimp | Mouse | Squirrel |
|---|---|---|---|---|
| Human | 1 | NA | NA | NA |
| Chimp | 71.2380952380952 | 1 | NA | NA |
| Mouse | 77.0916334661355 | 63.5658914728682 | 1 | NA |
| Squirrel | 77.9559118236473 | 66.2835249042146 | 81.7622950819672 | 1 |
There are different methods used to calculate pid. Below uses the human and chimp sequences to show differences.
pid_methods_compare_vector <- c("PID1", pid(align_human_chimp, type = "PID1"),
"aligned positions + internal gap positions",
"PID2", pid(align_human_chimp, type = "PID2"),
"aligned positions", "PID3",
pid(align_human_chimp, type = "PID3"),
"length shorter sequence", "PID4",
pid(align_human_chimp, type = "PID4"),
"average length of the two sequences")
pid_methods_compare_table <- matrix(pid_methods_compare_vector, nrow = 4, byrow = T)
pid_methods_compare_table.df <- data.frame(pid_methods_compare_table,
stringsAsFactors = F)
names(pid_methods_compare_table.df) <- c("Method", "PID", "denominator")
pander::pander(pid_methods_compare_table.df)
| Method | PID | denominator |
|---|---|---|
| PID1 | 71.2380952380952 | aligned positions + internal gap positions |
| PID2 | 81.8380743982494 | aligned positions |
| PID3 | 76.1710794297352 | length shorter sequence |
| PID4 | 76.1710794297352 | average length of the two sequences |
We need to prepare our data for creating an msa.The code below is taking the NCBI accession numbers, and using the NCBI database to add every sequence in variable EGLN1. We then clean the files with cat() and create our list.
EGLN1 <-entrez_fetch (db = "protein",
id = ncbi_protein_accession,
rettype = "fasta")
cat(EGLN1)
## >NP_071334.1 egl nine homolog 1 isoform 1 [Homo sapiens]
## MANDSGGPGGPSPSERDRQYCELCGKMENLLRCSRCRSSFYCCKEHQRQDWKKHKLVCQGSEGALGHGVG
## PHQHSGPAPPAAVPPPRAGAREPRKAAARRDNASGDAAKGKVKAKPPADPAAAASPCRAAAGGQGSAVAA
## EAEPGKEEPPARSSLFQEKANLYPPSNTPGDALSPGGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVV
## DDFLGKETGQQIGDEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLI
## RHCNGKLGSYKINGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGK
## AQFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLTGEKGVRVELNKPSDS
## VGKDVF
##
## >XP_525092.2 PREDICTED: egl nine homolog 1 [Pan troglodytes]
## MYKEKHIAHVCIWQNPQYRDKQDRNSHLATLTGASLAHPAPAWRKSSHDYAAFAPLPACGWRPPPAGVGA
## AALPPRLFRRFAGPCALGVPRRRPAAGPKAPGKRPGKSTVKPPADPAAAASPSRAAAGGQGSAVAAEAEP
## GKEEPPARSSLFQEKANLYPPSNTPGDALSPSGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFL
## GKETGQQIGDEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLIRHCN
## GKLGSYKINGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKAQFA
## DIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLTGEKGVRVELNKPSDSVSKD
## VF
##
## >NP_444437.2 egl nine homolog 1 isoform 1 [Mus musculus]
## MASDSGGPGVLSASERDRQYCELCGKMENLLRCGRCRSSFYCCKEHQRQDWKKHKLVCQGGEAPRAQPAP
## AQPRVAPPPGGAPGAARAGGAARRGDSAAASRVPGPEDAAQARSGPGPAEPGSEDPPLSRSPGPERASLC
## PAGGGPGEALSPGGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFLGRETGQQIGDEVRALHDTG
## KFTDGQLVSQKSDSSKDIRGDQITWIEGKEPGCETIGLLMSSMDDLIRHCSGKLGNYRINGRTKAMVACY
## PGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKAQFADIEPKFDRLLFFWSDRRNP
## HEVQPAYATRYAITVWYFDADERARAKVKYLTGEKGVRVELKPNSVSKDV
##
## >XP_001104870.1 egl nine homolog 1 isoform X1 [Macaca mulatta]
## MANDSGGPGGPSPSERDRQYCELCGKMENLLRCSRCRSSFYCCKEHQRQDWKKHKLVCQGSESALGPGVG
## PHQHSGPAPPVAAPPPRAGAREPRKAAARRDNASGDAAKAKAKGKPPADPAAAASPSRAAPGGQGSAVAA
## EAEPGKEEPPARSSLFQEKANLYPPSNTPGDALSPSGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVV
## DDFLGKETGQQIGDEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLI
## RHCNGKLGSYKINGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGK
## AQFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLTGEKGVRVELNKPSDS
## VGKDVF
##
## >XP_002728657.3 PREDICTED: egl nine homolog 1 [Rattus norvegicus]
## MNKHGICVVDDFLGRETGQQIGDEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGWRPSAP
## GAARAGGAARRGDSSTAASRVPGPEDATQAGSGPGPAEPSSEDPPPSRSPGPERASLCPAGGGPGEALSP
## SGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFLGRETGQQIGDEVRALHDTGKFTDGQLVSQKS
## DSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLIRHCSGKLGNYRINGRTKAMVACYPGNGTGYVRHVD
## NPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKAQFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYA
## ITVWYFDADERARAKVKYLTGEKGVRVELKPNSVSKDV
##
## >NP_001002595.2 egl nine homolog 1b [Danio rerio]
## MEGNSRESERLERERQFCELCGKMENLMKCGRCRSSFYCSKEHQRQDWKKHKRVCKEADKQQQPVGEEEE
## EQPSDTSKQSSTSQSNSTLTSQDEKMPDFIKSATGSDTKPTPDSSKPNGQTRSPPQKLATDYIVPCMNKH
## GICVVDNFLGNEVGRSILEDVRALYLTGGFTDGQLVSQRSDSSKDIRGDKITWVEGKEPGCERIAFLMSR
## MDDLIRHCNGKLGNYRINGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDATEHGGLLRI
## FPEGTAQFADIEPKFDRLLLFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKEKYSTSSGERGVKVE
## LNKPSEPS
##
## >NP_001015960.1 egl nine homolog 1 [Xenopus tropicalis]
## MAGGGSEGSNQSERDRQYCELCGKMEDLLRCGRCRSSFYCSKEHQRQDWKKHKLFCKSDSTITPASQNTV
## KNSKKEQFSSSAVSISHSDTSQFKSHVEPASGVSDEQEEVDGGAALKAINEEDDTQVSGEAMGSSTSRKV
## TTSGGRPNGQTKPPLHRIALEYTIPCMNKHGICVLDDFLGQETGDRIVCEVKALHNTGRFTDGQLVSQKS
## DSTRDIRGDQITWVEGKESSCKAIGKLMNKMDDLIRHCSGKLGNFRINGRTKAMVACYPGNGTGYVRHVD
## NPNADGRCVTCIYYLNKQWDAKTHGGLLRIFPEGKSQFADIEPKFDRLLLFWSDRRNPHEVQPAFATRYA
## ITVWYFDADERARAKEKYLNTGERGVRIELNKPSEQVVKEVQNP
##
## >XP_032955515.1 egl nine homolog 1 isoform X1 [Rhinolophus ferrumequinum]
## MANDSGGPGGPSPSERDRQYCELCGKMENLLRCSRCRSSFYCSKEHQRQDWKKHKLVCQRGDGAPGPGAS
## QHPHPGPAPPAAAPPPRATGAKEARTAAPRRDSAAAGDAADAKAKPAAAASPPRAAPGGQGPAAAAATAT
## AEAEPAKEEPLGRSSLFQDKANLYPPGNTPGEALSHGGGPRPNGQTKPLPALKLALEYIVPCMNKHGICV
## VDDFLGKETGQQIGEEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDL
## IRHCNGKLGNYKINGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEG
## KAQFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLTGEKGVRVELNKPSD
## SISKDVL
##
## >XP_026268254.1 egl nine homolog 1 [Urocitellus parryii]
## MASDSGGPGGPSASERDRQYCELCGKMENLLRCGRCRSSFYCCKEHQRQDWKKHKLVCQGGDPAGPAAAP
## RAQPGPAPPAAAPPTRAGAAAGNAGARAGKAAGQRQGSAEAARAPAPGDAAQARGPRGAAECGQEEPPAR
## RSPCPERASLCPAGSSPGEALSPGGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFLGRETGQQI
## GDEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLIRHCNGKLGNYKI
## NGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKAQFADIEPKFDR
## LLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLTGEKGVRVELNKPSDPVGKDV
##
## >XP_020929239.1 egl nine homolog 1 isoform X1 [Sus scrofa]
## MANDSGGPGGPSPSERDRQYCELCGKMENLLRCGRCRSSFYCSKEHQRQDWKKHKLVCQGGEGAPAPGAG
## QQPQPGPAPLLRKAVPRRDRVAAVEAAGPRAAGDAAKAKAKAKPAADPAAAASPPRAAPGGAGPAAAAAA
## AEAEPAKEEPPARSSLYQEKANLYPPSNTPGEGLSHGGGLRPNGQTKPLPALKLALEYIVPCMNKHGICV
## VDDFLGKETGQQIGDEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDL
## IRHCNGKLGNYKINGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEG
## KAQFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLTGEKGVRVELNKPSD
## SVSKDVL
EGLN1_list <- entrez_fetch_list(db = "protein",
id = ncbi_protein_accession,
rettype = "fasta")
We need to make a new vector with the cleaned fasta files from each sequence starting from a vector of NAs. We need to add names to our vector and add the respective amino acids from each file.
for(i in 1:length(EGLN1_list)){
EGLN1_list[[i]] <- fasta_cleaner(EGLN1_list[[i]], parse = F)
}
EGLN1_vector <- rep(NA, length(EGLN1_list))
for(i in 1:length(EGLN1_vector)){
EGLN1_vector[i] <- EGLN1_list[[i]]
}
names(EGLN1_vector) <- names(EGLN1_list)
EGLN1_list_ss <- Biostrings:: AAStringSet(EGLN1_vector)
The code below creates an interface, preparing for the making of an msa.
EGLN1_align_msa <- msa(EGLN1_list_ss,
method = "ClustalW")
## use default substitution matrix
EGLN1_align_msa
## CLUSTAL 2.1
##
## Call:
## msa(EGLN1_list_ss, method = "ClustalW")
##
## MsaAAMultipleAlignment with 10 rows and 439 columns
## aln names
## [1] MASDSGGPGV---LSASERDRQYCE...--GEKGVRVELKPNS--VSKDV--- NP_444437.2
## [2] --MNKHGICV---VDDFLGRETGQQ...--GEKGVRVELKPNS--VSKDV--- XP_002728657.3
## [3] MASDSGGPGG---PSASERDRQYCE...--GEKGVRVELNKPSDPVGKDV--- XP_026268254.1
## [4] MANDSGGPGG---PSPSERDRQYCE...--GEKGVRVELNKPSDSVGKDVF-- NP_071334
## [5] MANDSGGPGG---PSPSERDRQYCE...--GEKGVRVELNKPSDSVGKDVF-- XP_001104870.1
## [6] MANDSGGPGG---PSPSERDRQYCE...--GEKGVRVELNKPSDSISKDVL-- XP_032955515.1
## [7] MANDSGGPGG---PSPSERDRQYCE...--GEKGVRVELNKPSDSVSKDVL-- XP_020929239.1
## [8] MYKEKHIAHVCIWQNPQYRDKQ--D...--GEKGVRVELNKPSDSVSKDVF-- XP_525092.2
## [9] --MEGNSRE----SERLERERQFCE...SSGERGVKVELNKPSEPS------- NP_001002595.2
## [10] --MAGGGSEG---SNQSERDRQYCE...-TGERGVRIELNKPSEQVVKEVQNP NP_001015960.1
## Con MA?DSGGPGG---PSPSERDRQYCE...--GEKGVRVELNKPSDSVSKDV?-- Consensus
We need to clean the code a bit before displaying our msa. We have to change the class of alignment as well as convert to the sequinr form.
class(EGLN1_align_msa) <- "AAMultipleAlignment"
EGLN1_align_msa_seqinr <- msaConvert(EGLN1_align_msa, type = "seqinr::alignment")
compbio4all::print_msa(EGLN1_align_msa_seqinr)
## [1] "MASDSGGPGV---LSASERDRQYCELCGKMENLLRCGRCRS-SFYCCKEHQRQDWKKHKL 0"
## [1] "--MNKHGICV---VDDFLGRETGQQIGDEVRALHDTGKFTDGQLVSQKSDSSKDIRGDKI 0"
## [1] "MASDSGGPGG---PSASERDRQYCELCGKMENLLRCGRCRS-SFYCCKEHQRQDWKKHKL 0"
## [1] "MANDSGGPGG---PSPSERDRQYCELCGKMENLLRCSRCRS-SFYCCKEHQRQDWKKHKL 0"
## [1] "MANDSGGPGG---PSPSERDRQYCELCGKMENLLRCSRCRS-SFYCCKEHQRQDWKKHKL 0"
## [1] "MANDSGGPGG---PSPSERDRQYCELCGKMENLLRCSRCRS-SFYCSKEHQRQDWKKHKL 0"
## [1] "MANDSGGPGG---PSPSERDRQYCELCGKMENLLRCGRCRS-SFYCSKEHQRQDWKKHKL 0"
## [1] "MYKEKHIAHVCIWQNPQYRDKQ--DRNSHLATLTGASLAHP-APAWRKSS--HDYAAFAP 0"
## [1] "--MEGNSRE----SERLERERQFCELCGKMENLMKCGRCRS-SFYCSKEHQRQDWKKHKR 0"
## [1] "--MAGGGSEG---SNQSERDRQYCELCGKMEDLLRCGRCRS-SFYCSKEHQRQDWKKHKL 0"
## [1] " "
## [1] "VCQGGE---------------------APRAQPAPAQPRVAPPPGGAPGAARAGGAARRG 0"
## [1] "TWIEGK---------------------EPGWRPS------------APGAARAGGAARRG 0"
## [1] "VCQGGDPA-----------GPAAAPRAQPGPAPPAAAPPTRAGAAAGNAGARAGKAAGQR 0"
## [1] "VCQGSEGALGHGVGPHQHSGPAPPAAVPPPRAGAREPRKAAARRDNASGDAAKGKVKAKP 0"
## [1] "VCQGSESALGPGVGPHQHSGPAPPVAAPPPRAGAREPRKAAARRDNASGDAAKAKAKGKP 0"
## [1] "VCQRGDGAPGPGASQHPHPGPAPPAAAPPPRATGAKEARTAAPR--RDSAAAGDAADAK- 0"
## [1] "VCQGGEGAPAPGAGQQPQPGPAPLLRKAVPRRDRVAAVEAAGPR--AAGDAAKAKAKAKP 0"
## [1] "LPACGWRPPPAGVGAAALP-PRLFRRFAGPCALGVPRRRPAAGP--KAPGKRPGKSTVKP 0"
## [1] "VCKEAD-----------------------KQQQPVG-----------EEEEEQPS----- 0"
## [1] "FCKSDSTI-------------------TPASQNTVK-----------NSKKEQFSSSAVS 0"
## [1] " "
## [1] "DS-AAASRVPGPEDAAQARSGPG------PAEPGSEDPPLSRSPGPERASLCPAGGGPGE 0"
## [1] "DSSTAASRVPGPEDATQAGSGPG------PAEPSSEDPPPSRSPGPERASLCPAGGGPGE 0"
## [1] "QGSAEAARAPAPGDAAQARGPRG------AAECGQEEPPARRSPCPERASLCPAGSSPGE 0"
## [1] "PADPAAAASPCRAAAGGQGSAVA-----AEAEPGKEEPPARSSLFQEKANLYPPSNTPGD 0"
## [1] "PADPAAAASPSRAAPGGQGSAVA-----AEAEPGKEEPPARSSLFQEKANLYPPSNTPGD 0"
## [1] "-AKPAAAASPPRAAPGGQGPAAAAATATAEAEPAKEEPLGRSSLFQDKANLYPPGNTPGE 0"
## [1] "AADPAAAASPPRAAPGGAGPAAAAAAA--EAEPAKEEPPARSSLYQEKANLYPPSNTPGE 0"
## [1] "PADPAAAASPSRAAAGGQGSAVA-----AEAEPGKEEPPARSSLFQEKANLYPPSNTPGD 0"
## [1] "----DTSKQSSTSQSNSTLTSQD------EKMP------------DFIKS--ATGSDT-- 0"
## [1] "ISHSDTSQFKSHVEPASGVSDEQ------EEVDGGAALKAINEEDDTQVSGEAMGSSTSR 0"
## [1] " "
## [1] "ALSPGGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFLGRETGQQIGDEVRALHD 0"
## [1] "ALSPSGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFLGRETGQQIGDEVRALHD 0"
## [1] "ALSPGGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFLGRETGQQIGDEVRALHD 0"
## [1] "ALSPGGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFLGKETGQQIGDEVRALHD 0"
## [1] "ALSPSGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFLGKETGQQIGDEVRALHD 0"
## [1] "ALSHGGGPRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFLGKETGQQIGEEVRALHD 0"
## [1] "GLSHGGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFLGKETGQQIGDEVRALHD 0"
## [1] "ALSPSGGLRPNGQTKPLPALKLALEYIVPCMNKHGICVVDDFLGKETGQQIGDEVRALHD 0"
## [1] "KPTPDSS-KPNGQTRS-PPQKLATDYIVPCMNKHGICVVDNFLGNEVGRSILEDVRALYL 0"
## [1] "KVTTSGG-RPNGQTKP-PLHRIALEYTIPCMNKHGICVLDDFLGQETGDRIVCEVKALHN 0"
## [1] " "
## [1] "TGKFTDGQLVSQKSDSSKDIRGDQITWIEGKEPGCETIGLLMSSMDDLIRHCSGKLGNYR 0"
## [1] "TGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLIRHCSGKLGNYR 0"
## [1] "TGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLIRHCNGKLGNYK 0"
## [1] "TGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLIRHCNGKLGSYK 0"
## [1] "TGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLIRHCNGKLGSYK 0"
## [1] "TGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLIRHCNGKLGNYK 0"
## [1] "TGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLIRHCNGKLGNYK 0"
## [1] "TGKFTDGQLVSQKSDSSKDIRGDKITWIEGKEPGCETIGLLMSSMDDLIRHCNGKLGSYK 0"
## [1] "TGGFTDGQLVSQRSDSSKDIRGDKITWVEGKEPGCERIAFLMSRMDDLIRHCNGKLGNYR 0"
## [1] "TGRFTDGQLVSQKSDSTRDIRGDQITWVEGKESSCKAIGKLMNKMDDLIRHCSGKLGNFR 0"
## [1] " "
## [1] "INGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKA 0"
## [1] "INGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKA 0"
## [1] "INGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKA 0"
## [1] "INGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKA 0"
## [1] "INGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKA 0"
## [1] "INGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKA 0"
## [1] "INGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKA 0"
## [1] "INGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDAKVSGGILRIFPEGKA 0"
## [1] "INGRTKAMVACYPGNGTGYVRHVDNPNGDGRCVTCIYYLNKDWDATEHGGLLRIFPEGTA 0"
## [1] "INGRTKAMVACYPGNGTGYVRHVDNPNADGRCVTCIYYLNKQWDAKTHGGLLRIFPEGKS 0"
## [1] " "
## [1] "QFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLT--GEKG 0"
## [1] "QFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLT--GEKG 0"
## [1] "QFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLT--GEKG 0"
## [1] "QFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLT--GEKG 0"
## [1] "QFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLT--GEKG 0"
## [1] "QFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLT--GEKG 0"
## [1] "QFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLT--GEKG 0"
## [1] "QFADIEPKFDRLLFFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKVKYLT--GEKG 0"
## [1] "QFADIEPKFDRLLLFWSDRRNPHEVQPAYATRYAITVWYFDADERARAKEKYSTSSGERG 0"
## [1] "QFADIEPKFDRLLLFWSDRRNPHEVQPAFATRYAITVWYFDADERARAKEKYLN-TGERG 0"
## [1] " "
## [1] "VRVELKPNS--VSKDV--- 41"
## [1] "VRVELKPNS--VSKDV--- 41"
## [1] "VRVELNKPSDPVGKDV--- 41"
## [1] "VRVELNKPSDSVGKDVF-- 41"
## [1] "VRVELNKPSDSVGKDVF-- 41"
## [1] "VRVELNKPSDSISKDVL-- 41"
## [1] "VRVELNKPSDSVSKDVL-- 41"
## [1] "VRVELNKPSDSVSKDVF-- 41"
## [1] "VKVELNKPSEPS------- 41"
## [1] "VRIELNKPSEQVVKEVQNP 41"
## [1] " "
We can now display our msa.
ggmsa:: ggmsa(EGLN1_align_msa,
start = 0,
end = 50)
Too make things easier to calculate a distance matrix, we are only going to use a subset of the sequences. We will use sequences for humans, mice, fish, bats and squirrels. We will make a new vector with the accession numbers of the sequences we want and we will use bracket notation to grab what we want from EGLN1_list_ss. We will then add this to a new variable.
assessions_subsets <- c("NP_071334", "NP_444437.2", "NP_001002595.2", "XP_032955515.1", "XP_026268254.1")
EGLN1_list_ss[assessions_subsets]
## AAStringSet object of length 5:
## width seq names
## [1] 426 MANDSGGPGGPSPSERDRQYCEL...LTGEKGVRVELNKPSDSVGKDVF NP_071334
## [2] 400 MASDSGGPGVLSASERDRQYCEL...VKYLTGEKGVRVELKPNSVSKDV NP_444437.2
## [3] 358 MEGNSRESERLERERQFCELCGK...EKYSTSSGERGVKVELNKPSEPS NP_001002595.2
## [4] 427 MANDSGGPGGPSPSERDRQYCEL...LTGEKGVRVELNKPSDSISKDVL XP_032955515.1
## [5] 413 MASDSGGPGGPSASERDRQYCEL...YLTGEKGVRVELNKPSDPVGKDV XP_026268254.1
EGLN1_subset_list_ss <- EGLN1_list_ss[assessions_subsets]
We will have to do the same conversions again as we did to set up our msa.
EGLN1_align_subset <- msa(EGLN1_subset_list_ss,
method = "ClustalW")
## use default substitution matrix
class(EGLN1_align_subset) <- "AAMultipleAlignment"
EGLN1_align_subset_seqinr <- msaConvert(EGLN1_align_subset, type = "seqinr::alignment")
Now we will calculating genetic distance from our msa using the seqinr::dist.alignment() function.
EGLN1_subset_align_dist <- seqinr::dist.alignment(EGLN1_align_subset_seqinr,
matrix = "identity")
EGLN1_subset_align_dist_rounded <- round(EGLN1_subset_align_dist,
digits = 3)
EGLN1_subset_align_dist_rounded
## NP_444437.2 XP_026268254.1 NP_071334 XP_032955515.1
## XP_026268254.1 0.312
## NP_071334 0.410 0.395
## XP_032955515.1 0.430 0.412 0.309
## NP_001002595.2 0.555 0.543 0.548 0.546
Now we will repeat the same steps but with all the sequences. You will see that the matrix is a bit more confusing to understand as it is much bigger.
EGLN1_full_align_dist <- seqinr::dist.alignment(EGLN1_align_msa_seqinr,
matrix = "identity")
EGLN1_full_align_dist_rounded <- round(EGLN1_full_align_dist,
digits = 3)
EGLN1_full_align_dist_rounded
## NP_444437.2 XP_002728657.3 XP_026268254.1 NP_071334
## XP_002728657.3 0.407
## XP_026268254.1 0.361 0.495
## NP_071334 0.433 0.545 0.426
## XP_001104870.1 0.439 0.545 0.426 0.153
## XP_032955515.1 0.444 0.551 0.431 0.337
## XP_020929239.1 0.425 0.540 0.424 0.326
## XP_525092.2 0.545 0.533 0.544 0.463
## NP_001002595.2 0.560 0.631 0.551 0.566
## NP_001015960.1 0.581 0.663 0.586 0.598
## XP_001104870.1 XP_032955515.1 XP_020929239.1 XP_525092.2
## XP_002728657.3
## XP_026268254.1
## NP_071334
## XP_001104870.1
## XP_032955515.1 0.337
## XP_020929239.1 0.323 0.318
## XP_525092.2 0.463 0.502 0.489
## NP_001002595.2 0.563 0.556 0.558 0.628
## NP_001015960.1 0.594 0.591 0.594 0.663
## NP_001002595.2
## XP_002728657.3
## XP_026268254.1
## NP_071334
## XP_001104870.1
## XP_032955515.1
## XP_020929239.1
## XP_525092.2
## NP_001002595.2
## NP_001015960.1 0.545
To plot a phylogenetic tree, we can use the neighbor joining algorithm with the nj() function. The nj() function takes in a distance matrix as an argument and uses genetic distances to cluster sequences into clades.
tree_set <- nj(EGLN1_full_align_dist)
Now we can plot the tree.
plot.phylo(tree_set, main="Phylogenetic Tree",
use.edge.length = F)
# add label
mtext(text = "ELGN1 family gene tree - rooted, no branch lenths")