The gene and the protein it encodes which are being analyzed in this script is called GABARAP, which stands for gamma-aminobutyric acid receptor-associated protein. This protein is found in Homo Sapiens and many other species. In this script, I will be fetching information from various databases of the GABARA1 gene and associated protein, completing pairwise and multiple alignments with other species, and producing phylogenetic trees for visualization of the relatedness between the sequences.
Much of the important information used to analyze this protein within this script was obtained from NCBI databases. This includes:
- RefSeq (protein page): https://www.ncbi.nlm.nih.gov/protein/NP_009209.1
- RefSeq (Homologene): https://www.ncbi.nlm.nih.gov/homologene/134119
References:
- https://rpubs.com/lowbrowR/drawProtein
- https://rpubs.com/lowbrowR/843543
- https://rpubs.com/lowbrowR/844554
# GitHub packages
library(devtools)
# devtools::install_github("brouwern/compbio4all")
library(compbio4all)
# devtools::install_github("YuLab-SMU/ggmsa")
# library(ggmsa)
# CRAN packages
library(rentrez)
library(ape)
library(pander)
library(seqinr)
##
## Attaching package: 'seqinr'
## The following objects are masked from 'package:ape':
##
## as.alignment, consensus
library(ggpubr)
## Loading required package: ggplot2
## Loading required package: magrittr
##
## Attaching package: 'ggpubr'
## The following object is masked from 'package:ape':
##
## rotate
library(ggplot2)
# Bioconductor packages
library(BiocManager)
##
## Attaching package: 'BiocManager'
## The following object is masked from 'package:devtools':
##
## install
library(Biostrings)
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind,
## colMeans, colnames, colSums, dirname, do.call, duplicated,
## eval, evalq, Filter, Find, get, grep, grepl, intersect,
## is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
## paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
## Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
## table, tapply, union, unique, unsplit, which, which.max,
## which.min
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
##
## expand.grid
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:grDevices':
##
## windows
## Loading required package: XVector
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:seqinr':
##
## translate
## The following object is masked from 'package:ape':
##
## complement
## The following object is masked from 'package:base':
##
## strsplit
library(drawProteins)
library(msa)
##
## Attaching package: 'msa'
## The following object is masked from 'package:BiocManager':
##
## version
These accession numbers were found using RefSeq, RefSeq Homologene, UniProt, and PDB. Neither Homo neanderthalensis nor Drosophila entries were found after thorough searching of these databases. The final three entries in the table were obtained via a protein BLAST search. First, vertebrates were excluded, and there were 2 results and the gene existed in invertebrates. One of these with sufficient data was included. Mammals were then excluded from the BLAST search which yielded many results.
# RefSeq, UniProt, PDB, scientific name, common name, gene name
ref_seq <- c('NP_009209.1', 'XP_003315392.1', 'XP_536616.2', 'NP_001029220.1', 'NP_062723.1', 'NP_742033.1', 'NP_001013278.1', 'XP_031568314.1', 'NP_001084530.1', 'XP_004079842.1')
uniprot <- c('O95166', 'H2QC26', 'F1Q2L2', 'Q9GJW7', 'Q9DCD6', 'P60517', 'Q6PSS4', 'A0A6P8IN80', 'Q6NUG7', 'A0A3P9HG37')
pdb <- c('1GNU', 'NA', 'NA', 'NA', '5YIR', '1KJT', 'NA', 'NA', 'NA', 'NA')
sci_name <- c('Homo sapiens', 'Pan troglodytes', 'Canis lupus', 'Bos taurus', 'Mus musculus', 'Rattus norvegicus', 'Danio rerio', 'Actinia tenebrosa', 'Xenopus laevis', 'Oryzias latipes')
common_name <- c('Human', 'Chimpanzee', 'Dog', 'Bovine', 'Mouse', 'Rat', 'Zebrafish', 'Australian red waratah sea anemone', 'African clawed frog', 'Japanese rice rish')
gene_name <- c('GABARAP', 'GABARAP', 'GABARAP', 'GABARAP', 'Gabarap', 'Gabarap', 'gabarap', 'LOC116303016', 'gabarap.L', 'NA')
Convert this vector information into a data frame to produce a table.
gabarap_table <- data.frame(NCBI.Accession = ref_seq,
UniProt.ID = uniprot,
PDB.ID = pdb,
ScientificName = sci_name,
CommonName = common_name,
GeneName = gene_name)
pander::pander(gabarap_table)
| NCBI.Accession | UniProt.ID | PDB.ID | ScientificName |
|---|---|---|---|
| NP_009209.1 | O95166 | 1GNU | Homo sapiens |
| XP_003315392.1 | H2QC26 | NA | Pan troglodytes |
| XP_536616.2 | F1Q2L2 | NA | Canis lupus |
| NP_001029220.1 | Q9GJW7 | NA | Bos taurus |
| NP_062723.1 | Q9DCD6 | 5YIR | Mus musculus |
| NP_742033.1 | P60517 | 1KJT | Rattus norvegicus |
| NP_001013278.1 | Q6PSS4 | NA | Danio rerio |
| XP_031568314.1 | A0A6P8IN80 | NA | Actinia tenebrosa |
| NP_001084530.1 | Q6NUG7 | NA | Xenopus laevis |
| XP_004079842.1 | A0A3P9HG37 | NA | Oryzias latipes |
| CommonName | GeneName |
|---|---|
| Human | GABARAP |
| Chimpanzee | GABARAP |
| Dog | GABARAP |
| Bovine | GABARAP |
| Mouse | Gabarap |
| Rat | Gabarap |
| Zebrafish | gabarap |
| Australian red waratah sea anemone | LOC116303016 |
| African clawed frog | gabarap.L |
| Japanese rice rish | NA |
Protein sequences were retrieved from NCBI databases using the compbio4all::entrez_fetch_list() function, which uses the function rentrez::entrez_fetch().
gabarap_list <- entrez_fetch_list(db = "protein",
id = gabarap_table$NCBI.Accession,
rettype = "fasta")
Check the length of the list and remove the FASTA header. Check the first entry in the list to make sure it contains only the protein sequence.
length(gabarap_list)
## [1] 10
for(i in 1:length(gabarap_list)){
gabarap_list[[i]] <- compbio4all::fasta_cleaner(gabarap_list[[i]], parse = F)
}
gabarap_list[[1]]
## [1] "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGL"
Use the UniProt accession number for the human sequence to generate a protein diagram indicating interesting regions of the protein.
O95166_id <- drawProteins::get_features(uniprot[1])
## [1] "Download has worked"
Convert the vector to a dataframe.
is(O95166_id)
## [1] "list" "vector" "list_OR_List"
## [4] "vector_OR_factor"
my_prot_df <- drawProteins::feature_to_dataframe(O95166_id)
This creates a visualization which shows the regions where secondary structures can be identified, and shows that there are three different regions defined by the interaction GABARAP has with proteins. I included all of these aspects because they are interesting and the visual is informative. It shows what the structure is in specific regions where the protein is known to interact with other proteins, suggesting information about its function.
my_prot_df <- drawProteins::feature_to_dataframe(O95166_id)
my_canvas <- draw_canvas(my_prot_df)
my_canvas <- draw_chains(my_canvas, my_prot_df, label_size = 2.5)
my_canvas <- draw_regions(my_canvas, my_prot_df)
my_canvas <- draw_motif(my_canvas, my_prot_df)
my_canvas <- draw_phospho(my_canvas, my_prot_df)
my_canvas <- draw_repeat(my_canvas, my_prot_df)
my_canvas <- draw_recept_dom(my_canvas, my_prot_df)
my_canvas <- draw_folding(my_canvas, my_prot_df)
my_canvas
my_prot_df
## type
## featuresTemp CHAIN
## featuresTemp.1 PROPEP
## featuresTemp.2 REGION
## featuresTemp.3 REGION
## featuresTemp.4 REGION
## featuresTemp.5 SITE
## featuresTemp.6 LIPID
## featuresTemp.7 MUTAGEN
## featuresTemp.8 MUTAGEN
## featuresTemp.9 MUTAGEN
## featuresTemp.10 MUTAGEN
## featuresTemp.11 MUTAGEN
## featuresTemp.12 MUTAGEN
## featuresTemp.13 HELIX
## featuresTemp.14 HELIX
## featuresTemp.15 STRAND
## featuresTemp.16 STRAND
## featuresTemp.17 HELIX
## featuresTemp.18 STRAND
## featuresTemp.19 STRAND
## featuresTemp.20 HELIX
## featuresTemp.21 STRAND
## featuresTemp.22 STRAND
## description
## featuresTemp Gamma-aminobutyric acid receptor-associated protein
## featuresTemp.1 Removed in mature form
## featuresTemp.2 Interaction with beta-tubulin
## featuresTemp.3 Interaction with GPHN
## featuresTemp.4 Interaction with GABRG2
## featuresTemp.5 Cleavage; by ATG4B
## featuresTemp.6 Phosphatidylethanolamine amidated glycine
## featuresTemp.7 No effect on WDFY3-binding. Impaired WDFY3-binding, but no effect on SQSTM1-binding; when associated with H-25 and H-54.
## featuresTemp.8 No effect on WDFY3-binding. Impaired WDFY3-binding, but no effect on SQSTM1-binding; when associated with Q-24 and H-54.
## featuresTemp.9 Inhibits interaction with TECPR2.
## featuresTemp.10 No effect on WDFY3-binding. Impaired WDFY3-binding, but no effect on SQSTM1-binding; when associated with Q-24 and H-25.
## featuresTemp.11 No effect on interaction with TECPR2.
## featuresTemp.12 Impairs localization at the autophagosomal membrane.
## featuresTemp.13 NONE
## featuresTemp.14 NONE
## featuresTemp.15 NONE
## featuresTemp.16 NONE
## featuresTemp.17 NONE
## featuresTemp.18 NONE
## featuresTemp.19 NONE
## featuresTemp.20 NONE
## featuresTemp.21 NONE
## featuresTemp.22 NONE
## begin end length accession entryName taxid order
## featuresTemp 1 116 115 O95166 GBRAP_HUMAN 9606 1
## featuresTemp.1 117 117 0 O95166 GBRAP_HUMAN 9606 1
## featuresTemp.2 1 22 21 O95166 GBRAP_HUMAN 9606 1
## featuresTemp.3 36 117 81 O95166 GBRAP_HUMAN 9606 1
## featuresTemp.4 36 68 32 O95166 GBRAP_HUMAN 9606 1
## featuresTemp.5 116 117 1 O95166 GBRAP_HUMAN 9606 1
## featuresTemp.6 116 116 0 O95166 GBRAP_HUMAN 9606 1
## featuresTemp.7 24 24 0 O95166 GBRAP_HUMAN 9606 1
## featuresTemp.8 25 25 0 O95166 GBRAP_HUMAN 9606 1
## featuresTemp.9 49 50 1 O95166 GBRAP_HUMAN 9606 1
## featuresTemp.10 54 54 0 O95166 GBRAP_HUMAN 9606 1
## featuresTemp.11 67 67 0 O95166 GBRAP_HUMAN 9606 1
## featuresTemp.12 116 116 0 O95166 GBRAP_HUMAN 9606 1
## featuresTemp.13 4 8 4 O95166 GBRAP_HUMAN 9606 1
## featuresTemp.14 11 24 13 O95166 GBRAP_HUMAN 9606 1
## featuresTemp.15 28 35 7 O95166 GBRAP_HUMAN 9606 1
## featuresTemp.16 48 52 4 O95166 GBRAP_HUMAN 9606 1
## featuresTemp.17 57 67 10 O95166 GBRAP_HUMAN 9606 1
## featuresTemp.18 77 80 3 O95166 GBRAP_HUMAN 9606 1
## featuresTemp.19 87 90 3 O95166 GBRAP_HUMAN 9606 1
## featuresTemp.20 91 98 7 O95166 GBRAP_HUMAN 9606 1
## featuresTemp.21 101 103 2 O95166 GBRAP_HUMAN 9606 1
## featuresTemp.22 105 114 9 O95166 GBRAP_HUMAN 9606 1
Access the human sequence from the list and make sure it is in the proper format with each character separated in the vector.
gabarap_human_vector <- fasta_cleaner(gabarap_list[1])
gabarap_human_vector
## [1] "M" "K" "F" "V" "Y" "K" "E" "E" "H" "P" "F" "E" "K" "R" "R" "S" "E"
## [18] "G" "E" "K" "I" "R" "K" "K" "Y" "P" "D" "R" "V" "P" "V" "I" "V" "E"
## [35] "K" "A" "P" "K" "A" "R" "I" "G" "D" "L" "D" "K" "K" "K" "Y" "L" "V"
## [52] "P" "S" "D" "L" "T" "V" "G" "Q" "F" "Y" "F" "L" "I" "R" "K" "R" "I"
## [69] "H" "L" "R" "A" "E" "D" "A" "L" "F" "F" "F" "V" "N" "N" "V" "I" "P"
## [86] "P" "T" "S" "A" "T" "M" "G" "Q" "L" "Y" "Q" "E" "H" "H" "E" "E" "D"
## [103] "F" "F" "L" "Y" "I" "A" "Y" "S" "D" "E" "S" "V" "Y" "G" "L"
Use the human protein sequences to create a dot plot.
dotPlot(gabarap_human_vector, gabarap_human_vector,
main = "Human GABARAP Dot Plot")
Create a 2x2 panel which shows the different values for dot plot settings.
par(mfrow = c(2,2),
mar = c(0,0,2,1))
dotPlot(gabarap_human_vector, gabarap_human_vector,
wsize = 10,
nmatch = 5,
main = "wsize = 10, nmatch = 5")
dotPlot(gabarap_human_vector, gabarap_human_vector,
wsize = 15,
nmatch = 3,
main = "wsize = 15, nmatch = 3")
dotPlot(gabarap_human_vector, gabarap_human_vector,
wsize = 6,
nmatch = 2,
main = "wsize = 5, nmatch = 1")
dotPlot(gabarap_human_vector, gabarap_human_vector,
wsize = 20,
nmatch = 5,
main = "wsize = 20, nmatch = 5")
par(mfrow = c(1,1),
mar = c(4,4,4,4))
We can include arguments to adjust the settings of a dot plot to present the data in an interesting way.
dotPlot(gabarap_human_vector,
gabarap_human_vector,
wsize = 7,
nmatch = 2,
main = "Human GABARAP Dot Plot: wsize = 7, nmatch = 2")
No information could be found about this protein in DisProt or RepeatsDB.
source <- c("PFAM", "UniProt", "PDB", "AlphaFold")
prot_property <- c("There is one domain, labeled ATG8, spanning from AA position 13-116, which is the majority of the 117-amino acid length protein. It is an autophagy protein domain.",
"GABARAP is associated with intracellular membrane structures, and is primarily found in the cytoskeleton and Golgi apparatus membrane.",
"The PDB structure shows that the protein contains both alpha helices and beta sheets in its secondary structure. Beta turns and omega loops can also be observed in the structure.",
"AlphaFold shows a similar structural summary as PDB, with alpha helices, beta sheets, beta turns, and omega loops present. The structure is predicted with high confidence.")
link <- c("http://pfam.xfam.org/protein/O95166",
"https://www.uniprot.org/uniprot/O95166",
"https://www.rcsb.org/structure/1gnu",
"https://alphafold.ebi.ac.uk/entry/O95166")
properties_table <- data.frame(Source = source,
ProteinProperty = prot_property,
Link = link)
pander::pander(properties_table)
| Source | ProteinProperty |
|---|---|
| PFAM | There is one domain, labeled ATG8, spanning from AA position 13-116, which is the majority of the 117-amino acid length protein. It is an autophagy protein domain. |
| UniProt | GABARAP is associated with intracellular membrane structures, and is primarily found in the cytoskeleton and Golgi apparatus membrane. |
| PDB | The PDB structure shows that the protein contains both alpha helices and beta sheets in its secondary structure. Beta turns and omega loops can also be observed in the structure. |
| AlphaFold | AlphaFold shows a similar structural summary as PDB, with alpha helices, beta sheets, beta turns, and omega loops present. The structure is predicted with high confidence. |
| Link |
|---|
| http://pfam.xfam.org/protein/O95166 |
| https://www.uniprot.org/uniprot/O95166 |
| https://www.rcsb.org/structure/1gnu |
| https://alphafold.ebi.ac.uk/entry/O95166 |
UniProt shows that the GABARAP protein is associated with intracellular membrane structures, particularly the Golgi Apparatus membrane, as well as the cytoskeleton. Link: https://www.uniprot.org/uniprot/O95166\n
The code used to set up this prediction was obtained here: https://rpubs.com/lowbrowR/843543. The information used was adapted from Chou and Zhang, 1994.
Make amino acid name vectors.
# enter once
aa.1.1 <- c("A","R","N","D","C","Q","E","G","H","I",
"L","K","M","F","P","S","T","W","Y","V")
# enter again
aa.1.2 <- c("A","R","N","D","C","Q","E","G","H","I",
"L","K","M","F","P","S","T","W","Y","V")
# are they the same length?
length(aa.1.1) == length(aa.1.2)
## [1] TRUE
aa.1.1 == aa.1.2
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE TRUE TRUE TRUE TRUE
unique(aa.1.1)
## [1] "A" "R" "N" "D" "C" "Q" "E" "G" "H" "I" "L" "K" "M" "F" "P" "S" "T"
## [18] "W" "Y" "V"
length(unique(aa.1.1)) == length(unique(aa.1.2))
## [1] TRUE
any(c(aa.1.1 == aa.1.2) == FALSE)
## [1] FALSE
This is an input of data from the table from Chou and Zhang 1994.
# alpha proteins
alpha <- c(285, 53, 97, 163, 22, 67, 134, 197, 111, 91,
221, 249, 48, 123, 82, 122, 119, 33, 63, 167)
# check against chou's total
sum(alpha) == 2447
## [1] TRUE
# beta proteins
beta <- c(203, 67, 139, 121, 75, 122, 86, 297, 49, 120,
177, 115, 16, 85, 127, 341, 253, 44, 110, 229)
# check against chou's total
sum(beta) == 2776
## [1] TRUE
Logical checks are done to make sure that data entry was successful.
a.plus.b <- c(175, 78, 120, 111, 74, 74, 86, 171, 33, 93,
110, 112, 25, 52, 71, 126, 117, 30, 108, 123)
sum(a.plus.b) == 1889
## [1] TRUE
# alpha/beta
a.div.b <- c(361, 146, 183, 244, 63, 114, 257, 377, 107, 239,
339, 321, 91, 158, 188, 327, 238, 72, 130, 378)
sum(a.div.b) == 4333
## [1] TRUE
Calculate the proportion of each of the four protein fold types.
alpha.prop <- alpha/sum(alpha)
beta.prop <- beta/sum(beta)
a.plus.b.prop <- a.plus.b/sum(a.plus.b)
a.div.b <- a.div.b/sum(a.div.b)
Create a dataframe with the proportion information.
aa.prop <- data.frame(alpha.prop,
beta.prop,
a.plus.b.prop,
a.div.b)
row.names(aa.prop) <- aa.1.1
pander::pander(aa.prop)
| alpha.prop | beta.prop | a.plus.b.prop | a.div.b | |
|---|---|---|---|---|
| A | 0.1165 | 0.07313 | 0.09264 | 0.08331 |
| R | 0.02166 | 0.02414 | 0.04129 | 0.03369 |
| N | 0.03964 | 0.05007 | 0.06353 | 0.04223 |
| D | 0.06661 | 0.04359 | 0.05876 | 0.05631 |
| C | 0.008991 | 0.02702 | 0.03917 | 0.01454 |
| Q | 0.02738 | 0.04395 | 0.03917 | 0.02631 |
| E | 0.05476 | 0.03098 | 0.04553 | 0.05931 |
| G | 0.08051 | 0.107 | 0.09052 | 0.08701 |
| H | 0.04536 | 0.01765 | 0.01747 | 0.02469 |
| I | 0.03719 | 0.04323 | 0.04923 | 0.05516 |
| L | 0.09031 | 0.06376 | 0.05823 | 0.07824 |
| K | 0.1018 | 0.04143 | 0.05929 | 0.07408 |
| M | 0.01962 | 0.005764 | 0.01323 | 0.021 |
| F | 0.05027 | 0.03062 | 0.02753 | 0.03646 |
| P | 0.03351 | 0.04575 | 0.03759 | 0.04339 |
| S | 0.04986 | 0.1228 | 0.0667 | 0.07547 |
| T | 0.04863 | 0.09114 | 0.06194 | 0.05493 |
| W | 0.01349 | 0.01585 | 0.01588 | 0.01662 |
| Y | 0.02575 | 0.03963 | 0.05717 | 0.03 |
| V | 0.06825 | 0.08249 | 0.06511 | 0.08724 |
Now we can use this information to determine the frequency of amino acids encoded by the GABARAP gene.
gabarap_freq_table <- table(gabarap_human_vector)/length(gabarap_human_vector)
gabarap_freq_table
## gabarap_human_vector
## A D E F G H
## 0.05128205 0.05982906 0.09401709 0.07692308 0.04273504 0.03418803
## I K L M N P
## 0.05982906 0.10256410 0.07692308 0.01709402 0.01709402 0.05982906
## Q R S T V Y
## 0.02564103 0.06837607 0.04273504 0.02564103 0.07692308 0.06837607
Convert this table into a vector.
table_to_vector <- function(table_x){
table_names <- attr(table_x, "dimnames")[[1]]
table_vect <- as.vector(table_x)
names(table_vect) <- table_names
return(table_vect)
}
gabarap_human_aa_freq <- table_to_vector(gabarap_freq_table)
gabarap_human_aa_freq
## A D E F G H
## 0.05128205 0.05982906 0.09401709 0.07692308 0.04273504 0.03418803
## I K L M N P
## 0.05982906 0.10256410 0.07692308 0.01709402 0.01709402 0.05982906
## Q R S T V Y
## 0.02564103 0.06837607 0.04273504 0.02564103 0.07692308 0.06837607
gabarap_human_aa_freq <- c(A=0.05128205, C = 0, D = 0.05982906, E = 0.09401709, F = 0.07692308, G = 0.04173504, H = 0.03418803, I = 0.05982906, K = 0.10256410, L = 0.07692308, M = 0.01709402, N = 0.01709402, P = 0.05982906, Q = 0.02564103, R = 0.06837607, S = 0.04273504, T = 0.02654103, V = 0.07692308, W = 0, Y = 0.06837607)
Check if “U” is present in the protein.
aa.names <- names(gabarap_human_aa_freq)
i.U <- which(aa.names == "U")
aa.names[i.U]
## character(0)
gabarap_human_aa_freq[i.U]
## named numeric(0)
Use the GABARAP frequency information and add it to the amino acid frequency table. NOTE: GABARAP does not have two amino acids that are in the table and the column could not be appended. I created the table manually so that I would be able to run the code.
aa.prop$gabarap_human_aa_freq <- gabarap_human_aa_freq
pander::pander(aa.prop)
| alpha.prop | beta.prop | a.plus.b.prop | a.div.b | gabarap_human_aa_freq | |
|---|---|---|---|---|---|
| A | 0.1165 | 0.07313 | 0.09264 | 0.08331 | 0.05128 |
| R | 0.02166 | 0.02414 | 0.04129 | 0.03369 | 0 |
| N | 0.03964 | 0.05007 | 0.06353 | 0.04223 | 0.05983 |
| D | 0.06661 | 0.04359 | 0.05876 | 0.05631 | 0.09402 |
| C | 0.008991 | 0.02702 | 0.03917 | 0.01454 | 0.07692 |
| Q | 0.02738 | 0.04395 | 0.03917 | 0.02631 | 0.04174 |
| E | 0.05476 | 0.03098 | 0.04553 | 0.05931 | 0.03419 |
| G | 0.08051 | 0.107 | 0.09052 | 0.08701 | 0.05983 |
| H | 0.04536 | 0.01765 | 0.01747 | 0.02469 | 0.1026 |
| I | 0.03719 | 0.04323 | 0.04923 | 0.05516 | 0.07692 |
| L | 0.09031 | 0.06376 | 0.05823 | 0.07824 | 0.01709 |
| K | 0.1018 | 0.04143 | 0.05929 | 0.07408 | 0.01709 |
| M | 0.01962 | 0.005764 | 0.01323 | 0.021 | 0.05983 |
| F | 0.05027 | 0.03062 | 0.02753 | 0.03646 | 0.02564 |
| P | 0.03351 | 0.04575 | 0.03759 | 0.04339 | 0.06838 |
| S | 0.04986 | 0.1228 | 0.0667 | 0.07547 | 0.04274 |
| T | 0.04863 | 0.09114 | 0.06194 | 0.05493 | 0.02654 |
| W | 0.01349 | 0.01585 | 0.01588 | 0.01662 | 0.07692 |
| Y | 0.02575 | 0.03963 | 0.05717 | 0.03 | 0 |
| V | 0.06825 | 0.08249 | 0.06511 | 0.08724 | 0.06838 |
Use these two custom functions to calculate correlation.
# Corrleation used in Chou and Zhang 1992.
chou_corr <- function(x,y){
numerator <- sum(x*y)
denominator <- sqrt((sum(x^2))*(sum(y^2)))
result <- numerator/denominator
return(result)
}
# Cosine similarity used in Higgs and Attwood (2005).
chou_cosine <- function(z.1, z.2){
z.1.abs <- sqrt(sum(z.1^2))
z.2.abs <- sqrt(sum(z.2^2))
my.cosine <- sum(z.1*z.2)/(z.1.abs*z.2.abs)
return(my.cosine)
}
Calculate the correlation in each column.
corr.alpha <- chou_corr(aa.prop[,5], aa.prop[,1])
corr.beta <- chou_corr(aa.prop[,5], aa.prop[,2])
corr.apb <- chou_corr(aa.prop[,5], aa.prop[,3])
corr.adb <- chou_corr(aa.prop[,5], aa.prop[,4])
Calculate the cosine similarity.
cos.alpha <- chou_cosine(aa.prop[,5], aa.prop[,1])
cos.beta <- chou_cosine(aa.prop[,5], aa.prop[,2])
cos.apb <- chou_cosine(aa.prop[,5], aa.prop[,3])
cos.adb <- chou_cosine(aa.prop[,5], aa.prop[,4])
Calculate the distance.
aa.prop.flipped <- t(aa.prop)
round(aa.prop.flipped,2)
## A R N D C Q E G H I
## alpha.prop 0.12 0.02 0.04 0.07 0.01 0.03 0.05 0.08 0.05 0.04
## beta.prop 0.07 0.02 0.05 0.04 0.03 0.04 0.03 0.11 0.02 0.04
## a.plus.b.prop 0.09 0.04 0.06 0.06 0.04 0.04 0.05 0.09 0.02 0.05
## a.div.b 0.08 0.03 0.04 0.06 0.01 0.03 0.06 0.09 0.02 0.06
## gabarap_human_aa_freq 0.05 0.00 0.06 0.09 0.08 0.04 0.03 0.06 0.10 0.08
## L K M F P S T W Y V
## alpha.prop 0.09 0.10 0.02 0.05 0.03 0.05 0.05 0.01 0.03 0.07
## beta.prop 0.06 0.04 0.01 0.03 0.05 0.12 0.09 0.02 0.04 0.08
## a.plus.b.prop 0.06 0.06 0.01 0.03 0.04 0.07 0.06 0.02 0.06 0.07
## a.div.b 0.08 0.07 0.02 0.04 0.04 0.08 0.05 0.02 0.03 0.09
## gabarap_human_aa_freq 0.02 0.02 0.06 0.03 0.07 0.04 0.03 0.08 0.00 0.07
dist(aa.prop.flipped, method = "euclidean")
## alpha.prop beta.prop a.plus.b.prop a.div.b
## beta.prop 0.13342098
## a.plus.b.prop 0.09281824 0.08289406
## a.div.b 0.06699039 0.08659174 0.06175113
## gabarap_human_aa_freq 0.19389115 0.19771646 0.17495923 0.17870305
Calculate the individual distances.
dist.alpha <- dist((aa.prop.flipped[c(1,5),]), method = "euclidean")
dist.beta <- dist((aa.prop.flipped[c(2,5),]), method = "euclidean")
dist.apb <- dist((aa.prop.flipped[c(3,5),]), method = "euclidean")
dist.adb <- dist((aa.prop.flipped[c(4,5),]), method = "euclidean")
fold_name <- c("alpha", "beta", "alpha + beta", "alpha / beta")
correlation_sim <- round(c(corr.alpha,corr.beta,corr.apb,corr.adb),5)
cosine_sim <- round(c(cos.alpha,cos.beta,cos.apb,cos.adb),5)
euc_dist <- round(c(dist.alpha,dist.beta,dist.apb,dist.adb),5)
sim_summary <- c("","","max sim", "")
dist_summary <- c("","","min sum","")
fold_df <- data.frame(fold_name,
correlation_sim,
cosine_sim,
euc_dist,
sim_summary,
dist_summary)
pander::pander(fold_df)
| fold_name | correlation_sim | cosine_sim | euc_dist | sim_summary |
|---|---|---|---|---|
| alpha | 0.7178 | 0.7178 | 0.1939 | |
| beta | 0.7104 | 0.7104 | 0.1977 | |
| alpha + beta | 0.7577 | 0.7577 | 0.175 | max sim |
| alpha / beta | 0.751 | 0.751 | 0.1787 |
| dist_summary |
min sum
List was FASTA cleaned in an earlier code chunk
names(gabarap_list)
## [1] "NP_009209.1" "XP_003315392.1" "XP_536616.2" "NP_001029220.1"
## [5] "NP_062723.1" "NP_742033.1" "NP_001013278.1" "XP_031568314.1"
## [9] "NP_001084530.1" "XP_004079842.1"
length(gabarap_list)
## [1] 10
for(i in 1:length(gabarap_list)){
gabarap_list[[i]] <- compbio4all::fasta_cleaner(gabarap_list[[i]], parse = F)
}
gabarap_list[[1]]
## [1] "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGL"
Convert each entry of the list into a vector.
gabarap_vector <- rep(NA, length(gabarap_list))
for(i in 1:length(gabarap_vector)){
gabarap_vector[i] <- gabarap_list[[i]]
}
names(gabarap_vector) <- names(gabarap_list)
gabarap_vector
## NP_009209.1
## "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGL"
## XP_003315392.1
## "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGL"
## XP_536616.2
## "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGL"
## NP_001029220.1
## "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGL"
## NP_062723.1
## "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGL"
## NP_742033.1
## "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGL"
## NP_001013278.1
## "MKFMYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGLLYQEHHEEDFFLYIAYSDESVYGDVLKDV"
## XP_031568314.1
## "MKWEYKEEHPFEKRRAEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGTD"
## NP_001084530.1
## "MKFVYKEEHAFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGM"
## XP_004079842.1
## "MKFQYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGLLYQEHHEEDFFLYIAYSDESVYGNSQREM"
Perform alignments between selected species: human, chimp, sea anemone, and fish.
align.01.02 <- Biostrings::pairwiseAlignment(
gabarap_vector[1],
gabarap_vector[2])
align.01.08 <- Biostrings::pairwiseAlignment(
gabarap_vector[1],
gabarap_vector[8])
align.01.10 <- Biostrings::pairwiseAlignment(
gabarap_vector[1],
gabarap_vector[10])
align.02.08 <- Biostrings::pairwiseAlignment(
gabarap_vector[2],
gabarap_vector[8]
)
align.02.10 <- Biostrings::pairwiseAlignment(
gabarap_vector[2],
gabarap_vector[10]
)
align.08.10 <- Biostrings::pairwiseAlignment(
gabarap_vector[8],
gabarap_vector[10]
)
Calculate the percent identities between human with chimp, sea anemone, and fish.
Biostrings::pid(align.01.02)
## [1] 100
Biostrings::pid(align.01.08)
## [1] 96.5812
Biostrings::pid(align.01.10)
## [1] 97.4359
Build a percent identity matrix with these 4 species.
pids <- c(1, NA, NA, NA,
pid(align.01.02), 1, NA, NA,
pid(align.01.08), pid(align.02.08), 1, NA,
pid(align.01.10), pid(align.02.10), pid(align.08.10), 1)
pid_mat <- matrix(pids, nrow = 4, byrow = T)
row.names(pid_mat) <- c("Homo","Pan","Sea anemone","Fish")
colnames(pid_mat) <- c("Homo","Pan","Sea anemone","Fish")
pander::pander(pid_mat)
| Homo | Pan | Sea anemone | Fish | |
|---|---|---|---|---|
| Homo | 1 | NA | NA | NA |
| Pan | 100 | 1 | NA | NA |
| Sea anemone | 96.58 | 96.58 | 1 | NA |
| Fish | 97.44 | 97.44 | 94.92 | 1 |
Different methods can be specified to calculate the percent identity between two species. We will use this method for the human and sea anemone alignment. I chose to use this alignment rather than that with the chimp because there was a 100% PID between human and chimp, thus the data might not be as interesting.
Biostrings::pid(align.01.08, type = "PID1")
## [1] 96.5812
Biostrings::pid(align.01.08, type = "PID2")
## [1] 96.5812
Biostrings::pid(align.01.08, type = "PID3")
## [1] 96.5812
Biostrings::pid(align.01.08, type = "PID4")
## [1] 96.17021
Create a table to compare these different methods of calculating percent identity.
method <- c("PID1", "PID2", "PID3", "PID4")
pid.01.08 <- c(Biostrings::pid(align.01.08, type = "PID1"),
Biostrings::pid(align.01.08, type = "PID2"),
Biostrings::pid(align.01.08, type = "PID3"),
Biostrings::pid(align.01.08, type = "PID4")
)
denominator <- c("(aligned positions + internal gap positions)",
"(aligned positions)",
"(length shorter sequence)",
"(average length of two sequences)")
pid_table <- data.frame(Method = method,
PID = pid.01.08,
Denominator = denominator)
pander::pander(pid_table)
| Method | PID | Denominator |
|---|---|---|
| PID1 | 96.58 | (aligned positions + internal gap positions) |
| PID2 | 96.58 | (aligned positions) |
| PID3 | 96.58 | (length shorter sequence) |
| PID4 | 96.17 | (average length of two sequences) |
gabarap_vector_ss <- Biostrings::AAStringSet(gabarap_vector)
gabarap_align <- msa(gabarap_vector_ss,
method = "ClustalW")
## use default substitution matrix
Clean the data and set up the MSA.
class(gabarap_align)
## [1] "MsaAAMultipleAlignment"
## attr(,"package")
## [1] "msa"
is(gabarap_align)
## [1] "MsaAAMultipleAlignment" "AAMultipleAlignment"
## [3] "MsaMetaData" "MultipleAlignment"
Change the class of the alignment and convert it to seqinr format.
class(gabarap_align) <- "AAMultipleAlignment"
gabarap_align_seqinr <- msaConvert(gabarap_align,
type = "seqinr::alignment")
compbio4all::print_msa(gabarap_align_seqinr)
## [1] "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQF 0"
## [1] "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQF 0"
## [1] "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQF 0"
## [1] "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQF 0"
## [1] "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQF 0"
## [1] "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQF 0"
## [1] "MKFVYKEEHAFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQF 0"
## [1] "MKFQYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQF 0"
## [1] "MKFMYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQF 0"
## [1] "MKWEYKEEHPFEKRRAEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQF 0"
## [1] " "
## [1] "YFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYG---- 0"
## [1] "YFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYG---- 0"
## [1] "YFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYG---- 0"
## [1] "YFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYG---- 0"
## [1] "YFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYG---- 0"
## [1] "YFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYG---- 0"
## [1] "YFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYG---- 0"
## [1] "YFLIRKRIHLRAEDALFFFVNNVIPPTSATMGLLYQEHHEEDFFLYIAYSDESVYGNSQR 0"
## [1] "YFLIRKRIHLRAEDALFFFVNNVIPPTSATMGLLYQEHHEEDFFLYIAYSDESVYGDVLK 0"
## [1] "YFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGTD-- 0"
## [1] " "
## [1] "-L 58"
## [1] "-L 58"
## [1] "-L 58"
## [1] "-L 58"
## [1] "-L 58"
## [1] "-L 58"
## [1] "-M 58"
## [1] "EM 58"
## [1] "DV 58"
## [1] "-- 58"
## [1] " "
The full alignment was completed above and includes the whole sequence. To present this MSA, we will include the first 60 amino acids of the protein so that we are able to read it. GABARAP, while a relatively short protein, contains several different regions, which were conserved across many of the sequences. The first 60 amino acids contains a piece of each of these regions and therefore the display of this MSA includes the first 60 amino acids.
NOTE: The ggmsa function was causing major problems (crashing R) while trying to run the entire code so it has been commented out, however the code should be correct and it should be able to run if ggmsa() is available.
# ggmsa::ggmsa(gabarap_align,
# start = 0,
# end = 60)
Calculate the genetic distance in the alignment.
gabarap_dist <- seqinr::dist.alignment(gabarap_align_seqinr,
matrix = "identity")
Check the data types that were produced.
is(gabarap_dist)
## [1] "dist" "oldClass"
class(gabarap_dist)
## [1] "dist"
Round the distance matrix values to clean up the display.
gabarap_dist_rnd <- round(gabarap_dist, 3)
gabarap_dist_rnd
## NP_009209.1 XP_003315392.1 XP_536616.2 NP_001029220.1
## XP_003315392.1 0.000
## XP_536616.2 0.000 0.000
## NP_001029220.1 0.000 0.000 0.000
## NP_062723.1 0.000 0.000 0.000 0.000
## NP_742033.1 0.000 0.000 0.000 0.000
## NP_001084530.1 0.131 0.131 0.131 0.131
## XP_004079842.1 0.160 0.160 0.160 0.160
## NP_001013278.1 0.160 0.160 0.160 0.160
## XP_031568314.1 0.161 0.161 0.161 0.161
## NP_062723.1 NP_742033.1 NP_001084530.1 XP_004079842.1
## XP_003315392.1
## XP_536616.2
## NP_001029220.1
## NP_062723.1
## NP_742033.1 0.000
## NP_001084530.1 0.131 0.131
## XP_004079842.1 0.160 0.160 0.160
## NP_001013278.1 0.160 0.160 0.185 0.240
## XP_031568314.1 0.161 0.161 0.186 0.225
## NP_001013278.1
## XP_003315392.1
## XP_536616.2
## NP_001029220.1
## NP_062723.1
## NP_742033.1
## NP_001084530.1
## XP_004079842.1
## NP_001013278.1
## XP_031568314.1 0.225
Use the distance matrix to build a phylogenetic tree for these 10 sequences using the neighbor-joining nj() algorithm.
tree_seqs <- nj(gabarap_dist)
There are a few different ways to plot a phylogenetic tree. Upon creating the phylogenetics trees with and varying the root and branch lengths, the clearest visual was given by a rooted phylogenetic tree without branch lengths. these settings can be easily manipulated to produce trees by including additional arguments, thus the unrooted trees / trees with branch lengths are not included in this code.
plot.phylo(tree_seqs, main="Phylogenetic Tree\n",
use.edge.length = F)
mtext(text = "GABARAP family gene tree - rooted, no branch lengths")