Introduction

The gene and the protein it encodes which are being analyzed in this script is called GABARAP, which stands for gamma-aminobutyric acid receptor-associated protein. This protein is found in Homo Sapiens and many other species. In this script, I will be fetching information from various databases of the GABARA1 gene and associated protein, completing pairwise and multiple alignments with other species, and producing phylogenetic trees for visualization of the relatedness between the sequences.

Resources and References

Much of the important information used to analyze this protein within this script was obtained from NCBI databases. This includes:
- RefSeq (protein page): https://www.ncbi.nlm.nih.gov/protein/NP_009209.1
- RefSeq (Homologene): https://www.ncbi.nlm.nih.gov/homologene/134119
References:
- https://rpubs.com/lowbrowR/drawProtein
- https://rpubs.com/lowbrowR/843543
- https://rpubs.com/lowbrowR/844554

Preparation and Packages

# GitHub packages
library(devtools)
# devtools::install_github("brouwern/compbio4all")
library(compbio4all)
# devtools::install_github("YuLab-SMU/ggmsa")
# library(ggmsa)

# CRAN packages

library(rentrez)
library(ape)
library(pander)
library(seqinr)
## 
## Attaching package: 'seqinr'
## The following objects are masked from 'package:ape':
## 
##     as.alignment, consensus
library(ggpubr)
## Loading required package: ggplot2
## Loading required package: magrittr
## 
## Attaching package: 'ggpubr'
## The following object is masked from 'package:ape':
## 
##     rotate
library(ggplot2)


# Bioconductor packages
library(BiocManager)
## 
## Attaching package: 'BiocManager'
## The following object is masked from 'package:devtools':
## 
##     install
library(Biostrings)
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind,
##     colMeans, colnames, colSums, dirname, do.call, duplicated,
##     eval, evalq, Filter, Find, get, grep, grepl, intersect,
##     is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
##     paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
##     Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which, which.max,
##     which.min
## Loading required package: S4Vectors
## Loading required package: stats4
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
## 
##     expand.grid
## Loading required package: IRanges
## 
## Attaching package: 'IRanges'
## The following object is masked from 'package:grDevices':
## 
##     windows
## Loading required package: XVector
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:seqinr':
## 
##     translate
## The following object is masked from 'package:ape':
## 
##     complement
## The following object is masked from 'package:base':
## 
##     strsplit
library(drawProteins)
library(msa)
## 
## Attaching package: 'msa'
## The following object is masked from 'package:BiocManager':
## 
##     version

Accession Numbers

These accession numbers were found using RefSeq, RefSeq Homologene, UniProt, and PDB. Neither Homo neanderthalensis nor Drosophila entries were found after thorough searching of these databases. The final three entries in the table were obtained via a protein BLAST search. First, vertebrates were excluded, and there were 2 results and the gene existed in invertebrates. One of these with sufficient data was included. Mammals were then excluded from the BLAST search which yielded many results.

# RefSeq, UniProt, PDB, scientific name, common name, gene name
ref_seq <- c('NP_009209.1', 'XP_003315392.1', 'XP_536616.2', 'NP_001029220.1', 'NP_062723.1', 'NP_742033.1', 'NP_001013278.1', 'XP_031568314.1', 'NP_001084530.1', 'XP_004079842.1')
uniprot <- c('O95166', 'H2QC26', 'F1Q2L2', 'Q9GJW7', 'Q9DCD6', 'P60517', 'Q6PSS4', 'A0A6P8IN80', 'Q6NUG7', 'A0A3P9HG37')
pdb <- c('1GNU', 'NA', 'NA', 'NA', '5YIR', '1KJT', 'NA', 'NA', 'NA', 'NA')
sci_name <- c('Homo sapiens', 'Pan troglodytes', 'Canis lupus', 'Bos taurus', 'Mus musculus', 'Rattus norvegicus', 'Danio rerio', 'Actinia tenebrosa', 'Xenopus laevis', 'Oryzias latipes')
common_name <- c('Human', 'Chimpanzee', 'Dog', 'Bovine', 'Mouse', 'Rat', 'Zebrafish', 'Australian red waratah sea anemone', 'African clawed frog', 'Japanese rice rish')
gene_name <- c('GABARAP', 'GABARAP', 'GABARAP', 'GABARAP', 'Gabarap', 'Gabarap', 'gabarap', 'LOC116303016', 'gabarap.L', 'NA')

Convert this vector information into a data frame to produce a table.

gabarap_table <- data.frame(NCBI.Accession = ref_seq,
                            UniProt.ID = uniprot,
                            PDB.ID = pdb,
                            ScientificName = sci_name,
                            CommonName = common_name,
                            GeneName = gene_name)
pander::pander(gabarap_table)
Table continues below
NCBI.Accession UniProt.ID PDB.ID ScientificName
NP_009209.1 O95166 1GNU Homo sapiens
XP_003315392.1 H2QC26 NA Pan troglodytes
XP_536616.2 F1Q2L2 NA Canis lupus
NP_001029220.1 Q9GJW7 NA Bos taurus
NP_062723.1 Q9DCD6 5YIR Mus musculus
NP_742033.1 P60517 1KJT Rattus norvegicus
NP_001013278.1 Q6PSS4 NA Danio rerio
XP_031568314.1 A0A6P8IN80 NA Actinia tenebrosa
NP_001084530.1 Q6NUG7 NA Xenopus laevis
XP_004079842.1 A0A3P9HG37 NA Oryzias latipes
CommonName GeneName
Human GABARAP
Chimpanzee GABARAP
Dog GABARAP
Bovine GABARAP
Mouse Gabarap
Rat Gabarap
Zebrafish gabarap
Australian red waratah sea anemone LOC116303016
African clawed frog gabarap.L
Japanese rice rish NA

Data Preparation

Protein sequences were retrieved from NCBI databases using the compbio4all::entrez_fetch_list() function, which uses the function rentrez::entrez_fetch().

gabarap_list <- entrez_fetch_list(db = "protein",
                          id = gabarap_table$NCBI.Accession,
                          rettype = "fasta")

Initial Data Cleaning

Check the length of the list and remove the FASTA header. Check the first entry in the list to make sure it contains only the protein sequence.

length(gabarap_list)
## [1] 10
for(i in 1:length(gabarap_list)){
  gabarap_list[[i]] <- compbio4all::fasta_cleaner(gabarap_list[[i]], parse = F)
}

gabarap_list[[1]]
## [1] "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGL"

General Protein Information

Protein Diagram

Use the UniProt accession number for the human sequence to generate a protein diagram indicating interesting regions of the protein.

O95166_id <- drawProteins::get_features(uniprot[1])
## [1] "Download has worked"

Convert the vector to a dataframe.

is(O95166_id)
## [1] "list"             "vector"           "list_OR_List"    
## [4] "vector_OR_factor"
my_prot_df <- drawProteins::feature_to_dataframe(O95166_id)

This creates a visualization which shows the regions where secondary structures can be identified, and shows that there are three different regions defined by the interaction GABARAP has with proteins. I included all of these aspects because they are interesting and the visual is informative. It shows what the structure is in specific regions where the protein is known to interact with other proteins, suggesting information about its function.

my_prot_df <- drawProteins::feature_to_dataframe(O95166_id)

my_canvas <- draw_canvas(my_prot_df)
my_canvas <- draw_chains(my_canvas, my_prot_df, label_size = 2.5)

my_canvas <- draw_regions(my_canvas, my_prot_df)
my_canvas <- draw_motif(my_canvas, my_prot_df)
my_canvas <- draw_phospho(my_canvas, my_prot_df)
my_canvas <- draw_repeat(my_canvas, my_prot_df)
my_canvas <- draw_recept_dom(my_canvas, my_prot_df)
my_canvas <- draw_folding(my_canvas, my_prot_df)
my_canvas

my_prot_df
##                    type
## featuresTemp      CHAIN
## featuresTemp.1   PROPEP
## featuresTemp.2   REGION
## featuresTemp.3   REGION
## featuresTemp.4   REGION
## featuresTemp.5     SITE
## featuresTemp.6    LIPID
## featuresTemp.7  MUTAGEN
## featuresTemp.8  MUTAGEN
## featuresTemp.9  MUTAGEN
## featuresTemp.10 MUTAGEN
## featuresTemp.11 MUTAGEN
## featuresTemp.12 MUTAGEN
## featuresTemp.13   HELIX
## featuresTemp.14   HELIX
## featuresTemp.15  STRAND
## featuresTemp.16  STRAND
## featuresTemp.17   HELIX
## featuresTemp.18  STRAND
## featuresTemp.19  STRAND
## featuresTemp.20   HELIX
## featuresTemp.21  STRAND
## featuresTemp.22  STRAND
##                                                                                                                              description
## featuresTemp                                                                         Gamma-aminobutyric acid receptor-associated protein
## featuresTemp.1                                                                                                    Removed in mature form
## featuresTemp.2                                                                                             Interaction with beta-tubulin
## featuresTemp.3                                                                                                     Interaction with GPHN
## featuresTemp.4                                                                                                   Interaction with GABRG2
## featuresTemp.5                                                                                                        Cleavage; by ATG4B
## featuresTemp.6                                                                                 Phosphatidylethanolamine amidated glycine
## featuresTemp.7  No effect on WDFY3-binding. Impaired WDFY3-binding, but no effect on SQSTM1-binding; when associated with H-25 and H-54.
## featuresTemp.8  No effect on WDFY3-binding. Impaired WDFY3-binding, but no effect on SQSTM1-binding; when associated with Q-24 and H-54.
## featuresTemp.9                                                                                         Inhibits interaction with TECPR2.
## featuresTemp.10 No effect on WDFY3-binding. Impaired WDFY3-binding, but no effect on SQSTM1-binding; when associated with Q-24 and H-25.
## featuresTemp.11                                                                                    No effect on interaction with TECPR2.
## featuresTemp.12                                                                     Impairs localization at the autophagosomal membrane.
## featuresTemp.13                                                                                                                     NONE
## featuresTemp.14                                                                                                                     NONE
## featuresTemp.15                                                                                                                     NONE
## featuresTemp.16                                                                                                                     NONE
## featuresTemp.17                                                                                                                     NONE
## featuresTemp.18                                                                                                                     NONE
## featuresTemp.19                                                                                                                     NONE
## featuresTemp.20                                                                                                                     NONE
## featuresTemp.21                                                                                                                     NONE
## featuresTemp.22                                                                                                                     NONE
##                 begin end length accession   entryName taxid order
## featuresTemp        1 116    115    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.1    117 117      0    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.2      1  22     21    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.3     36 117     81    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.4     36  68     32    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.5    116 117      1    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.6    116 116      0    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.7     24  24      0    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.8     25  25      0    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.9     49  50      1    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.10    54  54      0    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.11    67  67      0    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.12   116 116      0    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.13     4   8      4    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.14    11  24     13    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.15    28  35      7    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.16    48  52      4    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.17    57  67     10    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.18    77  80      3    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.19    87  90      3    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.20    91  98      7    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.21   101 103      2    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.22   105 114      9    O95166 GBRAP_HUMAN  9606     1

Draw Dot Plot

Data preparation

Access the human sequence from the list and make sure it is in the proper format with each character separated in the vector.

gabarap_human_vector <- fasta_cleaner(gabarap_list[1])
gabarap_human_vector
##   [1] "M" "K" "F" "V" "Y" "K" "E" "E" "H" "P" "F" "E" "K" "R" "R" "S" "E"
##  [18] "G" "E" "K" "I" "R" "K" "K" "Y" "P" "D" "R" "V" "P" "V" "I" "V" "E"
##  [35] "K" "A" "P" "K" "A" "R" "I" "G" "D" "L" "D" "K" "K" "K" "Y" "L" "V"
##  [52] "P" "S" "D" "L" "T" "V" "G" "Q" "F" "Y" "F" "L" "I" "R" "K" "R" "I"
##  [69] "H" "L" "R" "A" "E" "D" "A" "L" "F" "F" "F" "V" "N" "N" "V" "I" "P"
##  [86] "P" "T" "S" "A" "T" "M" "G" "Q" "L" "Y" "Q" "E" "H" "H" "E" "E" "D"
## [103] "F" "F" "L" "Y" "I" "A" "Y" "S" "D" "E" "S" "V" "Y" "G" "L"

Use the human protein sequences to create a dot plot.

dotPlot(gabarap_human_vector, gabarap_human_vector,
        main = "Human GABARAP Dot Plot")

Create a 2x2 panel which shows the different values for dot plot settings.

par(mfrow = c(2,2),
    mar = c(0,0,2,1))

dotPlot(gabarap_human_vector, gabarap_human_vector,
        wsize = 10,
        nmatch = 5,
        main = "wsize = 10, nmatch = 5")

dotPlot(gabarap_human_vector, gabarap_human_vector,
        wsize = 15,
        nmatch = 3, 
        main = "wsize = 15, nmatch = 3")

dotPlot(gabarap_human_vector, gabarap_human_vector,
        wsize = 6,
        nmatch = 2, 
        main = "wsize = 5, nmatch = 1")

dotPlot(gabarap_human_vector, gabarap_human_vector,
        wsize = 20,
        nmatch = 5, 
        main = "wsize = 20, nmatch = 5")

par(mfrow = c(1,1),
    mar = c(4,4,4,4))

We can include arguments to adjust the settings of a dot plot to present the data in an interesting way.

dotPlot(gabarap_human_vector,
        gabarap_human_vector,
        wsize = 7,
        nmatch = 2,
        main = "Human GABARAP Dot Plot: wsize = 7, nmatch = 2")

Protein Properties Compiled from Databases

No information could be found about this protein in DisProt or RepeatsDB.

source <- c("PFAM", "UniProt", "PDB", "AlphaFold")

prot_property <- c("There is one domain, labeled ATG8, spanning from AA position 13-116, which is the majority of the 117-amino acid length protein. It is an autophagy protein domain.",
                   "GABARAP is associated with intracellular membrane structures, and is primarily found in the cytoskeleton and Golgi apparatus membrane.",
                   "The PDB structure shows that the protein contains both alpha helices and beta sheets in its secondary structure. Beta turns and omega loops can also be observed in the structure.",
                   "AlphaFold shows a similar structural summary as PDB, with alpha helices, beta sheets, beta turns, and omega loops present. The structure is predicted with high confidence.")

link <- c("http://pfam.xfam.org/protein/O95166",
          "https://www.uniprot.org/uniprot/O95166",
          "https://www.rcsb.org/structure/1gnu",
          "https://alphafold.ebi.ac.uk/entry/O95166")

properties_table <- data.frame(Source = source,
                               ProteinProperty = prot_property,
                               Link = link)
pander::pander(properties_table)
Table continues below
Source ProteinProperty
PFAM There is one domain, labeled ATG8, spanning from AA position 13-116, which is the majority of the 117-amino acid length protein. It is an autophagy protein domain.
UniProt GABARAP is associated with intracellular membrane structures, and is primarily found in the cytoskeleton and Golgi apparatus membrane.
PDB The PDB structure shows that the protein contains both alpha helices and beta sheets in its secondary structure. Beta turns and omega loops can also be observed in the structure.
AlphaFold AlphaFold shows a similar structural summary as PDB, with alpha helices, beta sheets, beta turns, and omega loops present. The structure is predicted with high confidence.
Link
http://pfam.xfam.org/protein/O95166
https://www.uniprot.org/uniprot/O95166
https://www.rcsb.org/structure/1gnu
https://alphafold.ebi.ac.uk/entry/O95166

Protein Feature Prediction

UniProt shows that the GABARAP protein is associated with intracellular membrane structures, particularly the Golgi Apparatus membrane, as well as the cytoskeleton. Link: https://www.uniprot.org/uniprot/O95166\n

The code used to set up this prediction was obtained here: https://rpubs.com/lowbrowR/843543. The information used was adapted from Chou and Zhang, 1994.

Make amino acid name vectors.

# enter once
aa.1.1 <- c("A","R","N","D","C","Q","E","G","H","I",
            "L","K","M","F","P","S","T","W","Y","V")

# enter again
aa.1.2 <- c("A","R","N","D","C","Q","E","G","H","I",
            "L","K","M","F","P","S","T","W","Y","V")

# are they the same length?
length(aa.1.1) == length(aa.1.2)
## [1] TRUE
aa.1.1 == aa.1.2
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE TRUE TRUE TRUE TRUE
unique(aa.1.1)
##  [1] "A" "R" "N" "D" "C" "Q" "E" "G" "H" "I" "L" "K" "M" "F" "P" "S" "T"
## [18] "W" "Y" "V"
length(unique(aa.1.1)) == length(unique(aa.1.2))
## [1] TRUE
any(c(aa.1.1 == aa.1.2) == FALSE)
## [1] FALSE

This is an input of data from the table from Chou and Zhang 1994.

# alpha proteins
alpha <- c(285, 53, 97, 163, 22, 67, 134, 197, 111, 91,
           221, 249, 48, 123, 82, 122, 119, 33, 63, 167)

# check against chou's total
sum(alpha) == 2447
## [1] TRUE
# beta proteins
beta <- c(203, 67, 139, 121, 75, 122, 86, 297, 49, 120,
          177, 115, 16, 85, 127, 341, 253, 44, 110, 229)

# check against chou's total
sum(beta) == 2776
## [1] TRUE

Logical checks are done to make sure that data entry was successful.

a.plus.b <- c(175, 78, 120, 111, 74, 74, 86, 171, 33, 93,
              110, 112, 25, 52, 71, 126, 117, 30, 108, 123)
sum(a.plus.b) == 1889
## [1] TRUE
# alpha/beta
a.div.b <- c(361, 146, 183, 244, 63, 114, 257, 377, 107, 239,
             339, 321, 91, 158, 188, 327, 238, 72, 130, 378)
sum(a.div.b) == 4333
## [1] TRUE

Calculate the proportion of each of the four protein fold types.

alpha.prop <- alpha/sum(alpha)
beta.prop <- beta/sum(beta)
a.plus.b.prop <- a.plus.b/sum(a.plus.b)
a.div.b <- a.div.b/sum(a.div.b)

Create a dataframe with the proportion information.

aa.prop <- data.frame(alpha.prop,
                      beta.prop,
                      a.plus.b.prop,
                      a.div.b)
row.names(aa.prop) <- aa.1.1
pander::pander(aa.prop)
  alpha.prop beta.prop a.plus.b.prop a.div.b
A 0.1165 0.07313 0.09264 0.08331
R 0.02166 0.02414 0.04129 0.03369
N 0.03964 0.05007 0.06353 0.04223
D 0.06661 0.04359 0.05876 0.05631
C 0.008991 0.02702 0.03917 0.01454
Q 0.02738 0.04395 0.03917 0.02631
E 0.05476 0.03098 0.04553 0.05931
G 0.08051 0.107 0.09052 0.08701
H 0.04536 0.01765 0.01747 0.02469
I 0.03719 0.04323 0.04923 0.05516
L 0.09031 0.06376 0.05823 0.07824
K 0.1018 0.04143 0.05929 0.07408
M 0.01962 0.005764 0.01323 0.021
F 0.05027 0.03062 0.02753 0.03646
P 0.03351 0.04575 0.03759 0.04339
S 0.04986 0.1228 0.0667 0.07547
T 0.04863 0.09114 0.06194 0.05493
W 0.01349 0.01585 0.01588 0.01662
Y 0.02575 0.03963 0.05717 0.03
V 0.06825 0.08249 0.06511 0.08724

Now we can use this information to determine the frequency of amino acids encoded by the GABARAP gene.

gabarap_freq_table <- table(gabarap_human_vector)/length(gabarap_human_vector)
gabarap_freq_table
## gabarap_human_vector
##          A          D          E          F          G          H 
## 0.05128205 0.05982906 0.09401709 0.07692308 0.04273504 0.03418803 
##          I          K          L          M          N          P 
## 0.05982906 0.10256410 0.07692308 0.01709402 0.01709402 0.05982906 
##          Q          R          S          T          V          Y 
## 0.02564103 0.06837607 0.04273504 0.02564103 0.07692308 0.06837607

Convert this table into a vector.

table_to_vector <- function(table_x){
  table_names <- attr(table_x, "dimnames")[[1]]
  table_vect <- as.vector(table_x)
  names(table_vect) <- table_names
  return(table_vect)
}
gabarap_human_aa_freq <- table_to_vector(gabarap_freq_table)
gabarap_human_aa_freq
##          A          D          E          F          G          H 
## 0.05128205 0.05982906 0.09401709 0.07692308 0.04273504 0.03418803 
##          I          K          L          M          N          P 
## 0.05982906 0.10256410 0.07692308 0.01709402 0.01709402 0.05982906 
##          Q          R          S          T          V          Y 
## 0.02564103 0.06837607 0.04273504 0.02564103 0.07692308 0.06837607
gabarap_human_aa_freq <- c(A=0.05128205, C = 0, D = 0.05982906, E = 0.09401709, F = 0.07692308, G = 0.04173504, H = 0.03418803, I = 0.05982906, K = 0.10256410, L = 0.07692308, M = 0.01709402, N = 0.01709402, P = 0.05982906, Q = 0.02564103, R = 0.06837607, S = 0.04273504, T = 0.02654103, V = 0.07692308, W = 0, Y = 0.06837607)

Check if “U” is present in the protein.

aa.names <- names(gabarap_human_aa_freq)
i.U <- which(aa.names == "U")
aa.names[i.U]
## character(0)
gabarap_human_aa_freq[i.U]
## named numeric(0)

Use the GABARAP frequency information and add it to the amino acid frequency table. NOTE: GABARAP does not have two amino acids that are in the table and the column could not be appended. I created the table manually so that I would be able to run the code.

aa.prop$gabarap_human_aa_freq <- gabarap_human_aa_freq
pander::pander(aa.prop)
  alpha.prop beta.prop a.plus.b.prop a.div.b gabarap_human_aa_freq
A 0.1165 0.07313 0.09264 0.08331 0.05128
R 0.02166 0.02414 0.04129 0.03369 0
N 0.03964 0.05007 0.06353 0.04223 0.05983
D 0.06661 0.04359 0.05876 0.05631 0.09402
C 0.008991 0.02702 0.03917 0.01454 0.07692
Q 0.02738 0.04395 0.03917 0.02631 0.04174
E 0.05476 0.03098 0.04553 0.05931 0.03419
G 0.08051 0.107 0.09052 0.08701 0.05983
H 0.04536 0.01765 0.01747 0.02469 0.1026
I 0.03719 0.04323 0.04923 0.05516 0.07692
L 0.09031 0.06376 0.05823 0.07824 0.01709
K 0.1018 0.04143 0.05929 0.07408 0.01709
M 0.01962 0.005764 0.01323 0.021 0.05983
F 0.05027 0.03062 0.02753 0.03646 0.02564
P 0.03351 0.04575 0.03759 0.04339 0.06838
S 0.04986 0.1228 0.0667 0.07547 0.04274
T 0.04863 0.09114 0.06194 0.05493 0.02654
W 0.01349 0.01585 0.01588 0.01662 0.07692
Y 0.02575 0.03963 0.05717 0.03 0
V 0.06825 0.08249 0.06511 0.08724 0.06838

Calculating Similarities

Use these two custom functions to calculate correlation.

# Corrleation used in Chou and Zhang 1992.
chou_corr <- function(x,y){
  numerator <- sum(x*y)
denominator <- sqrt((sum(x^2))*(sum(y^2)))
result <- numerator/denominator
return(result)
}

# Cosine similarity used in Higgs and Attwood (2005). 
chou_cosine <- function(z.1, z.2){
  z.1.abs <- sqrt(sum(z.1^2))
  z.2.abs <- sqrt(sum(z.2^2))
  my.cosine <- sum(z.1*z.2)/(z.1.abs*z.2.abs)
  return(my.cosine)
}

Calculate the correlation in each column.

corr.alpha <- chou_corr(aa.prop[,5], aa.prop[,1])
corr.beta <- chou_corr(aa.prop[,5], aa.prop[,2])
corr.apb <- chou_corr(aa.prop[,5], aa.prop[,3])
corr.adb <- chou_corr(aa.prop[,5], aa.prop[,4])

Calculate the cosine similarity.

cos.alpha <- chou_cosine(aa.prop[,5], aa.prop[,1])
cos.beta <- chou_cosine(aa.prop[,5], aa.prop[,2])
cos.apb <- chou_cosine(aa.prop[,5], aa.prop[,3])
cos.adb <- chou_cosine(aa.prop[,5], aa.prop[,4])

Calculate the distance.

aa.prop.flipped <- t(aa.prop)
round(aa.prop.flipped,2)
##                          A    R    N    D    C    Q    E    G    H    I
## alpha.prop            0.12 0.02 0.04 0.07 0.01 0.03 0.05 0.08 0.05 0.04
## beta.prop             0.07 0.02 0.05 0.04 0.03 0.04 0.03 0.11 0.02 0.04
## a.plus.b.prop         0.09 0.04 0.06 0.06 0.04 0.04 0.05 0.09 0.02 0.05
## a.div.b               0.08 0.03 0.04 0.06 0.01 0.03 0.06 0.09 0.02 0.06
## gabarap_human_aa_freq 0.05 0.00 0.06 0.09 0.08 0.04 0.03 0.06 0.10 0.08
##                          L    K    M    F    P    S    T    W    Y    V
## alpha.prop            0.09 0.10 0.02 0.05 0.03 0.05 0.05 0.01 0.03 0.07
## beta.prop             0.06 0.04 0.01 0.03 0.05 0.12 0.09 0.02 0.04 0.08
## a.plus.b.prop         0.06 0.06 0.01 0.03 0.04 0.07 0.06 0.02 0.06 0.07
## a.div.b               0.08 0.07 0.02 0.04 0.04 0.08 0.05 0.02 0.03 0.09
## gabarap_human_aa_freq 0.02 0.02 0.06 0.03 0.07 0.04 0.03 0.08 0.00 0.07
dist(aa.prop.flipped, method = "euclidean")
##                       alpha.prop  beta.prop a.plus.b.prop    a.div.b
## beta.prop             0.13342098                                    
## a.plus.b.prop         0.09281824 0.08289406                         
## a.div.b               0.06699039 0.08659174    0.06175113           
## gabarap_human_aa_freq 0.19389115 0.19771646    0.17495923 0.17870305

Calculate the individual distances.

dist.alpha <- dist((aa.prop.flipped[c(1,5),]), method = "euclidean")
dist.beta <- dist((aa.prop.flipped[c(2,5),]), method = "euclidean")
dist.apb <- dist((aa.prop.flipped[c(3,5),]), method = "euclidean")
dist.adb <- dist((aa.prop.flipped[c(4,5),]), method = "euclidean")
fold_name <- c("alpha", "beta", "alpha + beta", "alpha / beta")

correlation_sim <- round(c(corr.alpha,corr.beta,corr.apb,corr.adb),5)
cosine_sim <- round(c(cos.alpha,cos.beta,cos.apb,cos.adb),5)
euc_dist <- round(c(dist.alpha,dist.beta,dist.apb,dist.adb),5)

sim_summary <- c("","","max sim", "")
dist_summary <- c("","","min sum","")

fold_df <- data.frame(fold_name,
                      correlation_sim,
                      cosine_sim,
                      euc_dist,
                      sim_summary,
                      dist_summary)

pander::pander(fold_df)
Table continues below
fold_name correlation_sim cosine_sim euc_dist sim_summary
alpha 0.7178 0.7178 0.1939
beta 0.7104 0.7104 0.1977
alpha + beta 0.7577 0.7577 0.175 max sim
alpha / beta 0.751 0.751 0.1787
dist_summary

min sum


Percent Identity Comparisons (PID)

Data Preparation

List was FASTA cleaned in an earlier code chunk

names(gabarap_list)
##  [1] "NP_009209.1"    "XP_003315392.1" "XP_536616.2"    "NP_001029220.1"
##  [5] "NP_062723.1"    "NP_742033.1"    "NP_001013278.1" "XP_031568314.1"
##  [9] "NP_001084530.1" "XP_004079842.1"
length(gabarap_list)
## [1] 10
for(i in 1:length(gabarap_list)){
  gabarap_list[[i]] <- compbio4all::fasta_cleaner(gabarap_list[[i]], parse = F)
}

gabarap_list[[1]]
## [1] "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGL"

Convert each entry of the list into a vector.

gabarap_vector <- rep(NA, length(gabarap_list))


for(i in 1:length(gabarap_vector)){
  gabarap_vector[i] <- gabarap_list[[i]]
}

names(gabarap_vector) <- names(gabarap_list)
gabarap_vector
##                                                                                                                  NP_009209.1 
##      "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGL" 
##                                                                                                               XP_003315392.1 
##      "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGL" 
##                                                                                                                  XP_536616.2 
##      "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGL" 
##                                                                                                               NP_001029220.1 
##      "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGL" 
##                                                                                                                  NP_062723.1 
##      "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGL" 
##                                                                                                                  NP_742033.1 
##      "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGL" 
##                                                                                                               NP_001013278.1 
## "MKFMYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGLLYQEHHEEDFFLYIAYSDESVYGDVLKDV" 
##                                                                                                               XP_031568314.1 
##     "MKWEYKEEHPFEKRRAEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGTD" 
##                                                                                                               NP_001084530.1 
##      "MKFVYKEEHAFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGM" 
##                                                                                                               XP_004079842.1 
## "MKFQYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGLLYQEHHEEDFFLYIAYSDESVYGNSQREM"

Perform alignments between selected species: human, chimp, sea anemone, and fish.

align.01.02 <- Biostrings::pairwiseAlignment(
                  gabarap_vector[1],
                  gabarap_vector[2])

align.01.08 <- Biostrings::pairwiseAlignment(
                  gabarap_vector[1],
                  gabarap_vector[8])

align.01.10 <- Biostrings::pairwiseAlignment(
                  gabarap_vector[1],
                  gabarap_vector[10])

align.02.08 <- Biostrings::pairwiseAlignment(
  gabarap_vector[2],
  gabarap_vector[8]
)

align.02.10 <- Biostrings::pairwiseAlignment(
  gabarap_vector[2],
  gabarap_vector[10]
)

align.08.10 <- Biostrings::pairwiseAlignment(
  gabarap_vector[8],
  gabarap_vector[10]
)

Calculate the percent identities between human with chimp, sea anemone, and fish.

Biostrings::pid(align.01.02)
## [1] 100
Biostrings::pid(align.01.08)
## [1] 96.5812
Biostrings::pid(align.01.10)
## [1] 97.4359

Build a percent identity matrix with these 4 species.

pids <- c(1,                  NA,     NA,     NA,
          pid(align.01.02),          1,     NA,     NA,
          pid(align.01.08), pid(align.02.08),      1,     NA,
          pid(align.01.10), pid(align.02.10), pid(align.08.10), 1)

pid_mat <- matrix(pids, nrow = 4, byrow = T)
row.names(pid_mat) <- c("Homo","Pan","Sea anemone","Fish")  
colnames(pid_mat) <- c("Homo","Pan","Sea anemone","Fish")  
pander::pander(pid_mat)
  Homo Pan Sea anemone Fish
Homo 1 NA NA NA
Pan 100 1 NA NA
Sea anemone 96.58 96.58 1 NA
Fish 97.44 97.44 94.92 1

Different methods can be specified to calculate the percent identity between two species. We will use this method for the human and sea anemone alignment. I chose to use this alignment rather than that with the chimp because there was a 100% PID between human and chimp, thus the data might not be as interesting.

Biostrings::pid(align.01.08, type = "PID1")
## [1] 96.5812
Biostrings::pid(align.01.08, type = "PID2")
## [1] 96.5812
Biostrings::pid(align.01.08, type = "PID3")
## [1] 96.5812
Biostrings::pid(align.01.08, type = "PID4")
## [1] 96.17021

Create a table to compare these different methods of calculating percent identity.

method <- c("PID1", "PID2", "PID3", "PID4")
pid.01.08 <- c(Biostrings::pid(align.01.08, type = "PID1"),
Biostrings::pid(align.01.08, type = "PID2"),
Biostrings::pid(align.01.08, type = "PID3"),
Biostrings::pid(align.01.08, type = "PID4")
)

denominator <- c("(aligned positions + internal gap positions)",
                 "(aligned positions)",
                 "(length shorter sequence)",
                 "(average length of two sequences)")

pid_table <- data.frame(Method = method,
                        PID = pid.01.08,
                        Denominator = denominator)
pander::pander(pid_table)
Method PID Denominator
PID1 96.58 (aligned positions + internal gap positions)
PID2 96.58 (aligned positions)
PID3 96.58 (length shorter sequence)
PID4 96.17 (average length of two sequences)

Multiple Sequence Alignment

MSA Data Preparation

gabarap_vector_ss <- Biostrings::AAStringSet(gabarap_vector)

Build the Multiple Sequence Alignment (MSA)

gabarap_align <- msa(gabarap_vector_ss,
                     method = "ClustalW")
## use default substitution matrix

Clean the data and set up the MSA.

class(gabarap_align)
## [1] "MsaAAMultipleAlignment"
## attr(,"package")
## [1] "msa"
is(gabarap_align)
## [1] "MsaAAMultipleAlignment" "AAMultipleAlignment"   
## [3] "MsaMetaData"            "MultipleAlignment"

Change the class of the alignment and convert it to seqinr format.

class(gabarap_align) <- "AAMultipleAlignment"
gabarap_align_seqinr <- msaConvert(gabarap_align,
                                   type = "seqinr::alignment")
compbio4all::print_msa(gabarap_align_seqinr)
## [1] "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQF 0"
## [1] "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQF 0"
## [1] "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQF 0"
## [1] "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQF 0"
## [1] "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQF 0"
## [1] "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQF 0"
## [1] "MKFVYKEEHAFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQF 0"
## [1] "MKFQYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQF 0"
## [1] "MKFMYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQF 0"
## [1] "MKWEYKEEHPFEKRRAEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQF 0"
## [1] " "
## [1] "YFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYG---- 0"
## [1] "YFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYG---- 0"
## [1] "YFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYG---- 0"
## [1] "YFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYG---- 0"
## [1] "YFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYG---- 0"
## [1] "YFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYG---- 0"
## [1] "YFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYG---- 0"
## [1] "YFLIRKRIHLRAEDALFFFVNNVIPPTSATMGLLYQEHHEEDFFLYIAYSDESVYGNSQR 0"
## [1] "YFLIRKRIHLRAEDALFFFVNNVIPPTSATMGLLYQEHHEEDFFLYIAYSDESVYGDVLK 0"
## [1] "YFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGTD-- 0"
## [1] " "
## [1] "-L 58"
## [1] "-L 58"
## [1] "-L 58"
## [1] "-L 58"
## [1] "-L 58"
## [1] "-L 58"
## [1] "-M 58"
## [1] "EM 58"
## [1] "DV 58"
## [1] "-- 58"
## [1] " "

Completed MSA

The full alignment was completed above and includes the whole sequence. To present this MSA, we will include the first 60 amino acids of the protein so that we are able to read it. GABARAP, while a relatively short protein, contains several different regions, which were conserved across many of the sequences. The first 60 amino acids contains a piece of each of these regions and therefore the display of this MSA includes the first 60 amino acids.

NOTE: The ggmsa function was causing major problems (crashing R) while trying to run the entire code so it has been commented out, however the code should be correct and it should be able to run if ggmsa() is available.

# ggmsa::ggmsa(gabarap_align,  
#      start = 0,
#      end = 60)

Distance Matrix

Calculate the genetic distance in the alignment.

gabarap_dist <- seqinr::dist.alignment(gabarap_align_seqinr,
                                       matrix = "identity")

Check the data types that were produced.

is(gabarap_dist)
## [1] "dist"     "oldClass"
class(gabarap_dist)
## [1] "dist"

Round the distance matrix values to clean up the display.

gabarap_dist_rnd <- round(gabarap_dist, 3)
gabarap_dist_rnd
##                NP_009209.1 XP_003315392.1 XP_536616.2 NP_001029220.1
## XP_003315392.1       0.000                                          
## XP_536616.2          0.000          0.000                           
## NP_001029220.1       0.000          0.000       0.000               
## NP_062723.1          0.000          0.000       0.000          0.000
## NP_742033.1          0.000          0.000       0.000          0.000
## NP_001084530.1       0.131          0.131       0.131          0.131
## XP_004079842.1       0.160          0.160       0.160          0.160
## NP_001013278.1       0.160          0.160       0.160          0.160
## XP_031568314.1       0.161          0.161       0.161          0.161
##                NP_062723.1 NP_742033.1 NP_001084530.1 XP_004079842.1
## XP_003315392.1                                                      
## XP_536616.2                                                         
## NP_001029220.1                                                      
## NP_062723.1                                                         
## NP_742033.1          0.000                                          
## NP_001084530.1       0.131       0.131                              
## XP_004079842.1       0.160       0.160          0.160               
## NP_001013278.1       0.160       0.160          0.185          0.240
## XP_031568314.1       0.161       0.161          0.186          0.225
##                NP_001013278.1
## XP_003315392.1               
## XP_536616.2                  
## NP_001029220.1               
## NP_062723.1                  
## NP_742033.1                  
## NP_001084530.1               
## XP_004079842.1               
## NP_001013278.1               
## XP_031568314.1          0.225

Phylogenetic Trees

Use the distance matrix to build a phylogenetic tree for these 10 sequences using the neighbor-joining nj() algorithm.

tree_seqs <- nj(gabarap_dist)

There are a few different ways to plot a phylogenetic tree. Upon creating the phylogenetics trees with and varying the root and branch lengths, the clearest visual was given by a rooted phylogenetic tree without branch lengths. these settings can be easily manipulated to produce trees by including additional arguments, thus the unrooted trees / trees with branch lengths are not included in this code.

plot.phylo(tree_seqs, main="Phylogenetic Tree\n",
            use.edge.length = F)

mtext(text = "GABARAP family gene tree - rooted, no branch lengths")