Preparation and Packages

# GitHub packages
library(devtools)
# devtools::install_github("brouwern/compbio4all")
library(compbio4all)
# devtools::install_github("YuLab-SMU/ggmsa")
# library(ggmsa)

# CRAN packages

library(rentrez)
library(ape)
library(pander)
library(seqinr)

## 
## Attaching package: 'seqinr'

## The following objects are masked from 'package:ape':
## 
##     as.alignment, consensus

library(ggpubr)

## Loading required package: ggplot2

## Loading required package: magrittr

## 
## Attaching package: 'ggpubr'

## The following object is masked from 'package:ape':
## 
##     rotate

library(ggplot2)


# Bioconductor packages
library(BiocManager)

## 
## Attaching package: 'BiocManager'

## The following object is masked from 'package:devtools':
## 
##     install

library(Biostrings)

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind,
##     colMeans, colnames, colSums, dirname, do.call, duplicated,
##     eval, evalq, Filter, Find, get, grep, grepl, intersect,
##     is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
##     paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
##     Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which, which.max,
##     which.min

## Loading required package: S4Vectors

## Loading required package: stats4

## 
## Attaching package: 'S4Vectors'

## The following object is masked from 'package:base':
## 
##     expand.grid

## Loading required package: IRanges

## 
## Attaching package: 'IRanges'

## The following object is masked from 'package:grDevices':
## 
##     windows

## Loading required package: XVector

## 
## Attaching package: 'Biostrings'

## The following object is masked from 'package:seqinr':
## 
##     translate

## The following object is masked from 'package:ape':
## 
##     complement

## The following object is masked from 'package:base':
## 
##     strsplit

library(drawProteins)
library(msa)

## 
## Attaching package: 'msa'

## The following object is masked from 'package:BiocManager':
## 
##     version

Accession Numbers

These accession numbers were found using RefSeq, RefSeq Homologene, UniProt, and PDB. Neither Homo neanderthalensis nor Drosophila entries were found after thorough searching of these databases. The final three entries in the table were obtained via a protein BLAST search. First, vertebrates were excluded, and there were 2 results and the gene existed in invertebrates. One of these with sufficient data was included. Mammals were then excluded from the BLAST search which yielded many results.

# RefSeq, UniProt, PDB, scientific name, common name, gene name
ref_seq <- c('NP_009209.1', 'XP_003315392.1', 'XP_536616.2', 'NP_001029220.1', 'NP_062723.1', 'NP_742033.1', 'NP_001013278.1', 'XP_031568314.1', 'NP_001084530.1', 'XP_004079842.1')
uniprot <- c('O95166', 'H2QC26', 'F1Q2L2', 'Q9GJW7', 'Q9DCD6', 'P60517', 'Q6PSS4', 'A0A6P8IN80', 'Q6NUG7', 'A0A3P9HG37')
pdb <- c('1GNU', 'NA', 'NA', 'NA', '5YIR', '1KJT', 'NA', 'NA', 'NA', 'NA')
sci_name <- c('Homo sapiens', 'Pan troglodytes', 'Canis lupus', 'Bos taurus', 'Mus musculus', 'Rattus norvegicus', 'Danio rerio', 'Actinia tenebrosa', 'Xenopus laevis', 'Oryzias latipes')
common_name <- c('Human', 'Chimpanzee', 'Dog', 'Bovine', 'Mouse', 'Rat', 'Zebrafish', 'Australian red waratah sea anemone', 'African clawed frog', 'Japanese rice rish')
gene_name <- c('GABARAP', 'GABARAP', 'GABARAP', 'GABARAP', 'Gabarap', 'Gabarap', 'gabarap', 'LOC116303016', 'gabarap.L', 'NA')

Convert this vector information into a data frame to produce a table.

gabarap_table <- data.frame(NCBI.Accession = ref_seq,
                            UniProt.ID = uniprot,
                            PDB.ID = pdb,
                            ScientificName = sci_name,
                            CommonName = common_name,
                            GeneName = gene_name)
pander::pander(gabarap_table)

Table continues below
NCBI.Accession	UniProt.ID	PDB.ID	ScientificName
NP_009209.1	O95166	1GNU	Homo sapiens
XP_003315392.1	H2QC26	NA	Pan troglodytes
XP_536616.2	F1Q2L2	NA	Canis lupus
NP_001029220.1	Q9GJW7	NA	Bos taurus
NP_062723.1	Q9DCD6	5YIR	Mus musculus
NP_742033.1	P60517	1KJT	Rattus norvegicus
NP_001013278.1	Q6PSS4	NA	Danio rerio
XP_031568314.1	A0A6P8IN80	NA	Actinia tenebrosa
NP_001084530.1	Q6NUG7	NA	Xenopus laevis
XP_004079842.1	A0A3P9HG37	NA	Oryzias latipes

CommonName	GeneName
Human	GABARAP
Chimpanzee	GABARAP
Dog	GABARAP
Bovine	GABARAP
Mouse	Gabarap
Rat	Gabarap
Zebrafish	gabarap
Australian red waratah sea anemone	LOC116303016
African clawed frog	gabarap.L
Japanese rice rish	NA

Data Preparation

Protein sequences were retrieved from NCBI databases using the compbio4all::entrez_fetch_list() function, which uses the function rentrez::entrez_fetch().

gabarap_list <- entrez_fetch_list(db = "protein",
                          id = gabarap_table$NCBI.Accession,
                          rettype = "fasta")

Initial Data Cleaning

Check the length of the list and remove the FASTA header. Check the first entry in the list to make sure it contains only the protein sequence.

length(gabarap_list)

## [1] 10

for(i in 1:length(gabarap_list)){
  gabarap_list[[i]] <- compbio4all::fasta_cleaner(gabarap_list[[i]], parse = F)
}

gabarap_list[[1]]

## [1] "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGL"

General Protein Information

Protein Diagram

Use the UniProt accession number for the human sequence to generate a protein diagram indicating interesting regions of the protein.

O95166_id <- drawProteins::get_features(uniprot[1])

## [1] "Download has worked"

Convert the vector to a dataframe.

is(O95166_id)

## [1] "list"             "vector"           "list_OR_List"    
## [4] "vector_OR_factor"

my_prot_df <- drawProteins::feature_to_dataframe(O95166_id)

This creates a visualization which shows the regions where secondary structures can be identified, and shows that there are three different regions defined by the interaction GABARAP has with proteins. I included all of these aspects because they are interesting and the visual is informative. It shows what the structure is in specific regions where the protein is known to interact with other proteins, suggesting information about its function.

my_prot_df <- drawProteins::feature_to_dataframe(O95166_id)

my_canvas <- draw_canvas(my_prot_df)
my_canvas <- draw_chains(my_canvas, my_prot_df, label_size = 2.5)

my_canvas <- draw_regions(my_canvas, my_prot_df)
my_canvas <- draw_motif(my_canvas, my_prot_df)
my_canvas <- draw_phospho(my_canvas, my_prot_df)
my_canvas <- draw_repeat(my_canvas, my_prot_df)
my_canvas <- draw_recept_dom(my_canvas, my_prot_df)
my_canvas <- draw_folding(my_canvas, my_prot_df)
my_canvas

my_prot_df

##                    type
## featuresTemp      CHAIN
## featuresTemp.1   PROPEP
## featuresTemp.2   REGION
## featuresTemp.3   REGION
## featuresTemp.4   REGION
## featuresTemp.5     SITE
## featuresTemp.6    LIPID
## featuresTemp.7  MUTAGEN
## featuresTemp.8  MUTAGEN
## featuresTemp.9  MUTAGEN
## featuresTemp.10 MUTAGEN
## featuresTemp.11 MUTAGEN
## featuresTemp.12 MUTAGEN
## featuresTemp.13   HELIX
## featuresTemp.14   HELIX
## featuresTemp.15  STRAND
## featuresTemp.16  STRAND
## featuresTemp.17   HELIX
## featuresTemp.18  STRAND
## featuresTemp.19  STRAND
## featuresTemp.20   HELIX
## featuresTemp.21  STRAND
## featuresTemp.22  STRAND
##                                                                                                                              description
## featuresTemp                                                                         Gamma-aminobutyric acid receptor-associated protein
## featuresTemp.1                                                                                                    Removed in mature form
## featuresTemp.2                                                                                             Interaction with beta-tubulin
## featuresTemp.3                                                                                                     Interaction with GPHN
## featuresTemp.4                                                                                                   Interaction with GABRG2
## featuresTemp.5                                                                                                        Cleavage; by ATG4B
## featuresTemp.6                                                                                 Phosphatidylethanolamine amidated glycine
## featuresTemp.7  No effect on WDFY3-binding. Impaired WDFY3-binding, but no effect on SQSTM1-binding; when associated with H-25 and H-54.
## featuresTemp.8  No effect on WDFY3-binding. Impaired WDFY3-binding, but no effect on SQSTM1-binding; when associated with Q-24 and H-54.
## featuresTemp.9                                                                                         Inhibits interaction with TECPR2.
## featuresTemp.10 No effect on WDFY3-binding. Impaired WDFY3-binding, but no effect on SQSTM1-binding; when associated with Q-24 and H-25.
## featuresTemp.11                                                                                    No effect on interaction with TECPR2.
## featuresTemp.12                                                                     Impairs localization at the autophagosomal membrane.
## featuresTemp.13                                                                                                                     NONE
## featuresTemp.14                                                                                                                     NONE
## featuresTemp.15                                                                                                                     NONE
## featuresTemp.16                                                                                                                     NONE
## featuresTemp.17                                                                                                                     NONE
## featuresTemp.18                                                                                                                     NONE
## featuresTemp.19                                                                                                                     NONE
## featuresTemp.20                                                                                                                     NONE
## featuresTemp.21                                                                                                                     NONE
## featuresTemp.22                                                                                                                     NONE
##                 begin end length accession   entryName taxid order
## featuresTemp        1 116    115    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.1    117 117      0    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.2      1  22     21    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.3     36 117     81    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.4     36  68     32    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.5    116 117      1    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.6    116 116      0    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.7     24  24      0    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.8     25  25      0    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.9     49  50      1    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.10    54  54      0    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.11    67  67      0    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.12   116 116      0    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.13     4   8      4    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.14    11  24     13    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.15    28  35      7    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.16    48  52      4    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.17    57  67     10    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.18    77  80      3    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.19    87  90      3    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.20    91  98      7    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.21   101 103      2    O95166 GBRAP_HUMAN  9606     1
## featuresTemp.22   105 114      9    O95166 GBRAP_HUMAN  9606     1

Draw Dot Plot

Data preparation

Access the human sequence from the list and make sure it is in the proper format with each character separated in the vector.

gabarap_human_vector <- fasta_cleaner(gabarap_list[1])
gabarap_human_vector

##   [1] "M" "K" "F" "V" "Y" "K" "E" "E" "H" "P" "F" "E" "K" "R" "R" "S" "E"
##  [18] "G" "E" "K" "I" "R" "K" "K" "Y" "P" "D" "R" "V" "P" "V" "I" "V" "E"
##  [35] "K" "A" "P" "K" "A" "R" "I" "G" "D" "L" "D" "K" "K" "K" "Y" "L" "V"
##  [52] "P" "S" "D" "L" "T" "V" "G" "Q" "F" "Y" "F" "L" "I" "R" "K" "R" "I"
##  [69] "H" "L" "R" "A" "E" "D" "A" "L" "F" "F" "F" "V" "N" "N" "V" "I" "P"
##  [86] "P" "T" "S" "A" "T" "M" "G" "Q" "L" "Y" "Q" "E" "H" "H" "E" "E" "D"
## [103] "F" "F" "L" "Y" "I" "A" "Y" "S" "D" "E" "S" "V" "Y" "G" "L"

Use the human protein sequences to create a dot plot.

dotPlot(gabarap_human_vector, gabarap_human_vector,
        main = "Human GABARAP Dot Plot")

Create a 2x2 panel which shows the different values for dot plot settings.

par(mfrow = c(2,2),
    mar = c(0,0,2,1))

dotPlot(gabarap_human_vector, gabarap_human_vector,
        wsize = 10,
        nmatch = 5,
        main = "wsize = 10, nmatch = 5")

dotPlot(gabarap_human_vector, gabarap_human_vector,
        wsize = 15,
        nmatch = 3, 
        main = "wsize = 15, nmatch = 3")

dotPlot(gabarap_human_vector, gabarap_human_vector,
        wsize = 6,
        nmatch = 2, 
        main = "wsize = 5, nmatch = 1")

dotPlot(gabarap_human_vector, gabarap_human_vector,
        wsize = 20,
        nmatch = 5, 
        main = "wsize = 20, nmatch = 5")

par(mfrow = c(1,1),
    mar = c(4,4,4,4))

We can include arguments to adjust the settings of a dot plot to present the data in an interesting way.

dotPlot(gabarap_human_vector,
        gabarap_human_vector,
        wsize = 7,
        nmatch = 2,
        main = "Human GABARAP Dot Plot: wsize = 7, nmatch = 2")

Protein Properties Compiled from Databases

No information could be found about this protein in DisProt or RepeatsDB.

source <- c("PFAM", "UniProt", "PDB", "AlphaFold")

prot_property <- c("There is one domain, labeled ATG8, spanning from AA position 13-116, which is the majority of the 117-amino acid length protein. It is an autophagy protein domain.",
                   "GABARAP is associated with intracellular membrane structures, and is primarily found in the cytoskeleton and Golgi apparatus membrane.",
                   "The PDB structure shows that the protein contains both alpha helices and beta sheets in its secondary structure. Beta turns and omega loops can also be observed in the structure.",
                   "AlphaFold shows a similar structural summary as PDB, with alpha helices, beta sheets, beta turns, and omega loops present. The structure is predicted with high confidence.")

link <- c("http://pfam.xfam.org/protein/O95166",
          "https://www.uniprot.org/uniprot/O95166",
          "https://www.rcsb.org/structure/1gnu",
          "https://alphafold.ebi.ac.uk/entry/O95166")

properties_table <- data.frame(Source = source,
                               ProteinProperty = prot_property,
                               Link = link)
pander::pander(properties_table)

Table continues below
Source	ProteinProperty
PFAM	There is one domain, labeled ATG8, spanning from AA position 13-116, which is the majority of the 117-amino acid length protein. It is an autophagy protein domain.
UniProt	GABARAP is associated with intracellular membrane structures, and is primarily found in the cytoskeleton and Golgi apparatus membrane.
PDB	The PDB structure shows that the protein contains both alpha helices and beta sheets in its secondary structure. Beta turns and omega loops can also be observed in the structure.
AlphaFold	AlphaFold shows a similar structural summary as PDB, with alpha helices, beta sheets, beta turns, and omega loops present. The structure is predicted with high confidence.

Link
http://pfam.xfam.org/protein/O95166
https://www.uniprot.org/uniprot/O95166
https://www.rcsb.org/structure/1gnu
https://alphafold.ebi.ac.uk/entry/O95166

Protein Feature Prediction

UniProt shows that the GABARAP protein is associated with intracellular membrane structures, particularly the Golgi Apparatus membrane, as well as the cytoskeleton. Link: https://www.uniprot.org/uniprot/O95166\n

The code used to set up this prediction was obtained here: https://rpubs.com/lowbrowR/843543. The information used was adapted from Chou and Zhang, 1994.

Make amino acid name vectors.

# enter once
aa.1.1 <- c("A","R","N","D","C","Q","E","G","H","I",
            "L","K","M","F","P","S","T","W","Y","V")

# enter again
aa.1.2 <- c("A","R","N","D","C","Q","E","G","H","I",
            "L","K","M","F","P","S","T","W","Y","V")

# are they the same length?
length(aa.1.1) == length(aa.1.2)

## [1] TRUE

aa.1.1 == aa.1.2

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE TRUE TRUE TRUE TRUE

unique(aa.1.1)

##  [1] "A" "R" "N" "D" "C" "Q" "E" "G" "H" "I" "L" "K" "M" "F" "P" "S" "T"
## [18] "W" "Y" "V"

length(unique(aa.1.1)) == length(unique(aa.1.2))

## [1] TRUE

any(c(aa.1.1 == aa.1.2) == FALSE)

## [1] FALSE

This is an input of data from the table from Chou and Zhang 1994.

# alpha proteins
alpha <- c(285, 53, 97, 163, 22, 67, 134, 197, 111, 91,
           221, 249, 48, 123, 82, 122, 119, 33, 63, 167)

# check against chou's total
sum(alpha) == 2447

## [1] TRUE

# beta proteins
beta <- c(203, 67, 139, 121, 75, 122, 86, 297, 49, 120,
          177, 115, 16, 85, 127, 341, 253, 44, 110, 229)

# check against chou's total
sum(beta) == 2776

## [1] TRUE

Logical checks are done to make sure that data entry was successful.

a.plus.b <- c(175, 78, 120, 111, 74, 74, 86, 171, 33, 93,
              110, 112, 25, 52, 71, 126, 117, 30, 108, 123)
sum(a.plus.b) == 1889

## [1] TRUE

# alpha/beta
a.div.b <- c(361, 146, 183, 244, 63, 114, 257, 377, 107, 239,
             339, 321, 91, 158, 188, 327, 238, 72, 130, 378)
sum(a.div.b) == 4333

## [1] TRUE

Calculate the proportion of each of the four protein fold types.

alpha.prop <- alpha/sum(alpha)
beta.prop <- beta/sum(beta)
a.plus.b.prop <- a.plus.b/sum(a.plus.b)
a.div.b <- a.div.b/sum(a.div.b)

Create a dataframe with the proportion information.

aa.prop <- data.frame(alpha.prop,
                      beta.prop,
                      a.plus.b.prop,
                      a.div.b)
row.names(aa.prop) <- aa.1.1
pander::pander(aa.prop)

	alpha.prop	beta.prop	a.plus.b.prop	a.div.b
A	0.1165	0.07313	0.09264	0.08331
R	0.02166	0.02414	0.04129	0.03369
N	0.03964	0.05007	0.06353	0.04223
D	0.06661	0.04359	0.05876	0.05631
C	0.008991	0.02702	0.03917	0.01454
Q	0.02738	0.04395	0.03917	0.02631
E	0.05476	0.03098	0.04553	0.05931
G	0.08051	0.107	0.09052	0.08701
H	0.04536	0.01765	0.01747	0.02469
I	0.03719	0.04323	0.04923	0.05516
L	0.09031	0.06376	0.05823	0.07824
K	0.1018	0.04143	0.05929	0.07408
M	0.01962	0.005764	0.01323	0.021
F	0.05027	0.03062	0.02753	0.03646
P	0.03351	0.04575	0.03759	0.04339
S	0.04986	0.1228	0.0667	0.07547
T	0.04863	0.09114	0.06194	0.05493
W	0.01349	0.01585	0.01588	0.01662
Y	0.02575	0.03963	0.05717	0.03
V	0.06825	0.08249	0.06511	0.08724

Now we can use this information to determine the frequency of amino acids encoded by the GABARAP gene.

gabarap_freq_table <- table(gabarap_human_vector)/length(gabarap_human_vector)
gabarap_freq_table

## gabarap_human_vector
##          A          D          E          F          G          H 
## 0.05128205 0.05982906 0.09401709 0.07692308 0.04273504 0.03418803 
##          I          K          L          M          N          P 
## 0.05982906 0.10256410 0.07692308 0.01709402 0.01709402 0.05982906 
##          Q          R          S          T          V          Y 
## 0.02564103 0.06837607 0.04273504 0.02564103 0.07692308 0.06837607

Convert this table into a vector.

table_to_vector <- function(table_x){
  table_names <- attr(table_x, "dimnames")[[1]]
  table_vect <- as.vector(table_x)
  names(table_vect) <- table_names
  return(table_vect)
}
gabarap_human_aa_freq <- table_to_vector(gabarap_freq_table)
gabarap_human_aa_freq

##          A          D          E          F          G          H 
## 0.05128205 0.05982906 0.09401709 0.07692308 0.04273504 0.03418803 
##          I          K          L          M          N          P 
## 0.05982906 0.10256410 0.07692308 0.01709402 0.01709402 0.05982906 
##          Q          R          S          T          V          Y 
## 0.02564103 0.06837607 0.04273504 0.02564103 0.07692308 0.06837607

gabarap_human_aa_freq <- c(A=0.05128205, C = 0, D = 0.05982906, E = 0.09401709, F = 0.07692308, G = 0.04173504, H = 0.03418803, I = 0.05982906, K = 0.10256410, L = 0.07692308, M = 0.01709402, N = 0.01709402, P = 0.05982906, Q = 0.02564103, R = 0.06837607, S = 0.04273504, T = 0.02654103, V = 0.07692308, W = 0, Y = 0.06837607)

Check if “U” is present in the protein.

aa.names <- names(gabarap_human_aa_freq)
i.U <- which(aa.names == "U")
aa.names[i.U]

## character(0)

gabarap_human_aa_freq[i.U]

## named numeric(0)

Use the GABARAP frequency information and add it to the amino acid frequency table. NOTE: GABARAP does not have two amino acids that are in the table and the column could not be appended. I created the table manually so that I would be able to run the code.

aa.prop$gabarap_human_aa_freq <- gabarap_human_aa_freq
pander::pander(aa.prop)

	alpha.prop	beta.prop	a.plus.b.prop	a.div.b	gabarap_human_aa_freq
A	0.1165	0.07313	0.09264	0.08331	0.05128
R	0.02166	0.02414	0.04129	0.03369	0
N	0.03964	0.05007	0.06353	0.04223	0.05983
D	0.06661	0.04359	0.05876	0.05631	0.09402
C	0.008991	0.02702	0.03917	0.01454	0.07692
Q	0.02738	0.04395	0.03917	0.02631	0.04174
E	0.05476	0.03098	0.04553	0.05931	0.03419
G	0.08051	0.107	0.09052	0.08701	0.05983
H	0.04536	0.01765	0.01747	0.02469	0.1026
I	0.03719	0.04323	0.04923	0.05516	0.07692
L	0.09031	0.06376	0.05823	0.07824	0.01709
K	0.1018	0.04143	0.05929	0.07408	0.01709
M	0.01962	0.005764	0.01323	0.021	0.05983
F	0.05027	0.03062	0.02753	0.03646	0.02564
P	0.03351	0.04575	0.03759	0.04339	0.06838
S	0.04986	0.1228	0.0667	0.07547	0.04274
T	0.04863	0.09114	0.06194	0.05493	0.02654
W	0.01349	0.01585	0.01588	0.01662	0.07692
Y	0.02575	0.03963	0.05717	0.03	0
V	0.06825	0.08249	0.06511	0.08724	0.06838

Calculating Similarities

Use these two custom functions to calculate correlation.

# Corrleation used in Chou and Zhang 1992.
chou_corr <- function(x,y){
  numerator <- sum(x*y)
denominator <- sqrt((sum(x^2))*(sum(y^2)))
result <- numerator/denominator
return(result)
}

# Cosine similarity used in Higgs and Attwood (2005). 
chou_cosine <- function(z.1, z.2){
  z.1.abs <- sqrt(sum(z.1^2))
  z.2.abs <- sqrt(sum(z.2^2))
  my.cosine <- sum(z.1*z.2)/(z.1.abs*z.2.abs)
  return(my.cosine)
}

Calculate the correlation in each column.

corr.alpha <- chou_corr(aa.prop[,5], aa.prop[,1])
corr.beta <- chou_corr(aa.prop[,5], aa.prop[,2])
corr.apb <- chou_corr(aa.prop[,5], aa.prop[,3])
corr.adb <- chou_corr(aa.prop[,5], aa.prop[,4])

Calculate the cosine similarity.

cos.alpha <- chou_cosine(aa.prop[,5], aa.prop[,1])
cos.beta <- chou_cosine(aa.prop[,5], aa.prop[,2])
cos.apb <- chou_cosine(aa.prop[,5], aa.prop[,3])
cos.adb <- chou_cosine(aa.prop[,5], aa.prop[,4])

Calculate the distance.

aa.prop.flipped <- t(aa.prop)
round(aa.prop.flipped,2)

##                          A    R    N    D    C    Q    E    G    H    I
## alpha.prop            0.12 0.02 0.04 0.07 0.01 0.03 0.05 0.08 0.05 0.04
## beta.prop             0.07 0.02 0.05 0.04 0.03 0.04 0.03 0.11 0.02 0.04
## a.plus.b.prop         0.09 0.04 0.06 0.06 0.04 0.04 0.05 0.09 0.02 0.05
## a.div.b               0.08 0.03 0.04 0.06 0.01 0.03 0.06 0.09 0.02 0.06
## gabarap_human_aa_freq 0.05 0.00 0.06 0.09 0.08 0.04 0.03 0.06 0.10 0.08
##                          L    K    M    F    P    S    T    W    Y    V
## alpha.prop            0.09 0.10 0.02 0.05 0.03 0.05 0.05 0.01 0.03 0.07
## beta.prop             0.06 0.04 0.01 0.03 0.05 0.12 0.09 0.02 0.04 0.08
## a.plus.b.prop         0.06 0.06 0.01 0.03 0.04 0.07 0.06 0.02 0.06 0.07
## a.div.b               0.08 0.07 0.02 0.04 0.04 0.08 0.05 0.02 0.03 0.09
## gabarap_human_aa_freq 0.02 0.02 0.06 0.03 0.07 0.04 0.03 0.08 0.00 0.07

dist(aa.prop.flipped, method = "euclidean")

##                       alpha.prop  beta.prop a.plus.b.prop    a.div.b
## beta.prop             0.13342098                                    
## a.plus.b.prop         0.09281824 0.08289406                         
## a.div.b               0.06699039 0.08659174    0.06175113           
## gabarap_human_aa_freq 0.19389115 0.19771646    0.17495923 0.17870305

Calculate the individual distances.

dist.alpha <- dist((aa.prop.flipped[c(1,5),]), method = "euclidean")
dist.beta <- dist((aa.prop.flipped[c(2,5),]), method = "euclidean")
dist.apb <- dist((aa.prop.flipped[c(3,5),]), method = "euclidean")
dist.adb <- dist((aa.prop.flipped[c(4,5),]), method = "euclidean")

fold_name <- c("alpha", "beta", "alpha + beta", "alpha / beta")

correlation_sim <- round(c(corr.alpha,corr.beta,corr.apb,corr.adb),5)
cosine_sim <- round(c(cos.alpha,cos.beta,cos.apb,cos.adb),5)
euc_dist <- round(c(dist.alpha,dist.beta,dist.apb,dist.adb),5)

sim_summary <- c("","","max sim", "")
dist_summary <- c("","","min sum","")

fold_df <- data.frame(fold_name,
                      correlation_sim,
                      cosine_sim,
                      euc_dist,
                      sim_summary,
                      dist_summary)

pander::pander(fold_df)

Table continues below
fold_name	correlation_sim	cosine_sim	euc_dist	sim_summary
alpha	0.7178	0.7178	0.1939
beta	0.7104	0.7104	0.1977
alpha + beta	0.7577	0.7577	0.175	max sim
alpha / beta	0.751	0.751	0.1787

dist_summary

min sum

Percent Identity Comparisons (PID)

Data Preparation

List was FASTA cleaned in an earlier code chunk

names(gabarap_list)

##  [1] "NP_009209.1"    "XP_003315392.1" "XP_536616.2"    "NP_001029220.1"
##  [5] "NP_062723.1"    "NP_742033.1"    "NP_001013278.1" "XP_031568314.1"
##  [9] "NP_001084530.1" "XP_004079842.1"

length(gabarap_list)

## [1] 10

for(i in 1:length(gabarap_list)){
  gabarap_list[[i]] <- compbio4all::fasta_cleaner(gabarap_list[[i]], parse = F)
}

gabarap_list[[1]]

## [1] "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGL"

Convert each entry of the list into a vector.

gabarap_vector <- rep(NA, length(gabarap_list))


for(i in 1:length(gabarap_vector)){
  gabarap_vector[i] <- gabarap_list[[i]]
}

names(gabarap_vector) <- names(gabarap_list)
gabarap_vector

##                                                                                                                  NP_009209.1 
##      "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGL" 
##                                                                                                               XP_003315392.1 
##      "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGL" 
##                                                                                                                  XP_536616.2 
##      "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGL" 
##                                                                                                               NP_001029220.1 
##      "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGL" 
##                                                                                                                  NP_062723.1 
##      "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGL" 
##                                                                                                                  NP_742033.1 
##      "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGL" 
##                                                                                                               NP_001013278.1 
## "MKFMYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGLLYQEHHEEDFFLYIAYSDESVYGDVLKDV" 
##                                                                                                               XP_031568314.1 
##     "MKWEYKEEHPFEKRRAEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGTD" 
##                                                                                                               NP_001084530.1 
##      "MKFVYKEEHAFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGM" 
##                                                                                                               XP_004079842.1 
## "MKFQYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQFYFLIRKRIHLRAEDALFFFVNNVIPPTSATMGLLYQEHHEEDFFLYIAYSDESVYGNSQREM"

Perform alignments between selected species: human, chimp, sea anemone, and fish.

align.01.02 <- Biostrings::pairwiseAlignment(
                  gabarap_vector[1],
                  gabarap_vector[2])

align.01.08 <- Biostrings::pairwiseAlignment(
                  gabarap_vector[1],
                  gabarap_vector[8])

align.01.10 <- Biostrings::pairwiseAlignment(
                  gabarap_vector[1],
                  gabarap_vector[10])

align.02.08 <- Biostrings::pairwiseAlignment(
  gabarap_vector[2],
  gabarap_vector[8]
)

align.02.10 <- Biostrings::pairwiseAlignment(
  gabarap_vector[2],
  gabarap_vector[10]
)

align.08.10 <- Biostrings::pairwiseAlignment(
  gabarap_vector[8],
  gabarap_vector[10]
)

Calculate the percent identities between human with chimp, sea anemone, and fish.

Biostrings::pid(align.01.02)

## [1] 100

Biostrings::pid(align.01.08)

## [1] 96.5812

Biostrings::pid(align.01.10)

## [1] 97.4359

Build a percent identity matrix with these 4 species.

pids <- c(1,                  NA,     NA,     NA,
          pid(align.01.02),          1,     NA,     NA,
          pid(align.01.08), pid(align.02.08),      1,     NA,
          pid(align.01.10), pid(align.02.10), pid(align.08.10), 1)

pid_mat <- matrix(pids, nrow = 4, byrow = T)
row.names(pid_mat) <- c("Homo","Pan","Sea anemone","Fish")  
colnames(pid_mat) <- c("Homo","Pan","Sea anemone","Fish")  
pander::pander(pid_mat)

	Homo	Pan	Sea anemone	Fish
Homo	1	NA	NA	NA
Pan	100	1	NA	NA
Sea anemone	96.58	96.58	1	NA
Fish	97.44	97.44	94.92	1

Different methods can be specified to calculate the percent identity between two species. We will use this method for the human and sea anemone alignment. I chose to use this alignment rather than that with the chimp because there was a 100% PID between human and chimp, thus the data might not be as interesting.

Biostrings::pid(align.01.08, type = "PID1")

## [1] 96.5812

Biostrings::pid(align.01.08, type = "PID2")

## [1] 96.5812

Biostrings::pid(align.01.08, type = "PID3")

## [1] 96.5812

Biostrings::pid(align.01.08, type = "PID4")

## [1] 96.17021

Create a table to compare these different methods of calculating percent identity.

method <- c("PID1", "PID2", "PID3", "PID4")
pid.01.08 <- c(Biostrings::pid(align.01.08, type = "PID1"),
Biostrings::pid(align.01.08, type = "PID2"),
Biostrings::pid(align.01.08, type = "PID3"),
Biostrings::pid(align.01.08, type = "PID4")
)

denominator <- c("(aligned positions + internal gap positions)",
                 "(aligned positions)",
                 "(length shorter sequence)",
                 "(average length of two sequences)")

pid_table <- data.frame(Method = method,
                        PID = pid.01.08,
                        Denominator = denominator)
pander::pander(pid_table)

Method	PID	Denominator
PID1	96.58	(aligned positions + internal gap positions)
PID2	96.58	(aligned positions)
PID3	96.58	(length shorter sequence)
PID4	96.17	(average length of two sequences)

Multiple Sequence Alignment

MSA Data Preparation

gabarap_vector_ss <- Biostrings::AAStringSet(gabarap_vector)

Build the Multiple Sequence Alignment (MSA)

gabarap_align <- msa(gabarap_vector_ss,
                     method = "ClustalW")

## use default substitution matrix

Clean the data and set up the MSA.

class(gabarap_align)

## [1] "MsaAAMultipleAlignment"
## attr(,"package")
## [1] "msa"

is(gabarap_align)

## [1] "MsaAAMultipleAlignment" "AAMultipleAlignment"   
## [3] "MsaMetaData"            "MultipleAlignment"

Change the class of the alignment and convert it to seqinr format.

class(gabarap_align) <- "AAMultipleAlignment"
gabarap_align_seqinr <- msaConvert(gabarap_align,
                                   type = "seqinr::alignment")
compbio4all::print_msa(gabarap_align_seqinr)

## [1] "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQF 0"
## [1] "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQF 0"
## [1] "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQF 0"
## [1] "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQF 0"
## [1] "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQF 0"
## [1] "MKFVYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQF 0"
## [1] "MKFVYKEEHAFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQF 0"
## [1] "MKFQYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQF 0"
## [1] "MKFMYKEEHPFEKRRSEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQF 0"
## [1] "MKWEYKEEHPFEKRRAEGEKIRKKYPDRVPVIVEKAPKARIGDLDKKKYLVPSDLTVGQF 0"
## [1] " "
## [1] "YFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYG---- 0"
## [1] "YFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYG---- 0"
## [1] "YFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYG---- 0"
## [1] "YFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYG---- 0"
## [1] "YFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYG---- 0"
## [1] "YFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYG---- 0"
## [1] "YFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYG---- 0"
## [1] "YFLIRKRIHLRAEDALFFFVNNVIPPTSATMGLLYQEHHEEDFFLYIAYSDESVYGNSQR 0"
## [1] "YFLIRKRIHLRAEDALFFFVNNVIPPTSATMGLLYQEHHEEDFFLYIAYSDESVYGDVLK 0"
## [1] "YFLIRKRIHLRAEDALFFFVNNVIPPTSATMGQLYQEHHEEDFFLYIAYSDESVYGTD-- 0"
## [1] " "
## [1] "-L 58"
## [1] "-L 58"
## [1] "-L 58"
## [1] "-L 58"
## [1] "-L 58"
## [1] "-L 58"
## [1] "-M 58"
## [1] "EM 58"
## [1] "DV 58"
## [1] "-- 58"
## [1] " "

Completed MSA

The full alignment was completed above and includes the whole sequence. To present this MSA, we will include the first 60 amino acids of the protein so that we are able to read it. GABARAP, while a relatively short protein, contains several different regions, which were conserved across many of the sequences. The first 60 amino acids contains a piece of each of these regions and therefore the display of this MSA includes the first 60 amino acids.

NOTE: The ggmsa function was causing major problems (crashing R) while trying to run the entire code so it has been commented out, however the code should be correct and it should be able to run if ggmsa() is available.

# ggmsa::ggmsa(gabarap_align,  
#      start = 0,
#      end = 60)

Distance Matrix

Calculate the genetic distance in the alignment.

gabarap_dist <- seqinr::dist.alignment(gabarap_align_seqinr,
                                       matrix = "identity")

Check the data types that were produced.

is(gabarap_dist)

## [1] "dist"     "oldClass"

class(gabarap_dist)

## [1] "dist"

Round the distance matrix values to clean up the display.

gabarap_dist_rnd <- round(gabarap_dist, 3)
gabarap_dist_rnd

##                NP_009209.1 XP_003315392.1 XP_536616.2 NP_001029220.1
## XP_003315392.1       0.000                                          
## XP_536616.2          0.000          0.000                           
## NP_001029220.1       0.000          0.000       0.000               
## NP_062723.1          0.000          0.000       0.000          0.000
## NP_742033.1          0.000          0.000       0.000          0.000
## NP_001084530.1       0.131          0.131       0.131          0.131
## XP_004079842.1       0.160          0.160       0.160          0.160
## NP_001013278.1       0.160          0.160       0.160          0.160
## XP_031568314.1       0.161          0.161       0.161          0.161
##                NP_062723.1 NP_742033.1 NP_001084530.1 XP_004079842.1
## XP_003315392.1                                                      
## XP_536616.2                                                         
## NP_001029220.1                                                      
## NP_062723.1                                                         
## NP_742033.1          0.000                                          
## NP_001084530.1       0.131       0.131                              
## XP_004079842.1       0.160       0.160          0.160               
## NP_001013278.1       0.160       0.160          0.185          0.240
## XP_031568314.1       0.161       0.161          0.186          0.225
##                NP_001013278.1
## XP_003315392.1               
## XP_536616.2                  
## NP_001029220.1               
## NP_062723.1                  
## NP_742033.1                  
## NP_001084530.1               
## XP_004079842.1               
## NP_001013278.1               
## XP_031568314.1          0.225

Phylogenetic Trees

Use the distance matrix to build a phylogenetic tree for these 10 sequences using the neighbor-joining nj() algorithm.

tree_seqs <- nj(gabarap_dist)

There are a few different ways to plot a phylogenetic tree. Upon creating the phylogenetics trees with and varying the root and branch lengths, the clearest visual was given by a rooted phylogenetic tree without branch lengths. these settings can be easily manipulated to produce trees by including additional arguments, thus the unrooted trees / trees with branch lengths are not included in this code.

plot.phylo(tree_seqs, main="Phylogenetic Tree\n",
            use.edge.length = F)

mtext(text = "GABARAP family gene tree - rooted, no branch lengths")

Final Portfolio: GABARAP, Gamma-Aminobutyric Acid Receptor-Associated Protein

Colleen Petersen

12/12/2021

Introduction

Resources and References

Preparation and Packages

Accession Numbers

Data Preparation

Initial Data Cleaning

General Protein Information

Protein Diagram

Draw Dot Plot

Data preparation

Protein Properties Compiled from Databases

Protein Feature Prediction

Calculating Similarities

Percent Identity Comparisons (PID)

Data Preparation

Multiple Sequence Alignment

MSA Data Preparation

Build the Multiple Sequence Alignment (MSA)

Completed MSA

Distance Matrix

Phylogenetic Trees