final portfolio project

INTRO

The PRKAA1 gene encodes the protein that is the catalytic sub-unit of a protein kinase. Protein kinases regulate cellular AMP/ATP ratios.

The PRKAA1 gene also goes by: AMPK, AMPKa1, and AMPK alpha 1.

REFERENCES

LOAD PACKAGES

TO DOWNLOAD PACKAGES…

from github: devtools::install_github(“devloper_name/package”) requires devtools to be installed via install.packages()
from CRAN: install.packages(“package_name”)
from BioConductor: BioManager::install(“package_name”) requires BioManager to be installed via install.packages()

# GitHub Packages
library(compbio4all)

# CRAN Packages
library(ape)
library(rentrez)
library(seqinr)
library(msa)
library(ggmsa)
library(ggplot2)
library(pander)

# BioConductor Packages
library(Biostrings)
library(drawProteins)

PRKAA1 ACCESSION NUMBERS

The PRKAA1 gene is present in multiple species. Information from the mRNA transcript, gene, protein, etc can be found from multiple databases.

RefSeq - databank contains transcript, protein, and chromosome data. entries contain sequence information such as species name, structure, sequences, etc. the entries are manually curated and can be experimentally derived or predicted by computational means.

Uniprot - databank for well annotated, high quality proteins. entries contain the protein’s definitive state, interactions, and 3D structures if they exist. the focus of uniprot is on protein structure and function.

PBD - a database that only contains known 3D structures of proteins.

# accession table
common_names <- c("Humans", "Cows", "Mouse", "Rat",
                  "Chicken", "Tropical Clawed Frog", "Zebrafish",
                  "Thale Cress", "Horse", "Pig", "Orangutan")

science_names <- c("Homo spapians", "Bos taurus", "Mus musculus", "Rattus norvegicus",
                   "Gallus gallus", "Xenopus tropicalis", "Danio rerio",
                   "Arabidopsis thaliana", "Equus caballus", "Sus scrofa", "Pongo abelii")

rs <- c("NP_006242.5", "NP_001103272.1", "NP_001013385.3",  "NP_062015.2",
        "NP_001034692.1", "NP_001120434.1", "NP_001103756.1",
        "NP_566130.1", "NP_001075272.2", "NP_001161105.1", "NP_001127249.1")

up <- c("Q13131", "A8E649", "Q5EG47", "P54645",
        "Q2PUH1", "B1H2Z8", "A5WUM0",
        NA, "Q2LGG0", "D0G7E1", "Q5RDH5")

pdb <- c("4RED", NA, "5UFU", "4EAK",
         NA, NA, NA,
         NA, NA, NA, NA)

# create dataframe to view
acessions.df <- data.frame(name = common_names, 
                           scientific_name = science_names, 
                           refseq = rs, 
                           uniprot = up, 
                           protein_db = pdb)
pander(acessions.df)

Table continues below
name	scientific_name	refseq	uniprot
Humans	Homo spapians	NP_006242.5	Q13131
Cows	Bos taurus	NP_001103272.1	A8E649
Mouse	Mus musculus	NP_001013385.3	Q5EG47
Rat	Rattus norvegicus	NP_062015.2	P54645
Chicken	Gallus gallus	NP_001034692.1	Q2PUH1
Tropical Clawed Frog	Xenopus tropicalis	NP_001120434.1	B1H2Z8
Zebrafish	Danio rerio	NP_001103756.1	A5WUM0
Thale Cress	Arabidopsis thaliana	NP_566130.1	NA
Horse	Equus caballus	NP_001075272.2	Q2LGG0
Pig	Sus scrofa	NP_001161105.1	D0G7E1
Orangutan	Pongo abelii	NP_001127249.1	Q5RDH5

protein_db
4RED
NA
5UFU
4EAK
NA
NA
NA
NA
NA
NA
NA

DATA PREPARATION

The function rentrez::entrez_fetch() downloads a single, specified FASTA file. compbio4all::entrez_fetch_list() is a wrapper function for entrez_fetch() that returns multiple FASTA files stored in a list object.

Directly after downloading, FASTA files still contain the header, line breaks (\n), and other unnecessary formatting. In order to use just the sequence, the files have to be cleaned. To clean fasta files, use compbio4all::fasta_cleaner()

# download and clean FASTA files

prkaa_list <- compbio4all::entrez_fetch_list(id = acessions.df$refseq,
                                                 db = "protein",
                                                 rettype = "FASTA")


for (i in 1:length(prkaa_list))
{
  prkaa_list[[i]] <- compbio4all::fasta_cleaner(prkaa_list[[i]], parse = F)
}

# label with the name of the organism instead of the defaul (accession number)
names(prkaa_list) <- common_names

PROTEIN EXPLORATION

KEY DOMAINS // PROTEIN DIAGRAM

UniProt entries contain “Family & Domain” subsections that show the structure of the protein, what domains/repeats/regions are present, what the secondary structure contains, etc. The data recorded on UniProt varies based on how studied the gene/protein is.

By using a UniProt acession number, the BioConductor package drawProteins can extract information about the protein’s domains, chains, regions, motifs, phosphorylation sites, and repeats IF that information is available.

# download data from UniProt
Q13131.feat <- drawProteins::get_features("Q13131")

## [1] "Download has worked"

# convert data to dataframe and view
Q13131.df <- drawProteins::feature_to_dataframe(Q13131.feat)

The plot created by drawProteins can vary based on the available data. There are different functions in the package to plot specific parts of the data.

## Domains Present
domain.plt <- draw_canvas(Q13131.df) # create basic plot w/ longest protein and number of proteins
domain.plt <- draw_chains(domain.plt, Q13131.df, label_size = 2.5) # plot the chains
domain.plt <- draw_domains(domain.plt, Q13131.df) # add domains created by draw_chains

# show domain plot:
domain.plt

## Only the Receptor Domain
recdom.plt <- draw_canvas(Q13131.df) 
recdom.plt <- draw_chains(recdom.plt, Q13131.df, label_size = 2.5)
recdom.plt <- draw_recept_dom(recdom.plt, Q13131.df) # plot receptor domains

# Show plot:
recdom.plt

## Plot Regions
region.plt <- draw_canvas(Q13131.df)
region.plt <- draw_chains(region.plt, Q13131.df, label_size = 2.5)
region.plt <- draw_regions(region.plt, Q13131.df)

# Show Plot:
region.plt

## Region AND Folding
regfold.plt <- region.plt
regfold.plt <- draw_folding(regfold.plt, Q13131.df)

# Show Plot:
regfold.plt

# Plot Repeats
rep.plt <- draw_canvas(Q13131.df)
rep.plt <- draw_chains(rep.plt, Q13131.df, label_size = 2.5)
rep.plt <- draw_repeat(rep.plt, Q13131.df)

# Show Plot:
rep.plt

## Plot Phosphorylation
phos.plt <- draw_canvas(Q13131.df)
phos.plt <- draw_chains(phos.plt, Q13131.df, label_size = 2.5)
phos.plt <- draw_phospho(phos.plt, Q13131.df)

phos.plt

Based on the discovery plots, the repeats and receptor domain information is not available for the PRKAA1 gene entry in UniProt.

prot.plt <- draw_canvas(Q13131.df)
prot.plt <- draw_chains(prot.plt, Q13131.df, label_size = 2.5)

# prot.plt <- draw_domains(prot.plt, Q13131.df) # domains
prot.plt <- draw_folding(prot.plt, Q13131.df) # folding
prot.plt <- draw_regions(prot.plt, Q13131.df) # regions
prot.plt <- draw_phospho(prot.plt, Q13131.df) # phosphorylation

prot.plt

DOTPLOT

Doing a self to self dot plot can reveal repeats. The seqinr package in R has a dotPlot() function that works for this task, all that’s required is two sequence vectors. Altering the wsize wstep and nmatch arguments focuses the data and reduces the graph equivalent to background noise.

Using the human version of PRKAA1:

# get the human PRKAA1 sequence from the FASTA list:
humanSeq <- fasta_cleaner(prkaa_list[["Humans"]])
  
# confirm vector with data exploration
is(humanSeq)

##  [1] "character"               "vector"                 
##  [3] "data.frameRowLabels"     "SuperClassMethod"       
##  [5] "character_OR_connection" "character_OR_NULL"      
##  [7] "atomic"                  "EnumerationValue"       
##  [9] "vector_OR_Vector"        "vector_OR_factor"

# set up 2 x 2 grid
par(mfrow = c(2,2), 
    mar = c(0,0,2,1))

# Default Plot:
dotPlot(humanSeq,
        humanSeq, 
        wsize = 1, 
        nmatch = 1, 
        main = "PRKAA Defaults")

# plot 2:
dotPlot(humanSeq,
        humanSeq,  
        wsize = 20,
        nmatch = 5,
        main = "PRKAA - size = 20, nmatch = 5")

# plot 3: 
dotPlot(humanSeq[0:200],
        humanSeq[0:200],  
        wsize = 30, 
        nmatch = 5, 
        main = "PRKAA region [0:200] w=30")

# plot 4:
dotPlot(humanSeq[100:200],
        humanSeq[100:200], 
        wsize = 40,
        nmatch = 5,
        main = "PRKAA region [100:200] w=40")

Plot 4 shows the clearest evidence of a repeated pattern:

par(mfrow = c(1,1), 
    mar = c(4,4,4,4))

dotPlot(humanSeq[100:200],
        humanSeq[100:200], 
        wsize = 40, 
        nmatch = 5,
        main = "PRKAA[100:200] w=40, m=5")

focal dot plot

PROTEIN PROPERTIES

Pfam - A protein family database, works with UniProt.

Disprot - Curated database for intrinsincally disordered proteins.

RepeatsDB - Database of annotated tandem repeats.

CATH - Database with a focus on classifications of protein structure

Information for PRKAA1 was only found in the Pfam database.

Pfam determines two large domains in the protein: pkinase from base 27 to 279 and adenylate sensor from base 406 to 503. The pkinase determination is consistent with the protein diagram created above. Pfam also makes note of 3 low complexity regions and 7 disorder regions, however they cover very small areas (as little as 3bp!) and could be part of a larger region. In fact, UniProt has only 2 regions (AIS and Disordered) and 1 domain (pkinase), as is seen in the protein diagram.

UniProt has the subcelluar location of the protein as the nucleus and the cytoplasm/cytosol.

The protein databank entry 4RED is frequently used in these analyses. This entry shows a crystallized version of the protein collected from humans and rats. The structure appears to be mainly alpha helices but also contains a few beta pleated sheets

PROTEIN PREDICTION

In a 1994 paper KC Chou uses multivariable statistical methods to predict the folding pattern in proteins. Using the sequence of the protein, the frequency of each amino acid is found and used to predict the fold type of the protein.

## Use Table 5 from the linked paper to get the training data used to build the Chou Method
# aa names:
aa.1 <- c("A", "R", "N", "D", "C",
          "Q", "E", "G", "H", "I",
          "L", "K", "M", "F", "P", 
          "S","T", "W", "Y", "V")

# frequency:
## alpha:
alpha.class <- c(285, 53, 97, 163, 22, 67, 134,
                 197, 111, 91, 221, 249, 48, 123,
                 82, 122, 119, 33, 63, 167)
## beta:
beta.class <- c(203, 67, 139, 121, 75, 122, 86,
                297, 49, 120, 177, 115, 16, 85,
                127, 341, 253, 44, 110, 229)
## alpha + beta
apb.class <- c(175, 78, 120, 111, 74, 74, 86,
               171, 33, 93, 110, 112, 25, 52,
               71, 126, 117, 30, 108, 123)
## alpha/beta
adb.class <- c(361, 146, 183, 244, 63, 114,
               257, 377, 107, 239, 339, 321,
               91, 158, 188, 327, 238, 72, 130, 378)

## calculate proportions:
alpha.prop <- alpha.class/sum(alpha.class)
beta.prop <- beta.class/sum(beta.class)
apb.prop <- apb.class/sum(apb.class)
adb.prop <- adb.class/sum(adb.class)

aa.prop <- data.frame(aplha = alpha.prop,
                      beta = beta.prop,
                      alpha.plus.beta = apb.prop,
                      alpha.div.beta = adb.prop)
row.names(aa.prop) <- aa.1
pander(aa.prop)

	aplha	beta	alpha.plus.beta	alpha.div.beta
A	0.1165	0.07313	0.09264	0.08331
R	0.02166	0.02414	0.04129	0.03369
N	0.03964	0.05007	0.06353	0.04223
D	0.06661	0.04359	0.05876	0.05631
C	0.008991	0.02702	0.03917	0.01454
Q	0.02738	0.04395	0.03917	0.02631
E	0.05476	0.03098	0.04553	0.05931
G	0.08051	0.107	0.09052	0.08701
H	0.04536	0.01765	0.01747	0.02469
I	0.03719	0.04323	0.04923	0.05516
L	0.09031	0.06376	0.05823	0.07824
K	0.1018	0.04143	0.05929	0.07408
M	0.01962	0.005764	0.01323	0.021
F	0.05027	0.03062	0.02753	0.03646
P	0.03351	0.04575	0.03759	0.04339
S	0.04986	0.1228	0.0667	0.07547
T	0.04863	0.09114	0.06194	0.05493
W	0.01349	0.01585	0.01588	0.01662
Y	0.02575	0.03963	0.05717	0.03
V	0.06825	0.08249	0.06511	0.08724

Apply the data to predict the fold classificatin of PRKAA1:

# vector of the protein sequence
prkaa.classify <- fasta_cleaner(prkaa_list[["Humans"]], parse = TRUE)

prkaa.freq.table <- table(prkaa.classify)/length(prkaa.classify)

Tables are not considered vectors in R:

table_to_vector <- function(table_x){
  table.names <- attr(table_x, "dimnames")[[1]]
  table.vec <- as.vector(table_x)
  names(table.vec) <- table.names
  return (table.vec)
}

Use function to continue with fold prediction:

prkaa.aa.freq <- table_to_vector(prkaa.freq.table)

## if 'U' (unknown) is present in sequence, remove it
aa.names <- names(prkaa.aa.freq)

if (any(aa.names == "U")){
  iU <- which(aa.names == "U")
  prkaa.aa.freq <- prkaa.aa.freq[-iU]
}

## add to data table
aa.prop$prkaa.freq <- prkaa.aa.freq

Cusom functions required to calculate similarity with this strategy:

## Cosine Similarity:
chou_cosine <- function(z.1, z.2){
  z.1.abs <- sqrt(sum(z.1^2))
  z.2.abs <- sqrt(sum(z.2^2))
  my.cos <- sum(z.1*z.2)/(z.1.abs*z.2.abs)
  return(my.cos)
}

## Correlation:
chou_cor <- function(x,y){
  numerator <- sum(x*y)
  denominator <- sqrt((sum(x^2))*(sum(y^2)))
  result <- numerator/denominator
  return(result)
}

Use these functions to calculate the correlation and cosine similarity in each column of the dataframe:

## correlation calculation
corr.alpha <- chou_cor(aa.prop[,5], aa.prop[,1])
corr.beta <- chou_cor(aa.prop[,5], aa.prop[,2])
corr.apb <- chou_cor(aa.prop[,5], aa.prop[,3])
corr.adb <- chou_cor(aa.prop[,5], aa.prop[,4])

## cosine sim calculation
cos.alpha <- chou_cosine(aa.prop[,5], aa.prop[,1])
cos.beta <- chou_cosine(aa.prop[,5], aa.prop[,2])
cos.apb <- chou_cosine(aa.prop[,5], aa.prop[,3])
cos.adb <- chou_cosine(aa.prop[,5], aa.prop[,4])

## calculate distance (requires flipped table)
faa.prop <- t(aa.prop)
round(faa.prop, 2)

##                    A    R    N    D    C    Q    E    G    H    I    L    K
## aplha           0.12 0.02 0.04 0.07 0.01 0.03 0.05 0.08 0.05 0.04 0.09 0.10
## beta            0.07 0.02 0.05 0.04 0.03 0.04 0.03 0.11 0.02 0.04 0.06 0.04
## alpha.plus.beta 0.09 0.04 0.06 0.06 0.04 0.04 0.05 0.09 0.02 0.05 0.06 0.06
## alpha.div.beta  0.08 0.03 0.04 0.06 0.01 0.03 0.06 0.09 0.02 0.06 0.08 0.07
## prkaa.freq      0.05 0.02 0.07 0.06 0.03 0.05 0.04 0.06 0.07 0.09 0.03 0.03
##                    M    F    P    S    T    W    Y    V
## aplha           0.02 0.05 0.03 0.05 0.05 0.01 0.03 0.07
## beta            0.01 0.03 0.05 0.12 0.09 0.02 0.04 0.08
## alpha.plus.beta 0.01 0.03 0.04 0.07 0.06 0.02 0.06 0.07
## alpha.div.beta  0.02 0.04 0.04 0.08 0.05 0.02 0.03 0.09
## prkaa.freq      0.05 0.04 0.07 0.09 0.04 0.07 0.01 0.04

dist(faa.prop, method = "euclidean")

##                      aplha       beta alpha.plus.beta alpha.div.beta
## beta            0.13342098                                          
## alpha.plus.beta 0.09281824 0.08289406                               
## alpha.div.beta  0.06699039 0.08659174      0.06175113               
## prkaa.freq      0.16650059 0.14780836      0.13649090     0.13797574

### calculate individual differences
dist.alph <- dist(faa.prop[c(1,5),], method = "euclidean")
dist.bet <- dist(faa.prop[c(2,5),], method = "euclidean")
dist.apb <- dist(faa.prop[c(3,5),], method = "euclidean")
dist.adb <- dist(faa.prop[c(4,5),], method = "euclidean")

Compile all the calculations to view the potential fold prediction:

fold.type <- c("alpha", "beta", "alpha + beta", "alpha / beta")

corr.sim <- round(c(corr.alpha, corr.beta, corr.apb, corr.adb), 5)
cos.sim <- round(c(cos.alpha, cos.beta, cos.apb, cos.adb), 5)
euc.dist <- round(c(dist.alph, dist.bet, dist.apb, dist.adb), 5)

sim.sum <- c("", "", "tied similarity", "tied similarity")
dist.sum <- c("", "", "min distance", "")

pred.df <- data.frame(fold.type,
                      corr.sim,
                      cos.sim,
                      euc.dist,
                      sim.sum,
                      dist.sum)

pander(pred.df)

fold.type	corr.sim	cos.sim	euc.dist	sim.sum	dist.sum
alpha	0.7808	0.7808	0.1665
beta	0.8308	0.8308	0.1478
alpha + beta	0.8422	0.8422	0.1365	tied similarity	min distance
alpha / beta	0.842	0.842	0.138	tied similarity

PAIRWISE ALIGNMENTS AND PID ANALYSIS

Pairwise alignments compare two sequences. There are two measurements made during pairwise alignments: score() and pid().

Scores of alignments are computed based on amino acid matches. For example, the Needleman algorithm for calculating scores divides points based on: exact amino acid match, amino acid mismatch, gap creations, and gap expansions. BLOSUM is another popular scoring matrix, which is used below. R also supports creating unique scoring matrices by using the nucleotideSubstitutionMatrix() function.

PID is percent (or proportion) identity between sequences. It can be found easily in R using the pid() function. There are a couple variations, but mainly PID is calculated by amino acid matches/length of alignment.

# choose 4 species for pairwise alignment
up_species <- c("Human", "Orangutan", "Horse", "Chicken")
up_accessions <- c("Q13131", "Q5RDH5", "Q2LGG0", "Q2PUH1")

# download and clean sequences
upH <- rentrez::entrez_fetch(id = "Q13131",
                             db = "protein",
                             rettype = "FASTA")
upH <- compbio4all::fasta_cleaner(upH)

upO <- rentrez::entrez_fetch(id = "Q5RDH5",
                             db = "protein",
                             rettype = "FASTA")
upO <- compbio4all::fasta_cleaner(upO)

upHorse <- rentrez::entrez_fetch(id = "Q2LGG0",
                             db = "protein",
                             rettype = "FASTA")
upHorse <- compbio4all::fasta_cleaner(upHorse)

upC <- rentrez::entrez_fetch(id = "Q2PUH1",
                             db = "protein",
                             rettype = "FASTA")
upC <- compbio4all::fasta_cleaner(upC)

# Convert vector to string
upHS <- paste(upH, collapse = "")
upOS <- paste(upO, collapse = "")
upHoS <- paste(upHorse, collapse = "")
upCS <- paste(upC, collapse = "")

# Make sure everything is capitalized
upHS <- toupper(upHS)
upOS <- toupper(upOS)
upHoS <- toupper(upHoS)
upCS <- toupper(upCS)

# Load Substitution Matrix
data(BLOSUM50)

# Human vs Orangutan
ho_align <- Biostrings::pairwiseAlignment(upHS, upOS,
                                              substitutionMatrix = BLOSUM50,
                                              gapOpening = -2,
                                              gapExtension = -8,
                                              scoreOnly = FALSE)
# Human vs Horse
hho_align <- Biostrings::pairwiseAlignment(upHS, upHoS,
                                              substitutionMatrix = BLOSUM50,
                                              gapOpening = -2,
                                              gapExtension = -8,
                                              scoreOnly = FALSE)
# Human vs Chicken
hc_align <- Biostrings::pairwiseAlignment(upHS, upCS,
                                              substitutionMatrix = BLOSUM50,
                                              gapOpening = -2,
                                              gapExtension = -8,
                                              scoreOnly = FALSE)
# Orangutan vs Horse
oho_align <- Biostrings::pairwiseAlignment(upOS, upHoS,
                                              substitutionMatrix = BLOSUM50,
                                              gapOpening = -2,
                                              gapExtension = -8,
                                              scoreOnly = FALSE)
# Orangutan vs Chicken
oc_align <- Biostrings::pairwiseAlignment(upOS, upCS,
                                              substitutionMatrix = BLOSUM50,
                                              gapOpening = -2,
                                              gapExtension = -8,
                                              scoreOnly = FALSE)
# Horse vs Chicken
hoc_align <- Biostrings::pairwiseAlignment(upHoS, upCS,
                                              substitutionMatrix = BLOSUM50,
                                              gapOpening = -2,
                                              gapExtension = -8,
                                              scoreOnly = FALSE)

Seeing the full alignments and the individual scores is nice, but it’s a lot easier to understand and interpret in a table. The data format of the table is: pid1/pid2/pid3/pid4.

# set up the data:
ho.pid1 <- c(round(pid(ho_align, type = "PID1"), digits = 1))
ho.pid2 <- c(round(pid(ho_align, type = "PID2"), digits = 1))
ho.pid3 <- c(round(pid(ho_align, type = "PID3"), digits = 1))
ho.pid4 <- c(round(pid(ho_align, type = "PID4"), digits = 1))

hho.pid1 <- c(round(pid(hho_align, type = "PID1"), digits = 1))
hho.pid2 <- c(round(pid(hho_align, type = "PID2"), digits = 1))
hho.pid3 <- c(round(pid(hho_align, type = "PID3"), digits = 1))
hho.pid4 <- c(round(pid(hho_align, type = "PID4"), digits = 1))

hc.pid1 <- c(round(pid(hc_align, type = "PID1"), digits = 1))
hc.pid2 <- c(round(pid(hc_align, type = "PID2"), digits = 1))
hc.pid3 <- c(round(pid(hc_align, type = "PID3"), digits = 1))
hc.pid4 <- c(round(pid(hc_align, type = "PID4"), digits = 1))

oho.pid1 <- c(round(pid(oho_align, type = "PID1"), digits = 1))
oho.pid2 <- c(round(pid(oho_align, type = "PID2"), digits = 1))
oho.pid3 <- c(round(pid(oho_align, type = "PID3"), digits = 1))
oho.pid4 <- c(round(pid(oho_align, type = "PID4"), digits = 1))

oc.pid1 <- c(round(pid(oc_align, type = "PID1"), digits = 1))
oc.pid2 <- c(round(pid(oc_align, type = "PID2"), digits = 1))
oc.pid3 <- c(round(pid(oc_align, type = "PID3"), digits = 1))
oc.pid4 <- c(round(pid(oc_align, type = "PID4"), digits = 1))

hoc.pid1 <- c(round(pid(hoc_align, type = "PID1"), digits = 1))
hoc.pid2 <- c(round(pid(hoc_align, type = "PID2"), digits = 1))
hoc.pid3 <- c(round(pid(hoc_align, type = "PID3"), digits = 1))
hoc.pid4 <- c(round(pid(hoc_align, type = "PID4"), digits = 1))

ho.pid <- paste(ho.pid1, ho.pid2, ho.pid3, ho.pid4, sep = "/")
hho.pid <- paste(hho.pid1, hho.pid2, hho.pid3, hho.pid4, sep = "/")
hc.pid <- paste(hc.pid1, hc.pid2, hc.pid3, hc.pid4, sep = "/")
oho.pid <- paste(oho.pid1, oho.pid2, oho.pid3, oho.pid4, sep = "/")
oc.pid <- paste(oc.pid1, oc.pid2, oc.pid3, oc.pid4, sep = "/")
hoc.pid <- paste(hoc.pid1, hoc.pid2, hoc.pid3, hoc.pid4, sep = "/")

# make the table:
col.names <- up_species
row1 <- c("--", ho.pid, hho.pid, hc.pid)
row2 <- c("--", "--", oho.pid, oc.pid)
row3 <- c("--", "--", "--", hoc.pid)
row4 <- c("--", "--", "--", "--")

pid.tab <- data.frame(
                      human = row1,
                      orang = row2,
                      horse = row3,
                      chick = row4,
                      row.names = col.names
                      )

pander(pid.tab)

Table continues below
	human	orang
Human	–	–
Orangutan	99.8/99.8/99.8/99.4	–
Horse	98.2/99.8/99.8/99	99.6/99.6/99.6/99.3
Chicken	95/95.5/95.3/95.3	94.8/95.5/95.3/94.8

	horse	chick
Human	–	–
Orangutan	–	–
Horse	–	–
Chicken	93.8/95.8/95.6/94.8	–

Chimps and Humans are very closely related. There is not a PRKAA1 gene (experimentally derived!) in NCBI yet, so comparing Orangutans and Humans will suffice.

# same pid calculations as above, just rounded out a bit further
ho.pid1 <- c(round(pid(ho_align, type = "PID1"), digits = 2))
ho.pid2 <- c(round(pid(ho_align, type = "PID2"), digits = 2))
ho.pid3 <- c(round(pid(ho_align, type = "PID3"), digits = 2))
ho.pid4 <- c(round(pid(ho_align, type = "PID4"), digits = 2))

# build table
ho.tab <- data.frame(calculation.method = c("PID1", "PID2", "PID3", "PID4"),
                     pid = c(ho.pid1, ho.pid2, ho.pid3, ho.pid4))

pander(ho.tab)

calculation.method	pid
PID1	99.82
PID2	99.82
PID3	99.82
PID4	99.37

MULTIPLE SEQUENCE ALIGNMENT

A Multiple Sequence Alignment (MSA) creates an alignment summary of the protein across species.

In order to use ggmsa(), the FATA files need to be in a named vector, currently they are stored as a list.

prkaa_vector <- rep(NA, length(prkaa_list))

for (i in 1:length(prkaa_vector))
{
  prkaa_vector[i] <- prkaa_list[[i]]
}

names(prkaa_vector) <- names(prkaa_list)

prkaa_vss <- Biostrings::AAStringSet(prkaa_vector)
prkaa_alignment <- msa(prkaa_vss,
                       method = "ClustalW")

## use default substitution matrix

print(paste("original class of alignment:", class(prkaa_alignment)))

## [1] "original class of alignment: MsaAAMultipleAlignment"

class(prkaa_alignment) <- "AAMultipleAlignment"
print(paste("change class to:", class(prkaa_alignment)))

## [1] "change class to: AAMultipleAlignment"

# without specifying start and end, the msa is too small to see
ggmsa::ggmsa(prkaa_alignment,
             start = 100,
             end = 200)

msa plot

Viewing the amino acid sequence between bases 100 and 100 shows a highly conserved region in the protein and is consistent with the dotplot above. This suggests this part of the sequence is evolutionary important.

DISTANCE MATRIX

While an MSA is a good way to examine a sequence its hard to assess all of the information visually. A phylogenetic tree allows you to summarize patterns in an MSA. The fastest way to make phylogenetic trees to is first summarize an MSA using a genetic distance matrix. The more amino acids that are identical to each other, the smaller the genetic distance is between them and the less evolution has occurred.

Calculating genetic distance from an MSA is done using the seqinr::dist.alignment() function.

# change the data class to work with seqinr
prkaa.align.seq <- msaConvert(prkaa_alignment, type = "seqinr::alignment")

# make a distance matrix
prkaa.distance <- seqinr::dist.alignment(prkaa.align.seq, 
                                       matrix = "identity")
# round the data to make easier to look at
prkaa.distance.rnd <- round(prkaa.distance, digits = 3)

prkaa.distance.rnd

##                       Cows   Pig Horse Orangutan Mouse   Rat Humans Chicken
## Pig                  0.047                                                 
## Horse                0.081 0.060                                           
## Orangutan            0.093 0.085 0.060                                     
## Mouse                0.155 0.135 0.120     0.121                           
## Rat                  0.140 0.121 0.104     0.104 0.060                     
## Humans               0.093 0.074 0.042     0.043 0.112 0.095               
## Chicken              0.214 0.209 0.212     0.209 0.232 0.232  0.212        
## Tropical Clawed Frog 0.324 0.308 0.311     0.314 0.317 0.317  0.311   0.302
## Zebrafish            0.359 0.355 0.357     0.360 0.365 0.360  0.357   0.333
## Thale Cress          0.731 0.723 0.723     0.723 0.727 0.726  0.723   0.722
##                      Tropical Clawed Frog Zebrafish
## Pig                                                
## Horse                                              
## Orangutan                                          
## Mouse                                              
## Rat                                                
## Humans                                             
## Chicken                                            
## Tropical Clawed Frog                               
## Zebrafish                           0.349          
## Thale Cress                         0.725     0.730

PHYLOGENY

One way to build a phylognetic tree is to use a neighbor joining algorithm, which uses genetic distance to put sequences into clades. The ape function nj() uses this method by taking in a distance matrix as an argument.

prk.tree <-  ape::nj(prkaa.distance)

ape::plot.phylo(prk.tree,
                use.edge.length = TRUE,
                adj = 0,
                main = "PRKAA1 Phylogeny")

phylogenetic tree