Introduction

This code will analyze and compile data information on the gene FADS2 (Fatty Acid Desaturase 2). Many different components of analysis will be used including multiple sequence alignments, percent identity, and phylogenetic trees. This data will show how close of a relationship this gene has in humans compared to the other animals that have homologous genes.

Resources / References

Key information used to make this script can be found here: * Refseq Gene: https://www.ncbi.nlm.nih.gov/gene/9415 *Refseq Homologene: https://www.ncbi.nlm.nih.gov/homologene/3149

Other resources consulted includes: * Neanderthal genome: http://neandertal.ensemblgenomes.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000134824

Other interesting resources and online tools include: * BLAST alignment: https://blast.ncbi.nlm.nih.gov/Blast.cgi * Sub-cellular locations prediction in Uniprot: https://www.uniprot.org/uniprot/O95864

Preparation

Load necessary packages:

library(BiocManager)
## Bioconductor version '3.13' is out-of-date; the current release version '3.14'
##   is available with R version '4.1'; see https://bioconductor.org/install
#install("drawProteins")
library(drawProteins)

Load other packages:

#github packages
library(compbio4all)
library(ggmsa)
## Registered S3 methods overwritten by 'ggalt':
##   method                  from   
##   grid.draw.absoluteGrob  ggplot2
##   grobHeight.absoluteGrob ggplot2
##   grobWidth.absoluteGrob  ggplot2
##   grobX.absoluteGrob      ggplot2
##   grobY.absoluteGrob      ggplot2
#CRAN packages
library(rentrez)
library(seqinr)
library(ape)
## 
## Attaching package: 'ape'
## The following objects are masked from 'package:seqinr':
## 
##     as.alignment, consensus
library(pander)

library(ggplot2)

#Bioconductor packages
library(msa)
## Loading required package: Biostrings
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: XVector
## Loading required package: GenomeInfoDb
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:ape':
## 
##     complement
## The following object is masked from 'package:seqinr':
## 
##     translate
## The following object is masked from 'package:base':
## 
##     strsplit
## 
## Attaching package: 'msa'
## The following object is masked from 'package:BiocManager':
## 
##     version
#Biostrings
library(Biostrings)
# install("HGNChelper")
library(HGNChelper)

Accession Numbers

The homologous genes in other organisms were found both by using NCBI’s homologous gene page as well as using NCBI’s BLAST function. Both of these methods produced lists of homologous genes in other species with their accession numbers listed alongside the scientific names.

Two things that were left out of the table were drosophila and Neanderthals. This gene does not occur outside of vertebrates and so it is not found in drosophila. This gene was found in Neanderthals, but their was only a uniprot accession number given and it turned out to be the exact same as in the human gene.

The following function confirms the validity of the FADS2 gene name.

checkGeneSymbols(x = c("FADS2"))
## Maps last updated on: Thu Oct 24 12:31:05 2019
##       x Approved Suggested.Symbol
## 1 FADS2     TRUE            FADS2

Accession Number Table

As talked about above, this gene does not occur outside of vertebrates and was found in neanderthal’s, but does not have a RefSeq Accession Number.

We will now make a vector including the names and accession numbers of all of our species which will be turned into a table.

#                   RefSeq         Uniprot       PDB      Sci Name         Common Name   Gene name
fads2.table <- c("NP_004256.1",    "O95864",     "NA", "Homo sapiens",      "Human",       "Fads2",
                 "NP_062673.1",    "Q9Z0R9",     "NA", "Mus musculus",      "Mouse",       "Fads2",
                 "NP_001351410.1", "A0A6D2Y484", "NA", "Pan troglodytes",   "Chimpanzee",  "Fads2",
                 "XP_013976379.2", "NA",         "NA", "Canis lupus",       "Wolf",        "Fads2",
                 "NP_571720.2",    "Q9DEX7",     "NA", "Danio rerio",       "Zebrafish",   "Fads2",
                 "XP_027624528.1", "NA",         "NA", "Tupaia chinensis",  "Tree Shrew",  "Fads2",
                 "XP_012816067.1", "F6YR25",     "NA", "Xenopus tropicalis","Frog",        "Fads2",
                 "XP_006911280.1", "L5KQ77",     "NA", "Pteropus Alecto",   "Flying Fox",  "Fads2",
                 "NP_001076913.1", "A4FV48",     "NA", "Bovine taurus",     "Cow",         "Fads2",
                 "XP_032735091.1", "NA",         "NA", "Lontra Canadensis", "River Otter", "Fads2")

This chunk will convert our vector into a matrix and then create a dataframe to display a table of the information. Below is printed the dimensions of the table, the column names prior to being changed, and the final table.

# turn the vector into a matrix
fads2.matrix <- matrix(fads2.table, nrow = 10, ncol = 6, byrow = TRUE)

# the matrix is 10 by 6
dim(fads2.matrix)
## [1] 10  6
# make a datafram from the matrix
fads2.df <- as.data.frame(fads2.matrix)

# print current column names
colnames(fads2.df)
## [1] "V1" "V2" "V3" "V4" "V5" "V6"
# rename the columns
colnames(fads2.df) <- c(c("ncbi.protein.accession", "UniProt.id", "PDB", 
                          "species", "common.name", "gene.name"))

# print out the data frame
pander(fads2.df)
Table continues below
ncbi.protein.accession UniProt.id PDB species common.name
NP_004256.1 O95864 NA Homo sapiens Human
NP_062673.1 Q9Z0R9 NA Mus musculus Mouse
NP_001351410.1 A0A6D2Y484 NA Pan troglodytes Chimpanzee
XP_013976379.2 NA NA Canis lupus Wolf
NP_571720.2 Q9DEX7 NA Danio rerio Zebrafish
XP_027624528.1 NA NA Tupaia chinensis Tree Shrew
XP_012816067.1 F6YR25 NA Xenopus tropicalis Frog
XP_006911280.1 L5KQ77 NA Pteropus Alecto Flying Fox
NP_001076913.1 A4FV48 NA Bovine taurus Cow
XP_032735091.1 NA NA Lontra Canadensis River Otter
gene.name
Fads2
Fads2
Fads2
Fads2
Fads2
Fads2
Fads2
Fads2
Fads2
Fads2

Data Preparation

Download Sequences

We will use entrez_fetch_list to get a list of all of the FASTA files.

fads2.accesion <- fads2.df$ncbi.protein.accession
fads2.accesion
##  [1] "NP_004256.1"    "NP_062673.1"    "NP_001351410.1" "XP_013976379.2"
##  [5] "NP_571720.2"    "XP_027624528.1" "XP_012816067.1" "XP_006911280.1"
##  [9] "NP_001076913.1" "XP_032735091.1"
fads2.accesion.list <- entrez_fetch_list(db = "protein", id = fads2.accesion, rettype = "fasta")

Number of FASTA files obtained:

length(fads2.accesion.list)
## [1] 10

The first entry:

fads2.accesion.list[[1]]
## [1] ">NP_004256.1 acyl-CoA 6-desaturase isoform 1 [Homo sapiens]\nMGKGGNQGEGAAEREVSVPTFSWEEIQKHNLRTDRWLVIDRKVYNITKWSIQHPGGQRVIGHYAGEDATD\nAFRAFHPDLEFVGKFLKPLLIGELAPEEPSQDHGKNSKITEDFRALRKTAEDMNLFKTNHVFFLLLLAHI\nIALESIAWFTVFYFGNGWIPTLITAFVLATSQAQAGWLQHDYGHLSVYRKPKWNHLVHKFVIGHLKGASA\nNWWNHRHFQHHAKPNIFHKDPDVNMLHVFVLGEWQPIEYGKKKLKYLPYNHQHEYFFLIGPPLLIPMYFQ\nYQIIMTMIVHKNWVDLAWAVSYYIRFFITYIPFYGILGALLFLNFIRFLESHWFVWVTQMNHIVMEIDQE\nAYRDWFSSQLTATCNVEQSFFNDWFSGHLNFQIEHHLFPTMPRHNLHKIAPLVKSLCAKHGIEYQEKPLL\nRALLDIIRSLKKSGKLWLDAYLHK\n\n"

Initial data cleaning

The first step will be to remove the FASTA header.

for(i in 1:length(fads2.accesion.list)){
  fads2.accesion.list[[i]] <- fasta_cleaner(fads2.accesion.list[[i]], parse = F)
}

General Protein Information

Protein Information

The following code creates an image that shows descriptions of the different parts of the gene.

fads2.uniprot <- get_features(fads2.df$UniProt.id[1])
## [1] "Download has worked"
fads2.prot <- feature_to_dataframe(fads2.uniprot)
fads2.canvas <- draw_canvas(fads2.prot)
fads2.canvas <- draw_chains(fads2.canvas, fads2.prot, label_size = 2.5)
fads2.canvas <- draw_recept_dom(fads2.canvas, fads2.prot)
fads2.canvas

Draw dotplot

We will prepare the data so that each letter can be compared.

human_fasta <- fads2.accesion.list[[1]]
fads2_vector <- fasta_cleaner(human_fasta)
dotPlot(fads2_vector, fads2_vector)

Let’s examine different settings to get a clearer picture

par(mfrow = c(2,2), 
    mar = c(0,0,2,1))

# plot 1:  - Defaults
dotPlot(fads2_vector,
        fads2_vector, 
        wsize = 1, 
        nmatch = 1, 
        main = "\nfads2 Defaults \n")

# plot - size = 10, nmatch = 1
dotPlot(fads2_vector, 
        fads2_vector, 
        wsize = 10, 
        nmatch = 1, 
        main = "\nfads2 - size = 10, nmatch = 1 \n")

# plot 3:  - size = 10, nmatch = 5
dotPlot(fads2_vector, 
        fads2_vector, 
        wsize = 10, 
        nmatch = 5, 
        main = "\nfads2 - size = 10, nmatch = 5 \n")

# plot 4: size = 20, nmatch = 5
dotPlot(fads2_vector, 
        fads2_vector, 
        wsize = 20,
        nmatch = 5,
        main = "\nfads2 - size = 20, nmatch = 5 \n")

# reset par() - run this or other plots will be small!
par(mfrow = c(1,1), 
    mar = c(4,4,4,4))

The bottom right dotplot seems to be clearest. Let’s print it larger so that we can see the data.

dotPlot(fads2_vector, 
        fads2_vector, 
        wsize = 20,
        nmatch = 5,
        main = "\nfads2 - size = 20, nmatch = 5 \n")

Protein properties compiled from databases

Many websites can give information on our gene. Here is a table of any additional information given from the sites listed below.

Pfam <- c("Cyt-b5 domain spanning 22-95aa and FA-desaturase domain 156-418aa")
Disprot <- c("No information available")
RepeatsDB <- c("No information available")
Uniprot.subcellular.location <- c("Endoplasmic reticulum membrane")
PDB <- c("No PDB entry available")
Alpha.fold <- c("The predicted protein is Acyl-CoA 6-desaturase and is completely made from alpha helixes")
IUpred2A <- c("There were two peaks that rose above 0.5")

protein.properties.df <- data.frame(Pfam = Pfam, Disprot = Disprot, RepeatsDB = RepeatsDB, 
                                    Uniprot.subcellular.location = Uniprot.subcellular.location,
                                    PDB = PDB, Alpha.fold = Alpha.fold, IUpred2A = IUpred2A)
pander(protein.properties.df)
Table continues below
Pfam Disprot
Cyt-b5 domain spanning 22-95aa and FA-desaturase domain 156-418aa No information available
Table continues below
RepeatsDB Uniprot.subcellular.location
No information available Endoplasmic reticulum membrane
Table continues below
PDB Alpha.fold
No PDB entry available The predicted protein is Acyl-CoA 6-desaturase and is completely made from alpha helixes
IUpred2A
There were two peaks that rose above 0.5

Protein feature prediction

Uniprot indicates that this protein is a Multi-pass membrane protein. They define this as a protein that spans the membrane more than once.

Predict protein fold

Alphafold indicates that this protein is solely consisted of alpha helixes. Therefore, I predict that the machine-learning methods will categorize this protein as α-all.

We will now use the method from Chou and Zhang to see what this analysis tells us about the protein fold.

aa.1.1 <- c("A","R","N","D","C","Q","E","G","H","I",
            "L","K","M","F","P","S","T","W","Y","V")
alpha <- c(285, 53, 97, 163, 22, 67, 134, 197, 111, 91, 
           221, 249, 48, 123, 82, 122, 119, 33, 63, 167)
beta <- c(203, 67, 139, 121, 75, 122, 86, 297, 49, 120, 
          177, 115, 16, 85, 127, 341, 253, 44, 110, 229)
a.plus.b <- c(175, 78, 120, 111, 74, 74, 86, 171, 33, 93,
              110, 112, 25, 52, 71, 126, 117, 30, 108, 123)
a.div.b <- c(361, 146, 183, 244, 63, 114, 257, 377, 107, 239, 
             339, 321, 91, 158, 188, 327, 238, 72, 130, 378)
pander(data.frame(aa.1.1, alpha, beta, a.plus.b, a.div.b))
aa.1.1 alpha beta a.plus.b a.div.b
A 285 203 175 361
R 53 67 78 146
N 97 139 120 183
D 163 121 111 244
C 22 75 74 63
Q 67 122 74 114
E 134 86 86 257
G 197 297 171 377
H 111 49 33 107
I 91 120 93 239
L 221 177 110 339
K 249 115 112 321
M 48 16 25 91
F 123 85 52 158
P 82 127 71 188
S 122 341 126 327
T 119 253 117 238
W 33 44 30 72
Y 63 110 108 130
V 167 229 123 378

Calculate Proportions

alpha.prop <- alpha/sum(alpha)
beta.prop <- beta/sum(beta)
a.plus.b.prop <- a.plus.b/sum(a.plus.b)
a.div.b.prop <- a.div.b/sum(a.div.b)

Create Dataframe

aa.prop <- data.frame(alpha.prop, beta.prop, a.plus.b.prop, a.div.b.prop)
rownames(aa.prop) <- aa.1.1
pander(aa.prop)
  alpha.prop beta.prop a.plus.b.prop a.div.b.prop
A 0.1165 0.07313 0.09264 0.08331
R 0.02166 0.02414 0.04129 0.03369
N 0.03964 0.05007 0.06353 0.04223
D 0.06661 0.04359 0.05876 0.05631
C 0.008991 0.02702 0.03917 0.01454
Q 0.02738 0.04395 0.03917 0.02631
E 0.05476 0.03098 0.04553 0.05931
G 0.08051 0.107 0.09052 0.08701
H 0.04536 0.01765 0.01747 0.02469
I 0.03719 0.04323 0.04923 0.05516
L 0.09031 0.06376 0.05823 0.07824
K 0.1018 0.04143 0.05929 0.07408
M 0.01962 0.005764 0.01323 0.021
F 0.05027 0.03062 0.02753 0.03646
P 0.03351 0.04575 0.03759 0.04339
S 0.04986 0.1228 0.0667 0.07547
T 0.04863 0.09114 0.06194 0.05493
W 0.01349 0.01585 0.01588 0.01662
Y 0.02575 0.03963 0.05717 0.03
V 0.06825 0.08249 0.06511 0.08724

This function will convert our table into a vector.

table_to_vector <- function(table_x){
  table_names <- attr(table_x, "dimnames")[[1]]
  table_vect <- as.vector(table_x)
  names(table_vect) <- table_names
  return(table_vect)
}
fads2_human_table <- table(fads2_vector)/length(fads2_vector)
fads2.human.aa.freq <- table_to_vector(fads2_human_table)
fads2.human.aa.freq
##           A           C           D           E           F           G 
## 0.063063063 0.004504505 0.038288288 0.051801802 0.074324324 0.058558559 
##           H           I           K           L           M           N 
## 0.067567568 0.072072072 0.065315315 0.105855856 0.020270270 0.045045045 
##           P           Q           R           S           T           V 
## 0.045045045 0.042792793 0.036036036 0.040540541 0.038288288 0.051801802 
##           W           Y 
## 0.038288288 0.040540541

Add data on my focal protein to the amino acid frequency.

aa.prop$fads2.human.aa.freq <- fads2.human.aa.freq
pander(aa.prop)
Table continues below
  alpha.prop beta.prop a.plus.b.prop a.div.b.prop
A 0.1165 0.07313 0.09264 0.08331
R 0.02166 0.02414 0.04129 0.03369
N 0.03964 0.05007 0.06353 0.04223
D 0.06661 0.04359 0.05876 0.05631
C 0.008991 0.02702 0.03917 0.01454
Q 0.02738 0.04395 0.03917 0.02631
E 0.05476 0.03098 0.04553 0.05931
G 0.08051 0.107 0.09052 0.08701
H 0.04536 0.01765 0.01747 0.02469
I 0.03719 0.04323 0.04923 0.05516
L 0.09031 0.06376 0.05823 0.07824
K 0.1018 0.04143 0.05929 0.07408
M 0.01962 0.005764 0.01323 0.021
F 0.05027 0.03062 0.02753 0.03646
P 0.03351 0.04575 0.03759 0.04339
S 0.04986 0.1228 0.0667 0.07547
T 0.04863 0.09114 0.06194 0.05493
W 0.01349 0.01585 0.01588 0.01662
Y 0.02575 0.03963 0.05717 0.03
V 0.06825 0.08249 0.06511 0.08724
  fads2.human.aa.freq
A 0.06306
R 0.004505
N 0.03829
D 0.0518
C 0.07432
Q 0.05856
E 0.06757
G 0.07207
H 0.06532
I 0.1059
L 0.02027
K 0.04505
M 0.04505
F 0.04279
P 0.03604
S 0.04054
T 0.03829
W 0.0518
Y 0.03829
V 0.04054

Functions to calculate similarities

Two custom functions are needed: one to calculate the correlation between two columns of our table, and one to calculate correlation similarities.

# Correlation used in Chou and Zhang 1992.
chou_cor <- function(x,y){
  numerator <- sum(x*y)
denominator <- sqrt((sum(x^2))*(sum(y^2)))
result <- numerator/denominator
return(result)
}

# Cosine similarity used in Higgs and Attwood (2005). 
chou_cosine <- function(z.1, z.2){
  z.1.abs <- sqrt(sum(z.1^2))
  z.2.abs <- sqrt(sum(z.2^2))
  my.cosine <- sum(z.1*z.2)/(z.1.abs*z.2.abs)
  return(my.cosine)
}

Calculate Correlation between each column

corr.alpha <- chou_cor(aa.prop[,5], aa.prop[,1])
corr.beta  <- chou_cor(aa.prop[,5], aa.prop[,2])
corr.apb   <- chou_cor(aa.prop[,5], aa.prop[,3])
corr.adb   <- chou_cor(aa.prop[,5], aa.prop[,4])

Calculate cosine similarity

cos.alpha <- chou_cosine(aa.prop[,5], aa.prop[,1])
cos.beta  <- chou_cosine(aa.prop[,5], aa.prop[,2])
cos.apb   <- chou_cosine(aa.prop[,5], aa.prop[,3])
cos.adb   <- chou_cosine(aa.prop[,5], aa.prop[,4])

Calculate distance.

aa.prop.flipped <- t(aa.prop)
round(aa.prop.flipped, 2)
##                        A    R    N    D    C    Q    E    G    H    I    L    K
## alpha.prop          0.12 0.02 0.04 0.07 0.01 0.03 0.05 0.08 0.05 0.04 0.09 0.10
## beta.prop           0.07 0.02 0.05 0.04 0.03 0.04 0.03 0.11 0.02 0.04 0.06 0.04
## a.plus.b.prop       0.09 0.04 0.06 0.06 0.04 0.04 0.05 0.09 0.02 0.05 0.06 0.06
## a.div.b.prop        0.08 0.03 0.04 0.06 0.01 0.03 0.06 0.09 0.02 0.06 0.08 0.07
## fads2.human.aa.freq 0.06 0.00 0.04 0.05 0.07 0.06 0.07 0.07 0.07 0.11 0.02 0.05
##                        M    F    P    S    T    W    Y    V
## alpha.prop          0.02 0.05 0.03 0.05 0.05 0.01 0.03 0.07
## beta.prop           0.01 0.03 0.05 0.12 0.09 0.02 0.04 0.08
## a.plus.b.prop       0.01 0.03 0.04 0.07 0.06 0.02 0.06 0.07
## a.div.b.prop        0.02 0.04 0.04 0.08 0.05 0.02 0.03 0.09
## fads2.human.aa.freq 0.05 0.04 0.04 0.04 0.04 0.05 0.04 0.04

We can get a distance matrix like this

dist(aa.prop.flipped, method = "euclidean")
##                     alpha.prop  beta.prop a.plus.b.prop a.div.b.prop
## beta.prop           0.13342098                                      
## a.plus.b.prop       0.09281824 0.08289406                           
## a.div.b.prop        0.06699039 0.08659174    0.06175113             
## fads2.human.aa.freq 0.15929515 0.16794894    0.13127207   0.14241948

We can also calculate the individual distances

dist.alpha <- dist((aa.prop.flipped[c(1,5),]),  method = "euclidean")
dist.beta  <- dist((aa.prop.flipped[c(2,5),]),  method = "euclidean")
dist.apb   <- dist((aa.prop.flipped[c(3,5),]),  method = "euclidean")
dist.adb  <- dist((aa.prop.flipped[c(4,5),]), method = "euclidean")

Compile the information. Rounding makes it easier to read.

# fold types
fold.type <- c("alpha","beta","alpha plus beta", "alpha/beta")

# data
corr.sim <- round(c(corr.alpha,corr.beta,corr.apb,corr.adb),5)
cosine.sim <- round(c(cos.alpha,cos.beta,cos.apb,cos.adb),5)
Euclidean.dist <- round(c(dist.alpha,dist.beta,dist.apb,dist.adb),5)

# summary
sim.sum <- c("","","most.sim","")
dist.sum <- c("","","min.dist","")

df <- data.frame(fold.type,
           corr.sim ,
           cosine.sim ,
           Euclidean.dist ,
           sim.sum ,
           dist.sum )

Display Output

pander(df)
fold.type corr.sim cosine.sim Euclidean.dist sim.sum dist.sum
alpha 0.7994 0.7994 0.1593
beta 0.7806 0.7806 0.1679
alpha plus beta 0.8539 0.8539 0.1313 most.sim min.dist
alpha/beta 0.8316 0.8316 0.1424

Alpha plus Beta was predicted which does not match up with the prediction from Alphafold.

Subcellular location prediction

TBD

Percent Identity Comparisons (PID)

Data Preparation

Convert all FASTA records into entries in a single vector. The following for loop will take the accession numbers and associate them with the cleaned vector for the FASTA code.

fads2_list <- rep(NA, 0)
for(i in 1:length(fads2.accesion.list)){
    fads2_list[[fads2.df$ncbi.protein.accession[i]]] <- fads2.accesion.list[[i]]
}
is(fads2_list)
##  [1] "character"               "vector"                 
##  [3] "data.frameRowLabels"     "SuperClassMethod"       
##  [5] "EnumerationValue"        "character_OR_connection"
##  [7] "character_OR_NULL"       "atomic"                 
##  [9] "vector_OR_Vector"        "vector_OR_factor"
length(fads2_list)
## [1] 10
names(fads2_list)
##  [1] "NP_004256.1"    "NP_062673.1"    "NP_001351410.1" "XP_013976379.2"
##  [5] "NP_571720.2"    "XP_027624528.1" "XP_012816067.1" "XP_006911280.1"
##  [9] "NP_001076913.1" "XP_032735091.1"

PID table

data("BLOSUM50")
align01.03 <- pairwiseAlignment(fads2.accesion.list[[1]], fads2_list[[3]],
                                substitutionMatrix = "BLOSUM50", 
                                              scoreOnly = FALSE)

 align01.05 <- pairwiseAlignment(fads2.accesion.list[[1]], fads2_list[[5]],
                                substitutionMatrix = "BLOSUM50", 
                                              scoreOnly = FALSE)
 
 align01.08 <- pairwiseAlignment(fads2.accesion.list[[1]], fads2_list[[8]],
                                substitutionMatrix = "BLOSUM50", 
                                              scoreOnly = FALSE)
 align03.05 <- pairwiseAlignment(fads2.accesion.list[[3]], fads2_list[[5]],
                                substitutionMatrix = "BLOSUM50", 
                                              scoreOnly = FALSE)
 align03.08 <- pairwiseAlignment(fads2.accesion.list[[3]], fads2_list[[8]],
                                substitutionMatrix = "BLOSUM50", 
                                              scoreOnly = FALSE)
 align05.08 <- pairwiseAlignment(fads2.accesion.list[[5]], fads2_list[[8]],
                                substitutionMatrix = "BLOSUM50", 
                                              scoreOnly = FALSE)
pid(align01.03)
## [1] 100
pid(align01.05)
## [1] 64.86486
pid(align01.08)
## [1] 83.22296

This matrix will show the percent identity between the human, chimpanzee, zebrafish, and flying fox.

pids <- c(100,             NA,              NA,              NA,
          pid(align01.03), 100,             NA,              NA,
          pid(align01.05), pid(align03.05), 100,             NA,
          pid(align01.08), pid(align03.08), pid(align05.08), 100)

mat <- matrix(pids, nrow = 4, byrow = T)
row.names(mat) <- c("Homo","Pan","Fish","Fox")   
colnames(mat) <-  c("Homo","Pan","Fish","Fox")   
pander(mat)  
  Homo Pan Fish Fox
Homo 100 NA NA NA
Pan 100 100 NA NA
Fish 64.86 64.86 100 NA
Fox 83.22 83.22 63.13 100

PID methods comparison

Compare different PID methods. First we will compare humans and chimps

method <- c("PID1", "PID2", "PID3", "PID4")
PID.chimp <- c(pid(align01.03, type = "PID1"),
                pid(align01.03, type = "PID2"),
                pid(align01.03, type = "PID3"),
                pid(align01.03, type = "PID4"))
denominator <- c("(aligned positions + internal gap positions)",
                 "(aligned positions)",
                 "(length shorter sequence)",
                 "(average length of the two sequences)")

pid.chimp.comparison <- data.frame(method, PID = PID.chimp, denominator)
pander(pid.chimp.comparison)
method PID denominator
PID1 100 (aligned positions + internal gap positions)
PID2 100 (aligned positions)
PID3 100 (length shorter sequence)
PID4 100 (average length of the two sequences)

We will now compare humans and zebrafish

PID.fish <- c(pid(align01.05, type = "PID1"),
                pid(align01.05, type = "PID2"),
                pid(align01.05, type = "PID3"),
                pid(align01.05, type = "PID4"))

pid.fish.comparison <- data.frame(method, PID = PID.fish, denominator)
pander(pid.fish.comparison)
method PID denominator
PID1 64.86 (aligned positions + internal gap positions)
PID2 64.86 (aligned positions)
PID3 64.86 (length shorter sequence)
PID4 64.86 (average length of the two sequences)

Multiple Sequence Alignment

MSA data preparation

For R bioinformatics tools we are going to convert the vector into a string set

fads2_vector_ss <- AAStringSet(fads2_list)

Building Multiple Sequence Alignment (MSA)

fads2_align <-msa(fads2_vector_ss,
                     method = "ClustalW")
## use default substitution matrix

Cleaning / setting up an MSA

MSA produces a species MSA object

class(fads2_align)
## [1] "MsaAAMultipleAlignment"
## attr(,"package")
## [1] "msa"
is(fads2_align)
## [1] "MsaAAMultipleAlignment" "AAMultipleAlignment"    "MsaMetaData"           
## [4] "MultipleAlignment"

Default output of MSA

fads2_align
## CLUSTAL 2.1  
## 
## Call:
##    msa(fads2_vector_ss, method = "ClustalW")
## 
## MsaAAMultipleAlignment with 10 rows and 455 columns
##      aln                                                   names
##  [1] MGKGGNQGEGAAER--EVSVPTFSW...LRALLDIIRSLKKSGKLWLDAYLHK NP_004256.1
##  [2] MGKGGNQGEGAAER--EVSVPTFSW...LRALLDIIRSLKKSGKLWLDAYLHK NP_001351410.1
##  [3] MGKGGNQGEGAAER--EAPLQTFCW...LRALQDIIRSLKKSGELWLDAYLHK XP_013976379.2
##  [4] MGKGGNQGEGATER--EAPMPTFSW...LRALQDIIRSLKKSGELWLDAYLHK XP_032735091.1
##  [5] MGKGGNQDEGATEL--EAPMPTFRW...LRALQDIIGSLRKSGQLWLDAYLHK NP_001076913.1
##  [6] MGKGGNQGEGATER--EVPMPTFSW...LRALRDIIRSLKKSGDLWLDAYLHK XP_027624528.1
##  [7] MGKGGNQGEGSTER--QAPMPTFRW...LRALIDIVSSLKKSGELWLDAYLHK NP_062673.1
##  [8] MGKGGNQGEGATER--EAPIPTFRW...LRALLDVIRSLKKSGDLWLDAYLHK XP_006911280.1
##  [9] MGMGGQSGEGCSSGNCVKPEARYSW...YHGFRDVFRSLKKSGQLWLDAYLHK XP_012816067.1
## [10] MGGGGQQTDRITDT--NGRFSSYTW...YGAFADIIRSLEKSGELWLDAYLNK NP_571720.2
##  Con MGKGGNQGEGATER--EAP?PTFSW...LRAL?DIIRSLKKSG?LWLDAYLHK Consensus

Change class of alignment

class(fads2_align) <- "AAMultipleAlignment"
class(fads2_align)
## [1] "AAMultipleAlignment"
fads2_align_seqinr <- msaConvert(fads2_align, type = "seqinr::alignment")

Show output of the MSA

print_msa(fads2_align_seqinr)
## [1] "MGKGGNQGEGAAER--EVSVPTFSWEEIQKHNLRTDRWLVIDRKVYNITKWSIQHPGGQR 0"
## [1] "MGKGGNQGEGAAER--EVSVPTFSWEEIQKHNLRTDRWLVIDRKVYNITKWSIQHPGGQR 0"
## [1] "MGKGGNQGEGAAER--EAPLQTFCWEEIQKHNLRTDKWLVIDRKVYNITKWSSRHPGGHR 0"
## [1] "MGKGGNQGEGATER--EAPMPTFSWEEIQKHNLRTDKWLVIDRKVYNITKWSSRHPGGQR 0"
## [1] "MGKGGNQDEGATEL--EAPMPTFRWEEIQKHNLRTDKWLVIDRKVYNITKWSSRHPGGQR 0"
## [1] "MGKGGNQGEGATER--EVPMPTFSWEEIQKHNLRTDKWLVIDRKVYNITKWSNRHPGGHR 0"
## [1] "MGKGGNQGEGSTER--QAPMPTFRWEEIQKHNLRTDRWLVIDRKVYNVTKWSQRHPGGHR 0"
## [1] "MGKGGNQGEGATER--EAPIPTFRWEEIQKHNVRTDQWLVVDRKVYNVTKWASRHPGGHR 0"
## [1] "MGMGGQSGEGCSSGNCVKPEARYSWEEIQKHNLKTDKWLVIERKVYNITQWVKCHPGGMR 0"
## [1] "MGGGGQQTDRITDT--NGRFSSYTWEEMQKHTKHGDQWVVVERKVYNVSQWVKRHPGGLR 0"
## [1] " "
## [1] "VIGHYAGEDATDAFRAFHPDLEFVGKFLKPLLIGELAPEEPSQDHGKNSKITEDFRALRK 0"
## [1] "VIGHYAGEDATDAFRAFHPDLEFVGKFLKPLLIGELAPEEPSQDHGKNSKITEDFRALRK 0"
## [1] "VIGHYAGEDATDAFHAFHRDLDFVRKFMKPLLIGELAPEEPSQDHGKNSQITEDFRALRK 0"
## [1] "VIGHYAGEDATDAFLAFHRDLDFVRKFMKPLLIGELAPEEPSQDHGKNSQITEDFRALRK 0"
## [1] "VIGHYAGEDATDAFLAFHRNLDFVRKFMKPLLIGELAPEEPSQDRGKNSQITEDFRALRK 0"
## [1] "VIGHYAGEDATDAFRAFHPNLDFVRKFMKPLLIGELAPEEPSQDRGKNSQITEDFRALKK 0"
## [1] "VIGHYSGEDATDAFRAFHLDLDFVGKFLKPLLIGELAPEEPSLDRGKSSQITEDFRALKK 0"
## [1] "VIGHYAGEDATDAFHAFHIDLDFVGKFLKPLLVGELAPEEPSQDRGKNSQIIEDFRALKK 0"
## [1] "VIGHYAGEDATDAFHAFHPDKNFVRKFLKPLYVGELAENEPSQDRDKNAQQVEDFRALRK 0"
## [1] "ILGHYAGEDATEAFTAFHPNLQLMRKYLKPLLIGELEASEPSQDRQKNAALVEDFRALRE 0"
## [1] " "
## [1] "TAEDMNLFKTNHVFFLLLLAHIIALESIAWFTVFYFGNGWIPTLITAFVLATSQAQAGWL 0"
## [1] "TAEDMNLFKTNHVFFLLLLAHIIALESIAWFTVFYFGNGWIPTLITAFVLATSQAQAGWL 0"
## [1] "TAEDMNLFKSDHLFFLLLLAHIIVMESIAWFIIFYFGNGWISTVITAFVLATSQAQAGWL 0"
## [1] "AAEDMNLFKSNHLFFLVLLAHIIVMESIAWFIIFYFGNGWIPTVITAFVLATSQAQAGWL 0"
## [1] "TAEDMNLFKSNQLFFLLHLAHIIAMESIAWFTLFYFGNGWIPTIITAFVLATSQAQAGWL 0"
## [1] "TAEDMNLFKANHLFFLLLLAHIIVMESIAWFTIFYFGNGWIPTVITAFVLATSQAQAGWL 0"
## [1] "TAEDMNLFKTNHLFFFLLLSHIIVMESLAWFILSYFGTGWIPTLVTAFVLATSQAQAGWL 0"
## [1] "TAQDMNLFKSNHLFFLLLLAHILVMEAIAWLIVFYFGNGWITTIIASFILATSQVQTGWL 0"
## [1] "TAEDMGLFKSNPAFFIVYLFHILLIEFLAWCTLHYFGTGWIPTILTVLLLTISQAQAGWL 0"
## [1] "RLEAEGCFKTQPLFFALHLGHILLLEAIAFVMVWYFGTGWINTLIVAVILATAQSQAGWL 0"
## [1] " "
## [1] "QHDYGHLSVYRKPKWNHLVHKFVIGHLKGASANWWNHRHFQHHAKPNIFHKDPDVNMLHV 0"
## [1] "QHDYGHLSVYRKPKWNHLVHKFVIGHLKGASANWWNHRHFQHHAKPNIFHKDPDVNMLHV 0"
## [1] "QHDYGHLSVYKKSMWNHVVHKFIIGHLKGASANWWNHRHFQHHAKPNIFHKDPDVNMLHV 0"
## [1] "QHDYGHLSVYKKSSWNHVVHKFIIGHLKGASANWWNHRHFQHHAKPNIFHKDPDVNMLHV 0"
## [1] "QHDYGHLSVYKKSMWNHIVHKFVIGHLKGASANWWNHRHFQHHAKPNIFHKDPDVNMLHV 0"
## [1] "QHDYGHLSVYKKSTWNHLVHKFIIGHLKGASANWWNHRHFQHHAKPNIFHKDPDVNMLHV 0"
## [1] "QHDYGHLSVYKKSIWNHVVHKFVIGHLKGASANWWNHRHFQHHAKPNIFHKDPDIKSLHV 0"
## [1] "QHDLGHLSVYKKSKWNHLAHKFVIGHLKGASSNWWNHRHFQHHAKPNIFHKDPDVNMLHV 0"
## [1] "QHDFGHLSVFKKSKWNHLIHKFIIGHLKGASANWWNHRHFQHHAKPNIFRKDPDVNMVNV 0"
## [1] "QHDFGHLSVFKTSGMNHLVHKFVIGHLKGASAGWWNHRHFQHHAKPNIFKKDPDVNMLNA 0"
## [1] " "
## [1] "FVLGEWQPIEYGKKKLKYLPYNHQHEYFFLIGPPLLIPMYFQYQIIMTMIVHKNWVDLAW 0"
## [1] "FVLGEWQPIEYGKKKLKYLPYNHQHEYFFLIGPPLLIPMYFQYQIIMTMIVHKNWVDLAW 0"
## [1] "FVLGEWQPIEYGKKKLKYLPYNHQHEYFFLIGPPLLIPVYFQYQIIMTMIVRRDWVDLAW 0"
## [1] "FVLGESQPIEYGKKKLKYLPYNHQHEYFFLIGPPLLIPVYFQYQIIMTMISRHDWVDLAW 0"
## [1] "FVLGEWQPIEYGKKKLKYLPYNHQHEYFFLIGPPLLIPLYFQYQIIMTMIVRKYWADLAW 0"
## [1] "FVLGEWQPIEYGKKKLKYLPYNHQHEYFFLIGPPLLIPMYFQYQIIMTMIVRKDWVDLAW 0"
## [1] "FVLGEWQPLEYGKKKLKYLPYNHQHEYFFLIGPPLLIPMYFQYQIIMTMISRRDWVDLAW 0"
## [1] "LVLGEWQPIEYGKKKLKYLPYNHQHQYFFLIGPPLLIPIYFQYQIIMTMIVHKNWVDLAW 0"
## [1] "FVLGDTQPVEFGKKRIKYLPYNHQHLYFFLIGPPLLIPIYFTIQIMKTMITRKDWVDLAW 0"
## [1] "FVVGNVQPVEYGVKKIKHLPYNHQHKYFFFIGPPLLIPVYFQFQIFHNMISHGMWVDLLW 0"
## [1] " "
## [1] "AVSYYIRFFITYIPFYGILGALLFLNFIRFLESHWFVWVTQMNHIVMEIDQEAYRDWFSS 0"
## [1] "AVSYYIRFFITYIPFYGILGALLFLNFIRFLESHWFVWVTQMNHIVMEIDQEAYRDWFSS 0"
## [1] "AMSYYVRFFITYIPFYGILGALLFLNFIRFLESHWFVWVTQMNHIVMEIDREPYRDWFSS 0"
## [1] "AMSYYARFFITYIPFYGILGAVIFLNFIRFLESHWFVWVTQMNHIVMEIDHEPYRDWFSN 0"
## [1] "AISYYTRFFITYIPFYGVLGSILFLNFIRFLESHWFVWVTQMNHIVMEIDREPYRDWFSS 0"
## [1] "AISYYARFFITYIPFYGVLGALLFLNFIRFLESHWFVWVTQMNHIVMEIDREPYRDWFRT 0"
## [1] "AISYYMRFFYTYIPFYGILGALVFLNFIRFLESHWFVWVTQMNHLVMEIDLDHYRDWFSS 0"
## [1] "SISYYVRFFISYVPFYGFLGSILFFNFIRFLESHWFVWVTQMNHIVMEINEEPYRPLFLL 0"
## [1] "SVSYYARFFITFVPFYGVLGSLALLNAVRFIESHWFVWVTQMNHLPMVIDHEKYKDWLET 0"
## [1] "CISYYVRYFLCYTQFYGVFWAIILFNFVRFMESHWFVWVTQMSRIPMNIDYEQNQDWLSM 0"
## [1] " "
## [1] "---------QLTATCNVEQSFFNDWFSGHLNFQIEHHLFPTMPRHNLHKIAPLVKSLCAK 0"
## [1] "---------QLTATCNVEQSFFNDWFSGHLNFQIEHHLFPTMPRHNLHKIAPLVKSLCAK 0"
## [1] "---------QLAATCNVEQSFFNDWFSGHLNFQIEHHLFPTMPRHNLHKVAPLVKSLCAK 0"
## [1] "---------QLAATCNVEQSFFNDWFSGHLNFQIEHHLFPTMPRHNFHKVAPLVKSLCAK 0"
## [1] "---------QLAATCNVEQSFFNDWFSGHLNFQIEHHLFPTMPRHNLHKIAPLVRSLCAK 0"
## [1] "---------QLAATCNVEQSFFNDWFSGHLNFQIEHHLFPTMPRHNFHKIAPLVKSLCAK 0"
## [1] "---------QLAATCNVEQSFFNDWFSGHLNFQIEHHLFPTMPRHNLHKIAPLVKSLCAK 0"
## [1] "PGPHPPSPEQLAATCNVEQSLFNDWFSGHLNFQIEHHLFPTMPRHNLHKIAPLVKSLCAK 0"
## [1] "---------QLAATCNIEPSFFNDWFSGHLNFQIEHHLFPTMPRHNYWKIAPLVRSLCSK 0"
## [1] "---------QLVATCNIEQSAFNDWFSGHLNFQIEHHLFPTMPRHNYWRAAPRVRSLCEK 0"
## [1] " "
## [1] "HGIEYQEKPLLRALLDIIRSLKKSGKLWLDAYLHK 25"
## [1] "HGIEYQEKPLLRALLDIIRSLKKSGKLWLDAYLHK 25"
## [1] "HGIKYQEKPLLRALQDIIRSLKKSGELWLDAYLHK 25"
## [1] "HGIKYQEKPLLRALQDIIRSLKKSGELWLDAYLHK 25"
## [1] "HGIEYQEKPLLRALQDIIGSLRKSGQLWLDAYLHK 25"
## [1] "HGIEYQEKPLLRALRDIIRSLKKSGDLWLDAYLHK 25"
## [1] "HGIEYQEKPLLRALIDIVSSLKKSGELWLDAYLHK 25"
## [1] "HGIQYQEKRLLRALLDVIRSLKKSGDLWLDAYLHK 25"
## [1] "YNVTYEEKCLYHGFRDVFRSLKKSGQLWLDAYLHK 25"
## [1] "YGVKYQEKTLYGAFADIIRSLEKSGELWLDAYLNK 25"
## [1] " "

Finished MSA

ggmsa(fads2_align, start = 350, end = 400)

Distance Matrix

Make a distance matrix

fads2_subset_dist <- dist.alignment(fads2_align_seqinr, 
                                       matrix = "identity")

is(fads2_subset_dist)
## [1] "dist"     "oldClass"
class(fads2_subset_dist)
## [1] "dist"

Round for display

fads2_subset_dist_rnd <- round(fads2_subset_dist, 3)
fads2_subset_dist_rnd
##                NP_004256.1 NP_001351410.1 XP_013976379.2 XP_032735091.1
## NP_001351410.1       0.000                                             
## XP_013976379.2       0.308          0.308                              
## XP_032735091.1       0.322          0.322          0.212               
## NP_001076913.1       0.322          0.322          0.281          0.289
## XP_027624528.1       0.289          0.289          0.256          0.256
## NP_062673.1          0.352          0.352          0.336          0.342
## XP_006911280.1       0.388          0.388          0.374          0.391
## XP_012816067.1       0.531          0.531          0.528          0.522
## NP_571720.2          0.593          0.593          0.593          0.591
##                NP_001076913.1 XP_027624528.1 NP_062673.1 XP_006911280.1
## NP_001351410.1                                                         
## XP_013976379.2                                                         
## XP_032735091.1                                                         
## NP_001076913.1                                                         
## XP_027624528.1          0.268                                          
## NP_062673.1             0.355          0.329                           
## XP_006911280.1          0.380          0.368       0.416               
## XP_012816067.1          0.522          0.509       0.543          0.543
## NP_571720.2             0.589          0.591       0.606          0.597
##                XP_012816067.1
## NP_001351410.1               
## XP_013976379.2               
## XP_032735091.1               
## NP_001076913.1               
## XP_027624528.1               
## NP_062673.1                  
## XP_006911280.1               
## XP_012816067.1               
## NP_571720.2             0.587

Phylogenetic trees of sequences

Build a phylogenetic tree from matrix

tree_subset <- nj(fads2_subset_dist)

Plotting phylogenetic trees

Plot the tree.

plot.phylo(tree_subset, main="Phylogenetic Tree", 
            use.edge.length = F)

mtext(text = "\nFads2 family gene tree - rooted, no branch lenths")