Introduction

This code will analyze and compile data information on the gene FADS2 (Fatty Acid Desaturase 2). Many different components of analysis will be used including multiple sequence alignments, percent identity, and phylogenetic trees. This data will show how close of a relationship this gene has in humans compared to the other animals that have homologous genes.

Resources / References

Key information used to make this script can be found here: * Refseq Gene: https://www.ncbi.nlm.nih.gov/gene/9415 *Refseq Homologene: https://www.ncbi.nlm.nih.gov/homologene/3149

Other resources consulted includes: * Neanderthal genome: http://neandertal.ensemblgenomes.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000134824

Other interesting resources and online tools include: * BLAST alignment: https://blast.ncbi.nlm.nih.gov/Blast.cgi * Sub-cellular locations prediction in Uniprot: https://www.uniprot.org/uniprot/O95864

Preparation

Load necessary packages:

library(BiocManager)

## Bioconductor version '3.13' is out-of-date; the current release version '3.14'
##   is available with R version '4.1'; see https://bioconductor.org/install

#install("drawProteins")
library(drawProteins)

Load other packages:

#github packages
library(compbio4all)
library(ggmsa)

## Registered S3 methods overwritten by 'ggalt':
##   method                  from   
##   grid.draw.absoluteGrob  ggplot2
##   grobHeight.absoluteGrob ggplot2
##   grobWidth.absoluteGrob  ggplot2
##   grobX.absoluteGrob      ggplot2
##   grobY.absoluteGrob      ggplot2

#CRAN packages
library(rentrez)
library(seqinr)
library(ape)

## 
## Attaching package: 'ape'

## The following objects are masked from 'package:seqinr':
## 
##     as.alignment, consensus

library(pander)

library(ggplot2)

#Bioconductor packages
library(msa)

## Loading required package: Biostrings

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min

## Loading required package: S4Vectors

## Loading required package: stats4

## 
## Attaching package: 'S4Vectors'

## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname

## Loading required package: IRanges

## Loading required package: XVector

## Loading required package: GenomeInfoDb

## 
## Attaching package: 'Biostrings'

## The following object is masked from 'package:ape':
## 
##     complement

## The following object is masked from 'package:seqinr':
## 
##     translate

## The following object is masked from 'package:base':
## 
##     strsplit

## 
## Attaching package: 'msa'

## The following object is masked from 'package:BiocManager':
## 
##     version

#Biostrings
library(Biostrings)

# install("HGNChelper")
library(HGNChelper)

Accession Numbers

The homologous genes in other organisms were found both by using NCBI’s homologous gene page as well as using NCBI’s BLAST function. Both of these methods produced lists of homologous genes in other species with their accession numbers listed alongside the scientific names.

Two things that were left out of the table were drosophila and Neanderthals. This gene does not occur outside of vertebrates and so it is not found in drosophila. This gene was found in Neanderthals, but their was only a uniprot accession number given and it turned out to be the exact same as in the human gene.

The following function confirms the validity of the FADS2 gene name.

checkGeneSymbols(x = c("FADS2"))

## Maps last updated on: Thu Oct 24 12:31:05 2019

##       x Approved Suggested.Symbol
## 1 FADS2     TRUE            FADS2

Accession Number Table

As talked about above, this gene does not occur outside of vertebrates and was found in neanderthal’s, but does not have a RefSeq Accession Number.

We will now make a vector including the names and accession numbers of all of our species which will be turned into a table.

#                   RefSeq         Uniprot       PDB      Sci Name         Common Name   Gene name
fads2.table <- c("NP_004256.1",    "O95864",     "NA", "Homo sapiens",      "Human",       "Fads2",
                 "NP_062673.1",    "Q9Z0R9",     "NA", "Mus musculus",      "Mouse",       "Fads2",
                 "NP_001351410.1", "A0A6D2Y484", "NA", "Pan troglodytes",   "Chimpanzee",  "Fads2",
                 "XP_013976379.2", "NA",         "NA", "Canis lupus",       "Wolf",        "Fads2",
                 "NP_571720.2",    "Q9DEX7",     "NA", "Danio rerio",       "Zebrafish",   "Fads2",
                 "XP_027624528.1", "NA",         "NA", "Tupaia chinensis",  "Tree Shrew",  "Fads2",
                 "XP_012816067.1", "F6YR25",     "NA", "Xenopus tropicalis","Frog",        "Fads2",
                 "XP_006911280.1", "L5KQ77",     "NA", "Pteropus Alecto",   "Flying Fox",  "Fads2",
                 "NP_001076913.1", "A4FV48",     "NA", "Bovine taurus",     "Cow",         "Fads2",
                 "XP_032735091.1", "NA",         "NA", "Lontra Canadensis", "River Otter", "Fads2")

This chunk will convert our vector into a matrix and then create a dataframe to display a table of the information. Below is printed the dimensions of the table, the column names prior to being changed, and the final table.

# turn the vector into a matrix
fads2.matrix <- matrix(fads2.table, nrow = 10, ncol = 6, byrow = TRUE)

# the matrix is 10 by 6
dim(fads2.matrix)

## [1] 10  6

# make a datafram from the matrix
fads2.df <- as.data.frame(fads2.matrix)

# print current column names
colnames(fads2.df)

## [1] "V1" "V2" "V3" "V4" "V5" "V6"

# rename the columns
colnames(fads2.df) <- c(c("ncbi.protein.accession", "UniProt.id", "PDB", 
                          "species", "common.name", "gene.name"))

# print out the data frame
pander(fads2.df)

Table continues below
ncbi.protein.accession	UniProt.id	PDB	species	common.name
NP_004256.1	O95864	NA	Homo sapiens	Human
NP_062673.1	Q9Z0R9	NA	Mus musculus	Mouse
NP_001351410.1	A0A6D2Y484	NA	Pan troglodytes	Chimpanzee
XP_013976379.2	NA	NA	Canis lupus	Wolf
NP_571720.2	Q9DEX7	NA	Danio rerio	Zebrafish
XP_027624528.1	NA	NA	Tupaia chinensis	Tree Shrew
XP_012816067.1	F6YR25	NA	Xenopus tropicalis	Frog
XP_006911280.1	L5KQ77	NA	Pteropus Alecto	Flying Fox
NP_001076913.1	A4FV48	NA	Bovine taurus	Cow
XP_032735091.1	NA	NA	Lontra Canadensis	River Otter

gene.name
Fads2
Fads2
Fads2
Fads2
Fads2
Fads2
Fads2
Fads2
Fads2
Fads2

Data Preparation

Download Sequences

We will use entrez_fetch_list to get a list of all of the FASTA files.

fads2.accesion <- fads2.df$ncbi.protein.accession
fads2.accesion

##  [1] "NP_004256.1"    "NP_062673.1"    "NP_001351410.1" "XP_013976379.2"
##  [5] "NP_571720.2"    "XP_027624528.1" "XP_012816067.1" "XP_006911280.1"
##  [9] "NP_001076913.1" "XP_032735091.1"

fads2.accesion.list <- entrez_fetch_list(db = "protein", id = fads2.accesion, rettype = "fasta")

Number of FASTA files obtained:

length(fads2.accesion.list)

## [1] 10

The first entry:

fads2.accesion.list[[1]]

## [1] ">NP_004256.1 acyl-CoA 6-desaturase isoform 1 [Homo sapiens]\nMGKGGNQGEGAAEREVSVPTFSWEEIQKHNLRTDRWLVIDRKVYNITKWSIQHPGGQRVIGHYAGEDATD\nAFRAFHPDLEFVGKFLKPLLIGELAPEEPSQDHGKNSKITEDFRALRKTAEDMNLFKTNHVFFLLLLAHI\nIALESIAWFTVFYFGNGWIPTLITAFVLATSQAQAGWLQHDYGHLSVYRKPKWNHLVHKFVIGHLKGASA\nNWWNHRHFQHHAKPNIFHKDPDVNMLHVFVLGEWQPIEYGKKKLKYLPYNHQHEYFFLIGPPLLIPMYFQ\nYQIIMTMIVHKNWVDLAWAVSYYIRFFITYIPFYGILGALLFLNFIRFLESHWFVWVTQMNHIVMEIDQE\nAYRDWFSSQLTATCNVEQSFFNDWFSGHLNFQIEHHLFPTMPRHNLHKIAPLVKSLCAKHGIEYQEKPLL\nRALLDIIRSLKKSGKLWLDAYLHK\n\n"

Initial data cleaning

The first step will be to remove the FASTA header.

for(i in 1:length(fads2.accesion.list)){
  fads2.accesion.list[[i]] <- fasta_cleaner(fads2.accesion.list[[i]], parse = F)
}

General Protein Information

Protein Information

The following code creates an image that shows descriptions of the different parts of the gene.

fads2.uniprot <- get_features(fads2.df$UniProt.id[1])

## [1] "Download has worked"

fads2.prot <- feature_to_dataframe(fads2.uniprot)
fads2.canvas <- draw_canvas(fads2.prot)
fads2.canvas <- draw_chains(fads2.canvas, fads2.prot, label_size = 2.5)
fads2.canvas <- draw_recept_dom(fads2.canvas, fads2.prot)
fads2.canvas

Draw dotplot

We will prepare the data so that each letter can be compared.

human_fasta <- fads2.accesion.list[[1]]
fads2_vector <- fasta_cleaner(human_fasta)
dotPlot(fads2_vector, fads2_vector)

Let’s examine different settings to get a clearer picture

par(mfrow = c(2,2), 
    mar = c(0,0,2,1))

# plot 1:  - Defaults
dotPlot(fads2_vector,
        fads2_vector, 
        wsize = 1, 
        nmatch = 1, 
        main = "\nfads2 Defaults \n")

# plot - size = 10, nmatch = 1
dotPlot(fads2_vector, 
        fads2_vector, 
        wsize = 10, 
        nmatch = 1, 
        main = "\nfads2 - size = 10, nmatch = 1 \n")

# plot 3:  - size = 10, nmatch = 5
dotPlot(fads2_vector, 
        fads2_vector, 
        wsize = 10, 
        nmatch = 5, 
        main = "\nfads2 - size = 10, nmatch = 5 \n")

# plot 4: size = 20, nmatch = 5
dotPlot(fads2_vector, 
        fads2_vector, 
        wsize = 20,
        nmatch = 5,
        main = "\nfads2 - size = 20, nmatch = 5 \n")

# reset par() - run this or other plots will be small!
par(mfrow = c(1,1), 
    mar = c(4,4,4,4))

The bottom right dotplot seems to be clearest. Let’s print it larger so that we can see the data.

dotPlot(fads2_vector, 
        fads2_vector, 
        wsize = 20,
        nmatch = 5,
        main = "\nfads2 - size = 20, nmatch = 5 \n")

Protein properties compiled from databases

Many websites can give information on our gene. Here is a table of any additional information given from the sites listed below.

Pfam <- c("Cyt-b5 domain spanning 22-95aa and FA-desaturase domain 156-418aa")
Disprot <- c("No information available")
RepeatsDB <- c("No information available")
Uniprot.subcellular.location <- c("Endoplasmic reticulum membrane")
PDB <- c("No PDB entry available")
Alpha.fold <- c("The predicted protein is Acyl-CoA 6-desaturase and is completely made from alpha helixes")
IUpred2A <- c("There were two peaks that rose above 0.5")

protein.properties.df <- data.frame(Pfam = Pfam, Disprot = Disprot, RepeatsDB = RepeatsDB, 
                                    Uniprot.subcellular.location = Uniprot.subcellular.location,
                                    PDB = PDB, Alpha.fold = Alpha.fold, IUpred2A = IUpred2A)
pander(protein.properties.df)

Table continues below
Pfam	Disprot
Cyt-b5 domain spanning 22-95aa and FA-desaturase domain 156-418aa	No information available

Table continues below
RepeatsDB	Uniprot.subcellular.location
No information available	Endoplasmic reticulum membrane

Table continues below
PDB	Alpha.fold
No PDB entry available	The predicted protein is Acyl-CoA 6-desaturase and is completely made from alpha helixes

IUpred2A
There were two peaks that rose above 0.5

Protein feature prediction

Uniprot indicates that this protein is a Multi-pass membrane protein. They define this as a protein that spans the membrane more than once.

Predict protein fold

Alphafold indicates that this protein is solely consisted of alpha helixes. Therefore, I predict that the machine-learning methods will categorize this protein as α-all.

We will now use the method from Chou and Zhang to see what this analysis tells us about the protein fold.

aa.1.1 <- c("A","R","N","D","C","Q","E","G","H","I",
            "L","K","M","F","P","S","T","W","Y","V")
alpha <- c(285, 53, 97, 163, 22, 67, 134, 197, 111, 91, 
           221, 249, 48, 123, 82, 122, 119, 33, 63, 167)
beta <- c(203, 67, 139, 121, 75, 122, 86, 297, 49, 120, 
          177, 115, 16, 85, 127, 341, 253, 44, 110, 229)
a.plus.b <- c(175, 78, 120, 111, 74, 74, 86, 171, 33, 93,
              110, 112, 25, 52, 71, 126, 117, 30, 108, 123)
a.div.b <- c(361, 146, 183, 244, 63, 114, 257, 377, 107, 239, 
             339, 321, 91, 158, 188, 327, 238, 72, 130, 378)
pander(data.frame(aa.1.1, alpha, beta, a.plus.b, a.div.b))

aa.1.1	alpha	beta	a.plus.b	a.div.b
A	285	203	175	361
R	53	67	78	146
N	97	139	120	183
D	163	121	111	244
C	22	75	74	63
Q	67	122	74	114
E	134	86	86	257
G	197	297	171	377
H	111	49	33	107
I	91	120	93	239
L	221	177	110	339
K	249	115	112	321
M	48	16	25	91
F	123	85	52	158
P	82	127	71	188
S	122	341	126	327
T	119	253	117	238
W	33	44	30	72
Y	63	110	108	130
V	167	229	123	378

Calculate Proportions

alpha.prop <- alpha/sum(alpha)
beta.prop <- beta/sum(beta)
a.plus.b.prop <- a.plus.b/sum(a.plus.b)
a.div.b.prop <- a.div.b/sum(a.div.b)

Create Dataframe

aa.prop <- data.frame(alpha.prop, beta.prop, a.plus.b.prop, a.div.b.prop)
rownames(aa.prop) <- aa.1.1
pander(aa.prop)

	alpha.prop	beta.prop	a.plus.b.prop	a.div.b.prop
A	0.1165	0.07313	0.09264	0.08331
R	0.02166	0.02414	0.04129	0.03369
N	0.03964	0.05007	0.06353	0.04223
D	0.06661	0.04359	0.05876	0.05631
C	0.008991	0.02702	0.03917	0.01454
Q	0.02738	0.04395	0.03917	0.02631
E	0.05476	0.03098	0.04553	0.05931
G	0.08051	0.107	0.09052	0.08701
H	0.04536	0.01765	0.01747	0.02469
I	0.03719	0.04323	0.04923	0.05516
L	0.09031	0.06376	0.05823	0.07824
K	0.1018	0.04143	0.05929	0.07408
M	0.01962	0.005764	0.01323	0.021
F	0.05027	0.03062	0.02753	0.03646
P	0.03351	0.04575	0.03759	0.04339
S	0.04986	0.1228	0.0667	0.07547
T	0.04863	0.09114	0.06194	0.05493
W	0.01349	0.01585	0.01588	0.01662
Y	0.02575	0.03963	0.05717	0.03
V	0.06825	0.08249	0.06511	0.08724

This function will convert our table into a vector.

table_to_vector <- function(table_x){
  table_names <- attr(table_x, "dimnames")[[1]]
  table_vect <- as.vector(table_x)
  names(table_vect) <- table_names
  return(table_vect)
}

fads2_human_table <- table(fads2_vector)/length(fads2_vector)
fads2.human.aa.freq <- table_to_vector(fads2_human_table)
fads2.human.aa.freq

##           A           C           D           E           F           G 
## 0.063063063 0.004504505 0.038288288 0.051801802 0.074324324 0.058558559 
##           H           I           K           L           M           N 
## 0.067567568 0.072072072 0.065315315 0.105855856 0.020270270 0.045045045 
##           P           Q           R           S           T           V 
## 0.045045045 0.042792793 0.036036036 0.040540541 0.038288288 0.051801802 
##           W           Y 
## 0.038288288 0.040540541

Add data on my focal protein to the amino acid frequency.

aa.prop$fads2.human.aa.freq <- fads2.human.aa.freq
pander(aa.prop)

Table continues below
	alpha.prop	beta.prop	a.plus.b.prop	a.div.b.prop
A	0.1165	0.07313	0.09264	0.08331
R	0.02166	0.02414	0.04129	0.03369
N	0.03964	0.05007	0.06353	0.04223
D	0.06661	0.04359	0.05876	0.05631
C	0.008991	0.02702	0.03917	0.01454
Q	0.02738	0.04395	0.03917	0.02631
E	0.05476	0.03098	0.04553	0.05931
G	0.08051	0.107	0.09052	0.08701
H	0.04536	0.01765	0.01747	0.02469
I	0.03719	0.04323	0.04923	0.05516
L	0.09031	0.06376	0.05823	0.07824
K	0.1018	0.04143	0.05929	0.07408
M	0.01962	0.005764	0.01323	0.021
F	0.05027	0.03062	0.02753	0.03646
P	0.03351	0.04575	0.03759	0.04339
S	0.04986	0.1228	0.0667	0.07547
T	0.04863	0.09114	0.06194	0.05493
W	0.01349	0.01585	0.01588	0.01662
Y	0.02575	0.03963	0.05717	0.03
V	0.06825	0.08249	0.06511	0.08724

	fads2.human.aa.freq
A	0.06306
R	0.004505
N	0.03829
D	0.0518
C	0.07432
Q	0.05856
E	0.06757
G	0.07207
H	0.06532
I	0.1059
L	0.02027
K	0.04505
M	0.04505
F	0.04279
P	0.03604
S	0.04054
T	0.03829
W	0.0518
Y	0.03829
V	0.04054

Functions to calculate similarities

Two custom functions are needed: one to calculate the correlation between two columns of our table, and one to calculate correlation similarities.

# Correlation used in Chou and Zhang 1992.
chou_cor <- function(x,y){
  numerator <- sum(x*y)
denominator <- sqrt((sum(x^2))*(sum(y^2)))
result <- numerator/denominator
return(result)
}

# Cosine similarity used in Higgs and Attwood (2005). 
chou_cosine <- function(z.1, z.2){
  z.1.abs <- sqrt(sum(z.1^2))
  z.2.abs <- sqrt(sum(z.2^2))
  my.cosine <- sum(z.1*z.2)/(z.1.abs*z.2.abs)
  return(my.cosine)
}

Calculate Correlation between each column

corr.alpha <- chou_cor(aa.prop[,5], aa.prop[,1])
corr.beta  <- chou_cor(aa.prop[,5], aa.prop[,2])
corr.apb   <- chou_cor(aa.prop[,5], aa.prop[,3])
corr.adb   <- chou_cor(aa.prop[,5], aa.prop[,4])

Calculate cosine similarity

cos.alpha <- chou_cosine(aa.prop[,5], aa.prop[,1])
cos.beta  <- chou_cosine(aa.prop[,5], aa.prop[,2])
cos.apb   <- chou_cosine(aa.prop[,5], aa.prop[,3])
cos.adb   <- chou_cosine(aa.prop[,5], aa.prop[,4])

Calculate distance.

aa.prop.flipped <- t(aa.prop)
round(aa.prop.flipped, 2)

##                        A    R    N    D    C    Q    E    G    H    I    L    K
## alpha.prop          0.12 0.02 0.04 0.07 0.01 0.03 0.05 0.08 0.05 0.04 0.09 0.10
## beta.prop           0.07 0.02 0.05 0.04 0.03 0.04 0.03 0.11 0.02 0.04 0.06 0.04
## a.plus.b.prop       0.09 0.04 0.06 0.06 0.04 0.04 0.05 0.09 0.02 0.05 0.06 0.06
## a.div.b.prop        0.08 0.03 0.04 0.06 0.01 0.03 0.06 0.09 0.02 0.06 0.08 0.07
## fads2.human.aa.freq 0.06 0.00 0.04 0.05 0.07 0.06 0.07 0.07 0.07 0.11 0.02 0.05
##                        M    F    P    S    T    W    Y    V
## alpha.prop          0.02 0.05 0.03 0.05 0.05 0.01 0.03 0.07
## beta.prop           0.01 0.03 0.05 0.12 0.09 0.02 0.04 0.08
## a.plus.b.prop       0.01 0.03 0.04 0.07 0.06 0.02 0.06 0.07
## a.div.b.prop        0.02 0.04 0.04 0.08 0.05 0.02 0.03 0.09
## fads2.human.aa.freq 0.05 0.04 0.04 0.04 0.04 0.05 0.04 0.04

We can get a distance matrix like this

dist(aa.prop.flipped, method = "euclidean")

##                     alpha.prop  beta.prop a.plus.b.prop a.div.b.prop
## beta.prop           0.13342098                                      
## a.plus.b.prop       0.09281824 0.08289406                           
## a.div.b.prop        0.06699039 0.08659174    0.06175113             
## fads2.human.aa.freq 0.15929515 0.16794894    0.13127207   0.14241948

We can also calculate the individual distances

dist.alpha <- dist((aa.prop.flipped[c(1,5),]),  method = "euclidean")
dist.beta  <- dist((aa.prop.flipped[c(2,5),]),  method = "euclidean")
dist.apb   <- dist((aa.prop.flipped[c(3,5),]),  method = "euclidean")
dist.adb  <- dist((aa.prop.flipped[c(4,5),]), method = "euclidean")

Compile the information. Rounding makes it easier to read.

# fold types
fold.type <- c("alpha","beta","alpha plus beta", "alpha/beta")

# data
corr.sim <- round(c(corr.alpha,corr.beta,corr.apb,corr.adb),5)
cosine.sim <- round(c(cos.alpha,cos.beta,cos.apb,cos.adb),5)
Euclidean.dist <- round(c(dist.alpha,dist.beta,dist.apb,dist.adb),5)

# summary
sim.sum <- c("","","most.sim","")
dist.sum <- c("","","min.dist","")

df <- data.frame(fold.type,
           corr.sim ,
           cosine.sim ,
           Euclidean.dist ,
           sim.sum ,
           dist.sum )

Display Output

pander(df)

fold.type	corr.sim	cosine.sim	Euclidean.dist	sim.sum	dist.sum
alpha	0.7994	0.7994	0.1593
beta	0.7806	0.7806	0.1679
alpha plus beta	0.8539	0.8539	0.1313	most.sim	min.dist
alpha/beta	0.8316	0.8316	0.1424

Alpha plus Beta was predicted which does not match up with the prediction from Alphafold.

Subcellular location prediction

TBD

Percent Identity Comparisons (PID)

Data Preparation

Convert all FASTA records into entries in a single vector. The following for loop will take the accession numbers and associate them with the cleaned vector for the FASTA code.

fads2_list <- rep(NA, 0)
for(i in 1:length(fads2.accesion.list)){
    fads2_list[[fads2.df$ncbi.protein.accession[i]]] <- fads2.accesion.list[[i]]
}
is(fads2_list)

##  [1] "character"               "vector"                 
##  [3] "data.frameRowLabels"     "SuperClassMethod"       
##  [5] "EnumerationValue"        "character_OR_connection"
##  [7] "character_OR_NULL"       "atomic"                 
##  [9] "vector_OR_Vector"        "vector_OR_factor"

length(fads2_list)

## [1] 10

names(fads2_list)

##  [1] "NP_004256.1"    "NP_062673.1"    "NP_001351410.1" "XP_013976379.2"
##  [5] "NP_571720.2"    "XP_027624528.1" "XP_012816067.1" "XP_006911280.1"
##  [9] "NP_001076913.1" "XP_032735091.1"

PID table

data("BLOSUM50")
align01.03 <- pairwiseAlignment(fads2.accesion.list[[1]], fads2_list[[3]],
                                substitutionMatrix = "BLOSUM50", 
                                              scoreOnly = FALSE)

 align01.05 <- pairwiseAlignment(fads2.accesion.list[[1]], fads2_list[[5]],
                                substitutionMatrix = "BLOSUM50", 
                                              scoreOnly = FALSE)
 
 align01.08 <- pairwiseAlignment(fads2.accesion.list[[1]], fads2_list[[8]],
                                substitutionMatrix = "BLOSUM50", 
                                              scoreOnly = FALSE)
 align03.05 <- pairwiseAlignment(fads2.accesion.list[[3]], fads2_list[[5]],
                                substitutionMatrix = "BLOSUM50", 
                                              scoreOnly = FALSE)
 align03.08 <- pairwiseAlignment(fads2.accesion.list[[3]], fads2_list[[8]],
                                substitutionMatrix = "BLOSUM50", 
                                              scoreOnly = FALSE)
 align05.08 <- pairwiseAlignment(fads2.accesion.list[[5]], fads2_list[[8]],
                                substitutionMatrix = "BLOSUM50", 
                                              scoreOnly = FALSE)

pid(align01.03)

## [1] 100

pid(align01.05)

## [1] 64.86486

pid(align01.08)

## [1] 83.22296

This matrix will show the percent identity between the human, chimpanzee, zebrafish, and flying fox.

pids <- c(100,             NA,              NA,              NA,
          pid(align01.03), 100,             NA,              NA,
          pid(align01.05), pid(align03.05), 100,             NA,
          pid(align01.08), pid(align03.08), pid(align05.08), 100)

mat <- matrix(pids, nrow = 4, byrow = T)
row.names(mat) <- c("Homo","Pan","Fish","Fox")   
colnames(mat) <-  c("Homo","Pan","Fish","Fox")   
pander(mat)

	Homo	Pan	Fish	Fox
Homo	100	NA	NA	NA
Pan	100	100	NA	NA
Fish	64.86	64.86	100	NA
Fox	83.22	83.22	63.13	100

PID methods comparison

Compare different PID methods. First we will compare humans and chimps

method <- c("PID1", "PID2", "PID3", "PID4")
PID.chimp <- c(pid(align01.03, type = "PID1"),
                pid(align01.03, type = "PID2"),
                pid(align01.03, type = "PID3"),
                pid(align01.03, type = "PID4"))
denominator <- c("(aligned positions + internal gap positions)",
                 "(aligned positions)",
                 "(length shorter sequence)",
                 "(average length of the two sequences)")

pid.chimp.comparison <- data.frame(method, PID = PID.chimp, denominator)
pander(pid.chimp.comparison)

method	PID	denominator
PID1	100	(aligned positions + internal gap positions)
PID2	100	(aligned positions)
PID3	100	(length shorter sequence)
PID4	100	(average length of the two sequences)

We will now compare humans and zebrafish

PID.fish <- c(pid(align01.05, type = "PID1"),
                pid(align01.05, type = "PID2"),
                pid(align01.05, type = "PID3"),
                pid(align01.05, type = "PID4"))

pid.fish.comparison <- data.frame(method, PID = PID.fish, denominator)
pander(pid.fish.comparison)

method	PID	denominator
PID1	64.86	(aligned positions + internal gap positions)
PID2	64.86	(aligned positions)
PID3	64.86	(length shorter sequence)
PID4	64.86	(average length of the two sequences)

Multiple Sequence Alignment

MSA data preparation

For R bioinformatics tools we are going to convert the vector into a string set

fads2_vector_ss <- AAStringSet(fads2_list)

Building Multiple Sequence Alignment (MSA)

fads2_align <-msa(fads2_vector_ss,
                     method = "ClustalW")

## use default substitution matrix

Cleaning / setting up an MSA

MSA produces a species MSA object

class(fads2_align)

## [1] "MsaAAMultipleAlignment"
## attr(,"package")
## [1] "msa"

is(fads2_align)

## [1] "MsaAAMultipleAlignment" "AAMultipleAlignment"    "MsaMetaData"           
## [4] "MultipleAlignment"

Default output of MSA

fads2_align

## CLUSTAL 2.1  
## 
## Call:
##    msa(fads2_vector_ss, method = "ClustalW")
## 
## MsaAAMultipleAlignment with 10 rows and 455 columns
##      aln                                                   names
##  [1] MGKGGNQGEGAAER--EVSVPTFSW...LRALLDIIRSLKKSGKLWLDAYLHK NP_004256.1
##  [2] MGKGGNQGEGAAER--EVSVPTFSW...LRALLDIIRSLKKSGKLWLDAYLHK NP_001351410.1
##  [3] MGKGGNQGEGAAER--EAPLQTFCW...LRALQDIIRSLKKSGELWLDAYLHK XP_013976379.2
##  [4] MGKGGNQGEGATER--EAPMPTFSW...LRALQDIIRSLKKSGELWLDAYLHK XP_032735091.1
##  [5] MGKGGNQDEGATEL--EAPMPTFRW...LRALQDIIGSLRKSGQLWLDAYLHK NP_001076913.1
##  [6] MGKGGNQGEGATER--EVPMPTFSW...LRALRDIIRSLKKSGDLWLDAYLHK XP_027624528.1
##  [7] MGKGGNQGEGSTER--QAPMPTFRW...LRALIDIVSSLKKSGELWLDAYLHK NP_062673.1
##  [8] MGKGGNQGEGATER--EAPIPTFRW...LRALLDVIRSLKKSGDLWLDAYLHK XP_006911280.1
##  [9] MGMGGQSGEGCSSGNCVKPEARYSW...YHGFRDVFRSLKKSGQLWLDAYLHK XP_012816067.1
## [10] MGGGGQQTDRITDT--NGRFSSYTW...YGAFADIIRSLEKSGELWLDAYLNK NP_571720.2
##  Con MGKGGNQGEGATER--EAP?PTFSW...LRAL?DIIRSLKKSG?LWLDAYLHK Consensus

Change class of alignment

class(fads2_align) <- "AAMultipleAlignment"
class(fads2_align)

## [1] "AAMultipleAlignment"

fads2_align_seqinr <- msaConvert(fads2_align, type = "seqinr::alignment")

Show output of the MSA

print_msa(fads2_align_seqinr)

## [1] "MGKGGNQGEGAAER--EVSVPTFSWEEIQKHNLRTDRWLVIDRKVYNITKWSIQHPGGQR 0"
## [1] "MGKGGNQGEGAAER--EVSVPTFSWEEIQKHNLRTDRWLVIDRKVYNITKWSIQHPGGQR 0"
## [1] "MGKGGNQGEGAAER--EAPLQTFCWEEIQKHNLRTDKWLVIDRKVYNITKWSSRHPGGHR 0"
## [1] "MGKGGNQGEGATER--EAPMPTFSWEEIQKHNLRTDKWLVIDRKVYNITKWSSRHPGGQR 0"
## [1] "MGKGGNQDEGATEL--EAPMPTFRWEEIQKHNLRTDKWLVIDRKVYNITKWSSRHPGGQR 0"
## [1] "MGKGGNQGEGATER--EVPMPTFSWEEIQKHNLRTDKWLVIDRKVYNITKWSNRHPGGHR 0"
## [1] "MGKGGNQGEGSTER--QAPMPTFRWEEIQKHNLRTDRWLVIDRKVYNVTKWSQRHPGGHR 0"
## [1] "MGKGGNQGEGATER--EAPIPTFRWEEIQKHNVRTDQWLVVDRKVYNVTKWASRHPGGHR 0"
## [1] "MGMGGQSGEGCSSGNCVKPEARYSWEEIQKHNLKTDKWLVIERKVYNITQWVKCHPGGMR 0"
## [1] "MGGGGQQTDRITDT--NGRFSSYTWEEMQKHTKHGDQWVVVERKVYNVSQWVKRHPGGLR 0"
## [1] " "
## [1] "VIGHYAGEDATDAFRAFHPDLEFVGKFLKPLLIGELAPEEPSQDHGKNSKITEDFRALRK 0"
## [1] "VIGHYAGEDATDAFRAFHPDLEFVGKFLKPLLIGELAPEEPSQDHGKNSKITEDFRALRK 0"
## [1] "VIGHYAGEDATDAFHAFHRDLDFVRKFMKPLLIGELAPEEPSQDHGKNSQITEDFRALRK 0"
## [1] "VIGHYAGEDATDAFLAFHRDLDFVRKFMKPLLIGELAPEEPSQDHGKNSQITEDFRALRK 0"
## [1] "VIGHYAGEDATDAFLAFHRNLDFVRKFMKPLLIGELAPEEPSQDRGKNSQITEDFRALRK 0"
## [1] "VIGHYAGEDATDAFRAFHPNLDFVRKFMKPLLIGELAPEEPSQDRGKNSQITEDFRALKK 0"
## [1] "VIGHYSGEDATDAFRAFHLDLDFVGKFLKPLLIGELAPEEPSLDRGKSSQITEDFRALKK 0"
## [1] "VIGHYAGEDATDAFHAFHIDLDFVGKFLKPLLVGELAPEEPSQDRGKNSQIIEDFRALKK 0"
## [1] "VIGHYAGEDATDAFHAFHPDKNFVRKFLKPLYVGELAENEPSQDRDKNAQQVEDFRALRK 0"
## [1] "ILGHYAGEDATEAFTAFHPNLQLMRKYLKPLLIGELEASEPSQDRQKNAALVEDFRALRE 0"
## [1] " "
## [1] "TAEDMNLFKTNHVFFLLLLAHIIALESIAWFTVFYFGNGWIPTLITAFVLATSQAQAGWL 0"
## [1] "TAEDMNLFKTNHVFFLLLLAHIIALESIAWFTVFYFGNGWIPTLITAFVLATSQAQAGWL 0"
## [1] "TAEDMNLFKSDHLFFLLLLAHIIVMESIAWFIIFYFGNGWISTVITAFVLATSQAQAGWL 0"
## [1] "AAEDMNLFKSNHLFFLVLLAHIIVMESIAWFIIFYFGNGWIPTVITAFVLATSQAQAGWL 0"
## [1] "TAEDMNLFKSNQLFFLLHLAHIIAMESIAWFTLFYFGNGWIPTIITAFVLATSQAQAGWL 0"
## [1] "TAEDMNLFKANHLFFLLLLAHIIVMESIAWFTIFYFGNGWIPTVITAFVLATSQAQAGWL 0"
## [1] "TAEDMNLFKTNHLFFFLLLSHIIVMESLAWFILSYFGTGWIPTLVTAFVLATSQAQAGWL 0"
## [1] "TAQDMNLFKSNHLFFLLLLAHILVMEAIAWLIVFYFGNGWITTIIASFILATSQVQTGWL 0"
## [1] "TAEDMGLFKSNPAFFIVYLFHILLIEFLAWCTLHYFGTGWIPTILTVLLLTISQAQAGWL 0"
## [1] "RLEAEGCFKTQPLFFALHLGHILLLEAIAFVMVWYFGTGWINTLIVAVILATAQSQAGWL 0"
## [1] " "
## [1] "QHDYGHLSVYRKPKWNHLVHKFVIGHLKGASANWWNHRHFQHHAKPNIFHKDPDVNMLHV 0"
## [1] "QHDYGHLSVYRKPKWNHLVHKFVIGHLKGASANWWNHRHFQHHAKPNIFHKDPDVNMLHV 0"
## [1] "QHDYGHLSVYKKSMWNHVVHKFIIGHLKGASANWWNHRHFQHHAKPNIFHKDPDVNMLHV 0"
## [1] "QHDYGHLSVYKKSSWNHVVHKFIIGHLKGASANWWNHRHFQHHAKPNIFHKDPDVNMLHV 0"
## [1] "QHDYGHLSVYKKSMWNHIVHKFVIGHLKGASANWWNHRHFQHHAKPNIFHKDPDVNMLHV 0"
## [1] "QHDYGHLSVYKKSTWNHLVHKFIIGHLKGASANWWNHRHFQHHAKPNIFHKDPDVNMLHV 0"
## [1] "QHDYGHLSVYKKSIWNHVVHKFVIGHLKGASANWWNHRHFQHHAKPNIFHKDPDIKSLHV 0"
## [1] "QHDLGHLSVYKKSKWNHLAHKFVIGHLKGASSNWWNHRHFQHHAKPNIFHKDPDVNMLHV 0"
## [1] "QHDFGHLSVFKKSKWNHLIHKFIIGHLKGASANWWNHRHFQHHAKPNIFRKDPDVNMVNV 0"
## [1] "QHDFGHLSVFKTSGMNHLVHKFVIGHLKGASAGWWNHRHFQHHAKPNIFKKDPDVNMLNA 0"
## [1] " "
## [1] "FVLGEWQPIEYGKKKLKYLPYNHQHEYFFLIGPPLLIPMYFQYQIIMTMIVHKNWVDLAW 0"
## [1] "FVLGEWQPIEYGKKKLKYLPYNHQHEYFFLIGPPLLIPMYFQYQIIMTMIVHKNWVDLAW 0"
## [1] "FVLGEWQPIEYGKKKLKYLPYNHQHEYFFLIGPPLLIPVYFQYQIIMTMIVRRDWVDLAW 0"
## [1] "FVLGESQPIEYGKKKLKYLPYNHQHEYFFLIGPPLLIPVYFQYQIIMTMISRHDWVDLAW 0"
## [1] "FVLGEWQPIEYGKKKLKYLPYNHQHEYFFLIGPPLLIPLYFQYQIIMTMIVRKYWADLAW 0"
## [1] "FVLGEWQPIEYGKKKLKYLPYNHQHEYFFLIGPPLLIPMYFQYQIIMTMIVRKDWVDLAW 0"
## [1] "FVLGEWQPLEYGKKKLKYLPYNHQHEYFFLIGPPLLIPMYFQYQIIMTMISRRDWVDLAW 0"
## [1] "LVLGEWQPIEYGKKKLKYLPYNHQHQYFFLIGPPLLIPIYFQYQIIMTMIVHKNWVDLAW 0"
## [1] "FVLGDTQPVEFGKKRIKYLPYNHQHLYFFLIGPPLLIPIYFTIQIMKTMITRKDWVDLAW 0"
## [1] "FVVGNVQPVEYGVKKIKHLPYNHQHKYFFFIGPPLLIPVYFQFQIFHNMISHGMWVDLLW 0"
## [1] " "
## [1] "AVSYYIRFFITYIPFYGILGALLFLNFIRFLESHWFVWVTQMNHIVMEIDQEAYRDWFSS 0"
## [1] "AVSYYIRFFITYIPFYGILGALLFLNFIRFLESHWFVWVTQMNHIVMEIDQEAYRDWFSS 0"
## [1] "AMSYYVRFFITYIPFYGILGALLFLNFIRFLESHWFVWVTQMNHIVMEIDREPYRDWFSS 0"
## [1] "AMSYYARFFITYIPFYGILGAVIFLNFIRFLESHWFVWVTQMNHIVMEIDHEPYRDWFSN 0"
## [1] "AISYYTRFFITYIPFYGVLGSILFLNFIRFLESHWFVWVTQMNHIVMEIDREPYRDWFSS 0"
## [1] "AISYYARFFITYIPFYGVLGALLFLNFIRFLESHWFVWVTQMNHIVMEIDREPYRDWFRT 0"
## [1] "AISYYMRFFYTYIPFYGILGALVFLNFIRFLESHWFVWVTQMNHLVMEIDLDHYRDWFSS 0"
## [1] "SISYYVRFFISYVPFYGFLGSILFFNFIRFLESHWFVWVTQMNHIVMEINEEPYRPLFLL 0"
## [1] "SVSYYARFFITFVPFYGVLGSLALLNAVRFIESHWFVWVTQMNHLPMVIDHEKYKDWLET 0"
## [1] "CISYYVRYFLCYTQFYGVFWAIILFNFVRFMESHWFVWVTQMSRIPMNIDYEQNQDWLSM 0"
## [1] " "
## [1] "---------QLTATCNVEQSFFNDWFSGHLNFQIEHHLFPTMPRHNLHKIAPLVKSLCAK 0"
## [1] "---------QLTATCNVEQSFFNDWFSGHLNFQIEHHLFPTMPRHNLHKIAPLVKSLCAK 0"
## [1] "---------QLAATCNVEQSFFNDWFSGHLNFQIEHHLFPTMPRHNLHKVAPLVKSLCAK 0"
## [1] "---------QLAATCNVEQSFFNDWFSGHLNFQIEHHLFPTMPRHNFHKVAPLVKSLCAK 0"
## [1] "---------QLAATCNVEQSFFNDWFSGHLNFQIEHHLFPTMPRHNLHKIAPLVRSLCAK 0"
## [1] "---------QLAATCNVEQSFFNDWFSGHLNFQIEHHLFPTMPRHNFHKIAPLVKSLCAK 0"
## [1] "---------QLAATCNVEQSFFNDWFSGHLNFQIEHHLFPTMPRHNLHKIAPLVKSLCAK 0"
## [1] "PGPHPPSPEQLAATCNVEQSLFNDWFSGHLNFQIEHHLFPTMPRHNLHKIAPLVKSLCAK 0"
## [1] "---------QLAATCNIEPSFFNDWFSGHLNFQIEHHLFPTMPRHNYWKIAPLVRSLCSK 0"
## [1] "---------QLVATCNIEQSAFNDWFSGHLNFQIEHHLFPTMPRHNYWRAAPRVRSLCEK 0"
## [1] " "
## [1] "HGIEYQEKPLLRALLDIIRSLKKSGKLWLDAYLHK 25"
## [1] "HGIEYQEKPLLRALLDIIRSLKKSGKLWLDAYLHK 25"
## [1] "HGIKYQEKPLLRALQDIIRSLKKSGELWLDAYLHK 25"
## [1] "HGIKYQEKPLLRALQDIIRSLKKSGELWLDAYLHK 25"
## [1] "HGIEYQEKPLLRALQDIIGSLRKSGQLWLDAYLHK 25"
## [1] "HGIEYQEKPLLRALRDIIRSLKKSGDLWLDAYLHK 25"
## [1] "HGIEYQEKPLLRALIDIVSSLKKSGELWLDAYLHK 25"
## [1] "HGIQYQEKRLLRALLDVIRSLKKSGDLWLDAYLHK 25"
## [1] "YNVTYEEKCLYHGFRDVFRSLKKSGQLWLDAYLHK 25"
## [1] "YGVKYQEKTLYGAFADIIRSLEKSGELWLDAYLNK 25"
## [1] " "

Finished MSA

ggmsa(fads2_align, start = 350, end = 400)

Distance Matrix

Make a distance matrix

fads2_subset_dist <- dist.alignment(fads2_align_seqinr, 
                                       matrix = "identity")

is(fads2_subset_dist)

## [1] "dist"     "oldClass"

class(fads2_subset_dist)

## [1] "dist"

Round for display

fads2_subset_dist_rnd <- round(fads2_subset_dist, 3)
fads2_subset_dist_rnd

##                NP_004256.1 NP_001351410.1 XP_013976379.2 XP_032735091.1
## NP_001351410.1       0.000                                             
## XP_013976379.2       0.308          0.308                              
## XP_032735091.1       0.322          0.322          0.212               
## NP_001076913.1       0.322          0.322          0.281          0.289
## XP_027624528.1       0.289          0.289          0.256          0.256
## NP_062673.1          0.352          0.352          0.336          0.342
## XP_006911280.1       0.388          0.388          0.374          0.391
## XP_012816067.1       0.531          0.531          0.528          0.522
## NP_571720.2          0.593          0.593          0.593          0.591
##                NP_001076913.1 XP_027624528.1 NP_062673.1 XP_006911280.1
## NP_001351410.1                                                         
## XP_013976379.2                                                         
## XP_032735091.1                                                         
## NP_001076913.1                                                         
## XP_027624528.1          0.268                                          
## NP_062673.1             0.355          0.329                           
## XP_006911280.1          0.380          0.368       0.416               
## XP_012816067.1          0.522          0.509       0.543          0.543
## NP_571720.2             0.589          0.591       0.606          0.597
##                XP_012816067.1
## NP_001351410.1               
## XP_013976379.2               
## XP_032735091.1               
## NP_001076913.1               
## XP_027624528.1               
## NP_062673.1                  
## XP_006911280.1               
## XP_012816067.1               
## NP_571720.2             0.587

Phylogenetic trees of sequences

Build a phylogenetic tree from matrix

tree_subset <- nj(fads2_subset_dist)

Plotting phylogenetic trees

Plot the tree.

plot.phylo(tree_subset, main="Phylogenetic Tree", 
            use.edge.length = F)

mtext(text = "\nFads2 family gene tree - rooted, no branch lenths")

Final Portfolio Assignment: FADS2

Hayden Bash

12/13/2021

Introduction

Resources / References

Preparation

Accession Numbers

Accession Number Table

Data Preparation

Download Sequences

Initial data cleaning

General Protein Information

Protein Information

Draw dotplot

Protein properties compiled from databases

Protein feature prediction

Predict protein fold

Functions to calculate similarities

Subcellular location prediction

Percent Identity Comparisons (PID)

Data Preparation

PID table

PID methods comparison

Multiple Sequence Alignment

MSA data preparation

Building Multiple Sequence Alignment (MSA)

Cleaning / setting up an MSA

Finished MSA

Distance Matrix

Phylogenetic trees of sequences

Plotting phylogenetic trees

aa.1.1	alpha	beta	a.plus.b	a.div.b
A	285	203	175	361
R	53	67	78	146
N	97	139	120	183
D	163	121	111	244
C	22	75	74	63
Q	67	122	74	114
E	134	86	86	257
G	197	297	171	377
H	111	49	33	107
I	91	120	93	239
L	221	177	110	339
K	249	115	112	321
M	48	16	25	91
F	123	85	52	158
P	82	127	71	188
S	122	341	126	327
T	119	253	117	238
W	33	44	30	72
Y	63	110	108	130
V	167	229	123	378

aa.1.1	alpha	beta	a.plus.b	a.div.b
A	285	203	175	361
R	53	67	78	146
N	97	139	120	183
D	163	121	111	244
C	22	75	74	63
Q	67	122	74	114
E	134	86	86	257
G	197	297	171	377
H	111	49	33	107
I	91	120	93	239
L	221	177	110	339
K	249	115	112	321
M	48	16	25	91
F	123	85	52	158
P	82	127	71	188
S	122	341	126	327
T	119	253	117	238
W	33	44	30	72
Y	63	110	108	130
V	167	229	123	378

aa.1.1	alpha	beta	a.plus.b	a.div.b
A	285	203	175	361
R	53	67	78	146
N	97	139	120	183
D	163	121	111	244
C	22	75	74	63
Q	67	122	74	114
E	134	86	86	257
G	197	297	171	377
H	111	49	33	107
I	91	120	93	239
L	221	177	110	339
K	249	115	112	321
M	48	16	25	91
F	123	85	52	158
P	82	127	71	188
S	122	341	126	327
T	119	253	117	238
W	33	44	30	72
Y	63	110	108	130
V	167	229	123	378