This code will analyze and compile data information on the gene FADS2 (Fatty Acid Desaturase 2). Many different components of analysis will be used including multiple sequence alignments, percent identity, and phylogenetic trees. This data will show how close of a relationship this gene has in humans compared to the other animals that have homologous genes.
Key information used to make this script can be found here: * Refseq Gene: https://www.ncbi.nlm.nih.gov/gene/9415 *Refseq Homologene: https://www.ncbi.nlm.nih.gov/homologene/3149
Other resources consulted includes: * Neanderthal genome: http://neandertal.ensemblgenomes.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000134824
Other interesting resources and online tools include: * BLAST alignment: https://blast.ncbi.nlm.nih.gov/Blast.cgi * Sub-cellular locations prediction in Uniprot: https://www.uniprot.org/uniprot/O95864
Load necessary packages:
library(BiocManager)
## Bioconductor version '3.13' is out-of-date; the current release version '3.14'
## is available with R version '4.1'; see https://bioconductor.org/install
#install("drawProteins")
library(drawProteins)
Load other packages:
#github packages
library(compbio4all)
library(ggmsa)
## Registered S3 methods overwritten by 'ggalt':
## method from
## grid.draw.absoluteGrob ggplot2
## grobHeight.absoluteGrob ggplot2
## grobWidth.absoluteGrob ggplot2
## grobX.absoluteGrob ggplot2
## grobY.absoluteGrob ggplot2
#CRAN packages
library(rentrez)
library(seqinr)
library(ape)
##
## Attaching package: 'ape'
## The following objects are masked from 'package:seqinr':
##
## as.alignment, consensus
library(pander)
library(ggplot2)
#Bioconductor packages
library(msa)
## Loading required package: Biostrings
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: XVector
## Loading required package: GenomeInfoDb
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:ape':
##
## complement
## The following object is masked from 'package:seqinr':
##
## translate
## The following object is masked from 'package:base':
##
## strsplit
##
## Attaching package: 'msa'
## The following object is masked from 'package:BiocManager':
##
## version
#Biostrings
library(Biostrings)
# install("HGNChelper")
library(HGNChelper)
The homologous genes in other organisms were found both by using NCBI’s homologous gene page as well as using NCBI’s BLAST function. Both of these methods produced lists of homologous genes in other species with their accession numbers listed alongside the scientific names.
Two things that were left out of the table were drosophila and Neanderthals. This gene does not occur outside of vertebrates and so it is not found in drosophila. This gene was found in Neanderthals, but their was only a uniprot accession number given and it turned out to be the exact same as in the human gene.
The following function confirms the validity of the FADS2 gene name.
checkGeneSymbols(x = c("FADS2"))
## Maps last updated on: Thu Oct 24 12:31:05 2019
## x Approved Suggested.Symbol
## 1 FADS2 TRUE FADS2
As talked about above, this gene does not occur outside of vertebrates and was found in neanderthal’s, but does not have a RefSeq Accession Number.
We will now make a vector including the names and accession numbers of all of our species which will be turned into a table.
# RefSeq Uniprot PDB Sci Name Common Name Gene name
fads2.table <- c("NP_004256.1", "O95864", "NA", "Homo sapiens", "Human", "Fads2",
"NP_062673.1", "Q9Z0R9", "NA", "Mus musculus", "Mouse", "Fads2",
"NP_001351410.1", "A0A6D2Y484", "NA", "Pan troglodytes", "Chimpanzee", "Fads2",
"XP_013976379.2", "NA", "NA", "Canis lupus", "Wolf", "Fads2",
"NP_571720.2", "Q9DEX7", "NA", "Danio rerio", "Zebrafish", "Fads2",
"XP_027624528.1", "NA", "NA", "Tupaia chinensis", "Tree Shrew", "Fads2",
"XP_012816067.1", "F6YR25", "NA", "Xenopus tropicalis","Frog", "Fads2",
"XP_006911280.1", "L5KQ77", "NA", "Pteropus Alecto", "Flying Fox", "Fads2",
"NP_001076913.1", "A4FV48", "NA", "Bovine taurus", "Cow", "Fads2",
"XP_032735091.1", "NA", "NA", "Lontra Canadensis", "River Otter", "Fads2")
This chunk will convert our vector into a matrix and then create a dataframe to display a table of the information. Below is printed the dimensions of the table, the column names prior to being changed, and the final table.
# turn the vector into a matrix
fads2.matrix <- matrix(fads2.table, nrow = 10, ncol = 6, byrow = TRUE)
# the matrix is 10 by 6
dim(fads2.matrix)
## [1] 10 6
# make a datafram from the matrix
fads2.df <- as.data.frame(fads2.matrix)
# print current column names
colnames(fads2.df)
## [1] "V1" "V2" "V3" "V4" "V5" "V6"
# rename the columns
colnames(fads2.df) <- c(c("ncbi.protein.accession", "UniProt.id", "PDB",
"species", "common.name", "gene.name"))
# print out the data frame
pander(fads2.df)
| ncbi.protein.accession | UniProt.id | PDB | species | common.name |
|---|---|---|---|---|
| NP_004256.1 | O95864 | NA | Homo sapiens | Human |
| NP_062673.1 | Q9Z0R9 | NA | Mus musculus | Mouse |
| NP_001351410.1 | A0A6D2Y484 | NA | Pan troglodytes | Chimpanzee |
| XP_013976379.2 | NA | NA | Canis lupus | Wolf |
| NP_571720.2 | Q9DEX7 | NA | Danio rerio | Zebrafish |
| XP_027624528.1 | NA | NA | Tupaia chinensis | Tree Shrew |
| XP_012816067.1 | F6YR25 | NA | Xenopus tropicalis | Frog |
| XP_006911280.1 | L5KQ77 | NA | Pteropus Alecto | Flying Fox |
| NP_001076913.1 | A4FV48 | NA | Bovine taurus | Cow |
| XP_032735091.1 | NA | NA | Lontra Canadensis | River Otter |
| gene.name |
|---|
| Fads2 |
| Fads2 |
| Fads2 |
| Fads2 |
| Fads2 |
| Fads2 |
| Fads2 |
| Fads2 |
| Fads2 |
| Fads2 |
We will use entrez_fetch_list to get a list of all of the FASTA files.
fads2.accesion <- fads2.df$ncbi.protein.accession
fads2.accesion
## [1] "NP_004256.1" "NP_062673.1" "NP_001351410.1" "XP_013976379.2"
## [5] "NP_571720.2" "XP_027624528.1" "XP_012816067.1" "XP_006911280.1"
## [9] "NP_001076913.1" "XP_032735091.1"
fads2.accesion.list <- entrez_fetch_list(db = "protein", id = fads2.accesion, rettype = "fasta")
Number of FASTA files obtained:
length(fads2.accesion.list)
## [1] 10
The first entry:
fads2.accesion.list[[1]]
## [1] ">NP_004256.1 acyl-CoA 6-desaturase isoform 1 [Homo sapiens]\nMGKGGNQGEGAAEREVSVPTFSWEEIQKHNLRTDRWLVIDRKVYNITKWSIQHPGGQRVIGHYAGEDATD\nAFRAFHPDLEFVGKFLKPLLIGELAPEEPSQDHGKNSKITEDFRALRKTAEDMNLFKTNHVFFLLLLAHI\nIALESIAWFTVFYFGNGWIPTLITAFVLATSQAQAGWLQHDYGHLSVYRKPKWNHLVHKFVIGHLKGASA\nNWWNHRHFQHHAKPNIFHKDPDVNMLHVFVLGEWQPIEYGKKKLKYLPYNHQHEYFFLIGPPLLIPMYFQ\nYQIIMTMIVHKNWVDLAWAVSYYIRFFITYIPFYGILGALLFLNFIRFLESHWFVWVTQMNHIVMEIDQE\nAYRDWFSSQLTATCNVEQSFFNDWFSGHLNFQIEHHLFPTMPRHNLHKIAPLVKSLCAKHGIEYQEKPLL\nRALLDIIRSLKKSGKLWLDAYLHK\n\n"
The first step will be to remove the FASTA header.
for(i in 1:length(fads2.accesion.list)){
fads2.accesion.list[[i]] <- fasta_cleaner(fads2.accesion.list[[i]], parse = F)
}
The following code creates an image that shows descriptions of the different parts of the gene.
fads2.uniprot <- get_features(fads2.df$UniProt.id[1])
## [1] "Download has worked"
fads2.prot <- feature_to_dataframe(fads2.uniprot)
fads2.canvas <- draw_canvas(fads2.prot)
fads2.canvas <- draw_chains(fads2.canvas, fads2.prot, label_size = 2.5)
fads2.canvas <- draw_recept_dom(fads2.canvas, fads2.prot)
fads2.canvas
We will prepare the data so that each letter can be compared.
human_fasta <- fads2.accesion.list[[1]]
fads2_vector <- fasta_cleaner(human_fasta)
dotPlot(fads2_vector, fads2_vector)
Let’s examine different settings to get a clearer picture
par(mfrow = c(2,2),
mar = c(0,0,2,1))
# plot 1: - Defaults
dotPlot(fads2_vector,
fads2_vector,
wsize = 1,
nmatch = 1,
main = "\nfads2 Defaults \n")
# plot - size = 10, nmatch = 1
dotPlot(fads2_vector,
fads2_vector,
wsize = 10,
nmatch = 1,
main = "\nfads2 - size = 10, nmatch = 1 \n")
# plot 3: - size = 10, nmatch = 5
dotPlot(fads2_vector,
fads2_vector,
wsize = 10,
nmatch = 5,
main = "\nfads2 - size = 10, nmatch = 5 \n")
# plot 4: size = 20, nmatch = 5
dotPlot(fads2_vector,
fads2_vector,
wsize = 20,
nmatch = 5,
main = "\nfads2 - size = 20, nmatch = 5 \n")
# reset par() - run this or other plots will be small!
par(mfrow = c(1,1),
mar = c(4,4,4,4))
The bottom right dotplot seems to be clearest. Let’s print it larger so that we can see the data.
dotPlot(fads2_vector,
fads2_vector,
wsize = 20,
nmatch = 5,
main = "\nfads2 - size = 20, nmatch = 5 \n")
Many websites can give information on our gene. Here is a table of any additional information given from the sites listed below.
Pfam <- c("Cyt-b5 domain spanning 22-95aa and FA-desaturase domain 156-418aa")
Disprot <- c("No information available")
RepeatsDB <- c("No information available")
Uniprot.subcellular.location <- c("Endoplasmic reticulum membrane")
PDB <- c("No PDB entry available")
Alpha.fold <- c("The predicted protein is Acyl-CoA 6-desaturase and is completely made from alpha helixes")
IUpred2A <- c("There were two peaks that rose above 0.5")
protein.properties.df <- data.frame(Pfam = Pfam, Disprot = Disprot, RepeatsDB = RepeatsDB,
Uniprot.subcellular.location = Uniprot.subcellular.location,
PDB = PDB, Alpha.fold = Alpha.fold, IUpred2A = IUpred2A)
pander(protein.properties.df)
| Pfam | Disprot |
|---|---|
| Cyt-b5 domain spanning 22-95aa and FA-desaturase domain 156-418aa | No information available |
| RepeatsDB | Uniprot.subcellular.location |
|---|---|
| No information available | Endoplasmic reticulum membrane |
| PDB | Alpha.fold |
|---|---|
| No PDB entry available | The predicted protein is Acyl-CoA 6-desaturase and is completely made from alpha helixes |
| IUpred2A |
|---|
| There were two peaks that rose above 0.5 |
Uniprot indicates that this protein is a Multi-pass membrane protein. They define this as a protein that spans the membrane more than once.
Alphafold indicates that this protein is solely consisted of alpha helixes. Therefore, I predict that the machine-learning methods will categorize this protein as α-all.
We will now use the method from Chou and Zhang to see what this analysis tells us about the protein fold.
aa.1.1 <- c("A","R","N","D","C","Q","E","G","H","I",
"L","K","M","F","P","S","T","W","Y","V")
alpha <- c(285, 53, 97, 163, 22, 67, 134, 197, 111, 91,
221, 249, 48, 123, 82, 122, 119, 33, 63, 167)
beta <- c(203, 67, 139, 121, 75, 122, 86, 297, 49, 120,
177, 115, 16, 85, 127, 341, 253, 44, 110, 229)
a.plus.b <- c(175, 78, 120, 111, 74, 74, 86, 171, 33, 93,
110, 112, 25, 52, 71, 126, 117, 30, 108, 123)
a.div.b <- c(361, 146, 183, 244, 63, 114, 257, 377, 107, 239,
339, 321, 91, 158, 188, 327, 238, 72, 130, 378)
pander(data.frame(aa.1.1, alpha, beta, a.plus.b, a.div.b))
| aa.1.1 | alpha | beta | a.plus.b | a.div.b |
|---|---|---|---|---|
| A | 285 | 203 | 175 | 361 |
| R | 53 | 67 | 78 | 146 |
| N | 97 | 139 | 120 | 183 |
| D | 163 | 121 | 111 | 244 |
| C | 22 | 75 | 74 | 63 |
| Q | 67 | 122 | 74 | 114 |
| E | 134 | 86 | 86 | 257 |
| G | 197 | 297 | 171 | 377 |
| H | 111 | 49 | 33 | 107 |
| I | 91 | 120 | 93 | 239 |
| L | 221 | 177 | 110 | 339 |
| K | 249 | 115 | 112 | 321 |
| M | 48 | 16 | 25 | 91 |
| F | 123 | 85 | 52 | 158 |
| P | 82 | 127 | 71 | 188 |
| S | 122 | 341 | 126 | 327 |
| T | 119 | 253 | 117 | 238 |
| W | 33 | 44 | 30 | 72 |
| Y | 63 | 110 | 108 | 130 |
| V | 167 | 229 | 123 | 378 |
Calculate Proportions
alpha.prop <- alpha/sum(alpha)
beta.prop <- beta/sum(beta)
a.plus.b.prop <- a.plus.b/sum(a.plus.b)
a.div.b.prop <- a.div.b/sum(a.div.b)
Create Dataframe
aa.prop <- data.frame(alpha.prop, beta.prop, a.plus.b.prop, a.div.b.prop)
rownames(aa.prop) <- aa.1.1
pander(aa.prop)
| Â | alpha.prop | beta.prop | a.plus.b.prop | a.div.b.prop |
|---|---|---|---|---|
| A | 0.1165 | 0.07313 | 0.09264 | 0.08331 |
| R | 0.02166 | 0.02414 | 0.04129 | 0.03369 |
| N | 0.03964 | 0.05007 | 0.06353 | 0.04223 |
| D | 0.06661 | 0.04359 | 0.05876 | 0.05631 |
| C | 0.008991 | 0.02702 | 0.03917 | 0.01454 |
| Q | 0.02738 | 0.04395 | 0.03917 | 0.02631 |
| E | 0.05476 | 0.03098 | 0.04553 | 0.05931 |
| G | 0.08051 | 0.107 | 0.09052 | 0.08701 |
| H | 0.04536 | 0.01765 | 0.01747 | 0.02469 |
| I | 0.03719 | 0.04323 | 0.04923 | 0.05516 |
| L | 0.09031 | 0.06376 | 0.05823 | 0.07824 |
| K | 0.1018 | 0.04143 | 0.05929 | 0.07408 |
| M | 0.01962 | 0.005764 | 0.01323 | 0.021 |
| F | 0.05027 | 0.03062 | 0.02753 | 0.03646 |
| P | 0.03351 | 0.04575 | 0.03759 | 0.04339 |
| S | 0.04986 | 0.1228 | 0.0667 | 0.07547 |
| T | 0.04863 | 0.09114 | 0.06194 | 0.05493 |
| W | 0.01349 | 0.01585 | 0.01588 | 0.01662 |
| Y | 0.02575 | 0.03963 | 0.05717 | 0.03 |
| V | 0.06825 | 0.08249 | 0.06511 | 0.08724 |
This function will convert our table into a vector.
table_to_vector <- function(table_x){
table_names <- attr(table_x, "dimnames")[[1]]
table_vect <- as.vector(table_x)
names(table_vect) <- table_names
return(table_vect)
}
fads2_human_table <- table(fads2_vector)/length(fads2_vector)
fads2.human.aa.freq <- table_to_vector(fads2_human_table)
fads2.human.aa.freq
## A C D E F G
## 0.063063063 0.004504505 0.038288288 0.051801802 0.074324324 0.058558559
## H I K L M N
## 0.067567568 0.072072072 0.065315315 0.105855856 0.020270270 0.045045045
## P Q R S T V
## 0.045045045 0.042792793 0.036036036 0.040540541 0.038288288 0.051801802
## W Y
## 0.038288288 0.040540541
Add data on my focal protein to the amino acid frequency.
aa.prop$fads2.human.aa.freq <- fads2.human.aa.freq
pander(aa.prop)
| Â | alpha.prop | beta.prop | a.plus.b.prop | a.div.b.prop |
|---|---|---|---|---|
| A | 0.1165 | 0.07313 | 0.09264 | 0.08331 |
| R | 0.02166 | 0.02414 | 0.04129 | 0.03369 |
| N | 0.03964 | 0.05007 | 0.06353 | 0.04223 |
| D | 0.06661 | 0.04359 | 0.05876 | 0.05631 |
| C | 0.008991 | 0.02702 | 0.03917 | 0.01454 |
| Q | 0.02738 | 0.04395 | 0.03917 | 0.02631 |
| E | 0.05476 | 0.03098 | 0.04553 | 0.05931 |
| G | 0.08051 | 0.107 | 0.09052 | 0.08701 |
| H | 0.04536 | 0.01765 | 0.01747 | 0.02469 |
| I | 0.03719 | 0.04323 | 0.04923 | 0.05516 |
| L | 0.09031 | 0.06376 | 0.05823 | 0.07824 |
| K | 0.1018 | 0.04143 | 0.05929 | 0.07408 |
| M | 0.01962 | 0.005764 | 0.01323 | 0.021 |
| F | 0.05027 | 0.03062 | 0.02753 | 0.03646 |
| P | 0.03351 | 0.04575 | 0.03759 | 0.04339 |
| S | 0.04986 | 0.1228 | 0.0667 | 0.07547 |
| T | 0.04863 | 0.09114 | 0.06194 | 0.05493 |
| W | 0.01349 | 0.01585 | 0.01588 | 0.01662 |
| Y | 0.02575 | 0.03963 | 0.05717 | 0.03 |
| V | 0.06825 | 0.08249 | 0.06511 | 0.08724 |
| Â | fads2.human.aa.freq |
|---|---|
| A | 0.06306 |
| R | 0.004505 |
| N | 0.03829 |
| D | 0.0518 |
| C | 0.07432 |
| Q | 0.05856 |
| E | 0.06757 |
| G | 0.07207 |
| H | 0.06532 |
| I | 0.1059 |
| L | 0.02027 |
| K | 0.04505 |
| M | 0.04505 |
| F | 0.04279 |
| P | 0.03604 |
| S | 0.04054 |
| T | 0.03829 |
| W | 0.0518 |
| Y | 0.03829 |
| V | 0.04054 |
Two custom functions are needed: one to calculate the correlation between two columns of our table, and one to calculate correlation similarities.
# Correlation used in Chou and Zhang 1992.
chou_cor <- function(x,y){
numerator <- sum(x*y)
denominator <- sqrt((sum(x^2))*(sum(y^2)))
result <- numerator/denominator
return(result)
}
# Cosine similarity used in Higgs and Attwood (2005).
chou_cosine <- function(z.1, z.2){
z.1.abs <- sqrt(sum(z.1^2))
z.2.abs <- sqrt(sum(z.2^2))
my.cosine <- sum(z.1*z.2)/(z.1.abs*z.2.abs)
return(my.cosine)
}
Calculate Correlation between each column
corr.alpha <- chou_cor(aa.prop[,5], aa.prop[,1])
corr.beta <- chou_cor(aa.prop[,5], aa.prop[,2])
corr.apb <- chou_cor(aa.prop[,5], aa.prop[,3])
corr.adb <- chou_cor(aa.prop[,5], aa.prop[,4])
Calculate cosine similarity
cos.alpha <- chou_cosine(aa.prop[,5], aa.prop[,1])
cos.beta <- chou_cosine(aa.prop[,5], aa.prop[,2])
cos.apb <- chou_cosine(aa.prop[,5], aa.prop[,3])
cos.adb <- chou_cosine(aa.prop[,5], aa.prop[,4])
Calculate distance.
aa.prop.flipped <- t(aa.prop)
round(aa.prop.flipped, 2)
## A R N D C Q E G H I L K
## alpha.prop 0.12 0.02 0.04 0.07 0.01 0.03 0.05 0.08 0.05 0.04 0.09 0.10
## beta.prop 0.07 0.02 0.05 0.04 0.03 0.04 0.03 0.11 0.02 0.04 0.06 0.04
## a.plus.b.prop 0.09 0.04 0.06 0.06 0.04 0.04 0.05 0.09 0.02 0.05 0.06 0.06
## a.div.b.prop 0.08 0.03 0.04 0.06 0.01 0.03 0.06 0.09 0.02 0.06 0.08 0.07
## fads2.human.aa.freq 0.06 0.00 0.04 0.05 0.07 0.06 0.07 0.07 0.07 0.11 0.02 0.05
## M F P S T W Y V
## alpha.prop 0.02 0.05 0.03 0.05 0.05 0.01 0.03 0.07
## beta.prop 0.01 0.03 0.05 0.12 0.09 0.02 0.04 0.08
## a.plus.b.prop 0.01 0.03 0.04 0.07 0.06 0.02 0.06 0.07
## a.div.b.prop 0.02 0.04 0.04 0.08 0.05 0.02 0.03 0.09
## fads2.human.aa.freq 0.05 0.04 0.04 0.04 0.04 0.05 0.04 0.04
We can get a distance matrix like this
dist(aa.prop.flipped, method = "euclidean")
## alpha.prop beta.prop a.plus.b.prop a.div.b.prop
## beta.prop 0.13342098
## a.plus.b.prop 0.09281824 0.08289406
## a.div.b.prop 0.06699039 0.08659174 0.06175113
## fads2.human.aa.freq 0.15929515 0.16794894 0.13127207 0.14241948
We can also calculate the individual distances
dist.alpha <- dist((aa.prop.flipped[c(1,5),]), method = "euclidean")
dist.beta <- dist((aa.prop.flipped[c(2,5),]), method = "euclidean")
dist.apb <- dist((aa.prop.flipped[c(3,5),]), method = "euclidean")
dist.adb <- dist((aa.prop.flipped[c(4,5),]), method = "euclidean")
Compile the information. Rounding makes it easier to read.
# fold types
fold.type <- c("alpha","beta","alpha plus beta", "alpha/beta")
# data
corr.sim <- round(c(corr.alpha,corr.beta,corr.apb,corr.adb),5)
cosine.sim <- round(c(cos.alpha,cos.beta,cos.apb,cos.adb),5)
Euclidean.dist <- round(c(dist.alpha,dist.beta,dist.apb,dist.adb),5)
# summary
sim.sum <- c("","","most.sim","")
dist.sum <- c("","","min.dist","")
df <- data.frame(fold.type,
corr.sim ,
cosine.sim ,
Euclidean.dist ,
sim.sum ,
dist.sum )
Display Output
pander(df)
| fold.type | corr.sim | cosine.sim | Euclidean.dist | sim.sum | dist.sum |
|---|---|---|---|---|---|
| alpha | 0.7994 | 0.7994 | 0.1593 | ||
| beta | 0.7806 | 0.7806 | 0.1679 | ||
| alpha plus beta | 0.8539 | 0.8539 | 0.1313 | most.sim | min.dist |
| alpha/beta | 0.8316 | 0.8316 | 0.1424 |
Alpha plus Beta was predicted which does not match up with the prediction from Alphafold.
TBD
Convert all FASTA records into entries in a single vector. The following for loop will take the accession numbers and associate them with the cleaned vector for the FASTA code.
fads2_list <- rep(NA, 0)
for(i in 1:length(fads2.accesion.list)){
fads2_list[[fads2.df$ncbi.protein.accession[i]]] <- fads2.accesion.list[[i]]
}
is(fads2_list)
## [1] "character" "vector"
## [3] "data.frameRowLabels" "SuperClassMethod"
## [5] "EnumerationValue" "character_OR_connection"
## [7] "character_OR_NULL" "atomic"
## [9] "vector_OR_Vector" "vector_OR_factor"
length(fads2_list)
## [1] 10
names(fads2_list)
## [1] "NP_004256.1" "NP_062673.1" "NP_001351410.1" "XP_013976379.2"
## [5] "NP_571720.2" "XP_027624528.1" "XP_012816067.1" "XP_006911280.1"
## [9] "NP_001076913.1" "XP_032735091.1"
data("BLOSUM50")
align01.03 <- pairwiseAlignment(fads2.accesion.list[[1]], fads2_list[[3]],
substitutionMatrix = "BLOSUM50",
scoreOnly = FALSE)
align01.05 <- pairwiseAlignment(fads2.accesion.list[[1]], fads2_list[[5]],
substitutionMatrix = "BLOSUM50",
scoreOnly = FALSE)
align01.08 <- pairwiseAlignment(fads2.accesion.list[[1]], fads2_list[[8]],
substitutionMatrix = "BLOSUM50",
scoreOnly = FALSE)
align03.05 <- pairwiseAlignment(fads2.accesion.list[[3]], fads2_list[[5]],
substitutionMatrix = "BLOSUM50",
scoreOnly = FALSE)
align03.08 <- pairwiseAlignment(fads2.accesion.list[[3]], fads2_list[[8]],
substitutionMatrix = "BLOSUM50",
scoreOnly = FALSE)
align05.08 <- pairwiseAlignment(fads2.accesion.list[[5]], fads2_list[[8]],
substitutionMatrix = "BLOSUM50",
scoreOnly = FALSE)
pid(align01.03)
## [1] 100
pid(align01.05)
## [1] 64.86486
pid(align01.08)
## [1] 83.22296
This matrix will show the percent identity between the human, chimpanzee, zebrafish, and flying fox.
pids <- c(100, NA, NA, NA,
pid(align01.03), 100, NA, NA,
pid(align01.05), pid(align03.05), 100, NA,
pid(align01.08), pid(align03.08), pid(align05.08), 100)
mat <- matrix(pids, nrow = 4, byrow = T)
row.names(mat) <- c("Homo","Pan","Fish","Fox")
colnames(mat) <- c("Homo","Pan","Fish","Fox")
pander(mat)
| Â | Homo | Pan | Fish | Fox |
|---|---|---|---|---|
| Homo | 100 | NA | NA | NA |
| Pan | 100 | 100 | NA | NA |
| Fish | 64.86 | 64.86 | 100 | NA |
| Fox | 83.22 | 83.22 | 63.13 | 100 |
Compare different PID methods. First we will compare humans and chimps
method <- c("PID1", "PID2", "PID3", "PID4")
PID.chimp <- c(pid(align01.03, type = "PID1"),
pid(align01.03, type = "PID2"),
pid(align01.03, type = "PID3"),
pid(align01.03, type = "PID4"))
denominator <- c("(aligned positions + internal gap positions)",
"(aligned positions)",
"(length shorter sequence)",
"(average length of the two sequences)")
pid.chimp.comparison <- data.frame(method, PID = PID.chimp, denominator)
pander(pid.chimp.comparison)
| method | PID | denominator |
|---|---|---|
| PID1 | 100 | (aligned positions + internal gap positions) |
| PID2 | 100 | (aligned positions) |
| PID3 | 100 | (length shorter sequence) |
| PID4 | 100 | (average length of the two sequences) |
We will now compare humans and zebrafish
PID.fish <- c(pid(align01.05, type = "PID1"),
pid(align01.05, type = "PID2"),
pid(align01.05, type = "PID3"),
pid(align01.05, type = "PID4"))
pid.fish.comparison <- data.frame(method, PID = PID.fish, denominator)
pander(pid.fish.comparison)
| method | PID | denominator |
|---|---|---|
| PID1 | 64.86 | (aligned positions + internal gap positions) |
| PID2 | 64.86 | (aligned positions) |
| PID3 | 64.86 | (length shorter sequence) |
| PID4 | 64.86 | (average length of the two sequences) |
For R bioinformatics tools we are going to convert the vector into a string set
fads2_vector_ss <- AAStringSet(fads2_list)
fads2_align <-msa(fads2_vector_ss,
method = "ClustalW")
## use default substitution matrix
MSA produces a species MSA object
class(fads2_align)
## [1] "MsaAAMultipleAlignment"
## attr(,"package")
## [1] "msa"
is(fads2_align)
## [1] "MsaAAMultipleAlignment" "AAMultipleAlignment" "MsaMetaData"
## [4] "MultipleAlignment"
Default output of MSA
fads2_align
## CLUSTAL 2.1
##
## Call:
## msa(fads2_vector_ss, method = "ClustalW")
##
## MsaAAMultipleAlignment with 10 rows and 455 columns
## aln names
## [1] MGKGGNQGEGAAER--EVSVPTFSW...LRALLDIIRSLKKSGKLWLDAYLHK NP_004256.1
## [2] MGKGGNQGEGAAER--EVSVPTFSW...LRALLDIIRSLKKSGKLWLDAYLHK NP_001351410.1
## [3] MGKGGNQGEGAAER--EAPLQTFCW...LRALQDIIRSLKKSGELWLDAYLHK XP_013976379.2
## [4] MGKGGNQGEGATER--EAPMPTFSW...LRALQDIIRSLKKSGELWLDAYLHK XP_032735091.1
## [5] MGKGGNQDEGATEL--EAPMPTFRW...LRALQDIIGSLRKSGQLWLDAYLHK NP_001076913.1
## [6] MGKGGNQGEGATER--EVPMPTFSW...LRALRDIIRSLKKSGDLWLDAYLHK XP_027624528.1
## [7] MGKGGNQGEGSTER--QAPMPTFRW...LRALIDIVSSLKKSGELWLDAYLHK NP_062673.1
## [8] MGKGGNQGEGATER--EAPIPTFRW...LRALLDVIRSLKKSGDLWLDAYLHK XP_006911280.1
## [9] MGMGGQSGEGCSSGNCVKPEARYSW...YHGFRDVFRSLKKSGQLWLDAYLHK XP_012816067.1
## [10] MGGGGQQTDRITDT--NGRFSSYTW...YGAFADIIRSLEKSGELWLDAYLNK NP_571720.2
## Con MGKGGNQGEGATER--EAP?PTFSW...LRAL?DIIRSLKKSG?LWLDAYLHK Consensus
Change class of alignment
class(fads2_align) <- "AAMultipleAlignment"
class(fads2_align)
## [1] "AAMultipleAlignment"
fads2_align_seqinr <- msaConvert(fads2_align, type = "seqinr::alignment")
Show output of the MSA
print_msa(fads2_align_seqinr)
## [1] "MGKGGNQGEGAAER--EVSVPTFSWEEIQKHNLRTDRWLVIDRKVYNITKWSIQHPGGQR 0"
## [1] "MGKGGNQGEGAAER--EVSVPTFSWEEIQKHNLRTDRWLVIDRKVYNITKWSIQHPGGQR 0"
## [1] "MGKGGNQGEGAAER--EAPLQTFCWEEIQKHNLRTDKWLVIDRKVYNITKWSSRHPGGHR 0"
## [1] "MGKGGNQGEGATER--EAPMPTFSWEEIQKHNLRTDKWLVIDRKVYNITKWSSRHPGGQR 0"
## [1] "MGKGGNQDEGATEL--EAPMPTFRWEEIQKHNLRTDKWLVIDRKVYNITKWSSRHPGGQR 0"
## [1] "MGKGGNQGEGATER--EVPMPTFSWEEIQKHNLRTDKWLVIDRKVYNITKWSNRHPGGHR 0"
## [1] "MGKGGNQGEGSTER--QAPMPTFRWEEIQKHNLRTDRWLVIDRKVYNVTKWSQRHPGGHR 0"
## [1] "MGKGGNQGEGATER--EAPIPTFRWEEIQKHNVRTDQWLVVDRKVYNVTKWASRHPGGHR 0"
## [1] "MGMGGQSGEGCSSGNCVKPEARYSWEEIQKHNLKTDKWLVIERKVYNITQWVKCHPGGMR 0"
## [1] "MGGGGQQTDRITDT--NGRFSSYTWEEMQKHTKHGDQWVVVERKVYNVSQWVKRHPGGLR 0"
## [1] " "
## [1] "VIGHYAGEDATDAFRAFHPDLEFVGKFLKPLLIGELAPEEPSQDHGKNSKITEDFRALRK 0"
## [1] "VIGHYAGEDATDAFRAFHPDLEFVGKFLKPLLIGELAPEEPSQDHGKNSKITEDFRALRK 0"
## [1] "VIGHYAGEDATDAFHAFHRDLDFVRKFMKPLLIGELAPEEPSQDHGKNSQITEDFRALRK 0"
## [1] "VIGHYAGEDATDAFLAFHRDLDFVRKFMKPLLIGELAPEEPSQDHGKNSQITEDFRALRK 0"
## [1] "VIGHYAGEDATDAFLAFHRNLDFVRKFMKPLLIGELAPEEPSQDRGKNSQITEDFRALRK 0"
## [1] "VIGHYAGEDATDAFRAFHPNLDFVRKFMKPLLIGELAPEEPSQDRGKNSQITEDFRALKK 0"
## [1] "VIGHYSGEDATDAFRAFHLDLDFVGKFLKPLLIGELAPEEPSLDRGKSSQITEDFRALKK 0"
## [1] "VIGHYAGEDATDAFHAFHIDLDFVGKFLKPLLVGELAPEEPSQDRGKNSQIIEDFRALKK 0"
## [1] "VIGHYAGEDATDAFHAFHPDKNFVRKFLKPLYVGELAENEPSQDRDKNAQQVEDFRALRK 0"
## [1] "ILGHYAGEDATEAFTAFHPNLQLMRKYLKPLLIGELEASEPSQDRQKNAALVEDFRALRE 0"
## [1] " "
## [1] "TAEDMNLFKTNHVFFLLLLAHIIALESIAWFTVFYFGNGWIPTLITAFVLATSQAQAGWL 0"
## [1] "TAEDMNLFKTNHVFFLLLLAHIIALESIAWFTVFYFGNGWIPTLITAFVLATSQAQAGWL 0"
## [1] "TAEDMNLFKSDHLFFLLLLAHIIVMESIAWFIIFYFGNGWISTVITAFVLATSQAQAGWL 0"
## [1] "AAEDMNLFKSNHLFFLVLLAHIIVMESIAWFIIFYFGNGWIPTVITAFVLATSQAQAGWL 0"
## [1] "TAEDMNLFKSNQLFFLLHLAHIIAMESIAWFTLFYFGNGWIPTIITAFVLATSQAQAGWL 0"
## [1] "TAEDMNLFKANHLFFLLLLAHIIVMESIAWFTIFYFGNGWIPTVITAFVLATSQAQAGWL 0"
## [1] "TAEDMNLFKTNHLFFFLLLSHIIVMESLAWFILSYFGTGWIPTLVTAFVLATSQAQAGWL 0"
## [1] "TAQDMNLFKSNHLFFLLLLAHILVMEAIAWLIVFYFGNGWITTIIASFILATSQVQTGWL 0"
## [1] "TAEDMGLFKSNPAFFIVYLFHILLIEFLAWCTLHYFGTGWIPTILTVLLLTISQAQAGWL 0"
## [1] "RLEAEGCFKTQPLFFALHLGHILLLEAIAFVMVWYFGTGWINTLIVAVILATAQSQAGWL 0"
## [1] " "
## [1] "QHDYGHLSVYRKPKWNHLVHKFVIGHLKGASANWWNHRHFQHHAKPNIFHKDPDVNMLHV 0"
## [1] "QHDYGHLSVYRKPKWNHLVHKFVIGHLKGASANWWNHRHFQHHAKPNIFHKDPDVNMLHV 0"
## [1] "QHDYGHLSVYKKSMWNHVVHKFIIGHLKGASANWWNHRHFQHHAKPNIFHKDPDVNMLHV 0"
## [1] "QHDYGHLSVYKKSSWNHVVHKFIIGHLKGASANWWNHRHFQHHAKPNIFHKDPDVNMLHV 0"
## [1] "QHDYGHLSVYKKSMWNHIVHKFVIGHLKGASANWWNHRHFQHHAKPNIFHKDPDVNMLHV 0"
## [1] "QHDYGHLSVYKKSTWNHLVHKFIIGHLKGASANWWNHRHFQHHAKPNIFHKDPDVNMLHV 0"
## [1] "QHDYGHLSVYKKSIWNHVVHKFVIGHLKGASANWWNHRHFQHHAKPNIFHKDPDIKSLHV 0"
## [1] "QHDLGHLSVYKKSKWNHLAHKFVIGHLKGASSNWWNHRHFQHHAKPNIFHKDPDVNMLHV 0"
## [1] "QHDFGHLSVFKKSKWNHLIHKFIIGHLKGASANWWNHRHFQHHAKPNIFRKDPDVNMVNV 0"
## [1] "QHDFGHLSVFKTSGMNHLVHKFVIGHLKGASAGWWNHRHFQHHAKPNIFKKDPDVNMLNA 0"
## [1] " "
## [1] "FVLGEWQPIEYGKKKLKYLPYNHQHEYFFLIGPPLLIPMYFQYQIIMTMIVHKNWVDLAW 0"
## [1] "FVLGEWQPIEYGKKKLKYLPYNHQHEYFFLIGPPLLIPMYFQYQIIMTMIVHKNWVDLAW 0"
## [1] "FVLGEWQPIEYGKKKLKYLPYNHQHEYFFLIGPPLLIPVYFQYQIIMTMIVRRDWVDLAW 0"
## [1] "FVLGESQPIEYGKKKLKYLPYNHQHEYFFLIGPPLLIPVYFQYQIIMTMISRHDWVDLAW 0"
## [1] "FVLGEWQPIEYGKKKLKYLPYNHQHEYFFLIGPPLLIPLYFQYQIIMTMIVRKYWADLAW 0"
## [1] "FVLGEWQPIEYGKKKLKYLPYNHQHEYFFLIGPPLLIPMYFQYQIIMTMIVRKDWVDLAW 0"
## [1] "FVLGEWQPLEYGKKKLKYLPYNHQHEYFFLIGPPLLIPMYFQYQIIMTMISRRDWVDLAW 0"
## [1] "LVLGEWQPIEYGKKKLKYLPYNHQHQYFFLIGPPLLIPIYFQYQIIMTMIVHKNWVDLAW 0"
## [1] "FVLGDTQPVEFGKKRIKYLPYNHQHLYFFLIGPPLLIPIYFTIQIMKTMITRKDWVDLAW 0"
## [1] "FVVGNVQPVEYGVKKIKHLPYNHQHKYFFFIGPPLLIPVYFQFQIFHNMISHGMWVDLLW 0"
## [1] " "
## [1] "AVSYYIRFFITYIPFYGILGALLFLNFIRFLESHWFVWVTQMNHIVMEIDQEAYRDWFSS 0"
## [1] "AVSYYIRFFITYIPFYGILGALLFLNFIRFLESHWFVWVTQMNHIVMEIDQEAYRDWFSS 0"
## [1] "AMSYYVRFFITYIPFYGILGALLFLNFIRFLESHWFVWVTQMNHIVMEIDREPYRDWFSS 0"
## [1] "AMSYYARFFITYIPFYGILGAVIFLNFIRFLESHWFVWVTQMNHIVMEIDHEPYRDWFSN 0"
## [1] "AISYYTRFFITYIPFYGVLGSILFLNFIRFLESHWFVWVTQMNHIVMEIDREPYRDWFSS 0"
## [1] "AISYYARFFITYIPFYGVLGALLFLNFIRFLESHWFVWVTQMNHIVMEIDREPYRDWFRT 0"
## [1] "AISYYMRFFYTYIPFYGILGALVFLNFIRFLESHWFVWVTQMNHLVMEIDLDHYRDWFSS 0"
## [1] "SISYYVRFFISYVPFYGFLGSILFFNFIRFLESHWFVWVTQMNHIVMEINEEPYRPLFLL 0"
## [1] "SVSYYARFFITFVPFYGVLGSLALLNAVRFIESHWFVWVTQMNHLPMVIDHEKYKDWLET 0"
## [1] "CISYYVRYFLCYTQFYGVFWAIILFNFVRFMESHWFVWVTQMSRIPMNIDYEQNQDWLSM 0"
## [1] " "
## [1] "---------QLTATCNVEQSFFNDWFSGHLNFQIEHHLFPTMPRHNLHKIAPLVKSLCAK 0"
## [1] "---------QLTATCNVEQSFFNDWFSGHLNFQIEHHLFPTMPRHNLHKIAPLVKSLCAK 0"
## [1] "---------QLAATCNVEQSFFNDWFSGHLNFQIEHHLFPTMPRHNLHKVAPLVKSLCAK 0"
## [1] "---------QLAATCNVEQSFFNDWFSGHLNFQIEHHLFPTMPRHNFHKVAPLVKSLCAK 0"
## [1] "---------QLAATCNVEQSFFNDWFSGHLNFQIEHHLFPTMPRHNLHKIAPLVRSLCAK 0"
## [1] "---------QLAATCNVEQSFFNDWFSGHLNFQIEHHLFPTMPRHNFHKIAPLVKSLCAK 0"
## [1] "---------QLAATCNVEQSFFNDWFSGHLNFQIEHHLFPTMPRHNLHKIAPLVKSLCAK 0"
## [1] "PGPHPPSPEQLAATCNVEQSLFNDWFSGHLNFQIEHHLFPTMPRHNLHKIAPLVKSLCAK 0"
## [1] "---------QLAATCNIEPSFFNDWFSGHLNFQIEHHLFPTMPRHNYWKIAPLVRSLCSK 0"
## [1] "---------QLVATCNIEQSAFNDWFSGHLNFQIEHHLFPTMPRHNYWRAAPRVRSLCEK 0"
## [1] " "
## [1] "HGIEYQEKPLLRALLDIIRSLKKSGKLWLDAYLHK 25"
## [1] "HGIEYQEKPLLRALLDIIRSLKKSGKLWLDAYLHK 25"
## [1] "HGIKYQEKPLLRALQDIIRSLKKSGELWLDAYLHK 25"
## [1] "HGIKYQEKPLLRALQDIIRSLKKSGELWLDAYLHK 25"
## [1] "HGIEYQEKPLLRALQDIIGSLRKSGQLWLDAYLHK 25"
## [1] "HGIEYQEKPLLRALRDIIRSLKKSGDLWLDAYLHK 25"
## [1] "HGIEYQEKPLLRALIDIVSSLKKSGELWLDAYLHK 25"
## [1] "HGIQYQEKRLLRALLDVIRSLKKSGDLWLDAYLHK 25"
## [1] "YNVTYEEKCLYHGFRDVFRSLKKSGQLWLDAYLHK 25"
## [1] "YGVKYQEKTLYGAFADIIRSLEKSGELWLDAYLNK 25"
## [1] " "
ggmsa(fads2_align, start = 350, end = 400)
Make a distance matrix
fads2_subset_dist <- dist.alignment(fads2_align_seqinr,
matrix = "identity")
is(fads2_subset_dist)
## [1] "dist" "oldClass"
class(fads2_subset_dist)
## [1] "dist"
Round for display
fads2_subset_dist_rnd <- round(fads2_subset_dist, 3)
fads2_subset_dist_rnd
## NP_004256.1 NP_001351410.1 XP_013976379.2 XP_032735091.1
## NP_001351410.1 0.000
## XP_013976379.2 0.308 0.308
## XP_032735091.1 0.322 0.322 0.212
## NP_001076913.1 0.322 0.322 0.281 0.289
## XP_027624528.1 0.289 0.289 0.256 0.256
## NP_062673.1 0.352 0.352 0.336 0.342
## XP_006911280.1 0.388 0.388 0.374 0.391
## XP_012816067.1 0.531 0.531 0.528 0.522
## NP_571720.2 0.593 0.593 0.593 0.591
## NP_001076913.1 XP_027624528.1 NP_062673.1 XP_006911280.1
## NP_001351410.1
## XP_013976379.2
## XP_032735091.1
## NP_001076913.1
## XP_027624528.1 0.268
## NP_062673.1 0.355 0.329
## XP_006911280.1 0.380 0.368 0.416
## XP_012816067.1 0.522 0.509 0.543 0.543
## NP_571720.2 0.589 0.591 0.606 0.597
## XP_012816067.1
## NP_001351410.1
## XP_013976379.2
## XP_032735091.1
## NP_001076913.1
## XP_027624528.1
## NP_062673.1
## XP_006911280.1
## XP_012816067.1
## NP_571720.2 0.587
Build a phylogenetic tree from matrix
tree_subset <- nj(fads2_subset_dist)
Plot the tree.
plot.phylo(tree_subset, main="Phylogenetic Tree",
use.edge.length = F)
mtext(text = "\nFads2 family gene tree - rooted, no branch lenths")