Introduction

This code compiles summary information about the gene POU1f1. This gene encodes a member of the POU family of transcription factors that regulate mammalian development. The protein regulates expression of several genes involved in pituitary development and hormone expression. Mutations in this genes result in combined pituitary hormone deficiency. Multiple transcript variants encoding different isoforms have been found for this gene.

Resources / References

Key information use to make this script can be found here: - Refseq Gene: https://www.ncbi.nlm.nih.gov/gene/2993 - Refseq Homologene: https://www.ncbi.nlm.nih.gov/homologene/48076

Other resources consulted includes - Neanderthal genome: http://neandertal.ensemblgenomes.org/index.html

Other interesting resources and online tools include: - REPPER: https://toolkit.tuebingen.mpg.de/jobs/7803820 - Sub-cellular locations prediction: https://wolfpsort.hgc.jp/

Preparation

Load necessary packages:

Download and load drawProteins from Bioconductor

library(BiocManager)
## Bioconductor version '3.13' is out-of-date; the current release version '3.14'
##   is available with R version '4.1'; see https://bioconductor.org/install
#install("drawProteins")

library(drawProteins)

Load other packages

# github packages
library(compbio4all)
library(ggmsa)
## Registered S3 methods overwritten by 'ggalt':
##   method                  from   
##   grid.draw.absoluteGrob  ggplot2
##   grobHeight.absoluteGrob ggplot2
##   grobWidth.absoluteGrob  ggplot2
##   grobX.absoluteGrob      ggplot2
##   grobY.absoluteGrob      ggplot2
# CRAN packages
library(rentrez)
library(seqinr)
library(ape)
## 
## Attaching package: 'ape'
## The following objects are masked from 'package:seqinr':
## 
##     as.alignment, consensus
library(pander)


library(ggplot2)

# Bioconductor packages
## msa
### The msa package is having problems on some platforms
### You can skip the msa steps if necessary.  The msa output
### is used to make a distance matrix and then phylogenetics trees,
### but I provide code to build the matrix by hand so
### you can proceed even if msa doesn't work for you.

## Biostrings
library(Biostrings)
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
##     union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: XVector
## Loading required package: GenomeInfoDb
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:ape':
## 
##     complement
## The following object is masked from 'package:seqinr':
## 
##     translate
## The following object is masked from 'package:base':
## 
##     strsplit
library(msa)
## 
## Attaching package: 'msa'
## The following object is masked from 'package:BiocManager':
## 
##     version
library(drawProteins)


library(HGNChelper)

Accession numbers

Accession numbers were obtained from RefSeq, Refseq Homlogene, UniProt and PDB. UniProt accession numbers can be found by searching for the gene name. PDB accessions can be found by searching with a UniProt accession or a gene name, though many proteins are not in PDB.

A protein BLAST search (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome) was carried out to find more species, since the HomoloGene search only yielded 2 species. The gene not appears to be primate-only.

OPTIONAL: Use the function to confirm the validity of your gene name and any aliases

# this is optional
HGNChelper::checkGeneSymbols(x = c("POU1F1"))
## Maps last updated on: Thu Oct 24 12:31:05 2019
##        x Approved Suggested.Symbol
## 1 POU1F1     TRUE           POU1F1

Accession number table

Not available: - Drosophila

#                       RefSeq           Uniprot     PDB    sci name                common name          gene name
POU1F1_table_vector<-c("NP_001116229.1", "P28069", "5WC9",  "Homo sapiens" ,           "Human",             "POU1F1",
                     "XP_001145917.1",  "NA",     "NA",    "Pan troglodytes" ,        "Chimpanzee",         "POU1F1",
                     "NP_001036325.1",  "NA",     "NA",    "Macaca mulatta",          "Rhesus monkey",      "POU1F1",
                     "NP_001006950.1",  "NA",     "NA",    "Canis lupus familiaris",      "Dog",            "POU1F1",
                     "NP_777004.1",     "NA",     "NA",        "Bos taurus",             "cattle",          "POU1F1",
                     "NP_032875.1",    "NA",     "NA",    "Mus musculus",              "house mouse",       "POU1F1",
                     "NP_037140.2",    "NA",     "NA",    "Rattus norvegicus",          "Norway rat",      "POU1F1",
                     "XP_003831560.1",  "NA",     "NA",    "Pan paniscus",     "pygmy chimpanzee",        "POU1F1",
                     "XP_003893855.1",  "NA",     "NA",    "Papio anubis",  "olive baboon",      "POU1F1",
                     " XP_011782975.1 ", "NA",  "NA", "Colobus angolensis palliatus", "Angolan black-and-white colobus",    "POU1F1")

POU1F1_matrix <- matrix( POU1F1_table_vector, ncol = 6, byrow = TRUE) 
POU1F1_df <- data.frame( POU1F1_matrix )

colnames( POU1F1_df ) <- c("ncbi.protein.accession", "UniProt.id", "PDB", "species", "common.name",
                          "gene.name")

The finished table

pander::pander( POU1F1_df )
Table continues below
ncbi.protein.accession UniProt.id PDB species
NP_001116229.1 P28069 5WC9 Homo sapiens
XP_001145917.1 NA NA Pan troglodytes
NP_001036325.1 NA NA Macaca mulatta
NP_001006950.1 NA NA Canis lupus familiaris
NP_777004.1 NA NA Bos taurus
NP_032875.1 NA NA Mus musculus
NP_037140.2 NA NA Rattus norvegicus
XP_003831560.1 NA NA Pan paniscus
XP_003893855.1 NA NA Papio anubis
XP_011782975.1 NA NA Colobus angolensis palliatus
common.name gene.name
Human POU1F1
Chimpanzee POU1F1
Rhesus monkey POU1F1
Dog POU1F1
cattle POU1F1
house mouse POU1F1
Norway rat POU1F1
pygmy chimpanzee POU1F1
olive baboon POU1F1
Angolan black-and-white colobus POU1F1

Data Preparation

Download Sequences

All sequences were downloaded using a wrapper compbio4all::entrez_fetch_list() which uses rentrez::entrez_fetch() to access NCBI databases.

# download FASTA sequences
POU1F1_list <- compbio4all::entrez_fetch_list( db = "protein",
                                              id = POU1F1_df$ncbi.protein.accession,
                                              rettype = "fasta"
                                              ) 

Number of FASTA files obtained

length( POU1F1_list )
## [1] 10

The first entry

POU1F1_list[[1]]
## [1] ">NP_001116229.1 pituitary-specific positive transcription factor 1 isoform beta [Homo sapiens]\nMSCQAFTSADTFIPLNSDASATLPLIMHHSAAECLPVSNHATNVMSTVPSILSLIQTPKCLCTHFSVTTL\nGNTATGLHYSVPSCHYGNQPSTYGVMAGSLTPCLYKFPDHTLSHGFPPIHQPLLAEDPTAADFKQELRRK\nSKLVEEPIDMDSPEIRELEKFANEFKVRRIKLGYTQTNVGEALAAVHGSEFSQTTICRFENLQLSFKNAC\nKLKAILSKWLEEAEQVGALYNEKVGANERKRKRRTTISIAAKDALERHFGEQNKPSSQEIMRMAEELNLE\nKEVVRVWFCNRRQREKRVKTSLNQSLFSISKEHLECR\n\n"
# output should be the FASTA sequence with header information and newlines still included

Initial Data Cleaning

Remove FASTA header

for(i in 1:length(POU1F1_list)){
  POU1F1_list[[i]] <- compbio4all::fasta_cleaner(POU1F1_list[[i]], parse = F)
}

Specific additional cleaning steps will be as needed for particular analyses

General Protein Information

Protein Diagram

For code see https://rpubs.com/lowbrowR/drawProtein

P28069_json <- drawProteins::get_features("P28069")
## [1] "Download has worked"
my_prot_df <- drawProteins::feature_to_dataframe(P28069_json)
is(my_prot_df)
## [1] "data.frame"       "list"             "oldClass"         "vector"          
## [5] "list_OR_List"     "vector_OR_Vector" "vector_OR_factor"
my_canvas <- draw_canvas(my_prot_df)  
my_canvas <- draw_chains(my_canvas, my_prot_df, 
                         label_size = 2.5)
my_canvas <- draw_domains(my_canvas, my_prot_df)
my_canvas

Draw dotplot

Prepare Data

POU1F1_list[[1]]
## [1] "MSCQAFTSADTFIPLNSDASATLPLIMHHSAAECLPVSNHATNVMSTVPSILSLIQTPKCLCTHFSVTTLGNTATGLHYSVPSCHYGNQPSTYGVMAGSLTPCLYKFPDHTLSHGFPPIHQPLLAEDPTAADFKQELRRKSKLVEEPIDMDSPEIRELEKFANEFKVRRIKLGYTQTNVGEALAAVHGSEFSQTTICRFENLQLSFKNACKLKAILSKWLEEAEQVGALYNEKVGANERKRKRRTTISIAAKDALERHFGEQNKPSSQEIMRMAEELNLEKEVVRVWFCNRRQREKRVKTSLNQSLFSISKEHLECR"
POU1F1_human_vector <- unlist(strsplit( POU1F1_list[[1]], "" ))
seqinr::dotPlot( POU1F1_human_vector, POU1F1_human_vector )

TODO:

par(mfrow = c(2,2), 
    mar = c(0,0,2,1))

# plot 1: Defaults
seqinr::dotPlot(POU1F1_human_vector, POU1F1_human_vector, 
        wsize = 1, 
        nmatch = 1, 
        main = "size=1, num match=1")

# plot 2 size = 10, nmatch = 10
seqinr::dotPlot(POU1F1_human_vector, POU1F1_human_vector, 
        wsize = 10, 
        nmatch = 1, 
        main = "size = 10, nmatch = 10")

# plot 3: size = 10, nmatch = 5
seqinr::dotPlot(POU1F1_human_vector, POU1F1_human_vector, 
        wsize = 10, 
        nmatch = 5, 
        main = "size = 10, nmatch = 5")

# plot 4: size = 20, nmatch = 5
seqinr::dotPlot(POU1F1_human_vector, POU1F1_human_vector, 
        wsize = 20,
        nmatch = 5,
        main = "size = 20, nmatch = 5")

par(mfrow = c(1,1), 
    mar = c(4,4,4,4))
seqinr::dotPlot(POU1F1_human_vector, POU1F1_human_vector, 
        wsize = 20,
        nmatch = 5,
        main = "POU1F1 human dot plot")

Protein properties compiled from databases

TODO: Create table

Below are links to relevant information. 1. Pfam: http://pfam.xfam.org/protein/P28069; “Pou” region from 127 to 198. “Homeodomain” region from 215 to 271 2. DisProt: no info available 3. RepeatDB: no info available 4. PDB secondary structural location: no info available

Protein feature prediction

Uniprot (which uses http://www.csbio.sjtu.edu) indicates that this protein is a protein kinase that acts as key mediator of the nitric oxide (NO)/cGMP signaling pathway.

Predict protein fold

First, I need the data from Chou and Zhang (1994) Table 5. Code to build this table is available at https://rpubs.com/lowbrowR/843543

The table looks like this:

# enter once
aa.1.1 <- c("A","R","N","D","C","Q","E","G","H","I",
            "L","K","M","F","P","S","T","W","Y","V")
# alpha proteins
alpha <- c(285, 53, 97, 163, 22, 67, 134, 197, 111, 91, 
           221, 249, 48, 123, 82, 122, 119, 33, 63, 167)
# beta proteins
beta <- c(203, 67, 139, 121, 75, 122, 86, 297, 49, 120, 
          177, 115, 16, 85, 127, 341, 253, 44, 110, 229)
# alpha + beta
a.plus.b <- c(175, 78, 120, 111, 74, 74, 86, 171, 33, 93,
              110, 112, 25, 52, 71, 126, 117, 30, 108, 123)
# alpha/beta
a.div.b <- c(361, 146, 183, 244, 63, 114, 257, 377, 107, 239, 
             339, 321, 91, 158, 188, 327, 238, 72, 130, 378)

pander(data.frame(aa.1.1, alpha, beta, a.plus.b, a.div.b))
aa.1.1 alpha beta a.plus.b a.div.b
A 285 203 175 361
R 53 67 78 146
N 97 139 120 183
D 163 121 111 244
C 22 75 74 63
Q 67 122 74 114
E 134 86 86 257
G 197 297 171 377
H 111 49 33 107
I 91 120 93 239
L 221 177 110 339
K 249 115 112 321
M 48 16 25 91
F 123 85 52 158
P 82 127 71 188
S 122 341 126 327
T 119 253 117 238
W 33 44 30 72
Y 63 110 108 130
V 167 229 123 378

Convert to frequencies Table 5 therefore becomes this

alpha.prop <- alpha/sum(alpha)
beta.prop <- beta/sum(beta)
a.plus.b.prop <- a.plus.b/sum(a.plus.b)
a.div.b <- a.div.b/sum(a.div.b)

aa.prop <- data.frame(alpha.prop,
                      beta.prop,
                      a.plus.b.prop,
                      a.div.b)
row.names(aa.prop) <- aa.1.1

pander::pander(aa.prop)
  alpha.prop beta.prop a.plus.b.prop a.div.b
A 0.1165 0.07313 0.09264 0.08331
R 0.02166 0.02414 0.04129 0.03369
N 0.03964 0.05007 0.06353 0.04223
D 0.06661 0.04359 0.05876 0.05631
C 0.008991 0.02702 0.03917 0.01454
Q 0.02738 0.04395 0.03917 0.02631
E 0.05476 0.03098 0.04553 0.05931
G 0.08051 0.107 0.09052 0.08701
H 0.04536 0.01765 0.01747 0.02469
I 0.03719 0.04323 0.04923 0.05516
L 0.09031 0.06376 0.05823 0.07824
K 0.1018 0.04143 0.05929 0.07408
M 0.01962 0.005764 0.01323 0.021
F 0.05027 0.03062 0.02753 0.03646
P 0.03351 0.04575 0.03759 0.04339
S 0.04986 0.1228 0.0667 0.07547
T 0.04863 0.09114 0.06194 0.05493
W 0.01349 0.01585 0.01588 0.01662
Y 0.02575 0.03963 0.05717 0.03
V 0.06825 0.08249 0.06511 0.08724

Determine the number of each amino acid in my protein.

A Function to convert a table into a vector is helpful here because R is goofy about tables not being the same as vectors.

table_to_vector <- function(table_x){
  table_names <- attr(table_x, "dimnames")[[1]]
  table_vect <- as.vector(table_x)
  names(table_vect) <- table_names
  return(table_vect)
}
POU1F1_human_table <- table(POU1F1_human_vector)/length(POU1F1_human_vector)
POU1F1.human.aa.freq <- table_to_vector(POU1F1_human_table)
POU1F1.human.aa.freq
##           A           C           D           E           F           G 
## 0.078864353 0.031545741 0.025236593 0.085173502 0.044164038 0.037854890 
##           H           I           K           L           M           N 
## 0.037854890 0.044164038 0.066246057 0.097791798 0.022082019 0.047318612 
##           P           Q           R           S           T           V 
## 0.050473186 0.041009464 0.056782334 0.088328076 0.069400631 0.050473186 
##           W           Y 
## 0.006309148 0.018927445

Check for the presence of “U” (unknown aa.)

aa.names <- names(POU1F1.human.aa.freq)
i.U <- which(aa.names == "U")
aa.names[i.U]
## character(0)
POU1F1.human.aa.freq[i.U]
## named numeric(0)

Add data on my focal protein to the amino acid frequency table.

POU1F1.human.aa.freq
##           A           C           D           E           F           G 
## 0.078864353 0.031545741 0.025236593 0.085173502 0.044164038 0.037854890 
##           H           I           K           L           M           N 
## 0.037854890 0.044164038 0.066246057 0.097791798 0.022082019 0.047318612 
##           P           Q           R           S           T           V 
## 0.050473186 0.041009464 0.056782334 0.088328076 0.069400631 0.050473186 
##           W           Y 
## 0.006309148 0.018927445
aa.prop$POU1F1.human.aa.freq <- POU1F1.human.aa.freq
pander::pander(aa.prop)
  alpha.prop beta.prop a.plus.b.prop a.div.b POU1F1.human.aa.freq
A 0.1165 0.07313 0.09264 0.08331 0.07886
R 0.02166 0.02414 0.04129 0.03369 0.03155
N 0.03964 0.05007 0.06353 0.04223 0.02524
D 0.06661 0.04359 0.05876 0.05631 0.08517
C 0.008991 0.02702 0.03917 0.01454 0.04416
Q 0.02738 0.04395 0.03917 0.02631 0.03785
E 0.05476 0.03098 0.04553 0.05931 0.03785
G 0.08051 0.107 0.09052 0.08701 0.04416
H 0.04536 0.01765 0.01747 0.02469 0.06625
I 0.03719 0.04323 0.04923 0.05516 0.09779
L 0.09031 0.06376 0.05823 0.07824 0.02208
K 0.1018 0.04143 0.05929 0.07408 0.04732
M 0.01962 0.005764 0.01323 0.021 0.05047
F 0.05027 0.03062 0.02753 0.03646 0.04101
P 0.03351 0.04575 0.03759 0.04339 0.05678
S 0.04986 0.1228 0.0667 0.07547 0.08833
T 0.04863 0.09114 0.06194 0.05493 0.0694
W 0.01349 0.01585 0.01588 0.01662 0.05047
Y 0.02575 0.03963 0.05717 0.03 0.006309
V 0.06825 0.08249 0.06511 0.08724 0.01893

Functions to calculate similarities

Two custom functions are needed: one to calculate correlates between two columns of our table, and one to calculate correlation similarities.

# Correlation used in Chou and Zhange 1992.
chou_cor <- function(x,y){
  numerator <- sum(x*y)
denominator <- sqrt((sum(x^2))*(sum(y^2)))
result <- numerator/denominator
return(result)
}

# Cosine similarity used in Higgs and Attwood (2005). 
chou_cosine <- function(z.1, z.2){
  z.1.abs <- sqrt(sum(z.1^2))
  z.2.abs <- sqrt(sum(z.2^2))
  my.cosine <- sum(z.1*z.2)/(z.1.abs*z.2.abs)
  return(my.cosine)
}

Calculate correlation between each column

corr.alpha <- chou_cor(aa.prop[,5], aa.prop[,1])
corr.beta  <- chou_cor(aa.prop[,5], aa.prop[,2])
corr.apb   <- chou_cor(aa.prop[,5], aa.prop[,3])
corr.adb   <- chou_cor(aa.prop[,5], aa.prop[,4])

Calculate cosine similarity

cos.alpha <- chou_cosine(aa.prop[,5], aa.prop[,1])
cos.beta  <- chou_cosine(aa.prop[,5], aa.prop[,2])
cos.apb   <- chou_cosine(aa.prop[,5], aa.prop[,3])
cos.adb   <- chou_cosine(aa.prop[,5], aa.prop[,4])

Calculate distance. Note: we need to flip the dataframe on its side using a command called t()

aa.prop.flipped <- t(aa.prop)
round(aa.prop.flipped,2)
##                         A    R    N    D    C    Q    E    G    H    I    L
## alpha.prop           0.12 0.02 0.04 0.07 0.01 0.03 0.05 0.08 0.05 0.04 0.09
## beta.prop            0.07 0.02 0.05 0.04 0.03 0.04 0.03 0.11 0.02 0.04 0.06
## a.plus.b.prop        0.09 0.04 0.06 0.06 0.04 0.04 0.05 0.09 0.02 0.05 0.06
## a.div.b              0.08 0.03 0.04 0.06 0.01 0.03 0.06 0.09 0.02 0.06 0.08
## POU1F1.human.aa.freq 0.08 0.03 0.03 0.09 0.04 0.04 0.04 0.04 0.07 0.10 0.02
##                         K    M    F    P    S    T    W    Y    V
## alpha.prop           0.10 0.02 0.05 0.03 0.05 0.05 0.01 0.03 0.07
## beta.prop            0.04 0.01 0.03 0.05 0.12 0.09 0.02 0.04 0.08
## a.plus.b.prop        0.06 0.01 0.03 0.04 0.07 0.06 0.02 0.06 0.07
## a.div.b              0.07 0.02 0.04 0.04 0.08 0.05 0.02 0.03 0.09
## POU1F1.human.aa.freq 0.05 0.05 0.04 0.06 0.09 0.07 0.05 0.01 0.02

We can get distance matrix like this

dist(aa.prop.flipped, method = "euclidean")
##                      alpha.prop  beta.prop a.plus.b.prop    a.div.b
## beta.prop            0.13342098                                    
## a.plus.b.prop        0.09281824 0.08289406                         
## a.div.b              0.06699039 0.08659174    0.06175113           
## POU1F1.human.aa.freq 0.15625179 0.15527795    0.13873638 0.14030109

Individual distances using dist()

dist.alpha <- dist((aa.prop.flipped[c(1,5),]),  method = "euclidean")
dist.beta  <- dist((aa.prop.flipped[c(2,5),]),  method = "euclidean")
dist.apb   <- dist((aa.prop.flipped[c(3,5),]),  method = "euclidean")
dist.adb  <- dist((aa.prop.flipped[c(4,5),]), method = "euclidean")

Compile the information. Rounding makes it easier to read

# fold types
fold.type <- c("alpha","beta","alpha plus beta", "alpha/beta")

# data
corr.sim <- round(c(corr.alpha,corr.beta,corr.apb,corr.adb),5)
cosine.sim <- round(c(cos.alpha,cos.beta,cos.apb,cos.adb),5)
Euclidean.dist <- round(c(dist.alpha,dist.beta,dist.apb,dist.adb),5)

# summary
sim.sum <- c("","","most.sim","")
dist.sum <- c("","","min.dist","")

df <- data.frame(fold.type,
           corr.sim ,
           cosine.sim ,
           Euclidean.dist ,
           sim.sum ,
           dist.sum )

Display output

pander::pander(df)
fold.type corr.sim cosine.sim Euclidean.dist sim.sum dist.sum
alpha 0.8103 0.8103 0.1562
beta 0.8157 0.8157 0.1553
alpha plus beta 0.8407 0.8407 0.1387 most.sim min.dist
alpha/beta 0.84 0.84 0.1403

Percent Identity Comparisons (PID)

Data Preparation

Convert all FASTA records intro entries in a single vector. FASTA entries are contained in a list produced at the beginning of the script. They were cleaned to remove the header and newline characters.

POU1F1_list 
## $NP_001116229.1
## [1] "MSCQAFTSADTFIPLNSDASATLPLIMHHSAAECLPVSNHATNVMSTVPSILSLIQTPKCLCTHFSVTTLGNTATGLHYSVPSCHYGNQPSTYGVMAGSLTPCLYKFPDHTLSHGFPPIHQPLLAEDPTAADFKQELRRKSKLVEEPIDMDSPEIRELEKFANEFKVRRIKLGYTQTNVGEALAAVHGSEFSQTTICRFENLQLSFKNACKLKAILSKWLEEAEQVGALYNEKVGANERKRKRRTTISIAAKDALERHFGEQNKPSSQEIMRMAEELNLEKEVVRVWFCNRRQREKRVKTSLNQSLFSISKEHLECR"
## 
## $XP_001145917.1
## [1] "MYGKIIFVLLLSAIVSISASSTTEVAMHTSTSSVTKSYISSETSDKHKWDIYPATPRAHEVSEIYVTTVYPPEEENGEGVQLVHRFSEPEITLIIFGVMAGVIGTILLIYYSICRLIKKSPSDVKPLPSPDTDVPLSSVEIENPETSDQ"
## 
## $NP_001036325.1
## [1] "MSCQAFTSADTFIPLNSDASATLPLIMHHSAAECLPVSNHATNVMSTVPSILSLIQTPKCLCTHFLVTTLGNTATGLHYSVPSCHYGNQPSTYGVMAGSLTLVFYKFPDHTLSHGFPPIHQPLLAEDPTAADFKQELRRKSKLVEEPIDMDSPEIRELEKFANEFKVRRIKLGYTQTNVGEALAAVHGSEFSQTTICRFENLQLSFKNACKLKAILSKWLEEAEQVGALYNEKVGANERKRKRRTTISIAAKDALERHFGEQNKPSSQEIMRMAEELNLEKEVVRVWFCNRRQREKRVKTSLNQSLFSISKEHLECR"
## 
## $NP_001006950.1
## [1] "MSCQPFTSADTFLPLNSEASAALPLIMHPGAAECLPGSNHATNVVSTATGLHYSVPSCHYGNQPSTYGVMAGGLTPCLYKFPEHGLGPGFPAAHQPLLAEGPAVADFKQELRRRSKLAEEPVDTESPEIRELEKFANEFKVRRIKLGYTQTNVGEALAAVHGSEFSQTTICRFENLQLSFKNACKLKAILSKWLEEAEQVGALYNEKVGANERKRKRRTTISIAAKDALERHFGEQNKPSSQEIMRMAEELNLEKEVVRVWFCNRRQREKRVKTSLNQSLFTISKEHLECR"
## 
## $NP_777004.1
## [1] "MSCQPFTSTDTFIPLNSESSATLPLIMHPSAAECLPVSNHATNVMSTATGLHYSVPFCHYGNQSSTYGVMAGSLTPCLYKFPDHTLSHGFPPMHQPLLSEDPTAADFKQELRRKSKLVEEPIDMDSPEIRELEKFANEFKVRRIKLGYTQTNVGEALAAVHGSEFSQTTICRFENLQLSFKNACKLKAILSKWLEEAEQVGALYNEKVGANERKRKRRTTISIAAKDALERHFGEQNKPSSQEILRMAEELNLEKEVVRVWFCNRRQREKRVKTSLNQSLFTISKEHLECR"
## 
## $NP_032875.1
## [1] "MSCQSFTSADTFITLNSDASAALPLRMHHSAAECLPASNHATNVMSTATGLHYSVPSCHYGNQPSTYGVMAGSLTPCLYKFPDHTLSHGFPPLHQPLLAEDPAASEFKQELRRKSKLVEEPIDMDSPEIRELEQFANEFKVRRIKLGYTQTNVGEALAAVHGSEFSQTTICRFENLQLSFKNACKLKAILSKWLEEAEQVGALYNEKVGANERKRKRRTTISVAAKDALERHFGEHSKPSSQEIMRMAEELNLEKEVVRVWFCNRRQREKRVKTSLNQSLFSISKEHLECR"
## 
## $NP_037140.2
## [1] "MSCQPFTSADTFIPLNSDASAALPLRMHHNAAEGLPASNHATNVMSTATGLHYSVPSCHYGNQPSTYGVMAGTLTPCLYKFPDHTLSHGFPPLHQPLLAEDPTASEFKQELRRKSKLVEEPIDMDSPEIRELEQFANEFKVRRIKLGYTQTNVGEALAAVHGSEFSQTTICRFENLQLSFKNACKLKAILSKWLEEAEQVGALYNEKVGANERKRKRRTTISIAAKDALERHFGEHSKPSSQEIMRMAEELNLEKEVVRVWFCNRRQREKRVKTSLNQSLFSISKEHLECR"
## 
## $XP_003831560.1
## [1] "MSCQAFTSADNFIPLNSDASATLPLIMHHSAAECLPVSNHATNVMSTATGLHYSVPSCHYGNQPSTYGVMAGSLTPCLYKFPDHTLSHGFPPIHQPLLAEDPTAADFKQELRRKSKLVEEPIDMDSPEIRELEKFANEFKVRRIKLGYTQTNVGEALAAVHGSEFSQTTICRFENLQLSFKNACKLKAILSKWLEEAEQVGALYNEKVGANERKRKRRTTISIAAKDALERHFGEQNKPSSQEIMRMAEELNLEKEVVRVWFCNRRQREKRVKTSLNQSLFSISKEHLECR"
## 
## $XP_003893855.1
## [1] "MSCQAFTSADTFIPLNSDASATLPLIMHHSAAECLPVSNHATNVMSTATGLHYSVPSCHYGNQPSTYGVMAGSLTPCLYKFPDHTLSHGFPPIHQPLLAEDPTAADFKQELRRKSKLVEEPIDMDSPEIRELEKFANEFKVRRIKLGYTQTNVGEALAAVHGSEFSQTTICRFENLQLSFKNACKLKAILSKWLEEAEQVGALYNEKVGANERKRKRRTTISIAAKDALERHFGEQNKPSSQEIMRMAEELNLEKEVVRVWFCNRRQREKRVKTSLNQSLFSISKEHLECR"
## 
## $` XP_011782975.1 `
## [1] "MSCQAFTSADTFIPLNSDASATLPLIMHHSAAECLPVSNHATNVMSTATGLHYSVPSCHYGNQPSTYGVMAGSLTPCLYKFPDHTLSHGFPPLHQPLLAEDPTAADFKQELRRKSKLVEEPIDMDSPEIRELEKFANEFKVRRIKLGYTQTNVGEALAAVHGSEFSQTTICRFENLQLSFKNACKLKAILSKWLEEAEQVGALYNEKVGANERKRKRRTTISIAAKDALERHFGEQNKPSSQEIMRMAEELNLEKEVVRVWFCNRRQREKRVKTSLNQSLFSISKEHLECR"
names( POU1F1_list )
##  [1] "NP_001116229.1"   "XP_001145917.1"   "NP_001036325.1"   "NP_001006950.1"  
##  [5] "NP_777004.1"      "NP_032875.1"      "NP_037140.2"      "XP_003831560.1"  
##  [9] "XP_003893855.1"   " XP_011782975.1 "
length( POU1F1_list )
## [1] 10

Each entry is a full entry with no spaces or parsing, and no header

POU1F1_list[1]
## $NP_001116229.1
## [1] "MSCQAFTSADTFIPLNSDASATLPLIMHHSAAECLPVSNHATNVMSTVPSILSLIQTPKCLCTHFSVTTLGNTATGLHYSVPSCHYGNQPSTYGVMAGSLTPCLYKFPDHTLSHGFPPIHQPLLAEDPTAADFKQELRRKSKLVEEPIDMDSPEIRELEKFANEFKVRRIKLGYTQTNVGEALAAVHGSEFSQTTICRFENLQLSFKNACKLKAILSKWLEEAEQVGALYNEKVGANERKRKRRTTISIAAKDALERHFGEQNKPSSQEIMRMAEELNLEKEVVRVWFCNRRQREKRVKTSLNQSLFSISKEHLECR"

Make each entry of the list into a vector. There are several ways to do this.

POU1F1_vector <- unlist( POU1F1_list )

Name the vector

names( POU1F1_list )
##  [1] "NP_001116229.1"   "XP_001145917.1"   "NP_001036325.1"   "NP_001006950.1"  
##  [5] "NP_777004.1"      "NP_032875.1"      "NP_037140.2"      "XP_003831560.1"  
##  [9] "XP_003893855.1"   " XP_011782975.1 "
names( POU1F1_vector ) <- names( POU1F1_list )

PID table

Do pairwise alignments for humans, chimps and 2-other species.

POU1F1_human      <- POU1F1_vector["NP_001116229.1"]
POU1F1_chimp      <- POU1F1_vector["XP_001145917.1"]
POU1F1_dog     <- POU1F1_vector["NP_001006950.1"]
POU1F1_cattle  <- POU1F1_vector["NP_777004.1"]

align.human.chimp      <- Biostrings::pairwiseAlignment(POU1F1_human, POU1F1_chimp)
align.human.dog     <- Biostrings::pairwiseAlignment(POU1F1_human, POU1F1_dog)
align.human.cattle  <- Biostrings::pairwiseAlignment(POU1F1_human, POU1F1_cattle)
align.chimp.dog     <- Biostrings::pairwiseAlignment(POU1F1_chimp, POU1F1_dog)
align.chimp.cattle  <- Biostrings::pairwiseAlignment(POU1F1_chimp, POU1F1_cattle)
align.dog.cattle <- Biostrings::pairwiseAlignment(POU1F1_dog, POU1F1_cattle)

Build matrix

pids <- c(1,                  NA,     NA,     NA,
          Biostrings::pid(align.human.chimp),          1,     NA,     NA,
          Biostrings::pid(align.human.dog), Biostrings::pid(align.chimp.dog),      1,     NA,
          Biostrings::pid(align.human.cattle), Biostrings::pid(align.chimp.cattle), Biostrings::pid(align.dog.cattle), 1)

mat <- matrix(pids, nrow = 4, byrow = T)
row.names(mat) <- c("Homo","Chimp","dog","cattle")   
colnames(mat) <- c("Homo","Chimp","dog","cattle")   

pander::pander(mat)  
  Homo Chimp dog cattle
Homo 1 NA NA NA
Chimp 20.33 1 NA NA
dog 84.23 20.07 1 NA
cattle 88.33 20.43 91.07 1

PID methods comparison

Compare different PID methods. I did this for Humans vs. chimps

PID1 <- Biostrings::pid(align.human.chimp, type="PID1")
PID2 <- Biostrings::pid(align.human.chimp, type="PID2")
PID3 <- Biostrings::pid(align.human.chimp, type="PID3")
PID4 <- Biostrings::pid(align.human.chimp, type="PID4")

method <- c("PID1", "PID2", "PID3", "PID4")

PID <- c( PID1, PID2, PID3, PID4 )

pid.comparison <- data.frame( method, PID )

pander::pander(pid.comparison)
method PID
PID1 20.33
PID2 41.89
PID3 41.61
PID4 26.61

Multiple Sequence Alignment

MSA data preparation *my MSA package was not working so I just commented this out

For use with R bioinformatics tools we need to convert our named vector to a string set using Biostrings::AAStringSet(). Note the _ss tag at the end of the object we’re assigning the output to, which designates this as a string set.

POU1F1_vector_ss <- Biostrings::AAStringSet( POU1F1_vector )

Building Multiple Sequence Alignment (MSA)

POU1F1_align <- msa(POU1F1_vector_ss, method = "ClustalW")
## use default substitution matrix

Cleaning / Setting up an MSA

msa produces a species MSA object

class( POU1F1_align )
## [1] "MsaAAMultipleAlignment"
## attr(,"package")
## [1] "msa"
is( POU1F1_align )
## [1] "MsaAAMultipleAlignment" "AAMultipleAlignment"    "MsaMetaData"           
## [4] "MultipleAlignment"

Default output of MSA

POU1F1_align
## CLUSTAL 2.1  
## 
## Call:
##    msa(POU1F1_vector_ss, method = "ClustalW")
## 
## MsaAAMultipleAlignment with 10 rows and 317 columns
##      aln                                                   names
##  [1] MSCQAFTSADTFIPLNSDASATLPL...QREKRVKTSLNQSLFSISKEHLECR NP_001116229.1
##  [2] MSCQAFTSADTFIPLNSDASATLPL...QREKRVKTSLNQSLFSISKEHLECR XP_003893855.1
##  [3] MSCQAFTSADNFIPLNSDASATLPL...QREKRVKTSLNQSLFSISKEHLECR XP_003831560.1
##  [4] MSCQAFTSADTFIPLNSDASATLPL...QREKRVKTSLNQSLFSISKEHLECR NP_001036325.1
##  [5] MSCQAFTSADTFIPLNSDASATLPL...QREKRVKTSLNQSLFSISKEHLECR  XP_011782975.1 
##  [6] MSCQSFTSADTFITLNSDASAALPL...QREKRVKTSLNQSLFSISKEHLECR NP_032875.1
##  [7] MSCQPFTSADTFIPLNSDASAALPL...QREKRVKTSLNQSLFSISKEHLECR NP_037140.2
##  [8] MSCQPFTSTDTFIPLNSESSATLPL...QREKRVKTSLNQSLFTISKEHLECR NP_777004.1
##  [9] MSCQPFTSADTFLPLNSEASAALPL...QREKRVKTSLNQSLFTISKEHLECR NP_001006950.1
## [10] ------------------MYGKIIF...------------------------- XP_001145917.1
##  Con MSCQAFTSADTFIPLNSDASATLPL...QREKRVKTSLNQSLFSISKEHLECR Consensus

Change class of alignment

class(POU1F1_align) <- "AAMultipleAlignment"

Convert to seqinr format

POU1F1_align_seqinr <- msaConvert(POU1F1_align, type = "seqinr::alignment")

OPTIONAL: show output with print_msa

compbio4all::print_msa(POU1F1_align_seqinr)
## [1] "MSCQAFTSADTFIPLNSDASATLPLIMHHSAAECLPVSNHATNVMSTVPSILSLIQTPKC 0"
## [1] "MSCQAFTSADTFIPLNSDASATLPLIMHHSAAECLPVSNHATNVMS-------------- 0"
## [1] "MSCQAFTSADNFIPLNSDASATLPLIMHHSAAECLPVSNHATNVMS-------------- 0"
## [1] "MSCQAFTSADTFIPLNSDASATLPLIMHHSAAECLPVSNHATNVMSTVPSILSLIQTPKC 0"
## [1] "MSCQAFTSADTFIPLNSDASATLPLIMHHSAAECLPVSNHATNVMS-------------- 0"
## [1] "MSCQSFTSADTFITLNSDASAALPLRMHHSAAECLPASNHATNVMS-------------- 0"
## [1] "MSCQPFTSADTFIPLNSDASAALPLRMHHNAAEGLPASNHATNVMS-------------- 0"
## [1] "MSCQPFTSTDTFIPLNSESSATLPLIMHPSAAECLPVSNHATNVMS-------------- 0"
## [1] "MSCQPFTSADTFLPLNSEASAALPLIMHPGAAECLPGSNHATNVVS-------------- 0"
## [1] "------------------MYGKIIFVLLLSAIVSISASSTTEVAMH-------------- 0"
## [1] " "
## [1] "LCTHFSVTTLGNTATGLHYSVPSCHYGNQPSTYGVMAGSLTPCLYKFPDHTLSHGFPPIH 0"
## [1] "------------TATGLHYSVPSCHYGNQPSTYGVMAGSLTPCLYKFPDHTLSHGFPPIH 0"
## [1] "------------TATGLHYSVPSCHYGNQPSTYGVMAGSLTPCLYKFPDHTLSHGFPPIH 0"
## [1] "LCTHFLVTTLGNTATGLHYSVPSCHYGNQPSTYGVMAGSLTLVFYKFPDHTLSHGFPPIH 0"
## [1] "------------TATGLHYSVPSCHYGNQPSTYGVMAGSLTPCLYKFPDHTLSHGFPPLH 0"
## [1] "------------TATGLHYSVPSCHYGNQPSTYGVMAGSLTPCLYKFPDHTLSHGFPPLH 0"
## [1] "------------TATGLHYSVPSCHYGNQPSTYGVMAGTLTPCLYKFPDHTLSHGFPPLH 0"
## [1] "------------TATGLHYSVPFCHYGNQSSTYGVMAGSLTPCLYKFPDHTLSHGFPPMH 0"
## [1] "------------TATGLHYSVPSCHYGNQPSTYGVMAGGLTPCLYKFPEHGLGPGFPAAH 0"
## [1] "------------TSTSSVTKSYISSETSDKHKWDIYP--ATPRAHEVSEIYVTTVYPPEE 0"
## [1] " "
## [1] "QPLLAEDPTAADFKQELRRKSKLVEEPIDMDSPEIRELEKFANEFKVRRIKLGYTQTNVG 0"
## [1] "QPLLAEDPTAADFKQELRRKSKLVEEPIDMDSPEIRELEKFANEFKVRRIKLGYTQTNVG 0"
## [1] "QPLLAEDPTAADFKQELRRKSKLVEEPIDMDSPEIRELEKFANEFKVRRIKLGYTQTNVG 0"
## [1] "QPLLAEDPTAADFKQELRRKSKLVEEPIDMDSPEIRELEKFANEFKVRRIKLGYTQTNVG 0"
## [1] "QPLLAEDPTAADFKQELRRKSKLVEEPIDMDSPEIRELEKFANEFKVRRIKLGYTQTNVG 0"
## [1] "QPLLAEDPAASEFKQELRRKSKLVEEPIDMDSPEIRELEQFANEFKVRRIKLGYTQTNVG 0"
## [1] "QPLLAEDPTASEFKQELRRKSKLVEEPIDMDSPEIRELEQFANEFKVRRIKLGYTQTNVG 0"
## [1] "QPLLSEDPTAADFKQELRRKSKLVEEPIDMDSPEIRELEKFANEFKVRRIKLGYTQTNVG 0"
## [1] "QPLLAEGPAVADFKQELRRRSKLAEEPVDTESPEIRELEKFANEFKVRRIKLGYTQTNVG 0"
## [1] "E-----------------------------NGEGVQLVHRFS-EPEITLIIFGVMAGVIG 0"
## [1] " "
## [1] "EALAAVHGSEFSQTTICRFENLQLSFKNACKLKAILSKWLEEAEQVGALYNEKVGANERK 0"
## [1] "EALAAVHGSEFSQTTICRFENLQLSFKNACKLKAILSKWLEEAEQVGALYNEKVGANERK 0"
## [1] "EALAAVHGSEFSQTTICRFENLQLSFKNACKLKAILSKWLEEAEQVGALYNEKVGANERK 0"
## [1] "EALAAVHGSEFSQTTICRFENLQLSFKNACKLKAILSKWLEEAEQVGALYNEKVGANERK 0"
## [1] "EALAAVHGSEFSQTTICRFENLQLSFKNACKLKAILSKWLEEAEQVGALYNEKVGANERK 0"
## [1] "EALAAVHGSEFSQTTICRFENLQLSFKNACKLKAILSKWLEEAEQVGALYNEKVGANERK 0"
## [1] "EALAAVHGSEFSQTTICRFENLQLSFKNACKLKAILSKWLEEAEQVGALYNEKVGANERK 0"
## [1] "EALAAVHGSEFSQTTICRFENLQLSFKNACKLKAILSKWLEEAEQVGALYNEKVGANERK 0"
## [1] "EALAAVHGSEFSQTTICRFENLQLSFKNACKLKAILSKWLEEAEQVGALYNEKVGANERK 0"
## [1] "TILLIYY-------SICRLIK-----KSPSDVKPLPS----------------------- 0"
## [1] " "
## [1] "RKRRTTISIAAKDALERHFGEQNKPSSQEIMRMAEELNLEKEVVRVWFCNRRQREKRVKT 0"
## [1] "RKRRTTISIAAKDALERHFGEQNKPSSQEIMRMAEELNLEKEVVRVWFCNRRQREKRVKT 0"
## [1] "RKRRTTISIAAKDALERHFGEQNKPSSQEIMRMAEELNLEKEVVRVWFCNRRQREKRVKT 0"
## [1] "RKRRTTISIAAKDALERHFGEQNKPSSQEIMRMAEELNLEKEVVRVWFCNRRQREKRVKT 0"
## [1] "RKRRTTISIAAKDALERHFGEQNKPSSQEIMRMAEELNLEKEVVRVWFCNRRQREKRVKT 0"
## [1] "RKRRTTISVAAKDALERHFGEHSKPSSQEIMRMAEELNLEKEVVRVWFCNRRQREKRVKT 0"
## [1] "RKRRTTISIAAKDALERHFGEHSKPSSQEIMRMAEELNLEKEVVRVWFCNRRQREKRVKT 0"
## [1] "RKRRTTISIAAKDALERHFGEQNKPSSQEILRMAEELNLEKEVVRVWFCNRRQREKRVKT 0"
## [1] "RKRRTTISIAAKDALERHFGEQNKPSSQEIMRMAEELNLEKEVVRVWFCNRRQREKRVKT 0"
## [1] "--PDTDVPLSSVEIENPETSDQ-------------------------------------- 0"
## [1] " "
## [1] "SLNQSLFSISKEHLECR 43"
## [1] "SLNQSLFSISKEHLECR 43"
## [1] "SLNQSLFSISKEHLECR 43"
## [1] "SLNQSLFSISKEHLECR 43"
## [1] "SLNQSLFSISKEHLECR 43"
## [1] "SLNQSLFSISKEHLECR 43"
## [1] "SLNQSLFSISKEHLECR 43"
## [1] "SLNQSLFTISKEHLECR 43"
## [1] "SLNQSLFTISKEHLECR 43"
## [1] "----------------- 43"
## [1] " "

Finished MSA

class(POU1F1_align) <- "AAMultipleAlignment"
ggmsa::ggmsa(POU1F1_align, start = 69, end = 150) 

Distance Matrix

Make a distance matrix

POU1F1_dist <- seqinr::dist.alignment(POU1F1_align_seqinr, 
                                       matrix = "identity")

This produces a “dist” class object

is( POU1F1_dist )
## [1] "dist"     "oldClass"
class( POU1F1_dist )
## [1] "dist"

Round for display

POU1F1_align_seqinr_rnd <- round(POU1F1_dist, 3)
POU1F1_align_seqinr_rnd
##                  NP_001116229.1 XP_003893855.1 XP_003831560.1 NP_001036325.1
## XP_003893855.1            0.000                                             
## XP_003831560.1            0.059          0.059                              
## NP_001036325.1            0.112          0.102          0.117               
##  XP_011782975.1           0.059          0.059          0.083          0.117
## NP_032875.1               0.211          0.211          0.219          0.234
## NP_037140.2               0.211          0.211          0.219          0.234
## NP_777004.1               0.194          0.194          0.203          0.219
## NP_001006950.1            0.287          0.287          0.293          0.305
## XP_001145917.1            0.916          0.916          0.916          0.920
##                   XP_011782975.1  NP_032875.1 NP_037140.2 NP_777004.1
## XP_003893855.1                                                       
## XP_003831560.1                                                       
## NP_001036325.1                                                       
##  XP_011782975.1                                                      
## NP_032875.1                 0.203                                    
## NP_037140.2                 0.203       0.155                        
## NP_777004.1                 0.194       0.275       0.269            
## NP_001006950.1              0.287       0.321       0.316       0.299
## XP_001145917.1              0.916       0.916       0.920       0.916
##                  NP_001006950.1
## XP_003893855.1                 
## XP_003831560.1                 
## NP_001036325.1                 
##  XP_011782975.1                
## NP_032875.1                    
## NP_037140.2                    
## NP_777004.1                    
## NP_001006950.1                 
## XP_001145917.1            0.923

Phylognetic trees of sequences

Build a phylogenetic tree from distance matrix

tree <- nj(POU1F1_align_seqinr_rnd)

Plotting phylogenetic trees

Plot the tree

plot.phylo(tree, main="POU1F1 Phylogenetic Tree", 
            use.edge.length = F)

mtext(text = "POU1F1 Phylogenetic Tree - rooted, no branch lengths")