Annotation overview

Pseudomonas sp. 273 was sequenced via PacBio and a completed 1 contig genome was provided by an external company.

You will receive:

The general strategy is to:

  1. Annotate by a standard Prokka application (v. 1.13.3). Prokka calls genes using Prodigal and annotates using a database based on UniProt.
  2. present summary statistics of the detected genes
  3. Take the list of proteins (*.faa) file and annotate via EggNOG-mapper (http://eggnogdb.embl.de/app/emapper#/app/home), which provides KO and COG annotations

Annotation summary (gene count and stats)

library(Biostrings)
library(reshape2)
P273.prokka.biostring <- readAAStringSet("PROKKA_12032018/PROKKA_12032018.faa")
name <- as.character(names(P273.prokka.biostring))
sequence <- paste(P273.prokka.biostring)
length <- width(P273.prokka.biostring)
P273.prokka <- data.frame(name, length, sequence, stringsAsFactors = F)
name.split <- colsplit(P273.prokka$name, " ", c("locus","prokka_function"))
P273.prokka <- cbind(name.split,P273.prokka[-1])
cat("### Subset of the anotations")

Subset of the anotations

kable(head(P273.prokka))
locus prokka_function length sequence
P273_00001 DNase CdiA 1899 MNNNGTLGALGALDLTSQGIANGADSLLFSGGDMTLRAASLSNRYGDLYSQGNLSFAGQDGGRAQSLSNRSGTIEAQGDIHLDVANLENTRDVFAFEQTTTFGEFDVQCGQHCGGHDSFKRGKIDVSETIEERITQNSPAAWLTAGGNLSIDADAVENRNSTIAANGDLTINANSLLNQGNTSRTGNKVVVINVVPGDYGKIPTGQWDAMENLARAFNNKMAAGTFDQDLYDQLWAIYNGDRWAIGTPVVSWSEDGAQSAPATLQAGNRVTLNVAHNLQNGTVSEYSQAQLTGQLAGSLLGGQLGTVNLTLNKQSSDAQARGPQTVQSVTHTAADGSQQVSFIPVDYTGVPFAAVDPTAADTFRLPQGQYGMFIRSPDPQSHYLIETNPALTDLGRFLNSDYLLGKLGFDPDQAWKRLGDGAYETRLIREAIQAQTGQRFLDGLTSDYDQFQYLMDNALAAKDALQLSVGVGLSAEQVAALTHDIVWMETRVVDGQQVLVPVVYLAQTDARNLRGGSLIQGRDLNLMAGGDLTNVGTLRASEDLTATAGGSILQGGLVDAGQRVSLLAGDSIRNALAGQIRGDQVDLTALKGDIVNDRTAVTAGIGGDEYRSFLDAGASISARSELSLDAGRDITNRGSLASGGDSYLGAGRDINLQAVTDASRLRDIQQGGHHVTTTTVAQNHGSSLTAGGDLVLDAGRDLNVVGSQASAKGDLTAAAGRDINLRAVEDAASVEVRSKTSSTRTVEQTGQTRQLGAQLTAGGDLVASAGQDLNLTASTISASNEAYLYATRDVNLQAAAETDSHALSKTKRSHGLLSSSEKKTEDTSLYTTQQGSLVSADKVAIRAGQDIGVSGSDVASTNGTSLLAGRNVLIDGATETSETSHAESKKKSGVMSSGGLGFTLGSASTQATQTNHNEQTRGSTIGSVLGNVDIQAGKDLTIRGSDVVAGKDINLIGQNVDILAAQNENRSEQTYKSKTSGLTLALSGSVGSAMDSGYQTAKQAKHEDDSRLSALQGIKAGLTGVQAWQAAQQGTEGGGVSQFFGISASLGSQKSSSKQTQEQSVSQGSSLTAGNNLNILATGAGKVGQDGDIRIQGSQLKAGNDVLLAANRDITLEAAANTQKLDGKNKSSGGAVGVSVGYSADNGVGLSIFANANQGSGKEVGTGTTWTETTLDAGNQVKLVSGRDTTLKGAQVNGEQIIANVGRDLTLQSLQDSDYYDSKQKNVSAGASVAIIGTGGSASVSASQSKIDSNYKSVQEQTGLYAGKGGFQIDVGNHTQLDGSVIASTAEAEKNRLSTGTLGWSSIDNKADYKSQQQSVSLSSGSDGSGKFISNMPSGMLVAYNHGDSASGTTGSAISSGTLEIRDPASQQQDVASLSRDVEHANGSISPIFDKEKEQNRLKQVQLIAEIGTQAMDIVRTQGEIEAAEEGRKELKTQGKDNPTQKELEATVAYQNVMREYGTGSDYQRAAQAVTAALQYLAGGDIGGAIAGASAPYIAHLIKQQTGDNDTARIMAQALLGAVVAGVQGNSSVAGGIGAATGELIAANLYPGKKPEDLTENERQIVSALSSLAAGMAGGLASGDTAGAVAAAGAGKTAVDNNFLSGDQAKAFDHEMQQCTKEGDCTKVIKKYVALNDENRELLKATCSEKPWVCYGNSRDFVLTGLNSADPSRPVSSGGIENDNVRLFVQYENSLDLQYINKNTDTLYKALVFASEPENFMLMFGGLANLTNASGTSIATGAGLSMAANGGVQLATGSTGDKFDWIGFMTSGVTGGMSAGQTLTPTLQTNIGGAYISSQLNGQSSLDAMMGAMIGASLGYGAGATITSQMEKNYVNKIFGLSRNSVNALKYSEAANFPGSYLLKETPMSPIPGILGGATGSVVSESSNNAVLNGANNGK
P273_00002 RNA 2’-phosphotransferase 182 MDTKLLNETSKFLSYVLRHEPQAIGLQLDSEGWANINALIAGAAKKGKNLDSEIIQKVVASSDKKRFSISSDGQRIRAVQGHSTPTVTLQHTEKEPPELLYHGTASRFLDSIKTQGLIPGARHYVHLSQDEQTAVEVGKRYGKPVILKIEALRMHRQGFKFFQAENGVWLADKIPANFILTE
P273_00003 hypothetical protein 70 MMRPDAKVKAVYLYPKPVDFRKSIDGLAALVELDIKVAVFDPVLFVFLERGLLYFSYSKADHSGRIRVLK
P273_00004 hypothetical protein 93 MATHNVVLPQPMEKSIDDLVSEGRYQNFSEVVRAGLRLLLEREAEESAKLVALRNATSSGIMQLETGRFVEIASEAQLEKYLGELGQLASSRQ
P273_00005 hypothetical protein 121 MSDPQFRLSLDAQTDLIDILRFTQVKFGEDVRRRYQGLLRAAFVSLSAESERAGSIAREQLETGLRSLHLLYCRSEAPNGRVDRPRHVVFYRLGHDQVIEIVRILHDAMEVERHLQKVPAG
P273_00006 Putative lipoprotein/NMB1164 225 MKMSQGLLLGVASAALLMVGGCATESSRALPVQQVESVGKPYSGVRSPIAVGKFDNRSSYMRGIFSDGVDRLGGQAKTILITHLQQTNRFNVLDRDNMSEIQQEAAIKGQAQRLKGADYVVTGDVTEFGRKEVGDRQLFGILGRGKTQVAYAKVALNIVNISTSEVVYSTQGAGEYELSNREIIGFGGTASYDSTLNGKVLDLAMREAVNKLVNAVDSGSWNPAR

There are 6754 protein-encoding genes detected

Median gene length is 275 amino acids.

Adding annotations from EggNOG

The protein sequences were annotated via the EggNOG database. EggNOG provides links to several databases. Importing and adding the annotations to the annotation table and attaching to the annotation table:

eggnog.full <- read.csv("PROKKA_12032018.faa.emapper.annotations", sep = "\t", header = F, stringsAsFactors = FALSE)
colnames(eggnog.full) <- c("locus","Seed.Ortholo","evalue","score","Predicted.name","GO.terms","KEGG.KO","BiGG.reactions","tax.scope","eggNOG.OGs","best.OG","COG.Cat.","eggNOG.HMM.Desc.")
eggnog <- eggnog.full[c(1,6,7,10,13)]
library(dplyr)
P273.prokka.eggnog <- dplyr::left_join(P273.prokka, eggnog, by="locus")

We now have, for each gene, annotations from:

  • prokka (uniprot)
  • KO
  • GO
  • COG/NOG

The last addition is to translate the KO terms into protein function descriptions

library(dplyr)
KO.to.name <- read.csv("../KO.to.name.map.csv", header = T, stringsAsFactors = F)
colnames(KO.to.name)[1] = "KEGG.KO"
P273.prokka.eggnog.KO <- dplyr::left_join(P273.prokka.eggnog, KO.to.name, by = "KEGG.KO")
P273.prokka.eggnog.KO <- P273.prokka.eggnog.KO[c(1,3,2,6,9,8,7,5,4)]
write.csv(P273.prokka.eggnog.KO, "P273.full.annotations.csv")
cat("# Full annotation is provided in csv and xlsx formats under the name P273.full.annotations")

Full annotation is provided in csv and xlsx formats under the name P273.full.annotations

This table now contains clean sorted information from many different databases along with the actual protein sequence.

Annotation table summary

Column names and descriptions:

  1. locus; gene locus tag, found at beginning of gene and protein names
  2. length; protein length (not nucleotide)
  3. prokka_function; name assigned by prokka program, broadly from the uniprot database. These are usually less informative than other systems.
  4. KEGG.KO; KEGG orthology family name. Can be uploaded (one or many) at https://www.genome.jp/kegg/tool/map_pathway1.html to map to pathways
  5. KEGG.name; name/function associated with the KO.
  6. eggNOG.HMM.Desc; this is a name assigned by the eggNOG system
  7. eggNOG.OGs; these are Orthology Groups in the eggNOG system, groups of related proteins assigned a function
  8. GO.terms; orthology groups from the Gene Ontology system (another database)
  9. sequence; the protein sequence