Genbank.Gmorbida.denovo.rmd
Genbank.Gmorbida.denovo.rmd
1 Overarching Goal
Build a system that will allow GenBank-acceptable annotated genomes using the currently used tools for the Geosmitha project. This pipeline is broadly applicable for using several annotation systems to create a hierarchical annotated genome for submission to NCBI.
2 The challenge
Submitting a Eukaryotic genome is not easy. there is no magic button for this. For prokaryotic genomes, prokka offers a fairly streamlined system. For Eukaryotes, there is no analagous NCBI preparation tool available. What this means is that you have to finalize your own gff or 5 column table (*.tbl) file.
3 Current pipeline
I am basing this on the available G. morbida annotation, for which I have
Genome fasta
gff files, containing only gene -> CDS, for each of the genes, produced by Augustus/Maker and by Coding Quarry Pathogen Mode
gene to annotation mapping files for KEGG, COG, and EggNOG orthogroups
descriptions files for the orthogroups; for COG and EggNOG, some of the descriptions are not acceptable, too long, overly descriptive, or contain qualitative statements
From these available tools, I have to construct an annotation file (gff or tbl) which has
gene -> mRNA -> CDS and also includes
orthology-based funtional prediction
inference for each annotation
locus tags
4 Goal
Using the available tools, create a pipeline for appropriate gff creation. This should be broadly applicable to new annotations that we generate, thus I will avoid using any previous gff versions that I generated for G. morbida.
5 custom text manipulation funtions
a few functions for rapid column splitting and element harvesting
scSplit <- function(x){
strsplit(x,";")[[1]][1]
}6 Adding mRNA and corrrecting all lineages
There are two separate gff files created by CodingQuarry PM:
original genes
new PM genes
First, read them in and correct the algorithm column (V2) to correctly state that the main genes were generated using Maker_v2.31.8
gff.maker <- read.csv("gff/PredictedPass.gff3",sep = "\t",header = F,stringsAsFactors = F)
gff.maker$V2 <- "Maker_v2.31.8"Next, read in the gff for the pathogen mode (PM) genes, merge them into the original, and sort them by scaffold (alphanumeric A-Z) -> start location (small to large) -> element (alphanumeric Z-A)
gff.pm <- read.csv("gff/PGN_predictedPass.gff3",sep = "\t",header=F,stringsAsFactors = F )
gff.base <- rbind(gff.maker,gff.pm)
library(dplyr)
#note that "desc" is used to designate reverse alphabetical in dplyr, in this case to make sure that "gene" is always before "CD"
gff.base <- gff.base %>% arrange(V1,V4,desc(V3))The full gff is now properly sorted. Now we start to construct the attributes.
Isolate ID
Trim the ID name (remove “-mRNA-1” from the end of each)
Create an mRNA for each gene. These are identical to genes except that the name should start with “mRNA:”
6.1 fix the gene names
The gene names are too long for NCBI to accept. It is time to swap them out for the locus tags. Pull out the gene ID from each element, make a sub-table of genes only, and number them sequentially.
gff.base$ID <- unlist(lapply(gff.base$V9, scSplit))At this stage, I generate locus tags and use them also as gene-ID
7 Adding annotations to the CDS elements
Only CDS should have annotations. These take the form
product=metaxin;inference=protein motif:KEGG:K17776
Both of these are derived from KEGG, COG, and EggNOG mapping files.
Both KEGG and COG have separate naming files (which makes 5 tables in total).
First, import the three mapping files and two name files
map.kegg <- read.csv("annotations/kegg.map.csv", header = F)
map.kegg <- map.kegg[1:2]
#remove empty annotations
map.kegg <- map.kegg[!map.kegg$V2=="",]
colnames(map.kegg) <- c("gene", "KO")
map.cog <- read.csv("annotations/COG.map.csv",header = F)
#retain only the strongest COG hit
map.cog <- map.cog[!duplicated(map.cog$V1),]
map.cog <- map.cog[c(1,3)]
colnames(map.cog) <- c("gene", "COG")
map.en <- read.csv("annotations/eggnog.map.csv", header = F)
colnames(map.en) <- c("gene","NOG","description")
spSplit <- function(x){
strsplit(x," ")[[1]][1]
}
#importing and parsing the kegg description map
name.kegg.t <- read.csv("annotations/KO_Orthology_ko00001.txt",sep = "\t",header = F, stringsAsFactors = F)
name.kegg <- unlist(lapply(name.kegg.t$V4, spSplit))
spSplit <- function(x){
strsplit(x," ")[[1]][2]
}
df <- unlist(lapply(name.kegg.t$V4, spSplit))
name.kegg.t <- data.frame(name.kegg, df, stringsAsFactors = F)
cSplit <- function(x){
strsplit(x,"; ")[[1]][2]
}
name.kegg.t$V1 <- lapply(name.kegg.t$df, cSplit)
spSplit <- function(x){
strsplit(x," \\[")[[1]][1]
}
name.kegg.t$V2 <- lapply(name.kegg.t$V1, spSplit)
name.kegg <- name.kegg.t[c(1,4)]
colnames(name.kegg) <- c("KO", "description")
name.kegg <- name.kegg[!duplicated(name.kegg$KO),]
#import and parse the COG description file
name.cog.t <- read.csv("annotations/cognames2003-2014.tab.txt", sep = "\t")
name.cog <- name.cog.t[c(1,3)]
colnames(name.cog) <- c("COG", "description")7.1 cleaning up illegal characters
described in https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md, several characters must be escaped when they are used in descriptions
* ; semicolon (%3B)
* = equals (%3D)
* & ampersand (%26)
* , comma (%2C)
#many odd things here were detected in the GenBank discrepancy report upon initial submission
cleanfun <- function(x) {
x <- str_replace_all(x, ",", "%2C")
x <- str_replace_all(x, ";", "%3B")
x <- str_replace_all(x, "=", "%3D")
x <- str_replace_all(x, "_", " ")
x <- str_replace_all(x, "Inherit from COG: ", "")
x <- str_replace_all(x, "FOG: ", "")
x <- str_replace_all(x, "\\. ", "\\, ")
x <- str_replace_all(x, "a.k.a. ", "")
x <- str_replace_all(x, "the structure of ", "")
x <- str_replace_all(x, " homolog", "-like protein")
x <- str_replace_all(x, " homologue", "-like protein")
x <- str_replace_all(x, " activity", "")
x <- str_replace_all(x, " gene", "")
x <- str_replace_all(x, "Partial ", "")
str_replace_all(x, "&", "%26")
}
name.kegg$description <- cleanfun(name.kegg$description)
name.cog$description <- cleanfun(name.cog$description)
map.en$description <- cleanfun(map.en$description)All three files are now in place. Now I will make a map with a gff-friendly orthology for each, a single frame for each system.
Again, it should look like this:
product=metaxin;inference=protein motif:KEGG:K17776
*note that there is JUNK in the eggnog map, should be manually curated, look for non-canonical EggNOG orthology numbers and remove those rows
#merge the ortholgoy and name mapping files together and make O+description gff format
#kegg
map.kegg <- left_join(map.kegg, name.kegg)
df <- map.kegg
df$dummy <- "product"
df <- df %>% unite(product, dummy, description, sep = "=")
map.kegg <- df
df <- map.kegg
df$dummy <- "inference=protein motif:KEGG:"
df <- df %>% unite(inference, dummy, KO, sep="")
map.kegg <- df
df <- map.kegg
df <- df %>% unite(orthology, product, inference, sep=";")
map.kegg <- df
#COG
df <- left_join(map.cog, name.cog)
df$dummy <- "product"
df <- df %>% unite(product, dummy, description, sep = "=")
df$dummy <- "inference=protein motif:COG:"
df <- df %>% unite(inference, dummy, COG, sep="")
df <- df %>% unite(orthology, product, inference, sep=";")
map.cog <- df
#EggNOG
df <- map.en
df$dummy <- "product"
df <- df %>% unite(product, dummy, description, sep = "=")
df$dummy <- "inference=protein motif:EggNOG:"
df <- df %>% unite(inference, dummy, NOG, sep="")
df <- df %>% unite(orthology, product, inference, sep=";")
map.en <- dfwe now have our clean maps with
one row for each map
genes with no map excluded
mapped-to values are gff compatible
7.2 hierarchical annotation
Now is the tricky part. I want to annotate in preference of KEGG > COG > EggNOG.
I can handle this with an “is.in” type of list check
t <- gff.cds
#create a dummy column to receive annotation information
t$annot <- ""
spSplit <- function(x){
strsplit(x,";")[[1]][1]
}
#add kegg, this is the easy part
for (n in (1:nrow(t))) {
if (t[n,10] %in% map.kegg$gene) {
t[n,11] <- map.kegg[map.kegg$gene == t[n,10],2]
}
}
t2<-t
#add cog, requires two tests
for (n in (1:nrow(t2))) {
if (t2[n,10] %in% map.cog$gene & t2[n,11] == "") {
t2[n,11] <- map.cog[map.cog$gene == t2[n,10],2]
}
}
t3 <- t2
#add eggnog using same strategy
for (n in (1:nrow(t3))) {
if (t3[n,10] %in% map.en$gene & t3[n,11] == "") {
t3[n,11] <- map.en[map.en$gene == t3[n,10],2]
}
}
gff.cds <- t3
t <- gff.cds %>% unite(V9, V9, annot, sep=";")
t <- t[-10]
gff.cds <- t8 merge into full gff and sort properly
gff <- rbind(gff.gene, gff.mrna, gff.cds)
gff$V3 <- factor(gff$V3, levels=c("gene","mRNA","CDS"))
gff <- gff %>% arrange(V1,V4,V3)That’s mostly it. The next step is to write it and run through gff2asn. There will be many errors to manage after this point no doubt, but there is no way to other way handle.
write.table(gff,"gmorb2.gff", sep = "\t", col.names = F, row.names = F, quote = F)colnames have to be removed manually…
9 gff2asn
downloaded from: ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/converters/by_program/table2asn_GFF
Terminal command:
/home/robert/Tools/table2asn_GFF.Linux-3.10.0-1062.9.1.el7.x86_64-x86_64 \
-M n -J -c efswx -euk -a r10k -X E -augustus-fix \
-t template.sbt \
-gaps-min 10 -l paired-ends -j "[organism=Geosmithia morbida] [isolate=1262]" -i \
refgenome/G.morbida.refgenome.fa -f gmorb2.gff -o output_file.sqn -Z
9.1 initial revisions
“output_file.stats” provides an overview of suspicious issues with the validation. I manually reformatted the file output_file.stats so that I could read into here easily:
stats <- read.csv("output_file.stats",sep = "\t",header = F)
datatable(stats)Most of these are not something I am going to worry about. However there are 2 types of problems I should investigate (they are pseudogenes. First, import the full *.val file and filter down to the important problems. This is mostly just a list of entries, but I can parse out important information
val <- read.csv("output_file.val", sep = "\t",header = F, stringsAsFactors = F)
spSplit <- function(x){
strsplit(x,"\\[")[[1]][2]
}
val$code <- unlist(lapply(val$V1, spSplit))
spSplit <- function(x){
strsplit(x,"\\]")[[1]][1]
}
val$code <- unlist(lapply(val$code, spSplit))
val$dummy <- ":"
library(tidyr)
val <- val %>% unite(code, code, dummy, sep = "")
#make sub-stats table only the guys I will investigate
stats <- subset(stats, V3 == "investigate")
#then use the substats table to filter the val table
val <- subset(val, code %in% stats$V1)
write.csv(val, "val.curated.csv")The validation items now drop from 7800 to 28. Look into parsing it a bit, writing, and consider what to do next.
the only thing I altered was conversion of 14 genes into pseudogenes (manually)
9.2 run curated gff trhough tbl2asn
Run again through tbl2asn_gff and check the validation;
/home/robert/Tools/table2asn_GFF.Linux-3.10.0-1062.9.1.el7.x86_64-x86_64 \
-M n -J -c efswx -euk -a r10k -X E -augustus-fix \
-t template.sbt \
-gaps-min 10 -l paired-ends -j "[organism=Geosmithia morbida] [isolate=1262]" -i \
refgenome/G.morbida.refgenome.fa -f gmorb2_curate.gff -o output_file.sqn -Z
Validation looks fine.
9.3 Consider previous submission errors
The discrepancy report from NCBI on the previous submission was domianted by issues that this pipeline fixes (bad kegg names, data structure problems). The only thing I address now is the overlapping genes;
Remove
Scaffold 37 was flagged as pure contaminat. Remove completely prior to final submission.
There were also errors listed in the email they sent, some overlapping genes that I should remove.
FIND_OVERLAPPED_GENES: 4 genes completely overlapped by other genes
ORIG/output_file.sqn:Gene GMORB2_6716 lcl|scaffold_6:816662-817629 GMORB2_6716
ORIG/output_file.sqn:Gene GMORB2_6718 lcl|scaffold_6:c820276-817760 GMORB2_6718
ORIG/output_file.sqn:Gene GMORB2_2885 lcl|scaffold_15:58872-60434 GMORB2_2885
ORIG/output_file.sqn:Gene GMORB2_2887 lcl|scaffold_15:c61373-60757 GMORB2_2887
These four genes were removed manually from the gff
/home/robert/Tools/table2asn_GFF.Linux-3.10.0-1062.9.1.el7.x86_64-x86_64 \
-M n -J -c efswx -euk -a r10k -X E -augustus-fix \
-t template.sbt \
-gaps-min 10 -l paired-ends -j "[organism=Geosmithia morbida] [isolate=1262]" -i \
refgenome/G.morbida.refgenome.curated.fa -f gmorb2_curate.gff -o output_file.sqn -Z
Submitted this again on 2/19/2020, treating as a resubmission of previous.
9.4 1st submission results
Following submission (~20 minutes later) The Discrepancy.txt report is generated and must be manually interpreted, line by line!
Many inappropriate names, have to be dealt with sequentially
Parent for CDS uses the wrong separator! _ not ;. I fixed this in the pipeline.
SUSPECT_PRODUCT_NAMES: 124 features contain underscore… these I can fix in the pipeline also during description cleanup.
many other name errors like above; extensively revised the cleanup script (line 269) to deal with these.
Example messages I cannot interpret:
I cannot see anything wrong here:
CDS_HAS_NEW_EXCEPTION: 39 coding regions have new exceptions
ORIG/output_file.sqn:CDS trans-aconitate 3-methyltransferase [lcl|scaffold_0:c801355-801338, c801333-800371] GMORB2_0260
I cannot see anything wrong here:
These inconsistencies do not hold up upon inspection. Perhaps the actual stop is before the end of the gene? I could manually repair; this is gene-calling inconsistency
FEATURE_LOCATION_CONFLICT: 35 features have inconsistent gene locations.
FEATURE_LOCATION_CONFLICT: RNA feature location does not match gene location
ORIG/output_file.sqn:mRNA mRNA lcl|scaffold_0:c944231-943218 GMORB2_0310
ORIG/output_file.sqn:Gene GMORB2_0310 lcl|scaffold_0:c>944231-943215 GMORB2_0310
these make no sense at all:
GENE_PARTIAL_CONFLICT: mRNA feature partialness conflicts with gene on both ends
ORIG/output_file.sqn:mRNA mRNA lcl|scaffold_0:c250922-249185 GMORB2_0074
ORIG/output_file.sqn:Gene GMORB2_0074 lcl|scaffold_0:c>250922-<249185 GMORB2_0074
GENE_PARTIAL_CONFLICT: mRNA feature partialness conflicts with gene on 5' end
ORIG/output_file.sqn:mRNA mRNA lcl|scaffold_0:c52647-51893 GMORB2_0018
ORIG/output_file.sqn:Gene GMORB2_0018 lcl|scaffold_0:c>52647-51893 GMORB2_0018
GENE_PARTIAL_CONFLICT: mRNA feature partialness conflicts with gene on 3' end
ORIG/output_file.sqn:mRNA mRNA lcl|scaffold_0:c30780-28384 GMORB2_0010
ORIG/output_file.sqn:Gene GMORB2_0010 lcl|scaffold_0:c30780-<28384 GMORB2_0010
10 Addressing discrepancies
I revised the description cleanup script to remove many of these errors. After regenerating, I will have to return and remove the redundant genes and revise the pseudogenes again.
11 resubmission
on 2/19/20 (5pm). By 5:30pm it seems it made it past the automated screening process. That means that a person will tell me what finally needs repair.