Genbank.Gmorbida.denovo.rmd

1 Overarching Goal
2 The challenge
3 Current pipeline
4 Goal
5 custom text manipulation funtions
6 Adding mRNA and corrrecting all lineages
- 6.1 fix the gene names
- 6.2 adding custom locus tags
7 Adding annotations to the CDS elements
- 7.1 cleaning up illegal characters
- 7.2 hierarchical annotation
8 merge into full gff and sort properly
9 gff2asn
10 Addressing discrepancies
11 resubmission

1 Overarching Goal

Build a system that will allow GenBank-acceptable annotated genomes using the currently used tools for the Geosmitha project. This pipeline is broadly applicable for using several annotation systems to create a hierarchical annotated genome for submission to NCBI.

2 The challenge

Submitting a Eukaryotic genome is not easy. there is no magic button for this. For prokaryotic genomes, prokka offers a fairly streamlined system. For Eukaryotes, there is no analagous NCBI preparation tool available. What this means is that you have to finalize your own gff or 5 column table (*.tbl) file.

3 Current pipeline

I am basing this on the available G. morbida annotation, for which I have

Genome fasta
gff files, containing only gene -> CDS, for each of the genes, produced by Augustus/Maker and by Coding Quarry Pathogen Mode
gene to annotation mapping files for KEGG, COG, and EggNOG orthogroups
descriptions files for the orthogroups; for COG and EggNOG, some of the descriptions are not acceptable, too long, overly descriptive, or contain qualitative statements

From these available tools, I have to construct an annotation file (gff or tbl) which has

gene -> mRNA -> CDS and also includes
orthology-based funtional prediction
inference for each annotation
locus tags

4 Goal

Using the available tools, create a pipeline for appropriate gff creation. This should be broadly applicable to new annotations that we generate, thus I will avoid using any previous gff versions that I generated for G. morbida.

5 custom text manipulation funtions

a few functions for rapid column splitting and element harvesting

scSplit <- function(x){
  strsplit(x,";")[[1]][1]
}

6 Adding mRNA and corrrecting all lineages

There are two separate gff files created by CodingQuarry PM:

original genes
new PM genes

First, read them in and correct the algorithm column (V2) to correctly state that the main genes were generated using Maker_v2.31.8

gff.maker <- read.csv("gff/PredictedPass.gff3",sep = "\t",header = F,stringsAsFactors = F)
gff.maker$V2 <- "Maker_v2.31.8"

Next, read in the gff for the pathogen mode (PM) genes, merge them into the original, and sort them by scaffold (alphanumeric A-Z) -> start location (small to large) -> element (alphanumeric Z-A)

gff.pm <- read.csv("gff/PGN_predictedPass.gff3",sep = "\t",header=F,stringsAsFactors = F )
gff.base <- rbind(gff.maker,gff.pm)

library(dplyr)
#note that "desc" is used to designate reverse alphabetical in dplyr, in this case to make sure that "gene" is always before "CD"
gff.base <- gff.base %>% arrange(V1,V4,desc(V3))

The full gff is now properly sorted. Now we start to construct the attributes.

Isolate ID
Trim the ID name (remove “-mRNA-1” from the end of each)
Create an mRNA for each gene. These are identical to genes except that the name should start with “mRNA:”

6.1 fix the gene names

The gene names are too long for NCBI to accept. It is time to swap them out for the locus tags. Pull out the gene ID from each element, make a sub-table of genes only, and number them sequentially.

gff.base$ID <- unlist(lapply(gff.base$V9, scSplit))

At this stage, I generate locus tags and use them also as gene-ID

6.2 adding custom locus tags

locus tags are added to the gene elements:

ID=scaffold_0-gene-0.2274;locus_tag=GMORB2_0001;

I already built a locus tag map, but the goal is to make an easily reproducible method here.

library(stringr)
library(tidyr)
gff.gene <- subset(gff.base, V3 == "gene")
t <- gff.gene
t$short <- "locus_tag=GMORB2_"
t$number <- 1
t$shorter <- "ID=GMORB2"
for (n in (1:nrow(t))) {
  t[n,12] <- str_pad(n, 4, pad = "0")
}
t <- t %>% unite(locus, short, number, sep = "")
t$number <- 1

for (n in (1:nrow(t))) {
  t[n,13] <- str_pad(n, 4, pad = "0")
}
t <- t %>% unite (shorter, shorter, number, sep = "_")
locus.map <- data.frame(t$ID, t$shorter)
colnames(locus.map) <- c("gene","new.name")
t <- t %>% unite (V9, shorter, locus, sep = ";")

t$ID <- str_replace(t$ID, "ID", "note")
t <- t %>% unite(V9, V9, ID, sep=";")
gff.gene <- t

the object “locus.map” can be used to fix CDS and mRNA frames

we make partial gff with only mRNA lines. It will be easier moving forward if we have three separate gffs, one each for gene, CDS, and mRNA. We can amend them as needed and then merge and order after the fact.

gff.mrna <- gff.gene
gff.mrna$V3 <- "mRNA"
scSplit <- function(x){
  strsplit(x,";")[[1]][1]
}
gff.mrna$ID <- unlist(lapply(gff.mrna$V9, scSplit))
gff.mrna$ID <- str_replace(gff.mrna$ID, "ID=", "ID=mRNA:")
scSplit <- function(x){
  strsplit(x,";")[[1]][2]
}
gff.mrna$parent <- unlist(lapply(gff.mrna$V9, scSplit))
gff.mrna$parent <- str_replace(gff.mrna$parent, "locus_tag", "Parent")
gff.mrna <- gff.mrna %>% unite(V9, ID, parent, sep = ";")

next we make sure the lineage of the CDS are correct

gff.cds <- subset(gff.base, V3 == "CDS")
colnames(gff.cds)[10] <- "gene"
gff.cds$gene <- str_replace(gff.cds$gene, "CDS:", "")
library(dplyr)
gff.cds <- left_join(gff.cds, locus.map)
gff.cds$new.name <- str_replace(gff.cds$new.name, "ID=", "ID=CDS:")
gff.cds$parent <- str_replace(gff.cds$new.name, "ID=CDS:", "Parent=mRNA:")
gff.cds <- gff.cds %>% unite(V9, new.name, parent, sep = ";")
gff.cds$gene <- str_replace(gff.cds$gene, "ID=","")
gff.cds <- gff.cds[c(1:8,10,9)]

All of the lineages are now corrected and names have been revised. I have three separate data frames, gene, mRNA, CDS.

that’s it.

7 Adding annotations to the CDS elements

Only CDS should have annotations. These take the form

product=metaxin;inference=protein motif:KEGG:K17776

Both of these are derived from KEGG, COG, and EggNOG mapping files.

Both KEGG and COG have separate naming files (which makes 5 tables in total).

First, import the three mapping files and two name files

map.kegg <- read.csv("annotations/kegg.map.csv", header = F)
map.kegg <- map.kegg[1:2]
#remove empty annotations
map.kegg <- map.kegg[!map.kegg$V2=="",]
colnames(map.kegg) <- c("gene", "KO")

map.cog <- read.csv("annotations/COG.map.csv",header = F)
#retain only the strongest COG hit
map.cog <- map.cog[!duplicated(map.cog$V1),]
map.cog <- map.cog[c(1,3)]
colnames(map.cog) <- c("gene", "COG")

map.en <- read.csv("annotations/eggnog.map.csv", header = F)
colnames(map.en) <- c("gene","NOG","description")

spSplit <- function(x){
  strsplit(x,"  ")[[1]][1]
}

#importing and parsing the kegg description map

name.kegg.t <- read.csv("annotations/KO_Orthology_ko00001.txt",sep = "\t",header = F, stringsAsFactors = F)
name.kegg <- unlist(lapply(name.kegg.t$V4, spSplit))

spSplit <- function(x){
  strsplit(x,"  ")[[1]][2]
}
df <- unlist(lapply(name.kegg.t$V4, spSplit))

name.kegg.t <- data.frame(name.kegg, df, stringsAsFactors = F)
cSplit <- function(x){
  strsplit(x,"; ")[[1]][2]
}

name.kegg.t$V1 <- lapply(name.kegg.t$df, cSplit)

spSplit <- function(x){
  strsplit(x," \\[")[[1]][1]
}

name.kegg.t$V2 <- lapply(name.kegg.t$V1, spSplit)

name.kegg <- name.kegg.t[c(1,4)]
colnames(name.kegg) <- c("KO", "description")
name.kegg <- name.kegg[!duplicated(name.kegg$KO),]

#import and parse the COG description file
name.cog.t <- read.csv("annotations/cognames2003-2014.tab.txt", sep = "\t")
name.cog <- name.cog.t[c(1,3)]
colnames(name.cog) <- c("COG", "description")

7.1 cleaning up illegal characters

described in https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md, several characters must be escaped when they are used in descriptions

* ; semicolon (%3B)

* = equals (%3D)

* & ampersand (%26)

* , comma (%2C)

#many odd things here were detected in the GenBank discrepancy report upon initial submission
cleanfun <- function(x) {
  x <- str_replace_all(x, ",", "%2C")
  x <- str_replace_all(x, ";", "%3B")
  x <- str_replace_all(x, "=", "%3D")
  x <- str_replace_all(x, "_", " ")
  x <- str_replace_all(x, "Inherit from COG: ", "")
  x <- str_replace_all(x, "FOG: ", "")
  x <- str_replace_all(x, "\\. ", "\\, ")
  x <- str_replace_all(x, "a.k.a. ", "")
  x <- str_replace_all(x, "the structure of ", "")
  x <- str_replace_all(x, " homolog", "-like protein")
  x <- str_replace_all(x, " homologue", "-like protein")
  x <- str_replace_all(x, " activity", "")
  x <- str_replace_all(x, " gene", "")
  x <- str_replace_all(x, "Partial ", "")
  str_replace_all(x, "&", "%26")
}
name.kegg$description <- cleanfun(name.kegg$description)
name.cog$description <- cleanfun(name.cog$description)
map.en$description <- cleanfun(map.en$description)

All three files are now in place. Now I will make a map with a gff-friendly orthology for each, a single frame for each system.

Again, it should look like this:

product=metaxin;inference=protein motif:KEGG:K17776

*note that there is JUNK in the eggnog map, should be manually curated, look for non-canonical EggNOG orthology numbers and remove those rows

#merge the ortholgoy and name mapping files together and make O+description gff format
#kegg
map.kegg <- left_join(map.kegg, name.kegg)
df <- map.kegg
df$dummy <- "product"
df <- df %>% unite(product, dummy, description, sep = "=")
map.kegg <- df
df <- map.kegg
df$dummy <- "inference=protein motif:KEGG:"
df <- df %>% unite(inference, dummy, KO, sep="")
map.kegg <- df
df <- map.kegg
df <- df %>% unite(orthology, product, inference, sep=";")
map.kegg <- df

#COG
df <- left_join(map.cog, name.cog)
df$dummy <- "product"
df <- df %>% unite(product, dummy, description, sep = "=")
df$dummy <- "inference=protein motif:COG:"
df <- df %>% unite(inference, dummy, COG, sep="")
df <- df %>% unite(orthology, product, inference, sep=";")
map.cog <- df

#EggNOG
df <- map.en
df$dummy <- "product"
df <- df %>% unite(product, dummy, description, sep = "=")
df$dummy <- "inference=protein motif:EggNOG:"
df <- df %>% unite(inference, dummy, NOG, sep="")
df <- df %>% unite(orthology, product, inference, sep=";")
map.en <- df

we now have our clean maps with

one row for each map
genes with no map excluded
mapped-to values are gff compatible

7.2 hierarchical annotation

Now is the tricky part. I want to annotate in preference of KEGG > COG > EggNOG.

I can handle this with an “is.in” type of list check

t <- gff.cds
#create a dummy column to receive annotation information
t$annot <- ""

spSplit <- function(x){
  strsplit(x,";")[[1]][1]
}

#add kegg, this is the easy part
for (n in (1:nrow(t))) {
  if (t[n,10] %in% map.kegg$gene) {
    t[n,11] <- map.kegg[map.kegg$gene == t[n,10],2]
  } 
}
t2<-t
#add cog, requires two tests
for (n in (1:nrow(t2))) {
  if (t2[n,10] %in% map.cog$gene & t2[n,11] == "") {
    t2[n,11] <- map.cog[map.cog$gene == t2[n,10],2]
  }
}
t3 <- t2
#add eggnog using same strategy
for (n in (1:nrow(t3))) {
  if (t3[n,10] %in% map.en$gene & t3[n,11] == "") {
    t3[n,11] <- map.en[map.en$gene == t3[n,10],2]
  }
}

gff.cds <- t3

t <- gff.cds %>% unite(V9, V9, annot, sep=";")
t <- t[-10]
gff.cds <- t

8 merge into full gff and sort properly

gff <- rbind(gff.gene, gff.mrna, gff.cds)
gff$V3 <- factor(gff$V3, levels=c("gene","mRNA","CDS"))
gff <- gff %>% arrange(V1,V4,V3)

That’s mostly it. The next step is to write it and run through gff2asn. There will be many errors to manage after this point no doubt, but there is no way to other way handle.

write.table(gff,"gmorb2.gff", sep = "\t", col.names = F, row.names = F, quote = F)

colnames have to be removed manually…

9 gff2asn

downloaded from: ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/converters/by_program/table2asn_GFF

Terminal command:

/home/robert/Tools/table2asn_GFF.Linux-3.10.0-1062.9.1.el7.x86_64-x86_64 \
-M n -J -c efswx -euk -a r10k -X E -augustus-fix \
-t template.sbt \
-gaps-min 10 -l paired-ends -j "[organism=Geosmithia morbida] [isolate=1262]" -i \
refgenome/G.morbida.refgenome.fa -f gmorb2.gff -o output_file.sqn -Z

9.1 initial revisions

“output_file.stats” provides an overview of suspicious issues with the validation. I manually reformatted the file output_file.stats so that I could read into here easily:

stats <- read.csv("output_file.stats",sep = "\t",header = F)
datatable(stats)

Most of these are not something I am going to worry about. However there are 2 types of problems I should investigate (they are pseudogenes. First, import the full *.val file and filter down to the important problems. This is mostly just a list of entries, but I can parse out important information

val <- read.csv("output_file.val", sep = "\t",header = F, stringsAsFactors = F)
spSplit <- function(x){
  strsplit(x,"\\[")[[1]][2]
}
val$code <- unlist(lapply(val$V1, spSplit))
spSplit <- function(x){
  strsplit(x,"\\]")[[1]][1]
}
val$code <- unlist(lapply(val$code, spSplit))
val$dummy <- ":"
library(tidyr)
val <- val %>% unite(code, code, dummy, sep = "")

#make sub-stats table only the guys I will investigate
stats <- subset(stats, V3 == "investigate")

#then use the substats table to filter the val table
val <- subset(val, code %in% stats$V1)
write.csv(val, "val.curated.csv")

The validation items now drop from 7800 to 28. Look into parsing it a bit, writing, and consider what to do next.

the only thing I altered was conversion of 14 genes into pseudogenes (manually)

9.2 run curated gff trhough tbl2asn

Run again through tbl2asn_gff and check the validation;

/home/robert/Tools/table2asn_GFF.Linux-3.10.0-1062.9.1.el7.x86_64-x86_64 \
-M n -J -c efswx -euk -a r10k -X E -augustus-fix \
-t template.sbt \
-gaps-min 10 -l paired-ends -j "[organism=Geosmithia morbida] [isolate=1262]" -i \
refgenome/G.morbida.refgenome.fa -f gmorb2_curate.gff -o output_file.sqn -Z

Validation looks fine.

9.3 Consider previous submission errors

The discrepancy report from NCBI on the previous submission was domianted by issues that this pipeline fixes (bad kegg names, data structure problems). The only thing I address now is the overlapping genes;

Remove

Scaffold 37 was flagged as pure contaminat. Remove completely prior to final submission.

There were also errors listed in the email they sent, some overlapping genes that I should remove.

FIND_OVERLAPPED_GENES: 4 genes completely overlapped by other genes
ORIG/output_file.sqn:Gene   GMORB2_6716 lcl|scaffold_6:816662-817629    GMORB2_6716
ORIG/output_file.sqn:Gene   GMORB2_6718 lcl|scaffold_6:c820276-817760   GMORB2_6718
ORIG/output_file.sqn:Gene   GMORB2_2885 lcl|scaffold_15:58872-60434 GMORB2_2885
ORIG/output_file.sqn:Gene   GMORB2_2887 lcl|scaffold_15:c61373-60757    GMORB2_2887

These four genes were removed manually from the gff

/home/robert/Tools/table2asn_GFF.Linux-3.10.0-1062.9.1.el7.x86_64-x86_64 \
-M n -J -c efswx -euk -a r10k -X E -augustus-fix \
-t template.sbt \
-gaps-min 10 -l paired-ends -j "[organism=Geosmithia morbida] [isolate=1262]" -i \
refgenome/G.morbida.refgenome.curated.fa -f gmorb2_curate.gff -o output_file.sqn -Z

Submitted this again on 2/19/2020, treating as a resubmission of previous.

9.4 1st submission results

Following submission (~20 minutes later) The Discrepancy.txt report is generated and must be manually interpreted, line by line!

Many inappropriate names, have to be dealt with sequentially
Parent for CDS uses the wrong separator! _ not ;. I fixed this in the pipeline.
SUSPECT_PRODUCT_NAMES: 124 features contain underscore… these I can fix in the pipeline also during description cleanup.
many other name errors like above; extensively revised the cleanup script (line 269) to deal with these.

Example messages I cannot interpret:

I cannot see anything wrong here:
CDS_HAS_NEW_EXCEPTION: 39 coding regions have new exceptions
ORIG/output_file.sqn:CDS    trans-aconitate 3-methyltransferase [lcl|scaffold_0:c801355-801338, c801333-800371] GMORB2_0260

I cannot see anything wrong here:
These inconsistencies do not hold up upon inspection.  Perhaps the actual stop is before the end of the gene? I could manually repair; this is gene-calling inconsistency
FEATURE_LOCATION_CONFLICT: 35 features have inconsistent gene locations.
FEATURE_LOCATION_CONFLICT: RNA feature location does not match gene location
ORIG/output_file.sqn:mRNA   mRNA    lcl|scaffold_0:c944231-943218   GMORB2_0310
ORIG/output_file.sqn:Gene   GMORB2_0310 lcl|scaffold_0:c>944231-943215  GMORB2_0310

these make no sense at all:
GENE_PARTIAL_CONFLICT: mRNA feature partialness conflicts with gene on both ends
ORIG/output_file.sqn:mRNA   mRNA    lcl|scaffold_0:c250922-249185   GMORB2_0074
ORIG/output_file.sqn:Gene   GMORB2_0074 lcl|scaffold_0:c>250922-<249185 GMORB2_0074

GENE_PARTIAL_CONFLICT: mRNA feature partialness conflicts with gene on 5' end
ORIG/output_file.sqn:mRNA   mRNA    lcl|scaffold_0:c52647-51893 GMORB2_0018
ORIG/output_file.sqn:Gene   GMORB2_0018 lcl|scaffold_0:c>52647-51893    GMORB2_0018

GENE_PARTIAL_CONFLICT: mRNA feature partialness conflicts with gene on 3' end
ORIG/output_file.sqn:mRNA   mRNA    lcl|scaffold_0:c30780-28384 GMORB2_0010
ORIG/output_file.sqn:Gene   GMORB2_0010 lcl|scaffold_0:c30780-<28384    GMORB2_0010

10 Addressing discrepancies

I revised the description cleanup script to remove many of these errors. After regenerating, I will have to return and remove the redundant genes and revise the pseudogenes again.

11 resubmission

on 2/19/20 (5pm). By 5:30pm it seems it made it past the automated screening process. That means that a person will tell me what finally needs repair.