new Geosmithia assemblies

Currently there are 12 various Geosmithia genomes which we would like to send into a comparative genomics pipeline with the goal of identifying pathogenicity factors in the pathogenic strains.

strain.details <- readxl::read_xlsx("../report/strain.table.200522.xlsx")
strain.details %>%
  kable() %>%
  kable_styling()

species/strain designation	raw reads	assembled	code
Geosmithia sp. 2	yes	yes	KW1
Geosmithia sp. 10	yes	yes	5BS4W
Geosmithia sp. 21	yes	yes	3BHS13
Geosmithia sp. 23	yes	yes	4LW11
Geosmithia sp. 41	yes	yes	4BE20
Geosmithia obscura	yes	no	6BE2
Geosmithia lavendula	yes	no	na
Geosmithia omnicola strain 1	yes	yes	na
Geosmithia omnicola strain 2a	yes	yes	na
Geosmithia pallida strain 3	yes	yes	na
Geosmithia pallida strain 4	yes	yes	na
Geosmithia pallida strain 6	yes	yes	na
Geosmithia pallida strain 20a	yes	yes	na
Geosmithia pallida strain 153	yes	yes	na
Geosmithia morbida	Yes	yes	na
Geosmithia putterilli	yes	yes	na
Geosmithia flava	yes	yes	na

I have to assemble 2 of these WGS projects using the same pipeline that has already been used to assemble sp. 2 - 41

G. obscura
G. lavendula

The previously applied pipeline (by Meg Staton et al.) consisted of:

library(DiagrammeR)
grViz("digraph flowchart {
      # node definitions with substituted label text
      node [fontname = Helvetica, shape = rectangle]        
      tab1 [label = '@@1']
      tab2 [label = '@@2']
      tab3 [label = '@@3']
      tab4 [label = '@@4']
      tab5 [label = '@@5']

      # edge definitions with the node IDs
      tab1 -> tab2 -> tab3 -> tab4 -> tab5;
      }

      [1]: 'Raw WGS Reads'
      [2]: 'Apply Illumina error correction tool: BFC correction '
      [3]: 'Remove adapter contamination: Skewer'
      [4]: 'Assemble de novo: ABySS'
      [5]: 'Draft Genome'
      ")

1 Installing the programs

You might want to create a separate conda environment first

apt-get does not work on MacOS, but other routes exist, see their github page

conda install -c bioconda skewer bfc
sudo apt-get install abyss
sudo apt-get install openmpi-bin

1.1 BFC error correction

I don’t have the original script for this, but there are few options available.

General instructions are here: https://github.com/lh3/bfc

#!/bin/bash

mkdir ../corrected.reads

bfc -s 28m -t14 ../raw.data/G_obscura/6BE2_S14_L003_R1_001.fastq.gz \
> ../corrected.reads/G_obscura_corr_R1.fastq

bfc -s 28m -t14 ../raw.data/G_obscura/6BE2_S14_L003_R2_001.fastq.gz \
> ../corrected.reads/G_obscura_corr_R2.fastq

bfc -s 28m -t14 ../raw.data/G_lavendula/G_lavendula_S25_L003_R1_001.fastq.gz \
> ../corrected.reads/G_lavendula_corr_R1.fastq

bfc -s 28m -t14 ../raw.data/G_lavendula/G_lavendula_S25_L003_R2_001.fastq.gz \
> ../corrected.reads/G_lavendula_corr_R2.fastq

1.2 Skewer

Skewer only scans for and removes adapter contamination. You must provide the adapter sequences. I have prepared a combined adapter sequences file, which can be downloaded here: https://github.com/rwmurdoch/project.scripts/blob/master/IlluminaClip.fa

#!/bin/bash

mkdir ../trimmed.reads

skewer \
-x ../../../resources/Trimmomatic-0.36/adapters/IlluminaClip.fa \
-l 30 \
-t 14 \
../corrected.reads/G_obscura_corr_R1.fastq \
../corrected.reads/G_obscura_corr_R2.fastq \
-o ../trimmed.reads/G_obscura

skewer \
-x ../../../resources/Trimmomatic-0.36/adapters/IlluminaClip.fa \
-l 30 \
-t 14 \
../corrected.reads/G_lavendula_corr_R1.fastq \
../corrected.reads/G_lavendula_corr_R2.fastq \
-o ../trimmed.reads/G_lavendula

1.3 ABySS

Assembly has been done using ABySS, which is a resource efficient multithreaded assembler. I’m not sure why this was chosen, but I have to move forward with the same pipeline

#!/bin/bash

cd ../new.assemblies

abyss-pe \
name=G_obscura k=81 \
in='../trimmed.reads/G_obscura-trimmed-pair1.fastq ../trimmed.reads/G_obscura-trimmed-pair2.fastq' \
> G_obscura

abyss-pe \
name=G_obscura k=81 \
in='../trimmed.reads/G_lavendula-trimmed-pair1.fastq ../trimmed.reads/G_lavendula-trimmed-pair2.fastq' \
> G_obscura

I ran into a problem where I could not get the program to multithread; this could be worked around/solved I’m sure with local compilation. Nevertheless, this completed within an hour or two, not an issue

2 Quality assessment

Next, all genomes are subjected to QUAST, which calculates basic assembly statistics.

BUSCO, which estimates genome completeness and contamination, should ideally come after you have completed your gene calling pipeline (it has a built-in agustus pipeline but this is not adequate)

G. morbida is included in these analyses.

It is expected that the G. morbida genome will be higher quality since it was also informed by a mate-pair library and more careful assembly procedures.

2.1 QUAST

These packages seem incompatible with something else in the base conda environment, so I create a fresh env for them.

To install:

conda create --name genomeqc
conda activate genomeqc
conda install -c bioconda quast
conda install -c bioconda -c conda-forge busco=4.0.6 #note that this is not used now!

The easiest way to run quast is by use of a simple, brute-force script:

#!/bin/bash

mkdir ../genome.qc

quast \
../G_assemblies/G.morbida.refgenome.fa \
../G_assemblies/G_lavendula.fa \
../G_assemblies/G_obscura.fa \
../G_assemblies/G_omnicola_1.fa \
../G_assemblies/G_omnicola_2a.fa \
../G_assemblies/G_pallida_3.fa \
../G_assemblies/G_pallida_4.fa \
../G_assemblies/G_pallida_6.fa \
../G_assemblies/G_pallida_18a.fa \
../G_assemblies/G_pallida_19a.fa \
../G_assemblies/G_pallida_20a.fa \
../G_assemblies/G_pallida_153.fa \
../G_assemblies/G_sp2.fa \
../G_assemblies/G_sp10.fa \
../G_assemblies/G_sp21.fa \
../G_assemblies/G_sp23.fa \
../G_assemblies/G_sp41.fa \
-t 14 \
-o ../genome.qc

This produces several report files. We are most concerned with spotting any particularly questionable assemblies at this stage, as judged by deviations in total size and GC content, for example.

We will only consider contigs 1000bp and larger. Smaller contigs will be thrown out at this stage.

quast <- read.csv("../genome.qc/transposed_report.tsv", sep = "\t", header = T)
quast <- quast[c(1,3,9,15,17,18,20)]
colnames(quast) <- c("genome","contigs","total.length","largest.contig","GC","N50","L50")
quast %>%
  kable() %>%
  kable_styling()

genome	contigs	total.length	largest.contig	GC	N50	L50
G.morbida.refgenome	73	26549069	2597956	54.31	1305468	7
G_lavendula	98	24695666	1441847	54.56	560687	16
G_obscura	256	28388349	980739	52.07	333790	27
G_omnicola_1	2406	28053598	91915	53.47	18218	459
G_omnicola_2a	3221	27939781	66216	53.50	13457	656
G_pallida_3	116	4995752	269129	65.88	84092	20
G_pallida_4	877	27324476	507614	58.39	71952	109
G_pallida_6	853	27195544	396931	58.93	71839	104
G_pallida_18a	2344	27016781	159161	58.49	18616	430
G_pallida_19a	1931	26762913	125295	58.84	24553	321
G_pallida_20a	813	26979842	350886	58.60	62040	135
G_pallida_153	5014	26942791	51726	53.33	7693	1049
G_sp2	90	27258288	3834359	59.04	1527339	6
G_sp10	130	26648617	2762688	58.74	746394	10
G_sp21	77	27425179	2322932	58.96	1074268	9
G_sp23	101	25783886	1305060	58.69	567610	15
G_sp41	57	26100254	2159669	59.31	810755	10

library(ggplot2)

df <- quast
plot <- ggplot(df, aes(x="",y=total.length)) +
  geom_jitter(size = 2, 
              position = position_jitter(width=c(0.1,.2),seed=23),
              alpha = 1) +
  scale_color_manual(values=c("#c7c7c7","#ff0000" )) +
  stat_boxplot(geom ="errorbar", width = 0.1) + 
  geom_boxplot(outlier.alpha = 0, alpha = .1, width = 0.1) +
  stat_summary(fun.y=mean, colour="black", geom="point", 
               shape=21, fill = "white", size=2,show_guide = FALSE) +
  labs(y="total assembly length (bp)", 
       x="") + guides(color = guide_legend(override.aes = list(size=5))) + 
  labs(color='remove') +
  ylim(0,3e7)

#ggsave(file="", width=10, height=8)

plot(plot)

the single outlier here in the length plot is G_pallida_3. It also shows aberrant GC%.

2.2 Interpretation

Firstly, contigs smaller than 1000bp will be thrown out (data not shown, but in some cases there are 1000’s of such contigs)

Second, G_pallida_3 clearly must be thrown out given the small size and aberrant GC%. I will allow it to go through annotation to the point of BUSCO analysis in order to confirm however.

Contig count and N50 values vary greatly, but there is no avoiding this barring using a different assembly program. Given the lack of mate-pairs, I’m actually shocked by how contiguous some of these assemblies are, actually appaering equivalent to the G. morbida reference genome. Assemblies of 1000’s of contigs are not surprising.

I will move forward here on the reasonable assumption that assembly has problems in repetitive regions, not in protein-coding regions. Thus, the comparative genetics should not be adversely affected. However, I will look at the relationship between contig count, genome length, and predicted genes in order to assess.

3 contig renaming

the contigs at this stage are named very poorly and with useless extraneous information. I will now simply revise all of the contig names to “scaffold_x” where “x” is an arbitrary number in the assembly. I will also remove all contigs smaller than 1000bp

library(Biostrings)
glist <- list.files("../G_assemblies/")

for (x in (1:length(glist))) {
  genome <- readDNAStringSet(paste0("../G_assemblies/",glist[x]))
  badname <- names(genome)
  seq <- paste(genome)
  len <- width(genome)
  genome <- data.frame(badname,seq,len)
  genome <- genome[genome$len > 999,]
  genome <- genome[order(-genome$len),]
  num <- c(1:nrow(genome))
  name <- paste0("scaffold_",num)
  genome$name <- name
  writeFasta(genome, paste0("../revised_G_assemblies/",glist[x]))
}

4 Gene prediction

This is done in as similar a manner as was done for G. morbida by Shuelke in 2016:

Maker is the main program SNAP and Augustus gene callers are used within Maker SNAP is trained using CEGMA v2.5 Augustus is trained using Fusarium solani proteins downloaded from Ensembl Fungi

4.1 Conceptual overview

library(DiagrammeR)
grViz("digraph flowchart {
      # node definitions with substituted label text
      node [fontname = Helvetica, shape = rectangle]        
      tab1 [label = '@@1']
      tab2 [label = '@@2']
      tab3 [label = '@@3']
      tab4 [label = '@@4']
      tab5 [label = '@@5']

      # edge definitions with the node IDs
      tab1 -> tab2 -> tab3 -> tab4 -> tab5;
      }

      [1]: 'Assembled genome'
      [2]: 'detect single copy conserved genes (SCCG) with CEGMA 2.5'
      [3]: 'train SNAP with SCCG'
      [4]: 'train Augustus with Fusarium solani'
      [5]: 'use MAKER to predict genes using SNAP model and trained Augustus'
      ")

4.2 Resources

Scripts: https://www.dropbox.com/sh/uu4duhi31x3sxoj/AAAZ6TWTcjshFq34YgoTHRYBa?dl=0
Maker control files: https://www.dropbox.com/sh/isr3xmv7v5c9fy1/AAB5tA3YbrQzsahyICp9KRqEa?dl=0

4.3 Installing

Skip this section unless you are struggling to install Maker yourself

4.3.1 Conda version of Maker doesnt work!

Maker must run in a python 2.7 environment:

conda create -n maker python=2.7
conda activate maker
conda install -c bioconda maker

None of this ran for me, it gave a nonsensical error message that my CUDA should be 10.1 but it is actually 10.1

Rather, I downloaded and installed locally. Start with the master README file and go from there. Installation requires many dependencies of many sorts, so it is unfortunate that the conda version doesn’t work!

Tutorial is here: http://gmod.org/wiki/MAKER_Tutorial#Getting_Started_with_MAKER

Currently, I have * maker installed manually, need to add to bachrc path * snap installed in base conda * austustus installed by maker, somewhere, also needs to be in path * CEGMA installed via VM available @http://korflab.ucdavis.edu/datasets/cegma/#SCT4 (described later)

Overall I’m shocked by how obtuse this program (MAKER) is.

4.4 basic usage

using this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Data_for_Tutorial_Examples, running through tutorial 1:

first you run the maker executable in your project directory via

maker -CTL

which creates three “control” files (.ctl).

the maker_ext.ctl file has locations of all executables. In my case, many of these have to be manually entered. For example my Augustus executable is lost somewhere on my system… It was supposedly installed during maker installation but I cannot find it! just install again via;

sudo apt-get install -y augustus
sudo apt-get install ncbi-blast+

The maker installation of Repeatmasker is also bugged. Download and install using these instructions: http://www.repeatmasker.org/RMDownload.html

Upon re-creating the CTL files, it now knows where Augustus is… some other executables may be required based on what algorithms you wish to run…

The “maker_opts.ctl” file is where the action is, where you define what you want to do and how.

RepeatMasker required an additional perl module in my case; check the log files carefully!

These scripts: https://www.dropbox.com/sh/07w4oxx0zs0v92c/AABye81r5qNJXl5ap_UUZaD6a?dl=1 worked for tutorial 1, which is an EST-guided gene prediction.

Some additional commands can be used to merge all of the gff and fasta files together:

fasta_merge -d dpp_contig_master_datastore_index.log
gff3_merge -d dpp_contig_master_datastore_index.log

These are maker executables FYI. Can all be done in other ways also.

4.4.1 Warning about Maker

Maker is awkward… the easiest way to get running seems to be to install the conda package, which as of 5/26, still had broken RepeatMasker libraries. This can be solved by downloading the latest version and copying over all files from the Library into the RepeatMasker library of the conda version, which you can find at… something along the lines of “~/miniconda3/envs/maker/share/repeatmasker/library”… not sure about the last few directories but that should get you there. From that point, everything should work okay.

4.4.2 Maker computational dimensions

The first round is mainly running tblast and RepeatMasker. Ultimately its not so clear what the delay is, but blastX is computationally demanding and does a lot of read/writing. Nothing is terribly RAM intensive. It isn’t clear at all what the demands of RepeatMakser are, even though it clearly takes up ~10-20% of the time. So, CPU and read/write… if you wanted to get tricky, you would re-route the blast executables to diamond alternatives and save a ton of time.

I have chosen to use all of SwissProt to guide gene prediction, which is leading to heavy blastx use, which in turn is surely defining the time requirements.

5 Reconstructing the gene prediction pipeline

Initial runs using current gene prediction methods for Gmorbida had ~2k genes more than our current version. That will lead to big problems down the line.

MAKER suggests using all of SWISSPROT for an initial ab initio gene prediction. We don’t start with this, rather we skip ahead to SNAP and Augustus

BUSCO is the top of the line for detecting conserved genes in Eukaryotes. CEGMA was the previous standard tool for this but it is no longer supported

Try to more rigidly reconstruct their original gene prediction pipeline (https://peerj.com/articles/1952/?utm_source=TrendMD&utm_campaign=PeerJ_TrendMD_1&utm_medium=TrendMD)

Maker v 2.31.8 (I used 2.31.10, not likely a problem)
SNAP was trained using gff files generated by CEGMA v2.5. My SNAP was trained using an initial prediction based on a swissprot-informed ab-initio prediction. This is likely a key difference, as CEGMA training should produce a much more narrow prediction.
Augustus was trained with Fusarium solani. I used Augustus pre-trained with Fusarium graminarium.
They made no initial swissprot prediction. I feel that this was a mistake, but the goal here is to replicate, not improve.

The expected outcome is that:

by using CEGMA to train SNAP (rathan than Swissprot as described in MAKER tutorials), SNAP will be much narrower since it is looking only for single-copy conserved genes (SCCP).
Using a different species to train augustus, the outcome is not clear. There is no particular reason to think this will change anything notably.

5.1 Augustus training

This looks like it takes time to set up locally, so I’m using a web service: http://bioinf.uni-greifswald.de/webaugustus/

Genome files downloaded from: ftp://ftp.ensemblgenomes.org/pub/fungi/release-47/fasta/fusarium_solani/dna/

protein file: ftp://ftp.ensemblgenomes.org/pub/release-47/fungi/fasta/fusarium_solani/pep/

5.2 running Maker

copy ONLY the genome and the snap model into a new folder structure

Now we use a SNAP hmm (which is essentially a gene prediction model) and the trained Augustus to run Maker (no pre-run with SwissProt).

script = maker.v2.sh

5.3 CEGMA Virtual Machine installtion and operation

manual installation of CEGMA is NOT ADVISED (don’t waste your time and sanity)

Install Oracle VM
Download the CEGMA 2.5 VM from http://korflab.ucdavis.edu/Datasets/cegma/cegma_vm.html
Configure the VM for RAM, CPU, Video RAM, and designate the project folder as a shared folder with Mount point as /home/ubuntu/Work
Deal with permissions issues in the VM via ‘sudo adduser $USER vboxsf’
activate cegma in VM via ‘source .bash_profile’
download “core eukarotic proteins” set from http://korflab.ucdavis.edu/datasets/cegma/#SCT5 and place in reasonable location in shared directory
run CEGMA with: cegma -g genome.fa -p ../core.faa -T 4

one run finished within ~2-3 hours

5.4 train SNAP with CEGMA proteins

script= maker.cegma.to.snap.sh

for f in ../revised_G_assemblies/*
    do
    d=${f##*/}
    echo $d
    cd /home/robert/Dropbox/Projects/Geosmithia.comparative/revised_G_assemblies/$d
    cp /home/robert/Dropbox/maker.scripts/cegma.snap/* \
        /home/robert/Dropbox/Projects/Geosmithia.comparative/revised_G_assemblies/$d
    maker -fix_nucleotides
    mkdir snap
    cd snap
    gff3_merge -d ../genome.maker.output/genome_master_datastore_index.log
    maker2zff -n genome.all.gff
    fathom -categorize 1000 genome.ann genome.dna
    fathom -export 1000 -plus uni.ann uni.dna
    forge export.ann export.dna
    hmm-assembler.pl $d . > ../snapmodel.hmm
done

The maker control files were altered to direct to the CEGMA protein output, output.cegma.fa

5.5 Augustus training

This looks like it takes time to set up locally, so I’m using a web service: http://bioinf.uni-greifswald.de/webaugustus/

Genome files downloaded from: ftp://ftp.ensemblgenomes.org/pub/fungi/release-47/fasta/fusarium_solani/dna/

protein file: ftp://ftp.ensemblgenomes.org/pub/release-47/fungi/fasta/fusarium_solani/pep/

6 Summary: Final gene prediction pipeline

Run CEGMA on each genome using core Euk protein set
Train SNAP on the CEGMA predictions
Run final ab initio predictions using trained SNAP model and F.solani Augustus model

6.1 Initial overview

Grab gene counts from each annotation

library(Biostrings)
files <- list.files("../r2_complete/")
initial.prot.count <- data.frame(matrix(nrow=length(files),ncol=2))
colnames(initial.prot.count) <- c("genome","genes")
for (x in c(1:length(files))) {
  prot <- readAAStringSet(paste0("../r2_complete/",files[x],"/genome.all.maker.non_overlapping_ab_initio.proteins.fasta"))
  prot <- names(prot)
  count <- length(prot)
  initial.prot.count[x,1] <- files[x]
  initial.prot.count[x,2] <- count
}
write.csv(initial.prot.count, "gene.counts.csv")
initial.prot.count

##           genome genes
## 1    G.lavendula  7852
## 2      G.morbida  7470
## 3      G.obscura  8468
## 4   G.omnicola.1  9367
## 5  G.omnicola.2a  9386
## 6  G.pallida.153 10550
## 7  G.pallida.18a 10296
## 8  G.pallida.19a 10026
## 9  G.pallida.20a  9500
## 10   G.pallida.4  9423
## 11   G.pallida.6  9673
## 12       G.sp.21  9265
## 13       G.sp.23  8710
## 14        G.sp10  9026
## 15         G.sp2  9288
## 16        G.sp41  8847

I will not include coding quarry. Despite my best efforts to “turn back the clock”, I was not able to fully reconstruct the original annotation system. i.e., I still ended up with 1200 more genes in the G.morbida prediction that was originally predicted in 2016. Plus, coding quarry has not been updated or maintained in 4 years. It’s time to move on.

6.2 Assessing the gene counts

G.pallida tends to have higher gene counts, at ~10,000
G.pallida.3 is defective and should be discarded, only has 3,000 genes
G.morbida and G.lavendula have lower gene counts. Lavendula is considered a generalist, while morbida is considered a pathogen with narrow lifestyles. Thus this doesn’t make much sense. Especially considering that G.pallida is also a pathogen

Overall I’m very happy with the consistency of the results.

7 Annotation

The next obvious stage is to produce an annotation database. That necessitates renaming of genes to use concise locus tags rather then the long messes they are now. The subsequent problem is that the gff references the current gene names. thus, annotation much preceed any efforts to rename anything.

7.1 make a genome to gene name lookup table and concatenated protein fasta

library(Biostrings)
files <- list.files("../r2_complete/")
full.gene.map <- data.frame(matrix(ncol=3))
colnames(full.gene.map) <- c("genome","name","seq")


spcSplit <- function(x){
  strsplit(x," ")[[1]][1]
}

for (x in c(1:length(files))) {
  prot <- readAAStringSet(paste0("../r2_complete/",files[x],"/genome.all.maker.non_overlapping_ab_initio.proteins.fasta"))
  name <- names(prot)
  #everythin after "mRNA-1" has to be stripped
  name <- unlist(lapply(name, spcSplit))
  seq <- paste(prot)
  prot <- data.frame(name,seq)
  prot$genome <- files[x]
  prot <- prot[c(3,1,2)]
  #a temporary tag must be added so that gene names can be mapped back to genomes properly
  prot$name <- paste(prot$name,prot$genome,sep = ":")
  #full.gene.map <- rbind(full.gene.map, prot)
  writeFasta(prot, paste0("allprots/",files[x],".faa"))
}

The individual faa multifasta have to be combined in shell (writeFasta is failing for this many proteins, no idea why, it worked before for a large metagenome)

All 147,000 proteins now have unique names and are uploaded to

KEGG GhostKoala
EggNOG
COG, pfam, and TIGRFAM via webMGA server