new Geosmithia assemblies

Currently there are 12 various Geosmithia genomes which we would like to send into a comparative genomics pipeline with the goal of identifying pathogenicity factors in the pathogenic strains.

strain.details <- readxl::read_xlsx("../report/strain.table.200522.xlsx")
strain.details %>%
  kable() %>%
  kable_styling()
species/strain designation raw reads assembled code
Geosmithia sp. 2 yes yes KW1
Geosmithia sp. 10 yes yes 5BS4W
Geosmithia sp. 21 yes yes 3BHS13
Geosmithia sp. 23 yes yes 4LW11
Geosmithia sp. 41 yes yes 4BE20
Geosmithia obscura yes no 6BE2
Geosmithia lavendula yes no na
Geosmithia omnicola strain 1 yes yes na
Geosmithia omnicola strain 2a yes yes na
Geosmithia pallida strain 3 yes yes na
Geosmithia pallida strain 4 yes yes na
Geosmithia pallida strain 6 yes yes na
Geosmithia pallida strain 20a yes yes na
Geosmithia pallida strain 153 yes yes na
Geosmithia morbida Yes yes na
Geosmithia putterilli yes yes na
Geosmithia flava yes yes na

I have to assemble 2 of these WGS projects using the same pipeline that has already been used to assemble sp. 2 - 41

The previously applied pipeline (by Meg Staton et al.) consisted of:

library(DiagrammeR)
grViz("digraph flowchart {
      # node definitions with substituted label text
      node [fontname = Helvetica, shape = rectangle]        
      tab1 [label = '@@1']
      tab2 [label = '@@2']
      tab3 [label = '@@3']
      tab4 [label = '@@4']
      tab5 [label = '@@5']

      # edge definitions with the node IDs
      tab1 -> tab2 -> tab3 -> tab4 -> tab5;
      }

      [1]: 'Raw WGS Reads'
      [2]: 'Apply Illumina error correction tool: BFC correction '
      [3]: 'Remove adapter contamination: Skewer'
      [4]: 'Assemble de novo: ABySS'
      [5]: 'Draft Genome'
      ")

1 Installing the programs

You might want to create a separate conda environment first

apt-get does not work on MacOS, but other routes exist, see their github page

conda install -c bioconda skewer bfc
sudo apt-get install abyss
sudo apt-get install openmpi-bin

1.1 BFC error correction

I don’t have the original script for this, but there are few options available.

General instructions are here: https://github.com/lh3/bfc

#!/bin/bash

mkdir ../corrected.reads

bfc -s 28m -t14 ../raw.data/G_obscura/6BE2_S14_L003_R1_001.fastq.gz \
> ../corrected.reads/G_obscura_corr_R1.fastq

bfc -s 28m -t14 ../raw.data/G_obscura/6BE2_S14_L003_R2_001.fastq.gz \
> ../corrected.reads/G_obscura_corr_R2.fastq

bfc -s 28m -t14 ../raw.data/G_lavendula/G_lavendula_S25_L003_R1_001.fastq.gz \
> ../corrected.reads/G_lavendula_corr_R1.fastq

bfc -s 28m -t14 ../raw.data/G_lavendula/G_lavendula_S25_L003_R2_001.fastq.gz \
> ../corrected.reads/G_lavendula_corr_R2.fastq

1.2 Skewer

Skewer only scans for and removes adapter contamination. You must provide the adapter sequences. I have prepared a combined adapter sequences file, which can be downloaded here: https://github.com/rwmurdoch/project.scripts/blob/master/IlluminaClip.fa

#!/bin/bash

mkdir ../trimmed.reads

skewer \
-x ../../../resources/Trimmomatic-0.36/adapters/IlluminaClip.fa \
-l 30 \
-t 14 \
../corrected.reads/G_obscura_corr_R1.fastq \
../corrected.reads/G_obscura_corr_R2.fastq \
-o ../trimmed.reads/G_obscura

skewer \
-x ../../../resources/Trimmomatic-0.36/adapters/IlluminaClip.fa \
-l 30 \
-t 14 \
../corrected.reads/G_lavendula_corr_R1.fastq \
../corrected.reads/G_lavendula_corr_R2.fastq \
-o ../trimmed.reads/G_lavendula

1.3 ABySS

Assembly has been done using ABySS, which is a resource efficient multithreaded assembler. I’m not sure why this was chosen, but I have to move forward with the same pipeline

#!/bin/bash

cd ../new.assemblies

abyss-pe \
name=G_obscura k=81 \
in='../trimmed.reads/G_obscura-trimmed-pair1.fastq ../trimmed.reads/G_obscura-trimmed-pair2.fastq' \
> G_obscura

abyss-pe \
name=G_obscura k=81 \
in='../trimmed.reads/G_lavendula-trimmed-pair1.fastq ../trimmed.reads/G_lavendula-trimmed-pair2.fastq' \
> G_obscura

I ran into a problem where I could not get the program to multithread; this could be worked around/solved I’m sure with local compilation. Nevertheless, this completed within an hour or two, not an issue

2 Quality assessment

Next, all genomes are subjected to QUAST, which calculates basic assembly statistics.

BUSCO, which estimates genome completeness and contamination, should ideally come after you have completed your gene calling pipeline (it has a built-in agustus pipeline but this is not adequate)

G. morbida is included in these analyses.

It is expected that the G. morbida genome will be higher quality since it was also informed by a mate-pair library and more careful assembly procedures.

2.1 QUAST

These packages seem incompatible with something else in the base conda environment, so I create a fresh env for them.

To install:

conda create --name genomeqc
conda activate genomeqc
conda install -c bioconda quast
conda install -c bioconda -c conda-forge busco=4.0.6 #note that this is not used now!

The easiest way to run quast is by use of a simple, brute-force script:

#!/bin/bash

mkdir ../genome.qc

quast \
../G_assemblies/G.morbida.refgenome.fa \
../G_assemblies/G_lavendula.fa \
../G_assemblies/G_obscura.fa \
../G_assemblies/G_omnicola_1.fa \
../G_assemblies/G_omnicola_2a.fa \
../G_assemblies/G_pallida_3.fa \
../G_assemblies/G_pallida_4.fa \
../G_assemblies/G_pallida_6.fa \
../G_assemblies/G_pallida_18a.fa \
../G_assemblies/G_pallida_19a.fa \
../G_assemblies/G_pallida_20a.fa \
../G_assemblies/G_pallida_153.fa \
../G_assemblies/G_sp2.fa \
../G_assemblies/G_sp10.fa \
../G_assemblies/G_sp21.fa \
../G_assemblies/G_sp23.fa \
../G_assemblies/G_sp41.fa \
-t 14 \
-o ../genome.qc

This produces several report files. We are most concerned with spotting any particularly questionable assemblies at this stage, as judged by deviations in total size and GC content, for example.

We will only consider contigs 1000bp and larger. Smaller contigs will be thrown out at this stage.

quast <- read.csv("../genome.qc/transposed_report.tsv", sep = "\t", header = T)
quast <- quast[c(1,3,9,15,17,18,20)]
colnames(quast) <- c("genome","contigs","total.length","largest.contig","GC","N50","L50")
quast %>%
  kable() %>%
  kable_styling()
genome contigs total.length largest.contig GC N50 L50
G.morbida.refgenome 73 26549069 2597956 54.31 1305468 7
G_lavendula 98 24695666 1441847 54.56 560687 16
G_obscura 256 28388349 980739 52.07 333790 27
G_omnicola_1 2406 28053598 91915 53.47 18218 459
G_omnicola_2a 3221 27939781 66216 53.50 13457 656
G_pallida_3 116 4995752 269129 65.88 84092 20
G_pallida_4 877 27324476 507614 58.39 71952 109
G_pallida_6 853 27195544 396931 58.93 71839 104
G_pallida_18a 2344 27016781 159161 58.49 18616 430
G_pallida_19a 1931 26762913 125295 58.84 24553 321
G_pallida_20a 813 26979842 350886 58.60 62040 135
G_pallida_153 5014 26942791 51726 53.33 7693 1049
G_sp2 90 27258288 3834359 59.04 1527339 6
G_sp10 130 26648617 2762688 58.74 746394 10
G_sp21 77 27425179 2322932 58.96 1074268 9
G_sp23 101 25783886 1305060 58.69 567610 15
G_sp41 57 26100254 2159669 59.31 810755 10
library(ggplot2)

df <- quast
plot <- ggplot(df, aes(x="",y=total.length)) +
  geom_jitter(size = 2, 
              position = position_jitter(width=c(0.1,.2),seed=23),
              alpha = 1) +
  scale_color_manual(values=c("#c7c7c7","#ff0000" )) +
  stat_boxplot(geom ="errorbar", width = 0.1) + 
  geom_boxplot(outlier.alpha = 0, alpha = .1, width = 0.1) +
  stat_summary(fun.y=mean, colour="black", geom="point", 
               shape=21, fill = "white", size=2,show_guide = FALSE) +
  labs(y="total assembly length (bp)", 
       x="") + guides(color = guide_legend(override.aes = list(size=5))) + 
  labs(color='remove') +
  ylim(0,3e7)

#ggsave(file="", width=10, height=8)

plot(plot)

the single outlier here in the length plot is G_pallida_3. It also shows aberrant GC%.

2.2 Interpretation

Firstly, contigs smaller than 1000bp will be thrown out (data not shown, but in some cases there are 1000’s of such contigs)

Second, G_pallida_3 clearly must be thrown out given the small size and aberrant GC%. I will allow it to go through annotation to the point of BUSCO analysis in order to confirm however.

Contig count and N50 values vary greatly, but there is no avoiding this barring using a different assembly program. Given the lack of mate-pairs, I’m actually shocked by how contiguous some of these assemblies are, actually appaering equivalent to the G. morbida reference genome. Assemblies of 1000’s of contigs are not surprising.

I will move forward here on the reasonable assumption that assembly has problems in repetitive regions, not in protein-coding regions. Thus, the comparative genetics should not be adversely affected. However, I will look at the relationship between contig count, genome length, and predicted genes in order to assess.

3 contig renaming

the contigs at this stage are named very poorly and with useless extraneous information. I will now simply revise all of the contig names to “scaffold_x” where “x” is an arbitrary number in the assembly. I will also remove all contigs smaller than 1000bp

library(Biostrings)
glist <- list.files("../G_assemblies/")

for (x in (1:length(glist))) {
  genome <- readDNAStringSet(paste0("../G_assemblies/",glist[x]))
  badname <- names(genome)
  seq <- paste(genome)
  len <- width(genome)
  genome <- data.frame(badname,seq,len)
  genome <- genome[genome$len > 999,]
  genome <- genome[order(-genome$len),]
  num <- c(1:nrow(genome))
  name <- paste0("scaffold_",num)
  genome$name <- name
  writeFasta(genome, paste0("../revised_G_assemblies/",glist[x]))
}

4 Gene prediction

This is done in as similar a manner as was done for G. morbida by Shuelke in 2016:

Maker is the main program SNAP and Augustus gene callers are used within Maker SNAP is trained using CEGMA v2.5 Augustus is trained using Fusarium solani proteins downloaded from Ensembl Fungi

4.1 Conceptual overview

library(DiagrammeR)
grViz("digraph flowchart {
      # node definitions with substituted label text
      node [fontname = Helvetica, shape = rectangle]        
      tab1 [label = '@@1']
      tab2 [label = '@@2']
      tab3 [label = '@@3']
      tab4 [label = '@@4']
      tab5 [label = '@@5']

      # edge definitions with the node IDs
      tab1 -> tab2 -> tab3 -> tab4 -> tab5;
      }

      [1]: 'Assembled genome'
      [2]: 'detect single copy conserved genes (SCCG) with CEGMA 2.5'
      [3]: 'train SNAP with SCCG'
      [4]: 'train Augustus with Fusarium solani'
      [5]: 'use MAKER to predict genes using SNAP model and trained Augustus'
      ")

4.3 Installing

Skip this section unless you are struggling to install Maker yourself

4.3.1 Conda version of Maker doesnt work!

Maker must run in a python 2.7 environment:

conda create -n maker python=2.7
conda activate maker
conda install -c bioconda maker

None of this ran for me, it gave a nonsensical error message that my CUDA should be 10.1 but it is actually 10.1

Rather, I downloaded and installed locally. Start with the master README file and go from there. Installation requires many dependencies of many sorts, so it is unfortunate that the conda version doesn’t work!

Tutorial is here: http://gmod.org/wiki/MAKER_Tutorial#Getting_Started_with_MAKER

Currently, I have * maker installed manually, need to add to bachrc path * snap installed in base conda * austustus installed by maker, somewhere, also needs to be in path * CEGMA installed via VM available @http://korflab.ucdavis.edu/datasets/cegma/#SCT4 (described later)

Overall I’m shocked by how obtuse this program (MAKER) is.

4.4 basic usage

using this tutorial: http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018#Data_for_Tutorial_Examples, running through tutorial 1:

first you run the maker executable in your project directory via

maker -CTL

which creates three “control” files (.ctl).

the maker_ext.ctl file has locations of all executables. In my case, many of these have to be manually entered. For example my Augustus executable is lost somewhere on my system… It was supposedly installed during maker installation but I cannot find it! just install again via;

sudo apt-get install -y augustus
sudo apt-get install ncbi-blast+

The maker installation of Repeatmasker is also bugged. Download and install using these instructions: http://www.repeatmasker.org/RMDownload.html

Upon re-creating the CTL files, it now knows where Augustus is… some other executables may be required based on what algorithms you wish to run…

The “maker_opts.ctl” file is where the action is, where you define what you want to do and how.

RepeatMasker required an additional perl module in my case; check the log files carefully!

These scripts: https://www.dropbox.com/sh/07w4oxx0zs0v92c/AABye81r5qNJXl5ap_UUZaD6a?dl=1 worked for tutorial 1, which is an EST-guided gene prediction.

Some additional commands can be used to merge all of the gff and fasta files together:

fasta_merge -d dpp_contig_master_datastore_index.log
gff3_merge -d dpp_contig_master_datastore_index.log

These are maker executables FYI. Can all be done in other ways also.

4.4.1 Warning about Maker

Maker is awkward… the easiest way to get running seems to be to install the conda package, which as of 5/26, still had broken RepeatMasker libraries. This can be solved by downloading the latest version and copying over all files from the Library into the RepeatMasker library of the conda version, which you can find at… something along the lines of “~/miniconda3/envs/maker/share/repeatmasker/library”… not sure about the last few directories but that should get you there. From that point, everything should work okay.

4.4.2 Maker computational dimensions

The first round is mainly running tblast and RepeatMasker. Ultimately its not so clear what the delay is, but blastX is computationally demanding and does a lot of read/writing. Nothing is terribly RAM intensive. It isn’t clear at all what the demands of RepeatMakser are, even though it clearly takes up ~10-20% of the time. So, CPU and read/write… if you wanted to get tricky, you would re-route the blast executables to diamond alternatives and save a ton of time.

I have chosen to use all of SwissProt to guide gene prediction, which is leading to heavy blastx use, which in turn is surely defining the time requirements.

5 Reconstructing the gene prediction pipeline

Initial runs using current gene prediction methods for Gmorbida had ~2k genes more than our current version. That will lead to big problems down the line.

MAKER suggests using all of SWISSPROT for an initial ab initio gene prediction. We don’t start with this, rather we skip ahead to SNAP and Augustus

BUSCO is the top of the line for detecting conserved genes in Eukaryotes. CEGMA was the previous standard tool for this but it is no longer supported

Try to more rigidly reconstruct their original gene prediction pipeline (https://peerj.com/articles/1952/?utm_source=TrendMD&utm_campaign=PeerJ_TrendMD_1&utm_medium=TrendMD)

  • Maker v 2.31.8 (I used 2.31.10, not likely a problem)

  • SNAP was trained using gff files generated by CEGMA v2.5. My SNAP was trained using an initial prediction based on a swissprot-informed ab-initio prediction. This is likely a key difference, as CEGMA training should produce a much more narrow prediction.

  • Augustus was trained with Fusarium solani. I used Augustus pre-trained with Fusarium graminarium.

  • They made no initial swissprot prediction. I feel that this was a mistake, but the goal here is to replicate, not improve.

The expected outcome is that:

  • by using CEGMA to train SNAP (rathan than Swissprot as described in MAKER tutorials), SNAP will be much narrower since it is looking only for single-copy conserved genes (SCCP).

  • Using a different species to train augustus, the outcome is not clear. There is no particular reason to think this will change anything notably.

5.1 Augustus training

This looks like it takes time to set up locally, so I’m using a web service: http://bioinf.uni-greifswald.de/webaugustus/

Genome files downloaded from: ftp://ftp.ensemblgenomes.org/pub/fungi/release-47/fasta/fusarium_solani/dna/

protein file: ftp://ftp.ensemblgenomes.org/pub/release-47/fungi/fasta/fusarium_solani/pep/

5.2 running Maker

copy ONLY the genome and the snap model into a new folder structure

Now we use a SNAP hmm (which is essentially a gene prediction model) and the trained Augustus to run Maker (no pre-run with SwissProt).

script = maker.v2.sh

5.3 CEGMA Virtual Machine installtion and operation

manual installation of CEGMA is NOT ADVISED (don’t waste your time and sanity)

  1. Install Oracle VM

  2. Download the CEGMA 2.5 VM from http://korflab.ucdavis.edu/Datasets/cegma/cegma_vm.html

  3. Configure the VM for RAM, CPU, Video RAM, and designate the project folder as a shared folder with Mount point as /home/ubuntu/Work

  4. Deal with permissions issues in the VM via ‘sudo adduser $USER vboxsf’

  5. activate cegma in VM via ‘source .bash_profile’

  6. download “core eukarotic proteins” set from http://korflab.ucdavis.edu/datasets/cegma/#SCT5 and place in reasonable location in shared directory

  7. run CEGMA with: cegma -g genome.fa -p ../core.faa -T 4

one run finished within ~2-3 hours

5.4 train SNAP with CEGMA proteins

script= maker.cegma.to.snap.sh

for f in ../revised_G_assemblies/*
    do
    d=${f##*/}
    echo $d
    cd /home/robert/Dropbox/Projects/Geosmithia.comparative/revised_G_assemblies/$d
    cp /home/robert/Dropbox/maker.scripts/cegma.snap/* \
        /home/robert/Dropbox/Projects/Geosmithia.comparative/revised_G_assemblies/$d
    maker -fix_nucleotides
    mkdir snap
    cd snap
    gff3_merge -d ../genome.maker.output/genome_master_datastore_index.log
    maker2zff -n genome.all.gff
    fathom -categorize 1000 genome.ann genome.dna
    fathom -export 1000 -plus uni.ann uni.dna
    forge export.ann export.dna
    hmm-assembler.pl $d . > ../snapmodel.hmm
done

The maker control files were altered to direct to the CEGMA protein output, output.cegma.fa

5.5 Augustus training

This looks like it takes time to set up locally, so I’m using a web service: http://bioinf.uni-greifswald.de/webaugustus/

Genome files downloaded from: ftp://ftp.ensemblgenomes.org/pub/fungi/release-47/fasta/fusarium_solani/dna/

protein file: ftp://ftp.ensemblgenomes.org/pub/release-47/fungi/fasta/fusarium_solani/pep/

6 Summary: Final gene prediction pipeline

  1. Run CEGMA on each genome using core Euk protein set

  2. Train SNAP on the CEGMA predictions

  3. Run final ab initio predictions using trained SNAP model and F.solani Augustus model

6.1 Initial overview

Grab gene counts from each annotation

library(Biostrings)
files <- list.files("../r2_complete/")
initial.prot.count <- data.frame(matrix(nrow=length(files),ncol=2))
colnames(initial.prot.count) <- c("genome","genes")
for (x in c(1:length(files))) {
  prot <- readAAStringSet(paste0("../r2_complete/",files[x],"/genome.all.maker.non_overlapping_ab_initio.proteins.fasta"))
  prot <- names(prot)
  count <- length(prot)
  initial.prot.count[x,1] <- files[x]
  initial.prot.count[x,2] <- count
}
write.csv(initial.prot.count, "gene.counts.csv")
initial.prot.count
##           genome genes
## 1    G.lavendula  7852
## 2      G.morbida  7470
## 3      G.obscura  8468
## 4   G.omnicola.1  9367
## 5  G.omnicola.2a  9386
## 6  G.pallida.153 10550
## 7  G.pallida.18a 10296
## 8  G.pallida.19a 10026
## 9  G.pallida.20a  9500
## 10   G.pallida.4  9423
## 11   G.pallida.6  9673
## 12       G.sp.21  9265
## 13       G.sp.23  8710
## 14        G.sp10  9026
## 15         G.sp2  9288
## 16        G.sp41  8847

I will not include coding quarry. Despite my best efforts to “turn back the clock”, I was not able to fully reconstruct the original annotation system. i.e., I still ended up with 1200 more genes in the G.morbida prediction that was originally predicted in 2016. Plus, coding quarry has not been updated or maintained in 4 years. It’s time to move on.

6.2 Assessing the gene counts

  • G.pallida tends to have higher gene counts, at ~10,000

  • G.pallida.3 is defective and should be discarded, only has 3,000 genes

  • G.morbida and G.lavendula have lower gene counts. Lavendula is considered a generalist, while morbida is considered a pathogen with narrow lifestyles. Thus this doesn’t make much sense. Especially considering that G.pallida is also a pathogen

Overall I’m very happy with the consistency of the results.

7 Annotation

The next obvious stage is to produce an annotation database. That necessitates renaming of genes to use concise locus tags rather then the long messes they are now. The subsequent problem is that the gff references the current gene names. thus, annotation much preceed any efforts to rename anything.

7.1 make a genome to gene name lookup table and concatenated protein fasta

library(Biostrings)
files <- list.files("../r2_complete/")
full.gene.map <- data.frame(matrix(ncol=3))
colnames(full.gene.map) <- c("genome","name","seq")


spcSplit <- function(x){
  strsplit(x," ")[[1]][1]
}

for (x in c(1:length(files))) {
  prot <- readAAStringSet(paste0("../r2_complete/",files[x],"/genome.all.maker.non_overlapping_ab_initio.proteins.fasta"))
  name <- names(prot)
  #everythin after "mRNA-1" has to be stripped
  name <- unlist(lapply(name, spcSplit))
  seq <- paste(prot)
  prot <- data.frame(name,seq)
  prot$genome <- files[x]
  prot <- prot[c(3,1,2)]
  #a temporary tag must be added so that gene names can be mapped back to genomes properly
  prot$name <- paste(prot$name,prot$genome,sep = ":")
  #full.gene.map <- rbind(full.gene.map, prot)
  writeFasta(prot, paste0("allprots/",files[x],".faa"))
}

The individual faa multifasta have to be combined in shell (writeFasta is failing for this many proteins, no idea why, it worked before for a large metagenome)

All 147,000 proteins now have unique names and are uploaded to

  • KEGG GhostKoala

  • EggNOG

  • COG, pfam, and TIGRFAM via webMGA server