Workshop 1

looking at basic commands

ls
pwd
cd
cd ~/techinuqes_workshop/

Workshop 2

For this exercise will use data from a previous MinION run on a culture dominated by a bacterium in the genus Vibrio. The dominant organism has two large chromosomes with sizes around 3.4Mb and 1.8Mb respectively. We also expect some smaller plasmids. At very high coverage it was also revealed that this data contained another bacterium (a Gammaproteobacterium) which was around 100 times less abundant than the Vibrio.

Redbean

Collect a small subset of data for the initial assembly. We will use just the first 3 files which equates to 12000 reads and around 6x coverage of the target genome. This will almost certainly not be enough to generate a good assembly. As a learning exercise it will be useful to compare the results with this small dataset to results we will obtain later using more reads.

pull data and run assembly

cat /pvol/data/amtp4/*_[012].fastq > amtp4.fastq

wtdbg2 -x ont -g 5m -t 4 -i amtp4.fastq -fo amtp4
wtpoa-cns -t 4 -i amtp4.ctg.lay.gz -fo amtp4.ctg.fa
  • took ~1 min ish
  • created 120 contigs
assembly-stats amtp4.ctg.fa

stats for amtp4.ctg.fa

sum = 3639659, n = 120, ave = 30330.49, largest = 143809

N50 = 48273, n = 23

N60 = 36829, n = 32

N70 = 28879, n = 43

N80 = 22438, n = 57

N90 = 13359, n = 79

N100 = 2697, n = 120

N_count = 0

Gaps = 0

We will now run redbean with higher coverage

cat /pvol/data/amtp4/*.fastq > amtp4_20x.fastq
wtdbg2 -x ont -g 5m -t 4 -i amtp4_20x.fastq -fo amtp4_20x
wtpoa-cns -t 4 -i amtp4_20x.ctg.lay.gz -fo amtp4_20x.ctg.fa
  • took longer
  • created 89 contigs
assembly-stats amtp4_20x.ctg.fa

stats for amtp4_20x.ctg.fa

sum = 6878943, n = 89, ave = 77291.49, largest = 3344234

N50 = 1764939, n = 2

N60 = 1764939, n = 2

N70 = 1764939, n = 2

N80 = 54425, n = 8

N90 = 23747, n = 28

N100 = 4477, n = 89

N_count = 0

Gaps = 0

Flye

Now we will try an alternative assembler called Flye. This assembler has been around a bit longer than redbean and is a more mature software project. As far as long read assemblers go it is fairly quick but not as fast as redbean. Again, lucky you, it has already been installed on your system but if you want to try it out on your own machine later, you can go with the conda version for flye.

Create symbolic links to same raw data from redbean

ln -s ../redbean/amtp4.fastq .
ln -s ../redbean/amtp4_20x.fastq .
  • files now exist in ~/techniques_workshop/flye

next run flye as script using

#!/bin/bash
#SBATCH --time=60
#SBATCH --ntasks=4 --mem=4gb

echo "Starting flye in $(pwd) at $(date)"

flye --nano-raw amtp4.fastq --out-dir amtp4 --genome-size 5m --threads 4

echo "Finished flye in $(pwd) at $(date)"

run using sbatch run_flye_small.sh

  • actually run using sbatch run_fly_small.sh because I can’t spell
  • can check on it using squeue, also tail -f slurm-310.out
assembly-stats amtp4/assembly.fasta

stats for amtp4/assembly.fasta

sum = 5170103, n = 77, ave = 67144.19, largest = 368927

N50 = 119681, n = 16

N60 = 99731, n = 20

N70 = 73460, n = 26

N80 = 48300, n = 35

N90 = 31251, n = 48

N100 = 2777, n = 77

N_count = 0

Gaps = 0

now we run on large dataset using script

#!/bin/bash
#SBATCH --time=360
#SBATCH --ntasks=4 --mem=4gb

echo "Starting flye in $(pwd) at $(date)"

flye --nano-raw amtp4_20x.fastq --out-dir amtp4_20x --genome-size 5m --threads 4

echo "Finished flye in $(pwd) at $(date)"

stats for amtp4_20x/assembly.fasta

sum = 7478720, n = 53, ave = 141107.92, largest = 3378345

N50 = 1775300, n = 2

N60 = 1775300, n = 2

N70 = 113574, n = 3

N80 = 76177, n = 11

N90 = 44111, n = 25

N100 = 5575, n = 53

N_count = 0

Gaps = 0

Bandage

In this visualisation contiguous segments are shown as solid coloured lines. Some of these are separated into independent graphs whereas others (eg the mess in the top left) are connected to each other via black lines. There are few things to keep in mind here;

  • Biologically distinct entities such as chromosomes or plasmids will form separate graphs
  • Separated graphs can also occur if coverage is very low and no reads exist to join parts of the assembly together
  • Highly joined regions (eg top left) tend to occur because repeats create ambiguities. There are too many connections and the assembler has trouble choosing the correct path between them.
repeats info
repeats info
low coverage bandage graph
low coverage bandage graph
low coverage bandage graph looking at cluster
low coverage bandage graph looking at cluster

what makes a repeat? - the node has multiple connections (branching), - its coverage is higher than surrounding unique sequence, - its sequence aligns to multiple locations in the final assembly.

In this low coverage graph if you add a depth label, you dan see that brown middle one is 42x compared to the others allbeing 5-7x

low coverage bandage graph plasmids the little red bit has 25x coverage but doesn’t map to anywhere

high coverage bandage graph
high coverage bandage graph

In this high coverage graph, the little blue bit in the pink circle is 208x compared to 24x, loopy blue bit that maps to itself also decently high 83x, chunky bit 370x.

Which contigs look like fully assembled chromosomes? Which ones look like plasmids?

QUAST

Up to this point we have performed four separate assemblies with 2 different assemblers and 2 different levels of coverage. Although we could clearly see a huge improvement in contiguity with the addition of more reads we could not easily check for the accuracy of assemblies.

For this particular example we have a reference assembly which we previously created using very deep coverage and best-practice genome polishing tools. By comparing our assemblies to the reference we can assess them in terms of all the important metrics, contiguity, completeness and correctness.

first look at contig size in flye assembly

cat ../flye/amtp4_20x/assembly.fasta | bioawk -c fastx '{print $name, length($seq)}' | sort -n -k2

we want to extract the two largest contigs which is contig 59 (1775300) and 10 (3378345) but may differ by assembly

samtools faidx ../flye/amtp4_20x/assembly.fasta contig_59 contig_10 > flye_asm/vibrio.fasta

repeat for redbean

cat ../redbean/amtp4_20x.ctg.fa | bioawk -c fastx '{print $name, length($seq)}' | sort -n -k2

samtools faidx ../redbean/amtp4_20x.ctg.fa ctg1 ctg2 > redbean_asm/vibrio.fasta

then we need to create a symbolic link to original reads and the reference assembly

ln ../flye/amtp4_20x.fastq .

ln -s /pvol/data/amtp4/reference/vibrio .

run QUAST

quast -o quast_vibrio redbean_asm/vibrio.fasta flye_asm/vibrio.fasta -L -t 4 --circos --glimmer -r vibrio/vibrio.fna --features  vibrio/vibrio.gff --nanopore amtp4_20x.fastq

Quast outputs should be saved to a folder called quast_vibrio. Download this folder to your laptop and unzip it. Some things to look at include

The main report report.html A circos plot in circos/circos.png Both assemblers seem to have done a very good job of assembling the genome but both have produced regions where there are a large number of small differences (snps and indels) between the assembly and the reference. The redbean assembly also seems to have a couple of gaps which most likely correspond with the highly repetitive region we observed in the Bandage visualisation.

circos plot
circos plot

Porechop - optional exercise

Notice that we didn’t actually do any adapter trimming prior to running flye or redbean. In a real project we would probably run porechop on all the reads to trim adapters. This would most likely improve the assembly and reduce the possibility of incorporating adapter sequences into our final product.

As an exercise use porechop to check for adapters in the various assembly files you created.

porechop -i ../flye/amtp4_20x/assembly.fasta -o flye_asm/flye_adapter.fasta

porechop -i ../redbean/amtp4_20x.ctg.fa -o redbean_asm/redbean_adapter.fasta

flye

13 / 53 reads had adapters trimmed from their start (1,093 bp removed)

0 / 53 reads had adapters trimmed from their end (0 bp removed)

redbean

No adapters found - output reads are unchanged from input reads

Polishing - optional

Although long reads provide very high contiguity they generally have low accuracy. To a large extent this low accuracy can be overcome obtaining a high read depth and taking the consensus of many reads at each position. Some genome assemblers incorporate such error correction (eg Canu) but even in such cases it is always a good idea to use a dedicated assembly polishing tool to correct small errors in the assembly.

A good polishing tool for Oxford Nanopore data is medaka. Ideally we would probably want some more accurate Illumina data for final polishing, or PacBio data which has less biased errors than Nanopore. For each of these tasks there are a growing number of programs that can be used.

Mystery Repeat Region

We have just shown with ‘bandage’ that the ‘flye’ assembly contains a repeat region that could not be completely resolved. Let’s have a look and see what might be going on here.

For this exercise we will run commands and store results in a directory called repeat. Assuming that you are in your quast directory, you would run the following commands to set this up and cd into it.

  • use bandage to get sequence, saved as flye_repeat

prokka

prokka --outdir prokka --prefix repeat flye_repeat

from repeat.tsv

locus_tag ftype length_bp gene EC_number COG product

ONKKGFDH_00001 tRNA 76 tRNA-Asp(gtc)

ONKKGFDH_00002 rRNA 111 5S ribosomal RNA

ONKKGFDH_00003 rRNA 2878 23S ribosomal RNA

ONKKGFDH_00004 tRNA 76 tRNA-Ala(tgc)

ONKKGFDH_00005 tRNA 77 tRNA-Ile(gat)

ONKKGFDH_00006 rRNA 1549 16S ribosomal RNA

  • sequence dominated by rRNA, mainly 23S

Genomic arrangement of repeats

Now we know the genes that are present in our repeat but we don’t know how the repeats are arranged in the genome. Specifically we are interested in whether they are arranged in tandem or dispersed throughout the genome.

To investigate this we can map the repeat sequence back to the genome itself. The tool of choice for mapping long reads and/or long DNA sequences is minimap2

minimap2 -x map-ont vibrio.fna flye_repeat > repeat_vs_genome.paf
  • examine output using head and cat repeat_vs_genome.paf | grep '^repe' | awk -F '\t' '{print $6,$8,$9}' > repeat_locations.bed

repeat_locations.bed

tig00000584 1365920 1371399

tig00000584 839427 844903

tig00000584 1267990 1273367

tig00000584 1643296 1649096

tig00000584 1167711 1173336

tig00000584 1273520 1279148