looking at basic commands
ls
pwd
cd
cd ~/techinuqes_workshop/
For this exercise will use data from a previous MinION run on a culture dominated by a bacterium in the genus Vibrio. The dominant organism has two large chromosomes with sizes around 3.4Mb and 1.8Mb respectively. We also expect some smaller plasmids. At very high coverage it was also revealed that this data contained another bacterium (a Gammaproteobacterium) which was around 100 times less abundant than the Vibrio.
Collect a small subset of data for the initial assembly. We will use just the first 3 files which equates to 12000 reads and around 6x coverage of the target genome. This will almost certainly not be enough to generate a good assembly. As a learning exercise it will be useful to compare the results with this small dataset to results we will obtain later using more reads.
pull data and run assembly
cat /pvol/data/amtp4/*_[012].fastq > amtp4.fastq
wtdbg2 -x ont -g 5m -t 4 -i amtp4.fastq -fo amtp4
wtpoa-cns -t 4 -i amtp4.ctg.lay.gz -fo amtp4.ctg.fa
assembly-stats amtp4.ctg.fa
stats for amtp4.ctg.fa
sum = 3639659, n = 120, ave = 30330.49, largest = 143809
N50 = 48273, n = 23
N60 = 36829, n = 32
N70 = 28879, n = 43
N80 = 22438, n = 57
N90 = 13359, n = 79
N100 = 2697, n = 120
N_count = 0
Gaps = 0
We will now run redbean with higher coverage
cat /pvol/data/amtp4/*.fastq > amtp4_20x.fastq
wtdbg2 -x ont -g 5m -t 4 -i amtp4_20x.fastq -fo amtp4_20x
wtpoa-cns -t 4 -i amtp4_20x.ctg.lay.gz -fo amtp4_20x.ctg.fa
assembly-stats amtp4_20x.ctg.fa
stats for amtp4_20x.ctg.fa
sum = 6878943, n = 89, ave = 77291.49, largest = 3344234
N50 = 1764939, n = 2
N60 = 1764939, n = 2
N70 = 1764939, n = 2
N80 = 54425, n = 8
N90 = 23747, n = 28
N100 = 4477, n = 89
N_count = 0
Gaps = 0
Now we will try an alternative assembler called Flye. This assembler has been around a bit longer than redbean and is a more mature software project. As far as long read assemblers go it is fairly quick but not as fast as redbean. Again, lucky you, it has already been installed on your system but if you want to try it out on your own machine later, you can go with the conda version for flye.
Create symbolic links to same raw data from redbean
ln -s ../redbean/amtp4.fastq .
ln -s ../redbean/amtp4_20x.fastq .
~/techniques_workshop/flyenext run flye as script using
#!/bin/bash
#SBATCH --time=60
#SBATCH --ntasks=4 --mem=4gb
echo "Starting flye in $(pwd) at $(date)"
flye --nano-raw amtp4.fastq --out-dir amtp4 --genome-size 5m --threads 4
echo "Finished flye in $(pwd) at $(date)"
run using sbatch run_flye_small.sh
sbatch run_fly_small.sh because I
can’t spellsqueue, also
tail -f slurm-310.outassembly-stats amtp4/assembly.fasta
stats for amtp4/assembly.fasta
sum = 5170103, n = 77, ave = 67144.19, largest = 368927
N50 = 119681, n = 16
N60 = 99731, n = 20
N70 = 73460, n = 26
N80 = 48300, n = 35
N90 = 31251, n = 48
N100 = 2777, n = 77
N_count = 0
Gaps = 0
now we run on large dataset using script
#!/bin/bash
#SBATCH --time=360
#SBATCH --ntasks=4 --mem=4gb
echo "Starting flye in $(pwd) at $(date)"
flye --nano-raw amtp4_20x.fastq --out-dir amtp4_20x --genome-size 5m --threads 4
echo "Finished flye in $(pwd) at $(date)"
stats for amtp4_20x/assembly.fasta
sum = 7478720, n = 53, ave = 141107.92, largest = 3378345
N50 = 1775300, n = 2
N60 = 1775300, n = 2
N70 = 113574, n = 3
N80 = 76177, n = 11
N90 = 44111, n = 25
N100 = 5575, n = 53
N_count = 0
Gaps = 0
In this visualisation contiguous segments are shown as solid coloured lines. Some of these are separated into independent graphs whereas others (eg the mess in the top left) are connected to each other via black lines. There are few things to keep in mind here;
what makes a repeat? - the node has multiple connections (branching), - its coverage is higher than surrounding unique sequence, - its sequence aligns to multiple locations in the final assembly.
In this low coverage graph if you add a depth label, you dan see that brown middle one is 42x compared to the others allbeing 5-7x
the little red bit has 25x
coverage but doesn’t map to anywhere
In this high coverage graph, the little blue bit in the pink circle is 208x compared to 24x, loopy blue bit that maps to itself also decently high 83x, chunky bit 370x.
Which contigs look like fully assembled chromosomes? Which ones look like plasmids?
Up to this point we have performed four separate assemblies with 2 different assemblers and 2 different levels of coverage. Although we could clearly see a huge improvement in contiguity with the addition of more reads we could not easily check for the accuracy of assemblies.
For this particular example we have a reference assembly which we previously created using very deep coverage and best-practice genome polishing tools. By comparing our assemblies to the reference we can assess them in terms of all the important metrics, contiguity, completeness and correctness.
first look at contig size in flye assembly
cat ../flye/amtp4_20x/assembly.fasta | bioawk -c fastx '{print $name, length($seq)}' | sort -n -k2
we want to extract the two largest contigs which is contig 59 (1775300) and 10 (3378345) but may differ by assembly
samtools faidx ../flye/amtp4_20x/assembly.fasta contig_59 contig_10 > flye_asm/vibrio.fasta
repeat for redbean
cat ../redbean/amtp4_20x.ctg.fa | bioawk -c fastx '{print $name, length($seq)}' | sort -n -k2
samtools faidx ../redbean/amtp4_20x.ctg.fa ctg1 ctg2 > redbean_asm/vibrio.fasta
then we need to create a symbolic link to original reads and the reference assembly
ln ../flye/amtp4_20x.fastq .
ln -s /pvol/data/amtp4/reference/vibrio .
run QUAST
quast -o quast_vibrio redbean_asm/vibrio.fasta flye_asm/vibrio.fasta -L -t 4 --circos --glimmer -r vibrio/vibrio.fna --features vibrio/vibrio.gff --nanopore amtp4_20x.fastq
Quast outputs should be saved to a folder called quast_vibrio. Download this folder to your laptop and unzip it. Some things to look at include
The main report report.html A circos plot in circos/circos.png Both assemblers seem to have done a very good job of assembling the genome but both have produced regions where there are a large number of small differences (snps and indels) between the assembly and the reference. The redbean assembly also seems to have a couple of gaps which most likely correspond with the highly repetitive region we observed in the Bandage visualisation.
Notice that we didn’t actually do any adapter trimming prior to running flye or redbean. In a real project we would probably run porechop on all the reads to trim adapters. This would most likely improve the assembly and reduce the possibility of incorporating adapter sequences into our final product.
As an exercise use porechop to check for adapters in the various assembly files you created.
porechop -i ../flye/amtp4_20x/assembly.fasta -o flye_asm/flye_adapter.fasta
porechop -i ../redbean/amtp4_20x.ctg.fa -o redbean_asm/redbean_adapter.fasta
flye
13 / 53 reads had adapters trimmed from their start (1,093 bp removed)
0 / 53 reads had adapters trimmed from their end (0 bp removed)
redbean
No adapters found - output reads are unchanged from input reads
Although long reads provide very high contiguity they generally have low accuracy. To a large extent this low accuracy can be overcome obtaining a high read depth and taking the consensus of many reads at each position. Some genome assemblers incorporate such error correction (eg Canu) but even in such cases it is always a good idea to use a dedicated assembly polishing tool to correct small errors in the assembly.
A good polishing tool for Oxford Nanopore data is medaka. Ideally we would probably want some more accurate Illumina data for final polishing, or PacBio data which has less biased errors than Nanopore. For each of these tasks there are a growing number of programs that can be used.
We have just shown with ‘bandage’ that the ‘flye’ assembly contains a repeat region that could not be completely resolved. Let’s have a look and see what might be going on here.
For this exercise we will run commands and store results in a directory called repeat. Assuming that you are in your quast directory, you would run the following commands to set this up and cd into it.
flye_repeatprokka --outdir prokka --prefix repeat flye_repeat
from repeat.tsv
locus_tag ftype length_bp gene EC_number COG product
ONKKGFDH_00001 tRNA 76 tRNA-Asp(gtc)
ONKKGFDH_00002 rRNA 111 5S ribosomal RNA
ONKKGFDH_00003 rRNA 2878 23S ribosomal RNA
ONKKGFDH_00004 tRNA 76 tRNA-Ala(tgc)
ONKKGFDH_00005 tRNA 77 tRNA-Ile(gat)
ONKKGFDH_00006 rRNA 1549 16S ribosomal RNA
Now we know the genes that are present in our repeat but we don’t know how the repeats are arranged in the genome. Specifically we are interested in whether they are arranged in tandem or dispersed throughout the genome.
To investigate this we can map the repeat sequence back to the genome itself. The tool of choice for mapping long reads and/or long DNA sequences is minimap2
minimap2 -x map-ont vibrio.fna flye_repeat > repeat_vs_genome.paf
head and
cat repeat_vs_genome.paf | grep '^repe' | awk -F '\t' '{print $6,$8,$9}' > repeat_locations.bedrepeat_locations.bed
tig00000584 1365920 1371399
tig00000584 839427 844903
tig00000584 1267990 1273367
tig00000584 1643296 1649096
tig00000584 1167711 1173336
tig00000584 1273520 1279148