Plasmid Assembly at the AGRF

Alexis Lucattini
July, 2017

What can you hope to get out of this talk

  • Nanopore expectations with low input.
  • What plasmids are and why they are useful.
  • Applied examples of bioinformatic tools for Nanopore sequencing.
  • Potential AGRF partnership with Nanopore sequencing.

List of tools used during this talk.

Name Author Focus
Poreduck A. Lucattini Data Handling
Albacore ONT Basecalling
Porechop R. Wick Trimming
Pauvre D. Schultz QC-plotting
Canu S. Koren Assembly
Circlator M. Hunt Assembly

Plasmid Background

  • Plasmids are small circular dsDNA sequences, found mainly in bacteria.
  • Bacteria can use plasmids to share genetic information, even between different species.
  • Genes on plasmids can encode for virulence factors or anti-microbial resistance.
  • But scientists can also use plasmids to clone, transfer or manipulate genes

plasmid_1 plasmid_2

Traditional Plasmid Sequencing at the AGRF

  • Samples sequenced on MiSeq or Sanger.
  • Sanger with multiple primers to 'hop' through genome
  • MiSeq, huge coverage yet poor assembly.
  • Potential market for plasmid length MinION reads.

Nanopore Yields:

You only get out, what you put in (Nutri-grain et al.)

  • High molecular weight DNA ( > 1µg) ==> High yields 5-10+ Gbs.
  • Low yields in plasmid extraction, sequencing used as confirmation.
  • But how low can one go and still achieve their desired outcome?
    • 500 ng?
    • … 50 ng?

8 plasmids: 4 bigs, 4 smalls

plot of chunk unnamed-chunk-2

  • Small plasmids retained through clean-ups better than large plasmids.
  • Reasons unknown.
    • Higher quality DNA?
    • Smaller DNA less likely to break?
  • Protocol instructs normalise to 700 ng?
  • We have a small dilemma…

A handy workaround.

The problem

  • Highly unbalanced library.
  • Can obtain equal sequencing by lowering all inputs
    • Only as good as your lowest yield.

The solution

  • Add the good samples once the bad samples have had a go…
  • Can use real-time analysis to know when to add.

The real-time analysis pipeline. Part 1/6

poreduck Find me on Github. www.github.com/alexiswl/poreduck

Two main scripts:

  • Transfer data to server and remove from laptop.
  • Run albacore through SGE on server.

Both scripts run in real-time.

Basic steps of Poreduck pipeline.

  • transfer script zips up full read folders and uses Rsync to move folder to server.
  • transfer script continues to run until MinKNOW has finished.
  • albacore script will unzip and run through basecaller. Outputs a fastq file for each folder. Barcoding compatible.
  • This run shows need for low-yield parameter. Zip every 100 fast5 files.

The real-time analysis pipeline. Part 2/6

porechop

  • Removes both ligation and barcoding adapters from nanopore reads.
  • Can also quality filter reads with 'unusual results'.
  • Useful output summary.

plot of chunk unnamed-chunk-3

The real-time analysis pipeline. Part 2/6

Road block:

  • Large amount of unclassified reads.
  • Potential library step issue.
    • DNA extraction?
    • End-prep issue?
  • Could still use for consensus?

plot of chunk unnamed-chunk-4

The real-time analysis pipeline. Part 3/6

Sample Yield
JB1 8073511
JB2 1475418
PTS1 549940
PTS2 324648
unclassified 57426941
YHC17 871441
YHC6 363154
YHC7 468044
YHC8 903629

Lowest sample yield is 324648 bp for sample PTS2

  • Is this data still useful?
  • We've got 200x coverage.
  • What's the read length like?

The real-time analysis pipeline. Part 3/6

pauvre

Introducing pauvre.

  • Margin plot for sample YHC6.
  • We can use these plots to estimate the actual length of the plasmid.
    • Required parameter for Canu
  • Documentation can be found on GitHub.

The real-time analysis pipeline. Part 4/6

Assembly with Canu

  • Now we have our genome length parameter let the assembly begin.

Example Canu assembly command

canu -d assemblies/barcode06/ -p YHC6 genomeSize=18k \
-nanopore-raw 

canu can also invoke your SGE.
See the docs http://canu.readthedocs.io/en/latest/

The real-time analysis pipeline. Part 4/6

When canu fails

  • Canu gives quite verbose output and logs.
  • In some cases, it may not produce any contigs at all.
  • In our case Canu failed on two of the large plasmids.
  • There can still be useful partial assemblies that we can examine.
  • The *.contigs.layout.tigInfo file can give us a clue.

The real-time analysis pipeline. Part 5/6

Use circulator to cut edges that may have exceeded.

Using circulator minimus2 we can circularise the assembly

circilator minimus2 <input.fasta> <output.prefix>

The real-time analysis pipeline. Part 6/6

Realign to our circuarlised draft genome.

  • Using samtools we can realign our trimmed reads to our draft genome.
  • View alignment with IGV.
  • Does everything look right?

The validation process. Circularisation

  • Circulariser logs. circle_dir/sample.log
    • Did the genome circularise?
    • Good indication of assembly process.

plot of chunk unnamed-chunk-8

The validation process. Circularised Genome

good_assembly

The validation process. Non-circularised Genome

bad_assembly

What about all the unclassified data?

  • In this particular run there was a large amount of unclassified data.
  • Can we map it to our draft genomes for a greater consensus sequence?

With the help of hybrid assemblies:

  • Unicycler.