MinIon.Assembly.DCA.rmd

1 DCA enriched sludge from Lenoir City; MinION metagenomics
2 Summary and Conclusions
- 2.1 Useful downloads
3 concatenating the fastQ
4 QC
- 4.1 application note
- 4.2 QC Results
5 Subsampling
6 Available assemblers
7 Assembly Results
8 rerun the pipeline on the full dataset
- 8.1 Canu
- 8.2 Flye
9 Working with the full flye assembly
10 Conclusions and next step
11 nanopolish error
12 Conclusion: sans-polish
- 12.1 barrnap
- 12.2 Silva
13 next steps
- 13.1 standalone assemble + nanopolish
14 Centrifuge
- 14.1 installing default database
- 14.2 buiding database manually
15 Canu/medaka
16 Archive
17 Subsampling
- 17.1 Target depth
- 17.2 seqTK subsampling

1 DCA enriched sludge from Lenoir City; MinION metagenomics

2 Summary and Conclusions

This was a very exploratory process. Canu -> Medaka was the assembly that I landed on and will base any additional results on that.

Right at the start here I’m providing some key downloads for use in making downstream analyses:

2.1 Useful downloads

2.1.1 full-metagenome files

metagenome assembly: https://www.dropbox.com/s/umudgfv4pq7kyse/consensus.fasta?dl=1
gene nucleotide fastas: https://www.dropbox.com/s/cdhubbz5g0x7810/LHAD.medaka.fna?dl=1
gene protein fastas: https://www.dropbox.com/s/cdhubbz5g0x7810/LHAD.medaka.fna?dl=1
16S rRNA genes: https://www.dropbox.com/s/u92yq8x88f9222e/LHAD.medaka.RNA.fa?dl=1

library(dplyr)
medaka.diel.search <- read.csv("canu.full/medaka/blast/LHAD_medaka_search/blast.out",sep = "\t",header=F)
medaka.diel.search <- subset(medaka.diel.search, V1 == "HAD_DIEL" )
medaka.diel.search <- medaka.diel.search[2]
colnames(medaka.diel.search) <- "name"
medaka.prot <- readAAStringSet("canu.full/medaka/LHAD.medaka.faa")
name <- names(medaka.prot)
name <- unlist(lapply(name, spcSplit))
seq <- paste(medaka.prot)
medaka.prot <- data.frame(name,seq)
medaka.had <- left_join(medaka.diel.search, medaka.prot)
writeFasta(medaka.had, "LC.medaka.diel.hadases.faa")


medaka.nuc <- readAAStringSet("canu.full/medaka/LHAD.medaka.fna")
name <- names(medaka.nuc)
name <- unlist(lapply(name, spcSplit))
seq <- paste(medaka.nuc)
medaka.nuc <- data.frame(name,seq)
medaka.nuc <- left_join(medaka.diel.search, medaka.nuc)
writeFasta(medaka.nuc, "LC.medaka.diel.hadases.fna")

2.1.2 Dichloromethanomonas-specific files

protein similarity to Diel HADases, mec cassette genes, glyoxylate metabolic genes, etc.: https://www.dropbox.com/s/wc9zy8unwpriz45/blast.out?dl=1
Diel-like HADases (according to BLASTP searches), genes: https://www.dropbox.com/s/i1bv7walapvvdpz/LC.medaka.diel.hadases.fna?dl=1
Diel-like HADases (according to BLASTP searches), proteins: https://www.dropbox.com/s/zdrmo9j113wo362/LC.medaka.diel.hadases.faa?dl=1
Diel-like 16S rRNA genes: https://www.dropbox.com/s/1akoapifebe9ebf/diel.like.16S.fna?dl=1

From here on out, methods and data used to derive this information

Evaluating MinIon denovo assembly pipelines with NO available illumina data.

Excellent overview here: https://www.sciencedirect.com/science/article/pii/S1672022916301309

3 concatenating the fastQ

The fastq files are in 40mb chunks, just a peculiarity of MinIon system. It is easiest if I concatenate these down to a single file.

cat file Gao_LC_DCA/Gao_LC_DCA/20190607_1432_MN30014_FAK53375_08507b6b/fastq_pass/*.fastq > lenoir.DCA.fastq

4 QC

A simple program is available, MinionQC

https://github.com/roblanf/minion_qc https://academic.oup.com/bioinformatics/article/35/3/523/5057155

However, it relies upon a report file that should be provided by the sequencer called “sequencing_summary.txt”

This program is actually just an R script, run as such:

Rscript MinIONQC.R -i Gao_LC_DCA/Gao_LC_DCA/20190607_1432_MN30014_FAK53375_08507b6b/sequencing_summary/DESKTOP_MJPJRJJ_20190607_FAK53375_MN30014_sequencing_run_Gao_LC_DCA_90913_sequencing_summary.txt

It generates a wide variety of metrics. This will be used to determine average read length and thus how many reads are required to reach a desired sampling depth

4.1 application note

I had trouble getting Rscript to recognize the appropriate packages… IInstead of trying to really solve the problem (Rscript doesnt seem to use the same version of R). I just amended the main script with brute-force installation command as the first step:

install.packages(c("data.table", 
                   "futile.logger",
                   "ggplot2",
                   "optparse",
                   "plyr",
                   "readr",
                   "reshape2",
                   "scales",
                   "viridis",
                   "yaml"))

4.2 QC Results

5 Subsampling

Traditional random subsampling is not possible here since I need both the fast5 and fastq files for some assembly pipelines

For starters, I will simply take 1/4 of the files (50 of the 200) and place into other folders

6 Available assemblers

I might have trouble with data volume. There is ~8Gb of raw fast.gz files. I will reduce this down to ~2Gb. This might overwhelm my system, we will see. There is no such assembler available via KBase, so I might be up a creek with this.

6.1 poreTally

https://github.com/cvdelannoy/poreTally https://www.biorxiv.org/content/biorxiv/early/2018/09/23/424184.full.pdf (reference)

A hybrid assembler system which compares many (if not all) trustworthy MinIon assembly programs in parallel.

Note that this is intended to be used to benchmark assemblers. There is some weird stuff in here that I hope will not get in the way of just getting assemblies finished.

I will stick with this for now, it is a great place to start

6.2 running default poreTally assemble

this runs by default FOUR assembly pipelines:

Canu Flye Minimap2 + miniasm Minimap2 + miniasm + racon x 2 SMARTdenovo

poreTally requires that a “reference genome” be required. This simply does not matter for my purposes, so I will provide some nonsense.

Create a yaml file (also not important, but required for the program)

authors: B. Honeydew
species: Escherichia coli
basecaller: Albacore 2.2.7
flowcell: FLO-MIN106
kit: SQK-LSK108

poreTally run_assemblies \
                -w poretally.work/ \
                -t 8 \
                -f subsampled.fast5 \
                -r regenome.fa \
                --reads-fastq subsampled.fastq

6.3 Monitoring RAM usage

minimap2 got up to 12GB in the first few minutes
flye-assemble 2.5GB
flye-repeat 1.2GB
minimap2 12GB second iteration (with racon X 2)
racon just a few GB
smartDENOVO completed quickly, didnt see usage

7 Assembly Results

7.1 Minimap2_miniasm

writeFasta<-function(data, filename){
  fastaLines = c()
  for (rowNum in 1:nrow(data)){
    fastaLines = c(fastaLines, as.character(paste(">", data[rowNum,"name"], sep = "")))
    fastaLines = c(fastaLines,as.character(data[rowNum,"seq"]))
  }
  fileConn<-file(filename)
  writeLines(fastaLines, fileConn)
  close(fileConn)
}


library(Biostrings)
minimini <- readDNAStringSet("poretally.work/assembler_results/assemblies/minimap2_miniasm.fasta")
len <- width(minimini)
name <- names(minimini)
seq <- paste(minimini)
minimini <- data.frame(name,seq,len)
minimini <- minimini[order(-minimini$len),]
minimini.largest <- minimini[1]
writeFasta(minimini.largest,"poretally.work/MAGs/minimini.largest.fa")

181 sequences largest contig = 3,076,114 total length = 12,909,294

7.1.1 Taxonomy of largest contig

performed via MiGA:

7.2 Mimimap2_minasm with racon X2

library(Biostrings)
racon <- readDNAStringSet("poretally.work/assembler_results/assemblies/minimap2_miniasm_raconX2.fasta")
len <- width(racon)
name <- names(racon)
seq <- paste(racon)
racon <- data.frame(name,seq,len)
racon <- racon[order(-racon$len),]
racon.largest <- racon[1]
writeFasta(racon.largest,"poretally.work/MAGs/racon.largest.fa")

181 sequences largest contig = 3,130,719 total length = 13,120,286

7.3 Flye

flye <- readDNAStringSet("poretally.work/assembler_results/assemblies/flye.fasta")
len <- width(flye)
name <- names(flye)
seq <- paste(flye)
flye <- data.frame(name,seq,len)
flye <- flye[order(-flye$len),]
flye.largest <- flye[1]
writeFasta(flye.largest,"poretally.work/MAGs/flye.largest.fa")

30 sequences largest contig = 3,208,805 total length = 11,027,625

note there is a coverage file generated which suggests roughly equivalent coverage for everyting, same order of magnitude. some smaller contigs are deeper, the largest is at 40X while one at 800kb is at 76X for example. The full assembly (using all data) will hopefully resolve this.

7.4 SmartDenovo

smart <- readDNAStringSet("poretally.work/assembler_results/assemblies/smartdenovo.fasta")
len <- width(smart)
name <- names(smart)
seq <- paste(smart)
smart <- data.frame(name,seq,len)
smart <- smart[order(-smart$len),]
smart.largest <- smart[1]
writeFasta(smart.largest,"poretally.work/MAGs/smart.largest.fa")

127 sequences largest contig = 3,983,612 total length = 13,265,513

7.5 Conclusions

Four distinct pipelines completed, Canu failed entirely (seems to happen with many people). Results were largely consistent. Smartdenovo produced the largest contig, but it doesnt seem to run any sort of polishing, thus this is questionable

8 rerun the pipeline on the full dataset

poreTally run_assemblies \
                -w full.poretally.work/ \
                -t 8 \
                -f Gao_LC_DCA/Gao_LC_DCA/20190607_1432_MN30014_FAK53375_08507b6b/fast5_pass \
                -r regenome.fa \
                --reads-fastq Gao_LC_DCA/Gao_LC_DCA/20190607_1432_MN30014_FAK53375_08507b6b/fastq_pass

curiously, Canu is now running (failed previously). Via java, takes up only 3.4GB.

8.1 Canu

library(Biostrings)
canu <- readDNAStringSet("full.poretally.work/assembler_results/assemblies/canu.fasta")
len <- width(canu)
name <- names(canu)
seq <- paste(canu)
canu <- data.frame(name,seq,len)
canu <- canu[order(-canu$len),]
canu.largest <- canu[1,]
#writeFa

8.2 Flye

library(Biostrings)
flye2 <- readDNAStringSet("full.poretally.work/assembler_results/assemblies/flye.fasta")
len <- width(flye2)
name <- names(flye2)
seq <- paste(flye2)
flye2 <- data.frame(name,seq,len)
flye2 <- flye2[order(-flye2$len),]
flye2.largest <- flye2[1,]
writeFasta(flye2.largest,"full.poretally.work/MAGs/flye.largest.fasta")

9 Working with the full flye assembly

The largest genome, 3.2mb, was annoated and the 16S reads identified it as a Methanobacterium

full prokka annotation is here: https://www.dropbox.com/sh/hpwpft94g8052ig/AACvdMfJdmwf6NfxDjIysBjqa?dl=1 the rRNA (both 16S and 23S) sequences and silva classification results are here: https://www.dropbox.com/sh/j9fmacm107yqwb1/AADNkBvag9BdSV2-W4M246RHa?dl=1

This is almost certainly not what we are looking for (although it is certainly possible)

9.1 contig depth

The largest assembly was not the most common according to flye depth results.

depth <- read.csv("full.poretally.work/assembler_results/flye/assembly_info.txt", sep="\t", header=T)
depth <- depth[order(-depth$cov.),]
datatable(depth)

There is almost an inverse correlation between length and coverage:

plot(depth$length,depth$cov.,xlab="contig.length",ylab="contig.read.depth")

pdf("contig.length.relationship.pdf")
plot(depth$length,depth$cov.,xlab="contig.length",ylab="contig.read.depth")
dev.off()

## PNG 
##   2

The plot makes it clear that there are several contigs with a depth of ~500. Perhaps these represent a single major genome? Also there is another large contig at 2.8mb.

Are the smaller contigs artifacts? BlastNR the highest coverage contig, contig_22. It is a cloning artifact almost certainly.

The fact is, all larger contigs need to be considered carefull, read depth can be used to make casual bins PERHAPS but will require careful curation.

9.2 Checking second largest contig

flye.second <- flye2[2,]
writeFasta(flye.second, "full.poretally.work/MAGs/flye2.fasta")

The annotated genome is here: https://www.dropbox.com/sh/b99nditxqk9tmuj/AABGVtgGKRYrNDQZPX9T62Gta?dl=1 The rRNA(16S and 23S) and silva classification are here: https://www.dropbox.com/sh/im5grzp7ma8ylim/AABHUnrUP7hmb5bCvRh-_Y78a?dl=1

This appears to be related to dehalobacter

9.2.1 Phylogeny

Computed a RaxML tree (GTRGamma) with the 23 closest neighbor sequences:

Newick file is here: https://www.dropbox.com/s/8o2oqx3zcolhzb0/LHAD2.16S.nwk?dl=1

9.3 Functional annotation

Key questions: 1. does it have HADases 2. does it have the DCM cassette

the Prokka annotation does not really help us here. However, we can get information from better systems. Quickly get KEGG via EggNog, and COG/pfam/TIGRFAM.

9.3.1 HADase via K01560

LHAD2_05234: K01560, This fragment (just the N-terminal 41 AA) is 93% identical to Dichloromethanomonas

9.3.2 DCM cassette

I don’t have time to set up local blast search, quickly scan annotations for diagnossis:

Key results to look for:

mtaA/cmuA TIGR01463 result: LHAD2_00039
DUF1638 (pfam07796)
mtrH (K00584, COG1962) result: no COG

The functional annotations are TERRIBLE because the gene-calling is terrible. Most ORFs are tiny. The error-prone nature of MinIon has made it apparently useless for gene-calling.

9.4 minimap2 miniasm raxon X2

library(Biostrings)
racon2 <- readDNAStringSet("full.poretally.work/assembler_results/assemblies/minimap2_miniasm_raconX2.fasta")
len <- width(racon2)
name <- names(racon2)
seq <- paste(racon2)
racon2 <- data.frame(name,seq,len)
racon2 <- racon2[order(-racon2$len),]
racon2.largest <- racon2[1,]
writeFasta(racon2.largest,"full.poretally.work/MAGs/racon1.fasta")
second <- racon2[2,]
writeFasta(second, "full.poretally.work/MAGs/racon2.fasta")

Prokka-generated genes are also pretty terrible.

10 Conclusions and next step

The assembly systems I used all led to unacceptably high error rates as judged by the protein size and annotation.

I should get a canu+nanopolish system working. For now, canu remains problematic. Thus, I will try the Minimap2 + miniasm + nanopolish system to see if it improves the error rate.

This time around, specify assembly systems rather than using default 5

poreTally run_assemblies \
                -w full.poretally.nanopolish/ \
                -p minimap2_miniasm_nanopolish \
                -t 8 \
                -r regenome.fa \
                -f Gao_LC_DCA/Gao_LC_DCA/20190607_1432_MN30014_FAK53375_08507b6b/fast5_pass \
                --reads-fastq Gao_LC_DCA/Gao_LC_DCA/20190607_1432_MN30014_FAK53375_08507b6b/fastq_pass

Minimap RAM requirements go WAY up with this full-data assembly up to 24 GB.

11 nanopolish error

based on error logs, nanopolish_merge.py failed:

The command from the logs:

python "${np_path}"/nanopolish_merge.py polished.*.fa > /home/robert/Dropbox/Projects/Minion.Assembly/full.poretally.nanopolish/assembler_results/assemblies/minimap2_miniasm_nanopolish.fasta 2>> /home/robert/Dropbox/Projects/Minion.Assembly/full.poretally.nanopolish/assembler_results/log_files/minimap2_miniasm_nanopolish.log

I installed manually and will attempt to use my own installation:

python /home/robert/Tools/nanopolish/scripts/nanopolish_merge.py polished.*.fa > /home/robert/Dropbox/Projects/Minion.Assembly/full.poretally.nanopolish/assembler_results/assemblies/minimap2_miniasm_nanopolish.fasta 2>> /home/robert/Dropbox/Projects/Minion.Assembly/full.poretally.nanopolish/assembler_results/log_files/manual.minimap2_miniasm_nanopolish.log

Actually no, I have no “polished.*.fa" file

The previous step must have failed?

python "${{np_path}}"/nanopolish_makerange.py minimap2_miniasm.fasta | parallel --no-notice --results nanopolish.results -P 4 nanopolish variants --consensus polished.{{1}}.fa -w {{1}} -r input.fastq -b reads_sorted.bam -g minimap2_miniasm.fasta -t ${{threads_per_process}}

The folders which should have ranges are totally empty instead

Can I bypass the makerange step? I have the assembly, the BAM, the fasta… whats the problem?

Try manual nanopolish variants:

python /home/robert/Tools/nanopolish/scripts/nanopolish_makerange.py minimap2_miniasm.fasta | parallel --results nanopolish.results -P 8 \
/home/robert/Tools/nanopolish/nanopolish variants --consensus -w {1} --reads input.fastq --bam reads_sorted.bam \
--genome minimap2_miniasm.fasta -t 8 -p 1 -o polished.{1}.vcf --min-candidate-frequency 0.1

After a lot of bug hunting, this: https://github.com/jts/nanopolish#computing-a-new-consensus-sequence-for-a-draft-assembly helped me to finish. Yes (holy crap) poretally was feeding the wrong command to nanopolish.

This process is clearly very time-consuming. one thread per 50kb chunk of assembly, over 3GB per thread, looks like it will take a long long time… There will be over 1000 chunks!

11.0.1 Update, two days later

the process is still chugging away, only a handful of chunks have completed. RAM is totally maxed out across 8 threads.

maybe 8 chunks have completed. in two days. That would mean something like a year for the full thing.

This is clearly not going to work. Two problems: 1. there are many many reads per contig, read depth in the 1000’s 2. RAM might be an issue, try running with fewer threads?

Issue 1 I cannot deal with unless I start the entire pipeline again.

Do this in a new folder, try on subsampled data:

poreTally run_assemblies \
                -w poretally.nano.subsampled/ \
                -p minimap2_miniasm_nanopolish \
                -t 8 \
                -r regenome.fa \
                -f subsampled.fast5 \
                --reads-fastq subsampled.fastq

Then when nanopolish fails, use manual version from within the created directory:

python /home/robert/Tools/nanopolish/scripts/nanopolish_makerange.py minimap2_miniasm.fasta | parallel --results nanopolish.results -P 8 \
/home/robert/Tools/nanopolish/nanopolish variants --consensus -w {1} --reads input.fastq --bam reads_sorted.bam \
--genome minimap2_miniasm.fasta -t 8 -p 1 -o polished.{1}.vcf --min-candidate-frequency 0.1

Still eats up 100% CPU across the 12 threads.

Still on the first 8 chunks after 12 hours…

the final command, which should merge the files:

/home/robert/Tools/nanopolish/nanopolish vcf2fasta -g minimap2_miniasm.fasta polished.*.vcf > polished_genome.fa

12 Conclusion: sans-polish

Without polishing we can derive RNA primarily. Using the flye assembler, we can also get circularity and coverage estimates. This can allow subsequent focused polishing on target contigs.

12.1 barrnap

barrnap -t 8 -o LHAD.RNA.fa flye.prokka/LHAD.fna > LHAD.RNA.gff

12.2 Silva

10 16S rRNA sequences were detected in the full 13mb assembly. Sending these to Silva we get:

class <- read.csv("draft.assembly.flye/silva.results/arb-silva.de_align_resultlist_688671.csv", header = T, sep = ";")
datatable(class)

13 next steps

It is clear that the assembly MUST be polished to get anything useful other than RNA
build a new pipeline, assemble with flye and follow the recommended nanopolish steps

13.1 standalone assemble + nanopolish

14 Centrifuge

This is a metagenome classifier http://www.ccb.jhu.edu/software/centrifuge/manual.shtml#obtaining-centrifuge

This appears to be a short-read classifier. Caleb is working to apply to MinIon data, so I will try it out.

Version 1.03 failed to make, 1.04 seems to be working, successful make and install

14.1 installing default database

Downloading database:

cd indices
make p_compressed            # bacterial genomes compressed at the species level [~4.2G]

This downloads I assume a reference genomes database along with making the necessary indices required for the program to run.

the make command failed due to absence of “dustmaker”, which is supposed to be part of ncbiblast… which is already installed. Update the conda version (conda install -c bioconda blast) and try again…

When doing “make” i get a simple “recipe for target ‘p_compressed’ failed”. Strangely, it seems to be downloading the wrong database, the Archaea database rather than bacterial. the logfile says “Could not find Jellyfish.”

conda install -c conda-forge jellyfish

failed again. Perhaps it is a path problem, likely

Add installation directory to global path

sudo nano /etc/environment
PATH=/home/robert/Tools/centrifuge-1.0.4-beta:$PATH

attempting the make one more time with base command… seems to have failed again… I noticed that the genome files don’t actually seem to be downloading? The reference sequences? not sure about this… I checked permissions and everything appears highly restricted. Change all to 775 and try again.

Items are downloading with highly-restricted permissions.

14.2 buiding database manually

first make a “customdb” within the centrifuge installation

../centrifuge-download -o taxonomy taxonomy
../centrifuge-download -o library -m -d "archaea,bacteria,viral" refseq > seqid2taxid.map

fail, forget this program

15 Canu/medaka

conda install -c bioconda canu

canu \
    -p LHAD -d LHAD.canu \
    genomeSize=13.2m \
    -nanopore-raw ../Gao_LC_DCA/Gao_LC_DCA/20190607_1432_MN30014_FAK53375_08507b6b/fastq_pass

started @ 12:15pm Friday

This destroyed my computer, turns out conda version is old, v1.4. upgrade to v1.8

The pre-compiled binary fails at the first step

Download source and compile locally

/home/robert/Tools/canu/Linux-amd64/bin/canu \
    -p LHAD -d LHAD.canu \
    genomeSize=13.2m \
    -nanopore-raw all.pass.fastq

Running fine, 80% CPU and up to 10gb RAM

Once it gets to the “overlaps” stage, (not sure if this is part of correction or trimming), it starts to go very very slowly. It appears that each chunk takes ~30 minutes to complete… it isn’t clear yet how many of these batches there are, is it 100 or 7?

Process completed within ~10 hours overall. The assembly is ~24mb, surprisingly large. There are no “mega”contigs like before. How to really evaluate? How to bin?

15.1 Extract rRNA

I pulled this time 24 16S sequences (only 10 usin Flye). Classifying at Silva

canu.16S.silva <- read.csv("canu.full/LHAD.canu/silva/LHAD.canu.16S.silva.results.csv", header = T, sep = ";")
canu.16S.silva <- canu.16S.silva[c(2,9,14,21)]
datatable(canu.16S.silva)

15.2 Medaka

https://github.com/nanoporetech/medaka

This help page: https://nanoporetech.github.io/medaka/benchmarks.html says that they run canu, then racon, then medaka…. wow what a PITA. However, in the walkthrough, they say assemble (i.e. canu) then run medaka. However, you have to provide a trained model! So that it can decide what is “true”… they provide a base model trained with human, yeast, and e. coli… this really sounds very wishywashy.

15.3 Gene call the canu assembly and search for HADase and DCM

made a small custom search set consisting of Dehalobacterium DCM genes, HADases from DIEL and DEFO

gene called the assembly using

prodigal -i LHAD.contigs.fasta -o LHAD.gbk -a LHAD.faa -p meta

and blastp’d using

#!/bin/bash

makeblastdb -in ../LHAD.faa -dbtype prot -out blastdb/LHAD_db

mkdir LHAD_DCM_HAD_search

blastp -db blastdb/LHAD_db -query DHB.DCM.2HADases.faa \
-out LHAD_DCM_HAD_search/blast.out \
-outfmt 6 \
-evalue 0.001

canu.blastp <- read.csv("canu.full/LHAD.canu/blast/LHAD_DCM_HAD_search/LHAD.canu.blast.out", sep="\t")
datatable(canu.blastp)

No notable hits to the DCM genes, but several strong hits to Diel-HAD2

pull those genes out

library(Biostrings)
canu.diel.had <- canu.blastp[canu.blastp$V1 == "HAD_DIEL",]
write.csv(canu.diel.had, "canu.full/LHAD.canu/blast/LHAD_DCM_HAD_search/HAD_DIEL.blastp.out")
colnames(canu.diel.had)[2] <- "name"
canu.diel.had <- canu.diel.had[c(1,2)]


canu.prot <- readAAStringSet("canu.full/LHAD.canu/LHAD.faa")
name <- names(canu.prot)
#clean up the names
splitfun <- function(x){
  strsplit(x," ")[[1]][1]
}
##apply the split across the gene list
name <- unlist(lapply(name, splitfun))

seq <- paste(canu.prot)
canu.prot <- data.frame(name,seq)
library(dplyr)

canu.diel.had.seq <- left_join(canu.diel.had, canu.prot)
writeFasta(canu.diel.had.seq, "canu.full/LHAD.canu/blast/LHAD_DCM_HAD_search/LHAD.DIEL.HAD.hits.faa")

15.4 DCA pathway

Made a small set of DIEL proteins, glycerate kinase, TSAR, and hydroxypyruvate isomerase

blasting these against the canu proteins

tsar.blastp <- read.csv("canu.full/LHAD.canu/blast/LHAD_TSAR_search/LHAD.TSAR.blast.out", header=F, sep = "\t")
datatable(tsar.blastp)

15.5 Medaka

This can install via conda… latest versions of miniconda are broken, so had to downgrade, but then the conda install worked. It is even installing tensorflow, might be compatible with GPU use.

medaka_consensus \
-i all.pass.fastq \
-d LHAD.canu/LHAD.contigs.fasta \
-o medaka \
-t 8

A customized local compile is required in order to make use of GPU

order of operations: 1. minimap2 + samtools, CPU saturated, ~3Gb RAM (5 min) 2. samtools, all cpu, 6gb RAM (5 min) 3. medaka, heavy cpu, 11.2Gb RAM

finished within an hour!

again run barrnap:

15.5.1 Models

This is supposed to be a very important choice, but is a bit opaque. The default model r941_min_high should be broadly applicable and is aimed at high-precision rather than speed

15.6 maxbin2

run a quick evaluation, will this work on the present data?

run_MaxBin.pl -contig consensus.fasta -thread 8 -out maxbin2/LHAD.medaka -reads ../all.pass.fastq

I did find somebody who did this, not published though: https://www.biorxiv.org/content/biorxiv/suppl/2018/10/30/456905.DC1/456905-1.pdf

uses up to 36.6Gb of RAM and 100% CPU. Very RAM intensive!

process completed overnight, four bins

maxbin2 <- read.csv("canu.full/medaka/maxbin2/LHAD.medaka.marker", sep="\t",header = T)
maxbin2 <- t(maxbin2)
datatable(maxbin2)

15.6.1 checkm

sudo checkm data setRoot /home/robert/Tools/checkmdata
checkm lineage_wf -f CheckM.txt -t 8 -x fasta bins/ bins/SCG

note that checkm has to be run within a python 2.x environment

library(tidyr)
bin.max.bact <- read.csv("canu.full/medaka/maxbin2/CheckM.csv")
datatable(bin.max.bact)

The bins are just garbage. I’m guessing that bowtie2 was the problem; try minimap2 instead

15.7 minimap2

this will require basic minimap2 to samfile, then generation of a coverage file for maxbin2 to use

minimap2 -a -x map-ont -t 8 consensus.fasta ../all.pass.fastq > minimapped.sam

Bedtools genomecov seems to be the best option here

First convert to bam and sort

samtools view -S -b minimapped.sam > minimapped.bam
samtools sort minimapped.bam -o minimap.sorted.bam

then bedtools genomecov

this first requires a “genome file”

library(Biostrings)
m.con <- readDNAStringSet("canu.full/medaka/consensus.fasta")
name <- names(m.con)
len <- width(m.con)
m.con <- data.frame(name,len)
write.table(m.con, "canu.full/medaka/genome.txt", sep = "\t", quote = F,  row.names = F)

samtools depth -a minimap.sorted.bam > sam.depth.txt

import and create a summary

sam <- read.csv("canu.full/medaka/sam.depth.txt",header = F, sep = "\t")
library(dplyr)
cov <- sam %>%
  group_by(V1) %>%
  summarize(cov = mean(V3))
write.table(cov,"canu.full/medaka/sam.avg.dep.txt",sep = "\t",
            col.names=F,row.names = F,quote = F)

run_MaxBin.pl -contig consensus.fasta -abund sam.avg.dep.txt -thread 8 -out maxbin2.2/LHAD.medaka
checkm lineage_wf -f CheckM.2.txt -t 8 -x fasta maxbin2.2/ maxbin2.2/SCG

The resulting bins appear similar, which implies that bowtie2 is not the problem

bin2.max.bact <- read.csv("canu.full/medaka/CheckM.2.csv")
bin2.max.bact <- bin2.max.bact[c(1,2,12,13)]
kable(bin2.max.bact)

Bin.Id	Marker.lineage	Completeness	Contamination
LHAD.medaka.002	k__Archaea (UID2)	91.74	73.08
LHAD.medaka.001	c__Clostridia (UID1085)	90.31	8.80
LHAD.medaka.003	k__Bacteria (UID203)	70.02	31.48
LHAD.medaka.004	k__Bacteria (UID203)	60.74	62.01

Bin1 ended up quite clean…. however, this relies up on marker genes, and I am certain that fragmentation of genes due to errors will lead to missed information, which makes this just overall garbage.

15.8 barrnap

barrnap -t 8 -o LHAD.medaka.RNA.fa consensus.fasta > LHAD.medaka.RNA.gff

16 Archive

17 Subsampling

17.1 Target depth

For starters, we will subsample down to a depth required to confidently assembly a genome which makes up 10% of the community. To get 40x coverage of 10 genomes, we need roughly:

4mb genome X 40 coverage X 10 genomes = 1600 mbases AKA 1.6 Gbases

17.2 seqTK subsampling

As long as the reads are provided in fastQ format, this can be executed using the seqTK package as such

/path/to/seqtk sample -s100 \
lenoir.DCA.fastq \
(insert.read.number.here) > subsampled.fastq/lenoir.DCA.READCOUNT.fastq