MinIon.Assembly.DCA.rmd
MinIon.Assembly.DCA.rmd
- 1 DCA enriched sludge from Lenoir City; MinION metagenomics
- 2 Summary and Conclusions
- 3 concatenating the fastQ
- 4 QC
- 5 Subsampling
- 6 Available assemblers
- 7 Assembly Results
- 8 rerun the pipeline on the full dataset
- 9 Working with the full flye assembly
- 10 Conclusions and next step
- 11 nanopolish error
- 12 Conclusion: sans-polish
- 13 next steps
- 14 Centrifuge
- 15 Canu/medaka
- 16 Archive
- 17 Subsampling
1 DCA enriched sludge from Lenoir City; MinION metagenomics
2 Summary and Conclusions
This was a very exploratory process. Canu -> Medaka was the assembly that I landed on and will base any additional results on that.
Right at the start here I’m providing some key downloads for use in making downstream analyses:
2.1 Useful downloads
2.1.1 full-metagenome files
metagenome assembly: https://www.dropbox.com/s/umudgfv4pq7kyse/consensus.fasta?dl=1
gene nucleotide fastas: https://www.dropbox.com/s/cdhubbz5g0x7810/LHAD.medaka.fna?dl=1
gene protein fastas: https://www.dropbox.com/s/cdhubbz5g0x7810/LHAD.medaka.fna?dl=1
16S rRNA genes: https://www.dropbox.com/s/u92yq8x88f9222e/LHAD.medaka.RNA.fa?dl=1
library(dplyr)
medaka.diel.search <- read.csv("canu.full/medaka/blast/LHAD_medaka_search/blast.out",sep = "\t",header=F)
medaka.diel.search <- subset(medaka.diel.search, V1 == "HAD_DIEL" )
medaka.diel.search <- medaka.diel.search[2]
colnames(medaka.diel.search) <- "name"
medaka.prot <- readAAStringSet("canu.full/medaka/LHAD.medaka.faa")
name <- names(medaka.prot)
name <- unlist(lapply(name, spcSplit))
seq <- paste(medaka.prot)
medaka.prot <- data.frame(name,seq)
medaka.had <- left_join(medaka.diel.search, medaka.prot)
writeFasta(medaka.had, "LC.medaka.diel.hadases.faa")
medaka.nuc <- readAAStringSet("canu.full/medaka/LHAD.medaka.fna")
name <- names(medaka.nuc)
name <- unlist(lapply(name, spcSplit))
seq <- paste(medaka.nuc)
medaka.nuc <- data.frame(name,seq)
medaka.nuc <- left_join(medaka.diel.search, medaka.nuc)
writeFasta(medaka.nuc, "LC.medaka.diel.hadases.fna")2.1.2 Dichloromethanomonas-specific files
protein similarity to Diel HADases, mec cassette genes, glyoxylate metabolic genes, etc.: https://www.dropbox.com/s/wc9zy8unwpriz45/blast.out?dl=1
Diel-like HADases (according to BLASTP searches), genes: https://www.dropbox.com/s/i1bv7walapvvdpz/LC.medaka.diel.hadases.fna?dl=1
Diel-like HADases (according to BLASTP searches), proteins: https://www.dropbox.com/s/zdrmo9j113wo362/LC.medaka.diel.hadases.faa?dl=1
Diel-like 16S rRNA genes: https://www.dropbox.com/s/1akoapifebe9ebf/diel.like.16S.fna?dl=1
From here on out, methods and data used to derive this information
Evaluating MinIon denovo assembly pipelines with NO available illumina data.
Excellent overview here: https://www.sciencedirect.com/science/article/pii/S1672022916301309
3 concatenating the fastQ
The fastq files are in 40mb chunks, just a peculiarity of MinIon system. It is easiest if I concatenate these down to a single file.
cat file Gao_LC_DCA/Gao_LC_DCA/20190607_1432_MN30014_FAK53375_08507b6b/fastq_pass/*.fastq > lenoir.DCA.fastq
4 QC
A simple program is available, MinionQC
https://github.com/roblanf/minion_qc https://academic.oup.com/bioinformatics/article/35/3/523/5057155
However, it relies upon a report file that should be provided by the sequencer called “sequencing_summary.txt”
This program is actually just an R script, run as such:
Rscript MinIONQC.R -i Gao_LC_DCA/Gao_LC_DCA/20190607_1432_MN30014_FAK53375_08507b6b/sequencing_summary/DESKTOP_MJPJRJJ_20190607_FAK53375_MN30014_sequencing_run_Gao_LC_DCA_90913_sequencing_summary.txt
It generates a wide variety of metrics. This will be used to determine average read length and thus how many reads are required to reach a desired sampling depth
4.1 application note
I had trouble getting Rscript to recognize the appropriate packages… IInstead of trying to really solve the problem (Rscript doesnt seem to use the same version of R). I just amended the main script with brute-force installation command as the first step:
install.packages(c("data.table",
"futile.logger",
"ggplot2",
"optparse",
"plyr",
"readr",
"reshape2",
"scales",
"viridis",
"yaml"))
4.2 QC Results
5 Subsampling
Traditional random subsampling is not possible here since I need both the fast5 and fastq files for some assembly pipelines
For starters, I will simply take 1/4 of the files (50 of the 200) and place into other folders
6 Available assemblers
I might have trouble with data volume. There is ~8Gb of raw fast.gz files. I will reduce this down to ~2Gb. This might overwhelm my system, we will see. There is no such assembler available via KBase, so I might be up a creek with this.
6.1 poreTally
https://github.com/cvdelannoy/poreTally https://www.biorxiv.org/content/biorxiv/early/2018/09/23/424184.full.pdf (reference)
A hybrid assembler system which compares many (if not all) trustworthy MinIon assembly programs in parallel.
Note that this is intended to be used to benchmark assemblers. There is some weird stuff in here that I hope will not get in the way of just getting assemblies finished.
I will stick with this for now, it is a great place to start
6.2 running default poreTally assemble
this runs by default FOUR assembly pipelines:
Canu Flye Minimap2 + miniasm Minimap2 + miniasm + racon x 2 SMARTdenovo
poreTally requires that a “reference genome” be required. This simply does not matter for my purposes, so I will provide some nonsense.
Create a yaml file (also not important, but required for the program)
authors: B. Honeydew
species: Escherichia coli
basecaller: Albacore 2.2.7
flowcell: FLO-MIN106
kit: SQK-LSK108
poreTally run_assemblies \
-w poretally.work/ \
-t 8 \
-f subsampled.fast5 \
-r regenome.fa \
--reads-fastq subsampled.fastq
6.3 Monitoring RAM usage
- minimap2 got up to 12GB in the first few minutes
- flye-assemble 2.5GB
- flye-repeat 1.2GB
- minimap2 12GB second iteration (with racon X 2)
- racon just a few GB
- smartDENOVO completed quickly, didnt see usage
7 Assembly Results
7.1 Minimap2_miniasm
writeFasta<-function(data, filename){
fastaLines = c()
for (rowNum in 1:nrow(data)){
fastaLines = c(fastaLines, as.character(paste(">", data[rowNum,"name"], sep = "")))
fastaLines = c(fastaLines,as.character(data[rowNum,"seq"]))
}
fileConn<-file(filename)
writeLines(fastaLines, fileConn)
close(fileConn)
}
library(Biostrings)
minimini <- readDNAStringSet("poretally.work/assembler_results/assemblies/minimap2_miniasm.fasta")
len <- width(minimini)
name <- names(minimini)
seq <- paste(minimini)
minimini <- data.frame(name,seq,len)
minimini <- minimini[order(-minimini$len),]
minimini.largest <- minimini[1]
writeFasta(minimini.largest,"poretally.work/MAGs/minimini.largest.fa")181 sequences largest contig = 3,076,114 total length = 12,909,294
7.1.1 Taxonomy of largest contig
performed via MiGA:
7.2 Mimimap2_minasm with racon X2
library(Biostrings)
racon <- readDNAStringSet("poretally.work/assembler_results/assemblies/minimap2_miniasm_raconX2.fasta")
len <- width(racon)
name <- names(racon)
seq <- paste(racon)
racon <- data.frame(name,seq,len)
racon <- racon[order(-racon$len),]
racon.largest <- racon[1]
writeFasta(racon.largest,"poretally.work/MAGs/racon.largest.fa")181 sequences largest contig = 3,130,719 total length = 13,120,286
7.3 Flye
flye <- readDNAStringSet("poretally.work/assembler_results/assemblies/flye.fasta")
len <- width(flye)
name <- names(flye)
seq <- paste(flye)
flye <- data.frame(name,seq,len)
flye <- flye[order(-flye$len),]
flye.largest <- flye[1]
writeFasta(flye.largest,"poretally.work/MAGs/flye.largest.fa")30 sequences largest contig = 3,208,805 total length = 11,027,625
note there is a coverage file generated which suggests roughly equivalent coverage for everyting, same order of magnitude. some smaller contigs are deeper, the largest is at 40X while one at 800kb is at 76X for example. The full assembly (using all data) will hopefully resolve this.
7.4 SmartDenovo
smart <- readDNAStringSet("poretally.work/assembler_results/assemblies/smartdenovo.fasta")
len <- width(smart)
name <- names(smart)
seq <- paste(smart)
smart <- data.frame(name,seq,len)
smart <- smart[order(-smart$len),]
smart.largest <- smart[1]
writeFasta(smart.largest,"poretally.work/MAGs/smart.largest.fa")127 sequences largest contig = 3,983,612 total length = 13,265,513
7.5 Conclusions
Four distinct pipelines completed, Canu failed entirely (seems to happen with many people). Results were largely consistent. Smartdenovo produced the largest contig, but it doesnt seem to run any sort of polishing, thus this is questionable
8 rerun the pipeline on the full dataset
poreTally run_assemblies \
-w full.poretally.work/ \
-t 8 \
-f Gao_LC_DCA/Gao_LC_DCA/20190607_1432_MN30014_FAK53375_08507b6b/fast5_pass \
-r regenome.fa \
--reads-fastq Gao_LC_DCA/Gao_LC_DCA/20190607_1432_MN30014_FAK53375_08507b6b/fastq_pass
curiously, Canu is now running (failed previously). Via java, takes up only 3.4GB.
8.1 Canu
library(Biostrings)
canu <- readDNAStringSet("full.poretally.work/assembler_results/assemblies/canu.fasta")
len <- width(canu)
name <- names(canu)
seq <- paste(canu)
canu <- data.frame(name,seq,len)
canu <- canu[order(-canu$len),]
canu.largest <- canu[1,]
#writeFa8.2 Flye
library(Biostrings)
flye2 <- readDNAStringSet("full.poretally.work/assembler_results/assemblies/flye.fasta")
len <- width(flye2)
name <- names(flye2)
seq <- paste(flye2)
flye2 <- data.frame(name,seq,len)
flye2 <- flye2[order(-flye2$len),]
flye2.largest <- flye2[1,]
writeFasta(flye2.largest,"full.poretally.work/MAGs/flye.largest.fasta")9 Working with the full flye assembly
The largest genome, 3.2mb, was annoated and the 16S reads identified it as a Methanobacterium
full prokka annotation is here: https://www.dropbox.com/sh/hpwpft94g8052ig/AACvdMfJdmwf6NfxDjIysBjqa?dl=1 the rRNA (both 16S and 23S) sequences and silva classification results are here: https://www.dropbox.com/sh/j9fmacm107yqwb1/AADNkBvag9BdSV2-W4M246RHa?dl=1
This is almost certainly not what we are looking for (although it is certainly possible)
9.1 contig depth
The largest assembly was not the most common according to flye depth results.
depth <- read.csv("full.poretally.work/assembler_results/flye/assembly_info.txt", sep="\t", header=T)
depth <- depth[order(-depth$cov.),]
datatable(depth)There is almost an inverse correlation between length and coverage:
plot(depth$length,depth$cov.,xlab="contig.length",ylab="contig.read.depth")pdf("contig.length.relationship.pdf")
plot(depth$length,depth$cov.,xlab="contig.length",ylab="contig.read.depth")
dev.off()## PNG
## 2
The plot makes it clear that there are several contigs with a depth of ~500. Perhaps these represent a single major genome? Also there is another large contig at 2.8mb.
Are the smaller contigs artifacts? BlastNR the highest coverage contig, contig_22. It is a cloning artifact almost certainly.
The fact is, all larger contigs need to be considered carefull, read depth can be used to make casual bins PERHAPS but will require careful curation.
9.2 Checking second largest contig
flye.second <- flye2[2,]
writeFasta(flye.second, "full.poretally.work/MAGs/flye2.fasta")The annotated genome is here: https://www.dropbox.com/sh/b99nditxqk9tmuj/AABGVtgGKRYrNDQZPX9T62Gta?dl=1 The rRNA(16S and 23S) and silva classification are here: https://www.dropbox.com/sh/im5grzp7ma8ylim/AABHUnrUP7hmb5bCvRh-_Y78a?dl=1
This appears to be related to dehalobacter
9.2.1 Phylogeny
Computed a RaxML tree (GTRGamma) with the 23 closest neighbor sequences:
Newick file is here: https://www.dropbox.com/s/8o2oqx3zcolhzb0/LHAD2.16S.nwk?dl=1
9.3 Functional annotation
Key questions: 1. does it have HADases 2. does it have the DCM cassette
the Prokka annotation does not really help us here. However, we can get information from better systems. Quickly get KEGG via EggNog, and COG/pfam/TIGRFAM.
9.3.1 HADase via K01560
LHAD2_05234: K01560, This fragment (just the N-terminal 41 AA) is 93% identical to Dichloromethanomonas
9.3.2 DCM cassette
I don’t have time to set up local blast search, quickly scan annotations for diagnossis:
Key results to look for:
- mtaA/cmuA TIGR01463 result: LHAD2_00039
- DUF1638 (pfam07796)
- mtrH (K00584, COG1962) result: no COG
The functional annotations are TERRIBLE because the gene-calling is terrible. Most ORFs are tiny. The error-prone nature of MinIon has made it apparently useless for gene-calling.
9.4 minimap2 miniasm raxon X2
library(Biostrings)
racon2 <- readDNAStringSet("full.poretally.work/assembler_results/assemblies/minimap2_miniasm_raconX2.fasta")
len <- width(racon2)
name <- names(racon2)
seq <- paste(racon2)
racon2 <- data.frame(name,seq,len)
racon2 <- racon2[order(-racon2$len),]
racon2.largest <- racon2[1,]
writeFasta(racon2.largest,"full.poretally.work/MAGs/racon1.fasta")
second <- racon2[2,]
writeFasta(second, "full.poretally.work/MAGs/racon2.fasta")Prokka-generated genes are also pretty terrible.
10 Conclusions and next step
The assembly systems I used all led to unacceptably high error rates as judged by the protein size and annotation.
I should get a canu+nanopolish system working. For now, canu remains problematic. Thus, I will try the Minimap2 + miniasm + nanopolish system to see if it improves the error rate.
This time around, specify assembly systems rather than using default 5
poreTally run_assemblies \
-w full.poretally.nanopolish/ \
-p minimap2_miniasm_nanopolish \
-t 8 \
-r regenome.fa \
-f Gao_LC_DCA/Gao_LC_DCA/20190607_1432_MN30014_FAK53375_08507b6b/fast5_pass \
--reads-fastq Gao_LC_DCA/Gao_LC_DCA/20190607_1432_MN30014_FAK53375_08507b6b/fastq_pass
Minimap RAM requirements go WAY up with this full-data assembly up to 24 GB.
11 nanopolish error
based on error logs, nanopolish_merge.py failed:
The command from the logs:
python "${np_path}"/nanopolish_merge.py polished.*.fa > /home/robert/Dropbox/Projects/Minion.Assembly/full.poretally.nanopolish/assembler_results/assemblies/minimap2_miniasm_nanopolish.fasta 2>> /home/robert/Dropbox/Projects/Minion.Assembly/full.poretally.nanopolish/assembler_results/log_files/minimap2_miniasm_nanopolish.log
I installed manually and will attempt to use my own installation:
python /home/robert/Tools/nanopolish/scripts/nanopolish_merge.py polished.*.fa > /home/robert/Dropbox/Projects/Minion.Assembly/full.poretally.nanopolish/assembler_results/assemblies/minimap2_miniasm_nanopolish.fasta 2>> /home/robert/Dropbox/Projects/Minion.Assembly/full.poretally.nanopolish/assembler_results/log_files/manual.minimap2_miniasm_nanopolish.log
Actually no, I have no “polished.*.fa" file
The previous step must have failed?
python "${{np_path}}"/nanopolish_makerange.py minimap2_miniasm.fasta | parallel --no-notice --results nanopolish.results -P 4 nanopolish variants --consensus polished.{{1}}.fa -w {{1}} -r input.fastq -b reads_sorted.bam -g minimap2_miniasm.fasta -t ${{threads_per_process}}
The folders which should have ranges are totally empty instead
Can I bypass the makerange step? I have the assembly, the BAM, the fasta… whats the problem?
Try manual nanopolish variants:
python /home/robert/Tools/nanopolish/scripts/nanopolish_makerange.py minimap2_miniasm.fasta | parallel --results nanopolish.results -P 8 \
/home/robert/Tools/nanopolish/nanopolish variants --consensus -w {1} --reads input.fastq --bam reads_sorted.bam \
--genome minimap2_miniasm.fasta -t 8 -p 1 -o polished.{1}.vcf --min-candidate-frequency 0.1
After a lot of bug hunting, this: https://github.com/jts/nanopolish#computing-a-new-consensus-sequence-for-a-draft-assembly helped me to finish. Yes (holy crap) poretally was feeding the wrong command to nanopolish.
This process is clearly very time-consuming. one thread per 50kb chunk of assembly, over 3GB per thread, looks like it will take a long long time… There will be over 1000 chunks!
11.0.1 Update, two days later
the process is still chugging away, only a handful of chunks have completed. RAM is totally maxed out across 8 threads.
maybe 8 chunks have completed. in two days. That would mean something like a year for the full thing.
This is clearly not going to work. Two problems: 1. there are many many reads per contig, read depth in the 1000’s 2. RAM might be an issue, try running with fewer threads?
Issue 1 I cannot deal with unless I start the entire pipeline again.
Do this in a new folder, try on subsampled data:
poreTally run_assemblies \
-w poretally.nano.subsampled/ \
-p minimap2_miniasm_nanopolish \
-t 8 \
-r regenome.fa \
-f subsampled.fast5 \
--reads-fastq subsampled.fastq
Then when nanopolish fails, use manual version from within the created directory:
python /home/robert/Tools/nanopolish/scripts/nanopolish_makerange.py minimap2_miniasm.fasta | parallel --results nanopolish.results -P 8 \
/home/robert/Tools/nanopolish/nanopolish variants --consensus -w {1} --reads input.fastq --bam reads_sorted.bam \
--genome minimap2_miniasm.fasta -t 8 -p 1 -o polished.{1}.vcf --min-candidate-frequency 0.1
Still eats up 100% CPU across the 12 threads.
Still on the first 8 chunks after 12 hours…
the final command, which should merge the files:
/home/robert/Tools/nanopolish/nanopolish vcf2fasta -g minimap2_miniasm.fasta polished.*.vcf > polished_genome.fa
12 Conclusion: sans-polish
Without polishing we can derive RNA primarily. Using the flye assembler, we can also get circularity and coverage estimates. This can allow subsequent focused polishing on target contigs.
12.1 barrnap
barrnap -t 8 -o LHAD.RNA.fa flye.prokka/LHAD.fna > LHAD.RNA.gff
12.2 Silva
10 16S rRNA sequences were detected in the full 13mb assembly. Sending these to Silva we get:
class <- read.csv("draft.assembly.flye/silva.results/arb-silva.de_align_resultlist_688671.csv", header = T, sep = ";")
datatable(class)13 next steps
- It is clear that the assembly MUST be polished to get anything useful other than RNA
- build a new pipeline, assemble with flye and follow the recommended nanopolish steps
13.1 standalone assemble + nanopolish
14 Centrifuge
This is a metagenome classifier http://www.ccb.jhu.edu/software/centrifuge/manual.shtml#obtaining-centrifuge
This appears to be a short-read classifier. Caleb is working to apply to MinIon data, so I will try it out.
Version 1.03 failed to make, 1.04 seems to be working, successful make and install
14.1 installing default database
Downloading database:
cd indices
make p_compressed # bacterial genomes compressed at the species level [~4.2G]
This downloads I assume a reference genomes database along with making the necessary indices required for the program to run.
the make command failed due to absence of “dustmaker”, which is supposed to be part of ncbiblast… which is already installed. Update the conda version (conda install -c bioconda blast) and try again…
When doing “make” i get a simple “recipe for target ‘p_compressed’ failed”. Strangely, it seems to be downloading the wrong database, the Archaea database rather than bacterial. the logfile says “Could not find Jellyfish.”
conda install -c conda-forge jellyfish
failed again. Perhaps it is a path problem, likely
Add installation directory to global path
sudo nano /etc/environment
PATH=/home/robert/Tools/centrifuge-1.0.4-beta:$PATH
attempting the make one more time with base command… seems to have failed again… I noticed that the genome files don’t actually seem to be downloading? The reference sequences? not sure about this… I checked permissions and everything appears highly restricted. Change all to 775 and try again.
Items are downloading with highly-restricted permissions.
14.2 buiding database manually
first make a “customdb” within the centrifuge installation
../centrifuge-download -o taxonomy taxonomy
../centrifuge-download -o library -m -d "archaea,bacteria,viral" refseq > seqid2taxid.map
fail, forget this program
15 Canu/medaka
conda install -c bioconda canu
canu \
-p LHAD -d LHAD.canu \
genomeSize=13.2m \
-nanopore-raw ../Gao_LC_DCA/Gao_LC_DCA/20190607_1432_MN30014_FAK53375_08507b6b/fastq_pass
started @ 12:15pm Friday
This destroyed my computer, turns out conda version is old, v1.4. upgrade to v1.8
The pre-compiled binary fails at the first step
Download source and compile locally
/home/robert/Tools/canu/Linux-amd64/bin/canu \
-p LHAD -d LHAD.canu \
genomeSize=13.2m \
-nanopore-raw all.pass.fastq
Running fine, 80% CPU and up to 10gb RAM
Once it gets to the “overlaps” stage, (not sure if this is part of correction or trimming), it starts to go very very slowly. It appears that each chunk takes ~30 minutes to complete… it isn’t clear yet how many of these batches there are, is it 100 or 7?
Process completed within ~10 hours overall. The assembly is ~24mb, surprisingly large. There are no “mega”contigs like before. How to really evaluate? How to bin?
15.1 Extract rRNA
I pulled this time 24 16S sequences (only 10 usin Flye). Classifying at Silva
canu.16S.silva <- read.csv("canu.full/LHAD.canu/silva/LHAD.canu.16S.silva.results.csv", header = T, sep = ";")
canu.16S.silva <- canu.16S.silva[c(2,9,14,21)]
datatable(canu.16S.silva)15.2 Medaka
https://github.com/nanoporetech/medaka
This help page: https://nanoporetech.github.io/medaka/benchmarks.html says that they run canu, then racon, then medaka…. wow what a PITA. However, in the walkthrough, they say assemble (i.e. canu) then run medaka. However, you have to provide a trained model! So that it can decide what is “true”… they provide a base model trained with human, yeast, and e. coli… this really sounds very wishywashy.
15.3 Gene call the canu assembly and search for HADase and DCM
made a small custom search set consisting of Dehalobacterium DCM genes, HADases from DIEL and DEFO
gene called the assembly using
prodigal -i LHAD.contigs.fasta -o LHAD.gbk -a LHAD.faa -p meta
and blastp’d using
#!/bin/bash
makeblastdb -in ../LHAD.faa -dbtype prot -out blastdb/LHAD_db
mkdir LHAD_DCM_HAD_search
blastp -db blastdb/LHAD_db -query DHB.DCM.2HADases.faa \
-out LHAD_DCM_HAD_search/blast.out \
-outfmt 6 \
-evalue 0.001
canu.blastp <- read.csv("canu.full/LHAD.canu/blast/LHAD_DCM_HAD_search/LHAD.canu.blast.out", sep="\t")
datatable(canu.blastp)No notable hits to the DCM genes, but several strong hits to Diel-HAD2
pull those genes out
library(Biostrings)
canu.diel.had <- canu.blastp[canu.blastp$V1 == "HAD_DIEL",]
write.csv(canu.diel.had, "canu.full/LHAD.canu/blast/LHAD_DCM_HAD_search/HAD_DIEL.blastp.out")
colnames(canu.diel.had)[2] <- "name"
canu.diel.had <- canu.diel.had[c(1,2)]
canu.prot <- readAAStringSet("canu.full/LHAD.canu/LHAD.faa")
name <- names(canu.prot)
#clean up the names
splitfun <- function(x){
strsplit(x," ")[[1]][1]
}
##apply the split across the gene list
name <- unlist(lapply(name, splitfun))
seq <- paste(canu.prot)
canu.prot <- data.frame(name,seq)
library(dplyr)
canu.diel.had.seq <- left_join(canu.diel.had, canu.prot)
writeFasta(canu.diel.had.seq, "canu.full/LHAD.canu/blast/LHAD_DCM_HAD_search/LHAD.DIEL.HAD.hits.faa")15.4 DCA pathway
Made a small set of DIEL proteins, glycerate kinase, TSAR, and hydroxypyruvate isomerase
blasting these against the canu proteins
tsar.blastp <- read.csv("canu.full/LHAD.canu/blast/LHAD_TSAR_search/LHAD.TSAR.blast.out", header=F, sep = "\t")
datatable(tsar.blastp)15.5 Medaka
This can install via conda… latest versions of miniconda are broken, so had to downgrade, but then the conda install worked. It is even installing tensorflow, might be compatible with GPU use.
medaka_consensus \
-i all.pass.fastq \
-d LHAD.canu/LHAD.contigs.fasta \
-o medaka \
-t 8
A customized local compile is required in order to make use of GPU
order of operations: 1. minimap2 + samtools, CPU saturated, ~3Gb RAM (5 min) 2. samtools, all cpu, 6gb RAM (5 min) 3. medaka, heavy cpu, 11.2Gb RAM
finished within an hour!
again run barrnap:
15.5.1 Models
This is supposed to be a very important choice, but is a bit opaque. The default model r941_min_high should be broadly applicable and is aimed at high-precision rather than speed
15.6 maxbin2
run a quick evaluation, will this work on the present data?
run_MaxBin.pl -contig consensus.fasta -thread 8 -out maxbin2/LHAD.medaka -reads ../all.pass.fastq
I did find somebody who did this, not published though: https://www.biorxiv.org/content/biorxiv/suppl/2018/10/30/456905.DC1/456905-1.pdf
uses up to 36.6Gb of RAM and 100% CPU. Very RAM intensive!
process completed overnight, four bins
maxbin2 <- read.csv("canu.full/medaka/maxbin2/LHAD.medaka.marker", sep="\t",header = T)
maxbin2 <- t(maxbin2)
datatable(maxbin2)15.6.1 checkm
sudo checkm data setRoot /home/robert/Tools/checkmdata
checkm lineage_wf -f CheckM.txt -t 8 -x fasta bins/ bins/SCG
note that checkm has to be run within a python 2.x environment
library(tidyr)
bin.max.bact <- read.csv("canu.full/medaka/maxbin2/CheckM.csv")
datatable(bin.max.bact)The bins are just garbage. I’m guessing that bowtie2 was the problem; try minimap2 instead
15.7 minimap2
this will require basic minimap2 to samfile, then generation of a coverage file for maxbin2 to use
minimap2 -a -x map-ont -t 8 consensus.fasta ../all.pass.fastq > minimapped.sam
Bedtools genomecov seems to be the best option here
First convert to bam and sort
samtools view -S -b minimapped.sam > minimapped.bam
samtools sort minimapped.bam -o minimap.sorted.bam
then bedtools genomecov
this first requires a “genome file”
library(Biostrings)
m.con <- readDNAStringSet("canu.full/medaka/consensus.fasta")
name <- names(m.con)
len <- width(m.con)
m.con <- data.frame(name,len)
write.table(m.con, "canu.full/medaka/genome.txt", sep = "\t", quote = F, row.names = F)samtools depth -a minimap.sorted.bam > sam.depth.txt
import and create a summary
sam <- read.csv("canu.full/medaka/sam.depth.txt",header = F, sep = "\t")
library(dplyr)
cov <- sam %>%
group_by(V1) %>%
summarize(cov = mean(V3))
write.table(cov,"canu.full/medaka/sam.avg.dep.txt",sep = "\t",
col.names=F,row.names = F,quote = F)run_MaxBin.pl -contig consensus.fasta -abund sam.avg.dep.txt -thread 8 -out maxbin2.2/LHAD.medaka
checkm lineage_wf -f CheckM.2.txt -t 8 -x fasta maxbin2.2/ maxbin2.2/SCG
The resulting bins appear similar, which implies that bowtie2 is not the problem
bin2.max.bact <- read.csv("canu.full/medaka/CheckM.2.csv")
bin2.max.bact <- bin2.max.bact[c(1,2,12,13)]
kable(bin2.max.bact)| Bin.Id | Marker.lineage | Completeness | Contamination |
|---|---|---|---|
| LHAD.medaka.002 | k__Archaea (UID2) | 91.74 | 73.08 |
| LHAD.medaka.001 | c__Clostridia (UID1085) | 90.31 | 8.80 |
| LHAD.medaka.003 | k__Bacteria (UID203) | 70.02 | 31.48 |
| LHAD.medaka.004 | k__Bacteria (UID203) | 60.74 | 62.01 |
Bin1 ended up quite clean…. however, this relies up on marker genes, and I am certain that fragmentation of genes due to errors will lead to missed information, which makes this just overall garbage.
15.8 barrnap
barrnap -t 8 -o LHAD.medaka.RNA.fa consensus.fasta > LHAD.medaka.RNA.gff
16 Archive
17 Subsampling
17.1 Target depth
For starters, we will subsample down to a depth required to confidently assembly a genome which makes up 10% of the community. To get 40x coverage of 10 genomes, we need roughly:
4mb genome X 40 coverage X 10 genomes = 1600 mbases AKA 1.6 Gbases
17.2 seqTK subsampling
As long as the reads are provided in fastQ format, this can be executed using the seqTK package as such
/path/to/seqtk sample -s100 \
lenoir.DCA.fastq \
(insert.read.number.here) > subsampled.fastq/lenoir.DCA.READCOUNT.fastq