Overview

This document describes the metagenomic analysis pipeline used in the paper PAPER_NAME_LINK. Steps included range start with quality control and end with predicting open read frames are included.

Metagenome fastq files can be found under NCBI BioProject PRJNA693524.

Evaluate metagenome quality

Use FastQC to evaluate the quality of the raw metagenome fastq files.

# FastQC v0.11.8
/path/to/fastqc /path/to/R1.fastq.gz /path/to/R2.fastq.gz -o /path/to/fastqc_output_raw

Remove sequencing adapters and poor quality reads

Use Trimmomatic.

For adapters, I used the Trimmomatic TruSeq3 sequences and added one additional sequence that was overrepresented based on the FastQC outputs. Reverse read files for all samples (except W56) had the overrpresented sequence (string of G’s).

>PrefixPE/1
TACACTCTTTCCCTACACGACGCTCTTCCGATCT
>PrefixPE/2
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
>PE1
TACACTCTTTCCCTACACGACGCTCTTCCGATCT
>PE1_rc
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA
>PE2
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
>PE2_rc
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
>U_R2/2
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

I ran Trimmomatic using the above adapter files and used Trimmomatic’s recommended parameters.

# Trimmomatic v0.38
java -jar /path/to/trimmomatic-0.38.jar PE -threads 4 -trim.log /path/to/trimmomatic.log.txt /path/to/R1.fastq.gz /path/to/R2/fastqc.gz -baseout /path/to/trimmomatic/output/trimmed.fastq.gz ILLUMINACLIP:/path/to/adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

Evaluate quality-trimmed metagenome

Evaluate Trimmomatic output with FastQC.

# FastQC v0.11.8
/path/to/fastqc /path/to/trimmed_1P.fastq.gz /path/to/trimmed_2P.fastq.gz /path/to/trimmed_1U.fastq.gz /path/to/trimmed_2U.fastq.gz -o /path/to/fastqc_output_trimmomatic
/path/to/R1.fastq.gz /path/to/R2.fastq.gz -o /path/to/fastqc_output_trimmomatic

Remove host DNA contamination

I followed this tutorial by Matthias Scholz to remove host sequences using bowtie2 and samtools.

### Software Info ###
# Bowtie2 v2.2.2
# Samtools v0.1.19-44428cd


### Prepare host genome ###

# Download host genome
curl -O ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/236/235/GCA_000236235.1_SpeTri2.0/GCA_000236235.1_SpeTri2.0_assembly_structure/Primary_Assembly/unplaced_scaffolds/FASTA/unplaced.scaf.fna.gz

# Open downloaded file
gunzip unplaced.scaf.fna.gz

# Rename file
mv unplaced.scaf.fna 13liner.fna

# Build Host Database
bowtie2-build --seed 1 -p 10 13liner.fna 13liner.db


### Work with trimmomatic outputs ###

# Convert Trimmomatic output to SAM
bowtie2 --very-sensitive-local -p 4 --seed 1 -x /path/to/13liner.db -1 /path/to/trimmed_1P.fastq.gz -2 /path/to/trimmed_2P.fastq.gz -S /path/to/nohost.sam

# Convert SAM to BAM
samtools view -@ 4 -bS /path/to/nohost.sam > /path/to/nohost.bam

# Extract unmapped sequences (non-host sequences)
/path/to/samtools view -@ 4 -b -f 12 -F 256 /path/to/nohost.bam > /path/to/nohost.unmap.bam

# Sort BAM file to organized paired reads
/path/to/samtools sort -n /path/to/nohost.unmap.bam nohost.unmap.sorted.bam

# Convert BAM to fastq
/path/to/samtools bam2fq /path/to/nohost.unmap.sorted.bam > /path/to/nohost.fastq

Evaluate metagenome quality after host removal

Use FastQC to evaluate the quality of the metagenome after host DNA removal.

# FastQC v0.11.8
/path/to/fastqc /path/to/nohost.fastq -o /path/to/fastqc_output_nohost

Assemble metagenome

Use metaSPAdes to assemble metagenome.

# SPAdes version 3.13.0
/path/to/SPAdes-3.13.0-Linux/bin/metaspades.py --pe1-12 /path/to/nohost.fastq -k 55,75,95 -t 20 -m 160 -o /path/to/metaSPAdes_output

Evaluate assembly quality with metaQUAST

Use MetaQUAST to evaluate the quality of the assembled metagenome.

# Quast v5.0.2
/home/GLBRCORG/echiang3/metag/quast-5.0.2/metaquast.py -m 1 -t 20 /path/to/metaSPAdes_output/contigs.fasta -o /path/to/fastqc_metaSPAdes --max-ref-number 0

Evaluate assemblies with bowtie2

Use bowtie2 and samtools.

### Software Info ###
# Bowtie2 v2.2.2
# Samtools v0.1.19-44428cd


### Prepare files ###

# Deinterleave metaSPAdes input file into forward and reverse read
  # deinterleave_fastq.sh is by Nathan Watson-Haigh: https://gist.github.com/nathanhaigh/3521724
/path/to/deinterleave_fastq.sh < /path/to/nohost.fastq /path/to/nohost.R1.fastq /path/to/nohost.R2.fastq

# Convert metaSPAdes assembly into bowtie2 database
bowtie2-build --seed 1 /path/to/metaSPAdes_output/contigs.fasta /path/to/metaSPAdes.bowtie.database


### Map input reads to assembly ###
bowtie2 --sensitive-local -p 10 --seed 1 -x /path/to/metaSPAdes.bowtie.database -1 /path/to/nohost.R1.fastq -2 /path/to/nohost.R2.fastq -S /path/to/metaSPAdes.sam

Predict open reading frames

Use prodigal to predict open reading frames.

# Prodigal v2.6.3
/path/to/prodigal -a /path/to/output.faa -d /path/to/output.ffn -i /path/to/metaSPAdes_output/contigs.fasta -o /path/to/prodigal.txt -p meta -s /path/to/genes.txt