Quality Assessment and Quality Control

Summary

Quality Control on Shotgun Metagenomic sequenced data can be challenging. Here, we provide a short guide on the basic steps towards achieving high-quality reads, starting from raw sequenced reads, for reliable downstream Metagenomics analyses.

The HPC environment and availability of tools to do the job

Analysis of genomic sequencing data is a computing-intensive undertaking which must be performed on High Performance Computing (HPC) clusters. All students and staff must take necessary steps to arrange for a personal account on the deep-thought HPC through the Flinders University IDS department. Various WGS pipelines are installed in form of modules on deep-thought. To check whether a module is installed, use the command module avail keyword or module spider keyword. It is advisable to run all your jobs by submitting them to the computing nodes using the slurm workload management scripts provided by your administrator. A sample slurm.sub script should should be structured like this:

Template slurm scripts are provided by Flinders IDS. Edit all sections following the equals to “=” sign and add all modules that are required for the job using module add TOOLXX. The executable command is then added to the bottom of the script with and extra empty line. Sometimes, as is the case for trimmomatic, a FULL PATH to the executable file is required. These files are located at /cm/shared/apps/TOOL/... where TOOL is the module in question; e.g. /cm/shared/apps/trimmomatic/0.39/trimmomatic-0.39.jar. For other tools, it may be helpful to explore the /cm/shared/apps/TOOL/bin folder. Once ready, slurm scripts should be submitted to the computing clusters queue using the command sbatch slurm.sub. The progress of each task on the workload management system is monitored using squeue jobID. A unique jobID is always generated for each job submitted. For batch submissions, explore slurm arrays at https://slurm.schedmd.com/job_array.html or our custom scripts multisub_task.pl that create multiple slurm tasks for all input files in a specified directory. The multisub scripts are located in the commands directory within the shared sahmri_mbiome r_drive.

QC on raw reads from Illumina Sequencing

The reads we receive from the sequencing core are often not quality-controlled. QC of Whole-genome sequencing data from microbial communities is especially challenging. The process involves identification and filtration of sequencing artifacts such as adapters and low-quality reads that could affect the downstream analysis leading to erroneous or misleading results. Different samples require different processing approaches and may pose varying complexities depending on the origin/sampled site, quality of samples taken, and sequencing depth. A plethora of tools have been developed to perform various steps of the QC.

The basic pipeline involves the following steps:

  • Assessing the quality of raw reads. FASTQ
  • Trimming of low quality and adapter reads. FASTQ
  • Assessing the quality of trimmed reads. FASTQ
  • Removing any host (human) sequences - Not necessary for stool samples. FASTA
  • Assessing the quality of cleaned reads
  • Interleaving the trimmed reads to form. FASTQ to FASTA

Assessing the quality of raw reads

This process requires FASTQC https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ which can be loaded on the deep-thought server as module load fastQC/0.11.5 and doesn’t require any other modules to run. after loading the modules the “-h” option is the best way to get started with basic input and output options. Try fastqc -h. However for simplicity, it is always recommended to use default options. Therefore, in your sample script, the command will be:

fastqc -o output_dir -f fastq -c contaminant_file -t 24 /paths_to/*.fastq.gz

The contaminants file (fasta file of adapters or other index sequences used prior to and during sequencing) is not compulsory but is important if available.

It is recommended to run this module in a slurm sub-script when using N computing threads. i.e. -t N - in our case, -t 24.

Once completed, FASTQC will generate one html report for each sequenced reads file supplied as input. For paired-end sequenced data, there will be two files for each sample just as were inputs.

Now lets interrogate the following output file from FASTQ and see how they can steer subsequent QC of the sequencing reads.

For the initial metagenomic sequencing data quality check, pay much attention to Per base sequence quality, Per tile sequence quality, Per base sequence content, Adapter Content, Per sequence GC content, and Sequence Length Distribution.

Please refer to: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/ for details on FASTQ output graphs and their interpretation. NB: I have copied the most relevant information here for the purpose of continuity and readability of this tutorial :)

Per Base Sequence Quality

This plot shows an overview of the range of quality values across all bases at each position in the FastQ file

For each position a Box-whisker type plot is drawn. The elements of the plot are as follows:

  • The central red line is the median value
  • The yellow box represents the inter-quartile range (25-75%)
  • The upper and lower whiskers represent the 10% and 90% points
  • The blue line represents the mean quality

The y-axis on the graph shows the phred quality scores. The higher the score the better the base call. The background of the graph divides the y axis into very good quality calls (green), calls of reasonable quality (orange), and calls of poor quality (red). The quality of calls on most platforms will degrade as the run progresses, so it is common to see base calls falling into the orange area towards the end of a read.

For the purpose of QC filtering of low-quality reads, this file gives a glimpse of how much data one expects to lose with every phred cut-off. Typically, quality scores above the red zone are acceptable. Of course the green zone gives the best results for downstream processing when the data allows.

Per tile sequence quality

This graph will only appear in your analysis results if you’re using an Illumina library which retains its original sequence identifiers. Encoded in these is the flowcell tile from which each read came. The graph allows you to look at the quality scores from each tile across all of your bases to see if there was a loss in quality associated with only one part of the flowcell.

The plot shows the deviation from the average quality for each tile. The colors are on a cold to hot scale, with cold colors being positions where the quality was at or above the average for that base in the run, and hotter colors indicate that a tile had worse qualities than other tiles for that base. In the example below you can see that certain tiles show consistently poor quality. A good plot should be blue all over.

Per Base Sequence content

Per Base Sequence Content plots out the proportion of each base position in a file for which each of the four normal DNA bases has been called.

In a random library you would expect that there would be little to no difference between the different bases of a sequence run, so the lines in this plot should run parallel with each other. The relative amount of each base should reflect the overall amount of these bases in your genome, but in any case they should not be hugely imbalanced from each other.

It’s worth noting that some types of library will always produce biased sequence composition, normally at the start of the read. Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries) and those which were fragmented using transposases inherit an intrinsic bias in the positions at which reads start. This bias does not concern an absolute sequence, but instead provides enrichment of a number of different K-mers at the 5’ end of the reads. Whilst this is a true technical bias, it isn’t something which can be corrected by trimming and in most cases doesn’t seem to adversely affect the downstream analysis.

Per sequence GC content

The GC content across the whole length of each sequence in a file is measured and compared to a modeled normal distribution of GC content

This check is usually not very strict with metagenomics sequencing but is useful to check so as to have an indication of the GC variations within your sampled niche. In a normal random library you would expect to see a roughly normal distribution of GC content where the central peak corresponds to the overall GC content of the underlying genome. Since we don’t know the the GC content of the genome the modal GC content is calculated from the observed data and used to build a reference distribution.

An unusually shaped distribution could indicate a contaminated library or some other kinds of biased subset. A normal distribution which is shifted indicates some systematic bias which is independent of base position. If there is a systematic bias which creates a shifted normal distribution then this won’t be flagged as an error by the module since it doesn’t know what your genome’s GC content should be.

Sequence Length Distribution

In many cases this will produce a simple graph showing a peak only at one size, but for variable length FastQ files this will show the relative amounts of each different size of sequence fragment.

This module will raise an error if any of the sequences have zero length. Ideally, all raw unprocessed reads should have equal length as your requested from the sequencing core. Deviations are however expected after QC trimming.

Adapter content

One obvious class of sequences which you might want to remove are adapter sequences. It is useful to know if your library contains a significant amount of adapter in order to be able to assess whether you need to adapter trim or not. This graph will list all the adapters in your reads and show on average the length they occupy in your reads.

FASTQC reports across many samples can be aggregate into a single report using MULTIQC module load MultiQC/1.4.

Run MultiQC as follows:

multiqc -o multiqc_outdir dir_with_fastqc_reports

Trimming of low quality and adapter reads

Having an indication of the artifacts that should be filtered from the raw reads, the next step is to get rid of these artifacts.

For this we will need TRIMMOMATIC (Bolger, Lohse, and Usadel 2014) which can be loaded in deep-thought as:

module load trimmomatic/0.39

Supply the following command for Paired End reads:

java -jar /cm/shared/apps/trimmomatic/0.39/trimmomatic-0.39.jar PE -phred33 input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz\ output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz\ ILLUMINACLIP:/cm/shared/apps/trimmomatic/0.39/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:25 MINLEN:36

This will perform the following:

  • Remove adapters (ILLUMINACLIP:TruSeq3-PE.fa:2:30:10)
  • Remove leading low quality or N bases (below quality 3) (LEADING:3)
  • Remove trailing low quality or N bases (below quality 3) (TRAILING:3)
  • Scan the reads with a 4-base-wide sliding window, trimming whenever the average quality per base drops below 25 (SLIDINGWINDOW:4:25)
  • Drop reads below the 36 bases long from the final output of cleaned reading (MINLEN:36)

Supply the following command for Single End reads:

java -jar trimmomatic-0.39.jar SE -phred33 input.fq.gz output.fq.gz ILLUMINACLIP:TruSeq3-SE:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:25 MINLEN:36

This will perform the same steps but for PE, using the single-ended adapter file. Remember to provide the adapter file for your run

Trimmomatic works with compressed FASTQ files with the extension “.gz”

Four files are produced for each F-R PE reads you analyse. These are the paired Forward, paired Reverse, Unpaired Forward, and Unpaired Reverse. In case you don’t require any unpaired reads, e.g. for 16S reads to be used in Qiime or for single-isolate WGS data, then include /dev/null in place of output_forward_unpaired.fq.gz and output_reverse_unpaired.fq.gz

In this case, your command should be:

java -jar trimmomatic-0.39.jar PE -phred33 input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz /dev/null output_reverse_paired.fq.gz /dev/null ILLUMINACLIP:/r_drive/genomicsdb/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:25 MINLEN:36

Remember, most of the trimming along the entire read is accomplished with the option SLIDINGWINDOW:4:25. Without it, chances are your reads will come out the same way they went in - with just the adapters and the 5' and the 3' ends of your reads cleaned.

At this point, there are a few tricks one could employ to ensure efficient trimming.

  • Refer to the “Per Base Sequence Quality” section of your MultiQC report. At what quality/phred score (y-axis) will you maximize accuracy while minimizing loss of reads? There is no gold-standard rule on this one. Let your scientific imagination run free based on the quality of your sequenced data in general. Use this value (N) in the SLIDINGWINDOW:Window-size:N parameter. Normally, I prefer small window-size of about 4 bases.
  • Refer to the “Per Base Sequence content” section of your MultiQC report. This will give you an indication of how many bases to trim at the beginning and the end of your reads. It is general practice to remove all the discordant (where the % TAGC trend-lines don’t run in parallel. That value is 19bp for the example image used in this tutorial) base-pairs. Use the value at the start (X) for the HEADCROP:X and set the minimum quality thresholds required to keep a base for LEADING:3 TRAILING:3 parameters.
  • Provide a FASTA file of adapter and other illumina-specific sequences to ILLUMINACLIP:your-contaminants:::. All universal Illumina-specific sequences (Truseq and Nextera) are in a file located at /r_drive/genomicsdb/TruSeq3-PE.fa. You can supply your own file of run-specific adapters in addition to the universal ones.

Assessing the quality of trimmed reads. FASTQ

It is advisable to perform another run of FASTQC on the Quality-trimmed reads to asses how successful the trimming was.

If done correctly, most of the artifacts will now be gone. Repetitive sequences and kmers will be minimized and GC distribution should be in a normal distribution.

Removing any host (human) sequences

Metagenomes from oropharyngeal, vaginal, or rectal swabs may contain large amounts of host DNA. This DNA is eliminated by various depletion methods during DNA extraction and prior to sequencing. Any remnant host DNA will end up as part of the sequenced reads and this could skew the analysis. This sequences should be removed before further quantification of the metagenomes. When dealing with stool samples from healthy adults, this step could be skipped since the amount of residual host DNA in normally negligible (mostly <2%). Some researchers have experienced higher amounts and caution should be applied. However, removal of host DNA is highly recomended for baby poo and stool from unhealthy participants where higher amounts host DNA is expected. For unified protocols, BBMap can be used as a standard step for all types of sequenced megatenomic material.

The BBMAP (Bushnell 2014) module for removing reads mapped to a reference can be loaded in deep-thought as

module load bbmap/38.46

bbmap.sh minid=0.95 maxindel=3 bwr=0.16 bw=12 quickmatch fast minhits=2 threads=4 path=/r_drive/genomicsdb/human_reference_genome_grch38/ qtrim=lr trimq=10 untrim -Xmx48g in=output.fasta outu=output_hclean.fa outm=output_human.fa

The Grch38 release of the reference human genome is located at /r_drive/genomicsdb/human_reference_genome_grch38/. Please feel free to download and update any other versions you may require.

NB: BBMap is always nondeterministic when run in paired-end mode with multiple threads, because the insert-size average is calculated on a per-thread basis, which affects mapping; and which reads are assigned to which thread is nondeterministic. The only way to avoid that would be to restrict it to a single thread (threads=1), or map the reads as single-ended and then fix pairing afterward (http://seqanswers.com/forums/showthread.php?t=58221). In our example, we’ve interleaved the FASTQ pairs into a single FASTA files before running BBMap.

Assessing the quality of host-DNA cleaned reads

It is advisable to perform another run of FASTQC on the reads deprived of host DNA to asses how successful the cleaning step was.

After QC, the cleaned reads can be used in various ways to quantify the community microbial composition or functional capabilities.

Downstream analysis

Downstream analysis of metagenomics data is largely dependent on the study objective. In most cases (Sharpton 2014), the following two broad pathways are applicatble:

Who is there?

  • Marker Gene analysis (MetaPhlAn2 (Truong et al. 2015) and HUMAnN2 (Franzosa et al. 2018)): i. Taxonomic profiling, ii. Phylogenetic profiling, iii. Functional profiling
  • Binning (Wu, Simmons, and Singer 2016; Albertsen et al. 2013; Alneberg et al. 2014; Lu et al. 2017; Strous et al. 2012) (NOT covered in this tutorial): i.Taxonomic profiling, ii. Phylogenetic profiling, iii. Novel Taxa
  • De novo assembly (Peng et al. 2012): i. Novel Taxa, ii. Genome diversity

What are they doing?

  • Gene Prediction: i. Gene diversity, ii. Novel genes
  • Functional Annotation (Kanehisa et al. 2012; Franzosa et al. 2018): i. Protein family diversity, ii. Functional diversity

Comprehensive Workflows for reference-based taxonomic and functional profiling of Quality-controlled sequencning reads can be found at:

FUNCTIONAL: https://bitbucket.org/biobakery/humann2/wiki/Home#markdown-header-standard-workflow

TAXONOMIC: https://bitbucket.org/biobakery/biobakery/wiki/metaphlan2

Briefly:

## HUMAnN2 module for functional characterization of metagenomics data
module load python/6.3.0/3.6.1 
pip install humann2 --user --force-reinstall --no-binary :all: --install-option='--replace-dependencies-install'

## Copy the Chocophlan and Uniref databases required from `/r_drive/genomicsdb/` before running the analysis.

humann2 -i input1 --input-format fasta --search-mode uniref90 --evalue 0.0001 --threads 24 -o outdir\/$SAMPLE_hmnN2 --nucleotide-database chocophlan_DB --protein-database uniref90_DB

# Normalize the abundance output files
humann2_renorm_table --input $SAMPLE_genefamilies.tsv --output $SAMPLE_genefamilies_relab.tsv --units relab

#Join ## input is directory
humann2_join_tables --input . --output STUDY_genefamilies.tsv --file_name genefamilies_relab
humann2_join_tables --input . --output STUDY_pathabundance.tsv --file_name pathabundance_relab
humann2_join_tables --input . --output STUDY_pathcoverage.tsv --file_name pathcoverage

### Regrouping genes to other functional categories like KO, OG etc
humann2_regroup_table -i $SAMPLE_genefamilies-cpm.tsv -o $SAMPLE_uniref90_ec.tsv --groups uniref90_ec ## EC
humann2_regroup_table -i $SAMPLE_genefamilies-cpm.tsv -o $SAMPLE_uniref90_ko.tsv --groups uniref90_ko ## associate
 
### Associate pathway abundance with metadata
humann2_associate --input try_assoc.pcl --last-metadatum Treatment --focal-metadatum Treatment --focal-type categorical --output stats.txt

## plot
humann2_barplot --sort sum metadata --input hmp_pathabund.pcl --focal-feature METSYN-PWY --focal-metadatum STSite --last-metadatum STSite --output plot2.png

##METAPHLAN2 module for taxonomic profiling of metagenomics data

NB: By default, only relative abundances are reported. Try running the analysis with -t rel_ab_w_read_stats to get raw counts and all stats possible
metaphlan2.py input1 --input_type fasta --nproc 16 -o outdir\/$SAMPLE.out --bowtie2out outdir2\/$SAMPLE.bowtieout --biom outdir3\/$SAMPLE.biom

## merging abundance tables
merge_metaphlan_tables.py *.out > $STUDY__metaphlan.txt

## Metaphlan grapics
## Get taxonomic level in metaphlan2 output
## To make a list with only species 
grep -E "(s__)|(^ID)" $STUDY_merged__metaphlan.tsv | grep -v "t__" | sed 's/^.*s__//g' > $STUDY_metaphlan_relabund_species.tsv
## To make a list of only genus
grep -E "(g__)|(^ID)" $STUDY_merged__metaphlan.tsv | grep -v "t__" | sed 's/^.*g__//g'| grep -v s_ > STUDY_metaphlan_relabund_genus.tsv

The resulting taxa and function relative abundance tables can be exported and used for downstream statistical analysis.

De novo assembly

Interleaving the trimmed reads to form. FASTQ to FASTA

The trimmed FASTQ files are converted into FASTA for further processing. This can be achieved using the fq2fa option of IDBA (Peng et al. 2012) which is loaded in deep-thought as module load idba-long/1.1.3 or for longer reads (>200 bp), any of the (to-be)installed modified variants of IDBA:

idba/1.1.3-512kmax, idba/1.1.3-1024kmax, or idba/1.1.3-1600kmax

At this point, all unpaired reads can be concatenated into a single file or later assembly as follows:

cat *_unpaired.fq.gz > study_unpaired.fq.gz

FQ2FA always runs in bash environment in deep-thought

For paired End reads, run

bash -c 'fq2fa --merge --filter <(gunzip -c output_forward_paired.fq.gz) <(gunzip -c output_forward_unpaired.fq.gz) output.fasta'

and for Single End reads,

bash -c 'fq2fa --filter <(gunzip -c 20_1_1P.fastq.gz) 20_1.fasta' ## SE or unpaired reads

The interleaved fasta reads can be assembled into contiguous scaffolds using IDBA_UD loaded as before.

The basic IDBA (Peng et al. 2012) command can be run as follows:

idba_ud -r sample.fasta -o sample_outdir --num_threads 24 --pre_correction

In a nutshell, each sample sample.fasta is assembled into a folder sample_outdir using 16 threads multitasking mode and reads are pre-corrected before assembly.

Check out for the files end, graph-xx.fa, kmer, local-contig-xx.fa, scaffold.fa align-xx, begin, contig.fa graph-xx.fa, log in the output folder. Of interest is the contig.fa file and the scaffold.fa which contain your final assemblies. In case the program fails to complete the run, use contig-xx.fa, where xx is the maximum num in the directory, downstream analyses.

outside the IDBA output directories, run the following command to concatenate all contigs from all samples into a single file.

cat */contig,fa > all_contigs.fna or awk 1 */contig.fa > all_contigs.fna

Identify genes/CDS

MetaGeneMark (Zhu, Lomsadze, and Borodovsky 2010) is used to predict the genes/coding sequences in the contig files. This module is loaded in deep-thought as module load metagenemark/1.0 and the following command supplied to generate CDS.

gmhmmp -r -p 0 -m /opt/shared/metagenemark/1.0/bin/MetaGeneMark_v1.mod -o oufile.gff -d -D outfile.fna -f G all_contigs.fna

If you get errors related to the program Licence, run cp /opt/shared/metagenemark/3.26/bin/gm_key_64 ~/.gm_key and retry the gmhmmp command.

Remove redundant genes/CDS

To generate the non-redundant data-set of genes for your study, use CDHIT (Li and Godzik 2006) which can be loaded as follows:

module load cdhit/4.8.1 module load gnubison/1.26

And the executable command is: cd-hit-est -i outfile.fna -o outfile_nr.fna -c 0.95 -G 0 -aS 0.9 -g 1 -r 1 -T 16 -M 16000

NB: CDHIT-EST is usually used for nucleic acid (DNA) sequences whereas CDHIT is used for amino acid (protein) sequences.

Remove short genes

Short genes that cannot form a functional peptides, normally <100.bp or 30.aa are usually removed from the assemblies as follows.

perl /r_drive/sahmri_mbiome/commands/removeShort.pl 100 outfile_nr.fna > outfile_nr_n100.fna

Change 100 to your preferred minimum length or N as required

Here is the removeShort.pl code:

## removeShort.pl
#!/usr/bin/perl
use strict;
use warnings;

my $minlen = shift or die "Error: `minlen` parameter not provided\n";
{
  local $/=">";
  while(<>) {
    chomp;
    next unless /\w/;
    s/>$//gs;
    my @chunk = split /\n/;
    my $header = shift @chunk;
    my $seqlen = length join "", @chunk;
    print ">$_" if($seqlen >= $minlen);
  }
  local $/="\n";
}

After this process, a number of genes will be removed. The FASTA headers in the resulting file needs to be renamed and ordered sequentially for later steps

awk '/^>/{print ">gene" ++i; next}{print}' < outfile_nr_n100.fna > outfile_nr_n100_awk.fna

Sometimes, this step can be used before prediction of CDS/genes (using gene-marks) to rename contigs. e.g. when using contigs for BWA alignment/assembly. This requires skipping the GMM step and performing CDHIT-EST and removeShort.pl on contigs before the AWK command is applied.

awk '/^>/{print ">contig" ++i; next}{print}' < all_contigs.fna > all_contigs_awk.fna

Annotate the non-redundant genes set against a databse using BLAST

In this example. we’ll bLAST against antibiotic resistance genes database. Translate nucleic acids to amino acids and blast against CARD database in which you screen for antibiotic resistance genes. These should be run in a subscript, especially the blast command, which can take several days to run.

module load gnubison/1.26
module load emboss/6.6.0
transeq -sequence outfile_nr_n100_awk.fna -outseq outfile_nr_n100_awk.faa

The blast command is:

module load blast/2.9.0`

blastp -task blastp-fast -query outfile_nr_n100_awk.faa -db /home/user/card_prot_DB/card_prot_DB.fa -out home/user/file.blastp.out -outfmt "6 sseqid length qseqid pident mismatch gapopen qstart qend sstart send evalue bitscore"  -evalue 1e-10 -num_threads 20  -qcov_hsp_perc 99 -max_hsps 1 -max_target_seqs 1

Calculating abundance of your non-redundant genes set in each sample

This task could be achieved in the following processing order:

  • Building a soap index file of the non-redundant genes set
  • Aligning interleaved sample fasta files (BBMap or fq2fa) to SoapIndex
  • Calculating coverage of each sample aligned
  • Joining coverage files into a matrix

SoapAligner Index (SOAP NOT Installed)

The conda version of SoapAligner can be installed in the $HOMEDIR as follows:

1. module add miniconda/3.0
2. conda create -y -n soap_env
3. source activate soap_env
4. conda init bash
5. ctrl + d
6. log back in
7. conda activate soap_env
8. conda install -c bioconda soapaligner
9. y
## your install is complete. Test the programs
10. 2bwt-builder --help  ## build the index of your non-redundant genes set that you processed through genemarks and cdhit..
11. soap --help  ## to run the actual alignment of your cleaned FASTA reads (paired bbmap or paired fq2fa.. right before IDBA) onto the indexed NR genes set.
12. conda deactivate
13. conda deactivate ## only if you are still inside conda env '(base)'

2bwt-builder outfile_nr_n100_awk.fna NB. nucleic acids but ! not amino acids

SoapAligner: soap alignment of interleaved fasta files to the non-redundant reference from genemark-cdhit

soap -D outfile_nr_n100_awk.fna.index -a input.fna -o sample.out -2 unpaired-hits.out -p 30 -u unmapped.fa -M 4 -r 1 -l 30 -v 5

Calculating coverage of each sample aligned

NB: copy the folder /r_drive/sahmri_mbiome/soap_ref to the $HOMEDIR: cp -r /r_drive/sahmri_mbiome/soap_ref ~/

~/soap_ref/2.7.7/soap.coverage -cvg -i sample.out -refsingle outfile_nr_n100_awk.fna -p 6 -o sample.out.cov -plot plot.txt 0 2000"

Join all coverage output files

Clean coverage files

Select only rows starting with the name “GENE”

for i in `ls *.cov.out|cut -f 1 -d \.`; do grep "^gene" "$i".cov.out > "$i".grep.out; done

Create a tab delimited file (File.info) containing sample name and corresponding cleaned coverage file-name

for i in `ls *.grep.out|cut -f 1 -d \.`; do echo -e "$i.grep.out \t $i"; done > File.info

Merge cleaned coverage files

perl /r_drive/sahmri_mbiome/commands/merge_soapcov.pl File.info

here is the merge_soapcov.pl code:

#merge_soapcov.pl
#!/usr/bin/perl -w

my %hash=();
my %id=();
my @id_array=();
my $option=0;
my %Tag=();
my %Tagc=();
my %g_id=();

#gene55970: 880/1071    Percentage:82.1662  Depth:4
#gene1537538: 0/1011    Percentage:0    Depth:nan
#gene55971: 0/708   Percentage:0    Depth:nan

my $List = $ARGV[0];
open ($fh,'<', $List) or die "Could not open file '$List' $!";
open(my $out, '>', 'Merged_Data.txt') or die "Could not open file 'Merged_Data.txt' $!";
print $out "gene\tlength\t";

while(my $line = <$fh>){
  chomp ($line);
  my @DATA=split(/\s+/,$line);
  print "Check the File format: Two columns separated by Tab\n\n" if scalar(@DATA)<2;
  exit if scalar(@DATA)<2;
  if ($DATA[1] =~ /^ID$/){
    unshift @id_array,$DATA[1];
  }
  else{
    push @id_array,$DATA[1];
  }
  #print $sample_id,"\t";
  open (FILE,$DATA[0]) or die $!;
  while(<FILE>){
    chomp;
    my @data = split(/\t/,$_);
    if ($data[0] =~ /\.hmm$|\_/){
      $data[2] =~ s/\_\d+//;
      $Tagc{$data[2]} = $data[0];# rwmove "$data[0]" and replace with "$_" if wanted all description columns
      $option =1;
      next;
    }
  my ($id,$length) = split(/\:\s+/,$data[0]);
  $length =~ s/\d+\///;
  $data[2] =~ s/Depth://;
  $data[2] =~ s/nan/0/;
  $Tag{$id}=1;
  $g_id{$id}=$length;
  $hash{$DATA[1]}{$id}=$data[2];
  }
}

%Tag = %Tagc if $option==1;
print $out "$_\t" foreach @id_array;
print $out "\n";
shift @id_array if $option==1;

foreach $gene_id (sort keys%Tag){
  if ($Tag{$gene_id} =~ /^1$/){
    print $out $gene_id,"\t",$g_id{$gene_id},"\t";
  }
  else{
    print $out $gene_id,"\t",$g_id{$gene_id},"\t",$Tag{$gene_id},"\t";
  }
  foreach my $line (@id_array){
    if (exists $hash{$line}{$gene_id}){
      print $out $hash{$line}{$gene_id},"\t";
    }
    else{
      print $out "0\t";
    }
  }
  print $out "\n";
}

NB: Despite being a very reliable global aligner for this task, the research community acknowledges that SOAPaligner is becoming outdated and less supported. There are plans to move towards modern aligners such as BWA and STAR to accomplish this step. The challenges faced at the moment is that most reliable aligners like BWA are all local aligners and could require advanced knowledge to troubleshoot as well as utilize the outputs of their analysis. Please check with the group bioinformatician if these changes have been implemented and how they affect your subsequent analysis.

Happy QC and quantification everyone :)

References

library(MASS) library(dplyr) Cars93 %>% group_by(Type) %>% mutate(Cheapest = min(Min.Price)) %>% filter(Cheapest==Min.Price) %>% dplyr::select(Type,Cheapest,Model, Manufacturer)## ensure not base r select in used

Albertsen, M., P. Hugenholtz, A. Skarshewski, K. L. Nielsen, G. W. Tyson, and P. H. Nielsen. 2013. “Genome Sequences of Rare, Uncultured Bacteria Obtained by Differential Coverage Binning of Multiple Metagenomes.” Journal Article. Nat Biotechnol 31 (6): 533–8. https://doi.org/10.1038/nbt.2579.

Alneberg, J., B. S. Bjarnason, I. de Bruijn, M. Schirmer, J. Quick, U. Z. Ijaz, L. Lahti, N. J. Loman, A. F. Andersson, and C. Quince. 2014. “Binning Metagenomic Contigs by Coverage and Composition.” Journal Article. Nat Methods 11 (11): 1144–6. https://doi.org/10.1038/nmeth.3103.

Bolger, A. M., M. Lohse, and B. Usadel. 2014. “Trimmomatic: A Flexible Trimmer for Illumina Sequence Data.” Journal Article. Bioinformatics 30 (15): 2114–20. https://doi.org/10.1093/bioinformatics/btu170.

Bushnell, Brian. 2014. BBMap: A Fast, Accurate, Splice-Aware Aligner. Book. Conference: 9th Annual Genomics of Energy &Amp; Environment Meeting, Walnut Creek, ca, March 17-20, 2014.; Lawrence Berkeley National Lab. (LBNL), Berkeley, CA (United States). https://www.osti.gov/servlets/purl/1241166.

Franzosa, E. A., L. J. McIver, G. Rahnavard, L. R. Thompson, M. Schirmer, G. Weingart, K. S. Lipson, et al. 2018. “Species-Level Functional Profiling of Metagenomes and Metatranscriptomes.” Journal Article. Nat Methods 15 (11): 962–68. https://doi.org/10.1038/s41592-018-0176-y.

Kanehisa, M., S. Goto, Y. Sato, M. Furumichi, and M. Tanabe. 2012. “KEGG for Integration and Interpretation of Large-Scale Molecular Data Sets.” Journal Article. Nucleic Acids Res 40 (Database issue): D109–14. https://doi.org/10.1093/nar/gkr988.

Li, W., and A. Godzik. 2006. “Cd-Hit: A Fast Program for Clustering and Comparing Large Sets of Protein or Nucleotide Sequences.” Journal Article. Bioinformatics 22 (13): 1658–9. https://doi.org/10.1093/bioinformatics/btl158.

Lu, Y. Y., T. Chen, J. A. Fuhrman, and F. Sun. 2017. “COCACOLA: Binning Metagenomic Contigs Using Sequence Composition, Read Coverage, Co-Alignment and Paired-End Read Linkage.” Journal Article. Bioinformatics 33 (6): 791–98. https://doi.org/10.1093/bioinformatics/btw290.

Peng, Y., H. C. Leung, S. M. Yiu, and F. Y. Chin. 2012. “IDBA-Ud: A de Novo Assembler for Single-Cell and Metagenomic Sequencing Data with Highly Uneven Depth.” Journal Article. Bioinformatics 28 (11): 1420–8. https://doi.org/10.1093/bioinformatics/bts174.

Sharpton, Thomas J. 2014. “An Introduction to the Analysis of Shotgun Metagenomic Data.” Frontiers in Plant Science 5: 209. https://doi.org/10.3389/fpls.2014.00209.

Strous, M., B. Kraft, R. Bisdorf, and H. E. Tegetmeyer. 2012. “The Binning of Metagenomic Contigs for Microbial Physiology of Mixed Cultures.” Journal Article. Front Microbiol 3: 410. https://doi.org/10.3389/fmicb.2012.00410.

Truong, D. T., E. A. Franzosa, T. L. Tickle, M. Scholz, G. Weingart, E. Pasolli, A. Tett, C. Huttenhower, and N. Segata. 2015. “MetaPhlAn2 for Enhanced Metagenomic Taxonomic Profiling.” Journal Article. Nat Methods 12 (10): 902–3. https://doi.org/10.1038/nmeth.3589.

Wu, Y. W., B. A. Simmons, and S. W. Singer. 2016. “MaxBin 2.0: An Automated Binning Algorithm to Recover Genomes from Multiple Metagenomic Datasets.” Journal Article. Bioinformatics 32 (4): 605–7. https://doi.org/10.1093/bioinformatics/btv638.

Zhu, W., A. Lomsadze, and M. Borodovsky. 2010. “Ab Initio Gene Identification in Metagenomic Sequences.” Journal Article. Nucleic Acids Res 38 (12): e132. https://doi.org/10.1093/nar/gkq275.