Report - TB in Southern Africa
Author: Dr Eduan Wilkinson (Ph.D.)
Date: 21^st of February 2019

Basic Information

This documentation contains the basic workflow for the processing of next generation Illumina sequence data of M. tuberculosis (TB) samples from southern African. Samples included three main sources: (i) the Africa Health Research Institute (AHRI), (ii) the Centre for the AIDS Programme of Research in South Africa (CAPRISA), and (iii) samples from Thato Iketleng of Botswana.

Sample source	Number
AHRI	9
CAPRISA	42
Thato	25

AHRI sequences were performed by S. Danaviah in July 2015 for Alex Pym.
Thato’s sequences were performed by J. Giandhari in January of 2018 for Thato (KRISP project #0012).
CAPRISA sequences were performed by J. Gianhari in June 2018 for CAPRISA (KRISP project #0041).

Bioinformatic workflow

The following commands were used to process the raw reads into assembled genomes, to index BAM files, to call variants and to hard filter these variants. For the purpose of this study the H37Rv reference was used as for read mapping. The processing was performed on the KRISP2 server under /data/eduan/TB/. PLEASE NOTE the $ sign SHOULD NOT be entered with the rest of the command, nor should any back slashes (\). All the file paths in the commands below are absolute and should work for anyone working on KRISP2 sever.

Software requirements

To run these commands locally or on a server the following software requirements are needed:

parallel
FastQC
Trimmomatic
bowtie2
GATK v 3
GATK v 4
picard

Pipeline steps

STEP1 - Unzip fastq files

In this step we are decompressing the sequence reads prior to processing.

$ gunzip *.gz

NOTE - if fastq files are already uncompressed (files ending in *.fastq) then you can skip this step.

STEP2 - Run FastQC on unprocessed sequence reads

In this step we first create a folder called Pre.FastQC. In the second part we execute FastQC and ask the program to write output (-o) to the folder we created in the previous step.

$ mkdir Pre.FastQC

$ /data/biotools/linux/FastQC/fastqc -o Pre.FastQC/ -f fastq -c *.fastq

NOTES - if you get a warrning message saying that option -c is ambigious. Dont worry it will still run.

STEP3 - Remove reads with low base quility scores and trim residual adaptor sequences

In this step we scan all the reads and remove potential residual adaptor sequences. In short adaptor sequences are small barcode sequences that are atteched to sequence fragements during library prep. These adaptors allow binding to the flow cell during the sequencing run. Most sequencers will remove adaptor sequences, but in some cases some adaptor sequences might remain. As these are artifically added and do not have any biological meaning for the organisms that’s being studied they should be removed. We also remove reads with low quility base scores and very short reads. The tool can also be used to remove the first and last n bases of reads. This would be a helpful extention if fastqc analyses suggest a big drop in the quility scores at the end of reads.

$ parallel 'java -jar /data/biotools/linux/Trimmomatic-0.33/trimmomatic-0.33.jar PE \ {}1_001.fastq {}2_001.fastq {}_pair_F.fastq {}_unpair_F.fastq {}_pair_R.fastq {}_unpair_R.fastq \ ILLUMINACLIP:/data/biotools/linux/Trimmomatic-0.33/adapters/NexteraPE-PE.fa:2:30:10 \ SLIDINGWINDOW:4:25 MINLEN:20' ::: $(ls *_*.fastq | rev | cut -c 12- | rev | uniq)

NOTES - Nextera adaptors were used in the sequence run. We are using a sliding window of 4:15 which means it scans all reads with a window size of 4bp and drops reads where the average quality drops below 15. The MINLEN command restricts the length of reads to a minimum size of 20. If reads are shorter than 20 the will be dropped. The reason for this is that you dont want short reads, as these can potentially align at multiple regions. There is no hard-and-fast rule for these spesifications. If sequence quility is poor it might be needed to change these parameters.

STEP4 - Now we run FastQC on the post trimmed reads

$ mkdir raw.reads

$ mv *_001.fastq raw.reads/

$ zip -r raw.reads.zip raw.reads -x "*DS_Store"

$ rm -r raw.reads/

$ mkdir Post.FastQC

$ /data/biotools/linux/FastQC/fastqc -o Post.FastQC/ -f fastq -c *.fastq

NOTES - In this step the commands first create a directory called raw.reads. Next we move all of the raw sequence reads (unproccessed reads) to the newly created directory. Next we compress this directory with the zip command. Next we remove the uncompressed folder. Next we make a folder called Post.FastQC and this is followed by running FastQC on the output from step #3.

STEP5 - Now we run MultiQC on the two FastQC folders

$ multiqc Pre.FastQC/

$ multiqc Post.FastQC/

NOTES - In this step we run MultiQC on the output from FastQC. In steps #2 and #4 we wrote the output of FastQC to two directories called Pre.FastQC and Post.FastQC. Running multiqc is as simple as typing multiqc (the command is in the PATH) and specifying the folder containing the files that needs to be analyzed.

STEP6 - Bin the paired and upaired reads

$ mkdir unpaired.reads

$ mv *unpair* unpaired.reads

$ zip -r unpaired.reads.zip unpaired.reads -x "DS_Store"

$ rm -r unpaired.reads/

NOTES - In this step we first create a directory called unpaired.reads and then move all unpaired reads into the new directory. Next we archive this folder with the zip command and remove the original folder.

STEP7 - Now lets do our alignment, which will map the reads to our reference

This is the reference (H37Rv) against which we will align our reads. First we need to build the reference with Bowtie2-build. Once we have a build reference we can do the alignment.

$ bowtie2-build TB_H37Rv.fasta TB_H37Rv.fasta

$ parallel 'bowtie2 -x TB_H37Rv.fasta -1 {}_pair_F.fastq -2 {}_pair_R.fastq \ -I 0 -X 1200 --un-conc {}_unaln_fastq --rg-id "ID_{}" --rg "LB:LB_{}" --rg "PL:ILLUMINA" \ --rg "SM:SM_{}" -S {}.sam' ::: $(ls *_pair_*.fastq | rev | cut -c 14- | rev | uniq)

NOTES - Paramaters:

-I = minimum fragment length
-X = maximum fragment length
–un-conc = write pairs that didnt align concordantly to
–rg-id = set read group id, reflected in @RG line and RG:Z: opt field
–rg = add (“lab:value”) to @RG line of SAM header. Note: @RG line only printed when –rg-id is set
-S = SAM output file

STEP8 - Now lets bin our paired reads

$ mkdir paired.reads

$ mv *pair* paired.reads

$ zip -r paired.reads.zip paired.reads -x "*DS_Store"

$ rm -r paired.reads/

NOTES - In this step we make a new folder called paired.reads and then move all paired reads into the new folder. We then compress this folder with the zip command and then remove the original folder leaving the archived/compressed copy.

STEP9 - Confert the output SAM file from the alignment step to a BAM file

$ parallel 'samtools view -bS {}.sam | samtools sort -o {}_sorted.bam -O BAM' \ ::: $(ls *.sam | rev | cut -c 5- | rev | uniq)

$ rm -r *_sorted.bam

NOTES - In this step we convert the SAM file to a BAM file. BAM files are a bit more manageable to work with compared to SAM files, which can be really BIG.

STEP10 - Bin the SAM files and compress

$ mkdir sam.files

$ mv *.sam sam.files/

$ zip -r sam.files.zip sam.files -x "*DS_Store"

$ rm -r sam.files/

NOTES - In this step we make a new folder called sam.files and move all the SAM files into the new directory. Next we archive with zip and remove the original folder.

STEP11 - Now lets mark the duplicate reads in our BAM file

$ parallel 'java -jar /data/biotools/linux/picard.jar AddOrReplaceReadGroups \ I={}.bam O={}_re.bam RGID=ID_{} RGLB=LB_{} RGPL=ILLUMINA RGPU=ILLUMINA RGSM=SM_{}' \ ::: $(ls *.bam | rev | cut -c 5- | rev | uniq)

$ rm *_re.bam

STEP12 - Now lets index our BAM file

$ parallel 'samtools index {}' ::: *_nodups.bam

STEP13 - Now lets create our GATK reference index for downstream variant calling

$ java -jar /data/biotools/linux/picard.jar CreateSequenceDictionary R=TB_H37Rv.fasta O=TB_H37Rv.dict

$ samtools faidx TB_H37Rv.fasta

STEP14 - Create summary statstics on BAM file

$ parallel 'java -jar /data/biotools/linux/picard.jar CollectAlignmentSummaryMetrics R=TB_H37Rv.fasta \ I={}_nodups.bam O=AlignmentSummaryMetrics_{}.txt ASSUME_SORTED=true MAX_INSERT_SIZE=100000' \ ::: $(ls *_nodups.bam | rev | cut -c 12- | rev | uniq)

$ parallel 'java -jar /data/biotools/linux/picard.jar BamIndexStats I={}_nodups.bam' \ ::: $(ls *_nodups.bam | rev | cut -c 12- | rev | uniq)

STEP15 - Calculate depth of coverage stats from BAM file

$ parallel 'java -jar /data/biotools/linux/GenomeAnalysisTK.jar -T DepthOfCoverage -R TB_H37Rv.fasta \ -I {}_nodups.bam -pt sample --outputFormat table' ::: $(ls *_nodups.bam | rev | cut -c 12- | rev | uniq)

NOTES - This is done in GATKv3 as DepthofCoverage package is not supported in GATKv4. At the end of this step you should have a good “clean” BAM file by now, which can be used to call variants from.

STEP16 - Call variants from BAM file

$ parallel 'gatk HaplotypeCaller -R TB_H37Rv.fasta -I {}_nodups.bam -O {}.raw.snp.indel.vcf' \ ::: $(ls *_nodups.bam | rev | cut -c 12- | rev | uniq)

$ mkdir vcfs

$ mv *.vcf vcfs

$ cd vcfs

NOTES - This step uses the HaplotypeCaller package of GATK to call variants. If you are using GATKv3 you need to spesify the package with -T (e.g. -T HaplotypeCaller). -R spesifies the reference genome, -I the input BAM file and -O the output file.

STEP17 - Now use GATK to hard filter the variant calls

The GATK VariantFiltration tool was designed for hard-filtering of variant calls based on user defined criteria. The tool changes vcf values in the FILTER field to something other than PASS (--filter-name). Those records that did not pass the filter will be marked but retained. To remove these from your call set we need to mask them out in the following step.

$ parallel 'gatk VariantFiltration -R ../TB_H37Rv.fasta -V {}.raw.snp.indel.vcf --filter-expression \ "QUAL > 30.0 || QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" \ --filter-name "FAILED" -O {}_filtered_snps.vcf' \ ::: $(ls *R_sorted.raw.snp.indel.vcf | rev | cut -c 19- | rev | uniq)

NOTES - If you look at the content of a VCF file you will see the file format contains an overwhelming amount of information (this can be done with the less command). Variant calls are indexed according to a wide range of metrics.

QUAL = Quality; The QUAL metric in the VCF file is a phred-scaled probability of the SNP being homozygous for the reference (higher scores indicate higher confidence).
QD = quility by depth; QD for a call set is calculated by dividing the varaints confidence (QUAL) by the unfiltered depth of all non-reference samples.
FS = Fishter strand; FS is a phred-scaled p-value using Fisher’s Exact Test to identify strand bias. Higher FS values indicate stronger strand bias (a.k.a. likely false positives).
MQRankSum = mapping quality rank sum test; The MQRankSum is a Wilcoxon Rank Sum Test that test the hypothesis that the reads carrying the varaint allele have a consistenlty lower mapping quality that do the reads with the reference allele.
ReadPosRankSum = read position rank sum test; The ReadPosRankSum on the other hand is a Mann-Whitney rank sum test that tests the hypothesis that, instead of being randomly distrib uted over the read, the variant allele is consistently found more often at the beginning or the end of a read.

STEP18 - Now we remove the variant calls that did not pass the hard-filter

We use bcftools to remove variants that did not pass the hard filter in the previous step.

$ parallel 'bcftools view -Oz -f .,PASS {}_snps.vcf > {}_pass.vcf' ::: $(ls *_snps.vcf | rev | cut -c 10- | rev | uniq)

NOTES - If you look at VCF files with the pass.vcf endings you will see in the FILTER field all PASS.

Now we have good quility VCF files!!!