Report - TB in Southern Africa
Author: Dr Eduan Wilkinson (Ph.D.)
Date: 21st of February 2019
This documentation contains the basic workflow for the processing of next generation Illumina sequence data of M. tuberculosis (TB) samples from southern African. Samples included three main sources: (i) the Africa Health Research Institute (AHRI), (ii) the Centre for the AIDS Programme of Research in South Africa (CAPRISA), and (iii) samples from Thato Iketleng of Botswana.
| Sample source | Number |
|---|---|
| AHRI | 9 |
| CAPRISA | 42 |
| Thato | 25 |
The following commands were used to process the raw reads into assembled genomes, to index BAM files, to call variants and to hard filter these variants. For the purpose of this study the H37Rv reference was used as for read mapping. The processing was performed on the KRISP2 server under /data/eduan/TB/. PLEASE NOTE the $ sign SHOULD NOT be entered with the rest of the command, nor should any back slashes (\). All the file paths in the commands below are absolute and should work for anyone working on KRISP2 sever.
To run these commands locally or on a server the following software requirements are needed:
parallel
FastQC
Trimmomatic
bowtie2
GATK v 3
GATK v 4
picard
STEP1 - Unzip fastq files
In this step we are decompressing the sequence reads prior to processing.
$ gunzip *.gz
NOTE - if fastq files are already uncompressed (files ending in *.fastq) then you can skip this step.
STEP2 - Run FastQC on unprocessed sequence reads
In this step we first create a folder called Pre.FastQC. In the second part we execute FastQC and ask the program to write output (-o) to the folder we created in the previous step.
$ mkdir Pre.FastQC
$ /data/biotools/linux/FastQC/fastqc -o Pre.FastQC/ -f fastq -c *.fastq
NOTES - if you get a warrning message saying that option -c is ambigious. Dont worry it will still run.
STEP3 - Remove reads with low base quility scores and trim residual adaptor sequences
In this step we scan all the reads and remove potential residual adaptor sequences. In short adaptor sequences are small barcode sequences that are atteched to sequence fragements during library prep. These adaptors allow binding to the flow cell during the sequencing run. Most sequencers will remove adaptor sequences, but in some cases some adaptor sequences might remain. As these are artifically added and do not have any biological meaning for the organisms that’s being studied they should be removed. We also remove reads with low quility base scores and very short reads. The tool can also be used to remove the first and last n bases of reads. This would be a helpful extention if fastqc analyses suggest a big drop in the quility scores at the end of reads.
$ parallel 'java -jar /data/biotools/linux/Trimmomatic-0.33/trimmomatic-0.33.jar PE \ {}1_001.fastq {}2_001.fastq {}_pair_F.fastq {}_unpair_F.fastq {}_pair_R.fastq {}_unpair_R.fastq \ ILLUMINACLIP:/data/biotools/linux/Trimmomatic-0.33/adapters/NexteraPE-PE.fa:2:30:10 \ SLIDINGWINDOW:4:25 MINLEN:20' ::: $(ls *_*.fastq | rev | cut -c 12- | rev | uniq)
NOTES - Nextera adaptors were used in the sequence run. We are using a sliding window of 4:15 which means it scans all reads with a window size of 4bp and drops reads where the average quality drops below 15. The MINLEN command restricts the length of reads to a minimum size of 20. If reads are shorter than 20 the will be dropped. The reason for this is that you dont want short reads, as these can potentially align at multiple regions. There is no hard-and-fast rule for these spesifications. If sequence quility is poor it might be needed to change these parameters.
STEP4 - Now we run FastQC on the post trimmed reads
$ mkdir raw.reads
$ mv *_001.fastq raw.reads/
$ zip -r raw.reads.zip raw.reads -x "*DS_Store"
$ rm -r raw.reads/
$ mkdir Post.FastQC
$ /data/biotools/linux/FastQC/fastqc -o Post.FastQC/ -f fastq -c *.fastq
NOTES - In this step the commands first create a directory called raw.reads. Next we move all of the raw sequence reads (unproccessed reads) to the newly created directory. Next we compress this directory with the zip command. Next we remove the uncompressed folder. Next we make a folder called Post.FastQC and this is followed by running FastQC on the output from step #3.
STEP5 - Now we run MultiQC on the two FastQC folders
$ multiqc Pre.FastQC/
$ multiqc Post.FastQC/
NOTES - In this step we run MultiQC on the output from FastQC. In steps #2 and #4 we wrote the output of FastQC to two directories called Pre.FastQC and Post.FastQC. Running multiqc is as simple as typing multiqc (the command is in the PATH) and specifying the folder containing the files that needs to be analyzed.
STEP6 - Bin the paired and upaired reads
$ mkdir unpaired.reads
$ mv *unpair* unpaired.reads
$ zip -r unpaired.reads.zip unpaired.reads -x "DS_Store"
$ rm -r unpaired.reads/
NOTES - In this step we first create a directory called unpaired.reads and then move all unpaired reads into the new directory. Next we archive this folder with the zip command and remove the original folder.
STEP7 - Now lets do our alignment, which will map the reads to our reference
This is the reference (H37Rv) against which we will align our reads. First we need to build the reference with Bowtie2-build. Once we have a build reference we can do the alignment.
$ bowtie2-build TB_H37Rv.fasta TB_H37Rv.fasta
$ parallel 'bowtie2 -x TB_H37Rv.fasta -1 {}_pair_F.fastq -2 {}_pair_R.fastq \ -I 0 -X 1200 --un-conc {}_unaln_fastq --rg-id "ID_{}" --rg "LB:LB_{}" --rg "PL:ILLUMINA" \ --rg "SM:SM_{}" -S {}.sam' ::: $(ls *_pair_*.fastq | rev | cut -c 14- | rev | uniq)
NOTES - Paramaters:
STEP8 - Now lets bin our paired reads
$ mkdir paired.reads
$ mv *pair* paired.reads
$ zip -r paired.reads.zip paired.reads -x "*DS_Store"
$ rm -r paired.reads/
NOTES - In this step we make a new folder called paired.reads and then move all paired reads into the new folder. We then compress this folder with the zip command and then remove the original folder leaving the archived/compressed copy.
STEP9 - Confert the output SAM file from the alignment step to a BAM file
$ parallel 'samtools view -bS {}.sam | samtools sort -o {}_sorted.bam -O BAM' \ ::: $(ls *.sam | rev | cut -c 5- | rev | uniq)
$ rm -r *_sorted.bam
NOTES - In this step we convert the SAM file to a BAM file. BAM files are a bit more manageable to work with compared to SAM files, which can be really BIG.
STEP10 - Bin the SAM files and compress
$ mkdir sam.files
$ mv *.sam sam.files/
$ zip -r sam.files.zip sam.files -x "*DS_Store"
$ rm -r sam.files/
NOTES - In this step we make a new folder called sam.files and move all the SAM files into the new directory. Next we archive with zip and remove the original folder.
STEP11 - Now lets mark the duplicate reads in our BAM file
$ parallel 'java -jar /data/biotools/linux/picard.jar AddOrReplaceReadGroups \ I={}.bam O={}_re.bam RGID=ID_{} RGLB=LB_{} RGPL=ILLUMINA RGPU=ILLUMINA RGSM=SM_{}' \ ::: $(ls *.bam | rev | cut -c 5- | rev | uniq)
$ rm *_re.bam
STEP12 - Now lets index our BAM file
$ parallel 'samtools index {}' ::: *_nodups.bam
STEP13 - Now lets create our GATK reference index for downstream variant calling
$ java -jar /data/biotools/linux/picard.jar CreateSequenceDictionary R=TB_H37Rv.fasta O=TB_H37Rv.dict
$ samtools faidx TB_H37Rv.fasta
STEP14 - Create summary statstics on BAM file
$ parallel 'java -jar /data/biotools/linux/picard.jar CollectAlignmentSummaryMetrics R=TB_H37Rv.fasta \ I={}_nodups.bam O=AlignmentSummaryMetrics_{}.txt ASSUME_SORTED=true MAX_INSERT_SIZE=100000' \ ::: $(ls *_nodups.bam | rev | cut -c 12- | rev | uniq)
$ parallel 'java -jar /data/biotools/linux/picard.jar BamIndexStats I={}_nodups.bam' \ ::: $(ls *_nodups.bam | rev | cut -c 12- | rev | uniq)
STEP15 - Calculate depth of coverage stats from BAM file
$ parallel 'java -jar /data/biotools/linux/GenomeAnalysisTK.jar -T DepthOfCoverage -R TB_H37Rv.fasta \ -I {}_nodups.bam -pt sample --outputFormat table' ::: $(ls *_nodups.bam | rev | cut -c 12- | rev | uniq)
NOTES - This is done in GATKv3 as DepthofCoverage package is not supported in GATKv4. At the end of this step you should have a good “clean” BAM file by now, which can be used to call variants from.
STEP16 - Call variants from BAM file
$ parallel 'gatk HaplotypeCaller -R TB_H37Rv.fasta -I {}_nodups.bam -O {}.raw.snp.indel.vcf' \ ::: $(ls *_nodups.bam | rev | cut -c 12- | rev | uniq)
$ mkdir vcfs
$ mv *.vcf vcfs
$ cd vcfs
NOTES - This step uses the HaplotypeCaller package of GATK to call variants. If you are using GATKv3 you need to spesify the package with -T (e.g. -T HaplotypeCaller). -R spesifies the reference genome, -I the input BAM file and -O the output file.
STEP17 - Now use GATK to hard filter the variant calls
The GATK VariantFiltration tool was designed for hard-filtering of variant calls based on user defined criteria. The tool changes vcf values in the FILTER field to something other than PASS (--filter-name). Those records that did not pass the filter will be marked but retained. To remove these from your call set we need to mask them out in the following step.
$ parallel 'gatk VariantFiltration -R ../TB_H37Rv.fasta -V {}.raw.snp.indel.vcf --filter-expression \ "QUAL > 30.0 || QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" \ --filter-name "FAILED" -O {}_filtered_snps.vcf' \ ::: $(ls *R_sorted.raw.snp.indel.vcf | rev | cut -c 19- | rev | uniq)
NOTES - If you look at the content of a VCF file you will see the file format contains an overwhelming amount of information (this can be done with the less command). Variant calls are indexed according to a wide range of metrics.
STEP18 - Now we remove the variant calls that did not pass the hard-filter
We use bcftools to remove variants that did not pass the hard filter in the previous step.
$ parallel 'bcftools view -Oz -f .,PASS {}_snps.vcf > {}_pass.vcf' ::: $(ls *_snps.vcf | rev | cut -c 10- | rev | uniq)
NOTES - If you look at VCF files with the pass.vcf endings you will see in the FILTER field all PASS.
Now we have good quility VCF files!!!