Human Variant Calling
Author: Dr Eduan Wilkinson (Ph.D.)
Date: 25th of February 2019
This document explain the main steps involved in human variant calling. Spesifically for germline short variant calling (SNPs and Indels). For this excersise we focus on whole exome sequencing of human chr1 as per the H3ABioNet work example (https://h3abionet.github.io/H3ABionet-SOPs/Variant-Calling-1-0).
The bioinformatic workflow can broadly be devided into two main steps: (i) Pre-processing, and (ii) Variant calling and prioritization.
To run these commands locally or on a server the following software requirements are needed:
parallel
FastQC
Trimmomatic
bowtie2
GATK v 3
GATK v 4
picard
snpEff
As mentioned above, the pipeline can be devided into two main steps: (i) Pre-processing, and (ii) Variant calling and prioritization. The pre-processing involves the QC and filtering of raw sequence reads, the assembly of reads to the latest human reference, and the filtering and base quality recalibration that is needed to produce a BAM file that is ready for variant calling. The seccond step involves the calling of variants from the output BAM file(s), the consolidation of variants into a GVCF, joint genotyping and variant quality score recalibration. NOTE - FOR STUDIES WITH LESS THAN ONE WHOLE GENOME SEQUENCE OR LESS THAN 30 WHOLE EXOME GENOMES THE VQSR TOOL BY GATK4 CANT BE RUN
PLEASE NOTE - THE ABSOLUTE SAME REFERENCE BUILD NEEDS TO BE USED THROUGHOUT!!! If you pick hg19 then stick with hg19. If you go with b37 or b38 then stick with b37 or b38 throughout the entire excercise. Different builds can how very small differences (some differ by only 1bp for a given chromosome for example), but this can have a big effect downstream.
Figure 1: Basic workflow involved in calling short germline variants (SNPs and Indels) based on GATK v4 “Best Practices guidelines”. The pre-processing step runs utill the Analysis-Read Reads step which produces a BAM file ready to call variants from. In the subsequent steps we call variants with GATK HaplotypeCaller, consolidate variants into a GVCF file, the joint genotyping and variant quality score recalibration.
STEP1 - Unzip fastq files
In this step we are decompressing the sequence reads prior to processing.
$ gunzip *.gz
NOTE - if fastq files are already uncompressed (files ending in *.fastq) then you can skip this step.
STEP2 - Run FastQC on unprocessed sequence reads
In this step we first create a folder called Pre.FastQC. In the second part we execute FastQC and ask the program to write output (-o) to the folder we created in the previous step.
$ mkdir Pre.FastQC
$ /Applications/biotools/FastQC/fastqc -o Pre.FastQC/ -f fastq *.fastq
STEP3 - Remove reads with low base quility scores and trim residual adaptor sequences
In this step we scan all the reads and remove potential residual adaptor sequences. In short adaptor sequences are small barcode sequences that are atteched to sequence fragements during library prep. These adaptors allow binding to the flow cell during the sequencing run. Most sequencers will remove adaptor sequences, but in some cases some adaptor sequences might remain. As these are artifically added and do not have any biological meaning for the organisms that’s being studied they should be removed. We also remove reads with low quility base scores and very short reads. The tool can also be used to remove the first and last n bases of reads. This would be a helpful extention if fastqc analyses suggest a big drop in the quility scores at the end of reads.
$ parallel 'java -jar /Applications/biotools/Trimmomatic-0.36/trimmomatic-0.36.jar PE \
{}1.fastq {}2.fastq {}_pair_F.fastq {}_unpair_F.fastq {}_pair_R.fastq {}_unpair_R.fastq \
ILLUMINACLIP:/Applications/biotools/Trimmomatic-0.36/adapters/NexteraPE-PE.fa:2:30:10 \
SLIDINGWINDOW:4:25 MINLEN:20' ::: $(ls *_*.fastq | rev | cut -c 8- | rev | uniq)
NOTES - We dont know what adaptors were used in the sequencing run, but we need to spesify some otherwise Trimmomatic will not run so we using Nextera adaptors in this example. We are using a sliding window of 4:15 which means it scans all reads with a window size of 4bp and drops reads where the average quality drops below 15. The MINLEN command restricts the length of reads to a minimum size of 20. If reads are shorter than 20 the will be dropped. The reason for this is that you dont want short reads, as these can potentially align at multiple regions. There is no hard-and-fast rule for these spesifications. If sequence quility is poor it might be needed to change these parameters.
STEP4 - Now we run FastQC on the post trimmed reads
$ mkdir raw.reads
$ mv *read1.fastq raw.reads
$ mv *read2.fastq raw.reads
$ mkdir Post.FastQC
$ /Applications/biotools/FastQC/fastqc -o Post.FastQC/ -f fastq -c *.fastq
NOTES - In this step the commands first create a directory called raw.reads. Next we move all of the raw sequence reads (unproccessed reads) to the newly created directory. Next we compress this directory with the zip command. Next we remove the uncompressed folder. Next we make a folder called Post.FastQC and this is followed by running FastQC on the output from step #3.
STEP5 - Now we run MultiQC on the two FastQC folders
$ multiqc Pre.FastQC/
$ mv multiqc_* Pre.FastQC
$ multiqc Post.FastQC/
$ mv multiqc_* Post.FastQC/
NOTES - In this step we run MultiQC on the output from FastQC. In steps #2 and #4 we wrote the output of FastQC to two directories called Pre.FastQC and Post.FastQC. Running multiqc is as simple as typing multiqc (the command is in the PATH) and specifying the folder containing the files that needs to be analyzed.
STEP6 - Now lets do our alignment, which will map the reads to our reference
This is the reference (HG19_GATKbundle2.8_noDecoys.noChrInName.fa) against which we will align our reads. First we need to build the reference with Bowtie2-build. Once we have a build reference we can do the alignment.
$ bowtie2-build HG19_GATKbundle2.8_noDecoys.noChrInName.fa HG19_GATKbundle2.8_noDecoys.noChrInName.fa
$ parallel 'bowtie2 -x HG19_GATKbundle2.8_noDecoys.noChrInName.fa -1 {}_pair_F.fastq \
-2 {}_pair_R.fastq -I 0 -X 1200 --un-conc {}_unaln_fastq --rg-id "ID_{}" --rg "LB:LB_{}" \
--rg "PL:ILLUMINA" --rg "SM:SM_{}" -S {}.sam' ::: $(ls *_pair_*.fastq | rev | cut -c 14- | rev | uniq)
NOTES - Paramaters:
STEP7 - Bin the paired and upaired reads
$ mkdir unpaired.reads
$ mv *unpair* unpaired.reads
$ zip -r unpaired.reads.zip unpaired.reads -x "DS_Store"
$ rm -r unpaired.reads/
NOTES - In this step we first create a directory called unpaired.reads and then move all unpaired reads into the new directory. Next we archive this folder with the zip command and remove the original folder.
STEP8 - Now lets bin our paired reads
$ mkdir paired.reads
$ mv *pair* paired.reads
$ zip -r paired.reads.zip paired.reads -x "*DS_Store"
$ rm -r paired.reads/
NOTES - In this step we make a new folder called paired.reads and then move all paired reads into the new folder. We then compress this folder with the zip command and then remove the original folder leaving the archived/compressed copy.
STEP9 - Convert the output SAM file from the alignment step to a BAM file
$ parallel 'samtools view -bS {}.sam | samtools sort -o {}_sorted.bam -O BAM' \
::: $(ls *.sam | rev | cut -c 5- | rev | uniq)
$ rm -r *_sorted.bam
NOTES - In this step we convert the SAM file to a BAM file. BAM files are a bit more manageable to work with compared to SAM files, which can be really BIG.
STEP10 - Bin the SAM files and compress
$ mkdir sam.files
$ mv *.sam sam.files/
$ zip -r sam.files.zip sam.files -x "*DS_Store"
$ rm -r sam.files/
NOTES - In this step we make a new folder called sam.files and move all the SAM files into the new directory. Next we archive with zip and remove the original folder. Its always good to keep the SAM files. If something goes wrong donwstream you can always go back to your SAM file without having to redo the alignment step, which for human data can be quiet time consuming.
STEP11 - Now lets add the read groups and mark the duplicate reads in our BAM file
$ parallel 'java -jar /data/biotools/linux/picard.jar AddOrReplaceReadGroups \
I={}.bam O={}_re.bam RGID=ID_{} RGLB=LB_{} RGPL=ILLUMINA RGPU=ILLUMINA RGSM=SM_{}' \
::: $(ls *.bam | rev | cut -c 5- | rev | uniq)
NOTES - Many tools (Picard and GATK for example) require or assume the presence of at least one RG tag, defining a “read-group” to which each read can be assigned (as specified in the RG tag in the SAM record). This tool enables the user to assign all the reads in the #INPUT to a single new read-group. This tool can accept as input either SAM or BAM files.
$ rm -r *_sorted.bam
parallel 'java -jar /data/biotools/linux/picard.jar MarkDuplicates INPUT={}_re.bam \
OUTPUT={}_nodups.bam METRICS_FILE=marked_dup_metrics_{}.txt ASSUME_SORTED=true \
REMOVE_DUPLICATES=true READ_NAME_REGEX="[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*" \
OPTICAL_DUPLICATE_PIXEL_DISTANCE=100' ::: $(ls *_re.bam | rev | cut -c 8- | rev | uniq)
NOTES - This tool locates and tags duplicate reads in a BAM or SAM file, where duplicate reads are defined as originating from a single fragment of DNA. Duplicates can arise during sample preparation e.g. library construction using PCR. Duplicate reads can also result from a single amplification cluster, incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument. These duplication artifacts are referred to as optical duplicates. If desired, duplicates can be removed using the REMOVE_DUPLICATE and REMOVE_SEQUENCING_DUPLICATES options in Picard. Additional parameters:
ASSUME_SORTED - If true, assume that the input file is coordinate sorted even if the header says otherwise. Deprecated, used ASSUME_SORT_ORDER=coordinate instead. Default value: false. This option can be set to ‘null’ to clear the default value. Possible values: {true, false} Cannot be used in conjuction with option(s) ASSUME_SORT_ORDER (ASO).
REMOVE_DUPLICATES - f true do not write duplicates to the output file instead of writing them with appropriate flags set. Default value: false. This option can be set to ‘null’ to clear the default value. Possible values: {true, false}.
READ_NAME_REGEX - Regular expression that can be used to parse read names in the incoming SAM file. Read names are parsed to extract three variables: tile/region, x coordinate and y coordinate. These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. Set this option to null to disable optical duplicate detection, e.g. for RNA-seq or other data where duplicate sets are extremely large and estimating library complexity is not an aim. Note that without optical duplicate counts, library size estimation will be inaccurate. The regular expression should contain three capture groups for the three variables, in order. It must match the entire read name. Note that if the default regex is specified, a regex match is not actually done, but instead the read name is split on colon character. For 5 element names, the 3rd, 4th and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8), the 5th, 6th, and 7th elements are assumed to be tile, x and y values. Default value: . This option can be set to ‘null’ to clear the default value.
OPTICAL_DUPLICATE_PIXEL_DISTANCE - the maximum offset between two duplicate clusters in order to consider them optical duplicates. The default is appropriate for unpatterned versions of the Illumina platform. For the patterned flowcell models, 2500 is moreappropriate. For other platforms and models, users should experiment to find what works best. Default value: 100. This option can be set to ‘null’ to clear the default value.
STEP12 - Now lets index our BAM file
$ parallel 'samtools index {}' ::: *_nodups.bam
STEP13 - Now lets create our GATK reference index for downstream variant calling
$ java -jar /data/biotools/linux/picard.jar CreateSequenceDictionary \
R=HG19_GATKbundle2.8_noDecoys.noChrInName.fa O=R=HG19_GATKbundle2.8_noDecoys.noChrInName.dict
$ samtools faidx HG19_GATKbundle2.8_noDecoys.noChrInName.fa
STEP14 - Create Indel realignment targets
The local realignment process is designed to consume one or more BAM files and to locally realign reads such that the number of mismatching bases is minimized across all the reads. In general, a large percent of regions requiring local realignment are due to the presence of an insertion or deletion (indels) in the individual’s genome with respect to the reference genome. Such alignment artifacts result in many bases mismatching the reference near the misalignment, which are easily mistaken as SNPs. Moreover, since read mapping algorithms operate on each read independently, it is impossible to place reads on the reference genome such that mismatches are minimized across all reads. Consequently, even when some reads are correctly mapped with indels, reads covering the indel near just the start or end of the read are often incorrectly mapped with respect the true indel, also requiring realignment. Local realignment serves to transform regions with misalignments due to indels into clean reads containing a consensus indel suitable for standard variant discovery approaches.
parallel 'java -jar /Applications/biotools/GenomeAnalysisTK-3.8-0/GenomeAnalysisTK.jar \
-T RealignerTargetCreator -R HG19_GATKbundle2.8_noDecoys.noChrInName.fa -I {}_nodups.bam \
--known 1.1KGIndels.chr1.vcf -o forIndelRealigner_{}.intervals' \
::: $(ls *_nodups.bam | rev | cut -c 12- | rev | uniq)
NOTES - in this excersise we only working with chr1. Hence we only need to spesify the known indel sites for chr1. If you are working with whole genome or whole exome sequencing data you can spesify a folder with all chromosomes and mitochondria (e.g. /known_indels/*.vcf
). This step and the next can be skipped since we using the HaplotypeCaller, but I have included it here for sake of completion.
STEP15 - Do realignment around indel sites
parallel 'java -jar /Applications/biotools/GenomeAnalysisTK-3.8-0/GenomeAnalysisTK.jar \
-T IndelRealigner -R HG19_GATKbundle2.8_noDecoys.noChrInName.fa -I {}_nodups.bam -known \
1.1KGIndels.chr1.vcf -targetIntervals forIndelRealigner_{}.intervals -o {}_realn.bam' \
::: $(ls *_nodups.bam | rev | cut -c 12- | rev | uniq)
NOTES - Please note that indel realignment is no longer necessary for variant discovery if you plan to use a variant caller that performs a haplotype assembly step, such as HaplotypeCaller or MuTect2. However it is still required when using legacy callers such as UnifiedGenotyper or the original MuTect. Also note that reads produced from the 454 technology inherently contain false indels, the realigner will not work with them (or with reads from similar technologies).
STEP15 - Fix mate information
parallel 'java -jar $picard FixMateInformation I={}_realn.bam O={}_fix.bam SORT_ORDER=coordinate' \
::: $(ls *_realn.bam | rev | cut -c 11- | rev | uniq)
NOTES - This tool ensures that all mate-pair information is in sync between each read and its mate pair. If no OUTPUT file is supplied then the output is written to a temporary file and then copied over the INPUT file (with the original placed in a .old file.) Reads marked with the secondary alignment flag are written to the output file unchanged. However supplementary reads are corrected so that they point to the primary, non-supplemental mate record.
STEP16 - Base Quality Score Recalibration (BQSR)
In short BQSR changes the quality values for each base within the reads which initially was set by the sequencing machine. The goal behind this is, that the new values are nearer to the truth.
Variant calling algorithms rely heavily on the quality scores assigned to the individual base calls in each sequence read. These scores are per-base estimates of error emitted by the sequencing machines. Unfortunately the scores produced by the machines are subject to various sources of systematic technical error, leading to over- or under-estimated base quality scores in the data. Base quality score recalibration (BQSR) is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. This allows us to get more accurate base qualities, which in turn improves the accuracy of our variant calls. The base recalibration process involves two key steps: first the program builds a model of covariation based on the data and a set of known variants (which you can bootstrap if there is none available for your organism), then it adjusts the base quality scores in the data based on the model. There is an optional but highly recommended step that involves building a second model and generating before/after plots to visualize the effects of the recalibration process. This is useful for quality control purposes. This tool performs the first step described above: it builds the model of covariation and produces the recalibration table. It operates only at sites that are not in dbSNP; we assume that all reference mismatches we see are therefore errors and indicative of poor base quality. This tool generates tables based on various user-specified covariates (such as read group, reported quality score, cycle, and context). Assuming we are working with a large amount of data, we can then calculate an empirical probability of error given the particular covariates seen at this site, where p(error) = num mismatches / num observations. The output file is a table (of the several covariate values, number of observations, number of mismatches, empirical quality score).
parallel 'java -jar /Applications/biotools/GenomeAnalysisTK-3.8-0/GenomeAnalysisTK.jar \
-T BaseRecalibrator -R HG19_GATKbundle2.8_noDecoys.noChrInName.fa -I {}_fix.bam -knownSites \
dbsnp_138.hg19.NoChrInNames.vcf -o recal_data.table' ::: $(ls *_fix.bam | rev | cut -c 9- | rev | uniq)
parallel '/Applications/biotools/gatk-4.0.3.0/gatk ApplyBQSR -R HG19_GATKbundle2.8_noDecoys.noChrInName.fa \
-I {}_fix.bam --bqsr-recal-file recal_data.table -O {}_bqsr.bam' ::: $(ls *_fix.bam | rev | cut -c 9- | rev | uniq)
STEP1 - Now lets call our variants with Hyplotype caller
Tool requirements: HaplotypeCaller GATKv4
Input: Input BAM files from which to make calls.
Output: Either a VCF or a gVCF file with raw, unfiltered SNPs and indel calls. Regular VCFs must be filtered either by variant recalibration (best) or hard-filtering before use in downstream analyses. If using the reference-confidence model workflow for cohort analysis, the output is a GVCF file that must first be run through GenotypeGVCFs and then filtering before further analysis.
parallel '/Applications/biotools/gatk-4.0.3.0/gatk HaplotypeCaller -R HG19_GATKbundle2.8_noDecoys.noChrInName.fa \
-I {}_nodups.bam -O {}.raw.snp.indel.vcf' ::: $(ls *_nodups.bam | rev | cut -c 12- | rev | uniq)
NOTES - How HaplotypeCaller works:
Define active regions: The program determines which regions of the genome it needs to operate on, based on the presence of significant evidence for variation.
Determine haplotypes by assembly of the active region: For each ActiveRegion, the program builds a De Bruijn-like graph to reassemble the ActiveRegion, and identifies what are the possible haplotypes present in the data. The program then realigns each haplotype against the reference haplotype using the Smith-Waterman algorithm in order to identify potentially variant sites.
Determine likelihoods of the haplotypes given the read data: For each ActiveRegion, the program performs a pairwise alignment of each read against each haplotype using the PairHMM algorithm. This produces a matrix of likelihoods of haplotypes given the read data. These likelihoods are then marginalized to obtain the likelihoods of alleles for each potentially variant site given the read data.
Assign sample genotypes: For each potentially variant site, the program applies Bayes’ rule, using the likelihoods of alleles given the read data to calculate the likelihoods of each genotype per sample given the read data observed for that sample. The most likely genotype is then assigned to the sample.
PS - when running multiple samples at the same time for the same study (cohort study) its best to run in gVCF mode.
STEP2 - Now lets hard filter the reads
Please not for studies with bigger sample numbers GATK has the VQSR protocol to mark variants as highly probable or not. However, small sample numbers hard filtering of variants are sufficiant.
$ parallel 'gatk VariantFiltration -R hg38.fa -V {}.raw.snp.indel.vcf --filter-expression \
"QUAL > 30.0 || QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" \
--filter-name "FAILED" -O {}_filtered_snps.vcf' \
::: $(ls *R_sorted.raw.snp.indel.vcf | rev | cut -c 19- | rev | uniq)
No lets filter out the variants that did not pass the FILTER.
parallel 'bcftools view -Oz -f .,PASS {}_snps.vcf > {}_pass.vcf' \
::: $(ls *_snps.vcf | rev | cut -c 10- | rev | uniq)
STEP3 - Now lets pass it through SnpEff
java -Xmx4g -jar /Applications/biotools/snpEff/snpEff.jar hg38 \
WES_chr1_50X_E0.005_filtered_pass.vcf > WES.snpEff.ann.vcf