Basic Information

This document explain the main steps involved in human variant calling. Spesifically for germline short variant calling (SNPs and Indels). For this excersise we focus on whole exome sequencing of human chr1 as per the H3ABioNet work example:

Software requirements

To run these commands locally or on a server the following software requirements are needed:

parallel
FastQC
Trimmomatic
bowtie2
GATK v 3
GATK v 4
picard
snpEff

Pipeline steps

The pipeline can be devided into two main steps: (i) Pre-processing, and (ii) Variant calling and prioritization. The pre-processing involves the QC and filtering of raw sequence reads, the assembly of reads to the latest human reference, and the filtering and base quality recalibration that is needed to produce a BAM file that is ready for variant calling. The seccond step involves the calling of variants from the output BAM file(s), the consolidation of variants into a GVCF, joint genotyping and variant quality score recalibration. NOTE - For studies with less than one whole genome sequence or less than 30 whole exome sequences the VQSR tool by GATK4 can NOT be used.

Figure 1: Basic workflow involved in calling short germline variants (SNPs and Indels) based on GATK v4 “Best Practices guidelines”. The pre-processing step runs utill the Analysis-Read Reads step which produces a BAM file ready to call variants from. In the subsequent steps we call variants with GATK HaplotypeCaller, consolidate variants into a GVCF file, the joint genotyping and variant quality score recalibration. Picture adopted from GATK website (https://software.broadinstitute.org/gatk/best-practices/workflow?id=11145).

PLEASE NOTE - The excersise only contains exome sequence reads for chr1 of a single patient. The steps below use parallel so that the pipeline can be run on multiple samples at the same time. Also note that the dollar sign should not be included in cammands!

1. Pre-Processing

1.1 Trimming

The frist step would be to run FastQC on the raw sequence reads. However, I prefer to first do the trimming step and then run FastQC on all fastq files (raw reads as well as paired an unpaired reads) simultaniously.

We process the raw reads with Trimmomatic. This will remove any residual adaptor sequences in reads as well as short reads (<20bp) and reads with low quility scores. This can be done with the following command:

$ parallel 'java -jar /data/biotools/linux/Trimmomatic-0.33/trimmomatic-0.33.jar PE 
{}1.fastq {}2.fastq {}_pair_F.fastq {}_unpair_F.fastq {}_pair_R.fastq {}_unpair_R.fastq ILLUMINACLIP:/data/biotools/linux/Trimmomatic-0.33/adapters/NexteraPE-PE.fa:2:30:10 
SLIDINGWINDOW:4:25 MINLEN:20' ::: $(ls *.fastq | rev | cut -c 8- | rev | uniq)

NOTES - We assume that Nextera adaptors were used in the sequence run. If we dont specify adaptors then Trimmomatic will not run. We are using a sliding window of 4:25 which means it scans all reads with a window size of 4bp and drops reads where the average quality score (Phred) drops below 25. The MINLEN command restricts the length of reads to a minimum size of 20. If reads are shorter than 20 the will be dropped. The reason for this is that you don’t want short reads, as these can potentially align at multiple regions. There is no hard-and-fast rules for these spesifications and you might have to play around with these parameters from time to time. For example when sequence quility is poor.

1.2 QC of reads

As mentioned in section 1.1. I prefer to run fastQC after the trimming step. This way the fastqc/multiqc report contains info on raw reads as well as stats on the paired and unpaired sequence reads. This saves a bit of time so you dont have to run fastqc twich. To check the quality of the raw sequence reads we will run them through fastqc. First, we create a new folder to which we will write the output from FASTQC. Next, we run fastqc.

$ mkdir fastqc

$ /data/biotools/linux/FastQC/fastqc -o pre.fastqc/ -f fastq -c *.fastq

Now run MultiQC on the fastqc folder.

$ multiqc fastqc/

$ mv multiqc_data/ fastqc/

$ mv *.html fastqc/

1.3 Read mapping to reference

First we need to build our reference. Assuming that the reference genome is in your home directory you can run the command below: Alternatively, HG19 the reference we are using can be found at /data/eduan/References/ on KRISP2. Please note!!! Once you start with a particular reference you need to stick to that reference. For example, if you using hg38 then stick with that throughout the entire pipeline. If you use hg38 for mapping and hg19 for variant calling you WILL run into some trouble. These builds can have small differences (some only 1bp) but this can have a BIG influence.

$ bowtie2-build HG19_GATKbundle2.8_noDecoys.noChrInName.fa HG19_GATKbundle2.8_noDecoys.noChrInName.fa

Next we can align the reads to the refernece.

$ parallel 'bowtie2 -xHG19_GATKbundle2.8_noDecoys.noChrInName.fa -1 {}_pair_F.fastq -2 {}_pair_R.fastq -I 0 -X 1200 --un-conc {}_unaln_fastq --rg-id "ID_{}" --rg "LB:LB_{}" --rg "PL:ILLUMINA" --rg "SM:SM_{}" -S {}.sam' ::: $(ls *_pair_*.fastq | rev | cut -c 14- | rev | uniq)

NOTES - Paramaters:

-I = minimum fragment length
-X = maximum fragment length
–un-conc = write pairs that didnt align concordantly to file
–rg-id = set read group id, reflected in @RG line and RG:Z: opt field
–rg = add (“lab:value”) to @RG line of SAM header. Note: @RG line only printed when –rg-id is set
-S = SAM output file

1.4 SAM to BAM convertion

The output from the mapping step creates a SAM file short for Sequence Alignment Map. These files are extremely large in size. We can convert this flat text tile to a binary file format called a BAM file (Binary Alignment Map) which is much easier to work with. This can be done in samtools using the following command:

$ parallel 'samtools view -bS {}.sam | samtools sort -o {}_sorted.bam -O BAM' ::: $(ls *.sam | rev | cut -c 5- | rev | uniq)

1.5 Add read groups and mark duplicates

There is no formal definition of what is a read group, but in practice, this term refers to a set of reads that were generated from a single run of a sequencing instrument.

In the simple case where a single library preparation derived from a single biological sample was run on a single lane of a flowcell, all the reads from that lane run belong to the same read group. When multiplexing is involved, then each subset of reads originating from a separate library run on that lane will constitute a separate read group.

Read groups are identified in the SAM/BAM /CRAM file by a number of tags that are defined in the official SAM specification. These tags, when assigned appropriately, allow us to differentiate not only samples, but also various technical features that are associated with artifacts. With this information in hand, we can mitigate the effects of those artifacts during the duplicate marking and base recalibration steps. The GATK requires several read group fields to be present in input files and will fail with errors if this requirement is not satisfied.

So lets add the read groups:

$ parallel 'java -jar /data/biotools/linux/picard.jar AddOrReplaceReadGroups I={}.bam O={}_re.bam RGID=ID_{} RGLB=LB_{} RGPL=ILLUMINA RGPU=ILLUMINA RGSM=SM_{}' ::: $(ls *.bam | rev | cut -c 5- | rev | uniq)

NOTES - Many tools (Picard and GATK for example) require or assume the presence of at least one RG tag, defining a “read-group” to which each read can be assigned (as specified in the RG tag in the SAM record). This tool enables the user to assign all the reads in the #INPUT to a single new read-group. This tool can accept as input either SAM or BAM files.

1.6 Mark duplicates

This tool locates and tags duplicate reads in a BAM or SAM file, where duplicate reads are defined as originating from a single fragment of DNA. Duplicates can arise during sample preparation e.g. library construction using PCR. Duplicate reads can also result from a single amplification cluster, incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument. These duplication artifacts are referred to as optical duplicates.

Now lets mark the duplicate reads:

$ parallel 'java -jar /data/biotools/linux/picard.jar MarkDuplicates INPUT={}_re.bam OUTPUT={}_nodups.bam METRICS_FILE=marked_dup_metrics_{}.txt ASSUME_SORTED=true REMOVE_DUPLICATES=true READ_NAME_REGEX="[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*" OPTICAL_DUPLICATE_PIXEL_DISTANCE=100' ::: $(ls *_re.bam | rev | cut -c 8- | rev | uniq)

NOTES - This tool locates and tags duplicate reads in a BAM or SAM file, where duplicate reads are defined as originating from a single fragment of DNA. Duplicates can arise during sample preparation e.g. library construction using PCR. Duplicate reads can also result from a single amplification cluster, incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument. These duplication artifacts are referred to as optical duplicates. If desired, duplicates can be removed using the REMOVE_DUPLICATE and REMOVE_SEQUENCING_DUPLICATES options in Picard. Additional parameters:

  • ASSUME_SORTED - If true, assume that the input file is coordinate sorted even if the header says otherwise. Deprecated, used ASSUME_SORT_ORDER=coordinate instead. Default value: false. This option can be set to ‘null’ to clear the default value. Possible values: {true, false} Cannot be used in conjuction with option(s) ASSUME_SORT_ORDER (ASO).

  • REMOVE_DUPLICATES - f true do not write duplicates to the output file instead of writing them with appropriate flags set. Default value: false. This option can be set to ‘null’ to clear the default value. Possible values: {true, false}.

  • READ_NAME_REGEX - Regular expression that can be used to parse read names in the incoming SAM file. Read names are parsed to extract three variables: tile/region, x coordinate and y coordinate. These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. Set this option to null to disable optical duplicate detection, e.g. for RNA-seq or other data where duplicate sets are extremely large and estimating library complexity is not an aim. Note that without optical duplicate counts, library size estimation will be inaccurate. The regular expression should contain three capture groups for the three variables, in order. It must match the entire read name. Note that if the default regex is specified, a regex match is not actually done, but instead the read name is split on colon character. For 5 element names, the 3rd, 4th and 5th elements are assumed to be tile, x and y values. For 7 element names (CASAVA 1.8), the 5th, 6th, and 7th elements are assumed to be tile, x and y values. Default value: . This option can be set to ‘null’ to clear the default value.

  • OPTICAL_DUPLICATE_PIXEL_DISTANCE - the maximum offset between two duplicate clusters in order to consider them optical duplicates. The default is appropriate for unpatterned versions of the Illumina platform. For the patterned flowcell models, 2500 is moreappropriate. For other platforms and models, users should experiment to find what works best. Default value: 100. This option can be set to ‘null’ to clear the default value.

1.7 Now index the BAM file

$ parallel 'samtools index {}' ::: *_nodups.bam

1.8 Create GATK reference

$ java -jar /data/biotools/linux/picard.jar CreateSequenceDictionary R=HG19_GATKbundle2.8_noDecoys.noChrInName.fa O=HG19_GATKbundle2.8_noDecoys.noChrInName.dict

$ samtools faidx HG19_GATKbundle2.8_noDecoys.noChrInName.fa

1.9 Fix mate information

parallel 'java -jar /data/biotools/linux/picard.jar FixMateInformation I={}_nodups.bam O={}_fix.bam SORT_ORDER=coordinate' ::: $(ls *_nodups.bam | rev | cut -c 12- | rev | uniq)

NOTES - This tool ensures that all mate-pair information is in sync between each read and its mate pair. If no OUTPUT file is supplied then the output is written to a temporary file and then copied over the INPUT file (with the original placed in a .old file.) Reads marked with the secondary alignment flag are written to the output file unchanged. However supplementary reads are corrected so that they point to the primary, non-supplemental mate record.

1.10 BQSR

In short, base quality score recalibration or BQSR changes the quality values for each base within the reads which initially was set by the sequencing machine. The goal behind this is, that the new values are nearer to the truth.

Variant calling algorithms rely heavily on the quality scores assigned to the individual base calls in each sequence read. These scores are per-base estimates of error emitted by the sequencing machines. Unfortunately the scores produced by the machines are subject to various sources of systematic technical error, leading to over- or under-estimated base quality scores in the data. Base quality score recalibration (BQSR) is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. This allows us to get more accurate base qualities, which in turn improves the accuracy of our variant calls. The base recalibration process involves two key steps: first the program builds a model of covariation based on the data and a set of known variants (which you can bootstrap if there is none available for your organism), then it adjusts the base quality scores in the data based on the model. There is an optional but highly recommended step that involves building a second model and generating before/after plots to visualize the effects of the recalibration process. This is useful for quality control purposes. This tool performs the first step described above: it builds the model of covariation and produces the recalibration table. It operates only at sites that are not in dbSNP; we assume that all reference mismatches we see are therefore errors and indicative of poor base quality. This tool generates tables based on various user-specified covariates (such as read group, reported quality score, cycle, and context). Assuming we are working with a large amount of data, we can then calculate an empirical probability of error given the particular covariates seen at this site, where p(error) = num mismatches / num observations. The output file is a table (of the several covariate values, number of observations, number of mismatches, empirical quality score).

First we need to re-index our BAM file:

$ parallel 'samtools index {}' ::: *_fix.bam

Next step is BaseRecalibrator:

$ parallel 'java -jar /data/biotools/linux/GenomeAnalysisTK.jar -T BaseRecalibrator -R HG19_GATKbundle2.8_noDecoys.noChrInName.fa -I {}_fix.bam -knownSites dbsnp_138.hg19.NoChrInNames.vcf -o recal_data.table' ::: $(ls *_fix.bam | rev | cut -c 9- | rev | uniq)

NOTE - Since we used hg38 for our assembly we need to use the known sites of variance for this build of the human reference genome.

The second step is ApplyBQSR:

$ parallel 'gatk ApplyBQSR -R HG19_GATKbundle2.8_noDecoys.noChrInName.fa -I {}_fix.bam --bqsr-recal-file recal_data.table -O {}_bqsr.bam' ::: $(ls *_fix.bam | rev | cut -c 9- | rev | uniq)

Now you have a good quility BAM file that’s ready for variant calling!!!

2. Variant Calling

2.1 HaplotypeCaller

$ parallel 'gatk HaplotypeCaller -R HG19_GATKbundle2.8_noDecoys.noChrInName.fa -I {}_bqsr.bam -O {}.raw.snp.indel.vcf' ::: $(ls *_bqsr.bam | rev | cut -c 10- | rev | uniq)

NOTES - How HaplotypeCaller works:

Define active regions: The program determines which regions of the genome it needs to operate on, based on the presence of significant evidence for variation.

Determine haplotypes by assembly of the active region: For each ActiveRegion, the program builds a De Bruijn-like graph to reassemble the ActiveRegion, and identifies what are the possible haplotypes present in the data. The program then realigns each haplotype against the reference haplotype using the Smith-Waterman algorithm in order to identify potentially variant sites.

Determine likelihoods of the haplotypes given the read data: For each ActiveRegion, the program performs a pairwise alignment of each read against each haplotype using the PairHMM algorithm. This produces a matrix of likelihoods of haplotypes given the read data. These likelihoods are then marginalized to obtain the likelihoods of alleles for each potentially variant site given the read data.

Assign sample genotypes: For each potentially variant site, the program applies Bayes’ rule, using the likelihoods of alleles given the read data to calculate the likelihoods of each genotype per sample given the read data observed for that sample. The most likely genotype is then assigned to the sample.

PS - when running multiple samples at the same time for the same study (cohort study) its best to run in gVCF mode.

2.2 Filter Variants

Please not for studies with bigger sample numbers GATK has the VQSR protocol to mark variants as highly probable or not. However, small sample numbers hard filtering of variants are sufficiant.

$ parallel 'gatk VariantFiltration -R HG19_GATKbundle2.8_noDecoys.noChrInName.fa -V {}.raw.snp.indel.vcf --filter-expression "QUAL > 30.0 || QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filter-name "FAILED" -O {}_filtered_snps.vcf' ::: $(ls *.raw.snp.indel.vcf | rev | cut -c 19- | rev | uniq)

No lets filter out the variants that did not pass the FILTER.

$ parallel 'bcftools view -Oz -f .,PASS {}_filtered_snps.vcf > {}_pass.vcf' ::: $(ls *_filtered_snps.vcf | rev | cut -c 19- | rev | uniq)