We have finally arrived at a point where we can use our data to identify DNA polymorphisms! DNA polymorphisms are commonly referred to as variants. Most of the time when people are talking about variants they are talking about SNPS; however, DNA polymorphisms come in many additional forms including, indels, rearrangements, and epimutations created by methylation. Studying the frequency, location and distribution of variants in the genome, types of variants, and inheritance of these variants can provide biological insight and the raw material needed for marker-trait associations. Discovering variants in a population is often only the first step in analysis and is typically followed by inferring genotypes. In other words, once a SNP is known to exist in a population we usually want to know what the genotype of each individual is at that SNP locus. We will discuss both variant discovery and genotype inference in this module.
Calling variants can be achieved using many different technologies. The most high throughput method for discovering variants is through sequencing. Variants are called from read data by piling up reads that are derived from the same genomic region and comparing sequence for variants at each position. Discovering variants in this way can be achieved with or without a reference genome.
In order to identify variants, it is necessary to know which parts of reads correspond to other parts of reads. In other words, we need to be able to contrast the same base across reads to identify SNPs. This is easy when there is a lot of overlap between reads. For example, if two 75 bp reads overlap by 72 bp it is easy to match reads to each other and call variants by comparing stacks of reads that are likely from the same genomic region. This approach is referred to as reference free variant calling because reads are not aligned to a reference genome first, rather they are matched against each other to find variants. Popular reference free variant pipelines include Stacks and Uneak. When working with a non-model organism that has no reference genome the reference free approach is the only approach. The reference free approach is illustrated below.
Although we can call variants without a genome, we cannot ascertain the physical location of the variants. If genotyping is done on a standard mapping population then it is possible to orient the SNPS based on genetic distance inferred from recombination.
When using a reference genome to discover variants, reads are aligned to the genome and bases with the same genomic coordinates are scanned to find variants. This is illustrated below where discovery of a C/A SNP is shown.
It is possible to call variants in a single sample. However, with one individual only variants within the individual and between the individual and the reference genome can be detected. More variants can usually be detected by sequencing a cohort, also referred to as a population, of individuals. Single sample and cohort variant calling are illustrated below. In the cohort example, an additional G/T variant is detected. In addition to being able to detect more variants in a cohort, sequencing a population of individuals has the added advantage of allowing for the use of population genetics to filter variants, which will be discussed in further detail later in this module.
It is common practice to pool all available read data to discover variants. Once variants have been discovered we are often interested in knowing the genotype of every individual, which requires demultiplexing the data and looking at the read data within each individual at each variant position identified in the discovery step. Recall the cohort variant discovery pictured above. In this example all reads colored green belong to Sample 1 and the black reads are derived from Sample 2. Based on this data we would assign Sample 1 the genotype C/C at this SNP locus and Sample 2 A/A. However, what if we didn’t sequence our samples at high enough coverage? We only have three reads per sample at this SNP locus. What if we sequenced more? Would we have detected a read containing the “A” allele in Sample 1? Does one read with an “A” and three reads with a “C” at this position in Sample 1 provide enough evidence to give Sample 1 the genotype C/A? This brings us to the subject of variant/genotype filtering.
Filtering variants and genotypes is a crucial step for most genetic analyses. In most variant discovery pipelines, it is standard to throw away all variants that are supported by only a few reads or contain only a few alternative reads. What does this mean? Consider the two SNP loci pictured below.
After pooling your sample read data, SNP 1 is only covered by two reads. It is likely that this SNP would be discarded, as there is simply not enough evidence that this region contains a SNP. It is more likely that a SNP detected in a region covered by two reads is a false positive created by alignment errors in the mapping step than a true SNP. The second SNP detected, SNP 2, has four times the coverage of SNP 1, but only one of the eight reads supports the presence of a SNP at this location. It could easily be that this SNP was created by a sequencing error in one of the reads and thus this SNP would also be discarded. Had more reads with the alternative “T” allele been detected the SNP would be more trustworthy.
Okay, you have thrown away SNPs that are covered by few reads or have only a few reads supporting an alternative allele. Now genotypes must be inferred for all samples with the set of SNPs that have survived filtering. This brings us to the questions brought up in the inferring genotypes section. As alluded to, coverage is key to genotype inference as well. For example, we could say that a genotype can only be inferred for a sample if it has ten reads at the SNP locus. If the sample doesn’t have a coverage of ten at a particular SNP locus we don’t infer a genotype for that sample and put an “NA” for missing data in place of a genotype for that SNP in that sample. For SNP loci that have ten or more reads, we then have to determine whether the genotype is homozygous or heterozyous. If a sample is heterozygous for a SNP then, in a perfect world, there should be an equal proportion of reads with each SNP. For example, if a sample has a coverage of ten at a locus then five reads should have one allele and five should have the other. Unfortunately, this perfect ratio is rarely seen, so in practice a more forgiving ratio is often chosen for deciding whether a sample is heterozygous or not.
Once SNPs have been filtered and genotypes have been inferred it is possible to apply one more filter, the filter of genetics. How can genetics be used to further filter variants? Imagine you have a population of RILs and you found a bunch of SNPs that are heterozygous in you samples or you are working in an F2 population and some SNP loci are segregating at a ratio of 10:1:1 instead of 1:2:1. Obviously, there is something strange going on with these variants, as they do not seem to be following the rules of genetics, so in the examples provided the trouble SNPs would be removed from analysis. There are many more genetic principles that can be used to filter variants( inbreeding coefficient, HWE, LD etc.); however, you must understand the nature of the population you are studying to understand how genetics can be used to filter your data set.
As always, RNA-Seq data requires some special consideration. When comparing RNA-Seq data with DNA-Seq data it is not uncommon to see that RNA-Seq data will call homozygous genotypes at SNPs that are called as heterozygous in DNA-Seq data from the same sample. In DNA-Seq, two alleles at a heterozygous locus are present in equal proportions; however, in RNA-seq data reads from one allele can be much more abundant than reads from the other allele. This is often due to allele specific expression, which is when one allele is preferentially expressed. Allele specific expression makes it more likely that only one of the alleles will be sampled when sequencing and often leads to spurious homozygous genotype inferences. Allele specific expression is a biological phenomena worthy of study in its own right, but is a nuisance when trying to use RNA-seq for genotyping. Consider the image below:
Keep in mind that we don’t know that the sample is heterozygous, we only have the read data to try and infer the genotype. The “C” allele is preferentially expressed making it seem that the sample is homozygous. As seen from this example, additional precaution is needed when interpreting variants discovered from RNA-Seq data.
The unfortunate truth is that many variant filtering practices involve hard filtering, which is a term used to describe filtering methods that arbitrarily set thresholds (like coverage or ratio to call a genotype heterozygous) to filter variants. The use of genetics as a filter is more principled, but still requires making assumptions and also requires an intimate knowledge of the population the data was generated from. In practice, many people make use of more sophisticated methods that rely on statistical modeling and machine learning, such as the GATK Variant Score Recalibrator. Most of these methods involve assigning a probability to genotypes and filtering is applied to remove low probability genotypes.
Variant and genotype data are usually stored in a Variant Call Format file commonly abbreviated as VCF file. There are eight mandatory fields in a VCF file and you can read a description of the various fields here. VCF files are very flexible as the can store any type of polymophism (SNP, indel, microsatillite etc..) and additional metadata, such as some annotation data affiliated with the polymophism. Additionally, the type of information contained in the INFO and FORMAT field can change depending on the pipeline used to discover variants. However, the flexibility of VCF files also makes them very difficult to work with! An example of a VCF file is shown below.
Rows starting with “##” are meta-information lines. Additional rows in the VCF file correspond to different variants and contain information about the variant in the first eight columns followed by the genotype for that variant for every sample used in variant discovery.
In the tomato QTL-Seq study, there are no genotypes for individual plants as plants from both extremes have been bulked together so that we have two DNA samples derived from two pools of many plants. Therefore, we will not be able to infer genotypes. Instead we will discover variants in each pool and then estimate allele frequency using the read data rather than infer genotypes. The reference allele frequency for each SNP in each pool is estimated from read coverage data. For example if we have 100 reads for a SNP in one of the pools and twenty reads support the reference allele and 80 support the other allele, then the reference allele frequency is 0.2 for this SNP. The reference allele in the tomato QTL-Seq study is from S. lycopersicum and the alternative allele is assumed to be from S. galapagense.
In order to call variants in our data set both bulk pools were trimmed and aligned to the tomato genome with HISAT. SAMtools was used to create a pileup file, a concise summary of reads matched to the same genomic coordinates, from the BAM file produced by HISAT. The tool Varscan was then used to generate a VCF file from the pileup file. Below is a screenshot of the Varscan menu in Galaxy.
Please go to this tool and examine the options that Varscan uses for filtering variants and inferring genotypes. Please answer questions 1-7 and submit via email to ch728.
1. Why would you filter variants based on minimium depth (coverage)?
2. The minimium supporting read option is used to filter SNPs that only have a few reads supporting one of the alleles. Why would there be only a few reads supporting a SNP alelle?
3. Would you expect to detect fewer or more SNPs if you allowed lower quality bases to be used for variant discovery? What is the danger in allowing lower quality bases to be included in variant discovery?
4. The minimium variant allele frequency threshold refers to the minimium allowable MAF for a variant to be retained. How is MAF calculated for a SNP? Why would a SNP have a low MAF? Give a technical and biological explanation for low MAF SNPS.
5. Why would it likely be a bad idea to set the minimum frequency to call a homozygote to one?
Search “Plant Genomic Approaches: Module 3” in the shared histories and import the variants.vcf file into your history.
After you load the variants.vcf run the “VCFtoTab-delimited” tool under “NGS: VCF Manipulation” on the vcf. This will transform your vcf file into a tab-delimited file that will be easier to work with for further steps.
Next choose the “Select” tool under “Filter and sort” and run it on the file created in the previous step (the file created by VCFtoTab-delimited). Your input should look something like this:
Now you should have a file containing only information from one of the two pools (Pool A). Use the “Scatterplot” tool under “Graph/Display Data” to generate a graph of the reference allele average base quality vs. the alternative allele average base quality. The input should look like this:
6. Looking at the scatter plot produced, are there any cases where the average quality at the alternative allele base and the reference allele base are very different from each other? What might this indicate? Do you trust these SNPs? Please download the scatter plot you produced and paste it into your answer document.
7. Pick a column in your tab-delimited vcf file that contains information that you could use to filter your SNPS (other than the RBQ and ABQ column) and describe how you would use it to filter your SNPs.