Now that we have a VCF file full of variants it’s time to finally do some genetics! Our data is from a study recently published in Nature Genetics and is available here. One of the goals of this study was to determine the genetic basis for day-length-sensitive flowering in tomato, which is a super important trait! One of the approaches used in the study was QTL-Seq, the NGS version of bulk segregant analysis. In this module we will carry out a typical QTL-Seq analysis.
Search Plant Genomic Approaches Module 4 under shared histories and import the data. Two files have been provided, a file called VCF-tab-delimited.txt and Window_data.
The VCF-tab-delimited file has the alternative allele frequency (S.gal) for every SNP and each pool (early and late). Lets start by splitting the file by pool using the Select tool under Filter and Sort. In the below example we select the early pool and output a separate file. You will need to do the same matching “Late” as well.
Next, use the second Cut function under Text Manipulation to get the third column, the fourteenth column, and the twentieth column from the two files produced in the last step. This will give you two files with SNP position in the first column, sample name in the second column, and the alternative allele frequency in the last column.
After this join both files on the first column (chromosome position) using the Join two Datasets tool under Join, Subtract and Group.
We now need to subtract the early pool allele frequency from the late pool allele frequency to get the SNP-index. We can do this by using the Text reformatting tool under Text manipulation with the command {print $1“\t”$6 -$3} in the AWK program box ran on our joined file, as shown below.
Lastly, let’s plot the raw data using the Scatterplot function under Graph/Display Data.
Next we will plot the window average of the SNP-index at the end of chromosome 5 (to compare with publication). To do this run the scatterplot tool on the Window_data file. Windows were chosen using GenWin.
Using the two graphs you produced, the paper, and your mental prowess please answer questions 1-6 and email the document containing your answers to ch728@cornell.edu
1. Describe the first graph you produced. What is being plotting? What regions seem to be associated with the trait? Please copy and paste this graph into your answer document.
2. What are possible sources of noise that could influence SNP-index estimates?
3. Why are windows often used in QTL-Seq analysis?
4. Describe the second graph you produced. What is being plotted? Do these results seem to match the results from the study? Please copy and paste this graph into your answer document.
5. Pretend you have detected a genic region that has an average SNP-index of 1. What can you say about this region? What would you say if it had an average SNP-index of -1? What if it had an average SNP-index of 2?
6. Over the course of these modules you have learned about the main steps used to discover variants and carry out QTL-Seq analysis. Using the information provided in the paper, describe the methodology the authors used to prepare and sequence their NGS libraries. Additionally, describe the bioinformatic pipeline the authors used. In other words, what tools did they use for QC, mapping and variant calling? What version of the tomato genome did they use? How did they filter variants? Did they calculate window averages and if so how did they choose window size? Are there details the authors didn’t provide that you feel they should have? Please do not copy and paste anything directly from the paper, use your own words.