NGS Week 5 Homework Assignment

Task 1:Genotype qualities and Phred-scaled genotype likelihoods (PLs) in VCF

Q1.1 Copy the variant record output from the grep command above and answer the following [ 3 points ]:

1   284891  .   C   T   38.31   .   AC=2;AF=0.043;AN=46;BaseQRankSum=0.967;ClippingRankSum=0.00;DP=82;ExcessHet=3.1175;FS=0.000;InbreedingCoeff=-0.1403;MLEAC=2;MLEAF=0.043;MQ=25.60;MQRankSum=-1.800e-01;QD=3.48;ReadPosRankSum=1.24;SOR=1.112 GT:AD:DP:GQ:PL  ./.:0,0:0:.:0,0,0   0/0:2,0:2:6:0,6,82  0/0:2,0:2:6:0,6,70  0/0:2,0:2:6:0,6,76  ./.:0,0:0:.:0,0,0   0/0:1,0:1:3:0,3,33  0/0:6,0:6:18:0,18,237   0/0:2,0:2:6:0,6,77  0/0:6,0:6:18:0,18,254   ./.:0,0:0:.:0,0,0   0/0:6,0:6:18:0,18,270   ./.:2,0:2:.:0,0,1   0/0:3,0:3:9:0,9,112 0/0:4,0:4:12:0,12,159   0/1:6,2:8:40:40,0,169   0/0:4,0:4:12:0,12,133   0/0:5,0:5:0:0,0,133 0/0:4,0:4:12:0,12,109   0/0:1,0:1:3:0,3,39  0/1:1,2:3:18:42,0,18    0/0:2,0:2:6:0,6,63  0/0:2,0:2:6:0,6,47  ./.:0,0:0:.:0,0,0   ./.:0,0:0:.:0,0,0   0/0:5,0:5:0:0,0,129 0/0:2,0:2:6:0,6,56  0/0:3,0:3:9:0,9,89  0/0:6,0:6:18:0,18,214   0/0:1,0:1:3:0,3,30

Q1.1a What is the variant quality for this SNP and what is the probability that it it is a false positive (i.e., called in error)?

The variant quality for this SNP is 38.31. The probability that it is a false positive or called in error is 0.000148% which is the probability of error according to the quality score.

Q1.1b How many heterozygous genotypes were called at this site? Answer the question and copy the sample field data (beginning with the GT field and ending with the PL field) for these genotypes.

There are two heterozygous genotypes that were called at this site which is represented as 0/1 in the GT field.

GT:AD:DP:GQ:PL
0/1:6,2:8:40:40,0,169 #first sample
0/1:1,2:3:18:42,0,18 #second sample

Q1.1c How many reads supported the reference and alternate alleles for each heterozygous sample from Q1.1b?

For the first sample, there were a total of 8 reads, with 6 supporting the reference allele and 2 supporting the alternate allele. For the second sample, there were 3 total reads, with 1 supporting the reference allele and 2 supporting the alternate allele.

Q1.1d What are the nucleotide genotypes (in terms of A, T, G, C) of the 0/0 samples? The 0/1 genotypes? The 1/1 genotypes?

The nucleotide genotypes of the 0/0 sample is homozygous alleles C/C. For 0/1 sample, the nucleotide genotype is heterozygous alleles C/T. For the 1/1 sample, the nucleotide genotype is homozygous alleles T/T.

Q1.1e What is the GQ of the heterozygote genotypes you reported above with the lowest probability that the genotype is called an error?

The GQ of the heterozygote genotypes reported above with the lowest probability that the genoype is called an error is 40 (first sample) which is the higher GQ as higher GQ represents high confidence genotype call and thus representing a lower probability of an error.

Q1.1f. A common observation is that the GQs of heterozygote genotypes are higher than GQs of homozygotes (you can convince yourself of this by looking at the above VCF record or other records in the VCF file). Provide an intuitive explanation why this might be.

GQ score aka genotype quality represents the probability that the genotype assignment was called due to an error. The higher the GQ score, the higher the confidence that the genotype assignment was not called due to an error (lower probability of an error). Heterozygote genotypes are often time assigned due to the observation of the two different alleles (reference allele and alternate allele) and this often times will mean that the individual sample has an heterozygote genotype and probably not due to an error. However, for homozygote calls, which is often times the observation of only one allele or base (either homozygous reference or homozygous alternate) from the sample. However, this might be due to sampling of only one parent allele which often times is the problem for low coverage sequencing. There is still that possibility that this individual is heteozygote but this particular sequencing was not able to sequence both alleles/bases due to low coverage. For homozygote genotype calls, there are two possibilities: the individual is truly homozygous for that position or the individual might actually be heterozygous at that position if we were to sequence more deeply. Therefore, heterozyote genotype calls have higher GQs as there is higher confidence that the call is not due to an error as there are two different alleles being sequenced, and homozygote genotypes having a lower GQs as there is a higher chance that this call can be an error since we don’t know if the individual is truly homozygote or they may be heterozygote if we were to sequence more deeply especially during low coverage sequencing.

Task 2:Estimation of genotype likelihoods from BAM

Q2.1 What genotype likelihood model does the above script use to infer GLs? Your answer should be a model not a number. See above ANGSD genotype likelihoods website. [ 1 point ]

The genotype likelihood model the script used to infer GLs is GATK as indicated by -GL 2.

Q2.2. How many genotype likelihoods are there per sample per site in the output file GLs.gz? [ 1 point ]?

In the output file GLs.gz, there are 3 genotype likelihoods per sample per site, one for the major/major genotype, one for the major/minor genotype, and one for the minor/minor genotype which were represented in the output file for each sample per site.

Task 3: Analysis of population structure with genotype likelihoods I

Q3.1 Embed the diagram in your RMarkdown or attach the .pdf to your assignment and answer the question what is plotted on the x-axis? The y-axis? [ 1 point ]

The x-axis represents different individual samples while the y-axis represents the proportion of ancestry assigned to each of the K=3 clusters.

library(tidyverse) 

pop.tbl_df <- read_delim(file = "Demo1pop.info",delim = " ", col_names = FALSE)

## Rows: 30 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: " "
## chr (2): X1, X2
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

qopt.tbl_df <- read_delim(file = "Demo1NGSadmix_nowhite.qopt",delim = " ", col_names = F)

## Rows: 30 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: " "
## dbl (3): X1, X2, X3
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

qopt.tbl_df <- bind_cols(pop.tbl_df,qopt.tbl_df)

## New names:
## * X1 -> X1...1
## * X2 -> X2...2
## * X1 -> X1...3
## * X2 -> X2...4

names(qopt.tbl_df) <- c("pop","sample","g1","g2","g3")

# Pipe qopt tibble to pivot_longer to "pivot" the data to "long" format
# If you are unfamiliar with pipe syntax in R see here: https://magrittr.tidyverse.org/reference/pipe.html
qopt.tbl_df.long <- qopt.tbl_df %>% 
                      pivot_longer(cols = g1:g3, names_to = 'group', values_to = 'fraction')

# Plot with ggplot
# Here we pipe pivoted ("long") output to ggplot
qopt.tbl_df.long %>%
  ggplot(aes(x=sample,y=fraction,fill=group)) + geom_col(color = "gray", size = 0.1) +
    facet_grid(~ pop, scales = "free", space = "free") +   
    theme_bw() +
    theme(panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),
        axis.text.x = element_text(angle = 90))

Q3.2a Which individuals in the CEU (“Utah residents (CEPH) with Northern and Western European ancestry”) appear to have a second source of ancestry at K=3? Wher [ 1 point ]

The individuals in the CEU that appear to have a second source of ancestry at K=3 are smallNA11831, smallNA11994, and smallNA12763

Q3.2b Based on the populations in this analysis, which population is this additional source of ancestry shared with?

Based on the populations in this analysis, this additional source of ancestry is shared with JPT.

Task 4: Analysis of population structure with genotype likelihoods II

Q4.1 Copy the contents of your script into your assignment document [ 1 point ].

#!/bin/bash
# 
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=2:00:00
#SBATCH --mem=8GB
#SBATCH --job-name=NGS_admix
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=bl2477@nyu.edu


module load ngsadmix/intel/20210224


NGSadmix -likes Demo2input.gz -K 3 -minMaf 0.05 -seed 1 -o Demo2NGSK3admix
NGSadmix -likes Demo2input.gz -K 4 -minMaf 0.05 -seed 1 -o Demo2NGSK4admix
NGSadmix -likes Demo2input.gz -K 5 -minMaf 0.05 -seed 1 -o Demo2NGSK5admix

Q4.2 Plot the results separately for K=3, K=4 and K=5 with ggplot using the example from Task 3 to assist you. Include your R code and plot in your answer. [ 1 point ]

Results for K=3

library(tidyverse) 

pop.tbl_df <- read_delim(file = "Demo2pop.info",delim = " ", col_names = FALSE)

## Rows: 100 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: " "
## chr (2): X1, X2
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

qopt.tbl_df <- read_delim(file = "Demo2NGSK3admix_nowhite.qopt",delim = " ", col_names = F)

## Rows: 100 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: " "
## dbl (3): X1, X2, X3
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

qopt.tbl_df <- bind_cols(pop.tbl_df,qopt.tbl_df)

## New names:
## * X1 -> X1...1
## * X2 -> X2...2
## * X1 -> X1...3
## * X2 -> X2...4

names(qopt.tbl_df) <- c("pop","sample","g1","g2","g3")


qopt.tbl_df.long <- qopt.tbl_df %>% 
                      pivot_longer(cols = g1:g3, names_to = 'group', values_to = 'fraction')


qopt.tbl_df.long %>%
  ggplot(aes(x=sample,y=fraction,fill=group)) + geom_col(color = "gray", size = 0.1) +
    facet_grid(~ pop, scales = "free", space = "free") +   
    theme_bw() +
    theme(panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),
        axis.text.x = element_text(angle = 90))

Results for K=4

library(tidyverse) 

pop.tbl_df <- read_delim(file = "Demo2pop.info",delim = " ", col_names = FALSE)

## Rows: 100 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: " "
## chr (2): X1, X2
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

qopt.tbl_df <- read_delim(file = "Demo2NGSK4admix_nowhite.qopt",delim = " ", col_names = F)

## Rows: 100 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: " "
## dbl (4): X1, X2, X3, X4
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

qopt.tbl_df <- bind_cols(pop.tbl_df,qopt.tbl_df)

## New names:
## * X1 -> X1...1
## * X2 -> X2...2
## * X1 -> X1...3
## * X2 -> X2...4

names(qopt.tbl_df) <- c("pop","sample","g1","g2","g3","g4")


qopt.tbl_df.long <- qopt.tbl_df %>% 
                      pivot_longer(cols = g1:g4, names_to = 'group', values_to = 'fraction')


qopt.tbl_df.long %>%
  ggplot(aes(x=sample,y=fraction,fill=group)) + geom_col(color = "gray", size = 0.1) +
    facet_grid(~ pop, scales = "free", space = "free") +   
    theme_bw() +
    theme(panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),
        axis.text.x = element_text(angle = 90))

Results for K=5

library(tidyverse) 

pop.tbl_df <- read_delim(file = "Demo2pop.info",delim = " ", col_names = FALSE)

## Rows: 100 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: " "
## chr (2): X1, X2
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

qopt.tbl_df <- read_delim(file = "Demo2NGSK5admix_nowhite.qopt",delim = " ", col_names = F)

## Rows: 100 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: " "
## dbl (5): X1, X2, X3, X4, X5
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

qopt.tbl_df <- bind_cols(pop.tbl_df,qopt.tbl_df)

## New names:
## * X1 -> X1...1
## * X2 -> X2...2
## * X1 -> X1...3
## * X2 -> X2...4

names(qopt.tbl_df) <- c("pop","sample","g1","g2","g3","g4","g5")


qopt.tbl_df.long <- qopt.tbl_df %>% 
                      pivot_longer(cols = g1:g5, names_to = 'group', values_to = 'fraction')


qopt.tbl_df.long %>%
  ggplot(aes(x=sample,y=fraction,fill=group)) + geom_col(color = "gray", size = 0.1) +
    facet_grid(~ pop, scales = "free", space = "free") +   
    theme_bw() +
    theme(panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),
        axis.text.x = element_text(angle = 90))

Q4.3 Review your outputs from Q4.2 and answer the following [ 1 point ]:

Q4.3a Which populations appear to be “pure/unadmixed” at all K values? Select all that apply:

The populations that appear to be “pure/unadmixed” at all K values are CHB and YRI.

Q4.3b Which populations appear as admixed (i.e. has mixed ancestry) at all values of K? Select all that apply.

The populations that appear as admixed (i.e. has mixed ancestry) at all values of K are ASW, CEU, and MXL.

Q4.3C Strict interpretation of ancestry diagrams often depends on identifying an appropriate K (similar to K-means clustering). This is a controversial topic. For assistance with this question and how generally to avoid mistakes when interpreting this type of analysis, see:Lawson et al. 2018. A tutorial on how not to over-interpret STRUCTURE and ADMIXTURE bar plots. Nat. Comm. 9: 3258. However, we can often gain insight into ancestry by also considering the totality of the analysis across different K values. Which population do you think is most likely to have internal population structure (e.g., have unrelated individuals with unique sources of ancestry?). Select the single best answer.

MXL population most likely have internal population structure.