1 284891 . C T 38.31 . AC=2;AF=0.043;AN=46;BaseQRankSum=0.967;ClippingRankSum=0.00;DP=82;ExcessHet=3.1175;FS=0.000;InbreedingCoeff=-0.1403;MLEAC=2;MLEAF=0.043;MQ=25.60;MQRankSum=-1.800e-01;QD=3.48;ReadPosRankSum=1.24;SOR=1.112 GT:AD:DP:GQ:PL ./.:0,0:0:.:0,0,0 0/0:2,0:2:6:0,6,82 0/0:2,0:2:6:0,6,70 0/0:2,0:2:6:0,6,76 ./.:0,0:0:.:0,0,0 0/0:1,0:1:3:0,3,33 0/0:6,0:6:18:0,18,237 0/0:2,0:2:6:0,6,77 0/0:6,0:6:18:0,18,254 ./.:0,0:0:.:0,0,0 0/0:6,0:6:18:0,18,270 ./.:2,0:2:.:0,0,1 0/0:3,0:3:9:0,9,112 0/0:4,0:4:12:0,12,159 0/1:6,2:8:40:40,0,169 0/0:4,0:4:12:0,12,133 0/0:5,0:5:0:0,0,133 0/0:4,0:4:12:0,12,109 0/0:1,0:1:3:0,3,39 0/1:1,2:3:18:42,0,18 0/0:2,0:2:6:0,6,63 0/0:2,0:2:6:0,6,47 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 0/0:5,0:5:0:0,0,129 0/0:2,0:2:6:0,6,56 0/0:3,0:3:9:0,9,89 0/0:6,0:6:18:0,18,214 0/0:1,0:1:3:0,3,30
The variant quality for this SNP is 38.31. The probability that it is a false positive or called in error is 0.000148% which is the probability of error according to the quality score.
There are two heterozygous genotypes that were called at this site which is represented as 0/1 in the GT field.
GT:AD:DP:GQ:PL
0/1:6,2:8:40:40,0,169 #first sample
0/1:1,2:3:18:42,0,18 #second sample
For the first sample, there were a total of 8 reads, with 6 supporting the reference allele and 2 supporting the alternate allele. For the second sample, there were 3 total reads, with 1 supporting the reference allele and 2 supporting the alternate allele.
The nucleotide genotypes of the 0/0 sample is homozygous alleles C/C. For 0/1 sample, the nucleotide genotype is heterozygous alleles C/T. For the 1/1 sample, the nucleotide genotype is homozygous alleles T/T.
The GQ of the heterozygote genotypes reported above with the lowest probability that the genoype is called an error is 40 (first sample) which is the higher GQ as higher GQ represents high confidence genotype call and thus representing a lower probability of an error.
GQ score aka genotype quality represents the probability that the genotype assignment was called due to an error. The higher the GQ score, the higher the confidence that the genotype assignment was not called due to an error (lower probability of an error). Heterozygote genotypes are often time assigned due to the observation of the two different alleles (reference allele and alternate allele) and this often times will mean that the individual sample has an heterozygote genotype and probably not due to an error. However, for homozygote calls, which is often times the observation of only one allele or base (either homozygous reference or homozygous alternate) from the sample. However, this might be due to sampling of only one parent allele which often times is the problem for low coverage sequencing. There is still that possibility that this individual is heteozygote but this particular sequencing was not able to sequence both alleles/bases due to low coverage. For homozygote genotype calls, there are two possibilities: the individual is truly homozygous for that position or the individual might actually be heterozygous at that position if we were to sequence more deeply. Therefore, heterozyote genotype calls have higher GQs as there is higher confidence that the call is not due to an error as there are two different alleles being sequenced, and homozygote genotypes having a lower GQs as there is a higher chance that this call can be an error since we don’t know if the individual is truly homozygote or they may be heterozygote if we were to sequence more deeply especially during low coverage sequencing.
The genotype likelihood model the script used to infer GLs is GATK as indicated by -GL 2.
In the output file GLs.gz, there are 3 genotype likelihoods per sample per site, one for the major/major genotype, one for the major/minor genotype, and one for the minor/minor genotype which were represented in the output file for each sample per site.
The x-axis represents different individual samples while the y-axis represents the proportion of ancestry assigned to each of the K=3 clusters.
library(tidyverse)
pop.tbl_df <- read_delim(file = "Demo1pop.info",delim = " ", col_names = FALSE)
## Rows: 30 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: " "
## chr (2): X1, X2
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
qopt.tbl_df <- read_delim(file = "Demo1NGSadmix_nowhite.qopt",delim = " ", col_names = F)
## Rows: 30 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: " "
## dbl (3): X1, X2, X3
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
qopt.tbl_df <- bind_cols(pop.tbl_df,qopt.tbl_df)
## New names:
## * X1 -> X1...1
## * X2 -> X2...2
## * X1 -> X1...3
## * X2 -> X2...4
names(qopt.tbl_df) <- c("pop","sample","g1","g2","g3")
# Pipe qopt tibble to pivot_longer to "pivot" the data to "long" format
# If you are unfamiliar with pipe syntax in R see here: https://magrittr.tidyverse.org/reference/pipe.html
qopt.tbl_df.long <- qopt.tbl_df %>%
pivot_longer(cols = g1:g3, names_to = 'group', values_to = 'fraction')
# Plot with ggplot
# Here we pipe pivoted ("long") output to ggplot
qopt.tbl_df.long %>%
ggplot(aes(x=sample,y=fraction,fill=group)) + geom_col(color = "gray", size = 0.1) +
facet_grid(~ pop, scales = "free", space = "free") +
theme_bw() +
theme(panel.grid.major=element_blank(),
panel.grid.minor=element_blank(),
axis.text.x = element_text(angle = 90))
The individuals in the CEU that appear to have a second source of ancestry at K=3 are smallNA11831, smallNA11994, and smallNA12763
#!/bin/bash
#
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=2:00:00
#SBATCH --mem=8GB
#SBATCH --job-name=NGS_admix
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=bl2477@nyu.edu
module load ngsadmix/intel/20210224
NGSadmix -likes Demo2input.gz -K 3 -minMaf 0.05 -seed 1 -o Demo2NGSK3admix
NGSadmix -likes Demo2input.gz -K 4 -minMaf 0.05 -seed 1 -o Demo2NGSK4admix
NGSadmix -likes Demo2input.gz -K 5 -minMaf 0.05 -seed 1 -o Demo2NGSK5admix
Results for K=3
library(tidyverse)
pop.tbl_df <- read_delim(file = "Demo2pop.info",delim = " ", col_names = FALSE)
## Rows: 100 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: " "
## chr (2): X1, X2
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
qopt.tbl_df <- read_delim(file = "Demo2NGSK3admix_nowhite.qopt",delim = " ", col_names = F)
## Rows: 100 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: " "
## dbl (3): X1, X2, X3
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
qopt.tbl_df <- bind_cols(pop.tbl_df,qopt.tbl_df)
## New names:
## * X1 -> X1...1
## * X2 -> X2...2
## * X1 -> X1...3
## * X2 -> X2...4
names(qopt.tbl_df) <- c("pop","sample","g1","g2","g3")
qopt.tbl_df.long <- qopt.tbl_df %>%
pivot_longer(cols = g1:g3, names_to = 'group', values_to = 'fraction')
qopt.tbl_df.long %>%
ggplot(aes(x=sample,y=fraction,fill=group)) + geom_col(color = "gray", size = 0.1) +
facet_grid(~ pop, scales = "free", space = "free") +
theme_bw() +
theme(panel.grid.major=element_blank(),
panel.grid.minor=element_blank(),
axis.text.x = element_text(angle = 90))
Results for K=4
library(tidyverse)
pop.tbl_df <- read_delim(file = "Demo2pop.info",delim = " ", col_names = FALSE)
## Rows: 100 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: " "
## chr (2): X1, X2
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
qopt.tbl_df <- read_delim(file = "Demo2NGSK4admix_nowhite.qopt",delim = " ", col_names = F)
## Rows: 100 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: " "
## dbl (4): X1, X2, X3, X4
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
qopt.tbl_df <- bind_cols(pop.tbl_df,qopt.tbl_df)
## New names:
## * X1 -> X1...1
## * X2 -> X2...2
## * X1 -> X1...3
## * X2 -> X2...4
names(qopt.tbl_df) <- c("pop","sample","g1","g2","g3","g4")
qopt.tbl_df.long <- qopt.tbl_df %>%
pivot_longer(cols = g1:g4, names_to = 'group', values_to = 'fraction')
qopt.tbl_df.long %>%
ggplot(aes(x=sample,y=fraction,fill=group)) + geom_col(color = "gray", size = 0.1) +
facet_grid(~ pop, scales = "free", space = "free") +
theme_bw() +
theme(panel.grid.major=element_blank(),
panel.grid.minor=element_blank(),
axis.text.x = element_text(angle = 90))
Results for K=5
library(tidyverse)
pop.tbl_df <- read_delim(file = "Demo2pop.info",delim = " ", col_names = FALSE)
## Rows: 100 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: " "
## chr (2): X1, X2
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
qopt.tbl_df <- read_delim(file = "Demo2NGSK5admix_nowhite.qopt",delim = " ", col_names = F)
## Rows: 100 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: " "
## dbl (5): X1, X2, X3, X4, X5
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
qopt.tbl_df <- bind_cols(pop.tbl_df,qopt.tbl_df)
## New names:
## * X1 -> X1...1
## * X2 -> X2...2
## * X1 -> X1...3
## * X2 -> X2...4
names(qopt.tbl_df) <- c("pop","sample","g1","g2","g3","g4","g5")
qopt.tbl_df.long <- qopt.tbl_df %>%
pivot_longer(cols = g1:g5, names_to = 'group', values_to = 'fraction')
qopt.tbl_df.long %>%
ggplot(aes(x=sample,y=fraction,fill=group)) + geom_col(color = "gray", size = 0.1) +
facet_grid(~ pop, scales = "free", space = "free") +
theme_bw() +
theme(panel.grid.major=element_blank(),
panel.grid.minor=element_blank(),
axis.text.x = element_text(angle = 90))
The populations that appear to be “pure/unadmixed” at all K values are CHB and YRI.
The populations that appear as admixed (i.e. has mixed ancestry) at all values of K are ASW, CEU, and MXL.