Intro

# Load in the vcf data
fn1 <- 'NA12878.bwa.sort.rmdup.realign.bam.vcf'
vcf.na12878 <- read.vcfR(paste(as.dir,fn1,sep='/'),verbose = FALSE )

The .vcf file format is written with meta information on top which defines the abbreviations used in the rest of the file.

For example:

vcf.na12878@meta %>% strwrap %>% head(8)
## [1] "##fileformat=VCFv4.2"                                              
## [2] "##FILTER=<ID=LowQual,Description=\"Low quality\">"                 
## [3] "##FORMAT=<ID=AD,Number=.,Type=Integer,Description=\"Allelic depths"
## [4] "for the ref and alt alleles in the order listed\">"                
## [5] "##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Approximate"   
## [6] "read depth (reads with MQ=255 or with bad mates are filtered)\">"  
## [7] "##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=\"Genotype"      
## [8] "Quality\">"

We see that GT is the code associated with the Genotype. Below the top section the data are tabular, with the first eight columns in the fix section containing common information for each possible variant.

getFIX(vcf.na12878) %>% head
##      CHROM  POS        ID REF ALT QUAL      FILTER
## [1,] "chr1" "17707739" NA "T" "C" "1002.77" NA    
## [2,] "chr1" "17707798" NA "C" "T" "994.77"  NA    
## [3,] "chr1" "17708349" NA "G" "A" "1458.77" NA    
## [4,] "chr1" "17708928" NA "G" "A" "1684.77" NA    
## [5,] "chr1" "17709002" NA "C" "T" "1157.77" NA    
## [6,] "chr1" "17710631" NA "C" "G" "1009.77" NA

Question 1

The columns in the .vcf file relate to the genotype information. We see that we have information for allelic depths (AD), genotype (GT), and so on.

gt.order <- unlist(strsplit(vcf.na12878@gt[1,1],':'))
gt.vcf <- vcf.na12878@gt %>% data.frame %>% separate(NA12878,into=gt.order,sep=':')
head(gt.vcf)
##           FORMAT  GT   AD DP GQ         PL
## 1 GT:AD:DP:GQ:PL 1/1 0,30 30 84  1031,84,0
## 2 GT:AD:DP:GQ:PL 1/1 0,30 31 81  1023,81,0
## 3 GT:AD:DP:GQ:PL 1/1 0,39 40 99 1487,114,0
## 4 GT:AD:DP:GQ:PL 1/1 0,46 46 99 1713,132,0
## 5 GT:AD:DP:GQ:PL 1/1 0,37 38 96  1186,96,0
## 6 GT:AD:DP:GQ:PL 1/1 0,29 29 84  1038,84,0

According to our reference document:

GT genotype, encoded as alleles values separated by either of “/” or “|”, e.g. The allele values are 0 for the reference allele (what is in the reference sequence), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. For diploid calls examples could be 0/1 or 1|0 etc.

As our data shows below, our potential variants have alleles on both (1/1) or one (0/1) of their nucleotides.

gt.vcf$GT %>% table
## .
## 0/1 1/1 
##  99 301

Question 2

Additional annotation information can be found in the INFO column of the FIX section. For example, here is the additional information seen for the second potential variant, and we can see information such as the depth across (DP) or the haplotype score (HaplotypeScore).

vcf.na12878@fix[,'INFO'][2] %>% str_split_fixed(';',14) %>%
  str_split('=') %>% do.call('rbind',.) %>% data.frame %>%
  set_colnames(c('keys','values'))
##              keys values
## 1              AC      2
## 2              AF   1.00
## 3              AN      2
## 4              DP     31
## 5            Dels   0.00
## 6       ExcessHet 3.0103
## 7              FS  0.000
## 8  HaplotypeScore 1.9312
## 9           MLEAC      2
## 10          MLEAF   1.00
## 11             MQ  60.00
## 12            MQ0      0
## 13             QD  33.16
## 14            SOR  0.693

Question 3

vcf.na12878@meta %>% str_subset(str_c('ID=','QD|FS|MQ',',')) %>%
  str_split('\\"') %>% lapply(.,function(q) q[2]) %>% unlist %>%
  set_names(c('QD','FS','MQ'),.)
## Phred-scaled p-value using Fisher's exact test to detect strand bias 
##                                                                 "QD" 
##                                                  RMS Mapping Quality 
##                                                                 "FS" 
##                                  Variant Confidence/Quality by Depth 
##                                                                 "MQ"

We see that QD stands for Variant Confidence Quality by depth, FS stands for Fisher’s exact test to detect strand bias, and MQ stands for RMS Mapping Quality.

Questions 4 and 5

# Load in the snpEff .vcf file
snpEff <- read.vcfR('NA12878.bwa.sort.rmdup.realign.bam.filter.snpeff.vcf',verbose = F)
snpInfo <- snpEff@fix[,'INFO']
sapply(c('HIGH','MODERATE','LOW','MODIFIER'),function(ss) str_detect(snpInfo,ss) %>% sum)
##     HIGH MODERATE      LOW MODIFIER 
##        1        1        1      246

We see that there was one variant which had a high impact and one variant which had a low impact. To determine the effect categories, we query these observations, as the effect categories are found in the entry right after the allele.

high.var <- str_subset(snpInfo,'HIGH') %>% str_split('ANN=A\\||\\|HIGH') %>% unlist %>% extract2(2)
mod.var <- str_subset(snpInfo,'MODERATE') %>% str_split('ANN=C\\||\\|MODERATE') %>% unlist %>% extract2(2)
high.var;mod.var
## [1] "splice_acceptor_variant&splice_donor_variant&intron_variant"
## [1] "missense_variant"

We see that the high-impact variant is associated with: (1) a splice variant that changes the 2 base region at the 3’ end of an intron (splice_acceptor_variant), (2) a splice variant that changes the 2 base pair region at the 5’ end of an intron (splice_donor_variant), and (3) a transcript variant occurring within an intron (intron_variant). Whereas the moderate-impact variant is associated with a missense variant: a sequence variant, that changes one or more bases, resulting in a different amino acid sequence but where the length is preserved.1

Question 6

The third column of our NA12878.bwa.sort.rmdup.realign.bam.filter.snpeff.vcf is supposed to contain the dbSNP rsIDS, but as we see there are no values populating it.

snpEff@fix[,3] %>% head
## [1] NA NA NA NA NA NA

Question 7

dbSNP <- read.vcfR('NA12878.bwa.sort.rmdup.realign.bam.filter.snpeff.dbsnp.vcf',verbose = F)
dbFix <- dbSNP@fix %>% data.frame
dbFix[,-8] %>% filter(FILTER=='PASS') %>% 
  mutate(ID2=ifelse(is.na(ID),'noDB','YesDB')) %>% 
  use_series(ID2) %>% table %>% prop.table %>% round(3)
## .
##  noDB YesDB 
## 0.046 0.954

We see that 95.4% of the variants that passed all the filters were also in dbSNP.

Question 8

Below we show three of the eighteen variants that are likely variants but that are not in dbSNP. For example, there is one in position 17753255 (on Chromosome 1).

dbFix[,-8] %>% filter(FILTER=='PASS' & is.na(ID)) %>% head(3)
##   CHROM      POS   ID            REF ALT    QUAL FILTER
## 1  chr1 17753255 <NA> CTTTTTTTTTTTTT   C 1242.73   PASS
## 2  chr1 17756984 <NA>              C  CA  156.74   PASS
## 3  chr1 17758018 <NA>              G   A   43.77   PASS


  1. Definitions retrieved from the sequence ontology terms here.