Code Check Point: Testing vcfR

This code checkpoint will make sure that you can load .vcf files on your computer. You will also review vocab and concepts related to .vcf files and SNPs.

You will need to load an analyze a .vcf file on the final exam. Make notes on all of this material and include it on your notes sheet.

Learning objectives

This material will appear on the final exam

Know what a .gz file is
know the difference between .vcf and .vcf.gz
Know how to save and load .vcf and .vcf.gz files
Know where meta-data is located in a .vcf file
Know how samples (e.g. people) and SNPs are organized in a file.
Know what what SNP raw genotype data looks in R like when it is extracted from a .vcf file
Know what what SNP numeric data looks in R like when it is extracted from a .vcf file
review working directories and getwd()

Install and load the vcfR

You only need to do this once. Comment it out after it installs.

##install.packages("vcfR", 
                 ##repos = "https://cloud.r-project.org")

Load the vcfR library

library(vcfR)

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

Work with test data

vcfR comes with a data object that allows you to make sure it can save and reload .vcf and .vcf.gz files.

First, you’ll load a regular R data object with data().

Next, you’ll save this R data object as as .vcf and .vcf.gz files to your working directory.

Finally, your re-load these files to make sure that all the functionality works.

Load and examine test data

Load the test data, called vcfR_test, with data().

# add data() around the object name
data(vcfR_test)

Examine the object:

What type of object is it? Use the function is()

# wrap the object name in is()
is(vcfR_test)

## [1] "vcfR"

Run the name of the object to see what R shows us:

vcfR_test

## ***** Object of Class vcfR *****
## 3 samples
## 1 CHROMs
## 5 variants
## Object size: 0 Mb
## 0 percent missing data
## *****        *****         *****

Now call head() on the object. Note the format.

# add head() and put the object name into it

head(vcfR_test)

## [1] "***** Object of class 'vcfR' *****"
## [1] "***** Meta section *****"
## [1] "##fileformat=VCFv4.3"
## [1] "##fileDate=20090805"
## [1] "##source=myImputationProgramV3.1"
## [1] "##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta"
## [1] "##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d [Truncated]"
## [1] "##phasing=partial"
## [1] "First 6 rows."
## [1] 
## [1] "***** Fixed section *****"
##      CHROM POS       ID          REF   ALT      QUAL FILTER
## [1,] "20"  "14370"   "rs6054257" "G"   "A"      "29" "PASS"
## [2,] "20"  "17330"   NA          "T"   "A"      "3"  "q10" 
## [3,] "20"  "1110696" "rs6040355" "A"   "G,T"    "67" "PASS"
## [4,] "20"  "1230237" NA          "T"   NA       "47" "PASS"
## [5,] "20"  "1234567" "microsat1" "GTC" "G,GTCT" "50" "PASS"
## [1] 
## [1] "***** Genotype section *****"
##      FORMAT        NA00001          NA00002          NA00003       
## [1,] "GT:GQ:DP:HQ" "0|0:48:1:51,51" "1|0:48:8:51,51" "1/1:43:5:.,."
## [2,] "GT:GQ:DP:HQ" "0|0:49:3:58,50" "0|1:3:5:65,3"   "0/0:41:3"    
## [3,] "GT:GQ:DP:HQ" "1|2:21:6:23,27" "2|1:2:0:18,2"   "2/2:35:4"    
## [4,] "GT:GQ:DP:HQ" "0|0:54:7:56,60" "0|0:48:4:51,51" "0/0:61:2"    
## [5,] "GT:GQ:DP"    "0/1:35:4"       "0/2:17:2"       "1/1:40:3"    
## [1] 
## [1] "Unique GT formats:"
## [1] "GT:GQ:DP:HQ" "GT:GQ:DP"   
## [1]

The top of the .vcf file is *metadata It is hard to read, but what’s it telling you is “data about the data” - background information on the file. In advanced work with .vcf files this information will become very important. We’ll ignore it for now.

The Fixed section of the data tells you about the SNPs that were genotyped in each person or sample. Each SNP is a row. The fixed section tells you the chromosome (CHROM), ID of the SNP, the reference allele, and other information.

The Genotype section gives you the raw .vcf version of the SNP genotypes. Each column is a person or sample.

For more information on .vcf files see https://en.wikipedia.org/wiki/Variant_Call_Format

Save and load .vcf file

To make sure everything works we’ll save then reload a .vcf and .vcf.gz. There’s no reason to expect it not to work, but its always good to check!

.vcf files are plain text files that could be opened in a text editor if you wanted. .vcf.gz files are compressed to save space and make things easier to move around on the internet.

First, check where the file will be saved with getwd(). R calls this your working directory.

# add getwd(); nothing goes inside the parentheses

getwd()

## [1] "/Users/james/Downloads"

Run the code below to save the .vcf file. Note that file = … ends with .vcf. This means were saving an un-compressed version of the file.

write.vcf(vcfR_test, 
          file = "vcfR_test.vcf" )

Run the code below to save the compressed .vcf.gz file. Note that the file name ends with .vcf.gz; this means that vcfR will be doing extra work to format the file in .gz. format. This format cannot be opened without being uncompressed.

write.vcf(vcfR_test, 
          file = "vcfR_test.vcf.gz" )

Load files

vcfR has the ability to both

load .vcf files, and also
uncompress and load .vcf.gz file.

We want to make sure that both functions work on your compute.

Load .vcf file

.vcf files are just text files so there should be no problem with this.

vcfR will output info about the file when it loads it. The second to last line is important. It should say “Processed variant: …” and have a number.

Its important to check this because its possible to download .vcf files that have no variants and not realize it!

vcf01 <- read.vcfR(file = "vcfR_test.vcf.gz" )

## Scanning file to determine attributes.
## File attributes:
##   meta lines: 18
##   header_line: 19
##   variant count: 5
##   column count: 12
## 
Meta line 18 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 5
##   Character matrix gt cols: 12
##   skip: 0
##   nrows: 5
##   row_num: 0
## 
Processed variant: 5
## All variants processed

.vcf.gz files need to be uncompressed. This is a slightly more complex process for vcfR to do but should work unless there’s something weird going on. Email Dr. BRouwer immediately if this doesn’t work.

vcf02 <- read.vcfR(file = "vcfR_test.vcf.gz" )

## Scanning file to determine attributes.
## File attributes:
##   meta lines: 18
##   header_line: 19
##   variant count: 5
##   column count: 12
## 
Meta line 18 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 5
##   Character matrix gt cols: 12
##   skip: 0
##   nrows: 5
##   row_num: 0
## 
Processed variant: 5
## All variants processed

Extract data from .vcf files

.vcf files have LOTS of information in the metadata and “fixed” section that isn’t needed for basic data visualization and analysis. The raw data is also formatted in a complex way. We therefore need to extract and format the data in a away that we and/or R can read and use.

Extact genotypes

The extract.gt() function extracts the original genotype in character format and tells you the diploid genotype of each SNP for each sample, e.g. AA, AT, AG, GG, etc. It separates alleles with a vertical bar like “A|T”.

Run extract.gt() on the object vcf01 and assign it to an object called gt_df

# assigning output to gt_df
gt_df <- extract.gt(vcfR_test,
           element = "GT",
           return.alleles = T)

Look at the output by calling head() on gt_df

head(gt_df)

##            NA00001 NA00002    NA00003
## rs6054257  "G|G"   "A|G"      "A/A"  
## 20_17330   "T|T"   "T|A"      "T/T"  
## rs6040355  "G|T"   "T|G"      "T/T"  
## 20_1230237 "T|T"   "T|T"      "T/T"  
## microsat1  "GTC/G" "GTC/GTCT" "G/G"

Extract numeric genotype scores

Diploid genotypes can be converted to a numeric score that can be used in analyses such as PCA that work with numeric data. These scores are calculated as

0 = homozygous for reference allele
1 = heterozygous (1+0)
2 = homozygous for *minor allele** (1+1)

So if A is the reference allele and G is the minor allele, AA = 0 + 0 = 0, AG = 1, and GG = 2. (Recall that these scores will be scaled during most analyses)

# assign output to gt_score_df
 gt_score_df <- extract.gt(vcfR_test,
           element = "GT",
           as.numeric = T)

Examine the output

gt_score_df

##            NA00001 NA00002 NA00003
## rs6054257        0       1       1
## 20_17330         0       0       0
## rs6040355        1       2       2
## 20_1230237       0       0       0
## microsat1        0       0       1

For more information about converting .vcf file to matrices and dataframes see https://knausb.github.io/vcfR_documentation/matrices.html.

Review questions

what is special about a .gz file?
What is metadata?
Are SNPs organized in rows or columns in a .vcf file?
What line of output when opening a .vcf file in * .vcfR is important to check to make sure you have a .vcf file that actually has data?
How do .vcf files represent genotypes?
How do .vcf files represent numeric genotype scores?
What value does the reference allele take when calculating a genotype score?