Preparing VCF Data for Analysis: Transposition

NOTE - before you begin, make sure your WORKING DIRECTORY is set to the location of the .vcf file being used.

Learning objectives

All of this material will appear on the exam. Take notes on the workflow, functions, and concepts.

Set a working directory and confirm a file is present with getwd() and list.files(pattern = ...)
Know what it means to reshape data
Learn what transposition of data means
Know what SNP data in VCF form needs to be transposed
Know which function in R does transposition

Review

What does “VCF” stand for?
What is the general structure of a VCF file?
What is the file extension of a VCF file?
Where is the metadata of VCF file
What the extract.gt() command does
Why we don’t use na.omit() on SNPs

Preliminaries

We need just one package for this lesson, vcfR.

library(vcfR)

## Warning: package 'vcfR' was built under R version 4.2.2

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

In this lesson you will be using a .vcf file called “all_loci.vcf”.

You must make sure that R has its working directory set to where this file is located.

The easiest way to do this is

Save “all_loci.vcf” to whichever folder this file .Rmd file is in
On the top RStudio menu, click on “Session”, then “Set Working Directory”, then “To Source File Location”.

Check the location of your working directory with getwd()

# run getwd()
        getwd(        )

## [1] "C:/Users/Grief Mage/Documents"

Check for the presence of the “all_loci.vcf” file in the working directory with list.files()

list.files()

##  [1] "all_loci.vcf"                    "cover letter.pdf"               
##  [3] "desktop.ini"                     "Dolphin Emulator"               
##  [5] "FPSMonitor.txt"                  "GitHub"                         
##  [7] "IEF essay 2.txt"                 "League of Legends"              
##  [9] "My Games"                        "My Music"                       
## [11] "My Pictures"                     "My Videos"                      
## [13] "my_snps"                         "R files"                        
## [15] "rsconnect"                       "transpose_VCF_data.Rmd"         
## [17] "vcfR_test.vcf"                   "vcfR_test.vcf.gz"               
## [19] "walsh2017morphology.csv"         "working_directory_practice.html"
## [21] "working_directory_practice.Rmd"  "Zoom"

                  # todo

If you have lots of files in the working directory, you can search for the file specifically with list.files(pattern = “vcf”)

# Run list.files() with pattern = "vcf
                   # todo
list.files(pattern = "vcf")

## [1] "all_loci.vcf"     "vcfR_test.vcf"    "vcfR_test.vcf.gz"

Introduction

SNP data is often stored in Variant Call Format (VCF) files that are organized differently than normal R data. When SNP data is analyzed in R, such as by doing PCA or cluster analysis, the SNPs are usually the features (variables) and each row is a sample (such as a person, or other organism). VCF files, however, are organized with SNPs in rows and samples in columns.

Changing how data is arranged is called reshaping. Reshaping can take different forms. In this case we need to flip the orientation of the data so that the rows, SNPS, become the columns.

In R we can flip the orientation of matrix or dataframe using t() operation, which stands for transpose. The transpose of a dataset takes each row and makes it into a column. For vcf data, this means that the SNPs in rows can be made into columns.

Example

The example below will illustrate the general principle of transposition of data. First, let’s make two vectors of data in vcf-like format, with genotypes expressed as “A|A”.

# Add the c() function to make 
## vectors containing SNP genotypes

SNP01 <- c("A|A","A|A","A|A","A|A") # TODO
SNP02 <- c("G|G","G|G","G|G","G|G")

Now let’s make a matrix with rbind()

# add the rbind() function
## to combine the vectors into a dataframe

my_matrix <- rbind(SNP01, SNP02) # TODO

Since this is representing a vcf file, the rows will be SNPS and the columns will be samples. I’ll name the columns with colnames().

# add the colnames() function around
## the my_matrix object to 
## rename it

colnames(my_matrix) <- c("sample1",  # TODO
                         "sample2",
                         "sample3", 
                         "sample4")

We now have a matrix in the general form of a vcf file

# add the my_matrix object to the chunk so
## its contents are shown
my_matrix

##       sample1 sample2 sample3 sample4
## SNP01 "A|A"   "A|A"   "A|A"   "A|A"  
## SNP02 "G|G"   "G|G"   "G|G"   "G|G"

 # TODO

The t() function will re-orient our matrix for us so that SNPs are in columns and samples are in rows.

# add the t() function to transpose
## the matrix

t(my_matrix) # TODO

##         SNP01 SNP02
## sample1 "A|A" "G|G"
## sample2 "A|A" "G|G"
## sample3 "A|A" "G|G"
## sample4 "A|A" "G|G"

We can save the transposed output to a new matrix.

# add the t() function and assign
## the output to an object called
## my_matrix_t

my_matrix_t <- t(my_matrix) # TODO

We can do this in R with dataframes also:

# add the data.frame() function to 
## turn my_matrix into a dataframe
## and assign it to an object called my_df

my_df <- data.frame(my_matrix) # TODO

Again, t() flips the dataframe for us:

# add the t() function to show the flipped
## dataframe
t(my_df) # TODO

##         SNP01 SNP02
## sample1 "A|A" "G|G"
## sample2 "A|A" "G|G"
## sample3 "A|A" "G|G"
## sample4 "A|A" "G|G"

Worked example

A typical VCF file can be found in the file all_loci.vcf. (Note that this file is NOT compressed and so has an extension of .vcf, not .vcf.gz). Load the data into an object called bird_snps using the function vcfR::read.vcfR().

NOTE - before you begin, make sure your WORKING DIRECTORY is set to the location of the .vcf file being used.

# into an object called bird_snps
bird_snps <- vcfR::read.vcfR("all_loci.vcf", convertNA = T)

## Scanning file to determine attributes.
## File attributes:
##   meta lines: 8
##   header_line: 9
##   variant count: 1929
##   column count: 81
## 
Meta line 8 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 1929
##   Character matrix gt cols: 81
##   skip: 0
##   nrows: 1929
##   row_num: 0
## 
Processed variant 1000
Processed variant: 1929
## All variants processed

warning("If this didn't work, you may not have set your working directory to the location of the vcf file")

## Warning: If this didn't work, you may not have set your working directory to the
## location of the vcf file

Make sure that at the bottom of the output in the console it says “Processed variant: 1929”.

Examine the VCF file

Call head on the bird_snps object. You should be able to identify the metadata (at the top) and be able to see how samples are organized in rows.

# run head() on bird_snps
head(bird_snps)           # TODO

## [1] "***** Object of class 'vcfR' *****"
## [1] "***** Meta section *****"
## [1] "##fileformat=VCFv4.0"
## [1] "##fileDate=20160506"
## [1] "##source=\"Stacks v1.20\""
## [1] "##INFO=<ID=NS,Number=1,Type=Integer,Description=\"Number of Samples With Data\">"
## [1] "##INFO=<ID=AF,Number=.,Type=Float,Description=\"Allele Frequency\">"
## [1] "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
## [1] "First 6 rows."
## [1] 
## [1] "***** Fixed section *****"
##      CHROM POS    ID   REF ALT QUAL FILTER
## [1,] "un"  "1478" "17" "C" "A" NA   "PASS"
## [2,] "un"  "1479" "17" "A" "G" NA   "PASS"
## [3,] "un"  "1723" "20" "C" "T" NA   "PASS"
## [4,] "un"  "1948" "22" "C" "A" NA   "PASS"
## [5,] "un"  "1972" "22" "C" "T" NA   "PASS"
## [6,] "un"  "2299" "26" "C" "T" NA   "PASS"
## [1] 
## [1] "***** Genotype section *****"
##      FORMAT     sample_ACAAA_Nel3     sample_ACAGTG_Nel5   
## [1,] "GT:DP:GL" "1/0:344:.,-359.6,."  "1/0:265:.,-273.8,." 
## [2,] "GT:DP:GL" "0/1:344:.,-359.6,."  "0/1:265:.,-273.8,." 
## [3,] "GT:DP:GL" "0/0:254:.,352.119,." "0/0:267:.,370.141,."
## [4,] "GT:DP:GL" "0/0:332:.,460.25,."  "0/0:339:.,469.954,."
## [5,] "GT:DP:GL" "0/0:332:.,431.419,." "0/0:339:.,454.108,."
## [6,] "GT:DP:GL" "0/0:421:.,553.847,." "0/0:412:.,540.081,."
##      sample_AGCAT_Nel8     sample_ATGAAAC_Nel10  sample_ATGAAAC_Nel15 
## [1,] "0/0:252:.,349.346,." "0/0:253:.,350.732,." "0/0:260:.,360.437,."
## [2,] "0/0:252:.,349.346,." "0/0:253:.,350.732,." "0/0:260:.,360.437,."
## [3,] "0/0:373:.,517.088,." "0/0:222:.,307.757,." "0/0:298:.,413.116,."
## [4,] "0/0:288:.,399.253,." "0/0:232:.,321.62,."  "0/0:342:.,474.113,."
## [5,] "0/0:288:.,399.253,." "0/0:232:.,321.62,."  "0/0:342:.,474.113,."
## [6,] "0/0:444:.,615.515,." "0/0:377:.,522.633,." "0/0:379:.,525.406,."
## [1] "First 6 columns only."
## [1] 
## [1] "Unique GT formats:"
## [1] "GT:DP:GL"
## [1]

We can view just the meta data like this:

bird_snps@meta

## [1] "##fileformat=VCFv4.0"                                                            
## [2] "##fileDate=20160506"                                                             
## [3] "##source=\"Stacks v1.20\""                                                       
## [4] "##INFO=<ID=NS,Number=1,Type=Integer,Description=\"Number of Samples With Data\">"
## [5] "##INFO=<ID=AF,Number=.,Type=Float,Description=\"Allele Frequency\">"             
## [6] "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"                  
## [7] "##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Read Depth\">"               
## [8] "##FORMAT=<ID=GL,Number=.,Type=Float,Description=\"Genotype Likelihood\">"

(Don’t worry about what the “at” symbol is using - this is a somewhat less common R syntax).

We can get a snapshot of the samples like this:

bird_snps@gt[1:10, 1:3]

##       FORMAT     sample_ACAAA_Nel3     sample_ACAGTG_Nel5   
##  [1,] "GT:DP:GL" "1/0:344:.,-359.6,."  "1/0:265:.,-273.8,." 
##  [2,] "GT:DP:GL" "0/1:344:.,-359.6,."  "0/1:265:.,-273.8,." 
##  [3,] "GT:DP:GL" "0/0:254:.,352.119,." "0/0:267:.,370.141,."
##  [4,] "GT:DP:GL" "0/0:332:.,460.25,."  "0/0:339:.,469.954,."
##  [5,] "GT:DP:GL" "0/0:332:.,431.419,." "0/0:339:.,454.108,."
##  [6,] "GT:DP:GL" "0/0:421:.,553.847,." "0/0:412:.,540.081,."
##  [7,] "GT:DP:GL" "0/0:421:.,583.63,."  "0/0:412:.,571.153,."
##  [8,] "GT:DP:GL" "0/0:421:.,583.63,."  "0/0:412:.,571.153,."
##  [9,] "GT:DP:GL" "0/0:528:.,731.963,." "0/0:301:.,417.275,."
## [10,] "GT:DP:GL" "0/0:253:.,350.732,." "0/0:248:.,343.801,."

Each row is a SNP and each column is a sample (a bird in this case.)

Extract numeric genotype scores

We extract genotype scores using vcfR::extract.gt(). You need to be familiar with this command, but I’ll give you the code to do this on an exam.

# Add vcfR::extract.gt() to extract the
## numeric scores 
bird_snps_num <- vcfR::extract.gt(bird_snps, # TODO
           element = "GT",
           IDtoRowNames  = F,
           as.numeric = T,
           convertNA = T)

We now have just the numeric data that would go into an analysis such as PCA.

The sample names are REALLY long so its hard to display in a compact way. The code below will make this a a little easier to see using the regular expression gsub().

colnames(bird_snps_num) <-  gsub("sample_", "",
                                 colnames(bird_snps_num))

colnames(bird_snps_num) <-  gsub("_", "",
                                 colnames(bird_snps_num))

We can see that the matrix just contains numeric genotype scores of 0, 1 or 2.

Here’s a small view of data:

bird_snps_num[1:10, 1:4]

##       ACAAANel3 ACAGTGNel5 AGCATNel8 ATGAAACNel10
##  [1,]         1          1         0            0
##  [2,]         0          0         0            0
##  [3,]         0          0         0            0
##  [4,]         0          0         0            0
##  [5,]         0          0         0            0
##  [6,]         0          0         0            0
##  [7,]         0          0         0            0
##  [8,]         0          0         0            0
##  [9,]         0          0         0            0
## [10,]         0          0         0            0

We can call summary of a bit of the data like this:

summary(bird_snps_num[, 1:5])

##    ACAAANel3         ACAGTGNel5        AGCATNel8        ATGAAACNel10    
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.08574   Mean   :0.07962   Mean   :0.08259   Mean   :0.07192  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##  NA's   :28        NA's   :20        NA's   :28        NA's   :24       
##   ATGAAACNel15   
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.0766  
##  3rd Qu.:0.0000  
##  Max.   :1.0000  
##  NA's   :23

Transpose numeric genotype scores.

The data are formatted as genotype scores but SNPs are in columns and samples in row. We can reformat them using the transpose function t().

# run t() on bird_snps_num and 
## save the output to bird_snps_num_t

bird_snps_num_t <- t(bird_snps_num) # TODO

We can look at the output like this:

bird_snps_num_t[1:10, 1:4]

##              [,1] [,2] [,3] [,4]
## ACAAANel3       1    0    0    0
## ACAGTGNel5      1    0    0    0
## AGCATNel8       0    0    0    0
## ATGAAACNel10    0    0    0    0
## ATGAAACNel15    0    0    0    0
## CGATGTNel4      0    0    0    0
## CTAGCNel7       0    0    0    0
## CTGTANel12      0    0    0    0
## CTGTANel9       0    0    0    0
## CTTGANel2       0    0    0    0

Preview - dealing with NAs

These data have a lot of NAs. If we just call na.omit() on them, what happens? Call na.omit() on the bird_snps_num_t object, then check the dimensions.

# Call na.omit() on bird_snps_num_t
## and assign the output to no_NAs
no_NAs <- na.omit(bird_snps_num_t) # TODO

# what is the remaining size of the data?
# why
dim(no_NAs) # TODO

## [1]    0 1929