Preparing VCF Data for Analysis: Transposition

NOTE - before you begin, make sure your WORKING DIRECTORY is set to the location of the .vcf file being used.

Learning objectives

All of this material will appear on the exam. Take notes on the workflow, functions, and concepts.

Set a working directory and confirm a file is present with getwd() and list.files(pattern = ...)
Know what it means to reshape data
Learn what transposition of data means
Know what SNP data in VCF form needs to be transposed
Know which function in R does transposition

Review

What does “VCF” stand for?
What is the general structure of a VCF file?
What is the file extension of a VCF file?
Where is the metadata of VCF file
What the extract.gt() command does
Why we don’t use na.omit() on SNPs

Preliminaries

We need just one package for this lesson, vcfR.

In this lesson you will be using a .vcf file called “all_loci.vcf”.

You must make sure that R has its working directory set to where this file is located.

The easiest way to do this is

Save “all_loci.vcf” to whichever folder this file .Rmd file is in
On the top RStudio menu, click on “Session”, then “Set Working Directory”, then “To Source File Location”.

Check the location of your working directory with getwd()

Check for the presence of the “all_loci.vcf” file in the working directory with list.files()

If you have lots of files in the working directory, you can search for the file specifically with list.files(pattern = “vcf”)

Introduction

SNP data is often stored in Variant Call Format (VCF) files that are organized differently than normal R data. When SNP data is analyzed in R, such as by doing PCA or cluster analysis, the SNPs are usually the features (variables) and each row is a sample (such as a person, or other organism). VCF files, however, are organized with SNPs in rows and samples in columns.

Changing how data is arranged is called reshaping. Reshaping can take different forms. In this case we need to flip the orientation of the data so that the rows, SNPS, become the columns.

In R we can flip the orientation of matrix or dataframe using t() operation, which stands for transpose. The transpose of a dataset takes each row and makes it into a column. For vcf data, this means that the SNPs in rows can be made into columns.

Example

The example below will illustrate the general principle of transposition of data. First, let’s make two vectors of data in vcf-like format, with genotypes expressed as “A|A”.

Now let’s make a matrix with rbind()

Since this is representing a vcf file, the rows will be SNPS and the columns will be samples. I’ll name the columns with colnames().

We now have a matrix in the general form of a vcf file

The t() function will re-orient our matrix for us so that SNPs are in columns and samples are in rows.

We can save the transposed output to a new matrix.

We can do this in R with dataframes also:

Again, t() flips the dataframe for us:

Worked example

A typical VCF file can be found in the file all_loci.vcf. (Note that this file is NOT compressed and so has an extension of .vcf, not .vcf.gz). Load the data into an object called bird_snps using the function vcfR::read.vcfR().

NOTE - before you begin, make sure your WORKING DIRECTORY is set to the location of the .vcf file being used.

Make sure that at the bottom of the output in the console it says “Processed variant: 1929”.

Examine the VCF file

Call head on the bird_snps object. You should be able to identify the metadata (at the top) and be able to see how samples are organized in rows.

We can view just the meta data like this:

(Don’t worry about what the “at” symbol is using - this is a somewhat less common R syntax).

We can get a snapshot of the samples like this:

Each row is a SNP and each column is a sample (a bird in this case.)

Extract numeric genotype scores

We extract genotype scores using vcfR::extract.gt(). You need to be familiar with this command, but I’ll give you the code to do this on an exam.

We now have just the numeric data that would go into an analysis such as PCA.

The sample names are REALLY long so its hard to display in a compact way. The code below will make this a a little easier to see using the regular expression gsub().

We can see that the matrix just contains numeric genotype scores of 0, 1 or 2.

Here’s a small view of data:

We can call summary of a bit of the data like this:

Transpose numeric genotype scores.

The data are formatted as genotype scores but SNPs are in columns and samples in row. We can reformat them using the transpose function t().

We can look at the output like this:

Preview - dealing iwth NAs

These data have a lot of NAs. If we just call na.omit() on them, what happens? Call na.omit() on the bird_snps_num_t object, then check the dimensions.