NOTE - before you begin, make sure your WORKING DIRECTORY is set to the location of the .vcf file being used.
All of this material will appear on the exam. Take notes on the workflow, functions, and concepts.
getwd()
and
list.files(pattern = ...)
extract.gt()
command doesna.omit()
on SNPsWe need just one package for this lesson, vcfR
.
library(vcfR)
## Warning: package 'vcfR' was built under R version 4.2.2
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
In this lesson you will be using a .vcf
file called
“all_loci.vcf”.
You must make sure that R has its working directory set to where this file is located.
The easiest way to do this is
Check the location of your working directory with
getwd()
# run getwd()
getwd( )
## [1] "C:/Users/Grief Mage/Documents"
Check for the presence of the “all_loci.vcf” file in the working
directory with list.files()
list.files()
## [1] "all_loci.vcf" "cover letter.pdf"
## [3] "desktop.ini" "Dolphin Emulator"
## [5] "FPSMonitor.txt" "GitHub"
## [7] "IEF essay 2.txt" "League of Legends"
## [9] "My Games" "My Music"
## [11] "My Pictures" "My Videos"
## [13] "my_snps" "R files"
## [15] "rsconnect" "transpose_VCF_data.Rmd"
## [17] "vcfR_test.vcf" "vcfR_test.vcf.gz"
## [19] "walsh2017morphology.csv" "working_directory_practice.html"
## [21] "working_directory_practice.Rmd" "Zoom"
# todo
If you have lots of files in the working directory, you can search for the file specifically with list.files(pattern = “vcf”)
# Run list.files() with pattern = "vcf
# todo
list.files(pattern = "vcf")
## [1] "all_loci.vcf" "vcfR_test.vcf" "vcfR_test.vcf.gz"
SNP data is often stored in Variant Call Format (VCF) files that are organized differently than normal R data. When SNP data is analyzed in R, such as by doing PCA or cluster analysis, the SNPs are usually the features (variables) and each row is a sample (such as a person, or other organism). VCF files, however, are organized with SNPs in rows and samples in columns.
Changing how data is arranged is called reshaping. Reshaping can take different forms. In this case we need to flip the orientation of the data so that the rows, SNPS, become the columns.
In R we can flip the orientation of matrix or dataframe using
t()
operation, which stands for transpose.
The transpose of a dataset takes each row and makes it into a column.
For vcf data, this means that the SNPs in rows can be made into
columns.
The example below will illustrate the general principle of transposition of data. First, let’s make two vectors of data in vcf-like format, with genotypes expressed as “A|A”.
# Add the c() function to make
## vectors containing SNP genotypes
SNP01 <- c("A|A","A|A","A|A","A|A") # TODO
SNP02 <- c("G|G","G|G","G|G","G|G")
Now let’s make a matrix with rbind()
# add the rbind() function
## to combine the vectors into a dataframe
my_matrix <- rbind(SNP01, SNP02) # TODO
Since this is representing a vcf file, the rows will be SNPS and the
columns will be samples. I’ll name the columns with
colnames()
.
# add the colnames() function around
## the my_matrix object to
## rename it
colnames(my_matrix) <- c("sample1", # TODO
"sample2",
"sample3",
"sample4")
We now have a matrix in the general form of a vcf file
# add the my_matrix object to the chunk so
## its contents are shown
my_matrix
## sample1 sample2 sample3 sample4
## SNP01 "A|A" "A|A" "A|A" "A|A"
## SNP02 "G|G" "G|G" "G|G" "G|G"
# TODO
The t()
function will re-orient our matrix for us so
that SNPs are in columns and samples are in rows.
# add the t() function to transpose
## the matrix
t(my_matrix) # TODO
## SNP01 SNP02
## sample1 "A|A" "G|G"
## sample2 "A|A" "G|G"
## sample3 "A|A" "G|G"
## sample4 "A|A" "G|G"
We can save the transposed output to a new matrix.
# add the t() function and assign
## the output to an object called
## my_matrix_t
my_matrix_t <- t(my_matrix) # TODO
We can do this in R with dataframes also:
# add the data.frame() function to
## turn my_matrix into a dataframe
## and assign it to an object called my_df
my_df <- data.frame(my_matrix) # TODO
Again, t()
flips the dataframe for us:
# add the t() function to show the flipped
## dataframe
t(my_df) # TODO
## SNP01 SNP02
## sample1 "A|A" "G|G"
## sample2 "A|A" "G|G"
## sample3 "A|A" "G|G"
## sample4 "A|A" "G|G"
A typical VCF file can be found in the file
all_loci.vcf
. (Note that this file is NOT compressed and so
has an extension of .vcf
, not .vcf.gz
). Load
the data into an object called bird_snps
using the function
vcfR::read.vcfR()
.
NOTE - before you begin, make sure your WORKING DIRECTORY is set to the location of the .vcf file being used.
# into an object called bird_snps
bird_snps <- vcfR::read.vcfR("all_loci.vcf", convertNA = T)
## Scanning file to determine attributes.
## File attributes:
## meta lines: 8
## header_line: 9
## variant count: 1929
## column count: 81
##
Meta line 8 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
## Character matrix gt rows: 1929
## Character matrix gt cols: 81
## skip: 0
## nrows: 1929
## row_num: 0
##
Processed variant 1000
Processed variant: 1929
## All variants processed
warning("If this didn't work, you may not have set your working directory to the location of the vcf file")
## Warning: If this didn't work, you may not have set your working directory to the
## location of the vcf file
Make sure that at the bottom of the output in the console it says “Processed variant: 1929”.
Call head on the bird_snps object. You should be able to identify the metadata (at the top) and be able to see how samples are organized in rows.
# run head() on bird_snps
head(bird_snps) # TODO
## [1] "***** Object of class 'vcfR' *****"
## [1] "***** Meta section *****"
## [1] "##fileformat=VCFv4.0"
## [1] "##fileDate=20160506"
## [1] "##source=\"Stacks v1.20\""
## [1] "##INFO=<ID=NS,Number=1,Type=Integer,Description=\"Number of Samples With Data\">"
## [1] "##INFO=<ID=AF,Number=.,Type=Float,Description=\"Allele Frequency\">"
## [1] "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
## [1] "First 6 rows."
## [1]
## [1] "***** Fixed section *****"
## CHROM POS ID REF ALT QUAL FILTER
## [1,] "un" "1478" "17" "C" "A" NA "PASS"
## [2,] "un" "1479" "17" "A" "G" NA "PASS"
## [3,] "un" "1723" "20" "C" "T" NA "PASS"
## [4,] "un" "1948" "22" "C" "A" NA "PASS"
## [5,] "un" "1972" "22" "C" "T" NA "PASS"
## [6,] "un" "2299" "26" "C" "T" NA "PASS"
## [1]
## [1] "***** Genotype section *****"
## FORMAT sample_ACAAA_Nel3 sample_ACAGTG_Nel5
## [1,] "GT:DP:GL" "1/0:344:.,-359.6,." "1/0:265:.,-273.8,."
## [2,] "GT:DP:GL" "0/1:344:.,-359.6,." "0/1:265:.,-273.8,."
## [3,] "GT:DP:GL" "0/0:254:.,352.119,." "0/0:267:.,370.141,."
## [4,] "GT:DP:GL" "0/0:332:.,460.25,." "0/0:339:.,469.954,."
## [5,] "GT:DP:GL" "0/0:332:.,431.419,." "0/0:339:.,454.108,."
## [6,] "GT:DP:GL" "0/0:421:.,553.847,." "0/0:412:.,540.081,."
## sample_AGCAT_Nel8 sample_ATGAAAC_Nel10 sample_ATGAAAC_Nel15
## [1,] "0/0:252:.,349.346,." "0/0:253:.,350.732,." "0/0:260:.,360.437,."
## [2,] "0/0:252:.,349.346,." "0/0:253:.,350.732,." "0/0:260:.,360.437,."
## [3,] "0/0:373:.,517.088,." "0/0:222:.,307.757,." "0/0:298:.,413.116,."
## [4,] "0/0:288:.,399.253,." "0/0:232:.,321.62,." "0/0:342:.,474.113,."
## [5,] "0/0:288:.,399.253,." "0/0:232:.,321.62,." "0/0:342:.,474.113,."
## [6,] "0/0:444:.,615.515,." "0/0:377:.,522.633,." "0/0:379:.,525.406,."
## [1] "First 6 columns only."
## [1]
## [1] "Unique GT formats:"
## [1] "GT:DP:GL"
## [1]
We can view just the meta data like this:
bird_snps@meta
## [1] "##fileformat=VCFv4.0"
## [2] "##fileDate=20160506"
## [3] "##source=\"Stacks v1.20\""
## [4] "##INFO=<ID=NS,Number=1,Type=Integer,Description=\"Number of Samples With Data\">"
## [5] "##INFO=<ID=AF,Number=.,Type=Float,Description=\"Allele Frequency\">"
## [6] "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"
## [7] "##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Read Depth\">"
## [8] "##FORMAT=<ID=GL,Number=.,Type=Float,Description=\"Genotype Likelihood\">"
(Don’t worry about what the “at” symbol is using - this is a somewhat less common R syntax).
We can get a snapshot of the samples like this:
bird_snps@gt[1:10, 1:3]
## FORMAT sample_ACAAA_Nel3 sample_ACAGTG_Nel5
## [1,] "GT:DP:GL" "1/0:344:.,-359.6,." "1/0:265:.,-273.8,."
## [2,] "GT:DP:GL" "0/1:344:.,-359.6,." "0/1:265:.,-273.8,."
## [3,] "GT:DP:GL" "0/0:254:.,352.119,." "0/0:267:.,370.141,."
## [4,] "GT:DP:GL" "0/0:332:.,460.25,." "0/0:339:.,469.954,."
## [5,] "GT:DP:GL" "0/0:332:.,431.419,." "0/0:339:.,454.108,."
## [6,] "GT:DP:GL" "0/0:421:.,553.847,." "0/0:412:.,540.081,."
## [7,] "GT:DP:GL" "0/0:421:.,583.63,." "0/0:412:.,571.153,."
## [8,] "GT:DP:GL" "0/0:421:.,583.63,." "0/0:412:.,571.153,."
## [9,] "GT:DP:GL" "0/0:528:.,731.963,." "0/0:301:.,417.275,."
## [10,] "GT:DP:GL" "0/0:253:.,350.732,." "0/0:248:.,343.801,."
Each row is a SNP and each column is a sample (a bird in this case.)
We extract genotype scores using vcfR::extract.gt()
. You
need to be familiar with this command, but I’ll give you the code to do
this on an exam.
# Add vcfR::extract.gt() to extract the
## numeric scores
bird_snps_num <- vcfR::extract.gt(bird_snps, # TODO
element = "GT",
IDtoRowNames = F,
as.numeric = T,
convertNA = T)
We now have just the numeric data that would go into an analysis such as PCA.
The sample names are REALLY long so its hard to display in a compact
way. The code below will make this a a little easier to see using the
regular expression gsub()
.
colnames(bird_snps_num) <- gsub("sample_", "",
colnames(bird_snps_num))
colnames(bird_snps_num) <- gsub("_", "",
colnames(bird_snps_num))
We can see that the matrix just contains numeric genotype scores of 0, 1 or 2.
Here’s a small view of data:
bird_snps_num[1:10, 1:4]
## ACAAANel3 ACAGTGNel5 AGCATNel8 ATGAAACNel10
## [1,] 1 1 0 0
## [2,] 0 0 0 0
## [3,] 0 0 0 0
## [4,] 0 0 0 0
## [5,] 0 0 0 0
## [6,] 0 0 0 0
## [7,] 0 0 0 0
## [8,] 0 0 0 0
## [9,] 0 0 0 0
## [10,] 0 0 0 0
We can call summary of a bit of the data like this:
summary(bird_snps_num[, 1:5])
## ACAAANel3 ACAGTGNel5 AGCATNel8 ATGAAACNel10
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.08574 Mean :0.07962 Mean :0.08259 Mean :0.07192
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000
## NA's :28 NA's :20 NA's :28 NA's :24
## ATGAAACNel15
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.0766
## 3rd Qu.:0.0000
## Max. :1.0000
## NA's :23
The data are formatted as genotype scores but SNPs are in columns and
samples in row. We can reformat them using the transpose function
t()
.
# run t() on bird_snps_num and
## save the output to bird_snps_num_t
bird_snps_num_t <- t(bird_snps_num) # TODO
We can look at the output like this:
bird_snps_num_t[1:10, 1:4]
## [,1] [,2] [,3] [,4]
## ACAAANel3 1 0 0 0
## ACAGTGNel5 1 0 0 0
## AGCATNel8 0 0 0 0
## ATGAAACNel10 0 0 0 0
## ATGAAACNel15 0 0 0 0
## CGATGTNel4 0 0 0 0
## CTAGCNel7 0 0 0 0
## CTGTANel12 0 0 0 0
## CTGTANel9 0 0 0 0
## CTTGANel2 0 0 0 0
These data have a lot of NAs. If we just call na.omit() on them, what
happens? Call na.omit()
on the bird_snps_num_t
object, then check the dimensions.
# Call na.omit() on bird_snps_num_t
## and assign the output to no_NAs
no_NAs <- na.omit(bird_snps_num_t) # TODO
# what is the remaining size of the data?
# why
dim(no_NAs) # TODO
## [1] 0 1929