In this worked example you will replicate a PCA on a published dataset.
The example is split into 2 Parts:
In this Data Preparation phase, you will do the following things:
vcfR::read.vcfR())vcfR::extract.gt())t())for()
loop).csv file
for the next step (write.csv())This worked example is based on a paper in the journal Molecular Ecology from 2017 by Jennifer Walsh titled Subspecies delineation amid phenotypic, geographic and genetic discordance in a songbird.
The study investigated variation between two bird species in the genus Ammodramus: A. nenlsoni and A. caudacutus.
The species A. nenlsoni has been divided into 3 sub-species: A. n. nenlsoni, A.n. alterus, and A n. subvirgatus. The other species, A. caudacutus, has been divided into two subspecies, A.c. caudacutus and A.c. diversus.
The purpose of this study was to investigate to what extent these five subspecies recognized by taxonomists are supported by genetic data. The author’s collected DNA from 75 birds (15 per subspecies) and genotyped 1929 SNPs. They then analyzed the data with Principal Components Analysis (PCA), among other genetic analyzes.
This tutorial will work through all of the steps necessary to re-analyze Walsh et al.s data
In the code below all code is provided. Your tasks will be to do 2 things:
Load the vcfR and other packages with library().
library(vcfR)
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(ggplot2)
library(ggpubr)
Make sure that your working directory is set to the location of the
file all_loci.vcf.
getwd()
## [1] "/Users/madisondougherty/Downloads/COMPBIO"
list.files()
## [1] "07-mean_imputation.Rmd" "all_loci.vcf"
## [3] "SNPs_cleaned.csv" "walsh2017morphology.csv"
list.files(pattern = "vcf")
## [1] "all_loci.vcf"
We need to read in the .vcf file here in order to be
able to work with the data. Here, we assign the data to an object called
snps, which is a vcfR object. This step is also important
because after we run this line of code, the output that is printed tells
us whether the data properly uploaded. The last line should say
"All variants processed" and the second to last line should
say "Processed Variant: …", which tells us the amount of
SNPs in our data.
snps <- vcfR::read.vcfR("all_loci.vcf", convertNA = TRUE)
## Scanning file to determine attributes.
## File attributes:
## meta lines: 8
## header_line: 9
## variant count: 1929
## column count: 81
##
Meta line 8 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
## Character matrix gt rows: 1929
## Character matrix gt cols: 81
## skip: 0
## nrows: 1929
## row_num: 0
##
Processed variant 1000
Processed variant: 1929
## All variants processed
.vcf FileSince .vcf files have a lot information included that
isn’t necessary for our analysis, we need to extract the genotype data
that we plan to use and convert it into a form that we can easily use in
R. So, here, we extract the character data containing the genotypes, and
convert it into numeric scores that we can later use in our analysis
using the function vcfR::extract.gt() and the arguments as
follows.
snps_num <- vcfR::extract.gt(snps,
element = "GT",
IDtoRowNames = F,
as.numeric = T,
convertNA = T,
return.alleles = F)
.vcf data is automatically formatted with SNPs in
columns and samples in rows. Our matrix containing the the genotype
scores called snps_num is formatted this way, so we need to
transpose the data in order to perform our analysis. We do this here by
using the function t() and then assigning the output to a
new matrix called snps_num_t.
snps_num_t <- t(snps_num)
Here we convert the matrix to a data frame so that we can perform the appropriate functions on the data and make the plots necessary for analysis.
snps_num_df <- data.frame(snps_num_t)
In this step we create a function called find_NAs() that
will tell us how many NAs are in our data frame of SNPs and print the
index value of where an NA is located. When using the function, you
input the row number you want to search for NAs in using bracket
notation, and the function will print the indices where NAs were found
within that row.
find_NAs <- function(x){
NAs_TF <- is.na(x)
i_NA <- which(NAs_TF == TRUE)
N_NA <- length(i_NA)
cat("Results:",N_NA, "NAs present\n.")
return(i_NA)
}
In this process, we use a for() loop to go through every
row of the data frame and search each row individually for NAs using the
find_NAs function. The output will contain each row’s
number of NAs.
# N_rows
# number of rows (individuals)
N_rows <- nrow(snps_num_t)
# N_NA
# vector to hold output (number of NAs)
N_NA <- rep(x = 0, times = N_rows)
# N_SNPs
# total number of columns (SNPs)
N_SNPs <- ncol(snps_num_t)
# the for() loop
for(i in 1:N_rows){
# for each row, find the location of
## NAs with snps_num_t()
i_NA <- find_NAs(snps_num_t[i,])
# then determine how many NAs
## with length()
N_NA_i <- length(i_NA)
# then save the output to
## our storage vector
N_NA[i] <- N_NA_i
}
## Results: 28 NAs present
## .Results: 20 NAs present
## .Results: 28 NAs present
## .Results: 24 NAs present
## .Results: 23 NAs present
## .Results: 63 NAs present
## .Results: 51 NAs present
## .Results: 38 NAs present
## .Results: 34 NAs present
## .Results: 24 NAs present
## .Results: 48 NAs present
## .Results: 21 NAs present
## .Results: 42 NAs present
## .Results: 78 NAs present
## .Results: 45 NAs present
## .Results: 21 NAs present
## .Results: 42 NAs present
## .Results: 34 NAs present
## .Results: 66 NAs present
## .Results: 54 NAs present
## .Results: 59 NAs present
## .Results: 52 NAs present
## .Results: 47 NAs present
## .Results: 31 NAs present
## .Results: 63 NAs present
## .Results: 40 NAs present
## .Results: 40 NAs present
## .Results: 22 NAs present
## .Results: 60 NAs present
## .Results: 48 NAs present
## .Results: 961 NAs present
## .Results: 478 NAs present
## .Results: 59 NAs present
## .Results: 26 NAs present
## .Results: 285 NAs present
## .Results: 409 NAs present
## .Results: 1140 NAs present
## .Results: 600 NAs present
## .Results: 1905 NAs present
## .Results: 25 NAs present
## .Results: 1247 NAs present
## .Results: 23 NAs present
## .Results: 750 NAs present
## .Results: 179 NAs present
## .Results: 433 NAs present
## .Results: 123 NAs present
## .Results: 65 NAs present
## .Results: 49 NAs present
## .Results: 192 NAs present
## .Results: 433 NAs present
## .Results: 66 NAs present
## .Results: 597 NAs present
## .Results: 1891 NAs present
## .Results: 207 NAs present
## .Results: 41 NAs present
## .Results: 268 NAs present
## .Results: 43 NAs present
## .Results: 110 NAs present
## .Results: 130 NAs present
## .Results: 90 NAs present
## .Results: 271 NAs present
## .Results: 92 NAs present
## .Results: 103 NAs present
## .Results: 175 NAs present
## .Results: 31 NAs present
## .Results: 66 NAs present
## .Results: 64 NAs present
## .Results: 400 NAs present
## .Results: 192 NAs present
## .Results: 251 NAs present
## .Results: 69 NAs present
## .Results: 58 NAs present
## .
Here, we first created an object containing the number equal to 50%
of the total number of columns. Then, we create a histogram
demonstrating the variation in the amount of NAs found in each row of
the data frame. From the graph, we see that the majority of rows had
less than 500 NAs, but there were still some rows that had as much as
2000 NAs. We then added a line for the value stored in the object
cutoff50, which shows us how many of the columns were not
included in the final analysis of the data.
# 50% of N_SNPs
cutoff50 <- N_SNPs*0.5
hist(N_NA)
abline(v = cutoff50,
col = 2,
lwd = 2,
lty = 2)
Here, we determined how many NAs are in each row and we created an object that only contains the rows that had less than 50% of the data as NAs.
percent_NA <- N_NA/N_SNPs*100
# Call which() on percent_NA
i_NA_50percent <- which(percent_NA > 50)
snps_num_t02 <- snps_num_t[-i_NA_50percent, ]
Here, we reduce the sample names down to just the population code, which tells us where the data was sampled from. Lastly, we created a frequency table that tells us how many samples were taken from each of the population codes.
row_names <- row.names(snps_num_t02) # Key
row_names02 <- gsub("sample_","",row_names)
sample_id <- gsub("^([ATCG]*)(_)(.*)",
"\\3",
row_names02)
pop_id <- gsub("[01-9]*",
"",
sample_id)
table(pop_id)
## pop_id
## Alt Cau Div Nel Sub
## 15 12 15 15 11
Here, we created the function invar_omit() , which
removes the columns that are considered invariant from the data frame
that we want to analyze. Invariant columns contain data that does not
have any variance, and thus, is unnecessary to include in our analysis
because it does not provide any useful information. The function we
created here removes the invariant columns and then prints an output
showing how many columns were removed. After creating the function, we
used it on our data frame and created the new data frame, that now
contains no invariant columns, snps_no_invar.
invar_omit <- function(x){
cat("Dataframe of dim",dim(x), "processed...\n")
sds <- apply(x, 2, sd, na.rm = TRUE)
i_var0 <- which(sds == 0)
cat(length(i_var0),"columns removed\n")
if(length(i_var0) > 0){
x <- x[, -i_var0]
}
## add return() with x in it
return(x)
}
snps_no_invar <- invar_omit(snps_num_t02)
## Dataframe of dim 68 1929 processed...
## 591 columns removed
First, we replaced our data frame snps_noNAs with the
data frame that doesn’t contain invariant columns that we created in the
last step. Then we used a for() loop to complete mean
imputation on our data frame to replace all of the remaining NAs in each
column with the mean of each column, which we calculated using
mean(). In that function, we used the argument
na.rm = TRUE to ensure that the mean could be calculated
while ignoring the NAs in the column.
snps_noNAs <- snps_no_invar
N_col <- ncol(snps_no_invar)
for(i in 1:N_col){
# get the current column
column_i <- snps_noNAs[, i]
# get the mean of the current column
mean_i <- mean(column_i, na.rm = TRUE)
# get the NAs in the current column
NAs_i <- which(is.na(column_i))
# record the number of NAs
N_NAs <- length(NAs_i)
# replace the NAs in the current column
column_i[NAs_i] <- mean_i
# replace the original column with the
## updated columns
snps_noNAs[, i] <- column_i
}
Save the data as a .csv file which can be loaded again later.
write.csv(snps_noNAs, file = "SNPs_cleaned.csv",
row.names = F)
Check for the presence of the file with list.files()
list.files(pattern = ".csv")
## [1] "SNPs_cleaned.csv" "walsh2017morphology.csv"
In Part 2, we will re-load the SNPs_cleaned.csv file and
carry an an analysis with PCA.