The example is split into 2 Parts:
Part 1 must be completed first to create a file,
SNPs_cleaned.csv, that has been completely prepared for
analysis.
Now in Part 2, you will analyze the data with PCA. The steps here will be:
scale())prcomp())screeplot())vegan::scores())In the code below all code is provided. Your tasks will be to do 4 things:
Load the vcfR package with library()
library(vcfR) # KEY
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(ggplot2)
library(ggpubr)
Set the working directory
setwd("~/Documents/BIOSC 1540 Code/Final_Project")
Load the data
SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")
warning("If this didn't work, its may be because you didn't set your working directory.")
## Warning: If this didn't work, its may be because you didn't set your working
## directory.
TODO: Center the data around the mean by using the
scale() on the SNPs_cleaned dataframe. This will make the
mean of the data set 0 and the standard deviation 1.
SNPs_scaled <- scale(SNPs_cleaned)
TODO: Run a PCA analysis on the scaled data frame using
prcomp(). Store this in a variable pca_scaled.
pca_scaled <- prcomp(SNPs_scaled)
TODO: Create a screeplot with screeplot() in order to
analyze the how many of the new features (PCs) should be considered,
plot, and/or used in further analyses.
screeplot(pca_scaled,
ylab = "Relative importance",
main = "Scree plot for Walsh bird data ")
TODO: From the screeplot, it appears that only PC1 is important to use in further analyses. This is because there is a steep drop between PC1 and PC2, and all of the subsequent PCs are even smaller. This indicates that PC1 captures the largest amount of data variation in the data set, making it the most important.
TODO: Get the information on the percentage of variation in the data
using the summary() command. Store the summary information
in an object summary_out_scaled.
summary_out_scaled <- summary(pca_scaled)
Use a function to extract the explained variation (importance) we want.
PCA_variation <- function(pca_summary, PCs = 2){
var_explained <- pca_summary$importance[2,1:PCs]*100
var_explained <- round(var_explained,1)
return(var_explained)
}
Use function and store in object var_out. This will extract the first 10 PCs (shown on scree plot as well).
var_out <- PCA_variation(summary_out_scaled,PCs = 10)
var_out
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
## 19.9 2.3 2.2 2.1 2.0 2.0 2.0 1.9 1.8 1.8
Make a scree plot where the y-axis is the percent of variation
captured by each PC using barplot()
N_columns <- ncol(SNPs_scaled)
barplot(var_out,
main = "Percent variation Scree plot",
ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)
TODO: The horizontal line in the barplot is calculated as 100/(# of dimensions in the data).If all the PCs were equally important, then the amount of variation they would explain would be (100)/(# of columns). These data have 50 columns, so if all the PCs were equal, they would each explain 2% of the variation.
TODO: Create a biplot of the SNP PCA data. This will show a vector for each opf the 50 SNPs on the plot.
biplot(pca_scaled)
TODO: This a bad idea to do because there are simply too many SNPs that if we look at a biplot we are unable to make any sense of the data. As shown, the vectors completely cover up the points and we are unable to come to any conclusions about the SNPs.
TODO: Get the PCA scores using vegan::scores() and store
the information into an object pca_scores
pca_scores <- vegan::scores(pca_scaled)
TODO: Create a vector pop_id that has a population code for each of the 68 samples. This is to replace the sample id label in order to make more sense of the data and color code the plot later on.
pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")
TODO: Combine the scores with the species information into a dataframe pca_scores2.
pca_scores2 <- data.frame(pop_id,
pca_scores)
TODO: Plot the scores, which are color coded by population id. The amount of variation explained by each PC is shown in the axis labels.
ggpubr::ggscatter(data = pca_scores2,
y = "PC2",
x = "PC1",
color = "pop_id",
shape = "pop_id",
xlab = "PC1 (19.9% variation)",
ylab = "PC2 (2.3% variation)",
main = "PCA Scatterplot of Walsh Bird Species")
TODO: Looking at this scatterplot, it looks like birds with the population id Sub are most separate from the other populations. as it is in its own separate clump. On the other hand, species Alt and NeI, and Cau and Div seem highly similar across both PC1 and PC2. Cau and Div also seem to be separate from the other species across PC1, indicating that there is most likely a lot of variation between them.