The example is split into 2 Parts:
Part 1 must be completed first to create a file,
SNPs_cleaned.csv, that has been completely prepared for
analysis.
Now in Part 2, you will analyze the data with PCA. The steps here will be:
scale())prcomp())screeplot())vegan::scores())In the code below all code is provided. Your tasks will be to do 4 things:
Load the vcfR package with library()
library(vcfR) # KEY
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(ggplot2)
library(ggpubr)
Set the working directory
getwd()
## [1] "/Users/neha/Desktop/Computational Biology/code"
Load the data
SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")
warning("If this didn't work, its may be because you didn't set your working directory.")
## Warning: If this didn't work, its may be because you didn't set your working
## directory.
TODO: Scale the data using the scale() function before performing PCA.
SNPs_scaled <- scale(SNPs_cleaned)
TODO: Use the prcomp() function to perform PCA on the scaled data and assign to object pca_scaled.
pca_scaled <- prcomp(SNPs_scaled)
TODO: Use the function screeplot() to create the plot, this plot helps us determine how many of the new dimensions should be retained.
TODO: UPDATE PLOT WITH TITLE
screeplot(pca_scaled,
ylab = "Relative importance",
main = "Screeplot of Scaled SNP Data")
TODO: INTERPRET SCREEPLOT The plot shows a sharp drop in “relative importance” at PC10 and for all subsequent PCs (no more bars). This indicates that the first 10 are more worthwile to further analyze as they are before the steep drop in the plot.
TODO: Relative importance of a PC on the Screeplot is directly related to how much variation there is in the data for the PC. Use summary() to get the percentage of variation of the data, use function() to extract the information on variation/importance, and then use PCA_variation() to set how many PCs you want.
summary_out_scaled <- summary(pca_scaled)
PCA_variation <- function(pca_summary, PCs = 2){
var_explained <- pca_summary$importance[2,1:PCs]*100
var_explained <- round(var_explained,1)
return(var_explained)
}
var_out <- PCA_variation(summary_out_scaled,PCs = 10)
N_columns <- ncol(SNPs_scaled)
barplot(var_out,
main = "Percent variation Scree plot",
ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)
TODO: This creates a bar plot with the variation percentage for each of the 10 PCs using the barplot function(). The horizontal line is calculated by 100/(# of dimensions in the data), so if all the PCs were equal, they would explain 10% of the variation each as there are 10 PCs.
TODO: Create a biplot for the scaled data and try to understand the relationship between the 10 features.
biplot(pca_scaled)
TODO: Because there are 10 PCs and so much data, the arrows (which help with the understanding of the relationship among the features), many of the data points and arrows are clumped which make it difficult to interpret the plot.
TODO: Create a vector isolating the scores for the scaled PCA data in one vector using the scale functio from the vegan package.
pca_scores <- vegan::scores(pca_scaled)
length(pca_scores)
## [1] 4624
TODO: Create another vector pop_id isolating the population id using the c() function.
pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")
length(pop_id)
## [1] 68
TODO: Combine the two vectors - pca_scores and pop_id - using the data.frame function and assign into pca_scores2.
pca_scores2 <- data.frame(pop_id,
pca_scores)
TODO: Use the ggscatter function to plot the pca_scores2 data with color coding and shape code based upon pop_id.
TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED
ggpubr::ggscatter(data = pca_scores2,
y = "PC2",
x = "PC1",
color = "pop_id",
shape = "pop_id",
xlab = "PC1 (19.9% variation)",
ylab = "PC2 (2.32% variation)",
main = "PCA Scatter Plot")
TODO: The scatter plot shows the relation between PC1 and PC2 with data points grouped by population id. This only focuses on the population_id feature between the two PCs, in contrast to the biplot which showed all features and their relations for PC1 and PC2. This plot is easier to interpret as the relation is more clear whereas in the biplot all the arrows showing the relations for the features were clumped.