The example is split into 2 Parts:
Part 1 must be completed first to create a file,
SNPs_cleaned.csv, that has been completely prepared for
analysis.
Now in Part 2, you will analyze the data with PCA. The steps here will be:
scale())prcomp())screeplot())vegan::scores())In the code below all code is provided. Your tasks will be to do 4 things:
Load the vcfR package with library()
library(vcfR) # KEY
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(ggplot2)
library(ggpubr)
Set the working directory
Load the data
SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")
warning("If this didn't work, its may be because you didn't set your working directory.")
## Warning: If this didn't work, its may be because you didn't set your working
## directory.
Use scale() to scale the columns of the matrix, SNPs_cleaned, into an object called SNPs_scaled
SNPs_scaled <- scale(SNPs_cleaned)
Us prcomp() to perform a PCA on the scaled SNP data
pca_scaled <- prcomp(SNPs_scaled)
Use screeplot() to create a scree plot with the scaled pca data
TODO: UPDATE PLOT WITH TITLE
screeplot(pca_scaled,
ylab = "Relative importance",
main = "Eigenvalues after PCA")
This scree plot gives the relative importance of all the principal components. Every single PC except for PC1 has the relatively same value, which is why they can be ignored and we will focus on using just PC1
First, get a summary of the scaled pca data, and then find the percentages of importance. Create a barplot that graphs the PCs on the x axis and the percent variation explained on the Y axis.
summary_out_scaled <- summary(pca_scaled)
PCA_variation <- function(pca_summary, PCs = 2){
var_explained <- pca_summary$importance[2,1:PCs]*100
var_explained <- round(var_explained,1)
return(var_explained)
}
var_out <- PCA_variation(summary_out_scaled,PCs = 10)
N_columns <- ncol(SNPs_scaled)
barplot(var_out,
main = "Percent variation Scree plot",
ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)
PC1 has the highest percent variation explained as seen in the barplot. Because the curve dips after PC1, we can ignore the otheer PCs to make less cluttered, hence dimension reduction. There is a low percent variation for all the PCs other than PC1, so the data is more consistent and does not need a PCA analysis. Because PC1 has a high variability, the points would be spread apart and can be compared to PC2.
Use biplot() to create a biplot of the scaled pca data
biplot(pca_scaled)
Biplots show both PC scores of samples and loading of variables, which can make the plot quite cluttered
TODO: EXPLAIN
pca_scores <- vegan::scores(pca_scaled)
This is a vector of all the different identifications of the population data. This will be used to create a data frame against the pca scores
pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")
Create a data frame using the pop_id and pca scores vectors
pca_scores2 <- data.frame(pop_id,
pca_scores)
Use ggscatter from ggpubr library to screate a scatterplot with the pca scores data.
TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED
ggpubr::ggscatter(data = pca_scores2,
y = "PC2",
x = "PC1",
color = "pop_id",
shape = "pop_id",
xlab = "PC1 (19.9% variation)",
ylab = "PC2 (2.3% variation)",
main = "SNPs PCA Scatterplot")
TODO: INTERPRET PLOT