The example is split into 2 Parts:
Part 1 must be completed first to create a file,
SNPs_cleaned.csv, that has been completely prepared for
analysis.
Now in Part 2, you will analyze the data with PCA. The steps here will be:
scale())prcomp())screeplot())vegan::scores())In the code below all code is provided. Your tasks will be to do 4 things:
Load the vcfR package with library()
library(vcfR) # KEY
## Warning: package 'vcfR' was built under R version 4.2.2
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(vegan)
## Warning: package 'vegan' was built under R version 4.2.2
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(ggplot2)
library(ggpubr)
Set the working directory
Load the data
SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")
warning("If this didn't work, its may be because you didn't set your working directory.")
## Warning: If this didn't work, its may be because you didn't set your working
## directory.
Using the function ‘scale()’, the data is centered around the mean and divided by the standard deviation. This makes the data easier to work with and this data is then saved to an object.
SNPs_scaled <- scale(SNPs_cleaned)
Using the function ‘prcomp()’ PCA analysis is applied to the data. The PCA of the data is then saved to an object.
pca_scaled <- prcomp(SNPs_scaled)
Use the function ‘screeplot()’ to get a screeplot of the data. A screeplot plots the principle components, showing us which components have the most variability from one another which are the ones to focus on.
TODO: UPDATE PLOT WITH TITLE
screeplot(pca_scaled,
ylab = "Relative importance",
main = "Screeplot of PCA on SNPs")
The principle component 1 has the most relative importance as it is the most unique of all the principle components. All of the other principle components are very similar too each other, making their relative importance the same and much lower compared to principle component 1. When comparing the principle components to each other we should compare PC1 to PC2 as there is a great difference in variability and since PC2 shares the same variability as PC3-PC10.
Use the function ‘summary()’ to get the summary of the PCA scaled data and save that information to an object to be used in the function that will explain the variation of the principle components. Create a bar plot to show the percent variation of the screeplot and add a line to that barplot to have a threshold of which principle components to focus on.
summary_out_scaled <- summary(pca_scaled)
PCA_variation <- function(pca_summary, PCs = 2){
var_explained <- pca_summary$importance[2,1:PCs]*100
var_explained <- round(var_explained,1)
return(var_explained)
}
var_out <- PCA_variation(summary_out_scaled,PCs = 10)
N_columns <- ncol(SNPs_scaled)
barplot(var_out,
main = "Percent variation Scree plot",
ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)
The percent variation within the screeplot is the same for the principle components 2-10. Principle component 1 has the highest percent variation which demonstrates that we should focus on that component and compare it to principle component 2. This is because the line on the barplot shows us which principle components to focus on. According to the line, we should focus on anything above that line. However, PC2 has the same percent variation as PC3-10. This means we can just compare PC1 to PC2 to get the greatest variation difference from the data.
Using the function ‘biplot()’ create a biplot of the PCA scaled.
biplot(pca_scaled)
Creating a biplot of the scaled PCA data will be a bad idea because each of the vectors of the biplot will correspond to a SNP and since there are so many SNPs, it will be hard to interpret and read the data.
Calculating the scores of the scaled PCA data using the function ‘vegan::scores()’.
pca_scores <- vegan::scores(pca_scaled)
Creating a new vector that contains the population ID.
pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")
Creating a data frame with the pop_id and the pca_scores using the ‘data.frame()’ function and saving that data frame into an object.
pca_scores2 <- data.frame(pop_id,
pca_scores)
Using the function ‘ggpubr:: ggscatter()’ we can create a scatter plot of the scores of the PCA data where the axes are PC2 and PC1. The color and shapes on the scatter plot correspond to the population ID.
TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED
ggpubr::ggscatter(data = pca_scores2,
y = "PC2",
x = "PC1",
color = "pop_id",
shape = "pop_id",
xlab = "PC1 (19.9% variation)",
ylab = "PC2 (2.3% variation)",
main = "PCA scatterplot")
TODO: INTERPRET PLOT
The scatter plot shows us that most populations are clustered close to each other. The populations Cau,Sub and Div are strongly correlated with PC1. The other two populations which are Alt and Nel are strongly correlated with PC2.