The example is split into 2 Parts:
Part 1 must be completed first to create a file,
SNPs_cleaned.csv, that has been completely prepared for
analysis.
Now in Part 2, you will analyze the data with PCA. The steps here will be:
scale())prcomp())screeplot())vegan::scores())In the code below all code is provided. Your tasks will be to do 4 things:
Load the vcfR package with library()
library(vcfR) # KEY
## Warning: package 'vcfR' was built under R version 4.2.2
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(vegan)
## Warning: package 'vegan' was built under R version 4.2.2
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(ggplot2)
library(ggpubr)
Set the working directory
setwd("~/biosc1540/FinalProject")
Load the data
SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")
warning("If this didn't work, its may be because you didn't set your working directory.")
## Warning: If this didn't work, its may be because you didn't set your working
## directory.
Adjusting the data so that the mean = 0 and the stdev = 1. This makes the data more standard and readable.
SNPs_scaled <- scale(SNPs_cleaned)
running prcomp() to obtain a PCR object for the data.
pca_scaled <- prcomp(SNPs_scaled)
Using screeplot to determine which principal components (PCs) are different enough form each other to be worth analyzing. This is especially important when working with many features/dimensions.
TODO: UPDATE PLOT WITH TITLE
screeplot(pca_scaled,
ylab = "Relative importance",
main = "Scaled SNPs Screeplot")
We can see that there is a large drop after PC1. Therefore, we might reason that the other PCs contain less variation between features and are less important to assess.
Extracting information on how much variation is present in each PC using summary(), then defining a function to assess importance of each PC (explained variation).
summary_out_scaled <- summary(pca_scaled)
PCA_variation <- function(pca_summary, PCs = 2){
var_explained <- pca_summary$importance[2,1:PCs]*100
var_explained <- round(var_explained,1)
return(var_explained)
}
var_out <- PCA_variation(summary_out_scaled,PCs = 10)
N_columns <- ncol(SNPs_scaled)
barplot(var_out,
main = "Percent variation Scree plot",
ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)
Normally, all bars above the red line should be considered as important. However, there seems to be some sort of issue in the provided code. A rule of thumb that can be used, is that any PC with a % explained variation higher than 100/(# of features) should be analyzed, as they have more than their proportional amount of variation.
Features on the biplot that are parallel or oppositely aligned exhibit strong positive/negative correlation respectively. Meanwhile, features which are orthogonal to each other will be unrelated.
biplot(pca_scaled)
This biplot is basically impossible to read- there are too many labels and numbers overlapping each other, and it should be cleaned up a bit more.
using vegan to make a plot that is more readable and customizable.
pca_scores <- vegan::scores(pca_scaled)
Adding categories to the data so that we can later color code this information
pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")
adding the population id codes to the pca_scores data frame
pca_scores2 <- data.frame(pop_id,
pca_scores)
plotting the data from the PCA scores, with the population IDs as a color and shape marker
ggpubr::ggscatter(data = pca_scores2,
y = "PC2",
x = "PC1",
color = "pop_id",
shape = "pop_id",
xlab = "PC1 (19.9% variation)",
ylab = "PC2 (2.3% variation)",
main = "PCA Scatterplot SNPs")
This plot shows 2 major groups, that seem to align closely by population. Cau and Div are on one end of PC1, while Alt, Nel, and Sub lie on the other end. Among those 3 groups, Sub seems to be a bit further from Nel and Alt along PC2, the Y-axis. We can safely differentiate 2-3 clusters from this PCA biplot.