The example is split into 2 Parts:
Part 1 must be completed first to create a file,
SNPs_cleaned.csv, that has been completely prepared for
analysis.
Now in Part 2, you will analyze the data with PCA. The steps here will be:
scale())prcomp())screeplot())vegan::scores())In the code below all code is provided. Your tasks will be to do 4 things:
Load the vcfR package with library()
library(vcfR) # KEY
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(ggplot2)
library(ggpubr)
Set the working directory
getwd()
## [1] "/Users/eeshamukherjee/Downloads/Computaional Bio 1/Final Project"
setwd("/Users/eeshamukherjee/Downloads/Computaional Bio 1/Final Project")
Load the data
SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")
warning("If this didn't work, its may be because you didn't set your working directory.")
## Warning: If this didn't work, its may be because you didn't set your working
## directory.
Prep the data for PCA by scaling it. Scaling means that we will be setting the SD to 0. (scale(snps))
SNPs_scaled <- scale(SNPs_cleaned)
Run prcomp() on the scaled snps to run a PCA. Assign it to pca_scaled.
pca_scaled <- prcomp(SNPs_scaled)
Plot the pca_scaled, that we created above, as a screeplot using screeplot().
TODO: UPDATE PLOT WITH TITLE
screeplot(pca_scaled,
ylab = "Relative importance",
main = "SNP Relative Importance")
The screeplot tells us which SNPs have the highest importance compared to the other SNPs, as we can see, PC1 seems to have the most realtive importance.
Extract the amount of variation explained from the 10 PCs using PCA_variation funtion. First, we do it for 2 PCs, then for 10. Add a horozontial line using abline().
summary_out_scaled <- summary(pca_scaled)
PCA_variation <- function(pca_summary, PCs = 2){
var_explained <- pca_summary$importance[2,1:PCs]*100
var_explained <- round(var_explained,1)
return(var_explained)
}
var_out <- PCA_variation(summary_out_scaled,PCs = 10)
N_columns <- ncol(SNPs_scaled)
barplot(var_out,
main = "Percent variation Scree plot",
ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)
The variance is set at the x axis since there is not alarge varienve seen between the PCs overall.
Using biplot(), create a biplot for our scaled pca.
biplot(pca_scaled)
Our data has multiple dimensions, therefore the biplot will be too jumbled to properly depict our scaled pca.
Get PCA scores using vegan::scores on the scaled pca.
pca_scores <- vegan::scores(pca_scaled)
Create a vector of the population IDs for each pca score.
pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")
Create a df using data.frame() of the population ids and pca scores, and assign it to pca_scores2.
pca_scores2 <- data.frame(pop_id,
pca_scores)
We can plot our data and compare the variance between PC1 and PC2 using a scatterplot.
TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED
var_out <- PCA_variation(summary_out_scaled,PCs = 1)
print(var_out)
## [1] 19.9
var_out <- PCA_variation(summary_out_scaled,PCs = 2)
print(var_out)
## PC1 PC2
## 19.9 2.3
ggpubr::ggscatter(data = pca_scores2,
y = "PC2",
x = "PC1",
color = "pop_id",
shape = "pop_id",
xlab = "PC1 (19.9% variation)",
ylab = "PC2 (2.3% variation)",
main = "PC1 vs PC2 Variance")
We can see that the data plotted has 3 groups. The “Sub” group forms near -10 on the X-axis and goes up to about -3 on the Y-axis (purple). The “Alt” and “Nel” groups are a second group that groups together still on -10 by the X-axis, but higher up on the Y-axis ( about 0 to 10 for PC2). “Cau” and “Div” are on the opposite side of the scatterplot.