The example is split into 2 Parts:
Part 1 must be completed first to create a file,
SNPs_cleaned.csv
, that has been completely prepared for
analysis.
Now in Part 2, you will analyze the data with PCA. The steps here will be:
scale()
)prcomp()
)screeplot()
)vegan::scores()
)In the code below all code is provided. Your tasks will be to do 4 things:
Load the vcfR package with library()
library(vcfR) # KEY
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(ggplot2)
library(ggpubr)
Set the working directory
setwd("~/Desktop/R/BIOSC1540")
list.files(pattern=".csv")
## [1] "SNPs_cleaned.csv" "walsh2017morphology.csv"
Load the data
SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")
warning("If this didn't work, its may be because you didn't set your working directory.")
## Warning: If this didn't work, its may be because you didn't set your working
## directory.
Scale the data using the scale() function to change the data to mean = 0 and sd = 1. Scale also removes any inherent magnitude differences that could affect the PCA.
SNPs_scaled <- scale(SNPs_cleaned)
Use the prcomp function to perform PCA analysis.
pca_scaled <- prcomp(SNPs_scaled)
Use screeplot() to view the importance of each principal component
screeplot(pca_scaled,
ylab = "Relative importance",
main = "Principal Component Importance")
There is a massive drop off in importance after principal component 1. PC1 can explain a lot of the data and is the one most worth investigating.
Identify the percent of variation in the data explained by each principal component
summary_out_scaled <- summary(pca_scaled)
PCA_variation <- function(pca_summary, PCs = 2){
var_explained <- pca_summary$importance[2,1:PCs]*100
var_explained <- round(var_explained,1)
return(var_explained)
}
var_out <- PCA_variation(summary_out_scaled,PCs = 10)
N_columns <- ncol(SNPs_scaled)
barplot(var_out,
main = "Percent variation Scree plot",
ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)
PC1 explains the greatest amount of variation in the data at >15%. The following PC describe a very low amount of the variation at <5%.
Use biplot() to generate the biplot of the data.
#biplot(pca_scaled)
THIS IS NOT A GOOD IDEA. Since we have way over 1000 SNPs, we are going to end up with one vector for each SNP, resulting in an illegible plot.
use scores() to get the % of variation explained by the PCAs. Save to pca_scores
pca_scores <- vegan::scores(pca_scaled)
Create a vector with the sample names.
pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")
Create a dataframe with pop_id an pca_scores
pca_scores2 <- data.frame(pop_id,
pca_scores)
Use a scatterplot to view how the variation of PC1 compares to PC2 for each sample with color code based on the population ID
TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED
ggpubr::ggscatter(data = pca_scores2,
y = "PC2",
x = "PC1",
color = "pop_id",
shape = "pop_id",
xlab = "PC1 (19.2% variation)",
ylab = "PC2 (2.3% variation)",
main = "PC1 vs PC2 variation")
There are about 3 distinct clusters. ‘Alt’ and ‘Nel’ are clustered in the top left. ‘Sub’ is clusterd by itself in the lower left. ‘Div’ and ‘Cau’ are clustered in the middle on the right. The middle right cluster is particularly well explained by PC1.