The example is split into 2 Parts:
Part 1 must be completed first to create a file,
SNPs_cleaned.csv, that has been completely prepared for
analysis.
Now in Part 2, you will analyze the data with PCA. The steps here will be:
scale())prcomp())screeplot())vegan::scores())In the code below all code is provided. Your tasks will be to do 4 things:
Load the vcfR package with library()
library(vcfR) # KEY
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(ggplot2)
library(ggpubr)
Set the working directory
getwd()
## [1] "/Users/angelazhu/Downloads"
Load the data
SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")
warning("If this didn't work, its may be because you didn't set your working directory.")
## Warning: If this didn't work, its may be because you didn't set your working
## directory.
TODO: Center the data using the scale()
SNPs_scaled <- scale(SNPs_cleaned)
TODO: Run a PCA analysis using prcomp()
pca_scaled <- prcomp(SNPs_scaled)
TODO: Evaluate the scree plot from the PCA using screeplot()
TODO: PCA on scaled data
screeplot(pca_scaled,
ylab = "Relative importance",
main = "PCA on scaled data")
TODO: INTERPRET SCREEPLOT The scree plot shows the eigenvalues on the Relative importance and the number of factors on the PC1~10. It gives a downward curve. Between point PC1 and PC2 is where the slope of the curve is clearly leveling off.
TODO: Get the summary information and store it to an object called summary_out_scaled; Then, use a function to extract the information on explained variation we want.; Call PCA_variation(), with PCs = 10; Number of dimensions in the data and make barplot with LOBF.
summary_out_scaled <- summary(pca_scaled)
PCA_variation <- function(pca_summary, PCs = 2){
var_explained <- pca_summary$importance[2,1:PCs]*100
var_explained <- round(var_explained,1)
return(var_explained)
}
var_out <- PCA_variation(summary_out_scaled,PCs = 10)
var_out
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
## 19.9 2.3 2.2 2.1 2.0 2.0 2.0 1.9 1.8 1.8
N_columns <- ncol(SNPs_scaled)
barplot(var_out,
main = "Percent variation Scree plot",
ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)
TODO: INTERPRET VARIANCE EXPLAINED The horizontal line in the barplot is calculated as 100/(# of dimensions in the data). These data have 10 columns, so if all the PCs were equal, they would each explain 10% of the variation.
TODO: Use biplot() to make the biplot
biplot(pca_scaled)
TODO: EXPLAIN WHY THIS IS A BAD IDEA The PCA biplot shows both PC scores of samples and loading of variables. This is a bad idea because we can’t analysis the data clearly from the plot.
TODO: Call vegan::scores() to get the scores
pca_scores <- vegan::scores(pca_scaled)
TODO: Use c() to combine the arguments and save it in an object called pop_id.
pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")
TODO: Call data.frame() to combine the scores with the species in formation into a dataframe.
pca_scores2 <- data.frame(pop_id,
pca_scores)
TODO: Call ggpubr::ggscatter() to plot the scores with species color-coded. Make color and shape = “pop_id”.
TODO: “PCA Scatterplot” TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED
ggpubr::ggscatter(data = pca_scores2,
y = "PC2",
x = "PC1",
color = "pop_id",
shape = "pop_id",
xlab = "PC1 (19.9% variation)",
ylab = "PC2 (2.3% variation)",
main = "PCA Scatterplot")
TODO: INTERPRET PLOT Based on the PCA Scatterplot, the Cau & Div have the higher correlation which is the stronger relationship. The Alt and Nel have the higher correlation which is the stronger relationship. The overall data point is not in a linear relationship, all the datas are on either left or right of the plot.