The example is split into 2 Parts:
Part 1 must be completed first to create a file,
SNPs_cleaned.csv, that has been completely prepared for
analysis.
Now in Part 2, you will analyze the data with PCA. The steps here will be:
scale())prcomp())screeplot())vegan::scores())In the code below all code is provided. Your tasks will be to do 4 things:
Load the vcfR package with library()
library(vcfR) # KEY
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(ggplot2)
library(ggpubr)
Set the working directory
Load the data
SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")
warning("If this didn't work, its may be because you didn't set your working directory.")
## Warning: If this didn't work, its may be because you didn't set your working
## directory.
TODO: Center the data by using scale() which sets mean to 1 and standard deviation to 0.
SNPs_scaled <- scale(SNPs_cleaned)
TODO: Runs a PCA analysis with prcomp() on the scaled data and save it to an object called pca_scaled.
pca_scaled <- prcomp(SNPs_scaled)
TODO: Creates a scree plot
TODO: UPDATE PLOT WITH TITLE
screeplot(pca_scaled,
ylab = "Relative importance",
main = "SNPs Scree Plot")
TODO: The scree plot shows that PC1 and PC2 will be used for the remaining portion of the analysis and all the other PCs do not have useful information.
TODO: EXPLAIN
summary_out_scaled <- summary(pca_scaled)
PCA_variation <- function(pca_summary, PCs = 2){
var_explained <- pca_summary$importance[2,1:PCs]*100
var_explained <- round(var_explained,1)
return(var_explained)
}
var_out <- PCA_variation(summary_out_scaled,PCs = 10)
N_columns <- ncol(SNPs_scaled)
barplot(var_out,
main = "Percent variation Scree plot",
ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)
TODO: The percent variation shows that PC1 is significantly more important than the rest of the PCs.
TODO: Creates a PCA biplot to visualize relationship between PC1 and PC2
biplot(pca_scaled)
TODO: This is a bad because there are too many data points so it is hard to interpret and visualize what is happening in this biplot.
TODO: Extract PCA scores from scaled PCA data with vegan::scores() and store into an object called pca_scores
pca_scores <- vegan::scores(pca_scaled)
TODO: Create vector with all the population codes of the SNP data and store in an object called pop_id
pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")
TODO: Convert PCA scores and population IDs to data frame and save in object called pca_scores2
pca_scores2 <- data.frame(pop_id,
pca_scores)
TODO: Create scatterplot of PCA scores with PC2 on the Y-axis and PC1 on teh X-axis.
TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED
ggpubr::ggscatter(data = pca_scores2,
y = "PC2",
x = "PC1",
color = "pop_id",
shape = "pop_id",
xlab = "PC1 (20% variation)",
ylab = "PC2 (2.5% variation)",
main = "PCA Scatterplot")
TODO: Among the bird types, Cau and Div are the most correlated with one another and positively correlated with PC1. Alt and Nel are most correlated with each other and sightly correlated with PC2. Sub is negatively correlated with PC1 and PC2.