The example is split into 2 Parts:
Part 1 must be completed first to create a file,
SNPs_cleaned.csv, that has been completely prepared for
analysis.
Now in Part 2, you will analyze the data with PCA. The steps here will be:
scale())prcomp())screeplot())vegan::scores())In the code below all code is provided. Your tasks will be to do 4 things:
Load the vcfR package with library()
library(vcfR) # KEY
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(ggplot2)
library(ggpubr)
Set the working directory
setwd("/Users/mansiavunoori/Downloads")
Load the data
SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")
warning("If this didn't work, its may be because you didn't set your working directory.")
## Warning: If this didn't work, its may be because you didn't set your working
## directory.
TODO: Scale data for PCA (this can be done before or after dealing with NAs and other data preparation issues)
SNPs_scaled <- scale(SNPs_cleaned)
TODO: Run a PCA with prcomp()
pca_scaled <- prcomp(SNPs_scaled)
TODO: A scree plot is the typical tool for this because the base for a scree plot is basic R.
screeplot(pca_scaled,
ylab = "Relative importance",
main = "PCA on SNPs")
This scree plot shows how much variation each PC captures from the data (in this case PC1 is the highest). The bars on the default R scree plot represent the relative importance of the PCs in representing the data. More specifically, they are proportional to the amount of variation in the data captured by each PC. The more variation represented by a PC, the more important it is.
TODO: Get the information on the percentage of variation in the data
using the summary() command and store it to an object.
summary_out_scaled <- summary(pca_scaled)
PCA_variation <- function(pca_summary, PCs = 2){
var_explained <- pca_summary$importance[2,1:PCs]*100
var_explained <- round(var_explained,1)
return(var_explained)
}
var_out <- PCA_variation(summary_out_scaled,PCs = 10)
N_columns <- ncol(SNPs_scaled)
barplot(var_out,
main = "Percent variation Scree plot",
ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)
TODO: We use a function to extract the information on explained variation. Make sure to always return. Next, the var_out function allows us to specify how many PCs we want. We are getting the first 10 from the morphological data. Theb these percentages are reported as labels on biplots and scatterplots of PCA results. Instead of the default screeplot, we’ll use the barplot() function, to make the y-axis the percent of variation captured by each PC.
TODO: Now we’ll make the biplot and look at the biplot and interpret the relationship between the features.
biplot(pca_scaled)
TODO: The arrows may not get plotted.
TODO: Get the scores by calling them
pca_scores <- vegan::scores(pca_scaled)
TODO: Entered each one into vectors for the plot
pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")
TODO: Combine the scores with the species information into a dataframe
pca_scores2 <- data.frame(pop_id,
pca_scores)
TODO: Plot the scores, with species color-coded
ggpubr::ggscatter(data = pca_scores2,
y = "PC2",
x = "PC1",
color = "pop_id",
shape = "pop_id",
xlab = "PC1 (19.9% variation)",
ylab = "PC2 (2.3% variation)",
main = "Custom PCA scores and variations")
The plot displays the amount of variation explained by each PC1 and PC2. The raw data of these features are going to be highly correlated with each other. The data is very polar to either the left or right side of the scatterplot.