The example is split into 2 Parts:
Part 1 must be completed first to create a file,
SNPs_cleaned.csv, that has been completely prepared for
analysis.
Now in Part 2, you will analyze the data with PCA. The steps here will be:
scale())prcomp())screeplot())vegan::scores())In the code below all code is provided. Your tasks will be to do 4 things:
Load the vcfR package with library()
library(vcfR) # KEY
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(ggplot2)
library(ggpubr)
Set the working directory
setwd("~/BIOSC_1540/BIOSC1540Project")
Load the data
SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")
warning("If this didn't work, its may be because you didn't set your working directory.")
## Warning: If this didn't work, its may be because you didn't set your working
## directory.
PCA data is always scaled. This is because there are many features in the dataset, and each have different ranges of values.
SNPs_scaled <- scale(SNPs_cleaned)
Run PCA on the scaled data using prcomp. Other packages can be used for this but we will be using prcomp for this example.
pca_scaled <- prcomp(SNPs_scaled)
Make a screeplot of the scaled PCA data. This will show the amount of variance explained for each PC and will tell which PCs are worth looking at.
screeplot(pca_scaled,
ylab = "Relative importance",
main = "PCA Screeplot")
As we can see from the screeplot, PC1 explains the most amount of variance.
Using summary on the scaled pca data shows information on the percentage of variation in the data. We then can plot this data to show which PCs we want to test.
summary_out_scaled <- summary(pca_scaled)
PCA_variation <- function(pca_summary, PCs = 2){
var_explained <- pca_summary$importance[2,1:PCs]*100
var_explained <- round(var_explained,1)
return(var_explained)
}
var_out <- PCA_variation(summary_out_scaled,PCs = 10)
N_columns <- ncol(SNPs_scaled)
barplot(var_out,
main = "Percent variation Scree plot",
ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)
The barplot shows we want to look at all PCs from 1-10 since they explain more than .075% of variance.
Run a biplot on the PCA data in order to gain insight into the relationships between the points.
biplot(pca_scaled)
Since the dataset has a ton of features, each one is plotted, making the graph hard to read with this mess.
The scores function displays the scores between the sites and PCs.
pca_scores <- vegan::scores(pca_scaled)
Create a vector of population ids so we can later use them to label the sites in scores.
pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")
Convert the scores into a dataframe using pop_id to show population id so we can easily plot the data in r.
pca_scores2 <- data.frame(pop_id,
pca_scores)
Create a scatter plot to show the relationship between the different populations. This helps visualize the similarites between certain groups
ggpubr::ggscatter(data = pca_scores2,
y = "PC2",
x = "PC1",
color = "pop_id",
shape = "pop_id",
xlab = "PC1 (19.9% variation)",
ylab = "PC2 (2.3% variation)",
main = "PCA")
PC1 is plotted on the x-axis and PC2 is on the y-axis. Div and Cau are highly correlated as they are clustered around each other. The other three populations are less similar between each other are they are clustered on the left side, with Alt and Nel closer related than they are to Sub.