The example is split into 2 Parts:
Part 1 must be completed first to create a file,
SNPs_cleaned.csv, that has been completely prepared for
analysis.
Now in Part 2, you will analyze the data with PCA. The steps here will be:
scale())prcomp())screeplot())vegan::scores())In the code below all code is provided. Your tasks will be to do 4 things:
Load the vcfR package with library()
library(vcfR) # KEY
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(ggplot2)
library(ggpubr)
Set the working directory
setwd("~/Desktop/Computational Biology Code:Screenshots")
Load the data
SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")
warning("If this didn't work, its may be because you didn't set your working directory.")
## Warning: If this didn't work, its may be because you didn't set your working
## directory.
TODO: Essentially we are preparing the data for PCA by scaling it. This means we will set the standard deviation of the data equal to 0.
SNPs_scaled <- scale(SNPs_cleaned)
TODO: We use prcomp() on SNPs_scaled to run the PCA for us.
pca_scaled <- prcomp(SNPs_scaled)
TODO: We used the command screeplot() to generate a screeplot of the PCA, this will tell us out of the generated principal components which one is the most important to analyze.
TODO: UPDATE PLOT WITH TITLE
screeplot(pca_scaled,
ylab = "Relative importance",
main = "Importance of each PCA")
TODO: When interpreting thre plot we see that the only important PC is PC1 since its relative importance is so much higher than PC2 and so on.
TODO: From our data we are able to extract the amount of variation explained from the 10 PCs
summary_out_scaled <- summary(pca_scaled)
PCA_variation <- function(pca_summary, PCs = 2){
var_explained <- pca_summary$importance[2,1:PCs]*100
var_explained <- round(var_explained,1)
return(var_explained)
}
var_out <- PCA_variation(summary_out_scaled,PCs = 10)
percent_variance <- var_out[c(1,2,3)]
N_columns <- ncol(SNPs_scaled)
barplot(var_out,
main = "Percent variation Scree plot",
ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)
TODO: From our barplot we see that the highest percent variance in data came from PC1 which is to no suprise since we know that PC1 has the most relative importance so it would have the highest percent variance.
TODO: We are making a biplot from our scaled pca data
biplot(pca_scaled)
TODO: This is a bad idea since our data is highly dimensional and it will looked very mushed up. So we won’t be able to make accurate conclusions from this biplot.
TODO: We are getting the PCA scores from our PCA, by using vegan::scores
pca_scores <- vegan::scores(pca_scaled)
TODO: We are reformatting and creating a vector full of the population IDs for each pca score.
pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")
TODO: We are assinging population IDs to the pca scores that we extracted. This will create a whole dataframe.
pca_scores2 <- data.frame(pop_id,
pca_scores)
TODO: We are using ggscatter to make a scatter plot of the population ID scores with PC1 and PC2
TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED
percent_variance[1]
## PC1
## 19.9
percent_variance[2]
## PC2
## 2.3
ggpubr::ggscatter(data = pca_scores2,
y = "PC2",
x = "PC1",
color = "pop_id",
shape = "pop_id",
xlab = "PC1 (19.9% variation)",
ylab = "PC2 (2.3% variation)",
main = "PC1 vs PC2")
TODO: When we see the data plotted with PC1 and PC2 we see that there are formations of 3 distinct groups. The Sub group forms as 1 that is near the -10 area of both PC1 and PC2. While the Alt and Nel group are the second group that stay consistent on the -10 side of PC1 and will spread from 0 to 10 for PC2. The third group is Cau and Div on the opposite side of the scatterplot. We see that is on the 20 side of PC1 and stays relatively centered at 0 when looking at the PC2.