The example is split into 2 Parts:
Part 1 must be completed first to create a file,
SNPs_cleaned.csv, that has been completely prepared for
analysis.
Now in Part 2, you will analyze the data with PCA. The steps here will be:
scale())prcomp())screeplot())vegan::scores())In the code below all code is provided. Your tasks will be to do 4 things:
Load the vcfR package with library()
library(vcfR) # KEY
## Warning: package 'vcfR' was built under R version 4.2.2
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-2
library(ggplot2)
library(ggpubr)
Set the working directory
Load the data
SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")
warning("If this didn't work, its may be because you didn't set your working directory.")
## Warning: If this didn't work, its may be because you didn't set your working
## directory.
TODO: Using the scale function to scale the cleaned data for PCA
SNPs_scaled <- scale(SNPs_cleaned)
TODO: Here we are using the prcomp function to run pca on the scaled data
pca_scaled <- prcomp(SNPs_scaled)
TODO: In this chunk we are creating a screeplot to see the importance of the different PCs
TODO: UPDATE PLOT WITH TITLE
screeplot(pca_scaled,
ylab = "Relative importance",
main = "SNP Screeplot")
TODO: INTERPRET SCREEPLOT *Looking at the screeplot we can conclude that PC1 has the most importance relative to the other PCs, and we can also conclude that the rest of the PCs are relatively similar in importance
TODO: In this chunk of code we are creating a barplot to determine the varation that we see among the PCs
summary_out_scaled <- summary(pca_scaled)
PCA_variation <- function(pca_summary, PCs = 2){
var_explained <- pca_summary$importance[2,1:PCs]*100
var_explained <- round(var_explained,1)
return(var_explained)
}
var_out <- PCA_variation(summary_out_scaled,PCs = 10)
N_columns <- ncol(SNPs_scaled)
barplot(var_out,
main = "Percent variation Scree plot",
ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)
TODO: INTERPRET VARIANCE EXPLAINED *By observing the barplot, we can conclude that PC1 has the greatest variance among the data, meaning that PC1 is the category in which the data is the most different from one another. We can also conclude that the rest of the PCs are similar in variance
TODO: We are using the biplot function to plot our scaled pca data
biplot(pca_scaled)
TODO: EXPLAIN WHY THIS IS A BAD IDEA *This is a bad idea because of how unreadable the graph is. We are not able to interpret on the graph at all.
TODO: We are saving the PCA scores in a object pca_scores to set it up graphing it on a scatterplot
pca_scores <- vegan::scores(pca_scaled)
TODO: Creating a vector to save the population id
pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")
TODO: Create a data frame with the pc scores and the population id
pca_scores2 <- data.frame(pop_id,
pca_scores)
TODO: Using the scatterplot function to plot the PCA scores for PC1 and PC2
TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED
ggpubr::ggscatter(data = pca_scores2,
y = "PC2",
x = "PC1",
color = "pop_id",
shape = "pop_id",
xlab = "PC1 (19.9% variation)",
ylab = "PC2 (2.3% variation)",
main = "PCA Scores Scatterplot")
TODO: INTERPRET PLOT *Observing the plot, we can conclude that the population groups “Div” and “Cau” have similarities compared to the other groups. The rest of the population groups are different compared to “Div” and “Cau”. In the rest of the population groups we see some variance among the y-axis, however, the variance is not as significant in respect to the Y-axis.