The example is split into 2 Parts:
Part 1 must be completed first to create a file,
SNPs_cleaned.csv, that has been completely prepared for
analysis.
Now in Part 2, you will analyze the data with PCA. The steps here will be:
scale())prcomp())screeplot())vegan::scores())In the code below all code is provided. Your tasks will be to do 4 things:
Load the vcfR package with library()
library(vcfR) # KEY
## Warning: package 'vcfR' was built under R version 4.2.2
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-2
library(ggplot2)
library(ggpubr)
Set the working directory
setwd("C:/Users/Owner/Downloads")
Load the data
SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")
warning("If this didn't work, its may be because you didn't set your working directory.")
## Warning: If this didn't work, its may be because you didn't set your working
## directory.
TODO: Centers and scales the columns, and in this case, the SNPs
SNPs_scaled <- scale(SNPs_cleaned)
TODO: prcomp performs a principal components analysison the SNPs_scaled object, assigning it to a new object
pca_scaled <- prcomp(SNPs_scaled)
TODO: Takes our scaled, principal component analyzed data and graphs it
##TODO: UPDATE PLOT WITH TITLE
screeplot(pca_scaled,
ylab = "Relative importance",
main = "Relative importance of PC columns")
TODO: INTERPRET SCREEPLOT
TODO: Get a summary of our scaled analyzed data and submit the data to an object. Get our variation data and plot the percentage based on PC column
summary_out_scaled <- summary(pca_scaled)
PCA_variation <- function(pca_summary, PCs = 2){
var_explained <- pca_summary$importance[2,1:PCs]*100
var_explained <- round(var_explained,1)
return(var_explained)
}
var_out <- PCA_variation(summary_out_scaled,PCs = 10)
N_columns <- ncol(SNPs_scaled)
barplot(var_out,
main = "Percent variation Scree plot",
ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)
TODO: INTERPRET VARIANCE EXPLAINED
TODO: We are making a biplot of our scaled data
biplot(pca_scaled)
##TODO: EXPLAIN WHY THIS IS A BAD IDEA THIS IS A BAD IDEA BECAUSE ALL OF THAT DATA IS SMUSHED TOGETHER YOU CAN’T TELL WHATS THERE!
TODO: We are specifically asking for the vegan package to be able to access the scores/site scores of the pca_scaled and setting it as the basic pca_scores object
pca_scores <- vegan::scores(pca_scaled)
TODO: We are creating a new vector
pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")
TODO: Organize our pop_id and scores to a data frame, assign it to a specific object
pca_scores2 <- data.frame(pop_id,
pca_scores)
TODO: Make a scatterplot of our IDs and compare them from PC1 and PC2 in variation
##TODO: UPDATE PLOT WITH TITLE ##TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED
ggpubr::ggscatter(data = pca_scores2,
y = "PC2",
x = "PC1",
color = "pop_id",
shape = "pop_id",
xlab = "PC1 (20% variation)",
ylab = "PC2 (10% variation)",
main = "PC1 vs PC2 in Variation")
##TODO: INTERPRET PLOT
We can see that there is no point when both PC1 and PC2 have a variation percentage of 0. The more negative variation in PC2 equates to lower PC1 variation. However, when you compare true variation.. When PC1 increases in variation, PC2 remains around the point of 0-5% variation. While when PC2 increases in variation, PC1 remains at around -15% variation. PC1 has a more neutral result against PC2 when it increases but PC2 has a more negative result against PC1 when it increases.