The example is split into 2 Parts:
Part 1 must be completed first to create a file,
SNPs_cleaned.csv, that has been completely prepared for
analysis.
Now in Part 2, you will analyze the data with PCA. The steps here will be:
scale())prcomp())screeplot())vegan::scores())In the code below all code is provided. Your tasks will be to do 4 things:
Load the vcfR package with library()
library(vcfR) # KEY
## Warning: package 'vcfR' was built under R version 4.2.2
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(vegan)
## Warning: package 'vegan' was built under R version 4.2.2
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.2
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.2.2
Set the working directory
setwd("C://Users//Benj//Desktop//rrrr")
Load the data
SNPs_cleaned <- read.csv(file = "data//SNPs_cleaned.csv")
#warning("If this didn't work, its may be because you didn't set your working directory.")
TODO: PCA is almost always done with scaled data for consistency
SNPs_scaled <- scale(SNPs_cleaned)
TODO: Base R implementation of PCA.
pca_scaled <- prcomp(SNPs_scaled)
TODO: Visually represent the influence the components have on the data
TODO: Sources of variance in SNPs
screeplot(pca_scaled,
ylab = "Relative importance",
main = "screeplot SNP Variation")
TODO: This shows the variance in a data set that can be attributed to a single factor
TODO: Get the data for the PCA values from a summary of the analysis
summary_out_scaled <- summary(pca_scaled)
PCA_variation <- function(pca_summary, PCs = 2){
var_explained <- pca_summary$importance[2,1:PCs]*100
var_explained <- round(var_explained, 1)
return(var_explained)
}
var_out <- PCA_variation(summary_out_scaled, PCs = 10)
N_columns <- ncol(SNPs_scaled)
barplot(var_out,
main = "Percent variation bar plot",
ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)
TODO: The bar plot shows that there is a single component that explains ~20% of the variance
TODO: show how features relate to the PC1/2s
biplot(pca_scaled)
TODO: EXPLAIN WHY THIS IS A BAD IDEA Including the data labels and/or vectors make the plot unreadable
TODO: Extract PCA scores with the scores() function
pca_scores <- vegan::scores(pca_scaled)
TODO: Categorical values can be applied to the data set now PCA is done.
pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")
TODO: Takes the PCA and categorical data are recombined into a data frame
pca_scores2 <- data.frame(pop_id,
pca_scores)
TODO: Allows visualization of the results with their categorical labels to identify clusters
TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED
x_axis <- paste("PC1 (",var_out[1],"% variation)", sep = "")
y_axis <- paste("PC2 (",var_out[2],"% variation)", sep = "")
title <- "Genetic Diversity in Birds by SNP Analysis"
ggpubr::ggscatter(data = pca_scores2,
y = "PC2",
x = "PC1",
color = "pop_id",
shape = "pop_id",
xlab = x_axis,
ylab = y_axis,
main = title)
TODO: There is a clear division between the 2 species by PC1. While A. nenlsoni appears to have several unique genetic clusters, A. caudacutus has no clear genetic clustering of its subspecies.