The example is split into 2 Parts:
Part 1 must be completed first to create a file,
SNPs_cleaned.csv, that has been completely prepared for
analysis.
Now in Part 2, you will analyze the data with PCA. The steps here will be:
scale())prcomp())screeplot())vegan::scores())In the code below all code is provided. Your tasks will be to do 4 things:
Load the vcfR package with library()
library(vcfR) # KEY
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(ggplot2)
library(ggpubr)
Set the working directory
setwd("~/Desktop/Pitt/2nd/comp bio/Final Project/practice portfolios")
Load the data
SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")
warning("If this didn't work, its may be because you didn't set your working directory.")
## Warning: If this didn't work, its may be because you didn't set your working
## directory.
Prepare data for analysis by centering it on the mean + scaling it on the standard deviation.
SNPs_scaled <- scale(SNPs_cleaned)
Run PCA on the scaled data + save output to an object.
pca_scaled <- prcomp(SNPs_scaled)
Make a screeplot of the PCA object to determine the best # of PCs needed to evalute data.
screeplot(pca_scaled,
ylab = "Relative importance",
main = "PCA Screeplot")
The screeplot exhibits a huge dropoff in “relative importance” after PC1, indicating that PC1 accounts for almost all of the variability in the data, and is likely the only PC needed for analysis.
Create + use a function to produce a barplot of the PCs vs. the % variation they account for. Use to determine the usefulness of each PC / how many are necessary for analysis.
summary_out_scaled <- summary(pca_scaled)
PCA_variation <- function(pca_summary, PCs = 2){
var_explained <- pca_summary$importance[2,1:PCs]*100
var_explained <- round(var_explained,1)
return(var_explained)
}
var_out <- PCA_variation(summary_out_scaled,PCs = 10)
N_columns <- ncol(SNPs_scaled)
barplot(var_out,
main = "Percent variation Scree plot",
ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)
The barplot once again displays a huge dropoff in % variation explained after the first PC, supporting the previous screeplot’s findings that only PC1 is useful/necessary for analyzing this dataset.
Assess the relationships between each feature and PC1 + PC2 by creating a biplot.
biplot(pca_scaled)
Biplots display each feature/column as vectors on a graph, and thus they should only be used for datasets with limited numbers of features, or else the graph will become far too crowded and unreadable. Such is the case for this dataset, where there are so many vectors that the graph becomes clustered and useless to the reader.
Use the vegan package to extract the PCA scores in a vector.
pca_scores <- vegan::scores(pca_scaled)
Create a vector containing each of the sample’s population IDs to use as labels.
pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")
Combine the population IDs and extracted PCA scores into a dataframe.
pca_scores2 <- data.frame(pop_id,
pca_scores)
Use ggpubr package to create a scatterplot of the dataset graphed on the new axes generated by PCA. Use the color + shape arguments to view the data categorized by a third feature/column.
ggpubr::ggscatter(data = pca_scores2,
y = "PC2",
x = "PC1",
color = "pop_id",
shape = "pop_id",
xlab = "PC1 (19.9% variation)",
ylab = "PC2 (2.3% variation)",
main = "PCA Scatterplot")
Graphing the PC scores in a scatterplot allows us to see that, once again, the majority of variation in the data can be accounted for by PC1, as the datapoints predominantly vary along the X-axis (Y-axis = PC2). We can further view groupings in the data attributed to a third variable, which is included through color/shape.