The example is split into 2 Parts:
Part 1 must be completed first to create a file,
SNPs_cleaned.csv, that has been completely prepared for
analysis.
Now in Part 2, you will analyze the data with PCA. The steps here will be:
scale())prcomp())screeplot())vegan::scores())In the code below all code is provided. Your tasks will be to do 4 things:
Load the vcfR package with library()
library(vcfR) # KEY
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(ggplot2)
library(ggpubr)
Set the working directory
setwd("/Users/adampowley/Desktop/School Stuff/Comp Bio/Port")
Load the data
SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")
warning("If this didn't work, its may be because you didn't set your working directory.")
## Warning: If this didn't work, its may be because you didn't set your working
## directory.
We manipulate the scaling of the data to a standard range without altering the relationship between data.
SNPs_scaled <- scale(SNPs_cleaned)
We find the principle components of the data that attempt to represent the data along optimal axis.
pca_scaled <- prcomp(SNPs_scaled)
To determine how many principle components are necessary to understand the general meaning of the data we look at the scree plot of the principle components.
screeplot(pca_scaled,
ylab = "Relative importance",
main = "Scree plot of PC for SNP data")
The first principle component largely summarizes all of the multivariate data.
Here it we see which principle components summarizes the data best via its ability to explain a large percentage of the variation.
summary_out_scaled <- summary(pca_scaled)
PCA_variation <- function(pca_summary, PCs = 2){
var_explained <- pca_summary$importance[2,1:PCs]*100
var_explained <- round(var_explained,1)
return(var_explained)
}
var_out <- PCA_variation(summary_out_scaled,PCs = 10)
N_columns <- ncol(SNPs_scaled)
barplot(var_out,
main = "Percent variation Scree plot",
ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)
The first principle component explains the most amount of variation by far.
A biplot will help visualize the results of our PCA, including the originial features’ way on the PCs
biplot(pca_scaled)
There are so many PCAs being concidered that visualizing how each effect the PCA is impossible.
Here we extract the scores of our PCA.
pca_scores <- vegan::scores(pca_scaled)
Here we create a vector of the real population IDs.
pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")
Here we create a data frame containing the population IDs with scores to visualize the effect of our PCA.
pca_scores2 <- data.frame(pop_id,
pca_scores)
Here we finally visualize how effective our PCA was at associating like populations based on SNP data.
TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED
ggpubr::ggscatter(data = pca_scores2,
y = "PC2",
x = "PC1",
color = "pop_id",
shape = "pop_id",
xlab = "PC1 (xx% variation)",
ylab = "PC2 (xx% variation)",
main = "PCA results from SNPs with real population identity")
The populations Div and Cau are pretty homogenous in our PCA, but are both sperated by a large margin from the rest of the data. Alt and Nel are also pretty homogenous, although Nel is more wide spread. The PCA does a good job segregating Sub populations from the others.