The example is split into 2 Parts:
Part 1 must be completed first to create a file,
SNPs_cleaned.csv, that has been completely prepared for
analysis.
Now in Part 2, you will analyze the data with PCA. The steps here will be:
scale())prcomp())screeplot())vegan::scores())In the code below all code is provided. Your tasks will be to do 4 things:
Load the vcfR package with library()
library(vcfR) # KEY
## Warning: package 'vcfR' was built under R version 4.2.2
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-2
library(ggplot2)
library(ggpubr)
Set the working directory
getwd()
## [1] "C:/Users/Ian/OneDrive - University of Pittsburgh/Academic/Fall 2022/Comp bio/Final Project/VCF practice"
Load the data
SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")
warning("If this didn't work, its may be because you didn't set your working directory.")
## Warning: If this didn't work, its may be because you didn't set your working
## directory.
TODO: Scale SNPs
SNPs_scaled <- scale(SNPs_cleaned)
TODO: perform PCA on scaled data
pca_scaled <- prcomp(SNPs_scaled)
TODO: Evaluate Significance of each PC component via a screeplot
TODO: UPDATE PLOT WITH TITLE
screeplot(pca_scaled,
ylab = "Relative importance",
main = "Screeplot of PCA")
TODO: PC1 has SIGNIFICANTLY more variation than any of the other. Every other PC about equal and very insignificant
TODO: Analyze the variation the can be explained by each PC
summary_out_scaled <- summary(pca_scaled)
PCA_variation <- function(pca_summary, PCs = 2){
var_explained <- pca_summary$importance[2,1:PCs]*100
var_explained <- round(var_explained,1)
return(var_explained)
}
var_out <- PCA_variation(summary_out_scaled,PCs = 10)
N_columns <- ncol(SNPs_scaled)
barplot(var_out,
main = "Percent variation Scree plot",
ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)
var_out[1]
## PC1
## 19.9
var_out[2]
## PC2
## 2.3
TODO: Only PC 1 even gets above 5% of variation explained meaning most of theese PCs are not significant for analysis
TODO: EXPLAIN
biplot(pca_scaled)
TODO: EXPLAIN WHY THIS IS A BAD IDEA
TODO: use the vegan::scored method instead
pca_scores <- vegan::scores(pca_scaled)
TODO: EXPLAIN
pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")
TODO: EXPLAIN
pca_scores2 <- data.frame(pop_id,
pca_scores)
TODO: Scatter plot of PCA scores. Colored to show clusters of populations.
TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED
ggpubr::ggscatter(data = pca_scores2,
y = "PC2",
x = "PC1",
color = "pop_id",
shape = "pop_id",
xlab = "PC1 (19.9% variation)",
ylab = "PC2 (2.2% variation)",
main = "PCA scores labled based on population")
TODO: High correlation between Div and Cau populaton groups, as well as Alt and Nel. Sub population is also fairly correlated to the latter group considering the little significance of PC2 (vertical)