The example is split into 2 Parts:
Part 1 must be completed first to create a file,
SNPs_cleaned.csv, that has been completely prepared for
analysis.
Now in Part 2, you will analyze the data with PCA. The steps here will be:
scale())prcomp())screeplot())vegan::scores())In the code below all code is provided. Your tasks will be to do 4 things:
Load the vcfR package with library()
library(vcfR) # KEY
## Warning: package 'vcfR' was built under R version 4.2.2
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-2
library(ggplot2)
library(ggpubr)
Set the working directory
getwd()
## [1] "C:/Users/afhar/Desktop/Comp Bio"
list.files(pattern="cleaned")
## [1] "SNPs_cleaned.csv"
Load the data
SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")
warning("If this didn't work, its may be because you didn't set your working directory.")
## Warning: If this didn't work, its may be because you didn't set your working
## directory.
Data needs to be scaled before further analysis
SNPs_scaled <- scale(SNPs_cleaned)
Use prcomp on scaled data to run PCA
pca_scaled <- prcomp(SNPs_scaled)
Scree plots are used to see how much variation is represented by each PC. Based on the scree plot, it can be decided how many PCs to keep in ongoing analysis, with the more variation (higher bars) being kept.
screeplot(pca_scaled,
ylab = "Relative importance",
main = "Scrreplot of pca_scaled")
This screeplot shows a steep drop off of iportance after PC1, meaning the majority of variation is found in PC1.
###Analyze PC data
Summary function can also be used to extract then analyze the variation of each PC, and which to keep in further analysis.
summary_out_scaled <- summary(pca_scaled)
PCA_variation <- function(pca_summary, PCs = 2){
var_explained <- pca_summary$importance[2,1:PCs]*100
var_explained <- round(var_explained,1)
return(var_explained)
}
var_out <- PCA_variation(summary_out_scaled,PCs = 10)
var_out
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
## 19.9 2.3 2.2 2.1 2.0 2.0 2.0 1.9 1.8 1.8
N_columns <- ncol(SNPs_scaled)
barplot(var_out,
main = "Percent variation Scree plot",
ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)
This shows how each PC’s variation compares against if each PC represented the same amount of variation. PC1 is still shown to contain the most amount of variation.
Biplots are often used to compare vectors to PCs to show how data is correlated, or not correlated to each other
biplot(pca_scaled)
SNP data does not use biplot as a visual interpretation of the data because of the sheer amount of data being analyzed. It ends up becoming a huge undecipherable mess of numbers and arrows that do not help with analysis.
Character data was previously removed to complete PCA analysis, but it can now be put back.
pca_scores <- vegan::scores(pca_scaled)
Creating a vector with character data allows for it to be re-attached to data
pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")
A new dataframe is created holding the names and PCA scores of the data together.
pca_scores2 <- data.frame(pop_id,
pca_scores)
A final scatterplot of data is created to allow interpretation of data collected and processed.
ggpubr::ggscatter(data = pca_scores2,
y = "PC2",
x = "PC1",
color = "pop_id",
shape = "pop_id",
xlab = "PC1 (19.9% variation)",
ylab = "PC2 (2.3% variation)",
main = "PCA Analysis of Data")
This final scatterplot shows how each type of bird data was collected from compares to the other types. Ellipses can be added to show groupings within the data.