The example is split into 2 Parts:
Part 1 must be completed first to create a file,
SNPs_cleaned.csv, that has been completely prepared for
analysis.
Now in Part 2, you will analyze the data with PCA. The steps here will be:
scale())prcomp())screeplot())vegan::scores())In the code below all code is provided. Your tasks will be to do 4 things:
Load the vcfR package with library()
library(vcfR) # KEY
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(ggplot2)
library(ggpubr)
Set the working directory
getwd()
## [1] "/Users/jooppereira/Downloads"
Load the data
SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")
warning("If this didn't work, its may be because you didn't set your working directory.")
## Warning: If this didn't work, its may be because you didn't set your working
## directory.
First, scaling the data must be completed before continuing with PCA. Use scale()
SNPs_scaled <- scale(SNPs_cleaned)
Start to run the PCA with prcomp() and the scaled data
pca_scaled <- prcomp(SNPs_scaled)
Create a screeplot using screeplot() to observe how many variables should be plotted
screeplot(pca_scaled,
ylab = "Relative importance",
main = "Screeplot of Principal Components")
##Screeplot Interpretation: PC1 seems to be the only component that is worth studying since PC2-10 all are significantly lower on the scale of “relative importance”. For the biplot, PC1 and PC2 will be studied.
Create a function names PCA_Variation which will work on the summary data of the scaled PCA. Make a bar plot which measures the percent variation in the Screeplot which was made before.
summary_out_scaled <- summary(pca_scaled)
PCA_variation <- function(pca_summary, PCs = 2){
var_explained <- pca_summary$importance[2,1:PCs]*100
var_explained <- round(var_explained,1)
return(var_explained)
}
var_out <- PCA_variation(summary_out_scaled,PCs = 10)
N_columns <- ncol(SNPs_scaled)
barplot(var_out,
main = "Percent variation Scree plot",
ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)
The red line represents the percent variation, which remains flat, and very close to zero.
Create a biplot using biplot() to compare PC1 and PC2
biplot(pca_scaled)
This is a BAD idea because ther are so many points on the biplot, that you can not even distinguish what is happening, or any trends! Very clustered and can not make and conclusions about the data.
Using vegan::scores(), get the PCA scores on the scaled data.
pca_scores <- vegan::scores(pca_scaled)
Create a vector using c() which relate to the PCA scores
pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")
Put the pop_id and pca_scores in a data frame together.
pca_scores2 <- data.frame(pop_id,
pca_scores)
Create a scatterplot using ggpubr and ggscatter with the data frame just stored in the object pca_scores2. Color code the points by pop_id.
ggpubr::ggscatter(data = pca_scores2,
y = "PC2",
x = "PC1",
color = "pop_id",
shape = "pop_id",
xlab = "PC1 (20.2% variation)",
ylab = "PC2 (2.3% variation)",
main = "Correlations Among Bird Populations")
Plot Interpretaion: Populations identified as “Alt” and “Nel” were very similar among principal components 1 and 2, the same can be said for “Cau” and “Div”. “Sub” on the otherhand, is more unique, its data shows that it is similar to “alt” and “nel” on the PC1 axis, yet differ on the PC2 axis.