The example is split into 2 Parts:
Part 1 must be completed first to create a file,
SNPs_cleaned.csv, that has been completely prepared for
analysis.
Now in Part 2, you will analyze the data with PCA. The steps here will be:
scale())prcomp())screeplot())vegan::scores())In the code below all code is provided. Your tasks will be to do 4 things:
Load the vcfR package with library()
library(vcfR) # KEY
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(ggplot2)
library(ggpubr)
Set the working directory
getwd()
## [1] "/Users/ethanfrank/Desktop/R"
Load the data
SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")
warning("If this didn't work, its may be because you didn't set your working directory.")
## Warning: If this didn't work, its may be because you didn't set your working
## directory.
Centers and scales the values in SNPs_cleaned and stores into SNPs_scaled object.
SNPs_scaled <- scale(SNPs_cleaned)
Run PCA on SNPs_scaled with prcomp() and store into pca_scaled object.
pca_scaled <- prcomp(SNPs_scaled)
Creates scree plot with scaled PCA data from pca_scaled.
screeplot(pca_scaled,
ylab = "Relative importance",
main = "pca_scaled Screeplot")
We can see that PC1 is of the most importance in the screeplot.
Take summary of pca_scaled to produce result summaries of the results of various model fitting functions and store them into summary_out_scaled. Create function called PCA_variation which takes a PCA summary argument that returns var_explained. Run PCA_variation on summary_out_scaled with 10 PCs and store in var_out object. Store number of columns from SNPS_scaled into N_columns object. Create barplot with var_out. Add abline at 1/N_columns*100.
summary_out_scaled <- summary(pca_scaled)
PCA_variation <- function(pca_summary, PCs = 2){
var_explained <- pca_summary$importance[2,1:PCs]*100
var_explained <- round(var_explained,1)
return(var_explained)
}
var_out <- PCA_variation(summary_out_scaled,PCs = 10)
N_columns <- ncol(SNPs_scaled)
barplot(var_out,
main = "Percent variation Scree plot",
ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)
There is a lot of percent variation on PC1 compared to other PCs
Creates a biplot using data from pca_scaled.
biplot(pca_scaled)
There is little to no variation in every other PC other than PC1 so biplot will be very clumped together.
Extracts scores from pca_scaled using vegan package and stores them in a matrix object called pca_scores.
pca_scores <- vegan::scores(pca_scaled)
Create a vector with all population ids and store in pop_id object.
pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")
Create dataframe object using pop_id names and pca_scores matrix.
pca_scores2 <- data.frame(pop_id,
pca_scores)
Creates scatter plot using pca_scores2 with PC1 and PC2.
TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED
ggpubr::ggscatter(data = pca_scores2,
y = "PC2",
x = "PC1",
color = "pop_id",
shape = "pop_id",
xlab = "PC1 (10% variation)",
ylab = "PC2 (10% variation)",
main = "pca_scores2 scatterplot variation")
Cau and Div have much more variation in PC1 compared to every other population id. There is little to no variation with PC2 between Cau and Div while other population ids span across PC2.