The example is split into 2 Parts:
Part 1 must be completed first to create a file,
SNPs_cleaned.csv, that has been completely prepared for
analysis.
Now in Part 2, you will analyze the data with PCA. The steps here will be:
scale())prcomp())screeplot())vegan::scores())In the code below all code is provided. Your tasks will be to do 4 things:
Load the vcfR package with library()
library(vcfR)
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(ggplot2)
library(ggpubr)
Set and examine the working directory
setwd("~/Desktop/Computational Biology/Final Project/Software Checkpoints and Portfolios")
getwd()
## [1] "/Users/madelinefontana/Desktop/Computational Biology/Final Project/Software Checkpoints and Portfolios"
Load the data
SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")
warning("If this didn't work, its may be because you didn't set your working directory.")
## Warning: If this didn't work, its may be because you didn't set your working
## directory.
Scale the cleaned SNPs data with scale() to center the
columns in the matrix
SNPs_scaled <- scale(SNPs_cleaned)
Perform a principal components analysis on the given scaled SNPs data
and return the results and assign it to pca_scaled
pca_scaled <- prcomp(SNPs_scaled)
Create a scree plot using screeplot() to plot variances
against the principle components
TODO: UPDATE PLOT WITH TITLE
screeplot(pca_scaled,
ylab = "Relative importance",
main = "Principle Component Analysis of SNPs Data")
TODO: INTERPRET SCREEPLOT
The most important component of analysis of the SNPs data is PC1, the first principle component because this had the highest reltive importance from the scree plot.
Call summary() on the scaled principle component
analysis results to evaluate the variation
summary_out_scaled <- summary(pca_scaled)
PCA_variation <- function(pca_summary, PCs = 2){
var_explained <- pca_summary$importance[2,1:PCs]*100
var_explained <- round(var_explained,1)
return(var_explained)
}
var_out <- PCA_variation(summary_out_scaled,PCs = 10)
N_columns <- ncol(SNPs_scaled)
barplot(var_out,
main = "Percent variation Scree plot",
ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)
TODO: INTERPRET VARIANCE EXPLAINED
The barplot shows us the variance in the scree plot and explains this by showing that PC1 has the highest percent variation explained.
Create a biplot from the scaled PCA data where vectors will be created to show how components are coorelated
biplot(pca_scaled)
TODO: EXPLAIN WHY THIS IS A BAD IDEA
This is a bad idea because it does not fully show how the components are related and is hard to read and analyze being that there is so much data that appears to overlap.
Use the vegan package to score the scaled PCA data, this
is used to analyze community ecology and uses multivaruate tools
pca_scores <- vegan::scores(pca_scaled)
Provide Ids for the populations after getting this output
pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")
Create a data frame with these ids and PCA scores
pca_scores2 <- data.frame(pop_id,
pca_scores)
Use ggpubr::ggscatter() on the data frame containing the
scores and ids to visualize the PCA results and analyze percent
variation and create a scatterplot showing this
TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED
ggpubr::ggscatter(data = pca_scores2,
y = "PC2",
x = "PC1",
color = "pop_id",
shape = "pop_id",
xlab = "PC1 (19.9% variation)",
ylab = "PC2 (2.3% variation)",
main = "Principle Component Analysis Variation by Population")
TODO: INTERPRET PLOT
Most of the scored data populations lie toward the left of the plot around -100 PC1 and from -10 to 10 PC2, showing the variation the PCA results. The principle component analysis 1 shows 19.9% variation where the PC2 only show 2.3% variation.