The example is split into 2 Parts:
Part 1 must be completed first to create a file,
SNPs_cleaned.csv, that has been completely prepared for
analysis.
Now in Part 2, you will analyze the data with PCA. The steps here will be:
scale())prcomp())screeplot())vegan::scores())In the code below all code is provided. Your tasks will be to do 4 things:
Load the vcfR package with library()
library(vcfR) # KEY
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(ggplot2)
library(ggpubr)
Set the working directory
setwd("~/Library/Mobile Documents/com~apple~CloudDocs/Pitt/BIOSC 1540/final stuff")
Load the data
SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")
warning("If this didn't work, its may be because you didn't set your working directory.")
## Warning: If this didn't work, its may be because you didn't set your working
## directory.
TODO: center the data around the mean with the scale() function and assign to the SNPs_scaled data frame.
SNPs_scaled <- scale(SNPs_cleaned)
TODO: run a pca analysis on the data that we scaled in the previous step with prcomp(). assign this value to a variable named ‘pca_scaled’.
pca_scaled <- prcomp(SNPs_scaled)
TODO: create a screeplot off the pca_scaled using screeplot().
TODO: UPDATE PLOT WITH TITLE
screeplot(pca_scaled,
ylab = "Relative importance",
main = "Scree plot for bird data")
TODO: by looking at the plot we can see that only PC1 is important for further use; this is because there is a large drop off from the value of PC1 to PC2. PC1 is the most important.
TODO: get the information on variation using summary(). store this in an object named ‘summary_out_scaled’
summary_out_scaled <- summary(pca_scaled)
PCA_variation <- function(pca_summary, PCs = 2){
var_explained <- pca_summary$importance[2,1:PCs]*100
var_explained <- round(var_explained,1)
return(var_explained)
}
var_out <- PCA_variation(summary_out_scaled,PCs = 10)
N_columns <- ncol(SNPs_scaled)
barplot(var_out,
main = "Percent variation Scree plot",
ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)
TODO: the red horizontal line in the scree plot is calculated as 100 / (the number of dimensions in the data). there are 50 SNPs so that is why the line is at 2, or 2%.
TODO: Makes a biplot() of the SNPs on the PCA data. this also creates a vector for each SNP (so 50 vectors).
biplot(pca_scaled)
TODO: This is a bad idea because it makes the data very hard to interpret because it is just a large red cluster. You can’t even see the original points.
TODO: Get the PCA scores using vegan::scores. store in an object named ‘pca_scores’.
pca_scores <- vegan::scores(pca_scaled)
TODO: this creates a vector named ‘pop_id’ that has a code for all of the samples. this helps to make analysis of data easier.
pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")
TODO: combine the scores into a data frame named ‘pca_scores2.’
pca_scores2 <- data.frame(pop_id,
pca_scores)
TODO: the scores are color coded by pop id. the amount of variation (of PCs) is displayed in the axis label.
TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED
ggpubr::ggscatter(data = pca_scores2,
y = "PC2",
x = "PC1",
color = "pop_id",
shape = "pop_id",
xlab = "PC1 (19.9% variation)",
ylab = "PC2 (2.3% variation)",
main = "PCA scatterplot of birds")
TODO: the birds with ‘sub’ are the most isolated specices/ in its own seperate cluster. Alt and NeI are related. Cau and Div are related. this is across both PC1 AND PC2. Cau and Dive are also further isolated (amongst them selves) compared to the other 3 pops.