The example is split into 2 Parts:
Part 1 must be completed first to create a file,
SNPs_cleaned.csv, that has been completely prepared for
analysis.
Now in Part 2, you will analyze the data with PCA. The steps here will be:
scale())prcomp())screeplot())vegan::scores())In the code below all code is provided. Your tasks will be to do 4 things:
Load the vcfR package with library()
library(vcfR) # KEY
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(ggplot2)
library(ggpubr)
Set the working directory
getwd()
## [1] "/Users/adetayoadenekan/Downloads"
Load the data
SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")
warning("If this didn't work, its may be because you didn't set your working directory.")
## Warning: If this didn't work, its may be because you didn't set your working
## directory.
TODO: the data is scaled using the scale() function which makes the values numerical, which can then be used by R
SNPs_scaled <- scale(SNPs_cleaned)
TODO: prcomp is used to jumpstart the pca analysis by calculating the principle componanrts of the analysis
pca_scaled <- prcomp(SNPs_scaled)
TODO: the screeplot is able to plot multiple pcs at a time to compare the data and see where its redundant so it can be removed
TODO: UPDATE PLOT WITH TITLE
screeplot(pca_scaled,
ylab = "Relative importance",
main = "bird genes")
TODO: the screeplot has 9 pcs plotted, but only the first one is really important because they all have the same height after the first one. The jump also lets us know that the other bars in the screeplot are redundant
TODO: summary is called on pca scalled to see the components of the scaled data. function is then used to create a return statemnt for the code to run indefinetly
summary_out_scaled <- summary(pca_scaled)
PCA_variation <- function(pca_summary, PCs = 2){
var_explained <- pca_summary$importance[2,1:PCs]*100
var_explained <- round(var_explained,1)
return(var_explained)
}
var_out <- PCA_variation(summary_out_scaled,PCs = 10)
N_columns <- ncol(SNPs_scaled)
barplot(var_out,
main = "Percent variation Scree plot",
ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)
TODO: there is little varaince because there is a huge jump in the graph from pc1 to pc2
TODO: the biplot plots pc1 against pc2 to see wheree most of the data lies when comapred
biplot(pca_scaled)
TODO: there are too many points and its very hard to read the graph
TODO: this is done to make it easier to convert the data into a dataframe
pca_scores <- vegan::scores(pca_scaled)
TODO: EXPLAIN
pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")
TODO: pca scores2 saves the data frame from pca scores to make into a scatterplot
pca_scores2 <- data.frame(pop_id,
pca_scores)
TODO: population density
TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED
ggpubr::ggscatter(data = pca_scores2,
y = "PC2",
x = "PC1",
color = "pop_id",
shape = "pop_id",
xlab = "PC1 (xx% variation)",
ylab = "PC2 (xx% variation)",
main = "Population Density")
TODO: the populations with simialr genes are grouped together meanwhile those that are not simialr are farther away