Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-4

library(ggplot2)
library(ggpubr)

Set the working directory

getwd()

## [1] "/Users/adetayoadenekan/Downloads"

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")

## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

TODO: Scale and center the data to make it R compatible

TODO: the data is scaled using the scale() function which makes the values numerical, which can then be used by R

SNPs_scaled <- scale(SNPs_cleaned)

TODO: Start PCA using prcomp

TODO: prcomp is used to jumpstart the pca analysis by calculating the principle componanrts of the analysis

pca_scaled <- prcomp(SNPs_scaled)

TODO: Screeplot compares scaled data

TODO: the screeplot is able to plot multiple pcs at a time to compare the data and see where its redundant so it can be removed

TODO: UPDATE PLOT WITH TITLE

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "bird genes")

TODO: the screeplot has 9 pcs plotted, but only the first one is really important because they all have the same height after the first one. The jump also lets us know that the other bars in the screeplot are redundant

TODO: use summary to look at the components of the data set

TODO: summary is called on pca scalled to see the components of the scaled data. function is then used to create a return statemnt for the code to run indefinetly

summary_out_scaled <- summary(pca_scaled)

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}

var_out <- PCA_variation(summary_out_scaled,PCs = 10)

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

TODO: there is little varaince because there is a huge jump in the graph from pc1 to pc2

TODO: biplot visualizes variance data

TODO: the biplot plots pc1 against pc2 to see wheree most of the data lies when comapred

biplot(pca_scaled)

TODO: there are too many points and its very hard to read the graph

TODO: make scaled data into scored data

TODO: this is done to make it easier to convert the data into a dataframe

pca_scores <- vegan::scores(pca_scaled)

TODO: EXPLAIN

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

TODO: pca scores2 saves the data frame from pca scores to make into a scatterplot

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

TODO: make scatter plot with pca score2 data

TODO: population density

TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (xx% variation)",
                  ylab = "PC2 (xx% variation)",
                  main = "Population Density")

TODO: the populations with simialr genes are grouped together meanwhile those that are not simialr are farther away

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

2022-11-23