Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-4

library(ggplot2)
library(ggpubr)

Set the working directory

getwd()

## [1] "/Users/neha/Desktop/Computational Biology/code"

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")

## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

TODO: Scale the Data

TODO: Scale the data using the scale() function before performing PCA.

SNPs_scaled <- scale(SNPs_cleaned)

TODO: PCA Analysis

TODO: Use the prcomp() function to perform PCA on the scaled data and assign to object pca_scaled.

pca_scaled <- prcomp(SNPs_scaled)

TODO: Create Scree plot

TODO: Use the function screeplot() to create the plot, this plot helps us determine how many of the new dimensions should be retained.

TODO: UPDATE PLOT WITH TITLE

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "Screeplot of Scaled SNP Data")

TODO: INTERPRET SCREEPLOT The plot shows a sharp drop in “relative importance” at PC10 and for all subsequent PCs (no more bars). This indicates that the first 10 are more worthwile to further analyze as they are before the steep drop in the plot.

TODO: Variation of the Data

TODO: Relative importance of a PC on the Screeplot is directly related to how much variation there is in the data for the PC. Use summary() to get the percentage of variation of the data, use function() to extract the information on variation/importance, and then use PCA_variation() to set how many PCs you want.

summary_out_scaled <- summary(pca_scaled)

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}

var_out <- PCA_variation(summary_out_scaled,PCs = 10)

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

TODO: This creates a bar plot with the variation percentage for each of the 10 PCs using the barplot function(). The horizontal line is calculated by 100/(# of dimensions in the data), so if all the PCs were equal, they would explain 10% of the variation each as there are 10 PCs.

TODO: PCA Biplot

TODO: Create a biplot for the scaled data and try to understand the relationship between the 10 features.

biplot(pca_scaled)

TODO: Because there are 10 PCs and so much data, the arrows (which help with the understanding of the relationship among the features), many of the data points and arrows are clumped which make it difficult to interpret the plot.

TODO: Creating new data frame for PCA

TODO: Create a vector isolating the scores for the scaled PCA data in one vector using the scale functio from the vegan package.

pca_scores <- vegan::scores(pca_scaled)
length(pca_scores)

## [1] 4624

TODO: Create another vector pop_id isolating the population id using the c() function.

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")
length(pop_id)

## [1] 68

TODO: Combine the two vectors - pca_scores and pop_id - using the data.frame function and assign into pca_scores2.

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

TODO: Creating Custom PCA Plot

TODO: Use the ggscatter function to plot the pca_scores2 data with color coding and shape code based upon pop_id.

TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (19.9% variation)",
                  ylab = "PC2 (2.32% variation)",
                  main = "PCA Scatter Plot")

TODO: The scatter plot shows the relation between PC1 and PC2 with data points grouped by population id. This only focuses on the population_id feature between the two PCs, in contrast to the biplot which showed all features and their relations for PC1 and PC2. This plot is easier to interpret as the relation is more clear whereas in the biplot all the arrows showing the relations for the features were clumped.

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

2022-11-23