Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY

## Warning: package 'vcfR' was built under R version 4.2.2

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Warning: package 'vegan' was built under R version 4.2.2

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-4

library(ggplot2)
library(ggpubr)

Set the working directory

setwd("~/biosc1540/FinalProject")

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")

## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

Data scaling

Adjusting the data so that the mean = 0 and the stdev = 1. This makes the data more standard and readable.

SNPs_scaled <- scale(SNPs_cleaned)

PCR

running prcomp() to obtain a PCR object for the data.

pca_scaled <- prcomp(SNPs_scaled)

Screeplot

Using screeplot to determine which principal components (PCs) are different enough form each other to be worth analyzing. This is especially important when working with many features/dimensions.

TODO: UPDATE PLOT WITH TITLE

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "Scaled SNPs Screeplot")

We can see that there is a large drop after PC1. Therefore, we might reason that the other PCs contain less variation between features and are less important to assess.

Explained variation

Extracting information on how much variation is present in each PC using summary(), then defining a function to assess importance of each PC (explained variation).

summary_out_scaled <- summary(pca_scaled)

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}

var_out <- PCA_variation(summary_out_scaled,PCs = 10)

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

Normally, all bars above the red line should be considered as important. However, there seems to be some sort of issue in the provided code. A rule of thumb that can be used, is that any PC with a % explained variation higher than 100/(# of features) should be analyzed, as they have more than their proportional amount of variation.

PCA biplot

Features on the biplot that are parallel or oppositely aligned exhibit strong positive/negative correlation respectively. Meanwhile, features which are orthogonal to each other will be unrelated.

biplot(pca_scaled)

This biplot is basically impossible to read- there are too many labels and numbers overlapping each other, and it should be cleaned up a bit more.

Custom PCA plot

using vegan to make a plot that is more readable and customizable.

pca_scores <- vegan::scores(pca_scaled)

Adding categories to the data so that we can later color code this information

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

adding the population id codes to the pca_scores data frame

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

PCA Scatterplot

plotting the data from the PCA scores, with the population IDs as a color and shape marker

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (19.9% variation)",
                  ylab = "PC2 (2.3% variation)",
                  main = "PCA Scatterplot SNPs")

This plot shows 2 major groups, that seem to align closely by population. Cau and Div are on one end of PC1, while Alt, Nel, and Sub lie on the other end. Among those 3 groups, Sub seems to be a bit further from Nel and Alt along PC2, the Y-axis. We can safely differentiate 2-3 clusters from this PCA biplot.

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

2022-11-23