Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-4

library(ggplot2)
library(ggpubr)

Set the working directory

getwd()

## [1] "/Users/eeshamukherjee/Downloads/Computaional Bio 1/Final Project"

setwd("/Users/eeshamukherjee/Downloads/Computaional Bio 1/Final Project")

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")

## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

Scaling Our Data

Prep the data for PCA by scaling it. Scaling means that we will be setting the SD to 0. (scale(snps))

SNPs_scaled <- scale(SNPs_cleaned)

PCA Analysis

Run prcomp() on the scaled snps to run a PCA. Assign it to pca_scaled.

pca_scaled <- prcomp(SNPs_scaled)

Making a Screeplot

Plot the pca_scaled, that we created above, as a screeplot using screeplot().

TODO: UPDATE PLOT WITH TITLE

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "SNP Relative Importance")

The screeplot tells us which SNPs have the highest importance compared to the other SNPs, as we can see, PC1 seems to have the most realtive importance.

Evaluation of the Variation

Extract the amount of variation explained from the 10 PCs using PCA_variation funtion. First, we do it for 2 PCs, then for 10. Add a horozontial line using abline().

summary_out_scaled <- summary(pca_scaled)

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}

var_out <- PCA_variation(summary_out_scaled,PCs = 10)

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

The variance is set at the x axis since there is not alarge varienve seen between the PCs overall.

Making a Biplot

Using biplot(), create a biplot for our scaled pca.

biplot(pca_scaled)

Our data has multiple dimensions, therefore the biplot will be too jumbled to properly depict our scaled pca.

Extract PCA Scores

Get PCA scores using vegan::scores on the scaled pca.

pca_scores <- vegan::scores(pca_scaled)

Create a vector of the population IDs for each pca score.

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

Create a df using data.frame() of the population ids and pca scores, and assign it to pca_scores2.

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

Create a Scatterplot

We can plot our data and compare the variance between PC1 and PC2 using a scatterplot.

TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED

var_out <- PCA_variation(summary_out_scaled,PCs = 1)
print(var_out)

## [1] 19.9

var_out <- PCA_variation(summary_out_scaled,PCs = 2)
print(var_out)

##  PC1  PC2 
## 19.9  2.3

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (19.9% variation)",
                  ylab = "PC2 (2.3% variation)",
                  main = "PC1 vs PC2 Variance")

We can see that the data plotted has 3 groups. The “Sub” group forms near -10 on the X-axis and goes up to about -3 on the Y-axis (purple). The “Alt” and “Nel” groups are a second group that groups together still on -10 by the X-axis, but higher up on the Y-axis ( about 0 to 10 for PC2). “Cau” and “Div” are on the opposite side of the scatterplot.

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

2022-11-23