Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-4

library(ggplot2)
library(ggpubr)

Set the working directory

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")

## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

Scale data

Use scale() to scale the columns of the matrix, SNPs_cleaned, into an object called SNPs_scaled

SNPs_scaled <- scale(SNPs_cleaned)

PCA Analysis

Us prcomp() to perform a PCA on the scaled SNP data

pca_scaled <- prcomp(SNPs_scaled)

Make Screeplot

Use screeplot() to create a scree plot with the scaled pca data

TODO: UPDATE PLOT WITH TITLE

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "Eigenvalues after PCA")

This scree plot gives the relative importance of all the principal components. Every single PC except for PC1 has the relatively same value, which is why they can be ignored and we will focus on using just PC1

Graph the percent variation

First, get a summary of the scaled pca data, and then find the percentages of importance. Create a barplot that graphs the PCs on the x axis and the percent variation explained on the Y axis.

summary_out_scaled <- summary(pca_scaled)

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}

var_out <- PCA_variation(summary_out_scaled,PCs = 10)

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

PC1 has the highest percent variation explained as seen in the barplot. Because the curve dips after PC1, we can ignore the otheer PCs to make less cluttered, hence dimension reduction. There is a low percent variation for all the PCs other than PC1, so the data is more consistent and does not need a PCA analysis. Because PC1 has a high variability, the points would be spread apart and can be compared to PC2.

Create biplot

Use biplot() to create a biplot of the scaled pca data

biplot(pca_scaled)

Biplots show both PC scores of samples and loading of variables, which can make the plot quite cluttered

Create data frame between pca scores and population id

TODO: EXPLAIN

pca_scores <- vegan::scores(pca_scaled)

This is a vector of all the different identifications of the population data. This will be used to create a data frame against the pca scores

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

Create a data frame using the pop_id and pca scores vectors

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

Create Scatterplot

Use ggscatter from ggpubr library to screate a scatterplot with the pca scores data.

TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (19.9% variation)",
                  ylab = "PC2 (2.3% variation)",
                  main = "SNPs PCA Scatterplot")

TODO: INTERPRET PLOT

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

2022-11-23