Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-4

library(ggplot2)
library(ggpubr)

Set the working directory

setwd("~/Desktop/R/BIOSC1540")
list.files(pattern=".csv")

## [1] "SNPs_cleaned.csv"        "walsh2017morphology.csv"

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")

## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

Data scaling

Scale the data using the scale() function to change the data to mean = 0 and sd = 1. Scale also removes any inherent magnitude differences that could affect the PCA.

SNPs_scaled <- scale(SNPs_cleaned)

PCA

Use the prcomp function to perform PCA analysis.

pca_scaled <- prcomp(SNPs_scaled)

PCA Analysis

Use screeplot() to view the importance of each principal component

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "Principal Component Importance")

There is a massive drop off in importance after principal component 1. PC1 can explain a lot of the data and is the one most worth investigating.

PCA Variation

Identify the percent of variation in the data explained by each principal component

summary_out_scaled <- summary(pca_scaled)

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}

var_out <- PCA_variation(summary_out_scaled,PCs = 10)

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

PC1 explains the greatest amount of variation in the data at >15%. The following PC describe a very low amount of the variation at <5%.

Biplot

Use biplot() to generate the biplot of the data.

#biplot(pca_scaled)

THIS IS NOT A GOOD IDEA. Since we have way over 1000 SNPs, we are going to end up with one vector for each SNP, resulting in an illegible plot.

PCA scores

use scores() to get the % of variation explained by the PCAs. Save to pca_scores

pca_scores <- vegan::scores(pca_scaled)

Create a vector with the sample names.

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

Create a dataframe with pop_id an pca_scores

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

Visualization of PCA variation

Use a scatterplot to view how the variation of PC1 compares to PC2 for each sample with color code based on the population ID

TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (19.2% variation)",
                  ylab = "PC2 (2.3% variation)",
                  main = "PC1 vs PC2 variation")

There are about 3 distinct clusters. ‘Alt’ and ‘Nel’ are clustered in the top left. ‘Sub’ is clusterd by itself in the lower left. ‘Div’ and ‘Cau’ are clustered in the middle on the right. The middle right cluster is particularly well explained by PC1.

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

2022-11-23