Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-4

library(ggplot2)
library(ggpubr)

Set the working directory

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")

## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

TODO: Scaling Data

TODO: Center the data using the scale() function so that the mean becomes 0 and standard deviation becomes 1.

SNPs_scaled <- scale(SNPs_cleaned)

TODO: Run PCA

TODO: Run a Principle Component Analysis (PCA) on this scaled data using prcomp() and store the output in an object called pca_scaled.

pca_scaled <- prcomp(SNPs_scaled)

TODO: Creating a Screeplot

TODO: Make a screeplot to visualize the principle components and determine how many of the principle components should be looked at/explored in analysis.

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "Screeplot for Walsh Bird Data")

TODO: There is a big decrease in relative importance between PC1 and PC2, but not a huge difference between PC2 and PC3 and all subsequent PCs. For this PCA analysis, it would be appropriate to consider just PC1 and PC2, since it appears to capture most of the variation in the data.

TODO: Variance Explained

TODO: Get information on the percentage of variation in the data by calling summary() on the output of PCA analysis.

summary_out_scaled <- summary(pca_scaled)

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}

var_out <- PCA_variation(summary_out_scaled,PCs = 10)

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

TODO: Creating a Biplot

TODO: Make a biplot (visual representation) of the data using the output of PCA analysis and the biplot() function.

biplot(pca_scaled)

TODO: This is not a good idea because there are too many SNPs. The biplot would be nearly impossible to interpret and would not give us useful information for analysis and interpretation.

TODO: Set Up for Custom PCA Plot

TODO: Isolate the scores from the scaled PCA output into a vector called pca_scores.

pca_scores <- vegan::scores(pca_scaled)

TODO: Create a new vector, pop_id, that contains the population ID information for the data set.

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

TODO: Create a new data frame called pca_scores_2 that combines both the pop_id and pca_scores vectors.

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

TODO: Custom PCA Plot

TODO: Using the ggscatter() function from package ggpubr, make a plot to create a visual representation of the PCA data in a more readable format. The plot is color coded based on the population id.

TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (19.9% variation)",
                  ylab = "PC2 (2.3% variation)",
                  main = "PCA Scatterplot")

TODO: Populations with ID Cau and Div seem to be correlated with each other along both PC1 and PC2; same with populations with ID Alt and Nel. Population Sub seems to be in its own “cluster” without overlap with any other population IDs. Cau and Div may be correlated with the other populations along PC2, but not PC1.

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

2022-11-23