Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-4

library(ggplot2)
library(ggpubr)

Set the working directory

setwd("~/BIOSC_1540/BIOSC1540Project")

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")

## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

Scaling Data

PCA data is always scaled. This is because there are many features in the dataset, and each have different ranges of values.

SNPs_scaled <- scale(SNPs_cleaned)

PCA

Run PCA on the scaled data using prcomp. Other packages can be used for this but we will be using prcomp for this example.

pca_scaled <- prcomp(SNPs_scaled)

Screeplot

Make a screeplot of the scaled PCA data. This will show the amount of variance explained for each PC and will tell which PCs are worth looking at.

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "PCA Screeplot")

As we can see from the screeplot, PC1 explains the most amount of variance.

Summarize

Using summary on the scaled pca data shows information on the percentage of variation in the data. We then can plot this data to show which PCs we want to test.

summary_out_scaled <- summary(pca_scaled)

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}

var_out <- PCA_variation(summary_out_scaled,PCs = 10)

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

The barplot shows we want to look at all PCs from 1-10 since they explain more than .075% of variance.

Biplot

Run a biplot on the PCA data in order to gain insight into the relationships between the points.

biplot(pca_scaled)

Since the dataset has a ton of features, each one is plotted, making the graph hard to read with this mess.

Scores

The scores function displays the scores between the sites and PCs.

pca_scores <- vegan::scores(pca_scaled)

Create a vector of population ids so we can later use them to label the sites in scores.

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

Convert the scores into a dataframe using pop_id to show population id so we can easily plot the data in r.

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

PCA Plot using IDs

Create a scatter plot to show the relationship between the different populations. This helps visualize the similarites between certain groups

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (19.9% variation)",
                  ylab = "PC2 (2.3% variation)",
                  main = "PCA")

PC1 is plotted on the x-axis and PC2 is on the y-axis. Div and Cau are highly correlated as they are clustered around each other. The other three populations are less similar between each other are they are clustered on the left side, with Alt and Nel closer related than they are to Sub.

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

2022-11-23