Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY

## Warning: package 'vcfR' was built under R version 4.2.2

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Warning: package 'vegan' was built under R version 4.2.2

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-4

library(ggplot2)
library(ggpubr)

Set the working directory

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")

## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

Scaling the SNPS

Using the function ‘scale()’, the data is centered around the mean and divided by the standard deviation. This makes the data easier to work with and this data is then saved to an object.

SNPs_scaled <- scale(SNPs_cleaned)

Applying PCA

Using the function ‘prcomp()’ PCA analysis is applied to the data. The PCA of the data is then saved to an object.

pca_scaled <- prcomp(SNPs_scaled)

Creating a screeplot

Use the function ‘screeplot()’ to get a screeplot of the data. A screeplot plots the principle components, showing us which components have the most variability from one another which are the ones to focus on.

TODO: UPDATE PLOT WITH TITLE

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "Screeplot of PCA on SNPs")

The principle component 1 has the most relative importance as it is the most unique of all the principle components. All of the other principle components are very similar too each other, making their relative importance the same and much lower compared to principle component 1. When comparing the principle components to each other we should compare PC1 to PC2 as there is a great difference in variability and since PC2 shares the same variability as PC3-PC10.

Explaining the percent variation

Use the function ‘summary()’ to get the summary of the PCA scaled data and save that information to an object to be used in the function that will explain the variation of the principle components. Create a bar plot to show the percent variation of the screeplot and add a line to that barplot to have a threshold of which principle components to focus on.

summary_out_scaled <- summary(pca_scaled)

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}

var_out <- PCA_variation(summary_out_scaled,PCs = 10)

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

The percent variation within the screeplot is the same for the principle components 2-10. Principle component 1 has the highest percent variation which demonstrates that we should focus on that component and compare it to principle component 2. This is because the line on the barplot shows us which principle components to focus on. According to the line, we should focus on anything above that line. However, PC2 has the same percent variation as PC3-10. This means we can just compare PC1 to PC2 to get the greatest variation difference from the data.

Creating a biplot

Using the function ‘biplot()’ create a biplot of the PCA scaled.

biplot(pca_scaled)

Creating a biplot of the scaled PCA data will be a bad idea because each of the vectors of the biplot will correspond to a SNP and since there are so many SNPs, it will be hard to interpret and read the data.

Calculating the scores of the scaled PCA data using the function ‘vegan::scores()’.

pca_scores <- vegan::scores(pca_scaled)

Creating a new vector that contains the population ID.

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

Creating a data frame with the pop_id and the pca_scores using the ‘data.frame()’ function and saving that data frame into an object.

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

Creating a scatter plot

Using the function ‘ggpubr:: ggscatter()’ we can create a scatter plot of the scores of the PCA data where the axes are PC2 and PC1. The color and shapes on the scatter plot correspond to the population ID.

TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (19.9% variation)",
                  ylab = "PC2 (2.3% variation)",
                  main = "PCA scatterplot")

TODO: INTERPRET PLOT

The scatter plot shows us that most populations are clustered close to each other. The populations Cau,Sub and Div are strongly correlated with PC1. The other two populations which are Alt and Nel are strongly correlated with PC2.

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

2022-11-23