Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-4

library(ggplot2)
library(ggpubr)

Set the working directory

setwd("~/Downloads/COMPBIO")

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")

## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

Center the Data

Here, we need to center the data before we perform PCA. To center the data we loaded, SNPs_cleaned.csv, we use the scale() function, which makes the mean equal to 1 and the standard deviation equal to 0.

SNPs_scaled <- scale(SNPs_cleaned)

Run PCA Analysis

Here, we use the base R function prcomp() on our scaled data stored in SNPs_scaled to run the PCA analysis. We then save the results to the object pca_scaled.

pca_scaled <- prcomp(SNPs_scaled)

Create a Scree Plot

Here, we use the function screeplot() on our PCA data stored in pca_scaled to make a scree plot, which shows us the variance between the multiple PCs. We labelled the y-axis “Relative importance” with the argument ylab and gave the plot a main title called “SNP Scree Plot” with the argument main.

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "SNP Scree Plot")

The scree plot that prints as a result shows us that PC1 is the most important for analysis. This is determined due to the large drop between PC1 and PC2, as the PCs before the drop are the most important. As the researchers, we can now decide to perform the rest of our analysis using PC1 and PC2, since the other PCs will not give much useful information.

Evaluate the Variation

In the series of codes, we explain the amount of variation demonstrated by the different PCs shown in the scree plot. First, we save the summary statistics of the data to the object summary_out_scaled. Then, we create a function that will extract the information about the explained variation that we want to analyze. The function contains an argument that allows the researcher to chose how many PCs to plot. The default is 2, but in the next chunk of code, we run the function with 10 PCs, since that is how many are included in the scree plot. Lastly, we created a new scree plot that uses the y-axis as the percent variation captured by each PC using the function barplot().

summary_out_scaled <- summary(pca_scaled)

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}

var_out <- PCA_variation(summary_out_scaled,PCs = 10)

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 100/N_columns*100, col = 2, lwd = 2)

The line that is added to the plot shows where the height of all the PCs should be if they were all equally important. As we can see, PC1 is significantly above this line, and the rest of the PCs are significantly below this line. This further shows that PC1 is the most important PC for analysis.

Create the PCA Biplot

Using the function biplot(), we create the biplot that demonstrates the relationship between the features of the SNP data against the two PC axes.

biplot(pca_scaled)

As we can see from the biplot, the information is hard to interpret and decipher. This is due to there being lots of different data points that cloud the plot. From this biplot, we cannot determine any noticeable relationships.

Extract PCA Scores

Here, we use the function vegan::scores() to extract the PCA scores from the PCA data we have stored in the object pca_scaled. We then saved this data to the object pca_scores. These scores can later be used to run PCA using the vegan package.

pca_scores <- vegan::scores(pca_scaled)

Here, we create a vector containing all of the population codes of the the corresponding samples of SNP data using the function c() and save the vector to an object called pop_id.

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

Here, we combine the population codes with the PCA scores in a data frame called pca_scores2 using the function data.frame(). We need the data in this structure to run PCA on it.

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

Plot the PCA

Using the ggpubr package, we use the function ggscatter() to create a scatterplot of the PCA scores along the PC1 and PC2 axes. We labelled the X and Y axes appropriately using the xlab and ylab arguments and made the shape and color of the data points vary based on their respective population codes using the arguments color and shape.

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (19.9% variation)",
                  ylab = "PC2 (2.3% variation)",
                  main = "SNP PCA Scatterplot")

The scatterplot shows the Cau and Div population codes most correlated with PC1, as they are farther along the x-axis. The Alt and Nel population codes seem to have slight correlations with PC2, and the Sub population seems to have negative correlations with both PC1 and PC2.

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

2022-11-23