Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-4

library(ggplot2)
library(ggpubr)

Set the working directory

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")

## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

Centering the Data

Cleaned data is centered by subtracting the mean and dividing by the standard deviation for each feature (SNP). Without scaling, one variable may dominate the PCA procedure and the plot will really just be visualizing that one dimension.

SNPs_scaled <- scale(SNPs_cleaned)

Run a PCA Analysis

Using the clean, scaled data, we run a PCA calculations. The original features are modified (re-engineered) to a number or orthogonal (uncorrelated) features.

pca_scaled <- prcomp(SNPs_scaled)

PC % Variation Screeplot Analysis

We need to assess the relative importance of each principal component in representing the variation in the data. By creating a scree plot, we can determine which PCs we should include in our analysis.

screeplot(pca_scaled,
          main = "Screeplot of Bird SNPs",
          ylab  = "Relative importance")

PC1 is responsible for the bulk of the variation in the data. PC2 through PC10 are all similar in importance, but are a steep drop off from PC1. Therefore, we are alright only looking at PC1 & PC2.

Evaluating the Amount of Variation

We are using the summary() command a the function PCA_variation to extract infrormation on explained variation. By using the general rule of thumb, 100/(# of dimensions), we can determine which PCs we should consider.

summary_out_scaled <- summary(pca_scaled)

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}

var_out <- PCA_variation(summary_out_scaled,PCs = 10)

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

Since there are 1338 columns, any PC that accounts for more than 0.07% of the variation is considered important based on the general rule. PC1 through PC10 are all above this threshold. Of not, PC1 explains roughly 10x more of the variation than PC2.

Creating a Biplot

We are plotting all 1338 features (SNPs).

biplot(pca_scaled)

This biplot is a horrendous way to visualize the data. Since there are 1338 features, the graph is too busy to gain any insight into what is going on.

Extracting PCA Scores

We are running PCA calculations to get the PCA scores on the original features. This reengineers the data so that all the redundancies are gone.

pca_scores <- vegan::scores(pca_scaled)

Here, we are creating a vector of character data containing the species name for each bird in the sample. PCA calculations only operate on numeric data, but we can add a categorical label to the plot once scores are gathered.

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

We are creating a data frame that has the PCA scores and the categorical label, species. We need them in one place so that our plot can be color-coded based on the species.

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

Plotting the Data

We are plotting the PCA scores. The species of each bird is indicated by shape & color.

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (19.9% variation)",
                  ylab = "PC2 (2.3% variation)",
                  main = "Variation in Songbird Subspecies by SNP Data")

PC1 accounts for 19.9% of genetic variation, and along the x-axis there are two groups sticking out to me. Sub, Alt, & Nel are towards the left whereas Cau & Div are way over to the right. W are investigating 2 bird species, A.nenlsoni & A.caudactus. Therefore, it is not surprising that Sub, Alt, & Nel (subspecies of A.nenlsoni) & Cau & Dev (subspecies of A.caudacutus) are two separate “clusters.” PC2 only makes up 2.3% of the explained variation so separation along the y-axis is not as meaningful. The 3 sub-species withing A.nenlsoni are spread out from each other along the y-axis. Cau & Dev are really similar along the y-axis. This suggests that there is more variation within A.nenlsoni compared to A.caudacutus, but not as much as there is between the 2 species.

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

2022-11-23