Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY

## Warning: package 'vcfR' was built under R version 4.1.2

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.5-7

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.1.2

library(ggpubr)

Set the working directory

setwd("~/Downloads/uta_compbio")

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")

## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

Scale Data

Using the scale function, we center/scale our data for each SNP.

SNPs_scaled <- scale(SNPs_cleaned)

Run PCA

Using prcomp function to run PCA on our scaled data. This allows us to examine the data in few dimensions instead of comparing all the features.

pca_scaled <- prcomp(SNPs_scaled)

Create Screeplot

Using the screeplot function, we see the importance of each PC

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "Screeplot for PCs of Bird SNPs")

TODO: PC1 is the most important because it explains most of the variation of data, and the other PCs are relatively unimportant.

Find Percent Variation Explained of each PC

Creating a function PCA_variation and using it along with the summary function to get information about percent variation captured by each PC.We then plot this using the barplot function.

summary_out_scaled <- summary(pca_scaled)

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}

var_out <- PCA_variation(summary_out_scaled,PCs = 10)
var_out

##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8  PC9 PC10 
## 19.9  2.3  2.2  2.1  2.0  2.0  2.0  1.9  1.8  1.8

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

PC1 explains the most variation but PCs 1-10 all are above the threshold and considered important based on the rule of thumb, 100/number of dimensions.

Create Biplot

Using biplot function to plot all SNPs on a plot with PC1 and PC2 on each axis.

biplot(pca_scaled)

Bad idea because there are so many SNPs

Extract PCA scores

Using the scores function of the vegan package, we can get the PCA scores. This will allow us to remove repeititon in data

pca_scores <- vegan::scores(pca_scaled)

Creating a vector containing the species of each bird, allowing us to later add categorical, character labels.

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

Creating a dataframe with PCA scores and species labels.

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

Plotting Data

Plotting PCA scores, with different species categorized by shape/color

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (19.9% variation)",
                  ylab = "PC2 (2.3% variation)",
                  main = "Variation in Bird Species")

We can see two clusters along PC1, which accounts for 19.9% variaiton, 1 with Div and Cau species, and the other with Sub, Nel, and Alt species.

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

2022-11-23