Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-4

library(ggplot2)
library(ggpubr)

Set the working directory

setwd("~/Desktop/Pitt/2nd/comp bio/Final Project/practice portfolios")

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")

## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

Center data

Prepare data for analysis by centering it on the mean + scaling it on the standard deviation.

SNPs_scaled <- scale(SNPs_cleaned)

Run PCA

Run PCA on the scaled data + save output to an object.

pca_scaled <- prcomp(SNPs_scaled)

Evaluate screeplot

Make a screeplot of the PCA object to determine the best # of PCs needed to evalute data.

screeplot(pca_scaled, 
          ylab  = "Relative importance",
            main = "PCA Screeplot")

The screeplot exhibits a huge dropoff in “relative importance” after PC1, indicating that PC1 accounts for almost all of the variability in the data, and is likely the only PC needed for analysis.

Evaluate variation accounted for by PC1, PC2

Create + use a function to produce a barplot of the PCs vs. the % variation they account for. Use to determine the usefulness of each PC / how many are necessary for analysis.

summary_out_scaled <- summary(pca_scaled)

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}

var_out <- PCA_variation(summary_out_scaled,PCs = 10)

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

The barplot once again displays a huge dropoff in % variation explained after the first PC, supporting the previous screeplot’s findings that only PC1 is useful/necessary for analyzing this dataset.

Evaluate biplot

Assess the relationships between each feature and PC1 + PC2 by creating a biplot.

biplot(pca_scaled)

Biplots display each feature/column as vectors on a graph, and thus they should only be used for datasets with limited numbers of features, or else the graph will become far too crowded and unreadable. Such is the case for this dataset, where there are so many vectors that the graph becomes clustered and useless to the reader.

Extract PCA scores

Use the vegan package to extract the PCA scores in a vector.

pca_scores <- vegan::scores(pca_scaled)

Create a vector containing each of the sample’s population IDs to use as labels.

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

Combine the population IDs and extracted PCA scores into a dataframe.

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

Evaluate PCA Scatterplot

Use ggpubr package to create a scatterplot of the dataset graphed on the new axes generated by PCA. Use the color + shape arguments to view the data categorized by a third feature/column.

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (19.9% variation)",
                  ylab = "PC2 (2.3% variation)",
                  main = "PCA Scatterplot")

Graphing the PC scores in a scatterplot allows us to see that, once again, the majority of variation in the data can be accounted for by PC1, as the datapoints predominantly vary along the X-axis (Y-axis = PC2). We can further view groupings in the data attributed to a third variable, which is included through color/shape.

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

2022-11-23