Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY

## Warning: package 'vcfR' was built under R version 4.2.2

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-2

library(ggplot2)
library(ggpubr)

Set the working directory

getwd()

## [1] "C:/Users/afhar/Desktop/Comp Bio"

list.files(pattern="cleaned")

## [1] "SNPs_cleaned.csv"

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")

## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

Scale Data

Data needs to be scaled before further analysis

SNPs_scaled <- scale(SNPs_cleaned)

Run PCA

Use prcomp on scaled data to run PCA

pca_scaled <- prcomp(SNPs_scaled)

Create screeplot of PCA data

Scree plots are used to see how much variation is represented by each PC. Based on the scree plot, it can be decided how many PCs to keep in ongoing analysis, with the more variation (higher bars) being kept.

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "Scrreplot of pca_scaled")

This screeplot shows a steep drop off of iportance after PC1, meaning the majority of variation is found in PC1.

###Analyze PC data

Summary function can also be used to extract then analyze the variation of each PC, and which to keep in further analysis.

summary_out_scaled <- summary(pca_scaled)

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}

var_out <- PCA_variation(summary_out_scaled,PCs = 10)
var_out

##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8  PC9 PC10 
## 19.9  2.3  2.2  2.1  2.0  2.0  2.0  1.9  1.8  1.8

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

This shows how each PC’s variation compares against if each PC represented the same amount of variation. PC1 is still shown to contain the most amount of variation.

Biplot of Data

Biplots are often used to compare vectors to PCs to show how data is correlated, or not correlated to each other

biplot(pca_scaled)

SNP data does not use biplot as a visual interpretation of the data because of the sheer amount of data being analyzed. It ends up becoming a huge undecipherable mess of numbers and arrows that do not help with analysis.

Character Data

Character data was previously removed to complete PCA analysis, but it can now be put back.

pca_scores <- vegan::scores(pca_scaled)

Creating a vector with character data allows for it to be re-attached to data

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

A new dataframe is created holding the names and PCA scores of the data together.

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

Scatterplot

A final scatterplot of data is created to allow interpretation of data collected and processed.

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (19.9% variation)",
                  ylab = "PC2 (2.3% variation)",
                  main = "PCA Analysis of Data")

This final scatterplot shows how each type of bird data was collected from compares to the other types. Ellipses can be added to show groupings within the data.

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

2022-11-23