Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY

## Warning: package 'vcfR' was built under R version 4.2.2

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Warning: package 'vegan' was built under R version 4.2.2

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-4

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.2.2

library(ggpubr)

## Warning: package 'ggpubr' was built under R version 4.2.2

Set the working directory

setwd("C://Users//Benj//Desktop//rrrr")

Load the data

SNPs_cleaned <- read.csv(file = "data//SNPs_cleaned.csv")

#warning("If this didn't work, its may be because you didn't set your working directory.")

Data analysis

TODO: Scale Data

TODO: PCA is almost always done with scaled data for consistency

SNPs_scaled <- scale(SNPs_cleaned)

TODO: PCA Analysis

TODO: Base R implementation of PCA.

pca_scaled <- prcomp(SNPs_scaled)

TODO: Create scree plot

TODO: Visually represent the influence the components have on the data

TODO: Sources of variance in SNPs

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "screeplot SNP Variation")

TODO: This shows the variance in a data set that can be attributed to a single factor

TODO: Return PCA values

TODO: Get the data for the PCA values from a summary of the analysis

summary_out_scaled <- summary(pca_scaled)

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained, 1)
  return(var_explained)
}

var_out <- PCA_variation(summary_out_scaled, PCs = 10)

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation bar plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

TODO: The bar plot shows that there is a single component that explains ~20% of the variance

TODO: Create biplot of the first 2 pricipal components

TODO: show how features relate to the PC1/2s

biplot(pca_scaled)

TODO: EXPLAIN WHY THIS IS A BAD IDEA Including the data labels and/or vectors make the plot unreadable

TODO: PCA with the vegan package

TODO: Extract PCA scores with the scores() function

pca_scores <- vegan::scores(pca_scaled)

TODO: Categorical values can be applied to the data set now PCA is done.

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

TODO: Takes the PCA and categorical data are recombined into a data frame

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

TODO: Make scatterplot of results

TODO: Allows visualization of the results with their categorical labels to identify clusters

TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED

x_axis <- paste("PC1 (",var_out[1],"% variation)", sep = "")
y_axis <- paste("PC2 (",var_out[2],"% variation)", sep = "")

title <- "Genetic Diversity in Birds by SNP Analysis"

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = x_axis,
                  ylab = y_axis,
                  main = title)

TODO: There is a clear division between the 2 species by PC1. While A. nenlsoni appears to have several unique genetic clusters, A. caudacutus has no clear genetic clustering of its subspecies.

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

2022-11-23