Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-4

library(ggplot2)
library(ggpubr)

Set the working directory

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")

## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

TODO: CENTER DATA

TODO: Center the data by using scale() which sets mean to 1 and standard deviation to 0.

SNPs_scaled <- scale(SNPs_cleaned)

TODO: RUN PCA ANALYSIS

TODO: Runs a PCA analysis with prcomp() on the scaled data and save it to an object called pca_scaled.

pca_scaled <- prcomp(SNPs_scaled)

TODO: EVALUATE SCREE PLOT

TODO: Creates a scree plot

TODO: UPDATE PLOT WITH TITLE

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "SNPs Scree Plot")

TODO: The scree plot shows that PC1 and PC2 will be used for the remaining portion of the analysis and all the other PCs do not have useful information.

TODO: EVALUATE VARIATION

TODO: EXPLAIN

summary_out_scaled <- summary(pca_scaled)

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}

var_out <- PCA_variation(summary_out_scaled,PCs = 10)

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

TODO: The percent variation shows that PC1 is significantly more important than the rest of the PCs.

TODO: CREATE BIPLOT

TODO: Creates a PCA biplot to visualize relationship between PC1 and PC2

biplot(pca_scaled)

TODO: This is a bad because there are too many data points so it is hard to interpret and visualize what is happening in this biplot.

TODO: EXTRACT PCA SCORES

TODO: Extract PCA scores from scaled PCA data with vegan::scores() and store into an object called pca_scores

pca_scores <- vegan::scores(pca_scaled)

TODO: Create vector with all the population codes of the SNP data and store in an object called pop_id

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

TODO: Convert PCA scores and population IDs to data frame and save in object called pca_scores2

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

TODO: PLOT DATA

TODO: Create scatterplot of PCA scores with PC2 on the Y-axis and PC1 on teh X-axis.

TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (20% variation)",
                  ylab = "PC2 (2.5% variation)",
                  main = "PCA Scatterplot")

TODO: Among the bird types, Cau and Div are the most correlated with one another and positively correlated with PC1. Alt and Nel are most correlated with each other and sightly correlated with PC2. Sub is negatively correlated with PC1 and PC2.

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

2022-11-23