Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-4

library(ggplot2)
library(ggpubr)

Set the working directory

setwd("/Users/mansiavunoori/Downloads")

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")

## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

TODO: Data scaling

TODO: Scale data for PCA (this can be done before or after dealing with NAs and other data preparation issues)

SNPs_scaled <- scale(SNPs_cleaned)

TODO: PCA

TODO: Run a PCA with prcomp()

pca_scaled <- prcomp(SNPs_scaled)

TODO: Screeplot

TODO: A scree plot is the typical tool for this because the base for a scree plot is basic R.

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "PCA on SNPs")

This scree plot shows how much variation each PC captures from the data (in this case PC1 is the highest). The bars on the default R scree plot represent the relative importance of the PCs in representing the data. More specifically, they are proportional to the amount of variation in the data captured by each PC. The more variation represented by a PC, the more important it is.

TODO: Summary Variation

TODO: Get the information on the percentage of variation in the data using the summary() command and store it to an object.

summary_out_scaled <- summary(pca_scaled)

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}

var_out <- PCA_variation(summary_out_scaled,PCs = 10)

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

TODO: We use a function to extract the information on explained variation. Make sure to always return. Next, the var_out function allows us to specify how many PCs we want. We are getting the first 10 from the morphological data. Theb these percentages are reported as labels on biplots and scatterplots of PCA results. Instead of the default screeplot, we’ll use the barplot() function, to make the y-axis the percent of variation captured by each PC.

TODO: PCA Biplot

TODO: Now we’ll make the biplot and look at the biplot and interpret the relationship between the features.

biplot(pca_scaled)

TODO: The arrows may not get plotted.

TODO: Custom PCA plot

TODO: Get the scores by calling them

pca_scores <- vegan::scores(pca_scaled)

TODO: Entered each one into vectors for the plot

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

TODO: Combine the scores with the species information into a dataframe

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

TODO: Scatter Plot

TODO: Plot the scores, with species color-coded

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (19.9% variation)",
                  ylab = "PC2 (2.3% variation)",
                  main = "Custom PCA scores and variations")

The plot displays the amount of variation explained by each PC1 and PC2. The raw data of these features are going to be highly correlated with each other. The data is very polar to either the left or right side of the scatterplot.

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

2022-11-23