Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-4

library(ggplot2)
library(ggpubr)

Set the working directory

getwd()

## [1] "/Users/jooppereira/Downloads"

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")

## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

Scaling

First, scaling the data must be completed before continuing with PCA. Use scale()

SNPs_scaled <- scale(SNPs_cleaned)

PCA

Start to run the PCA with prcomp() and the scaled data

pca_scaled <- prcomp(SNPs_scaled)

Screeplot

Create a screeplot using screeplot() to observe how many variables should be plotted

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "Screeplot of Principal Components")

##Screeplot Interpretation: PC1 seems to be the only component that is worth studying since PC2-10 all are significantly lower on the scale of “relative importance”. For the biplot, PC1 and PC2 will be studied.

Determining PCA variation

Create a function names PCA_Variation which will work on the summary data of the scaled PCA. Make a bar plot which measures the percent variation in the Screeplot which was made before.

summary_out_scaled <- summary(pca_scaled)

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}

var_out <- PCA_variation(summary_out_scaled,PCs = 10)

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

The red line represents the percent variation, which remains flat, and very close to zero.

Biplot

Create a biplot using biplot() to compare PC1 and PC2

biplot(pca_scaled)

This is a BAD idea because ther are so many points on the biplot, that you can not even distinguish what is happening, or any trends! Very clustered and can not make and conclusions about the data.

Find PCA scores

Using vegan::scores(), get the PCA scores on the scaled data.

pca_scores <- vegan::scores(pca_scaled)

Create a vector using c() which relate to the PCA scores

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

Put the pop_id and pca_scores in a data frame together.

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

Scatterplot PCA

Create a scatterplot using ggpubr and ggscatter with the data frame just stored in the object pca_scores2. Color code the points by pop_id.

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (20.2% variation)",
                  ylab = "PC2 (2.3% variation)",
                  main = "Correlations Among Bird Populations")

Plot Interpretaion: Populations identified as “Alt” and “Nel” were very similar among principal components 1 and 2, the same can be said for “Cau” and “Div”. “Sub” on the otherhand, is more unique, its data shows that it is similar to “alt” and “nel” on the PC1 axis, yet differ on the PC2 axis.

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

2022-11-23