Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-4

library(ggplot2)
library(ggpubr)

Set the working directory

setwd("~/Library/Mobile Documents/com~apple~CloudDocs/Pitt/BIOSC 1540/final stuff")

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")

## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

TODO: Scale data

TODO: center the data around the mean with the scale() function and assign to the SNPs_scaled data frame.

SNPs_scaled <- scale(SNPs_cleaned)

TODO: PCA Analysis

TODO: run a pca analysis on the data that we scaled in the previous step with prcomp(). assign this value to a variable named ‘pca_scaled’.

pca_scaled <- prcomp(SNPs_scaled)

TODO: PCA screeplot

TODO: create a screeplot off the pca_scaled using screeplot().

TODO: UPDATE PLOT WITH TITLE

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "Scree plot for bird data")

TODO: by looking at the plot we can see that only PC1 is important for further use; this is because there is a large drop off from the value of PC1 to PC2. PC1 is the most important.

TODO: Scaled summary

TODO: get the information on variation using summary(). store this in an object named ‘summary_out_scaled’

summary_out_scaled <- summary(pca_scaled)

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}

var_out <- PCA_variation(summary_out_scaled,PCs = 10)

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

TODO: the red horizontal line in the scree plot is calculated as 100 / (the number of dimensions in the data). there are 50 SNPs so that is why the line is at 2, or 2%.

TODO: PCA Biplot

TODO: Makes a biplot() of the SNPs on the PCA data. this also creates a vector for each SNP (so 50 vectors).

biplot(pca_scaled)

TODO: This is a bad idea because it makes the data very hard to interpret because it is just a large red cluster. You can’t even see the original points.

TODO: PCA scores plot

TODO: Get the PCA scores using vegan::scores. store in an object named ‘pca_scores’.

pca_scores <- vegan::scores(pca_scaled)

TODO: this creates a vector named ‘pop_id’ that has a code for all of the samples. this helps to make analysis of data easier.

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

TODO: combine the scores into a data frame named ‘pca_scores2.’

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

TODO: PCA plot–interpret

TODO: the scores are color coded by pop id. the amount of variation (of PCs) is displayed in the axis label.

TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (19.9% variation)",
                  ylab = "PC2 (2.3% variation)",
                  main = "PCA scatterplot of birds")

TODO: the birds with ‘sub’ are the most isolated specices/ in its own seperate cluster. Alt and NeI are related. Cau and Div are related. this is across both PC1 AND PC2. Cau and Dive are also further isolated (amongst them selves) compared to the other 3 pops.

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

2022-11-23