Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-4

library(ggplot2)
library(ggpubr)

Set the working directory

setwd("~/Documents/BIOSC 1540 Code/Final_Project")

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")

## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

TODO: Data scaling

TODO: Center the data around the mean by using the scale() on the SNPs_cleaned dataframe. This will make the mean of the data set 0 and the standard deviation 1.

SNPs_scaled <- scale(SNPs_cleaned)

TODO: PCA analysis

TODO: Run a PCA analysis on the scaled data frame using prcomp(). Store this in a variable pca_scaled.

pca_scaled <- prcomp(SNPs_scaled)

TODO: PCA diagnostics

TODO: Create a screeplot with screeplot() in order to analyze the how many of the new features (PCs) should be considered, plot, and/or used in further analyses.

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "Scree plot for Walsh bird data ")

TODO: From the screeplot, it appears that only PC1 is important to use in further analyses. This is because there is a steep drop between PC1 and PC2, and all of the subsequent PCs are even smaller. This indicates that PC1 captures the largest amount of data variation in the data set, making it the most important.

TODO: Explained variation

TODO: Get the information on the percentage of variation in the data using the summary() command. Store the summary information in an object summary_out_scaled.

summary_out_scaled <- summary(pca_scaled)

Use a function to extract the explained variation (importance) we want.

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}

Use function and store in object var_out. This will extract the first 10 PCs (shown on scree plot as well).

var_out <- PCA_variation(summary_out_scaled,PCs = 10)
var_out

##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8  PC9 PC10 
## 19.9  2.3  2.2  2.1  2.0  2.0  2.0  1.9  1.8  1.8

Make a scree plot where the y-axis is the percent of variation captured by each PC using barplot()

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

TODO: The horizontal line in the barplot is calculated as 100/(# of dimensions in the data).If all the PCs were equally important, then the amount of variation they would explain would be (100)/(# of columns). These data have 50 columns, so if all the PCs were equal, they would each explain 2% of the variation.

TODO: PCA Biplot

TODO: Create a biplot of the SNP PCA data. This will show a vector for each opf the 50 SNPs on the plot.

biplot(pca_scaled)

TODO: This a bad idea to do because there are simply too many SNPs that if we look at a biplot we are unable to make any sense of the data. As shown, the vectors completely cover up the points and we are unable to come to any conclusions about the SNPs.

TODO: Custom PCA Plot

TODO: Get the PCA scores using vegan::scores() and store the information into an object pca_scores

pca_scores <- vegan::scores(pca_scaled)

TODO: Create a vector pop_id that has a population code for each of the 68 samples. This is to replace the sample id label in order to make more sense of the data and color code the plot later on.

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

TODO: Combine the scores with the species information into a dataframe pca_scores2.

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

TODO: Custom PCA PLot - Plot and Interpretation

TODO: Plot the scores, which are color coded by population id. The amount of variation explained by each PC is shown in the axis labels.

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (19.9% variation)",
                  ylab = "PC2 (2.3% variation)",
                  main = "PCA Scatterplot of Walsh Bird Species")

TODO: Looking at this scatterplot, it looks like birds with the population id Sub are most separate from the other populations. as it is in its own separate clump. On the other hand, species Alt and NeI, and Cau and Div seem highly similar across both PC1 and PC2. Cau and Div also seem to be separate from the other species across PC1, indicating that there is most likely a lot of variation between them.

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

2022-12-05