Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-4

library(ggplot2)
library(ggpubr)

Set the working directory

setwd("~/Desktop/Computational Biology Code:Screenshots")

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")

## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

TODO: Scale

TODO: Essentially we are preparing the data for PCA by scaling it. This means we will set the standard deviation of the data equal to 0.

SNPs_scaled <- scale(SNPs_cleaned)

TODO: Run PCA

TODO: We use prcomp() on SNPs_scaled to run the PCA for us.

pca_scaled <- prcomp(SNPs_scaled)

TODO: Screeplot

TODO: We used the command screeplot() to generate a screeplot of the PCA, this will tell us out of the generated principal components which one is the most important to analyze.

TODO: UPDATE PLOT WITH TITLE

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "Importance of each PCA")

TODO: When interpreting thre plot we see that the only important PC is PC1 since its relative importance is so much higher than PC2 and so on.

TODO: Evaluate Variation

TODO: From our data we are able to extract the amount of variation explained from the 10 PCs

summary_out_scaled <- summary(pca_scaled)

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}

var_out <- PCA_variation(summary_out_scaled,PCs = 10)
percent_variance <- var_out[c(1,2,3)]

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

TODO: From our barplot we see that the highest percent variance in data came from PC1 which is to no suprise since we know that PC1 has the most relative importance so it would have the highest percent variance.

TODO: Biplot

TODO: We are making a biplot from our scaled pca data

biplot(pca_scaled)

TODO: This is a bad idea since our data is highly dimensional and it will looked very mushed up. So we won’t be able to make accurate conclusions from this biplot.

TODO: Extract

TODO: We are getting the PCA scores from our PCA, by using vegan::scores

pca_scores <- vegan::scores(pca_scaled)

TODO: We are reformatting and creating a vector full of the population IDs for each pca score.

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

TODO: We are assinging population IDs to the pca scores that we extracted. This will create a whole dataframe.

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

TODO: PCA Plot

TODO: We are using ggscatter to make a scatter plot of the population ID scores with PC1 and PC2

TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED

percent_variance[1]

##  PC1 
## 19.9

percent_variance[2]

## PC2 
## 2.3

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (19.9% variation)",
                  ylab = "PC2 (2.3% variation)",
                  main = "PC1 vs PC2")

TODO: When we see the data plotted with PC1 and PC2 we see that there are formations of 3 distinct groups. The Sub group forms as 1 that is near the -10 area of both PC1 and PC2. While the Alt and Nel group are the second group that stay consistent on the -10 side of PC1 and will spread from 0 to 10 for PC2. The third group is Cau and Div on the opposite side of the scatterplot. We see that is on the 20 side of PC1 and stays relatively centered at 0 when looking at the PC2.

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

2022-11-23