Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-4

library(ggplot2)
library(ggpubr)

Set the working directory

getwd()

## [1] "/Users/krishna/Downloads"

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")

## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

TODO: Data Scaling

TODO: Scale the data before analzying, scaling the data is setting the standard deviation to zero and assign it to the object SNPs_scaled

SNPs_scaled <- scale(SNPs_cleaned)

TODO: Run PCA

TODO: We will use the function prcomp to run a PCA, principle component anaysis and we will assign the PCA to the pca_scaled.

pca_scaled <- prcomp(SNPs_scaled)

TODO: Make a Screeplot

TODO: Use pca_scaled data frame to make a screeplot of the pca that we ran in the code chunk above.

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "Relative Importnace of SNPs")

TODO: PC1 has the highest relative importacne by a very large margin. A screeplot is used to tell us the varaince between the 2 variables, in this case it tells us which snp has the highest importance when compared to other SNPs.

TODO: Evaluate and Graph the Variation

TODO: Summarize the scaled data using the summary function. Then create a function to get the variation from the 10 PCs. Use the fucntion created, PCA_variation, on the summary_out_scaled to get the variation. Then graph the results on a barplot with a vertical line at the variation percentage.

summary_out_scaled <- summary(pca_scaled)

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}

var_out <- PCA_variation(summary_out_scaled,PCs = 10)

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

TODO: The variance is taken from teh x axis as a large varaince is not presetn between the PCs.

TODO: Make a biplot

TODO: We used the featiure biplot to make a biplot for the scaled pca data

biplot(pca_scaled)

TODO: This is a bad idea because we have more than 2 dimensions so are not able to proplery represent the data and it will result in a plot that is too chatoic to interpert.

TODO: PCA Scores

TODO: We used the fucntion vegan::scores to get the scores of the PCA.

pca_scores <- vegan::scores(pca_scaled)

TODO: We created a vector called pop_id tthat is a ppopualtion id for each pca score.

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

TODO: We used the data.frame fucntion to create a data frame of the population ids and the pca score and named the object pca_scores2.

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

TODO: Make a Scatterplot

TODO: We created a scatterplot using ggpubr::ggscatter function that compares PC1 and PC2 varaince.

TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED

var_out<- PCA_variation(summary_out_scaled, PCs =1)
print(var_out)

## [1] 19.9

var_out <- PCA_variation(summary_out_scaled,PCs = 2)
print(var_out)

##  PC1  PC2 
## 19.9  2.3

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (19.9% variation)",
                  ylab = "PC2 (2.3% variation)",
                  main = "PC1 vs. PC2 Variance ")

TODO: The graph has 3 distinct groups that form. One group, “sub” is located at the (-10,-3). The 2 other groups are “Alt” and “Nel” which form around the (-10, 5). The other 2 groups plotted are the CAu and Div whihc are located on the postive side of the axis. This shows us that the most similar groups are Alt and Nel.

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

2022-11-23