Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY

## Warning: package 'vcfR' was built under R version 4.2.2

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-2

library(ggplot2)
library(ggpubr)

Set the working directory

setwd("C:/Users/Owner/Downloads")

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")

## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

TODO: Scale our SNPs

TODO: Centers and scales the columns, and in this case, the SNPs

SNPs_scaled <- scale(SNPs_cleaned)

TODO: SNPs Analysis

TODO: prcomp performs a principal components analysison the SNPs_scaled object, assigning it to a new object

pca_scaled <- prcomp(SNPs_scaled)

TODO: Make a Screeplot

TODO: Takes our scaled, principal component analyzed data and graphs it

##TODO: UPDATE PLOT WITH TITLE

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "Relative importance of PC columns")

TODO: INTERPRET SCREEPLOT

TODO: Summarize and Plot

TODO: Get a summary of our scaled analyzed data and submit the data to an object. Get our variation data and plot the percentage based on PC column

summary_out_scaled <- summary(pca_scaled)

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}

var_out <- PCA_variation(summary_out_scaled,PCs = 10)

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

TODO: INTERPRET VARIANCE EXPLAINED

TODO: Biplot our data

TODO: We are making a biplot of our scaled data

biplot(pca_scaled)

##TODO: EXPLAIN WHY THIS IS A BAD IDEA THIS IS A BAD IDEA BECAUSE ALL OF THAT DATA IS SMUSHED TOGETHER YOU CAN’T TELL WHATS THERE!

TODO: Access the Scores

TODO: We are specifically asking for the vegan package to be able to access the scores/site scores of the pca_scaled and setting it as the basic pca_scores object

pca_scores <- vegan::scores(pca_scaled)

TODO: We are creating a new vector

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

TODO: Organize our pop_id and scores to a data frame, assign it to a specific object

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

TODO: Scatter and Compare

TODO: Make a scatterplot of our IDs and compare them from PC1 and PC2 in variation

##TODO: UPDATE PLOT WITH TITLE ##TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (20% variation)",
                  ylab = "PC2 (10% variation)",
                  main = "PC1 vs PC2 in Variation")

##TODO: INTERPRET PLOT

We can see that there is no point when both PC1 and PC2 have a variation percentage of 0. The more negative variation in PC2 equates to lower PC1 variation. However, when you compare true variation.. When PC1 increases in variation, PC2 remains around the point of 0-5% variation. While when PC2 increases in variation, PC1 remains at around -15% variation. PC1 has a more neutral result against PC2 when it increases but PC2 has a more negative result against PC1 when it increases.

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

2022-11-23