Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-4

library(ggplot2)
library(ggpubr)

Set the working directory

getwd()

## [1] "/Users/ethanfrank/Desktop/R"

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")

## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

Scale data

Centers and scales the values in SNPs_cleaned and stores into SNPs_scaled object.

SNPs_scaled <- scale(SNPs_cleaned)

Run prcomp()

Run PCA on SNPs_scaled with prcomp() and store into pca_scaled object.

pca_scaled <- prcomp(SNPs_scaled)

Create screeplot

Creates scree plot with scaled PCA data from pca_scaled.

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "pca_scaled Screeplot")

We can see that PC1 is of the most importance in the screeplot.

Summary of pca_scaled

Take summary of pca_scaled to produce result summaries of the results of various model fitting functions and store them into summary_out_scaled. Create function called PCA_variation which takes a PCA summary argument that returns var_explained. Run PCA_variation on summary_out_scaled with 10 PCs and store in var_out object. Store number of columns from SNPS_scaled into N_columns object. Create barplot with var_out. Add abline at 1/N_columns*100.

summary_out_scaled <- summary(pca_scaled)

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}

var_out <- PCA_variation(summary_out_scaled,PCs = 10)

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

There is a lot of percent variation on PC1 compared to other PCs

Biplot of pca_scaled

Creates a biplot using data from pca_scaled.

biplot(pca_scaled)

There is little to no variation in every other PC other than PC1 so biplot will be very clumped together.

Extract scores from pca_scaled

Extracts scores from pca_scaled using vegan package and stores them in a matrix object called pca_scores.

pca_scores <- vegan::scores(pca_scaled)

Create a vector with all population ids and store in pop_id object.

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

Create dataframe object using pop_id names and pca_scores matrix.

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

Create scatter plot

Creates scatter plot using pca_scores2 with PC1 and PC2.

TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (10% variation)",
                  ylab = "PC2 (10% variation)",
                  main = "pca_scores2 scatterplot variation")

Cau and Div have much more variation in PC1 compared to every other population id. There is little to no variation with PC2 between Cau and Div while other population ids span across PC2.

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

2022-11-23