Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-4

library(ggplot2)
library(ggpubr)

Set the working directory

setwd("/Users/adampowley/Desktop/School Stuff/Comp Bio/Port")

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")

## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

Scale data

We manipulate the scaling of the data to a standard range without altering the relationship between data.

SNPs_scaled <- scale(SNPs_cleaned)

Find principle components

We find the principle components of the data that attempt to represent the data along optimal axis.

pca_scaled <- prcomp(SNPs_scaled)

Generate Screeplot

To determine how many principle components are necessary to understand the general meaning of the data we look at the scree plot of the principle components.

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "Scree plot of PC for SNP data")

The first principle component largely summarizes all of the multivariate data.

Screeplot using percent variation

Here it we see which principle components summarizes the data best via its ability to explain a large percentage of the variation.

summary_out_scaled <- summary(pca_scaled)

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}

var_out <- PCA_variation(summary_out_scaled,PCs = 10)

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

The first principle component explains the most amount of variation by far.

Generate biplot

A biplot will help visualize the results of our PCA, including the originial features’ way on the PCs

biplot(pca_scaled)

There are so many PCAs being concidered that visualizing how each effect the PCA is impossible.

Score our PCA

Here we extract the scores of our PCA.

pca_scores <- vegan::scores(pca_scaled)

Here we create a vector of the real population IDs.

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

Here we create a data frame containing the population IDs with scores to visualize the effect of our PCA.

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

Visualize results

Here we finally visualize how effective our PCA was at associating like populations based on SNP data.

TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (xx% variation)",
                  ylab = "PC2 (xx% variation)",
                  main = "PCA results from SNPs with real population identity")

The populations Div and Cau are pretty homogenous in our PCA, but are both sperated by a large margin from the rest of the data. Alt and Nel are also pretty homogenous, although Nel is more wide spread. The PCA does a good job segregating Sub populations from the others.

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

2022-11-23