Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR)

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-4

library(ggplot2)
library(ggpubr)

Set and examine the working directory

setwd("~/Desktop/Computational Biology/Final Project/Software Checkpoints and Portfolios")
getwd()

## [1] "/Users/madelinefontana/Desktop/Computational Biology/Final Project/Software Checkpoints and Portfolios"

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")

## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

Scale the Cleaned SNPs Data

Scale the cleaned SNPs data with scale() to center the columns in the matrix

SNPs_scaled <- scale(SNPs_cleaned)

Principle Components Analysis

Perform a principal components analysis on the given scaled SNPs data and return the results and assign it to pca_scaled

pca_scaled <- prcomp(SNPs_scaled)

Variance and PCA

Create a scree plot using screeplot() to plot variances against the principle components

TODO: UPDATE PLOT WITH TITLE

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "Principle Component Analysis of SNPs Data")

TODO: INTERPRET SCREEPLOT

The most important component of analysis of the SNPs data is PC1, the first principle component because this had the highest reltive importance from the scree plot.

Summarize the scaled results

Call summary() on the scaled principle component analysis results to evaluate the variation

summary_out_scaled <- summary(pca_scaled)

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}

var_out <- PCA_variation(summary_out_scaled,PCs = 10)

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

TODO: INTERPRET VARIANCE EXPLAINED

The barplot shows us the variance in the scree plot and explains this by showing that PC1 has the highest percent variation explained.

Create a biplot

Create a biplot from the scaled PCA data where vectors will be created to show how components are coorelated

biplot(pca_scaled)

TODO: EXPLAIN WHY THIS IS A BAD IDEA

This is a bad idea because it does not fully show how the components are related and is hard to read and analyze being that there is so much data that appears to overlap.

Run scoring on scaled PCA data

Use the vegan package to score the scaled PCA data, this is used to analyze community ecology and uses multivaruate tools

pca_scores <- vegan::scores(pca_scaled)

Provide Ids for the populations after getting this output

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

Create a data frame with these ids and PCA scores

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

Visualize the PCA variation

Use ggpubr::ggscatter() on the data frame containing the scores and ids to visualize the PCA results and analyze percent variation and create a scatterplot showing this

TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (19.9% variation)",
                  ylab = "PC2 (2.3% variation)",
                  main = "Principle Component Analysis Variation by Population")

TODO: INTERPRET PLOT

Most of the scored data populations lie toward the left of the plot around -100 PC1 and from -10 to 10 PC2, showing the variation the PCA results. The principle component analysis 1 shows 19.9% variation where the PC2 only show 2.3% variation.

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

2022-11-23