Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY

## Warning: package 'vcfR' was built under R version 4.2.2

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-2

library(ggplot2)
library(ggpubr)

Set the working directory

getwd()

## [1] "C:/Users/Ian/OneDrive - University of Pittsburgh/Academic/Fall 2022/Comp bio/Final Project/VCF practice"

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")

## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

TODO: Scale

TODO: Scale SNPs

SNPs_scaled <- scale(SNPs_cleaned)

TODO: PCA

TODO: perform PCA on scaled data

pca_scaled <- prcomp(SNPs_scaled)

TODO: Screeplot

TODO: Evaluate Significance of each PC component via a screeplot

TODO: UPDATE PLOT WITH TITLE

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "Screeplot of PCA")

TODO: PC1 has SIGNIFICANTLY more variation than any of the other. Every other PC about equal and very insignificant

TODO: analyze PCA

TODO: Analyze the variation the can be explained by each PC

summary_out_scaled <- summary(pca_scaled)

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}

var_out <- PCA_variation(summary_out_scaled,PCs = 10)

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

var_out[1]

##  PC1 
## 19.9

var_out[2]

## PC2 
## 2.3

TODO: Only PC 1 even gets above 5% of variation explained meaning most of theese PCs are not significant for analysis

TODO: TITLE

TODO: EXPLAIN

biplot(pca_scaled)

TODO: EXPLAIN WHY THIS IS A BAD IDEA

TODO: alternate PCA

TODO: use the vegan::scored method instead

pca_scores <- vegan::scores(pca_scaled)

TODO: EXPLAIN

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

TODO: EXPLAIN

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

TODO: Analyze PCA scores

TODO: Scatter plot of PCA scores. Colored to show clusters of populations.

TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (19.9% variation)",
                  ylab = "PC2 (2.2% variation)",
                  main = "PCA scores labled based on population")

TODO: High correlation between Div and Cau populaton groups, as well as Alt and Nel. Sub population is also fairly correlated to the latter group considering the little significance of PC2 (vertical)

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

2022-11-23

Introduction

Tasks

Preliminaries

Data analysis

TODO: Scale

TODO: PCA

TODO: Screeplot

TODO: analyze PCA

TODO: TITLE

TODO: alternate PCA

TODO: Analyze PCA scores