Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY

## Warning: package 'vcfR' was built under R version 4.2.2

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-2

library(ggplot2)
library(ggpubr)

Set the working directory

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")

## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

TODO: Scaling the cleaned data for PCA

TODO: Using the scale function to scale the cleaned data for PCA

SNPs_scaled <- scale(SNPs_cleaned)

TODO: PCA

TODO: Here we are using the prcomp function to run pca on the scaled data

pca_scaled <- prcomp(SNPs_scaled)

TODO: PCs analysis/ScreePlot

TODO: In this chunk we are creating a screeplot to see the importance of the different PCs

TODO: UPDATE PLOT WITH TITLE

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "SNP Screeplot")

TODO: INTERPRET SCREEPLOT *Looking at the screeplot we can conclude that PC1 has the most importance relative to the other PCs, and we can also conclude that the rest of the PCs are relatively similar in importance

TODO: PCs variation

TODO: In this chunk of code we are creating a barplot to determine the varation that we see among the PCs

summary_out_scaled <- summary(pca_scaled)

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}

var_out <- PCA_variation(summary_out_scaled,PCs = 10)

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

TODO: INTERPRET VARIANCE EXPLAINED *By observing the barplot, we can conclude that PC1 has the greatest variance among the data, meaning that PC1 is the category in which the data is the most different from one another. We can also conclude that the rest of the PCs are similar in variance

TODO: Creating a Biplot

TODO: We are using the biplot function to plot our scaled pca data

biplot(pca_scaled)

TODO: EXPLAIN WHY THIS IS A BAD IDEA *This is a bad idea because of how unreadable the graph is. We are not able to interpret on the graph at all.

TODO: Saving PCA scores

TODO: We are saving the PCA scores in a object pca_scores to set it up graphing it on a scatterplot

pca_scores <- vegan::scores(pca_scaled)

TODO: Creating a vector to save the population id

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

TODO: Create a data frame with the pc scores and the population id

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

TODO: Creating a PCA scores scatterplot

TODO: Using the scatterplot function to plot the PCA scores for PC1 and PC2

TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (19.9% variation)",
                  ylab = "PC2 (2.3% variation)",
                  main = "PCA Scores Scatterplot")

TODO: INTERPRET PLOT *Observing the plot, we can conclude that the population groups “Div” and “Cau” have similarities compared to the other groups. The rest of the population groups are different compared to “Div” and “Cau”. In the rest of the population groups we see some variance among the y-axis, however, the variance is not as significant in respect to the Y-axis.

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

2022-11-23