Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY

## Warning: package 'vcfR' was built under R version 4.2.2

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-2

library(ggplot2)
library(ggpubr)

Set the working directory

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")

## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

Center the data

By centering the data, it makes sure that variance of each variable is comparable and not influenced by the scale of the original values.

SNPs_scaled <- scale(SNPs_cleaned)

Run PCA Analysis

Now we can run a PCA analysis on the data. The prcomp takes the SNPs we provided in a data frame and turns it into a list of objects with the PCA results.

pca_scaled <- prcomp(SNPs_scaled)

Screeplot of prcomp results

A screeplot can help visualize the amount of variance explained by each component. This will help decide the number of PCs to retain for further analysis

TODO: UPDATE PLOT WITH TITLE

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "PCA variance")

TODO: INTERPRET SCREEPLOT

Given the elbow at PC2, we should only retain the first PC. However, given the fact that 1 dimension is not useful for analysis, we will take 2.

Evaluating the first 2 PCs

The summary function in R provide useful information about our PCA results. The item we are interested in is the proportion of variance explained by the PCs. We can then visualize the proportion of variance in our data explained with a bar chart.

summary_out_scaled <- summary(pca_scaled)
summary(pca_scaled)

## Importance of components:
##                            PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     16.2981 5.5721 5.41909 5.26487 5.19656 5.17558 5.11325
## Proportion of Variance  0.1985 0.0232 0.02195 0.02072 0.02018 0.02002 0.01954
## Cumulative Proportion   0.1985 0.2217 0.24368 0.26439 0.28458 0.30460 0.32414
##                            PC8     PC9    PC10   PC11    PC12    PC13    PC14
## Standard deviation     5.01144 4.94492 4.86900 4.8524 4.79022 4.69654 4.67504
## Proportion of Variance 0.01877 0.01828 0.01772 0.0176 0.01715 0.01649 0.01633
## Cumulative Proportion  0.34291 0.36118 0.37890 0.3965 0.41365 0.43013 0.44647
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     4.64717 4.62267 4.59217 4.55181 4.51365 4.48088 4.44317
## Proportion of Variance 0.01614 0.01597 0.01576 0.01549 0.01523 0.01501 0.01475
## Cumulative Proportion  0.46261 0.47858 0.49434 0.50983 0.52505 0.54006 0.55481
##                           PC22    PC23    PC24    PC25    PC26    PC27    PC28
## Standard deviation     4.39156 4.35229 4.32199 4.27752 4.25784 4.22305 4.19721
## Proportion of Variance 0.01441 0.01416 0.01396 0.01368 0.01355 0.01333 0.01317
## Cumulative Proportion  0.56923 0.58339 0.59735 0.61102 0.62457 0.63790 0.65107
##                           PC29    PC30    PC31    PC32    PC33    PC34    PC35
## Standard deviation     4.17971 4.11221 4.05455 4.03023 4.01265 3.98857 3.96761
## Proportion of Variance 0.01306 0.01264 0.01229 0.01214 0.01203 0.01189 0.01177
## Cumulative Proportion  0.66412 0.67676 0.68905 0.70119 0.71322 0.72511 0.73688
##                           PC36    PC37    PC38    PC39    PC40    PC41    PC42
## Standard deviation     3.93394 3.90733 3.89238 3.82540 3.79295 3.75287 3.73895
## Proportion of Variance 0.01157 0.01141 0.01132 0.01094 0.01075 0.01053 0.01045
## Cumulative Proportion  0.74844 0.75985 0.77118 0.78211 0.79287 0.80339 0.81384
##                           PC43    PC44    PC45   PC46    PC47    PC48   PC49
## Standard deviation     3.70877 3.68193 3.64483 3.6210 3.57055 3.52563 3.5082
## Proportion of Variance 0.01028 0.01013 0.00993 0.0098 0.00953 0.00929 0.0092
## Cumulative Proportion  0.82412 0.83425 0.84418 0.8540 0.86351 0.87280 0.8820
##                           PC50    PC51    PC52    PC53   PC54    PC55    PC56
## Standard deviation     3.43804 3.37418 3.35409 3.31966 3.3121 3.25229 3.17153
## Proportion of Variance 0.00883 0.00851 0.00841 0.00824 0.0082 0.00791 0.00752
## Cumulative Proportion  0.89083 0.89934 0.90775 0.91598 0.9242 0.93209 0.93961
##                          PC57    PC58    PC59    PC60    PC61   PC62    PC63
## Standard deviation     3.1679 3.07460 3.03509 2.91990 2.88622 2.7606 2.58794
## Proportion of Variance 0.0075 0.00707 0.00688 0.00637 0.00623 0.0057 0.00501
## Cumulative Proportion  0.9471 0.95417 0.96106 0.96743 0.97365 0.9794 0.98436
##                          PC64    PC65    PC66    PC67    PC68
## Standard deviation     2.5064 2.39421 2.16921 2.05232 1.1e-14
## Proportion of Variance 0.0047 0.00428 0.00352 0.00315 0.0e+00
## Cumulative Proportion  0.9890 0.99334 0.99685 1.00000 1.0e+00

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}

var_out <- PCA_variation(summary_out_scaled,PCs = 10)

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

TODO: INTERPRET VARIANCE EXPLAINED

Variance of the data explained by the first PC is far higher than the rest. This means the data can be represented efficiently with just 1 dimension.

Trying to interpret data with a biplot

Biplots are very helpful in interpreting the results of a PCA and can help us reach the conclusions we reached in our previous section with just one function. So why do we not use this?

biplot(pca_scaled)

TODO: EXPLAIN WHY THIS IS A BAD IDEA

Due to the large number of observations or variables, the biplot becomes very cluttered. This makes it far more difficult to interpret.

Extract PCA scores

The score function from the vegan library returns the coordinates of the observations, which will help use visualize this data.

pca_scores <- vegan::scores(pca_scaled)

Creating a vector which stores the ids of the population

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

This is used to categorize our observations into individual groups.

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

Plotting the PCA scores

After obtaining our pca scores, we can visualize them in a scatter plot. The positions of each observation reflects the the relationship between each observation in the original data set. This will allow us to identify clusters/patterns within the data

TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (19.85% variation)",
                  ylab = "PC2 (2.32% variation)",
                  main = "PC1 vs PC2")

TODO: INTERPRET PLOT

Looks like Cau, Div seem to be in one cluster, which indicates they are similar with each other in terms of genetic variation they capture. This means both SNPs are common in the population and they have similar frequencies of different nucleotide variations.

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

2022-11-23