Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-4

library(ggplot2)
library(ggpubr)

Set the working directory

setwd("C:/Users/Casth/Desktop/R")

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")

## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

Scaling

Use scale() to scale SNPS_cleaned by centering it around the mean

SNPs_scaled <- scale(SNPs_cleaned)

Running PCR

Run prcomp on the SNP_scaled

pca_scaled <- prcomp(SNPs_scaled)

Scree Plot

Create a Scree Plot using the PCR result from aboe and add the label for ylab and a title

TODO: UPDATE PLOT WITH TITLE

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "The importance of PC")

TODO: PC1 is the most important PC result while the others are identical and nonnegliable to one another

Output Summary

fint eh infromation on variation using summary() and store it in summary_out_scaled

summary_out_scaled <- summary(pca_scaled)

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}

var_out <- PCA_variation(summary_out_scaled,PCs = 10)

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

TODO: The redline is calculated by 100/ the total anumber of snps which is 50 which gets a variance around 2%

TODO: Biplot

Create a biplot of the the PCA results with PC2 on the left and PC1 on the bottom

biplot(pca_scaled)

TODO: EXPLAIN WHY THIS IS A BAD IDEA

This is a bad idea because the data is all over the place with no clear trends or clusters just a blo, aslo pc2-10 are identical so there would be similar vairiation

TODO: Get PCA_score

TODO: use vegan::scores to get the PCA scores

pca_scores <- vegan::scores(pca_scaled)

Creating a vector of id’s for the sample

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

Combinde the pop_id with the pca Scores in a data frame

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

PCA Scatterplot

TODO: The points are color coded with the different pop_id’s and the axis’s have the different variation of pc’s

TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (20.2% variation)",
                  ylab = "PC2 (2.3% variation)",
                  main = "Scatterplot of birds")

TODO: The birds with sub were the most unique birds, alt and nel birds were similar to one another as well as cau and div birds

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

Harish Kodavali

2022-12-6