Worked Example: PCA on SNPs data from a vcf file Part 2

Introduction

The example is split into 2 Parts:

Part 1: Data Preparation
Part 2: Data analysis with PCA (this file)

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

Center the data (scale())
Run a PCA analysis (prcomp())
Evaluate the scree plot from the PCA (screeplot())
Evaluate the amount of variation explained by the first 2 PCs.
Extract the PCA scores for plotting (vegan::scores())
Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY

## Warning: package 'vcfR' was built under R version 4.2.2

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-4

library(ggplot2)
library(ggpubr)

Set the working directory

setwd("C:/Users/mikaw/OneDrive/Pitt/Computational Biology/my_snps")
list.files()

##  [1] "1000Genome.gz"                                              
##  [2] "1000genomes_people_info2-1.csv"                             
##  [3] "6.1803432-2043432.ALL.chr6_GRCh38.genotypes.20170504.vcf.gz"
##  [4] "all_loci-1.vcf"                                             
##  [5] "all_loci.vcf"                                               
##  [6] "project snps.Rmd"                                           
##  [7] "project_snps_num_df"                                        
##  [8] "Screenshot 2022-12-03 105306.png"                           
##  [9] "vcf_num.csv"                                                
## [10] "walsh2017morphology.csv"

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

Data analysis

Scale Data

Center and scale data by subtracting the mean (center around 0) and dividing by the standard deviation

SNPs_scaled <- scale(SNPs_cleaned)

Perform PCS on Scaled Data

Used prcomp() to perform PCA

pca_scaled <- prcomp(SNPs_scaled)

Evaluate the scree plot

Create screeplot from PCA data

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "SNPs Screeplot")

PC1 is the only one with relative high importance. All the others have lower and about equal importance

Evaluate the amount of variation explained by the first 2 PCs

Look at the summary of the data Load PCA variation adn extract that to calculate the percentage variation Calculate the cutoff for therule of thumb based on the number of dimensions in the data Plot the percentage variation

summary_out_scaled <- summary(pca_scaled)

PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}

var_out <- PCA_variation(summary_out_scaled,PCs = 10)

N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

Of the first 10 PCs, none of them are cut off

Make a biplot

This gives the data based on all the SNPs

biplot(pca_scaled)

There are toomany dimensions to analyze this data

Extract the PCA scores for plotting

Extract using vegan::score()

pca_scores <- vegan::scores(pca_scaled)

Create vector with the population IDS

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

Create a dataframe combining the population IDs and the PCS scores

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

Plot the data

Create a scatter plot of the PCA scores

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 amount of variation explained",
                  ylab = "PC2 amount of variation explained",
                  main = "PCA scores")

You can see groups and trends forming based on the population ID

Worked Example: PCA on SNPs data from a vcf file Part 2 - Data Analysis

2022-11-23