Introduction

The example is split into 2 Parts:

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

  1. Center the data (scale())
  2. Run a PCA analysis (prcomp())
  3. Evaluate the scree plot from the PCA (screeplot())
  4. Evaluate the amount of variation explained by the first 2 PCs.
  5. Extract the PCA scores for plotting (vegan::scores())
  6. Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

  1. Give a meaningful title to all sections marked “TODO: TITLE”
  2. Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
  3. Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
  4. Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY
## Warning: package 'vcfR' was built under R version 4.2.2
## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-2
library(ggplot2)
library(ggpubr)

Set the working directory

setwd("~/Pitt/Fall_2022/comp bio/R files")

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")
## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

Scale Data

Use the scale function to scale and center the data.

SNPs_scaled <- scale(SNPs_cleaned)

Run PCA

Use prcomp to run PCA on the snp data.

pca_scaled <- prcomp(SNPs_scaled)

Create Screeplot

Make a screeplot to determine which PCs make sense to analyze based on their variances.

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "SNP Screeplot")

PC1 contains far more variance than any of the other PCs, and would therefore be the most informative to analyze.

PC Variance Percentages

Determine the percentage of variance that the first 10 PCs describe and plot them with a barchart. Add a horizontal line which is 100 divided by the number of columns, which is what the percent variation of all the PCs would be if they were equal.

summary_out_scaled <- summary(pca_scaled)
PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}
var_out <- PCA_variation(summary_out_scaled,PCs = 10)
N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

PC1 contains most of the variance.

Create Biplot

Make a biplot of the PCA using the biplot function.

biplot(pca_scaled)

It is hard to read and interpret because there are so many snps.

Get Scores

Get the scores of each PC.

pca_scores <- vegan::scores(pca_scaled)

Make a vector of the population ids. I’m guessing that this information is contained somewhere in the vcf file?

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

Create a dataframe that combines the scores and the population ids.

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

Create PCA Scatterplot

Make a scatterplot of the PCA with different populations as different colors and shapes.

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (19.9% variation)",
                  ylab = "PC2 (2.3% variation)",
                  main = "PCA Scatterplot")

The Alt and Nel populations seem to have a lot of overlap in genenotype. The Sub population is close to these but clearly distinct, and the Cau and Div populations overlap with each other but are far from the others.