Introduction

The example is split into 2 Parts:

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

  1. Center the data (scale())
  2. Run a PCA analysis (prcomp())
  3. Evaluate the scree plot from the PCA (screeplot())
  4. Evaluate the amount of variation explained by the first 2 PCs.
  5. Extract the PCA scores for plotting (vegan::scores())
  6. Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

  1. Give a meaningful title to all sections marked “TODO: TITLE”
  2. Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
  3. Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
  4. Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY
## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(ggplot2)
library(ggpubr)

Set the working directory

getwd()
## [1] "/Users/krishna/Downloads"

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")
## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

TODO: Data Scaling

TODO: Scale the data before analzying, scaling the data is setting the standard deviation to zero and assign it to the object SNPs_scaled

SNPs_scaled <- scale(SNPs_cleaned)

TODO: Run PCA

TODO: We will use the function prcomp to run a PCA, principle component anaysis and we will assign the PCA to the pca_scaled.

pca_scaled <- prcomp(SNPs_scaled)

TODO: Make a Screeplot

TODO: Use pca_scaled data frame to make a screeplot of the pca that we ran in the code chunk above.

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "Relative Importnace of SNPs")

TODO: PC1 has the highest relative importacne by a very large margin. A screeplot is used to tell us the varaince between the 2 variables, in this case it tells us which snp has the highest importance when compared to other SNPs.

TODO: Evaluate and Graph the Variation

TODO: Summarize the scaled data using the summary function. Then create a function to get the variation from the 10 PCs. Use the fucntion created, PCA_variation, on the summary_out_scaled to get the variation. Then graph the results on a barplot with a vertical line at the variation percentage.

summary_out_scaled <- summary(pca_scaled)
PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}
var_out <- PCA_variation(summary_out_scaled,PCs = 10)
N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

TODO: The variance is taken from teh x axis as a large varaince is not presetn between the PCs.

TODO: Make a biplot

TODO: We used the featiure biplot to make a biplot for the scaled pca data

biplot(pca_scaled)

TODO: This is a bad idea because we have more than 2 dimensions so are not able to proplery represent the data and it will result in a plot that is too chatoic to interpert.

TODO: PCA Scores

TODO: We used the fucntion vegan::scores to get the scores of the PCA.

pca_scores <- vegan::scores(pca_scaled)

TODO: We created a vector called pop_id tthat is a ppopualtion id for each pca score.

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

TODO: We used the data.frame fucntion to create a data frame of the population ids and the pca score and named the object pca_scores2.

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

TODO: Make a Scatterplot

TODO: We created a scatterplot using ggpubr::ggscatter function that compares PC1 and PC2 varaince.

TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED

var_out<- PCA_variation(summary_out_scaled, PCs =1)
print(var_out)
## [1] 19.9
var_out <- PCA_variation(summary_out_scaled,PCs = 2)
print(var_out)
##  PC1  PC2 
## 19.9  2.3
ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (19.9% variation)",
                  ylab = "PC2 (2.3% variation)",
                  main = "PC1 vs. PC2 Variance ")

TODO: The graph has 3 distinct groups that form. One group, “sub” is located at the (-10,-3). The 2 other groups are “Alt” and “Nel” which form around the (-10, 5). The other 2 groups plotted are the CAu and Div whihc are located on the postive side of the axis. This shows us that the most similar groups are Alt and Nel.