The key things to update are 1) file names and columns names, 2) the number of columns with information on samples to drop from your PCA, and 3) metadata on your data / analysis such as the number of NAs removes or percentage of variation captured by the PCs.

Introduction

This report summarizes the analysis workflow and results of an analysis of SNPs from the 1000 Genomes Project.

Data preparation

Obtaining and loading data

SNPs were downloaded using the Ensembl Data Slicer from chromosome 14 between genomic coordinates 24,353,369 and 24,593,369. This represents 0.224% of the chromosome. A total of 7716 variants genotyped in 2504 individuals were downloaded.

The VCF file was loaded into R using the vcfR package (function read.vcfR) and converted to counts of the minor allele using the function vcfR::extract.gt().

Meta-data and sample information

These data were collected from volunteers with informed consent based on their ethnic heritage. Inconsistencies in population selection/representation resulted in a largely skewed and incomplete model of world’s genetic heritage.

Data cleaning

These SNPs were then screened for any SNPs that were invariant (fixed), resulting in removal of 1914 SNPs (features). This was done using the invar_omit() function by Nathan Brouwer.

NOTE: The original workflow code for removing invariant SNPs contained and error that resulted in no columns actually being removed (Brouwer, personal communication). The code was updated and a reduction in the size of the dataframe after omitting invariant columns confirmed by checking the dimensions of the dataframes before and after this process using dim().

The data were then screened for rows (people) with >50% NAs. There were no NAs in the data, so no rows were removed due to the presence of excessive NAs. Similarly, because no NAs were present no imputation was required.

The data were then centered and scaled using R’s scale() function. (Alternatively a SNP-specific centering technique common in other studies could have been applied).

The data were then saved in .csv format using write.csv() for PCA analysis.

After final processing the data contained 5796 SNPS and 2504 samples (people).

Data Analysis

The code below carries out a PCA on the data and presents the results. The key steps are:

  1. Load the data with read.csv().
  2. Process the data with prcomp().
  3. Extract PCA scores.
  4. Carry out PCA diagnostics, including construction of a scree plot.
  5. Plot PCs 1 through 3 as scatter plots as pairwise scatter plot.
  6. Plots PCS 1 through 3 as a 3D scatter plot.

Packages

The following packages were used in this analysis:

# plotting:
library(ggplot2)
library(ggpubr)

# scores() function
library(vegan)
## Warning: package 'vegan' was built under R version 4.2.2
## Loading required package: permute
## Warning: package 'permute' was built under R version 4.2.2
## Loading required package: lattice
## This is vegan 2.6-4
# 3D scatter plot
library(scatterplot3d)

Loading data

Load the fully processed data:

NOTE: with 5796 SNPs, this CSV is ~248 megabytes. There are more specialized packages for doing PCA with data sets this big. I do not recommend working with more than 10,000 SNPs with basic R functions as we have done in class.

vcf_scaled <- read.csv(file = "vcf_scaled.csv")

Check the dimensions of the data to confirm this is the correct data:

dim(vcf_scaled)
## [1] 2504 5802

Principal Components Analysis

The data are scaled and ready for analysis. Here we see the first 6 columns contain character data and need to be omitted.

head(vcf_scaled[,1:10])
##    sample pop super_pop    sex      lat       lng        X1          X3
## 1 HG00096 GBR       EUR   male 52.48624 -1.890401 -0.267265 -0.07222675
## 2 HG00097 GBR       EUR female 52.48624 -1.890401 -0.267265 -0.07222675
## 3 HG00099 GBR       EUR female 52.48624 -1.890401 -0.267265 -0.07222675
## 4 HG00100 GBR       EUR female 52.48624 -1.890401 -0.267265 -0.07222675
## 5 HG00101 GBR       EUR   male 52.48624 -1.890401 -0.267265 -0.07222675
## 6 HG00102 GBR       EUR female 52.48624 -1.890401 -0.267265 -0.07222675
##            X4         X5
## 1 -0.04472137 -0.4724028
## 2 -0.04472137  2.1159923
## 3 -0.04472137 -0.4724028
## 4 -0.04472137 -0.4724028
## 5 -0.04472137 -0.4724028
## 6 -0.04472137 -0.4724028

PCA

Principal Components Analysis was run using prcomp().

vcf_pca <- prcomp(vcf_scaled[,7:5802])

Get the PCA scores, which will be plotted.:

vcf_pca_scores  <- vegan::scores(vcf_pca) 

Combine the scores with the sample information into a dataframe.

vcf_pca_scores2 <- data.frame(population = vcf_scaled$super_pop,
                              vcf_pca_scores)
vcf_pca_scores2$population <- factor(vcf_pca_scores2$population)

PCA diagnostics

The following steps help us understand the PCA output and determine how many PCs should be plotted and/or used in further analyses such as scans for natural selection, cluster analysis, and GWAS.

Default scree plot

A default R scree plot was created with screeplot(). This plot does not provide extra information for assessing the importance of the PCs.

screeplot(vcf_pca, 
          xlab = "Principal Components")

Advanced scree plot

NEW Functions
PCA_variation() function

This function extracts information needed to make a more advanced, annotated scree plot.

PCA_variation <- function(pca){
  
  pca_summary <- summary(pca)
  
  variance <- pca_summary$importance[1,]
  
  var_explained <- pca_summary$importance[2,]*100
  var_explained <- round(var_explained,3)
  
  var_cumulative <- pca_summary$importance[3,]*100
  var_cumulative <- round(var_cumulative,3)
  
  N.PCs <- length(var_explained)
  var_df <- data.frame(PC = 1:N.PCs,
            var_raw  = variance,
            var_percent = var_explained, 
            cumulative_percent = var_cumulative)
  
  return(var_df)   
}
screeplot_snps() function

This function makes a more advanced scree plot better suited for PCS on for SNPs.

screeplot_snps <- function(var_df){
total_var <- sum(var_df$var_raw)
N <- length(var_df$var_raw)
var_cutoff <- total_var/N
var_cut_percent <- var_cutoff/total_var*100
var_cut_percent_rnd <- round(var_cut_percent,2)
i_above_cut <- which(var_df$var_percent > var_cut_percent)
i_cut <- max(i_above_cut) 
ti <- paste0("Cutoff = ",
            var_cut_percent_rnd,
            "%\n","Useful PCs = ",i_cut)
plot(var_df$var_percent,
        main =ti, type = "l",
     xlab = "PC",
     ylab = "Percent variation",
     col = 0)

segments(x0 = var_df$PC,
         x1 = var_df$PC,
         y0 = 0, 
         y1 = var_df$var_percent,
         col = 1)

segments(x0 = 0,
         x1 = N,
         y0 = var_cut_percent, 
         y1 = var_cut_percent,
         col = 2)
}

#var_df <- PCA_variation(vcf_pca)
#screeplot_snps(var_df)
PCA_cumulative_var_plot() function

This makes a plot complementary to a scree plot. A scree plot plots the amount of variation explained by each PC. This plot plots a curve of cumulative amount of variation explained by the PCs.

PCA_cumulative_var_plot <- function(var_df){
  plot(cumulative_percent ~ PC, 
       data = var_df,
       main = "Cumulative percent variation\n explained by PCs",
       xlab = "PC",
       ylab = "Cumulative %",
       type = "l")
  
  total_var <- sum(var_df$var_raw)
N <- length(var_df$var_raw)
var_cutoff <- total_var/N
var_cut_percent <- var_cutoff/total_var*100
var_cut_percent_rnd <- round(var_cut_percent,2)
i_above_cut <- which(var_df$var_percent > var_cut_percent)
i_cut <- max(i_above_cut) 

percent_cut_i <- which(var_df$PC == i_cut )
percent_cut <- var_df$cumulative_percent[percent_cut_i]
segments(x0 = i_cut,
         x1 = i_cut,
         y0 = 0, 
         y1 = 100,
         col = 2)

segments(x0 = -10,
         x1 = N,
         y0 = percent_cut, 
         y1 = percent_cut,
         col = 2)
}

#PCA_cumulative_var_plot(PCA_variation(vcf_pca))
Advanced screeplot analysis
Extract information

Extract information on the variance explained by each PC.

var_out <- PCA_variation(vcf_pca)

Look at the output of PCA_variation()

head(var_out)
##     PC   var_raw var_percent cumulative_percent
## PC1  1 16.476299       4.684              4.684
## PC2  2 12.982275       2.908              7.592
## PC3  3 12.141488       2.543             10.135
## PC4  4 10.341987       1.845             11.980
## PC5  5  9.265079       1.481             13.461
## PC6  6  8.949608       1.382             14.843
Advanced screeplot

This advanced scree plot shows the amount of variation explained by all PCs. It marks with a horizontal line what the cutoff is for the amount of Percent variation explained that is useful, and a vertical line for where that line interacts the curve of the scree plot.The title indicates the percentage value of the cutoff and which PC is the last PC below that value. Though only the first few PCs can be plotted, PCs below the cut off value (“useful PCs) should probably used for further machine learning algorithms.

Make the scree plot with screeplot_snps()

screeplot_snps(var_out)

Cumulative variation plot

The cumulative variation plot shows how much variation in the data explained in total as more and more PCs are considered. The vertical red line shows the cutoff value from the scree plot (above). The horizontal line indicates what the total percentage of variation explained by these useful PCs is.

Make cumulative variation plot with PCA_cumulative_var_plot()

PCA_cumulative_var_plot(var_out)

PCA Scatterplots

The object created above var_out indicates how much variation is explained by each of the Principal components. This information is often added to the axes of scatter plots of PCA output.

head(var_out)
##     PC   var_raw var_percent cumulative_percent
## PC1  1 16.476299       4.684              4.684
## PC2  2 12.982275       2.908              7.592
## PC3  3 12.141488       2.543             10.135
## PC4  4 10.341987       1.845             11.980
## PC5  5  9.265079       1.481             13.461
## PC6  6  8.949608       1.382             14.843

PC 1 explains 4.684% percent of the variation, PC2 explains. 2.91%, and PC3 explains 2.54%. In total, the first 3 PCs explain only 10.13% of the variability in the data. The scree plot indicate that the first 631 PCs explain ~76% of the variation in the data. In further analysis such as GWAS the first 631 PCs should therefore be used.

Plot PC1 versus PC2

Plot the scores, with super-population color-coded

ggpubr::ggscatter(data = vcf_pca_scores2,
                  y = "PC2",
                  x = "PC1",
              color = "population",   
              shape = "population",   
              main = "PCA Scatterplot",
         ylab = "PC2 (2.91% of variation)",
         xlab = "PC1 (4.68% of variation")

Note how in the plot the amount of variation explained by each PC is shown in the axis labels.

Plot PC2 versus PC3

Plot the scores, with super population color-coded

ggpubr::ggscatter(data = vcf_pca_scores2,
                  y = "PC3",
                  x = "PC2",
                  color = "population",   
                  shape = "population",   
                  main = "PCA Scatterplot",
          ylab = "PC3 (2.54% of variation)",
          xlab = "PC2 (2.91% of variation")

Note how in the plot the amount of variation explained by each PC is shown in the axis labels.

Plot PC1 versus PC3

Plot the scores, with super population color-coded

ggpubr::ggscatter(data = vcf_pca_scores2,
                  y = "PC3",
                  x = "PC1",
                  ellipse = T,
            color = "population",
            shape = "population",
            main = "PCA Scatterplot",
      ylab = "PC3 (2.54% of variation)",
      xlab = "PC1 (4.68% of variation")

Note how in the plot the amount of variation explained by each PC is shown in the axis labels.

3D scatterplot

The first 3 principal components can be presented as a 3D scatterplot.

colors_use <- as.numeric(vcf_pca_scores2$population)
scatterplot3d(x = vcf_pca_scores2$PC1,
              y = vcf_pca_scores2$PC2,
              z = vcf_pca_scores2$PC3,
  color = colors_use,
              xlab = "PC1 (4.68%)",
              ylab = "PC2 (2.91%)",
              zlab = "PC3 (2.54%)")