Principal Component Analysis (PCA) Pipeline for Analyzing Genomic Data

Introduction to PCA

Principal Component Analysis (PCA) is a statistical technique used in the field of genetics to analyze and visualize genetic variation across individuals and populations. It reduces the dimensionality of genetic data while preserving most of the variation, making it easier to identify patterns of diversity.

PCA works by identifying the directions (principal components) that maximize the variance in the data. Each individual’s genetic data can then be projected onto these principal components, allowing researchers to visualize and interpret the genetic structure of the population.

Importance of the PCA Pipeline Package

This package implements a comprehensive PCA pipeline that utilizes PLINK, a widely used tool in genomic data analysis. The pipeline is designed based on the parameter settings discussed in the Genome Asia 100K project, ensuring robustness and relevance to contemporary genetic research.

The key features of this package include:

Simplified workflow: Users only need to specify the data path, metadata path, and a few other parameters.
Customizability: Although the default settings are derived from the Genome Asia 100K project, users can adjust parameters to suit their specific research needs.
Comprehensive analysis: The package includes functions for PCA computation, result visualization, and outlier detection.

Prerequisites

Before utilizing the PCA pipeline package, certain prerequisites must be met to ensure the smooth execution of the pipeline.

R and RStudio

The pipeline requires R to be installed on your system. RStudio, while optional, is recommended for an enhanced coding environment.

R: Download and install R from The Comprehensive R Archive Network (CRAN).
RStudio: Download and install RStudio from the RStudio website.

Installing PLINK2

PLINK2 is a crucial component of the PCA pipeline. It can be installed using Conda or directly from the PLINK website.

Conda Installation

If you have Conda installed, you can install PLINK2 using the following commands in your terminal:

conda install bioconda::plink2
conda install bioconda/label/cf201901::plink2

Direct Download

Alternatively, PLINK2 can be downloaded directly from its official website. Click on the link below and follow the instructions to download and install PLINK2 for your operating system:

Download PLINK2

Setting the PLINK2 Path

After installing PLINK2, ensure that its path is added to your system’s environment variables, allowing it to be accessed from any directory in the terminal.

In Terminal

Typically, adding PLINK2 to the path can be done by modifying your shell’s profile file (e.g., ~/.bash_profile, ~/.bashrc, or ~/.zshrc), adding the following line:

export PATH="/path/to/plink2:$PATH"

Please replace /path/to/plink2 with the actual installation directory of PLINK2.

In RStudio

If you’re running your pipeline through RStudio, make sure RStudio recognizes the PLINK2 path. You might need to set the path within your R session using:

Sys.setenv(PATH = paste("/path/to/plink2", Sys.getenv("PATH"), sep=":"))

HGDP chromsome 21 data

We will use Human Genome Diversity Project (HGDP) chromosome 21 dataset. The data can be downloaded from here

Install PopulationStructure

devtools::install_github("Devashish13/PopulationStructure")

## Skipping install of 'PopulationStructure' from a github remote, the SHA1 (ffda5baa) has not changed since last install.
##   Use `force = TRUE` to force installation

PCA_GD: Running PCA on genomic data

The PCA_GD function is the first step in the PCA pipeline, responsible for performing PCA on genotype data using PLINK. Users can specify parameters like the data path, metadata path, minor allele frequency threshold, and more.

library(PopulationStructure)
data_path <- "hgdp_chr21/hgdp_chr21"
metadata_path <- "hgdp_chr21/hgdp_chr21.psam"
PCA_GD(data_path = data_path, metadata_path = metadata_path, geno_format = "pfile")

After running the specified R code, it generates two files: one with eigenvectors (.eigenvec) and another with eigenvalues (.eigenval). These files help us see how genetic variation is spread across different populations. We can visualize this information by plotting individuals based on their principal component loadings, usually marking different populations with colors (color_var) and optionally using shapes (shape_var) to represent other groupings, like geographical regions. This way, we can easily observe and understand genetic diversity and relationships within and between populations.

PCA_plot: generate plots of PC loadings

This function generates the PCA plots representing individuals and another plot contains the centroid of population specific loadings.

library(gridExtra)
pcadata_path <- "geno_data_QC_PCA_RESULT"
plots_set <- PCA_plot(pcadata_path = pcadata_path,metadata_path = metadata_path,
         sampleid_var = "#IID",color_var = "population",shape_var = "region",size = 4)

## Warning: package 'dplyr' was built under R version 4.2.3

do.call("grid.arrange", c(plots_set, ncol=2))

## PCA_outliers: Identifying population-specific outlier individuals

This function detects outliers within specific populations by applying the Mahalanobis distance to the first n principal component loadings. Identifying outliers is crucial as it can reveal potential mislabeling of individuals. Moreover, extreme outliers may impact subsequent genetic analyses. The function saves the population wise plots in a pdf(‘pop_specific_plots_D2_pvalue.pdf’) and saves the mahalanobis distance and corresponding pvalues in a text file (‘PCA_data_outlier_pvalue.txt’)

PCA_outliers(pcadata_path,metadata_path,color_var = "population",
             sampleid_var = "#IID", p_threshold = 1e-4,ncomp = 3,x = 1,y=2)

##         #IID population         EV1         EV2          EV3        D2
## 15 HGDP00029     Brahui -0.02720970 -0.00924397 -1.72961e-03 17.283007
## 23 HGDP00045     Brahui  0.01302050 -0.02100000 -5.55604e-03  6.332287
## 4  HGDP00007     Brahui  0.01187310 -0.02116080  4.55430e-03  5.685456
## 6  HGDP00011     Brahui  0.01065720 -0.02976340  1.61853e-03  4.643165
## 24 HGDP00047     Brahui  0.00236403 -0.02259380 -6.77685e-03  4.292484
## 11 HGDP00021     Brahui -0.00218998 -0.02338390  9.81247e-05  3.467931
##          pvalue
## 15 0.0006180635
## 23 0.0965165972
## 4  0.1279576241
## 6  0.1998702745
## 24 0.2315640956
## 11 0.3249451814

## quartz_off_screen 
##                 2

References

“The GenomeAsia 100K Project enables genetic discoveries across Asia.” Nature 576, no. 7785 (2019): 106-111.
Bergström, Anders, et al. “Insights into human genetic variation and population history from 929 diverse genomes.” Science 367.6484 (2020): eaay5012.
https://www.cog-genomics.org/plink/2.0/
https://github.com/koundy/ggplot_theme_Publication

Contact

For more information or to reach out, find me on GitHub and LinkedIn.