Principal Component Analysis (PCA) is a statistical technique used in the field of genetics to analyze and visualize genetic variation across individuals and populations. It reduces the dimensionality of genetic data while preserving most of the variation, making it easier to identify patterns of diversity.
PCA works by identifying the directions (principal components) that maximize the variance in the data. Each individual’s genetic data can then be projected onto these principal components, allowing researchers to visualize and interpret the genetic structure of the population.
This package implements a comprehensive PCA pipeline that utilizes PLINK, a widely used tool in genomic data analysis. The pipeline is designed based on the parameter settings discussed in the Genome Asia 100K project, ensuring robustness and relevance to contemporary genetic research.
The key features of this package include:
Before utilizing the PCA pipeline package, certain prerequisites must be met to ensure the smooth execution of the pipeline.
The pipeline requires R to be installed on your system. RStudio, while optional, is recommended for an enhanced coding environment.
PLINK2 is a crucial component of the PCA pipeline. It can be installed using Conda or directly from the PLINK website.
If you have Conda installed, you can install PLINK2 using the following commands in your terminal:
conda install bioconda::plink2
conda install bioconda/label/cf201901::plink2
Alternatively, PLINK2 can be downloaded directly from its official website. Click on the link below and follow the instructions to download and install PLINK2 for your operating system:
After installing PLINK2, ensure that its path is added to your system’s environment variables, allowing it to be accessed from any directory in the terminal.
Typically, adding PLINK2 to the path can be done by modifying your
shell’s profile file (e.g., ~/.bash_profile,
~/.bashrc, or ~/.zshrc), adding the following
line:
export PATH="/path/to/plink2:$PATH"
Please replace /path/to/plink2 with the actual installation directory of PLINK2.
If you’re running your pipeline through RStudio, make sure RStudio recognizes the PLINK2 path. You might need to set the path within your R session using:
Sys.setenv(PATH = paste("/path/to/plink2", Sys.getenv("PATH"), sep=":"))
We will use Human Genome Diversity Project (HGDP) chromosome 21 dataset. The data can be downloaded from here
devtools::install_github("Devashish13/PopulationStructure")
## Skipping install of 'PopulationStructure' from a github remote, the SHA1 (ffda5baa) has not changed since last install.
## Use `force = TRUE` to force installation
The PCA_GD function is the first step in the PCA
pipeline, responsible for performing PCA on genotype data using PLINK.
Users can specify parameters like the data path, metadata path, minor
allele frequency threshold, and more.
library(PopulationStructure)
data_path <- "hgdp_chr21/hgdp_chr21"
metadata_path <- "hgdp_chr21/hgdp_chr21.psam"
PCA_GD(data_path = data_path, metadata_path = metadata_path, geno_format = "pfile")
After running the specified R code, it generates two files: one with
eigenvectors (.eigenvec) and another with eigenvalues
(.eigenval). These files help us see how genetic variation
is spread across different populations. We can visualize this
information by plotting individuals based on their principal component
loadings, usually marking different populations with colors
(color_var) and optionally using shapes
(shape_var) to represent other groupings, like geographical
regions. This way, we can easily observe and understand genetic
diversity and relationships within and between populations.
This function generates the PCA plots representing individuals and another plot contains the centroid of population specific loadings.
library(gridExtra)
pcadata_path <- "geno_data_QC_PCA_RESULT"
plots_set <- PCA_plot(pcadata_path = pcadata_path,metadata_path = metadata_path,
sampleid_var = "#IID",color_var = "population",shape_var = "region",size = 4)
## Warning: package 'dplyr' was built under R version 4.2.3
do.call("grid.arrange", c(plots_set, ncol=2))
## PCA_outliers: Identifying population-specific outlier individuals
This function detects outliers within specific populations by applying the Mahalanobis distance to the first n principal component loadings. Identifying outliers is crucial as it can reveal potential mislabeling of individuals. Moreover, extreme outliers may impact subsequent genetic analyses. The function saves the population wise plots in a pdf(‘pop_specific_plots_D2_pvalue.pdf’) and saves the mahalanobis distance and corresponding pvalues in a text file (‘PCA_data_outlier_pvalue.txt’)
PCA_outliers(pcadata_path,metadata_path,color_var = "population",
sampleid_var = "#IID", p_threshold = 1e-4,ncomp = 3,x = 1,y=2)
## #IID population EV1 EV2 EV3 D2
## 15 HGDP00029 Brahui -0.02720970 -0.00924397 -1.72961e-03 17.283007
## 23 HGDP00045 Brahui 0.01302050 -0.02100000 -5.55604e-03 6.332287
## 4 HGDP00007 Brahui 0.01187310 -0.02116080 4.55430e-03 5.685456
## 6 HGDP00011 Brahui 0.01065720 -0.02976340 1.61853e-03 4.643165
## 24 HGDP00047 Brahui 0.00236403 -0.02259380 -6.77685e-03 4.292484
## 11 HGDP00021 Brahui -0.00218998 -0.02338390 9.81247e-05 3.467931
## pvalue
## 15 0.0006180635
## 23 0.0965165972
## 4 0.1279576241
## 6 0.1998702745
## 24 0.2315640956
## 11 0.3249451814
## quartz_off_screen
## 2