PlinkQC

PlinkQC is a R package for genotype quality control in genetic association studies. See https://meyer-lab-cshl.github.io/plinkQC/index.html

The protocol is implemented in three main functions:
1) The per-individual quality control (perIndividualQC)
2) The per-marker quality control (perMarkerQC)
3) The generation of the new, quality control dataset (cleanData)

Individuals and markers that fail the quality control can subsequently be removed with plinkQC to generate a new, clean dataset.

Input CWOW_flipped 712,571

Per-individual quality control

PerIndividualQC writes a list of all fail individual IDs to the qcdir. These IDs will be removed in the computation of the perMarkerQC

The HapMapIII and CWOW data were merged prior to running plinkQC. See the README for more information on how the HapMapIII reference data was obtained.

Sex check

check_sex: for the identification of individuals with discordant sex information

Heterozygosity check

The identification of individuals with outlying missing genotype and/or heterozygosity rates helps to detect samples with poor DNA quality and/or concentration that should be excluded from the study. Typically, individuals with more than 3-7% of their genotype calls missing are removed. Outlying heterozygosity rates are judged relative to the overall heterozygosity rates in the study, and individuals whose rates are more than a few standard deviations (sd) from the mean heterozygosity rate are removed. A typical quality control for outlying heterozygosity rates would remove individuals who are three sd away from the mean rate.

Relatedness check

Related individuals can be identified by their proportion of shared alleles at the genotyped markers (identity by descend, IBD). Standardly, individuals with second-degree relatedness or higher will be excluded.

Ancestry check

The identification of individuals of divergent ancestry can be achieved by combining the genotypes of the study population with genotypes of a reference dataset consisting of individuals from known ethnicities (for instance individuals from the Hapmap or 1000 genomes study.

Principal component analysis (PCA) on the combined genotype panel is employed to detect population structure, matching the granularity of the reference dataset. The tool check_ancestry is used to identify individuals with divergent ancestry. check_ancestry utilizes information from principal components 1 and 2 to determine the center of the European reference samples. Study samples with a Euclidean distance from this center that exceeds a specified radius are classified as non-European.

The default radius is 1.5. For this dataset, to retain more individuals, a radius of 5 was selected.

Missingness check

Markers with excessive missingness rate are removed as they are considered unreliable. Typically, thresholds for marker exclusion based on missingness range from 1%-5%. Identifying markers with high missingness rates is implemented in snp_missingness. It calculates the rates of missing genotype calls and frequency for all variants in the individuals that passed the perIndividualQC.

HWE check

Markers with strong deviation from HWE might be indicative of genotyping or genotype-calling errors. As serious genotyping errors often yield very low p-values, it is recommended to choose a reasonably low threshold to avoid filtering too many variants (that might have slight, non-critical deviations). Identifying markers with deviation from HWE is implemented in check_hwe. It calculates the observed and expected heterozygote frequencies per SNP in the individuals that passed the perIndividualQC and computes the deviation of the frequencies from Hardy-Weinberg equilibrium (HWE) by HWE exact test.

MAF check

minor allele frequency (MAF). Markers with low minor allele count are often removed as the actual genotype calling (via the calling algorithm) is very difficult due to the small sizes of the heterozygote and rare-homozygote clusters. check_maf alculates the minor allele frequencies for all variants in the individuals that passed the perIndividualQC.

PlinkQC with LBD CWOW SNP array data

Kimberly Olney, Ph.D.

07/25/2024