PlinkQC is a R package for genotype quality control in genetic association studies. See https://meyer-lab-cshl.github.io/plinkQC/index.html
The protocol is implemented in three main functions:
1) The per-individual quality control (perIndividualQC)
2) The per-marker quality control (perMarkerQC)
3) The generation of the new, quality control dataset (cleanData)
Individuals and markers that fail the quality control can subsequently be removed with plinkQC to generate a new, clean dataset.
Input CWOW_flipped 712,571
PerIndividualQC writes a list of all fail individual IDs to the qcdir. These IDs will be removed in the computation of the perMarkerQC
The HapMapIII and CWOW data were merged prior to running plinkQC. See
the README for more information on how the HapMapIII reference data was
obtained.
check_sex: for the identification of individuals with discordant sex
information
The identification of individuals with outlying missing genotype
and/or heterozygosity rates helps to detect samples with poor DNA
quality and/or concentration that should be excluded from the study.
Typically, individuals with more than 3-7% of their genotype calls
missing are removed. Outlying heterozygosity rates are judged relative
to the overall heterozygosity rates in the study, and individuals whose
rates are more than a few standard deviations (sd) from the mean
heterozygosity rate are removed. A typical quality control for outlying
heterozygosity rates would remove individuals who are three sd away from
the mean rate.
The identification of individuals of divergent ancestry can be achieved by combining the genotypes of the study population with genotypes of a reference dataset consisting of individuals from known ethnicities (for instance individuals from the Hapmap or 1000 genomes study.
Principal component analysis (PCA) on the combined genotype panel is employed to detect population structure, matching the granularity of the reference dataset. The tool check_ancestry is used to identify individuals with divergent ancestry. check_ancestry utilizes information from principal components 1 and 2 to determine the center of the European reference samples. Study samples with a Euclidean distance from this center that exceeds a specified radius are classified as non-European.
The default radius is 1.5. For this dataset, to retain more
individuals, a radius of 5 was selected.
Markers with excessive missingness rate are removed as they are
considered unreliable. Typically, thresholds for marker exclusion based
on missingness range from 1%-5%. Identifying markers with high
missingness rates is implemented in snp_missingness. It calculates the
rates of missing genotype calls and frequency for all variants in the
individuals that passed the perIndividualQC.
Markers with strong deviation from HWE might be indicative of
genotyping or genotype-calling errors. As serious genotyping errors
often yield very low p-values, it is recommended to choose a reasonably
low threshold to avoid filtering too many variants (that might have
slight, non-critical deviations). Identifying markers with deviation
from HWE is implemented in check_hwe. It calculates the observed and
expected heterozygote frequencies per SNP in the individuals that passed
the perIndividualQC and computes the deviation of the frequencies from
Hardy-Weinberg equilibrium (HWE) by HWE exact test.
minor allele frequency (MAF). Markers with low minor allele count are
often removed as the actual genotype calling (via the calling algorithm)
is very difficult due to the small sizes of the heterozygote and
rare-homozygote clusters. check_maf alculates the minor allele
frequencies for all variants in the individuals that passed the
perIndividualQC.