Created by Marzia Antonella Scelsi on December 18th, 2018 - Updated March 7th, 2021
With contributions from Dr Andre Altmann
This tutorial will guide you through all the steps required to impute, clean and process your GWAS data set (as described in Scelsi et al., Brain, 2018).
Following the guidelines in this workbook, you will learn how to:
1. prepare your raw genotype files for imputation with the Sanger Imputation Server;
2. perform post-imputation quality controls;
3. determine ancestry and relatedness levels of the samples in your data set.
At the end of this pipeline, you will have all the genetic information and data necessary to run a genome-wide association study (GWAS).
Excited? Let’s get started! :-)
NOTE ON VERSIONS: wherever possible, I will try to indicate what software versions I am referring to, and also update the commands if new versions are released. All PLINK calls refer to PLINK 1.9. If you notice steps that need updating due to release of newer versions, please let me know (my email is at the bottom of the page). Thanks!
In a realistic scenario, your raw genotypes will be distributed among multiple filesets according to the genotyping platform (hereby also referred to as chip), one fileset for each chip. We will assume your genotype files are in the PLINK binary format (.bed, .bim, .fam), and the genomic coordinates all in the same human genome build, that is, hg19. All the steps described in this section need to be performed separately for each genotyping platform.
First, it is good practice to check for consistency of sex assignments (in case sex chromosomes SNPs are available), as well as to inspect the subject-level missingness rates. Do this in PLINK:
inputfile=${1}
plink --bfile ${inputfile} --check-sex
plink --bfile ${inputfile} --missing
Scan the plink.sexcheck and plink.imiss files for subjects reported as a sex mismatch or as missing more than 10% of genotypes. There should not be many issues, but please keep track of any potential finding from this step.
For imputation we will use the Haplotype Reference Consortium v1.1 panel. Therefore, your .bim file needs to be consistent with the HRC site list. Please download the reference site list from ftp://ngs.sanger.ac.uk/production/hrc/HRC.r1/, as well as the HRC preparation checking tool by Will Rayner from here. This tool will check for consistency of strand, alleles, positions, Ref/Alt assignments and frequencies between your SNPs and the HRC panel, and produce a set of PLINK commands to update or remove SNPs based on the results of these checks. Running this tool will also require a file containing the allele frequencies in your genotype file.
plink --bfile ${inputfile} --freq
perl HRC-1000G-check-bim.pl -b ${inputfile}.bim -f plink.frq -r HRC.r1-1.GRCh37.wgs.mac5.sites.tab -h
WARNING: in my experience, older versions of the perl executable for some reason get stuck - although the desired output is correctly produced. If you’re using one of those older versions, once it is done, please kill the process manually via Ctrl+C before moving onto the next step. If you’re using the latest version (v4.3.0), this should terminate quickly and properly, and you should not need to intervene manually.
IMPORTANT: the perl executable generates a shell script called Run-plink.sh. This script needs to be slightly modified manually for our purposes. So please make the following edits:
* comment or delete all the plink calls to split the final file by chromosome (the Sanger server requires only one fileset containing all autosomal SNPs to be submitted), and
* replace the last remaining PLINK call with the following:
plink --bfile TEMP4 --a2-allele Force-Allele1-${inputfile}-HRC.txt --autosome --recode vcf-iid bgz --out ${inputfile}-updated
Now run the Run-plink.sh script. This will clean, fix and convert to compressed VCF format your platform-specific genotypes.
sh Run-plink.sh
The resulting VCF file also needs to be checked for compatibility with the input required by the Sanger server. To do this, download the checkVCF tool from http://qbrc.swmed.edu/zhanxw/software/checkVCF/checkVCF-20140116.tar.gz. This tarball includes the checkVCF.py
script, the reference genome (hs37d5.fa
) in FASTA format and its index file (hs37d5.fa.fai
).
python checkVCF.py -r hs37d5.fa -o test ${inputfile}-updated.vcf.gz
If the tool does not detect any issues, congrats! Skip the next two steps and upload to the imputation server.
If the only problems detected are mismatching REF alleles with the reference genome, you are on the right path! However, these mismatches need to be fixed before imputation. Do this by calling bcftools:
bcftools +fixref ${inputfile}-updated.vcf.gz -Ob -o ${inputfile}-updated-REFfixed.vcf.gz -- -f hs37d5.fa
and then running the checkVCF.py script again for extra safety:
python checkVCF.py -r hs37d5.fa -o test ${inputfile}-updated-REFfixed.vcf.gz
If no problems are detected, you can finally upload the final compressed VCF file to the imputation server (https://imputation.sanger.ac.uk/)! We recommend choosing the HRC reference panel, EAGLE2 for phasing, and PBWT for imputation.
IMPORTANT: please remember to perform the steps described in this section for each genotyping platform fileset.
Good luck! :-)
With the help of the previous section, hopefully you will be sailing smoothly through the imputation step with the Sanger server.
On completion of a chip-wise imputation job, you will download a folder containing imputed dosage data in VCF format and split by chromosome. Let us assume the name you gave to your imputation job is stored in the job_name
variable:
cd ${job_name}.vcfs
Now we are going to loop over the chromosome-wise VCF files and perform some of the post-imputation quality control steps. Refer to the comments in the following code block for the details about them.
for file in $(ls *.vcf.gz)
do
input_filename=$file
input_basename=$(basename ${input_filename} .vcf.gz)
# Extract the INFO metric from VCF file for ALL variants
vcftools --gzvcf ${input_filename} --get-INFO INFO --out ${input_basename}
# Write a list of variants with INFO metric < 0.3
awk '$NF < 0.3' ${input_basename}.INFO | cut -f 1-2 > ${input_basename}_ExcludePositions.txt
# Call VCFtools to exclude those variants by position
vcftools --gzvcf ${input_filename} --exclude-positions ${input_basename}_ExcludePositions.txt --recode --out ${input_basename}_clean
# Filter out multiallelic variants, set to missing calls with <90% posterior probability, then make plink binary file
plink --vcf ${input_basename}_clean.recode.vcf --double-id --biallelic-only strict list --vcf-min-gp 0.9 --make-bed --out ${input_basename}
rm ${input_basename}.INFO ${input_basename}_ExcludePositions.txt ${input_basename}_clean.recode.vcf
done
IMPORTANT: Once again, the steps described above must be performed separately for each imputed genotyping platform folder ${job_name}.vcfs
.
Next, we will merge the imputed platforms chromosome-wise, and perform additional quality controls. Specifically, we will apply filters based on minor allele frequency (MAF), Hardy-Weinberg equilibrium (HWE) and SNP missingness rate. Let us assume we have imputed data from two genotyping platforms, and the resulting VCF files are in two folders: ${job1_name}.vcfs
and ${job2_name}.vcfs
.
cd ..
for chrom in $(seq 1 22)
do
# This will generate the *merge.missnp lists
plink --bfile ${job1_name}.vcfs/${chrom} --bmerge ${job2_name}.vcfs/${chrom} --make-bed --out ${chrom}.merged
# Need to exclude the merge failures...
plink --bfile ${job1_name}.vcfs/${chrom} --exclude ${chrom}.merged-merge.missnp --make-bed --out ${job1_name}.vcfs/${chrom}.tmp
plink --bfile ${job2_name}.vcfs/${chrom} --exclude ${chrom}.merged-merge.missnp --make-bed --out ${job2_name}.vcfs/${chrom}.tmp
# ...and then merge again
plink --bfile ${job1_name}.vcfs/${chrom}.tmp --bmerge ${job2_name}.vcfs/${chrom}.tmp --make-bed --out ${chrom}.merged
rm ${chrom}.merged-merge.missnp ${chrom}.merged-merge.fam
# Now filter on MAF, HWE and SNP missingness rate
plink --bfile ${chrom}.merged --maf 0.05 --geno 0.1 --hwe 5e-7 --make-bed --out ${chrom}.merged-QC
done
This will generate a list of 22 quality-controlled PLINK binary filesets - one for each autosome, but including both chips. The complete data set still needs to be quality-controlled for subjects missing more than 10% of total imputed genotypes. If all the steps above have been successful, there should not be any such subjects. However, it is always good practice to check:
for chrom in $(seq 1 22)
do
# Write a list of the 22 autosomal filesets to be merged...
echo ${chrom}.merged-QC >> fileset.txt
done
# ...then merge them
plink --merge-list fileset.txt --make-bed --out 1-22.merged-QC
# Compute missingness statistics on this genome-wide data set...
plink --bfile 1-22.merged-QC --missing
# ...and check if there are any subjects missing more than 10% of total imputed genotypes
awk '$NF >= 0.1' plink.imiss
# If no problems are detected, delete the huge file
rm 1-22.merged-QC.* plink.* fileset.txt
This is the end of the genotype files processing. If you made it safely up to here, I congratulate you on your grit and patience so far! And I promise the next (and last) section will be child’s play compared to what you have just been through! :-)