The hitchhiker’s guide to GWAS

Chapter 1: Step-by-step GWAS data processing pipeline

Created by Marzia Antonella Scelsi on December 18th, 2018 - Updated March 7th, 2021
With contributions from Dr Andre Altmann

This tutorial will guide you through all the steps required to impute, clean and process your GWAS data set (as described in Scelsi et al., Brain, 2018).
Following the guidelines in this workbook, you will learn how to:
1. prepare your raw genotype files for imputation with the Sanger Imputation Server;
2. perform post-imputation quality controls;
3. determine ancestry and relatedness levels of the samples in your data set.
At the end of this pipeline, you will have all the genetic information and data necessary to run a genome-wide association study (GWAS).

Excited? Let’s get started! :-)

NOTE ON VERSIONS: wherever possible, I will try to indicate what software versions I am referring to, and also update the commands if new versions are released. All PLINK calls refer to PLINK 1.9. If you notice steps that need updating due to release of newer versions, please let me know (my email is at the bottom of the page). Thanks!

1. Prepare the raw genotype files for imputation

In a realistic scenario, your raw genotypes will be distributed among multiple filesets according to the genotyping platform (hereby also referred to as chip), one fileset for each chip. We will assume your genotype files are in the PLINK binary format (.bed, .bim, .fam), and the genomic coordinates all in the same human genome build, that is, hg19. All the steps described in this section need to be performed separately for each genotyping platform.

First, it is good practice to check for consistency of sex assignments (in case sex chromosomes SNPs are available), as well as to inspect the subject-level missingness rates. Do this in PLINK:

inputfile=${1}
plink --bfile ${inputfile} --check-sex 
plink --bfile ${inputfile} --missing

Scan the plink.sexcheck and plink.imiss files for subjects reported as a sex mismatch or as missing more than 10% of genotypes. There should not be many issues, but please keep track of any potential finding from this step.

For imputation we will use the Haplotype Reference Consortium v1.1 panel. Therefore, your .bim file needs to be consistent with the HRC site list. Please download the reference site list from ftp://ngs.sanger.ac.uk/production/hrc/HRC.r1/, as well as the HRC preparation checking tool by Will Rayner from here. This tool will check for consistency of strand, alleles, positions, Ref/Alt assignments and frequencies between your SNPs and the HRC panel, and produce a set of PLINK commands to update or remove SNPs based on the results of these checks. Running this tool will also require a file containing the allele frequencies in your genotype file.

plink --bfile ${inputfile} --freq
perl HRC-1000G-check-bim.pl -b ${inputfile}.bim -f plink.frq -r HRC.r1-1.GRCh37.wgs.mac5.sites.tab -h

WARNING: in my experience, older versions of the perl executable for some reason get stuck - although the desired output is correctly produced. If you’re using one of those older versions, once it is done, please kill the process manually via Ctrl+C before moving onto the next step. If you’re using the latest version (v4.3.0), this should terminate quickly and properly, and you should not need to intervene manually.

IMPORTANT: the perl executable generates a shell script called Run-plink.sh. This script needs to be slightly modified manually for our purposes. So please make the following edits:
* comment or delete all the plink calls to split the final file by chromosome (the Sanger server requires only one fileset containing all autosomal SNPs to be submitted), and
* replace the last remaining PLINK call with the following:

plink --bfile TEMP4 --a2-allele Force-Allele1-${inputfile}-HRC.txt --autosome --recode vcf-iid bgz --out ${inputfile}-updated

Now run the Run-plink.sh script. This will clean, fix and convert to compressed VCF format your platform-specific genotypes.

sh Run-plink.sh

The resulting VCF file also needs to be checked for compatibility with the input required by the Sanger server. To do this, download the checkVCF tool from http://qbrc.swmed.edu/zhanxw/software/checkVCF/checkVCF-20140116.tar.gz. This tarball includes the checkVCF.py script, the reference genome (hs37d5.fa) in FASTA format and its index file (hs37d5.fa.fai).

python checkVCF.py -r hs37d5.fa -o test ${inputfile}-updated.vcf.gz

If the tool does not detect any issues, congrats! Skip the next two steps and upload to the imputation server.

If the only problems detected are mismatching REF alleles with the reference genome, you are on the right path! However, these mismatches need to be fixed before imputation. Do this by calling bcftools:

bcftools +fixref ${inputfile}-updated.vcf.gz -Ob -o ${inputfile}-updated-REFfixed.vcf.gz -- -f hs37d5.fa

and then running the checkVCF.py script again for extra safety:

python checkVCF.py -r hs37d5.fa -o test ${inputfile}-updated-REFfixed.vcf.gz

If no problems are detected, you can finally upload the final compressed VCF file to the imputation server (https://imputation.sanger.ac.uk/)! We recommend choosing the HRC reference panel, EAGLE2 for phasing, and PBWT for imputation.

IMPORTANT: please remember to perform the steps described in this section for each genotyping platform fileset.

Good luck! :-)

2. Perform post-imputation quality controls

With the help of the previous section, hopefully you will be sailing smoothly through the imputation step with the Sanger server.
On completion of a chip-wise imputation job, you will download a folder containing imputed dosage data in VCF format and split by chromosome. Let us assume the name you gave to your imputation job is stored in the job_name variable:

cd ${job_name}.vcfs

Now we are going to loop over the chromosome-wise VCF files and perform some of the post-imputation quality control steps. Refer to the comments in the following code block for the details about them.

for file in $(ls *.vcf.gz)
do
    input_filename=$file
    input_basename=$(basename ${input_filename} .vcf.gz)

    # Extract the INFO metric from VCF file for ALL variants
    vcftools --gzvcf ${input_filename} --get-INFO INFO --out ${input_basename}

    # Write a list of variants with INFO metric < 0.3
    awk '$NF < 0.3' ${input_basename}.INFO | cut -f 1-2 > ${input_basename}_ExcludePositions.txt

    # Call VCFtools to exclude those variants by position
    vcftools --gzvcf ${input_filename} --exclude-positions ${input_basename}_ExcludePositions.txt --recode --out ${input_basename}_clean

    # Filter out multiallelic variants, set to missing calls with <90% posterior probability, then make plink binary file
    plink  --vcf ${input_basename}_clean.recode.vcf --double-id --biallelic-only strict list --vcf-min-gp 0.9 --make-bed --out ${input_basename}

    rm ${input_basename}.INFO ${input_basename}_ExcludePositions.txt ${input_basename}_clean.recode.vcf
done

IMPORTANT: Once again, the steps described above must be performed separately for each imputed genotyping platform folder ${job_name}.vcfs.

Next, we will merge the imputed platforms chromosome-wise, and perform additional quality controls. Specifically, we will apply filters based on minor allele frequency (MAF), Hardy-Weinberg equilibrium (HWE) and SNP missingness rate. Let us assume we have imputed data from two genotyping platforms, and the resulting VCF files are in two folders: ${job1_name}.vcfs and ${job2_name}.vcfs.

cd ..
for chrom in $(seq 1 22)
do
    # This will generate the *merge.missnp lists
    plink --bfile ${job1_name}.vcfs/${chrom} --bmerge ${job2_name}.vcfs/${chrom} --make-bed --out ${chrom}.merged
    # Need to exclude the merge failures...
    plink --bfile ${job1_name}.vcfs/${chrom} --exclude ${chrom}.merged-merge.missnp --make-bed --out ${job1_name}.vcfs/${chrom}.tmp
    plink --bfile ${job2_name}.vcfs/${chrom} --exclude ${chrom}.merged-merge.missnp --make-bed --out ${job2_name}.vcfs/${chrom}.tmp
    # ...and then merge again
    plink --bfile ${job1_name}.vcfs/${chrom}.tmp --bmerge ${job2_name}.vcfs/${chrom}.tmp --make-bed --out ${chrom}.merged   
    rm ${chrom}.merged-merge.missnp ${chrom}.merged-merge.fam
    # Now filter on MAF, HWE and SNP missingness rate 
    plink --bfile ${chrom}.merged --maf 0.05 --geno 0.1 --hwe 5e-7 --make-bed --out ${chrom}.merged-QC
done

This will generate a list of 22 quality-controlled PLINK binary filesets - one for each autosome, but including both chips. The complete data set still needs to be quality-controlled for subjects missing more than 10% of total imputed genotypes. If all the steps above have been successful, there should not be any such subjects. However, it is always good practice to check:

for chrom in $(seq 1 22)
do
  # Write a list of the 22 autosomal filesets to be merged...
    echo ${chrom}.merged-QC >> fileset.txt
done
# ...then merge them
plink --merge-list fileset.txt --make-bed --out 1-22.merged-QC
# Compute missingness statistics on this genome-wide data set...
plink --bfile 1-22.merged-QC --missing
# ...and check if there are any subjects missing more than 10% of total imputed genotypes
awk '$NF >= 0.1' plink.imiss
# If no problems are detected, delete the huge file
rm 1-22.merged-QC.* plink.* fileset.txt

This is the end of the genotype files processing. If you made it safely up to here, I congratulate you on your grit and patience so far! And I promise the next (and last) section will be child’s play compared to what you have just been through! :-)

3. Relatedness and ancestry determination

When conducting a genome-wide association study (GWAS), it is of considerable importance to ensure that all study participants are unrelated. Specific techniques are required to account for family structure in GWAS, but these are outside the scope of this tutorial. Moreover, allele frequencies differ across populations, therefore it is also essential to determine the ancestral background of the participants involved in an unbiased way (i.e., ignoring self-reported races and ethnicities that can be considerably biased). Typically, the outcome of an ancestry determination procedure is a set of two or more principal components (PCs) describing the sample structure in terms of genetically-defined populations; these PCs are then included as covariates when fitting the genome-wide regression models.
Both relatedness and ancestry determination are performed using only genotyped SNPs (not imputed ones).

3a. Relatedness

Let us go back to our previous example and assume we have raw genotypes from two chips, ${chip1}.bed/.bim/.fam and ${chip2}.bed/.bim/.fam. These two need to be merged in the ${chip1+2}.bed/.bim/.fam fileset:

plink --bfile ${chip1} --bmerge ${chip2} --make-bed --out tmp
plink --bfile ${chip1} --exclude tmp-merge.missnp --make-bed --out source1_tmp
plink --bfile ${chip2} --exclude tmp-merge.missnp --make-bed --out source2_tmp
plink --bfile source1_tmp --bmerge source2_tmp --make-bed --out ${chip1+2}
rm tmp.* source1_tmp.* source2_tmp.* tmp-merge.*

and SNPs missing in more than 10% participants need to be removed:

plink --bfile ${chip1+2} --geno 0.1 --make-bed --out ${chip1+2}_genofilter

Finally, a list of unrelated subjects will be produced by calling PLINK with the following options:

plink --bfile ${chip1+2}_genofilter --rel-cutoff 0.1 --out ${chip1+2}_unrelated

The --rel-cutoff 0.1 option will prune the genetic relatedness matrix (GRM) at a value of 0.1. The choice of this threshold value is quite arbitrary, however consider as a reference that the relatedness coefficient for first cousins is 0.125. This command will therefore exclude all subjects in your sample related at a level of first cousins or closer.

IMPORTANT: Use the --keep ${chip1+2}_unrelated.rel.id option when invoking PLINK for association testing to include only unrelated individuals.

3b. Ancestry

We will now perform a two-stage ancestry and subpopulation analysis using the SNPweights v2.1 software (https://cdn1.sph.harvard.edu/wp-content/uploads/sites/181/2014/05/SNPweights2.1.tar.gz). For the first stage, SNP weights for European, West African, East Asian and Native American ancestral populations (snpwt.NA file) can be downloaded from https://cdn1.sph.harvard.edu/wp-content/uploads/sites/181/2014/03/snpwt.NA_.zip. For the second stage, SNP weights for North-Western, South-Eastern and Ashkenazi Jewish ancestral populations of European Americans (snpwt.EA file) can be downloaded from https://cdn1.sph.harvard.edu/wp-content/uploads/sites/181/2014/03/snpwt.EA_.zip.

Before inferring ancestry, complete the quality control of the merged, genotyped SNPs:

plink --bfile ${chip1+2}_genofilter --maf 0.05 --hwe 5e-7 --make-bed ${chip1+2}_QC

WARNING: the following steps MUST be executed on a UNIX system. For one, because the --complement option to the cut shell command (which is needed by some scripts below) is not supported on Mac systems. Furthermore, in my experience, even after recompiling the SNPweights source code on Mac, ancestry results are still funny (specifically, NaN are generated for no apparent reason for every other sample in the results file).

SNPweights accepts as input files in the EIGENSTRAT format. Therefore, since we are working with PLINK files, a format conversion is necessary as a preliminar step. We will use three shell scripts - courtesy of Dr A. Altmann - to perform this conversion.

fam2ind.sh:

#!/bin/bash
inp=$1
cat $inp | awk '{ print $2" "$5" "$1}' | sed -e 's% 2 % F %' -e 's% 1 % M %' | tr ' ' '\t' > $(basename $inp .fam).ind

bim2snp.sh:

#!/bin/bash
inp=$1
cat $inp | awk '{ print $2"\t"$1"\t"$3"\t"$4"\t"$5"\t"$6}'| awk -F'\t' -vOFS='\t' '{ gsub("0", "X", $5) ; gsub("0", "X", $6) ; gsub("I", "X", $6);  print }' > $(basename $inp .bim).snp

bed2geno.sh:

#!/bin/bash
inp=${1%%.bed}
plink --bfile $inp --recode A-transpose --out tmp
cat tmp.traw | cut -f 1-6 --complement | tail -n +2 | sed 's%\t%%g'| sed 's/NA/9/g' > `basename $inp.geno`

The format conversion is then performed by running:

sh fam2ind.sh ${chip1+2}_QC.fam
sh bim2snp.sh ${chip1+2}_QC.bim
sh bed2geno.sh ${chip1+2}_QC.bed

Now create a new text file and call it inferancestry.par. This file will contain the values for the parameters to be passed to SNPweights in the first stage of the analysis. The content of inferancestry.par will look like this:

geno:  ${chip1+2}_QC.geno
snp:   ${chip1+2}_QC.snp
ind:   ${chip1+2}_QC.ind
snpwt: snpwt.NA
predpcoutput: ${chip1+2}_QC.NA.predpc

The first stage of the ancestry analysis will compare the samples in your dataset against a reference panel of ancestry-informative markers for four broad ethnic groups: Yoruba Africans (YRI), Central Europeans (CEU), East Asians (ASI), and Native Americans (NAT).

Run the ancestry inference by calling:

cd SNPweights2.1
./bin/inferanc -p ./inferancestry.par

The 10 columns in the ${chip1+2}_QC.NA.predpc output file are: sample ID, population label, number of SNPs used for inference, predicted PC1, predicted PC2, predicted PC3, % YRI ancestry, % CEU ancestry, % East Asian ancestry, % Native American ancestry.

In the second stage of this analysis, we will focus on subjects with >80% predicted CEU ancestry, and determine the extents to which these subjects fall into the three following European subpopulations: North-Western Europeans (NWE), South-Eastern Europeans (SEE) and Ashkenazi Jewish (AJW). This step will also produce the principal components that will be used as covariates in the GWAS to correct for population stratification.
So first let us write a list of CEU subjects to be retained, subset our genotype data, and repeat the conversion to EIGENSTRAT format:

awk '$8 >= 0.8' ${chip1+2}_QC.NA.predpc | awk '{printf $1"\t"$2"\n"}' > ${chip1+2}.CEU80
plink --bfile ${chip1+2}_QC --keep ${chip1+2}.CEU80 --make-bed --out ${chip1+2}.CEU80

sh fam2ind.sh ${chip1+2}.CEU80.fam
sh bim2snp.sh ${chip1+2}.CEU80.bim
sh bed2geno.sh ${chip1+2}.CEU80.bed

Next, the parameter values in inferancestry.par need to be updated. It will now read:

geno:  ${chip1+2}.CEU80.geno
snp:   ${chip1+2}.CEU80.snp
ind:   ${chip1+2}.CEU80.ind
snpwt: snpwt.EA
predpcoutput: ${chip1+2}.CEU80.EA.predpc

Finally, run the subpopulation inference by calling:

./bin/inferanc -p ./inferancestry.par
rm tmp.*

The 8 columns in the ${chip1+2}.CEU80.EA.predpc output file are: sample ID, population label, number of SNPs used for inference, predicted PC1, predicted PC2, % Northwest European ancestry, % Southeast European ancestry, % Ashkenazi Jewish ancestry.

This tutorial ends here. I hope you found it helpful and, for any comments or questions, please get in touch by sending an email to marzia.scelsi.15@ucl.ac.uk.