Final Project

QC 1000 Genomes Data

Convert VCF to Plink Format

plink --vcf ALL.2of4intersection.20100804.genotypes.vcf --make-bed --out ALL.2of4intersection.20100804.genotypes - --vcf loads the 1000 Genomes project VCF into plink - --make-bed directs plink to create a new PLINK1 binary file set (3 files consisting of a .bed, a .bim, and a .fam file) from the 1000 Genomes VCF - --out specifies the prefix that will be used when creating the PLINK1 binary file set

Assign unique identifiers to the SNPs with a missing rs-identifier (i.e., the SNPs with “.”).

plink --bfile ALL.2of4intersection.20100804.genotypes --set-missing-var-ids "@:#[b37]\$1,\$2" --make-bed --out ALL.2of4intersection.20100804.genotypes_no_missing_IDs - --bfile option tells plink what the names of the .bed, .bim, and .fam files to take as input are - For variants that do not standard rsIDs, assign a placeholder identifier with the format [chromosome number]:[base-pair position][b37][allele 1][allele 2] - --make-bed directs plink to create a new PLINK1 binary file set (3 files consisting of a .bed, a .bim, and a .fam file) after filling missing variant IDs - --out specifies the prefix that will be used when creating the PLINK1 binary file set

Remove individuals based on missing genotype data.

plink --bfile 1kG_MDS --mind 0.2 --allow-no-sex --make-bed --out 1kG_MDS2 - --bfile option tells plink what the names of the .bed, .bim, and .fam files to take as input are - --mind 0.2 directs plink to filter out all individuals with missing call rates exceeding the provided value (here a value of 0.2) - --allow-no-sex allows samples with ambiguous sex to remain ambiguous, instead of forcing a “missing” sex call - --make-bed directs plink to create a new PLINK1 binary file set (3 files consisting of a .bed, a .bim, and a .fam file) after filtering individuals with high missing call rates - --out specifies the prefix that will be used when creating the PLINK1 binary file set

Remove variants based on missing genotype data.

plink --bfile 1kG_MDS2 --geno 0.02 --allow-no-sex --make-bed --out 1kG_MDS3 - --bfile option tells plink what the names of the .bed, .bim, and .fam files to take as input are - --geno 0.02 directs plink to filter out all variants with missing call rates exceeding the provided value (here a value of 0.02) - --allow-no-sex allows samples with ambiguous sex to remain ambiguous, instead of forcing a “missing” sex call - --make-bed directs plink to create a new PLINK1 binary file set (3 files consisting of a .bed, a .bim, and a .fam file) after filtering variants with high missing call rates - --out specifies the prefix that will be used when creating the PLINK1 binary file set

Remove individuals based on missing genotype data.

plink --bfile 1kG_MDS3 --mind 0.02 --allow-no-sex --make-bed --out 1kG_MDS4 - --bfile option tells plink what the names of the .bed, .bim, and .fam files to take as input are - --mind 0.02 directs plink to filter out all individuals with missing call rates exceeding the provided value (here a value of 0.2) - --allow-no-sex allows samples with ambiguous sex to remain ambiguous, instead of forcing a “missing” sex call - --make-bed directs plink to create a new PLINK1 binary file set (3 files consisting of a .bed, a .bim, and a .fam file) after filtering individuals with high missing call rates - --out specifies the prefix that will be used when creating the PLINK1 binary file set

Remove variants based on MAF.

plink --bfile 1kG_MDS4 --maf 0.05 --allow-no-sex --make-bed --out 1kG_MDS5 - --maf 0.05 filters out all variants that have a minor allele frequency below 0.05 - --allow-no-sex allows samples with ambiguous sex to remain ambiguous, instead of forcing a “missing” sex call - --make-bed directs plink to create a new PLINK1 binary file set (3 files consisting of a .bed, a .bim, and a .fam file) after filtering SNPs with low minor allele frequency - --out specifies the prefix that will be used when creating the PLINK1 binary file set

Prep 1000 Genomes Data for Merging with HapMap Data

Extract the variants present in HapMap dataset from the 1000 genomes dataset.

awk '{print$2}' HapMap_3_r3_12.bim > HapMap_SNPs.txt - Get the 2nd field (the variant IDs) from the HapMap data set that passed previous QC checks

plink --bfile 1kG_MDS5 --extract HapMap_SNPs.txt --make-bed --out 1kG_MDS6 - --bfile option tells plink what the names of the .bed, .bim, and .fam files to take as input are - --extract to keep only the variant IDs present in the HapMap data set - --make-bed directs plink to create a new PLINK1 binary file set (3 files consisting of a .bed, a .bim, and a .fam file) after keeping only HapMap variants - --out specifies the prefix that will be used when creating the PLINK1 binary file set

Extract the variants present in 1000 Genomes dataset from the HapMap dataset.

awk '{print$2}' 1kG_MDS6.bim > 1kG_MDS6_SNPs.txt - Get the 2nd field (the variant IDs) from the 1000 Genomes data set that passed QC checks

plink --bfile HapMap_3_r3_12 --extract 1kG_MDS6_SNPs.txt --recode --make-bed --out HapMap_MDS - --bfile option tells plink what the names of the .bed, .bim, and .fam files to take as input are - --extract to keep only the variant IDs in the HapMap data set that are present in the 1000 Genomes project data set - --recode and --make-bed to create a new PLINK1 binary file set (3 files consisting of a .bed, a .bim, and a .fam file) for the intersection of variants between HapMap data and 1000 Genomes data - --out specifies the prefix that will be used when creating the PLINK1 binary file set

The data sets 1kG_MDS6 and HapMap_MDS now contain the exact same variants, but with different samples

The datasets must have the same build. Change the build 1000 Genomes data build.

awk '{print$2,$4}' HapMap_MDS.map > buildhapmap.txt - Get the 2nd and 4th fields (variant ID and base-pair coordinate respectively) of each variant in the HapMap_MDS data set - Write to buildhapmap.txt

plink --bfile 1kG_MDS6 --update-map buildhapmap.txt --make-bed --out 1kG_MDS7 - --bfile option tells plink what the names of the .bed, .bim, and .fam files to take as input are - --update-map makes plink change the base-pair coordinate of variants in the 1kG_MDS6 data set to the base-pair coordinate for those variants in the buidldhapmap.txt file - Variant base-pair coordinates now match between the 1000 Genomes data set and the HapMap data set - --make-bed directs plink to create a new PLINK1 binary file set (3 files consisting of a .bed, a .bim, and a .fam file) after updating variant base-pair coordinates - --out specifies the prefix that will be used when creating the PLINK1 binary file set

1kG_MDS7 and HapMap_MDS now have the same base-pair coordinates for each variant

Merge the HapMap and 1000 Genomes Data Sets

Set the reference genome

awk '{print$2,$5}' 1kG_MDS7.bim > 1kg_ref-list.txt - Get the variant ID and allele 1 (usually the major allele) for each row in the 1000 Genomes data set - Write that information to a 1kg_ref-list.txt file

plink --bfile HapMap_MDS --reference-allele 1kg_ref-list.txt --make-bed --out HapMap-adj - --bfile option tells plink what the names of the .bed, .bim, and .fam files to take as input are - --reference-allele uses the 1kg_ref-list.txt file to specify the major/reference allele (allele 1) for each variant in the HapMap_MDS data set - Reference allele information for HapMap data now comes from the 1000 Genomes project - Now the HapMap and 1000 Genomes data set have the same reference alleles for all SNPs - This command will generate some warnings for impossible A1 allele assignment - --out specifies the prefix that will be used when creating the PLINK1 binary file set

Resolve strand issues.

Check for potential strand issues.

awk '{print$2,$5,$6}' 1kG_MDS7.bim > 1kGMDS7_tmp - Get the variant ID, allele 1, and allele 2 for each row in the 1000 Genomes project data set (curated for overlap with HapMap data) - Write that information to 1kGMDS7_tmp

awk '{print$2,$5,$6}' HapMap-adj.bim > HapMap-adj_tmp - Get the variant ID, allele 1, and allele 2 for each row in the HapMap data set (curated for overlap with HapMap data) - Write that information to HapMap-adj_tmp

sort 1kGMDS7_tmp HapMap-adj_tmp | uniq -u > all_differences.txt - sort 1kGMDS7_tmp HapMap-adj_tmp to merge the 1kGMDS7_tmp and HapMap-adj_tmp files and then sort the resultant merge - uniq -u to get only lines that are unique in the merged output. Unique lines will be variants with the same variant ID, but different allele 1 and allele 2 values - wc -l all_differences.txt - 1624 differences between the files, some of these might be due to strand issues

Flip SNPs for resolving strand issues.

Print SNP-identifier and remove duplicates.

awk '{print$1}' all_differences.txt | sort -u > flip_list.txt - Get the variant IDs for the variants that have different alleles between the HapMap and 1000 Genomes data sets - sort -u to get a sorted list of the unique variant IDs and write to flip_list.txt - Generates a file of 812 SNPs. These are the non-corresponding SNPs between the two files

Flip the 812 non-corresponding SNPs.

plink --bfile HapMap-adj --flip flip_list.txt --reference-allele 1kg_ref-list.txt --make-bed --out corrected_hapmap - --bfile option tells plink what the names of the .bed, .bim, and .fam files to take as input are - --flip to change the alleles for the variants in flip_list to be the complementary alleles - --reference-allele uses the 1kg_ref-list.txt file to specify the major/reference allele (allele 1) for each variant in the HapMap_MDS data set - Reference allele information for HapMap data now comes from the 1000 Genomes project - Now the HapMap and 1000 Genomes data set have the same reference alleles for all SNPs - This command will generate some warnings for impossible A1 allele assignment - --make-bed directs plink to create a new PLINK1 binary file set (3 files consisting of a .bed, a .bim, and a .fam file) after changing alleles to the complementary base - --out specifies the prefix that will be used when creating the PLINK1 binary file set

Check for SNPs which are still problematic after they have been flipped.

awk '{print$2,$5,$6}' corrected_hapmap.bim > corrected_hapmap_tmp - Get the variant ID, allele 1, and allele 2 values for all rows in the HapMap data that has been corrected for potential strand-reversal issues

sort 1kGMDS7_tmp corrected_hapmap_tmp | uniq -u > uncorresponding_SNPs.txt -sort 1kGMDS7_tmp corrected_hapmap_tmp to merge and sort the 1000 Genomes variants and corrected HapMap variants - uniq -u to get the variants that have discordant information between the 1000 Genomes variants and HapMap variants after correcting for potential strand-reversal issues - wc -l uncorresponding_SNPs.txt - This file demonstrates that there are 84 differences between the files even after correcting for strand issues

Remove problematic SNPs from HapMap and 1000 Genomes.

awk '{print$1}' uncorresponding_SNPs.txt | sort -u > SNPs_for_exclusion.txt - awk '{print$1}' uncorresponding_SNPs.txt to get the variant IDs that were not corrected after correcting for strand issues - sort -u to de-duplicate the list of variant IDs and write to SNPs_for_exclusion.txt - Generate a list of the 42 SNPs which caused the 84 differences between the HapMap and the 1000 Genomes data sets after flipping and setting of the reference genome

Remove the 42 problematic SNPs from the HapMap data

plink --bfile corrected_hapmap --exclude SNPs_for_exclusion.txt --make-bed --out HapMap_MDS2 - --bfile option tells plink what the names of the .bed, .bim, and .fam files to take as input are - --exclude excludes variants listed in SNPs_for_exclusion.txt - --make-bed directs plink to create a new PLINK1 binary file set (3 files consisting of a .bed, a .bim, and a .fam file) after excluding problematic SNPs - --out specifies the prefix that will be used when creating the PLINK1 binary file set

Remove the 42 problematic SNPs from the 1000 Genomes data

plink --bfile 1kG_MDS7 --exclude SNPs_for_exlusion.txt --make-bed --out 1kG_MDS8 - --bfile option tells plink what the names of the .bed, .bim, and .fam files to take as input are - --exclude excludes variants listed in SNPs_for_exclusion.txt - --make-bed directs plink to create a new PLINK1 binary file set (3 files consisting of a .bed, a .bim, and a .fam file) after excluding problematic SNPs - --out specifies the prefix that will be used when creating the PLINK1 binary file set

Merge HapMap with 1000 Genomes Data.

plink --bfile HapMap_MDS2 --bmerge 1kG_MDS8.bed 1kG_MDS8.bim 1kG_MDS8.fam --allow-no-sex --make-bed --out MDS_merge2 - --bfile option tells plink what the names of the .bed, .bim, and .fam files to take as input are - --bmerge merges the provided file set (1000 Genomes project PLINK1 binary files here) with the input HapMap_MDS2 data - Note, there is sample overlap between the HapMap data set and the 1000 Genomes data set - --allow-no-sex allows samples with ambiguous sex to remain ambiguous, instead of forcing a “missing” sex call - --make-bed directs plink to create a new PLINK1 binary file set (3 files consisting of a .bed, a .bim, and a .fam file) after merging the data sets - --out specifies the prefix that will be used when creating the PLINK1 binary file set

Perform MDS on HapMap-CEU data anchored by 1000 Genomes data.

Using a set of pruned SNPs

plink --bfile MDS_merge2 --extract indepSNP.prune.in --genome --out MDS_merge2 - --bfile option tells plink what the names of the .bed, .bim, and .fam files to take as input are - --extract indepSNP.prune.in to select only SNPs in approximate linkage equilibrium with one another to do the MDS analysis with - --genome to perform an identity-by-descent computation - --out specifies the prefix that will be used when creating the IBD report

plink --bfile MDS_merge2 --read-genome MDS_merge2.genome --cluster --mds-plot 10 --out MDS_merge2 - --bfile option tells plink what the names of the .bed, .bim, and .fam files to take as input are - --read-genome uses the identity-by-descent results from the previous computation as the basis for clustering - --cluster performs linkage clustering on the MDS_merge2 data set - --mds-plot to do a multidimensional scaling analysis with 10 components - --out specifies the prefix that will be used when creating the multidimensional scaling report

MDS-plot

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/20100804.ALL.panel - Download the file with population information of the 1000 Genomes dataset - The file 20100804.ALL.panel contains population codes of the individuals of 1000 genomes.

Convert population codes into superpopulation codes (i.e., AFR,AMR,ASN, and EUR).

awk '{print$1,$1,$2}' 20100804.ALL.panel > race_1kG.txt
sed 's/JPT/ASN/g' race_1kG.txt > race_1kG2.txt
sed 's/ASW/AFR/g' race_1kG2.txt > race_1kG3.txt
sed 's/CEU/EUR/g' race_1kG3.txt > race_1kG4.txt
sed 's/CHB/ASN/g' race_1kG4.txt > race_1kG5.txt
sed 's/CHD/ASN/g' race_1kG5.txt > race_1kG6.txt
sed 's/YRI/AFR/g' race_1kG6.txt > race_1kG7.txt
sed 's/LWK/AFR/g' race_1kG7.txt > race_1kG8.txt
sed 's/TSI/EUR/g' race_1kG8.txt > race_1kG9.txt
sed 's/MXL/AMR/g' race_1kG9.txt > race_1kG10.txt
sed 's/GBR/EUR/g' race_1kG10.txt > race_1kG11.txt
sed 's/FIN/EUR/g' race_1kG11.txt > race_1kG12.txt
sed 's/CHS/ASN/g' race_1kG12.txt > race_1kG13.txt
sed 's/PUR/AMR/g' race_1kG13.txt > race_1kG14.txt

The final superpopulation file with sample IDs and population origin information should be written to the race_1kG14.txt file

Create a racefile of your own data.

awk '{print$1,$2,"OWN"}' HapMap_MDS.fam > racefile_own.txt - Get the FID and variant ID for each row in the HapMap data and label it as “OWN” - Write the output to racefile_own.txt

Concatenate racefiles.

cat race_1kG14.txt racefile_own.txt | gsed -e '1 i\FID IID race' > racefile.txt - Concatenate the 1000 Genomes population file and the HapMap population file - Add a FID IID race header to the concatenated output - Write the concatenated output to racefile.txt

Generate population stratification plot.

Rscript --no-save MDS_merged.R - The output file MDS.pdf demonstrates that our own data falls within the European group of the 1000 genomes data. Therefore, we do not have to remove subjects.

Exclude ethnic outliers.

The cut-off levels are not fixed thresholds but have to be determined based on the visualization of the first two dimensions. To exclude ethnic outliers, the thresholds need to be set around the cluster of population of interest.

awk '{ if ($4 <- 0.04 && $5 > 0.03) print $1,$2 }' MDS_merge2.mds > EUR_MDS_merge2 - Only include individuals in HapMap data within cut-off thresholds. Other individuals will be excluded from further analysis - Select based on the first two dimensions - Get the FID and IID of the individuals to exclude - Print the individuals to keep to the file EUR_MDS_merge2

Extract these individuals in HapMap data.

plink --bfile HapMap_3_r3_12 --keep EUR_MDS_merge2 --make-bed --out HapMap_3_r3_13 - --bfile option tells plink what the names of the .bed, .bim, and .fam files to take as input are - --keep to only keep the individuals listed in EUR_MDS_merge2 in the current analysis - As the HapMap data did not include any ethnic outliers, no individuals were removed at this step - --make-bed directs plink to create a new PLINK1 binary file set (3 files consisting of a .bed, a .bim, and a .fam file) after keeping only samples of EUR origin - --out specifies the prefix that will be used when creating the PLINK1 binary file set

Create co-variates for the HapMap data set based on MDS.

plink --bfile HapMap_3_r3_13 --extract indepSNP.prune.in --genome --out HapMap_3_r3_13 - --bfile option tells plink what the names of the .bed, .bim, and .fam files to take as input are - --extract to only keep the variants listed in indepSNP.prune.in. These will be variants in approximate linkage equilibrium - --genome performs an identity-by-descent computation - --out specifies the prefix that will be used when writing the output of this computation

plink --bfile HapMap_3_r3_13 --read-genome HapMap_3_r3_13.genome --cluster --mds-plot 10 --out HapMap_3_r3_13_mds - --bfile option tells plink what the names of the .bed, .bim, and .fam files to take as input are - --read-genome to use the .genome file created by the identity-by-descent computation as the basis for clustering - --cluster performs linkage clustering on the HapMap_3_r3_13 data set - --mds-plot to do a multidimensional scaling analysis with 10 components - --out specifies the prefix that will be used when creating the multidimensional scaling report

awk '{print$1, $2, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13}' HapMap_3_r3_13_mds.mds > covar_mds.txt - Change the format of the .mds file into a plink co-variate file - Take the fields of the HapMap .mds file that are shared between the .mds format and the plink co-variate file format

The values in covar_mds.txt will be used as co-variates, to adjust for remaining population stratification in the following GWAS

Association analyses

Since the phenotype is binary encoded (1 = control, 2 = case), association tests should account for a binary outcome.

Using Chi-Squared Tests and Logistic Regressions across the genome is appropriate here.

plink --bfile HapMap_3_r3_13 --assoc --adjust --out assoc_results - --bfile option tells plink what the names of the .bed, .bim, and .fam files to take as input are - --assoc performs standard case/control association analysis using a basic Chi-Squared Test of Association across the genome - This option does not allow adjustments/corrections for co-variates such as principal components or MDS components - `-–out` specifies the prefix that will be used when creating the output of the association analysis

plink --bfile HapMap_3_r3_13 --covar covar_mds.txt --logistic --hide-covar --out logistic_results - --bfile option tells plink what the names of the .bed, .bim, and .fam files to take as input are - --covar to allow for a co-variate file to adjust/correct the effects of in the logistic regression - By passing the co-variate file produced as a result of the earlier MDS workflow, any effect due to population stratification that was not eliminated through (not) eliminating ethnic outliers can be controlled for - --hide-covar to only show the additive results of the SNPs in the output file - --out specifies the prefix that will be used when creating the output of the regression analysis

awk '!/'NA'/' logistic_results.assoc.logistic > logistic_results.assoc_2.logistic - Remove NA values since they could create errors in the visualization scripts

Final Project

QC 1000 Genomes Data

Convert VCF to Plink Format

Assign unique identifiers to the SNPs with a missing rs-identifier (i.e., the SNPs with “.”).

Remove individuals based on missing genotype data.

Remove variants based on missing genotype data.

Remove individuals based on missing genotype data.

Remove variants based on MAF.

Prep 1000 Genomes Data for Merging with HapMap Data

Extract the variants present in HapMap dataset from the 1000 genomes dataset.

Extract the variants present in 1000 Genomes dataset from the HapMap dataset.

The datasets must have the same build. Change the build 1000 Genomes data build.

Merge the HapMap and 1000 Genomes Data Sets

Set the reference genome

Resolve strand issues.

Check for potential strand issues.

Flip SNPs for resolving strand issues.

Print SNP-identifier and remove duplicates.

Flip the 812 non-corresponding SNPs.

Check for SNPs which are still problematic after they have been flipped.

Remove problematic SNPs from HapMap and 1000 Genomes.

Remove the 42 problematic SNPs from the HapMap data

Remove the 42 problematic SNPs from the 1000 Genomes data

Merge HapMap with 1000 Genomes Data.

Perform MDS on HapMap-CEU data anchored by 1000 Genomes data.

Using a set of pruned SNPs

MDS-plot

Convert population codes into superpopulation codes (i.e., AFR,AMR,ASN, and EUR).

Create a racefile of your own data.

Concatenate racefiles.

Generate population stratification plot.

Exclude ethnic outliers.

Extract these individuals in HapMap data.

Create co-variates for the HapMap data set based on MDS.

Association analyses

Multiple testing

There are various ways to deal with multiple testing outside the conventional genome-wide significance threshold of 5.0E-8, below we present a couple.

Generate Manhattan and QQ plots.

These scripts assume R >= 3.0.0.

If you changed the name of the .assoc file or to the assoc.logistic file, please assign those names also to the Rscripts for the Manhattan and QQ plot, otherwise the scripts will not run.

This location can be changed to your desired directory