plink --vcf ALL.2of4intersection.20100804.genotypes.vcf --make-bed --out ALL.2of4intersection.20100804.genotypes
- --vcf loads the 1000 Genomes project VCF into
plink - --make-bed directs
plink to create a new PLINK1 binary file set (3 files
consisting of a .bed, a .bim, and a
.fam file) from the 1000 Genomes VCF - --out
specifies the prefix that will be used when creating the PLINK1 binary
file set
plink --bfile ALL.2of4intersection.20100804.genotypes --set-missing-var-ids "@:#[b37]\$1,\$2" --make-bed --out ALL.2of4intersection.20100804.genotypes_no_missing_IDs
- --bfile option tells plink what the
names of the .bed, .bim, and .fam
files to take as input are - For variants that do not standard rsIDs,
assign a placeholder identifier with the format
[chromosome number]:[base-pair position][b37][allele 1][allele 2]
- --make-bed directs plink to create a new
PLINK1 binary file set (3 files consisting of a .bed, a
.bim, and a .fam file) after filling missing
variant IDs - --out specifies the prefix that will be used
when creating the PLINK1 binary file set
plink --bfile 1kG_MDS --mind 0.2 --allow-no-sex --make-bed --out 1kG_MDS2
- --bfile option tells plink what the
names of the .bed, .bim, and .fam
files to take as input are - --mind 0.2 directs
plink to filter out all individuals
with missing call rates exceeding the provided value (here a value of
0.2) - --allow-no-sex allows samples with ambiguous sex to
remain ambiguous, instead of forcing a “missing” sex call -
--make-bed directs plink to create a new
PLINK1 binary file set (3 files consisting of a .bed, a
.bim, and a .fam file) after filtering
individuals with high missing call rates - --out specifies
the prefix that will be used when creating the PLINK1 binary file
set
plink --bfile 1kG_MDS2 --geno 0.02 --allow-no-sex --make-bed --out 1kG_MDS3
- --bfile option tells plink what the
names of the .bed, .bim, and .fam
files to take as input are - --geno 0.02 directs
plink to filter out all variants with
missing call rates exceeding the provided value (here a value of 0.02) -
--allow-no-sex allows samples with ambiguous sex to remain
ambiguous, instead of forcing a “missing” sex call -
--make-bed directs plink to create a new
PLINK1 binary file set (3 files consisting of a .bed, a
.bim, and a .fam file) after filtering
variants with high missing call rates - --out specifies the
prefix that will be used when creating the PLINK1 binary file set
plink --bfile 1kG_MDS3 --mind 0.02 --allow-no-sex --make-bed --out 1kG_MDS4
- --bfile option tells plink what the
names of the .bed, .bim, and .fam
files to take as input are - --mind 0.02 directs
plink to filter out all individuals
with missing call rates exceeding the provided value (here a value of
0.2) - --allow-no-sex allows samples with ambiguous sex to
remain ambiguous, instead of forcing a “missing” sex call -
--make-bed directs plink to create a new
PLINK1 binary file set (3 files consisting of a .bed, a
.bim, and a .fam file) after filtering
individuals with high missing call rates - --out specifies
the prefix that will be used when creating the PLINK1 binary file
set
plink --bfile 1kG_MDS4 --maf 0.05 --allow-no-sex --make-bed --out 1kG_MDS5
- --maf 0.05 filters out all variants that have a minor
allele frequency below 0.05 - --allow-no-sex allows samples
with ambiguous sex to remain ambiguous, instead of forcing a “missing”
sex call - --make-bed directs plink to
create a new PLINK1 binary file set (3 files consisting of a
.bed, a .bim, and a .fam file)
after filtering SNPs with low minor allele frequency -
--out specifies the prefix that will be used when creating
the PLINK1 binary file set
awk '{print$2}' HapMap_3_r3_12.bim > HapMap_SNPs.txt
- Get the 2nd field (the variant IDs) from the HapMap data set that
passed previous QC checks
plink --bfile 1kG_MDS5 --extract HapMap_SNPs.txt --make-bed --out 1kG_MDS6
- --bfile option tells plink what the
names of the .bed, .bim, and .fam
files to take as input are - --extract to keep only the
variant IDs present in the HapMap data set - --make-bed
directs plink to create a new PLINK1 binary file set (3
files consisting of a .bed, a .bim, and a
.fam file) after keeping only HapMap variants -
--out specifies the prefix that will be used when creating
the PLINK1 binary file set
awk '{print$2}' 1kG_MDS6.bim > 1kG_MDS6_SNPs.txt -
Get the 2nd field (the variant IDs) from the 1000 Genomes data set that
passed QC checks
plink --bfile HapMap_3_r3_12 --extract 1kG_MDS6_SNPs.txt --recode --make-bed --out HapMap_MDS
- --bfile option tells plink what the
names of the .bed, .bim, and .fam
files to take as input are - --extract to keep only the
variant IDs in the HapMap data set that are present in the 1000 Genomes
project data set - --recode and --make-bed to
create a new PLINK1 binary file set (3 files consisting of a
.bed, a .bim, and a .fam file)
for the intersection of variants between HapMap data and 1000 Genomes
data - --out specifies the prefix that will be used when
creating the PLINK1 binary file set
The data sets 1kG_MDS6 and
HapMap_MDS now contain the exact same variants, but with
different samples
awk '{print$2,$4}' HapMap_MDS.map > buildhapmap.txt -
Get the 2nd and 4th fields (variant ID and base-pair coordinate
respectively) of each variant in the HapMap_MDS data set -
Write to buildhapmap.txt
plink --bfile 1kG_MDS6 --update-map buildhapmap.txt --make-bed --out 1kG_MDS7
- --bfile option tells plink what the
names of the .bed, .bim, and .fam
files to take as input are - --update-map makes
plink change the base-pair coordinate of variants in
the 1kG_MDS6 data set to the base-pair coordinate for those
variants in the buidldhapmap.txt file - Variant base-pair
coordinates now match between the 1000 Genomes data set and the HapMap
data set - --make-bed directs plink to
create a new PLINK1 binary file set (3 files consisting of a
.bed, a .bim, and a .fam file)
after updating variant base-pair coordinates - --out
specifies the prefix that will be used when creating the PLINK1 binary
file set
1kG_MDS7 and HapMap_MDS now have the same base-pair coordinates for each variant
awk '{print$2,$5}' 1kG_MDS7.bim > 1kg_ref-list.txt -
Get the variant ID and allele 1 (usually the major allele) for each row
in the 1000 Genomes data set - Write that information to a
1kg_ref-list.txt file
plink --bfile HapMap_MDS --reference-allele 1kg_ref-list.txt --make-bed --out HapMap-adj
- --bfile option tells plink what the
names of the .bed, .bim, and .fam
files to take as input are - --reference-allele uses the
1kg_ref-list.txt file to specify the major/reference allele
(allele 1) for each variant in the HapMap_MDS data set
- Reference allele information for HapMap data now comes from the 1000
Genomes project - Now the HapMap and 1000 Genomes data set have the same
reference alleles for all SNPs - This command will generate some
warnings for impossible A1 allele assignment - --out
specifies the prefix that will be used when creating the PLINK1 binary
file set
awk '{print$2,$5,$6}' 1kG_MDS7.bim > 1kGMDS7_tmp -
Get the variant ID, allele 1, and allele 2 for each row in the 1000
Genomes project data set (curated for overlap with HapMap data) - Write
that information to 1kGMDS7_tmp
awk '{print$2,$5,$6}' HapMap-adj.bim > HapMap-adj_tmp
- Get the variant ID, allele 1, and allele 2 for each row in the HapMap
data set (curated for overlap with HapMap data) - Write that information
to HapMap-adj_tmp
sort 1kGMDS7_tmp HapMap-adj_tmp | uniq -u > all_differences.txt
- sort 1kGMDS7_tmp HapMap-adj_tmp to merge the
1kGMDS7_tmp and HapMap-adj_tmp files and then
sort the resultant merge - uniq -u to get only lines that
are unique in the merged output. Unique lines will be variants with the
same variant ID, but different allele 1 and allele 2 values -
wc -l all_differences.txt - 1624 differences between the
files, some of these might be due to strand issues
awk '{print$1}' all_differences.txt | sort -u > flip_list.txt
- Get the variant IDs for the variants that have different alleles
between the HapMap and 1000 Genomes data sets - sort -u to
get a sorted list of the unique variant IDs and write to
flip_list.txt - Generates a file of 812 SNPs. These are the
non-corresponding SNPs between the two files
plink --bfile HapMap-adj --flip flip_list.txt --reference-allele 1kg_ref-list.txt --make-bed --out corrected_hapmap
- --bfile option tells plink what the
names of the .bed, .bim, and .fam
files to take as input are - --flip to change the alleles
for the variants in flip_list to be the complementary
alleles - --reference-allele uses the
1kg_ref-list.txt file to specify the major/reference allele
(allele 1) for each variant in the HapMap_MDS data set
- Reference allele information for HapMap data now comes from the 1000
Genomes project - Now the HapMap and 1000 Genomes data set have the same
reference alleles for all SNPs - This command will generate some
warnings for impossible A1 allele assignment - --make-bed
directs plink to create a new PLINK1 binary file set (3
files consisting of a .bed, a .bim, and a
.fam file) after changing alleles to the complementary base
- --out specifies the prefix that will be used when
creating the PLINK1 binary file set
awk '{print$2,$5,$6}' corrected_hapmap.bim > corrected_hapmap_tmp
- Get the variant ID, allele 1, and allele 2 values for all rows in the
HapMap data that has been corrected for potential strand-reversal
issues
sort 1kGMDS7_tmp corrected_hapmap_tmp | uniq -u > uncorresponding_SNPs.txt
-sort 1kGMDS7_tmp corrected_hapmap_tmp to merge and sort
the 1000 Genomes variants and corrected HapMap variants -
uniq -u to get the variants that have discordant
information between the 1000 Genomes variants and HapMap variants
after correcting for potential strand-reversal issues -
wc -l uncorresponding_SNPs.txt - This file demonstrates
that there are 84 differences between the files even after correcting
for strand issues
awk '{print$1}' uncorresponding_SNPs.txt | sort -u > SNPs_for_exclusion.txt
- awk '{print$1}' uncorresponding_SNPs.txt to get the
variant IDs that were not corrected after correcting for strand issues -
sort -u to de-duplicate the list of variant IDs and write
to SNPs_for_exclusion.txt - Generate a list of the 42 SNPs
which caused the 84 differences between the HapMap and the 1000 Genomes
data sets after flipping and setting of the reference genome
plink --bfile corrected_hapmap --exclude SNPs_for_exclusion.txt --make-bed --out HapMap_MDS2
- --bfile option tells plink what the
names of the .bed, .bim, and .fam
files to take as input are - --exclude excludes variants
listed in SNPs_for_exclusion.txt - --make-bed
directs plink to create a new PLINK1 binary file set (3
files consisting of a .bed, a .bim, and a
.fam file) after excluding problematic SNPs -
--out specifies the prefix that will be used when creating
the PLINK1 binary file set
plink --bfile 1kG_MDS7 --exclude SNPs_for_exlusion.txt --make-bed --out 1kG_MDS8
- --bfile option tells plink what the
names of the .bed, .bim, and .fam
files to take as input are - --exclude excludes variants
listed in SNPs_for_exclusion.txt - --make-bed
directs plink to create a new PLINK1 binary file set (3
files consisting of a .bed, a .bim, and a
.fam file) after excluding problematic SNPs -
--out specifies the prefix that will be used when creating
the PLINK1 binary file set
plink --bfile HapMap_MDS2 --bmerge 1kG_MDS8.bed 1kG_MDS8.bim 1kG_MDS8.fam --allow-no-sex --make-bed --out MDS_merge2
- --bfile option tells plink what the
names of the .bed, .bim, and .fam
files to take as input are - --bmerge merges the provided
file set (1000 Genomes project PLINK1 binary files here) with the input
HapMap_MDS2 data - Note, there is sample overlap between the HapMap data
set and the 1000 Genomes data set - --allow-no-sex allows
samples with ambiguous sex to remain ambiguous, instead of forcing a
“missing” sex call - --make-bed directs
plink to create a new PLINK1 binary file set (3 files
consisting of a .bed, a .bim, and a
.fam file) after merging the data sets - --out
specifies the prefix that will be used when creating the PLINK1 binary
file set
plink --bfile MDS_merge2 --extract indepSNP.prune.in --genome --out MDS_merge2
- --bfile option tells plink what the
names of the .bed, .bim, and .fam
files to take as input are - --extract indepSNP.prune.in to
select only SNPs in approximate linkage equilibrium with one another to
do the MDS analysis with - --genome to perform an
identity-by-descent computation - --out specifies the
prefix that will be used when creating the IBD report
plink --bfile MDS_merge2 --read-genome MDS_merge2.genome --cluster --mds-plot 10 --out MDS_merge2
- --bfile option tells plink what the
names of the .bed, .bim, and .fam
files to take as input are - --read-genome uses the
identity-by-descent results from the previous computation as the basis
for clustering - --cluster performs linkage clustering on
the MDS_merge2 data set - --mds-plot to do a
multidimensional scaling analysis with 10 components -
--out specifies the prefix that will be used when creating
the multidimensional scaling report
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/20100804.ALL.panel
- Download the file with population information of the 1000 Genomes
dataset - The file 20100804.ALL.panel contains population
codes of the individuals of 1000 genomes.
awk '{print$1,$1,$2}' 20100804.ALL.panel > race_1kG.txt
sed 's/JPT/ASN/g' race_1kG.txt > race_1kG2.txt
sed 's/ASW/AFR/g' race_1kG2.txt > race_1kG3.txt
sed 's/CEU/EUR/g' race_1kG3.txt > race_1kG4.txt
sed 's/CHB/ASN/g' race_1kG4.txt > race_1kG5.txt
sed 's/CHD/ASN/g' race_1kG5.txt > race_1kG6.txt
sed 's/YRI/AFR/g' race_1kG6.txt > race_1kG7.txt
sed 's/LWK/AFR/g' race_1kG7.txt > race_1kG8.txt
sed 's/TSI/EUR/g' race_1kG8.txt > race_1kG9.txt
sed 's/MXL/AMR/g' race_1kG9.txt > race_1kG10.txt
sed 's/GBR/EUR/g' race_1kG10.txt > race_1kG11.txt
sed 's/FIN/EUR/g' race_1kG11.txt > race_1kG12.txt
sed 's/CHS/ASN/g' race_1kG12.txt > race_1kG13.txt
sed 's/PUR/AMR/g' race_1kG13.txt > race_1kG14.txt
race_1kG14.txt
fileawk '{print$1,$2,"OWN"}' HapMap_MDS.fam > racefile_own.txt
- Get the FID and variant ID for each row in the HapMap
data and label it as “OWN” - Write the output to
racefile_own.txt
cat race_1kG14.txt racefile_own.txt | gsed -e '1 i\FID IID race' > racefile.txt
- Concatenate the 1000 Genomes population file and the HapMap population
file - Add a FID IID race header to the concatenated output
- Write the concatenated output to racefile.txt
Rscript --no-save MDS_merged.R - The output file MDS.pdf
demonstrates that our own data falls within the European group of the
1000 genomes data. Therefore, we do not have to remove subjects.
The cut-off levels are not fixed thresholds but have to be determined based on the visualization of the first two dimensions. To exclude ethnic outliers, the thresholds need to be set around the cluster of population of interest.
awk '{ if ($4 <- 0.04 && $5 > 0.03) print $1,$2 }' MDS_merge2.mds > EUR_MDS_merge2
- Only include individuals in HapMap data within cut-off thresholds.
Other individuals will be excluded from further analysis - Select based
on the first two dimensions - Get the FID and
IID of the individuals to exclude - Print the
individuals to keep to the file EUR_MDS_merge2
plink --bfile HapMap_3_r3_12 --keep EUR_MDS_merge2 --make-bed --out HapMap_3_r3_13
- --bfile option tells plink what the
names of the .bed, .bim, and .fam
files to take as input are - --keep to only keep the
individuals listed in EUR_MDS_merge2 in
the current analysis - As the HapMap data did not include any ethnic
outliers, no individuals were removed at this step -
--make-bed directs plink to create a new
PLINK1 binary file set (3 files consisting of a .bed, a
.bim, and a .fam file) after keeping only
samples of EUR origin - --out specifies
the prefix that will be used when creating the PLINK1 binary file
set
plink --bfile HapMap_3_r3_13 --extract indepSNP.prune.in --genome --out HapMap_3_r3_13
- --bfile option tells plink what the
names of the .bed, .bim, and .fam
files to take as input are - --extract to only keep the
variants listed in indepSNP.prune.in. These will be
variants in approximate linkage equilibrium -
--genome performs an identity-by-descent computation -
--out specifies the prefix that will be used when writing
the output of this computation
plink --bfile HapMap_3_r3_13 --read-genome HapMap_3_r3_13.genome --cluster --mds-plot 10 --out HapMap_3_r3_13_mds
- --bfile option tells plink what the
names of the .bed, .bim, and .fam
files to take as input are - --read-genome to use the
.genome file created by the identity-by-descent computation
as the basis for clustering - --cluster performs linkage
clustering on the HapMap_3_r3_13 data set -
--mds-plot to do a multidimensional scaling analysis with
10 components - --out specifies the prefix
that will be used when creating the multidimensional scaling report
awk '{print$1, $2, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13}' HapMap_3_r3_13_mds.mds > covar_mds.txt
- Change the format of the .mds file into a plink co-variate file - Take
the fields of the HapMap .mds file that are shared between
the .mds format and the plink co-variate file format
The values in covar_mds.txt will be used as co-variates, to adjust for remaining population stratification in the following GWAS
| Since the phenotype is binary encoded (1 = control, 2 = case), association tests should account for a binary outcome. |
| Using Chi-Squared Tests and Logistic Regressions across the genome is appropriate here. |
plink --bfile HapMap_3_r3_13 --assoc --adjust --out assoc_results
- --bfile option tells plink what the
names of the .bed, .bim, and .fam
files to take as input are - --assoc performs standard
case/control association analysis using a basic Chi-Squared Test of
Association across the genome - This option does not allow
adjustments/corrections for co-variates such as principal components or
MDS components - `-–out` specifies the prefix that will be
used when creating the output of the association analysis
plink --bfile HapMap_3_r3_13 --covar covar_mds.txt --logistic --hide-covar --out logistic_results
- --bfile option tells plink what the
names of the .bed, .bim, and .fam
files to take as input are - --covar to allow for a
co-variate file to adjust/correct the effects of in the logistic
regression - By passing the co-variate file produced as a result of the
earlier MDS workflow, any effect due to population stratification that
was not eliminated through (not) eliminating ethnic outliers can be
controlled for - --hide-covar to only show the additive
results of the SNPs in the output file - --out specifies
the prefix that will be used when creating the output of the regression
analysis
awk '!/'NA'/' logistic_results.assoc.logistic > logistic_results.assoc_2.logistic
- Remove NA values since they could create errors in the
visualization scripts
#adjust plink –bfile HapMap_3_r3_13 -assoc –adjust –out adjusted_assoc_results # This file gives a Bonferroni corrected p-value, along with FDR and others.
Rscript –no-save Manhattan_plot.R Rscript –no-save QQ_plot.R