GWAS Analysis using the ADNI dataset

In this research project, our aim is to pinpoint the Single Nucleotide Polymorphisms (SNPs) that are associated with the diagnosis of Alzheimer’s disease. By pinpointing SNPs associated with Alzheimer’s disease, we aim to elucidate the genetic factors underlying the development and progression of this neurodegenerative disorder. Original tutorial for pre-processing genetic data can be found at: https://rpubs.com/maffleur/452627

Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf

1. Quality control: check sex mismatches and missingness of genotypes

cd /Users/seymour/Desktop/Bioinformatics/GWAS/ADNI1/ADNI_1_GWAS_Plink
inputfile=ADNI_cluster_01_forward_757LONI
plink --bfile $inputfile --check-sex
plink --bfile $inputfile --missing

We can check the number of subjects with mismatches in reported sex by inspecting the plink.sexcheck file. Similarly, we check the number of subjects missing more than 10% of genotypes by inspecting the plink.imiss file.

cd /Users/seymour/Desktop/Bioinformatics/GWAS/ADNI1/ADNI_1_GWAS_Plink

total_subjects=$(awk 'NR>1 {count++} END {print count}' "plink.imiss")
sex_mismatch_count=$(awk '$5 == "PROBLEM"' "plink.sexcheck" | wc -l)
missing_genotypes_count=$(awk 'NR>1 && ($6 > 0.10)' "plink.imiss" | wc -l)


echo "Total number of subjects: ${total_subjects}"
echo "Number of subjects with sex mismatches: ${sex_mismatch_count}"
echo "Number of subjects missing more than 10% of genotypes: ${missing_genotypes_count}"

## Total number of subjects: 757
## Number of subjects with sex mismatches:        2
## Number of subjects missing more than 10% of genotypes:        1

Therefore our dataset seems to be ok here. We now proceed to imputation using the Haplotype Reference Consortium v1.1 panel. The following steps check for consistency of strand, alleles, positions, Ref/Alt assignments and frequencies between your SNPs and the HRC panel. These are explained further in https://rpubs.com/maffleur/452627

cd /Users/seymour/Desktop/Bioinformatics/GWAS/ADNI1/ADNI_1_GWAS_Plink
inputfile=ADNI_cluster_01_forward_757LONI
plink --bfile ADNI_cluster_01_forward_757LONI --freq
perl HRC-1000G-check-bim.pl -b ADNI_cluster_01_forward_757LONI.bim -f plink.frq -r HRC.r1-1.GRCh37.wgs.mac5.sites.tab -h

cd /Users/seymour/Desktop/Bioinformatics/GWAS/ADNI1/ADNI_1_GWAS_Plink
inputfile=ADNI_cluster_01_forward_757LONI
plink --bfile TEMP4 --a2-allele Force-Allele1-${inputfile}-HRC.txt --autosome --recode vcf-iid bgz --out ${inputfile}-updated

cd /Users/seymour/Desktop/Bioinformatics/GWAS/ADNI1/ADNI_1_GWAS_Plink
sh Run-plink.sh

cd /Users/seymour/Desktop/Bioinformatics/GWAS/ADNI1/ADNI_1_GWAS_Plink
python checkVCF.py -r hs37d5.fa -o test ADNI_cluster_01_forward_757LONI-updated.vcf.gz

cd /Users/seymour/Desktop/Bioinformatics/GWAS/ADNI1/ADNI_1_GWAS_Plink
bcftools +fixref ADNI_cluster_01_forward_757LONI-updated.vcf.gz -Ob -o ADNI_cluster_01_forward_757LONI-updated-REFfixed.vcf.gz -- -f hs37d5.fa -m top

cd /Users/seymour/Desktop/Bioinformatics/GWAS/ADNI1/ADNI_1_GWAS_Plink
python checkVCF.py -r hs37d5.fa -o test ADNI_cluster_01_forward_757LONI-updated-REFfixed.vcf.gz

Here, we upload the ADNI_cluster_01_forward_757LONI-updated-REFfixed.vcf.gz file on to the Sanger Imputation Service (https://imputation.sanger.ac.uk/) for imputation, a process that infers missing genotypes and improves the resolution of genetic data. Following imputation, the imputed VCF file was downloaded and used to generate PLINK files (.fam, .bim and .bed) for subsequent GWAS analysis.

2. GWAS: Investigate SNPs associated with Alzheimer’s disease

After imputation and combining ADNI1 and ADNI3 datasets into chip1_2 we perform the following GWAS with the diagnosis of Alzheimer’s disease as the phenotype of interest.

 cd /Users/seymour/Desktop/Bioinformatics/GWAS/chip1_2/
 plink --bfile chip1_2_Dx \
       --pheno chip1_2_DX.txt \
       --pheno-name DX_bl \
       --allow-no-sex \
       --covar chip1_2_CEU80_EA_predpc_covs.txt\
       --covar-number 4,5\
       --maf 0.05 \
       --hwe 5e-7 \
       --logistic \
       --out logistic_regression_results_DX

3. Plotting GWAS results

In this section, we generate Manhattan plots to visualize the results of our GWAS. Manhattan plots are commonly used in GWAS to visually identify genomic regions associated with a trait or phenotype of interest, which in this case is the diagnosis of Alzheimer’s disease.

library(qqman)

results <- read.table("logistic_regression_results_DX.assoc.logistic", header = TRUE)
results_clean <- na.omit(results)
manhattan_plot <- manhattan(results_clean, col = c("blue", "red"), main = "GWAS Manhattan Plot", annotatePval = 0.01)

qq_plot <- qq(results_clean$P, main = "QQ Plot")

Based on our analysis, the SNP rs2075650 emerges as the most significant genetic variant associated with the diagnosis of Alzheimer’s disease. This finding underscores the potential importance of this genetic locus in influencing neurodegenerative processes in Alzheimer’s disease. Additionally, the QQ plot displayed a deviation from the expected distribution, suggesting a potential enrichment of true associations, further supporting the significance of our findings.

GWAS Analysis using the ADNI dataset

2024-02-01

1. Quality control: check sex mismatches and missingness of genotypes

2. GWAS: Investigate SNPs associated with Alzheimer’s disease

3. Plotting GWAS results