In this research project, our aim is to pinpoint the Single
Nucleotide Polymorphisms (SNPs) that are associated with the diagnosis
of Alzheimer’s disease. By pinpointing SNPs associated with Alzheimer’s
disease, we aim to elucidate the genetic factors underlying the
development and progression of this neurodegenerative disorder. Original
tutorial for pre-processing genetic data can be found at: https://rpubs.com/maffleur/452627
Data used in preparation of this article were obtained from the
Alzheimer’s Disease Neuroimaging Initiative (ADNI) database
(adni.loni.usc.edu). As such, the investigators within the ADNI
contributed to the design and implementation of ADNI and/or provided
data but did not participate in analysis or writing of this report. A
complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf
cd /Users/seymour/Desktop/Bioinformatics/GWAS/ADNI1/ADNI_1_GWAS_Plink
inputfile=ADNI_cluster_01_forward_757LONI
plink --bfile $inputfile --check-sex
plink --bfile $inputfile --missing
We can check the number of subjects with mismatches in reported sex by inspecting the plink.sexcheck file. Similarly, we check the number of subjects missing more than 10% of genotypes by inspecting the plink.imiss file.
cd /Users/seymour/Desktop/Bioinformatics/GWAS/ADNI1/ADNI_1_GWAS_Plink
total_subjects=$(awk 'NR>1 {count++} END {print count}' "plink.imiss")
sex_mismatch_count=$(awk '$5 == "PROBLEM"' "plink.sexcheck" | wc -l)
missing_genotypes_count=$(awk 'NR>1 && ($6 > 0.10)' "plink.imiss" | wc -l)
echo "Total number of subjects: ${total_subjects}"
echo "Number of subjects with sex mismatches: ${sex_mismatch_count}"
echo "Number of subjects missing more than 10% of genotypes: ${missing_genotypes_count}"
## Total number of subjects: 757
## Number of subjects with sex mismatches: 2
## Number of subjects missing more than 10% of genotypes: 1
Therefore our dataset seems to be ok here. We now proceed to imputation using the Haplotype Reference Consortium v1.1 panel. The following steps check for consistency of strand, alleles, positions, Ref/Alt assignments and frequencies between your SNPs and the HRC panel. These are explained further in https://rpubs.com/maffleur/452627
cd /Users/seymour/Desktop/Bioinformatics/GWAS/ADNI1/ADNI_1_GWAS_Plink
inputfile=ADNI_cluster_01_forward_757LONI
plink --bfile ADNI_cluster_01_forward_757LONI --freq
perl HRC-1000G-check-bim.pl -b ADNI_cluster_01_forward_757LONI.bim -f plink.frq -r HRC.r1-1.GRCh37.wgs.mac5.sites.tab -h
cd /Users/seymour/Desktop/Bioinformatics/GWAS/ADNI1/ADNI_1_GWAS_Plink
inputfile=ADNI_cluster_01_forward_757LONI
plink --bfile TEMP4 --a2-allele Force-Allele1-${inputfile}-HRC.txt --autosome --recode vcf-iid bgz --out ${inputfile}-updated
cd /Users/seymour/Desktop/Bioinformatics/GWAS/ADNI1/ADNI_1_GWAS_Plink
sh Run-plink.sh
cd /Users/seymour/Desktop/Bioinformatics/GWAS/ADNI1/ADNI_1_GWAS_Plink
python checkVCF.py -r hs37d5.fa -o test ADNI_cluster_01_forward_757LONI-updated.vcf.gz
cd /Users/seymour/Desktop/Bioinformatics/GWAS/ADNI1/ADNI_1_GWAS_Plink
bcftools +fixref ADNI_cluster_01_forward_757LONI-updated.vcf.gz -Ob -o ADNI_cluster_01_forward_757LONI-updated-REFfixed.vcf.gz -- -f hs37d5.fa -m top
cd /Users/seymour/Desktop/Bioinformatics/GWAS/ADNI1/ADNI_1_GWAS_Plink
python checkVCF.py -r hs37d5.fa -o test ADNI_cluster_01_forward_757LONI-updated-REFfixed.vcf.gz
Here, we upload the
ADNI_cluster_01_forward_757LONI-updated-REFfixed.vcf.gz file on to the
Sanger Imputation Service (https://imputation.sanger.ac.uk/) for imputation, a
process that infers missing genotypes and improves the resolution of
genetic data. Following imputation, the imputed VCF file was downloaded
and used to generate PLINK files (.fam, .bim and .bed) for subsequent
GWAS analysis.
After imputation and combining ADNI1 and ADNI3 datasets into chip1_2 we perform the following GWAS with the diagnosis of Alzheimer’s disease as the phenotype of interest.
cd /Users/seymour/Desktop/Bioinformatics/GWAS/chip1_2/
plink --bfile chip1_2_Dx \
--pheno chip1_2_DX.txt \
--pheno-name DX_bl \
--allow-no-sex \
--covar chip1_2_CEU80_EA_predpc_covs.txt\
--covar-number 4,5\
--maf 0.05 \
--hwe 5e-7 \
--logistic \
--out logistic_regression_results_DX
In this section, we generate Manhattan plots to visualize the results of our GWAS. Manhattan plots are commonly used in GWAS to visually identify genomic regions associated with a trait or phenotype of interest, which in this case is the diagnosis of Alzheimer’s disease.
library(qqman)
results <- read.table("logistic_regression_results_DX.assoc.logistic", header = TRUE)
results_clean <- na.omit(results)
manhattan_plot <- manhattan(results_clean, col = c("blue", "red"), main = "GWAS Manhattan Plot", annotatePval = 0.01)
qq_plot <- qq(results_clean$P, main = "QQ Plot")
Based on our analysis, the SNP rs2075650 emerges as the most
significant genetic variant associated with the diagnosis of Alzheimer’s
disease. This finding underscores the potential importance of this
genetic locus in influencing neurodegenerative processes in Alzheimer’s
disease. Additionally, the QQ plot displayed a deviation from the
expected distribution, suggesting a potential enrichment of true
associations, further supporting the significance of our findings.