created: 10.04.2022

updated: 09.05.2022

1 Introduction

Alzheimer’s disease (AD) is a polygenic and multifactorial chronic neurodegenerative disorder that represents approximately 50 to 70 percent of patients with age-related dementia (Milligan Armstrong et al., 2021). The 2 characteristic subtypes of AD include early-onset AD (EOAD, EOAD familial) encompassing individuals <65 years of age and late-onset AD (LOAD) including individuals ≥65 years of age (Nagaraj et al., 2017; Vacher et al., 2019). AD can manifest for up to 20 years before symptoms develop and patients commonly present with increases in amyloid-beta plaques and tau proteins (Milligan Armstrong et al., 2021; Vacher et al., 2019). The progression of AD is associated with synaptic loss, neuritic plaque build-up, and clinically significant increases in neurofibrillary tangles that influence cognitive decline, memory loss, and behavioural changes (Nagaraj et al., 2017). Increases in ethnic diversity and the underrepresentation of non-European ethnic groups highlight a knowledge gap in current research (Shigemizu et al., 2021). As such, the unique SNP and AD associations for individual ethnic populations remain incompletely understood (Shigemizu et al., 2021). Therefore, the aim of this report is to investigate unique SNP to phenotype associations for the different ethnic groups associated with AD.

2 Objective

The main objective of this research project is to explore single nucleotide polymorphisms for AD in different ethnic groups.

3 Hypothesis

The hypothesis is that the different ethnic groups will present with unique SNPs-to-phenotype associations in relation to AD.

4 Methodolody

4.1 Project population

4.1.1 Database

This research project used the NCBI specialised Phenotype-Genotype Integrator (PheGenI) database consisting of genome-wide association study (GWAS) data retrieved from NCBI databases including NCBI molQTL (eQTL data) and NCBI dbGaP (dbGaP studies), for human genomic-phenotype associations for AD (National Center for Biotechnology Information, 2021).

4.1.2 Inclusion criteria for analysis populations

The original study aims of the research projects listed on PheGenI were to investigate the different components of AD including genomic overview, gene associations, SNPs, population-based studies, and phenotype characteristics (National Center for Biotechnology Information, 2021). This research project used the PheGenI GWAS database consisting of 873 samples from multiple study designs including Family, control set; Case-control; Community, longitudinal, cross-sectional; Family, longitudinal, case-control; Family, sibling cohort; Case-control, longitudinal, family; Case-control, prospective, cohort; Case-control, family; and Longitudinal, case-control; for population distribution analysis (National Center for Biotechnology Information, 2021).

4.1.3 Analysis populations

The ethnic groups investigated in this research project included the PheGenI research population consisting of African, African American; African, European, African American, Indian American; East Asian, Asian, European; East Asian, European; European; European Hispanic; and Hispanic ethnicities (National Center for Biotechnology Information, 2021).

4.2 Variables of interest

4.2.1 Continuous variables

The continuous variable (Allen et al., 2019) used in this project include,

P-Value.

4.2.2 Categorical variables

The categorical variables (Allen et al., 2019) used in this report include,

Trait, Alzheimer disease
SNP rs
Context
Gene
Gene ID
Chromosome
Population.

For population, the designated value levels and labels (Wickham, 2019) include,

0 – African|African American
1 – African|European|African American|Indian American
2 – East Asian|Asian|European
3 – East Asian|European
4 – European
5 – European|Hispanic
6 – Hispanic.

4.3 Missing data

Missing data was recorded as NA, and NAs were converted to 0, where applicable (Wickham, 2019).

5 Ethics statement

This research project conforms with the 1975 Declaration of Helsinki and was approved by the Edith Cowan University Ethics Committee (# REMS NO: 2022-03371-KIRBY). The Ethics Approval email is provided in Appendix 1.

6 Data preparation

Data was sourced from the online NCBI PheGenI platform, and data preparation was conducted in RStudio using r-functions() in R Markdown.

6.1 Extract data from PheGenI

A search on PheGenI with Alzheimer Disease as the Trait for Phenotype Selection was performed. All resulting associated SNPs were downloaded and the datafile was saved in an Excel format.

6.2 Data preparation steps

The sequence of data preparation steps included,

Set working directory
Load dependencies, as applicable
Call-in datafiles
Explore data summary including the search summary, as illustrated in Table 1, and the study designs, as shown in Table 2
Explore total SNP rs distribution in relation to each ethnic group, as illustrated in Table 3 and Figure 1
Explore SNP rs data distribution with duplicates removed, and statistical significance (Bonferroni correction, calculated from dividing 0.05 by 873 samples equating to a 𝜌-value ≤0.00005), for each ethnic group, as shown in Table 4 and Figure 2
Remove matching SNP rs of non-European ethnic groups against the European population and tabulate the unique SNP rs summary with percentage scores for each ethnic group, as illustrated in Table 5
Convert 𝜌-values to z-scores for ease of interpretation and visualisation in plot analysis
Perform a dimension reduction analysis with PCA and identify the number of principal components, as illustrated in Figure 3
Investigate unique SNP rs and chromosome associations, as shown in Table 6
Identify the most statistically significant unique SNP rs for each ethnic group, as shown in Table 7
Aggregate data for each ethnic group and export a comma-delimited file of the final list of associated unique SNPs
Investigate gene enrichment using the Gene Ontology enRIchment anaLysis and visuaLizAtion tool (http://cbl-gorilla.cs.technion.ac.il/) and the unique genes associated with the unique SNP rs of all ethnic groups, as shown in Figure 4 and Table 7
Search PheGenI with final list of unique SNPs, for each ethnic group, as the SNP for Genotype Selection. Download the results as an Excel file and identify other phenotypes associated with AD, as shown in Table 8.

6.3 Pre-processing of omics data

The dataset was examined and manipulated into a tidy data format where each variable had its own column, each observation had its own row, and each value had its own cell.

6.4 Transformation and normalisation

Data was transformed to numeric values, where applicable
Data normalisation included converting NA values to 0.

6.5 Quality control

Initial data explorations were conducted to diagnose data quality and to understand the data
Data was investigated following consecutive data preparation steps, to inform that correct results were acquired or to flag discrepancies for reconciliation
The session information was retained including the version of R and the R packages used for the data analysis.

6.6 Methods to derive summary variables

Unique SNPs to each ethnic group were tabulated
A PCA plot was created to observe any latent variables associated with each ethnic group

6.7 R packages and online resources used for data preparation and analysis

Useful for coding in R Markdown with Bioconductor, as applicable,

to call in datafiles: readxl
for data manipulation: tidyverse, dplyr, scales, forcats
for data visualisation: kableextra, ggplot2, ggrepel, mixOmics
for 7th APA style referencing: tinytex, papaja
CRAN, RPubs (https://rpubs.com), and Stack overflow (https://stackoverflow.com) were referenced for general coding knowledge.

7 Statistical analysis

7.1 Brief plan of approach

Normally distributed data with a bonferroni correction was used to tabulate the SNP rs for each ethnic group, as shown in Table 4, and the percentage distributions are shown in Figure 2 (Allen et al., 2019)
The matching SNP rs of non-European ethnic groups to the European population were removed from further analysis, as shown in Table 5
A PCA plot was generated to identify principal components, as shown in Figure 3 (James et al., 2013)
The unique SNP rs were matched against all chromosomes, for each ethnic group, as shown in Table 6 (Allen et al., 2019)
The most statistically significant unique SNPs were extracted to identify patterns of interest, as shown in Table 7 (Allen et al., 2019)
Gene names of the unique SNP rs were adapted for the pathway analysis, as shown in Figure 4 and Table 8 (James et al., 2013)
The unique SNP rs for AD were searched against the PheGenI GWAS catalogue, to identify other phenotypes that are significantly associated with AD, as shown in Table 9.

7.2 Dependent and independent variables

The independent variables for statistical analysis included SNP-rs, Context, Gene, Gene ID, Chromosome, Location, Population, and P-value. SNP-rs was used to identify unique SNPs (Allen et al., 2019). The Context informed the type of SNP (National Center for Biotechnology Information, 2021). The Gene, Gene ID, Chromosome and Location identified and informed how the SNP-rs were situated in the genome (National Center for Biotechnology Information, 2021). P-value was used to identify statistically significant unique SNP-rs, and Population was referenced to identify the different ethnic groups(Allen et al., 2019; National Center for Biotechnology Information, 2021).
The dependent variable is Trait (Alzheimer’s disease) which was evaluated in association with the independent variables (Allen et al., 2019).

7.3 Considerations for co-variates

Co-variates including age, sex, medical co-morbidities, disease severity, and age on onset are considerations for AD however, this information was not clearly defined on PheGenI thus, not applied to this research data analysis (Rountree et al., 2012).

8 Results and discussion

8.1 Data summary

8.1.1 Research data distribution

A search summary for AD from PheGenI, as illustrated in Table 1, informed of 873 AD trait associated results from across GWAS. 30 SNPs and 26 genes across 9 chromosomes were identified in relation to page 1 of association results, and 16 eQTL Data and 13 dbGaP Studies were also listed in relation to page 1 of association results.

Table 1

*Seach summary for Alzhermer’s disease from PheGenI*
Search	Result	Search type
Association Results	873	Searched by phenotype trait.
Genes	28	Searched by gene IDs retrieved from page 1 of association results.
SNPs	31	Searched by SNP rs numbers retrieved from page 1 of association results.
eQTL Data	16	Searched by SNP rs numbers retrieved from page 1 of association results.
dbGaP Studies	14	Searched by traits retrieved from page 1 of association results.
Genome View	26 genes across 9 chromosomes

Note. PheGenI search summary for Alzheimer’s disease (AD) yielded 873 AD association results. 26 genes, 30 SNPs, 16 eQTL Data, and 13 dbGaP Studies were listed for AD, in relation to page 1 of association results. The 26 genes were identified across 9 chromosomes.

The study designs summary for AD provided by dpGaP on PheGenI are illustrated in Table 2. Research was based solely on AD, and AD together with other neurological disorders such as dementia and Parkinson disease, and other systemic disorders such as Diabetes Mellitus Type 2 and heart diseases (Nagaraj et al., 2017). The study type for AD research was predominantly case-control with some studies reported as longitudinal and family orientated, and the cohorts for the studies ranged between 20 to 6,065 candidates. Study IDs were provided for further investigations of individual research, and the Platform: Vendor informed the sequencing technology applied in each research. The cohort and Platform: Vendor information was no supplied for Study ID phs001440.v1.p1.

Table 2

*Summary of study designs for Alzheimer’s disease from PheGenI*
Disease	Study Type	Study Name	Study ID	Participants	Platform: Vendor
Alzheimer Disease; Alzheimer Disease	Family, Control Set	Genetics Consortium for Late Onset of Alzheimer’s Disease (LOAD CIDR Project)	phs000160.v1.p1	2398	Illumina: Linkage-IVb Marker Panel
Alzheimer Disease; Alzheimer Disease	Case-Control	GenADA/LONG/Imaging (Genetic Alzheimer’s Disease Associations)	phs000219.v1.p1	1718	Affymetrix: Mapping250K_Nsp; Affymetrix: Mapping250K_Sty
Alzheimer Disease; Dementia; Alzheimer Disease	Case-Control	ADGC Genome Wide Association Study	phs000372.v1.p1	6065	Illumina: Human660W-Quad_v1_A; Illumina: HumanOmniExpress
Alzheimer Disease; Cognition Disorders; Dementia; Alzheimer Disease	Community, Longitudinal, Cross-Sectional	Indianapolis-Ibadan, Nigeria Comparative Epidemiological Study of Dementia	phs000378.v1.p1	1251	Illumina: HumanOmni2.5
Alzheimer Disease; Dementia; Alzheimer Disease	Family, Longitudinal, Case-Control	NIA - Late Onset Alzheimer’s Disease and National Cell Repository for Alzheimer’s Disease Family Study: Genome-Wide Association Study for Susceptibility Loci	phs000168.v2.p2	5220	Illumina: Human610_Quadv1_B
Alzheimer Disease; Alzheimer Disease	Family, Sibling Cohort	Massachusetts General Hospital/Eisai NIMH AD Genetics Initiative Study	phs000483.v1.p1	1504	Affymetrix: Mapping250K_Nsp; Affymetrix: Mapping250K_Sty
Alzheimer Disease; Alzheimer Disease	Case-Control, Longitudinal, Family	Columbia University Study of Caribbean Hispanics and Late Onset Alzheimer’s disease	phs000496.v1.p1	3139	Illumina: HumanOmni1-Quad_v1-0_B
Alzheimer Disease; Parkinson Disease; Alzheimer Disease	Case-Control	miRNA profiles in serum and CSF of Parkinson’s and Alzheimer’s patients	phs000727.v1.p1	211	Illumina: TruSeq Small RNA Sample Prep Kit
Alzheimer Disease; Alzheimer Disease	Case-Control	RNAseq analysis of posterior cingulate astrocytes in Alzheimer’s disease	phs000745.v1.p1	20	NuGEN: Ovation RNA-Seq System V2
NA; Renal Insufficiency, Chronic; Alzheimer Disease; Aortic Aneurysm, Abdominal; Asthma; Attention Deficit Disorder with Hyperactivity; Clostridium difficile; Dermatitis, Atopic; Diabetes Mellitus, Type 2; Heart Diseases; Herpes Zoster; Hypothyroidism; Methicillin-Resistant Staphylococcus aureus; Prostatic Hyperplasia; NA	Case-Control, Prospective, Cohort	eMERGE III: Columbia GENIE (Genomic Integration with EHR)	phs000961.v1.p1	3065	Affymetrix: AFFY_6.0; Illumina: Mega Consortium - 15063755
Dementia; Alzheimer Disease; Aging; Aged; Age Factors; Dementia	Case-Control	Group Health/UW Aging and Dementia eMERGE Study	phs000234.v2.p2	3756	Illumina: Human660W-Quad_v1_A
Alzheimer Disease; Alzheimer Disease	Case-Control, Family	Alzheimer’s Disease Sequencing Project (ADSP)	phs000572.v8.p4	15630	Illumina: HiSeq 2000; Illumina: HiSeq 2000
Alzheimer Disease; Alzheimer Disease	Longitudinal, Case-Control, Population	AMP-AD/MÂ²OVE-AD Genomic Umbrella	phs001440.v1.p1

Note. dpGaP study designs summary on PhenGenI for Alzheimer’s disease listed the disease, study type, study name, study ID, participants, and the platform: vendor for each research study. The cohort and platform: vendor information was no listed for study ID phs001440.v1.p1.

8.2 Exploratory data analyses

8.2.1 Total SNP rs distribution

The 873 SNP rs associations, as shown in Table 3, in collaboration with Figure 1, show the percentage distribution to the number of SNP rs for each ethnic group in descending order as 49.37 percent for the European population with 431 SNP rs, followed by 15.92 percent for the African|African American population with 139 SNP rs, 0.34 percent for both the East Asian|European and the Hispanic populations with 3 SNP rs each, and 0.23 percent for both the East Asian|Asian|European and the European|Hispanic populations with 2 SNP rs for each group. 31.50 percent was allocated to NR with 275 SNP rs for which the origin and ethnicity was not defined on the PheGenI GWAS catalogue. As such approximately one-third of the data was removed from further analysis as data associated with NR was not useful in identifying unique SNP rs association to specific ethnic groups. The research summary for each ethnic group highlights that the majority of studies were associated with the European population in comparison to non-European populations.

Table 3

*SNP rs counts for each ethnic group in association with Alzheimer’s disease*
Ethnic_group	Research_summary
European	431
***NR	275
African\|African American	139
African\|European\|African American\|Indian American	18
East Asian\|European	3
Hispanic	3
East Asian\|Asian\|European	2
European\|Hispanic	2
	Total = 873

Note. The total SNP rs distribution summary in association with the different ethnic groups for Alzheimer’s disease, from the PheGenI GWAS catalogue. The SNP rs were distributed as European (n=431), African|African American (n=139), African|European|African American|Indian American (n=18), East Asian|European (n=3), Hispanic (n=3), East Asian|Asian|European (n=2), European|Hispanic (n=2), and NR (n=275).

Figure 1

Bar plot representing the SNP rs distribution and percentage scores for ethnic groups of Alzheimer’s disease

Note. Bar plot for the number of SNP rs and the percentage distribution associated with each ethnic group for Alzheimer’s disease, in the PheGenI GWAS catalogue. European (n=431, 49.37%), African|African American (n=139, 15.92%), African|European|African American|Indian American (n=18, 2.06%), East Asian|European (n=3, 0.34%), Hispanic (n=3, 0.34%), East Asian|Asian|European (n=2, 0.23%), European|Hispanic (n=2, 0.23%), and NR (n=275, 31.50%).

8.2.2 SNP rs data distribution with duplicates removed

The PheGenI GWAS catalogue contains normalised and statistically significant data however, as a precaution a Bonferroni correction for 𝜌-value ≤0.00005 was performed for this research project. All of the 873 SNP rs were retained however, as illustrated in Table 4 and Figure 2, the removal of duplicate SNP rs within each ethnic group resulted in a reduced total count of 556 SNP rs. In descending order, the percentage distribution and the number of SNP rs for the European population was the highest at 70.86 percent with 394 SNP rs, followed by a significant decrease in count for the African|African American population at 24.10 percent with 134 SNP rs, the African|European|African American|Indian American population at 2.06 percent with 18 SNP rs, both the East Asian|European and the Hispanic populations at 0.54 percent with 3 SNP rs, and both the East Asian|Asian|European and European|Hispanic populations at 0.36 percent with 2 SNP rs. The SNP rs summary for each ethnic group highlights that the majority of SNP rs are associated with the European population in comparison to non-European ethnic groups. As such, the SNP rs of each ethnic group were be compared against the European SNP rs profile for further analysis, to distinguish the unique SNPs associated with the individual ethnic groups.

Table 4

*SNP rs summary, with duplicates removed, for each ethnic group associated with Alzheimer’s disease*
Ethnic_group	SNPrs_summary
European	394
African\|African American	134
African\|European\|African American\|Indian American	18
East Asian\|European	3
Hispanic	3
East Asian\|Asian\|European	2
European\|Hispanic	2
	Total = 556

Figure 2

Pie chart representing the SNP rs, with duplicates removed, percentage distribution for each ethnic group associated with Alzheimer’s disease

Note. Pie chart for the percentage distribution of SNP rs with duplicates removed for each ethnic group in association with Alzheimer’s disease. European (n=431, 70.86%), African|African American (n=139, 24.10%), African|European|African American|Indian American (n=18, 3.24%), East Asian|European (n=3, 0.54%), Hispanic (n=3, 0.54%), East Asian|Asian|European (n=2, 0.36%), and European|Hispanic (n=2, 0.36%).

8.3 Identification of unique SNP rs

8.3.1 Unique SNP rs summary

One matching SNP rs was removed from the African|African American, the East Asian|Asian|European, and the European|Hispanic ethnic groups, when compared against the European population and, as illustrated in Table 5, a total of 553 unique SNP rs were retained and distributed between the ethnic groups as 394 SNP rs at 70.86 percent for the European population, followed by significantly lower SNP rs count of 133 at 24.10 percent for the African|African American population, 18 SNP rs at 3.24 percent for the African|European|African American|Indian American, 3 SNP rs at 0.54 percent for both the East Asian|European and the Hispanic populations, and 1 SNP rs at 0.36 percent for and the East Asian|Asian|European and the European|Hispanic populations. As such, most of the SNP rs remained unique to their respective groups however, with the removal of matching SNP rs, the SNP rs count was reduced to one for the East Asian|Asian|European and the European|Hispanic populations.

Table 5

*Unique SNP rs summary for each ethnic group associated with Alzheimer’s disease*
Ethnic_group	Unique_SNPrs_summary	Percentage
European	394	71.25%
African\|African American	133	24.05%
African\|European\|African American\|Indian American	18	3.25%
East Asian\|European	3	0.54%
Hispanic	3	0.54%
East Asian\|Asian\|European	1	0.18%
European\|Hispanic	1	0.18%
	Total = 553

Note. The unique SNP rs and percentage distribution associated with each ethnic group, for Alzheimer’s disease. European (n=394, 71.25%), African|African American (n=133, 24.05%), African|European|African American|Indian American (n=18, 3.25%), East Asian|European (n=3, 0.54%), Hispanic (n=3, 0.54%), East Asian|Asian|European (n=1, 0.18%), and European|Hispanic (n=1, 0.18%).

8.3.2 Principal component analysis

A principal component analysis was performed for Ethnic groups against SNP rs distribution and P-Values, and 100 percent of the variance was explained by one principal component, as illustrated in Figure 3. Only statistically significant data was used for analysis thus, the variance could be explained by one principal component [allen2018spss].

Figure 3

Principal component analysis for ethnic groups against SNP rs and P-Values, for Alzheimer’s disease analysis

Note. Principal component analysis for ethnic groups, SNP rs, and P-value in association with Alzheimer’s disease. 100% of the variance was explained by one principal component.

8.3.3 Unique SNP rs and chromosome summary

SNP rs were identified on the X sex chromosome (n=1) and all 22 autosomal chromosomes (n=552), across the different ethnic groups associated with AD, as illustrated in Table 6. SNP rs associations for chromosome 21 and the X chromosome were identified only with the European population. The highest number of SNP rs (n=49) were identified for Chromosome 2, and chromosome 9 was associated with the highest number of ethnic groups (n=5) including European, African|African American, African|European|African American|Indian American, Hispanic,and East Asian|Asian|European. A linear correlation can be described with a higher number of SNP rs to a greater number of associated chromosomes, for each ethnic group. SNP rs were associated with 496 different genes.

Table 6

Summary of SNP rs counts across all chromosomes for each ethnic group associated with Alzheimer’s disease

	European	African\|African American	African\|European\|African American\|Indian American	East Asian\|European	Hispanic	East Asian\|Asian\|European	European\|Hispanic
Total number of genes = 496
Overall SNP count (n	394	133	18	3	3	1	1
Chromosome (n
1	19	12
2	33	13	3
3	21	7	1		1
4	20	6	2
5	31	9	2		1
6	30	7
7	25	13		1
8	25	8
9	17	8	2
10	15	2	3				1
11	30	6	1
12	20	10		1
13	16	4
14	11	2
15	11	4
16	13	4
17	13	5	1
18	5	3		1
19	20	6	3		1	1
20	7	2
21	5
22	6	2
X	1
Y

Note. Summary of genes, and the SNP rs counts across all chromosomes for the different ethnic groups associated with Alzheimer’s disease. The highest number of SNP rs (n=49) was identified for chromosome 2, and chromosome 19 had the most associations with the different ethnic groups (n=5). SNP rs were associated with 496 different genes.

8.3.4 Most statistically significant and unique SNP rs for each ethnic group

The most statistically significant and unique SNP rs for each ethnic group associated with AD are illustrated in Table 7. The most statistically significant and unique SNP rs for ethnic groups including European (SNP rs 2075650), African|African American (SNP rs 115550680), African|European|African American|Indian American (SNP rs 15782), Hispanic (SNP rs 394819), and East Asian|Asian|European (SNP rs 519113) were observed on chromosome 19. The gene and chromosomal locations were unique for the most statistically significant and unique SNP rs of the African|African American (ABCA7, 1050421) and the East Asian|Asian|European (PVRL2, 44873027) populations whereas, the most statistically significant and unique SNP rs of the European, African|European|African American|Indian American, and the Hispanic populations were in association with the TOMM40 gene. Unique chromosomal locations were identified for the most statistically and significant SNP rs of the European (44892362), African|European|African American|Indian American (4892962), and the Hispanic (44901322) ethnic groups. Independent gene and chromosomal locations were observed for the most statistically significant and unique SNP rs for both the East Asian|European population (CNTNAP2, chromosome 7, 146265094), and the European|Hispanic population (CELF2, chromosome 10, 10958376). As expected, the most statistically significant SNP rs was identified for the European population at P-Value=1.999999999999999886549084 × 10^-157 whereas, the least most statistically significant SNP rs was identified for the European|Hispanic population at P-Value=1.999999999999999909447434 × 10^-7. Surprisingly, the most unique SNP rs for the East Asian|Asian|European population had a higher statistically significant P-Value=4.999999999999999808746736 × 10^-39 than the most unique SNPrs for the African|African American population with P-Value=2.000000000000000124466409 × 10^-9. All of the most statistically significant and unique SNPs were reported as introns in context.

Table 7

*Summary of the most statistically significant and unique SNP rs for each ethnic group associated with Alzheimer’s disease*
SNP rs	Context	Gene	Gene ID	Chromosome	Location	P-Value	Population
2075650	intron	TOMM40	10452	19	44892362	1.999999999999999886549084e-157	European
115550680	intron	ABCA7	10347	19	1050421	2.000000000000000124466409e-9	African\|African American
157582	intron	TOMM40	10452	19	44892962	9.000000000000000069388939e-52	African\|European\|African American\|Indian American
802571	intron	CNTNAP2	26047	7	146265094	9.999999999999999547237172e-7	East Asian\|European
394819	intron	TOMM40	10452	19	44901322	7.999999999999999516012150e-11	Hispanic
519113	intron	PVRL2	5819	19	44873027	4.999999999999999808746736e-39	East Asian\|Asian\|European
62209	intron	CELF2	10659	10	10958376	1.999999999999999909447434e-7	European\|Hispanic

Note. Summary of the most statistically significant and unique SNP rs for each ethnic group including European, African|African American, African|European|African American|Indian American, East Asian|European, Hispanic, East Asian|Asian|European, and European|Hispanic, in association with Alzheimer’s disease. The SNP rs were introns in context, across different chromosomes including 19, 10, and 7. The different genes included TOMM40, ABCA7, CNTNAP2, PVRL2, and CELF2. The P-Values ranged between 1.999999999999999886549084 × 10^-157 (European) and 1.999999999999999909447434 × 10^-7 (European|Hispanic).

8.3.5 Pathway analysis of the unique SNP rs in association with genes

The Gene Ontology enRIchment anaLysis tool was used for the enrichment pathway analysis, as illustrated in Figure 4, and the top ten enrichment GO terms are listed in Table 8. The top four GO terms with the highest significant P-Value were associated with the significant enrichment including negative regulation of amyloid precursor protein catabolic process (GO:1902992), regulation of amyloid precursor protein catabolic process(GO:1902991), regulation of amyloid-beta formation (GO:1902003), and regulation of endocytosis (GO:0030100). The gene ABCA7, which also contains the most significantly unique SNP rs for the African|African American population was pronounced in all of the listed GO terms, and PVRL2, which also contains the most unique SNP rs for the East Asian|Asian|European population, was enriched in membrane organization (GO:0061024).

Figure 4

Enrichment pathway analysis with genes associated with the unique SNP rs of ethnic groups associated with Alzheimer’s disease

Note Enrichment pathway analysis with genes associated with the unique SNP rs of ethnic groups associated with Alzheimer’s disease. The darker gradient represents more enriched pathways. The top ten pathways include negative regulation of amyloid precursor protein catabolic process, regulation of amyloid precursor protein catabolic process, regulation of amyloid-beta formation, regulation of endocytosis, negative regulation of amyloid-beta formation, regulation of cellular amide metabolic process, negative regulation of cellular amide metabolic process, regulation of vesicle-mediated transport, membrane organization, and negative regulation of protein metabolic process.

Table 8

*GO terms enriched in the enrichment pathway analysis with genes associated with the unique SNP rs of ethnic groups associated with Alzheimer’s disease*
GO Term	Description	P-value	Genes
GO:1902992	negative regulation of amyloid precursor protein catabolic process	2.13e-09	SORL1, PICALM, APOE, CLU, BIN1, ABCA7
GO:1902991	regulation of amyloid precursor protein catabolic process	2.13e-09	SORL1, APOE, CLU, BIN1, ABCA7
GO:1902003	regulation of amyloid-beta formation	2.13e-09	SORL1, APOE, CLU, BIN1, ABCA7
GO:0030100	regulation of endocytosis	5.58e-09	APOE, PICALM, CLU, BIN1, CD2AP, APOC1, TREM2, ABCA7
GO:1902430	negative regulation of amyloid-beta formation	1.07e-07	SORL1, APOE, CLU, SPON1, BIN1, ABCA7
GO:0034248	regulation of cellular amide metabolic process	2.67e-07	SORL1, PTK2B, APOE, PICALM, CLU, BIN1, ABCA7
GO:0034249	negative regulation of cellular amide metabolic process	5.47e-07	SORL1, APOE, CLU, BIN1, ABCA7
GO:0060627	regulation of vesicle-mediated transport	1.37e-06	SORL1, APOE, PICALM, CLU, BIN1, APOC1, ABCA7
GO:0061024	membrane organization	1.67e-06	CR1, APOE, PICALM, CLU, BIN1, TREM2, PVRL2, ABCA7
GO:0051248	negative regulation of protein metabolic process	2.06e-06	SORL1, CR1, APOE, PICALM, CLU, BIN1, ABCA7

Note GO terms for enrichment pathway analysis with genes associated with the unique SNP rs of ethnic groups associated with Alzheimer’s disease, performed using the Gene Ontology enRIchment anaLysis tool. The top ten GO terms, based on the most significant P-Value in descending order, are associated with pathways including negative regulation of amyloid precursor protein catabolic process (GO:1902992), regulation of amyloid precursor protein catabolic process (GO:1902991), regulation of amyloid-beta formation (GO:1902003), regulation of endocytosis (GO:0030100), negative regulation of amyloid-beta formation (GO:1902430), regulation of cellular amide metabolic process (GO:0034248), negative regulation of cellular amide metabolic process (GO:0034249), regulation of vesicle-mediated transport (GO:0060627), membrane organization (GO:0061024), and negative regulation of protein metabolic process (GO:0051248). The top four GO terms were the most significantly enriched in the enrichment pathway analysis.

8.3.6 Unique SNP rs of Alzhmeimer’s disease in association with other phenotypes from the PheGenI GWAS catalogue

The unique SNP rs associated with the different ethnic groups in association with AD were further explored with the PheGenI GWAS catalogue, as illustrated in Table 9. A total of 65 other phenotypes were identified in conjunction with variable biological representations including Age of Onset (n=221), Depression (n=19), Cholesterol/LDL/HDL (n=7,15,12), Lewy Body Disease (n=3), Hematocrit (n=2), Neoplasms (n=1), and Wet Macular Degeneration (n=1). The phenotype associations of the unique SNP rs associated with the different ethnic groups in association with AD elaborate on the multifactorial capacity of AD.

Table 9

*Significant other phenotypes in association with the unique SNP rs for Alzheimer’s disease*
Phenotype	SNPrs_count	Phenotype_continued	SNPrs_count_continued
Alzheimer Disease	618	Glucose	2
Age of Onset	221	Neurofibrillary Tangles	2
Depression	19	1-Alkyl-2-acetylglycerophosphocholine Esterase	2
Cholesterol, LDL	15	Electrocardiography	2
Brain	15	Corneal Topography	1
Amyloid beta-Peptides	14	Coronary Disease	1
tau Proteins	13	Psychomotor Performance	1
Cholesterol, HDL	12	Intelligence Tests	1
Triglycerides	12	Cocaine-Related Disorders	1
C-Reactive Protein	12	Body Height	1
amyloid beta-protein (1-42)	10	Frontotemporal Dementia	1
Cholesterol	7	Lipids	1
Macular Degeneration	6	Smell	1
Cerebrospinal Fluid	6	Arteries	1
Longevity	5	Hematologic Tests	1
Body Mass Index	5	Diabetes Mellitus	1
Neuropsychological Tests	4	Atrial Fibrillation	1
Memory	4	Diabetes Mellitus, Type 2	1
Body Fat Distribution	4	Paliperidone Palmitate	1
Cerebral Cortex	4	Metabolic Syndrome X	1
Dementia	3	Protein S	1
Blood Pressure	3	Migraine Disorders	1
Neuroimaging	3	Schizophrenia	1
Cholinesterase Inhibitors	3	Mortality	1
Lewy Body Disease	3	Coronary Artery Disease	1
Cognition Disorders	3	Myocardial Infarction	1
Plaque, Amyloid	2	Cerebral Amyloid Angiopathy	1
Heart Failure	2	Neoplasms	1
Waist-Hip Ratio	2	Neurodegenerative Diseases	1
Waist Circumference	2	Wet Macular Degeneration	1
Hematocrit	2	Entorhinal Cortex	1
Amyloidosis, Cerebral, with Spongiform Encephalopathy	2	Erythrocyte Indices	1
Stroke	2

Note A total of 65 phenotypes in association with the unique SNP rs from the different ethnic groups associated with Alzheimer’s disease were identified, from the PheGenI GWAS catalogue.

9 Conclusion

9.1 Main objetive and hypothesis

The main objective of this research project was to explore single nucleotide polymorphisms for Alzheimer’s disease in different ethnic groups. The hypothesis that the different ethnic groups will present with unique SNPs-to-phenotype associations in relation to AD was supported however, the European population was significantly overrepresented in comparison to the non-European population that were significantly underrepresented.

9.2 Summary of project findings

The data from the PheGenI GWAS catalogue was in collaboration with eQTL data and dbGaP studies, as shown in Table 1. The dbGaP study summary, as shown in Table 2 included research solely investigating AD, and AD together with other neurological disorders including dementia and Parkinson disease, and other systemic disorders including Diabetes Mellitus Type 2 and heart diseases (Nagaraj et al., 2017). All of the 873 SNP rs, as shown in Table 3, together with the percentage distribution of the SNP rs per population group, as illustrated in Figure 1 highlighted that the majority of SNP rs were associated with the European population in comparison to the non-European populations. Also, 31.50 percent of the SNP rs (NR) were not associated with any specific ethnic groups thus, removed from further analysis since the NR data was unsuitable for this research project of identifying unique SNP rs for individual ethnic groups associated with AD, as illustrated in Figure 1. The normally distributed and statistically significant data was filtered to remove duplicate SNP rs within each ethnic group which resulted in a total of 556 SNP rs, as shown in Table 4, and the percentage distribution of the SNP rs for each ethnic group, as illustrated in Figure 2 further highlighted that the majority of the SNP rs were associated with the European population in comparison to the non-European populations. As such, the SNP rs of each ethnic group were compared against the European SNP rs profile to distinguish the unique SNP rs associated with the individual ethnic groups and, as shown in Table 5, most of the SNP rs remained unique to their respective ethnic groups. However, the removal of matching SNP rs resulted in only one unique SNP rs for both the East Asian|Asian|European and the European|Hispanic ethnic groups, as shown in Table 5. A principal component analysis, as illustrated in Figure 3, was also performed to observe the variance and since only statistically significant data was used for analysis, the variance could be explained by one principal component (Allen et al., 2019).

When observing the chromosomal associations of the SNP rs, a linear correlation can be described for a greater number of chromosomal associations with a higher number of SNP rs for each ethnic group associated with AD, as shown in Table 6. Furthermore, the most statistically significant SNP rs for individual ethnic groups were sought to identify patterns of interest, as shown in Table 7. Considering the larger portion of SNP rs associated with the European population (n=394, 71.25%), the most statistically significant SNP rs was expected for the European population and observed as P-Value=1.999999999999999886549084 × 10^-157, as shown in Table 7. Suprisingly, for the East Asian|Asian|European population with a smaller portion of the unique SNP rs (n=1, 0.18%), the most statistically significant SNP rs had a higher statistically significant P-Value=4.999999999999999808746736 × 10^-39 than the SNP rs for the African|African American population (n=133, 24.05%) with a P-Value=2.000000000000000124466409 × 10^-9, as shown in Table 7. The most statistically significant and unique SNP rs (802571) for the East Asian|Asian population was independently located on chromosome: 7, location: 146265094, and gene: CNTNAP2, Table 7. Similarly, the most statistically significant and unique SNP rs (62209) for the European|Hispanic population was independently located on chromosome: 10, location: 10958376, and gene: CELF2, Table 7. Although on the same chromosome (chromosome 19), the gene and chromosomal locations were unique for the most statistically significant and unique SNP rs of the African|African American (SNP rs: 115550680, chromosome: 19, location: 1050421, gene: ABCA7) and the East Asian|Asian|European (SNP rs 519113, gene PVRL2, chromosome 19, location 44873027), ethnic groups, as shown in Table 7. And, although on the same chromosome (chromosome 19) and the same gene (gene TOMM40), the chromosomal location was unique for the European (SNPrs 2075650, chromosome: 19, location: 44892362, gene: TOMM40), the African|European|African American|Indian American (SNPrs 157582, chromosome: 19, location: 44892962, gene: TOMM40), and the Hispanic (SNPrs 394819, chromosome: 19, location: 44901322, gene: TOMM40) ethnic groups, as shown in Table 7.

The enrichment pathway analysis, as illustrated in Figure 4 was constructed with the unique genes (n=496) associated with the unique SNP rs of the ethnic groups associated with AD, and the top ten GO terms are shown in Table 8. The gene ABCA7, which also contains the most statistically significantly and unique SNP rs for the African|African American population was pronounced in all of the listed GO terms including the four significant GO term associations in descending order of negative regulation of amyloid precursor protein catabolic process (GO:1902992), regulation of amyloid precursor protein catabolic process(GO:1902991), regulation of amyloid-beta formation (GO:1902003), and regulation of endocytosis (GO:0030100), as illustrated in Figure 4 and Table 8. The gene PVRL2, which also contains the most statistically significant and unique SNP rs for the East Asian|Asian|European population, was enriched in membrane organization (GO:0061024), as shown in Figure 4 and Table 8. Since AD is polygenic and multifactorial, the unique SNP rs from all of the ethnic groups associated with AD were searched against the PheGenI GWAS catalogue, to identify other significantly associated phenotypes, as illustrated in Table 9, that could provide further insights through future projects into patterns of interest including enrichment pathways, the progression of AD, and unique multifactorial SNP rs representing the individual ethnic groups associated with AD.

9.3 Strengths and limitations

The PheGenI GWAS catalogue is freely available and easily accessible for the exploration and analysis of AD, other phenotypes, genes, and SNPs. Access to individual research projects associated with the search are also available via PheGenI. Removal of NR-related data resulted in the loss of 275 SNP rs (31.50% of SNP rs) from analysis which potentially restricted the identification of the unique SNP rs, especially when the SNP rs could assist with the identification of unique SNP rs for the underrepresented non-European ethnic groups associated with AD.

9.4 Recommendation for future projects

Future research of AD in the underrepresented non-European ethnic groups could improve the unique SNP rs pool for the respective populations associated with AD.

10 Acknowledgements

I would like to thank Dr Alyce Russell and Dr Tenielle Porter for their guidance throughout the unit of Clinical Bioinformatics, the support has been fundamental to the construction of this industry style research project.

11 Appendix

##Data preparation steps


#set working directory
"C:/Users/ladki/Desktop/AD_Project"


#_extract relevant datasets from
#https://www.ncbi.nlm.nih.gov/gap/phegeni


#_load dependencies
library(readxl) #to call in excel datafile
library(kableExtra) #to create tables
library(tidyverse) #for data manipulation
library(dplyr) #for data manipulation
#install.packages("devtools") #for referencing with APA 7th ed.
#devtools::install_github("crsh/papaja") #for referencing with APA
#remotes::install_github("crsh/papaja@devel") #for referencing with APA 7th ed.
library(tinytex) #for referencing with APA 7th ed and papaja.
library(papaja) #for referencing with APA 7th ed.
r_refs("r-adRef.bib") #reference .bib r text file
#install.packages("ggplot2")
library(ggplot2) #for data visualisation
library(ggrepel) #for data visualisation
library(forcats) #for data manipulation
library(scales) #for data manipulation
#if (!require("BiocManager", quietly = TRUE))
    #install.packages("BiocManager")
#BiocManager::install("mixOmics")
library(mixOmics) #for pca


#_call-in data files
ad<-read_excel("./data/AD.xlsx", sheet=1) #AD data
searchSummary<-read_excel("./data/AD.xlsx", sheet=2) #searchSummary data
studyDesigns<-read_excel("./data/AD.xlsx", sheet=3) #studyDesign data


#_research data summary

#research data distribution - search summary, table 1
s<-"" #create variable for blank
searchSummary[is.na(searchSummary)]<-s#rename NA
#create searchSummary table
searchSummary %>%
  kbl(caption = "_Seach summary for Alzhermer's disease from PheGenI_") %>% #Table title
  kable_classic(full_width=F, html_font = "Calibri") %>% #output width and font
  kable_styling(table.envir = "ctable", font_size=16) %>% #output fontSize
  row_spec(0, bold=T, color="white", background = "gray") #output heading and body layout


#research data distribution - study designs, table 2
studyDesigns<-as.data.frame(studyDesigns)
studyDesigns[is.na(studyDesigns)]<-s#rename NA
#remove number column in studyDesigns
studyDesigns<-studyDesigns[, 2:ncol(studyDesigns)]
#create studyDesigns table
studyDesigns %>%
  kbl(caption = "_Summary of study designs for Alzheimer's disease from PheGenI_") %>% #Table title
  kable_classic(full_width=F, html_font = "Calibri") %>% #output width and font
  kable_styling(table.envir = "ctable", font_size=16) %>% #output fontSize
  row_spec(0, bold=T, color="white", background = "gray") #output heading and body layout


#_exploratory data analysis

#total snprs distribution, table 3
#data distribution for the number of snprs for each ethnic group
p<-table(ad$Population) #create ethnicity summary
rownames (p)[rownames(p)=="NR"]= ("***NR") #change row name label 
p<-data.frame(p) #dataframe ethnicity summary
p<-rename(p, Ethnic_group=Var1) #rename Var1 as Ethnic_group
p<-rename(p, Research_summary=Freq) #rename Freq to Research_summary
p<-p[order(p$Research_summary, decreasing=TRUE),] #change Research_summary count to descending order
studyPbar_set<-p #duplicate p for bar plot
p[9, ] <- c("" , "Total = 873", "") #create row for total
p <- sapply(p, as.character) #convert to as.character
p[is.na(p)] <-""#replace NA with blank
p<-data.frame(p) #convert to dataframe
rownames(p)<- c() #remove row names
#create table
p %>%
  kbl(caption = "_SNP rs counts for each ethnic group in association with Alzheimer's disease_") %>% #Table title
  kable_classic(full_width=T, html_font = "Calibri") %>% #output width and font
  kable_styling(table.envir = "ctable", font_size=16) %>% #output fontSize
  row_spec(0, bold=T, color="white", background = "gray") #%>% #output heading and body layout
  #row_spec(2, strikeout = T) #strikeout row 2


#total snprs distribution, figure 1
#create research summary bar plot
studyPbar_set %>% #create pipeline to allocate percentages
  mutate(Percent = percent(Research_summary / sum(Research_summary))) -> studyPbar_set
studyPbar<-ggplot(studyPbar_set, aes(x=reorder(Ethnic_group, -Research_summary), y=Research_summary, fill=("Percent"))) +
  geom_col() + #add color to bar
  geom_text(
    aes(label=Percent), #add percentage
    hjust=0.5, vjust=-0.5, size=5) + #adjust percentage text position
  scale_y_continuous(limits=c(NA, 450)) + #length of y-axis
  labs(title=expression(underline(bold(The~"SNP rs and Percentage scores for ethnic groups of Alzheimer's disease")))) +
  labs(x=expression(bold("Ethnic group")), y=expression(bold("Number of  SNP rs"))) +#underlined title
 theme(plot.title.position="plot") +
  theme_classic() +
  theme(axis.text.x = element_text(angle=-90, vjust=0.5, hjust=0.5, size=14)) +
  theme(axis.text.y = element_text(angle=-90, vjust=0.5, hjust=0.5, size=14)) +
  theme(
    plot.title = element_text((hjust=0.7), size = 16), #adjust plot title position
    axis.title.x = element_text(vjust=-1, size=14), #adjust x-axis title position
    axis.title.y = element_text(vjust=1.8, size=14))
studyPbar


#snprs data distribution with duplicates removed, table 4
#Identify unique SNPs for each ethnic group
#main dataset "ad" manipulation
ad$'P-Value' <- format(ad$'P-Value', scientific=F) #turn-off scientific notation for P-Value
ad <- filter(ad, 'P-Value' >=0.00001) #filter P-Value <1*10^-5, all 17 variables and 873 obs retained
#assign numeric levels to population - factor levels
snpAll<-ad
snpAll$Population <- ifelse(snpAll$Population=="African|African American", 0,
                            ifelse(snpAll$Population=="African|European|African American|Indian American", 1,
                                    ifelse(snpAll$Population=="East Asian|Asian|European", 2,
                                           ifelse(snpAll$Population=="East Asian|European", 3,
                                                  ifelse(snpAll$Population=="European", 4,
                                                         ifelse(snpAll$Population=="European|Hispanic", 5,
                                                                ifelse(snpAll$Population=="Hispanic", 6, NA)))))))

snpAll$Population <- factor(snpAll$Population,
                            levels=c(0,1,2,3,4,5,6))
                            #labels=c("African|African American", "African|European|African American|Indian American", "East Asian|Asian|European", "East Asian|European", "European", "European|Hispanic", "Hispanic"))

#subset each ethnic group and remove duplicates
#Ethnic group - European "p_e"
p_e <-subset(snpAll, Population==4) #17 variables, 431 obs
p_e <- p_e[!duplicated(p_e[c('SNP rs')]),] #remove duplicate SNPrs, 17 variables, 394 obs
#Ethnic group - American|African American "p_fm"
p_fm <- subset(snpAll, Population==0) #17 variables, 139 obs
p_fm <- p_fm[!duplicated(p_fm[c('SNP rs')]),] #remove duplicate SNPrs, 17 variables, 134 obs
#Ethnic group - African|European|African American|Indian American "p_femi"
p_femi <- subset(snpAll, Population==1) #17 variables, 18 obs
p_femi <- p_femi[!duplicated(p_femi[c('SNP rs')]),] #remove duplicate SNPrs, 17 variables, 18 obs
#Ethnic group - East Asian|European "p_se"
p_se <- subset(snpAll, Population==3) #17 variables, 3 obs
p_se <- p_se[!duplicated(p_se[c('SNP rs')]),] #remove duplicate SNPrs, 17 variables, 3 obs
#Ethnic group - Hispanic "p_h"
p_h <- subset(snpAll, Population==6) #17 variables, 3 obs
p_h <- p_h[!duplicated(p_h[c('SNP rs')]),] #remove duplicate SNPrs, 17 variables, 3 obs
#Ethnic group - East Asian|Asian|European "p_sne"
p_sne <- subset(snpAll, Population==2) #17 variables, 2 obs
p_sne <- p_sne[!duplicated(p_sne[c('SNP rs')]),] #remove duplicate SNPrs, 17 variables, 2 obs
#Ethnic group - European|Hispanic "p_eh"
p_eh <- subset(snpAll, Population==5) #17 variables, 2 obs
p_eh <- p_eh[!duplicated(p_eh[c('SNP rs')]),] #remove duplicate SNPrs, 17 variables, 2 obs
#re-group the reduced SNPrs of each ethnic group to a new df
snpReduced<-bind_rows(p_e, p_fm, p_femi, p_se, p_h, p_sne, p_eh) #bind all SNPrs from individual ethnic groups, 17 variables,556 obs
snpRtable<-table(snpReduced$Population) #summarise SNPrs for each ethnic group
snpRtable<-data.frame(snpRtable) #dataframe SNPrs summary
snpRtable<-rename(snpRtable, Ethnic_group=Var1) #rename Var1 as Ethnic_group
snpRtable<-rename(snpRtable, SNPrs_summary=Freq) #rename Freq to Research_summary
snpRtable$Ethnic_group <- factor(snpRtable$Ethnic_group, #factor labels for levels
                            levels=c(0,1,2,3,4,5,6),
                            labels=c("African|African American", "African|European|African American|Indian American", "East Asian|Asian|European", "East Asian|European", "European", "European|Hispanic", "Hispanic"))
snpRtable<-snpRtable[order(snpRtable$SNPrs_summary, decreasing=TRUE),] #change SNPrs_summary count to descending order
snpRpie_set<-snpRtable #duplicate snpRtable for pie chart
snpRtable[8, ] <- c("" , "Total = 556", "") #create row for total
snpRtable <- sapply(snpRtable, as.character) #convert to as.character
snpRtable[is.na(snpRtable)] <-""#replace NA with blank
snpRtable<-data.frame(snpRtable) #convert to dataframe
rownames(snpRtable)<- c() #remove row names
#create table summary for SNPrs of each ethnic group
snpRtable %>%
  kbl(caption = "_SNP rs summary, with duplicates removed, for each ethnic group associated with Alzheimer's disease_") %>% #table title
  kable_classic(full_width=T, html_font = "Calibri") %>% #output width and font
  kable_styling(table.envir = "ctable", font_size=16) %>% #output fontSize
  row_spec(0, bold=T, color="white", background = "gray")


#snprs data distribution with duplicates removed, figure 2
#create pie chart for reduced SNPrs distribution
snpRpie_set %>% #create pipeline to allocate percentages
  mutate(Percentage = percent(SNPrs_summary / sum(SNPrs_summary))) -> snpRpie_set 
snpRpie <- ggplot(snpRpie_set, aes(x = "", y = SNPrs_summary, fill = fct_inorder(Ethnic_group))) + #order by ethnic group
       geom_col(width = 1, color = 1) +
       coord_polar(theta="y") +
       geom_label_repel(aes(label = Percentage), size=3, show.legend = F, nudge_x = 1) + #fill pie with percentages
       guides(fill = guide_legend(title ="Ethnic group")) + 
  labs(title=expression(underline("SNP rs percentage distribution, with duplicates removed, for each Ethnic group"))) +
  theme(plot.title.position="plot") +
        theme_void()
snpRpie <- snpRpie + 
   theme(
    plot.title = element_text(hjust=-1)) #adjust plot title position
snpRpie


#_identification of unique snprs

#unique snprs summary - table 5
#find unmatched unique SNPrs for each ethnic group
#Q = unique
snpQ_e<-p_e #unique SNPrs - European population "e", 17 variables, 394 obs
snpQ_fm<-p_fm[!(p_fm$`SNP rs` %in% snpQ_e$`SNP rs`),] #unique SNPrs - African|African American population "fm", 17 variables, 133 obs
snpQ_femi<-p_femi[!(p_femi$`SNP rs` %in% snpQ_e$`SNP rs`),] #unique SNPrs - African|European|African American|Indian American "femi", 17 variables, 18 obs
snpQ_se<-p_se[!(p_se$`SNP rs` %in% snpQ_e$`SNP rs`),]#unique SNPrs - East Asian|European "se", 17 variables, 3 obs
snpQ_h<-p_h[!(p_h$`SNP rs` %in% snpQ_e$`SNP rs`),]#unique SNPrs - Hispanic "h", 17 variables, 3 obs
snpQ_sne<-p_sne[!(p_sne$`SNP rs` %in% snpQ_e$`SNP rs`),]#unique SNPrs - East Asian|Asian|European "sne", 17 variables, 1 obs
snpQ_eh<-p_eh[!(p_eh$`SNP rs` %in% snpQ_e$`SNP rs`),]#unique SNPrs - European|Hispanic "eh", 17 variables, 1 obs
#group the unmatched unique SNPrs of each ethnic group
snpQ<-bind_rows(snpQ_e, snpQ_fm, snpQ_femi, snpQ_se, snpQ_h, snpQ_sne, snpQ_eh) #bind all unique SNPrs from individual ethnic groups, 17 variables,553 obs
snpQtable<-table(snpQ$Population) #summarise unique SNPrs for each ethnic group
snpQtable<-data.frame(snpQtable) #dataframe SNPrs summary
snpQtable<-rename(snpQtable, Ethnic_group=Var1) #rename Var1 as Ethnic_group
snpQtable<-rename(snpQtable, Unique_SNPrs_summary=Freq) #rename Freq to Unique SNPrs summary
snpQtable$Ethnic_group <- factor(snpQtable$Ethnic_group, #factor labels for levels
                            levels=c(0,1,2,3,4,5,6),
                            labels=c("African|African American", "African|European|African American|Indian American", "East Asian|Asian|European", "East Asian|European", "European", "European|Hispanic", "Hispanic"))
snpQtable<-snpQtable[order(snpQtable$Unique_SNPrs_summary, decreasing=TRUE),] #change Unique SNPrs summary count to descending order
snpQtable %>% #create pipeline to allocate percentage scores
  mutate(Percentage = percent(Unique_SNPrs_summary / sum(Unique_SNPrs_summary))) -> snpQtable 
snpQtable[8, ] <- c("" , "Total = 553", "") #create row for total
snpQtable <- sapply(snpQtable, as.character) #convert to as.character
snpQtable[is.na(snpQtable)] <-""#replace NA with blank
snpQtable<-data.frame(snpQtable) #convert to dataframe
rownames(snpQtable)<- c() #remove row names
#create table summary for SNPrs of each ethnic group
snpQtable %>%
  kbl(caption = "_Unique SNP rs summary for each ethnic group associated with Alzheimer's disease_") %>% #table title
  kable_classic(full_width=T, html_font = "Calibri") %>% #output width and font
  kable_styling(table.envir = "ctable", font_size=16) %>% #output fontSize
  row_spec(0, bold=T, color="white", background = "gray") #%>% #output heading and body layout


#principal component analysis - figure 3
#create pca plot
c<-c(3, 11, 17) #create specific column selection for pca
snpQpca_set<-snpQ[,c] #save specific columns to new variable
snpQpca_set$Population <- factor(snpQpca_set$Population, #factor labels for levels
                            levels=c(0,1,2,3,4,5,6),
                            labels=c("African|African American", "African|European|African American|Indian American", "East Asian|Asian|European", "East Asian|European", "European", "European|Hispanic", "Hispanic"))
snpQpca_set$`P-Value`<-qnorm(as.numeric(snpQpca_set$`P-Value`)) #convert p-values to Z scores
snpQpca<-pca(snpQpca_set)#run method for pca
plotIndiv(snpQpca, group = snpQpca_set$Population, legend = TRUE, title = "SNP rs, PCA", size.title = rel(1.3), legend.title = "Ethnic group") #plots samples for pca


#unique snprs and chromosome summary - table 6
#snprs to chromosome distribution for all ethnic groups together
snpQchr_All<-as.data.frame(table(snpQ$Chromosome)) #create df to explore snprs to chromosome distribution
snpQchr_All<-rename(snpQchr_All, Chromosome=Var1) #rename Var1 as Chromosome
snpQchr_All<-rename(snpQchr_All, count=Freq) #rename Freq as count
#assign levels to chromosomes
snpQchr_All$Chromosome<- ifelse(snpQchr_All$Chromosome=="1", 1,
                       ifelse(snpQchr_All$Chromosome=="2", 2,
                              ifelse(snpQchr_All$Chromosome=="3", 3,
                                     ifelse(snpQchr_All$Chromosome=="4", 4,
                                            ifelse(snpQchr_All$Chromosome=="5", 5,
                                                   ifelse(snpQchr_All$Chromosome=="6", 6,
                                                          ifelse(snpQchr_All$Chromosome=="7", 7,
                                                                 ifelse(snpQchr_All$Chromosome=="8", 8,
                                                                        ifelse(snpQchr_All$Chromosome=="9", 9,
                                                                               ifelse(snpQchr_All$Chromosome=="10", 10,
                                                                                      ifelse(snpQchr_All$Chromosome=="11", 11,
                                                                                             ifelse(snpQchr_All$Chromosome=="12", 12,
                                                                                                    ifelse(snpQchr_All$Chromosome=="13", 13,
                                                                                                           ifelse(snpQchr_All$Chromosome=="14", 14,
                                                                                                                  ifelse(snpQchr_All$Chromosome=="15", 15,
                                                                                                                         ifelse(snpQchr_All$Chromosome=="16", 16,
                                                                                                                                ifelse(snpQchr_All$Chromosome=="17", 17,
                                                                                                                                       ifelse(snpQchr_All$Chromosome=="18", 18,
                                                                                                                                              ifelse(snpQchr_All$Chromosome=="19", 19,
                                                                                                                                                     ifelse(snpQchr_All$Chromosome=="20", 20,
                                                                                                                                                            ifelse(snpQchr_All$Chromosome=="21", 21,
                                                                                                                                                                   ifelse(snpQchr_All$Chromosome=="22", 22, 
                                                                                                                                                                          ifelse(snpQchr_All$Chromosome=="X", 23,NA)))))))))))))))))))))))
snpQchr_All<-snpQchr_All[order(snpQchr_All$Chromosome),] #change chromosome to ascending order 
#snprs to chromosome distribution for "e"
snpQchr_e<-as.data.frame(table(snpQ_e$Chromosome))
snpQchr_e<-rename(snpQchr_e, Chromosome=Var1) #rename Var1 as Chromosome
snpQchr_e<-rename(snpQchr_e, count=Freq) #rename Freq as count
#assign levels to chromosomes
snpQchr_e$Chromosome<- ifelse(snpQchr_e$Chromosome=="1", 1,
                              ifelse(snpQchr_e$Chromosome=="2", 2,
                                     ifelse(snpQchr_e$Chromosome=="3", 3,
                                            ifelse(snpQchr_e$Chromosome=="4", 4,
                                                   ifelse(snpQchr_e$Chromosome=="5", 5,
                                                          ifelse(snpQchr_e$Chromosome=="6", 6,
                                                                 ifelse(snpQchr_e$Chromosome=="7", 7,
                                                                        ifelse(snpQchr_e$Chromosome=="8", 8,
                                                                               ifelse(snpQchr_e$Chromosome=="9", 9,
                                                                                      ifelse(snpQchr_e$Chromosome=="10", 10,
                                                                                             ifelse(snpQchr_e$Chromosome=="11", 11,
                                                                                                    ifelse(snpQchr_e$Chromosome=="12", 12,
                                                                                                           ifelse(snpQchr_e$Chromosome=="13", 13,
                                                                                                                  ifelse(snpQchr_e$Chromosome=="14", 14,
                                                                                                                         ifelse(snpQchr_e$Chromosome=="15", 15,
                                                                                                                                ifelse(snpQchr_e$Chromosome=="16", 16,
                                                                                                                                       ifelse(snpQchr_e$Chromosome=="17", 17,
                                                                                                                                              ifelse(snpQchr_e$Chromosome=="18", 18,
                                                                                                                                                     ifelse(snpQchr_e$Chromosome=="19", 19,
                                                                                                                                                            ifelse(snpQchr_e$Chromosome=="20", 20,
                                                                                                                                                                   ifelse(snpQchr_e$Chromosome=="21", 21,
                                                                                                                                                                          ifelse(snpQchr_e$Chromosome=="22", 22, 
                                                                                                                                                                                 ifelse(snpQchr_e$Chromosome=="X", 23,NA)))))))))))))))))))))))

snpQchr_e<-snpQchr_e[order(snpQchr_e$Chromosome),] #change chromosome to ascending order 
#snprs to chromosome distribution for "fm"
snpQchr_fm<-as.data.frame(table(snpQ_fm$Chromosome))
snpQchr_fm<-rename(snpQchr_fm, Chromosome=Var1) #rename Var1 as Chromosome
snpQchr_fm<-rename(snpQchr_fm, count=Freq) #rename Freq as count
#assign levels to chromosomes
snpQchr_fm$Chromosome<- ifelse(snpQchr_fm$Chromosome=="1", 1,
                        ifelse(snpQchr_fm$Chromosome=="2", 2,
                               ifelse(snpQchr_fm$Chromosome=="3", 3,
                                      ifelse(snpQchr_fm$Chromosome=="4", 4,
                                             ifelse(snpQchr_fm$Chromosome=="5", 5,
                                                    ifelse(snpQchr_fm$Chromosome=="6", 6,
                                                           ifelse(snpQchr_fm$Chromosome=="7", 7,
                                                                  ifelse(snpQchr_fm$Chromosome=="8", 8,
                                                                         ifelse(snpQchr_fm$Chromosome=="9", 9,
                                                                                ifelse(snpQchr_fm$Chromosome=="10", 10,
                                                                                       ifelse(snpQchr_fm$Chromosome=="11", 11,
                                                                                              ifelse(snpQchr_fm$Chromosome=="12", 12,
                                                                                                     ifelse(snpQchr_fm$Chromosome=="13", 13,
                                                                                                            ifelse(snpQchr_fm$Chromosome=="14", 14,
                                                                                                                   ifelse(snpQchr_fm$Chromosome=="15", 15,
                                                                                                                          ifelse(snpQchr_fm$Chromosome=="16", 16,
                                                                                                                                 ifelse(snpQchr_fm$Chromosome=="17", 17,
                                                                                                                                        ifelse(snpQchr_fm$Chromosome=="18", 18,
                                                                                                                                               ifelse(snpQchr_fm$Chromosome=="19", 19,
                                                                                                                                                      ifelse(snpQchr_fm$Chromosome=="20", 20,
                                                                                                                                                             ifelse(snpQchr_fm$Chromosome=="21", 21,
                                                                                                                                                                    ifelse(snpQchr_fm$Chromosome=="22", 22, 
                                                                                                                                                                           ifelse(snpQchr_fm$Chromosome=="X", 23,NA)))))))))))))))))))))))
snpQchr_fm<-snpQchr_fm[order(snpQchr_fm$Chromosome),] #change chromosome to ascending order
#snprs to chromosome distribution for "femi"
snpQchr_femi<-as.data.frame(table(snpQ_femi$Chromosome))
snpQchr_femi<-rename(snpQchr_femi, Chromosome=Var1) #rename Var1 as Chromosome
snpQchr_femi<-rename(snpQchr_femi, count=Freq) #rename Freq as count
#assign levels to chromosomes
snpQchr_femi$Chromosome<- ifelse(snpQchr_femi$Chromosome=="1", 1,
                          ifelse(snpQchr_femi$Chromosome=="2", 2,
                                 ifelse(snpQchr_femi$Chromosome=="3", 3,
                                        ifelse(snpQchr_femi$Chromosome=="4", 4,
                                               ifelse(snpQchr_femi$Chromosome=="5", 5,
                                                      ifelse(snpQchr_femi$Chromosome=="6", 6,
                                                             ifelse(snpQchr_femi$Chromosome=="7", 7,
                                                                    ifelse(snpQchr_femi$Chromosome=="8", 8,
                                                                           ifelse(snpQchr_femi$Chromosome=="9", 9,
                                                                                  ifelse(snpQchr_femi$Chromosome=="10", 10,
                                                                                         ifelse(snpQchr_femi$Chromosome=="11", 11,
                                                                                                ifelse(snpQchr_femi$Chromosome=="12", 12,
                                                                                                       ifelse(snpQchr_femi$Chromosome=="13", 13,
                                                                                                              ifelse(snpQchr_femi$Chromosome=="14", 14,
                                                                                                                     ifelse(snpQchr_femi$Chromosome=="15", 15,
                                                                                                                            ifelse(snpQchr_femi$Chromosome=="16", 16,
                                                                                                                                   ifelse(snpQchr_femi$Chromosome=="17", 17,
                                                                                                                                          ifelse(snpQchr_femi$Chromosome=="18", 18,
                                                                                                                                                 ifelse(snpQchr_femi$Chromosome=="19", 19,
                                                                                                                                                        ifelse(snpQchr_femi$Chromosome=="20", 20,
                                                                                                                                                               ifelse(snpQchr_femi$Chromosome=="21", 21,
                                                                                                                                                                      ifelse(snpQchr_femi$Chromosome=="22", 22, 
                                                                                                                                                                             ifelse(snpQchr_femi$Chromosome=="X", 23,NA)))))))))))))))))))))))
snpQchr_femi<-snpQchr_femi[order(snpQchr_femi$Chromosome),] #change chromosome to ascending order
#snprs to chromosome distribution for "se"
snpQchr_se<-as.data.frame(table(snpQ_se$Chromosome))
snpQchr_se<-rename(snpQchr_se, Chromosome=Var1) #rename Var1 as Chromosome
snpQchr_se<-rename(snpQchr_se, count=Freq) #rename Freq as count
#assign levels to chromosomes
snpQchr_se$Chromosome<- ifelse(snpQchr_se$Chromosome=="1", 1,
                                ifelse(snpQchr_se$Chromosome=="2", 2,
                                       ifelse(snpQchr_se$Chromosome=="3", 3,
                                              ifelse(snpQchr_se$Chromosome=="4", 4,
                                                     ifelse(snpQchr_se$Chromosome=="5", 5,
                                                            ifelse(snpQchr_se$Chromosome=="6", 6,
                                                                   ifelse(snpQchr_se$Chromosome=="7", 7,
                                                                          ifelse(snpQchr_se$Chromosome=="8", 8,
                                                                                 ifelse(snpQchr_se$Chromosome=="9", 9,
                                                                                        ifelse(snpQchr_se$Chromosome=="10", 10,
                                                                                               ifelse(snpQchr_se$Chromosome=="11", 11,
                                                                                                      ifelse(snpQchr_se$Chromosome=="12", 12,
                                                                                                             ifelse(snpQchr_se$Chromosome=="13", 13,
                                                                                                                    ifelse(snpQchr_se$Chromosome=="14", 14,
                                                                                                                           ifelse(snpQchr_se$Chromosome=="15", 15,
                                                                                                                                  ifelse(snpQchr_se$Chromosome=="16", 16,
                                                                                                                                         ifelse(snpQchr_se$Chromosome=="17", 17,
                                                                                                                                                ifelse(snpQchr_se$Chromosome=="18", 18,
                                                                                                                                                       ifelse(snpQchr_se$Chromosome=="19", 19,
                                                                                                                                                              ifelse(snpQchr_se$Chromosome=="20", 20,
                                                                                                                                                                     ifelse(snpQchr_se$Chromosome=="21", 21,
                                                                                                                                                                            ifelse(snpQchr_se$Chromosome=="22", 22, 
                                                                                                                                                                                   ifelse(snpQchr_se$Chromosome=="X", 23,NA)))))))))))))))))))))))
snpQchr_se<-snpQchr_se[order(snpQchr_se$Chromosome),] #change chromosome to ascending order
#snprs to chromosome distribution for "h"
snpQchr_h<-as.data.frame(table(snpQ_h$Chromosome))
snpQchr_h<-rename(snpQchr_h, Chromosome=Var1) #rename Var1 as Chromosome
snpQchr_h<-rename(snpQchr_h, count=Freq) #rename Freq as count
#assign levels to chromosomes
snpQchr_h$Chromosome<- ifelse(snpQchr_h$Chromosome=="1", 1,
                              ifelse(snpQchr_h$Chromosome=="2", 2,
                                     ifelse(snpQchr_h$Chromosome=="3", 3,
                                            ifelse(snpQchr_h$Chromosome=="4", 4,
                                                   ifelse(snpQchr_h$Chromosome=="5", 5,
                                                          ifelse(snpQchr_h$Chromosome=="6", 6,
                                                                 ifelse(snpQchr_h$Chromosome=="7", 7,
                                                                        ifelse(snpQchr_h$Chromosome=="8", 8,
                                                                               ifelse(snpQchr_h$Chromosome=="9", 9,
                                                                                      ifelse(snpQchr_h$Chromosome=="10", 10,
                                                                                             ifelse(snpQchr_h$Chromosome=="11", 11,
                                                                                                    ifelse(snpQchr_h$Chromosome=="12", 12,
                                                                                                           ifelse(snpQchr_h$Chromosome=="13", 13,
                                                                                                                  ifelse(snpQchr_h$Chromosome=="14", 14,
                                                                                                                         ifelse(snpQchr_h$Chromosome=="15", 15,
                                                                                                                                ifelse(snpQchr_h$Chromosome=="16", 16,
                                                                                                                                       ifelse(snpQchr_h$Chromosome=="17", 17,
                                                                                                                                              ifelse(snpQchr_h$Chromosome=="18", 18,
                                                                                                                                                     ifelse(snpQchr_h$Chromosome=="19", 19,
                                                                                                                                                            ifelse(snpQchr_h$Chromosome=="20", 20,
                                                                                                                                                                   ifelse(snpQchr_h$Chromosome=="21", 21,
                                                                                                                                                                          ifelse(snpQchr_h$Chromosome=="22", 22, 
                                                                                                                                                                                 ifelse(snpQchr_h$Chromosome=="X", 23,NA)))))))))))))))))))))))
snpQchr_h<-snpQchr_h[order(snpQchr_h$Chromosome),] #change chromosome to ascending order
#snprs to chromosome distribution for "sne"
snpQchr_sne<-as.data.frame(table(snpQ_sne$Chromosome))
snpQchr_sne<-rename(snpQchr_sne, Chromosome=Var1) #rename Var1 as Chromosome
snpQchr_sne<-rename(snpQchr_sne, count=Freq) #rename Freq as count
#assign levels to chromosomes
snpQchr_sne$Chromosome<- ifelse(snpQchr_sne$Chromosome=="1", 1,
                         ifelse(snpQchr_sne$Chromosome=="2", 2,
                                ifelse(snpQchr_sne$Chromosome=="3", 3,
                                       ifelse(snpQchr_sne$Chromosome=="4", 4,
                                              ifelse(snpQchr_sne$Chromosome=="5", 5,
                                                     ifelse(snpQchr_sne$Chromosome=="6", 6,
                                                            ifelse(snpQchr_sne$Chromosome=="7", 7,
                                                                   ifelse(snpQchr_sne$Chromosome=="8", 8,
                                                                          ifelse(snpQchr_sne$Chromosome=="9", 9,
                                                                                 ifelse(snpQchr_sne$Chromosome=="10", 10,
                                                                                        ifelse(snpQchr_sne$Chromosome=="11", 11,
                                                                                               ifelse(snpQchr_sne$Chromosome=="12", 12,
                                                                                                      ifelse(snpQchr_sne$Chromosome=="13", 13,
                                                                                                             ifelse(snpQchr_sne$Chromosome=="14", 14,
                                                                                                                    ifelse(snpQchr_sne$Chromosome=="15", 15,
                                                                                                                           ifelse(snpQchr_sne$Chromosome=="16", 16,
                                                                                                                                  ifelse(snpQchr_sne$Chromosome=="17", 17,
                                                                                                                                         ifelse(snpQchr_sne$Chromosome=="18", 18,
                                                                                                                                                ifelse(snpQchr_sne$Chromosome=="19", 19,
                                                                                                                                                       ifelse(snpQchr_sne$Chromosome=="20", 20,
                                                                                                                                                              ifelse(snpQchr_sne$Chromosome=="21", 21,
                                                                                                                                                                     ifelse(snpQchr_sne$Chromosome=="22", 22, 
                                                                                                                                                                            ifelse(snpQchr_sne$Chromosome=="X", 23,NA)))))))))))))))))))))))
snpQchr_sne<-snpQchr_sne[order(snpQchr_sne$Chromosome),] #change chromosome to ascending order
#snprs to chromosome distribution for "eh"
snpQchr_eh<-as.data.frame(table(snpQ_eh$Chromosome))
snpQchr_eh<-rename(snpQchr_eh, Chromosome=Var1) #rename Var1 as Chromosome
snpQchr_eh<-rename(snpQchr_eh, count=Freq) #rename Freq as count
#assign levels to chromosomes
snpQchr_eh$Chromosome<- ifelse(snpQchr_eh$Chromosome=="1", 1,
                               ifelse(snpQchr_eh$Chromosome=="2", 2,
                                      ifelse(snpQchr_eh$Chromosome=="3", 3,
                                             ifelse(snpQchr_eh$Chromosome=="4", 4,
                                                    ifelse(snpQchr_eh$Chromosome=="5", 5,
                                                           ifelse(snpQchr_eh$Chromosome=="6", 6,
                                                                  ifelse(snpQchr_eh$Chromosome=="7", 7,
                                                                         ifelse(snpQchr_eh$Chromosome=="8", 8,
                                                                                ifelse(snpQchr_eh$Chromosome=="9", 9,
                                                                                       ifelse(snpQchr_eh$Chromosome=="10", 10,
                                                                                              ifelse(snpQchr_eh$Chromosome=="11", 11,
                                                                                                     ifelse(snpQchr_eh$Chromosome=="12", 12,
                                                                                                            ifelse(snpQchr_eh$Chromosome=="13", 13,
                                                                                                                   ifelse(snpQchr_eh$Chromosome=="14", 14,
                                                                                                                          ifelse(snpQchr_eh$Chromosome=="15", 15,
                                                                                                                                 ifelse(snpQchr_eh$Chromosome=="16", 16,
                                                                                                                                        ifelse(snpQchr_eh$Chromosome=="17", 17,
                                                                                                                                               ifelse(snpQchr_eh$Chromosome=="18", 18,
                                                                                                                                                      ifelse(snpQchr_eh$Chromosome=="19", 19,
                                                                                                                                                             ifelse(snpQchr_eh$Chromosome=="20", 20,
                                                                                                                                                                    ifelse(snpQchr_eh$Chromosome=="21", 21,
                                                                                                                                                                           ifelse(snpQchr_eh$Chromosome=="22", 22, 
                                                                                                                                                                                  ifelse(snpQchr_eh$Chromosome=="X", 23,NA)))))))))))))))))))))))
snpQchr_eh<-snpQchr_eh[order(snpQchr_eh$Chromosome),] #change chromosome to ascending order
#create table for snprs for each chromosome for each population
snpQchromPtabe<-data.frame(vars= c("Total number of genes = 496", "**Overall SNP count** _(n_", "**Chromosome** _(n_", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21", "22", "X", "Y"), #create df with snprs and chromosome data for each ethnic group, and total number of genes
                           
                           European= c(s, snpQtable$Unique_SNPrs_summary[1], s, snpQchr_e$count[1:23], s),
                           
                           African_AfricanAmerican= c(s, snpQtable$Unique_SNPrs_summary[2], s, snpQchr_fm$count[1:20], s, snpQchr_fm$count[21], s,s),
                           African_European_AfricanAmerican_IndianAmerican= c(s, snpQtable$Unique_SNPrs_summary[3], s, s, snpQchr_femi$count[1:4], s, s, s, snpQchr_femi$count[5:7], s, s, s, s, s, snpQchr_femi$count[8], s, snpQchr_femi$count[9], s, s, s, s, s),
                           EastAsian_European= c(s, snpQtable$Unique_SNPrs_summary[4], s, s, s, s, s, s, s, snpQchr_se$count[1], s, s, s, s, snpQchr_se$count[2], s, s, s, s, s, snpQchr_se$count[3], s, s, s, s, s, s),
                           Hispanic= c(s, snpQtable$Unique_SNPrs_summary[5], s, s, s, snpQchr_h$count[1], s, snpQchr_h$count[2], s, s, s, s, s, s, s, s, s, s, s, s, s, snpQchr_h$count[3], s, s, s, s, s),
                           EastAsian_Asian_European= c(s, snpQtable$Unique_SNPrs_summary[6], s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, s, snpQchr_sne$count[1], s, s, s, s, s),
                           European_Hispanic= c(s,snpQtable$Unique_SNPrs_summary[7], s, s, s, s, s, s, s, s, s, s, snpQchr_eh$count[1], s, s, s, s, s, s, s, s, s, s, s, s, s, s),
                           stringsAsFactors = FALSE)
#create summary table of SNPrs to chromosome for each ethnic group, and total number of gene associations
kable(snpQchromPtabe,
      col.names = c("", "European", "African|African American", "African|European|African American|Indian American", "East Asian|European", "Hispanic", "East Asian|Asian|European", "European|Hispanic"),
      align="cccccccc",
      type="") %>%
  column_spec(1, width_min = "5cm", border_right = TRUE) %>%
  column_spec(c(2,3,4,5,6,7), width_min = "2cm") %>%
  row_spec(0, bold = T, color="ivory", background = "gray") %>%
  row_spec(1:27, color="black", background = "white") %>%
  kable_styling(bootstrap_options = ("bordered"), full_width=FALSE, font_size=12)


#most statistically significant and unique SNP rs for each ethnic group - table 7
#most significantly unique SNP for each ethnic group
snpQmost<-bind_rows(snpQ_e[1, ], snpQ_fm[1, ], snpQ_femi[1, ], snpQ_se[1, ], snpQ_h[1, ], snpQ_sne[1, ], snpQ_eh[1, ]) #gather most significant unmatched unique snprs
c2<-c(3:6, 9:11, 17) #select columns to keep
snpQmost<-snpQmost[,c2] #run columns to keep
snpQmost$`P-Value`[1]<-"1.999999999999999886549084e-157" #assign pv
snpQmost$`P-Value`[2]<-"2.000000000000000124466409e-9"
snpQmost$`P-Value`[3]<-"9.000000000000000069388939e-52"
snpQmost$`P-Value`[4]<-"9.999999999999999547237172e-7"
snpQmost$`P-Value`[5]<-"7.999999999999999516012150e-11"
snpQmost$`P-Value`[6]<-"4.999999999999999808746736e-39"
snpQmost$`P-Value`[7]<-"1.999999999999999909447434e-7"
snpQmost$Population <- factor(snpQmost$Population, #factor labels for levels
                            levels=c(0,1,2,3,4,5,6),
                            labels=c("African|African American", "African|European|African American|Indian American", "East Asian|Asian|European", "East Asian|European", "European", "European|Hispanic", "Hispanic"))
#create significantly most unique SNP rs table
snpQmost %>%
  kbl(caption = "_Summary of the most statistically significant and unique SNP rs for each ethnic group associated with Alzheimer's disease_",
      align = "lcccclll") %>% #table title
  column_spec(c(1,2,3,4),width_min = "2.5cm") %>%
  column_spec(c(5,6,7,8),width_min = "3cm") %>%
  kable_classic(full_width=FALSE, html_font = "Calibri") %>% #output width and font
  kable_styling(table.envir = "ctable", font_size=16) %>% #output fontSize
  row_spec(0, bold=T, color="white", background = "gray") #output heading and body layout

#export unmatched unique snp rs data to "output" folder
#all ethnic groups
write.table(snpQ, file="C:/Users/ladki/Desktop/AD_Project/output\\uniqueSNP_allEthnicGroups.csv", row.names=FALSE, sep=",") 

#Ethnic group - European
write.table(snpQ_e, file="C:/Users/ladki/Desktop/AD_Project/output\\uniqueSNP_european.csv", row.names=FALSE, sep=",") 
#Ethnic group - American|African American
write.table(snpQ_fm, file="C:/Users/ladki/Desktop/AD_Project/output\\uniqueSNP_african_africanAmerican.csv", row.names=FALSE, sep=",") 
#Ethnic group - African|European|African American|Indian American
write.table(snpQ_femi, file="C:/Users/ladki/Desktop/AD_Project/output\\uniqueSNP_african_european_africanAmerican_indianAmerican.csv", row.names=FALSE, sep=",")
#Ethnic group - East Asian|European
write.table(snpQ_se, file="C:/Users/ladki/Desktop/AD_Project/output\\uniqueSNP_eastAsian_european.csv", row.names=FALSE, sep=",")
#Ethnic group - Hispanic
write.table(snpQ_h, file="C:/Users/ladki/Desktop/AD_Project/output\\uniqueSNP_hispanic.csv", row.names=FALSE, sep=",")
#Ethnic group - East Asian|Asian|European
write.table(snpQ_sne, file="C:/Users/ladki/Desktop/AD_Project/output\\uniqueSNP_eastAsian_asian_european.csv", row.names=FALSE, sep=",")
#Ethnic group - European|Hispanic
write.table(snpQ_eh, file="C:/Users/ladki/Desktop/AD_Project/output\\uniqueSNP_european_hispanic.csv", row.names=FALSE, sep=",")


#pathway analysis of the unique SNP rs in association with genes - figure 4
knitr::include_graphics("C:/Users/ladki/Desktop/AD_Project/images/GO.png")


#pathway analysis of the unique SNP rs in association with genes - table 8
GO_AD<-read_excel("./data/GO_AD_function.xlsx", sheet=1) #GO_AD_functions data
GO_AD$`P-value`<-as.character(GO_AD$`P-value`)
#create table for GO_AD_functions terms 
GO_AD %>%
  kbl(caption = "_GO terms enriched in the enrichment pathway analysis with genes associated with the unique SNP rs of ethnic groups associated with Alzheimer's disease_ ",
      align = "llcl") %>% #table title
  column_spec(c(1,2,3,4), width_min = "3cm") %>%
  kable_classic(full_width=FALSE, html_font = "Calibri") %>% #output width and font
  kable_styling(table.envir = "ctable", font_size=16) %>% #output fontSize
  row_spec(0, bold=T, color="white", background = "gray") #output heading and body layout


#unique SNP rs of Alzheimer's disease in association with other phenotypes from the PheGenI GWAS catalogue, table 9
ad_o_ph<- read_excel("./data/PheGenI_otherPhenotype.xlsx", sheet=2) #other phenotype data
ad_o_ph<-rename(ad_o_ph, Phenotype=Phenotype...1) #rename column
ad_o_ph<-rename(ad_o_ph, SNPrs_count=SNPrs_count...2) #rename column
ad_o_ph<-rename(ad_o_ph, Phenotype_continued=Phenotype...3) #rename column
ad_o_ph<-rename(ad_o_ph, SNPrs_count_continued=SNPrs_count...4) #rename column
ad_o_ph<-as.data.frame(ad_o_ph) #save as df to change na
ad_o_ph[is.na(ad_o_ph)]<-s #change na to blank
#create table for other phenotypes
ad_o_ph %>%
  kbl(caption = "_Significant other phenotypes in association with the unique SNP rs for Alzheimer's disease_ ",
      align = "rlrl") %>% #table title
  column_spec(c(1,2,3,4), width_min = "3cm") %>%
  kable_classic(full_width=FALSE, html_font = "Calibri") %>% #output width and font
  kable_styling(table.envir = "ctable", font_size=16) %>% #output fontSize
  row_spec(0, bold=T, color="white", background = "gray") #output heading and body layout

12 Session information

sessionInfo()

## R version 4.1.2 (2021-11-01)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
## [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
## [5] LC_TIME=English_Australia.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] mixOmics_6.16.3   lattice_0.20-45   MASS_7.3-54       scales_1.2.0     
##  [5] ggrepel_0.9.1     papaja_0.1.0.9999 tinylabels_0.2.3  tinytex_0.38     
##  [9] forcats_0.5.1     stringr_1.4.0     dplyr_1.0.8       purrr_0.3.4      
## [13] readr_2.1.2       tidyr_1.2.0       tibble_3.1.6      ggplot2_3.3.5    
## [17] tidyverse_1.3.1   kableExtra_1.3.4  readxl_1.4.0     
## 
## loaded via a namespace (and not attached):
##  [1] matrixStats_0.62.0  fs_1.5.2            lubridate_1.8.0    
##  [4] webshot_0.5.3       RColorBrewer_1.1-3  httr_1.4.2         
##  [7] tools_4.1.2         backports_1.4.1     bslib_0.3.1        
## [10] utf8_1.2.2          R6_2.5.1            DBI_1.1.2          
## [13] colorspace_2.0-3    withr_2.5.0         tidyselect_1.1.2   
## [16] gridExtra_2.3       compiler_4.1.2      cli_3.2.0          
## [19] rvest_1.0.2         xml2_1.3.3          labeling_0.4.2     
## [22] sass_0.4.1          systemfonts_1.0.4   digest_0.6.29      
## [25] rmarkdown_2.14      svglite_2.1.0       pkgconfig_2.0.3    
## [28] htmltools_0.5.2     dbplyr_2.1.1        fastmap_1.1.0      
## [31] highr_0.9           rlang_1.0.2         rstudioapi_0.13    
## [34] farver_2.1.0        jquerylib_0.1.4     generics_0.1.2     
## [37] jsonlite_1.8.0      BiocParallel_1.26.2 magrittr_2.0.3     
## [40] Matrix_1.3-4        Rcpp_1.0.8.3        munsell_0.5.0      
## [43] fansi_1.0.3         lifecycle_1.0.1     stringi_1.7.6      
## [46] yaml_2.3.5          plyr_1.8.7          grid_4.1.2         
## [49] parallel_4.1.2      crayon_1.5.1        haven_2.5.0        
## [52] hms_1.1.1           knitr_1.39          pillar_1.7.0       
## [55] igraph_1.3.1        corpcor_1.6.10      reshape2_1.4.4     
## [58] reprex_2.0.1        glue_1.6.2          evaluate_0.15      
## [61] modelr_0.1.8        png_0.1-7           vctrs_0.4.1        
## [64] tzdb_0.3.0          cellranger_1.1.0    gtable_0.3.0       
## [67] assertthat_0.2.1    xfun_0.30           broom_0.8.0        
## [70] RSpectra_0.16-1     viridisLite_0.4.0   rARPACK_0.11-0     
## [73] ellipse_0.4.2       ellipsis_0.3.2

13 References

Allen, P., Bennett, K., & Heritage, B. (2019). SPSS statistics (4th ed.). Cengage Learning Australia.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 112). Springer.

Milligan Armstrong, A., Porter, T., Quek, H., White, A., Haynes, J., Jackaman, C., Villemagne, V., Munyard, K., Laws, S. M., & Verdile, G. (2021). Chronic stress and alzheimer’s disease: The interplay between the hypothalamic–pituitary–adrenal axis, genetics and microglia. Biological Reviews, 96(5), 2209–2228. https://doi.org/10.1111/brv.12750

Nagaraj, S., Laskowska-Kaszub, K., Dębski, K. J., Wojsiat, J., Dąbrowski, M., Gabryelewicz, T., Kuźnicki, J., & Wojda, U. (2017). Profile of 6 microRNA in blood plasma distinguish early stage alzheimer’s disease patients from non-demented subjects. Oncotarget, 8(10), 16122. https://doi.org/10.18632/oncotarget.15109

National Center for Biotechnology Information. (2021). Phenotype - genotype integrator. https://www.ncbi.nlm.nih.gov/gap/phegeni#pgSNP.

Rountree, S. D., Chan, W., Pavlik, V. N., Darby, E. J., & Doody, R. S. (2012). Factors that influence survival in a probable alzheimer disease cohort. Alzheimer’s Research & Therapy, 4(3), 1–6. https://doi.org/10.1186/alzrt119

Shigemizu, D., Mitsumori, R., Akiyama, S., Miyashita, A., Morizono, T., Higaki, S., Asanomi, Y., Hara, N., Tamiya, G., & Kinoshita, K. (2021). Ethnic and trans-ethnic genome-wide association studies identify new loci influencing japanese alzheimer’s disease risk. Translational Psychiatry, 11(1), 1–10. https://doi.org/10.1038/s41398-021-01272-3

Vacher, M., Porter, T., Villemagne, V. L., Milicic, L., Peretti, M., Fowler, C., Martins, R., Rainey-Smith, S., Ames, D., & Masters, C. L. (2019). Validation of a priori candidate alzheimer’s disease SNPs with brain amyloid-beta deposition. Scientific Reports, 9(1), 1–8. https://doi.org/10.1038/s41598-019-53604-5

Wickham, H. (2019). Advanced r. CRC press.

Unique single nucleotide polymorphisms for Alzheimer’s Disease in different ethnic groups

Student name: Artika Kirby

Student ID: 10463490