Load required packages
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble 3.0.4 ✓ purrr 0.3.4
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Let’s look at the population polygenic scores for height and educational attainment from the two largest GWAS, respectively (Yengo et al., 2018) and (Lee et al., 2018). The polygenic scores are computed using allele frequencies weighted by effect size for the GWAS significant SNPs.
Height PGS in HGDP populations
Height PGS in 1KG populations
Education PGS in HGDP populations
Education PGS in 1KG populations
The polygenic scores show population differentiation and seem to follow the average IQ and height of the populations. A vesion of the Qst-Fst test, known as Fst enrichment test (Guo et al., 2018) was used to detect divergent selection. A more powerful version is used here, which avoids the issue of multiple comparisons found in pairwise tests: global Qst-Fst. The trait-associated SNPs were matched (after LD clumping) to random SNPs stratified by MAF and not in close LD (r< 0.1). This process was repeated 1000 times to generate a distribution of mean Fst under neutrality.
In 1KG, for height the following mean Fst was found
## [1] 0.08609203
This was compared to the global Fst of the control SNPs.
## [1] 0.07627578
## [1] 0.07627577
The height-associated SNPs clearly have higher Fst than the random ones as can be seen in the density plot.
The Z score and empirical p value are significant as well:
## [1] 9.040245
## [1] 0.001998002
The education SNPs significantly deviate from the “neutral” Fst as well, although to a smaller extent. Mean random and GWAS Fst, respectively:
## [1] 0.07577113
## [1] 0.08038633
Z score and p value:
## [1] 3.11107
## [1] 0.001998002
This can be visualized from the density plot:
Similar deviations from neutrality are found using the HGDP dataset…
For height, mean random and GWAS Fst:
## [1] 0.08536426
## [1] 0.09343812
Z score and empirical p value..
Z_score= (mean_GWAS_Fst - mean(mean_by_set$average))/sd(mean_by_set$average)
Z_score
## [1] 7.945418
#Monte Carlo simulation
r_w=length(which(mean_by_set$average>= mean_GWAS_Fst))
n_w=length(mean_by_set$average)
MC_p_w=(r_w+1)/(n_w+1)
MC_p_w
## [1] 0.000999001
p <- ggplot(mean_by_set, aes(x=average)) +
geom_density(colour="black", fill="lightblue") +
geom_vline(aes(xintercept= mean_GWAS_Fst), # Ignore NA values for mean
color="red", linetype="dashed", size=1)+
labs(title="Height SNPs Global Fst. HGDP",x="Global Fst control SNPs", y = "Density")
p
And for education, mean Fst for random SNPs and for GWAS, respectively:
## [1] 0.08423303
## [1] 0.08920302
Z score and empirical p value:
## [1] 3.679664
## [1] 0.000999001
The height GWAS with the largest predictive power to date is Chung et al. (2018). Unfortunately, it does not provide p values or S.E. so it is not possible to perform the LD clumping required for the Qst-Fst test. However, it’s possible to compute polygenic scores. The highest predictive power was achieved using LASSO+CTPR (Cross-trait penalized regression) so its weights will be used for computing the PGS (note that the PGS computed using other weights are extremely correlated).
1000 Genomes populations:
Using estimated average height for the populations (only those from developed countries), the correlation is r> 0.9.
## [1] 0.9227997
Now let’s have a look at the gnomAD dataset (v. 3.0):
Education….
Height…