LDpred2 Exercise

This exercise generally follows the vignette by Florian Privé available online here.

The most relevant literature is Privé et al., bioRxiv 2020, which presents LDpred2. For more details on the LDpred method please see Vilhjálmsson et al., AJHG 2015. Furthermore the R packages bigsnpr and bigstatsr are presented and extended in following publications Privé et al., Bioinformatics 2018, Privé et al., Genetics 2019, Privé et al., AJHG 2019, Privé et al., Bioinformatics 2020.

Prerequisites

There are several requirements for completing this exercise.

R studio with R version >3.3.
R packages bigsnpr (and bigstatsr). For most users executing the install.packages("bigsnpr") ensures this.
The test dataset should be downloaded and unzipped and stored in a directory called tmp-data. You can find the data files here.

1. Load and inspect genotype data

First we load the bigsnpr package.

library(bigsnpr)

## Warning: package 'bigsnpr' was built under R version 3.6.2

## Loading required package: bigstatsr

library(R.utils)

## Loading required package: R.oo

## Loading required package: R.methodsS3

## R.methodsS3 v1.8.0 (2020-02-14 07:10:20 UTC) successfully loaded. See ?R.methodsS3 for help.

## R.oo v1.23.0 successfully loaded. See ?R.oo for help.

## 
## Attaching package: 'R.oo'

## The following object is masked from 'package:R.methodsS3':
## 
##     throw

## The following objects are masked from 'package:methods':
## 
##     getClasses, getMethods

## The following objects are masked from 'package:base':
## 
##     attach, detach, load, save

## R.utils v2.9.2 successfully loaded. See ?R.utils for help.

## 
## Attaching package: 'R.utils'

## The following object is masked from 'package:utils':
## 
##     timestamp

## The following objects are masked from 'package:base':
## 
##     cat, commandArgs, getOption, inherits, isOpen, nullfile,
##     parse, warnings

#For plotting
library(ggplot2) 
library(viridis)

## Loading required package: viridisLite

Then we read from bed/bim/fam, it generates .bk and .rds files. Make sure the data folder is in the working directory. You can check the working directory using getwd() and change it using setwd("your/working/directory").

setwd("/Users/au507860/Dropbox/Cloud_folder/Courses/DiabetesAcademy2020/DDA_2020_LDpred2")
snp_readBed("tmp-data/public-data.bed")

## [1] "/Users/au507860/Dropbox/Cloud_folder/Courses/DiabetesAcademy2020/DDA_2020_LDpred2/tmp-data/public-data.rds"

Load the data and store in the “bigSNP” R session object

obj.bigSNP <- snp_attach("tmp-data/public-data.rds")

Inspect the contents of the file

str(obj.bigSNP, max.level = 2, strict.width = "cut")

## List of 3
##  $ genotypes:Reference class 'FBM.code256' [package "bigstatsr"] with 15 ..
##   ..and 26 methods, of which 12 are  possibly relevant:
##   ..  add_columns, as.FBM, bm, bm.desc, check_dimensions,
##   ..  check_write_permissions, copy#envRefClass, initialize,
##   ..  initialize#FBM, save, show#envRefClass, show#FBM
##  $ fam      :'data.frame':   559 obs. of  6 variables:
##   ..$ family.ID  : chr [1:559] "EUR_GBR" "EUR_GBR" "EUR_GBR" "EUR_GBR" ...
##   ..$ sample.ID  : chr [1:559] "HG00096" "HG00097" "HG00099" "HG00100" ...
##   ..$ paternal.ID: int [1:559] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ maternal.ID: int [1:559] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ sex        : int [1:559] 1 2 2 2 1 2 1 2 2 1 ...
##   ..$ affection  : int [1:559] 1 2 1 1 1 1 2 1 2 1 ...
##  $ map      :'data.frame':   130816 obs. of  6 variables:
##   ..$ chromosome  : int [1:130816] 2 2 2 2 2 2 2 2 2 2 ...
##   ..$ marker.ID   : chr [1:130816] "rs13400442" "rs7594567" "rs7597758" "..
##   ..$ genetic.dist: int [1:130816] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ physical.pos: int [1:130816] 18506 21833 22398 28228 32003 32005 36..
##   ..$ allele1     : chr [1:130816] "C" "G" "T" "A" ...
##   ..$ allele2     : chr [1:130816] "T" "C" "C" "G" ...
##  - attr(*, "class")= chr "bigSNP"

Get aliases for useful bigSNP variables

G   <- obj.bigSNP$genotypes
CHR <- obj.bigSNP$map$chromosome
POS <- obj.bigSNP$map$physical.pos
y   <- obj.bigSNP$fam$affection - 1
NCORES <- nb_cores()

1.1 Preparing an LD reference: Principal Component Analysis (PCA) and Relateness

LDpred requires a LD reference, which are set of genotypes that ideally suffices two following conditions.

The ancestry of the LD reference should be similar as the sample for which the summary statistics are based on.
The individuals should not have close genetic relateness, e.g. more distant than cousins.

The test/validation data can sometimes be used as LD reference. You can use bigsnpr to do prune the data to ensure it suffices the two conditions above.

The details are shown in Privé et al., Bioinformatics 2020, and a vignette for how to conduct PCA analyses using the package can be found here.

First we download the 1000 genomes data, which can take time if the internet connection is slow.

bedfile <- download_1000G("data")

Relateness

First, let us detect all pairs of related individuals.

plink2 <- download_plink2("data")
rel <- snp_plinkKINGQC(
  plink2.path = plink2,
  bedfile.in = bedfile,
  thr.king = 2^-4.5,
  make.bed = FALSE,
  ncores = nb_cores()
)
str(rel)

## 'data.frame':    31 obs. of  8 variables:
##  $ FID1   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ IID1   : chr  "HG00120" "HG00240" "HG00542" "HG00595" ...
##  $ FID2   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ IID2   : chr  "HG00116" "HG00238" "HG00475" "HG00584" ...
##  $ NSNP   : int  1664852 1664852 1664852 1664852 1664852 1664852 1664852 1664852 1664852 1664852 ...
##  $ HETHET : num  0.111 0.1105 0.1024 0.101 0.0992 ...
##  $ IBS0   : num  0.0333 0.0367 0.0302 0.037 0.0367 ...
##  $ KINSHIP: num  0.0821 0.068 0.0854 0.0541 0.0535 ...

Principal Component Analysis (PCA)

We then prune related individuals and compute PCA on the remaining (unrelated) individuals. The function bed_autoSVD() iteratively prunes variants to reduce the risk of the PCA capturing Linkage Disequilibrium (LD).

(obj.bed <- bed(bedfile))

## A 'bed' object with 2490 samples and 1664852 variants.

ind.rel <- match(c(rel$IID1, rel$IID2), obj.bed$fam$sample.ID)
ind.norel <- rows_along(obj.bed)[-ind.rel]

obj.svd <- bed_autoSVD(obj.bed, ind.row = ind.norel, k = 20,
                       ncores = nb_cores())

## 
## Phase of clumping (on MAC) at r^2 > 0.2.. keep 392665 variants.
## Discarding 0 variant with MAC < 10.
## 
## Iteration 1:
## Computing SVD..
## 679 outlier variants detected..
## 7 long-range LD regions detected..
## 
## Iteration 2:
## Computing SVD..
## 0 outlier variant detected..
## 
## Converged!

We can plot these PCs.

anc_info <- read.delim('data/1000G_phase3_common_norel.fam2')[-ind.rel,]

plot_grid(plotlist = lapply(1:4, function(k) {
  plot(obj.svd, type = "scores", scores = 2 * k - 1:0, coeff = 0.6) +
  aes(color=anc_info$Super.Population) +
  scale_color_viridis(name="Ancestry", discrete = TRUE)
}), scale = 0.95)

Outlier sample detection (quality control)

Then, we look at if there are individual outliers, that could be evidence for genotyping issues.

prob <- bigutilsr::prob_dist(obj.svd$u, ncores = nb_cores())
S <- prob$dist.self / sqrt(prob$dist.nn)

We can then rerun the PCA without these outliers

ind.row <- ind.norel[S < 0.5]
ind.col <- attr(obj.svd, "subset")
obj.svd2 <- bed_autoSVD(obj.bed, ind.row = ind.row,
                        ind.col = ind.col, thr.r2 = NA,
                        k = 20, ncores = nb_cores())

## 
## Skipping clumping.
## Discarding 0 variant with MAC < 10.
## 
## Iteration 1:
## Computing SVD..
## 22 outlier variants detected..
## 1 long-range LD region detected..
## 
## Iteration 2:
## Computing SVD..
## 0 outlier variant detected..
## 
## Converged!

Final PCA with all individuals

We would still like to obtain the PCA values for the remaining individuals. We do this by projecting them onto these PCs, to get PCs for all inviduals.

PCs <- matrix(NA, nrow(obj.bed), ncol(obj.svd2$u))
PCs[ind.row, ] <- predict(obj.svd2)

proj <- bed_projectSelfPCA(obj.svd2, obj.bed,
                           ind.row = rows_along(obj.bed)[-ind.row],
                           ncores = 1) # useless -> too few individuals
PCs[-ind.row, ] <- proj$OADP_proj

We can plot the PCs again.

anc_info <- read.delim('data/1000G_phase3_common_norel.fam2')
PCs_df <- as.data.frame(PCs)
colnames(PCs_df) <- paste("PC",1:20,sep="")

ggplot(data = PCs_df, mapping = aes(x = PC1, y = PC2)) +
    scale_color_viridis(name="Ancestry", discrete = TRUE)+
    geom_point(aes(color = anc_info$Super.Population))

ggplot(data = PCs_df, mapping = aes(x = PC3, y = PC4)) +
    scale_color_viridis(name="Ancestry", discrete = TRUE)+
    geom_point(aes(color = anc_info$Super.Population))

Restricting to a specific ancestry

We want the set of individuals in our LD reference to have similar ancestry as the ones in our summary statistics. E.g. if we want to use summary statistics based on individuals with East Asian ancestry, we should exclude individuals that have a different ancestry in our LD reference. If labels are available we can use these. If not, then we can project the LD reference data on to 1KG and exclude individuals that land far away from individuals with the target ancestry. You can use functions cov <- bigutilsr::covrob_ogk(PC[ind_eur, ]) and log_dist <- log(mahalanobis(PC, center = cov$center, cov = cov$cov)) for this purpose.

1.2 Load external summary statistics

To train the polygenic risk score, the user should load external summary statistics

sumstats <- bigreadr::fread2("tmp-data/public-data-sumstats.txt")
str(sumstats)

## 'data.frame':    130816 obs. of  10 variables:
##  $ chromosome  : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ marker.ID   : chr  "rs13400442" "rs7594567" "rs7597758" "rs13383216" ...
##  $ physical.pos: int  18506 21833 22398 28228 32003 32005 36787 55237 56916 61687 ...
##  $ allele1     : chr  "C" "G" "T" "A" ...
##  $ allele2     : chr  "T" "C" "C" "G" ...
##  $ beta        : num  -0.073 0.0439 -0.3325 -0.5445 -0.4881 ...
##  $ beta_se     : num  0.277 0.248 0.192 0.247 0.242 ...
##  $ n_case      : int  157 157 157 157 157 157 157 157 157 157 ...
##  $ n_control   : int  402 402 402 402 402 402 402 402 402 402 ...
##  $ p           : num  0.7925 0.8593 0.0846 0.028 0.0439 ...

We split genotype data using part of the data to learn parameters of stacking and another part of the data to evaluate statistical properties of polygenic risk score such as AUC. Here we consider that there are 400 individuals in the training set.

set.seed(1)
ind.val <- sample(nrow(G), 400)
ind.test <- setdiff(rows_along(G), ind.val)

1.3 Matching and synchronising variants across genotype data and summary statistics

To match variants contained in genotype data and summary statistics, the variables “chr” (chromosome number), “pos” (genetic position), “a0” (reference allele) and “a1” (derived allele) should be available in the summary statistics and in the genotype data. These 4 variables are used to match variants between the two data frames.

sumstats$n_eff <- 4 / (1 / sumstats$n_case + 1 / sumstats$n_control)
sumstats$n_case <- sumstats$n_control <- NULL
names(sumstats) <- c("chr", "rsid", "pos", "a0", "a1", "beta", "beta_se", "p", "n_eff")
map <- obj.bigSNP$map[-(2:3)]
names(map) <- c("chr", "pos", "a0", "a1")
info_snp <- snp_match(sumstats, map)

## 130,816 variants to be matched.

## 18,932 ambiguous SNPs have been removed.

## 111,884 variants have been matched; 0 were flipped and 0 were reversed.

If no or few variants are actually flipped, you might want to disable the strand flipping option. Here, these are simulated data so all variants use the same strand and the same reference.

info_snp <- snp_match(sumstats, map, strand_flip = FALSE)

## 130,816 variants to be matched.

## 130,816 variants have been matched; 0 were flipped and 0 were reversed.

1.4 Quality control of variants

Some variants may still not match in terms of allele frequencies and LD, and we want to exclude those. As this data is simulated, we do not have this issue, but in practice we reccommend you follow the proceedure outlined below (and in the supplementary note of the paper).

sd <- sqrt(big_colstats(G,  ind.val, ncores = NCORES)$var)
sd_val <- sd[info_snp$`_NUM_ID_`]
sd_ss <- with(info_snp, 1 / sqrt(n_eff / 4 * beta_se^2))

is_bad <-
  sd_ss < (0.5 * sd_val) | sd_ss > (sd_val + 0.1) | sd_ss < 0.1 | sd_val < 0.05
qplot(sd_val, sd_ss, color = is_bad, alpha = I(0.5)) +
  theme_bigstatsr() +
  coord_equal() +
  scale_color_viridis_d(direction = -1) +
  geom_abline(linetype = 2, color = "red") +
  labs(x = "Standard deviations in the validation set",
       y = "Standard deviations derived from the summary statistics",
       color = "Removed?")

Restrict to HapMap3 variants

In practice, we recommend using the HapMap3 variants used in PRS-CS and the LDpred2 papers (until we find a better set of variants). Information about these variants can be retrieved with

HM3_info <- readRDS(url("https://github.com/privefl/bigsnpr/raw/master/data-raw/hm3_variants.rds"))
str(HM3_info)

## 'data.frame':    1120696 obs. of  8 variables:
##  $ chr     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ rsid    : chr  "rs3131972" "rs3131969" "rs1048488" "rs12562034" ...
##  $ pos     : int  752721 754182 760912 768448 779322 838555 846808 853954 854250 861808 ...
##  $ a1      : chr  "A" "A" "C" "A" ...
##  $ a0      : chr  "G" "G" "T" "G" ...
##  $ maf     : num  0.161 0.1282 0.16 0.0925 0.1183 ...
##  $ pos_hg19: int  762858 764319 771049 778585 789459 848692 856945 864091 864387 871945 ...
##  $ pos_hg17: int  802721 804182 810912 818448 829322 888555 896808 903954 904250 912088 ...

2 Computing LDpred2 scores for one chromosome

2.1 Calculating the linkage-disequilibirum (LD) matrix

First, you need to compute correlations between variants. We recommend to use a window size of 3 cM (see ref).

POS2 <- snp_asGeneticPos(CHR, POS, dir = "tmp-data", ncores = 3)
## indices in info_snp
ind.chr <- which(info_snp$chr == 2)
df_beta <- info_snp[ind.chr, c("beta", "beta_se", "n_eff")]
## indices in G
ind.chr2 <- info_snp$`_NUM_ID_`[ind.chr]
corr0 <- snp_cor(G, ind.col = ind.chr2, ncores = NCORES,
                 infos.pos = POS2[ind.chr2], size = 3 / 1000)
corr <- bigsparser::as_SFBM(as(corr0, "dgCMatrix"))

2.2 Infinitesimal model

(ldsc <- snp_ldsc2(corr0, df_beta))

##       int        h2 
## 1.0000000 0.1152166

h2_est <- ldsc[["h2"]]
beta_inf <- snp_ldpred2_inf(corr, df_beta, h2 = h2_est)
pred_inf <- big_prodVec(G, beta_inf, ind.row = ind.test, ind.col = ind.chr2)
AUCBoot(pred_inf, y[ind.test])

##       Mean       2.5%      97.5%         Sd 
## 0.66347145 0.56739110 0.75341191 0.04784474

2.3 Grid of models

In practice, we recommend to test multiple values for h2 and p.

(h2_seq <- round(h2_est * c(0.7, 1, 1.4), 4))

## [1] 0.0807 0.1152 0.1613

(p_seq <- signif(seq_log(1e-4, 1, length.out = 12), 2))

##  [1] 0.00010 0.00023 0.00053 0.00120 0.00280 0.00660 0.01500 0.03500
##  [9] 0.08100 0.19000 0.43000 1.00000

(params <- expand.grid(p = p_seq, h2 = h2_seq, sparse = c(FALSE, TRUE)))

This takes several minutes if you do not have many cores

beta_grid <- snp_ldpred2_grid(corr, df_beta, params, ncores = NCORES)
pred_grid <- big_prodMat(G, beta_grid, ind.col = ind.chr2)
params$score <- big_univLogReg(as_FBM(pred_grid[ind.val, ]), y[ind.val])$score
library(ggplot2)
ggplot(params, aes(x = p, y = score, color = as.factor(h2))) +
  theme_bigstatsr() +
  geom_point() +
  geom_line() +
  scale_x_log10(breaks = 10^(-5:0), minor_breaks = params$p) +
  facet_wrap(~ sparse, labeller = label_both) +
  labs(y = "GLM Z-Score", color = "h2") +
  theme(legend.position = "top", panel.spacing = unit(1, "lines"))

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

params %>%
  mutate(sparsity = colMeans(beta_grid == 0), id = row_number()) %>%
  arrange(desc(score)) %>%
  mutate_at(c("score", "sparsity"), round, digits = 3) %>%
  slice(1:10)

You can then choose the best model according to your preferred criterion (e.g. max AUC). Here, we use the Z-Score from the regression of the phenotype by the PRS since we have found it more robust than using the AUC. It also enables adjusting for covariates in this step.

best_grid_nosp <- params %>%
  mutate(id = row_number()) %>%
  filter(!sparse) %>%
  arrange(desc(score)) %>%
  slice(1) %>%
  pull(id) %>%
  beta_grid[, .]

pred_nosp <- big_prodVec(G, best_grid_nosp, ind.row = ind.test, ind.col = ind.chr2)
AUCBoot(pred_nosp, y[ind.test])

##       Mean       2.5%      97.5%         Sd 
## 0.67077797 0.57109988 0.76419644 0.04922476

best_grid_sp <- params %>%
  mutate(id = row_number()) %>%
  filter(sparse) %>%
  arrange(desc(score)) %>%
  slice(1) %>%
  pull(id) %>%
  beta_grid[, .]

pred_sp <- big_prodVec(G, best_grid_sp, ind.row = ind.test, ind.col = ind.chr2)
AUCBoot(pred_sp, y[ind.test])

##       Mean       2.5%      97.5%         Sd 
## 0.67181865 0.57173198 0.76486522 0.04934336

2.4 Automatic model

Actually, you can run many of them in parallel with different initial values for p.

multi_auto <- snp_ldpred2_auto(corr, df_beta, h2_init = h2_est,
                               vec_p_init = seq_log(1e-4, 0.9, length.out = NCORES),
                               ncores = NCORES)
str(multi_auto)

## List of 6
##  $ :List of 5
##   ..$ beta_est   : num [1:52556] -1.27e-07 7.60e-08 -5.86e-07 -9.69e-07 -8.65e-07 ...
##   ..$ p_est      : num 0.00338
##   ..$ h2_est     : num 0.000205
##   ..$ path_p_est : num [1:1500] 0.000119 0.000145 0.000131 0.00021 0.000176 ...
##   ..$ path_h2_est: num [1:1500] 0.112 0.03273 0.00694 0.00905 0.00842 ...
##  $ :List of 5
##   ..$ beta_est   : num [1:52556] -8.54e-08 5.14e-08 -3.89e-07 -6.38e-07 -5.72e-07 ...
##   ..$ p_est      : num 0.00902
##   ..$ h2_est     : num 0.000136
##   ..$ path_p_est : num [1:1500] 0.000521 0.000532 0.000539 0.000493 0.000596 ...
##   ..$ path_h2_est: num [1:1500] 0.0661 0.0424 0.0423 0.0568 0.0523 ...
##  $ :List of 5
##   ..$ beta_est   : num [1:52556] -4.17e-07 2.51e-07 -1.90e-06 -3.12e-06 -2.80e-06 ...
##   ..$ p_est      : num 0.0105
##   ..$ h2_est     : num 0.000665
##   ..$ path_p_est : num [1:1500] 0.0038 0.00348 0.00402 0.0043 0.00443 ...
##   ..$ path_h2_est: num [1:1500] 0.116 0.131 0.117 0.125 0.159 ...
##  $ :List of 5
##   ..$ beta_est   : num [1:52556] -1.32e-06 7.95e-07 -6.02e-06 -9.86e-06 -8.84e-06 ...
##   ..$ p_est      : num 0.132
##   ..$ h2_est     : num 0.0021
##   ..$ path_p_est : num [1:1500] 0.0217 0.0223 0.022 0.0232 0.0232 ...
##   ..$ path_h2_est: num [1:1500] 0.0992 0.0981 0.1033 0.1001 0.0984 ...
##  $ :List of 5
##   ..$ beta_est   : num [1:52556] -9.05e-06 5.55e-06 -4.15e-05 -6.80e-05 -6.10e-05 ...
##   ..$ p_est      : num 0.16
##   ..$ h2_est     : num 0.0146
##   ..$ path_p_est : num [1:1500] 0.145 0.146 0.143 0.138 0.137 ...
##   ..$ path_h2_est: num [1:1500] 0.12 0.125 0.13 0.121 0.123 ...
##  $ :List of 5
##   ..$ beta_est   : num [1:52556] -7.96e-06 4.83e-06 -3.65e-05 -5.98e-05 -5.36e-05 ...
##   ..$ p_est      : num 0.933
##   ..$ h2_est     : num 0.0128
##   ..$ path_p_est : num [1:1500] 0.904 0.904 0.905 0.903 0.9 ...
##   ..$ path_h2_est: num [1:1500] 0.118 0.111 0.112 0.115 0.112 ...

You should verify if the chains “converged”. This is not the case here, which is probably because the data is so small.

auto <- multi_auto[[1]]
plot_grid(
  qplot(y = auto$path_p_est) +
    theme_bigstatsr() +
    geom_hline(yintercept = auto$p_est, col = "blue") +
    scale_y_log10() +
    labs(y = "p"),
  qplot(y = auto$path_h2_est) +
    theme_bigstatsr() +
    geom_hline(yintercept = auto$h2_est, col = "blue") +
    labs(y = "h2"),
  ncol = 1, align = "hv"
)

beta_auto <- sapply(multi_auto, function(auto) auto$beta_est)
pred_auto <- big_prodMat(G, beta_auto, ind.row = ind.test, ind.col = ind.chr2)

Make sure the scale is okay (by comparing to LDpred2-inf and others auto) and possibly keep only the ones that looks good. (e.g. see code of paper), https://github.com/privefl/paper-ldpred2/blob/master/code/run-ldpred2.R#L101-L108.

c(mad(pred_inf), apply(pred_auto, 2, mad))

## [1] 0.2173099580 0.0003842597 0.0002544793 0.0012422195 0.0039347399
## [6] 0.0271166625 0.0239165971

# 
final_pred_auto <- rowMeans(pred_auto)
AUCBoot(final_pred_auto, y[ind.test])

##      Mean      2.5%     97.5%        Sd 
## 0.6631118 0.5676395 0.7514999 0.0473751

3. Summary

We have seen how to run 3 versions of LDpred2 (“-inf”, “-grid” and “-auto”) for one chromosome. You need to do this for each chromosome and combine results. Normally, you can just sum the resulting scores, or equivalently, append the effect sizes.