This exercise generally follows the vignette by Florian Privé available online here.
The most relevant literature is Privé et al., bioRxiv 2020, which presents LDpred2. For more details on the LDpred method please see Vilhjálmsson et al., AJHG 2015. Furthermore the R packages bigsnpr and bigstatsr are presented and extended in following publications Privé et al., Bioinformatics 2018, Privé et al., Genetics 2019, Privé et al., AJHG 2019, Privé et al., Bioinformatics 2020.
There are several requirements for completing this exercise.
install.packages("bigsnpr") ensures this.tmp-data. You can find the data files here.First we load the bigsnpr package.
library(bigsnpr)
## Warning: package 'bigsnpr' was built under R version 3.6.2
## Loading required package: bigstatsr
library(R.utils)
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.8.0 (2020-02-14 07:10:20 UTC) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.23.0 successfully loaded. See ?R.oo for help.
##
## Attaching package: 'R.oo'
## The following object is masked from 'package:R.methodsS3':
##
## throw
## The following objects are masked from 'package:methods':
##
## getClasses, getMethods
## The following objects are masked from 'package:base':
##
## attach, detach, load, save
## R.utils v2.9.2 successfully loaded. See ?R.utils for help.
##
## Attaching package: 'R.utils'
## The following object is masked from 'package:utils':
##
## timestamp
## The following objects are masked from 'package:base':
##
## cat, commandArgs, getOption, inherits, isOpen, nullfile,
## parse, warnings
#For plotting
library(ggplot2)
library(viridis)
## Loading required package: viridisLite
Then we read from bed/bim/fam, it generates .bk and .rds files. Make sure the data folder is in the working directory. You can check the working directory using getwd() and change it using setwd("your/working/directory").
setwd("/Users/au507860/Dropbox/Cloud_folder/Courses/DiabetesAcademy2020/DDA_2020_LDpred2")
snp_readBed("tmp-data/public-data.bed")
## [1] "/Users/au507860/Dropbox/Cloud_folder/Courses/DiabetesAcademy2020/DDA_2020_LDpred2/tmp-data/public-data.rds"
Load the data and store in the “bigSNP” R session object
obj.bigSNP <- snp_attach("tmp-data/public-data.rds")
Inspect the contents of the file
str(obj.bigSNP, max.level = 2, strict.width = "cut")
## List of 3
## $ genotypes:Reference class 'FBM.code256' [package "bigstatsr"] with 15 ..
## ..and 26 methods, of which 12 are possibly relevant:
## .. add_columns, as.FBM, bm, bm.desc, check_dimensions,
## .. check_write_permissions, copy#envRefClass, initialize,
## .. initialize#FBM, save, show#envRefClass, show#FBM
## $ fam :'data.frame': 559 obs. of 6 variables:
## ..$ family.ID : chr [1:559] "EUR_GBR" "EUR_GBR" "EUR_GBR" "EUR_GBR" ...
## ..$ sample.ID : chr [1:559] "HG00096" "HG00097" "HG00099" "HG00100" ...
## ..$ paternal.ID: int [1:559] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ maternal.ID: int [1:559] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ sex : int [1:559] 1 2 2 2 1 2 1 2 2 1 ...
## ..$ affection : int [1:559] 1 2 1 1 1 1 2 1 2 1 ...
## $ map :'data.frame': 130816 obs. of 6 variables:
## ..$ chromosome : int [1:130816] 2 2 2 2 2 2 2 2 2 2 ...
## ..$ marker.ID : chr [1:130816] "rs13400442" "rs7594567" "rs7597758" "..
## ..$ genetic.dist: int [1:130816] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ physical.pos: int [1:130816] 18506 21833 22398 28228 32003 32005 36..
## ..$ allele1 : chr [1:130816] "C" "G" "T" "A" ...
## ..$ allele2 : chr [1:130816] "T" "C" "C" "G" ...
## - attr(*, "class")= chr "bigSNP"
Get aliases for useful bigSNP variables
G <- obj.bigSNP$genotypes
CHR <- obj.bigSNP$map$chromosome
POS <- obj.bigSNP$map$physical.pos
y <- obj.bigSNP$fam$affection - 1
NCORES <- nb_cores()
LDpred requires a LD reference, which are set of genotypes that ideally suffices two following conditions.
The ancestry of the LD reference should be similar as the sample for which the summary statistics are based on.
The individuals should not have close genetic relateness, e.g. more distant than cousins.
The test/validation data can sometimes be used as LD reference. You can use bigsnpr to do prune the data to ensure it suffices the two conditions above.
The details are shown in Privé et al., Bioinformatics 2020, and a vignette for how to conduct PCA analyses using the package can be found here.
First we download the 1000 genomes data, which can take time if the internet connection is slow.
bedfile <- download_1000G("data")
First, let us detect all pairs of related individuals.
plink2 <- download_plink2("data")
rel <- snp_plinkKINGQC(
plink2.path = plink2,
bedfile.in = bedfile,
thr.king = 2^-4.5,
make.bed = FALSE,
ncores = nb_cores()
)
str(rel)
## 'data.frame': 31 obs. of 8 variables:
## $ FID1 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ IID1 : chr "HG00120" "HG00240" "HG00542" "HG00595" ...
## $ FID2 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ IID2 : chr "HG00116" "HG00238" "HG00475" "HG00584" ...
## $ NSNP : int 1664852 1664852 1664852 1664852 1664852 1664852 1664852 1664852 1664852 1664852 ...
## $ HETHET : num 0.111 0.1105 0.1024 0.101 0.0992 ...
## $ IBS0 : num 0.0333 0.0367 0.0302 0.037 0.0367 ...
## $ KINSHIP: num 0.0821 0.068 0.0854 0.0541 0.0535 ...
We then prune related individuals and compute PCA on the remaining (unrelated) individuals. The function bed_autoSVD() iteratively prunes variants to reduce the risk of the PCA capturing Linkage Disequilibrium (LD).
(obj.bed <- bed(bedfile))
## A 'bed' object with 2490 samples and 1664852 variants.
ind.rel <- match(c(rel$IID1, rel$IID2), obj.bed$fam$sample.ID)
ind.norel <- rows_along(obj.bed)[-ind.rel]
obj.svd <- bed_autoSVD(obj.bed, ind.row = ind.norel, k = 20,
ncores = nb_cores())
##
## Phase of clumping (on MAC) at r^2 > 0.2.. keep 392665 variants.
## Discarding 0 variant with MAC < 10.
##
## Iteration 1:
## Computing SVD..
## 679 outlier variants detected..
## 7 long-range LD regions detected..
##
## Iteration 2:
## Computing SVD..
## 0 outlier variant detected..
##
## Converged!
We can plot these PCs.
anc_info <- read.delim('data/1000G_phase3_common_norel.fam2')[-ind.rel,]
plot_grid(plotlist = lapply(1:4, function(k) {
plot(obj.svd, type = "scores", scores = 2 * k - 1:0, coeff = 0.6) +
aes(color=anc_info$Super.Population) +
scale_color_viridis(name="Ancestry", discrete = TRUE)
}), scale = 0.95)
Then, we look at if there are individual outliers, that could be evidence for genotyping issues.
prob <- bigutilsr::prob_dist(obj.svd$u, ncores = nb_cores())
S <- prob$dist.self / sqrt(prob$dist.nn)
We can then rerun the PCA without these outliers
ind.row <- ind.norel[S < 0.5]
ind.col <- attr(obj.svd, "subset")
obj.svd2 <- bed_autoSVD(obj.bed, ind.row = ind.row,
ind.col = ind.col, thr.r2 = NA,
k = 20, ncores = nb_cores())
##
## Skipping clumping.
## Discarding 0 variant with MAC < 10.
##
## Iteration 1:
## Computing SVD..
## 22 outlier variants detected..
## 1 long-range LD region detected..
##
## Iteration 2:
## Computing SVD..
## 0 outlier variant detected..
##
## Converged!
We would still like to obtain the PCA values for the remaining individuals. We do this by projecting them onto these PCs, to get PCs for all inviduals.
PCs <- matrix(NA, nrow(obj.bed), ncol(obj.svd2$u))
PCs[ind.row, ] <- predict(obj.svd2)
proj <- bed_projectSelfPCA(obj.svd2, obj.bed,
ind.row = rows_along(obj.bed)[-ind.row],
ncores = 1) # useless -> too few individuals
PCs[-ind.row, ] <- proj$OADP_proj
We can plot the PCs again.
anc_info <- read.delim('data/1000G_phase3_common_norel.fam2')
PCs_df <- as.data.frame(PCs)
colnames(PCs_df) <- paste("PC",1:20,sep="")
ggplot(data = PCs_df, mapping = aes(x = PC1, y = PC2)) +
scale_color_viridis(name="Ancestry", discrete = TRUE)+
geom_point(aes(color = anc_info$Super.Population))
ggplot(data = PCs_df, mapping = aes(x = PC3, y = PC4)) +
scale_color_viridis(name="Ancestry", discrete = TRUE)+
geom_point(aes(color = anc_info$Super.Population))
We want the set of individuals in our LD reference to have similar ancestry as the ones in our summary statistics. E.g. if we want to use summary statistics based on individuals with East Asian ancestry, we should exclude individuals that have a different ancestry in our LD reference. If labels are available we can use these. If not, then we can project the LD reference data on to 1KG and exclude individuals that land far away from individuals with the target ancestry. You can use functions cov <- bigutilsr::covrob_ogk(PC[ind_eur, ]) and log_dist <- log(mahalanobis(PC, center = cov$center, cov = cov$cov)) for this purpose.
To train the polygenic risk score, the user should load external summary statistics
sumstats <- bigreadr::fread2("tmp-data/public-data-sumstats.txt")
str(sumstats)
## 'data.frame': 130816 obs. of 10 variables:
## $ chromosome : int 2 2 2 2 2 2 2 2 2 2 ...
## $ marker.ID : chr "rs13400442" "rs7594567" "rs7597758" "rs13383216" ...
## $ physical.pos: int 18506 21833 22398 28228 32003 32005 36787 55237 56916 61687 ...
## $ allele1 : chr "C" "G" "T" "A" ...
## $ allele2 : chr "T" "C" "C" "G" ...
## $ beta : num -0.073 0.0439 -0.3325 -0.5445 -0.4881 ...
## $ beta_se : num 0.277 0.248 0.192 0.247 0.242 ...
## $ n_case : int 157 157 157 157 157 157 157 157 157 157 ...
## $ n_control : int 402 402 402 402 402 402 402 402 402 402 ...
## $ p : num 0.7925 0.8593 0.0846 0.028 0.0439 ...
We split genotype data using part of the data to learn parameters of stacking and another part of the data to evaluate statistical properties of polygenic risk score such as AUC. Here we consider that there are 400 individuals in the training set.
set.seed(1)
ind.val <- sample(nrow(G), 400)
ind.test <- setdiff(rows_along(G), ind.val)
To match variants contained in genotype data and summary statistics, the variables “chr” (chromosome number), “pos” (genetic position), “a0” (reference allele) and “a1” (derived allele) should be available in the summary statistics and in the genotype data. These 4 variables are used to match variants between the two data frames.
sumstats$n_eff <- 4 / (1 / sumstats$n_case + 1 / sumstats$n_control)
sumstats$n_case <- sumstats$n_control <- NULL
names(sumstats) <- c("chr", "rsid", "pos", "a0", "a1", "beta", "beta_se", "p", "n_eff")
map <- obj.bigSNP$map[-(2:3)]
names(map) <- c("chr", "pos", "a0", "a1")
info_snp <- snp_match(sumstats, map)
## 130,816 variants to be matched.
## 18,932 ambiguous SNPs have been removed.
## 111,884 variants have been matched; 0 were flipped and 0 were reversed.
If no or few variants are actually flipped, you might want to disable the strand flipping option. Here, these are simulated data so all variants use the same strand and the same reference.
info_snp <- snp_match(sumstats, map, strand_flip = FALSE)
## 130,816 variants to be matched.
## 130,816 variants have been matched; 0 were flipped and 0 were reversed.
Some variants may still not match in terms of allele frequencies and LD, and we want to exclude those. As this data is simulated, we do not have this issue, but in practice we reccommend you follow the proceedure outlined below (and in the supplementary note of the paper).
sd <- sqrt(big_colstats(G, ind.val, ncores = NCORES)$var)
sd_val <- sd[info_snp$`_NUM_ID_`]
sd_ss <- with(info_snp, 1 / sqrt(n_eff / 4 * beta_se^2))
is_bad <-
sd_ss < (0.5 * sd_val) | sd_ss > (sd_val + 0.1) | sd_ss < 0.1 | sd_val < 0.05
qplot(sd_val, sd_ss, color = is_bad, alpha = I(0.5)) +
theme_bigstatsr() +
coord_equal() +
scale_color_viridis_d(direction = -1) +
geom_abline(linetype = 2, color = "red") +
labs(x = "Standard deviations in the validation set",
y = "Standard deviations derived from the summary statistics",
color = "Removed?")
In practice, we recommend using the HapMap3 variants used in PRS-CS and the LDpred2 papers (until we find a better set of variants). Information about these variants can be retrieved with
HM3_info <- readRDS(url("https://github.com/privefl/bigsnpr/raw/master/data-raw/hm3_variants.rds"))
str(HM3_info)
## 'data.frame': 1120696 obs. of 8 variables:
## $ chr : int 1 1 1 1 1 1 1 1 1 1 ...
## $ rsid : chr "rs3131972" "rs3131969" "rs1048488" "rs12562034" ...
## $ pos : int 752721 754182 760912 768448 779322 838555 846808 853954 854250 861808 ...
## $ a1 : chr "A" "A" "C" "A" ...
## $ a0 : chr "G" "G" "T" "G" ...
## $ maf : num 0.161 0.1282 0.16 0.0925 0.1183 ...
## $ pos_hg19: int 762858 764319 771049 778585 789459 848692 856945 864091 864387 871945 ...
## $ pos_hg17: int 802721 804182 810912 818448 829322 888555 896808 903954 904250 912088 ...
First, you need to compute correlations between variants. We recommend to use a window size of 3 cM (see ref).
POS2 <- snp_asGeneticPos(CHR, POS, dir = "tmp-data", ncores = 3)
## indices in info_snp
ind.chr <- which(info_snp$chr == 2)
df_beta <- info_snp[ind.chr, c("beta", "beta_se", "n_eff")]
## indices in G
ind.chr2 <- info_snp$`_NUM_ID_`[ind.chr]
corr0 <- snp_cor(G, ind.col = ind.chr2, ncores = NCORES,
infos.pos = POS2[ind.chr2], size = 3 / 1000)
corr <- bigsparser::as_SFBM(as(corr0, "dgCMatrix"))
(ldsc <- snp_ldsc2(corr0, df_beta))
## int h2
## 1.0000000 0.1152166
h2_est <- ldsc[["h2"]]
beta_inf <- snp_ldpred2_inf(corr, df_beta, h2 = h2_est)
pred_inf <- big_prodVec(G, beta_inf, ind.row = ind.test, ind.col = ind.chr2)
AUCBoot(pred_inf, y[ind.test])
## Mean 2.5% 97.5% Sd
## 0.66347145 0.56739110 0.75341191 0.04784474
In practice, we recommend to test multiple values for h2 and p.
(h2_seq <- round(h2_est * c(0.7, 1, 1.4), 4))
## [1] 0.0807 0.1152 0.1613
(p_seq <- signif(seq_log(1e-4, 1, length.out = 12), 2))
## [1] 0.00010 0.00023 0.00053 0.00120 0.00280 0.00660 0.01500 0.03500
## [9] 0.08100 0.19000 0.43000 1.00000
(params <- expand.grid(p = p_seq, h2 = h2_seq, sparse = c(FALSE, TRUE)))
This takes several minutes if you do not have many cores
beta_grid <- snp_ldpred2_grid(corr, df_beta, params, ncores = NCORES)
pred_grid <- big_prodMat(G, beta_grid, ind.col = ind.chr2)
params$score <- big_univLogReg(as_FBM(pred_grid[ind.val, ]), y[ind.val])$score
library(ggplot2)
ggplot(params, aes(x = p, y = score, color = as.factor(h2))) +
theme_bigstatsr() +
geom_point() +
geom_line() +
scale_x_log10(breaks = 10^(-5:0), minor_breaks = params$p) +
facet_wrap(~ sparse, labeller = label_both) +
labs(y = "GLM Z-Score", color = "h2") +
theme(legend.position = "top", panel.spacing = unit(1, "lines"))
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
params %>%
mutate(sparsity = colMeans(beta_grid == 0), id = row_number()) %>%
arrange(desc(score)) %>%
mutate_at(c("score", "sparsity"), round, digits = 3) %>%
slice(1:10)
You can then choose the best model according to your preferred criterion (e.g. max AUC). Here, we use the Z-Score from the regression of the phenotype by the PRS since we have found it more robust than using the AUC. It also enables adjusting for covariates in this step.
best_grid_nosp <- params %>%
mutate(id = row_number()) %>%
filter(!sparse) %>%
arrange(desc(score)) %>%
slice(1) %>%
pull(id) %>%
beta_grid[, .]
pred_nosp <- big_prodVec(G, best_grid_nosp, ind.row = ind.test, ind.col = ind.chr2)
AUCBoot(pred_nosp, y[ind.test])
## Mean 2.5% 97.5% Sd
## 0.67077797 0.57109988 0.76419644 0.04922476
best_grid_sp <- params %>%
mutate(id = row_number()) %>%
filter(sparse) %>%
arrange(desc(score)) %>%
slice(1) %>%
pull(id) %>%
beta_grid[, .]
pred_sp <- big_prodVec(G, best_grid_sp, ind.row = ind.test, ind.col = ind.chr2)
AUCBoot(pred_sp, y[ind.test])
## Mean 2.5% 97.5% Sd
## 0.67181865 0.57173198 0.76486522 0.04934336
Actually, you can run many of them in parallel with different initial values for p.
multi_auto <- snp_ldpred2_auto(corr, df_beta, h2_init = h2_est,
vec_p_init = seq_log(1e-4, 0.9, length.out = NCORES),
ncores = NCORES)
str(multi_auto)
## List of 6
## $ :List of 5
## ..$ beta_est : num [1:52556] -1.27e-07 7.60e-08 -5.86e-07 -9.69e-07 -8.65e-07 ...
## ..$ p_est : num 0.00338
## ..$ h2_est : num 0.000205
## ..$ path_p_est : num [1:1500] 0.000119 0.000145 0.000131 0.00021 0.000176 ...
## ..$ path_h2_est: num [1:1500] 0.112 0.03273 0.00694 0.00905 0.00842 ...
## $ :List of 5
## ..$ beta_est : num [1:52556] -8.54e-08 5.14e-08 -3.89e-07 -6.38e-07 -5.72e-07 ...
## ..$ p_est : num 0.00902
## ..$ h2_est : num 0.000136
## ..$ path_p_est : num [1:1500] 0.000521 0.000532 0.000539 0.000493 0.000596 ...
## ..$ path_h2_est: num [1:1500] 0.0661 0.0424 0.0423 0.0568 0.0523 ...
## $ :List of 5
## ..$ beta_est : num [1:52556] -4.17e-07 2.51e-07 -1.90e-06 -3.12e-06 -2.80e-06 ...
## ..$ p_est : num 0.0105
## ..$ h2_est : num 0.000665
## ..$ path_p_est : num [1:1500] 0.0038 0.00348 0.00402 0.0043 0.00443 ...
## ..$ path_h2_est: num [1:1500] 0.116 0.131 0.117 0.125 0.159 ...
## $ :List of 5
## ..$ beta_est : num [1:52556] -1.32e-06 7.95e-07 -6.02e-06 -9.86e-06 -8.84e-06 ...
## ..$ p_est : num 0.132
## ..$ h2_est : num 0.0021
## ..$ path_p_est : num [1:1500] 0.0217 0.0223 0.022 0.0232 0.0232 ...
## ..$ path_h2_est: num [1:1500] 0.0992 0.0981 0.1033 0.1001 0.0984 ...
## $ :List of 5
## ..$ beta_est : num [1:52556] -9.05e-06 5.55e-06 -4.15e-05 -6.80e-05 -6.10e-05 ...
## ..$ p_est : num 0.16
## ..$ h2_est : num 0.0146
## ..$ path_p_est : num [1:1500] 0.145 0.146 0.143 0.138 0.137 ...
## ..$ path_h2_est: num [1:1500] 0.12 0.125 0.13 0.121 0.123 ...
## $ :List of 5
## ..$ beta_est : num [1:52556] -7.96e-06 4.83e-06 -3.65e-05 -5.98e-05 -5.36e-05 ...
## ..$ p_est : num 0.933
## ..$ h2_est : num 0.0128
## ..$ path_p_est : num [1:1500] 0.904 0.904 0.905 0.903 0.9 ...
## ..$ path_h2_est: num [1:1500] 0.118 0.111 0.112 0.115 0.112 ...
You should verify if the chains “converged”. This is not the case here, which is probably because the data is so small.
auto <- multi_auto[[1]]
plot_grid(
qplot(y = auto$path_p_est) +
theme_bigstatsr() +
geom_hline(yintercept = auto$p_est, col = "blue") +
scale_y_log10() +
labs(y = "p"),
qplot(y = auto$path_h2_est) +
theme_bigstatsr() +
geom_hline(yintercept = auto$h2_est, col = "blue") +
labs(y = "h2"),
ncol = 1, align = "hv"
)
beta_auto <- sapply(multi_auto, function(auto) auto$beta_est)
pred_auto <- big_prodMat(G, beta_auto, ind.row = ind.test, ind.col = ind.chr2)
Make sure the scale is okay (by comparing to LDpred2-inf and others auto) and possibly keep only the ones that looks good. (e.g. see code of paper), https://github.com/privefl/paper-ldpred2/blob/master/code/run-ldpred2.R#L101-L108.
c(mad(pred_inf), apply(pred_auto, 2, mad))
## [1] 0.2173099580 0.0003842597 0.0002544793 0.0012422195 0.0039347399
## [6] 0.0271166625 0.0239165971
#
final_pred_auto <- rowMeans(pred_auto)
AUCBoot(final_pred_auto, y[ind.test])
## Mean 2.5% 97.5% Sd
## 0.6631118 0.5676395 0.7514999 0.0473751
We have seen how to run 3 versions of LDpred2 (“-inf”, “-grid” and “-auto”) for one chromosome. You need to do this for each chromosome and combine results. Normally, you can just sum the resulting scores, or equivalently, append the effect sizes.