If you use decemedip in published research, please cite:
Shen, Ning, and Keegan Korthauer. “decemedip: hierarchical Bayesian modeling for cell type deconvolution of immunoprecipitation-based DNA methylome” bioRxiv, forthcoming.
This package will be submitted to Bioconductor. For now, you can install the development version from GitHub:
# Install stable version from Bioconductor (once available)
# BiocManager::install("decemedip")
# Install development version from GitHub
remotes::install_github("nshen7/decemedip")
After installation, load the decemedip package:
library(SummarizedExperiment)
library(dplyr)
library(ggplot2)
library(decemedip)
options(digits = 2)
Cell-free and bulk DNA methylation data obtained through MeDIP-seq reflect a mixture of methylation signals across multiple cell types. Decomposing these signals to infer cell type composition can provide valuable insights for cancer diagnosis, immune response monitoring, and other biomedical applications. However, challenges like enrichment-induced biases and sparse reference data make this task complex. The decemedip package addresses these challenges through a hierarchical Bayesian framework that estimates cell type proportions and models the relationship between MeDIP-seq counts and reference methylation data.
decemedip couples a logit-normal model with a generalize additive model (GAM) framework. For each site \(i \in \{1, ..., N\}\) in the reference panel, the input to the model is the fractional methylation levels \(x_{ik}\) for each cell type \(k \in \{1, ..., K\}\), the CpG density level \(z_i\) and the MeDIP-seq read count \(y_i\), where \(K\) is the total number of cell types. A unit simplex variable that follows a logit-normal prior, \(\boldsymbol{\pi} = (\pi_1, ..., \pi_K)\) where \(\pi_k > 0, \sum_{k=1}^K \pi_k = 1\), is included to describe the proportions of cell types in the reference panel while taking into account the correlations between these cell types.
decemedip requires three primary inputs:
Our package provides default reference matrices for hg19 and hg38 along with the corresponding CpG density information, as objects in class SummarizedExperiment. The objects can be accessed by calling data(hg19.ref.cts.se) and data(hg19.ref.anc.se), or data(hg38.ref.cts.se) and data(hg38.ref.anc.se). By default, the main function decemedip() applies hg19.ref.cts.se and hg19.ref.anc.se as the reference panels. Please refer to the manuscript for details of how the default reference panels were constructed.
On a side note, we provide the function makeReferencePanel() to allow user to build their own reference panels, which only requires input of reference CpGs and corresponded fractional methylation level matrix. The function computes CpG density on its own. Note that the cell type-specific sites and anchor sites need to be included in two SummarizedExperiment objects to be inputs to the main function decemedip(). See ?makeReferencePanel for more information.
We show how the reference panel look like in the following chunks:
data(hg19.ref.cts.se)
print(hg19.ref.cts.se)
## class: RangedSummarizedExperiment
## dim: 2500 25
## metadata(0):
## assays(1): beta
## rownames(2500): cg18856478 cg20820767 ... cg01071459 cg20726993
## rowData names(4): probe label pos n_cpgs_100bp
## colnames(25): Monocytes_EPIC B-cells_EPIC ... Upper_GI Uterus_cervix
## colData names(0):
head(granges(hg19.ref.cts.se))
## GRanges object with 6 ranges and 4 metadata columns:
## seqnames ranges strand | probe label
## <Rle> <IRanges> <Rle> | <character> <character>
## cg18856478 chr1 43814358 * | cg18856478 Monocytes_EPIC hyper..
## cg20820767 chr1 45082840 * | cg20820767 Monocytes_EPIC hyper..
## cg14855367 chr3 191048308 * | cg14855367 Monocytes_EPIC hyper..
## cg25913761 chr15 90727560 * | cg25913761 Monocytes_EPIC hyper..
## cg21546950 chr1 77904032 * | cg21546950 Monocytes_EPIC hyper..
## cg14981189 chr10 5113871 * | cg14981189 Monocytes_EPIC hyper..
## pos n_cpgs_100bp
## <integer> <integer>
## cg18856478 43814358 7
## cg20820767 45082840 12
## cg14855367 191048308 1
## cg25913761 90727560 4
## cg21546950 77904032 1
## cg14981189 5113871 1
## -------
## seqinfo: 22 sequences from hg19 genome; no seqlengths
data(hg19.ref.anc.se)
print(hg19.ref.anc.se)
## class: RangedSummarizedExperiment
## dim: 1000 25
## metadata(0):
## assays(1): beta
## rownames(1000): 353534 76294 ... 73948 87963
## rowData names(7): probe pos ... avg_beta_rank n_cpgs_100bp
## colnames(25): Monocytes_EPIC B-cells_EPIC ... Upper_GI Uterus_cervix
## colData names(0):
head(granges(hg19.ref.anc.se))
## GRanges object with 6 ranges and 7 metadata columns:
## seqnames ranges strand | probe pos label
## <Rle> <IRanges> <Rle> | <character> <integer> <character>
## 353534 chr19 19496477 * | cg15264323 19496477 All-tissue U
## 76294 chr1 154193656 * | cg13576006 154193656 All-tissue U
## 300825 chr12 860787 * | cg13284045 860787 All-tissue U
## 369864 chr5 43514988 * | cg20545087 43514988 All-tissue U
## 247131 chr16 278788 * | cg06819375 278788 All-tissue U
## 308091 chr12 96429111 * | cg09725090 96429111 All-tissue U
## margin avg_beta avg_beta_rank n_cpgs_100bp
## <numeric> <numeric> <numeric> <integer>
## 353534 0.1 0.063240 70720.0 10
## 76294 0.1 0.013492 19877.5 7
## 300825 0.1 0.029155 36307.5 1
## 369864 0.1 0.038771 45387.5 6
## 247131 0.1 0.032970 39748.5 2
## 308091 0.1 0.076555 77269.0 7
## -------
## seqinfo: 22 sequences from hg19 genome; no seqlengths
The main function decemedip() fits the decemedip model. It allows two types of input:
We provide instructions for both input types as follows.
sample_bam_file <- "path/to/bam/files"
paired <- TRUE # whether the sequencing is paired-end
output <- decemedip(sample_bam_file = sample_bam_file,
paired = paired,
cores = 4)
PS: By default, the decemedip() function uses a hg19 reference panel. But users may add the arguments ref_cts = hg38.ref.cts.se, ref_anc = hg38.ref.anc.se to apply read counts extraction on hg38 data.
We use built-in objects of the package that contains read counts of the prostate tumor patient-derived xenograft (PDX) samples from the Berchuck et al. 2022 [1] study, pdx.counts.cts.se and pdx.counts.anc.se, to demonstrate the output and diagnostics in this vignette.
[1] Berchuck JE, Baca SC, McClure HM, Korthauer K, Tsai HK, Nuzzo PV, et al. Detecting neuroendocrine prostate cancer through tissue-informed cell-free DNA methylation analysis. Clinical Cancer Research. 2022;28(5):928–938.
data(pdx.counts.cts.se)
data(pdx.counts.anc.se)
We extract the sample LuCaP_147CR from this example dataset for follwing illustration:
counts_cts <- assays(pdx.counts.cts.se)$counts[,'LuCaP_147CR'] # read counts of cell type-specific CpGs of the sample 'LuCaP_147CR'
counts_anc <- assays(pdx.counts.anc.se)$counts[,'LuCaP_147CR'] # read counts of anchor CpGs of the sample 'LuCaP_147CR'
Due to the vignette running time limit by Bioconductor, we only run 300 iterations (iter = 300) for the purpose of demonstration, which causes the warning of Bulk Effective Samples Size (ESS) is too low. In regular cases, we recommend to run 2000 iterations (the default), or at least 1000 iterations for a stable posterior inference.
output <- decemedip(counts_cts = counts_cts,
counts_anc = counts_anc,
diagnostics = TRUE,
cores = 4,
iter = 200)
## MCMC converged with seed 2024
The output of decemedip() is a list containing two elements:
names(output)
## [1] "data_list" "posterior"
data_list: An organized list of variables used as input to the Stan posterior sampling function.posterior: An stanfit object produced by Stan representing the fitted posteriors.After running the model, you may extract and save the summary of fitted posteriors using the monitor() and extract() functions provided by the RStan package. See documentation of RStan for details of these functions.
library(rstan)
Extract the fitted posterior of cell type proportions (\(\boldsymbol\pi\)):
smr_pi.df <- getSummaryOnPi(output$posterior)
head(smr_pi.df)
## cell_type mean se_mean sd 2.5% 25% 50% 75%
## pi[1] Monocytes_EPIC 0.00020 4.2e-05 0.00056 3.8e-11 7.5e-08 1.5e-06 6.9e-05
## pi[2] B-cells_EPIC 0.00037 1.1e-04 0.00165 8.2e-11 1.3e-07 5.3e-06 6.5e-05
## pi[3] CD4T-cells_EPIC 0.00024 5.8e-05 0.00092 7.5e-12 2.5e-08 3.5e-06 8.6e-05
## pi[4] NK-cells_EPIC 0.00023 5.7e-05 0.00081 4.9e-12 7.1e-08 1.9e-06 6.5e-05
## pi[5] CD8T-cells_EPIC 0.00056 2.6e-04 0.00301 4.5e-11 1.1e-07 4.8e-06 8.6e-05
## pi[6] Neutrophils_EPIC 0.00029 7.0e-05 0.00102 3.4e-12 8.4e-08 2.9e-06 6.8e-05
## 97.5% n_eff Rhat valid
## pi[1] 0.0020 167 1 1
## pi[2] 0.0046 203 1 1
## pi[3] 0.0020 235 1 1
## pi[4] 0.0024 183 1 1
## pi[5] 0.0039 126 1 1
## pi[6] 0.0035 184 1 1
cell_type: The name of the parameter or variable being analyzed.mean: The posterior mean, representing the point estimate of the parameter.se_mean: The standard error of the mean, calculated as sd / sqrt(n_eff), indicating precision of the mean estimate.sd: The posterior standard deviation, representing the spread or uncertainty of the parameter estimate.2.5%, 25%, 50% (median), 75%, 97.5%: Percentiles of the posterior distribution, providing a summary of parameter uncertainty. These define the 95% credible interval (2.5% to 97.5%).n_eff: The effective sample size, indicating how many independent samples the chain produced after accounting for autocorrelation.Rhat: The potential scale reduction factor, measuring chain convergence. Values close to 1.00 suggest good convergence.valid: A flag indicating whether diagnostic checks (e.g., Rhat and n_eff) passed for this parameter (1 = passed, 0 = potential issues).Plotting out the fitted cell type proportions with credible intervals:
labels <- gsub('_', ' ', smr_pi.df$cell_type)
labels <- gsub('(.*) EPIC', '\\1', labels)
smr_pi.df |>
mutate(cell_type = factor(cell_type, labels = labels)) |>
ggplot(aes(cell_type, mean)) +
geom_linerange(aes(ymin = `2.5%`, ymax = `97.5%`),
position = position_dodge2(width = 0.035),
linewidth = 7, alpha = 0.3) +
geom_linerange(aes(ymin = `25%`, ymax = `75%`),
position = position_dodge2(width = 0.035),
linewidth = 7, alpha = 1) +
geom_point(position = position_dodge2(width = 0.035),
fill = 'white', shape = 21, size = 8) +
theme_classic() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Note that this plot is only accessible when diagnostics is set toTRUE in the decemedip() function.
The actual read counts (orange) and the fitted counts predicted by the GAM component (black) are shown in the figure across varying levels of CpG density. Grey area represents the 95% credible intervals of the predicted counts. `CpG density: x’ means that there are x CpGs in the 100-bp window surrounding the CpG.
plotDiagnostics(output, plot_type = "y_fit") +
ylim(0, 300)
The decemedip package provides a robust framework for cell type deconvolution from MeDIP-seq data. By following this vignette, users can apply the method to their own datasets, extract key model outputs, and generate diagnostic plots for analysis.
sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.3
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Vancouver
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] rstan_2.32.6 StanHeaders_2.32.10
## [3] decemedip_0.99.0 ggplot2_3.5.1
## [5] dplyr_1.1.4 SummarizedExperiment_1.36.0
## [7] Biobase_2.66.0 GenomicRanges_1.58.0
## [9] GenomeInfoDb_1.42.3 IRanges_2.40.1
## [11] S4Vectors_0.44.0 BiocGenerics_0.52.0
## [13] MatrixGenerics_1.18.1 matrixStats_1.5.0
## [15] BiocStyle_2.34.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 bitops_1.0-9 gridExtra_2.3
## [4] httr2_1.1.0 inline_0.3.21 biomaRt_2.62.1
## [7] rlang_1.1.5 magrittr_2.0.3 MEDIPS_1.58.0
## [10] compiler_4.4.2 RSQLite_2.3.9 loo_2.8.0
## [13] png_0.1-8 vctrs_0.6.5 stringr_1.5.1
## [16] pkgconfig_2.0.3 crayon_1.5.3 fastmap_1.2.0
## [19] dbplyr_2.5.0 XVector_0.46.0 labeling_0.4.3
## [22] Rsamtools_2.22.0 rmarkdown_2.29 UCSC.utils_1.2.0
## [25] preprocessCore_1.68.0 tinytex_0.54 purrr_1.0.4
## [28] bit_4.5.0.1 xfun_0.50 zlibbioc_1.52.0
## [31] cachem_1.1.0 jsonlite_1.8.9 progress_1.2.3
## [34] blob_1.2.4 DelayedArray_0.32.0 BiocParallel_1.40.0
## [37] parallel_4.4.2 prettyunits_1.2.0 R6_2.6.1
## [40] bslib_0.9.0 stringi_1.8.4 rtracklayer_1.66.0
## [43] DNAcopy_1.80.0 jquerylib_0.1.4 Rcpp_1.0.14
## [46] bookdown_0.42 knitr_1.49 R.utils_2.12.3
## [49] Matrix_1.7-1 tidyselect_1.2.1 rstudioapi_0.17.1
## [52] abind_1.4-8 yaml_2.3.10 codetools_0.2-20
## [55] curl_6.2.0 pkgbuild_1.4.6 lattice_0.22-6
## [58] tibble_3.2.1 withr_3.0.2 KEGGREST_1.46.0
## [61] evaluate_1.0.3 RcppParallel_5.1.10 BiocFileCache_2.14.0
## [64] xml2_1.3.6 Biostrings_2.74.1 pillar_1.10.1
## [67] BiocManager_1.30.25 filelock_1.0.3 generics_0.1.3
## [70] RCurl_1.98-1.16 hms_1.1.3 rstantools_2.4.0
## [73] munsell_0.5.1 scales_1.3.0 gtools_3.9.5
## [76] glue_1.8.0 tools_4.4.2 BiocIO_1.16.0
## [79] BSgenome_1.74.0 GenomicAlignments_1.42.0 XML_3.99-0.18
## [82] grid_4.4.2 QuickJSR_1.5.1 AnnotationDbi_1.68.0
## [85] colorspace_2.1-1 GenomeInfoDbData_1.2.13 restfulr_0.0.15
## [88] cli_3.6.4 rappdirs_0.3.3 S4Arrays_1.6.0
## [91] gtable_0.3.6 R.methodsS3_1.8.2 sass_0.4.9
## [94] digest_0.6.37 SparseArray_1.6.1 farver_2.1.2
## [97] rjson_0.2.23 R.oo_1.27.0 memoise_2.0.1
## [100] htmltools_0.5.8.1 lifecycle_1.0.4 httr_1.4.7
## [103] bit64_4.6.0-1