scMET Bayesian modelling of DNA methylation heterogeneity at single-cell resolution

Andreas C. Kapourani1,2,3*, Ricard Argelaguet4**, Guido Sanguinetti3,5*** and Catalina Vallejos1,2****

1Human Genetic Unit (HGU), University of Edinburgh, UK
2Institute of Genetics and Molecular Medicine (IGMM), University of Edinburgh, UK
3School of Informatics, University of Edinburgh, UK
4European Bioinformatics Institute (EMBL-EBI), Hinxton, UK
5SISSA, International School of Advanced Studies, Trieste, Italy

*c.a.kapourani@ed.ac.uk or kapouranis.andreas@gmail.com
**ricard@ebi.ac.uk
***gsanguin@sissa.it
****catalina.vallejos@igmm.ed.ac.uk

2020-07-08

Package

scMET 0.0.0.9000

1 Installation

## Install from Github repository
# install.packages("devtools")
devtools::install_github("andreaskapou/scMET")

2 Background

High throughput measurements of DNA methylomes at single-cell resolution are a promising resource to quantify the heterogeneity of DNA methylation and uncover its role in gene regulation. However, limitations of the technology result in sparse CpG coverage, effectively posing challenges to robustly quantify genuine DNA methylation heterogeneity. Here we introduce scMET, a Bayesian framework for the analysis of single-cell DNA methylation data. This modelling approach combines a hierarchical beta-binomial specification with a generalised linear model framework with the aim of capturing biological overdispersion and overcome data sparsity by sharing information across cells and genomic features.

To disentangle technical from biological variability and overcome data sparsity, scMET couples a hierarchical BB model with a GLM framework (Fig.1a-b). For each cell \(i\) and region \(j\), the input for scMET is the number of CpG sites that are observed to be methylated (\(Y_{ij}\)) and the total number of sites for which methylation status was recorded (\(n_{ij}\)). The BB model uses feature-specific mean parameters \(\mu_j\) to quantify overall DNAm across all cells and biological overdispersion parameters \(\gamma_j\) as a proxy for cell-to-cell DNAm heterogeneity. These parameters capture the amount of variability that is not explained by binomial sampling noise, which would only account for technical variation.

The GLM framework is incorporated at two levels. Firstly, to introduce feature-specific covariates \(\mathbf{x}_{j}\) (e.g. CpG density) that may explain differences in mean methylation \(\mu_j\) across features. Secondly, similar to Eling2018, we use a non-linear regression framework to capture the mean-overdispersion trend that is typically observed in high throughput sequencing data, such as scBS-seq (Fig.1c). This trend is used to derive residual overdispersion parameters \(\epsilon_j\) — a measure of cell-to-cell variability that is not confounded by mean methylation. Feature-specific parameters are subsequently used for: (i) feature selection, to identify highly variable features (HVFs) that drive cell-to-cell epigenetic heterogeneity (Fig.1d) and (ii) differential methylation testing, to highlight features that show differences in DNAm mean or variability between specified groups of cells (Fig.1e).

Figure 1: scMET model overview.

3 scMET analysis on synthetic data

To highlight the main functionality of scMET we first apply it on synthetic data, where we can assess the accuracy of the inferred parameters (\(\mu\) and \(\gamma\)) as well as compare it with the maximum likelihood estimation of simple beta-binomial (BB-MLE) model.

# Load package
suppressPackageStartupMessages(library(scMET))
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(magrittr))
set.seed(123)

3.1 Loading synthetic data

scMET comes with simulated scBS-seq data from a single population. The users can generate their own synthetic data using the scmet_simulate function.

# Synthetic data: list with following elements
names(scmet_dt)

## [1] "Y"                 "X"                 "theta_true"       
## [4] "theta_priors_true" "opts"

Here, Y is a data.table containing the scBS-seq data in long format, where total_reads corresponds to the total number of covered CpGs \(n_{ij}\) and met_reads to number of methylated CpGs \(Y_{ij}\) in cell \(i\) and feature \(j\) (see Fig.1). This is the data format that scMET requires as input.

head(scmet_dt$Y)

##      Feature    Cell total_reads met_reads
## 1: Feature_1 Cell_13          14         9
## 2: Feature_1 Cell_21           9         9
## 3: Feature_1 Cell_11          10         8
## 4: Feature_1 Cell_50          10         1
## 5: Feature_1 Cell_44          11         6
## 6: Feature_1 Cell_14          12         8

Next, X is a matrix containing the feature-specific covariates that might explain differences in mean methylation. The first column of X is the bias term and contains a vector of 1s. This is an optional input when performing inference using scMET.

head(scmet_dt$X)

##             intercept cpg_density
## Feature_1           1 -0.77132814
## Feature_10          1  0.30623073
## Feature_100         1 -1.46447533
## Feature_101         1 -1.05901022
## Feature_102         1  0.96694264
## Feature_103         1  0.06700105

The theta_true element just contains the true \(\mu\) and \(\gamma\) parameters used to generate the scBS-seq data Y. We will use these to assess our prediction performance later. Finally the theta_priors_true contains the prior hyper-parameters which we defined to generate the synthetic data (top level parameters in Fig.1).

# Parameters \mu and \gamma
head(scmet_dt$theta_true)

##                    mu     gamma
## Feature_1   0.6201633 0.2953617
## Feature_10  0.3066003 0.3630327
## Feature_100 0.8417903 0.1483870
## Feature_101 0.9307007 0.1619025
## Feature_102 0.1089840 0.2645526
## Feature_103 0.4856963 0.3154443

# Hyper-paramter values
scmet_dt$theta_priors_true

## $w_mu
## [1] -0.5 -1.5
## 
## $w_gamma
## [1] -1.2 -0.3  1.1 -0.9
## 
## $s_mu
## [1] 1
## 
## $s_gamma
## [1] 0.2

Now let’s visualise how the mean-overdispersion trend looks like in our synthetic data and also the association between mean methylation and covariates X, which here we will assume is CpG density.

par(mfrow = c(1,2))
plot(scmet_dt$theta_true$mu, scmet_dt$theta_true$gamma, pch = 20,
     xlab = expression(paste("Mean methylation ",  mu)), 
     ylab = expression(paste("Overdsispersion ",  gamma)))
plot(scmet_dt$X[,2], scmet_dt$theta_true$mu, pch = 20,
     xlab = "X: CpG density", 
     ylab = expression(paste("Mean methylation ",  mu)))

3.2 scMET inference

Now we can perform inference on the simulated data using the scmet function. In addition to providing the input data Y and covariates X, we set L = 4 radial basis functions to model the mean-overdispersion relationship. Also, we set the total number of VB iterations to iter = 5000 for efficiency purposes. We let all the other parameters as default.

NOTE 1 The user should set this to a much higher value, e.g. around 20,000 when running VB.

NOTE 2 For relatively small datasets we recommend using the MCMC implementation of scMET, i.e. setting use_mcmc = TRUE, since it is more stable than the VB approximation and gives more accurate results.

# Run with seed for reproducibility
fit_obj <- scmet(Y = scmet_dt$Y, X = scmet_dt$X, L = 4,
                 iter = 5000, seed = 12)

## Sorting features.
## Total # of cells:  50 .
## Total # of features:  150 .
## Wed Jul  8 10:10:34 2020 : Using EB to set model priors.
## Usin EB estimates as initial values for posterior inference.
## Wed Jul  8 10:10:37 2020 : Finished EB.
## Wed Jul  8 10:10:37 2020 : Starting inference.
## Chain 1: ------------------------------------------------------------
## Chain 1: EXPERIMENTAL ALGORITHM:
## Chain 1:   This procedure has not been thoroughly tested and may be unstable
## Chain 1:   or buggy. The interface is subject to change.
## Chain 1: ------------------------------------------------------------
## Chain 1: 
## Chain 1: 
## Chain 1: 
## Chain 1: Gradient evaluation took 0.001535 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 15.35 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Begin eta adaptation.
## Chain 1: Iteration:   1 / 250 [  0%]  (Adaptation)
## Chain 1: Iteration:  50 / 250 [ 20%]  (Adaptation)
## Chain 1: Iteration: 100 / 250 [ 40%]  (Adaptation)
## Chain 1: Iteration: 150 / 250 [ 60%]  (Adaptation)
## Chain 1: Iteration: 200 / 250 [ 80%]  (Adaptation)
## Chain 1: Success! Found best value [eta = 1] earlier than expected.
## Chain 1: 
## Chain 1: Begin stochastic gradient ascent.
## Chain 1:   iter             ELBO   delta_ELBO_mean   delta_ELBO_med   notes 
## Chain 1:    100        -8816.455             1.000            1.000
## Chain 1:    200        -8185.937             0.539            1.000
## Chain 1:    300        -8119.809             0.362            0.077
## Chain 1:    400        -8024.891             0.274            0.077
## Chain 1:    500        -8012.193             0.220            0.012
## Chain 1:    600        -7966.628             0.021            0.008
## Chain 1:    700        -7987.344             0.006            0.006
## Chain 1:    800        -7972.263             0.005            0.003
## Chain 1:    900        -7952.611             0.003            0.002
## Chain 1:   1000        -7997.019             0.004            0.003
## Chain 1:   1100        -7965.491             0.003            0.003
## Chain 1:   1200        -7953.481             0.003            0.002
## Chain 1:   1300        -7926.468             0.003            0.003
## Chain 1:   1400        -7907.015             0.003            0.003
## Chain 1:   1500        -7912.552             0.002            0.002
## Chain 1:   1600        -7948.448             0.003            0.002
## Chain 1:   1700        -7925.476             0.003            0.003
## Chain 1:   1800        -7912.310             0.002            0.002
## Chain 1:   1900        -7921.115             0.002            0.002
## Chain 1:   2000        -7916.323             0.002            0.002
## Chain 1:   2100        -7910.660             0.001            0.001
## Chain 1:   2200        -7911.353             0.001            0.001
## Chain 1:   2300        -7897.388             0.001            0.001
## Chain 1:   2400        -7897.071             0.001            0.001
## Chain 1:   2500        -7908.039             0.001            0.001
## Chain 1:   2600        -7899.486             0.001            0.001
## Chain 1:   2700        -7886.011             0.001            0.001
## Chain 1:   2800        -7892.220             0.001            0.001
## Chain 1:   2900        -7891.941             0.001            0.001
## Chain 1:   3000        -7905.204             0.001            0.001
## Chain 1:   3100        -7890.422             0.001            0.002
## Chain 1:   3200        -7889.078             0.001            0.001
## Chain 1:   3300        -7884.443             0.001            0.001
## Chain 1:   3400        -7880.764             0.001            0.001
## Chain 1:   3500        -7885.446             0.001            0.001
## Chain 1:   3600        -7880.486             0.000            0.001
## Chain 1:   3700        -7902.075             0.001            0.001
## Chain 1:   3800        -7878.813             0.001            0.001
## Chain 1:   3900        -7886.469             0.002            0.001
## Chain 1:   4000        -7882.935             0.002            0.001
## Chain 1:   4100        -7874.693             0.002            0.001
## Chain 1:   4200        -7885.800             0.001            0.001
## Chain 1:   4300        -7884.911             0.001            0.001
## Chain 1:   4400        -7885.343             0.001            0.000
## Chain 1:   4500        -7871.911             0.001            0.001
## Chain 1:   4600        -7874.018             0.001            0.000
## Chain 1:   4700        -7866.202             0.001            0.000
## Chain 1:   4800        -7875.572             0.001            0.001
## Chain 1:   4900        -7878.063             0.001            0.001
## Chain 1:   5000        -7883.020             0.001            0.001
## Chain 1: Informational Message: The maximum number of iterations is reached! The algorithm may not have converged.
## Chain 1: This variational approximation is not guaranteed to be meaningful.
## Chain 1: 
## Chain 1: Drawing a sample of size 2000 from the approximate posterior... 
## Chain 1: COMPLETED.
## Wed Jul  8 10:10:53 2020 : Posterior inference finished.

3.2.1 Output summary

The scmet returns a scmet_<mode> object (mode either vb or mcmc) with the following structure:

class(fit_obj)

## [1] "scmet_vb"

names(fit_obj)

## [1] "posterior"     "Y"             "X"             "feature_names"
## [5] "theta_priors"  "opts"

Here the most important entry is the posterior slot, which is a list containing the posterior draws for each of the model parameters. All other slots contain the input data and initialised hyper-parameter values for reproducibility purposes.

# Elements of the posterior list
names(fit_obj$posterior)

## [1] "mu"      "gamma"   "epsilon" "w_mu"    "w_gamma" "s_gamma" "lp__"

# Rows correspond to draws and columns to parameter dimensions
# here number of features.
dim(fit_obj$posterior$mu)

## [1] 2000  150

# First 5 draws across 3 features for \mu parameter
fit_obj$posterior$mu[1:5, 1:3]

##           
## iterations Feature_1 Feature_10 Feature_100
##       [1,]     0.542      0.229       0.840
##       [2,]     0.448      0.166       0.798
##       [3,]     0.524      0.241       0.827
##       [4,]     0.582      0.175       0.902
##       [5,]     0.575      0.211       0.829

# First 5 draws across 3 features for \gamma parameter
fit_obj$posterior$gamma[1:5, 1:3]

##           
## iterations Feature_1 Feature_10 Feature_100
##       [1,]     0.124      0.305       0.237
##       [2,]     0.154      0.205       0.131
##       [3,]     0.198      0.340       0.259
##       [4,]     0.084      0.354       0.145
##       [5,]     0.156      0.419       0.268

# First 5 draws for covariate coefficients
# number of columns equal to ncol(X) = 2
fit_obj$posterior$w_mu[1:5, ]

##           
## iterations intercept cpg_density
##       [1,]    -0.635      -1.565
##       [2,]    -0.418      -1.357
##       [3,]    -0.446      -1.628
##       [4,]    -0.469      -1.465
##       [5,]    -0.549      -1.566

# First 5 draws for RBF coefficients
# number of columns equal to L = 4
fit_obj$posterior$w_gamma[1:5, ]

##           
## iterations   rbf1  rbf2  rbf3   rbf4
##       [1,] -1.663 0.081 1.119 -0.690
##       [2,] -1.697 0.001 1.081 -0.713
##       [3,] -1.763 0.107 1.174 -0.637
##       [4,] -1.571 0.074 1.112 -0.715
##       [5,] -1.677 0.010 1.032 -0.677

3.3 Plotting mean-overdispersion relationship

Next we plot the posterior median parameter estimates of \(\mu\) and \(\gamma\) together with the fitted non-linear trend. The color code is used to represent areas with high (green and yellow) and low (blue) concentration of features via kernel density estimation. Mostly useful when visualising datasets with large number of features.

gg1 <- scmet_plot_mean_var(obj = fit_obj, y = "gamma", 
                           task = NULL, show_fit = TRUE)
gg2 <- scmet_plot_mean_var(obj = fit_obj, y = "epsilon", 
                           task = NULL, show_fit = TRUE)
cowplot::plot_grid(gg1, gg2, ncol = 2)

As expected we observe no association between the mean - residual overdisperion relatioship.

3.4 Comparing true versus estimated parameters

Next we assess the predictive performance of scMET by comparing the posterior estimates with the true parameter values used to generate the synthetic data.

# Mean methylation estimates
gg1 <- scmet_plot_estimated_vs_true(obj = fit_obj, sim_dt = scmet_dt, 
                                    param = "mu")
# Overdispersion estimates
gg2 <- scmet_plot_estimated_vs_true(obj = fit_obj, sim_dt = scmet_dt, 
                                    param = "gamma")
cowplot::plot_grid(gg1, gg2, ncol = 2)

3.4.1 scMET versus beta-binomial MLE (shrinkage)

Below we assess the predictive performance of scMET compared to the maximum likelihood estimates from a beta-binomial model. First we obtain MLE estimates:

# Obtain MLE estimates by calling the bb_mle function
bbmle_fit <- scmet_dt$Y[, bb_mle(cbind(total_reads, met_reads))
                      [c("mu", "gamma")], by = c("Feature")]
head(bbmle_fit)

##        Feature         mu        gamma
## 1:   Feature_1 0.62682041 1.800892e-01
## 2:  Feature_10 0.25923353 2.530192e-01
## 3: Feature_100 0.83186860 2.486058e-01
## 4: Feature_101 0.95475110 8.172775e-07
## 5: Feature_102 0.08642232 1.352157e-01
## 6: Feature_103 0.47120018 2.669786e-01

As we can see, scMET can more accurately estimate the overdispersion parameter \(\gamma\) by shrinking the MLE estimates towards to population average, resulting in more robust estimates.

# Overdispersion estimates MLE vs scMET
# subset of features to avoid over-plotting
plot(scmet_plot_estimated_vs_true(obj = fit_obj, sim_dt = scmet_dt, 
                                  param = "gamma", mle_fit = bbmle_fit))

The above sections mostly focused on assessing the prediction performance of scMET. Below we show how the user can use scMET for downstream analyses, such as feature selection and differential methylation testing.

3.5 Identifying highly variable features

After fitting the scMET model, we now show how to perform feature selection; that is to identify highly variable features (HVF) which potentially drive cell-to-cell DNA methylation heterogeneity. We will identify HVFs based on the residual overdispersion estimates \(\epsilon\) (This is the default and recommended approach of scMET. The user can also perform HVF analysis based on the overdispersion parameter \(\gamma\), which however will be confounded by the mean methylation levels).

We can perform HVF analysis simply by calling the scmet_hvf function. This will create a new slot named hvf within the fit_obj object.

# Run HVF analysis
fit_obj <- scmet_hvf(scmet_obj = fit_obj, delta_e = 0.75, 
                     evidence_thresh = 0.8, efdr = 0.1)

## 6 Features classified as highly variable using: 
## - The  75 percentile of variable genes 
## - Evidence threshold = 0.83025
## - EFDR = 10.15% 
## - EFNR = 30.69%

# Summary of HVF analysis
head(fit_obj$hvf$summary)

##     feature_name     mu  gamma epsilon tail_prob is_variable
## 16   Feature_112 0.1390 0.4290  1.0030    0.9660        TRUE
## 138   Feature_88 0.7440 0.3210  0.9130    0.9470        TRUE
## 95    Feature_49 0.3065 0.4610  0.8180    0.9195        TRUE
## 9    Feature_106 0.4110 0.4435  0.6735    0.8930        TRUE
## 50   Feature_143 0.0530 0.3490  0.8175    0.8340        TRUE
## 44   Feature_138 0.9660 0.2415  0.7830    0.8315        TRUE

Here the tail_prob column denotes the evidence our model provides of a feature being called as HVF (the higher the larger the evidence), and the is_variable column denotes the output whether the feature was called as HVF or not.

Below we show the grid search to obtain the optimal posterior evidence threshold to achieve EFDR = 10%.

plot(scmet_plot_efdr_efnr_grid(obj = fit_obj, task = "hvf"))

Next we can also plot the corresponding tail posterior probabilities as a function of \(\mu\) and plot the mean - overdispersion relationship coloured by the HVF analysis.

gg1 <- scmet_plot_vf_tail_prob(obj = fit_obj, x = "mu", task = "hvf")
gg2 <- scmet_plot_mean_var(obj = fit_obj, y = "gamma", task = "hvf")
cowplot::plot_grid(gg1, gg2, ncol = 2)

NOTE One could perform the exact analysis for identifying lowly variable features (LVFs), although in general these are less useful for downstream analysis, such as clustering, since they denote regulatory regions that have stable DNAm patterns across cells.

3.6 Differential analysis with scMET

scMET has additional functionality for calling differentially methylated (DM) and differentially variable (DV) features when analysing cells from two conditions/populations.

To perform differential testing, scMET contains simulated data from two conditions/groups of cells, where a subset of features are DM and some others are DV between groups A and B. The user can generate their own synthetic data using the scmet_simulate_diff function. Below we show the structure of the scmet_diff_dt object.

# Structure of simulated data from two populations
names(scmet_diff_dt)

## [1] "scmet_dt_A"              "scmet_dt_B"             
## [3] "opts"                    "diff_var_features"      
## [5] "diff_var_features_up"    "diff_var_features_down" 
## [7] "diff_mean_features"      "diff_mean_features_up"  
## [9] "diff_mean_features_down"

Below we plot the mean-overdispersion relationship of the true parameters for each group, and highlight with red colour the features that are truly differentially variable in group B compared to group A.

# Extract DV features
dv <- scmet_diff_dt$diff_var_features$feature_idx
# Parameters for each group
theta_A <- scmet_diff_dt$scmet_dt_A$theta_true
theta_B <- scmet_diff_dt$scmet_dt_B$theta_true

par(mfrow = c(1,2))
# Group A mean - overdispersion relationship
plot(theta_A$mu, theta_A$gamma, pch = 20, main = "Group A",
     xlab = expression(paste("Mean methylation ",  mu)), 
     ylab = expression(paste("Overdsispersion ",  gamma)))
points(theta_A$mu[dv], theta_A$gamma[dv], col = "red", pch = 20)

# Group B mean - overdispersion relationship
plot(theta_B$mu, theta_B$gamma, pch = 20, main = "Group B",
     xlab = expression(paste("Mean methylation ",  mu)), 
     ylab = expression(paste("Overdsispersion ",  gamma)))
points(theta_B$mu[dv], theta_B$gamma[dv], col = "red", pch = 20)

The scmet_dt_A and scmet_dt_B are the two simulated datasets, having the same structure as the scmet_dt object in the above sections. The additional slots contain information about which featurers are DM or DV across populations.

3.6.1 Fitting scMET for each group

To perform differential analysis, we first need to run the scmet function on each dataset/group independently.

NOTE 1 The user should set this to a much higher value, e.g. around 20,000 when running VB.

# Run scMET for group A
fit_A <- scmet(Y = scmet_diff_dt$scmet_dt_A$Y,
               X = scmet_diff_dt$scmet_dt_A$X, L = 4, 
               iter = 5000, seed = 12)

## Sorting features.
## Total # of cells:  100 .
## Total # of features:  200 .
## Wed Jul  8 10:11:01 2020 : Using EB to set model priors.
## Usin EB estimates as initial values for posterior inference.
## Wed Jul  8 10:11:07 2020 : Finished EB.
## Wed Jul  8 10:11:07 2020 : Starting inference.
## Chain 1: ------------------------------------------------------------
## Chain 1: EXPERIMENTAL ALGORITHM:
## Chain 1:   This procedure has not been thoroughly tested and may be unstable
## Chain 1:   or buggy. The interface is subject to change.
## Chain 1: ------------------------------------------------------------
## Chain 1: 
## Chain 1: 
## Chain 1: 
## Chain 1: Gradient evaluation took 0.002938 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 29.38 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Begin eta adaptation.
## Chain 1: Iteration:   1 / 250 [  0%]  (Adaptation)
## Chain 1: Iteration:  50 / 250 [ 20%]  (Adaptation)
## Chain 1: Iteration: 100 / 250 [ 40%]  (Adaptation)
## Chain 1: Iteration: 150 / 250 [ 60%]  (Adaptation)
## Chain 1: Iteration: 200 / 250 [ 80%]  (Adaptation)
## Chain 1: Success! Found best value [eta = 1] earlier than expected.
## Chain 1: 
## Chain 1: Begin stochastic gradient ascent.
## Chain 1:   iter             ELBO   delta_ELBO_mean   delta_ELBO_med   notes 
## Chain 1:    100       -25145.285             1.000            1.000
## Chain 1:    200       -22617.738             0.556            1.000
## Chain 1:    300       -22382.208             0.374            0.112
## Chain 1:    400       -22245.028             0.282            0.112
## Chain 1:    500       -22129.156             0.227            0.011
## Chain 1:    600       -22039.409             0.028            0.006
## Chain 1:    700       -22023.275             0.005            0.005
## Chain 1:    800       -21995.103             0.003            0.004
## Chain 1:    900       -22070.022             0.003            0.003
## Chain 1:   1000       -21997.430             0.003            0.003
## Chain 1:   1100       -21959.225             0.002            0.002
## Chain 1:   1200       -21918.591             0.002            0.002
## Chain 1:   1300       -21990.373             0.003            0.003
## Chain 1:   1400       -21937.503             0.003            0.002
## Chain 1:   1500       -21939.525             0.002            0.002
## Chain 1:   1600       -21844.662             0.002            0.002
## Chain 1:   1700       -21861.427             0.002            0.002
## Chain 1:   1800       -21847.020             0.002            0.001
## Chain 1:   1900       -21867.540             0.001            0.001
## Chain 1:   2000       -21882.949             0.001            0.001
## Chain 1:   2100       -21866.715             0.001            0.001
## Chain 1:   2200       -21893.634             0.001            0.001
## Chain 1:   2300       -21861.194             0.001            0.001
## Chain 1:   2400       -21844.990             0.001            0.001
## Chain 1:   2500       -21838.353             0.001            0.001
## Chain 1:   2600       -21814.193             0.001            0.001
## Chain 1:   2700       -21829.680             0.001            0.001
## Chain 1:   2800       -21860.334             0.001            0.001
## Chain 1:   2900       -21808.208             0.001            0.001
## Chain 1:   3000       -21831.019             0.001            0.001
## Chain 1:   3100       -21853.390             0.001            0.001
## Chain 1:   3200       -21800.829             0.002            0.001
## Chain 1:   3300       -21824.084             0.002            0.001
## Chain 1:   3400       -21808.145             0.001            0.001
## Chain 1:   3500       -21816.374             0.001            0.001
## Chain 1:   3600       -21807.685             0.001            0.001
## Chain 1:   3700       -21789.537             0.001            0.001
## Chain 1:   3800       -21824.486             0.001            0.001
## Chain 1:   3900       -21793.810             0.001            0.001
## Chain 1:   4000       -21813.947             0.001            0.001
## Chain 1:   4100       -21780.700             0.001            0.001
## Chain 1:   4200       -21792.542             0.001            0.001
## Chain 1:   4300       -21802.062             0.001            0.001
## Chain 1:   4400       -21825.179             0.001            0.001
## Chain 1:   4500       -21783.956             0.001            0.001
## Chain 1:   4600       -21790.168             0.001            0.001
## Chain 1:   4700       -21778.296             0.001            0.001
## Chain 1:   4800       -21781.593             0.001            0.001
## Chain 1:   4900       -21814.312             0.001            0.001
## Chain 1:   5000       -21782.304             0.001            0.001
## Chain 1: Informational Message: The maximum number of iterations is reached! The algorithm may not have converged.
## Chain 1: This variational approximation is not guaranteed to be meaningful.
## Chain 1: 
## Chain 1: Drawing a sample of size 2000 from the approximate posterior... 
## Chain 1: COMPLETED.
## Wed Jul  8 10:11:35 2020 : Posterior inference finished.

# Run scMET for group B
fit_B <- scmet(Y = scmet_diff_dt$scmet_dt_B$Y,
               X = scmet_diff_dt$scmet_dt_B$X, L = 4, 
               iter = 5000, seed = 12)

## Sorting features.
## Total # of cells:  100 .
## Total # of features:  200 .
## Wed Jul  8 10:11:35 2020 : Using EB to set model priors.
## Usin EB estimates as initial values for posterior inference.
## Wed Jul  8 10:11:40 2020 : Finished EB.
## Wed Jul  8 10:11:40 2020 : Starting inference.
## Chain 1: ------------------------------------------------------------
## Chain 1: EXPERIMENTAL ALGORITHM:
## Chain 1:   This procedure has not been thoroughly tested and may be unstable
## Chain 1:   or buggy. The interface is subject to change.
## Chain 1: ------------------------------------------------------------
## Chain 1: 
## Chain 1: 
## Chain 1: 
## Chain 1: Gradient evaluation took 0.00305 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 30.5 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Begin eta adaptation.
## Chain 1: Iteration:   1 / 250 [  0%]  (Adaptation)
## Chain 1: Iteration:  50 / 250 [ 20%]  (Adaptation)
## Chain 1: Iteration: 100 / 250 [ 40%]  (Adaptation)
## Chain 1: Iteration: 150 / 250 [ 60%]  (Adaptation)
## Chain 1: Iteration: 200 / 250 [ 80%]  (Adaptation)
## Chain 1: Success! Found best value [eta = 1] earlier than expected.
## Chain 1: 
## Chain 1: Begin stochastic gradient ascent.
## Chain 1:   iter             ELBO   delta_ELBO_mean   delta_ELBO_med   notes 
## Chain 1:    100       -24710.879             1.000            1.000
## Chain 1:    200       -22169.801             0.557            1.000
## Chain 1:    300       -21916.277             0.375            0.115
## Chain 1:    400       -21824.776             0.283            0.115
## Chain 1:    500       -21736.215             0.227            0.012
## Chain 1:    600       -21642.469             0.028            0.004
## Chain 1:    700       -21629.921             0.005            0.004
## Chain 1:    800       -21600.443             0.003            0.004
## Chain 1:    900       -21655.167             0.003            0.003
## Chain 1:   1000       -21579.544             0.002            0.003
## Chain 1:   1100       -21567.996             0.002            0.001
## Chain 1:   1200       -21553.666             0.002            0.001
## Chain 1:   1300       -21565.885             0.002            0.001
## Chain 1:   1400       -21559.217             0.001            0.001
## Chain 1:   1500       -21525.993             0.001            0.001
## Chain 1:   1600       -21478.306             0.001            0.001
## Chain 1:   1700       -21488.489             0.001            0.001
## Chain 1:   1800       -21477.082             0.001            0.001
## Chain 1:   1900       -21504.395             0.001            0.001
## Chain 1:   2000       -21481.828             0.001            0.001
## Chain 1:   2100       -21486.260             0.001            0.001
## Chain 1:   2200       -21504.159             0.001            0.001
## Chain 1:   2300       -21487.278             0.001            0.001
## Chain 1:   2400       -21472.391             0.001            0.001
## Chain 1:   2500       -21475.653             0.001            0.001
## Chain 1:   2600       -21446.043             0.001            0.001
## Chain 1:   2700       -21466.509             0.001            0.001
## Chain 1:   2800       -21458.501             0.001            0.001
## Chain 1:   2900       -21440.613             0.001            0.001
## Chain 1:   3000       -21450.232             0.001            0.001
## Chain 1:   3100       -21440.061             0.001            0.000
## Chain 1:   3200       -21433.531             0.000            0.000
## Chain 1:   3300       -21449.140             0.001            0.000
## Chain 1:   3400       -21444.471             0.000            0.000
## Chain 1:   3500       -21439.966             0.000            0.000
## Chain 1:   3600       -21442.333             0.000            0.000
## Chain 1:   3700       -21418.777             0.000            0.000
## Chain 1:   3800       -21452.922             0.001            0.000
## Chain 1:   3900       -21428.062             0.001            0.001
## Chain 1:   4000       -21426.504             0.001            0.001
## Chain 1:   4100       -21425.608             0.001            0.001
## Chain 1:   4200       -21426.581             0.001            0.000   MEDIAN ELBO CONVERGED
## Chain 1: 
## Chain 1: Drawing a sample of size 2000 from the approximate posterior... 
## Chain 1: COMPLETED.
## Wed Jul  8 10:12:06 2020 : Posterior inference finished.

Showing mean - overdispersion plots for each group

gg1 <- scmet_plot_mean_var(obj = fit_A, y = "gamma", task = NULL, 
                           title = "Group A")
gg2 <- scmet_plot_mean_var(obj = fit_B, y = "gamma", task = NULL, 
                           title = "Group B")
cowplot::plot_grid(gg1, gg2, ncol = 2)

3.6.2 Running differential analysis

Now we can perform the DM and DV analysis by calling the scmet_differential function. NOTE that only those features that are common across the two groups will be kept for this analysis.

# Run differential analysis with small evidence_thresh
# tp obtain more hits.
diff_obj <- scmet_differential(obj_A = fit_A, obj_B = fit_B,
                               evidence_thresh_m = 0.65,
                               evidence_thresh_e = 0.65,
                               group_label_A = "A",
                               group_label_B = "B")

## --------------------------------------- 
## The user excluded 0 features.
## These features are marked as 'ExcludedByUser' 
## and are excluded from EFDR calibration.

## -------------------------------------------------------------
## 15 features with a change in mean methylation:
## - Higher methylation in A samples: 4
## - Higher methylation in B samples: 11
## - Odds ratio tolerance = 1.5
## - Probability evidence threshold=0.85875
## - EFDR = 5.26% 
## - EFNR = 30.97%. 
## -------------------------------------------------------------
## 
## -------------------------------------------------------------
## 22 genes with a change in over-dispersion:
## - Higher dispersion in A samples: 5
## - Higher dispersion in B samples: 17
## - Odds ratio tolerance = 1.5
## - Probability evidence threshold=0.84575
## - EFDR = 4.91% 
## - EFNR = 45.27% 
## NOTE: differential dispersion assessment only applied to the 
## 185 genes for which the mean did not change 
## and that were included for testing. 
## --------------------------------------------------------------
## -------------------------------------------------------------
## 22 genes with a change in residual over dispersion:
## - Higher residual dispersion in A samples: 6
## - Higher residual dispersion in B samples: 16
## - Odds ratio tolerance = 1.5
## - Probability evidence threshold=0.84675
## - EFDR = 5.07% 
## - EFNR = 46.73%. 
## --------------------------------------------------------------

NOTE sometimes the method returns a message that it could not obtain an acceptable evidence threshold to achieve EFDR = 5%. Then by by default the evidence threshold is set to the initial value. This does not mean though that the test failed.

The diff_obj stores all the differential analysis information in the following structure:

# Structure of diff_obj
class(diff_obj)

## [1] "scmet_differential"

names(diff_obj)

## [1] "diff_mu_summary"      "diff_epsilon_summary" "diff_gamma_summary"  
## [4] "diff_mu_thresh"       "diff_epsilon_thresh"  "diff_gamma_thresh"   
## [7] "opts"

The diff_<mode>_summaryslots contain the summary of the differential analysis, whereas the diff_<mode>_thresh slots contain information about the optimal posterior evidence threshold search to achieve the required EFDR.

# DM results
head(diff_obj$diff_mu_summary)

##     feature_name mu_overall   mu_A  mu_B     mu_LOR     mu_OR mu_tail_prob
## 1      Feature_1    0.20850 0.3190 0.098  1.4619423 4.3143311       1.0000
## 177   Feature_78    0.10400 0.1800 0.028  2.0182202 7.5249204       1.0000
## 58   Feature_150    0.35150 0.2350 0.468 -1.0532352 0.3488075       0.9975
## 198   Feature_97    0.35000 0.2490 0.451 -0.9072614 0.4036281       0.9960
## 60   Feature_152    0.76750 0.6780 0.857 -1.0509501 0.3496054       0.9925
## 20   Feature_116    0.29825 0.1915 0.405 -1.0610694 0.3460855       0.9825
##     mu_diff_test
## 1             A+
## 177           A+
## 58            B+
## 198           B+
## 60            B+
## 20            B+

# Summary of DMs
as.data.table(diff_obj$diff_mu_summary) %>%
  .[, .N, by = "mu_diff_test"]

##    mu_diff_test   N
## 1:           A+   4
## 2:           B+  11
## 3:       NoDiff 185

# DV (based on epsilon) results
head(diff_obj$diff_epsilon_summary)

##     feature_name epsilon_overall epsilon_A epsilon_B epsilon_change
## 187   Feature_87         0.16825   -0.7415    1.0780        -1.8345
## 126   Feature_31         0.67300   -0.0370    1.3830        -1.4250
## 190    Feature_9         0.41925   -0.3445    1.1830        -1.5225
## 75   Feature_166         0.23600   -0.3970    0.8690        -1.2730
## 180   Feature_80         0.49200   -0.1990    1.1830        -1.3870
## 143   Feature_47        -0.54675    0.1440   -1.2375         1.3785
##     epsilon_tail_prob epsilon_diff_test mu_overall  mu_A  mu_B
## 187            1.0000                B+     0.5860 0.617 0.555
## 126            0.9985                B+     0.7335 0.720 0.747
## 190            0.9975                B+     0.9265 0.933 0.920
## 75             0.9970                B+     0.3625 0.406 0.319
## 180            0.9970                B+     0.5135 0.546 0.481
## 143            0.9895                A+     0.2200 0.225 0.215

# Summary of DVs
as.data.table(diff_obj$diff_epsilon_summary) %>%
  .[, .N, by = "epsilon_diff_test"]

##    epsilon_diff_test   N
## 1:                B+  16
## 2:                A+   6
## 3:            NoDiff 178

Let us plot first the grid search for optimal evidence threshold.

gg1 <- scmet_plot_efdr_efnr_grid(obj = diff_obj, task = "diff_mu")
gg2 <- scmet_plot_efdr_efnr_grid(obj = diff_obj, task = "diff_epsilon")
cowplot::plot_grid(gg1, gg2, ncol = 2)

3.6.3 Plotting differential hits

Below we show volcano plots for DM and DV analysis. Red and green colours correspond to features that are more methylated or variable in group A and group B, respectively.

# DM volcano plot
plot(scmet_plot_volcano(diff_obj, task = "diff_mu"))

# DV based on epsilon volcano plot
plot(scmet_plot_volcano(diff_obj, task = "diff_epsilon"))

Below we show MA plots for DM and DV analysis.

# MA plot for DM analysis and x axis overall mean methylation
gg1 <- scmet_plot_ma(diff_obj, task = "diff_mu", x = "mu")
# MA plot for DV analysis and x axis overall mean methylation
gg2 <- scmet_plot_ma(diff_obj, task = "diff_epsilon", x = "mu")
cowplot::plot_grid(gg1, gg2, ncol = 2)

4 scMET on real data

TODO

5 Session Info

This vignette was compiled using:

sessionInfo()

## R version 4.0.0 (2020-04-24)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] magrittr_1.5      data.table_1.12.8 scMET_0.0.0.9000  knitr_1.28       
## [5] BiocStyle_2.16.0 
## 
## loaded via a namespace (and not attached):
##  [1] rstan_2.19.3        tidyselect_1.1.0    xfun_0.13          
##  [4] purrr_0.3.4         lattice_0.20-41     splines_4.0.0      
##  [7] colorspace_1.4-1    vctrs_0.3.0         viridisLite_0.3.0  
## [10] htmltools_0.4.0     stats4_4.0.0        loo_2.2.0          
## [13] yaml_2.2.1          rlang_0.4.6         pkgbuild_1.0.8     
## [16] pillar_1.4.4        glue_1.4.1          matrixStats_0.56.0 
## [19] lifecycle_0.2.0     stringr_1.4.0       munsell_0.5.0      
## [22] gtable_0.3.0        coda_0.19-3         codetools_0.2-16   
## [25] VGAM_1.1-3          evaluate_0.14       labeling_0.3       
## [28] inline_0.3.15       callr_3.4.3         ps_1.3.3           
## [31] parallel_4.0.0      fansi_0.4.1         highr_0.8          
## [34] Rcpp_1.0.4.6        scales_1.1.1        BiocManager_1.30.10
## [37] magick_2.3          StanHeaders_2.19.2  farver_2.0.3       
## [40] gridExtra_2.3       ggplot2_3.3.0       png_0.1-7          
## [43] digest_0.6.25       stringi_1.4.6       bookdown_0.18      
## [46] processx_3.4.2      dplyr_0.8.5         cowplot_1.0.0      
## [49] grid_4.0.0          logitnorm_0.8.37    cli_2.0.2          
## [52] tools_4.0.0         tibble_3.0.1        crayon_1.3.4       
## [55] pkgconfig_2.0.3     MASS_7.3-51.6       ellipsis_0.3.1     
## [58] prettyunits_1.1.1   viridis_0.5.1       assertthat_0.2.1   
## [61] rmarkdown_2.1       R6_2.4.1            compiler_4.0.0

6 Acknowledgements

This study was supported by funding from the University of Edinburgh and Medical Research Council (core grant to the MRC Institute of Genetics and Molecular Medicine).