CompSignLena Morrill Gavarró, Dominique-Laurent Couturier, Florian Markowetz 2024
CompSign is a toolkit for differential abundance
analysis of mutational signatures using a mixed effects
Dirichlet-multinominal model (or simpler variations). The compositional
nature of mutational signature exposures has often been overlooked but
has important implications, as the analyses must be done in relative
terms.
#> [1] "Date of compilation: 2025-01-12 18:48:19.625433"
CompSign can be installed as usual from github:
These are the libraries used in this vignette:
library(CompSign)
library(gridExtra)
library(TMB)
library(HMP)
library(gridExtra)
library(ggplot2)
theme_set(theme_bw())Several datasets and objects can be loaded as follows using the
data() function:
data(PancEndocrine_signaturesMSE)
data(ProstAdenoCA_chrom)
data(res_patient_lambda)
data(simplified_object)
data(ProstAdenoCA_trinucleotide)
data(obj_multilambda)
data(obj_multilambda_parameters)The package contains the following datasets of exposures of mutational signatures and metadata of the corresponding samples. These datasets are:
PancEndocrine_signaturesMSE: Signature exposures for
early and late mutations, in the PCAWG Panc-Endocrine cohortProstAdenoCA_chrom: Signature exposures for each
chromosome, in the PCAWG Prost-AdenoCA cohortsimplified_object: Nucleotide changes in clonal and
subclonal mutations in the Lung-AdenoCA cohortProstAdenoCA_trinucleotide: Trinucleotide abundandances
for clonal and subclonal mutations in the PCAWG Prost-AdenoCA
cohortThis is a minimal example for running the model and extracting the estimated coefficients.
The input dataset is the argument object, which is a
list with the following structure: - x: covariate matrix
(p x n) - z: matrix of random effects
indicating which patient-specific subsample belongs to which patient
(n x N) - Y (n x d) -
d
The function wrapper_run_TMB() is used to run all
variations of the model.
A minimal example is:
diagDM_no_small_sigs <- wrapper_run_TMB(object = simplified_object,
model = "diagRE_DM", use_nlminb=T, smart_init_vals=F)
in which simplified_object is a list containing
simplified_object$x, simplified_object$z,
simplified_object$Y. The object input for
wrapper_run_TMB is always a list containing a matrix of
covariates x, a matrix of observations Y and a
matrix matching observations patients and the labels corresponding to
the random effects. All three matrices should have the same number of
rows - the number of observations. In this case,
simplified_object corresponds to the nucleotides abundances
in clonal and subclonal mutations in the Lung-AdenoCA cohort.
simplified_object$x, contains two columns (a column of 1
for the intercept and a column of 0 and 1 indicating whether the
observation corresponds to clonal (0) or subclonal (1) mutations.
simplified_object$z contains a matrix where each
observation is paired with its corresponding patient,
simplified_object$Y is a matrix of counts for the
nucleotide abundances for each patient and (clonal/subclonal) group.
Here we are using the model diagRE_DM, which is a
mixed-effect Dirichlet-multinomial model with non-correlated random
effects and two dispersion parameters (one per group). See below for the
list of alternative models available.
You can load simplified_object, as well as several other
example datasets, using the data funtion:
data(package='CompSign')
data(simplified_object)
Depending on the model, the output of wrapper_run_TMB
contains the estimated values of different coefficients. The parameters
of interest are beta_0, beta_1 and
lambda. Both beta are values in log-ratio
space in with respect to the last category included in the input
object for wrapper_run_TMB (i.e. the last
colun of object$Y). beta_0 indicates the
coefficient for the baseline - in the two-group example of
simplified_object, this baseline reflects the abundance in
the first (clonal) group. beta_1 indicates the cofficient
for the second column - here representing the difference between clonal
and subclonal mutations. If beta_1 is a vector of 0, this
indicates that none of the log-ratios of abundances have changed between
the two groups, and hence that there is no differential abundance.
To test for differential abundance, a generalised Wald test can be
used with the function wald_TMB_wrapper, which gives a
p-value as output:
wald_TMB_wrapper(diagDM_no_small_sigs)
The estimated beta coefficients can be visualised as
follows:
library(ggplot2)
plot_betas(diagDM_no_small_sigs, return_ggplot = T)
These beta coefficients are in log-ratio space with
respect to a common category (the last column of object$Y).
Therefore, it is difficult to determine which of the abundances changed
in absolute terms. Assuming most categories do not change (i.e. assuming
minimal perturbation), we can detect which categories deviate from the
general beta_1 value. Without being able to draw any
conclusion about the baseline category, as we do not have standard error
values for it, we can detect for which categories in the numerator their
beta_1 has extreme values: this can be done as follows with
the plot_betas function:
plot_betas(diagDM_no_small_sigs, return_ggplot = T, add_median = T, line_zero = F, add_confint = T)
In this example workflow we determine differential abundance between clonal and subclonal mutations in Pancreatic neuroendocrine tumors.
sign objects: how to
load themPancEndocrine_signaturesMSE is an S4 object of class
sign. sign objects include all information
about the name of the samples, the mutation signature exposures (of all
signatures, or only of the subset of active signatures as determined by
Alexandrov et al. 2020). However, the use of these S4 objects is not
necessary for running the models of differential abundance, which
instead take a list as argument for the input data.
PancEndocrine_signaturesMSE = load_PCAWG("../inst/extdata/roo/Panc-Endocrine_signaturesMSE_ROO.RDS",
read_directly = T,
typedata = "signaturesMSE", override_warning_X_Z = T)
PancEndocrine_signaturesMSE_v2 = load_PCAWG(ct = "Panc-Endocrine", typedata = "signaturesMSE", path_to_data = "../inst/extdata/", load_all_sigs = F, override_warning_X_Z = T)If signature extraction has not been performed, the wrapper function
extract_sigs_TMB_obj outputs signature exposures from a
matrix of trinucleotide abundances. We use the trinucleotide abundances
from the object ProstAdenoCA_trinucleotide. In this wrapper
function two methods for signature extraction are implemented:
mutSigExtractor and quadratic programming (QP); this can be specified in
the signature_fitting_method argument. The version of
signature definitions matrix can be specified in the argument
signature_version (which can be v2 or
v3 of the COSMIC signatures. Alternatively, any signature
definition matrix can be specified in the
signature_definition, where the signature definition should
be a F x K signature definition matrix, where
K is the number of signatures and F the number
of features (for COSMIC signatures, F=96). The active
signature list can be specified in subset_signatures.
active_sigs_in_ct <- c("SBS1", "SBS2", "SBS3", "SBS5", "SBS8", "SBS13", "SBS18", "SBS37", "SBS40", "SBS41")
exposures_from_mutSigExtractor <- extract_sigs_TMB_obj(dataset_obj_trinucleotide=ProstAdenoCA_trinucleotide,
subset_signatures = active_sigs_in_ct,
signature_definition=NULL, signature_fitting_method = 'mutSigExtractor')
#> Loading required package: mutSigExtractor
#> [1] "v3"
#> Subsetting to active signatures
#> Running signature extraction
#> ... done.
exposures_from_QP <- extract_sigs_TMB_obj(dataset_obj_trinucleotide=ProstAdenoCA_trinucleotide,
subset_signatures = active_sigs_in_ct,
signature_definition=NULL, signature_fitting_method = 'QP')
#> [1] "v3"
#> Subsetting to active signatures
#> Running signature extraction
#> ... done.
plot(log(as.vector(exposures_from_mutSigExtractor$Y)), log(as.vector(exposures_from_QP$Y)),
xlab='Exposures from mutSigExtractor', ylab='Exposures from quadratic programming')
abline(coef = c(0,1), lty='dashed')All samples - clonal and subclonal - sorted by increasing SBS3 exposure:
createBarplot(normalise_rw(non_duplicated_rows(PancEndocrine_signaturesMSE$Y)),
order_labels = names(sort(non_duplicated_rows(PancEndocrine_signaturesMSE$Y)[,'SBS3'],
decreasing = F)), remove_labels=T,
custom_color_palette=color_list)+ggtitle('Sorted by SBS3')We create a simplified object containing exposures of fewer signatures (i.e. a subcomposition of the original signature vectors):
simplified_object <- give_subset_sigs_TMBobj(PancEndocrine_signaturesMSE,
sigs_to_remove = c('SBS13', 'SBS17a', 'SBS17b', 'SBS30'))In which simplified_object is a list containing
simplified_object$x, simplified_object$z,
simplified_object$Y, simplified_object$d (see
Input dataset section below)
The clonal and subclonal exposures are, respectively, the two barplots below:
do.call('grid.arrange', list(grobs=lapply(split_matrix_in_half(simplified_object$Y), function(i) createBarplot(normalise_rw(i), remove_labels = T, custom_color_palette=color_list)), nrow=1))
#> Creating plot... it might take some time if the data are large. Number of samples: 70
#> Creating plot... it might take some time if the data are large. Number of samples: 70There are several variations on the model, each with a particular
combination of the following: - Base model: Dirichlet-multinomial in
most cases, but two simpler models – multinomial models
diagRE_M and fullRE_M are implemented for
comparison. - Random effects: correlated or not (e.g. uncorrelated in
diagRE_DM, correlated in fullRE_DM). - Number
of dispersion parameters: generally one per group (e.g. in
diagRE_DM), but possibly fewer or more if specified in the
name (e.g. one single dipersion parameter in
diagRE_DM_singlelambda, one dispersion parameter per
patient in diagRE_DM_patientlambda).
The names of the model correspond to
{random effects}_{base model}_{particularities of the dispersion parameter, if any}.
The first column is the <model> argument in the
function wrapper_run_TMB().
| Name of model (for user) | Description and usage |
|---|---|
FE_DM_singlelambda |
Dirichlet-multinomial with no RE and a single lambda
which can be used in a two-group comparison (if we consider the
dipersion to be the same in both groups), a multi-group comparison, or
any regression setting. |
FE_DM |
Dirichlet-multinomial with no random effects and two
lambda, used for the comparison between two groups in which
we want to account for group-specific dispersion. Model slightly more
complex than FE_DM_singlelambda |
diagRE_M |
Multinomial with non-correlated multivariate effects. Simpler model
than diagRE_DM, as no dispersion is included |
diagRE_DM_singlelambda |
Dirichlet-multinomial with non-correlated multivariate RE and one: equivalent of diagRE_DM but with a shared . It can be used in a two-group comparison (if we consider the dipersion to be the same in both groups), a multi-group comparison, or any regression setting. |
singleRE_DM |
Dirichlet-multinomial with a single RE intercept and two
lambda: simple model in which random effects are not
multivariate, but where group-specific dispersion is needed, in a
two-group comparison. |
diagRE_DM |
Dirichlet-multinomial with independent RE and two
lambda: used most commonly throughout the paper. The data
are matched to warrant multivariate RE, but correlations between
categories are not explicitly modelled. Faster than
fullRE_DM with often comparable results for
beta_1. |
fullRE_DM_singlelambda |
Dirichlet-multinomial with correlated RE and two lambdab: equivalent
of diagR_DM_singlelambda that can be used if categories
have strong correlations. |
fullRE_M |
Multinomial with correlated RE: assuming no overdispersion, it can be used in any regression setting |
fullRE_DM |
Dirichlet-multinomial with correlated RE and two
lambda: model with multivariate RE with correlations
modelled explicitly. As a (K-1)*(K-1) covariance matrix is
estimated, this model is not recommended where the ratio of signatures
to samples is high. |
diagRE_DM_patientlambda |
Dirichlet-multinomial with non-correlated RE and one lambda$ per
patient: model more complex than diag_RE_DM that can be
used when there are several observations per patient and we wish to
include a patient-specific dispersion parameter |
fullRE_DM_patientlambda |
Dirichlet-multinomial with correlated RE and one per patient:
equivalent of diagRE_DM_patientlambda, but modelling
correlations between categories |
We run the model diagRE_DM_singlelambda with the dataset
simplified_object:
diagDM_no_small_sigs <- wrapper_run_TMB(object = simplified_object,
model = "diagRE_DM_singlelambda", use_nlminb=T, smart_init_vals=F)This is the resulting object with the estimates and their standard deviations:
diagDM_no_small_sigs
#> sdreport(.) result
#> Estimate Std. Error
#> beta0_SBS1_wrt_SBS39 2.03832104 0.22363954
#> beta1_SBS2_wrt_SBS39 -0.70454525 0.29906580
#> beta0_SBS3_wrt_SBS39 0.98806820 0.24807094
#> beta1_SBS5_wrt_SBS39 0.27978736 0.31041289
#> beta0_SBS6_wrt_SBS39 2.35669154 0.25394023
#> beta1_SBS8_wrt_SBS39 -0.54662130 0.29140102
#> beta0_SBS9_wrt_SBS39 3.72920999 0.22353776
#> beta1_SBS11_wrt_SBS39 -0.50234381 0.27244515
#> beta0_SBS26_wrt_SBS39 1.12048603 0.26120868
#> beta1_SBS36_wrt_SBS39 -0.40359421 0.31695478
#> beta0_SBS1_wrt_SBS39 2.85335968 0.20725141
#> beta1_SBS2_wrt_SBS39 -0.01029434 0.27904434
#> beta0_SBS3_wrt_SBS39 1.39433292 0.22601115
#> beta1_SBS5_wrt_SBS39 -0.02357902 0.30765208
#> beta0_SBS6_wrt_SBS39 0.88267656 0.24738726
#> beta1_SBS8_wrt_SBS39 -0.11575696 0.32338552
#> beta0_SBS9_wrt_SBS39 1.40320978 0.26009746
#> beta1_SBS11_wrt_SBS39 -0.24411805 0.30749405
#> beta0_SBS26_wrt_SBS39 0.56690763 0.39567437
#> beta1_SBS36_wrt_SBS39 -0.11893183 0.32177527
#> logs_sd_RE_SBS1_wrt_SBS39 -1.47032407 0.45921974
#> logs_sd_RE_SBS2_wrt_SBS39 -0.88112230 0.33980483
#> logs_sd_RE_SBS3_wrt_SBS39 0.16826194 0.25484956
#> logs_sd_RE_SBS5_wrt_SBS39 -0.49531043 0.23244324
#> logs_sd_RE_SBS6_wrt_SBS39 -0.19668190 0.30654793
#> logs_sd_RE_SBS8_wrt_SBS39 -11.95985277 455.21714653
#> logs_sd_RE_SBS9_wrt_SBS39 -12.11088723 318.80943762
#> logs_sd_RE_SBS11_wrt_SBS39 -1.26108779 0.53819666
#> logs_sd_RE_SBS26_wrt_SBS39 0.03013492 0.26234018
#> logs_sd_RE_SBS36_wrt_SBS39 1.78101081 0.22929287
#> log_lambda 2.85822798 0.05954625
#> Maximum gradient component: 0.01596245We can see how in this case we only have a single estimate for the
dispersion parameter (log_lambda). The number of \(\beta\) (20) corresonds to \(2*(d-1)\) where \(d\) is the number of mutations in the
object (11, as seen in the bar plots above). The last signature or
category, SBS39, is used as baseline signature in the ALR
transformation. All beta parameters are in ALR-transformed space, using
SBS39 as baseline. \(\beta_0\) and
\(\beta_1\) results are intercalated,
i.e. the \(\beta\) in the first row
corresponds to the \(\beta_0\) of SBS1
wrt (with respect to) SBS39, the \(\beta\) in the third row corresponds to the
\(\beta_0\) of SBS2 wrt (with respect
to) SBS39, the \(\beta\) in the first
row corresponds to the \(\beta_0\) of
SBS1 wrt (with respect to) SBS39, etc.
These \(\beta\) estimates can be plotted as follows:
In a scenario where we have several samples for each patient, which can still be grouped in two or more groups (fixed effects), we might be interested in patient-specific dispersion parameters, besides the patient-specific intercepts that we had. This has been implemented:
n <- 40 ## patients
num_samples_per_patient = 30
stopifnot((num_samples_per_patient %% 2)==0)
z <- do.call('rbind', replicate(expr = diag(n), n= num_samples_per_patient, simplify = F))
x <- cbind(rep(1, n*num_samples_per_patient),
c(rep(0, (n*num_samples_per_patient)/2),
rep(1, (n*num_samples_per_patient)/2)))
d = 5 ## number of signatures
random_effects <- matrix(runif(n*(d-1)), nrow=n, ncol=d-1)
fixed_effects <- matrix(runif(2*(d-1)), nrow=2, ncol=d-1)
alpha_mat <- x %*% fixed_effects + z %*% random_effects
alpha_mat_softmax = softmax( cbind(alpha_mat, 0) )
## lambda per patient
log_lambda_vec = runif(n, 3, 4)
## get patient-specific overdispersions
log_lambda_vec_per_obs <- z %*% log_lambda_vec
## simulate counts
Nm = rep(1000, n*num_samples_per_patient) ## fixed number of mutations
y = matrix(NA, nrow = n*num_samples_per_patient, ncol=d)
for(l in 1:(n*num_samples_per_patient)){
## for each patient-specific subsample
alpha_l = alpha_mat_softmax[l,]*exp(log_lambda_vec_per_obs[l])
y[l,] = HMP::Dirichlet.multinomial(Nrs = Nm[l], shape = alpha_l)
}##' create object with information about covariates, which observations correspond
##' to which patients, and the input (counts)
obj_multilambda <- list(Y = y, x=x, z = z, num_individuals = n)The true parameters used for the simulation are found in the object
obj_multilambda_parameters.
We can see how the parameters of interest are well recovered:
and two additional sets of parameters suffer from a bias due to
having used the uncorrelated version of the model
(i.e. diagRE_DM_patientlambda instead of
fullDMpatientlambda):
diagRE_DM_patientlambda is used.fixed_effects_estimate <- matrix(python_like_select_name(res_patient_lambda$par.fixed, 'beta'), nrow=2)
random_effects_estimate <- matrix(res_patient_lambda$par.random, ncol=ncol(obj_multilambda_parameters$fixed_effects))
log_lambda_vec_estimate <- python_like_select_name(res_patient_lambda$par.fixed, 'log_lambda')
list_comparisons_true_and_estimate = list(
all_betas = data.frame(x = as.vector(obj_multilambda_parameters$fixed_effects),
y = as.vector(fixed_effects_estimate)),
beta_1 = data.frame(x = obj_multilambda_parameters$fixed_effects[2,],
y = fixed_effects_estimate[2,]),
random_intercepts = data.frame(x = as.vector(obj_multilambda_parameters$random_effects),
y = as.vector(random_effects_estimate)),
log_lambda = data.frame(x = as.vector(obj_multilambda_parameters$log_lambda_vec),
y = as.vector(log_lambda_vec_estimate)))
plots_comparisons_true_and_estimate <- lapply(1:length(list_comparisons_true_and_estimate), function(i){
ggplot(list_comparisons_true_and_estimate[[i]], aes(x=x, y=y))+
geom_point()+
geom_abline(slope = 1, intercept = 0, lty='dashed')+
ggtitle(gsub('_', ' ', names(list_comparisons_true_and_estimate)[i]))+
labs(x='True value', y='Estimate')
})
do.call('grid.arrange', c(plots_comparisons_true_and_estimate, nrow=1))sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-apple-darwin20
#> Running under: macOS Sonoma 14.6.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: America/New_York
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] latex2exp_0.9.6 RColorBrewer_1.1-3 reshape2_1.4.4
#> [4] mutSigExtractor_1.29 ggplot2_3.5.1 HMP_2.0.1
#> [7] dirmult_0.1.3-5 gridExtra_2.3 CompSign_0.1.0
#> [10] RcppEigen_0.3.4.0.2 TMB_1.9.15
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.2.1 farver_2.1.2
#> [3] dplyr_1.1.4 Biostrings_2.70.3
#> [5] bitops_1.0-9 fastmap_1.2.0
#> [7] RCurl_1.98-1.16 GenomicAlignments_1.40.0
#> [9] XML_3.99-0.17 digest_0.6.37
#> [11] rpart_4.1.23 lifecycle_1.0.4
#> [13] cluster_2.1.6 magrittr_2.0.3
#> [15] compiler_4.4.1 rlang_1.1.4
#> [17] tools_4.4.1 utf8_1.2.4
#> [19] yaml_2.3.10 rtracklayer_1.64.0
#> [21] knitr_1.48 labeling_0.4.3
#> [23] S4Arrays_1.2.1 curl_5.2.3
#> [25] DelayedArray_0.28.0 plyr_1.8.9
#> [27] abind_1.4-8 BiocParallel_1.36.0
#> [29] KernSmooth_2.23-24 withr_3.0.1
#> [31] BiocGenerics_0.48.1 grid_4.4.1
#> [33] stats4_4.4.1 fansi_1.0.6
#> [35] caTools_1.18.3 colorspace_2.1-1
#> [37] scales_1.3.0 gtools_3.9.5
#> [39] iterators_1.0.14 MASS_7.3-60.2
#> [41] SummarizedExperiment_1.32.0 cli_3.6.3
#> [43] rmarkdown_2.28 vegan_2.6-8
#> [45] crayon_1.5.3 generics_0.1.3
#> [47] rstudioapi_0.16.0 rjson_0.2.23
#> [49] httr_1.4.7 stringr_1.5.1
#> [51] zlibbioc_1.48.2 splines_4.4.1
#> [53] parallel_4.4.1 XVector_0.42.0
#> [55] restfulr_0.0.15 matrixStats_1.4.1
#> [57] vctrs_0.6.5 Matrix_1.7-0
#> [59] IRanges_2.38.1 S4Vectors_0.40.2
#> [61] foreach_1.5.2 glue_1.8.0
#> [63] codetools_0.2-20 cowplot_1.1.3
#> [65] stringi_1.8.4 gtable_0.3.5
#> [67] GenomeInfoDb_1.38.8 quadprog_1.5-8
#> [69] GenomicRanges_1.56.1 BiocIO_1.12.0
#> [71] rpart.plot_3.1.2 munsell_0.5.1
#> [73] tibble_3.2.1 pillar_1.9.0
#> [75] htmltools_0.5.8.1 gplots_3.2.0
#> [77] BSgenome.Hsapiens.UCSC.hg19_1.4.3 GenomeInfoDbData_1.2.11
#> [79] BSgenome_1.72.0 R6_2.5.1
#> [81] Biobase_2.62.0 evaluate_1.0.0
#> [83] lattice_0.22-6 highr_0.11
#> [85] Rsamtools_2.20.0 Rcpp_1.0.13
#> [87] SparseArray_1.2.4 nlme_3.1-164
#> [89] permute_0.9-7 mgcv_1.9-1
#> [91] xfun_0.47 MatrixGenerics_1.16.0
#> [93] pkgconfig_2.0.3