CompSign is a toolkit for differential abundance analysis of mutational signatures using a mixed effects Dirichlet-multinominal model (or simpler variations). The compositional nature of mutational signature exposures has often been overlooked but has important implications, as the analyses must be done in relative terms.

Installation

library(devtools)
devtools::install_github("lm687/CompSign")

library(CompSign)
library(gridExtra)
library(TMB)
library(HMP)
library(gridExtra)
library(ggplot2)
theme_set(theme_bw())

Datasets

Several datasets and objects can be loaded as follows using the data() function:

data(PancEndocrine_signaturesMSE)
data(ProstAdenoCA_chrom)
data(res_patient_lambda)
data(simplified_object)
data(ProstAdenoCA_trinucleotide)
data(obj_multilambda)
data(obj_multilambda_parameters)

The package contains the following datasets of exposures of mutational signatures and metadata of the corresponding samples. These datasets are:

Minimal example workflow (Example 1)

This is a minimal example for running the model and extracting the estimated coefficients.

Input dataset

The input dataset is the argument object, which is a list with the following structure: - x: covariate matrix (p x n) - z: matrix of random effects indicating which patient-specific subsample belongs to which patient (n x N) - Y (n x d) - d

Fitting the model

in which simplified_object is a list containing simplified_object$x, simplified_object$z, simplified_object$Y. The object input for wrapper_run_TMB is always a list containing a matrix of covariates x, a matrix of observations Y and a matrix matching observations patients and the labels corresponding to the random effects. All three matrices should have the same number of rows - the number of observations. In this case, simplified_object corresponds to the nucleotides abundances in clonal and subclonal mutations in the Lung-AdenoCA cohort. simplified_object$x, contains two columns (a column of 1 for the intercept and a column of 0 and 1 indicating whether the observation corresponds to clonal (0) or subclonal (1) mutations. simplified_object$z contains a matrix where each observation is paired with its corresponding patient, simplified_object$Y is a matrix of counts for the nucleotide abundances for each patient and (clonal/subclonal) group.

Here we are using the model diagRE_DM, which is a mixed-effect Dirichlet-multinomial model with non-correlated random effects and two dispersion parameters (one per group). See below for the list of alternative models available.

Data included in CompSign

You can load simplified_object, as well as several other example datasets, using the data funtion:

Testing for differential abundance

Depending on the model, the output of wrapper_run_TMB contains the estimated values of different coefficients. The parameters of interest are beta_0, beta_1 and lambda. Both beta are values in log-ratio space in with respect to the last category included in the input object for wrapper_run_TMB (i.e. the last colun of object$Y). beta_0 indicates the coefficient for the baseline - in the two-group example of simplified_object, this baseline reflects the abundance in the first (clonal) group. beta_1 indicates the cofficient for the second column - here representing the difference between clonal and subclonal mutations. If beta_1 is a vector of 0, this indicates that none of the log-ratios of abundances have changed between the two groups, and hence that there is no differential abundance.

To test for differential abundance, a generalised Wald test can be used with the function wald_TMB_wrapper, which gives a p-value as output:

Visualising the output

These beta coefficients are in log-ratio space with respect to a common category (the last column of object$Y). Therefore, it is difficult to determine which of the abundances changed in absolute terms. Assuming most categories do not change (i.e. assuming minimal perturbation), we can detect which categories deviate from the general beta_1 value. Without being able to draw any conclusion about the baseline category, as we do not have standard error values for it, we can detect for which categories in the numerator their beta_1 has extreme values: this can be done as follows with the plot_betas function:

Workflow with mutational signatures (example 2)

In this example workflow we determine differential abundance between clonal and subclonal mutations in Pancreatic neuroendocrine tumors.

sign objects: how to load them

PancEndocrine_signaturesMSE is an S4 object of class sign. sign objects include all information about the name of the samples, the mutation signature exposures (of all signatures, or only of the subset of active signatures as determined by Alexandrov et al. 2020). However, the use of these S4 objects is not necessary for running the models of differential abundance, which instead take a list as argument for the input data.

PancEndocrine_signaturesMSE = load_PCAWG("../inst/extdata/roo/Panc-Endocrine_signaturesMSE_ROO.RDS",
                                         read_directly = T,
                                         typedata = "signaturesMSE", override_warning_X_Z = T)
PancEndocrine_signaturesMSE_v2 = load_PCAWG(ct = "Panc-Endocrine", typedata = "signaturesMSE", path_to_data = "../inst/extdata/", load_all_sigs = F, override_warning_X_Z = T)

Signature extraction

If signature extraction has not been performed, the wrapper function extract_sigs_TMB_obj outputs signature exposures from a matrix of trinucleotide abundances. We use the trinucleotide abundances from the object ProstAdenoCA_trinucleotide. In this wrapper function two methods for signature extraction are implemented: mutSigExtractor and quadratic programming (QP); this can be specified in the signature_fitting_method argument. The version of signature definitions matrix can be specified in the argument signature_version (which can be v2 or v3 of the COSMIC signatures. Alternatively, any signature definition matrix can be specified in the signature_definition, where the signature definition should be a F x K signature definition matrix, where K is the number of signatures and F the number of features (for COSMIC signatures, F=96). The active signature list can be specified in subset_signatures.

active_sigs_in_ct <- c("SBS1", "SBS2", "SBS3", "SBS5", "SBS8", "SBS13", "SBS18", "SBS37", "SBS40", "SBS41")
exposures_from_mutSigExtractor <- extract_sigs_TMB_obj(dataset_obj_trinucleotide=ProstAdenoCA_trinucleotide,
                             subset_signatures = active_sigs_in_ct,
                             signature_definition=NULL, signature_fitting_method = 'mutSigExtractor')
#> Loading required package: mutSigExtractor
#> [1] "v3"
#> Subsetting to active signatures
#> Running signature extraction
#> ... done.

exposures_from_QP <- extract_sigs_TMB_obj(dataset_obj_trinucleotide=ProstAdenoCA_trinucleotide,
                              subset_signatures = active_sigs_in_ct,
                              signature_definition=NULL, signature_fitting_method = 'QP')
#> [1] "v3"
#> Subsetting to active signatures
#> Running signature extraction
#> ... done.

plot(log(as.vector(exposures_from_mutSigExtractor$Y)), log(as.vector(exposures_from_QP$Y)),
     xlab='Exposures from mutSigExtractor', ylab='Exposures from quadratic programming')
abline(coef = c(0,1), lty='dashed')

Visualisation of mutational signature exposures


createBarplot(normalise_rw(non_duplicated_rows(PancEndocrine_signaturesMSE$Y)),
              order_labels = names(sort(non_duplicated_rows(PancEndocrine_signaturesMSE$Y)[,'SBS3'],
                                        decreasing = F)), remove_labels=T,
              custom_color_palette=color_list)+ggtitle('Sorted by SBS3')

Subsetting the exposures

We create a simplified object containing exposures of fewer signatures (i.e. a subcomposition of the original signature vectors):

simplified_object <- give_subset_sigs_TMBobj(PancEndocrine_signaturesMSE,
                                             sigs_to_remove = c('SBS13', 'SBS17a', 'SBS17b', 'SBS30'))

In which simplified_object is a list containing simplified_object$x, simplified_object$z, simplified_object$Y, simplified_object$d (see Input dataset section below)

do.call('grid.arrange', list(grobs=lapply(split_matrix_in_half(simplified_object$Y), function(i) createBarplot(normalise_rw(i), remove_labels = T, custom_color_palette=color_list)), nrow=1))
#> Creating plot... it might take some time if the data are large. Number of samples: 70
#> Creating plot... it might take some time if the data are large. Number of samples: 70

Variations of the model

There are several variations on the model, each with a particular combination of the following: - Base model: Dirichlet-multinomial in most cases, but two simpler models – multinomial models diagRE_M and fullRE_M are implemented for comparison. - Random effects: correlated or not (e.g. uncorrelated in diagRE_DM, correlated in fullRE_DM). - Number of dispersion parameters: generally one per group (e.g. in diagRE_DM), but possibly fewer or more if specified in the name (e.g. one single dipersion parameter in diagRE_DM_singlelambda, one dispersion parameter per patient in diagRE_DM_patientlambda).

The names of the model correspond to {random effects}_{base model}_{particularities of the dispersion parameter, if any}.

Using an alternative model, and parameter interpretation (example 3)

Name of model (for user)	Description and usage
`FE_DM_singlelambda`	Dirichlet-multinomial with no RE and a single `lambda` which can be used in a two-group comparison (if we consider the dipersion to be the same in both groups), a multi-group comparison, or any regression setting.
`FE_DM`	Dirichlet-multinomial with no random effects and two `lambda`, used for the comparison between two groups in which we want to account for group-specific dispersion. Model slightly more complex than `FE_DM_singlelambda`
`diagRE_M`	Multinomial with non-correlated multivariate effects. Simpler model than `diagRE_DM`, as no dispersion is included
`diagRE_DM_singlelambda`	Dirichlet-multinomial with non-correlated multivariate RE and one: equivalent of diagRE_DM but with a shared . It can be used in a two-group comparison (if we consider the dipersion to be the same in both groups), a multi-group comparison, or any regression setting.
`singleRE_DM`	Dirichlet-multinomial with a single RE intercept and two `lambda`: simple model in which random effects are not multivariate, but where group-specific dispersion is needed, in a two-group comparison.
`diagRE_DM`	Dirichlet-multinomial with independent RE and two `lambda`: used most commonly throughout the paper. The data are matched to warrant multivariate RE, but correlations between categories are not explicitly modelled. Faster than `fullRE_DM` with often comparable results for `beta_1`.
`fullRE_DM_singlelambda`	Dirichlet-multinomial with correlated RE and two lambdab: equivalent of `diagR_DM_singlelambda` that can be used if categories have strong correlations.
`fullRE_M`	Multinomial with correlated RE: assuming no overdispersion, it can be used in any regression setting
`fullRE_DM`	Dirichlet-multinomial with correlated RE and two `lambda`: model with multivariate RE with correlations modelled explicitly. As a `(K-1)*(K-1)` covariance matrix is estimated, this model is not recommended where the ratio of signatures to samples is high.
`diagRE_DM_patientlambda`	Dirichlet-multinomial with non-correlated RE and one lambda$ per patient: model more complex than `diag_RE_DM` that can be used when there are several observations per patient and we wish to include a patient-specific dispersion parameter
`fullRE_DM_patientlambda`	Dirichlet-multinomial with correlated RE and one per patient: equivalent of `diagRE_DM_patientlambda`, but modelling correlations between categories

diagDM_no_small_sigs <- wrapper_run_TMB(object = simplified_object,
                                        model = "diagRE_DM_singlelambda", use_nlminb=T, smart_init_vals=F)

diagDM_no_small_sigs
#> sdreport(.) result
#>                                Estimate   Std. Error
#> beta0_SBS1_wrt_SBS39         2.03832104   0.22363954
#> beta1_SBS2_wrt_SBS39        -0.70454525   0.29906580
#> beta0_SBS3_wrt_SBS39         0.98806820   0.24807094
#> beta1_SBS5_wrt_SBS39         0.27978736   0.31041289
#> beta0_SBS6_wrt_SBS39         2.35669154   0.25394023
#> beta1_SBS8_wrt_SBS39        -0.54662130   0.29140102
#> beta0_SBS9_wrt_SBS39         3.72920999   0.22353776
#> beta1_SBS11_wrt_SBS39       -0.50234381   0.27244515
#> beta0_SBS26_wrt_SBS39        1.12048603   0.26120868
#> beta1_SBS36_wrt_SBS39       -0.40359421   0.31695478
#> beta0_SBS1_wrt_SBS39         2.85335968   0.20725141
#> beta1_SBS2_wrt_SBS39        -0.01029434   0.27904434
#> beta0_SBS3_wrt_SBS39         1.39433292   0.22601115
#> beta1_SBS5_wrt_SBS39        -0.02357902   0.30765208
#> beta0_SBS6_wrt_SBS39         0.88267656   0.24738726
#> beta1_SBS8_wrt_SBS39        -0.11575696   0.32338552
#> beta0_SBS9_wrt_SBS39         1.40320978   0.26009746
#> beta1_SBS11_wrt_SBS39       -0.24411805   0.30749405
#> beta0_SBS26_wrt_SBS39        0.56690763   0.39567437
#> beta1_SBS36_wrt_SBS39       -0.11893183   0.32177527
#> logs_sd_RE_SBS1_wrt_SBS39   -1.47032407   0.45921974
#> logs_sd_RE_SBS2_wrt_SBS39   -0.88112230   0.33980483
#> logs_sd_RE_SBS3_wrt_SBS39    0.16826194   0.25484956
#> logs_sd_RE_SBS5_wrt_SBS39   -0.49531043   0.23244324
#> logs_sd_RE_SBS6_wrt_SBS39   -0.19668190   0.30654793
#> logs_sd_RE_SBS8_wrt_SBS39  -11.95985277 455.21714653
#> logs_sd_RE_SBS9_wrt_SBS39  -12.11088723 318.80943762
#> logs_sd_RE_SBS11_wrt_SBS39  -1.26108779   0.53819666
#> logs_sd_RE_SBS26_wrt_SBS39   0.03013492   0.26234018
#> logs_sd_RE_SBS36_wrt_SBS39   1.78101081   0.22929287
#> log_lambda                   2.85822798   0.05954625
#> Maximum gradient component: 0.01596245

We can see how in this case we only have a single estimate for the dispersion parameter (log_lambda). The number of $\beta$ (20) corresonds to $2*(d-1)$ where $d$ is the number of mutations in the object (11, as seen in the bar plots above). The last signature or category, SBS39, is used as baseline signature in the ALR transformation. All beta parameters are in ALR-transformed space, using SBS39 as baseline. $\beta_0$ and $\beta_1$ results are intercalated, i.e. the $\beta$ in the first row corresponds to the $\beta_0$ of SBS1 wrt (with respect to) SBS39, the $\beta$ in the third row corresponds to the $\beta_0$ of SBS2 wrt (with respect to) SBS39, the $\beta$ in the first row corresponds to the $\beta_0$ of SBS1 wrt (with respect to) SBS39, etc.

plot_betas(diagDM_no_small_sigs)
#> Loading required package: latex2exp

Using an alternative model with patient-specific dispersion (example 4)

In a scenario where we have several samples for each patient, which can still be grouped in two or more groups (fixed effects), we might be interested in patient-specific dispersion parameters, besides the patient-specific intercepts that we had. This has been implemented:

Simulating data


n <- 40 ## patients
num_samples_per_patient = 30
stopifnot((num_samples_per_patient %% 2)==0)
z <- do.call('rbind', replicate(expr = diag(n), n= num_samples_per_patient, simplify = F))
x <- cbind(rep(1, n*num_samples_per_patient),
           c(rep(0, (n*num_samples_per_patient)/2),
             rep(1, (n*num_samples_per_patient)/2)))
d = 5 ## number of signatures

random_effects <- matrix(runif(n*(d-1)), nrow=n, ncol=d-1)
fixed_effects <- matrix(runif(2*(d-1)), nrow=2, ncol=d-1)

alpha_mat <- x %*% fixed_effects + z %*% random_effects
alpha_mat_softmax = softmax( cbind(alpha_mat, 0) )

## lambda per patient
log_lambda_vec = runif(n, 3, 4)

## get patient-specific overdispersions
log_lambda_vec_per_obs <- z %*% log_lambda_vec

## simulate counts
Nm = rep(1000, n*num_samples_per_patient) ## fixed number of mutations
y = matrix(NA, nrow = n*num_samples_per_patient, ncol=d)
for(l in 1:(n*num_samples_per_patient)){
  ## for each patient-specific subsample
  alpha_l = alpha_mat_softmax[l,]*exp(log_lambda_vec_per_obs[l])
  y[l,] = HMP::Dirichlet.multinomial(Nrs = Nm[l], shape = alpha_l)
  
}

##' create object with information about covariates, which observations correspond
##' to which patients, and the input (counts)
obj_multilambda <- list(Y = y, x=x, z = z, num_individuals = n)

Estimate parameters

res_patient_lambda <- wrapper_run_TMB(obj_multilambda, model = 'diagRE_DM_patientlambda')

data(res_patient_lambda)
data(obj_multilambda)
data(obj_multilambda_parameters)

Comparison of estimates with ground truth

The true parameters used for the simulation are found in the object obj_multilambda_parameters.

and two additional sets of parameters suffer from a bias due to having used the uncorrelated version of the model (i.e. diagRE_DM_patientlambda instead of fullDMpatientlambda):

fixed_effects_estimate <- matrix(python_like_select_name(res_patient_lambda$par.fixed, 'beta'), nrow=2)
random_effects_estimate <- matrix(res_patient_lambda$par.random, ncol=ncol(obj_multilambda_parameters$fixed_effects))
log_lambda_vec_estimate <- python_like_select_name(res_patient_lambda$par.fixed, 'log_lambda')

list_comparisons_true_and_estimate = list(
  all_betas = data.frame(x = as.vector(obj_multilambda_parameters$fixed_effects),
                         y  = as.vector(fixed_effects_estimate)),
  beta_1 =  data.frame(x = obj_multilambda_parameters$fixed_effects[2,],
                       y  = fixed_effects_estimate[2,]),
  random_intercepts = data.frame(x = as.vector(obj_multilambda_parameters$random_effects), 
                                 y = as.vector(random_effects_estimate)),
  log_lambda =  data.frame(x = as.vector(obj_multilambda_parameters$log_lambda_vec), 
                           y = as.vector(log_lambda_vec_estimate)))

plots_comparisons_true_and_estimate <- lapply(1:length(list_comparisons_true_and_estimate), function(i){
  ggplot(list_comparisons_true_and_estimate[[i]], aes(x=x, y=y))+
    geom_point()+
    geom_abline(slope = 1, intercept = 0, lty='dashed')+
    ggtitle(gsub('_', ' ', names(list_comparisons_true_and_estimate)[i]))+
    labs(x='True value', y='Estimate')
})
do.call('grid.arrange', c(plots_comparisons_true_and_estimate, nrow=1))

Session info

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-apple-darwin20
#> Running under: macOS Sonoma 14.6.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/New_York
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] latex2exp_0.9.6      RColorBrewer_1.1-3   reshape2_1.4.4      
#>  [4] mutSigExtractor_1.29 ggplot2_3.5.1        HMP_2.0.1           
#>  [7] dirmult_0.1.3-5      gridExtra_2.3        CompSign_0.1.0      
#> [10] RcppEigen_0.3.4.0.2  TMB_1.9.15          
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.2.1                  farver_2.1.2                     
#>  [3] dplyr_1.1.4                       Biostrings_2.70.3                
#>  [5] bitops_1.0-9                      fastmap_1.2.0                    
#>  [7] RCurl_1.98-1.16                   GenomicAlignments_1.40.0         
#>  [9] XML_3.99-0.17                     digest_0.6.37                    
#> [11] rpart_4.1.23                      lifecycle_1.0.4                  
#> [13] cluster_2.1.6                     magrittr_2.0.3                   
#> [15] compiler_4.4.1                    rlang_1.1.4                      
#> [17] tools_4.4.1                       utf8_1.2.4                       
#> [19] yaml_2.3.10                       rtracklayer_1.64.0               
#> [21] knitr_1.48                        labeling_0.4.3                   
#> [23] S4Arrays_1.2.1                    curl_5.2.3                       
#> [25] DelayedArray_0.28.0               plyr_1.8.9                       
#> [27] abind_1.4-8                       BiocParallel_1.36.0              
#> [29] KernSmooth_2.23-24                withr_3.0.1                      
#> [31] BiocGenerics_0.48.1               grid_4.4.1                       
#> [33] stats4_4.4.1                      fansi_1.0.6                      
#> [35] caTools_1.18.3                    colorspace_2.1-1                 
#> [37] scales_1.3.0                      gtools_3.9.5                     
#> [39] iterators_1.0.14                  MASS_7.3-60.2                    
#> [41] SummarizedExperiment_1.32.0       cli_3.6.3                        
#> [43] rmarkdown_2.28                    vegan_2.6-8                      
#> [45] crayon_1.5.3                      generics_0.1.3                   
#> [47] rstudioapi_0.16.0                 rjson_0.2.23                     
#> [49] httr_1.4.7                        stringr_1.5.1                    
#> [51] zlibbioc_1.48.2                   splines_4.4.1                    
#> [53] parallel_4.4.1                    XVector_0.42.0                   
#> [55] restfulr_0.0.15                   matrixStats_1.4.1                
#> [57] vctrs_0.6.5                       Matrix_1.7-0                     
#> [59] IRanges_2.38.1                    S4Vectors_0.40.2                 
#> [61] foreach_1.5.2                     glue_1.8.0                       
#> [63] codetools_0.2-20                  cowplot_1.1.3                    
#> [65] stringi_1.8.4                     gtable_0.3.5                     
#> [67] GenomeInfoDb_1.38.8               quadprog_1.5-8                   
#> [69] GenomicRanges_1.56.1              BiocIO_1.12.0                    
#> [71] rpart.plot_3.1.2                  munsell_0.5.1                    
#> [73] tibble_3.2.1                      pillar_1.9.0                     
#> [75] htmltools_0.5.8.1                 gplots_3.2.0                     
#> [77] BSgenome.Hsapiens.UCSC.hg19_1.4.3 GenomeInfoDbData_1.2.11          
#> [79] BSgenome_1.72.0                   R6_2.5.1                         
#> [81] Biobase_2.62.0                    evaluate_1.0.0                   
#> [83] lattice_0.22-6                    highr_0.11                       
#> [85] Rsamtools_2.20.0                  Rcpp_1.0.13                      
#> [87] SparseArray_1.2.4                 nlme_3.1-164                     
#> [89] permute_0.9-7                     mgcv_1.9-1                       
#> [91] xfun_0.47                         MatrixGenerics_1.16.0            
#> [93] pkgconfig_2.0.3

Vignette of the package `CompSign`

Installation

Datasets

Minimal example workflow (Example 1)

Input dataset

Fitting the model

Data included in CompSign

Testing for differential abundance

Visualising the output

Workflow with mutational signatures (example 2)

`sign` objects: how to load them

Signature extraction

Visualisation of mutational signature exposures

Subsetting the exposures

Variations of the model

Using an alternative model, and parameter interpretation (example 3)

Using an alternative model with patient-specific dispersion (example 4)

Simulating data

Estimate parameters

Comparison of estimates with ground truth

Session info

Vignette of the package CompSign

Installation

Datasets

Minimal example workflow (Example 1)

Input dataset

Fitting the model

Data included in CompSign

Testing for differential abundance

Visualising the output

Workflow with mutational signatures (example 2)

sign objects: how to load them

Signature extraction

Visualisation of mutational signature exposures

Subsetting the exposures

Variations of the model

Using an alternative model, and parameter interpretation (example 3)

Using an alternative model with patient-specific dispersion (example 4)

Simulating data

Estimate parameters

Comparison of estimates with ground truth

Session info

Vignette of the package `CompSign`

`sign` objects: how to load them