dxsmall

Author

tim

Fitting a prognostic mixture model

Package installation

(this will take a while if you do it locally. maybe install the efficient bspm package first!)

# make fasteru
if (!require("bspm")) install.packages("bspm", quietly=TRUE)
Loading required package: bspm
# hopefully faster now 
suppressPackageStartupMessages(library(bspm))
if (!require("BiocManager")) {
  suppressWarnings(
    suppressMessages(
      install.packages(c("remotes","BiocManager"), quietly=TRUE)
    )
  )
}
Loading required package: BiocManager
suppressWarnings(
  suppressMessages(
    BiocManager::install(c("mclust","survminer","SingleCellExperiment"),
                           quietly=TRUE, update=TRUE, ask=FALSE)
  )
)

The bifurcatoR package is our library where we explore statistical tests meant to identify structured but as-yet-unexplained phenotypic heterogeneity. The dxsmall dataset is packaged with the library, and we will use it for the exercises below.

suppressWarnings(
  suppressMessages(
    BiocManager::install("VanAndelInstitute/bifurcatoR", quietly=TRUE, update=TRUE, ask=FALSE)
  )
)

The dxsmall dataset in the bifurcatoR package

Before we try to fit a mixture to some data, let’s look at what the data represents.

Note that if you install bifurcatoR, there is an interactive web application (using the wonderful iSEE package) which allows you to poke around at both the data and the metadata in the object we load below. In this example, we just focus on one gene in one class of fusion-driven leukemia.

suppressPackageStartupMessages(suppressWarnings(require("bifurcatoR")))
suppressPackageStartupMessages(suppressWarnings(require("SummarizedExperiment")))
suppressPackageStartupMessages(suppressWarnings(require("SingleCellExperiment")))
data("dxsmall", package="bifurcatoR")
show(dxsmall)
class: SingleCellExperiment 
dim: 6 470 
metadata(0):
assays(2): counts logcounts
rownames(6): MECOM PRDM16 ... NCAM1 KDM5D
rowData names(4): gene_id medianLogCounts rangeLogCounts madLogCounts
colnames(470): PAEAKL PAECCE ... PAYFYN PAYIET
colData names(8): AgeGroup Sex ... OS OSI
reducedDimNames(1): UMAP
mainExpName: mRNA
altExpNames(0):

You will notice that the rows of this object correspond to genes (and, specifically, the logcounts assay corresponds to log-normalized abundance of each gene in each subject’s diagnostic sample). The SingleCellExperiment class has some additional cubbyholes for reduced-dimensional visualization of results, but it’s really just a SummarizedExperiment, which is an object that helps keep data married to its metadata (i.e., data-about-the-data). This is important because each column in the object corresponds to a clinical specimen from a rare childhood disease. A schematic is below. If you fire up the iSEEapp function in the bifurcatoR package, you’ll notice that the reducedDims slot holds a UMAP visualization of all the samples’ gene expression. That’s why we use the object.

A SingleCellExperiment

The columns of the object correspond to each subject, which is to say, a pediatric patient who was diagnosed with acute myeloid leukemia and participated in a Children’s Oncology Group (COG) trial, either AAML0531 or AAML1031. The outcomes for each subject that we will model are overall survival (OS) and whether the subject is alive or dead as of last follow-up (OSI). These two quantities are mandatory for performing survival analyses. In addition, since there are two FusionGroup categories in this dataset (MLL, which is encoded by the KMT2A gene; and NUP98, which is encoded by the NUP98 gene), we will use this indicator to focus specifically on MLL fusion cases, where prognosis is wildly more variable than the (uniformly terrible) prognosis of NUP98 fusions. Each of these proteins (MLL and NUP98) have dozens of fusion partner genes that have been observed in leukemia. The precise reason why MLL and NUP98 (the genes for which are located on chromosome 11q23 and 11p15, respectively) are so often rearranged in leukemia is an open research topic.

Fitting a mixture model to gene expression

A mixture model (typically fit using the expectation-maximization, or E-M, algorithm) is useful when confronted with a dataset where more than one discernible group appears to exist, but without complete information on how they differ. The idea is that each group has characteristic generating parameters. The mclust package encapsulates decades of elegant work on this topic, so we tend to reach for it before anything else. (Part of the reason we wrote bifurcatoR package is that we were curious whether anything better exists!)

Let’s use mclust to fit a mixture model on expression of the MECOM gene in MLL-rearranged cases.

suppressPackageStartupMessages(library(mclust))

We will load the SummarizedExperiment package from earlier to manipulate the dxsmall object.

Below, we will plot the fitted mixture density, i.e. the curves for each of the two groups.

suppressPackageStartupMessages(require("SummarizedExperiment"))
fit <- Mclust(logcounts(dxsmall)["MECOM", dxsmall$FusionGroup == "MLL"], G=2, verbose=FALSE)
plot(fit, what="density")

You may want to poke around the mixture fit fit using plot(fit). There are numerous other possibilities for diagnostic plots and the mclust package will ask you which one to display. As a side effect of fitting this mixture of two approximately-Gaussian components, mclust will return its best estimate of each group’s parameters along with the classification of each subject into 1 or 2.

message("Mixing proportions:")
Mixing proportions:
table(fit$classification)/sum(dxsmall$FusionGroup=="MLL")

        1         2 
0.7138965 0.2861035 
message()
message("or equivalently, ")
or equivalently, 
message()
fit$parameters$pro
[1] 0.7138525 0.2861475
message()
message("Mean expression (by class):")
Mean expression (by class):
fit$parameters$mean
        1         2 
 2.552598 12.201440 
message()
message("SD (by class):")
SD (by class):
sqrt(fit$parameters$variance$sigmasq)
[1] 1.733953 1.280319

We will use these results below to fit a data-driven model for outcomes in MLL-rearranged pediatric AML patients, based rather simply on MECOM expression. (It’s not an entirely random choice, but it’s also not currently used for prognostic or predictive stratification in clinical trials.)

Survival analysis

If you are familiar with survival analysis you can skip right ahead to the results. If not, the simplest version is that a Kaplan-Meier curve and its associated log-rank test quantify the odds that two groups would have differences in time-to-event distributions as great as, or greater than, those observed by chance alone. The original paper by Kaplan and Meier has been cited tens of thousands of times, and the resulting plots of survival curves are a staple in medical journals when presenting the results of clinical trials.

suppressPackageStartupMessages(library(mclust))
suppressPackageStartupMessages(library(survival))
suppressPackageStartupMessages(library(bifurcatoR))
data("dxsmall", package="bifurcatoR")
fit <- Mclust(logcounts(dxsmall)["MECOM", dxsmall$FusionGroup == "MLL"], G=2, verbose=FALSE)
colData(dxsmall)$classification <- NA
colData(dxsmall)[dxsmall$FusionGroup=="MLL", "classification"] <-
  fit$classification
(survfit <- survfit(Surv(OS, OSI) ~ classification,
                   data=colData(dxsmall[, dxsmall$FusionGroup == "MLL"])))
Call: survfit(formula = Surv(OS, OSI) ~ classification, data = colData(dxsmall[, 
    dxsmall$FusionGroup == "MLL"]))

   1 observation deleted due to missingness 
                   n events median 0.95LCL 0.95UCL
classification=1 261    112     NA    1482      NA
classification=2 105     67    659     480    1392

The survminer package offers some particularly helpful tools for visualizing the results, complete with 95% confidence intervals. One way to think about confidence intervals is that the less they overlap, the stronger the evidence is for a statistically robust difference in survival (time-to-event, in this case death) between the groups involved. The estimation of survival curves is actually quite fiddly, which is one reason why the Kaplan-Meier and Cox papers have been cited so many times.

In any event, what we want is a figure that communicates the difference between MECOM-low and MECOM-high leukemia in terms of survival, and that’s what survminer will give us.

suppressPackageStartupMessages(require("survminer"))

(Drum roll) The results:

p <- ggsurvplot(survfit, linetype="strata", pval=TRUE, conf.int=TRUE)
print(p)

MECOM expression is a highly informative independent prognostic marker in MLL-rearranged AML.

(Exericise: try fitting the same survival model to the exact fusion, i.e. Surv(OS, OSI) ~ fusion. Which model do you think is more useful for randomizing patients to high-risk arms of a clinical trial?)