| Min. variants | N samples | N training | N test | Accuracy | Kappa |
|---|---|---|---|---|---|
| 30 | 1422 | 996 | 426 | 0.9812 | 0.9466 |
| 50 | 1258 | 881 | 377 | 0.9788 | 0.9445 |
| 75 | 1033 | 724 | 309 | 0.9871 | 0.9698 |
| 100 | 820 | 575 | 245 | 0.9918 | 0.9828 |
| 125 | 652 | 457 | 195 | 0.9949 | 0.9897 |
| 150 | 537 | 376 | 161 | 0.9752 | 0.9484 |
MSI classification from somatic mutation profiles
Utilization of indel distribution and genomic repeat tracks for microsatellite instability (MSI) prediction
Introduction
Microsatellite instability (MSI) is the result of impaired DNA mismatch repair and constitutes a cellular phenotype of clinical significance in many cancer types, most prominently colorectal cancers, stomach cancers, endometrial cancers, and ovarian cancers. Traditionally, MSI detection is based on a PCR-based assay of 5 microsatellite markers (two mono- and three dinucleotide repeats).
Recently, additional detection approaches have been proposed, utilizing next-generating sequencing data. Here, we demonstrate how a robust MSI classifier can be developed from the distribution of somatic mutations in a tumor, taking advantage of insertions/deletions in repetitive DNA, as well as the presence of deleterious mutations in known MSI-associated genes, i.e. POLE and the MMR gene family.
Methods
Assay-based determination of MSI status of TCGA samples
The assay-based MSI status (MSI-H, MSI-L, and MSS) of tumor samples within four different TCGA cohorts (colon (COAD), rectal (READ), endometrial (UCEC), and stomach (STAD)) were obtained through the GenomicDataCommons R package (release45_20251204). TCGA used a panel of four mononucleotide repeats (BAT25, BAT26, BAT40, and TGFBRII) and three dinucleotide repeats (D2S123, D5S346, and D17S250) to determine MSI status, except for a subset of CRC genomes evaluated by five mononucleotide markers (BAT25, BAT26, NR21, NR24, and MONO27). Based on this assay, tumor samples were classified as MSI-H (>40% of markers altered), MSI-L (<40% of markers altered), and MSS (no marker altered). As done in previous studies (e.g. (Huang et al. 2015)), we do not here distinguish between MSI-L and MSS tumors, but rather treat both as MSS (or more precisely non-MSI.H) tumors.
TCGA mutation data
Somatic SNVs/Indels of TCGA samples in four cohorts (colon, rectal, endometrial and stomach) were downloaded as annotated MAF files (build hg38) through the GenomicDataCommons R package (release45_20251204). We excluded noncoding calls (e.g. variants in UTR, regulatory regions (upstream/downstream), intron sequence etc.). Calls with a sequencing depth (t_depth) < 30, or with an allelic fraction (t_alt_count / t_depth) < 0.05 were excluded. Furthermore, we only considered samples for which at least n = 100 variant calls satisfied the AF/DP criteria (selected on the basis of a threshold sensitivity analysis; see below). The list below shows the number of samples analyzed per tumor type, in addition to the median number of somatic mutations (SNVs, InDels) per sample:
- Colon Adenocarcinoma (COAD): 271 samples, median number of mutations (SNVs, InDels) per sample — 140, 7
- Rectum Adenocarcinoma (READ): 69 samples, median number of mutations (SNVs, InDels) per sample — 122, 5
- Stomach Adenocarcinoma (STAD): 254 samples, median number of mutations (SNVs, InDels) per sample — 186, 8
- Uterine Corpus Endometrial Adenocarcinoma (UCEC): 226 samples, median number of mutations (SNVs, InDels) per sample — 477.5, 80
Sequence repeat data
Two tracks containing the locations of repetitive DNA were downloaded from UCSC (build hg38):
- simpleRepeats
- windowMasker + sDust
Statistical analysis
We implemented all statistical modeling and exploratory analysis in R, in particular utilizing the caret package for predictive modelling and ggplot2 package for data visualization.
Results
Threshold sensitivity analysis
Before finalising the training set, we evaluated classifier performance across a range of minimum mutation-count thresholds (n = 30, 50, 75, 100, 125, 150) to select a scientifically grounded cutoff. For each threshold, a random forest was trained on 70% of the passing samples and evaluated on the held-out 30%, with Cohen’s Kappa as the primary metric.
The n ≥ 100 threshold yielded the best Kappa and was selected for the final model. Performance at n ≥ 125 was comparable but reduced the training set substantially, while n ≥ 150 showed a drop likely attributable to reduced sample size.
Allelic fraction distribution check
Prior to AF filtering, we examined whether low-allelic-fraction calls were disproportionately enriched among repeat-region indels — a pattern that would inflate MSI-H features (fracRepeatIndels, fracIndels) through artefactual signal. The table below summarises the AF distribution by variant class before the AF ≥ 0.05 filter was applied.
| Variant class | N | Median AF | Mean AF | SD AF | 5th pct | 25th pct | 75th pct | Frac < 0.10 | Frac < 0.15 |
|---|---|---|---|---|---|---|---|---|---|
| Indel_nonrepeat | 30750 | 0.250 | 0.256 | 0.101 | 0.110 | 0.178 | 0.321 | 0.023 | 0.160 |
| Indel_repeat | 6618 | 0.285 | 0.289 | 0.099 | 0.139 | 0.217 | 0.353 | 0.005 | 0.069 |
| SNV | 883450 | 0.298 | 0.296 | 0.128 | 0.096 | 0.200 | 0.384 | 0.055 | 0.143 |
Repeat-region indels showed the lowest fraction of sub-0.10 AF calls of any class, indicating that the current AF floor does not disproportionately admit artefactual repeat indels. KS test results: SNV vs repeat indels D = 0.0951 (p = 5.48e-52); SNV vs non-repeat indels D = 0.1859 (p = 0).
Feature exploration
We set out to develop a statistical classifier that distinguishes tumors with microsatellite instability from their stable counterparts. Our goal was to perform the classification using data available only within the somatic mutation profile. The training dataset included n = 575 exome-sequenced tumor samples from four different tumor types (TCGA), making up 70% of the total dataset (leaving 30% (n = 245) for the test set). All of these 820 samples had already been assayed for MSI status using a mononucleotide marker assay. Based on observations in previous studies (Huang et al. 2015; Cortes-Ciriano et al. 2017) and explorations of the mutation and repeats data, we defined the following quantities per sample as features for MSI classification:
- tumor mutational burden (rate of SNVs, InDels, SNVs + InDels in a defined coding target region size (Mb), for TCGA we used 34 Mb)
- fraction of indels (among indels + SNVs) in chromosomal regions annotated as repeats by simpleRepeats
- fraction of indels (among indels + SNVs) in chromosomal regions annotated as repeats by windowMasker + sDust
- fraction of indels (among indels + SNVs) in non-repetitive chromosomal regions
- fraction of indels (among indels + SNVs)
- fraction of SNVs (among indels + SNVs) in regions annotated as repeats by windowMasker + sDust
- Coding mutation in POLD1, POLE, MSH2, MSH3, MSH6, MLH1, MLH3, PMS1, PMS2
Next, we explored the suggested predictive features for MSS and MSI.H samples within the training dataset. Figures 1, 2 and 3 indicate the discriminatory potential of some key features, particularly the indel fraction (both within and outside of repetitive DNA regions).
Figure 1: Enrichment/depletion of mutated samples - coding MMR and POL mutations
Figure 2: TCGA InDel fraction (non-repetitive DNA)
Figure 3: TCGA InDel fraction in repetitive DNA (windowMasker + sDust)
Model training and performance
We used the Random Forest algorithm of the caret package to train the MSI classifier. Predictive features were subject to preprocessing (scaling and YeoJohnson transformation), and ten-fold cross-validation was applied:
modfit_rf <- caret::train(as.factor(MSI_status) ~ ., method="rf", data=training,
preProcess=c("YeoJohnson","scale"),trControl = caret::trainControl(method = "cv", number = 10),
na.action = na.exclude)The table below indicates the relative importance (scale from 0 to 100) of the variables in the resulting model:
| Feature | Importance |
|---|---|
| tmb_indel | 100 |
| fracIndels | 78.84 |
| fracNonRepeatIndels | 61.19 |
| tmb | 56.44 |
| tmb_snv | 43.98 |
| fracWinMaskIndels | 31.53 |
| fracRepeatIndels | 28.05 |
| fracWinMaskSNVs | 12.26 |
| POLD1 | 4.55 |
| MSH3 | 1.77 |
| MLH1 | 1.49 |
| MSH6 | 1.43 |
| POLE | 1 |
| MLH3 | 0.35 |
| PMS1 | 0.24 |
| PMS2 | 0.1 |
| MSH2 | 0 |
Applied to the held-out test set of n = 245 samples, the classifier obtained the following performance:
- Sensitivity: 0.989
- Specificity: 0.987
Prediction probability calibration
The random forest outputs a class probability P(MSI-H) alongside the binary prediction. To verify that this probability is well-calibrated (i.e. usable as a confidence signal in PCGR), we compared the mean predicted P(MSI-H) against the observed MSI-H rate within decile bins on the held-out test set.
Brier score: 0.0112 (0 = perfect calibration, 0.25 = uninformative).
| Predicted P(MSI-H) bin | N samples | Mean predicted prob. | Observed MSI-H rate |
|---|---|---|---|
| [0,0.1) | 130 | 0.010 | 0.000 |
| [0.1,0.2) | 12 | 0.142 | 0.000 |
| [0.2,0.3) | 2 | 0.263 | 0.000 |
| [0.3,0.4) | 1 | 0.376 | 0.000 |
| [0.4,0.5) | 4 | 0.460 | 0.250 |
| [0.5,0.6) | 3 | 0.547 | 0.333 |
| [0.6,0.7) | 3 | 0.639 | 1.000 |
| [0.8,0.9) | 4 | 0.867 | 1.000 |
| [0.9,1] | 86 | 0.987 | 1.000 |
The vast majority of samples fall in the extreme bins (P < 0.1 or P > 0.9), confirming that the classifier is strongly decisive. Samples scoring between 0.1 and 0.9 are flagged as uncertain in the PCGR report.
Feature variance by mutation count
To quantify how reliably key MSI features can be estimated as a function of sample mutation count, we computed the median and standard deviation of each feature across mutation-count bins spanning all gold-standard samples (including those below the training threshold). This table is shipped with the runtime data and used by PCGR to annotate predictions from low-mutation samples with an expected feature uncertainty.
| Mutation bin | N samples | fracIndels median | fracIndels SD | fracRepeatIndels median | fracRepeatIndels SD | TMB indel median | TMB indel SD |
|---|---|---|---|---|---|---|---|
| <30 | 89 | 0.0417 | 0.0543 | 0.0000 | 0.0000 | 0.0294 | 0.0355 |
| 30-49 | 164 | 0.0625 | 0.0446 | 0.0000 | 0.1481 | 0.0588 | 0.0497 |
| 50-74 | 225 | 0.0577 | 0.0358 | 0.0000 | 0.1429 | 0.0882 | 0.0654 |
| 75-99 | 213 | 0.0440 | 0.0523 | 0.0000 | 0.1434 | 0.1176 | 0.1345 |
| 100-149 | 283 | 0.0396 | 0.0575 | 0.0000 | 0.1507 | 0.1471 | 0.2206 |
| 150-199 | 104 | 0.0363 | 0.0782 | 0.0000 | 0.1106 | 0.1765 | 0.3641 |
| 200-499 | 122 | 0.0880 | 0.0986 | 0.0000 | 0.1591 | 0.6618 | 1.1881 |
| 500+ | 311 | 0.1878 | 0.0948 | 0.1111 | 0.1141 | 5.2059 | 4.0926 |