MSI classification from somatic mutation profiles

Utilization of indel distribution and genomic repeat tracks for microsatellite instability (MSI) prediction

Published

June 4, 2026

Introduction

Microsatellite instability (MSI) is the result of impaired DNA mismatch repair and constitutes a cellular phenotype of clinical significance in many cancer types, most prominently colorectal cancers, stomach cancers, endometrial cancers, and ovarian cancers. Traditionally, MSI detection is based on a PCR-based assay of 5 microsatellite markers (two mono- and three dinucleotide repeats).

Recently, additional detection approaches have been proposed, utilizing next-generating sequencing data. Here, we demonstrate how a robust MSI classifier can be developed from the distribution of somatic mutations in a tumor, taking advantage of insertions/deletions in repetitive DNA, as well as the presence of deleterious mutations in known MSI-associated genes, i.e. POLE and the MMR gene family.

Methods

Assay-based determination of MSI status of TCGA samples

The assay-based MSI status (MSI-H, MSI-L, and MSS) of tumor samples within four different TCGA cohorts (colon (COAD), rectal (READ), endometrial (UCEC), and stomach (STAD)) were obtained through the GenomicDataCommons R package (release45_20251204). TCGA used a panel of four mononucleotide repeats (BAT25, BAT26, BAT40, and TGFBRII) and three dinucleotide repeats (D2S123, D5S346, and D17S250) to determine MSI status, except for a subset of CRC genomes evaluated by five mononucleotide markers (BAT25, BAT26, NR21, NR24, and MONO27). Based on this assay, tumor samples were classified as MSI-H (>40% of markers altered), MSI-L (<40% of markers altered), and MSS (no marker altered). As done in previous studies (e.g. (Huang et al. 2015)), we do not here distinguish between MSI-L and MSS tumors, but rather treat both as MSS (or more precisely non-MSI.H) tumors.

TCGA mutation data

Somatic SNVs/Indels of TCGA samples in four cohorts (colon, rectal, endometrial and stomach) were downloaded as annotated MAF files (build hg38) through the GenomicDataCommons R package (release45_20251204). We excluded noncoding calls (e.g. variants in UTR, regulatory regions (upstream/downstream), intron sequence etc.). Calls with a sequencing depth (t_depth) < 30, or with an allelic fraction (t_alt_count / t_depth) < 0.05 were excluded. Furthermore, we only considered samples for which at least n = 100 variant calls satisfied the AF/DP criteria (selected on the basis of a threshold sensitivity analysis; see below). The list below shows the number of samples analyzed per tumor type, in addition to the median number of somatic mutations (SNVs, InDels) per sample:

  1. Colon Adenocarcinoma (COAD): 271 samples, median number of mutations (SNVs, InDels) per sample — 140, 7
  2. Rectum Adenocarcinoma (READ): 69 samples, median number of mutations (SNVs, InDels) per sample — 122, 5
  3. Stomach Adenocarcinoma (STAD): 254 samples, median number of mutations (SNVs, InDels) per sample — 186, 8
  4. Uterine Corpus Endometrial Adenocarcinoma (UCEC): 226 samples, median number of mutations (SNVs, InDels) per sample — 477.5, 80

Sequence repeat data

Two tracks containing the locations of repetitive DNA were downloaded from UCSC (build hg38):

  • simpleRepeats
  • windowMasker + sDust

Statistical analysis

We implemented all statistical modeling and exploratory analysis in R, in particular utilizing the caret package for predictive modelling and ggplot2 package for data visualization.

Results

Threshold sensitivity analysis

Before finalising the training set, we evaluated classifier performance across a range of minimum mutation-count thresholds (n = 30, 50, 75, 100, 125, 150) to select a scientifically grounded cutoff. For each threshold, a random forest was trained on 70% of the passing samples and evaluated on the held-out 30%, with Cohen’s Kappa as the primary metric.

Classifier performance by minimum mutation-count threshold.
Min. variants N samples N training N test Accuracy Kappa
30 1422 996 426 0.9812 0.9466
50 1258 881 377 0.9788 0.9445
75 1033 724 309 0.9871 0.9698
100 820 575 245 0.9918 0.9828
125 652 457 195 0.9949 0.9897
150 537 376 161 0.9752 0.9484

The n ≥ 100 threshold yielded the best Kappa and was selected for the final model. Performance at n ≥ 125 was comparable but reduced the training set substantially, while n ≥ 150 showed a drop likely attributable to reduced sample size.

Allelic fraction distribution check

Prior to AF filtering, we examined whether low-allelic-fraction calls were disproportionately enriched among repeat-region indels — a pattern that would inflate MSI-H features (fracRepeatIndels, fracIndels) through artefactual signal. The table below summarises the AF distribution by variant class before the AF ≥ 0.05 filter was applied.

AF distribution by variant class (depth filter only, before AF filter).
Variant class N Median AF Mean AF SD AF 5th pct 25th pct 75th pct Frac < 0.10 Frac < 0.15
Indel_nonrepeat 30750 0.250 0.256 0.101 0.110 0.178 0.321 0.023 0.160
Indel_repeat 6618 0.285 0.289 0.099 0.139 0.217 0.353 0.005 0.069
SNV 883450 0.298 0.296 0.128 0.096 0.200 0.384 0.055 0.143

Repeat-region indels showed the lowest fraction of sub-0.10 AF calls of any class, indicating that the current AF floor does not disproportionately admit artefactual repeat indels. KS test results: SNV vs repeat indels D = 0.0951 (p = 5.48e-52); SNV vs non-repeat indels D = 0.1859 (p = 0).

Feature exploration

We set out to develop a statistical classifier that distinguishes tumors with microsatellite instability from their stable counterparts. Our goal was to perform the classification using data available only within the somatic mutation profile. The training dataset included n = 575 exome-sequenced tumor samples from four different tumor types (TCGA), making up 70% of the total dataset (leaving 30% (n = 245) for the test set). All of these 820 samples had already been assayed for MSI status using a mononucleotide marker assay. Based on observations in previous studies (Huang et al. 2015; Cortes-Ciriano et al. 2017) and explorations of the mutation and repeats data, we defined the following quantities per sample as features for MSI classification:

  • tumor mutational burden (rate of SNVs, InDels, SNVs + InDels in a defined coding target region size (Mb), for TCGA we used 34 Mb)
  • fraction of indels (among indels + SNVs) in chromosomal regions annotated as repeats by simpleRepeats
  • fraction of indels (among indels + SNVs) in chromosomal regions annotated as repeats by windowMasker + sDust
  • fraction of indels (among indels + SNVs) in non-repetitive chromosomal regions
  • fraction of indels (among indels + SNVs)
  • fraction of SNVs (among indels + SNVs) in regions annotated as repeats by windowMasker + sDust
  • Coding mutation in POLD1, POLE, MSH2, MSH3, MSH6, MLH1, MLH3, PMS1, PMS2

Next, we explored the suggested predictive features for MSS and MSI.H samples within the training dataset. Figures 1, 2 and 3 indicate the discriminatory potential of some key features, particularly the indel fraction (both within and outside of repetitive DNA regions).

Figure 1: Enrichment/depletion of mutated samples - coding MMR and POL mutations



Figure 2: TCGA InDel fraction (non-repetitive DNA)

Figure 3: TCGA InDel fraction in repetitive DNA (windowMasker + sDust)

Model training and performance

We used the Random Forest algorithm of the caret package to train the MSI classifier. Predictive features were subject to preprocessing (scaling and YeoJohnson transformation), and ten-fold cross-validation was applied:

modfit_rf <- caret::train(as.factor(MSI_status) ~ ., method="rf", data=training,
preProcess=c("YeoJohnson","scale"),trControl = caret::trainControl(method = "cv", number = 10),
na.action = na.exclude)

The table below indicates the relative importance (scale from 0 to 100) of the variables in the resulting model:

Feature Importance
tmb_indel 100
fracIndels 78.84
fracNonRepeatIndels 61.19
tmb 56.44
tmb_snv 43.98
fracWinMaskIndels 31.53
fracRepeatIndels 28.05
fracWinMaskSNVs 12.26
POLD1 4.55
MSH3 1.77
MLH1 1.49
MSH6 1.43
POLE 1
MLH3 0.35
PMS1 0.24
PMS2 0.1
MSH2 0



Applied to the held-out test set of n = 245 samples, the classifier obtained the following performance:

  • Sensitivity: 0.989
  • Specificity: 0.987

Prediction probability calibration

The random forest outputs a class probability P(MSI-H) alongside the binary prediction. To verify that this probability is well-calibrated (i.e. usable as a confidence signal in PCGR), we compared the mean predicted P(MSI-H) against the observed MSI-H rate within decile bins on the held-out test set.

Brier score: 0.0112 (0 = perfect calibration, 0.25 = uninformative).

RF calibration on test set (Brier score = 0.0112). Well-calibrated probabilities track the diagonal.
Predicted P(MSI-H) bin N samples Mean predicted prob. Observed MSI-H rate
[0,0.1) 130 0.010 0.000
[0.1,0.2) 12 0.142 0.000
[0.2,0.3) 2 0.263 0.000
[0.3,0.4) 1 0.376 0.000
[0.4,0.5) 4 0.460 0.250
[0.5,0.6) 3 0.547 0.333
[0.6,0.7) 3 0.639 1.000
[0.8,0.9) 4 0.867 1.000
[0.9,1] 86 0.987 1.000

The vast majority of samples fall in the extreme bins (P < 0.1 or P > 0.9), confirming that the classifier is strongly decisive. Samples scoring between 0.1 and 0.9 are flagged as uncertain in the PCGR report.

Feature variance by mutation count

To quantify how reliably key MSI features can be estimated as a function of sample mutation count, we computed the median and standard deviation of each feature across mutation-count bins spanning all gold-standard samples (including those below the training threshold). This table is shipped with the runtime data and used by PCGR to annotate predictions from low-mutation samples with an expected feature uncertainty.

Expected feature stability by mutation count bin (all gold-standard samples).
Mutation bin N samples fracIndels median fracIndels SD fracRepeatIndels median fracRepeatIndels SD TMB indel median TMB indel SD
<30 89 0.0417 0.0543 0.0000 0.0000 0.0294 0.0355
30-49 164 0.0625 0.0446 0.0000 0.1481 0.0588 0.0497
50-74 225 0.0577 0.0358 0.0000 0.1429 0.0882 0.0654
75-99 213 0.0440 0.0523 0.0000 0.1434 0.1176 0.1345
100-149 283 0.0396 0.0575 0.0000 0.1507 0.1471 0.2206
150-199 104 0.0363 0.0782 0.0000 0.1106 0.1765 0.3641
200-499 122 0.0880 0.0986 0.0000 0.1591 0.6618 1.1881
500+ 311 0.1878 0.0948 0.1111 0.1141 5.2059 4.0926



Author contributions

SN: Collected data from TCGA, built the MSI classifier, wrote up the report.

References

Cortes-Ciriano, Isidro, Sejoon Lee, Woong-Yang Park, Tae-Min Kim, and Peter J Park. 2017. “A Molecular Portrait of Microsatellite Instability Across Multiple Cancers.” Nat. Commun. 8: 15180.
Huang, Mi Ni, John R McPherson, Ioana Cutcutache, Bin Tean Teh, Patrick Tan, and Steven G Rozen. 2015. MSIseq: Software for Assessing Microsatellite Instability from Catalogs of Somatic Mutations.” Sci. Rep. 5: 13321.