MSI classification from somatic mutation profiles

Utilization of indel distribution and genomic repeat tracks for microsatellite instability (MSI) prediction

Author

Affiliation

Department of Tumor Biology, Institute for Cancer Research, Oslo, Norway

Published

September 25, 2024

Introduction

Microsatellite instability (MSI) is the result of impaired DNA mismatch repair and constitutes a cellular phenotype of clinical significance in many cancer types, most prominently colorectal cancers, stomach cancers, endometrial cancers, and ovarian cancers. Traditionally, MSI detection is based on a PCR-based assay of 5 microsatellite markers (two mono- and three dinucleotide repeats).

Recently, additional detection approaches have been proposed, utilizing next-generating sequencing data. Here, we demonstrate how a robust MSI classifier can be developed from the distribution of somatic mutations in a tumor, taking advantage of insertions/deletions in repetitive DNA , as well as the presense of deleterious mutations in known MSI-associated genes, i.e. POLE and the MMR gene family.

Methods

Assay-based determination of MSI status of TCGA samples

The assay-based MSI status (MSI-H, MSI-L, and MSS) of tumor samples within four different TCGA cohorts (colon (COAD), rectal (READ), endometrial (UCEC), and stomach (STAD)) were obtained through the TCGAbiolinks package (GDC Data Release 41.0, August 28th, 2024). TCGA used a panel of four mononucleotide repeats (BAT25, BAT26, BAT40, and TGFBRII) and three dinucleotide repeats (D2S123, D5S346, and D17S250) to determine MSI status, except for a subset of CRC genomes evaluated by five mononucleotide markers (BAT25, BAT26, NR21, NR24, and MONO27). Based on this assay, tumor samples were classified as MSI-H (>40% of markers altered), MSI-L (<40% of markers altered), and MSS (no marker altered). As done in previous studies (e.g. (Huang et al. 2015)), we do not here distinguish betweeen MSI-L and MSS tumors, but rather treat both as MSS (or more precisely non-MSI.H) tumors.

TCGA mutation data

Somatic SNVs/Indels of TCGA samples in four cohorts (colon, rectal, endometrial and stomach) were downloaded as annotated MAF files (build hg38) through the TCGAbiolinks package (GDC Data Release 41.0, August 28th, 2024). We excluded noncoding calls (e.g. variants in UTR, regulatory regions (upstream/downstream), intron sequence etc.). The list below shows the number of samples analyzed pr. tumor type, in addition to the median number of somatic mutations (SNVs,InDels) pr. sample:

Colon Adenocarcinoma (COAD): 431 samples, median number of mutations (SNVs,InDels) pr. sample - 119,6
Rectum Adenocarcinoma (READ): 154 samples, median number of mutations (SNVs,InDels) pr. sample - 94,4
Stomach Adenocarcinoma (STAD): 431 samples, median number of mutations (SNVs,InDels) pr. sample - 118,6
Uterine Corpus Endometrial Adenocarcinoma (UCEC): 504 samples, median number of mutations (SNVs,InDels) pr. sample - 77,6

Sequence repeat data

Two tracks containing the locations of repetitive DNA were downloaded from UCSC (build hg38):

simpleRepeats
windowMasker + sDust

Statistical analysis

We implemented all statistical modeling and exploratory analysis in R, in particular utilizing the caret package for predictive modelling and ggplot2 package for data visualization.

Results

We set out to develop a statistical classifier that distinguishes tumors with microsatellite instability from their stable counterparts. Our goal was to perform the classification using data available only within the somatic mutation profile. The training dataset included n = 1065 exome-sequenced tumor samples from four different tumor types (TCGA), making up 70% of the total dataset (leaving 30% (n = 455) for the test set). All of these 1520 samples had already been assayed for MSI status using a mononucleotide marker assay (71 samples in COAD/READ/STAD/UCEC did not have an MSI status, resulting in a slightly reduced total number as to the ones presented above for mutation calls). Based on observations in previous studies (Huang et al. 2015; Cortes-Ciriano et al. 2017) and explorations of the mutation and repeats data, we defined the following quantities per sample as features for MSI classification:

tumor mutational burden (rate of SNVs, InDels, SNVs + InDels in a defined coding target region size (mb), for TCGA we used 34Mb)
fraction of indels (among indels + SNVs) in chromosomal regions annotated as repeats by simpleRepeats
fraction of indels (among indels + SNVs) in chromosomal regions annotated as repeats by windowMasker + sDust
fraction of indels (among indels + SNVs) in non-repetitive chromosomal regions
fraction of indels (among indels + SNVs)
fraction of SNVs (among indels + SNVs) in regions annotated as repeats by windowMasker + sDust
Coding mutation in POLD1
Coding mutation in POLE
Coding mutation in MSH2
Coding mutation in MSH3
Coding mutation in MSH6
Coding mutation in MLH1
Coding mutation in MLH3
Coding mutation in PMS1
Coding mutation in PMS2

Next, we explored the suggested predictive features defined above for MSS and MSI.H samples within the training dataset. Figures 1,2 and 3 indicate the discriminatory potential of some key features, particularly the indel fraction (both within and outside of repetitive DNA regions).

Figure 1: Enrichment/depletion of mutated samples - coding MMR and POL mutations

Figure 2: TCGA InDel fraction (non-repetitive DNA)

Figure 3: TCGA InDel fraction in repetitive DNA (windowMasker + sDust)

## NULL

Finally, we used the Random Forest algorithm of the caret package to train an MSI classifier. Specifically, predictive features were subject to preprocessing (scaling and transformation (YeoJohnson)), and ten-fold cross-validation was applied (see R command below):

modfit_rf <- caret::train(as.factor(MSI_status) ~ ., method="rf", data=training,
preProcess=c("YeoJohnson","scale"),trControl = caret::trainControl(method = "cv", number = 10),
na.action = na.exclude)

The table below indicates the relative importance (scale from 0 to 100) of the variables used for prediction (i.e. using the varImp() method) in the resulting model:

Feature	Importance
tmb_indel	100
tmb	64.34
tmb_snv	64.02
fracNonRepeatIndels	63.83
fracIndels	63.09
fracWinMaskIndels	41.18
fracRepeatIndels	27.34
fracWinMaskSNVs	9.61
POLD1	6.67
MSH3	5.96
MSH6	4.72
POLE	4.69
MLH1	2.76
MLH3	1.46
PMS2	0.68
MSH2	0.09
PMS1	0

We applied the resulting MSI classification model to the test set of 435 tumor samples (comprising MSI.H and MSS samples from colon, rectum, stomach, and endometrium), and obtained the following performance:

Positive predictive value: 0.979
Negative predictive value: 0.994

Author contributions

SN: Collected data from TCGA, built the MSI classifier, wrote up the report.

References

Cortes-Ciriano, Isidro, Sejoon Lee, Woong-Yang Park, Tae-Min Kim, and Peter J Park. 2017. “A Molecular Portrait of Microsatellite Instability Across Multiple Cancers.” Nat. Commun. 8: 15180.

Huang, Mi Ni, John R McPherson, Ioana Cutcutache, Bin Tean Teh, Patrick Tan, and Steven G Rozen. 2015. “MSIseq: Software for Assessing Microsatellite Instability from Catalogs of Somatic Mutations.” Sci. Rep. 5: 13321.