MSI classification from somatic mutation profiles
Utilization of indel distribution and genomic repeat tracks for microsatellite instability (MSI) prediction
Introduction
Microsatellite instability (MSI) is the result of impaired DNA mismatch repair and constitutes a cellular phenotype of clinical significance in many cancer types, most prominently colorectal cancers, stomach cancers, endometrial cancers, and ovarian cancers. Traditionally, MSI detection is based on a PCR-based assay of 5 microsatellite markers (two mono- and three dinucleotide repeats).
Recently, additional detection approaches have been proposed, utilizing next-generating sequencing data. Here, we demonstrate how a robust MSI classifier can be developed from the distribution of somatic mutations in a tumor, taking advantage of insertions/deletions in repetitive DNA , as well as the presense of deleterious mutations in known MSI-associated genes, i.e. POLE and the MMR gene family.
Methods
Assay-based determination of MSI status of TCGA samples
The assay-based MSI status (MSI-H, MSI-L, and MSS) of tumor samples within four different TCGA cohorts (colon (COAD), rectal (READ), endometrial (UCEC), and stomach (STAD)) were obtained through the TCGAbiolinks package (GDC Data Release 41.0, August 28th, 2024). TCGA used a panel of four mononucleotide repeats (BAT25, BAT26, BAT40, and TGFBRII) and three dinucleotide repeats (D2S123, D5S346, and D17S250) to determine MSI status, except for a subset of CRC genomes evaluated by five mononucleotide markers (BAT25, BAT26, NR21, NR24, and MONO27). Based on this assay, tumor samples were classified as MSI-H (>40% of markers altered), MSI-L (<40% of markers altered), and MSS (no marker altered). As done in previous studies (e.g. (Huang et al. 2015)), we do not here distinguish betweeen MSI-L and MSS tumors, but rather treat both as MSS (or more precisely non-MSI.H) tumors.
TCGA mutation data
Somatic SNVs/Indels of TCGA samples in four cohorts (colon, rectal, endometrial and stomach) were downloaded as annotated MAF files (build hg38) through the TCGAbiolinks package (GDC Data Release 41.0, August 28th, 2024). We excluded noncoding calls (e.g. variants in UTR, regulatory regions (upstream/downstream), intron sequence etc.). The list below shows the number of samples analyzed pr. tumor type, in addition to the median number of somatic mutations (SNVs,InDels) pr. sample:
- Colon Adenocarcinoma (COAD): 431 samples, median number of mutations (SNVs,InDels) pr. sample - 119,6
- Rectum Adenocarcinoma (READ): 154 samples, median number of mutations (SNVs,InDels) pr. sample - 94,4
- Stomach Adenocarcinoma (STAD): 431 samples, median number of mutations (SNVs,InDels) pr. sample - 118,6
- Uterine Corpus Endometrial Adenocarcinoma (UCEC): 504 samples, median number of mutations (SNVs,InDels) pr. sample - 77,6
Sequence repeat data
Two tracks containing the locations of repetitive DNA were downloaded from UCSC (build hg38):
- simpleRepeats
- windowMasker + sDust
Statistical analysis
We implemented all statistical modeling and exploratory analysis in R, in particular utilizing the caret package for predictive modelling and ggplot2 package for data visualization.
Results
We set out to develop a statistical classifier that distinguishes tumors with microsatellite instability from their stable counterparts. Our goal was to perform the classification using data available only within the somatic mutation profile. The training dataset included n = 1065 exome-sequenced tumor samples from four different tumor types (TCGA), making up 70% of the total dataset (leaving 30% (n = 455) for the test set). All of these 1520 samples had already been assayed for MSI status using a mononucleotide marker assay (71 samples in COAD/READ/STAD/UCEC did not have an MSI status, resulting in a slightly reduced total number as to the ones presented above for mutation calls). Based on observations in previous studies (Huang et al. 2015; Cortes-Ciriano et al. 2017) and explorations of the mutation and repeats data, we defined the following quantities per sample as features for MSI classification:
- tumor mutational burden (rate of SNVs, InDels, SNVs + InDels in a defined coding target region size (mb), for TCGA we used 34Mb)
- fraction of indels (among indels + SNVs) in chromosomal regions annotated as repeats by simpleRepeats
- fraction of indels (among indels + SNVs) in chromosomal regions annotated as repeats by windowMasker + sDust
- fraction of indels (among indels + SNVs) in non-repetitive chromosomal regions
- fraction of indels (among indels + SNVs)
- fraction of SNVs (among indels + SNVs) in regions annotated as repeats by windowMasker + sDust
- Coding mutation in POLD1
- Coding mutation in POLE
- Coding mutation in MSH2
- Coding mutation in MSH3
- Coding mutation in MSH6
- Coding mutation in MLH1
- Coding mutation in MLH3
- Coding mutation in PMS1
- Coding mutation in PMS2
Next, we explored the suggested predictive features defined above for MSS and MSI.H samples within the training dataset. Figures 1,2 and 3 indicate the discriminatory potential of some key features, particularly the indel fraction (both within and outside of repetitive DNA regions).
Figure 1: Enrichment/depletion of mutated samples - coding MMR and POL mutations
Figure 2: TCGA InDel fraction (non-repetitive DNA)
Figure 3: TCGA InDel fraction in repetitive DNA (windowMasker + sDust)
## NULL
Finally, we used the Random Forest algorithm of the caret package to train an MSI classifier. Specifically, predictive features were subject to preprocessing (scaling and transformation (YeoJohnson)), and ten-fold cross-validation was applied (see R command below):
modfit_rf <- caret::train(as.factor(MSI_status) ~ ., method="rf", data=training,
preProcess=c("YeoJohnson","scale"),trControl = caret::trainControl(method = "cv", number = 10),
na.action = na.exclude)The table below indicates the relative importance (scale from 0 to 100) of the variables used for prediction (i.e. using the varImp() method) in the resulting model:
| Feature | Importance |
|---|---|
| tmb_indel | 100 |
| tmb | 64.34 |
| tmb_snv | 64.02 |
| fracNonRepeatIndels | 63.83 |
| fracIndels | 63.09 |
| fracWinMaskIndels | 41.18 |
| fracRepeatIndels | 27.34 |
| fracWinMaskSNVs | 9.61 |
| POLD1 | 6.67 |
| MSH3 | 5.96 |
| MSH6 | 4.72 |
| POLE | 4.69 |
| MLH1 | 2.76 |
| MLH3 | 1.46 |
| PMS2 | 0.68 |
| MSH2 | 0.09 |
| PMS1 | 0 |
We applied the resulting MSI classification model to the test set of 435 tumor samples (comprising MSI.H and MSS samples from colon, rectum, stomach, and endometrium), and obtained the following performance:
- Positive predictive value: 0.979
- Negative predictive value: 0.994