Utilization of indel distribution and genomic repeat tracks for microsatellite instability prediction
Microsatellite instability (MSI) is the result of impaired DNA mismatch repair and constitutes a cellular phenotype of clinical significance in many cancer types, most prominently colorectal cancers, stomach cancers, endometrial cancers, and ovarian cancers. Traditionally, MSI detection is based on a PCR-based assay of 5 microsatellite markers (two mono- and three dinucleotide repeats).
Recently, additional detection approaches have been proposed, utilizing next-generating sequencing data. Here, we demonstrate how a robust MSI classifier can be developed from the distribution of somatic mutations in a tumor, taking advantage of insertions/deletions in repetitive DNA , as well as the presense of deleterious mutations in known MSI-associated genes, i.e. POLE and the MMR gene family.
The assay-based MSI status (MSI-H, MSI-L, and MSS) of tumor samples within four different TCGA cohorts (colon (COAD), rectal (READ), endometrial (UCEC), and stomach (STAD)) were obtained through the TCGAbiolinks package (GDC Data Release 27, October 29th, 2020). TCGA used a panel of four mononucleotide repeats (BAT25, BAT26, BAT40, and TGFBRII) and three dinucleotide repeats (D2S123, D5S346, and D17S250) to determine MSI status, except for a subset of CRC genomes evaluated by five mononucleotide markers (BAT25, BAT26, NR21, NR24, and MONO27). Based on this assay, tumor samples were classified as MSI-H (>40% of markers altered), MSI-L (<40% of markers altered), and MSS (no marker altered). As done in previous studies (e.g. (Huang et al. 2015)), we do not here distinguish betweeen MSI-L and MSS tumors, but rather treat both as MSS (or more precisely non-MSI.H) tumors.
Somatic SNVs/Indels of TCGA samples in four cohorts (colon, rectal, endometrial and stomach) were downloaded as annotated MAF files (build hg38) through the TCGAbiolinks package (GDC Data Release 22, January 16th, 2020). Within each cohort, mutations called by different variant callers (mutect2, varscan2, muse, and somaticsniper) were combined. To minimize the impact of false positive calls, we excluded SNVs and InDels called by a single algorithm. Furthermore, we excluded noncoding calls (e.g. variants in UTR, regulatory regions (upstream/downstream), intron sequence etc.). The list below shows the number of samples analyzed pr. tumor type, in addition to the median number of somatic mutations (SNVs,InDels) pr. sample:
Two tracks containing the locations of repetitive DNA were downloaded from UCSC (build hg38):
We implemented all statistical modeling and exploratory analysis in R, in particular utilizing the caret package for predictive modelling and ggplot2 package for data visualization.
We set out to develop a statistical classifier that distinguishes tumors with microsatellite instability from their stable counterparts. Our goal was to perform the classification using data available only within the somatic mutation profile. The training dataset included 999 exome-sequenced tumor samples from four different tumor types (TCGA), making up 70% of the total dataset (leaving 30% (427) for the test set). All of these 1426 samples had already been assayed for MSI status using a mononucleotide marker assay (71 samples in COAD/READ/STAD/UCEC did not have an MSI status, resulting in a slightly reduced total number as to the ones presented above for mutation calls). Based on observations in previous studies (Huang et al. 2015; Cortes-Ciriano et al. 2017) and explorations of the mutation and repeats data, we defined the following quantities per sample as features for MSI classification:
Next, we explored the suggested predictive features defined above for MSS and MSI.H samples within the training dataset. Figures 1,2 and 3 indicate the discriminatory potential of some key features, particularly the indel fraction (both within and outside of repetitive DNA regions).
## Warning: Removed 1 rows containing missing values (geom_bar).
Finally, we used the Random Forest algorithm of the caret package to train an MSI classifier. Specifically, predictive features were subject to preprocessing (scaling and transformation (YeoJohnson)), and ten-fold cross-validation was applied (see R command below):
modfit_rf <- caret::train(as.factor(MSI_status) ~ ., method="rf", data=training,
preProcess=c("YeoJohnson","scale"),trControl = caret::trainControl(method = "cv", number = 10),
na.action = na.exclude)
The table below indicates the relative importance (scale from 0 to 100) of the variables used for prediction (i.e. using the varImp() method) in the resulting model:
| Feature | Importance |
|---|---|
| tmb_indel | 100 |
| fracIndels | 75.66 |
| fracNonRepeatIndels | 74.62 |
| tmb | 73.47 |
| tmb_snv | 52.26 |
| fracRepeatIndels | 38.99 |
| fracWinMaskIndels | 30.56 |
| fracWinMaskSNVs | 11.06 |
| POLD1 | 8 |
| POLE | 2.73 |
| MLH1 | 2.02 |
| MSH6 | 1.54 |
| PMS2 | 1.27 |
| MSH2 | 1.11 |
| MSH3 | 1.07 |
| MLH3 | 0.47 |
| PMS1 | 0 |
We applied the resulting MSI classification model to the test set of 427 tumor samples (comprising MSI.H and MSS samples from colon, rectum, stomach, and endometrium), and obtained the following performance:
SN: Collected data from TCGA, built the MSI classifier, wrote up the report.
Cortes-Ciriano, Isidro, Sejoon Lee, Woong-Yang Park, Tae-Min Kim, and Peter J Park. 2017. “A Molecular Portrait of Microsatellite Instability Across Multiple Cancers.” Nat. Commun. 8: 15180.
Huang, Mi Ni, John R McPherson, Ioana Cutcutache, Bin Tean Teh, Patrick Tan, and Steven G Rozen. 2015. “MSIseq: Software for Assessing Microsatellite Instability from Catalogs of Somatic Mutations.” Sci. Rep. 5: 13321.