The OptSurvCutR
package provides a
comprehensive, data-driven workflow for survival analysis, specifically
designed for scenarios where a single cut-point is not sufficient. Its
core functionality is to discover and validate one or more optimal
cut-points for continuous variables in time-to-event data. This is a
particularly useful in clinical and biological research where you have a
non-linear or U-shaped relationship and need to create meaningful,
distinct subgroups from continuous data (e.g., patient age, gene
expression levels, or, in this case, microbial abundance) for prognostic
analysis.
The key features of the package include:
This tutorial demonstrates how to use the
OptSurvCutR
package for survival analysis.
Specifically, we will stratify patients into different prognostic groups
based on a continuous biomarker.
The human gut is home to a complex community of microorganisms known as the microbiome. A key, though often overlooked, component of this community is the gut virome—the collection of all viruses. In a healthy state, this ecosystem is balanced, but in diseases like colorectal cancer (CRC), this balance is often disrupted (a state called dysbiosis).
For this example, we will analyse gut virome data to determine if the relative abundance of the genus Enterovirus can predict survival outcomes. Enterovirus is a common genus of viruses from the Picornaviridae family, known for causing a range of human illnesses from the common cold to more severe conditions. As their name suggests, they often replicate in the gastrointestinal tract and are a regular member of the human gut virome.
While Enterovirus can be found in the guts of both healthy individuals and CRC patients, recent studies have focused on how its prevalence and abundance change in the context of cancer. The relationship is complex and represents a fascinating double-edged sword. On one hand, chronic viral infections can contribute to the pro-inflammatory environment that is a known risk factor for CRC development. On the other hand, some strains of Enterovirus have been identified as being naturally oncolytic—meaning they can preferentially infect and kill cancer cells. This has opened up an exciting new avenue of research into using these viruses as a potential cancer therapy.
Therefore, understanding whether a high or low abundance of
Enterovirus is associated with patient survival is a critical
question. This analysis, inspired by data from Smyth et al. (2024), uses
OptSurvCutR
to explore this exact relationship, seeking to
identify data-driven thresholds that can stratify patients into distinct
prognostic groups based on this viral biomarker.
This guide covers a complete and robust workflow:
First, we load all the necessary R packages. The pacman
package will automatically install any missing packages.
if (!require("pacman")) install.packages("pacman")
pacman::p_load(
OptSurvCutR, dplyr,
survival, survminer, patchwork, ggplot2
)
We define the name of the microbe/virus we wish to analyse upfront. Here, we selected Enterovirus - a RNA virus within the Picornaviridae family, known for causing gastroenteritis and other infections.
# Define the biomarker of interest.
microbe_name_to_analyze <- "Enterovirus"
We load the pre-cleaned clinical and virome dataset directly from the
OptSurvCutR
package. With the updated data format, this is
now a single data frame. We will process it to extract the necessary
columns, parse the survival status, and censor the data at 5 years (60
months) for this analysis.
# Load the single, wide-format data frame from the package.
data("crc_virome", package = "OptSurvCutR")
# --- Prepare the final analysis data frame ---
# Add a safeguard: check if the selected microbe exists in the data's column names
if (!microbe_name_to_analyze %in% names(crc_virome)) {
stop(paste("The microbe '", microbe_name_to_analyze, "' was not found in the data.",
" Please check the 'microbe_name_to_analyze' parameter."))
}
# Process the single data frame to create the analysis-ready dataset
analysis_data <- crc_virome %>%
# Select the key columns and rename them for consistency
select(
patient_id = sample_id,
time_months,
status_text = status,
abundance = all_of(microbe_name_to_analyze)
) %>%
# Parse the status column (e.g., "0:LIVING" -> 0, "1:DECEASED" -> 1)
mutate(status_numeric = as.numeric(sub(":.*", "", status_text))) %>%
# Censor data at 5 years (60 months)
mutate(
status = ifelse(time_months > 60, 0, status_numeric), # Censor events after 5 years
time = pmin(time_months, 60) # Cap follow-up time at 5 years
) %>%
# Select the final columns needed for the analysis
select(patient_id, time, status, abundance)
# Display the head of the final, prepared data frame
head(analysis_data)
## patient_id time status abundance
## 1 TCGA-3L-AA1B-01 15.616267 0 3.663065
## 2 TCGA-4N-A93T-01 4.799947 0 1.697278
## 3 TCGA-4T-AA8H-01 12.657396 0 2.809707
## 4 TCGA-5M-AAT4-01 1.610941 1 1.688483
## 5 TCGA-5M-AAT6-01 9.534142 1 2.621937
## 6 TCGA-5M-AATE-01 39.451622 0 2.378831
Before finding where the cut-points are, we should determine how many cut-points are statistically justified.
# This function tests models with 0 to 4 cut-points and compares them using BIC.
number_result <- find_cutpoint_number(
data = analysis_data,
predictor = "abundance",
outcome_time = "time",
outcome_event = "status",
method = "genetic",
criterion = "BIC", # can be switched to AIC or AICc
max_cuts = 4,
nmin = 0.2,
maxiter = 100,
use_parallel = TRUE,
seed = 42 # Use the built-in seed argument for reproducible results
)
## num_cuts BIC Delta_BIC BIC_Weight Evidence
## 0 1258.74 7.05 2.5% Essentially no support
## 1 1257.13 5.44 5.5% Considerably less support
## 2 1251.69 0.00 83.8% Substantial support
## 3 1256.33 4.63 8.3% Considerably less support
## 4 Inf Inf 0% Essentially no support
## cuts
## NA
## 1.74
## 1.86, 2.15
## 1.61, 1.86, 2.15
## 1.49, 1.73, 1.94, 2.23
plot(number_result)
BIC scores for models with 0 to 4 cut-points. The lowest score indicates the optimal number.
summary(number_result)
## num_cuts BIC Delta_BIC BIC_Weight Evidence
## 0 1258.74 7.05 2.5% Essentially no support
## 1 1257.13 5.44 5.5% Considerably less support
## 2 1251.69 0.00 83.8% Substantial support
## 3 1256.33 4.63 8.3% Considerably less support
## 4 Inf Inf 0% Essentially no support
## cuts
## NA
## 1.74
## 1.86, 2.15
## 1.61, 1.86, 2.15
## 1.49, 1.73, 1.94, 2.23
## Group N Events
## 1 G1 299 64
## 2 G2 142 13
## 3 G3 140 33
## Call: survfit(formula = survival::Surv(time, event) ~ group, data = data)
##
## n events median 0.95LCL 0.95UCL
## group=G1 299 64 NA 51.5 NA
## group=G2 142 13 NA NA NA
## group=G3 140 33 56.3 43.2 NA
## Call:
## survival::coxph(formula = as.formula(formula_str), data = data)
##
## n= 581, number of events= 110
##
## coef exp(coef) se(coef) z Pr(>|z|)
## groupG2 -0.9646 0.3812 0.3044 -3.169 0.00153 **
## groupG3 0.1003 1.1056 0.2144 0.468 0.63979
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## groupG2 0.3812 2.6236 0.2099 0.6921
## groupG3 1.1056 0.9045 0.7262 1.6830
##
## Concordance= 0.558 (se = 0.026 )
## Likelihood ratio test= 14.78 on 2 df, p=6e-04
## Wald test = 11.64 on 2 df, p=0.003
## Score (logrank) test = 12.63 on 2 df, p=0.002
Interpretation: The model with 2 cut-points has the lowest BIC (Bayesian Information Criterion) and the highest BIC Weight. This provides strong statistical evidence that dividing the patients into three distinct prognostic groups (rather than 2 - High and Low groups) is the most appropriate choice for this dataset.
Now that we have decided to use two cut-points, we use
find_cutpoint()
to identify their optimal values.
# This function uses a genetic algorithm to find the location of the two best cut-points.
multi_cut_result <- find_cutpoint(
data = analysis_data,
predictor = "abundance",
outcome_time = "time",
outcome_event = "status",
method = "genetic",
num_cuts = 2,
nmin = 0.2,
maxiter = 500,
use_parallel = TRUE,
seed = 123 # Use a different seed for this distinct analysis step
)
optimal_cuts <- multi_cut_result$optimal_cuts
plot(multi_cut_result)
Distribution of the biomarker with optimal cut-points overlaid as vertical lines.
summary(multi_cut_result)
## Group N Events
## 1 G1 299 64
## 2 G2 142 13
## 3 G3 140 33
## Call: survfit(formula = survival::Surv(time, event) ~ group, data = data)
##
## n events median 0.95LCL 0.95UCL
## group=G1 299 64 NA 51.5 NA
## group=G2 142 13 NA NA NA
## group=G3 140 33 56.3 43.2 NA
## Call:
## survival::coxph(formula = as.formula(formula_str), data = data)
##
## n= 581, number of events= 110
##
## coef exp(coef) se(coef) z Pr(>|z|)
## groupG2 -0.9646 0.3812 0.3044 -3.169 0.00153 **
## groupG3 0.1003 1.1056 0.2144 0.468 0.63979
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## groupG2 0.3812 2.6236 0.2099 0.6921
## groupG3 1.1056 0.9045 0.7262 1.6830
##
## Concordance= 0.558 (se = 0.026 )
## Likelihood ratio test= 14.78 on 2 df, p=6e-04
## Wald test = 11.64 on 2 df, p=0.003
## Score (logrank) test = 12.63 on 2 df, p=0.002
## chisq df p
## group 5.02 2 0.081
## GLOBAL 5.02 2 0.081
Interpretation: The genetic algorithm has identified two optimal thresholds. These are the specific abundance values that best separate the patients into three prognostic groups based on the log-rank statistic.
Note on the AIC value: The summary output for the final model shows an AIC (Akaike Information Criterion) score. While we used BIC in Step 3.1 to select the best number of cut-points, the standard summary function in R reports AIC by default as a final measure of model quality. Both AIC and BIC serve a similar purpose: to find a model that best balances complexity and goodness-of-fit. The absolute AIC value itself is not directly interpretable, but it confirms the statistical properties of the final, chosen model.
This is a crucial step to assess how reliable our discovered
cut-points are. We use validate_cutpoint()
to run a
bootstrap analysis.
# Use the built-in seed argument for reproducibility of the bootstrap process.
validation_result <- validate_cutpoint(
cutpoint_result = multi_cut_result,
num_replicates = 200, # Use a higher number (e.g., 500) for a final analysis.
use_parallel = TRUE, # Enable parallel processing.
seed = 456
)
## Lower Upper
## Cut 1 1.507 1.893
## Cut 2 1.778 2.351
summary(validation_result)
## Mean Median SD IQR
## Cut 1 1.769 1.786 0.106 0.145
## Cut 2 2.156 2.160 0.162 0.206
## Lower Upper
## Cut 1 1.507 1.893
## Cut 2 1.778 2.351
Interpretation: This table shows the 95% confidence intervals for our two cut-points. For a cut-point to be considered stable, its confidence interval should be relatively narrow. This gives us a sense of the precision and reliability of our findings.
# Create the plot and store it in an object for later use
bootstrap_plot <- plot(validation_result)
print(bootstrap_plot)
Bootstrap distribution of the optimal cut-points. The solid red line is the original cut-point; dashed lines show the 95% CI.
First, we create a new column that categorises each patient into a
“Low”, “Medium”, or “High” abundance group based on the
optimal_cuts
.
# Create a new column with the abundance group based on the optimal cut-points.
analysis_data_categorised <- analysis_data %>%
mutate(abundance_group = cut(
abundance,
breaks = c(-Inf, optimal_cuts, Inf),
labels = c("Low", "Medium", "High")
)) %>%
filter(!is.na(abundance_group))
# Set "Medium" abundance as the reference group for the Cox model.
analysis_data_categorised$abundance_group <- relevel(
analysis_data_categorised$abundance_group, ref = "Medium"
)
# Fit the final Cox Proportional-Hazards model on the categorised data.
final_formula <- as.formula("Surv(time, status) ~ abundance_group")
cox_model <- coxph(final_formula, data = analysis_data_categorised)
# Fit the Kaplan-Meier model for plotting and summary tables.
km_fit <- survfit(Surv(time, status) ~ abundance_group, data = analysis_data_categorised)
A Kaplan-Meier curve shows the probability of survival over time for different patient groups.
# Create the plot using ggsurvplot.
km_plot <- ggsurvplot(
km_fit, data = analysis_data_categorised, pval = TRUE, risk.table = TRUE,
legend.title = microbe_name_to_analyze, legend.labs = c("Medium", "Low", "High"),
palette = c("#2E9FDF", "#E7B800", "#FC4E07"), # Custom colour palette
ylim = c(0.4, 1.0),
pval.coord = c(0, 0.6)
)
print(km_plot)
Interpretation: The Kaplan-Meier plot visualizes the survival probability for the “Low”, “Medium”, and “High” abundance groups over time. A steeply dropping curve indicates a rapid decrease in survival (poor prognosis), while a flat curve indicates a stable survival rate (good prognosis).
Here, we see a classic “U-shaped” risk relationship: - The Medium group’s curve (N=142) is the flattest, indicating these patients have the best survival outcome. This suggests an intermediate level of Enterovirus is associated with a more stable disease state. - The Low (N=299) and High (N=140) groups have much steeper curves. This means that deviating from the “medium” range in either direction is associated with a significantly worse prognosis. This pattern suggests that gut virome dysbiosis (imbalance), represented by either too little or too much Enterovirus, is a marker of higher risk.
The very small p-value (< 0.05) from the log-rank test confirms that these differences between the curves are statistically significant.
A forest plot visualises the Hazard Ratios (HR) from our Cox model. The HR quantifies the risk of death for one group compared to a reference group.
# ggforest creates a plot of the hazard ratios from the Cox model.
forest_plot <- ggforest(
cox_model, data = analysis_data_categorised,
main = "Hazard Ratios for Abundance Groups"
)
print(forest_plot)
Interpretation: This plot gives us the precise statistical risk for the “Low” and “High” groups, using the “Medium” group as the reference point (Hazard Ratio = 1.0).
- Low Group: The Hazard Ratio is 2.6 (95% CI: 1.4 - 4.8; p = 0.002).
- What this means: At any given time, patients in the “Low” abundance group are 2.6 times more likely to die than patients in the “Medium” group.
- The 95% confidence interval (the horizontal line) ranges from 1.4 to 4.8. Because this interval is entirely above 1.0, the result is statistically significant, which is confirmed by the very low p-value of 0.002.
- High Group: The Hazard Ratio is 2.9 (95% CI: 1.5 - 5.5; p = 0.001).
- What this means: Patients in the “High” abundance group are 2.9 times more likely to die than patients in the “Medium” group.
- This is the highest-risk group. The result is highly statistically significant (p = 0.001), as its confidence interval is also well above 1.0.
In summary, the forest plot statistically confirms what the Kaplan-Meier curve showed visually: both low and high levels of Enterovirus are significant predictors of poor survival compared to an intermediate level.
Finally, we use the patchwork
package to combine our
primary plots into a single, publication-ready figure.
# We extract the plot component from the ggsurvplot object to combine it with other ggplots.
# The bootstrap_plot object was created in the validation step.
publication_figure <- (km_plot$plot) / (forest_plot + bootstrap_plot) +
# Add labels (A, B, C) to the combined plots
plot_annotation(tag_levels = 'A')
print(publication_figure)
Summary of prognostic analysis. (A) Kaplan-Meier survival curves. (B) Forest plot of hazard ratios. (C) Bootstrap distribution of optimal cut-points.
This box plot provides a clear visual confirmation of how the
find_cutpoint()
function has partitioned the patients.
dist_plot <- ggplot(analysis_data_categorised, aes(x = abundance_group, y = abundance, fill = abundance_group)) +
geom_boxplot(show.legend = FALSE) +
labs(
title = paste("Distribution of", microbe_name_to_analyze, "Abundance by Risk Group"),
x = "Prognostic Group",
y = "Relative Abundance"
) +
scale_fill_manual(values = c("#2E9FDF", "#E7B800", "#FC4E07")) +
theme_minimal()
print(dist_plot)
A key assumption of the Cox model is that the effect of a predictor is constant over time. We can check this using a Schoenfeld residuals plot. A non-significant trend (p > 0.05) supports the assumption.
# Test the proportional hazards assumption for our Cox model.
schoenfeld_residuals <- cox.zph(cox_model)
# Print the test results to the console. A non-significant p-value is good.
print(schoenfeld_residuals)
## chisq df p
## abundance_group 5.02 2 0.081
## GLOBAL 5.02 2 0.081
# Visualise the residuals.
ggcoxzph(schoenfeld_residuals)
This vignette has demonstrated the three-step workflow for cut-point
analysis using OptSurvCutR
. By following this workflow,
users can confidently identify and validate robust,
statistically-optimal thresholds in their own survival data, moving
beyond simple median splits to uncover more nuanced relationships.
We encourage you to try OptSurvCutR
with your own
data.
remotes::install_github("paytonyau/OptSurvCutR")
For reproducibility, the session information below lists the R version and all attached packages.
sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8
## [2] LC_CTYPE=English_United Kingdom.utf8
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.utf8
##
## time zone: Europe/London
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] patchwork_1.3.1 survminer_0.5.0 ggpubr_0.6.1 ggplot2_3.5.2
## [5] survival_3.8-3 dplyr_1.1.4 OptSurvCutR_0.1.5 pacman_0.5.1
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.52 bslib_0.9.0 rstatix_0.7.2
## [5] lattice_0.22-7 vctrs_0.6.5 tools_4.5.1 generics_0.1.4
## [9] parallel_4.5.1 tibble_3.3.0 pkgconfig_2.0.3 Matrix_1.7-3
## [13] data.table_1.17.8 RColorBrewer_1.1-3 lifecycle_1.0.4 stringr_1.5.1
## [17] compiler_4.5.1 farver_2.1.2 ggsci_3.2.0 codetools_0.2-20
## [21] carData_3.0-5 litedown_0.7 htmltools_0.5.8.1 sass_0.4.10
## [25] yaml_2.3.10 Formula_1.2-5 pillar_1.11.0 car_3.1-3
## [29] jquerylib_0.1.4 tidyr_1.3.1 cachem_1.1.0 iterators_1.0.14
## [33] rgenoud_5.9-0.11 abind_1.4-8 foreach_1.5.2 km.ci_0.5-6
## [37] commonmark_2.0.0 tidyselect_1.2.1 digest_0.6.37 stringi_1.8.7
## [41] purrr_1.1.0 labeling_0.4.3 splines_4.5.1 cowplot_1.2.0
## [45] fastmap_1.2.0 grid_4.5.1 cli_3.6.5 magrittr_2.0.3
## [49] broom_1.0.9 withr_3.0.2 scales_1.4.0 backports_1.5.0
## [53] rmarkdown_2.29 ggtext_0.1.2 gridExtra_2.3 ggsignif_0.6.4
## [57] zoo_1.8-14 evaluate_1.0.4 knitr_1.50 KMsurv_0.1-6
## [61] doParallel_1.0.17 markdown_2.0 survMisc_0.5.6 rlang_1.1.6
## [65] Rcpp_1.1.0 gridtext_0.1.5 xtable_1.8-4 glue_1.8.0
## [69] xml2_1.4.0 rstudioapi_0.17.1 jsonlite_2.0.0 R6_2.6.1