The OptSurvCutR package provides a
comprehensive, data-driven workflow for survival analysis, specifically
designed for scenarios where a single cut-point is not sufficient. Its
core functionality is to discover and validate one or more optimal
cut-points for continuous variables in time-to-event data. This is
particularly useful in clinical and biological research where you have a
non-linear or U-shaped relationship and need to create meaningful,
distinct subgroups from continuous data (e.g., patient age, gene
expression levels, or, in this case, microbial abundance) for prognostic
analysis.
The key features of the package include:
This tutorial demonstrates how to use the
OptSurvCutR package for survival analysis.
Specifically, we will stratify patients into different prognostic groups
based on a continuous biomarker.
The human gut is home to a complex community of microorganisms known as the microbiome. A key, though often overlooked, component of this community is the gut virome—the collection of all viruses. In a healthy state, this ecosystem is balanced, but in diseases like colorectal cancer (CRC), this balance is often disrupted (a state called dysbiosis).
For this example, we will analyse gut virome data to determine if the relative abundance of the genus Enterovirus can predict survival outcomes. Enterovirus is a common genus of viruses from the Picornaviridae family, known for causing a range of human illnesses from the common cold to more severe conditions. As their name suggests, they often replicate in the gastrointestinal tract and are a regular member of the human gut virome.
While Enterovirus can be found in the guts of both healthy individuals and CRC patients, recent studies have focused on how its prevalence and abundance change in the context of cancer. The relationship is complex and represents a fascinating double-edged sword. On one hand, chronic viral infections can contribute to the pro-inflammatory environment that is a known risk factor for CRC development. On the other hand, some strains of Enterovirus have been identified as being naturally oncolytic—meaning they can preferentially infect and kill cancer cells. This has opened up an exciting new avenue of research into using these viruses as a potential cancer therapy.
Therefore, understanding whether a high or low abundance of
Enterovirus is associated with patient survival is a critical
question. This analysis, inspired by data from Smyth et al. (2024), uses
OptSurvCutR to explore this exact relationship, seeking to
identify data-driven thresholds that can stratify patients into distinct
prognostic groups based on this viral biomarker.
This guide covers a complete and robust workflow:
First, we load all the necessary R packages. The pacman
package will automatically install any missing packages.
# Check if pacman is installed, if not, install it
if (!require("pacman")) install.packages("pacman")
# Use pacman to load/install all required packages
pacman::p_load(
OptSurvCutR, # The core package for this analysis
dplyr, # For data manipulation (pipes %>%, select, mutate)
survival, # For core survival analysis functions (Surv, survfit)
survminer, # For plotting survival curves (ggsurvplot, ggforest)
patchwork, # For combining multiple ggplots into one figure
ggplot2, # For creating custom plots
rgenoud # Required for method = "genetic"
)
We define the name of the microbe/virus we wish to analyse upfront. Here, we select Enterovirus—an RNA virus within the Picornaviridae family, known for causing gastroenteritis and other infections.
# Define the biomarker of interest. This string must match a column name.
microbe_name_to_analyze <- "Enterovirus"
We load the pre-cleaned clinical and virome dataset directly from the
OptSurvCutR package. With the updated data format, this is
now a single data frame. We will process it to extract the necessary
columns, parse the survival status, and censor the data at 5 years (60
months) for this analysis.
# Load the single, wide-format data frame from the package.
data("crc_virome", package = "OptSurvCutR")
# --- Prepare the final analysis data frame ---
# Add a safeguard: check if the selected microbe exists in the data's column names
if (!microbe_name_to_analyze %in% names(crc_virome)) {
stop(paste("The microbe '", microbe_name_to_analyze, "' was not found in the data.",
" Please check the 'microbe_name_to_analyze' parameter."))
}
# Process the single data frame to create the analysis-ready dataset
analysis_data <- crc_virome %>%
# Select the key columns and rename them for consistency
select(
patient_id = sample_id,
time_months,
status_text = status, # e.g., "0:LIVING"
abundance = all_of(microbe_name_to_analyze) # Select our biomarker
) %>%
# Parse the status column (e.g., "0:LIVING" -> 0, "1:DECEASED" -> 1)
mutate(status_numeric = as.numeric(sub(":.*", "", status_text))) %>%
# Censor data at 5 years (60 months) for a 5-year survival analysis
mutate(
# If time > 60 months, set status to 0 (censored) regardless of original status
status = ifelse(time_months > 60, 0, status_numeric),
# Cap all follow-up time at 60 months
time = pmin(time_months, 60)
) %>%
# Select the final columns needed for the analysis
select(patient_id, time, status, abundance)
# Display the head of the final, prepared data frame
head(analysis_data)
## patient_id time status abundance
## 1 TCGA-3L-AA1B-01 15.616267 0 3.663065
## 2 TCGA-4N-A93T-01 4.799947 0 1.697278
## 3 TCGA-4T-AA8H-01 12.657396 0 2.809707
## 4 TCGA-5M-AAT4-01 1.610941 1 1.688483
## 5 TCGA-5M-AAT6-01 9.534142 1 2.621937
## 6 TCGA-5M-AATE-01 39.451622 0 2.378831
Before finding where the cut-points are, we should determine how many cut-points are statistically justified.
# This function tests models with 0 to 4 cut-points (creating 1 to 5 groups)
# and compares them using the Bayesian Information Criterion (BIC).
number_result <- find_cutpoint_number(
data = analysis_data,
predictor = "abundance", # The continuous variable to test
outcome_time = "time", # The time-to-event column
outcome_event = "status", # The event status column (0 or 1)
method = "genetic", # Use genetic algorithm for a fast search
criterion = "BIC", # Use BIC for model selection (can be AIC or AICc)
max_cuts = 4, # Test models with 0, 1, 2, 3, and 4 cuts
nmin = 0.2, # Each group must contain at least 20% of data
maxiter = 100, # Iterations for the genetic algorithm
use_parallel = TRUE, # Use parallel processing to speed up the search
n_cores = 2, # Number of cores to use
seed = 42 # Use the built-in seed for reproducible results
)
# The plot method shows the BIC score for each number of cuts.
# The lowest point (highlighted in orange) is the best model.
plot(number_result)
# The summary provides the data in a table format.
summary(number_result)
Interpretation: The model with 2 cut-points has the lowest BIC (Bayesian Information Criterion) and the highest BIC Weight. This provides strong statistical evidence that dividing the patients into three distinct prognostic groups (Low, Medium, High) is the most appropriate choice for this dataset, rather than a simple 2-group (High/Low) split.
Now that we have decided to use two cut-points, we use
find_cutpoint() to identify their optimal values.
# This function uses a genetic algorithm to find the location of the two best
# cut-points that maximize the log-rank test statistic.
multi_cut_result <- find_cutpoint(
data = analysis_data,
predictor = "abundance",
outcome_time = "time",
outcome_event = "status",
method = "genetic",
criterion = "logrank", # Optimise for the best log-rank statistic
num_cuts = 2, # We are looking for 2 cut-points, as found in Step 3.1
nmin = 0.2, # Each of the 3 groups must have at least 20% of data
maxiter = 500, # Use more iterations for a more precise search
seed = 123 # Use a different seed for this distinct analysis step
)
# Store the cut-point values
optimal_cuts <- multi_cut_result$optimal_cuts
# The plot method for a 'find_cutpoint' object shows the distribution
# of the predictor variable and draws vertical lines at the optimal cut-points.
plot(multi_cut_result, type = "distribution")
Distribution of the biomarker with optimal cut-points overlaid as vertical lines.
# The summary shows the cut-points and the Cox model fit for the new groups.
summary(multi_cut_result)
## Group N Events
## 1 G1 299 64
## 2 G2 142 13
## 3 G3 140 33
## Call: survfit(formula = survival::Surv(time, event) ~ group, data = data)
##
## n events median 0.95LCL 0.95UCL
## group=G1 299 64 NA 51.5 NA
## group=G2 142 13 NA NA NA
## group=G3 140 33 56.3 43.2 NA
## Call:
## survival::coxph(formula = as.formula(formula_str), data = data)
##
## n= 581, number of events= 110
##
## coef exp(coef) se(coef) z Pr(>|z|)
## groupG2 -0.9646 0.3812 0.3044 -3.169 0.00153 **
## groupG3 0.1003 1.1056 0.2144 0.468 0.63979
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## groupG2 0.3812 2.6236 0.2099 0.6921
## groupG3 1.1056 0.9045 0.7262 1.6830
##
## Concordance= 0.558 (se = 0.026 )
## Likelihood ratio test= 14.78 on 2 df, p=6e-04
## Wald test = 11.64 on 2 df, p=0.003
## Score (logrank) test = 12.63 on 2 df, p=0.002
## chisq df p
## group 5.02 2 0.081
## GLOBAL 5.02 2 0.081
Interpretation: The genetic algorithm has identified two optimal thresholds. These are the specific abundance values that best separate the patients into three prognostic groups based on the log-rank statistic.
Note on the AIC value: The summary output for the final model shows an AIC (Akaike Information Criterion) score. While we used BIC in Step 3.1 to select the best number of cut-points, the standard summary function in R reports AIC by default as a final measure of model quality. Both AIC and BIC serve a similar purpose: to find a model that best balances complexity and goodness-of-fit.
This is a crucial step to assess how reliable our discovered
cut-points are. We use validate_cutpoint() to run a
bootstrap analysis. This involves repeatedly resampling
the data, re-running the find_cutpoint algorithm, and
checking how much the cut-points change.
# Use the built-in seed argument for reproducibility of the bootstrap process.
validation_result <- validate_cutpoint(
cutpoint_result = multi_cut_result, # Pass in the results from Step 3.2
num_replicates = 200, # Use 200 replicates (>= 500 is recommended for publication)
use_parallel = TRUE, # Enable parallel processing.
n_cores = 2, # Number of cores to use
maxiter = 100, # Limit iterations *within* each replicate for speed
seed = 456 # Use a different seed for this new random process
)
## Cut-point Stability Analysis (Bootstrap)
## ----------------------------------------
## Original Optimal Cut-point(s): 1.864, 2.155
## Successful Replicates: 175 / 200 ( 87.5 %)
## Failed Replicates: 25
##
## 95% Confidence Intervals
## ------------------------
## Lower Upper
## Cut 1 1.477 1.893
## Cut 2 1.742 2.355
##
## Bootstrap Summary Statistics
## ---------------------------
## Cut Mean SD Median Q1 Q3
## 25% Cut1 1.767 0.111 1.774 1.720 1.864
## 25%1 Cut2 2.171 0.161 2.178 2.092 2.299
##
## Hint: Use `summary()` for detailed statistics or `plot()` to visualize stability.
# The summary provides 95% Confidence Intervals for the cut-points.
summary(validation_result)
## Cut-point Stability Analysis (Bootstrap)
## ----------------------------------------
## Original Optimal Cut-point(s): 1.864, 2.155
##
## Bootstrap Distribution Summary
## -----------------------------
## Cut Mean SD Median Q1 Q3
## 25% Cut1 1.767 0.111 1.774 1.720 1.864
## 25%1 Cut2 2.171 0.161 2.178 2.092 2.299
##
## 95% Confidence Intervals
## ------------------------
## Lower Upper
## Cut 1 1.477 1.893
## Cut 2 1.742 2.355
##
## Validation Parameters
## ---------------------
## Replicates Requested: 200
## Successful Replicates: 175 ( 87.5 %)
## Failed Replicates: 25
## Parallel Processing: Enabled
## Cores Used: 2
## Seed: 456
## Minimum Group Size (nmin): 104
## Method: genetic
## Criterion: logrank
## Covariates: None
Interpretation: This table shows the 95% confidence intervals for our two cut-points. For a cut-point to be considered stable, its confidence interval should be relatively narrow. This gives us a sense of the precision and reliability of our findings.
# Create the plot and store it in an object for later use
# This plot shows the histogram of cut-points found in the bootstrap samples.
bootstrap_plot <- plot(validation_result)
print(bootstrap_plot)
Bootstrap distribution of the optimal cut-points. The solid red line is the original cut-point; dashed lines show the 95% CI.
First, we create a new column that categorises each patient into a
“Low”, “Medium”, or “High” abundance group based on the
optimal_cuts. This step is necessary to manually build the
Kaplan-Meier and Forest plots.
# Create a new column with the abundance group based on the optimal cut-points.
analysis_data_categorised <- analysis_data %>%
mutate(abundance_group = cut(
abundance,
breaks = c(-Inf, optimal_cuts, Inf), # Use the cuts from Step 3.2
labels = c("Low", "Medium", "High")
)) %>%
filter(!is.na(abundance_group)) # Remove any rows that didn't fall into a group
# Set "Medium" abundance as the reference group for the Cox model.
# This means "Low" and "High" will be compared *to* "Medium".
analysis_data_categorised$abundance_group <- relevel(
analysis_data_categorised$abundance_group, ref = "Medium"
)
# Fit the final Cox Proportional-Hazards model on the categorised data.
final_formula <- as.formula("Surv(time, status) ~ abundance_group")
cox_model <- coxph(final_formula, data = analysis_data_categorised)
# Fit the Kaplan-Meier model for plotting and summary tables.
km_fit <- survfit(Surv(time, status) ~ abundance_group, data = analysis_data_categorised)
A Kaplan-Meier curve shows the probability of survival over time for different patient groups.
# Create the plot using ggsurvplot.
km_plot <- ggsurvplot(
km_fit, data = analysis_data_categorised, pval = TRUE, risk.table = TRUE,
legend.title = microbe_name_to_analyze, legend.labs = c("Medium", "Low", "High"),
palette = c("#2E9FDF", "#E7B800", "#FC4E07"), # Custom colour palette
ylim = c(0.4, 1.0),
pval.coord = c(0, 0.6)
)
print(km_plot)
Interpretation: The Kaplan-Meier plot visualises the survival probability for the “Low”, “Medium”, and “High” abundance groups over time. A steeply dropping curve indicates a rapid decrease in survival (poor prognosis), while a flat curve indicates a stable survival rate (good prognosis).
Here, we see a classic “U-shaped” risk relationship: - The Medium group’s curve (N=142) is the flattest, indicating these patients have the best survival outcome. This suggests an intermediate level of Enterovirus is associated with a more stable disease state. - The Low (N=299) and High (N=140) groups have much steeper curves. This means that deviating from the “medium” range in either direction is associated with a significantly worse prognosis. This pattern suggests that gut virome dysbiosis (imbalance), represented by either too little or too much Enterovirus, is a marker of higher risk.
The very small p-value (< 0.05) from the log-rank test confirms that these differences between the curves are statistically significant.
A forest plot visualises the Hazard Ratios (HR) from our Cox model. The HR quantifies the risk of death for one group compared to a reference group.
# ggforest creates a plot of the hazard ratios from the Cox model.
forest_plot <- ggforest(
cox_model, data = analysis_data_categorised,
main = "Hazard Ratios for Abundance Groups"
)
print(forest_plot)
Interpretation: This plot gives us the precise statistical risk for the “Low” and “High” groups, using the “Medium” group as the reference point (Hazard Ratio = 1.0).
- Low Group: The Hazard Ratio is 2.6 (95% CI: 1.4 - 4.8; p = 0.002).
- What this means: At any given time, patients in the “Low” abundance group are 2.6 times more likely to die than patients in the “Medium” group.
- The 95% confidence interval (the horizontal line) ranges from 1.4 to 4.8. Because this interval is entirely above 1.0, the result is statistically significant, which is confirmed by the very low p-value of 0.002.
- High Group: The Hazard Ratio is 2.9 (95% CI: 1.5 - 5.5; p = 0.001).
- What this means: Patients in the “High” abundance group are 2.9 times more likely to die than patients in the “Medium” group.
- This is the highest-risk group. The result is highly statistically significant (p = 0.001), as its confidence interval is also well above 1.0.
In summary, the forest plot statistically confirms what the Kaplan-Meier curve showed visually: both low and high levels of Enterovirus are significant predictors of poor survival compared to an intermediate level.
Finally, we use the patchwork package to combine our
primary plots into a single, publication-ready figure.
# We extract the plot component from the ggsurvplot object to combine it with other ggplots.
# The bootstrap_plot object was created in the validation step.
publication_figure <- (km_plot$plot) / (forest_plot + bootstrap_plot) +
# Add labels (A, B, C) to the combined plots
plot_annotation(tag_levels = 'A')
print(publication_figure)
Summary of prognostic analysis. (A) Kaplan-Meier survival curves. (B) Forest plot of hazard ratios. (C) Bootstrap distribution of optimal cut-points.
This box plot provides a clear visual confirmation of how the
find_cutpoint() function has partitioned the patients.
# Create a ggplot boxplot to show the distribution of the biomarker
# within the three groups we created.
dist_plot <- ggplot(analysis_data_categorised, aes(x = abundance_group, y = abundance, fill = abundance_group)) +
geom_boxplot(show.legend = FALSE) +
labs(
title = paste("Distribution of", microbe_name_to_analyze, "Abundance by Risk Group"),
x = "Prognostic Group",
y = "Relative AbDundance"
) +
scale_fill_manual(values = c("#2E9FDF", "#E7B800", "#FC4E07")) +
theme_minimal()
print(dist_plot)
A key assumption of the Cox model is that the effect of a predictor is constant over time. We can check this using a Schoenfeld residuals plot. A non-significant trend (p > 0.05) supports the assumption.
# Test the proportional hazards assumption for our Cox model.
schoenfeld_residuals <- cox.zph(cox_model)
# Print the test results to the console.
# A p-value > 0.05 for GLOBAL is good and supports the assumption.
print(schoenfeld_residuals)
## chisq df p
## abundance_group 5.02 2 0.081
## GLOBAL 5.02 2 0.081
# Visualise the residuals.
# We are looking for lines that are roughly horizontal.
ggcoxzph(schoenfeld_residuals)
This vignette has demonstrated the three-step workflow for cut-point
analysis using OptSurvCutR. By following this workflow,
users can confidently identify and validate robust,
statistically-optimal thresholds in their own survival data, moving
beyond simple median splits to uncover more nuanced relationships.
We encourage you to try OptSurvCutR with your own
data.
remotes::install_github("paytonyau/OptSurvCutR")For reproducibility, the session information below lists the R version and all attached packages.
sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8
## [2] LC_CTYPE=English_United Kingdom.utf8
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.utf8
##
## time zone: Europe/London
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rgenoud_5.9-0.11 patchwork_1.3.1 survminer_0.5.1 ggpubr_0.6.2
## [5] ggplot2_4.0.0 survival_3.8-3 dplyr_1.1.4 OptSurvCutR_0.1.7
## [9] pacman_0.5.1
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.53 bslib_0.9.0 rstatix_0.7.3
## [5] lattice_0.22-7 vctrs_0.6.5 tools_4.5.1 generics_0.1.4
## [9] parallel_4.5.1 tibble_3.3.0 pkgconfig_2.0.3 Matrix_1.7-3
## [13] data.table_1.17.8 RColorBrewer_1.1-3 S7_0.2.0 lifecycle_1.0.4
## [17] stringr_1.5.2 compiler_4.5.1 farver_2.1.2 codetools_0.2-20
## [21] carData_3.0-5 litedown_0.7 htmltools_0.5.8.1 sass_0.4.10
## [25] yaml_2.3.10 Formula_1.2-5 pillar_1.11.1 car_3.1-3
## [29] jquerylib_0.1.4 tidyr_1.3.1 cachem_1.1.0 iterators_1.0.14
## [33] abind_1.4-8 foreach_1.5.2 km.ci_0.5-6 commonmark_2.0.0
## [37] tidyselect_1.2.1 digest_0.6.37 stringi_1.8.7 purrr_1.1.0
## [41] labeling_0.4.3 splines_4.5.1 cowplot_1.2.0 fastmap_1.2.0
## [45] grid_4.5.1 cli_3.6.5 magrittr_2.0.4 broom_1.0.10
## [49] withr_3.0.2 scales_1.4.0 backports_1.5.0 rmarkdown_2.30
## [53] ggtext_0.1.2 gridExtra_2.3 ggsignif_0.6.4 zoo_1.8-14
## [57] evaluate_1.0.5 knitr_1.50 KMsurv_0.1-6 doParallel_1.0.17
## [61] markdown_2.0 survMisc_0.5.6 rlang_1.1.6 Rcpp_1.1.0
## [65] gridtext_0.1.5 xtable_1.8-4 glue_1.8.0 xml2_1.4.0
## [69] rstudioapi_0.17.1 jsonlite_2.0.0 R6_2.6.1