About the Package

The OptSurvCutR package provides a comprehensive, data-driven workflow for survival analysis, specifically designed for scenarios where a single cut-point is not sufficient. Its core functionality is to discover and validate one or more optimal cut-points for continuous variables in time-to-event data. This is a particularly useful in clinical and biological research where you have a non-linear or U-shaped relationship and need to create meaningful, distinct subgroups from continuous data (e.g., patient age, gene expression levels, or, in this case, microbial abundance) for prognostic analysis.

The key features of the package include:

  • Optimal Number of Cut-points: Use information criteria (AIC, AICc, BIC) to find the statistically justified number of groups.
  • Efficient Cut-point Discovery: Use advanced optimisation algorithms (genetic, systematic) to find the best cut-point locations.
  • Robust Validation: Use bootstrapping to assess the stability and reliability of your findings.

Introduction

This tutorial demonstrates how to use the OptSurvCutR package for survival analysis. Specifically, we will stratify patients into different prognostic groups based on a continuous biomarker.

The Scientific Background

The human gut is home to a complex community of microorganisms known as the microbiome. A key, though often overlooked, component of this community is the gut virome—the collection of all viruses. In a healthy state, this ecosystem is balanced, but in diseases like colorectal cancer (CRC), this balance is often disrupted (a state called dysbiosis).

For this example, we will analyse gut virome data to determine if the relative abundance of the genus Enterovirus can predict survival outcomes. Enterovirus is a common genus of viruses from the Picornaviridae family, known for causing a range of human illnesses from the common cold to more severe conditions. As their name suggests, they often replicate in the gastrointestinal tract and are a regular member of the human gut virome.

While Enterovirus can be found in the guts of both healthy individuals and CRC patients, recent studies have focused on how its prevalence and abundance change in the context of cancer. The relationship is complex and represents a fascinating double-edged sword. On one hand, chronic viral infections can contribute to the pro-inflammatory environment that is a known risk factor for CRC development. On the other hand, some strains of Enterovirus have been identified as being naturally oncolytic—meaning they can preferentially infect and kill cancer cells. This has opened up an exciting new avenue of research into using these viruses as a potential cancer therapy.

Therefore, understanding whether a high or low abundance of Enterovirus is associated with patient survival is a critical question. This analysis, inspired by data from Smyth et al. (2024), uses OptSurvCutR to explore this exact relationship, seeking to identify data-driven thresholds that can stratify patients into distinct prognostic groups based on this viral biomarker.

Analysis Workflow

This guide covers a complete and robust workflow:

  1. Setup and Configuration: Preparing the R environment and defining key parameters.
  2. Data Loading and Preparation: Loading the package’s built-in datasets and preparing them for analysis.
  3. Optimal Cut-point Discovery: Determine the optimal number of groups and find the cut-point locations.
  4. Cut-point Validation: Assess the stability of the discovered thresholds via bootstrapping.
  5. Visualisation and Interpretation: Create and interpret Kaplan-Meier curves, forest plots, and other key visualisations.
  6. Model Diagnostics: Checking the assumptions of the survival model.

1. Setup and Configuration

Loading Libraries

First, we load all the necessary R packages. The pacman package will automatically install any missing packages.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(
  OptSurvCutR, dplyr,
  survival, survminer, patchwork, ggplot2
)

Defining Parameters

We define the name of the microbe/virus we wish to analyse upfront. Here, we selected Enterovirus - a RNA virus within the Picornaviridae family, known for causing gastroenteritis and other infections.

# Define the biomarker of interest.
microbe_name_to_analyze <- "Enterovirus"

2. Data Loading and Preparation

We load the pre-cleaned clinical and virome dataset directly from the OptSurvCutR package. With the updated data format, this is now a single data frame. We will process it to extract the necessary columns, parse the survival status, and censor the data at 5 years (60 months) for this analysis.

# Load the single, wide-format data frame from the package.
data("crc_virome", package = "OptSurvCutR")

# --- Prepare the final analysis data frame ---

# Add a safeguard: check if the selected microbe exists in the data's column names
if (!microbe_name_to_analyze %in% names(crc_virome)) {
  stop(paste("The microbe '", microbe_name_to_analyze, "' was not found in the data.",
             " Please check the 'microbe_name_to_analyze' parameter."))
}

# Process the single data frame to create the analysis-ready dataset
analysis_data <- crc_virome %>%
  # Select the key columns and rename them for consistency
  select(
    patient_id = sample_id,
    time_months,
    status_text = status,
    abundance = all_of(microbe_name_to_analyze)
  ) %>%
  # Parse the status column (e.g., "0:LIVING" -> 0, "1:DECEASED" -> 1)
  mutate(status_numeric = as.numeric(sub(":.*", "", status_text))) %>%
  # Censor data at 5 years (60 months)
  mutate(
    status = ifelse(time_months > 60, 0, status_numeric), # Censor events after 5 years
    time = pmin(time_months, 60)                          # Cap follow-up time at 5 years
  ) %>%
  # Select the final columns needed for the analysis
  select(patient_id, time, status, abundance)

# Display the head of the final, prepared data frame
head(analysis_data)
##        patient_id      time status abundance
## 1 TCGA-3L-AA1B-01 15.616267      0  3.663065
## 2 TCGA-4N-A93T-01  4.799947      0  1.697278
## 3 TCGA-4T-AA8H-01 12.657396      0  2.809707
## 4 TCGA-5M-AAT4-01  1.610941      1  1.688483
## 5 TCGA-5M-AAT6-01  9.534142      1  2.621937
## 6 TCGA-5M-AATE-01 39.451622      0  2.378831

3. Optimal Cut-point Discovery

Step 3.1: Determine the Optimal Number of Cut-points

Before finding where the cut-points are, we should determine how many cut-points are statistically justified.

# This function tests models with 0 to 4 cut-points and compares them using BIC.
number_result <- find_cutpoint_number(
  data = analysis_data,
  predictor = "abundance",
  outcome_time = "time",
  outcome_event = "status",
  method = "genetic", 
  criterion = "BIC", # can be switched to AIC or AICc
  max_cuts = 4,
  nmin = 0.2,
  maxiter = 100,
  use_parallel = TRUE,
  seed = 42 # Use the built-in seed argument for reproducible results
) 
##  num_cuts     BIC Delta_BIC BIC_Weight                  Evidence
##         0 1258.74      7.05       2.5%    Essentially no support
##         1 1257.13      5.44       5.5% Considerably less support
##         2 1251.69      0.00      83.8%       Substantial support
##         3 1256.33      4.63       8.3% Considerably less support
##         4     Inf       Inf         0%    Essentially no support
##                    cuts
##                      NA
##                    1.74
##              1.86, 2.15
##        1.61, 1.86, 2.15
##  1.49, 1.73, 1.94, 2.23
plot(number_result)
BIC scores for models with 0 to 4 cut-points. The lowest score indicates the optimal number.

BIC scores for models with 0 to 4 cut-points. The lowest score indicates the optimal number.

summary(number_result)
##  num_cuts     BIC Delta_BIC BIC_Weight                  Evidence
##         0 1258.74      7.05       2.5%    Essentially no support
##         1 1257.13      5.44       5.5% Considerably less support
##         2 1251.69      0.00      83.8%       Substantial support
##         3 1256.33      4.63       8.3% Considerably less support
##         4     Inf       Inf         0%    Essentially no support
##                    cuts
##                      NA
##                    1.74
##              1.86, 2.15
##        1.61, 1.86, 2.15
##  1.49, 1.73, 1.94, 2.23
##   Group   N Events
## 1    G1 299     64
## 2    G2 142     13
## 3    G3 140     33
## Call: survfit(formula = survival::Surv(time, event) ~ group, data = data)
## 
##            n events median 0.95LCL 0.95UCL
## group=G1 299     64     NA    51.5      NA
## group=G2 142     13     NA      NA      NA
## group=G3 140     33   56.3    43.2      NA
## Call:
## survival::coxph(formula = as.formula(formula_str), data = data)
## 
##   n= 581, number of events= 110 
## 
##            coef exp(coef) se(coef)      z Pr(>|z|)   
## groupG2 -0.9646    0.3812   0.3044 -3.169  0.00153 **
## groupG3  0.1003    1.1056   0.2144  0.468  0.63979   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##         exp(coef) exp(-coef) lower .95 upper .95
## groupG2    0.3812     2.6236    0.2099    0.6921
## groupG3    1.1056     0.9045    0.7262    1.6830
## 
## Concordance= 0.558  (se = 0.026 )
## Likelihood ratio test= 14.78  on 2 df,   p=6e-04
## Wald test            = 11.64  on 2 df,   p=0.003
## Score (logrank) test = 12.63  on 2 df,   p=0.002

Interpretation: The model with 2 cut-points has the lowest BIC (Bayesian Information Criterion) and the highest BIC Weight. This provides strong statistical evidence that dividing the patients into three distinct prognostic groups (rather than 2 - High and Low groups) is the most appropriate choice for this dataset.

Step 3.2: Find the Location of the Cut-points

Now that we have decided to use two cut-points, we use find_cutpoint() to identify their optimal values.

# This function uses a genetic algorithm to find the location of the two best cut-points.
multi_cut_result <- find_cutpoint(
  data = analysis_data,
  predictor = "abundance",
  outcome_time = "time",
  outcome_event = "status",
  method = "genetic",
  num_cuts = 2,
  nmin = 0.2,
  maxiter = 500,
  use_parallel = TRUE,
  seed = 123 # Use a different seed for this distinct analysis step
)

optimal_cuts <- multi_cut_result$optimal_cuts
plot(multi_cut_result)
Distribution of the biomarker with optimal cut-points overlaid as vertical lines.

Distribution of the biomarker with optimal cut-points overlaid as vertical lines.

summary(multi_cut_result)
##   Group   N Events
## 1    G1 299     64
## 2    G2 142     13
## 3    G3 140     33
## Call: survfit(formula = survival::Surv(time, event) ~ group, data = data)
## 
##            n events median 0.95LCL 0.95UCL
## group=G1 299     64     NA    51.5      NA
## group=G2 142     13     NA      NA      NA
## group=G3 140     33   56.3    43.2      NA
## Call:
## survival::coxph(formula = as.formula(formula_str), data = data)
## 
##   n= 581, number of events= 110 
## 
##            coef exp(coef) se(coef)      z Pr(>|z|)   
## groupG2 -0.9646    0.3812   0.3044 -3.169  0.00153 **
## groupG3  0.1003    1.1056   0.2144  0.468  0.63979   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##         exp(coef) exp(-coef) lower .95 upper .95
## groupG2    0.3812     2.6236    0.2099    0.6921
## groupG3    1.1056     0.9045    0.7262    1.6830
## 
## Concordance= 0.558  (se = 0.026 )
## Likelihood ratio test= 14.78  on 2 df,   p=6e-04
## Wald test            = 11.64  on 2 df,   p=0.003
## Score (logrank) test = 12.63  on 2 df,   p=0.002
##        chisq df     p
## group   5.02  2 0.081
## GLOBAL  5.02  2 0.081

Interpretation: The genetic algorithm has identified two optimal thresholds. These are the specific abundance values that best separate the patients into three prognostic groups based on the log-rank statistic.

Note on the AIC value: The summary output for the final model shows an AIC (Akaike Information Criterion) score. While we used BIC in Step 3.1 to select the best number of cut-points, the standard summary function in R reports AIC by default as a final measure of model quality. Both AIC and BIC serve a similar purpose: to find a model that best balances complexity and goodness-of-fit. The absolute AIC value itself is not directly interpretable, but it confirms the statistical properties of the final, chosen model.


4. Cut-point Validation

This is a crucial step to assess how reliable our discovered cut-points are. We use validate_cutpoint() to run a bootstrap analysis.

# Use the built-in seed argument for reproducibility of the bootstrap process.
validation_result <- validate_cutpoint(
  cutpoint_result = multi_cut_result,
  num_replicates = 200, # Use a higher number (e.g., 500) for a final analysis.
  use_parallel = TRUE,  # Enable parallel processing.
  seed = 456
)
##       Lower Upper
## Cut 1 1.507 1.893
## Cut 2 1.778 2.351
summary(validation_result)
##        Mean Median    SD   IQR
## Cut 1 1.769  1.786 0.106 0.145
## Cut 2 2.156  2.160 0.162 0.206
##       Lower Upper
## Cut 1 1.507 1.893
## Cut 2 1.778 2.351

Interpretation: This table shows the 95% confidence intervals for our two cut-points. For a cut-point to be considered stable, its confidence interval should be relatively narrow. This gives us a sense of the precision and reliability of our findings.

# Create the plot and store it in an object for later use
bootstrap_plot <- plot(validation_result)
print(bootstrap_plot)
Bootstrap distribution of the optimal cut-points. The solid red line is the original cut-point; dashed lines show the 95% CI.

Bootstrap distribution of the optimal cut-points. The solid red line is the original cut-point; dashed lines show the 95% CI.


5. Visualisation and Interpretation

Categorising Patients

First, we create a new column that categorises each patient into a “Low”, “Medium”, or “High” abundance group based on the optimal_cuts.

# Create a new column with the abundance group based on the optimal cut-points.
analysis_data_categorised <- analysis_data %>%
  mutate(abundance_group = cut(
    abundance,
    breaks = c(-Inf, optimal_cuts, Inf),
    labels = c("Low", "Medium", "High")
  )) %>%
  filter(!is.na(abundance_group))

# Set "Medium" abundance as the reference group for the Cox model.
analysis_data_categorised$abundance_group <- relevel(
  analysis_data_categorised$abundance_group, ref = "Medium"
)

# Fit the final Cox Proportional-Hazards model on the categorised data.
final_formula <- as.formula("Surv(time, status) ~ abundance_group")
cox_model <- coxph(final_formula, data = analysis_data_categorised)

# Fit the Kaplan-Meier model for plotting and summary tables.
km_fit <- survfit(Surv(time, status) ~ abundance_group, data = analysis_data_categorised)

Plot A: Kaplan-Meier Curve

A Kaplan-Meier curve shows the probability of survival over time for different patient groups.

# Create the plot using ggsurvplot.
km_plot <- ggsurvplot(
  km_fit, data = analysis_data_categorised, pval = TRUE, risk.table = TRUE,
  legend.title = microbe_name_to_analyze, legend.labs = c("Medium", "Low", "High"),
  palette = c("#2E9FDF", "#E7B800", "#FC4E07"), # Custom colour palette
  ylim = c(0.4, 1.0),
  pval.coord = c(0, 0.6)
)

print(km_plot)

Interpretation: The Kaplan-Meier plot visualizes the survival probability for the “Low”, “Medium”, and “High” abundance groups over time. A steeply dropping curve indicates a rapid decrease in survival (poor prognosis), while a flat curve indicates a stable survival rate (good prognosis).

Here, we see a classic “U-shaped” risk relationship: - The Medium group’s curve (N=142) is the flattest, indicating these patients have the best survival outcome. This suggests an intermediate level of Enterovirus is associated with a more stable disease state. - The Low (N=299) and High (N=140) groups have much steeper curves. This means that deviating from the “medium” range in either direction is associated with a significantly worse prognosis. This pattern suggests that gut virome dysbiosis (imbalance), represented by either too little or too much Enterovirus, is a marker of higher risk.

The very small p-value (< 0.05) from the log-rank test confirms that these differences between the curves are statistically significant.

Plot B: Forest Plot of Hazard Ratios

A forest plot visualises the Hazard Ratios (HR) from our Cox model. The HR quantifies the risk of death for one group compared to a reference group.

# ggforest creates a plot of the hazard ratios from the Cox model.
forest_plot <- ggforest(
  cox_model, data = analysis_data_categorised,
  main = "Hazard Ratios for Abundance Groups"
)

print(forest_plot)

Interpretation: This plot gives us the precise statistical risk for the “Low” and “High” groups, using the “Medium” group as the reference point (Hazard Ratio = 1.0).

  • Low Group: The Hazard Ratio is 2.6 (95% CI: 1.4 - 4.8; p = 0.002).
    • What this means: At any given time, patients in the “Low” abundance group are 2.6 times more likely to die than patients in the “Medium” group.
    • The 95% confidence interval (the horizontal line) ranges from 1.4 to 4.8. Because this interval is entirely above 1.0, the result is statistically significant, which is confirmed by the very low p-value of 0.002.
  • High Group: The Hazard Ratio is 2.9 (95% CI: 1.5 - 5.5; p = 0.001).
    • What this means: Patients in the “High” abundance group are 2.9 times more likely to die than patients in the “Medium” group.
    • This is the highest-risk group. The result is highly statistically significant (p = 0.001), as its confidence interval is also well above 1.0.

In summary, the forest plot statistically confirms what the Kaplan-Meier curve showed visually: both low and high levels of Enterovirus are significant predictors of poor survival compared to an intermediate level.

Final Combined Figure

Finally, we use the patchwork package to combine our primary plots into a single, publication-ready figure.

# We extract the plot component from the ggsurvplot object to combine it with other ggplots.
# The bootstrap_plot object was created in the validation step.
publication_figure <- (km_plot$plot) / (forest_plot + bootstrap_plot) +
  # Add labels (A, B, C) to the combined plots
  plot_annotation(tag_levels = 'A')

print(publication_figure)
Summary of prognostic analysis. (A) Kaplan-Meier survival curves. (B) Forest plot of hazard ratios. (C) Bootstrap distribution of optimal cut-points.

Summary of prognostic analysis. (A) Kaplan-Meier survival curves. (B) Forest plot of hazard ratios. (C) Bootstrap distribution of optimal cut-points.


6. Model Diagnostics

Plot D: Biomarker Distribution by Risk Group

This box plot provides a clear visual confirmation of how the find_cutpoint() function has partitioned the patients.

dist_plot <- ggplot(analysis_data_categorised, aes(x = abundance_group, y = abundance, fill = abundance_group)) +
  geom_boxplot(show.legend = FALSE) +
  labs(
    title = paste("Distribution of", microbe_name_to_analyze, "Abundance by Risk Group"),
    x = "Prognostic Group",
    y = "Relative Abundance"
  ) +
  scale_fill_manual(values = c("#2E9FDF", "#E7B800", "#FC4E07")) +
  theme_minimal()

print(dist_plot)

Plot E: Checking the Proportional Hazards Assumption

A key assumption of the Cox model is that the effect of a predictor is constant over time. We can check this using a Schoenfeld residuals plot. A non-significant trend (p > 0.05) supports the assumption.

# Test the proportional hazards assumption for our Cox model.
schoenfeld_residuals <- cox.zph(cox_model)
# Print the test results to the console. A non-significant p-value is good.
print(schoenfeld_residuals)
##                 chisq df     p
## abundance_group  5.02  2 0.081
## GLOBAL           5.02  2 0.081
# Visualise the residuals.
ggcoxzph(schoenfeld_residuals)


7. Conclusion & Next Steps

This vignette has demonstrated the three-step workflow for cut-point analysis using OptSurvCutR. By following this workflow, users can confidently identify and validate robust, statistically-optimal thresholds in their own survival data, moving beyond simple median splits to uncover more nuanced relationships.

We encourage you to try OptSurvCutR with your own data.

  • Install the package from GitHub: remotes::install_github("paytonyau/OptSurvCutR")
  • Report issues or suggest features on our GitHub page.
  • Star the repository if you find it useful.

8. Session Information

For reproducibility, the session information below lists the R version and all attached packages.

sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8 
## [2] LC_CTYPE=English_United Kingdom.utf8   
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.utf8    
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] patchwork_1.3.1   survminer_0.5.0   ggpubr_0.6.1      ggplot2_3.5.2    
## [5] survival_3.8-3    dplyr_1.1.4       OptSurvCutR_0.1.5 pacman_0.5.1     
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       xfun_0.52          bslib_0.9.0        rstatix_0.7.2     
##  [5] lattice_0.22-7     vctrs_0.6.5        tools_4.5.1        generics_0.1.4    
##  [9] parallel_4.5.1     tibble_3.3.0       pkgconfig_2.0.3    Matrix_1.7-3      
## [13] data.table_1.17.8  RColorBrewer_1.1-3 lifecycle_1.0.4    stringr_1.5.1     
## [17] compiler_4.5.1     farver_2.1.2       ggsci_3.2.0        codetools_0.2-20  
## [21] carData_3.0-5      litedown_0.7       htmltools_0.5.8.1  sass_0.4.10       
## [25] yaml_2.3.10        Formula_1.2-5      pillar_1.11.0      car_3.1-3         
## [29] jquerylib_0.1.4    tidyr_1.3.1        cachem_1.1.0       iterators_1.0.14  
## [33] rgenoud_5.9-0.11   abind_1.4-8        foreach_1.5.2      km.ci_0.5-6       
## [37] commonmark_2.0.0   tidyselect_1.2.1   digest_0.6.37      stringi_1.8.7     
## [41] purrr_1.1.0        labeling_0.4.3     splines_4.5.1      cowplot_1.2.0     
## [45] fastmap_1.2.0      grid_4.5.1         cli_3.6.5          magrittr_2.0.3    
## [49] broom_1.0.9        withr_3.0.2        scales_1.4.0       backports_1.5.0   
## [53] rmarkdown_2.29     ggtext_0.1.2       gridExtra_2.3      ggsignif_0.6.4    
## [57] zoo_1.8-14         evaluate_1.0.4     knitr_1.50         KMsurv_0.1-6      
## [61] doParallel_1.0.17  markdown_2.0       survMisc_0.5.6     rlang_1.1.6       
## [65] Rcpp_1.1.0         gridtext_0.1.5     xtable_1.8-4       glue_1.8.0        
## [69] xml2_1.4.0         rstudioapi_0.17.1  jsonlite_2.0.0     R6_2.6.1