About the Package

The OptSurvCutR package provides a comprehensive, data-driven workflow for survival analysis, specifically designed for scenarios where a single cut-point is not sufficient. Its core functionality is to discover and validate one or more optimal cut-points for continuous variables in time-to-event data. This is particularly useful in clinical and biological research where you have a non-linear or U-shaped relationship and need to create meaningful, distinct subgroups from continuous data (e.g., patient age, gene expression levels, or, in this case, microbial abundance) for prognostic analysis.

The key features of the package include:

  • Optimal Number of Cut-points: Uses information criteria (AIC, AICc, BIC) to find the statistically justified number of groups.
  • Efficient Cut-point Discovery: Uses advanced optimisation algorithms (genetic, systematic) to find the best cut-point locations.
  • Robust Validation: Uses bootstrapping to assess the stability and reliability of your findings.

Introduction

This tutorial demonstrates how to use the OptSurvCutR package for survival analysis. Specifically, we will stratify patients into different prognostic groups based on a continuous biomarker.

The Scientific Background

The human gut is home to a complex community of microorganisms known as the microbiome. A key, though often overlooked, component of this community is the gut virome—the collection of all viruses. In a healthy state, this ecosystem is balanced, but in diseases like colorectal cancer (CRC), this balance is often disrupted (a state called dysbiosis).

For this example, we will analyse gut virome data to determine if the relative abundance of the genus Enterovirus can predict survival outcomes. Enterovirus is a common genus of viruses from the Picornaviridae family, known for causing a range of human illnesses from the common cold to more severe conditions. As their name suggests, they often replicate in the gastrointestinal tract and are a regular member of the human gut virome.

While Enterovirus can be found in the guts of both healthy individuals and CRC patients, recent studies have focused on how its prevalence and abundance change in the context of cancer. The relationship is complex and represents a fascinating double-edged sword. On one hand, chronic viral infections can contribute to the pro-inflammatory environment that is a known risk factor for CRC development. On the other hand, some strains of Enterovirus have been identified as being naturally oncolytic—meaning they can preferentially infect and kill cancer cells. This has opened up an exciting new avenue of research into using these viruses as a potential cancer therapy.

Therefore, understanding whether a high or low abundance of Enterovirus is associated with patient survival is a critical question. This analysis, inspired by data from Smyth et al. (2024), uses OptSurvCutR to explore this exact relationship, seeking to identify data-driven thresholds that can stratify patients into distinct prognostic groups based on this viral biomarker.

Analysis Workflow

This guide covers a complete and robust workflow:

  1. Setup and Configuration: Preparing the R environment and defining key parameters.
  2. Data Loading and Preparation: Loading the package’s built-in datasets and preparing them for analysis.
  3. Optimal Cut-point Discovery: Determine the optimal number of groups and find the cut-point locations.
  4. Cut-point Validation: Assess the stability of the discovered thresholds via bootstrapping.
  5. Visualisation and Interpretation: Create and interpret Kaplan-Meier curves, forest plots, and other key visualisations.
  6. Model Diagnostics: Checking the assumptions of the survival model.

1. Setup and Configuration

Loading Libraries

First, we load all the necessary R packages. The pacman package will automatically install any missing packages.

# Check if pacman is installed, if not, install it
if (!require("pacman")) install.packages("pacman")

# Use pacman to load/install all required packages
pacman::p_load(
  OptSurvCutR, # The core package for this analysis
  dplyr,       # For data manipulation (pipes %>%, select, mutate)
  survival,    # For core survival analysis functions (Surv, survfit)
  survminer,   # For plotting survival curves (ggsurvplot, ggforest)
  patchwork,   # For combining multiple ggplots into one figure
  ggplot2,     # For creating custom plots
  rgenoud      # Required for method = "genetic"
)

Defining Parameters

We define the name of the microbe/virus we wish to analyse upfront. Here, we select Enterovirus—an RNA virus within the Picornaviridae family, known for causing gastroenteritis and other infections.

# Define the biomarker of interest. This string must match a column name.
microbe_name_to_analyze <- "Enterovirus"

2. Data Loading and Preparation

We load the pre-cleaned clinical and virome dataset directly from the OptSurvCutR package. With the updated data format, this is now a single data frame. We will process it to extract the necessary columns, parse the survival status, and censor the data at 5 years (60 months) for this analysis.

# Load the single, wide-format data frame from the package.
data("crc_virome", package = "OptSurvCutR")

# --- Prepare the final analysis data frame ---

# Add a safeguard: check if the selected microbe exists in the data's column names
if (!microbe_name_to_analyze %in% names(crc_virome)) {
  stop(paste("The microbe '", microbe_name_to_analyze, "' was not found in the data.",
             " Please check the 'microbe_name_to_analyze' parameter."))
}

# Process the single data frame to create the analysis-ready dataset
analysis_data <- crc_virome %>%
  # Select the key columns and rename them for consistency
  select(
    patient_id = sample_id,
    time_months,
    status_text = status,                     # e.g., "0:LIVING"
    abundance = all_of(microbe_name_to_analyze) # Select our biomarker
  ) %>%
  # Parse the status column (e.g., "0:LIVING" -> 0, "1:DECEASED" -> 1)
  mutate(status_numeric = as.numeric(sub(":.*", "", status_text))) %>%
  # Censor data at 5 years (60 months) for a 5-year survival analysis
  mutate(
    # If time > 60 months, set status to 0 (censored) regardless of original status
    status = ifelse(time_months > 60, 0, status_numeric),
    # Cap all follow-up time at 60 months
    time = pmin(time_months, 60)
  ) %>%
  # Select the final columns needed for the analysis
  select(patient_id, time, status, abundance)

# Display the head of the final, prepared data frame
head(analysis_data)
##        patient_id      time status abundance
## 1 TCGA-3L-AA1B-01 15.616267      0  3.663065
## 2 TCGA-4N-A93T-01  4.799947      0  1.697278
## 3 TCGA-4T-AA8H-01 12.657396      0  2.809707
## 4 TCGA-5M-AAT4-01  1.610941      1  1.688483
## 5 TCGA-5M-AAT6-01  9.534142      1  2.621937
## 6 TCGA-5M-AATE-01 39.451622      0  2.378831

3. Optimal Cut-point Discovery

Step 3.1: Determine the Optimal Number of Cut-points

Before finding where the cut-points are, we should determine how many cut-points are statistically justified.

# This function tests models with 0 to 4 cut-points (creating 1 to 5 groups)
# and compares them using the Bayesian Information Criterion (BIC).
number_result <- find_cutpoint_number(
  data = analysis_data,
  predictor = "abundance",     # The continuous variable to test
  outcome_time = "time",       # The time-to-event column
  outcome_event = "status",    # The event status column (0 or 1)
  method = "genetic",          # Use genetic algorithm for a fast search
  criterion = "BIC",           # Use BIC for model selection (can be AIC or AICc)
  max_cuts = 4,                # Test models with 0, 1, 2, 3, and 4 cuts
  nmin = 0.2,                  # Each group must contain at least 20% of data
  maxiter = 100,               # Iterations for the genetic algorithm
  use_parallel = TRUE,         # Use parallel processing to speed up the search
  n_cores = 2,                 # Number of cores to use
  seed = 42                    # Use the built-in seed for reproducible results
)
# The plot method shows the BIC score for each number of cuts.
# The lowest point (highlighted in orange) is the best model.
plot(number_result)
# The summary provides the data in a table format.
summary(number_result)

Interpretation: The model with 2 cut-points has the lowest BIC (Bayesian Information Criterion) and the highest BIC Weight. This provides strong statistical evidence that dividing the patients into three distinct prognostic groups (Low, Medium, High) is the most appropriate choice for this dataset, rather than a simple 2-group (High/Low) split.

Step 3.2: Find the Location of the Cut-points

Now that we have decided to use two cut-points, we use find_cutpoint() to identify their optimal values.

# This function uses a genetic algorithm to find the location of the two best
# cut-points that maximize the log-rank test statistic.
multi_cut_result <- find_cutpoint(
  data = analysis_data,
  predictor = "abundance",
  outcome_time = "time",
  outcome_event = "status",
  method = "genetic",
  criterion = "logrank", # Optimise for the best log-rank statistic
  num_cuts = 2,          # We are looking for 2 cut-points, as found in Step 3.1
  nmin = 0.2,            # Each of the 3 groups must have at least 20% of data
  maxiter = 500,         # Use more iterations for a more precise search
  seed = 123             # Use a different seed for this distinct analysis step
)

# Store the cut-point values
optimal_cuts <- multi_cut_result$optimal_cuts
# The plot method for a 'find_cutpoint' object shows the distribution
# of the predictor variable and draws vertical lines at the optimal cut-points.
plot(multi_cut_result, type = "distribution")
Distribution of the biomarker with optimal cut-points overlaid as vertical lines.

Distribution of the biomarker with optimal cut-points overlaid as vertical lines.

# The summary shows the cut-points and the Cox model fit for the new groups.
summary(multi_cut_result)
##   Group   N Events
## 1    G1 299     64
## 2    G2 142     13
## 3    G3 140     33
## Call: survfit(formula = survival::Surv(time, event) ~ group, data = data)
## 
##            n events median 0.95LCL 0.95UCL
## group=G1 299     64     NA    51.5      NA
## group=G2 142     13     NA      NA      NA
## group=G3 140     33   56.3    43.2      NA
## Call:
## survival::coxph(formula = as.formula(formula_str), data = data)
## 
##   n= 581, number of events= 110 
## 
##            coef exp(coef) se(coef)      z Pr(>|z|)   
## groupG2 -0.9646    0.3812   0.3044 -3.169  0.00153 **
## groupG3  0.1003    1.1056   0.2144  0.468  0.63979   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##         exp(coef) exp(-coef) lower .95 upper .95
## groupG2    0.3812     2.6236    0.2099    0.6921
## groupG3    1.1056     0.9045    0.7262    1.6830
## 
## Concordance= 0.558  (se = 0.026 )
## Likelihood ratio test= 14.78  on 2 df,   p=6e-04
## Wald test            = 11.64  on 2 df,   p=0.003
## Score (logrank) test = 12.63  on 2 df,   p=0.002
##        chisq df     p
## group   5.02  2 0.081
## GLOBAL  5.02  2 0.081

Interpretation: The genetic algorithm has identified two optimal thresholds. These are the specific abundance values that best separate the patients into three prognostic groups based on the log-rank statistic.

Note on the AIC value: The summary output for the final model shows an AIC (Akaike Information Criterion) score. While we used BIC in Step 3.1 to select the best number of cut-points, the standard summary function in R reports AIC by default as a final measure of model quality. Both AIC and BIC serve a similar purpose: to find a model that best balances complexity and goodness-of-fit.


4. Cut-point Validation

This is a crucial step to assess how reliable our discovered cut-points are. We use validate_cutpoint() to run a bootstrap analysis. This involves repeatedly resampling the data, re-running the find_cutpoint algorithm, and checking how much the cut-points change.

# Use the built-in seed argument for reproducibility of the bootstrap process.
validation_result <- validate_cutpoint(
  cutpoint_result = multi_cut_result, # Pass in the results from Step 3.2
  num_replicates = 200, # Use 200 replicates (>= 500 is recommended for publication)
  use_parallel = TRUE,  # Enable parallel processing.
  n_cores = 2,          # Number of cores to use
  maxiter = 100,        # Limit iterations *within* each replicate for speed
  seed = 456            # Use a different seed for this new random process
)
## Cut-point Stability Analysis (Bootstrap)
## ----------------------------------------
## Original Optimal Cut-point(s): 1.864, 2.155 
## Successful Replicates: 175 / 200 ( 87.5 %)
## Failed Replicates: 25 
## 
## 95% Confidence Intervals
## ------------------------
##       Lower Upper
## Cut 1 1.477 1.893
## Cut 2 1.742 2.355
## 
## Bootstrap Summary Statistics
## ---------------------------
##       Cut  Mean    SD Median    Q1    Q3
## 25%  Cut1 1.767 0.111  1.774 1.720 1.864
## 25%1 Cut2 2.171 0.161  2.178 2.092 2.299
## 
## Hint: Use `summary()` for detailed statistics or `plot()` to visualize stability.
# The summary provides 95% Confidence Intervals for the cut-points.
summary(validation_result)
## Cut-point Stability Analysis (Bootstrap)
## ----------------------------------------
## Original Optimal Cut-point(s): 1.864, 2.155 
## 
## Bootstrap Distribution Summary
## -----------------------------
##       Cut  Mean    SD Median    Q1    Q3
## 25%  Cut1 1.767 0.111  1.774 1.720 1.864
## 25%1 Cut2 2.171 0.161  2.178 2.092 2.299
## 
## 95% Confidence Intervals
## ------------------------
##       Lower Upper
## Cut 1 1.477 1.893
## Cut 2 1.742 2.355
## 
## Validation Parameters
## ---------------------
## Replicates Requested: 200 
## Successful Replicates: 175 ( 87.5 %)
## Failed Replicates: 25 
## Parallel Processing: Enabled 
## Cores Used: 2 
## Seed: 456 
## Minimum Group Size (nmin): 104 
## Method: genetic 
## Criterion: logrank 
## Covariates: None

Interpretation: This table shows the 95% confidence intervals for our two cut-points. For a cut-point to be considered stable, its confidence interval should be relatively narrow. This gives us a sense of the precision and reliability of our findings.

# Create the plot and store it in an object for later use
# This plot shows the histogram of cut-points found in the bootstrap samples.
bootstrap_plot <- plot(validation_result)
print(bootstrap_plot)
Bootstrap distribution of the optimal cut-points. The solid red line is the original cut-point; dashed lines show the 95% CI.

Bootstrap distribution of the optimal cut-points. The solid red line is the original cut-point; dashed lines show the 95% CI.


5. Visualisation and Interpretation

Categorising Patients

First, we create a new column that categorises each patient into a “Low”, “Medium”, or “High” abundance group based on the optimal_cuts. This step is necessary to manually build the Kaplan-Meier and Forest plots.

# Create a new column with the abundance group based on the optimal cut-points.
analysis_data_categorised <- analysis_data %>%
  mutate(abundance_group = cut(
    abundance,
    breaks = c(-Inf, optimal_cuts, Inf), # Use the cuts from Step 3.2
    labels = c("Low", "Medium", "High")
  )) %>%
  filter(!is.na(abundance_group)) # Remove any rows that didn't fall into a group

# Set "Medium" abundance as the reference group for the Cox model.
# This means "Low" and "High" will be compared *to* "Medium".
analysis_data_categorised$abundance_group <- relevel(
  analysis_data_categorised$abundance_group, ref = "Medium"
)

# Fit the final Cox Proportional-Hazards model on the categorised data.
final_formula <- as.formula("Surv(time, status) ~ abundance_group")
cox_model <- coxph(final_formula, data = analysis_data_categorised)

# Fit the Kaplan-Meier model for plotting and summary tables.
km_fit <- survfit(Surv(time, status) ~ abundance_group, data = analysis_data_categorised)

Plot A: Kaplan-Meier Curve

A Kaplan-Meier curve shows the probability of survival over time for different patient groups.

# Create the plot using ggsurvplot.
km_plot <- ggsurvplot(
  km_fit, data = analysis_data_categorised, pval = TRUE, risk.table = TRUE,
  legend.title = microbe_name_to_analyze, legend.labs = c("Medium", "Low", "High"),
  palette = c("#2E9FDF", "#E7B800", "#FC4E07"), # Custom colour palette
  ylim = c(0.4, 1.0),
  pval.coord = c(0, 0.6)
)

print(km_plot)

Interpretation: The Kaplan-Meier plot visualises the survival probability for the “Low”, “Medium”, and “High” abundance groups over time. A steeply dropping curve indicates a rapid decrease in survival (poor prognosis), while a flat curve indicates a stable survival rate (good prognosis).

Here, we see a classic “U-shaped” risk relationship: - The Medium group’s curve (N=142) is the flattest, indicating these patients have the best survival outcome. This suggests an intermediate level of Enterovirus is associated with a more stable disease state. - The Low (N=299) and High (N=140) groups have much steeper curves. This means that deviating from the “medium” range in either direction is associated with a significantly worse prognosis. This pattern suggests that gut virome dysbiosis (imbalance), represented by either too little or too much Enterovirus, is a marker of higher risk.

The very small p-value (< 0.05) from the log-rank test confirms that these differences between the curves are statistically significant.

Plot B: Forest Plot of Hazard Ratios

A forest plot visualises the Hazard Ratios (HR) from our Cox model. The HR quantifies the risk of death for one group compared to a reference group.

# ggforest creates a plot of the hazard ratios from the Cox model.
forest_plot <- ggforest(
  cox_model, data = analysis_data_categorised,
  main = "Hazard Ratios for Abundance Groups"
)

print(forest_plot)

Interpretation: This plot gives us the precise statistical risk for the “Low” and “High” groups, using the “Medium” group as the reference point (Hazard Ratio = 1.0).

  • Low Group: The Hazard Ratio is 2.6 (95% CI: 1.4 - 4.8; p = 0.002).
    • What this means: At any given time, patients in the “Low” abundance group are 2.6 times more likely to die than patients in the “Medium” group.
    • The 95% confidence interval (the horizontal line) ranges from 1.4 to 4.8. Because this interval is entirely above 1.0, the result is statistically significant, which is confirmed by the very low p-value of 0.002.
  • High Group: The Hazard Ratio is 2.9 (95% CI: 1.5 - 5.5; p = 0.001).
    • What this means: Patients in the “High” abundance group are 2.9 times more likely to die than patients in the “Medium” group.
    • This is the highest-risk group. The result is highly statistically significant (p = 0.001), as its confidence interval is also well above 1.0.

In summary, the forest plot statistically confirms what the Kaplan-Meier curve showed visually: both low and high levels of Enterovirus are significant predictors of poor survival compared to an intermediate level.

Final Combined Figure

Finally, we use the patchwork package to combine our primary plots into a single, publication-ready figure.

# We extract the plot component from the ggsurvplot object to combine it with other ggplots.
# The bootstrap_plot object was created in the validation step.
publication_figure <- (km_plot$plot) / (forest_plot + bootstrap_plot) +
  # Add labels (A, B, C) to the combined plots
  plot_annotation(tag_levels = 'A')

print(publication_figure)
Summary of prognostic analysis. (A) Kaplan-Meier survival curves. (B) Forest plot of hazard ratios. (C) Bootstrap distribution of optimal cut-points.

Summary of prognostic analysis. (A) Kaplan-Meier survival curves. (B) Forest plot of hazard ratios. (C) Bootstrap distribution of optimal cut-points.


6. Model Diagnostics

Plot D: Biomarker Distribution by Risk Group

This box plot provides a clear visual confirmation of how the find_cutpoint() function has partitioned the patients.

# Create a ggplot boxplot to show the distribution of the biomarker
# within the three groups we created.
dist_plot <- ggplot(analysis_data_categorised, aes(x = abundance_group, y = abundance, fill = abundance_group)) +
  geom_boxplot(show.legend = FALSE) +
  labs(
    title = paste("Distribution of", microbe_name_to_analyze, "Abundance by Risk Group"),
    x = "Prognostic Group",
    y = "Relative AbDundance"
  ) +
  scale_fill_manual(values = c("#2E9FDF", "#E7B800", "#FC4E07")) +
  theme_minimal()

print(dist_plot)

Plot E: Checking the Proportional Hazards Assumption

A key assumption of the Cox model is that the effect of a predictor is constant over time. We can check this using a Schoenfeld residuals plot. A non-significant trend (p > 0.05) supports the assumption.

# Test the proportional hazards assumption for our Cox model.
schoenfeld_residuals <- cox.zph(cox_model)

# Print the test results to the console.
# A p-value > 0.05 for GLOBAL is good and supports the assumption.
print(schoenfeld_residuals)
##                 chisq df     p
## abundance_group  5.02  2 0.081
## GLOBAL           5.02  2 0.081
# Visualise the residuals.
# We are looking for lines that are roughly horizontal.
ggcoxzph(schoenfeld_residuals)


7. Conclusion & Next Steps

This vignette has demonstrated the three-step workflow for cut-point analysis using OptSurvCutR. By following this workflow, users can confidently identify and validate robust, statistically-optimal thresholds in their own survival data, moving beyond simple median splits to uncover more nuanced relationships.

We encourage you to try OptSurvCutR with your own data.

  • Install the package from GitHub: remotes::install_github("paytonyau/OptSurvCutR")
  • Report issues or suggest features on our GitHub page.
  • Star the repository if you find it useful.
  • Read the preprint: For a detailed methodological discussion, please see our paper on bioRxiv.

8. Session Information

For reproducibility, the session information below lists the R version and all attached packages.

sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8 
## [2] LC_CTYPE=English_United Kingdom.utf8   
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.utf8    
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rgenoud_5.9-0.11  patchwork_1.3.1   survminer_0.5.1   ggpubr_0.6.2     
## [5] ggplot2_4.0.0     survival_3.8-3    dplyr_1.1.4       OptSurvCutR_0.1.7
## [9] pacman_0.5.1     
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       xfun_0.53          bslib_0.9.0        rstatix_0.7.3     
##  [5] lattice_0.22-7     vctrs_0.6.5        tools_4.5.1        generics_0.1.4    
##  [9] parallel_4.5.1     tibble_3.3.0       pkgconfig_2.0.3    Matrix_1.7-3      
## [13] data.table_1.17.8  RColorBrewer_1.1-3 S7_0.2.0           lifecycle_1.0.4   
## [17] stringr_1.5.2      compiler_4.5.1     farver_2.1.2       codetools_0.2-20  
## [21] carData_3.0-5      litedown_0.7       htmltools_0.5.8.1  sass_0.4.10       
## [25] yaml_2.3.10        Formula_1.2-5      pillar_1.11.1      car_3.1-3         
## [29] jquerylib_0.1.4    tidyr_1.3.1        cachem_1.1.0       iterators_1.0.14  
## [33] abind_1.4-8        foreach_1.5.2      km.ci_0.5-6        commonmark_2.0.0  
## [37] tidyselect_1.2.1   digest_0.6.37      stringi_1.8.7      purrr_1.1.0       
## [41] labeling_0.4.3     splines_4.5.1      cowplot_1.2.0      fastmap_1.2.0     
## [45] grid_4.5.1         cli_3.6.5          magrittr_2.0.4     broom_1.0.10      
## [49] withr_3.0.2        scales_1.4.0       backports_1.5.0    rmarkdown_2.30    
## [53] ggtext_0.1.2       gridExtra_2.3      ggsignif_0.6.4     zoo_1.8-14        
## [57] evaluate_1.0.5     knitr_1.50         KMsurv_0.1-6       doParallel_1.0.17 
## [61] markdown_2.0       survMisc_0.5.6     rlang_1.1.6        Rcpp_1.1.0        
## [65] gridtext_0.1.5     xtable_1.8-4       glue_1.8.0         xml2_1.4.0        
## [69] rstudioapi_0.17.1  jsonlite_2.0.0     R6_2.6.1