Introduction

This tutorial demonstrates how to use the OptCutR package for survival analysis. Specifically, we will stratify patients into different prognostic groups based on a continuous biomarker.

The Scientific Background

This is a common and powerful technique in clinical research to identify patient subgroups with different risk levels. For this example, we will analyse gut microbiome data to determine if the relative abundance of the genus Parvimonas can predict survival outcomes in colorectal cancer (CRC) patients. This analysis is inspired by data published by Smyth et al. (2024).

Analysis Workflow

This guide covers a complete and robust workflow:

  1. Setup and Configuration: Preparing the R environment and defining key parameters.
  2. Data Loading and Preparation: Loading, cleaning, and merging clinical and microbiome datasets.
  3. Optimal Cut-point Discovery: Determine the optimal number of groups and find the cut-point locations.
  4. Cut-point Validation: Assess the stability of the discovered thresholds via bootstrapping.
  5. Visualisation and Interpretation: Create and interpret Kaplan-Meier curves, forest plots, and other key visualisations.
  6. Model Diagnostics: Checking the assumptions of the survival model.

1. Setup and Configuration

Loading Libraries

First, we load all the necessary R packages. The pacman package will automatically install any missing packages.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(
  OptCutR, readxl, dplyr, tidyr, 
  survival, survminer, patchwork, ggplot2
)

Defining Parameters

We define the name of the microbe we wish to analyse upfront.

# Define the biomarker of interest.
microbe_name_to_analyze <- "Parvimonas"

2. Data Loading and Preparation

In this section, we load the clinical and microbiome datasets, clean them, and merge them into a single data frame ready for analysis.

# Load the datasets from their respective files.
clinical_data <- read_excel("Colon clinical data.xlsx", .name_repair = "universal")
microbiome_data <- read.csv("Colon Microbiome data.csv", check.names = TRUE)

# --- Prepare clinical (survival) data ---
survival_data <- clinical_data %>%
  select(
    sample_id = Sample.ID,
    time = Overall.Survival..Months.,
    status_text = Overall.Survival.Status
  ) %>%
  mutate(
    status = ifelse(status_text == '1:DECEASED', 1, 0),
    time = as.numeric(time)
  ) %>%
  filter(!is.na(time) & !is.na(status))

# --- Process and prepare microbiome data ---
microbe_row <- microbiome_data %>%
  filter(grepl(microbe_name_to_analyze, NAME, ignore.case = TRUE)) %>%
  slice(1) 

abundance_long <- microbe_row %>%
  select(starts_with("TCGA")) %>%
  pivot_longer(
    cols = everything(),
    names_to = "sample_id",
    values_to = "abundance"
  ) %>%
  mutate(sample_id = gsub("\\.", "-", sample_id))

# --- Merge the two datasets ---
analysis_data <- inner_join(survival_data, abundance_long, by = "sample_id")

head(analysis_data)
## # A tibble: 6 × 5
##   sample_id        time status_text status abundance
##   <chr>           <dbl> <chr>        <dbl>     <dbl>
## 1 TCGA-3L-AA1B-01 15.6  0:LIVING         0      7.51
## 2 TCGA-4N-A93T-01  4.80 0:LIVING         0      7.89
## 3 TCGA-4T-AA8H-01 12.7  0:LIVING         0      7.81
## 4 TCGA-5M-AAT4-01  1.61 1:DECEASED       1      9.16
## 5 TCGA-5M-AAT6-01  9.53 1:DECEASED       1      9.95
## 6 TCGA-5M-AATE-01 39.5  0:LIVING         0      9.71

3. Optimal Cut-point Discovery

Step 3.1: Determine the Optimal Number of Cut-points

Before finding where the cut-points are, we should determine how many cut-points are statistically justified.

# Set a seed for reproducible results from the genetic algorithm.
set.seed(42)
# This function tests models with 0, 1, and 2 cut-points and compares them using BIC.
number_result <- find_cutpoint_number(
  data = analysis_data,
  predictor = "abundance",
  outcome_time = "time",
  outcome_event = "status",
  method = "genetic", 
  max_cuts = 2,
  nmin = 0.1,
  numgen = 50
)
##  num_cuts     BIC Delta_BIC BIC_Weight
##         0 1322.07      3.43      10.6%
##         1 1319.94      1.31      30.6%
##         2 1318.63      0.00      58.9%

Interpretation: The model with 2 cut-points has the lowest BIC and the highest BIC Weight. This provides strong statistical evidence that dividing the patients into three groups is the best choice for this dataset.

Step 3.2: Find the Location of the Cut-points

Now that we have decided to use two cut-points, we use find_cutpoint() to identify their optimal values.

# Use a different seed for this distinct analysis step.
set.seed(123)
# This function uses a genetic algorithm to find the location of the two best cut-points.
multi_cut_result <- find_cutpoint(
  data = analysis_data,
  predictor = "abundance",
  outcome_time = "time",
  outcome_event = "status",
  method = "genetic",
  num_cuts = 2,
  nmin = 0.1,
  numgen = 50
)
optimal_cuts <- multi_cut_result$optimal_cuts
print(optimal_cuts)
## [1] 8.416784 8.701395

Interpretation: The genetic algorithm has identified two optimal thresholds. These are the specific abundance values that best separate the patients into three prognostic groups.


4. Cut-point Validation

This is a crucial step to assess how reliable our discovered cut-points are. We use validate_cutpoint() to run a bootstrap analysis. Narrow confidence intervals suggest that the cut-points are stable.

# Seed for reproducibility of the bootstrap process.
set.seed(456)
# This function re-runs the cut-point finding process on many bootstrap samples.
validation_result <- validate_cutpoint(
  cutpoint_result = multi_cut_result,
  num_replicates = 200, # Use a higher number for a final analysis.
  use_parallel = TRUE   # Enable parallel processing.
)
##       Lower  Upper
## Cut 1 6.451  9.979
## Cut 2 7.311 11.814

Interpretation: This table shows the 95% confidence intervals for our two cut-points, giving us a sense of their precision and stability.


5. Visualisation and Interpretation

First, we create a new column that categorises each patient into a “Low”, “Medium”, or “High” abundance group based on the optimal_cuts.

# Create a new column with the abundance group based on the optimal cut-points.
analysis_data_categorised <- analysis_data %>%
  mutate(abundance_group = cut(
    abundance,
    breaks = c(-Inf, optimal_cuts, Inf),
    labels = c("Low", "Medium", "High")
  )) %>%
  filter(!is.na(abundance_group))

# Set "Low" abundance as the reference group for the Cox model.
analysis_data_categorised$abundance_group <- relevel(
  analysis_data_categorised$abundance_group, ref = "Low"
)

# Fit the final Cox Proportional-Hazards model on the categorised data.
final_formula <- as.formula("Surv(time, status) ~ abundance_group")
print(final_formula)
## Surv(time, status) ~ abundance_group
cox_model <- coxph(final_formula, data = analysis_data_categorised)
print(cox_model)
## Call:
## coxph(formula = final_formula, data = analysis_data_categorised)
## 
##                          coef exp(coef) se(coef)      z      p
## abundance_groupMedium -1.0593    0.3467   0.4301 -2.463 0.0138
## abundance_groupHigh    0.1061    1.1119   0.1889  0.561 0.5746
## 
## Likelihood ratio test=10.25  on 2 df, p=0.005954
## n= 581, number of events= 119

Plot A: Kaplan-Meier Curve

A Kaplan-Meier curve shows the probability of survival over time for different patient groups.

# Fit the Kaplan-Meier model.
km_fit <- survfit(Surv(time, status) ~ abundance_group, data = analysis_data_categorised)
# Create the plot using ggsurvplot.
km_plot <- ggsurvplot(
  km_fit, data = analysis_data_categorised, pval = TRUE, risk.table = TRUE,
  legend.title = microbe_name_to_analyze, legend.labs = c("Low", "Medium", "High"),
  palette = c("#2E9FDF", "#E7B800", "#FC4E07") # Custom colour palette
)

print(km_plot)

Plot B: Forest Plot of Hazard Ratios

A forest plot visualises the Hazard Ratios (HR) from our Cox model.

# ggforest creates a plot of the hazard ratios from the Cox model.
forest_plot <- ggforest(
  cox_model, data = analysis_data_categorised,
  main = "Hazard Ratios for Abundance Groups"
)

print(forest_plot)

Plot C: Bootstrap Distribution Plot

This plot shows the distribution of the cut-points found across all bootstrap samples.

# Prepare the bootstrap data for plotting with ggplot2.
bootstrap_df <- as.data.frame(validation_result$bootstrap_distribution)
names(bootstrap_df) <- c("Cut-point 1", "Cut-point 2")
bootstrap_long <- bootstrap_df %>%
  pivot_longer(cols = everything(), names_to = "Cutpoint", values_to = "Value")

# Create the density plot.
bootstrap_plot <- ggplot(bootstrap_long, aes(x = Value, fill = Cutpoint)) +
  geom_density(alpha = 0.7, show.legend = FALSE) +
  # Add vertical lines for the originally identified optimal cuts.
  geom_vline(data = data.frame(Cutpoint = c("Cut-point 1", "Cut-point 2"), val = optimal_cuts),
             aes(xintercept = val), colour = "red", linetype = "dashed", linewidth = 1) +
  labs(title = "Bootstrap Distribution of Cut-points", x = "Abundance Value", y = "Density") +
  theme_minimal() +
  facet_wrap(~ Cutpoint, scales = "free")

print(bootstrap_plot)

Final Combined Figure

Finally, we use the patchwork package to combine our primary plots into a single figure.

# We extract the plot component from the ggsurvplot object to combine it with other ggplots.
publication_figure <- (km_plot$plot) / (forest_plot + bootstrap_plot) +
  plot_annotation(tag_levels = 'A')

print(publication_figure)


6. Model Diagnostics

Plot D: Biomarker Distribution by Risk Group

This box plot provides a clear visual confirmation of how the find_cutpoint() function has partitioned the patients.

dist_plot <- ggplot(analysis_data_categorised, aes(x = abundance_group, y = abundance, fill = abundance_group)) +
  geom_boxplot(show.legend = FALSE) +
  labs(
    title = paste("Distribution of", microbe_name_to_analyze, "Abundance by Risk Group"),
    x = "Prognostic Group",
    y = "Relative Abundance"
  ) +
  scale_fill_manual(values = c("#2E9FDF", "#E7B800", "#FC4E07")) +
  theme_minimal()

print(dist_plot)

Plot E: Checking the Proportional Hazards Assumption

A key assumption of the Cox model is that the effect of a predictor is constant over time. We can check this using a Schoenfeld residuals plot. A non-significant trend supports the assumption.

# Test the proportional hazards assumption for our Cox model.
schoenfeld_residuals <- cox.zph(cox_model)
# Print the test results to the console. A non-significant p-value is good.
print(schoenfeld_residuals)
##                 chisq df    p
## abundance_group 0.153  2 0.93
## GLOBAL          0.153  2 0.93
# Visualise the residuals.
ggcoxzph(schoenfeld_residuals)


7. Clean Up Environment

It is good practice to remove objects that are no longer needed to keep the R environment tidy.

# Remove all the intermediate and final objects created during the analysis
rm(analysis_data, analysis_data_categorised, multi_cut_result, optimal_cuts, 
   validation_result, final_formula, cox_model, km_fit, km_plot, forest_plot, 
   bootstrap_df, bootstrap_long, bootstrap_plot, publication_figure, dist_plot, 
   schoenfeld_residuals, microbe_name_to_analyze)

8. Session Information

For reproducibility, the session information below lists the R version and all attached packages.

sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8 
## [2] LC_CTYPE=English_United Kingdom.utf8   
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.utf8    
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] patchwork_1.3.1 survminer_0.5.0 ggpubr_0.6.1    ggplot2_3.5.2  
##  [5] survival_3.8-3  tidyr_1.3.1     dplyr_1.1.4     readxl_1.4.5   
##  [9] OptCutR_0.1.0   pacman_0.5.1   
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       xfun_0.52          bslib_0.9.0        rstatix_0.7.2     
##  [5] lattice_0.22-7     vctrs_0.6.5        tools_4.5.1        generics_0.1.4    
##  [9] parallel_4.5.1     tibble_3.3.0       pkgconfig_2.0.3    Matrix_1.7-3      
## [13] data.table_1.17.6  RColorBrewer_1.1-3 lifecycle_1.0.4    stringr_1.5.1     
## [17] compiler_4.5.1     farver_2.1.2       codetools_0.2-20   carData_3.0-5     
## [21] litedown_0.7       htmltools_0.5.8.1  sass_0.4.10        yaml_2.3.10       
## [25] Formula_1.2-5      pillar_1.11.0      car_3.1-3          jquerylib_0.1.4   
## [29] cachem_1.1.0       iterators_1.0.14   rgenoud_5.9-0.11   abind_1.4-8       
## [33] foreach_1.5.2      km.ci_0.5-6        commonmark_2.0.0   tidyselect_1.2.1  
## [37] digest_0.6.37      stringi_1.8.7      purrr_1.0.4        labeling_0.4.3    
## [41] splines_4.5.1      cowplot_1.2.0      fastmap_1.2.0      grid_4.5.1        
## [45] cli_3.6.5          magrittr_2.0.3     utf8_1.2.6         broom_1.0.8       
## [49] withr_3.0.2        scales_1.4.0       backports_1.5.0    rmarkdown_2.29    
## [53] ggtext_0.1.2       gridExtra_2.3      ggsignif_0.6.4     cellranger_1.1.0  
## [57] zoo_1.8-14         evaluate_1.0.4     knitr_1.50         KMsurv_0.1-6      
## [61] doParallel_1.0.17  markdown_2.0       survMisc_0.5.6     rlang_1.1.6       
## [65] Rcpp_1.1.0         gridtext_0.1.5     xtable_1.8-4       glue_1.8.0        
## [69] xml2_1.3.8         rstudioapi_0.17.1  jsonlite_2.0.0     R6_2.6.1