Introduction

In survival analysis, a common challenge is to stratify subjects based on a continuous predictor. The conventional approach is often to dichotomise this predictor using a single cut-point, typically the median. However, this method can be statistically weak and may fail to capture more complex, non-linear relationships. The fundamental question is often not just where to make a cut, but how many cuts are statistically justified.

This vignette demonstrates how the OptSurvCutR package provides a comprehensive workflow to solve this problem. We will address the following research question:

Based on simulated data, can we find the optimal temperature threshold(s) that best separate fast and slow germination in rapeseed?

To answer this, we will use the germination dataset, which is included with the package. This dataset was simulated based on the findings of Haj Sghaier et al. (2022) to provide a realistic case study. The data was generated based on 7 core temperature groups (5, 10, 15, 20, 25, 30, and 35°C) and 6 intermediate groups (7.5, 12.5, 17.5, 22.5, 27.5, and 32.5°C) to create a rich, continuous predictor.

Analysis Workflow

This guide covers a complete workflow for answering this question using OptSurvCutR:

Installation & Setup: Get the package and load our toolkit.
Data Exploration: Visualise the raw germination patterns.
Step 1: Discovering the Optimal Number of Cut-points: Use find_cutpoint_number() to determine if 1, 2, or more thresholds are best.
Step 2: Discovering the Optimal Value of Cut-points: Use find_cutpoint() to identify the specific temperature thresholds.
Step 3: Validating the Cut-points: Use validate_cutpoint() to assess the stability of our findings.
Visualisation and Interpretation: Create publication-ready plots to understand and present the results.

1. Installation & Setup

Installing OptSurvCutR

First, let us get the necessary tools. OptSurvCutR is available on CRAN, and the development version is on GitHub.

# Option 1: Install the development version from GitHub for the latest features
# You will need the 'remotes' package first: install.packages("remotes")
remotes::install_github("paytonyau/OptSurvCutR")

# Option 2: Install the stable version from CRAN - Pending
# install.packages("OptSurvCutR")

Loading Our Toolkit

Now, we load all the R packages we will need for this analysis.

library(OptSurvCutR)
library(survival)
library(survminer)
library(dplyr)
library(ggplot2)
library(patchwork)
library(knitr)
library(tidyr)
library(cli)

The Rapeseed Germination Data

To create a realistic dataset for demonstrating the package’s capabilities, time-to-event data was simulated based on the findings of Haj Sghaier et al. (2022). The simulation was designed in two parts to generate a rich, continuous predictor with a non-linear relationship to the outcome. First, a core dataset was generated based on parameters reported directly in the source manuscript for seven distinct temperature points (5, 10, 15, 20, 25, 30, and 35°C). These parameters included the germination time window (start and end day), coefficients for a linear growth model (slope and intercept), and the overall germination rate for each temperature. Second, to create a more continuous variable suitable for cut-point optimisation, these parameters were linearly interpolated for six intermediate temperature points (7.5, 12.5, 17.5, 22.5, 27.5, and 32.5°C). The final dataset, germination, is included with the package and was formed by combining these two simulated subsets.

# Load the pre-simulated data included with the package
data("germination", package = "OptSurvCutR")
analysis_data <- germination
head(analysis_data)

##   temperature replicate time    growth germinated
## 1           5         1   22 8.1091549          1
## 2           5         1   18 0.0000000          0
## 3           5         1   15 1.2184871          1
## 4           5         1   22 6.0553687          1
## 5           5         1   19 5.1456349          1
## 6           5         1   14 0.4994083          1

2. Initial Data Exploration

Before modelling, we explore the raw data. The summary table below shows key germination metrics for each of the thirteen temperature groups.

# Create a summary table of the data
summary_table <- analysis_data %>%
  group_by(temperature) %>%
  summarise(
    N_Seeds = n(),
    N_Germinated = sum(germinated),
    Germination_Rate_Pct = mean(germinated) * 100,
    Avg_Time_to_Germinate_Days = mean(time[germinated == 1], na.rm = TRUE)
  ) %>%
  rename(`Temperature (°C)` = temperature)

# Display the table using kable for nice formatting
knitr::kable(summary_table,
      digits = 2,
      caption = "Summary of Germination Outcomes by Temperature Group")

Summary of Germination Outcomes by Temperature Group
Temperature (°C)	N_Seeds	N_Germinated	Germination_Rate_Pct	Avg_Time_to_Germinate_Days
5.0	80	70	87.50	18.73
7.5	80	71	88.75	16.08
10.0	80	72	90.00	13.93
12.5	80	73	91.25	11.85
15.0	80	74	92.50	10.64
17.5	80	76	95.00	9.72
20.0	80	78	97.50	9.59
22.5	80	77	96.25	10.18
25.0	80	75	93.75	8.51
27.5	80	74	92.50	9.36
30.0	80	73	91.25	10.75
32.5	80	58	72.50	12.78
35.0	80	44	55.00	14.59

Interpretation: This table confirms our data simulates the expected biological response. The highest germination rates and fastest germination times occur in the 15-25°C range, with performance dropping off at colder and hotter temperatures.

3. The Three-Step Analysis Workflow

Step 1: Determine the Optimal Number of Cut-points

Our first step is to determine how many cut-points the data supports. Forcing a single cut-point might be too simple, while too many might overfit the data. find_cutpoint_number() uses information criteria to provide statistical evidence for this decision.

Choosing Function Arguments

The function’s behaviour is controlled by two key arguments: method and criterion.

Method	How It Works	Recommendation	Rating (Accuracy & Performance)
`"systematic"`	Exhaustive Search: Tests every single possible cut-point.	Best for 1 cut-point; guarantees the optimal result.	Accuracy: ★★★★★ Performance: ★★★☆☆
`"genetic"`	Evolutionary Search: Uses a smart algorithm to efficiently find a near-perfect solution.	Highly recommended for 2+ cut-points. Much faster than the systematic search.	Accuracy: ★★★★☆ Performance: ★★★★★

Criterion	What It Is	Recommendation	Rating (Accuracy & Performance)
`"AIC"`	Akaike Information Criterion	Balances model fit and complexity. A good general-purpose choice.	Accuracy: ★★★★☆ Performance: ★★★★★
`"AICc"`	Corrected AIC	A version of AIC with a greater penalty for extra parameters. It is specifically recommended for smaller sample sizes.	Accuracy: ★★★★★ Performance: ★★★★★
`"BIC"`	Bayesian Information Criterion	Similar to AIC, but applies a stronger penalty for complexity, especially in larger datasets. It tends to favour simpler models.	Accuracy: ★★★★★ Performance: ★★★★★

We will use the "genetic" search method and the "BIC" criterion, which is excellent for balancing model fit and complexity.

# Using BIC (default)
cli_h2("Step 1: Find the number of cut(s) using BIC")
number_result_bic <- find_cutpoint_number(
  data = analysis_data,
  predictor = "temperature",
  outcome_time = "time",
  outcome_event = "germinated",
  method = "genetic",
  criterion = "BIC",
  use_parallel = TRUE,
  max_cuts = 5,
  nmin = 0.1,
  maxiter = 500,
  seed = 42
)

##  num_cuts      BIC Delta_BIC BIC_Weight                  Evidence
##         0 10878.79    421.52         0%    Essentially no support
##         1 10687.53    230.25         0%    Essentially no support
##         2 10532.11     74.83         0%    Essentially no support
##         3 10457.27      0.00        95%       Substantial support
##         4 10463.31      6.03       4.7% Considerably less support
##         5 10468.70     11.42       0.3%    Essentially no support
##                              cuts
##                                NA
##                              8.33
##                      10.25, 30.24
##                9.43, 13.55, 31.29
##            9.52, 13, 22.54, 30.23
##  8.56, 13.51, 18.55, 23.71, 30.45

summary(number_result_bic)

##  num_cuts      BIC Delta_BIC BIC_Weight                  Evidence
##         0 10878.79    421.52         0%    Essentially no support
##         1 10687.53    230.25         0%    Essentially no support
##         2 10532.11     74.83         0%    Essentially no support
##         3 10457.27      0.00        95%       Substantial support
##         4 10463.31      6.03       4.7% Considerably less support
##         5 10468.70     11.42       0.3%    Essentially no support
##                              cuts
##                                NA
##                              8.33
##                      10.25, 30.24
##                9.43, 13.55, 31.29
##            9.52, 13, 22.54, 30.23
##  8.56, 13.51, 18.55, 23.71, 30.45

##   Group   N Events
## 1    G1 160    141
## 2    G2 160    145
## 3    G3 560    527
## 4    G4 160    102

## Call: survfit(formula = survival::Surv(time, event) ~ group, data = data)
## 
##            n events median 0.95LCL 0.95UCL
## group=G1 160    141     18      17      19
## group=G2 160    145     13      12      14
## group=G3 560    527     10      10      11
## group=G4 160    102     15      14      16

## Call:
## survival::coxph(formula = as.formula(formula_str), data = data)
## 
##   n= 1040, number of events= 915 
## 
##            coef exp(coef) se(coef)      z Pr(>|z|)    
## groupG2  1.3185    3.7378   0.1372  9.613  < 2e-16 ***
## groupG3  2.3464   10.4480   0.1306 17.965  < 2e-16 ***
## groupG4  0.8722    2.3922   0.1463  5.963 2.48e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##         exp(coef) exp(-coef) lower .95 upper .95
## groupG2     3.738    0.26754     2.857     4.891
## groupG3    10.448    0.09571     8.088    13.496
## groupG4     2.392    0.41802     1.796     3.187
## 
## Concordance= 0.706  (se = 0.008 )
## Likelihood ratio test= 497.3  on 3 df,   p=<2e-16
## Wald test            = 393.6  on 3 df,   p=<2e-16
## Score (logrank) test = 475.4  on 3 df,   p=<2e-16

A plot makes the choice clear. The lowest point on the curve indicates the optimal number of cuts.

plot(number_result_bic)

Information criterion (BIC) scores by number of cut-points. The lowest score indicates the optimal number.

Interpretation: The BIC analysis strongly suggests that a model with 3 cut-points is optimal, as it has the lowest information criterion score. This provides statistical justification for creating four distinct temperature groups (e.g., Cool, Sub-Optimal, Optimal, Warm).

Step 2: Find the Optimal Temperature Thresholds

Now that we know we need three cut-points, we use find_cutpoint() to discover their specific values. We will optimise for the "logrank" statistic to find the thresholds that create the most statistically significant separation between the survival curves of the resulting groups.

Choosing an Optimisation `criterion`

Criterion	What It Optimises	Recommendation	Rating (Accuracy & Performance)
`"logrank"`	The statistical significance of the separation between survival curves (maximises the chi-squared statistic).	The most common and standard method. Best when the primary goal is to prove a significant difference.	Accuracy: ★★★★★ Performance: ★★★★★
`"hazard_ratio"`	The effect size (maximises the Hazard Ratio).	Best for clinical interpretability. Finds the cut-point that creates the largest practical difference between groups.	Accuracy: ★★★★★ Performance: ★★★★☆
`"p_value"`	The p-value from a Cox model (minimises the p-value).	A powerful way to find the most significant split, but the p-value itself should be interpreted with caution due to multiple testing.	Accuracy: ★★★★☆ Performance: ★★★★☆

cli_alert_info("Step 2: Finding optimal cut-point values...")

multi_cut_result <- find_cutpoint(
  data = analysis_data,
  predictor = "temperature",
  outcome_time = "time",
  outcome_event = "germinated",
  method = "genetic",
  criterion = "logrank",
  use_parallel = TRUE,
  num_cuts = 3,
  nmin = 0.1,
  maxiter = 500,
  seed = 123
)

summary(multi_cut_result)

##   Group   N Events
## 1    G1 160    141
## 2    G2 160    145
## 3    G3 560    527
## 4    G4 160    102

## Call: survfit(formula = survival::Surv(time, event) ~ group, data = data)
## 
##            n events median 0.95LCL 0.95UCL
## group=G1 160    141     18      17      19
## group=G2 160    145     13      12      14
## group=G3 560    527     10      10      11
## group=G4 160    102     15      14      16

## Call:
## survival::coxph(formula = as.formula(formula_str), data = data)
## 
##   n= 1040, number of events= 915 
## 
##            coef exp(coef) se(coef)      z Pr(>|z|)    
## groupG2  1.3185    3.7378   0.1372  9.613  < 2e-16 ***
## groupG3  2.3464   10.4480   0.1306 17.965  < 2e-16 ***
## groupG4  0.8722    2.3922   0.1463  5.963 2.48e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##         exp(coef) exp(-coef) lower .95 upper .95
## groupG2     3.738    0.26754     2.857     4.891
## groupG3    10.448    0.09571     8.088    13.496
## groupG4     2.392    0.41802     1.796     3.187
## 
## Concordance= 0.706  (se = 0.008 )
## Likelihood ratio test= 497.3  on 3 df,   p=<2e-16
## Wald test            = 393.6  on 3 df,   p=<2e-16
## Score (logrank) test = 475.4  on 3 df,   p=<2e-16

##        chisq df    p
## group   1.79  3 0.62
## GLOBAL  1.79  3 0.62

A plot of the temperature distribution with the discovered cut-points overlaid gives us a clear visual confirmation.

plot(multi_cut_result, type = "distribution")

Distribution of the predictor variable with optimal cut-points overlaid.

Interpretation: The algorithm has identified three temperature thresholds: 11.1°C, 18.2°C, and 28.1°C. These values are not arbitrary; they represent the points that best separate the rapeseed seeds into four groups with distinct germination behaviours.

Step 3: Validate the Cut-point Stability

A key question is whether these cut-points are stable or just an artefact of our specific sample. We run a bootstrap analysis with validate_cutpoint() to generate 95% confidence intervals.

cli_alert_info("Step 3: Validating cut-point stability with bootstrapping...")

validation_result <- validate_cutpoint(
  cutpoint_result = multi_cut_result,
  num_replicates = 100, # Use a higher number (e.g., 500) for publication
  use_parallel = TRUE,
  seed = 456,
  maxiter = 100
)

##        Lower Upper
## Cut 1  7.522 11.35
## Cut 2 12.532 17.42
## Cut 3 29.003 32.34

summary(validation_result)

##         Mean Median    SD   IQR
## Cut 1  8.907  8.843 1.016 1.520
## Cut 2 14.573 14.409 1.469 2.521
## Cut 3 31.084 31.099 0.882 1.093

##        Lower Upper
## Cut 1  7.522 11.35
## Cut 2 12.532 17.42
## Cut 3 29.003 32.34

A density plot of the bootstrap results provides a powerful visual for assessing stability. Narrow, sharp peaks indicate a highly stable cut-point.

plot(validation_result)

Bootstrap distribution of the three optimal cut-points. The solid red line represents the original cut-point.

Interpretation: The bootstrap results give us confidence in our findings. The confidence intervals for the cut-points are relatively narrow, especially for the lower and middle thresholds. This suggests that these are robust findings and not just random chance.

4. Visualising and Interpreting the Final Model

Now we can use our validated, optimal cut-points to create a final model and generate publication-ready visualisations.

# Extract the cuts for plotting
optimal_cuts <- multi_cut_result$optimal_cuts

# Create a temperature group variable with four levels
analysis_data$Temp_Group <- cut(
  analysis_data$temperature,
  breaks = c(-Inf, optimal_cuts, Inf),
  labels = c("Cool", "Sub-Optimal", "Optimal", "Warm")
)

Final Group Composition

This table shows which experimental temperatures were assigned to our four statistically-derived groups.

composition_table <- analysis_data %>%
  group_by(Temp_Group) %>%
  summarise(
    Temperatures_in_Group = paste(sort(unique(temperature)), collapse = ", ")
  )

knitr::kable(composition_table, caption = "Composition of Temperature Groups")

Composition of Temperature Groups
Temp_Group	Temperatures_in_Group
Cool	5, 7.5
Sub-Optimal	10, 12.5
Optimal	15, 17.5, 20, 22.5, 25, 27.5, 30
Warm	32.5, 35

Plot A: Kaplan-Meier Curve by Optimal Groups

This plot is the primary result, showing the time-to-germination for the four groups created by our optimal cut-points.

group_levels <- levels(analysis_data$Temp_Group)
palette <- c("#2E9FDF", "#E7B800", "#4CAF50", "#FC4E07")

km_fit_final <- survfit(Surv(time, germinated) ~ Temp_Group, data = analysis_data)
ggsurvplot(
  km_fit_final, data = analysis_data, pval = TRUE, risk.table = TRUE,
  title = "Germination by Optimal Temperature Strata",  
  xlab = "Time (Days)", ylab = "Proportion Ungerminated",  
  legend.title = "Temp Group",  
  legend.labs = group_levels,
  palette = palette
)

Kaplan-Meier survival curve for germination by temperature group.

Kaplan-Meier Plot Interpretation: The plot confirms our cut-points successfully stratified the data. The “Optimal” group curve drops most steeply, indicating the fastest germination. The “Cool” and “Warm” groups show much flatter curves, confirming inhibited germination. The p-value (< 0.0001) indicates the differences between these groups are highly statistically significant.

Plot B: Hazard Ratios (Forest Plot)

The forest plot visualises the Hazard Ratios (HR) for each group relative to the “Optimal” group.

analysis_data$Temp_Group <- relevel(analysis_data$Temp_Group, ref = "Optimal")
cox_model <- coxph(Surv(time, germinated) ~ Temp_Group, data = analysis_data)
ggforest(cox_model, data = analysis_data, main = "Hazard Ratios Relative to Optimal Temperature")

Forest plot of Hazard Ratios for each temperature group relative to the optimal group.

Forest Plot Interpretation: This plot shows the rate of germination for each group compared to the “Optimal” group. An HR less than 1 (like for the “Cool” and “Warm” groups) means a significantly lower rate of germination. An HR greater than 1 would mean a higher rate. Since all confidence intervals are far from crossing the vertical line at 1.0, all groups have a germination rate that is significantly different from the optimal group.

Plot C: Cumulative Hazard Plot

This diagnostic plot helps to visually check the proportional hazards assumption of the Cox model. For the model to be valid, the hazard ratios between the groups should be roughly constant over time, which is represented by parallel, non-crossing lines.

ggsurvplot(
  km_fit_final, data = analysis_data, fun = "cumhaz", pval = TRUE,
  title = "Cumulative Hazard of Germination",  
  xlab = "Time (Days)", ylab = "Cumulative Hazard",  
  legend.title = "Temp Group",  
  legend.labs = group_levels,
  palette = palette
)

Cumulative hazard plot for germination by temperature group, used to check the proportional hazards assumption.

Cumulative Hazard Plot Interpretation: The well-separated and parallel lines in the plot suggest that the proportional hazards assumption is met for this model, giving us confidence in the results from our Cox model and forest plot.

5. Conclusion & Next Steps

This vignette has demonstrated the three-step workflow for cut-point analysis using OptSurvCutR. By following this workflow, users can confidently identify and validate robust, statistically-optimal thresholds in their own survival data, moving beyond simple median splits to uncover more nuanced relationships.

We encourage you to try OptSurvCutR with your own data.

Install the package from GitHub: remotes::install_github("paytonyau/OptSurvCutR")
Report issues or suggest features on our GitHub page.
Star the repository if you find it useful.

6. Session Information

For reproducibility, the session information below lists the R version and all attached packages used to run this analysis.

sessionInfo()

## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8 
## [2] LC_CTYPE=English_United Kingdom.utf8   
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.utf8    
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] cli_3.6.5         tidyr_1.3.1       knitr_1.50        patchwork_1.3.1  
##  [5] dplyr_1.1.4       survminer_0.5.1   ggpubr_0.6.1      ggplot2_3.5.2    
##  [9] survival_3.8-3    OptSurvCutR_0.1.5
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       xfun_0.52          bslib_0.9.0        rstatix_0.7.2     
##  [5] lattice_0.22-7     vctrs_0.6.5        tools_4.5.1        generics_0.1.4    
##  [9] parallel_4.5.1     tibble_3.3.0       pkgconfig_2.0.3    Matrix_1.7-3      
## [13] data.table_1.17.8  RColorBrewer_1.1-3 lifecycle_1.0.4    stringr_1.5.1     
## [17] compiler_4.5.1     farver_2.1.2       codetools_0.2-20   carData_3.0-5     
## [21] litedown_0.7       htmltools_0.5.8.1  sass_0.4.10        yaml_2.3.10       
## [25] Formula_1.2-5      pillar_1.11.0      car_3.1-3          jquerylib_0.1.4   
## [29] cachem_1.1.0       iterators_1.0.14   rgenoud_5.9-0.11   abind_1.4-8       
## [33] foreach_1.5.2      km.ci_0.5-6        commonmark_2.0.0   tidyselect_1.2.1  
## [37] digest_0.6.37      stringi_1.8.7      purrr_1.1.0        labeling_0.4.3    
## [41] splines_4.5.1      cowplot_1.2.0      fastmap_1.2.0      grid_4.5.1        
## [45] magrittr_2.0.3     broom_1.0.9        withr_3.0.2        scales_1.4.0      
## [49] backports_1.5.0    rmarkdown_2.29     ggtext_0.1.2       gridExtra_2.3     
## [53] ggsignif_0.6.4     zoo_1.8-14         evaluate_1.0.5     KMsurv_0.1-6      
## [57] doParallel_1.0.17  markdown_2.0       survMisc_0.5.6     rlang_1.1.6       
## [61] gridtext_0.1.5     Rcpp_1.1.0         xtable_1.8-4       glue_1.8.0        
## [65] xml2_1.4.0         rstudioapi_0.17.1  jsonlite_2.0.0     R6_2.6.1

Analysing Rapeseed Germination Data with OptSurvCutR

A Workflow for Discovering and Validating Optimal Temperature Thresholds for Time-To-Event Data

Payton Yau

2025-09-09

Introduction

Analysis Workflow

1. Installation & Setup

Installing OptSurvCutR

Loading Our Toolkit

The Rapeseed Germination Data

2. Initial Data Exploration

3. The Three-Step Analysis Workflow

Step 1: Determine the Optimal Number of Cut-points

Choosing Function Arguments

Step 2: Find the Optimal Temperature Thresholds

Choosing an Optimisation `criterion`

Step 3: Validate the Cut-point Stability

4. Visualising and Interpreting the Final Model

Final Group Composition

Plot A: Kaplan-Meier Curve by Optimal Groups

Plot B: Hazard Ratios (Forest Plot)

Plot C: Cumulative Hazard Plot

5. Conclusion & Next Steps

6. Session Information

Analysing Rapeseed Germination Data with OptSurvCutR

A Workflow for Discovering and Validating Optimal Temperature Thresholds for Time-To-Event Data

Payton Yau

2025-09-09

Introduction

Analysis Workflow

1. Installation & Setup

Installing OptSurvCutR

Loading Our Toolkit

The Rapeseed Germination Data

2. Initial Data Exploration

3. The Three-Step Analysis Workflow

Step 1: Determine the Optimal Number of Cut-points

Choosing Function Arguments

Step 2: Find the Optimal Temperature Thresholds

Choosing an Optimisation criterion

Step 3: Validate the Cut-point Stability

4. Visualising and Interpreting the Final Model

Final Group Composition

Plot A: Kaplan-Meier Curve by Optimal Groups

Plot B: Hazard Ratios (Forest Plot)

Plot C: Cumulative Hazard Plot

5. Conclusion & Next Steps

6. Session Information

Choosing an Optimisation `criterion`