Introduction

In survival analysis, a common challenge is to stratify subjects based on a continuous predictor. The conventional approach is often to dichotomise this predictor using a single cut-point, typically the median. However, this method can be statistically weak and may fail to capture more complex, non-linear relationships. The fundamental question is often not just where to make a cut, but how many cuts are statistically justified.

This vignette demonstrates how the OptSurvCutR package provides a comprehensive workflow to solve this problem. We will address the following research question:

Based on simulated data, can we find the optimal temperature threshold(s) that best separate fast and slow germination in rapeseed?

To answer this, we write an explicit data generation engine directly into the workflow. This dataset simulates the biological findings of Haj Sghaier et al. (2022) to provide a highly realistic case study. The simulation engine models 7 core temperature environments (5, 10, 15, 20, 25, 30, and 35°C) alongside 6 intermediate linearly interpolated temperature micro-climates (7.5, 12.5, 17.5, 22.5, 27.5, and 32.5°C) to create a rich, multi-tier continuous tracking field.

Analysis Workflow

This guide covers a complete workflow for answering this question using OptSurvCutR:

Installation & Setup: Load our toolkit and execute the data generation engine.
Data Exploration: Visualise the raw germination patterns.
Step 1: Discovering the Optimal Number of Cut-points: Use find_cutpoint_number() to determine if 1, 2, or more thresholds are best.
Step 2: Discovering the Optimal Value of Cut-points: Use find_cutpoint() to identify the specific temperature thresholds.
Step 3: Validating the Cut-points: Use validate_cutpoint() to assess the stability of our findings.
Visualisation and Interpretation: Create publication-ready plots to understand and present the results.

1. Installation & Setup

Loading Our Toolkit

We load all the R packages we will need for this analysis. We use pacman to automatically manage and install any missing packages.

# Check if pacman is installed, if not, install it
if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman

# Use pacman to load/install all required packages
pacman::p_load(
  OptSurvCutR, # The core package for this analysis
  survival,    # For core survival analysis functions (Surv, survfit)
  survminer,   # For plotting survival curves (ggsurvplot, ggforest)
  dplyr,       # For data manipulation (pipes %>%, select, mutate)
  ggplot2,     # For creating custom plots
  patchwork,   # For combining multiple ggplots into one figure
  knitr,       # For creating formatted tables (kable)
  tidyr,       # For data tidying
  cli,         # For stylish console messages
  rgenoud      # Required for method = "genetic"
)

The Rapeseed Germination Simulation Engine

We build our experimental dataset by executing a rigorous parametric simulation pipeline matching the original mathematical distributions reported by Haj Sghaier et al. (2022).

# Set a global seed for full reproducibility of the simulation
set.seed(2025)
 
# Original parameters from the manuscript
original_params <- data.frame(
  temperature = c(5, 10, 15, 20, 25, 30, 35),
  start_day = c(13, 7, 4, 4, 3, 4, 9),
  end_day = c(25, 20, 16, 16, 15, 16, 20),
  slope = c(0.9846, 3.6848, 4.9765, 3.8693, 6.5766, 1.1384, 0.1560),
  intercept = c(-13.627, -34.076, -28.116, -17.012, -35.621, -1.2833, -1.0571),
  r_squared = c(0.9585, 0.9501, 0.9307, 0.9524, 0.9564, 0.9237, 0.9574),
  germination_rate = c(0.88, 0.90, 0.92, 0.98, 0.94, 0.91, 0.55)
)
 
# Core data generation function 
generate_hybrid_data <- function(params) {
  set.seed(as.integer(params$temperature * 10))
  n_replicates <- 4
  n_seeds_per_dish <- 20
  n_samples_total <- n_replicates * n_seeds_per_dish
  
  time_data <- round(runif(n_samples_total, min = params$start_day, max = params$end_day))
  
  perfect_growth <- params$intercept + params$slope * time_data
  perfect_growth_variance <- var(perfect_growth)
  if (params$r_squared < 1 && params$r_squared > 0) {
    residual_variance <- perfect_growth_variance * (1 - params$r_squared) / params$r_squared
  } else {
    residual_variance <- 0
  }
  residual_std_dev <- sqrt(residual_variance)
  residuals <- rnorm(n = n_samples_total, mean = 0, sd = residual_std_dev)
  actual_growth <- perfect_growth + residuals
  actual_growth[actual_growth < 0] <- 0
  
  total_germinated <- round(n_samples_total * params$germination_rate)
  status_vector <- c(rep(1, total_germinated), rep(0, n_samples_total - total_germinated))
  shuffled_status <- sample(status_vector)
  
  replicate_vector <- rep(1:n_replicates, each = n_seeds_per_dish)
  actual_growth[shuffled_status == 0] <- 0
  
  data.frame(
    temperature = params$temperature,
    replicate = replicate_vector,
    time = time_data,
    growth = actual_growth,
    germinated = shuffled_status
  )
}
 
# Create Interpolated Parameters to generate a highly rich continuous range
new_temps <- c(7.5, 12.5, 17.5, 22.5, 27.5, 32.5)
params_to_interp <- c("slope", "intercept", "r_squared", "germination_rate", "start_day", "end_day")
 
interpolated_values <- sapply(params_to_interp, function(param) {
  approx(original_params$temperature, original_params[[param]], xout = new_temps)$y
})
 
interpolated_params <- as.data.frame(interpolated_values)
interpolated_params$temperature <- new_temps
 
# Combine original and interpolated parameter matrices
all_params <- rbind(original_params, interpolated_params)
 
# Execute the data list simulation loop across rows
all_data_list <- lapply(1:nrow(all_params), function(i) generate_hybrid_data(all_params[i, ]))
 
# Row-bind into the master 'analysis_data' workspace 
analysis_data <- do.call(rbind, all_data_list)

head(analysis_data)
##   temperature replicate time    growth germinated
## 1           5         1   22 8.1091549          1
## 2           5         1   18 0.0000000          0
## 3           5         1   15 1.2184871          1
## 4           5         1   22 6.0553687          1
## 5           5         1   19 5.1456349          1
## 6           5         1   14 0.4994083          1

2. Initial Data Exploration

Before modeling, we explore the raw properties of the generated vectors. The summary table below shows key germination metrics across all thirteen simulated temperature groups.

# Create a summary table of the data
summary_table <- analysis_data %>%
  group_by(temperature) %>%
  summarise(
    N_Seeds = n(),
    N_Germinated = sum(germinated),
    Germination_Rate_Pct = mean(germinated) * 100,
    Avg_Time_to_Germinate_Days = mean(time[germinated == 1], na.rm = TRUE)
  ) %>%
  rename(`Temperature (°C)` = temperature)

# Display the table using kable for nice formatting
knitr::kable(summary_table,
      digits = 2,
      caption = "Summary of Germination Outcomes by Temperature Group")

Summary of Germination Outcomes by Temperature Group
Temperature (°C)	N_Seeds	N_Germinated	Germination_Rate_Pct	Avg_Time_to_Germinate_Days
5.0	80	70	87.50	18.73
7.5	80	71	88.75	16.08
10.0	80	72	90.00	13.93
12.5	80	73	91.25	11.85
15.0	80	74	92.50	10.64
17.5	80	76	95.00	9.72
20.0	80	78	97.50	9.59
22.5	80	77	96.25	10.18
25.0	80	75	93.75	8.51
27.5	80	74	92.50	9.36
30.0	80	73	91.25	10.75
32.5	80	58	72.50	12.78
35.0	80	44	55.00	14.59

The Agronomic Insight:

This table confirms our data generation script simulates the expected biological response. The highest germination rates and fastest germination times occur in the 15-25°C range, with performance dropping off at colder and hotter temperatures.

3. The Three-Step Analysis Workflow

Step 1: Determine the Optimal Number of Cut-points

Our first step is to determine how many cut-points the data supports. Forcing a single cut-point might be too simple, while too many might overfit the data. find_cutpoint_number() uses information criteria to provide statistical evidence for this decision.

Choosing Function Arguments

The function’s behaviour is controlled by two key arguments: method and criterion.

Method	How It Works	Recommendation	Rating (Accuracy & Performance)
`"systematic"`	Exhaustive Search: Tests every single possible cut-point.	Best for 1 cut-point; guarantees the optimal result.	Accuracy: ★★★★★ Performance: ★★★☆☆
`"genetic"`	Evolutionary Search: Uses a smart algorithm to efficiently find a near-perfect solution.	Highly recommended for 2+ cut-points. Much faster than the systematic search.	Accuracy: ★★★★☆ Performance: ★★★★★

Criterion	What It Is	Recommendation	Rating (Accuracy & Performance)
`"AIC"`	Akaike Information Criterion	Balances model fit and complexity. A good general-purpose choice.	Accuracy: ★★★★☆ Performance: ★★★★★
`"AICc"`	Corrected AIC	A version of AIC with a greater penalty for extra parameters. It is specifically recommended for smaller sample sizes.	Accuracy: ★★★★★ Performance: ★★★★★
`"BIC"`	Bayesian Information Criterion	Similar to AIC, but applies a stronger penalty for complexity, especially in larger datasets. It tends to favour simpler models.	Accuracy: ★★★★★ Performance: ★★★★★

We will use the "genetic" search method and the "BIC" criterion, which is excellent for balancing model fit and complexity.

# --- Step 1: Find the number of cut(s) using BIC ---
number_result_bic <- find_cutpoint_number(
  data = analysis_data,
  predictor = "temperature",
  outcome_time = "time",
  outcome_event = "germinated",
  method = "genetic",        # Use the fast genetic algorithm
  criterion = "BIC",         # Use BIC for model selection
  max_cuts = 5,              # Test 0-5 cuts (6 groups max)
  nmin = 0.1,                # Min 10% per group
  # THE "R-GROUND" SETTINGS
  max.generations = 100,     # Reduced for vignette speed
  pop.size = 100,   
  boundary.enforcement = 1,
  seed = 123
)
## ℹ nmin 0.1 is a proportion. Min. group size set to 104.
## ℹ Finding optimal cut number: method = genetic
## ℹ Running genetic algorithm for 1 cut-point(s)...
## ℹ Running genetic algorithm for 2 cut-point(s)...
## ℹ Running genetic algorithm for 3 cut-point(s)...
## ℹ Running genetic algorithm for 4 cut-point(s)...
## ℹ Running genetic algorithm for 5 cut-point(s)...
## No valid cut-points found for 5 cut(s) due to genetic algorithm failure or constraints.

summary(number_result_bic)
## 
## ── Optimal Cut-point Number Analysis (Genetic) ─────────────────────────────────────────────────────────────────────────
## ✔ Best Model: 3 Cut-points (Criterion: BIC)
## ℹ Optimal Thresholds: "9.835, 13.047, 31.925"
## 
## ── 1. Model Comparison ──
##  Marker num_cuts      BIC Delta_BIC BIC_Weight    Evidence
##                0 10878.79    421.52         0%     Minimal
##                1 10687.53    230.25         0%     Minimal
##                2 10532.11     74.83         0%     Minimal
##       >        3 10457.27      0.00      90.4% Substantial
##                4 10461.76      4.49       9.6%    Moderate
##                5       NA        NA        NA%        <NA>
## ── 2. Clinical Risk Cohorts ──
##  Group   N Events    Median_CI
##     G1 160    141 18 (17 - 19)
##     G2 160    145 13 (12 - 14)
##     G3 560    527 10 (10 - 11)
##     G4 160    102 15 (14 - 16)
## ── 3. Cox Proportional-Hazards ──
##  Group     HR Lower  Upper P_Value Signif
##     G2  3.738 2.857  4.891       0    ***
##     G3 10.448 8.088 13.496       0    ***
##     G4  2.392 1.796  3.187       0    ***
## ℹ Overall Model: Concordance = 0.706 | Log-rank p = 0
## 
## ── 4. Time-Dependent Diagnostics (Schoenfeld) ──
## 
## ✔ Passed: The proportional hazards assumption holds across the follow-up period (Global p = 0.617).
## 
## ── 5. Analysis Parameters ──
## 
## • Search Method: Genetic
## • Predictor: temperature
## • Criterion: BIC
## • Maximum Cuts Evaluated: 5
## • Minimum Group Size (nmin): 104

A plot makes the choice clear. The lowest point on the curve indicates the optimal number of cuts.

# The S3 plot method shows the BIC for each model.
# The lowest point (in orange) is the best.
plot(number_result_bic)

Information criterion (BIC) scores by number of cut-points. The lowest score indicates the optimal number.

The Number Selection Summary:

The BIC analysis strongly suggests that a model with 3 cut-points is optimal, as it has the lowest information criterion score. This provides statistical justification for creating four distinct temperature groups (e.g., Cool, Sub-Optimal, Optimal, Warm).

Step 2: Locating the Tipping Points

Now that we know we need three cut-points, we use find_cutpoint() to discover their specific values. We will optimise for the "logrank" statistic to find the thresholds that create the most statistically significant separation between the survival curves of the resulting groups.

Choosing an Optimisation `criterion`

Criterion	What It Optimises	Recommendation	Rating (Accuracy & Performance)
`"logrank"`	The statistical significance of the separation between survival curves (maximises the chi-squared statistic).	The most common and standard method. Best when the primary goal is to prove a significant difference.	Accuracy: ★★★★★ Performance: ★★★★★
`"hazard_ratio"`	The effect size (maximises the Hazard Ratio).	Best for clinical interpretability. Finds the cut-point that creates the largest practical difference between groups.	Accuracy: ★★★★★ Performance: ★★★★☆
`"p_value"`	The p-value from a Cox model (minimises the p-value).	A powerful way to find the most significant split, but the p-value itself should be interpreted with caution due to multiple testing.	Accuracy: ★★★★☆ Performance: ★★★★☆

# This function searches for the 3 cut-points that best
# separate the data, according to the log-rank test.
multi_cut_result <- find_cutpoint(
  data = analysis_data,
  predictor = "temperature",
  outcome_time = "time",
  outcome_event = "germinated",
  method = "genetic",      # Use genetic algorithm
  criterion = "logrank",   # Optimise for the log-rank statistic
  num_cuts = 3,
  nmin = 0.1,              
  n_perm = 50,   
  n_cores = 6,
  # THE "R-GROUND" SETTINGS
  max.generations = 100,
  pop.size = 100,
  boundary.enforcement = 1,
  seed = 123
)
## ℹ nmin 0.1 is a proportion. Min. group size set to 104.
## ℹ Starting genetic search for 3 cut(s)...
## ℹ Running 50 permutations to calculate adjusted p-value...

summary(multi_cut_result)
## 
## ── Optimal Cut-point Analysis for Survival Data (Genetic) ──────────────────────────────────────────────────────────────
## ✔ Optimal Threshold(s): "8.703, 12.804, 31.826"
## ℹ Permutation-Adjusted P-value (50 runs): 0.0196
## 
## ── 1. Stratified Risk Cohorts ──
## 
##  Group   N Events    Median_CI
##     G1 160    141 18 (17 - 19)
##     G2 160    145 13 (12 - 14)
##     G3 560    527 10 (10 - 11)
##     G4 160    102 15 (14 - 16)
## ── 2. Cox Proportional-Hazards ──
##  Group     HR Lower  Upper P_Value Signif
##     G2  3.738 2.857  4.891       0    ***
##     G3 10.448 8.088 13.496       0    ***
##     G4  2.392 1.796  3.187       0    ***
## ℹ Overall Model: Concordance = 0.706 | Log-rank p = 0
## 
## ── 3. Time-Dependent Diagnostics (Schoenfeld) ──
## 
## ✔ Passed: The proportional hazards assumption holds across the follow-up period (Global p = 0.617).
## 
## ── 4. Analysis Parameters ──
## 
## • Search Method: Genetic
## • Predictor: temperature
## • Number of cuts: 3
## • Minimum group size (nmin): 104
## • Permutations: 50

# Manually create a results data frame from the find_cutpoint object
results_summary_df <- data.frame(
  Parameter = c("Predictor", 
                "Criterion", 
                "Optimal Log-Rank Statistic", 
                "Recommended Cut-point(s)"),
  Value = c(
    multi_cut_result$parameters$predictor,
    multi_cut_result$parameters$criterion,
    round(multi_cut_result$optimal_stat, 4),
    paste(round(multi_cut_result$optimal_cuts, 3), collapse = ', ')
  )
)

# Use kable() to create a beautiful and robust table
knitr::kable(
  results_summary_df,
  col.names = c("Metric", "Result"), # Nicer column names
  caption = "Summary of Optimal Temperature Threshold Analysis"
)

Summary of Optimal Temperature Threshold Analysis
Metric	Result
Predictor	temperature
Criterion	logrank
Optimal Log-Rank Statistic	454.4678
Recommended Cut-point(s)	8.703, 12.804, 31.826

A plot of the temperature distribution with the discovered cut-points overlaid gives us a clear visual confirmation.

# The S3 plot method (type = "distribution") shows a histogram
# of the predictor with the optimal cut-points as vertical lines.
plot(multi_cut_result, 
     type = "distribution", 
     palette = "jco", 
     title = "Optimised Biomarker Stratification")

Distribution of the predictor variable with optimal cut-points overlaid.

Step 3: Validate the Cut-point Stability

A key question is whether these cut-points are stable or just an artefact of our specific sample. We run a bootstrap analysis with validate_cutpoint() to generate 95% confidence intervals.

# This function re-runs the find_cutpoint algorithm many times on
# resampled data to see how much the cut-points vary.
validation_result <- validate_cutpoint(
  cutpoint_result = multi_cut_result, # Pass in the results from Step 2
  num_replicates = 150, # Use a lower number for a fast demo (>= 500 for publication)
  n_cores = 6,          # Specify number of cores (set to your preference)
  # THE "R-GROUND" SETTINGS
  max.generations = 100,
  pop.size = 100,
  boundary.enforcement = 1,
  seed = 123        
)
## ℹ Using random seed 123 for reproducibility.
## ℹ Bootstrap `nmin` not set. Using 93 (90% of original) to improve stability.
## ℹ Validating 3 cut(s) from 'genetic' search using 'logrank'.
## ℹ Running 150 replicates on 6 cores...
## ✔ 150 replicates completed.

summary(validation_result)
## Cut-point Stability Analysis (Bootstrap)
## ----------------------------------------
## Original Optimal Cut-point(s): 8.703, 12.804, 31.826 
## 
## Bootstrap Distribution Summary
## -----------------------------
##       Cut   Mean    SD Median     Q1     Q3
## 25%  Cut1  8.847 0.837  8.841  8.196  9.371
## 25%1 Cut2 14.511 1.292 14.373 13.636 14.944
## 25%2 Cut3 31.186 0.843 31.198 30.646 31.764
## 
## 95% Confidence Intervals
## ------------------------
##        Lower  Upper
## Cut 1  7.548 10.542
## Cut 2 12.593 17.378
## Cut 3 29.428 32.440
## 
## Validation Parameters
## ---------------------
## Replicates Requested: 150 
## Successful Replicates: 150 / 150 ( 100 %)
## Failed Replicates: 0 
## Cores Used: 6 
## Seed: 123 
## Minimum Group Size (nmin): 93 
## Method: genetic 
## Criterion: logrank 
## Covariates: None 
## 
## 
## Stability Assessment:
## ---------------------
## Maximum CI Width (Relative to 10th-90th Percentile Range): 20.5%
## ✔ Model Status: OPTIMAL (Tier 1)
## The thresholds are highly consistent across samples with clean separation between risk cohorts.

A density plot of the bootstrap results provides a powerful visual for assessing stability. Narrow, sharp peaks indicate a highly stable cut-point.

# The S3 plot method for a validation object shows the density
# plots of the cut-points found during bootstrapping.
plot(validation_result)

Bootstrap distribution of the three optimal cut-points. The solid red line represents the original cut-point.

The Validation Summary:

The bootstrap results give us confidence in our findings. The confidence intervals for the cut-points are relatively narrow, especially for the lower and middle thresholds. This suggests that these are robust findings and not just random chance.

4. Advanced Visualisation & Reporting

OptSurvCutR provides a unified, highly extensible S3 plotting engine. Rather than forcing you to extract data, manually format groups, and fit complex Cox models from scratch, you can simply call plot() on your result object and use the type argument to generate a suite of publication-ready clinical graphics.

By default, the package mathematically assigns your data into groups labelled G1, G2, G3, etc., ordered sequentially from the lowest to highest predictor value.

4.1 Verifying Group Composition (The Escape Hatch)

Before diving into deeper diagnostics, you may want to verify exactly which temperatures fell into G1, G2, G3, and G4. You can use return_data = TRUE to extract the raw data with the groups perfectly assigned, acting as an “escape hatch” for custom tables.

# Extract the raw data frame with the optimal groups natively assigned
final_dataset <- plot(multi_cut_result, return_data = TRUE)

# Example logic for viewing which temperatures map to which group:
composition_table <- final_dataset %>%
  group_by(group) %>%
  summarise(
    Temperatures_in_Group = paste(sort(unique(factor)), collapse = ", ")
  )

knitr::kable(composition_table, caption = "Composition of Discovered Temperature Groups")

Composition of Discovered Temperature Groups
group	Temperatures_in_Group
1	5, 7.5
2	10, 12.5
3	15, 17.5, 20, 22.5, 25, 27.5, 30
4	32.5, 35

Based on this table, we can interpret the biological meaning of our groups:

G1: Cool
G2: Sub-Optimal
G3: Optimal
G4: Warm

4.2 Survival Outcomes (`type = "outcome"`)

To generate a standalone Kaplan-Meier curve, use the outcome type. Because OptSurvCutR passes additional arguments directly to the survminer engine, you can customise titles, palettes, and labels in a single line of code.

# 1. Round and format threshold boundaries dynamically
cuts <- round(sort(multi_cut_result$optimal_cuts), 1)
n_cuts <- length(cuts)

# 2. Vectorized construction of group label matrices (Handles 1, 2, or 3+ cuts automatically)
mid_labels <- if (n_cuts > 1) sprintf("G%d (%s - %s°C)", 
                                      2:n_cuts, cuts[-n_cuts], 
                                      cuts[-1]) else NULL
temp_labels <- c(sprintf("G1 (< %s°C)", cuts[1]), 
                 mid_labels, sprintf("G%d (> %s°C)", 
                                     n_cuts + 1, cuts[n_cuts]))

# 3. Call the internal package plotting engine natively
plot(multi_cut_result, type = "outcome",
     title = "Germination by Optimal Temperature Strata",  
     xlab = "Time (Days)", ylab = "Proportion Ungerminated",  
     legend.title = "Temp.",
     legend.labs = temp_labels)

Kaplan-Meier survival curve for germination by temperature group.

The Curve Interpretation:

The “Optimal” group (G3) curve drops most steeply, indicating the fastest germination. The “Cool” (G1) and “Warm” (G4) groups show much flatter curves, confirming inhibited germination. The global p-value confirms these differences are highly statistically significant.

4.3 Hazard Ratios & Forest Plots (`type = "forest"`)

The forest plot visualises the relative risk (Hazard Ratios) between your groups. By default, standard Cox models use G1 as the baseline. However, OptSurvCutR allows you to instantly change the baseline using the reference_group argument—perfect for comparing everything against our “Optimal” G3 group.

# Generate a Forest Plot comparing all groups against G3 (Optimal)
plot(multi_cut_result, 
     type = "forest", 
     reference_group = "G3",
     main = "Hazard Ratios Relative to Optimal Temperature (G3)")
## `height` was translated to `width`.

Forest plot of Hazard Ratios for each temperature group relative to the optimal group.

The Relative Risk Interpretation:

This plot shows the rate of germination for each group compared to the optimal group (G3). An HR less than 1 (e.g., G1 and G4) means a significantly lower rate of germination compared to the baseline. Because none of the confidence interval lines cross the 1.0 vertical threshold, we can conclude that every other temperature zone is significantly worse than the optimal zone.

4.4 2D Cut-point Stability Surface (`plot_validation()`)

To evaluate threshold behavior across samples, plot_validation() projects the high-dimensional bootstrap distribution onto a smooth, continuous 2D contour density topology map. For multi-cut models, use the focus_cuts parameter to pair boundaries for visualisation (defaulting to Cut 1 vs Cut 2).

plot_validation(validation_result, 
                focus_cuts = c(1, 2), 
                main = "Resampling Convergence & Stability Landscape (Cuts 1 & 2)")

Interpretation Guide:

Concentric Elevation Bands: Display the discovery frequency of threshold pairs across bootstrap iterations. A tight, bright central cluster indicates a highly stable, reproducible statistical signal.
The Orange Diamond: Marks your primary dataset’s optimal coordinates.

Why is the diamond sometimes slightly offset from the peak center? 1. Dimensionality Compression: The plot projects a complex multi-cut architecture onto a flat 2D surface. The global genetic engine optimises all cut-points simultaneously, but this contour plane only maps the localised density of two selected boundaries.

Robust Resampling Anchoring: The proximity of the baseline crosshair to the dense resampling hub—combined with narrow confidence intervals (< 20% width) and a Tier 1 (OPTIMAL) status—confirms that your discovered thresholds are resilient, maintain clear cohort separation, and are not artifacts of data noise.

4.5 Advanced Model Diagnostics (`type = "diagnostic"`)

For major publications, reviewers often require proof that your Cox models do not violate the Proportional Hazards assumption. OptSurvCutR automates this complex check natively by calculating and plotting the Schoenfeld residuals.

# Automatically run survival::cox.zph() and plot the residuals
plot(multi_cut_result, 
     type = "diagnostic")
## `geom_smooth()` using formula = 'y ~ x'

Schoenfeld residuals diagnostic plot to check Proportional Hazards assumptions.

The Diagnostic Interpretation:

For the Cox model to be statistically valid, the hazard ratios between the groups must remain roughly constant over time. If the trend lines in this plot are relatively flat (and the global p-value is > 0.05), the Proportional Hazards assumption is met, giving you confidence in your Forest Plot results.

5. Conclusion & Next Steps

This vignette has demonstrated the three-step workflow for cut-point analysis using OptSurvCutR. By following this workflow, users can confidently identify and validate robust, statistically-optimal thresholds in their own survival data, moving beyond simple median splits to uncover more nuanced relationships.

We encourage you to try OptSurvCutR with your own data.

Install the package from GitHub: remotes::install_github("paytonyau/OptSurvCutR")
Report issues or suggest features on our GitHub page.

6. Session Information

For reproducibility, the session information below lists the R version and all attached packages used to run this analysis.

sessionInfo()
## R version 4.6.0 (2026-04-24 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8  LC_CTYPE=English_United Kingdom.utf8   
## [3] LC_MONETARY=English_United Kingdom.utf8 LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.utf8    
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] rgenoud_5.9-0.11  cli_3.6.6         tidyr_1.3.2       knitr_1.51        patchwork_1.3.2   dplyr_1.2.1      
##  [7] survminer_0.5.2   ggpubr_0.6.3      ggplot2_4.0.3     survival_3.8-6    OptSurvCutR_0.4.9 pacman_0.5.1     
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.10        generics_0.1.4     rstatix_0.7.3      lattice_0.22-9     digest_0.6.39      magrittr_2.0.5    
##  [7] evaluate_1.0.5     grid_4.6.0         RColorBrewer_1.1-3 iterators_1.0.14   fastmap_1.2.0      foreach_1.5.2     
## [13] doParallel_1.0.17  jsonlite_2.0.0     Matrix_1.7-5       backports_1.5.1    Formula_1.2-5      gridExtra_2.3     
## [19] mgcv_1.9-4         doRNG_1.8.6.3      purrr_1.2.2        viridisLite_0.4.3  scales_1.4.0       isoband_0.3.0     
## [25] codetools_0.2-20   jquerylib_0.1.4    abind_1.4-8        rlang_1.2.0        splines_4.6.0      withr_3.0.2       
## [31] cachem_1.1.0       yaml_2.3.12        otel_0.2.0         tools_4.6.0        parallel_4.6.0     ggsignif_0.6.4    
## [37] rngtools_1.5.2     broom_1.0.13       vctrs_0.7.3        R6_2.6.1           lifecycle_1.0.5    car_3.1-5         
## [43] MASS_7.3-65        pkgconfig_2.0.3    pillar_1.11.1      bslib_0.11.0       gtable_0.3.6       glue_1.8.1        
## [49] Rcpp_1.1.1-1.1     xfun_0.57          tibble_3.3.1       tidyselect_1.2.1   rstudioapi_0.18.0  farver_2.1.2      
## [55] nlme_3.1-169       htmltools_0.5.9    labeling_0.4.3     carData_3.0-6      rmarkdown_2.31     compiler_4.6.0    
## [61] S7_0.2.2

Analysing Rapeseed Germination Data with OptSurvCutR

A Workflow for Discovering and Validating Optimal Temperature Thresholds for Time-To-Event Data

Payton Yau

2026-06-04

Introduction

Analysis Workflow

1. Installation & Setup

Loading Our Toolkit

The Rapeseed Germination Simulation Engine

2. Initial Data Exploration

3. The Three-Step Analysis Workflow

Step 1: Determine the Optimal Number of Cut-points

Choosing Function Arguments

Step 2: Locating the Tipping Points

Choosing an Optimisation `criterion`

Step 3: Validate the Cut-point Stability

4. Advanced Visualisation & Reporting

4.1 Verifying Group Composition (The Escape Hatch)

4.2 Survival Outcomes (`type = "outcome"`)

4.3 Hazard Ratios & Forest Plots (`type = "forest"`)

4.4 2D Cut-point Stability Surface (`plot_validation()`)

4.5 Advanced Model Diagnostics (`type = "diagnostic"`)

5. Conclusion & Next Steps

6. Session Information

Analysing Rapeseed Germination Data with OptSurvCutR

A Workflow for Discovering and Validating Optimal Temperature Thresholds for Time-To-Event Data

Payton Yau

2026-06-04

Introduction

Analysis Workflow

1. Installation & Setup

Loading Our Toolkit

The Rapeseed Germination Simulation Engine

2. Initial Data Exploration

3. The Three-Step Analysis Workflow

Step 1: Determine the Optimal Number of Cut-points

Choosing Function Arguments

Step 2: Locating the Tipping Points

Choosing an Optimisation criterion

Step 3: Validate the Cut-point Stability

4. Advanced Visualisation & Reporting

4.1 Verifying Group Composition (The Escape Hatch)

4.2 Survival Outcomes (type = "outcome")

4.3 Hazard Ratios & Forest Plots (type = "forest")

4.4 2D Cut-point Stability Surface (plot_validation())

4.5 Advanced Model Diagnostics (type = "diagnostic")

5. Conclusion & Next Steps

6. Session Information

Choosing an Optimisation `criterion`

4.2 Survival Outcomes (`type = "outcome"`)

4.3 Hazard Ratios & Forest Plots (`type = "forest"`)

4.4 2D Cut-point Stability Surface (`plot_validation()`)

4.5 Advanced Model Diagnostics (`type = "diagnostic"`)