In survival analysis, a common challenge is to stratify subjects based on a continuous predictor. The conventional approach is often to dichotomise this predictor using a single cut-point, typically the median. However, this method can be statistically weak and may fail to capture more complex, non-linear relationships. The fundamental question is often not just where to make a cut, but how many cuts are statistically justified.
This vignette demonstrates how the
OptSurvCutR
package provides a
comprehensive workflow to solve this problem. We will address the
following research question:
Based on simulated data, can we find the optimal temperature threshold(s) that best separate fast and slow germination in rapeseed?
To answer this, we will use the germination
dataset,
which is included with the package. This dataset was simulated based on
the findings of Haj Sghaier et al. (2022) to provide a realistic case
study. The data was generated based on 7 core temperature groups (5, 10,
15, 20, 25, 30, and 35°C) and 6 intermediate groups (7.5, 12.5, 17.5,
22.5, 27.5, and 32.5°C) to create a rich, continuous predictor.
This guide covers a complete workflow for answering this question
using OptSurvCutR
:
find_cutpoint_number()
to
determine if 1, 2, or more thresholds are best.find_cutpoint()
to identify the
specific temperature thresholds.validate_cutpoint()
to assess the stability of our
findings.First, let us get the necessary tools. OptSurvCutR
is
available on CRAN, and the development version is on GitHub.
# Option 1: Install the development version from GitHub for the latest features
# You will need the 'remotes' package first: install.packages("remotes")
remotes::install_github("paytonyau/OptSurvCutR")
# Option 2: Install the stable version from CRAN - Pending
# install.packages("OptSurvCutR")
Now, we load all the R packages we will need for this analysis.
library(OptSurvCutR)
library(survival)
library(survminer)
library(dplyr)
library(ggplot2)
library(patchwork)
library(knitr)
library(tidyr)
library(cli)
To create a realistic dataset for demonstrating the package’s
capabilities, time-to-event data was simulated based on the findings of
Haj Sghaier et al. (2022). The simulation was designed in two parts to
generate a rich, continuous predictor with a non-linear relationship to
the outcome. First, a core dataset was generated based on parameters
reported directly in the source manuscript for seven distinct
temperature points (5, 10, 15, 20, 25, 30, and 35°C). These parameters
included the germination time window (start and end day), coefficients
for a linear growth model (slope and intercept), and the overall
germination rate for each temperature. Second, to create a more
continuous variable suitable for cut-point optimisation, these
parameters were linearly interpolated for six intermediate temperature
points (7.5, 12.5, 17.5, 22.5, 27.5, and 32.5°C). The final dataset,
germination
, is included with the package and was formed by
combining these two simulated subsets.
# Load the pre-simulated data included with the package
data("germination", package = "OptSurvCutR")
analysis_data <- germination
head(analysis_data)
## temperature replicate time growth germinated
## 1 5 1 22 8.1091549 1
## 2 5 1 18 0.0000000 0
## 3 5 1 15 1.2184871 1
## 4 5 1 22 6.0553687 1
## 5 5 1 19 5.1456349 1
## 6 5 1 14 0.4994083 1
Before modelling, we explore the raw data. The summary table below shows key germination metrics for each of the thirteen temperature groups.
# Create a summary table of the data
summary_table <- analysis_data %>%
group_by(temperature) %>%
summarise(
N_Seeds = n(),
N_Germinated = sum(germinated),
Germination_Rate_Pct = mean(germinated) * 100,
Avg_Time_to_Germinate_Days = mean(time[germinated == 1], na.rm = TRUE)
) %>%
rename(`Temperature (°C)` = temperature)
# Display the table using kable for nice formatting
knitr::kable(summary_table,
digits = 2,
caption = "Summary of Germination Outcomes by Temperature Group")
Temperature (°C) | N_Seeds | N_Germinated | Germination_Rate_Pct | Avg_Time_to_Germinate_Days |
---|---|---|---|---|
5.0 | 80 | 70 | 87.50 | 18.73 |
7.5 | 80 | 71 | 88.75 | 16.08 |
10.0 | 80 | 72 | 90.00 | 13.93 |
12.5 | 80 | 73 | 91.25 | 11.85 |
15.0 | 80 | 74 | 92.50 | 10.64 |
17.5 | 80 | 76 | 95.00 | 9.72 |
20.0 | 80 | 78 | 97.50 | 9.59 |
22.5 | 80 | 77 | 96.25 | 10.18 |
25.0 | 80 | 75 | 93.75 | 8.51 |
27.5 | 80 | 74 | 92.50 | 9.36 |
30.0 | 80 | 73 | 91.25 | 10.75 |
32.5 | 80 | 58 | 72.50 | 12.78 |
35.0 | 80 | 44 | 55.00 | 14.59 |
Interpretation: This table confirms our data simulates the expected biological response. The highest germination rates and fastest germination times occur in the 15-25°C range, with performance dropping off at colder and hotter temperatures.
Our first step is to determine how many cut-points the data supports.
Forcing a single cut-point might be too simple, while too many might
overfit the data. find_cutpoint_number()
uses information
criteria to provide statistical evidence for this decision.
The function’s behaviour is controlled by two key arguments:
method
and criterion
.
Method | How It Works | Recommendation | Rating (Accuracy & Performance) |
---|---|---|---|
"systematic" |
Exhaustive Search: Tests every single possible cut-point. | Best for 1 cut-point; guarantees the optimal result. | Accuracy: ★★★★★ Performance: ★★★☆☆ |
"genetic" |
Evolutionary Search: Uses a smart algorithm to efficiently find a near-perfect solution. | Highly recommended for 2+ cut-points. Much faster than the systematic search. | Accuracy: ★★★★☆ Performance: ★★★★★ |
Criterion | What It Is | Recommendation | Rating (Accuracy & Performance) |
---|---|---|---|
"AIC" |
Akaike Information Criterion | Balances model fit and complexity. A good general-purpose choice. | Accuracy: ★★★★☆ Performance: ★★★★★ |
"AICc" |
Corrected AIC | A version of AIC with a greater penalty for extra parameters. It is specifically recommended for smaller sample sizes. | Accuracy: ★★★★★ Performance: ★★★★★ |
"BIC" |
Bayesian Information Criterion | Similar to AIC, but applies a stronger penalty for complexity, especially in larger datasets. It tends to favour simpler models. | Accuracy: ★★★★★ Performance: ★★★★★ |
We will use the "genetic"
search method and the
"BIC"
criterion, which is excellent for balancing model fit
and complexity.
# Using BIC (default)
cli_h2("Step 1: Find the number of cut(s) using BIC")
number_result_bic <- find_cutpoint_number(
data = analysis_data,
predictor = "temperature",
outcome_time = "time",
outcome_event = "germinated",
method = "genetic",
criterion = "BIC",
use_parallel = TRUE,
max_cuts = 5,
nmin = 0.1,
maxiter = 500,
seed = 42
)
## num_cuts BIC Delta_BIC BIC_Weight Evidence
## 0 10878.79 421.52 0% Essentially no support
## 1 10687.53 230.25 0% Essentially no support
## 2 10532.11 74.83 0% Essentially no support
## 3 10457.27 0.00 95% Substantial support
## 4 10463.31 6.03 4.7% Considerably less support
## 5 10468.70 11.42 0.3% Essentially no support
## cuts
## NA
## 8.33
## 10.25, 30.24
## 9.43, 13.55, 31.29
## 9.52, 13, 22.54, 30.23
## 8.56, 13.51, 18.55, 23.71, 30.45
summary(number_result_bic)
## num_cuts BIC Delta_BIC BIC_Weight Evidence
## 0 10878.79 421.52 0% Essentially no support
## 1 10687.53 230.25 0% Essentially no support
## 2 10532.11 74.83 0% Essentially no support
## 3 10457.27 0.00 95% Substantial support
## 4 10463.31 6.03 4.7% Considerably less support
## 5 10468.70 11.42 0.3% Essentially no support
## cuts
## NA
## 8.33
## 10.25, 30.24
## 9.43, 13.55, 31.29
## 9.52, 13, 22.54, 30.23
## 8.56, 13.51, 18.55, 23.71, 30.45
## Group N Events
## 1 G1 160 141
## 2 G2 160 145
## 3 G3 560 527
## 4 G4 160 102
## Call: survfit(formula = survival::Surv(time, event) ~ group, data = data)
##
## n events median 0.95LCL 0.95UCL
## group=G1 160 141 18 17 19
## group=G2 160 145 13 12 14
## group=G3 560 527 10 10 11
## group=G4 160 102 15 14 16
## Call:
## survival::coxph(formula = as.formula(formula_str), data = data)
##
## n= 1040, number of events= 915
##
## coef exp(coef) se(coef) z Pr(>|z|)
## groupG2 1.3185 3.7378 0.1372 9.613 < 2e-16 ***
## groupG3 2.3464 10.4480 0.1306 17.965 < 2e-16 ***
## groupG4 0.8722 2.3922 0.1463 5.963 2.48e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## groupG2 3.738 0.26754 2.857 4.891
## groupG3 10.448 0.09571 8.088 13.496
## groupG4 2.392 0.41802 1.796 3.187
##
## Concordance= 0.706 (se = 0.008 )
## Likelihood ratio test= 497.3 on 3 df, p=<2e-16
## Wald test = 393.6 on 3 df, p=<2e-16
## Score (logrank) test = 475.4 on 3 df, p=<2e-16
A plot makes the choice clear. The lowest point on the curve indicates the optimal number of cuts.
plot(number_result_bic)
Information criterion (BIC) scores by number of cut-points. The lowest score indicates the optimal number.
Interpretation: The BIC analysis strongly suggests that a model with 3 cut-points is optimal, as it has the lowest information criterion score. This provides statistical justification for creating four distinct temperature groups (e.g., Cool, Sub-Optimal, Optimal, Warm).
Now that we know we need three cut-points, we use
find_cutpoint()
to discover their specific values. We will
optimise for the "logrank"
statistic to find the thresholds
that create the most statistically significant separation between the
survival curves of the resulting groups.
criterion
Criterion | What It Optimises | Recommendation | Rating (Accuracy & Performance) |
---|---|---|---|
"logrank" |
The statistical significance of the separation between survival curves (maximises the chi-squared statistic). | The most common and standard method. Best when the primary goal is to prove a significant difference. | Accuracy: ★★★★★ Performance: ★★★★★ |
"hazard_ratio" |
The effect size (maximises the Hazard Ratio). | Best for clinical interpretability. Finds the cut-point that creates the largest practical difference between groups. | Accuracy: ★★★★★ Performance: ★★★★☆ |
"p_value" |
The p-value from a Cox model (minimises the p-value). | A powerful way to find the most significant split, but the p-value itself should be interpreted with caution due to multiple testing. | Accuracy: ★★★★☆ Performance: ★★★★☆ |
cli_alert_info("Step 2: Finding optimal cut-point values...")
multi_cut_result <- find_cutpoint(
data = analysis_data,
predictor = "temperature",
outcome_time = "time",
outcome_event = "germinated",
method = "genetic",
criterion = "logrank",
use_parallel = TRUE,
num_cuts = 3,
nmin = 0.1,
maxiter = 500,
seed = 123
)
summary(multi_cut_result)
## Group N Events
## 1 G1 160 141
## 2 G2 160 145
## 3 G3 560 527
## 4 G4 160 102
## Call: survfit(formula = survival::Surv(time, event) ~ group, data = data)
##
## n events median 0.95LCL 0.95UCL
## group=G1 160 141 18 17 19
## group=G2 160 145 13 12 14
## group=G3 560 527 10 10 11
## group=G4 160 102 15 14 16
## Call:
## survival::coxph(formula = as.formula(formula_str), data = data)
##
## n= 1040, number of events= 915
##
## coef exp(coef) se(coef) z Pr(>|z|)
## groupG2 1.3185 3.7378 0.1372 9.613 < 2e-16 ***
## groupG3 2.3464 10.4480 0.1306 17.965 < 2e-16 ***
## groupG4 0.8722 2.3922 0.1463 5.963 2.48e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## groupG2 3.738 0.26754 2.857 4.891
## groupG3 10.448 0.09571 8.088 13.496
## groupG4 2.392 0.41802 1.796 3.187
##
## Concordance= 0.706 (se = 0.008 )
## Likelihood ratio test= 497.3 on 3 df, p=<2e-16
## Wald test = 393.6 on 3 df, p=<2e-16
## Score (logrank) test = 475.4 on 3 df, p=<2e-16
## chisq df p
## group 1.79 3 0.62
## GLOBAL 1.79 3 0.62
A plot of the temperature distribution with the discovered cut-points overlaid gives us a clear visual confirmation.
plot(multi_cut_result, type = "distribution")
Distribution of the predictor variable with optimal cut-points overlaid.
Interpretation: The algorithm has identified three temperature thresholds: 11.1°C, 18.2°C, and 28.1°C. These values are not arbitrary; they represent the points that best separate the rapeseed seeds into four groups with distinct germination behaviours.
A key question is whether these cut-points are stable or just an
artefact of our specific sample. We run a bootstrap analysis with
validate_cutpoint()
to generate 95% confidence
intervals.
cli_alert_info("Step 3: Validating cut-point stability with bootstrapping...")
validation_result <- validate_cutpoint(
cutpoint_result = multi_cut_result,
num_replicates = 100, # Use a higher number (e.g., 500) for publication
use_parallel = TRUE,
seed = 456,
maxiter = 100
)
## Lower Upper
## Cut 1 7.522 11.35
## Cut 2 12.532 17.42
## Cut 3 29.003 32.34
summary(validation_result)
## Mean Median SD IQR
## Cut 1 8.907 8.843 1.016 1.520
## Cut 2 14.573 14.409 1.469 2.521
## Cut 3 31.084 31.099 0.882 1.093
## Lower Upper
## Cut 1 7.522 11.35
## Cut 2 12.532 17.42
## Cut 3 29.003 32.34
A density plot of the bootstrap results provides a powerful visual for assessing stability. Narrow, sharp peaks indicate a highly stable cut-point.
plot(validation_result)
Bootstrap distribution of the three optimal cut-points. The solid red line represents the original cut-point.
Interpretation: The bootstrap results give us confidence in our findings. The confidence intervals for the cut-points are relatively narrow, especially for the lower and middle thresholds. This suggests that these are robust findings and not just random chance.
Now we can use our validated, optimal cut-points to create a final model and generate publication-ready visualisations.
# Extract the cuts for plotting
optimal_cuts <- multi_cut_result$optimal_cuts
# Create a temperature group variable with four levels
analysis_data$Temp_Group <- cut(
analysis_data$temperature,
breaks = c(-Inf, optimal_cuts, Inf),
labels = c("Cool", "Sub-Optimal", "Optimal", "Warm")
)
This table shows which experimental temperatures were assigned to our four statistically-derived groups.
composition_table <- analysis_data %>%
group_by(Temp_Group) %>%
summarise(
Temperatures_in_Group = paste(sort(unique(temperature)), collapse = ", ")
)
knitr::kable(composition_table, caption = "Composition of Temperature Groups")
Temp_Group | Temperatures_in_Group |
---|---|
Cool | 5, 7.5 |
Sub-Optimal | 10, 12.5 |
Optimal | 15, 17.5, 20, 22.5, 25, 27.5, 30 |
Warm | 32.5, 35 |
This plot is the primary result, showing the time-to-germination for the four groups created by our optimal cut-points.
group_levels <- levels(analysis_data$Temp_Group)
palette <- c("#2E9FDF", "#E7B800", "#4CAF50", "#FC4E07")
km_fit_final <- survfit(Surv(time, germinated) ~ Temp_Group, data = analysis_data)
ggsurvplot(
km_fit_final, data = analysis_data, pval = TRUE, risk.table = TRUE,
title = "Germination by Optimal Temperature Strata",
xlab = "Time (Days)", ylab = "Proportion Ungerminated",
legend.title = "Temp Group",
legend.labs = group_levels,
palette = palette
)
Kaplan-Meier survival curve for germination by temperature group.
Kaplan-Meier Plot Interpretation: The plot confirms our cut-points successfully stratified the data. The “Optimal” group curve drops most steeply, indicating the fastest germination. The “Cool” and “Warm” groups show much flatter curves, confirming inhibited germination. The p-value (< 0.0001) indicates the differences between these groups are highly statistically significant.
The forest plot visualises the Hazard Ratios (HR) for each group relative to the “Optimal” group.
analysis_data$Temp_Group <- relevel(analysis_data$Temp_Group, ref = "Optimal")
cox_model <- coxph(Surv(time, germinated) ~ Temp_Group, data = analysis_data)
ggforest(cox_model, data = analysis_data, main = "Hazard Ratios Relative to Optimal Temperature")
Forest plot of Hazard Ratios for each temperature group relative to the optimal group.
Forest Plot Interpretation: This plot shows the rate of germination for each group compared to the “Optimal” group. An HR less than 1 (like for the “Cool” and “Warm” groups) means a significantly lower rate of germination. An HR greater than 1 would mean a higher rate. Since all confidence intervals are far from crossing the vertical line at 1.0, all groups have a germination rate that is significantly different from the optimal group.
This diagnostic plot helps to visually check the proportional hazards assumption of the Cox model. For the model to be valid, the hazard ratios between the groups should be roughly constant over time, which is represented by parallel, non-crossing lines.
ggsurvplot(
km_fit_final, data = analysis_data, fun = "cumhaz", pval = TRUE,
title = "Cumulative Hazard of Germination",
xlab = "Time (Days)", ylab = "Cumulative Hazard",
legend.title = "Temp Group",
legend.labs = group_levels,
palette = palette
)
Cumulative hazard plot for germination by temperature group, used to check the proportional hazards assumption.
Cumulative Hazard Plot Interpretation: The well-separated and parallel lines in the plot suggest that the proportional hazards assumption is met for this model, giving us confidence in the results from our Cox model and forest plot.
This vignette has demonstrated the three-step workflow for cut-point
analysis using OptSurvCutR
. By following this workflow,
users can confidently identify and validate robust,
statistically-optimal thresholds in their own survival data, moving
beyond simple median splits to uncover more nuanced relationships.
We encourage you to try OptSurvCutR
with your own
data.
remotes::install_github("paytonyau/OptSurvCutR")
For reproducibility, the session information below lists the R version and all attached packages used to run this analysis.
sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8
## [2] LC_CTYPE=English_United Kingdom.utf8
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.utf8
##
## time zone: Europe/London
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] cli_3.6.5 tidyr_1.3.1 knitr_1.50 patchwork_1.3.1
## [5] dplyr_1.1.4 survminer_0.5.1 ggpubr_0.6.1 ggplot2_3.5.2
## [9] survival_3.8-3 OptSurvCutR_0.1.5
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.52 bslib_0.9.0 rstatix_0.7.2
## [5] lattice_0.22-7 vctrs_0.6.5 tools_4.5.1 generics_0.1.4
## [9] parallel_4.5.1 tibble_3.3.0 pkgconfig_2.0.3 Matrix_1.7-3
## [13] data.table_1.17.8 RColorBrewer_1.1-3 lifecycle_1.0.4 stringr_1.5.1
## [17] compiler_4.5.1 farver_2.1.2 codetools_0.2-20 carData_3.0-5
## [21] litedown_0.7 htmltools_0.5.8.1 sass_0.4.10 yaml_2.3.10
## [25] Formula_1.2-5 pillar_1.11.0 car_3.1-3 jquerylib_0.1.4
## [29] cachem_1.1.0 iterators_1.0.14 rgenoud_5.9-0.11 abind_1.4-8
## [33] foreach_1.5.2 km.ci_0.5-6 commonmark_2.0.0 tidyselect_1.2.1
## [37] digest_0.6.37 stringi_1.8.7 purrr_1.1.0 labeling_0.4.3
## [41] splines_4.5.1 cowplot_1.2.0 fastmap_1.2.0 grid_4.5.1
## [45] magrittr_2.0.3 broom_1.0.9 withr_3.0.2 scales_1.4.0
## [49] backports_1.5.0 rmarkdown_2.29 ggtext_0.1.2 gridExtra_2.3
## [53] ggsignif_0.6.4 zoo_1.8-14 evaluate_1.0.5 KMsurv_0.1-6
## [57] doParallel_1.0.17 markdown_2.0 survMisc_0.5.6 rlang_1.1.6
## [61] gridtext_0.1.5 Rcpp_1.1.0 xtable_1.8-4 glue_1.8.0
## [65] xml2_1.4.0 rstudioapi_0.17.1 jsonlite_2.0.0 R6_2.6.1