Introduction

This tutorial demonstrates a powerful application of OptSurvCutR by analysing data inspired by published research. We will explore the findings of the paper “The Effects of Temperature and Water on the Seed Germination and Seedling Development of Rapeseed (Brassica napus L.)” by Haj Sghaier et al. (2022).

The Scientific Background

The study by Haj Sghaier et al. investigated how different temperature levels affect the germination of rapeseed. Their key findings were:

There is an optimal temperature range for germination, between 15°C and 25°C.
The single best temperature for germination speed and success is 20°C.
Temperatures below this range (5°C, 10°C) or above it (30°C, 35°C) significantly inhibit or slow down the germination process.

This creates a classic scenario for a time-to-event (survival) analysis, where the “event” is germination. Our research question is: Based on simulated data, can we find the optimal temperature threshold(s) that best separate fast and slow germination in rapeseed?

Analysis Workflow

This guide covers a complete and robust exploratory workflow for survival data using OptSurvCutR:

Setup and Data Simulation: Load libraries and programmatically generate a realistic dataset.
Initial Data Exploration: Visualise the raw germination patterns across all temperature groups.
Step 1: Optimal Cut-point Number Discovery: Use find_cutpoint_number() to let the data suggest the best number of temperature thresholds using information criteria like BIC or AICc.
Step 2: Optimal Cut-point Value Discovery: Use find_cutpoint() to identify the specific values of the thresholds, optimizing for criteria like the log-rank statistic or the Hazard Ratio.
Step 3: Cut-point Validation: Use bootstrapping with validate_cutpoint() to assess the stability of the discovered thresholds.
Visualisation and Interpretation: Create and interpret Kaplan-Meier curves, forest plots, and other diagnostic plots to understand the results.

1. Setup and Data Simulation

Loading Our Toolkit: The R Packages

First, we load all the necessary R packages.

# This chunk assumes the packages are already installed.
# You can install them with: install.packages(c("survival", "survminer", "dplyr", "ggplot2", "patchwork", "knitr", "tidyr", "cli"))
library(OptSurvCutR)
library(survival)
library(survminer)
library(dplyr)
library(ggplot2)
library(patchwork)
library(knitr)
library(tidyr)
library(cli)

Simulating the Rapeseed Germination Data

To create a realistic dataset for analysis, we simulate data that mimics the findings of the original study.

Part 1: Simulation Based on Manuscript Data

This chunk defines the parameters for the 7 temperature points, length of the experiment, growth rate, estimated germination rate directly from the manuscript and generates the first part of our dataset.

# Define All Target Parameters from the Manuscript
original_params <- data.frame(
  temperature = c(5, 10, 15, 20, 25, 30, 35),
  start_day = c(13, 7, 4, 4, 3, 4, 9),
  end_day = c(25, 20, 16, 16, 15, 16, 20),
  slope = c(0.9846, 3.6848, 4.9765, 3.8693, 6.5766, 1.1384, 0.1560),
  intercept = c(-13.627, -34.076, -28.116, -17.012, -35.621, -1.2833, -1.0571),
  r_squared = c(0.9585, 0.9501, 0.9307, 0.9524, 0.9564, 0.9237, 0.9574),
  germination_rate = c(0.88, 0.90, 0.92, 0.98, 0.94, 0.91, 0.55)
)

# Core Data Generation Function
generate_hybrid_data <- function(params) {
  set.seed(as.integer(params$temperature * 10))
  n_replicates <- 4
  n_seeds_per_dish <- 20
  n_samples_total <- n_replicates * n_seeds_per_dish
  
  time_data <- round(runif(n_samples_total, min = params$start_day, max = params$end_day))
  
  perfect_growth <- params$intercept + params$slope * time_data
  perfect_growth_variance <- var(perfect_growth)
  if (params$r_squared < 1 && params$r_squared > 0) {
      residual_variance <- perfect_growth_variance * (1 - params$r_squared) / params$r_squared
  } else {
      residual_variance <- 0
  }
  residual_std_dev <- sqrt(residual_variance)
  residuals <- rnorm(n = n_samples_total, mean = 0, sd = residual_std_dev)
  actual_growth <- perfect_growth + residuals
  actual_growth[actual_growth < 0] <- 0
  
  total_germinated <- round(n_samples_total * params$germination_rate)
  status_vector <- c(rep(1, total_germinated), rep(0, n_samples_total - total_germinated))
  shuffled_status <- sample(status_vector)
  
  replicate_vector <- rep(1:n_replicates, each = n_seeds_per_dish)
  actual_growth[shuffled_status == 0] <- 0
  
  data.frame(
    temperature = params$temperature,
    replicate = replicate_vector,
    time = time_data,
    growth = actual_growth,
    germinated = shuffled_status
  )
}

# Generate data for the original 7 temperature points
original_data_list <- lapply(1:nrow(original_params), function(i) generate_hybrid_data(original_params[i, ]))
original_sim_data <- do.call(rbind, original_data_list)

Part 2: Simulation of Interpolated Data

To create a more continuous predictor variable, we interpolate the parameters for 6 intermediate temperature points and generate the second part of our dataset.

# Define the new temperature points to add
new_temps <- c(7.5, 12.5, 17.5, 22.5, 27.5, 32.5)

# Use R's `approx` function for linear interpolation for each parameter
interp_slope <- approx(original_params$temperature, original_params$slope, xout = new_temps)$y
interp_intercept <- approx(original_params$temperature, original_params$intercept, xout = new_temps)$y
interp_r_squared <- approx(original_params$temperature, original_params$r_squared, xout = new_temps)$y
interp_germ_rate <- approx(original_params$temperature, original_params$germination_rate, xout = new_temps)$y
interp_start_day <- approx(original_params$temperature, original_params$start_day, xout = new_temps)$y
interp_end_day <- approx(original_params$temperature, original_params$end_day, xout = new_temps)$y

interpolated_params <- data.frame(
  temperature = new_temps,
  slope = interp_slope,
  intercept = interp_intercept,
  r_squared = interp_r_squared,
  germination_rate = interp_germ_rate,
  start_day = interp_start_day,
  end_day = interp_end_day
)

# Generate data for the 6 interpolated temperature points
interpolated_data_list <- lapply(1:nrow(interpolated_params), function(i) generate_hybrid_data(interpolated_params[i, ]))
interpolated_sim_data <- do.call(rbind, interpolated_data_list)

# Combine both datasets for the final analysis
analysis_data <- rbind(original_sim_data, interpolated_sim_data)

2. Initial Data Exploration

Before finding a cut-point, it is good practice to explore the raw data. The following summary table shows key germination metrics for each of the thirteen temperature groups.

# Create a summary table of the data
summary_table <- analysis_data %>%
  group_by(temperature) %>%
  summarise(
    N_Seeds = n(),
    N_Germinated = sum(germinated),
    Germination_Rate_Pct = mean(germinated) * 100,
    Avg_Time_to_Germinate_Days = mean(time[germinated == 1], na.rm = TRUE)
  ) %>%
  rename(`Temperature (°C)` = temperature)

# Display the table using kable for nice formatting
knitr::kable(summary_table,
      digits = 2,
      caption = "Summary of Germination Outcomes by Temperature Group")

Summary of Germination Outcomes by Temperature Group
Temperature (°C)	N_Seeds	N_Germinated	Germination_Rate_Pct	Avg_Time_to_Germinate_Days
5.0	80	70	87.50	18.73
7.5	80	71	88.75	16.08
10.0	80	72	90.00	13.93
12.5	80	73	91.25	11.85
15.0	80	74	92.50	10.64
17.5	80	76	95.00	9.72
20.0	80	78	97.50	9.59
22.5	80	77	96.25	10.18
25.0	80	75	93.75	8.51
27.5	80	74	92.50	9.36
30.0	80	73	91.25	10.75
32.5	80	58	72.50	12.78
35.0	80	44	55.00	14.59

Interpretation: This table confirms our data simulates the expected biological response, with the highest germination rates and fastest germination times in the optimal temperature range.

3. The Three-Step Analysis Workflow

Step 1: Determine the Optimal Number of Cut-points

The first step in our analysis is to determine how many cut-points our data supports. We use find_cutpoint_number() to get statistical evidence that guides this decision.

The function relies on a search method, which is the algorithm that finds the optimal cut-point(s). Your choice of method depends on the size of your dataset and the number of cut-points you need to find.

Choosing a Search `method`

Method	How It Works	Recommendation	Rating (Accuracy & Performance)
`"systematic"`	Exhaustive Search: Tests every single possible cut-point.	Best for 1 cut-point; guarantees the optimal result.	Accuracy: ★★★★★ Performance: ★★★☆☆
`"genetic"`	Evolutionary Search: Uses a smart algorithm to efficiently find a near-perfect solution.	Highly recommended for 2+ cut-points. Much faster than the systematic search.	Accuracy: ★★★★☆ Performance: ★★★★★

In addition to the search method, you must also choose a statistical criterion. This criterion is the objective function for the optimization algorithm; it gives the algorithm a statistical measure to maximize or minimize (e.g., maximizing the log-rank statistic). The package supports several criteria to help you balance the trade-off between model fit and simplicity.

Criterion	What It Is	Recommendation	Rating (Accuracy & Performance)
`"AIC"`	Akaike Information Criterion	Balances model fit and complexity. A good general-purpose choice.	Accuracy: ★★★★☆ Performance: ★★★★★
`"AICc"`	Corrected AIC	A version of AIC with a greater penalty for extra parameters. It is specifically recommended for smaller sample sizes.	Accuracy: ★★★★★ Performance: ★★★★★
`"BIC"`	Bayesian Information Criterion	Similar to AIC, but applies a stronger penalty for complexity, especially in larger datasets. It tends to favor simpler models.	Accuracy: ★★★★★ Performance: ★★★★★

# Using BIC (default)
cli_h2("Analysis using BIC")
number_result_bic <- find_cutpoint_number(
  data = analysis_data,
  predictor = "temperature",
  outcome_time = "time",
  outcome_event = "germinated",
  method = "genetic", # can be switched to systematic
  criterion = "BIC", # can be switched to AIC or AICc
  max_cuts = 5,
  nmin = 0.1,
  maxiter = 500,
  seed = 42 # Use the built-in seed argument for reproducibility
)

##  num_cuts      BIC Delta_BIC BIC_Weight                  Evidence
##         0 10878.79    421.52         0%    Essentially no support
##         1 10687.53    230.25         0%    Essentially no support
##         2 10532.11     74.83         0%    Essentially no support
##         3 10457.27      0.00        95%       Substantial support
##         4 10463.31      6.03       4.7% Considerably less support
##         5 10468.70     11.42       0.3%    Essentially no support
##                              cuts
##                                NA
##                              8.33
##                      10.25, 30.24
##                9.43, 13.55, 31.29
##            9.52, 13, 22.54, 30.23
##  8.56, 13.51, 18.55, 23.71, 30.45

Why set.seed()? The genetic algorithm and bootstrapping both use random processes. Setting a “seed” ensures that the sequence of random numbers is the same every time you run the code, making your results perfectly reproducible.

Now let’s print a summary of the find_cutpoint_number() results. This will display a table of information criterion scores for each model, from one cut-point to five. The model with the lowest score is considered the best fit for the data.

summary(number_result_bic)

##  num_cuts      BIC Delta_BIC BIC_Weight                  Evidence
##         0 10878.79    421.52         0%    Essentially no support
##         1 10687.53    230.25         0%    Essentially no support
##         2 10532.11     74.83         0%    Essentially no support
##         3 10457.27      0.00        95%       Substantial support
##         4 10463.31      6.03       4.7% Considerably less support
##         5 10468.70     11.42       0.3%    Essentially no support
##                              cuts
##                                NA
##                              8.33
##                      10.25, 30.24
##                9.43, 13.55, 31.29
##            9.52, 13, 22.54, 30.23
##  8.56, 13.51, 18.55, 23.71, 30.45

##   Group   N Events
## 1    G1 160    141
## 2    G2 160    145
## 3    G3 560    527
## 4    G4 160    102

## Call: survfit(formula = survival::Surv(time, event) ~ group, data = data)
## 
##            n events median 0.95LCL 0.95UCL
## group=G1 160    141     18      17      19
## group=G2 160    145     13      12      14
## group=G3 560    527     10      10      11
## group=G4 160    102     15      14      16

## Call:
## survival::coxph(formula = as.formula(formula_str), data = data)
## 
##   n= 1040, number of events= 915 
## 
##            coef exp(coef) se(coef)      z Pr(>|z|)    
## groupG2  1.3185    3.7378   0.1372  9.613  < 2e-16 ***
## groupG3  2.3464   10.4480   0.1306 17.965  < 2e-16 ***
## groupG4  0.8722    2.3922   0.1463  5.963 2.48e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##         exp(coef) exp(-coef) lower .95 upper .95
## groupG2     3.738    0.26754     2.857     4.891
## groupG3    10.448    0.09571     8.088    13.496
## groupG4     2.392    0.41802     1.796     3.187
## 
## Concordance= 0.706  (se = 0.008 )
## Likelihood ratio test= 497.3  on 3 df,   p=<2e-16
## Wald test            = 393.6  on 3 df,   p=<2e-16
## Score (logrank) test = 475.4  on 3 df,   p=<2e-16

To better visualise the results from the summary table, we can generate a plot. This plot will show how the information criterion score changes with the number of cut-points, making it easy to identify the optimal number.

plot(number_result_bic)

Interpretation: The BIC analyses suggest that a model with 3 cut-points is optimal, as it has the lowest information criterion score. This provides strong statistical justification for creating four distinct temperature groups (e.g., Cool, Sub-Optimal, Optimal, Warm).

Step 2: Find the Optimal Temperature Thresholds

With the optimal number of cut-points now identified, our next step is to find their specific values. We use the find_cutpoint() function to discover the best temperature thresholds that separate our data into distinct groups. To do this, the algorithm needs a rule to follow, much like a car navigating a winding road needs a GPS. This rule is the optimisation criterion, which tells the search method what “optimal” means. For instance, a criterion of “logrank” is like a GPS programmed to find the steepest, most significant decline in a road, whereas a criterion of “hazard_ratio” is programmed to find the biggest drop in altitude between two points. This choice determines the path the algorithm takes to find the best cut-point.

Choosing an Optimisation `criterion`

Criterion	What It Optimizes	Recommendation	Rating (Accuracy & Performance)
`"logrank"`	The statistical significance of the separation between survival curves (maximises the chi-squared statistic).	The most common and standard method. Best when the primary goal is to prove a significant difference.	Accuracy: ★★★★★ Performance: ★★★★★
`"hazard_ratio"`	The effect size (maximises the Hazard Ratio).	Best for clinical interpretability. Finds the cut-point that creates the largest practical difference between groups.	Accuracy: ★★★★★ Performance: ★★★★☆
`"p_value"`	The p-value from a Cox model (minimises the p-value).	A powerful way to find the most significant split, but the p-value itself should be interpreted with caution due to multiple testing.	Accuracy: ★★★★☆ Performance: ★★★★☆

We will use the "genetic" method to find three cut-points, optimising for the "logrank" to maximise the chi-squared statistic.

# We will find three cutpoints based on the result from Step 1
multi_cut_result <- find_cutpoint(
  data = analysis_data,
  predictor = "temperature",
  outcome_time = "time",
  outcome_event = "germinated",
  method = "genetic",      # can be switched to systematic
  criterion = "logrank", # Optimising for effect size
  num_cuts = 3,
  nmin = 0.1,
  maxiter = 500, # A higher number of generations for a more thorough search
  seed = 123
)

Now that we have found the optimal cut-point values, let’s display a summary of the results. This summary will show the exact cut-point values, the p-value from a log-rank test for the split, and other key information from the analysis.

summary(multi_cut_result)

##   Group   N Events
## 1    G1 160    141
## 2    G2 160    145
## 3    G3 560    527
## 4    G4 160    102

## Call: survfit(formula = survival::Surv(time, event) ~ group, data = data)
## 
##            n events median 0.95LCL 0.95UCL
## group=G1 160    141     18      17      19
## group=G2 160    145     13      12      14
## group=G3 560    527     10      10      11
## group=G4 160    102     15      14      16

## Call:
## survival::coxph(formula = as.formula(formula_str), data = data)
## 
##   n= 1040, number of events= 915 
## 
##            coef exp(coef) se(coef)      z Pr(>|z|)    
## groupG2  1.3185    3.7378   0.1372  9.613  < 2e-16 ***
## groupG3  2.3464   10.4480   0.1306 17.965  < 2e-16 ***
## groupG4  0.8722    2.3922   0.1463  5.963 2.48e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##         exp(coef) exp(-coef) lower .95 upper .95
## groupG2     3.738    0.26754     2.857     4.891
## groupG3    10.448    0.09571     8.088    13.496
## groupG4     2.392    0.41802     1.796     3.187
## 
## Concordance= 0.706  (se = 0.008 )
## Likelihood ratio test= 497.3  on 3 df,   p=<2e-16
## Wald test            = 393.6  on 3 df,   p=<2e-16
## Score (logrank) test = 475.4  on 3 df,   p=<2e-16

##        chisq df    p
## group   1.79  3 0.62
## GLOBAL  1.79  3 0.62

To see the cut-points visually and understand their placement along the predictor variable, we can generate a plot of the results. This plot helps to confirm the location of the cut-points and provides a visual representation of how the data is being divided.

plot(multi_cut_result)

Step 2.1: Summarise the Final Model

After finding the cut-points, we use summary() on the result object to inspect the final statistical model. This provides the detailed Hazard Ratios and p-values for the groups created by our analysis.

# You can also customise the output by turning specific sections off
summary(
  multi_cut_result,
  show_medians = TRUE,    # Hide the median survival table
  show_ph_test = TRUE,    # Hide the proportional hazards test
  show_params = TRUE      # Hide the analysis parameters list
)

##   Group   N Events
## 1    G1 160    141
## 2    G2 160    145
## 3    G3 560    527
## 4    G4 160    102

## Call: survfit(formula = survival::Surv(time, event) ~ group, data = data)
## 
##            n events median 0.95LCL 0.95UCL
## group=G1 160    141     18      17      19
## group=G2 160    145     13      12      14
## group=G3 560    527     10      10      11
## group=G4 160    102     15      14      16

## Call:
## survival::coxph(formula = as.formula(formula_str), data = data)
## 
##   n= 1040, number of events= 915 
## 
##            coef exp(coef) se(coef)      z Pr(>|z|)    
## groupG2  1.3185    3.7378   0.1372  9.613  < 2e-16 ***
## groupG3  2.3464   10.4480   0.1306 17.965  < 2e-16 ***
## groupG4  0.8722    2.3922   0.1463  5.963 2.48e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##         exp(coef) exp(-coef) lower .95 upper .95
## groupG2     3.738    0.26754     2.857     4.891
## groupG3    10.448    0.09571     8.088    13.496
## groupG4     2.392    0.41802     1.796     3.187
## 
## Concordance= 0.706  (se = 0.008 )
## Likelihood ratio test= 497.3  on 3 df,   p=<2e-16
## Wald test            = 393.6  on 3 df,   p=<2e-16
## Score (logrank) test = 475.4  on 3 df,   p=<2e-16

##        chisq df    p
## group   1.79  3 0.62
## GLOBAL  1.79  3 0.62

Interpretation: The summary shows the final Cox model. The exp(coef) column gives the Hazard Ratio (HR) for each group relative to the reference group (G1). A high HR (like for G3) indicates a much faster germination time (a higher “hazard” of germinating).

Step 3: Validate the Cut-point Stability

Finally, we run a bootstrap analysis with validate_cutpoint() to generate 95% confidence intervals for our discovered cut-points. This tells us how stable and reliable our thresholds are.

validation_result <- validate_cutpoint(
  cutpoint_result = multi_cut_result,
  num_replicates = 100, # A higher number (e.g., 500) is recommended for publication
  use_parallel = TRUE,
  seed = 456, # Use the built-in seed argument
  maxiter = 100 # Use fewer generations for faster validation runs
)

##        Lower  Upper
## Cut 1  7.584 10.860
## Cut 2 12.618 16.209
## Cut 3 28.525 32.426

To see the detailed results from the validation, we can print a summary of the validate_cutpoint() object. This will show the original cut-point values, the confidence intervals for the bootstrap results, and a count of the number of successful replicates.

summary(validation_result)

##         Mean Median    SD   IQR
## Cut 1  8.960  9.031 0.959 1.320
## Cut 2 14.039 13.963 0.875 1.172
## Cut 3 30.975 31.060 0.994 1.247

##        Lower  Upper
## Cut 1  7.584 10.860
## Cut 2 12.618 16.209
## Cut 3 28.525 32.426

Interpretation: The summary output from validate_cutpoint() provides a crucial stability check for our results. For each cut-point we found in the original analysis, it presents a 95% confidence interval based on the bootstrap resamples. For a cut-point to be considered stable and reliable, its confidence interval should be relatively narrow. A narrow interval suggests that if you were to repeat the experiment and re-run the analysis, the optimal cut-point would likely be found very close to the value you originally discovered. A very wide interval, on the other hand, indicates that the cut-point is highly sensitive to changes in the data and may not be a robust finding.

A good way to visualise the stability of the cut-points is with a density plot. This plot shows the distribution of the cut-points found across all the bootstrap resamples. The narrower and more centered the peaks, the more stable the cut-point is.

plot(validation_result)

Bootstrap distribution of the three optimal cut-points with 95% confidence intervals. The solid red line represents the cut-point found in the original analysis.

4. Visualising the Results

We will now create a series of plots to interpret our findings.

# Extract the cuts for plotting
optimal_cuts <- multi_cut_result$optimal_cuts

# Create a temperature group variable with four levels
analysis_data$Temp_Group <- cut(
  analysis_data$temperature,
  breaks = c(-Inf, optimal_cuts, Inf),
  labels = c("Cool", "Sub-Optimal", "Optimal", "Warm")
)

4.1 Group Composition

This table explicitly shows which of the original experimental temperatures were assigned to each of the four new groups.

composition_table <- analysis_data %>%
  group_by(Temp_Group) %>%
  summarise(
    Temperatures_in_Group = paste(sort(unique(temperature)), collapse = ", ")
  )

knitr::kable(composition_table, caption = "Composition of Temperature Groups")

Composition of Temperature Groups
Temp_Group	Temperatures_in_Group
Cool	5, 7.5
Sub-Optimal	10, 12.5
Optimal	15, 17.5, 20, 22.5, 25, 27.5, 30
Warm	32.5, 35

Plot A: Kaplan-Meier Curve by Optimal Cut-points

This plot shows the time-to-germination for the four groups created by our optimal cut-points.

group_levels <- levels(analysis_data$Temp_Group)
num_groups <- length(group_levels)
palette <- c("#2E9FDF", "#4CAF50", "#FC4E07", "#E7B800")

km_fit_final <- survfit(Surv(time, germinated) ~ Temp_Group, data = analysis_data)
km_plot_obj <- ggsurvplot(
  km_fit_final, data = analysis_data, pval = TRUE, risk.table = TRUE,
  title = "Germination by Optimal Temperature Strata",  
  xlab = "Time (Days)", ylab = "Proportion Ungerminated",  
  legend.title = "Temp Group",  
  legend.labs = group_levels,
  palette = palette[1:num_groups]
)

km_plot_obj

Kaplan-Meier survival curve for germination by temperature group.

Kaplan-Meier Plot: The Kaplan-Meier plot visually confirms that our optimal cut-points successfully stratified the data into groups with significantly different germination rates. The curves for the “Optimal” and “Sub-Optimal” groups drop much more steeply than the others, indicating a faster germination time. In contrast, the “Cool” and “Warm” groups show a much flatter curve, meaning the seeds in those groups take much longer to germinate or do not germinate at all.

Plot B: Hazard Ratios of Temperature Groups (Forest Plot)

The forest plot visualises the Hazard Ratios (HR) for the “Cool”, “Sub-Optimal”, and “Warm” groups relative to the “Optimal” group.

analysis_data$Temp_Group <- relevel(analysis_data$Temp_Group, ref = "Optimal")
cox_model <- coxph(Surv(time, germinated) ~ Temp_Group, data = analysis_data)
forest_plot <- ggforest(cox_model, data = analysis_data, main = "Hazard Ratios Relative to Optimal Temperature")

forest_plot

Forest plot of Hazard Ratios for each temperature group relative to the optimal group.

Interpretation of the Forest Plot: This plot shows the “hazard” (in this case, the rate of germination) for each group compared to the “Optimal” group. - The “Optimal” group is the reference, so its Hazard Ratio is 1.0. - An HR less than 1 (like for the “Cool” group) means a lower rate of germination (slower germination time). - The horizontal lines represent the 95% confidence interval. If an interval does not cross the vertical line at 1.0, the difference is statistically significant.

Plot C: Bootstrap Distribution of Cut-points

This plot shows the distribution of the optimal cut-points found during the bootstrap validation. The plot() method for the validation result makes this easy.

# The plot method automatically creates a faceted density plot
bootstrap_plot <- plot(validation_result)

bootstrap_plot

Bootstrap distribution of the three optimal cut-points with 95% confidence intervals. The solid red line represents the cut-point found in the original analysis.

Interpretation: The density plots show where the optimal cut-points most frequently landed during the bootstrap resamples. The solid red line shows our original cut-point, while the dashed red lines show the 95% confidence interval. Tight, sharp peaks indicate a very stable and reliable result.

Plot D: Cumulative Hazard Plot

This diagnostic plot visualizes the cumulative hazard for each of the four temperature groups.

cumhaz_plot_obj <- ggsurvplot(
  km_fit_final, data = analysis_data, fun = "cumhaz", pval = TRUE,
  title = "Cumulative Hazard of Germination",  
  xlab = "Time (Days)", ylab = "Cumulative Hazard",  
  legend.title = "Temp Group",  
  legend.labs = group_levels,
  palette = palette[1:num_groups]
)

cumhaz_plot_obj

Cumulative hazard plot for germination by temperature group, used to check the proportional hazards assumption.

Interpretation of the Cumulative Hazard Plot: This is a diagnostic plot used to visually check the proportional hazards assumption of the Cox model. For the model to be valid, the hazard ratios between the groups should be roughly constant over time. We check this by looking for lines that are approximately parallel and do not cross. In this case, the lines are well-separated and do not cross, suggesting the assumption is met.

Final Combined Figure

Finally, we use the patchwork package to combine our key plots into a single, comprehensive figure.

# Extract the plot component from the ggsurvplot objects
km_plot <- km_plot_obj$plot
cumhaz_plot <- cumhaz_plot_obj$plot

# Combine the four plots into a 2x2 grid
publication_figure <- (km_plot + cumhaz_plot) / (forest_plot + bootstrap_plot) +
  plot_annotation(tag_levels = 'A')

print(publication_figure)

A combined figure of the key analysis results, including the Kaplan-Meier curve, cumulative hazard plot, forest plot, and bootstrap distribution.

5. Session Information

For reproducibility, the session information below lists the R version and all attached packages used to run this analysis.

sessionInfo()

## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8 
## [2] LC_CTYPE=English_United Kingdom.utf8   
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.utf8    
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] cli_3.6.5         tidyr_1.3.1       knitr_1.50        patchwork_1.3.1  
##  [5] dplyr_1.1.4       survminer_0.5.0   ggpubr_0.6.1      ggplot2_3.5.2    
##  [9] survival_3.8-3    OptSurvCutR_0.1.5
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       xfun_0.52          bslib_0.9.0        rstatix_0.7.2     
##  [5] lattice_0.22-7     vctrs_0.6.5        tools_4.5.1        generics_0.1.4    
##  [9] parallel_4.5.1     tibble_3.3.0       pkgconfig_2.0.3    Matrix_1.7-3      
## [13] data.table_1.17.8  RColorBrewer_1.1-3 lifecycle_1.0.4    stringr_1.5.1     
## [17] compiler_4.5.1     farver_2.1.2       ggsci_3.2.0        codetools_0.2-20  
## [21] carData_3.0-5      litedown_0.7       htmltools_0.5.8.1  sass_0.4.10       
## [25] yaml_2.3.10        Formula_1.2-5      pillar_1.11.0      car_3.1-3         
## [29] jquerylib_0.1.4    cachem_1.1.0       iterators_1.0.14   rgenoud_5.9-0.11  
## [33] abind_1.4-8        foreach_1.5.2      km.ci_0.5-6        commonmark_2.0.0  
## [37] tidyselect_1.2.1   digest_0.6.37      stringi_1.8.7      purrr_1.1.0       
## [41] labeling_0.4.3     splines_4.5.1      cowplot_1.2.0      fastmap_1.2.0     
## [45] grid_4.5.1         magrittr_2.0.3     broom_1.0.9        withr_3.0.2       
## [49] scales_1.4.0       backports_1.5.0    rmarkdown_2.29     ggtext_0.1.2      
## [53] gridExtra_2.3      ggsignif_0.6.4     zoo_1.8-14         evaluate_1.0.4    
## [57] KMsurv_0.1-6       doParallel_1.0.17  markdown_2.0       survMisc_0.5.6    
## [61] rlang_1.1.6        gridtext_0.1.5     Rcpp_1.1.0         xtable_1.8-4      
## [65] glue_1.8.0         xml2_1.3.8         rstudioapi_0.17.1  jsonlite_2.0.0    
## [69] R6_2.6.1

Analysing Rapeseed Germination Data with OptSurvCutR

A Workflow for Discovering and Validating an Optimal Temperature Threshold for Time-To-Event Data

Payton Yau

2025-08-08

Introduction

The Scientific Background

Analysis Workflow

1. Setup and Data Simulation

Loading Our Toolkit: The R Packages

Simulating the Rapeseed Germination Data

Part 1: Simulation Based on Manuscript Data

Part 2: Simulation of Interpolated Data

2. Initial Data Exploration

3. The Three-Step Analysis Workflow

Step 1: Determine the Optimal Number of Cut-points

Choosing a Search `method`

Step 2: Find the Optimal Temperature Thresholds

Choosing an Optimisation `criterion`

Step 2.1: Summarise the Final Model

Step 3: Validate the Cut-point Stability

4. Visualising the Results

4.1 Group Composition

Plot A: Kaplan-Meier Curve by Optimal Cut-points

Plot B: Hazard Ratios of Temperature Groups (Forest Plot)

Plot C: Bootstrap Distribution of Cut-points

Plot D: Cumulative Hazard Plot

Final Combined Figure

5. Session Information

Analysing Rapeseed Germination Data with OptSurvCutR

A Workflow for Discovering and Validating an Optimal Temperature Threshold for Time-To-Event Data

Payton Yau

2025-08-08

Introduction

The Scientific Background

Analysis Workflow

1. Setup and Data Simulation

Loading Our Toolkit: The R Packages

Simulating the Rapeseed Germination Data

Part 1: Simulation Based on Manuscript Data

Part 2: Simulation of Interpolated Data

2. Initial Data Exploration

3. The Three-Step Analysis Workflow

Step 1: Determine the Optimal Number of Cut-points

Choosing a Search method

Step 2: Find the Optimal Temperature Thresholds

Choosing an Optimisation criterion

Step 2.1: Summarise the Final Model

Step 3: Validate the Cut-point Stability

4. Visualising the Results

4.1 Group Composition

Plot A: Kaplan-Meier Curve by Optimal Cut-points

Plot B: Hazard Ratios of Temperature Groups (Forest Plot)

Plot C: Bootstrap Distribution of Cut-points

Plot D: Cumulative Hazard Plot

Final Combined Figure

5. Session Information

Choosing a Search `method`

Choosing an Optimisation `criterion`