This tutorial demonstrates a powerful application of
OptSurvCutR
by analysing data inspired by published
research. We will explore the findings of the paper “The Effects of Temperature
and Water on the Seed Germination and Seedling Development of Rapeseed
(Brassica napus L.)” by Haj Sghaier et al.
(2022).
The study by Haj Sghaier et al. investigated how different temperature levels affect the germination of rapeseed. Their key findings were:
This creates a classic scenario for a time-to-event (survival) analysis, where the “event” is germination. Our research question is: Based on simulated data, can we find the optimal temperature threshold(s) that best separate fast and slow germination in rapeseed?
This guide covers a complete and robust exploratory workflow for
survival data using OptSurvCutR
:
find_cutpoint_number()
to let the data suggest the best
number of temperature thresholds using information criteria like BIC or
AICc.find_cutpoint()
to identify the specific values of the
thresholds, optimizing for criteria like the log-rank statistic or the
Hazard Ratio.validate_cutpoint()
to assess the stability of the
discovered thresholds.First, we load all the necessary R packages.
# This chunk assumes the packages are already installed.
# You can install them with: install.packages(c("survival", "survminer", "dplyr", "ggplot2", "patchwork", "knitr", "tidyr", "cli"))
library(OptSurvCutR)
library(survival)
library(survminer)
library(dplyr)
library(ggplot2)
library(patchwork)
library(knitr)
library(tidyr)
library(cli)
To create a realistic dataset for analysis, we simulate data that mimics the findings of the original study.
This chunk defines the parameters for the 7 temperature points, length of the experiment, growth rate, estimated germination rate directly from the manuscript and generates the first part of our dataset.
# Define All Target Parameters from the Manuscript
original_params <- data.frame(
temperature = c(5, 10, 15, 20, 25, 30, 35),
start_day = c(13, 7, 4, 4, 3, 4, 9),
end_day = c(25, 20, 16, 16, 15, 16, 20),
slope = c(0.9846, 3.6848, 4.9765, 3.8693, 6.5766, 1.1384, 0.1560),
intercept = c(-13.627, -34.076, -28.116, -17.012, -35.621, -1.2833, -1.0571),
r_squared = c(0.9585, 0.9501, 0.9307, 0.9524, 0.9564, 0.9237, 0.9574),
germination_rate = c(0.88, 0.90, 0.92, 0.98, 0.94, 0.91, 0.55)
)
# Core Data Generation Function
generate_hybrid_data <- function(params) {
set.seed(as.integer(params$temperature * 10))
n_replicates <- 4
n_seeds_per_dish <- 20
n_samples_total <- n_replicates * n_seeds_per_dish
time_data <- round(runif(n_samples_total, min = params$start_day, max = params$end_day))
perfect_growth <- params$intercept + params$slope * time_data
perfect_growth_variance <- var(perfect_growth)
if (params$r_squared < 1 && params$r_squared > 0) {
residual_variance <- perfect_growth_variance * (1 - params$r_squared) / params$r_squared
} else {
residual_variance <- 0
}
residual_std_dev <- sqrt(residual_variance)
residuals <- rnorm(n = n_samples_total, mean = 0, sd = residual_std_dev)
actual_growth <- perfect_growth + residuals
actual_growth[actual_growth < 0] <- 0
total_germinated <- round(n_samples_total * params$germination_rate)
status_vector <- c(rep(1, total_germinated), rep(0, n_samples_total - total_germinated))
shuffled_status <- sample(status_vector)
replicate_vector <- rep(1:n_replicates, each = n_seeds_per_dish)
actual_growth[shuffled_status == 0] <- 0
data.frame(
temperature = params$temperature,
replicate = replicate_vector,
time = time_data,
growth = actual_growth,
germinated = shuffled_status
)
}
# Generate data for the original 7 temperature points
original_data_list <- lapply(1:nrow(original_params), function(i) generate_hybrid_data(original_params[i, ]))
original_sim_data <- do.call(rbind, original_data_list)
To create a more continuous predictor variable, we interpolate the parameters for 6 intermediate temperature points and generate the second part of our dataset.
# Define the new temperature points to add
new_temps <- c(7.5, 12.5, 17.5, 22.5, 27.5, 32.5)
# Use R's `approx` function for linear interpolation for each parameter
interp_slope <- approx(original_params$temperature, original_params$slope, xout = new_temps)$y
interp_intercept <- approx(original_params$temperature, original_params$intercept, xout = new_temps)$y
interp_r_squared <- approx(original_params$temperature, original_params$r_squared, xout = new_temps)$y
interp_germ_rate <- approx(original_params$temperature, original_params$germination_rate, xout = new_temps)$y
interp_start_day <- approx(original_params$temperature, original_params$start_day, xout = new_temps)$y
interp_end_day <- approx(original_params$temperature, original_params$end_day, xout = new_temps)$y
interpolated_params <- data.frame(
temperature = new_temps,
slope = interp_slope,
intercept = interp_intercept,
r_squared = interp_r_squared,
germination_rate = interp_germ_rate,
start_day = interp_start_day,
end_day = interp_end_day
)
# Generate data for the 6 interpolated temperature points
interpolated_data_list <- lapply(1:nrow(interpolated_params), function(i) generate_hybrid_data(interpolated_params[i, ]))
interpolated_sim_data <- do.call(rbind, interpolated_data_list)
# Combine both datasets for the final analysis
analysis_data <- rbind(original_sim_data, interpolated_sim_data)
Before finding a cut-point, it is good practice to explore the raw data. The following summary table shows key germination metrics for each of the thirteen temperature groups.
# Create a summary table of the data
summary_table <- analysis_data %>%
group_by(temperature) %>%
summarise(
N_Seeds = n(),
N_Germinated = sum(germinated),
Germination_Rate_Pct = mean(germinated) * 100,
Avg_Time_to_Germinate_Days = mean(time[germinated == 1], na.rm = TRUE)
) %>%
rename(`Temperature (°C)` = temperature)
# Display the table using kable for nice formatting
knitr::kable(summary_table,
digits = 2,
caption = "Summary of Germination Outcomes by Temperature Group")
Temperature (°C) | N_Seeds | N_Germinated | Germination_Rate_Pct | Avg_Time_to_Germinate_Days |
---|---|---|---|---|
5.0 | 80 | 70 | 87.50 | 18.73 |
7.5 | 80 | 71 | 88.75 | 16.08 |
10.0 | 80 | 72 | 90.00 | 13.93 |
12.5 | 80 | 73 | 91.25 | 11.85 |
15.0 | 80 | 74 | 92.50 | 10.64 |
17.5 | 80 | 76 | 95.00 | 9.72 |
20.0 | 80 | 78 | 97.50 | 9.59 |
22.5 | 80 | 77 | 96.25 | 10.18 |
25.0 | 80 | 75 | 93.75 | 8.51 |
27.5 | 80 | 74 | 92.50 | 9.36 |
30.0 | 80 | 73 | 91.25 | 10.75 |
32.5 | 80 | 58 | 72.50 | 12.78 |
35.0 | 80 | 44 | 55.00 | 14.59 |
Interpretation: This table confirms our data simulates the expected biological response, with the highest germination rates and fastest germination times in the optimal temperature range.
The first step in our analysis is to determine how many cut-points
our data supports. We use find_cutpoint_number()
to get
statistical evidence that guides this decision.
The function relies on a search method, which is the algorithm that finds the optimal cut-point(s). Your choice of method depends on the size of your dataset and the number of cut-points you need to find.
method
Method | How It Works | Recommendation | Rating (Accuracy & Performance) |
---|---|---|---|
"systematic" |
Exhaustive Search: Tests every single possible cut-point. | Best for 1 cut-point; guarantees the optimal result. | Accuracy: ★★★★★ Performance: ★★★☆☆ |
"genetic" |
Evolutionary Search: Uses a smart algorithm to efficiently find a near-perfect solution. | Highly recommended for 2+ cut-points. Much faster than the systematic search. | Accuracy: ★★★★☆ Performance: ★★★★★ |
In addition to the search method, you must also choose a statistical criterion. This criterion is the objective function for the optimization algorithm; it gives the algorithm a statistical measure to maximize or minimize (e.g., maximizing the log-rank statistic). The package supports several criteria to help you balance the trade-off between model fit and simplicity.
Criterion | What It Is | Recommendation | Rating (Accuracy & Performance) |
---|---|---|---|
"AIC" |
Akaike Information Criterion | Balances model fit and complexity. A good general-purpose choice. | Accuracy: ★★★★☆ Performance: ★★★★★ |
"AICc" |
Corrected AIC | A version of AIC with a greater penalty for extra parameters. It is specifically recommended for smaller sample sizes. | Accuracy: ★★★★★ Performance: ★★★★★ |
"BIC" |
Bayesian Information Criterion | Similar to AIC, but applies a stronger penalty for complexity, especially in larger datasets. It tends to favor simpler models. | Accuracy: ★★★★★ Performance: ★★★★★ |
# Using BIC (default)
cli_h2("Analysis using BIC")
number_result_bic <- find_cutpoint_number(
data = analysis_data,
predictor = "temperature",
outcome_time = "time",
outcome_event = "germinated",
method = "genetic", # can be switched to systematic
criterion = "BIC", # can be switched to AIC or AICc
max_cuts = 5,
nmin = 0.1,
maxiter = 500,
seed = 42 # Use the built-in seed argument for reproducibility
)
## num_cuts BIC Delta_BIC BIC_Weight Evidence
## 0 10878.79 421.52 0% Essentially no support
## 1 10687.53 230.25 0% Essentially no support
## 2 10532.11 74.83 0% Essentially no support
## 3 10457.27 0.00 95% Substantial support
## 4 10463.31 6.03 4.7% Considerably less support
## 5 10468.70 11.42 0.3% Essentially no support
## cuts
## NA
## 8.33
## 10.25, 30.24
## 9.43, 13.55, 31.29
## 9.52, 13, 22.54, 30.23
## 8.56, 13.51, 18.55, 23.71, 30.45
Why
set.seed()
? The genetic algorithm and bootstrapping both use random processes. Setting a “seed” ensures that the sequence of random numbers is the same every time you run the code, making your results perfectly reproducible.
Now let’s print a summary of the find_cutpoint_number()
results. This will display a table of information criterion scores for
each model, from one cut-point to five. The model with the lowest score
is considered the best fit for the data.
summary(number_result_bic)
## num_cuts BIC Delta_BIC BIC_Weight Evidence
## 0 10878.79 421.52 0% Essentially no support
## 1 10687.53 230.25 0% Essentially no support
## 2 10532.11 74.83 0% Essentially no support
## 3 10457.27 0.00 95% Substantial support
## 4 10463.31 6.03 4.7% Considerably less support
## 5 10468.70 11.42 0.3% Essentially no support
## cuts
## NA
## 8.33
## 10.25, 30.24
## 9.43, 13.55, 31.29
## 9.52, 13, 22.54, 30.23
## 8.56, 13.51, 18.55, 23.71, 30.45
## Group N Events
## 1 G1 160 141
## 2 G2 160 145
## 3 G3 560 527
## 4 G4 160 102
## Call: survfit(formula = survival::Surv(time, event) ~ group, data = data)
##
## n events median 0.95LCL 0.95UCL
## group=G1 160 141 18 17 19
## group=G2 160 145 13 12 14
## group=G3 560 527 10 10 11
## group=G4 160 102 15 14 16
## Call:
## survival::coxph(formula = as.formula(formula_str), data = data)
##
## n= 1040, number of events= 915
##
## coef exp(coef) se(coef) z Pr(>|z|)
## groupG2 1.3185 3.7378 0.1372 9.613 < 2e-16 ***
## groupG3 2.3464 10.4480 0.1306 17.965 < 2e-16 ***
## groupG4 0.8722 2.3922 0.1463 5.963 2.48e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## groupG2 3.738 0.26754 2.857 4.891
## groupG3 10.448 0.09571 8.088 13.496
## groupG4 2.392 0.41802 1.796 3.187
##
## Concordance= 0.706 (se = 0.008 )
## Likelihood ratio test= 497.3 on 3 df, p=<2e-16
## Wald test = 393.6 on 3 df, p=<2e-16
## Score (logrank) test = 475.4 on 3 df, p=<2e-16
To better visualise the results from the summary table, we can generate a plot. This plot will show how the information criterion score changes with the number of cut-points, making it easy to identify the optimal number.
plot(number_result_bic)
Interpretation: The
BIC
analyses suggest that a model with 3 cut-points is optimal, as it has the lowest information criterion score. This provides strong statistical justification for creating four distinct temperature groups (e.g., Cool, Sub-Optimal, Optimal, Warm).
With the optimal number of cut-points now identified, our next step is to find their specific values. We use the find_cutpoint() function to discover the best temperature thresholds that separate our data into distinct groups. To do this, the algorithm needs a rule to follow, much like a car navigating a winding road needs a GPS. This rule is the optimisation criterion, which tells the search method what “optimal” means. For instance, a criterion of “logrank” is like a GPS programmed to find the steepest, most significant decline in a road, whereas a criterion of “hazard_ratio” is programmed to find the biggest drop in altitude between two points. This choice determines the path the algorithm takes to find the best cut-point.
criterion
Criterion | What It Optimizes | Recommendation | Rating (Accuracy & Performance) |
---|---|---|---|
"logrank" |
The statistical significance of the separation between survival curves (maximises the chi-squared statistic). | The most common and standard method. Best when the primary goal is to prove a significant difference. | Accuracy: ★★★★★ Performance: ★★★★★ |
"hazard_ratio" |
The effect size (maximises the Hazard Ratio). | Best for clinical interpretability. Finds the cut-point that creates the largest practical difference between groups. | Accuracy: ★★★★★ Performance: ★★★★☆ |
"p_value" |
The p-value from a Cox model (minimises the p-value). | A powerful way to find the most significant split, but the p-value itself should be interpreted with caution due to multiple testing. | Accuracy: ★★★★☆ Performance: ★★★★☆ |
We will use the "genetic"
method to find three
cut-points, optimising for the "logrank"
to maximise the
chi-squared statistic.
# We will find three cutpoints based on the result from Step 1
multi_cut_result <- find_cutpoint(
data = analysis_data,
predictor = "temperature",
outcome_time = "time",
outcome_event = "germinated",
method = "genetic", # can be switched to systematic
criterion = "logrank", # Optimising for effect size
num_cuts = 3,
nmin = 0.1,
maxiter = 500, # A higher number of generations for a more thorough search
seed = 123
)
Now that we have found the optimal cut-point values, let’s display a summary of the results. This summary will show the exact cut-point values, the p-value from a log-rank test for the split, and other key information from the analysis.
summary(multi_cut_result)
## Group N Events
## 1 G1 160 141
## 2 G2 160 145
## 3 G3 560 527
## 4 G4 160 102
## Call: survfit(formula = survival::Surv(time, event) ~ group, data = data)
##
## n events median 0.95LCL 0.95UCL
## group=G1 160 141 18 17 19
## group=G2 160 145 13 12 14
## group=G3 560 527 10 10 11
## group=G4 160 102 15 14 16
## Call:
## survival::coxph(formula = as.formula(formula_str), data = data)
##
## n= 1040, number of events= 915
##
## coef exp(coef) se(coef) z Pr(>|z|)
## groupG2 1.3185 3.7378 0.1372 9.613 < 2e-16 ***
## groupG3 2.3464 10.4480 0.1306 17.965 < 2e-16 ***
## groupG4 0.8722 2.3922 0.1463 5.963 2.48e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## groupG2 3.738 0.26754 2.857 4.891
## groupG3 10.448 0.09571 8.088 13.496
## groupG4 2.392 0.41802 1.796 3.187
##
## Concordance= 0.706 (se = 0.008 )
## Likelihood ratio test= 497.3 on 3 df, p=<2e-16
## Wald test = 393.6 on 3 df, p=<2e-16
## Score (logrank) test = 475.4 on 3 df, p=<2e-16
## chisq df p
## group 1.79 3 0.62
## GLOBAL 1.79 3 0.62
To see the cut-points visually and understand their placement along the predictor variable, we can generate a plot of the results. This plot helps to confirm the location of the cut-points and provides a visual representation of how the data is being divided.
plot(multi_cut_result)
After finding the cut-points, we use summary()
on the
result object to inspect the final statistical model. This provides the
detailed Hazard Ratios and p-values for the groups created by our
analysis.
# You can also customise the output by turning specific sections off
summary(
multi_cut_result,
show_medians = TRUE, # Hide the median survival table
show_ph_test = TRUE, # Hide the proportional hazards test
show_params = TRUE # Hide the analysis parameters list
)
## Group N Events
## 1 G1 160 141
## 2 G2 160 145
## 3 G3 560 527
## 4 G4 160 102
## Call: survfit(formula = survival::Surv(time, event) ~ group, data = data)
##
## n events median 0.95LCL 0.95UCL
## group=G1 160 141 18 17 19
## group=G2 160 145 13 12 14
## group=G3 560 527 10 10 11
## group=G4 160 102 15 14 16
## Call:
## survival::coxph(formula = as.formula(formula_str), data = data)
##
## n= 1040, number of events= 915
##
## coef exp(coef) se(coef) z Pr(>|z|)
## groupG2 1.3185 3.7378 0.1372 9.613 < 2e-16 ***
## groupG3 2.3464 10.4480 0.1306 17.965 < 2e-16 ***
## groupG4 0.8722 2.3922 0.1463 5.963 2.48e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## groupG2 3.738 0.26754 2.857 4.891
## groupG3 10.448 0.09571 8.088 13.496
## groupG4 2.392 0.41802 1.796 3.187
##
## Concordance= 0.706 (se = 0.008 )
## Likelihood ratio test= 497.3 on 3 df, p=<2e-16
## Wald test = 393.6 on 3 df, p=<2e-16
## Score (logrank) test = 475.4 on 3 df, p=<2e-16
## chisq df p
## group 1.79 3 0.62
## GLOBAL 1.79 3 0.62
Interpretation: The summary shows the final Cox model. The
exp(coef)
column gives the Hazard Ratio (HR) for each group relative to the reference group (G1). A high HR (like for G3) indicates a much faster germination time (a higher “hazard” of germinating).
Finally, we run a bootstrap analysis with
validate_cutpoint()
to generate 95% confidence intervals
for our discovered cut-points. This tells us how stable and reliable our
thresholds are.
validation_result <- validate_cutpoint(
cutpoint_result = multi_cut_result,
num_replicates = 100, # A higher number (e.g., 500) is recommended for publication
use_parallel = TRUE,
seed = 456, # Use the built-in seed argument
maxiter = 100 # Use fewer generations for faster validation runs
)
## Lower Upper
## Cut 1 7.584 10.860
## Cut 2 12.618 16.209
## Cut 3 28.525 32.426
To see the detailed results from the validation, we can print a
summary of the validate_cutpoint()
object. This will show
the original cut-point values, the confidence intervals for the
bootstrap results, and a count of the number of successful
replicates.
summary(validation_result)
## Mean Median SD IQR
## Cut 1 8.960 9.031 0.959 1.320
## Cut 2 14.039 13.963 0.875 1.172
## Cut 3 30.975 31.060 0.994 1.247
## Lower Upper
## Cut 1 7.584 10.860
## Cut 2 12.618 16.209
## Cut 3 28.525 32.426
Interpretation: The summary output from validate_cutpoint() provides a crucial stability check for our results. For each cut-point we found in the original analysis, it presents a 95% confidence interval based on the bootstrap resamples. For a cut-point to be considered stable and reliable, its confidence interval should be relatively narrow. A narrow interval suggests that if you were to repeat the experiment and re-run the analysis, the optimal cut-point would likely be found very close to the value you originally discovered. A very wide interval, on the other hand, indicates that the cut-point is highly sensitive to changes in the data and may not be a robust finding.
A good way to visualise the stability of the cut-points is with a density plot. This plot shows the distribution of the cut-points found across all the bootstrap resamples. The narrower and more centered the peaks, the more stable the cut-point is.
plot(validation_result)
Bootstrap distribution of the three optimal cut-points with 95% confidence intervals. The solid red line represents the cut-point found in the original analysis.
We will now create a series of plots to interpret our findings.
# Extract the cuts for plotting
optimal_cuts <- multi_cut_result$optimal_cuts
# Create a temperature group variable with four levels
analysis_data$Temp_Group <- cut(
analysis_data$temperature,
breaks = c(-Inf, optimal_cuts, Inf),
labels = c("Cool", "Sub-Optimal", "Optimal", "Warm")
)
This table explicitly shows which of the original experimental temperatures were assigned to each of the four new groups.
composition_table <- analysis_data %>%
group_by(Temp_Group) %>%
summarise(
Temperatures_in_Group = paste(sort(unique(temperature)), collapse = ", ")
)
knitr::kable(composition_table, caption = "Composition of Temperature Groups")
Temp_Group | Temperatures_in_Group |
---|---|
Cool | 5, 7.5 |
Sub-Optimal | 10, 12.5 |
Optimal | 15, 17.5, 20, 22.5, 25, 27.5, 30 |
Warm | 32.5, 35 |
This plot shows the time-to-germination for the four groups created by our optimal cut-points.
group_levels <- levels(analysis_data$Temp_Group)
num_groups <- length(group_levels)
palette <- c("#2E9FDF", "#4CAF50", "#FC4E07", "#E7B800")
km_fit_final <- survfit(Surv(time, germinated) ~ Temp_Group, data = analysis_data)
km_plot_obj <- ggsurvplot(
km_fit_final, data = analysis_data, pval = TRUE, risk.table = TRUE,
title = "Germination by Optimal Temperature Strata",
xlab = "Time (Days)", ylab = "Proportion Ungerminated",
legend.title = "Temp Group",
legend.labs = group_levels,
palette = palette[1:num_groups]
)
km_plot_obj
Kaplan-Meier survival curve for germination by temperature group.
Kaplan-Meier Plot: The Kaplan-Meier plot visually confirms that our optimal cut-points successfully stratified the data into groups with significantly different germination rates. The curves for the “Optimal” and “Sub-Optimal” groups drop much more steeply than the others, indicating a faster germination time. In contrast, the “Cool” and “Warm” groups show a much flatter curve, meaning the seeds in those groups take much longer to germinate or do not germinate at all.
The forest plot visualises the Hazard Ratios (HR) for the “Cool”, “Sub-Optimal”, and “Warm” groups relative to the “Optimal” group.
analysis_data$Temp_Group <- relevel(analysis_data$Temp_Group, ref = "Optimal")
cox_model <- coxph(Surv(time, germinated) ~ Temp_Group, data = analysis_data)
forest_plot <- ggforest(cox_model, data = analysis_data, main = "Hazard Ratios Relative to Optimal Temperature")
forest_plot
Forest plot of Hazard Ratios for each temperature group relative to the optimal group.
Interpretation of the Forest Plot: This plot shows the “hazard” (in this case, the rate of germination) for each group compared to the “Optimal” group. - The “Optimal” group is the reference, so its Hazard Ratio is 1.0. - An HR less than 1 (like for the “Cool” group) means a lower rate of germination (slower germination time). - The horizontal lines represent the 95% confidence interval. If an interval does not cross the vertical line at 1.0, the difference is statistically significant.
This plot shows the distribution of the optimal cut-points found
during the bootstrap validation. The plot()
method for the
validation result makes this easy.
# The plot method automatically creates a faceted density plot
bootstrap_plot <- plot(validation_result)
bootstrap_plot
Bootstrap distribution of the three optimal cut-points with 95% confidence intervals. The solid red line represents the cut-point found in the original analysis.
Interpretation: The density plots show where the optimal cut-points most frequently landed during the bootstrap resamples. The solid red line shows our original cut-point, while the dashed red lines show the 95% confidence interval. Tight, sharp peaks indicate a very stable and reliable result.
This diagnostic plot visualizes the cumulative hazard for each of the four temperature groups.
cumhaz_plot_obj <- ggsurvplot(
km_fit_final, data = analysis_data, fun = "cumhaz", pval = TRUE,
title = "Cumulative Hazard of Germination",
xlab = "Time (Days)", ylab = "Cumulative Hazard",
legend.title = "Temp Group",
legend.labs = group_levels,
palette = palette[1:num_groups]
)
cumhaz_plot_obj
Cumulative hazard plot for germination by temperature group, used to check the proportional hazards assumption.
Interpretation of the Cumulative Hazard Plot: This is a diagnostic plot used to visually check the proportional hazards assumption of the Cox model. For the model to be valid, the hazard ratios between the groups should be roughly constant over time. We check this by looking for lines that are approximately parallel and do not cross. In this case, the lines are well-separated and do not cross, suggesting the assumption is met.
Finally, we use the patchwork
package to combine our key
plots into a single, comprehensive figure.
# Extract the plot component from the ggsurvplot objects
km_plot <- km_plot_obj$plot
cumhaz_plot <- cumhaz_plot_obj$plot
# Combine the four plots into a 2x2 grid
publication_figure <- (km_plot + cumhaz_plot) / (forest_plot + bootstrap_plot) +
plot_annotation(tag_levels = 'A')
print(publication_figure)
A combined figure of the key analysis results, including the Kaplan-Meier curve, cumulative hazard plot, forest plot, and bootstrap distribution.
For reproducibility, the session information below lists the R version and all attached packages used to run this analysis.
sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8
## [2] LC_CTYPE=English_United Kingdom.utf8
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.utf8
##
## time zone: Europe/London
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] cli_3.6.5 tidyr_1.3.1 knitr_1.50 patchwork_1.3.1
## [5] dplyr_1.1.4 survminer_0.5.0 ggpubr_0.6.1 ggplot2_3.5.2
## [9] survival_3.8-3 OptSurvCutR_0.1.5
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.52 bslib_0.9.0 rstatix_0.7.2
## [5] lattice_0.22-7 vctrs_0.6.5 tools_4.5.1 generics_0.1.4
## [9] parallel_4.5.1 tibble_3.3.0 pkgconfig_2.0.3 Matrix_1.7-3
## [13] data.table_1.17.8 RColorBrewer_1.1-3 lifecycle_1.0.4 stringr_1.5.1
## [17] compiler_4.5.1 farver_2.1.2 ggsci_3.2.0 codetools_0.2-20
## [21] carData_3.0-5 litedown_0.7 htmltools_0.5.8.1 sass_0.4.10
## [25] yaml_2.3.10 Formula_1.2-5 pillar_1.11.0 car_3.1-3
## [29] jquerylib_0.1.4 cachem_1.1.0 iterators_1.0.14 rgenoud_5.9-0.11
## [33] abind_1.4-8 foreach_1.5.2 km.ci_0.5-6 commonmark_2.0.0
## [37] tidyselect_1.2.1 digest_0.6.37 stringi_1.8.7 purrr_1.1.0
## [41] labeling_0.4.3 splines_4.5.1 cowplot_1.2.0 fastmap_1.2.0
## [45] grid_4.5.1 magrittr_2.0.3 broom_1.0.9 withr_3.0.2
## [49] scales_1.4.0 backports_1.5.0 rmarkdown_2.29 ggtext_0.1.2
## [53] gridExtra_2.3 ggsignif_0.6.4 zoo_1.8-14 evaluate_1.0.4
## [57] KMsurv_0.1-6 doParallel_1.0.17 markdown_2.0 survMisc_0.5.6
## [61] rlang_1.1.6 gridtext_0.1.5 Rcpp_1.1.0 xtable_1.8-4
## [65] glue_1.8.0 xml2_1.3.8 rstudioapi_0.17.1 jsonlite_2.0.0
## [69] R6_2.6.1