This tutorial demonstrates how to use the
OptCutR
package for survival analysis.
Specifically, we will stratify patients into different prognostic groups
based on a continuous biomarker.
This is a common and powerful technique in clinical research to identify patient subgroups with different risk levels. For this example, we will analyse gut microbiome data to determine if the relative abundance of the genus Parvimonas can predict survival outcomes in colorectal cancer (CRC) patients. This analysis is inspired by data published by Smyth et al. (2024).
This guide covers a complete and robust workflow:
First, we load all the necessary R packages. The pacman
package will automatically install any missing packages.
if (!require("pacman")) install.packages("pacman")
pacman::p_load(
OptCutR, readxl, dplyr, tidyr,
survival, survminer, patchwork, ggplot2
)
We define the name of the microbe we wish to analyse upfront.
# Define the biomarker of interest.
microbe_name_to_analyze <- "Parvimonas"
In this section, we load the clinical and microbiome datasets, clean them, and merge them into a single data frame ready for analysis.
# Load the datasets from their respective files.
clinical_data <- read_excel("Colon clinical data.xlsx", .name_repair = "universal")
microbiome_data <- read.csv("Colon Microbiome data.csv", check.names = TRUE)
# --- Prepare clinical (survival) data ---
survival_data <- clinical_data %>%
select(
sample_id = Sample.ID,
time = Overall.Survival..Months.,
status_text = Overall.Survival.Status
) %>%
mutate(
status = ifelse(status_text == '1:DECEASED', 1, 0),
time = as.numeric(time)
) %>%
filter(!is.na(time) & !is.na(status))
# --- Process and prepare microbiome data ---
microbe_row <- microbiome_data %>%
filter(grepl(microbe_name_to_analyze, NAME, ignore.case = TRUE)) %>%
slice(1)
abundance_long <- microbe_row %>%
select(starts_with("TCGA")) %>%
pivot_longer(
cols = everything(),
names_to = "sample_id",
values_to = "abundance"
) %>%
mutate(sample_id = gsub("\\.", "-", sample_id))
# --- Merge the two datasets ---
analysis_data <- inner_join(survival_data, abundance_long, by = "sample_id")
head(analysis_data)
## # A tibble: 6 × 5
## sample_id time status_text status abundance
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 TCGA-3L-AA1B-01 15.6 0:LIVING 0 7.51
## 2 TCGA-4N-A93T-01 4.80 0:LIVING 0 7.89
## 3 TCGA-4T-AA8H-01 12.7 0:LIVING 0 7.81
## 4 TCGA-5M-AAT4-01 1.61 1:DECEASED 1 9.16
## 5 TCGA-5M-AAT6-01 9.53 1:DECEASED 1 9.95
## 6 TCGA-5M-AATE-01 39.5 0:LIVING 0 9.71
Before finding where the cut-points are, we should determine how many cut-points are statistically justified.
# Set a seed for reproducible results from the genetic algorithm.
set.seed(42)
# This function tests models with 0, 1, and 2 cut-points and compares them using BIC.
number_result <- find_cutpoint_number(
data = analysis_data,
predictor = "abundance",
outcome_time = "time",
outcome_event = "status",
method = "genetic",
max_cuts = 2,
nmin = 0.1,
numgen = 50
)
## num_cuts BIC Delta_BIC BIC_Weight
## 0 1322.07 3.43 10.6%
## 1 1319.94 1.31 30.6%
## 2 1318.63 0.00 58.9%
Interpretation: The model with 2 cut-points has the lowest BIC and the highest BIC Weight. This provides strong statistical evidence that dividing the patients into three groups is the best choice for this dataset.
Now that we have decided to use two cut-points, we use
find_cutpoint()
to identify their optimal values.
# Use a different seed for this distinct analysis step.
set.seed(123)
# This function uses a genetic algorithm to find the location of the two best cut-points.
multi_cut_result <- find_cutpoint(
data = analysis_data,
predictor = "abundance",
outcome_time = "time",
outcome_event = "status",
method = "genetic",
num_cuts = 2,
nmin = 0.1,
numgen = 50
)
optimal_cuts <- multi_cut_result$optimal_cuts
print(optimal_cuts)
## [1] 8.416784 8.701395
Interpretation: The genetic algorithm has identified two optimal thresholds. These are the specific abundance values that best separate the patients into three prognostic groups.
This is a crucial step to assess how reliable our discovered
cut-points are. We use validate_cutpoint()
to run a
bootstrap analysis. Narrow confidence intervals suggest
that the cut-points are stable.
# Seed for reproducibility of the bootstrap process.
set.seed(456)
# This function re-runs the cut-point finding process on many bootstrap samples.
validation_result <- validate_cutpoint(
cutpoint_result = multi_cut_result,
num_replicates = 200, # Use a higher number for a final analysis.
use_parallel = TRUE # Enable parallel processing.
)
## Lower Upper
## Cut 1 6.451 9.979
## Cut 2 7.311 11.814
Interpretation: This table shows the 95% confidence intervals for our two cut-points, giving us a sense of their precision and stability.
First, we create a new column that categorises each patient into a
“Low”, “Medium”, or “High” abundance group based on the
optimal_cuts
.
# Create a new column with the abundance group based on the optimal cut-points.
analysis_data_categorised <- analysis_data %>%
mutate(abundance_group = cut(
abundance,
breaks = c(-Inf, optimal_cuts, Inf),
labels = c("Low", "Medium", "High")
)) %>%
filter(!is.na(abundance_group))
# Set "Low" abundance as the reference group for the Cox model.
analysis_data_categorised$abundance_group <- relevel(
analysis_data_categorised$abundance_group, ref = "Low"
)
# Fit the final Cox Proportional-Hazards model on the categorised data.
final_formula <- as.formula("Surv(time, status) ~ abundance_group")
print(final_formula)
## Surv(time, status) ~ abundance_group
cox_model <- coxph(final_formula, data = analysis_data_categorised)
print(cox_model)
## Call:
## coxph(formula = final_formula, data = analysis_data_categorised)
##
## coef exp(coef) se(coef) z p
## abundance_groupMedium -1.0593 0.3467 0.4301 -2.463 0.0138
## abundance_groupHigh 0.1061 1.1119 0.1889 0.561 0.5746
##
## Likelihood ratio test=10.25 on 2 df, p=0.005954
## n= 581, number of events= 119
A Kaplan-Meier curve shows the probability of survival over time for different patient groups.
# Fit the Kaplan-Meier model.
km_fit <- survfit(Surv(time, status) ~ abundance_group, data = analysis_data_categorised)
# Create the plot using ggsurvplot.
km_plot <- ggsurvplot(
km_fit, data = analysis_data_categorised, pval = TRUE, risk.table = TRUE,
legend.title = microbe_name_to_analyze, legend.labs = c("Low", "Medium", "High"),
palette = c("#2E9FDF", "#E7B800", "#FC4E07") # Custom colour palette
)
print(km_plot)
A forest plot visualises the Hazard Ratios (HR) from our Cox model.
# ggforest creates a plot of the hazard ratios from the Cox model.
forest_plot <- ggforest(
cox_model, data = analysis_data_categorised,
main = "Hazard Ratios for Abundance Groups"
)
print(forest_plot)
This plot shows the distribution of the cut-points found across all bootstrap samples.
# Prepare the bootstrap data for plotting with ggplot2.
bootstrap_df <- as.data.frame(validation_result$bootstrap_distribution)
names(bootstrap_df) <- c("Cut-point 1", "Cut-point 2")
bootstrap_long <- bootstrap_df %>%
pivot_longer(cols = everything(), names_to = "Cutpoint", values_to = "Value")
# Create the density plot.
bootstrap_plot <- ggplot(bootstrap_long, aes(x = Value, fill = Cutpoint)) +
geom_density(alpha = 0.7, show.legend = FALSE) +
# Add vertical lines for the originally identified optimal cuts.
geom_vline(data = data.frame(Cutpoint = c("Cut-point 1", "Cut-point 2"), val = optimal_cuts),
aes(xintercept = val), colour = "red", linetype = "dashed", linewidth = 1) +
labs(title = "Bootstrap Distribution of Cut-points", x = "Abundance Value", y = "Density") +
theme_minimal() +
facet_wrap(~ Cutpoint, scales = "free")
print(bootstrap_plot)
Finally, we use the patchwork
package to combine our
primary plots into a single figure.
# We extract the plot component from the ggsurvplot object to combine it with other ggplots.
publication_figure <- (km_plot$plot) / (forest_plot + bootstrap_plot) +
plot_annotation(tag_levels = 'A')
print(publication_figure)
This box plot provides a clear visual confirmation of how the
find_cutpoint()
function has partitioned the patients.
dist_plot <- ggplot(analysis_data_categorised, aes(x = abundance_group, y = abundance, fill = abundance_group)) +
geom_boxplot(show.legend = FALSE) +
labs(
title = paste("Distribution of", microbe_name_to_analyze, "Abundance by Risk Group"),
x = "Prognostic Group",
y = "Relative Abundance"
) +
scale_fill_manual(values = c("#2E9FDF", "#E7B800", "#FC4E07")) +
theme_minimal()
print(dist_plot)
A key assumption of the Cox model is that the effect of a predictor is constant over time. We can check this using a Schoenfeld residuals plot. A non-significant trend supports the assumption.
# Test the proportional hazards assumption for our Cox model.
schoenfeld_residuals <- cox.zph(cox_model)
# Print the test results to the console. A non-significant p-value is good.
print(schoenfeld_residuals)
## chisq df p
## abundance_group 0.153 2 0.93
## GLOBAL 0.153 2 0.93
# Visualise the residuals.
ggcoxzph(schoenfeld_residuals)
It is good practice to remove objects that are no longer needed to keep the R environment tidy.
# Remove all the intermediate and final objects created during the analysis
rm(analysis_data, analysis_data_categorised, multi_cut_result, optimal_cuts,
validation_result, final_formula, cox_model, km_fit, km_plot, forest_plot,
bootstrap_df, bootstrap_long, bootstrap_plot, publication_figure, dist_plot,
schoenfeld_residuals, microbe_name_to_analyze)
For reproducibility, the session information below lists the R version and all attached packages.
sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8
## [2] LC_CTYPE=English_United Kingdom.utf8
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.utf8
##
## time zone: Europe/London
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] patchwork_1.3.1 survminer_0.5.0 ggpubr_0.6.1 ggplot2_3.5.2
## [5] survival_3.8-3 tidyr_1.3.1 dplyr_1.1.4 readxl_1.4.5
## [9] OptCutR_0.1.0 pacman_0.5.1
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.52 bslib_0.9.0 rstatix_0.7.2
## [5] lattice_0.22-7 vctrs_0.6.5 tools_4.5.1 generics_0.1.4
## [9] parallel_4.5.1 tibble_3.3.0 pkgconfig_2.0.3 Matrix_1.7-3
## [13] data.table_1.17.6 RColorBrewer_1.1-3 lifecycle_1.0.4 stringr_1.5.1
## [17] compiler_4.5.1 farver_2.1.2 codetools_0.2-20 carData_3.0-5
## [21] litedown_0.7 htmltools_0.5.8.1 sass_0.4.10 yaml_2.3.10
## [25] Formula_1.2-5 pillar_1.11.0 car_3.1-3 jquerylib_0.1.4
## [29] cachem_1.1.0 iterators_1.0.14 rgenoud_5.9-0.11 abind_1.4-8
## [33] foreach_1.5.2 km.ci_0.5-6 commonmark_2.0.0 tidyselect_1.2.1
## [37] digest_0.6.37 stringi_1.8.7 purrr_1.0.4 labeling_0.4.3
## [41] splines_4.5.1 cowplot_1.2.0 fastmap_1.2.0 grid_4.5.1
## [45] cli_3.6.5 magrittr_2.0.3 utf8_1.2.6 broom_1.0.8
## [49] withr_3.0.2 scales_1.4.0 backports_1.5.0 rmarkdown_2.29
## [53] ggtext_0.1.2 gridExtra_2.3 ggsignif_0.6.4 cellranger_1.1.0
## [57] zoo_1.8-14 evaluate_1.0.4 knitr_1.50 KMsurv_0.1-6
## [61] doParallel_1.0.17 markdown_2.0 survMisc_0.5.6 rlang_1.1.6
## [65] Rcpp_1.1.0 gridtext_0.1.5 xtable_1.8-4 glue_1.8.0
## [69] xml2_1.3.8 rstudioapi_0.17.1 jsonlite_2.0.0 R6_2.6.1