As data scientists, a common task is to predict if a customer will cancel their subscription—a problem known as “churn”. The standard approach is to build a classification model that predicts a simple “Yes” or “No”. While useful, this does not tell the whole story. A classification model treats churn as a binary outcome, ignoring a crucial piece of information: time.
A more powerful approach is to ask a different question: instead of asking if a customer will churn, we can ask when. This is called a time-to-event (or survival) analysis. It allows us to model the entire customer journey and understand the rate at which customers churn over time, providing a much richer insight into customer behaviour.
This example will walk you through a complete, real-world analysis.
We will use a public subscription dataset to uncover the specific
“tipping points” in MonthlyCharges that best predict how
long a customer is likely to stay subscribed.
To do this, we will use the OptSurvCutR
package, which provides a complete workflow to: 1. Determine the
statistically optimal number of cut-points. 2. Find the
precise location of those cut-points. 3.
Validate the stability of the results.
First, we need to install the development version of
OptSurvCutR from GitHub and load the other R packages we
will use for this analysis.
# Install required packages from CRAN
# rgenoud is required for method = "genetic"
install.packages(c("remotes", "dplyr", "survival", "survminer", "knitr", "rgenoud"))
# Option 1: Install the development version from GitHub for the latest features
remotes::install_github("paytonyau/OptSurvCutR")
# Option 2: Install the stable version from CRAN (when available)
# install.packages("OptSurvCutR")
Now, let us load the libraries.
# Load the core packages for this analysis
library(OptSurvCutR)
library(dplyr)
library(survival)
library(survminer)
library(knitr)
library(rgenoud) # Explicitly load rgenoud
# URL for the raw dataset from an IBM repository
data_url <- "https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv"
# Read the data into an R data frame
telco_data <- read.csv(data_url)
# The next steps use the pipe operator (%>%), which passes the data from one
# function to the next, making the code clean and readable.
telco_survival <- telco_data %>%
# The 'TotalCharges' column is read as text; we need to convert it to a number.
mutate(TotalCharges = as.numeric(TotalCharges)) %>%
# This conversion creates some missing values (NAs), which we need to remove.
filter(complete.cases(.)) %>%
# For survival analysis, we need a numeric "event" column.
# We define the "event" as Churn == "Yes" (1), and "No Churn" as 0.
mutate(churn_event = ifelse(Churn == "Yes", 1, 0)) %>%
# A customer tenure of 0 is not meaningful for this analysis, so we remove it.
filter(tenure > 0)
# Let us inspect the first few rows of our prepared data
head(telco_survival)
## customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 1 7590-VHVEG Female 0 Yes No 1 No
## 2 5575-GNVDE Male 0 No No 34 Yes
## 3 3668-QPYBK Male 0 No No 2 Yes
## 4 7795-CFOCW Male 0 No No 45 No
## 5 9237-HQITU Female 0 No No 2 Yes
## 6 9305-CDSKC Female 0 No No 8 Yes
## MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
## 1 No phone service DSL No Yes No
## 2 No DSL Yes No Yes
## 3 No DSL Yes Yes No
## 4 No phone service DSL Yes No Yes
## 5 No Fiber optic No No No
## 6 Yes Fiber optic No No Yes
## TechSupport StreamingTV StreamingMovies Contract PaperlessBilling
## 1 No No No Month-to-month Yes
## 2 No No No One year No
## 3 No No No Month-to-month Yes
## 4 Yes No No One year No
## 5 No No No Month-to-month Yes
## 6 No Yes Yes Month-to-month Yes
## PaymentMethod MonthlyCharges TotalCharges Churn churn_event
## 1 Electronic check 29.85 29.85 No 0
## 2 Mailed check 56.95 1889.50 No 0
## 3 Mailed check 53.85 108.15 Yes 1
## 4 Bank transfer (automatic) 42.30 1840.75 No 0
## 5 Electronic check 70.70 151.65 Yes 1
## 6 Electronic check 99.65 820.50 Yes 1
A standard survival analysis often begins by comparing the time-to-event between pre-defined groups. Let us start with a simple question: is there a difference in customer retention between senior citizens and non-senior citizens?
# The SeniorCitizen column is coded as 0 or 1. Let us make it a more descriptive factor.
telco_survival_senior <- telco_survival %>%
mutate(SeniorCitizen = factor(SeniorCitizen, levels = c(0, 1), labels = c("No", "Yes")))
# To perform a survival analysis, we first need to create a "survival object".
# The Surv() function bundles our time column ('tenure') and our event column ('churn_event') together.
surv_obj_senior <- Surv(time = telco_survival_senior$tenure, event = telco_survival_senior$churn_event)
# Now, we can fit a Kaplan-Meier model. The formula `surv_obj_senior ~ SeniorCitizen`
# tells R to create a separate survival curve for each category in the SeniorCitizen column.
km_fit_senior <- survfit(surv_obj_senior ~ SeniorCitizen, data = telco_survival_senior)
# Finally, we can plot the result using the ggsurvplot function
ggsurvplot(
km_fit_senior,
data = telco_survival_senior,
pval = TRUE,
conf.int = TRUE,
risk.table = TRUE,
legend.title = "Senior Citizen",
legend.labs = c("No", "Yes"),
title = "Customer Retention by Senior Citizen Status"
)
How to Read This Plot: A Guide for Data Scientists
This type of visualisation is a Kaplan-Meier plot, the cornerstone of survival analysis. Here is how to interpret it:
- The Y-Axis (Survival probability): This shows the probability that a customer is still subscribed at a given point in time. It starts at 1.0 (100%), as all customers are subscribed at the beginning.
- The X-Axis (tenure): This is our time variable, representing the number of months a customer has been subscribed.
- The Stepped Lines: Each line represents a group of customers (here, “Senior Citizen: No” and “Senior Citizen: Yes”). The line goes down each time a customer in that group churns. A steeper drop means a faster churn rate.
- The Shaded Areas: These are the 95% confidence intervals for the survival curves. They give us a sense of the uncertainty in our estimates.
- The p-value: This comes from a log-rank test. It tests the null hypothesis that there is no difference in survival between the groups. A small p-value (like p < 0.0001 here) provides strong evidence that the difference between the curves is statistically significant.
- The Risk Table: This table at the bottom shows the number of customers still “at risk” of churning at different time points. It provides context for the curves above; for example, you can see how many customers are remaining in each group after 24, 48, and 72 months.
The more interesting business question is about continuous
predictors. At what MonthlyCharges value does a customer’s
loyalty begin to change? A common but flawed approach is to split the
data at the median. A better method is to let the data itself tell us
where the real tipping point is. We can use find_cutpoint
to find the single best threshold.
# Find the single best cut-point for MonthlyCharges
one_cut_result <- find_cutpoint(
data = telco_survival,
predictor = "MonthlyCharges", # The continuous variable to test
outcome_time = "tenure", # The time-to-event column
outcome_event = "churn_event", # The event status column (0 or 1)
num_cuts = 1, # We are looking for just one cut-point
method = "systematic" # The "systematic" method is fast for finding one cut
)
one_cut_result
From this point on, we will use the OptSurvCutR
package’s built-in plot() methods, which are much faster
and more direct than building ggsurvplot charts
manually.
# Plot the outcome directly from the result object
# 'type = "outcome"' automatically generates a Kaplan-Meier plot
# using the newly found cut-point to create two groups.
plot(one_cut_result, type = "outcome")
Analysis: This analysis finds a significant split. However, it assumes a simple relationship where higher charges always lead to higher churn. What if the reality is more complex? A moderate charge might be acceptable, while very low (perhaps indicating low engagement) and very high charges could both lead to churn. A single cut-point cannot capture this.
To investigate more complex, non-linear relationships, we need to
find multiple cut-points. The OptSurvCutR package provides
a robust, three-step workflow for this.
First, we must determine how many cut-points are statistically
justified. We use find_cutpoint_number() to compare models
with different numbers of cuts. We will test up to 7 potential
cut-points to thoroughly explore the possibilities.
# Determine the optimal number of cut-points for MonthlyCharges
number_result <- find_cutpoint_number(
data = telco_survival,
predictor = "MonthlyCharges",
outcome_time = "tenure",
outcome_event = "churn_event",
method = "genetic", # Use "genetic" algorithm for a faster search
criterion = "BIC", # Use BIC (Bayesian Information Criterion) for model selection
use_parallel = TRUE, # Enable parallel processing to speed up the search
n_cores = 2, # Set to 2 cores; increase this based on your system
max_cuts = 7, # Test models with 0 up to 7 cut-points
nmin = 0.1, # Each group must contain at least 10% of the data
maxiter = 500, # Set iterations for the genetic algorithm
seed = 42 # Set a seed for reproducible results
)
## num_cuts BIC Delta_BIC BIC_Weight Evidence
## 0 31251.45 450.83 0% Minimal
## 1 31065.05 264.44 0% Minimal
## 2 30964.22 163.60 0% Minimal
## 3 30922.54 121.93 0% Minimal
## 4 30899.22 98.60 0% Minimal
## 5 30800.61 0.00 100% Substantial
## 6 31015.04 214.42 0% Minimal
## 7 30972.20 171.59 0% Minimal
## cuts
## NA
## 27.16
## 28.13, 102.56
## 27.16, 68.36, 102.5
## 28.25, 68.57, 76.71, 102.46
## 28.32, 50.95, 68.94, 80.31, 102.47
## 24.37, 44.59, 61.98, 73.8, 80.5, 102.04
## 20.48, 33.84, 53.59, 70.33, 80.21, 89.97, 102.47
Analysis: The output table shows that the model with 5 cut-points has the lowest BIC score. This provides strong statistical evidence that the customer base can be meaningfully segmented into six distinct groups based on their monthly charge.
Now that we have statistical justification for using five cut-points,
we use find_cutpoint() to find their precise locations.
# Find the locations of the 5 optimal cut-points
multi_cut_result <- find_cutpoint(
data = telco_survival,
predictor = "MonthlyCharges",
outcome_time = "tenure",
outcome_event = "churn_event",
method = "genetic",
criterion = "logrank", # Use the log-rank test statistic to find the best separation
num_cuts = 5, # We are now searching for the 5 cuts identified in Step 1
nmin = 0.1, # Each of the 6 groups must have at least 10% of data
maxiter = 500,
seed = 123
)
# Extract the optimal monthly charge thresholds
optimal_charges <- multi_cut_result$optimal_cuts
print(optimal_charges)
## [1] 27.13476 52.59850 68.82812 76.69175 102.49212
Let us visualise where these five tipping points fall on the distribution of monthly charges.
# The plot method can also show the distribution of the predictor
# with the optimal cut-points overlaid as vertical lines.
plot(multi_cut_result, type = "distribution", title = "Distribution of Monthly Charges with 5 Optimal Cut-points")
Finally, we assess the reliability of our discovered thresholds using
validate_cutpoint(). This bootstrap
analysis works by repeatedly resampling our data and
re-calculating the cut-points to see how much they vary. This generates
95% confidence intervals for each cut-point.
# Validate the 5 discovered cut-points
validation_result <- validate_cutpoint(
cutpoint_result = multi_cut_result, # Pass in our result object from Step 2
num_replicates = 100, # Use 100 bootstrap samples (>= 500 is recommended for publication)
use_parallel = TRUE, # Enable parallel processing
n_cores = 2, # Set to 2 cores
seed = 456, # Set a new seed for this new random process
maxiter = 100 # Iterations for the algorithm *within* each bootstrap replicate
)
## Cut-point Stability Analysis (Bootstrap)
## ----------------------------------------
## Original Optimal Cut-point(s): 27.135, 52.598, 68.828, 76.692, 102.492
## Successful Replicates: 100 / 100 ( 100 %)
## Failed Replicates: 0
##
## 95% Confidence Intervals
## ------------------------
## Lower Upper
## Cut 1 26.787 29.545
## Cut 2 49.497 55.364
## Cut 3 67.781 69.409
## Cut 4 75.328 81.126
## Cut 5 102.020 104.007
##
## Bootstrap Summary Statistics
## ---------------------------
## Cut Mean SD Median Q1 Q3
## 25% Cut1 28.235 1.266 28.051 27.393 29.361
## 25%1 Cut2 51.512 1.380 51.082 50.866 51.845
## 25%2 Cut3 68.917 0.352 68.947 68.823 69.154
## 25%3 Cut4 78.570 2.115 79.935 76.655 80.342
## 25%4 Cut5 103.060 0.633 103.018 102.473 103.495
##
## Hint: Use `summary()` for detailed statistics or `plot()` to visualize stability.
# Display the confidence intervals in a clean table
knitr::kable(
validation_result$confidence_intervals,
caption = "95% Confidence Intervals for Monthly Charge Cut-points"
)
| Lower | Upper | |
|---|---|---|
| Cut 1 | 26.78692 | 29.54460 |
| Cut 2 | 49.49735 | 55.36350 |
| Cut 3 | 67.78119 | 69.40857 |
| Cut 4 | 75.32751 | 81.12565 |
| Cut 5 | 102.01963 | 104.00653 |
With our five validated cut-points, we can now create our final six
customer segments and visualise their distinct retention curves. This
shows how to use the outputs from OptSurvCutR to build a
fully custom, publication-ready plot.
# Create more descriptive labels for the final plot dynamically
# --- THIS IS THE CORRECTED PART ---
# We must use the HTML-safe codes < (for <) and > (for >)
# to prevent ggsurvplot from trying to render them as HTML tags.
charge_labels <- c(
paste0(" < $", round(optimal_charges[1])),
paste0("$", round(optimal_charges[1]), " - $", round(optimal_charges[2])),
paste0("$", round(optimal_charges[2]), " - $", round(optimal_charges[3])),
paste0("$", round(optimal_charges[3]), " - $", round(optimal_charges[4])),
paste0("$", round(optimal_charges[4]), " - $", round(optimal_charges[5])),
paste0(" > $", round(optimal_charges[5]))
)
# You can now re-run your ggsurvplot() command without error.
# Create a new variable with the six groups
telco_categorised <- telco_survival %>%
mutate(charge_group = cut(
MonthlyCharges,
breaks = c(-Inf, optimal_charges, Inf),
labels = charge_labels
))
# Fit the final survival model
km_fit_final <- survfit(Surv(tenure, churn_event) ~ charge_group,
data = telco_categorised)
# Plot the final result
ggsurvplot(
km_fit_final,
data = telco_categorised,
pval = TRUE,
pval.coord = c(10, 0.45),
conf.int = FALSE,
risk.table = TRUE,
legend.title = "Monthly Charge Group",
legend.labs = charge_labels,
title = "Customer Retention by Six Optimal Monthly Charge Segments",
ylim = c(0.4, 1),
xbreaks = seq(0, 75, by = 12)
)
Business Interpretation: By following this complete workflow, we have moved beyond a simple classification model. The final Kaplan-Meier plot reveals a complex, non-linear relationship between price and customer loyalty. The legend shows the six risk groups identified by the optimal cut-points.
High-Risk Tiers: The plot shows two groups in the middle-to-high price range (e.g., the groups
$69 - $77and$77 - $102) that exhibit the highest rates of churn (the steepest-dropping curves).Low-Risk Tiers: Interestingly, the customers in the lowest-paying tier (e.g.,
< $27, the top-most curve) are the most loyal, showing the highest retention over the entire period. This contradicts the simple assumption that a lower price always equals a higher churn risk and suggests this entry-level plan is very effective at retaining its customer base.Moderate-Risk Tiers: The other groups, including the very highest-paying customers (
> $102), are clustered in the middle, displaying moderate-risk behaviour. The premium customers show high loyalty initially, after which their retention declines, though it remains better than the other high-risk tiers.This data-driven approach, powered by
OptSurvCutR, allows us to identify specific, actionable “tipping points” that can inform pricing strategies, marketing campaigns, and customer retention efforts. For example, a business might focus retention efforts on the high-risk tiers or investigate what features make the lower-cost plans so “sticky” for customers.
This article has demonstrated how survival analysis, combined with
the OptSurvCutR package, can uncover deep insights into
customer behaviour that a standard classification model might miss. By
identifying multiple, data-driven “tipping points”, we can create
sophisticated customer segments and develop more targeted business
strategies.
We encourage you to try OptSurvCutR with your own
data.
remotes::install_github("paytonyau/OptSurvCutR")For reproducibility, the session information below lists the R version and all attached packages.
sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8
## [2] LC_CTYPE=English_United Kingdom.utf8
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.utf8
##
## time zone: Europe/London
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rgenoud_5.9-0.11 knitr_1.50 survminer_0.5.1 ggpubr_0.6.2
## [5] ggplot2_4.0.0 survival_3.8-3 dplyr_1.1.4 OptSurvCutR_0.1.7
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.53 bslib_0.9.0 rstatix_0.7.3
## [5] lattice_0.22-7 vctrs_0.6.5 tools_4.5.1 generics_0.1.4
## [9] parallel_4.5.1 tibble_3.3.0 pkgconfig_2.0.3 Matrix_1.7-3
## [13] data.table_1.17.8 RColorBrewer_1.1-3 S7_0.2.0 lifecycle_1.0.4
## [17] stringr_1.5.2 compiler_4.5.1 farver_2.1.2 ggsci_4.1.0
## [21] codetools_0.2-20 carData_3.0-5 litedown_0.7 htmltools_0.5.8.1
## [25] sass_0.4.10 yaml_2.3.10 Formula_1.2-5 pillar_1.11.1
## [29] car_3.1-3 jquerylib_0.1.4 tidyr_1.3.1 cachem_1.1.0
## [33] iterators_1.0.14 abind_1.4-8 foreach_1.5.2 km.ci_0.5-6
## [37] commonmark_2.0.0 tidyselect_1.2.1 digest_0.6.37 stringi_1.8.7
## [41] purrr_1.1.0 labeling_0.4.3 splines_4.5.1 fastmap_1.2.0
## [45] grid_4.5.1 cli_3.6.5 magrittr_2.0.4 broom_1.0.10
## [49] withr_3.0.2 scales_1.4.0 backports_1.5.0 rmarkdown_2.30
## [53] ggtext_0.1.2 gridExtra_2.3 ggsignif_0.6.4 zoo_1.8-14
## [57] evaluate_1.0.5 KMsurv_0.1-6 doParallel_1.0.17 markdown_2.0
## [61] survMisc_0.5.6 rlang_1.1.6 gridtext_0.1.5 Rcpp_1.1.0
## [65] xtable_1.8-4 glue_1.8.0 xml2_1.4.0 rstudioapi_0.17.1
## [69] jsonlite_2.0.0 R6_2.6.1