As data scientists, a common task is to predict if a customer will cancel their subscription—a problem known as “churn”. The standard approach is to build a classification model that predicts a simple “Yes” or “No”. While useful, this does not tell the whole story. A classification model treats churn as a binary outcome, ignoring a crucial piece of information: time.
A more powerful approach is to ask a different question: instead of asking if a customer will churn, we can ask when. This is called a time-to-event (or survival) analysis. It allows us to model the entire customer journey and understand the rate at which customers churn over time, providing a much richer insight into customer behaviour.
This example will walk you through a complete, real-world analysis.
We will use a public subscription dataset to uncover the specific
“tipping points” in MonthlyCharges
that best predict how
long a customer is likely to stay subscribed.
To do this, we will use the OptSurvCutR
package, which provides a complete workflow to: 1. Determine the
statistically optimal number of cut-points. 2. Find the
precise location of those cut-points. 3.
Validate the stability of the results.
First, we need to install the development version of
OptSurvCutR
from GitHub and load the other R packages we
will use for this analysis.
# Option 1: Install the development version from GitHub for the latest features
# You will need the 'remotes' package first: install.packages("remotes")
remotes::install_github("paytonyau/OptSurvCutR")
# Option 2: Install the stable version from CRAN (when available)
# install.packages("OptSurvCutR")
Now, let us load the libraries. The pacman
package is a
convenient helper that will automatically install any packages you do
not have.
# Load the core packages for this analysis.
# If a package is not installed, pacman will install it for you.
if (!require("pacman")) install.packages("pacman")
pacman::p_load(OptSurvCutR, dplyr, survival, survminer, knitr)
# URL for the raw dataset from an IBM repository
data_url <- "https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv"
# Read the data into an R data frame
telco_data <- read.csv(data_url)
# The next steps use the pipe operator (%>%), which passes the data from one
# function to the next, making the code clean and readable.
telco_survival <- telco_data %>%
# The 'TotalCharges' column is read as text; we need to convert it to a number.
mutate(TotalCharges = as.numeric(TotalCharges)) %>%
# This conversion creates some missing values (NAs), which we need to remove.
filter(complete.cases(.)) %>%
# For survival analysis, we need a numeric "event" column (1 for churn, 0 otherwise).
mutate(churn_event = ifelse(Churn == "Yes", 1, 0)) %>%
# A customer tenure of 0 is not meaningful for this analysis, so we remove it.
filter(tenure > 0)
# Let us inspect the first few rows of our prepared data
head(telco_survival)
## customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 1 7590-VHVEG Female 0 Yes No 1 No
## 2 5575-GNVDE Male 0 No No 34 Yes
## 3 3668-QPYBK Male 0 No No 2 Yes
## 4 7795-CFOCW Male 0 No No 45 No
## 5 9237-HQITU Female 0 No No 2 Yes
## 6 9305-CDSKC Female 0 No No 8 Yes
## MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
## 1 No phone service DSL No Yes No
## 2 No DSL Yes No Yes
## 3 No DSL Yes Yes No
## 4 No phone service DSL Yes No Yes
## 5 No Fiber optic No No No
## 6 Yes Fiber optic No No Yes
## TechSupport StreamingTV StreamingMovies Contract PaperlessBilling
## 1 No No No Month-to-month Yes
## 2 No No No One year No
## 3 No No No Month-to-month Yes
## 4 Yes No No One year No
## 5 No No No Month-to-month Yes
## 6 No Yes Yes Month-to-month Yes
## PaymentMethod MonthlyCharges TotalCharges Churn churn_event
## 1 Electronic check 29.85 29.85 No 0
## 2 Mailed check 56.95 1889.50 No 0
## 3 Mailed check 53.85 108.15 Yes 1
## 4 Bank transfer (automatic) 42.30 1840.75 No 0
## 5 Electronic check 70.70 151.65 Yes 1
## 6 Electronic check 99.65 820.50 Yes 1
A standard survival analysis often begins by comparing the time-to-event between pre-defined groups. Let us start with a simple question: is there a difference in customer retention between senior citizens and non-senior citizens?
# The SeniorCitizen column is coded as 0 or 1. Let us make it a more descriptive factor.
telco_survival_senior <- telco_survival %>%
mutate(SeniorCitizen = factor(SeniorCitizen, levels = c(0, 1), labels = c("No", "Yes")))
# To perform a survival analysis, we first need to create a "survival object".
# The Surv() function bundles our time column ('tenure') and our event column ('churn_event') together.
surv_obj_senior <- Surv(time = telco_survival_senior$tenure, event = telco_survival_senior$churn_event)
# Now, we can fit a Kaplan-Meier model. The formula `surv_obj_senior ~ SeniorCitizen`
# tells R to create a separate survival curve for each category in the SeniorCitizen column.
km_fit_senior <- survfit(surv_obj_senior ~ SeniorCitizen, data = telco_survival_senior)
# Finally, we can plot the result using the ggsurvplot function
ggsurvplot(
km_fit_senior,
data = telco_survival_senior,
pval = TRUE,
conf.int = TRUE,
risk.table = TRUE,
legend.title = "Senior Citizen",
legend.labs = c("No", "Yes"),
title = "Customer Retention by Senior Citizen Status"
)
How to Read This Plot: A Guide for Data Scientists
This type of visualisation is a Kaplan-Meier plot, the cornerstone of survival analysis. Here is how to interpret it:
- The Y-Axis (Survival probability): This shows the probability that a customer is still subscribed at a given point in time. It starts at 1.0 (100%), as all customers are subscribed at the beginning.
- The X-Axis (tenure): This is our time variable, representing the number of months a customer has been subscribed.
- The Stepped Lines: Each line represents a group of customers (here, “Senior Citizen: No” and “Senior Citizen: Yes”). The line goes down each time a customer in that group churns. A steeper drop means a faster churn rate.
- The Shaded Areas: These are the 95% confidence intervals for the survival curves. They give us a sense of the uncertainty in our estimates.
- The p-value: This comes from a log-rank test. It tests the null hypothesis that there is no difference in survival between the groups. A small p-value (like p < 0.0001 here) provides strong evidence that the difference between the curves is statistically significant.
- The Risk Table: This table at the bottom shows the number of customers still “at risk” of churning at different time points. It provides context for the curves above; for example, you can see how many customers are remaining in each group after 24, 48, and 72 months.
The more interesting business question is about continuous
predictors. At what MonthlyCharges
value does a customer’s
loyalty begin to change? A common but flawed approach is to split the
data at the median. A better method is to let the data itself tell us
where the real tipping point is. We can use find_cutpoint
to find the single best threshold.
# Find the single best cut-point for MonthlyCharges
one_cut_result <- find_cutpoint(
data = telco_survival,
predictor = "MonthlyCharges",
outcome_time = "tenure",
outcome_event = "churn_event",
num_cuts = 1,
method = "systematic" # The "systematic" method is fast for finding one cut
)
optimal_charge_one_cut <- one_cut_result$optimal_cuts
one_cut_result
# Visualise the result of the single cut-point
telco_one_cut <- telco_survival %>%
mutate(charge_group = ifelse(MonthlyCharges >= optimal_charge_one_cut, "High", "Low"))
km_fit_one_cut <- survfit(Surv(tenure, churn_event) ~ charge_group, data = telco_one_cut)
ggsurvplot(
km_fit_one_cut,
data = telco_one_cut,
pval = TRUE,
conf.int = TRUE,
risk.table = TRUE,
legend.title = "Monthly Charge",
title = "Customer Retention by a Single Optimal Cut-point"
)
Analysis: This analysis finds a significant split. However, it assumes a simple relationship where higher charges always lead to higher churn. What if the reality is more complex? A moderate charge might be acceptable, while very low (perhaps indicating low engagement) and very high charges could both lead to churn. A single cut-point cannot capture this.
To investigate more complex, non-linear relationships, we need to
find multiple cut-points. The OptSurvCutR
package provides
a robust, three-step workflow for this.
First, we must determine how many cut-points are statistically
justified. We use find_cutpoint_number()
to compare models
with different numbers of cuts. We will test up to 7 potential
cut-points to thoroughly explore the possibilities.
# Determine the optimal number of cut-points for MonthlyCharges
number_result <- find_cutpoint_number(
data = telco_survival,
predictor = "MonthlyCharges",
outcome_time = "tenure",
outcome_event = "churn_event",
method = "genetic",
criterion = "BIC",
use_parallel = TRUE,
max_cuts = 7, # Let us test for up to 7 tipping points
nmin = 0.1,
maxiter = 500,
seed = 42
)
## num_cuts BIC Delta_BIC BIC_Weight Evidence
## 0 31251.45 450.83 0% Essentially no support
## 1 31065.05 264.44 0% Essentially no support
## 2 30964.22 163.60 0% Essentially no support
## 3 30922.54 121.93 0% Essentially no support
## 4 30899.22 98.60 0% Essentially no support
## 5 30800.61 0.00 100% Substantial support
## 6 31015.04 214.42 0% Essentially no support
## 7 30972.20 171.59 0% Essentially no support
## cuts
## NA
## 27.16
## 28.13, 102.56
## 27.16, 68.36, 102.5
## 28.25, 68.57, 76.71, 102.46
## 28.32, 50.95, 68.94, 80.31, 102.47
## 24.37, 44.59, 61.98, 73.8, 80.5, 102.04
## 20.48, 33.84, 53.59, 70.33, 80.21, 89.97, 102.47
summary(number_result)
## num_cuts BIC Delta_BIC BIC_Weight Evidence
## 0 31251.45 450.83 0% Essentially no support
## 1 31065.05 264.44 0% Essentially no support
## 2 30964.22 163.60 0% Essentially no support
## 3 30922.54 121.93 0% Essentially no support
## 4 30899.22 98.60 0% Essentially no support
## 5 30800.61 0.00 100% Substantial support
## 6 31015.04 214.42 0% Essentially no support
## 7 30972.20 171.59 0% Essentially no support
## cuts
## NA
## 27.16
## 28.13, 102.56
## 27.16, 68.36, 102.5
## 28.25, 68.57, 76.71, 102.46
## 28.32, 50.95, 68.94, 80.31, 102.47
## 24.37, 44.59, 61.98, 73.8, 80.5, 102.04
## 20.48, 33.84, 53.59, 70.33, 80.21, 89.97, 102.47
## Group N Events
## 1 G1 1600 148
## 2 G2 798 247
## 3 G3 887 138
## 4 G4 1129 450
## 5 G5 1907 711
## 6 G6 711 175
## Call: survfit(formula = survival::Surv(time, event) ~ group, data = data)
##
## n events median 0.95LCL 0.95UCL
## group=G1 1600 148 NA NA NA
## group=G2 798 247 71 64 NA
## group=G3 887 138 NA NA NA
## group=G4 1129 450 65 51 NA
## group=G5 1907 711 NA 68 NA
## group=G6 711 175 NA NA NA
## Call:
## survival::coxph(formula = as.formula(formula_str), data = data)
##
## n= 7032, number of events= 1869
##
## coef exp(coef) se(coef) z Pr(>|z|)
## groupG2 1.46981 4.34840 0.10409 14.120 < 2e-16 ***
## groupG3 0.43417 1.54369 0.11834 3.669 0.000244 ***
## groupG4 1.62294 5.06794 0.09479 17.121 < 2e-16 ***
## groupG5 1.25168 3.49620 0.09037 13.851 < 2e-16 ***
## groupG6 0.48710 1.62759 0.11207 4.346 1.38e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## groupG2 4.348 0.2300 3.546 5.333
## groupG3 1.544 0.6478 1.224 1.947
## groupG4 5.068 0.1973 4.209 6.103
## groupG5 3.496 0.2860 2.929 4.174
## groupG6 1.628 0.6144 1.307 2.027
##
## Concordance= 0.659 (se = 0.007 )
## Likelihood ratio test= 549.8 on 5 df, p=<2e-16
## Wald test = 472.7 on 5 df, p=<2e-16
## Score (logrank) test = 534.1 on 5 df, p=<2e-16
Analysis: The output table shows that the model with 5 cut-points has the lowest BIC score. This provides strong statistical evidence that the customer base can be meaningfully segmented into six distinct groups based on their monthly charge.
Now that we have statistical justification for using five cut-points,
we use find_cutpoint()
to find their precise locations.
# Find the locations of the 5 optimal cut-points
multi_cut_result <- find_cutpoint(
data = telco_survival,
predictor = "MonthlyCharges",
outcome_time = "tenure",
outcome_event = "churn_event",
method = "genetic",
criterion = "logrank",
use_parallel = TRUE,
num_cuts = 5,
nmin = 0.1,
maxiter = 500,
seed = 123
)
# Extract the optimal monthly charge thresholds
optimal_charges <- multi_cut_result$optimal_cuts
print(optimal_charges)
## [1] 27.13476 52.59850 68.82812 76.69175 102.49212
Let us visualise where these five tipping points fall on the distribution of monthly charges.
plot(multi_cut_result, type = "distribution", title = "Distribution of Monthly Charges with 5 Optimal Cut-points")
Finally, we assess the reliability of our discovered thresholds using
validate_cutpoint()
. This bootstrap
analysis works by repeatedly resampling our data and
re-calculating the cut-points to see how much they vary. This generates
95% confidence intervals for each cut-point.
# Validate the 5 discovered cut-points
validation_result <- validate_cutpoint(
cutpoint_result = multi_cut_result,
num_replicates = 200, # Use a higher number (e.g., 500) for publication
use_parallel = TRUE,
seed = 456,
maxiter = 100
)
## Lower Upper
## Cut 1 26.828 29.843
## Cut 2 49.681 55.368
## Cut 3 68.382 69.398
## Cut 4 75.294 81.631
## Cut 5 101.698 104.248
# Display the confidence intervals in a clean table
knitr::kable(
validation_result$confidence_intervals,
caption = "95% Confidence Intervals for Monthly Charge Cut-points"
)
Lower | Upper | |
---|---|---|
Cut 1 | 26.82836 | 29.84281 |
Cut 2 | 49.68078 | 55.36806 |
Cut 3 | 68.38244 | 69.39761 |
Cut 4 | 75.29439 | 81.63111 |
Cut 5 | 101.69804 | 104.24845 |
With our five validated cut-points, we can now create our final six customer segments and visualise their distinct retention curves.
# Create more descriptive labels, escaping the special characters for the plot
charge_labels <- c(
paste0("< $", round(optimal_charges[1])), # Use < for the less-than symbol
paste0("$", round(optimal_charges[1]), " - $", round(optimal_charges[2])),
paste0("$", round(optimal_charges[2]), " - $", round(optimal_charges[3])),
paste0("$", round(optimal_charges[3]), " - $", round(optimal_charges[4])),
paste0("$", round(optimal_charges[4]), " - $", round(optimal_charges[5])),
paste0("> $", round(optimal_charges[5])) # Use > for the greater-than symbol
)
# The rest of your code remains the same
telco_categorised <- telco_survival %>%
mutate(charge_group = cut(
MonthlyCharges,
breaks = c(-Inf, optimal_charges, Inf),
labels = charge_labels
))
km_fit_final <- survfit(Surv(tenure, churn_event) ~ charge_group, data = telco_categorised)
# Plot the final result
ggsurvplot(
km_fit_final,
data = telco_categorised,
pval = TRUE,
pval.coord = c(10, 0.45),
conf.int = FALSE,
risk.table = TRUE,
legend.title = "Monthly Charge Group",
legend.labs = charge_labels,
title = "Customer Retention by Six Optimal Monthly Charge Segments",
ylim = c(0.4, 1),
xbreaks = seq(0, 75, by = 12)
)
Business Interpretation: By following this complete workflow, we have moved beyond a simple classification model. The final Kaplan-Meier plot reveals a complex, non-linear relationship between price and customer loyalty.
High-Risk Tiers: The customers in the
£69-£77
and£77-£102
price brackets exhibit the highest rates of churn. Surprisingly, the£27-£53
group also falls into this high-risk category, suggesting that customers in this tier may be particularly sensitive to competitor pricing or perceive a low value-for-money proposition.Low-Risk Tiers: Interestingly, the customers in the lowest-paying tier (< £27) are the most loyal, showing the highest retention over the entire five-year period. This contradicts the simple assumption that a lower price always equals a higher churn risk and suggests this entry-level plan is very effective at retaining its customer base.
Moderate-Risk Tiers: The
£53-£69
group represents a pocket of moderate loyalty. The>£102
group displays unique behaviour: these premium customers show very high loyalty for the first 36-48 months, after which their retention declines significantly, though it remains better than the other high-risk tiers.This data-driven approach, powered by OptSurvCutR, allows us to identify specific, actionable “tipping points” that can inform pricing strategies, marketing campaigns, and customer retention efforts. For example, a business might focus retention efforts on the high-risk tiers or investigate what features make the lower-cost plans so “sticky” for customers.
This article has demonstrated how survival analysis, combined with
the OptSurvCutR
package, can uncover deep insights into
customer behaviour that a standard classification model might miss. By
identifying multiple, data-driven “tipping points”, we can create
sophisticated customer segments and develop more targeted business
strategies.
We encourage you to try OptSurvCutR
with your own
data.
remotes::install_github("paytonyau/OptSurvCutR")
For reproducibility, the session information below lists the R version and all attached packages.
sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8
## [2] LC_CTYPE=English_United Kingdom.utf8
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.utf8
##
## time zone: Europe/London
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.50 survminer_0.5.1 ggpubr_0.6.1 ggplot2_3.5.2
## [5] survival_3.8-3 dplyr_1.1.4 OptSurvCutR_0.1.5 pacman_0.5.1
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.52 bslib_0.9.0 rstatix_0.7.2
## [5] lattice_0.22-7 vctrs_0.6.5 tools_4.5.1 generics_0.1.4
## [9] parallel_4.5.1 tibble_3.3.0 pkgconfig_2.0.3 Matrix_1.7-3
## [13] data.table_1.17.8 RColorBrewer_1.1-3 lifecycle_1.0.4 stringr_1.5.1
## [17] compiler_4.5.1 farver_2.1.2 codetools_0.2-20 carData_3.0-5
## [21] litedown_0.7 htmltools_0.5.8.1 sass_0.4.10 yaml_2.3.10
## [25] Formula_1.2-5 pillar_1.11.0 car_3.1-3 jquerylib_0.1.4
## [29] tidyr_1.3.1 cachem_1.1.0 iterators_1.0.14 rgenoud_5.9-0.11
## [33] abind_1.4-8 foreach_1.5.2 km.ci_0.5-6 commonmark_2.0.0
## [37] tidyselect_1.2.1 digest_0.6.37 stringi_1.8.7 purrr_1.1.0
## [41] labeling_0.4.3 splines_4.5.1 fastmap_1.2.0 grid_4.5.1
## [45] cli_3.6.5 magrittr_2.0.3 broom_1.0.9 withr_3.0.2
## [49] scales_1.4.0 backports_1.5.0 rmarkdown_2.29 ggtext_0.1.2
## [53] gridExtra_2.3 ggsignif_0.6.4 zoo_1.8-14 evaluate_1.0.5
## [57] KMsurv_0.1-6 doParallel_1.0.17 markdown_2.0 survMisc_0.5.6
## [61] rlang_1.1.6 gridtext_0.1.5 Rcpp_1.1.0 xtable_1.8-4
## [65] glue_1.8.0 xml2_1.4.0 rstudioapi_0.17.1 jsonlite_2.0.0
## [69] R6_2.6.1