Introduction: A Better Way to Understand Customer Churn

As data scientists, a common task is to predict if a customer will cancel their subscription—a problem known as “churn”. The standard approach is to build a classification model that predicts a simple “Yes” or “No”. While useful, this does not tell the whole story. A classification model treats churn as a binary outcome, ignoring a crucial piece of information: time.

A more powerful approach is to ask a different question: instead of asking if a customer will churn, we can ask when. This is called a time-to-event (or survival) analysis. It allows us to model the entire customer journey and understand the rate at which customers churn over time, providing a much richer insight into customer behaviour.

What You Will Learn in This Guide

This example will walk you through a complete, real-world analysis. We will use a public subscription dataset to uncover the specific “tipping points” in MonthlyCharges that best predict how long a customer is likely to stay subscribed.

To do this, we will use the OptSurvCutR package, which provides a complete workflow to: 1. Determine the statistically optimal number of cut-points. 2. Find the precise location of those cut-points. 3. Validate the stability of the results.


1. Setup and Data Preparation

First, we need to install the development version of OptSurvCutR from GitHub and load the other R packages we will use for this analysis.

# Install required packages from CRAN
# rgenoud is required for method = "genetic"
install.packages(c("remotes", "dplyr", "survival", "survminer", "knitr", "rgenoud"))

# Option 1: Install the development version from GitHub for the latest features
remotes::install_github("paytonyau/OptSurvCutR")

# Option 2: Install the stable version from CRAN (when available)
# install.packages("OptSurvCutR")

Now, let us load the libraries.

# Load the core packages for this analysis
library(OptSurvCutR)
library(dplyr)
library(survival)
library(survminer)
library(knitr)
library(rgenoud) # Explicitly load rgenoud

# URL for the raw dataset from an IBM repository
data_url <- "https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv"

# Read the data into an R data frame
telco_data <- read.csv(data_url)

# The next steps use the pipe operator (%>%), which passes the data from one
# function to the next, making the code clean and readable.
telco_survival <- telco_data %>%
  # The 'TotalCharges' column is read as text; we need to convert it to a number.
  mutate(TotalCharges = as.numeric(TotalCharges)) %>%
  # This conversion creates some missing values (NAs), which we need to remove.
  filter(complete.cases(.)) %>%
  # For survival analysis, we need a numeric "event" column.
  # We define the "event" as Churn == "Yes" (1), and "No Churn" as 0.
  mutate(churn_event = ifelse(Churn == "Yes", 1, 0)) %>%
  # A customer tenure of 0 is not meaningful for this analysis, so we remove it.
  filter(tenure > 0)

# Let us inspect the first few rows of our prepared data
head(telco_survival)
##   customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 1 7590-VHVEG Female             0     Yes         No      1           No
## 2 5575-GNVDE   Male             0      No         No     34          Yes
## 3 3668-QPYBK   Male             0      No         No      2          Yes
## 4 7795-CFOCW   Male             0      No         No     45           No
## 5 9237-HQITU Female             0      No         No      2          Yes
## 6 9305-CDSKC Female             0      No         No      8          Yes
##      MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
## 1 No phone service             DSL             No          Yes               No
## 2               No             DSL            Yes           No              Yes
## 3               No             DSL            Yes          Yes               No
## 4 No phone service             DSL            Yes           No              Yes
## 5               No     Fiber optic             No           No               No
## 6              Yes     Fiber optic             No           No              Yes
##   TechSupport StreamingTV StreamingMovies       Contract PaperlessBilling
## 1          No          No              No Month-to-month              Yes
## 2          No          No              No       One year               No
## 3          No          No              No Month-to-month              Yes
## 4         Yes          No              No       One year               No
## 5          No          No              No Month-to-month              Yes
## 6          No         Yes             Yes Month-to-month              Yes
##               PaymentMethod MonthlyCharges TotalCharges Churn churn_event
## 1          Electronic check          29.85        29.85    No           0
## 2              Mailed check          56.95      1889.50    No           0
## 3              Mailed check          53.85       108.15   Yes           1
## 4 Bank transfer (automatic)          42.30      1840.75    No           0
## 5          Electronic check          70.70       151.65   Yes           1
## 6          Electronic check          99.65       820.50   Yes           1

2. A Simple Case: Comparing Pre-defined Groups

A standard survival analysis often begins by comparing the time-to-event between pre-defined groups. Let us start with a simple question: is there a difference in customer retention between senior citizens and non-senior citizens?

# The SeniorCitizen column is coded as 0 or 1. Let us make it a more descriptive factor.
telco_survival_senior <- telco_survival %>%
  mutate(SeniorCitizen = factor(SeniorCitizen, levels = c(0, 1), labels = c("No", "Yes")))

# To perform a survival analysis, we first need to create a "survival object".
# The Surv() function bundles our time column ('tenure') and our event column ('churn_event') together.
surv_obj_senior <- Surv(time = telco_survival_senior$tenure, event = telco_survival_senior$churn_event)

# Now, we can fit a Kaplan-Meier model. The formula `surv_obj_senior ~ SeniorCitizen`
# tells R to create a separate survival curve for each category in the SeniorCitizen column.
km_fit_senior <- survfit(surv_obj_senior ~ SeniorCitizen, data = telco_survival_senior)

# Finally, we can plot the result using the ggsurvplot function
ggsurvplot(
  km_fit_senior,
  data = telco_survival_senior,
  pval = TRUE,
  conf.int = TRUE,
  risk.table = TRUE,
  legend.title = "Senior Citizen",
  legend.labs = c("No", "Yes"),
  title = "Customer Retention by Senior Citizen Status"
)

How to Read This Plot: A Guide for Data Scientists

This type of visualisation is a Kaplan-Meier plot, the cornerstone of survival analysis. Here is how to interpret it:

  • The Y-Axis (Survival probability): This shows the probability that a customer is still subscribed at a given point in time. It starts at 1.0 (100%), as all customers are subscribed at the beginning.
  • The X-Axis (tenure): This is our time variable, representing the number of months a customer has been subscribed.
  • The Stepped Lines: Each line represents a group of customers (here, “Senior Citizen: No” and “Senior Citizen: Yes”). The line goes down each time a customer in that group churns. A steeper drop means a faster churn rate.
  • The Shaded Areas: These are the 95% confidence intervals for the survival curves. They give us a sense of the uncertainty in our estimates.
  • The p-value: This comes from a log-rank test. It tests the null hypothesis that there is no difference in survival between the groups. A small p-value (like p < 0.0001 here) provides strong evidence that the difference between the curves is statistically significant.
  • The Risk Table: This table at the bottom shows the number of customers still “at risk” of churning at different time points. It provides context for the curves above; for example, you can see how many customers are remaining in each group after 24, 48, and 72 months.

3. The Challenge of Continuous Variables: Finding a Single Tipping Point

The more interesting business question is about continuous predictors. At what MonthlyCharges value does a customer’s loyalty begin to change? A common but flawed approach is to split the data at the median. A better method is to let the data itself tell us where the real tipping point is. We can use find_cutpoint to find the single best threshold.

# Find the single best cut-point for MonthlyCharges
one_cut_result <- find_cutpoint(
  data = telco_survival,
  predictor = "MonthlyCharges",     # The continuous variable to test
  outcome_time = "tenure",          # The time-to-event column
  outcome_event = "churn_event",    # The event status column (0 or 1)
  num_cuts = 1,                     # We are looking for just one cut-point
  method = "systematic"             # The "systematic" method is fast for finding one cut
)

one_cut_result

From this point on, we will use the OptSurvCutR package’s built-in plot() methods, which are much faster and more direct than building ggsurvplot charts manually.

# Plot the outcome directly from the result object
# 'type = "outcome"' automatically generates a Kaplan-Meier plot
# using the newly found cut-point to create two groups.
plot(one_cut_result, type = "outcome")

Analysis: This analysis finds a significant split. However, it assumes a simple relationship where higher charges always lead to higher churn. What if the reality is more complex? A moderate charge might be acceptable, while very low (perhaps indicating low engagement) and very high charges could both lead to churn. A single cut-point cannot capture this.


4. Uncovering Complex Relationships: The Multi-Cut-point Workflow

To investigate more complex, non-linear relationships, we need to find multiple cut-points. The OptSurvCutR package provides a robust, three-step workflow for this.

Step 1: How Many Tipping Points Are There?

First, we must determine how many cut-points are statistically justified. We use find_cutpoint_number() to compare models with different numbers of cuts. We will test up to 7 potential cut-points to thoroughly explore the possibilities.

# Determine the optimal number of cut-points for MonthlyCharges
number_result <- find_cutpoint_number(
  data = telco_survival,
  predictor = "MonthlyCharges",
  outcome_time = "tenure",
  outcome_event = "churn_event",
  method = "genetic",     # Use "genetic" algorithm for a faster search
  criterion = "BIC",      # Use BIC (Bayesian Information Criterion) for model selection
  use_parallel = TRUE,    # Enable parallel processing to speed up the search
  n_cores = 2,            # Set to 2 cores; increase this based on your system
  max_cuts = 7,           # Test models with 0 up to 7 cut-points
  nmin = 0.1,             # Each group must contain at least 10% of the data
  maxiter = 500,          # Set iterations for the genetic algorithm
  seed = 42               # Set a seed for reproducible results
)
##  num_cuts      BIC Delta_BIC BIC_Weight    Evidence
##         0 31251.45    450.83         0%     Minimal
##         1 31065.05    264.44         0%     Minimal
##         2 30964.22    163.60         0%     Minimal
##         3 30922.54    121.93         0%     Minimal
##         4 30899.22     98.60         0%     Minimal
##         5 30800.61      0.00       100% Substantial
##         6 31015.04    214.42         0%     Minimal
##         7 30972.20    171.59         0%     Minimal
##                                              cuts
##                                                NA
##                                             27.16
##                                     28.13, 102.56
##                               27.16, 68.36, 102.5
##                       28.25, 68.57, 76.71, 102.46
##                28.32, 50.95, 68.94, 80.31, 102.47
##           24.37, 44.59, 61.98, 73.8, 80.5, 102.04
##  20.48, 33.84, 53.59, 70.33, 80.21, 89.97, 102.47

Analysis: The output table shows that the model with 5 cut-points has the lowest BIC score. This provides strong statistical evidence that the customer base can be meaningfully segmented into six distinct groups based on their monthly charge.

Step 2: Where Are the Tipping Points Located?

Now that we have statistical justification for using five cut-points, we use find_cutpoint() to find their precise locations.

# Find the locations of the 5 optimal cut-points
multi_cut_result <- find_cutpoint(
  data = telco_survival,
  predictor = "MonthlyCharges",
  outcome_time = "tenure",
  outcome_event = "churn_event",
  method = "genetic",
  criterion = "logrank", # Use the log-rank test statistic to find the best separation
  num_cuts = 5,          # We are now searching for the 5 cuts identified in Step 1
  nmin = 0.1,            # Each of the 6 groups must have at least 10% of data
  maxiter = 500,
  seed = 123
)

# Extract the optimal monthly charge thresholds
optimal_charges <- multi_cut_result$optimal_cuts
print(optimal_charges)
## [1]  27.13476  52.59850  68.82812  76.69175 102.49212

Let us visualise where these five tipping points fall on the distribution of monthly charges.

# The plot method can also show the distribution of the predictor
# with the optimal cut-points overlaid as vertical lines.
plot(multi_cut_result, type = "distribution", title = "Distribution of Monthly Charges with 5 Optimal Cut-points")

Step 3: How Stable Are These Tipping Points?

Finally, we assess the reliability of our discovered thresholds using validate_cutpoint(). This bootstrap analysis works by repeatedly resampling our data and re-calculating the cut-points to see how much they vary. This generates 95% confidence intervals for each cut-point.

# Validate the 5 discovered cut-points
validation_result <- validate_cutpoint(
  cutpoint_result = multi_cut_result, # Pass in our result object from Step 2
  num_replicates = 100, # Use 100 bootstrap samples (>= 500 is recommended for publication)
  use_parallel = TRUE,  # Enable parallel processing
  n_cores = 2,          # Set to 2 cores
  seed = 456,           # Set a new seed for this new random process
  maxiter = 100         # Iterations for the algorithm *within* each bootstrap replicate
)
## Cut-point Stability Analysis (Bootstrap)
## ----------------------------------------
## Original Optimal Cut-point(s): 27.135, 52.598, 68.828, 76.692, 102.492 
## Successful Replicates: 100 / 100 ( 100 %)
## Failed Replicates: 0 
## 
## 95% Confidence Intervals
## ------------------------
##         Lower   Upper
## Cut 1  26.787  29.545
## Cut 2  49.497  55.364
## Cut 3  67.781  69.409
## Cut 4  75.328  81.126
## Cut 5 102.020 104.007
## 
## Bootstrap Summary Statistics
## ---------------------------
##       Cut    Mean    SD  Median      Q1      Q3
## 25%  Cut1  28.235 1.266  28.051  27.393  29.361
## 25%1 Cut2  51.512 1.380  51.082  50.866  51.845
## 25%2 Cut3  68.917 0.352  68.947  68.823  69.154
## 25%3 Cut4  78.570 2.115  79.935  76.655  80.342
## 25%4 Cut5 103.060 0.633 103.018 102.473 103.495
## 
## Hint: Use `summary()` for detailed statistics or `plot()` to visualize stability.
# Display the confidence intervals in a clean table
knitr::kable(
  validation_result$confidence_intervals,
  caption = "95% Confidence Intervals for Monthly Charge Cut-points"
)
95% Confidence Intervals for Monthly Charge Cut-points
Lower Upper
Cut 1 26.78692 29.54460
Cut 2 49.49735 55.36350
Cut 3 67.78119 69.40857
Cut 4 75.32751 81.12565
Cut 5 102.01963 104.00653

5. Final Visualisation and Interpretation

With our five validated cut-points, we can now create our final six customer segments and visualise their distinct retention curves. This shows how to use the outputs from OptSurvCutR to build a fully custom, publication-ready plot.

# Create more descriptive labels for the final plot dynamically
# --- THIS IS THE CORRECTED PART ---
# We must use the HTML-safe codes &lt; (for <) and &gt; (for >)
# to prevent ggsurvplot from trying to render them as HTML tags.
charge_labels <- c(
  paste0(" &lt; $", round(optimal_charges[1])),
  paste0("$", round(optimal_charges[1]), " - $", round(optimal_charges[2])),
  paste0("$", round(optimal_charges[2]), " - $", round(optimal_charges[3])),
  paste0("$", round(optimal_charges[3]), " - $", round(optimal_charges[4])),
  paste0("$", round(optimal_charges[4]), " - $", round(optimal_charges[5])),
  paste0(" &gt; $", round(optimal_charges[5]))
)

# You can now re-run your ggsurvplot() command without error.

# Create a new variable with the six groups
telco_categorised <- telco_survival %>%
  mutate(charge_group = cut(
    MonthlyCharges,
    breaks = c(-Inf, optimal_charges, Inf),
    labels = charge_labels
  ))

# Fit the final survival model
km_fit_final <- survfit(Surv(tenure, churn_event) ~ charge_group, 
                        data = telco_categorised)

# Plot the final result
ggsurvplot(
  km_fit_final,
  data = telco_categorised,
  pval = TRUE,
  pval.coord = c(10, 0.45),
  conf.int = FALSE,
  risk.table = TRUE,
  legend.title = "Monthly Charge Group",
  legend.labs = charge_labels,
  title = "Customer Retention by Six Optimal Monthly Charge Segments",
  ylim = c(0.4, 1),
  xbreaks = seq(0, 75, by = 12)
)

Business Interpretation: By following this complete workflow, we have moved beyond a simple classification model. The final Kaplan-Meier plot reveals a complex, non-linear relationship between price and customer loyalty. The legend shows the six risk groups identified by the optimal cut-points.

  • High-Risk Tiers: The plot shows two groups in the middle-to-high price range (e.g., the groups $69 - $77 and $77 - $102) that exhibit the highest rates of churn (the steepest-dropping curves).

  • Low-Risk Tiers: Interestingly, the customers in the lowest-paying tier (e.g., < $27, the top-most curve) are the most loyal, showing the highest retention over the entire period. This contradicts the simple assumption that a lower price always equals a higher churn risk and suggests this entry-level plan is very effective at retaining its customer base.

  • Moderate-Risk Tiers: The other groups, including the very highest-paying customers (> $102), are clustered in the middle, displaying moderate-risk behaviour. The premium customers show high loyalty initially, after which their retention declines, though it remains better than the other high-risk tiers.

This data-driven approach, powered by OptSurvCutR, allows us to identify specific, actionable “tipping points” that can inform pricing strategies, marketing campaigns, and customer retention efforts. For example, a business might focus retention efforts on the high-risk tiers or investigate what features make the lower-cost plans so “sticky” for customers.


6. Conclusion & Next Steps

This article has demonstrated how survival analysis, combined with the OptSurvCutR package, can uncover deep insights into customer behaviour that a standard classification model might miss. By identifying multiple, data-driven “tipping points”, we can create sophisticated customer segments and develop more targeted business strategies.

We encourage you to try OptSurvCutR with your own data.

  • Install the package from GitHub: remotes::install_github("paytonyau/OptSurvCutR")
  • Report issues or suggest features on our GitHub page.
  • Star the repository if you find it useful.
  • Read the preprint: For a detailed methodological discussion, please see our paper on bioRxiv.

7. Session Information

For reproducibility, the session information below lists the R version and all attached packages.

sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8 
## [2] LC_CTYPE=English_United Kingdom.utf8   
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.utf8    
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rgenoud_5.9-0.11  knitr_1.50        survminer_0.5.1   ggpubr_0.6.2     
## [5] ggplot2_4.0.0     survival_3.8-3    dplyr_1.1.4       OptSurvCutR_0.1.7
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       xfun_0.53          bslib_0.9.0        rstatix_0.7.3     
##  [5] lattice_0.22-7     vctrs_0.6.5        tools_4.5.1        generics_0.1.4    
##  [9] parallel_4.5.1     tibble_3.3.0       pkgconfig_2.0.3    Matrix_1.7-3      
## [13] data.table_1.17.8  RColorBrewer_1.1-3 S7_0.2.0           lifecycle_1.0.4   
## [17] stringr_1.5.2      compiler_4.5.1     farver_2.1.2       ggsci_4.1.0       
## [21] codetools_0.2-20   carData_3.0-5      litedown_0.7       htmltools_0.5.8.1 
## [25] sass_0.4.10        yaml_2.3.10        Formula_1.2-5      pillar_1.11.1     
## [29] car_3.1-3          jquerylib_0.1.4    tidyr_1.3.1        cachem_1.1.0      
## [33] iterators_1.0.14   abind_1.4-8        foreach_1.5.2      km.ci_0.5-6       
## [37] commonmark_2.0.0   tidyselect_1.2.1   digest_0.6.37      stringi_1.8.7     
## [41] purrr_1.1.0        labeling_0.4.3     splines_4.5.1      fastmap_1.2.0     
## [45] grid_4.5.1         cli_3.6.5          magrittr_2.0.4     broom_1.0.10      
## [49] withr_3.0.2        scales_1.4.0       backports_1.5.0    rmarkdown_2.30    
## [53] ggtext_0.1.2       gridExtra_2.3      ggsignif_0.6.4     zoo_1.8-14        
## [57] evaluate_1.0.5     KMsurv_0.1-6       doParallel_1.0.17  markdown_2.0      
## [61] survMisc_0.5.6     rlang_1.1.6        gridtext_0.1.5     Rcpp_1.1.0        
## [65] xtable_1.8-4       glue_1.8.0         xml2_1.4.0         rstudioapi_0.17.1 
## [69] jsonlite_2.0.0     R6_2.6.1