Introduction: A Better Way to Understand Customer Churn

As data scientists, a common task is to predict if a customer will cancel their subscription—a problem known as “churn”. The standard approach is to build a classification model that predicts a simple “Yes” or “No”. While useful, this does not tell the whole story. A classification model treats churn as a binary outcome, ignoring a crucial piece of information: time.

A more powerful approach is to ask a different question: instead of asking if a customer will churn, we can ask when. This is called a time-to-event (or survival) analysis. It allows us to model the entire customer journey and understand the rate at which customers churn over time, providing a much richer insight into customer behaviour.

What You Will Learn in This Guide

This example will walk you through a complete, real-world analysis. We will use a public subscription dataset to uncover the specific “tipping points” in MonthlyCharges that best predict how long a customer is likely to stay subscribed.

To do this, we will use the OptSurvCutR package, which provides a complete workflow to: 1. Determine the statistically optimal number of cut-points. 2. Find the precise location of those cut-points. 3. Validate the stability of the results.


1. Setup and Data Preparation

First, we need to install the development version of OptSurvCutR from GitHub and load the other R packages we will use for this analysis.

# Option 1: Install the development version from GitHub for the latest features
# You will need the 'remotes' package first: install.packages("remotes")
remotes::install_github("paytonyau/OptSurvCutR")

# Option 2: Install the stable version from CRAN (when available)
# install.packages("OptSurvCutR")

Now, let us load the libraries. The pacman package is a convenient helper that will automatically install any packages you do not have.

# Load the core packages for this analysis.
# If a package is not installed, pacman will install it for you.
if (!require("pacman")) install.packages("pacman")
pacman::p_load(OptSurvCutR, dplyr, survival, survminer, knitr)

# URL for the raw dataset from an IBM repository
data_url <- "https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv"

# Read the data into an R data frame
telco_data <- read.csv(data_url)

# The next steps use the pipe operator (%>%), which passes the data from one
# function to the next, making the code clean and readable.
telco_survival <- telco_data %>%
  # The 'TotalCharges' column is read as text; we need to convert it to a number.
  mutate(TotalCharges = as.numeric(TotalCharges)) %>%
  # This conversion creates some missing values (NAs), which we need to remove.
  filter(complete.cases(.)) %>%
  # For survival analysis, we need a numeric "event" column (1 for churn, 0 otherwise).
  mutate(churn_event = ifelse(Churn == "Yes", 1, 0)) %>%
  # A customer tenure of 0 is not meaningful for this analysis, so we remove it.
  filter(tenure > 0)

# Let us inspect the first few rows of our prepared data
head(telco_survival)
##   customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 1 7590-VHVEG Female             0     Yes         No      1           No
## 2 5575-GNVDE   Male             0      No         No     34          Yes
## 3 3668-QPYBK   Male             0      No         No      2          Yes
## 4 7795-CFOCW   Male             0      No         No     45           No
## 5 9237-HQITU Female             0      No         No      2          Yes
## 6 9305-CDSKC Female             0      No         No      8          Yes
##      MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
## 1 No phone service             DSL             No          Yes               No
## 2               No             DSL            Yes           No              Yes
## 3               No             DSL            Yes          Yes               No
## 4 No phone service             DSL            Yes           No              Yes
## 5               No     Fiber optic             No           No               No
## 6              Yes     Fiber optic             No           No              Yes
##   TechSupport StreamingTV StreamingMovies       Contract PaperlessBilling
## 1          No          No              No Month-to-month              Yes
## 2          No          No              No       One year               No
## 3          No          No              No Month-to-month              Yes
## 4         Yes          No              No       One year               No
## 5          No          No              No Month-to-month              Yes
## 6          No         Yes             Yes Month-to-month              Yes
##               PaymentMethod MonthlyCharges TotalCharges Churn churn_event
## 1          Electronic check          29.85        29.85    No           0
## 2              Mailed check          56.95      1889.50    No           0
## 3              Mailed check          53.85       108.15   Yes           1
## 4 Bank transfer (automatic)          42.30      1840.75    No           0
## 5          Electronic check          70.70       151.65   Yes           1
## 6          Electronic check          99.65       820.50   Yes           1

2. A Simple Case: Comparing Pre-defined Groups

A standard survival analysis often begins by comparing the time-to-event between pre-defined groups. Let us start with a simple question: is there a difference in customer retention between senior citizens and non-senior citizens?

# The SeniorCitizen column is coded as 0 or 1. Let us make it a more descriptive factor.
telco_survival_senior <- telco_survival %>%
  mutate(SeniorCitizen = factor(SeniorCitizen, levels = c(0, 1), labels = c("No", "Yes")))

# To perform a survival analysis, we first need to create a "survival object".
# The Surv() function bundles our time column ('tenure') and our event column ('churn_event') together.
surv_obj_senior <- Surv(time = telco_survival_senior$tenure, event = telco_survival_senior$churn_event)

# Now, we can fit a Kaplan-Meier model. The formula `surv_obj_senior ~ SeniorCitizen`
# tells R to create a separate survival curve for each category in the SeniorCitizen column.
km_fit_senior <- survfit(surv_obj_senior ~ SeniorCitizen, data = telco_survival_senior)

# Finally, we can plot the result using the ggsurvplot function
ggsurvplot(
  km_fit_senior,
  data = telco_survival_senior,
  pval = TRUE,
  conf.int = TRUE,
  risk.table = TRUE,
  legend.title = "Senior Citizen",
  legend.labs = c("No", "Yes"),
  title = "Customer Retention by Senior Citizen Status"
)

How to Read This Plot: A Guide for Data Scientists

This type of visualisation is a Kaplan-Meier plot, the cornerstone of survival analysis. Here is how to interpret it:

  • The Y-Axis (Survival probability): This shows the probability that a customer is still subscribed at a given point in time. It starts at 1.0 (100%), as all customers are subscribed at the beginning.
  • The X-Axis (tenure): This is our time variable, representing the number of months a customer has been subscribed.
  • The Stepped Lines: Each line represents a group of customers (here, “Senior Citizen: No” and “Senior Citizen: Yes”). The line goes down each time a customer in that group churns. A steeper drop means a faster churn rate.
  • The Shaded Areas: These are the 95% confidence intervals for the survival curves. They give us a sense of the uncertainty in our estimates.
  • The p-value: This comes from a log-rank test. It tests the null hypothesis that there is no difference in survival between the groups. A small p-value (like p < 0.0001 here) provides strong evidence that the difference between the curves is statistically significant.
  • The Risk Table: This table at the bottom shows the number of customers still “at risk” of churning at different time points. It provides context for the curves above; for example, you can see how many customers are remaining in each group after 24, 48, and 72 months.

3. The Challenge of Continuous Variables: Finding a Single Tipping Point

The more interesting business question is about continuous predictors. At what MonthlyCharges value does a customer’s loyalty begin to change? A common but flawed approach is to split the data at the median. A better method is to let the data itself tell us where the real tipping point is. We can use find_cutpoint to find the single best threshold.

# Find the single best cut-point for MonthlyCharges
one_cut_result <- find_cutpoint(
  data = telco_survival,
  predictor = "MonthlyCharges",
  outcome_time = "tenure",
  outcome_event = "churn_event",
  num_cuts = 1,
  method = "systematic" # The "systematic" method is fast for finding one cut
)
optimal_charge_one_cut <- one_cut_result$optimal_cuts
one_cut_result
# Visualise the result of the single cut-point
telco_one_cut <- telco_survival %>%
  mutate(charge_group = ifelse(MonthlyCharges >= optimal_charge_one_cut, "High", "Low"))

km_fit_one_cut <- survfit(Surv(tenure, churn_event) ~ charge_group, data = telco_one_cut)

ggsurvplot(
  km_fit_one_cut,
  data = telco_one_cut,
  pval = TRUE,
  conf.int = TRUE,
  risk.table = TRUE,
  legend.title = "Monthly Charge",
  title = "Customer Retention by a Single Optimal Cut-point"
)

Analysis: This analysis finds a significant split. However, it assumes a simple relationship where higher charges always lead to higher churn. What if the reality is more complex? A moderate charge might be acceptable, while very low (perhaps indicating low engagement) and very high charges could both lead to churn. A single cut-point cannot capture this.


4. Uncovering Complex Relationships: The Multi-Cut-point Workflow

To investigate more complex, non-linear relationships, we need to find multiple cut-points. The OptSurvCutR package provides a robust, three-step workflow for this.

Step 1: How Many Tipping Points Are There?

First, we must determine how many cut-points are statistically justified. We use find_cutpoint_number() to compare models with different numbers of cuts. We will test up to 7 potential cut-points to thoroughly explore the possibilities.

# Determine the optimal number of cut-points for MonthlyCharges
number_result <- find_cutpoint_number(
  data = telco_survival,
  predictor = "MonthlyCharges",
  outcome_time = "tenure",
  outcome_event = "churn_event",
  method = "genetic",
  criterion = "BIC",
  use_parallel = TRUE,
  max_cuts = 7, # Let us test for up to 7 tipping points
  nmin = 0.1,
  maxiter = 500,
  seed = 42
)
##  num_cuts      BIC Delta_BIC BIC_Weight               Evidence
##         0 31251.45    450.83         0% Essentially no support
##         1 31065.05    264.44         0% Essentially no support
##         2 30964.22    163.60         0% Essentially no support
##         3 30922.54    121.93         0% Essentially no support
##         4 30899.22     98.60         0% Essentially no support
##         5 30800.61      0.00       100%    Substantial support
##         6 31015.04    214.42         0% Essentially no support
##         7 30972.20    171.59         0% Essentially no support
##                                              cuts
##                                                NA
##                                             27.16
##                                     28.13, 102.56
##                               27.16, 68.36, 102.5
##                       28.25, 68.57, 76.71, 102.46
##                28.32, 50.95, 68.94, 80.31, 102.47
##           24.37, 44.59, 61.98, 73.8, 80.5, 102.04
##  20.48, 33.84, 53.59, 70.33, 80.21, 89.97, 102.47
summary(number_result)
##  num_cuts      BIC Delta_BIC BIC_Weight               Evidence
##         0 31251.45    450.83         0% Essentially no support
##         1 31065.05    264.44         0% Essentially no support
##         2 30964.22    163.60         0% Essentially no support
##         3 30922.54    121.93         0% Essentially no support
##         4 30899.22     98.60         0% Essentially no support
##         5 30800.61      0.00       100%    Substantial support
##         6 31015.04    214.42         0% Essentially no support
##         7 30972.20    171.59         0% Essentially no support
##                                              cuts
##                                                NA
##                                             27.16
##                                     28.13, 102.56
##                               27.16, 68.36, 102.5
##                       28.25, 68.57, 76.71, 102.46
##                28.32, 50.95, 68.94, 80.31, 102.47
##           24.37, 44.59, 61.98, 73.8, 80.5, 102.04
##  20.48, 33.84, 53.59, 70.33, 80.21, 89.97, 102.47
##   Group    N Events
## 1    G1 1600    148
## 2    G2  798    247
## 3    G3  887    138
## 4    G4 1129    450
## 5    G5 1907    711
## 6    G6  711    175
## Call: survfit(formula = survival::Surv(time, event) ~ group, data = data)
## 
##             n events median 0.95LCL 0.95UCL
## group=G1 1600    148     NA      NA      NA
## group=G2  798    247     71      64      NA
## group=G3  887    138     NA      NA      NA
## group=G4 1129    450     65      51      NA
## group=G5 1907    711     NA      68      NA
## group=G6  711    175     NA      NA      NA
## Call:
## survival::coxph(formula = as.formula(formula_str), data = data)
## 
##   n= 7032, number of events= 1869 
## 
##            coef exp(coef) se(coef)      z Pr(>|z|)    
## groupG2 1.46981   4.34840  0.10409 14.120  < 2e-16 ***
## groupG3 0.43417   1.54369  0.11834  3.669 0.000244 ***
## groupG4 1.62294   5.06794  0.09479 17.121  < 2e-16 ***
## groupG5 1.25168   3.49620  0.09037 13.851  < 2e-16 ***
## groupG6 0.48710   1.62759  0.11207  4.346 1.38e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##         exp(coef) exp(-coef) lower .95 upper .95
## groupG2     4.348     0.2300     3.546     5.333
## groupG3     1.544     0.6478     1.224     1.947
## groupG4     5.068     0.1973     4.209     6.103
## groupG5     3.496     0.2860     2.929     4.174
## groupG6     1.628     0.6144     1.307     2.027
## 
## Concordance= 0.659  (se = 0.007 )
## Likelihood ratio test= 549.8  on 5 df,   p=<2e-16
## Wald test            = 472.7  on 5 df,   p=<2e-16
## Score (logrank) test = 534.1  on 5 df,   p=<2e-16

Analysis: The output table shows that the model with 5 cut-points has the lowest BIC score. This provides strong statistical evidence that the customer base can be meaningfully segmented into six distinct groups based on their monthly charge.

Step 2: Where Are the Tipping Points Located?

Now that we have statistical justification for using five cut-points, we use find_cutpoint() to find their precise locations.

# Find the locations of the 5 optimal cut-points
multi_cut_result <- find_cutpoint(
  data = telco_survival,
  predictor = "MonthlyCharges",
  outcome_time = "tenure",
  outcome_event = "churn_event",
  method = "genetic",
  criterion = "logrank",
  use_parallel = TRUE,
  num_cuts = 5,
  nmin = 0.1,
  maxiter = 500,
  seed = 123
)

# Extract the optimal monthly charge thresholds
optimal_charges <- multi_cut_result$optimal_cuts
print(optimal_charges)
## [1]  27.13476  52.59850  68.82812  76.69175 102.49212

Let us visualise where these five tipping points fall on the distribution of monthly charges.

plot(multi_cut_result, type = "distribution", title = "Distribution of Monthly Charges with 5 Optimal Cut-points")

Step 3: How Stable Are These Tipping Points?

Finally, we assess the reliability of our discovered thresholds using validate_cutpoint(). This bootstrap analysis works by repeatedly resampling our data and re-calculating the cut-points to see how much they vary. This generates 95% confidence intervals for each cut-point.

# Validate the 5 discovered cut-points
validation_result <- validate_cutpoint(
  cutpoint_result = multi_cut_result,
  num_replicates = 200, # Use a higher number (e.g., 500) for publication
  use_parallel = TRUE,
  seed = 456,
  maxiter = 100
)
##         Lower   Upper
## Cut 1  26.828  29.843
## Cut 2  49.681  55.368
## Cut 3  68.382  69.398
## Cut 4  75.294  81.631
## Cut 5 101.698 104.248
# Display the confidence intervals in a clean table
knitr::kable(
  validation_result$confidence_intervals,
  caption = "95% Confidence Intervals for Monthly Charge Cut-points"
)
95% Confidence Intervals for Monthly Charge Cut-points
Lower Upper
Cut 1 26.82836 29.84281
Cut 2 49.68078 55.36806
Cut 3 68.38244 69.39761
Cut 4 75.29439 81.63111
Cut 5 101.69804 104.24845

5. Final Visualisation and Interpretation

With our five validated cut-points, we can now create our final six customer segments and visualise their distinct retention curves.

# Create more descriptive labels, escaping the special characters for the plot
charge_labels <- c(
  paste0("&lt; $", round(optimal_charges[1])),  # Use &lt; for the less-than symbol
  paste0("$", round(optimal_charges[1]), " - $", round(optimal_charges[2])),
  paste0("$", round(optimal_charges[2]), " - $", round(optimal_charges[3])),
  paste0("$", round(optimal_charges[3]), " - $", round(optimal_charges[4])),
  paste0("$", round(optimal_charges[4]), " - $", round(optimal_charges[5])),
  paste0("&gt; $", round(optimal_charges[5]))   # Use &gt; for the greater-than symbol
)

# The rest of your code remains the same
telco_categorised <- telco_survival %>%
  mutate(charge_group = cut(
    MonthlyCharges,
    breaks = c(-Inf, optimal_charges, Inf),
    labels = charge_labels
  ))

km_fit_final <- survfit(Surv(tenure, churn_event) ~ charge_group, data = telco_categorised)
# Plot the final result
ggsurvplot(
  km_fit_final,
  data = telco_categorised,
  pval = TRUE,
  pval.coord = c(10, 0.45),
  conf.int = FALSE,
  risk.table = TRUE,
  legend.title = "Monthly Charge Group",
  legend.labs = charge_labels,
  title = "Customer Retention by Six Optimal Monthly Charge Segments",
  ylim = c(0.4, 1),
  xbreaks = seq(0, 75, by = 12)
)

Business Interpretation: By following this complete workflow, we have moved beyond a simple classification model. The final Kaplan-Meier plot reveals a complex, non-linear relationship between price and customer loyalty.

  • High-Risk Tiers: The customers in the £69-£77 and £77-£102 price brackets exhibit the highest rates of churn. Surprisingly, the £27-£53 group also falls into this high-risk category, suggesting that customers in this tier may be particularly sensitive to competitor pricing or perceive a low value-for-money proposition.

  • Low-Risk Tiers: Interestingly, the customers in the lowest-paying tier (< £27) are the most loyal, showing the highest retention over the entire five-year period. This contradicts the simple assumption that a lower price always equals a higher churn risk and suggests this entry-level plan is very effective at retaining its customer base.

  • Moderate-Risk Tiers: The £53-£69 group represents a pocket of moderate loyalty. The >£102 group displays unique behaviour: these premium customers show very high loyalty for the first 36-48 months, after which their retention declines significantly, though it remains better than the other high-risk tiers.

This data-driven approach, powered by OptSurvCutR, allows us to identify specific, actionable “tipping points” that can inform pricing strategies, marketing campaigns, and customer retention efforts. For example, a business might focus retention efforts on the high-risk tiers or investigate what features make the lower-cost plans so “sticky” for customers.


6. Conclusion & Next Steps

This article has demonstrated how survival analysis, combined with the OptSurvCutR package, can uncover deep insights into customer behaviour that a standard classification model might miss. By identifying multiple, data-driven “tipping points”, we can create sophisticated customer segments and develop more targeted business strategies.

We encourage you to try OptSurvCutR with your own data.

  • Install the package from GitHub: remotes::install_github("paytonyau/OptSurvCutR")
  • Report issues or suggest features on our GitHub page.
  • Star the repository if you find it useful.

7. Session Information

For reproducibility, the session information below lists the R version and all attached packages.

sessionInfo()
## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8 
## [2] LC_CTYPE=English_United Kingdom.utf8   
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.utf8    
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.50        survminer_0.5.1   ggpubr_0.6.1      ggplot2_3.5.2    
## [5] survival_3.8-3    dplyr_1.1.4       OptSurvCutR_0.1.5 pacman_0.5.1     
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       xfun_0.52          bslib_0.9.0        rstatix_0.7.2     
##  [5] lattice_0.22-7     vctrs_0.6.5        tools_4.5.1        generics_0.1.4    
##  [9] parallel_4.5.1     tibble_3.3.0       pkgconfig_2.0.3    Matrix_1.7-3      
## [13] data.table_1.17.8  RColorBrewer_1.1-3 lifecycle_1.0.4    stringr_1.5.1     
## [17] compiler_4.5.1     farver_2.1.2       codetools_0.2-20   carData_3.0-5     
## [21] litedown_0.7       htmltools_0.5.8.1  sass_0.4.10        yaml_2.3.10       
## [25] Formula_1.2-5      pillar_1.11.0      car_3.1-3          jquerylib_0.1.4   
## [29] tidyr_1.3.1        cachem_1.1.0       iterators_1.0.14   rgenoud_5.9-0.11  
## [33] abind_1.4-8        foreach_1.5.2      km.ci_0.5-6        commonmark_2.0.0  
## [37] tidyselect_1.2.1   digest_0.6.37      stringi_1.8.7      purrr_1.1.0       
## [41] labeling_0.4.3     splines_4.5.1      fastmap_1.2.0      grid_4.5.1        
## [45] cli_3.6.5          magrittr_2.0.3     broom_1.0.9        withr_3.0.2       
## [49] scales_1.4.0       backports_1.5.0    rmarkdown_2.29     ggtext_0.1.2      
## [53] gridExtra_2.3      ggsignif_0.6.4     zoo_1.8-14         evaluate_1.0.5    
## [57] KMsurv_0.1-6       doParallel_1.0.17  markdown_2.0       survMisc_0.5.6    
## [61] rlang_1.1.6        gridtext_0.1.5     Rcpp_1.1.0         xtable_1.8-4      
## [65] glue_1.8.0         xml2_1.4.0         rstudioapi_0.17.1  jsonlite_2.0.0    
## [69] R6_2.6.1