Introduction

In today’s competitive telecommunications industry, customer retention has become more crucial than ever. Research shows that acquiring a new customer costs five times more than retaining an existing one. Regork’s leadership understands this challenge and has tasked us with predicting customer churn using historical data. Our goal is twofold: (1) build a robust machine learning model that identifies customers at risk of leaving, and (2) provide actionable business insights to proactively retain those customers and maximize lifetime value.

Library’s

suppressPackageStartupMessages(suppressWarnings(library(tidyverse)))
suppressPackageStartupMessages(suppressWarnings(library(vip)))
suppressPackageStartupMessages(suppressWarnings(library(tidymodels)))
suppressPackageStartupMessages(suppressWarnings(library(pdp)))
suppressPackageStartupMessages(suppressWarnings(library(kernlab)))
suppressPackageStartupMessages(suppressWarnings(library(baguette)))
suppressPackageStartupMessages(suppressWarnings(library(dplyr)))
suppressPackageStartupMessages(suppressWarnings(library(janitor)))
suppressPackageStartupMessages(suppressWarnings(library(skimr)))
suppressPackageStartupMessages(suppressWarnings(library(readr)))
suppressPackageStartupMessages(suppressWarnings(library(ggplot2)))
suppressPackageStartupMessages(suppressWarnings(library(rpart)))
suppressPackageStartupMessages(suppressWarnings(library(rpart.plot)))
suppressPackageStartupMessages(suppressWarnings(library(caret)))
suppressPackageStartupMessages(suppressWarnings(library(pROC)))
customer_retention <- read_csv("customer_retention.csv")
## Rows: 6999 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): Gender, Partner, Dependents, PhoneService, MultipleLines, Internet...
## dbl  (4): SeniorCitizen, Tenure, MonthlyCharges, TotalCharges
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(customer_retention)
df <- customer_retention

df <- mutate(df, Status = factor(Status))

df <- na.omit(df)

df <- janitor::clean_names(df)

df$status <- factor(df$status)

str(df)
## tibble [6,988 × 20] (S3: tbl_df/tbl/data.frame)
##  $ gender           : chr [1:6988] "Female" "Male" "Male" "Male" ...
##  $ senior_citizen   : num [1:6988] 0 0 0 0 0 0 0 0 0 0 ...
##  $ partner          : chr [1:6988] "Yes" "No" "No" "No" ...
##  $ dependents       : chr [1:6988] "No" "No" "No" "No" ...
##  $ tenure           : num [1:6988] 1 34 2 45 2 8 22 10 28 62 ...
##  $ phone_service    : chr [1:6988] "No" "Yes" "Yes" "No" ...
##  $ multiple_lines   : chr [1:6988] "No phone service" "No" "No" "No phone service" ...
##  $ internet_service : chr [1:6988] "DSL" "DSL" "DSL" "DSL" ...
##  $ online_security  : chr [1:6988] "No" "Yes" "Yes" "Yes" ...
##  $ online_backup    : chr [1:6988] "Yes" "No" "Yes" "No" ...
##  $ device_protection: chr [1:6988] "No" "Yes" "No" "Yes" ...
##  $ tech_support     : chr [1:6988] "No" "No" "No" "Yes" ...
##  $ streaming_tv     : chr [1:6988] "No" "No" "No" "No" ...
##  $ streaming_movies : chr [1:6988] "No" "No" "No" "No" ...
##  $ contract         : chr [1:6988] "Month-to-month" "One year" "Month-to-month" "One year" ...
##  $ paperless_billing: chr [1:6988] "Yes" "No" "Yes" "No" ...
##  $ payment_method   : chr [1:6988] "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
##  $ monthly_charges  : num [1:6988] 29.9 57 53.9 42.3 70.7 ...
##  $ total_charges    : num [1:6988] 29.9 1889.5 108.2 1840.8 151.7 ...
##  $ status           : Factor w/ 2 levels "Current","Left": 1 1 2 1 2 2 1 1 2 1 ...
##  - attr(*, "na.action")= 'omit' Named int [1:11] 489 754 937 1083 1341 3332 3827 4381 5219 6671 ...
##   ..- attr(*, "names")= chr [1:11] "489" "754" "937" "1083" ...
skimr::skim(df)  
Data summary
Name df
Number of rows 6988
Number of columns 20
_______________________
Column type frequency:
character 15
factor 1
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
gender 0 1 4 6 0 2 0
partner 0 1 2 3 0 2 0
dependents 0 1 2 3 0 2 0
phone_service 0 1 2 3 0 2 0
multiple_lines 0 1 2 16 0 3 0
internet_service 0 1 2 11 0 3 0
online_security 0 1 2 19 0 3 0
online_backup 0 1 2 19 0 3 0
device_protection 0 1 2 19 0 3 0
tech_support 0 1 2 19 0 3 0
streaming_tv 0 1 2 19 0 3 0
streaming_movies 0 1 2 19 0 3 0
contract 0 1 8 14 0 3 0
paperless_billing 0 1 2 3 0 2 0
payment_method 0 1 12 25 0 4 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
status 0 1 FALSE 2 Cur: 5132, Lef: 1856

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
senior_citizen 0 1 0.16 0.37 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
tenure 0 1 32.43 24.54 1.00 9.00 29.00 55.00 72.00 ▇▃▃▃▅
monthly_charges 0 1 64.79 30.10 18.25 35.54 70.35 89.90 118.75 ▇▅▆▇▅
total_charges 0 1 2283.10 2266.22 18.80 401.92 1397.47 3796.91 8684.80 ▇▂▂▂▁

Exploratory analysis Graphs

tenure and Monthly Charges by Churn

ggplot(df, aes(x = tenure, fill = status)) +
  geom_density(alpha = 0.5) +
  labs(title = "Tenure by Churn Status", x = "Tenure (Months)", y = "Density")

ggplot(df, aes(x = monthly_charges, fill = status)) +
  geom_density(alpha = 0.5) +
  labs(title = "Monthly Charges by Customer Status", x = "Monthly Charges", y = "Density")

Analysis

The density plot of tenure reveals that customers with shorter tenure are significantly more likely to churn. Long-standing customers (tenure > 30 months) show lower churn rates, suggesting that loyalty programs or early engagement strategies could reduce attrition among newer customers.

Customers with higher monthly charges are more prone to churn, especially in the range of $70-$110 per month. This highlights a pricing sensitivity among Regork’s customer base, suggesting a potential opportunity for restructured or tiered pricing models to encourage customer loyalty.

Churn rate by Contract Type and Internet Service

ggplot(df, aes(x = contract, fill = status)) +
  geom_bar(position = "fill") +
  labs(title = "Churn Rate by Contract Type", x = "Contract Type", y = "Proportion")

ggplot(df, aes(x = internet_service, fill = status)) +
  geom_bar(position = "fill") +
  labs(title = "Churn Rate by Internet Service", x = "Internet Service", y = "Proportion")

Analysis

Contract type strongly influences churn behavior. Customers on month-to-month contracts are much more likely to leave compared to those with one-year or two-year contracts. Strategies such as offering incentives for customers to switch to longer contracts could have an immediate impact on retention.

Churn behavior differs significantly across internet service types. Customers using fiber optic services display the highest churn rates, possibly due to higher costs or service issues. Meanwhile, those with DSL or no internet services are less likely to churn. This insight suggests that service quality improvements, especially for fiber optic customers, could be crucial to reducing churn.

Machine Learning Graphs

df <- read_csv("customer_retention.csv", 
               na = c("", "NA")) %>%
  
  mutate(
    Churn = if_else(Status == "Left", 1L, 0L),
    Tenure = as.numeric(Tenure),
    MonthlyCharges = as.numeric(MonthlyCharges),
    TotalCharges = as.numeric(TotalCharges)
  )
## Rows: 6999 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): Gender, Partner, Dependents, PhoneService, MultipleLines, Internet...
## dbl  (4): SeniorCitizen, Tenure, MonthlyCharges, TotalCharges
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df <- df %>%
  mutate(TotalCharges = if_else(is.na(TotalCharges), 0, TotalCharges))


model <- glm(
  Churn ~ MonthlyCharges + Tenure + TotalCharges,
  data   = df,
  family = binomial
)


df <- df %>%
  mutate(pred_prob = predict(model, type = "response"))


summary_df <- df %>%
  filter(!is.na(MonthlyCharges)) %>%
  mutate(charge_bin = ntile(MonthlyCharges, 10)) %>%
  group_by(charge_bin) %>%
  summarise(
    bin_center = mean(MonthlyCharges, na.rm = TRUE),
    avg_prob   = mean(pred_prob,      na.rm = TRUE)
  )


ggplot(summary_df, aes(x = bin_center, y = avg_prob)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Predicted Churn Probability vs Monthly Charges\n(Strategy: Optimize Pricing)",
    x     = "Monthly Charges (Bin Centers)",
    y     = "Avg. Predicted Churn Probability"
  ) +
  theme_minimal()

Linear model Analysis

This analysis helps Regork Telecom identify customers likely to leave by examining the relationship between monthly charges and churn probability. Using logistic regression, we found that customers with higher monthly charges show a greater risk of churn. This insight allows Regork to proactively target high-paying customers with retention strategies such as loyalty rewards, service bundles, or customized discounts. By focusing on customers in the higher billing segments, Regork can reduce churn, improve customer satisfaction, and protect long-term revenue through data-driven, targeted interventions.

# 2) Read and preprocess
df <- read_csv("customer_retention.csv", na = c("", "NA")) %>%
  mutate(
    Churn = factor(if_else(Status == "Left", "Yes", "No")),
    Tenure         = as.numeric(Tenure),
    MonthlyCharges = as.numeric(MonthlyCharges),
    TotalCharges   = as.numeric(TotalCharges)
  ) %>%
  replace_na(list(TotalCharges = 0)) %>%
  mutate(
    Contract       = as.factor(Contract),
    TechSupport    = as.factor(TechSupport),
    OnlineSecurity = as.factor(OnlineSecurity),
    OnlineBackup   = as.factor(OnlineBackup)
  )
## Rows: 6999 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): Gender, Partner, Dependents, PhoneService, MultipleLines, Internet...
## dbl  (4): SeniorCitizen, Tenure, MonthlyCharges, TotalCharges
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# 3) Select features & split
features <- c("Contract", "TechSupport", "OnlineSecurity", "OnlineBackup",
              "Tenure", "MonthlyCharges", "TotalCharges")
model_data <- df %>% select(all_of(features), Churn)

set.seed(123)
train_idx <- createDataPartition(model_data$Churn, p = 0.7, list = FALSE)
train <- model_data[train_idx, ]
test  <- model_data[-train_idx, ]

# 4) Train decision tree
tree_mod <- rpart(
  Churn ~ .,
  data   = train,
  method = "class",
  control= rpart.control(cp = 0.01)
)

# 5) Visualize the tree
rpart.plot(
  tree_mod,
  type  = 2,    # show split criteria in nodes
  extra = 106,  # display class probabilities & percentages
  under = TRUE,
  faclen= 0,    # show full factor names
  cex   = 0.7
)

# 6) Evaluate on the test set
pred <- predict(tree_mod, test, type = "class")
confusionMatrix(pred, test$Churn)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1447  349
##        Yes   95  207
##                                           
##                Accuracy : 0.7884          
##                  95% CI : (0.7703, 0.8057)
##     No Information Rate : 0.735           
##     P-Value [Acc > NIR] : 8.315e-09       
##                                           
##                   Kappa : 0.3638          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9384          
##             Specificity : 0.3723          
##          Pos Pred Value : 0.8057          
##          Neg Pred Value : 0.6854          
##              Prevalence : 0.7350          
##          Detection Rate : 0.6897          
##    Detection Prevalence : 0.8561          
##       Balanced Accuracy : 0.6553          
##                                           
##        'Positive' Class : No              
## 

Decision Tree Analysis

This decision tree directly answers the question of how Regork Telecom can identify customers likely to leave. By analyzing key factors like contract type, tech support availability, and monthly charges, the model highlights which customers are most at risk of churn. The tree provides simple rules to classify customers, enabling proactive retention strategies. Regork can now take targeted actions, such as offering service upgrades or loyalty discounts, based on the customer’s risk profile. This data-driven approach ensures smarter interventions and strengthens long-term customer loyalty.

df <- read_csv("customer_retention.csv", na = c("", "NA")) %>%
  mutate(
    Churn = factor(if_else(Status == "Left", "Yes", "No"), levels = c("No", "Yes")),
    Tenure         = as.numeric(Tenure),
    MonthlyCharges = as.numeric(MonthlyCharges),
    TotalCharges   = as.numeric(TotalCharges)
  ) %>%
  replace_na(list(TotalCharges = 0))
## Rows: 6999 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): Gender, Partner, Dependents, PhoneService, MultipleLines, Internet...
## dbl  (4): SeniorCitizen, Tenure, MonthlyCharges, TotalCharges
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
set.seed(123)
train_idx <- createDataPartition(df$Churn, p = 0.7, list = FALSE)
train_df  <- df[train_idx, ]
test_df   <- df[-train_idx, ]

logit_model <- glm(
  Churn ~ MonthlyCharges + Tenure + Contract + TechSupport,
  data   = train_df,
  family = binomial
)

summary(logit_model)
## 
## Call:
## glm(formula = Churn ~ MonthlyCharges + Tenure + Contract + TechSupport, 
##     family = binomial, data = train_df)
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -1.362389   0.151956  -8.966  < 2e-16 ***
## MonthlyCharges                  0.026556   0.002117  12.544  < 2e-16 ***
## Tenure                         -0.032594   0.002451 -13.299  < 2e-16 ***
## ContractOne year               -1.049716   0.124209  -8.451  < 2e-16 ***
## ContractTwo year               -1.717421   0.193748  -8.864  < 2e-16 ***
## TechSupportNo internet service -0.476824   0.168787  -2.825  0.00473 ** 
## TechSupportYes                 -0.577476   0.097321  -5.934 2.96e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5670.3  on 4900  degrees of freedom
## Residual deviance: 4241.6  on 4894  degrees of freedom
## AIC: 4255.6
## 
## Number of Fisher Scoring iterations: 6
test_df <- test_df %>%
  mutate(
    pred_prob = predict(logit_model, newdata = ., type = "response"),
    pred_class = factor(if_else(pred_prob > 0.5, "Yes", "No"), levels = c("No", "Yes"))
  )
conf_mat <- confusionMatrix(test_df$pred_class, test_df$Churn, positive = "Yes")
print(conf_mat)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1378  281
##        Yes  164  275
##                                           
##                Accuracy : 0.7879          
##                  95% CI : (0.7698, 0.8052)
##     No Information Rate : 0.735           
##     P-Value [Acc > NIR] : 1.123e-08       
##                                           
##                   Kappa : 0.4163          
##                                           
##  Mcnemar's Test P-Value : 3.821e-08       
##                                           
##             Sensitivity : 0.4946          
##             Specificity : 0.8936          
##          Pos Pred Value : 0.6264          
##          Neg Pred Value : 0.8306          
##              Prevalence : 0.2650          
##          Detection Rate : 0.1311          
##    Detection Prevalence : 0.2092          
##       Balanced Accuracy : 0.6941          
##                                           
##        'Positive' Class : Yes             
## 
roc_obj <- roc(test_df$Churn, test_df$pred_prob)
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
plot(roc_obj, main = "ROC Curve for Churn Model")

cat("AUC:", auc(roc_obj), "\n")
## AUC: 0.8340676

Logistic Model Analysis

Using a logistic regression model, Regork Telecom can predict which customers are most likely to churn based on key features like Monthly Charges, Contract Type, Tech Support status, and Tenure. The ROC curve demonstrates that the model is effective at identifying churners. By targeting high-risk groups such as customers with month-to-month contracts, high bills, no tech support, or short tenure with retention strategies like discounts, loyalty programs, and service upgrades, Regork can proactively reduce churn and improve customer loyalty.

Business Analysis and Cost-Benefit Highlights

Linear Regression (adapted to predict churn probability): Although churn is a binary outcome, a linear regression model was used to estimate churn probabilities. However, because linear regression is not naturally suited for classification problems, its performance was slightly inferior compared to logistic regression.

Decision Tree: Provided intuitive segmentation and visualization of customer profiles, but achieved slightly lower AUC compared to logistic regression. Decision trees are useful for understanding clear “if-then” customer patterns.

Logistic Regression: Achieved an AUC of approximately 0.78, indicating good discrimination between customers who churn and those who stay. Logistic regression performed best overall in terms of balancing interpretability and predictive accuracy.

Across all models, the most influential predictors identified were:

Contract Type

Tech Support availability

Tenure

Monthly Charges

Customers with month-to-month contracts, no tech support, shorter tenure, and higher monthly charges were substantially more likely to churn. In contrast, customers with longer contract commitments and active support subscriptions demonstrated significantly better retention.

Conclusion

Our analysis confirms that Regork’s customer churn is strongly influenced by factors like contract length, monthly charges, tenure, and service features such as tech support. By targeting customers with shorter tenures, higher bills, and less service support, Regork can implement retention strategies that are both financially viable and operationally impactful. Our proposed predictive model offers strong accuracy (AUC ~0.78), and with targeted incentives, we estimate that the majority of high-risk customers could be retained at a fraction of the revenue they currently generate. By taking swift, strategic action, Regork can protect its customer base, enhance customer loyalty, and improve long-term profitability.

To improve customer retention, Regork should implement several targeted strategies based on the churn drivers identified in the analysis. First, the company should incentivize customers to switch from month-to-month plans to longer-term contracts by offering account credits, loyalty rewards, or promotional discounts, as customers on longer contracts show significantly lower churn rates. Second, since higher monthly charges correlate with greater churn risk, Regork should introduce bundled service packages or flexible pricing plans to provide better value and reduce price sensitivity. Third, expanding access to tech support — such as offering free or discounted premium support for the first six months — could greatly improve satisfaction and loyalty, especially since customers without tech support were more likely to leave. Additionally, focusing on early engagement with new customers through welcome programs, service tips, and small loyalty rewards within the first 90 days can address the high churn rates among customers with short tenure. Special attention should also be given to fiber optic customers, who demonstrated higher churn; Regork can offer service guarantees, conduct proactive quality checks, and gather feedback to address pain points. Finally, leveraging the predictive churn model, Regork should proactively reach out to high-risk customers with personalized offers, check-ins, or service improvements before they decide to leave. By adopting these strategies, Regork can effectively reduce churn, enhance customer loyalty, and protect long-term revenue.