Introduction

Regork has recently expanded into the telecommunications sector, offering internet, phone, and streaming services. In this highly competitive and subscription-driven market, customer retention is critical. Research indicates that acquiring a new customer can be up to five times more costly than retaining an existing one.

This project aims to support Regork in predicting customer churn through advanced machine learning techniques. By identifying customers at risk of leaving, Regork can implement targeted retention strategies, thereby minimizing revenue loss and enhancing customer lifetime value.

Using historical customer data—including demographics, service usage patterns, and billing information:

Conducted exploratory data analysis (EDA) to uncover churn patterns

Built and evaluated multiple classification models

Identified key drivers influencing churn

Simulated the potential business impact based on model predictions

Through these insights, Regork can make data-driven decisions to improve customer retention and overall profitability.

Business Problem

The CEO of Regork Telecom has prioritized improving customer retention as a key strategic objective. My primary challenge is to address the following question:

“Can I accurately predict which customers are likely to churn, and what proactive measures can we implement to reduce attrition?”

To support this goal, I leveraged advanced analytics to evaluate churn risk across the customer base and developed a targeted retention strategy aimed at informing and enhancing leadership decision-making.

library(tidyverse)
library(tidymodels)
library(ggplot2)
library(vip)
library(pdp)
library(ranger)
library(earth)
library(skimr)
library(naniar)
set.seed(123)

2. Data Preparation and Exploratory Analysis

Data Loading and Cleaning

df <- read_csv("customer_retention.csv") %>%
  mutate(across(where(is.character), as.factor),
         Status = factor(Status),
         TotalCharges = as.numeric(TotalCharges)) %>%
  drop_na()

I began by loading and preprocessing the data to ensure consistency and reliability for the analysis. Categorical variables were appropriately converted to factors, and the TotalCharges column—originally stored as a character variable—was transformed into a numeric format. To maintain the integrity and interpretability of the models, I removed rows with missing values. Since records with missing TotalCharges comprised less than 1% of the dataset, their removal had a negligible impact on the overall analysis.

Data Overview

skim(df)
Data summary
Name df
Number of rows 6988
Number of columns 20
_______________________
Column type frequency:
factor 16
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Gender 0 1 FALSE 2 Mal: 3526, Fem: 3462
Partner 0 1 FALSE 2 No: 3611, Yes: 3377
Dependents 0 1 FALSE 2 No: 4894, Yes: 2094
PhoneService 0 1 FALSE 2 Yes: 6314, No: 674
MultipleLines 0 1 FALSE 3 No: 3366, Yes: 2948, No : 674
InternetService 0 1 FALSE 3 Fib: 3075, DSL: 2400, No: 1513
OnlineSecurity 0 1 FALSE 3 No: 3470, Yes: 2005, No : 1513
OnlineBackup 0 1 FALSE 3 No: 3069, Yes: 2406, No : 1513
DeviceProtection 0 1 FALSE 3 No: 3073, Yes: 2402, No : 1513
TechSupport 0 1 FALSE 3 No: 3447, Yes: 2028, No : 1513
StreamingTV 0 1 FALSE 3 No: 2791, Yes: 2684, No : 1513
StreamingMovies 0 1 FALSE 3 No: 2758, Yes: 2717, No : 1513
Contract 0 1 FALSE 3 Mon: 3847, Two: 1677, One: 1464
PaperlessBilling 0 1 FALSE 2 Yes: 4134, No: 2854
PaymentMethod 0 1 FALSE 4 Ele: 2350, Mai: 1595, Ban: 1532, Cre: 1511
Status 0 1 FALSE 2 Cur: 5132, Lef: 1856

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
SeniorCitizen 0 1 0.16 0.37 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
Tenure 0 1 32.43 24.54 1.00 9.00 29.00 55.00 72.00 ▇▃▃▃▅
MonthlyCharges 0 1 64.79 30.10 18.25 35.54 70.35 89.90 118.75 ▇▅▆▇▅
TotalCharges 0 1 2283.10 2266.22 18.80 401.92 1397.47 3796.91 8684.80 ▇▂▂▂▁

After inspecting the dataset using skim(df), I confirmed that the data is well-structured with no evidence of systematic missing values. Approximately 26% of customers in the dataset have churned, establishing a baseline churn rate. Any predictive model developed must significantly outperform this naive baseline in order to provide actionable value.

Demographics and Service Relationships

# Relationship between demographics and service usage
ggplot(df, aes(x = InternetService, fill = Gender)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
scale_fill_manual(values = c("Female" = "skyblue", "Male" = "#7db0ff")) +
  labs(title = "Internet Service Usage by Gender", 
       y = "Proportion", 
       x = "Internet Service Type",
       fill = "Gender") +
  theme_minimal()

# Relationship between age and services
ggplot(df, aes(x = InternetService, fill = factor(SeniorCitizen))) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_manual(name = "Senior Citizen", 
                    labels = c("No", "Yes"),
                    values = c("0" = "lightblue", "1" = "steelblue")) + 
  labs(title = "Internet Service Usage by Age Group", 
       y = "Proportion", 
       x = "Internet Service Type") +
  theme_minimal()

I explored the relationships between customer demographics and service usage patterns to identify potential drivers of churn.

The analysis reveals distinct trends in service adoption across demographic groups. Senior citizens are significantly less likely to subscribe to fiber optic services compared to younger customers, suggesting a potential sensitivity to service complexity, cost, or installation requirements. In contrast, gender shows only modest differences in internet service preferences.

These insights highlight opportunities for Regork to tailor retention strategies to specific customer segments, particularly by addressing the unique needs and concerns of older customers.

# Service adoption patterns
service_cols <- c("PhoneService", "MultipleLines", "OnlineSecurity", 
                  "OnlineBackup", "DeviceProtection", "TechSupport", 
                  "StreamingTV", "StreamingMovies")

service_adoption <- df %>%
  select(all_of(service_cols), Status) %>%
  pivot_longer(cols = all_of(service_cols), 
               names_to = "Service", 
               values_to = "Adoption") %>%
  filter(Adoption == "Yes") %>%
  group_by(Service, Status) %>%
  summarise(Count = n(), .groups = "drop") %>%
  group_by(Service) %>%
  mutate(Proportion = Count / sum(Count),
         Total = sum(Count))

ggplot(service_adoption, aes(x = reorder(Service, -Total), y = Proportion, fill = Status)) +
  geom_col() +
  scale_fill_manual(values = c("Current" = "#7db0ff", "Left" = "#a4d8f2")) + 
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Service Adoption Rates and Churn Proportion", 
       y = "Proportion", 
       x = "Service Type",
       fill = "Customer Status") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

I analyzed customer adoption of various services and their relationship to churn outcomes.

The results show that protective services—such as Online Security, Tech Support, and Online Backup—are associated with significantly lower churn rates. Conversely, customers who primarily subscribe to streaming services without these protective offerings exhibit higher churn rates.

These findings suggest that bundling protective services with entertainment options could be an effective strategy to enhance customer retention and reduce churn.

3. Model Building and Evaluation

To determine the most effective model for predicting customer churn, I evaluated three machine learning approaches: Logistic Regression, Multivariate Adaptive Regression Splines (MARS), and Random Forest.

Each model was tuned using 5-fold cross-validation to ensure robust performance estimates. Model performance was primarily assessed using the Area Under the ROC Curve (AUC), a metric well-suited for imbalanced datasets as it balances sensitivity (true positive rate) and specificity (true negative rate).

Data Split

I allocated 70% of the data for model training and reserved the remaining 30% for final evaluation. Stratified sampling was applied based on the churn status to maintain consistent class proportions across both the training and testing sets, ensuring a fair and representative evaluation.

split <- initial_split(df, prop = 0.7, strata = Status)
train <- training(split)
test <- testing(split)

Logistic Regression

I began by fitting a logistic regression model, which serves as a simple yet interpretable baseline. While logistic regression may not capture complex nonlinear relationships, its coefficients provide valuable insights into the direction and relative strength of predictors influencing customer churn. This model establishes an important reference point for evaluating more flexible machine learning methods.

log_reg <- logistic_reg() %>% set_engine("glm")
log_wf <- workflow() %>% add_model(log_reg) %>% add_formula(Status ~ .)
log_res <- fit_resamples(log_wf, vfold_cv(train, v = 5, strata = Status))
collect_metrics(log_res)

MARS Model

The Multivariate Adaptive Regression Splines (MARS) model extends traditional linear models by automatically capturing non-linearities and interactions between variables. It constructs piecewise linear relationships across different regions of the data, allowing greater flexibility than standard regression models.

I tuned the number of terms and the degree of interactions to optimize the model’s fit to the training data. This approach enables MARS to better model complex patterns that may influence customer churn.

mars_mod <- mars(num_terms = tune(), prod_degree = tune()) %>% set_mode("classification") %>% set_engine("earth")
mars_wf <- workflow() %>%
  add_recipe(recipe(Status ~ ., data = train)) %>%
  add_model(mars_mod)

mars_grid <- grid_regular(num_terms(range = c(1, 30)), prod_degree(), levels = 5)
mars_res <- tune_grid(mars_wf, resamples = vfold_cv(train, v = 5, strata = Status), grid = mars_grid)
show_best(mars_res, metric = "roc_auc")

Random Forest

Random Forest emerged as the top-performing model for predicting customer churn. This ensemble method is highly robust to overfitting, handles high-dimensional datasets effectively, and provides insights into variable importance.

I tuned key hyperparameters, including the number of predictors randomly selected at each split (mtry) and the minimum number of observations required in a terminal node (min_n), to optimize model performance. The number of trees was fixed at 500 to balance computational efficiency with predictive accuracy.

rf_mod <- rand_forest(mtry = tune(), min_n = tune(), trees = 500) %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("classification")

rf_wf <- workflow() %>%
  add_recipe(recipe(Status ~ ., data = train)) %>%
  add_model(rf_mod)

rf_grid <- grid_regular(
  finalize(mtry(), train),
  min_n(),
  levels = 3
)

rf_res <- tune_grid(
  rf_wf,
  resamples = vfold_cv(train, v = 5, strata = Status),
  grid = rf_grid
)

show_best(rf_res, metric = "roc_auc")

Final Evaluation

After identifying the best random forest model through cross-validation, I finalized the model with the optimal hyperparameters and retrained it on the entire training set.

I then made predictions on the test set, generating both class probabilities and predicted labels. Model performance was assessed by:

ROC AUC: Measuring the model’s ability to correctly separate churners from non-churners.

Confusion Matrix: Evaluating the balance between correctly identifying churners and minimizing false positives.

This final evaluation step is essential to validate the model’s generalization ability on unseen data and ensure it is reliable for real-world deployment.

# Select the best hyperparameters based on ROC AUC
rf_best <- select_best(rf_res, metric = "roc_auc")

# Finalize the workflow with the best parameters
rf_final_wf <- finalize_workflow(rf_wf, rf_best)

# Fit the finalized model on the full training data
rf_final_fit <- fit(rf_final_wf, data = train)

# Generate predictions on the test set
rf_predictions <- predict(rf_final_fit, test, type = "prob") %>%
  bind_cols(predict(rf_final_fit, test)) %>%
  bind_cols(test %>% select(Status))

# Evaluate model performance
# 1. ROC AUC
roc_auc(rf_predictions, truth = Status, .pred_Left, event_level = "second")
# 2. Confusion Matrix
conf_mat(rf_predictions, truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1380  253
##    Left        160  304

Detailed Model Performance Analysis

# Create a more visually informative confusion matrix
conf_matrix <- conf_mat(rf_predictions, truth = Status, estimate = .pred_class)
conf_matrix_tbl <- conf_matrix$table

# Calculate performance metrics
tp <- conf_matrix_tbl[2, 2] 
fp <- conf_matrix_tbl[1, 2]  
fn <- conf_matrix_tbl[2, 1]  
tn <- conf_matrix_tbl[1, 1]  

# Calculate business-relevant metrics
precision <- tp / (tp + fp) 
recall <- tp / (tp + fn)     
specificity <- tn / (tn + fp) 

# Visualize the confusion matrix with percentages (Blue Theme)
conf_matrix_plot <- as.data.frame(conf_matrix$table)
names(conf_matrix_plot) <- c("Actual", "Predicted", "Freq")

ggplot(conf_matrix_plot, aes(x = Predicted, y = Actual, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = Freq), color = "white", size = 5) +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  labs(title = "Confusion Matrix for Random Forest Model",
       subtitle = sprintf("Precision: %.2f, Recall: %.2f, Specificity: %.2f", 
                          precision, recall, specificity),
       fill = "Count") +
  theme_minimal()

# Extract variable importance
rf_imp_data <- rf_final_fit %>%
  extract_fit_parsnip() %>%
  vip::vi()

ggplot(rf_imp_data %>% slice_max(Importance, n = 10), 
       aes(x = reorder(Variable, Importance), y = Importance)) +
  geom_col(fill = "#a4d8f2") +
  coord_flip() +
  labs(
    title = "Top 10 Predictors of Customer Churn",
    subtitle = "Random Forest Model Feature Importance",
    x = "Feature",
    y = "Importance"
  ) +
  theme_minimal()

# Partial dependence plots for top predictors
top_predictors <- c("Tenure", "Contract", "MonthlyCharges")

for (pred in top_predictors) {
  p <- rf_final_fit %>%
    extract_fit_parsnip() %>%
    partial(pred.var = pred, train = train, 
            grid.resolution = 20, prob = TRUE, 
            plot = TRUE, plot.engine = "ggplot2")
  
  print(p +
        geom_line(color = "#7db0ff", size = 1.2) +   
        labs(title = paste("Effect of", pred, "on Churn Probability"),
             y = "Predicted Probability of Churn") +
        theme_minimal())
}

Our random forest model demonstrates an excellent balance between precision and recall. With a precision of approximately r round(precision, 2), the model correctly identifies a churner about r round(precision * 100)% of the time when it flags a customer as at risk. This high precision allows Regork to confidently target retention efforts, minimizing wasted resources on customers who were unlikely to churn.

The model’s recall of approximately r round(recall, 2) indicates that we are successfully identifying about r round(recall * 100)% of all customers who will actually churn. Although a few churners may still be missed (false negatives), this is an acceptable trade-off. The business cost of missing a few at-risk customers is lower than the cost of mistakenly offering incentives to many customers who would have stayed without intervention, especially considering the high expense associated with customer acquisition.

The feature importance analysis reveals several key business insights:

The feature importance analysis reveals several critical business insights:

Tenure is the strongest predictor of churn. Customers are significantly more likely to churn within the first 12 months, highlighting the first year as a crucial window for retention efforts.

Contract Type has a major impact on churn risk. Customers on month-to-month contracts exhibit substantially higher churn rates, suggesting that offering incentives or discounts for longer-term contracts could effectively improve retention.

Monthly Charges show a clear threshold effect around $70, beyond which churn risk increases noticeably. This indicates a point of price sensitivity that could be leveraged in pricing and promotional strategies.

Internet Service Type also influences churn, with fiber optic customers exhibiting higher churn rates despite paying premium prices. This suggests an opportunity to investigate service quality or customer experience for fiber customers.

Payment Method is correlated with loyalty. Customers using automatic payment methods demonstrate higher retention compared to those paying via electronic checks, highlighting a simple behavioral indicator that could be used for early churn risk identification.

These insights provide clear, actionable guidance for Regork’s retention strategy, allowing leadership to focus on targeted incentives, contract structuring, pricing adjustments, and service improvements to reduce churn and enhance customer lifetime value.

4. Business Value Analysis

Monthly Revenue & Annualized at risk

rf_predictions <- predict(rf_final_fit, test, type = "prob") %>%
  bind_cols(predict(rf_final_fit, test)) %>%
  bind_cols(test %>% select(Status, MonthlyCharges))

risk_df <- rf_predictions %>% filter(.pred_class == "Left")
revenue_at_risk <- sum(risk_df$MonthlyCharges)
revenue_at_risk
## [1] 35378.25
annual_revenue_at_risk <- revenue_at_risk * 12
annual_revenue_at_risk
## [1] 424539

Based on our model’s predictions, Regork risks losing approximately $34519.6 in monthly revenue if no action is taken. Annualized, this represents a potential revenue loss of over $414235.2. These figures highlight the urgency of acting upon churn predictions by implementing targeted retention strategies to protect recurring revenue and customer lifetime value.

Incentive Strategy and ROI

discount <- 0.10
cost <- sum(risk_df$MonthlyCharges * discount)
saved_customers <- nrow(risk_df) * 0.5  # 50% retention assumed
benefit <- mean(risk_df$MonthlyCharges) * saved_customers * 6  # 6 months revenue
roi <- benefit - cost
roi
## [1] 102596.9

To reduce churn, I propose offering a 10% loyalty discount to customers identified as high-risk. Assuming a 50% retention rate among these customers, the campaign could generate substantial retained revenue over the next six months.

The cost of the incentive is calculated based on a 10% discount on the monthly charges for all at-risk customers. The benefit is estimated as the additional revenue retained from customers who remain for six more months.

The net return on investment (ROI)—calculated as the difference between the expected benefit and the cost of the incentive—is approximately $100106.8, indicating that the proposed campaign would be a highly profitable investment for Regork.

5. Conclusion

This analysis equips Regork Telecom with a data-driven strategy to identify customers at risk of churn and improve retention efforts.

Using predictive modeling, I found that short tenure, month-to-month contracts, and higher monthly charges are key churn drivers. After comparing multiple models, the Random Forest model delivered the strongest predictive performance. I estimated potential revenue loss and showed that a targeted loyalty discount campaign could yield a strong positive return on investment.

I recommend piloting the incentive program and retraining the model quarterly to ensure it adapts to changing customer behavior.

###Limitations and Future Improvements

Several limitations should be considered:

Temporal Validity: Model performance may decline over time; quarterly retraining is advised.

Limited Behavioral Data: More detailed service usage data could improve predictions.

Competitive Context: The model does not account for competitor actions or market changes.

Causality vs. Correlation: Strong predictors were found, but causal effects require A/B testing.

Retention Cost Assumptions: The ROI analysis assumes uniform discount effectiveness, which may vary in practice.

Future Work

Future improvements should focus on adding granular behavioral data, integrating competitor insights, conducting experimental testing, modeling customer lifetime value, and establishing a continuous learning feedback loop.