Introduction

Customer retention is a critical priority for Regork Telecom, especially given the high costs associated with acquiring new customers compared to retaining existing ones.

The goal of this project was to analyze customer data and build a predictive model that accurately identifies customers at risk of churning. By doing so, Regork Telecom can proactively target high-risk customers with strategic incentives to improve retention and protect future revenue.

This report outlines the full data analysis process, including exploratory data analysis (EDA), model development and selection, business impact analysis, and a final strategic recommendation based on the predictive model’s insights.

Uploading Required Packages

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.4.3

## Warning: package 'purrr' was built under R version 4.4.3

## Warning: package 'lubridate' was built under R version 4.4.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(tidymodels)

## Warning: package 'tidymodels' was built under R version 4.4.3

## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom        1.0.8     ✔ rsample      1.2.1
## ✔ dials        1.4.0     ✔ tune         1.3.0
## ✔ infer        1.0.7     ✔ workflows    1.2.0
## ✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
## ✔ parsnip      1.3.0     ✔ yardstick    1.3.2
## ✔ recipes      1.1.1

## Warning: package 'broom' was built under R version 4.4.3

## Warning: package 'dials' was built under R version 4.4.3

## Warning: package 'infer' was built under R version 4.4.3

## Warning: package 'modeldata' was built under R version 4.4.3

## Warning: package 'parsnip' was built under R version 4.4.3

## Warning: package 'rsample' was built under R version 4.4.3

## Warning: package 'tune' was built under R version 4.4.3

## Warning: package 'workflows' was built under R version 4.4.3

## Warning: package 'workflowsets' was built under R version 4.4.3

## Warning: package 'yardstick' was built under R version 4.4.3

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()

library(rpart)

## 
## Attaching package: 'rpart'
## 
## The following object is masked from 'package:dials':
## 
##     prune

library(ranger)

## Warning: package 'ranger' was built under R version 4.4.3

Loading essential packages for this projetc. Tidyverse handles data wrangling, ggplot2 handles helps with visualization and tidymodels, rpart, and ranger are all used for machine learning models.

Upload Data

df <- read.csv('customer_retention.csv')

head(df)

##   Gender SeniorCitizen Partner Dependents Tenure PhoneService    MultipleLines
## 1 Female             0     Yes         No      1           No No phone service
## 2   Male             0      No         No     34          Yes               No
## 3   Male             0      No         No      2          Yes               No
## 4   Male             0      No         No     45           No No phone service
## 5 Female             0      No         No      2          Yes               No
## 6 Female             0      No         No      8          Yes              Yes
##   InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 1             DSL             No          Yes               No          No
## 2             DSL            Yes           No              Yes          No
## 3             DSL            Yes          Yes               No          No
## 4             DSL            Yes           No              Yes         Yes
## 5     Fiber optic             No           No               No          No
## 6     Fiber optic             No           No              Yes          No
##   StreamingTV StreamingMovies       Contract PaperlessBilling
## 1          No              No Month-to-month              Yes
## 2          No              No       One year               No
## 3          No              No Month-to-month              Yes
## 4          No              No       One year               No
## 5          No              No Month-to-month              Yes
## 6         Yes             Yes Month-to-month              Yes
##               PaymentMethod MonthlyCharges TotalCharges  Status
## 1          Electronic check          29.85        29.85 Current
## 2              Mailed check          56.95      1889.50 Current
## 3              Mailed check          53.85       108.15    Left
## 4 Bank transfer (automatic)          42.30      1840.75 Current
## 5          Electronic check          70.70       151.65    Left
## 6          Electronic check          99.65       820.50    Left

Here I am importing the customer retention data set into R. Then I am displaying the first few rows to understand the data set’s structure and confirm a successful import.

Clean The Data

# Converting total charges to numerics and removing missing values
df <- df %>% 
  mutate(TotalCharges = as.numeric(TotalCharges)) %>% 
  filter(!is.na(TotalCharges))

# Convert status to factor
df <- df %>% 
  mutate(Status = factor(Status, levels = c("Current", "Left")))

df$Status <- as.factor(df$Status)

# Convert Senior Citizen to Factor
df <- df %>% 
  mutate(SeniorCitizen = factor(SeniorCitizen, levels = c(0,1), 
                                labels = c("Not Senior", "Senior")))
# Preview
glimpse(df)

## Rows: 6,988
## Columns: 20
## $ Gender           <chr> "Female", "Male", "Male", "Male", "Female", "Female",…
## $ SeniorCitizen    <fct> Not Senior, Not Senior, Not Senior, Not Senior, Not S…
## $ Partner          <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes…
## $ Dependents       <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"…
## $ Tenure           <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService     <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ MultipleLines    <chr> "No phone service", "No", "No", "No phone service", "…
## $ InternetService  <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt…
## $ OnlineSecurity   <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "…
## $ OnlineBackup     <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N…
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y…
## $ TechSupport      <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes…
## $ StreamingTV      <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye…
## $ StreamingMovies  <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes…
## $ Contract         <chr> "Month-to-month", "One year", "Month-to-month", "One …
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ PaymentMethod    <chr> "Electronic check", "Mailed check", "Mailed check", "…
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Status           <fct> Current, Current, Left, Current, Left, Left, Current,…

# Check missing values
colSums(is.na(df))

##           Gender    SeniorCitizen          Partner       Dependents 
##                0                0                0                0 
##           Tenure     PhoneService    MultipleLines  InternetService 
##                0                0                0                0 
##   OnlineSecurity     OnlineBackup DeviceProtection      TechSupport 
##                0                0                0                0 
##      StreamingTV  StreamingMovies         Contract PaperlessBilling 
##                0                0                0                0 
##    PaymentMethod   MonthlyCharges     TotalCharges           Status 
##                0                0                0                0

Preparing the data set for modeling by converting TotalCharges to numeric and removing missing values. Then I convert Status and SeniorCitizen columns into factors for appropriate modeling. Then I confirm that all variables have the correct data type and that there are no missing values.

EDA

Now that the data is all imported and cleaned, we can now start to visualize to see what factors may play an important role in the model that we will create.

Churn Distribution

ggplot(data = df, aes(x = Status, fill = Status)) +
  geom_bar() +
  labs(title = "Customer Churn Distribution",
       x = "Customer Status", y = "Count") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set2")

Here I am visualizing the distribution of customer status (Current vs. Left). This provides a baseline understanding of the churn rate and class imbalance.

Churn by Contract Type

ggplot(df, aes(x = Contract, fill = Status))+
  geom_bar(position = "dodge")+
  labs(title = "Churn by Contract Type",
       x = "Contract Type", y = "Count")+
  theme_minimal()+
  scale_fill_brewer(palette = "Set1")

This explores how churn rate relates to different contract type(Month-to-Month, One year, Two year). Contract type is expected to be a strong indicator of churn behavior.

Tenure by Churn Status

ggplot(df, aes(x = Status, y = Tenure, fill = Status))+
  geom_boxplot()+
  labs(title = "Tenure by Churn Status",
       x = "Customer Status", y = "Tenure (Months)")+
  theme_minimal()+
  scale_fill_brewer(palette = "Set3")

The Tenure by Current Status graph compares the length of customer tenure for those who stayed vs. those who left. Shorter tenures typically suggest a higher risk for churn.

Monthly Charges by Churn Status

ggplot(df, aes(x = Status, y = MonthlyCharges, fill = Status))+
  geom_boxplot()+
  labs(title = "Monthly Charges by Churn Status",
       x = "Customer Status", y = "Monthly Charges ($)")+
  theme_minimal()+
  scale_fill_brewer(palette = "Set2")

This graph analyzes how monthly billing amounts differ between customers who churned and those who stayed. High monthly charges might correlate with and increased likelihood of churn.

Churn by Gender

ggplot(df, aes(x= Gender, fill = Status))+
  geom_bar(position = "fill")+
  labs(title = "Churn Rate by Gender",
       x = "Gender", y = "Proportion")+
  theme_minimal()+
  scale_fill_brewer(palette = "Set1")+
  scale_y_continuous(labels = scales::percent)

This stacked bar char shows the churn rate for male and female customers. The churn rates are nearly identical indicating that gender does not appear to be a strong predictor of churn in this data set.

Churn by Senior Citizen Status

ggplot(df, aes(x = SeniorCitizen, fill = Status))+
  geom_bar(position = "fill")+
  labs(title = "Churn Rate by Senior Citizen Status",
       x = "Senior Citizen", y = "Proportion")+
  theme_minimal()+
  scale_fill_brewer(palette = "Set2")+
  scale_y_continuous(labels = scales::percent)

From this graph of churn rate by senior citizen status we can see that senior citizens have a noticeably higher churn rate compared to non senior customers. This suggests that being a senior is associated with an increased likelihood of churn.

Modeling Setup

After exploring key patterns in the data, we observed many factors that can be associated with customer churn. These findings suggest that predictive modeling could effectively identify customers at high risk of leaving. ## Splitting the Data

set.seed(123)
split <- initial_split(df, prop = 0.7, strata = Status)
train_data <- training(split)
test_data <- testing(split)

I split the data set into a 70% training set and a 30% testing set, stratified by churn status to maintain class proportions. This ensures that model evaluation is fair and unbiased.

Creatin the Recipe

churn_recipe <- recipe(Status ~ ., data = train_data) %>% 
  step_dummy(all_nominal_predictors()) %>%  # one hot encode categorical factors
  step_normalize(all_numeric_predictors())  # normalize numeric factors

This recipe defines a data preprocessing workflow to prepare data for modeling. Categorical predictors are one-hot encoded and numeric predictors are normalized to ensure consistent model input.

Model Training and Evaluation

Logistic Regression Model

# Define logistic regression model
log_mod <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")

# Create workflow
log_wf <- workflow() %>% 
  add_model(log_mod) %>% 
  add_recipe(churn_recipe)

# Cross validation setup
set.seed(123)
kfold <- vfold_cv(train_data, v = 5, strata = Status)

# Fit model
log_fit <- fit_resamples(
  log_wf,
  resamples = kfold,
  metrics = metric_set(roc_auc, accuracy),
  control = control_resamples(save_pred = TRUE)
)

# Get results
collect_metrics(log_fit)

## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.799     5 0.00401 Preprocessor1_Model1
## 2 roc_auc  binary     0.845     5 0.00521 Preprocessor1_Model1

This logistic regression model was evaluated using a 5 fold cross validation. It achieved an average accuracy of roughly 80% and an AUC of 0.845. This suggests that the logistic regression model is effective at distinguishing between customers who will stay and those who will churn. While this model performs well, we will proceed to evaluate more complex models to determine if further predictive imporvements can be made.

Decision Tree Model

# Define decision tree model
tree_mod <- decision_tree(
  cost_complexity = tune(),
  tree_depth = tune()
) %>% 
  set_engine("rpart") %>% 
  set_mode("classification")

# Create workflow
tree_wf <- workflow() %>% 
  add_model(tree_mod) %>% 
  add_recipe(churn_recipe)

# Deinfe tuning grid
tree_grid <- grid_regular(
  cost_complexity(range = c(-3,-1)),
  tree_depth(range = c(3,10)),
  levels = 5
)

# Tune modelusing cross validation
set.seed(123)
tree_res <- tune_grid(
  tree_wf,
  resamples = kfold,
  grid = tree_grid,
  metrics = metric_set(roc_auc, accuracy)
)

# Get best results
show_best(tree_res, metric = "roc_auc")

## # A tibble: 5 × 8
##   cost_complexity tree_depth .metric .estimator  mean     n std_err .config     
##             <dbl>      <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>       
## 1         0.001            6 roc_auc binary     0.808     5 0.00623 Preprocesso…
## 2         0.001            8 roc_auc binary     0.803     5 0.00399 Preprocesso…
## 3         0.001           10 roc_auc binary     0.798     5 0.00455 Preprocesso…
## 4         0.00316          6 roc_auc binary     0.792     5 0.0152  Preprocesso…
## 5         0.00316          8 roc_auc binary     0.787     5 0.0152  Preprocesso…

The decision tree model was tuned across different values of cost complexity and tree depth. The best performing decision tree achieved an AUC of 0.808 with a cost complexity of 0.001 and a maximum tree depth of 6. Although this model performed well, its AUC was slightly lower than the logistic regression model. Lastly we will evaluate a random forest model to see if it can further improve predictive power.

Random Forest Model

# Define random forest model
rf_mod <- rand_forest(
  mtry = tune(),
  trees = 500,
  min_n = tune()
) %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")

# Create workflow
rf_wf <- workflow() %>% 
  add_model(rf_mod) %>% 
  add_recipe(churn_recipe)

# Tune hyperparameters
rf_grid <- grid_regular(
  mtry(range = c(3,15)),
  min_n(range = c(5,30)),
  levels = 5
)

# Run tuning
set.seed(123)
rf_res <- tune_grid(
  rf_wf,
  resamples = kfold,
  grid = rf_grid,
  metrics = metric_set(roc_auc, accuracy)
)

# Show top performers
show_best(rf_res, metric = "roc_auc")

## # A tibble: 5 × 8
##    mtry min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     6    30 roc_auc binary     0.845     5 0.00460 Preprocessor1_Model22
## 2     3    17 roc_auc binary     0.845     5 0.00447 Preprocessor1_Model11
## 3     3    23 roc_auc binary     0.845     5 0.00471 Preprocessor1_Model16
## 4     3    30 roc_auc binary     0.845     5 0.00487 Preprocessor1_Model21
## 5     6    23 roc_auc binary     0.844     5 0.00398 Preprocessor1_Model17

The random forest model was tuned by varying the number of predictors considered at each split (mtry) and the minimum number of observations required in a node (min_n). The best performing forest achieved an AUC of 0.845 with mtry of 6 and min_n of 30. The random forest model slightly outperformed the logistic regression and decision tree models in terms of AUC. Its strong predictive performance makes it the most suitable model for predicting customer churn in this project.

Finalize and Fit Best Model (Random Forest) on Train Data

# Select best model based on AUC
best_rf <- select_best(rf_res, metric = "roc_auc")

# Finalize workflow
final_rf_wf <- finalize_workflow(rf_wf, best_rf)

# Fit final model on training data
final_rf_fit <- fit(final_rf_wf, data = train_data)

Finalizing the best random forest model based on cross validated AUC results. Then we are fitting the finalized model to complete the training data set to prepare for test set evaluation.

Predict and Evaluate on Test Set

# Predict on test set
rf_predictions <- predict(final_rf_fit, new_data = test_data, type = "prob") %>% 
  bind_cols(predict(final_rf_fit, new_data = test_data)) %>% 
  bind_cols(test_data %>% select(Status))

# Evaluate AUC
roc_auc(rf_predictions, truth = Status, .pred_Current)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.843

# Confusion Matrix
rf_class_preds <- rf_predictions %>% 
  mutate(pred_class = .pred_class)

conf_mat(rf_class_preds, truth = Status, estimate = pred_class)

##           Truth
## Prediction Current Left
##    Current    1389  271
##    Left        151  286

The final random forest model achieved an AUC of 0.843 on the test set, closely matching its cross validation performance during training. This strong AUC indicates that the model perfroms well at distinguishing customers who are likely to churn form those who will stay. The confusion matrix confirms high classification accuracy with a balanced distribution of false positives and false negatives.

Plotting the ROC curve

# Calculate the ROC curve manually
roc_data <- yardstick::roc_curve(rf_predictions, truth = Status, .pred_Current)

# Plot the ROC curve
ggplot(roc_data, aes(x = 1 - specificity, y = sensitivity)) +
  geom_path() +
  geom_abline(lty = 2, color = "gray") +
  labs(title = "ROC Curve",
       x = "1 - Specificity (False Positive Rate)",
       y = "Sensitivity (True Positive Rate)") +
  theme_minimal()

The ROC curve plots the true positive rate against the false positive rate across different classification thresholds. The high area under the curve (AUC) indicates the model is very effective.

Feature Importance

# Extract fitted model
rf_model <- extract_fit_parsnip(final_rf_fit)$fit

# PLot feature importance
vip::vip(rf_model, num_features = 10)

This bar chart displays the relative importance of the top predictors used by the random forest model to classify customers as likely to stay or churn. Feature importance is calculated based on how much each variable improves the model’s ability to make accurate splits across all trees.

After selecting the final random forest model based on its strong AUC and predictive performance, we now apply the model to conduct a business-focused analysis. Specifically, we use the model to identify customers at high risk of churning, estimate the potential monthly revenue loss if no action is taken, and propose an incentive strategy to retain valuable customers.

Business Analysis

# Predict probabilities and attach the original test data
rf_predictions <- predict(final_rf_fit, new_data = test_data, type = "prob") %>%
  bind_cols(test_data) %>%
  mutate(pred_class = if_else(.pred_Left > 0.5, "Left", "Current"))

# Now filter predicted churners
predicted_leavers <- rf_predictions %>%
  filter(pred_class == "Left")

n_predicted_leavers <- nrow(predicted_leavers)
n_predicted_leavers

## [1] 437

The finalized random forest model was used to predict the probability that each customer in the test set would churn.After generating the predictions, we filtered the results to identify customers predicted to leave. A total of 437 customers were predicted to churn if no intervention is made.

Estimate Predicted Monthly Revenue Loss

# Sum monthly charges of predicted leavers
predicted_loss <- sum(predicted_leavers$MonthlyCharges)

predicted_loss

## [1] 33517.4

After identifying the 437 customers predicted to churn, we calculated the total monthly revenue associated with these customers.By summing the MonthlyCharges for all predicted leavers, we estimate that Regork Telecom is at risk of losing approximately $33,517.40 per month if no action is taken to retain them.

Propose an Incentive Plan

# Assume offering $15 discount per month for 6 months
cost_per_customer <- 15 * 6

# Total Cost offered to all predicted leavers
total_incentive_cost <- cost_per_customer * nrow(predicted_leavers)

# Revenue preserved if they stay (6 months)
revenue_preserved <- sum(predicted_leavers$MonthlyCharges) * 6

# Net Benefit
net_benefit <- revenue_preserved - total_incentive_cost

total_incentive_cost

## [1] 39330

revenue_preserved

## [1] 201104.4

net_benefit

## [1] 161774.4

To mitigate this loss, we propose offering a $15 per month discount for six months to each predicted leaver. The total cost of implementing this incentive program would be approximately $39,330. If successful in retaining these customers, the company would preserve approximately $201,104.40 in revenue over the six-month period.

After accounting for the incentive costs, the net financial benefit of retaining these customers would be approximately $161,774.40.

Based on this analysis, implementing a targeted retention campaign offering a modest discount to high-risk customers is financially justified and strongly recommended.

Conclusion

In this project, we successfully developed a predictive model to identify customers at risk of churning for Regork Telecom.

After extensive exploratory data analysis and model comparison, a random forest model was selected based on its strong cross-validation and test set performance, achieving an AUC of 0.843.

Business analysis using the model’s predictions revealed that Regork Telecom is at risk of losing approximately $33,517.40 per month if no action is taken. We proposed a cost-effective incentive plan offering a $15 per month discount to predicted leavers, resulting in an estimated net financial benefit of $161,774.40 over six months.

Overall, this project demonstrates how predictive modeling combined with targeted business strategies can drive data-informed decisions to increase customer retention and protect company revenue.

Final Project

Dylan Dorn

2025-04-22