Data Mining 4080 Final Project

Introduction

With Regork’s new initiative to expand into the telecommunications market, understanding customer behavior is key to building a strong and loyal customer base. In this project, we focus on predicting whether customers are likely to stay or leave, giving Regork the opportunity to proactively invest in retention strategies and reduce churn rates.

We begin by exploring the customer retention dataset through exploratory data analysis (EDA) to identify trends, patterns, and important variables that influence customer decisions. After gaining these insights, we build and evaluate several machine learning models to determine which approach best predicts customer status.

By combining EDA with predictive modeling, this analysis provides actionable recommendations that Regork’s leadership and marketing teams can use to design targeted promotions, loyalty programs, and other customer-focused initiatives. Ultimately, the goal is to turn data-driven insights into strategies that strengthen customer relationships and drive long-term success in the telecommunications space.

Packages Required

The following pacakges are used in our analysis. In order to reproduce the results in the report you will need to install and load these packages

library(tidyverse)
library(vip)
library(tidymodels)
library(pdp)
library(kernlab)
library(baguette)
library(dplyr)
library(summarytools)
library(janitor)
library(kableExtra)

Data Preparation

Loading data

df <- read_csv('/Users/shamssadin/Library/Mobile Documents/com~apple~CloudDocs/Data/customer_retention.csv')

Initial Data Cleaning and Overview

df <- df %>% clean_names()


# Convert all character columns to factor
df <- df %>%
  mutate(across(where(is.character), as.factor))

summary(df)

##     gender     senior_citizen   partner    dependents     tenure     
##  Female:3467   Min.   :0.0000   No :3613   No :4894   Min.   : 0.00  
##  Male  :3532   1st Qu.:0.0000   Yes:3386   Yes:2105   1st Qu.: 9.00  
##                Median :0.0000                         Median :29.00  
##                Mean   :0.1619                         Mean   :32.38  
##                3rd Qu.:0.0000                         3rd Qu.:55.00  
##                Max.   :1.0000                         Max.   :72.00  
##                                                                      
##  phone_service          multiple_lines    internet_service
##  No : 676      No              :3371   DSL        :2405   
##  Yes:6323      No phone service: 676   Fiber optic:3075   
##                Yes             :2952   No         :1519   
##                                                           
##                                                           
##                                                           
##                                                           
##             online_security             online_backup 
##  No                 :3471   No                 :3070  
##  No internet service:1519   No internet service:1519  
##  Yes                :2009   Yes                :2410  
##                                                       
##                                                       
##                                                       
##                                                       
##            device_protection              tech_support 
##  No                 :3074    No                 :3448  
##  No internet service:1519    No internet service:1519  
##  Yes                :2406    Yes                :2032  
##                                                        
##                                                        
##                                                        
##                                                        
##               streaming_tv             streaming_movies           contract   
##  No                 :2792   No                 :2762    Month-to-month:3847  
##  No internet service:1519   No internet service:1519    One year      :1465  
##  Yes                :2688   Yes                :2718    Two year      :1687  
##                                                                              
##                                                                              
##                                                                              
##                                                                              
##  paperless_billing                   payment_method monthly_charges 
##  No :2862          Bank transfer (automatic):1534   Min.   : 18.25  
##  Yes:4137          Credit card (automatic)  :1512   1st Qu.: 35.48  
##                    Electronic check         :2350   Median : 70.35  
##                    Mailed check             :1603   Mean   : 64.75  
##                                                     3rd Qu.: 89.85  
##                                                     Max.   :118.75  
##                                                                     
##  total_charges        status    
##  Min.   :  18.8   Current:5143  
##  1st Qu.: 401.9   Left   :1856  
##  Median :1397.5                 
##  Mean   :2283.1                 
##  3rd Qu.:3796.9                 
##  Max.   :8684.8                 
##  NA's   :11

# Check how many NA values are in each column
colSums(is.na(df))

##            gender    senior_citizen           partner        dependents 
##                 0                 0                 0                 0 
##            tenure     phone_service    multiple_lines  internet_service 
##                 0                 0                 0                 0 
##   online_security     online_backup device_protection      tech_support 
##                 0                 0                 0                 0 
##      streaming_tv  streaming_movies          contract paperless_billing 
##                 0                 0                 0                 0 
##    payment_method   monthly_charges     total_charges            status 
##                 0                 0                11                 0

# View rows with missing TotalCharges
df %>% filter(is.na(total_charges))

## # A tibble: 11 × 20
##    gender senior_citizen partner dependents tenure phone_service multiple_lines 
##    <fct>           <dbl> <fct>   <fct>       <dbl> <fct>         <fct>          
##  1 Female              0 Yes     Yes             0 No            No phone servi…
##  2 Male                0 No      Yes             0 Yes           No             
##  3 Female              0 Yes     Yes             0 Yes           No             
##  4 Male                0 Yes     Yes             0 Yes           Yes            
##  5 Female              0 Yes     Yes             0 No            No phone servi…
##  6 Male                0 Yes     Yes             0 Yes           No             
##  7 Male                0 Yes     Yes             0 Yes           Yes            
##  8 Female              0 Yes     Yes             0 Yes           No             
##  9 Male                0 Yes     Yes             0 Yes           No             
## 10 Female              0 Yes     Yes             0 Yes           Yes            
## 11 Male                0 No      Yes             0 Yes           Yes            
## # ℹ 13 more variables: internet_service <fct>, online_security <fct>,
## #   online_backup <fct>, device_protection <fct>, tech_support <fct>,
## #   streaming_tv <fct>, streaming_movies <fct>, contract <fct>,
## #   paperless_billing <fct>, payment_method <fct>, monthly_charges <dbl>,
## #   total_charges <dbl>, status <fct>

The missing values in total charges are likely due to new customers as they have Tenure of zero years. The missing values will be dropped in order to preprocess the data for machine learning models.

df <- df %>% drop_na(total_charges)

Exploratory Data Analysis

df %>%
  count(status) %>%
  mutate(percent = n / sum(n) * 100)

## # A tibble: 2 × 3
##   status      n percent
##   <fct>   <int>   <dbl>
## 1 Current  5132    73.4
## 2 Left     1856    26.6

# Pie chart of Status
ggplot(df, aes(x="", fill=status)) +
  geom_bar(width=1) +
  coord_polar(theta="y") +
  theme_void() +
  labs(title="Customer Churn Distribution")

# Select demographic variables + Status
demo_df <- df %>%
  select(status, gender, senior_citizen, partner, dependents) %>%
  mutate(senior_citizen = as.character(senior_citizen))  

demo_long <- demo_df %>%
  pivot_longer(cols = c(gender, senior_citizen, partner, dependents),
               names_to = "Demographic",
               values_to = "Category")

# Plot
ggplot(demo_long, aes(x=Category, fill=status)) +
  geom_bar(position="fill") +
  facet_wrap(~Demographic, scales="free_x") +
  labs(title="Demographic Analysis vs Customer Status",
       y="Proportion",
       x="") +
  scale_y_continuous(labels=scales::percent) +
  theme_minimal() +
  theme(
        strip.text = element_text(face="bold"))

Customers without dependants are more likely to churn, male and female has similar churn rates, having a partner leads to lower churn rate, being a senior citizen leads to significantly higher churn rates

# Create clean table with only Current %
demo_table_current <- demo_long %>%
  group_by(Demographic, Category, status) %>%
  summarise(n = n(), .groups = "drop") %>%
  pivot_wider(names_from = status, values_from = n, values_fill = 0) %>%
  mutate(
    Total = Current + Left,
    Current = Current / Total
  ) %>%
  mutate(
    Current = scales::percent(Current, accuracy = 0.1)
  ) %>%
  select(Demographic, Category, Current) %>%
  arrange(Demographic, Category)

demo_table_current %>%
  kable(caption = "Customer Current Rate by Demographic Categories", align="c") %>%
  kable_styling(full_width = FALSE, position = "center", bootstrap_options = c("striped", "hover", "condensed"))

Customer Current Rate by Demographic Categories
Demographic	Category	Current
dependents	No	68.7%
dependents	Yes	84.5%
gender	Female	73.0%
gender	Male	73.9%
partner	No	67.0%
partner	Yes	80.3%
senior_citizen	0	76.4%
senior_citizen	1	58.3%

Demographic Insights Summary:

Dependents: Customers without dependents are more likely to churn compared to those with dependents.
Gender: Male and female customers have similar churn rates, with no significant difference observed.
Partner Status: Customers with a partner show a lower churn rate than those without a partner.
Senior Citizen Status: Being a senior citizen is associated with a significantly higher churn rate compared to non-senior customers.

# Prepare data 
charges_long <- df %>%
  mutate(senior_citizen = as.character(senior_citizen)) %>%  
  select(gender, senior_citizen, partner, dependents, total_charges, monthly_charges) %>%
  pivot_longer(cols = c(gender, senior_citizen, partner, dependents),
               names_to = "Demographic", values_to = "Category")

# Total Charges by Demographic Group
ggplot(charges_long, aes(x=Category, y=total_charges, fill=Category)) +
  geom_boxplot() +
  facet_wrap(~Demographic, scales="free_x", ncol=2) +
  labs(title="Total Charges by Demographic Group",
       x="Category",
       y="Total Charges") +
  theme_minimal(base_size = 14) +
  theme(legend.position="none",
        strip.text = element_text(face="bold"))

# Monthly Charges by Demographic Group
ggplot(charges_long, aes(x=Category, y=monthly_charges, fill=Category)) +
  geom_boxplot() +
  facet_wrap(~Demographic, scales="free_x", ncol=2) +
  labs(title="Monthly Charges by Demographic Group",
       x="Category",
       y="Monthly Charges") +
  theme_minimal(base_size = 14) +
  theme(legend.position="none",
        strip.text = element_text(face="bold"))

Charges Insights Summary:

Dependents: Customers with dependents have higher total charges but lower monthly charges compared to those without dependents.
Gender: Male and female customers have very similar total and monthly charges, with no significant difference observed.
Partner Status: Customers with a partner tend to have substantially higher total charges and comparitively higher monthly charges as well compared to those without a partner.
Senior Citizen Status: Senior citizens have higher total charges and higher monthly charges compared to non-senior customers.

ggplot(df, aes(x=tenure, y=monthly_charges, color=status)) +
  geom_point(alpha=0.4, size=1.2) +  # smaller, more transparent points
  geom_smooth(method="loess", se=FALSE) +  # smooth trend line without confidence interval
  labs(
    title = "Tenure vs Monthly Charges with Status Trend",
    x = "Tenure (Months)",
    y = "Monthly Charges ($)",
    color = "Customer Status"
  ) +
  theme_minimal(base_size = 14) +
  scale_color_manual(values=c("Current"="#4CAF50", "Left"="#F44336")) +  # nice green/red
  theme(
    legend.position = "bottom",
    plot.title = element_text(face="bold", hjust=0.5)
  )

ggplot(df, aes(x=contract, fill=status)) +
  geom_bar(position="fill") +
  labs(
    title = "Contract Type vs Customer Status",
    x = "Contract Type",
    y = "Proportion of Customers",
    fill = "Customer Status"
  ) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  theme_minimal(base_size = 14) +
  theme(
    legend.position = "bottom",
    plot.title = element_text(face="bold", hjust=0.5)
  )

ggplot(df, aes(x=monthly_charges, y=status)) +
  geom_jitter(aes(color=contract), height=0.2, width=0, alpha=0.2, size=1.5) +
  facet_wrap(~contract) +
  labs(
    title = "Monthly Charges vs Customer Status by Contract Type",
    x = "Monthly Charges ($)",
    y = "Customer Status"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face="bold", hjust=0.5),
    legend.position = "none",
    strip.text = element_text(face="bold")
  )

Monthly Charges vs Customer Status by Contract Type:

Month-to-month customers show a high churn rate across all monthly charge levels, but churn is more concentrated at higher monthly charges.
One-year and two-year contract customers show a much lower churn rate, even among those with higher monthly charges.
Higher monthly charges do not necessarily lead to churn for customers with longer-term contracts, but month-to-month customers with both low and high monthly charges are at risk. Regork should focus more on longer contracts

Machine Learning

Logistic Regression

# Create training and test sets
set.seed(123)  
split <- initial_split(df, prop = .7, strata = "status")
train <- training(split)
test  <- testing(split)

# Recipe
retention_recipe <- recipe(status ~ ., data = train) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors())

# 5-fold CV
set.seed(123)
kfolds <- vfold_cv(train, v = 5, strata = status)

# Model spec
lr_mod <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

# Workflow 
lr_workflow <- workflow() %>%
  add_recipe(retention_recipe) %>%
  add_model(lr_mod)

# Cross-validation
set.seed(123)
lr_results <- fit_resamples(lr_workflow,resamples = kfolds,)

# Collect AUC
collect_metrics(lr_results) %>%
  filter(.metric == "roc_auc")

## # A tibble: 1 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 roc_auc binary     0.845     5 0.00521 Preprocessor1_Model1

# Final fit 
final_log_fit <- lr_workflow %>%
  fit(data = train)

# Confusion matrix on test set
final_log_fit %>%
  predict(test) %>%
  bind_cols(test %>% select(status)) %>%
  conf_mat(truth = status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1362  225
##    Left        178  332

final_log_fit %>%
  predict(train, type = "prob") %>%
  mutate(truth = train$status) %>%
  roc_curve(truth, .pred_Current) %>%
  autoplot()

The logistic regression model achieved a mean ROC AUC of 0.845.

On the test set, the confusion matrix shows that 1362 current customers and 332 left customers were correctly classified, while there were 225 false positives and 178 false negatives.

This model performs better at predicting when a customer is current than when they have left. Misclassifications are slightly more from false positives (predicting “Current” when the customer actually left), indicating that the model tends to overpredict customer retention.

Decision Tree

# Decision Tree Model
tree_model <- decision_tree(mode = "classification") %>%
  set_engine("rpart")

# Workflow
tree_workflow <- workflow() %>%
  add_recipe(retention_recipe) %>%
  add_model(tree_model)

# Fit model on training data
tree_fit <- tree_workflow %>%
  fit(data = train)

# Cross-validation
set.seed(123)
tree_cv_results <- fit_resamples(tree_workflow,resamples = kfolds)

# Collect metrics 
collect_metrics(tree_cv_results) %>% filter(.metric == "roc_auc")

## # A tibble: 1 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 roc_auc binary     0.710     5 0.00529 Preprocessor1_Model1

# Visualize the fitted Decision Tree
rpart.plot::rpart.plot(tree_fit$fit$fit$fit)

## Redefine tree model with tuning parameters
tree_tune_model <- decision_tree(
  mode = "classification",
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()
) %>%
  set_engine("rpart")

# Create grid for tuning
tree_grid <- grid_regular(
  cost_complexity(),
  tree_depth(),
  min_n(),
  levels = 5
)

# Tune model across grid
set.seed(123)
tree_tuned_results <- tune_grid(
  tree_tune_model,
  retention_recipe,
  resamples = kfolds,
  grid = tree_grid
)

# Display top performing models
show_best(tree_tuned_results, metric = "roc_auc", n = 5)

## # A tibble: 5 × 9
##   cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
##             <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1    0.0000000001          8    30 roc_auc binary     0.816     5 0.00552
## 2    0.0000000178          8    30 roc_auc binary     0.816     5 0.00552
## 3    0.00000316            8    30 roc_auc binary     0.816     5 0.00552
## 4    0.0000000001          8    40 roc_auc binary     0.815     5 0.00421
## 5    0.0000000178          8    40 roc_auc binary     0.815     5 0.00421
## # ℹ 1 more variable: .config <chr>

# Select best hyperparameters
best_tree_params <- select_best(tree_tuned_results, metric = "roc_auc")

# Create final workflow
final_tree_workflow <- workflow() %>%
  add_recipe(retention_recipe) %>%
  add_model(tree_tune_model) %>%
  finalize_workflow(best_tree_params)

# Fit the finalized model on entire training set
final_tree_fit <- final_tree_workflow %>%
  fit(data = train)

# Confusion matrix on test data
final_tree_fit %>%
  predict(test) %>%
  bind_cols(test %>% select(status)) %>%
  conf_mat(truth = status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1316  243
##    Left        224  314

# ROC Curve for training data
final_tree_fit %>%
  predict(train, type = "prob") %>%
  mutate(truth = train$status) %>%
  roc_curve(truth, .pred_Current) %>%
  autoplot()

The Decision Tree model also predicts current customers more accurately than customers who have left, but slightly worse than logistic regression. Misclassifications are again more from false positives (predicting “Current” when they actually left), showing a bias toward predicting customers are still active. Compared to logistic regression, the decision tree has lower ROC AUC (0.710 vs 0.845) and more errors overall on the test set.

Bagging

# Define Bagging Model 
bagging_model <- bag_tree() %>%
  set_engine("rpart", times = 5) %>%
  set_mode("classification")

# Workflow 
bagging_workflow <- workflow() %>%
  add_recipe(retention_recipe) %>%
  add_model(bagging_model)

# Cross-validation Results
set.seed(123)
bagging_cv <- fit_resamples(bagging_workflow,resamples = kfolds)

# Collect Performance Metrics
collect_metrics(bagging_cv) %>%
  filter(.metric == "roc_auc")

## # A tibble: 1 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 roc_auc binary     0.770     5 0.00507 Preprocessor1_Model1

# Redefine Bagging Model 
bagging_tune_model <- bag_tree() %>%
  set_engine("rpart", times = tune()) %>%
  set_mode("classification")

# Workflow 
bagging_tune_workflow <- workflow() %>%
  add_recipe(retention_recipe) %>%
  add_model(bagging_tune_model)

# Hyperparameter Grid 
bagging_grid <- tibble(times = c(5, 25, 50, 100, 200, 300))

# Cross-validation 
set.seed(123)
bagging_tune_results <- tune_grid(
  bagging_tune_workflow,
  resamples = kfolds,
  grid = bagging_grid
)

# Show Top Models
show_best(bagging_tune_results, metric = "roc_auc", n = 5)

## # A tibble: 5 × 7
##   times .metric .estimator  mean     n std_err .config             
##   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1   200 roc_auc binary     0.822     5 0.00360 Preprocessor1_Model5
## 2   300 roc_auc binary     0.821     5 0.00435 Preprocessor1_Model6
## 3   100 roc_auc binary     0.820     5 0.00476 Preprocessor1_Model4
## 4    50 roc_auc binary     0.815     5 0.00395 Preprocessor1_Model3
## 5    25 roc_auc binary     0.809     5 0.00446 Preprocessor1_Model2

# Select Best Hyperparameters
best_bagging_params <- select_best(bagging_tune_results, metric = "roc_auc")

# Final Workflow 
final_bagging_workflow <- finalize_workflow(
  bagging_tune_workflow,
  best_bagging_params
)

# Fit Final Bagging Model on Full Training Set
final_bagging_fit <- final_bagging_workflow %>%
  fit(data = train)

# Confusion Matrix on Test Set
final_bagging_fit %>%
  predict(test) %>%
  bind_cols(test %>% select(status)) %>%
  conf_mat(truth = status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1341  263
##    Left        199  294

# ROC Curve on Training Set
final_bagging_fit %>%
  predict(train, type = "prob") %>%
  mutate(truth = train$status) %>%
  roc_curve(truth, .pred_Current) %>%
  autoplot()

The bagged model continues to predict current customers more accurately than left customers, like the earlier models. Compared to the simple decision tree (ROC AUC = 0.710), the tuned bagged tree (ROC AUC = 0.822) performs substantially better, though still slightly below the logistic regression model (ROC AUC = 0.845). The confusion matrix shows fewer false negatives (199 vs 224 in the simple tree), meaning it improves at detecting customers who have left.

Random Forest

# Random Forest
forest_model <- rand_forest() %>%
  set_engine("ranger") %>%
  set_mode("classification")

# Workflow 
forest_workflow <- workflow() %>%
  add_recipe(retention_recipe) %>%
  add_model(forest_model)

# Cross-validation 
set.seed(123)
forest_cv <- fit_resamples(forest_workflow,resamples = kfolds)

# Collect Metrics
collect_metrics(forest_cv) %>%
  filter(.metric == "roc_auc")

## # A tibble: 1 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 roc_auc binary     0.842     5 0.00411 Preprocessor1_Model1

# Hyperparameter tuning
forest_tune_model <- rand_forest(
  trees = tune(),
  mtry = tune(),
  min_n = tune()
) %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("classification")

# Workflow
forest_tune_workflow <- workflow() %>%
  add_recipe(retention_recipe) %>%
  add_model(forest_tune_model)

# Hyperparameter grid
forest_grid <- grid_regular(
  trees(range = c(50, 500)),
  mtry(range = c(2, 20)),  
  min_n(range = c(1, 20)),
  levels = 5
)

# Tune model across grid
set.seed(123)
forest_tune_results <- tune_grid(forest_tune_workflow, resamples = kfolds,grid = forest_grid)

# View top performing models 
show_best(forest_tune_results, metric = "roc_auc", n = 5)

## # A tibble: 5 × 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     6   500    20 roc_auc binary     0.845     5 0.00422 Preprocessor1_Model1…
## 2     6   162    20 roc_auc binary     0.845     5 0.00491 Preprocessor1_Model1…
## 3     6   387    20 roc_auc binary     0.845     5 0.00450 Preprocessor1_Model1…
## 4     6    50    20 roc_auc binary     0.844     5 0.00505 Preprocessor1_Model1…
## 5     6   275    20 roc_auc binary     0.844     5 0.00402 Preprocessor1_Model1…

# Select best hyperparameters 
best_forest_params <- select_best(forest_tune_results, metric = "roc_auc")

# Finalize workflow
final_forest_workflow <- finalize_workflow(
  forest_tune_workflow,
  best_forest_params
)

# Fit the model on the training data
final_forest_fit <- final_forest_workflow %>%
  fit(data = train)

# Confusion matrix 
final_forest_fit %>%
  predict(test) %>%
  bind_cols(test %>% select(status)) %>%
  conf_mat(truth = status, estimate = .pred_class)

##           Truth
## Prediction Current Left
##    Current    1387  264
##    Left        153  293

# ROC curve 
final_forest_fit %>%
  predict(train, type = "prob") %>%
  mutate(truth = train$status) %>%
  roc_curve(truth, .pred_Current) %>%
  autoplot()

The tuned random forest model achieved a mean ROC AUC of 0.845, matching the logistic regression model in cross-validation performance. On the test set, it correctly classified 1385 current customers and 293 left customers, with 264 false positives and 155 false negatives. Compared to other models, it shows the best balance between predicting current and left customers, with fewer missed detections of customers who have left.

collect_metrics(lr_results, summarize = FALSE) %>% 
  filter(.metric == "roc_auc")

## # A tibble: 5 × 5
##   id    .metric .estimator .estimate .config             
##   <chr> <chr>   <chr>          <dbl> <chr>               
## 1 Fold1 roc_auc binary         0.836 Preprocessor1_Model1
## 2 Fold2 roc_auc binary         0.839 Preprocessor1_Model1
## 3 Fold3 roc_auc binary         0.865 Preprocessor1_Model1
## 4 Fold4 roc_auc binary         0.839 Preprocessor1_Model1
## 5 Fold5 roc_auc binary         0.845 Preprocessor1_Model1

final_fit <- lr_mod %>%
  fit(status ~ ., data = train)

tidy(final_fit)

## # A tibble: 31 × 5
##    term                           estimate std.error statistic   p.value
##    <chr>                             <dbl>     <dbl>     <dbl>     <dbl>
##  1 (Intercept)                      1.02     0.969      1.05    2.95e- 1
##  2 genderMale                      -0.0461   0.0779    -0.592   5.54e- 1
##  3 senior_citizen                   0.258    0.101      2.55    1.08e- 2
##  4 partnerYes                      -0.141    0.0925    -1.52    1.29e- 1
##  5 dependentsYes                   -0.0475   0.108     -0.441   6.59e- 1
##  6 tenure                          -0.0646   0.00762   -8.47    2.35e-17
##  7 phone_serviceYes                -0.0417   0.774     -0.0539  9.57e- 1
##  8 multiple_linesNo phone service  NA       NA         NA      NA       
##  9 multiple_linesYes                0.442    0.211      2.10    3.56e- 2
## 10 internet_serviceFiber optic      1.48     0.950      1.56    1.18e- 1
## # ℹ 21 more rows

vip::vip(final_fit)

Which predictor variables appear to be most influential in customer behavior?

Looking at the variable importance plot, tenure stands out as the most influential predictor. Along with that, whether a customer has a one-year or two-year contract and their total charges also seem to play major roles in predicting if they stay or leave. Basically, customers who have been with the company longer or are locked into longer contracts are much less likely to leave.

Why are those specific predictor variables the most influential?

If a customer has been around for a long time (high tenure) or has committed to a long contract, they’re more likely to stay as they’ve already built a relationship with the company. Similarly, customers with higher total charges probably reflect longer usage or higher engagement, which can also tie into loyalty. These variables showed up with the highest importance when we used the vip() plot to measure impact.

How does this generalization error compare to the cross validation error seen in earlier results?

The generalization error we got from the test set is almost identical to what we saw during cross-validation, around 0.845. That’s a really good sign because it shows that our model isn’t just memorizing the training data; it’s actually learning patterns that hold up on new, unseen data too.

As a person responsible for making business decisions, what else are you learning from the observations in this section?

The biggest takeaway here is that tenure and contract length are critical levers for customer retention. If Regork Telecom wants to reduce churn, they should focus on strategies that keep customers around longer like offering incentives for multi-year contracts or rewarding long-term loyalty. Also, the fact that our model generalizes well means we can trust its predictions when making real business decisions, like identifying at-risk customers before they actually leave. Overall, the model gives leadership a reliable tool to be proactive rather than reactive.

final_fit %>%
  predict(test) %>%
  bind_cols(test %>% select(status))

## # A tibble: 2,097 × 2
##    .pred_class status 
##    <fct>       <fct>  
##  1 Left        Current
##  2 Left        Left   
##  3 Current     Current
##  4 Current     Current
##  5 Current     Current
##  6 Current     Current
##  7 Current     Current
##  8 Current     Current
##  9 Current     Current
## 10 Left        Left   
## # ℹ 2,087 more rows

left_customers <- test %>%
  bind_cols(final_fit %>% predict(test)) %>%
  filter(.pred_class == "Left")

summary(left_customers)

##     gender    senior_citizen partner   dependents     tenure     phone_service
##  Female:264   Min.   :0.0    No :368   No :446    Min.   : 1.0   No : 52      
##  Male  :246   1st Qu.:0.0    Yes:142   Yes: 64    1st Qu.: 2.0   Yes:458      
##               Median :0.0                         Median : 6.5                
##               Mean   :0.3                         Mean   :10.4                
##               3rd Qu.:1.0                         3rd Qu.:15.0                
##               Max.   :1.0                         Max.   :56.0                
##           multiple_lines    internet_service            online_security
##  No              :227    DSL        : 74     No                 :478   
##  No phone service: 52    Fiber optic:436     No internet service:  0   
##  Yes             :231    No         :  0     Yes                : 32   
##                                                                        
##                                                                        
##                                                                        
##              online_backup           device_protection
##  No                 :407   No                 :378    
##  No internet service:  0   No internet service:  0    
##  Yes                :103   Yes                :132    
##                                                       
##                                                       
##                                                       
##               tech_support              streaming_tv
##  No                 :465   No                 :279  
##  No internet service:  0   No internet service:  0  
##  Yes                : 45   Yes                :231  
##                                                     
##                                                     
##                                                     
##             streaming_movies           contract   paperless_billing
##  No                 :262     Month-to-month:510   No : 93          
##  No internet service:  0     One year      :  0   Yes:417          
##  Yes                :248     Two year      :  0                    
##                                                                    
##                                                                    
##                                                                    
##                    payment_method monthly_charges  total_charges   
##  Bank transfer (automatic): 45    Min.   : 24.45   Min.   :  24.6  
##  Credit card (automatic)  : 40    1st Qu.: 71.46   1st Qu.: 130.6  
##  Electronic check         :371    Median : 81.28   Median : 501.5  
##  Mailed check             : 54    Mean   : 78.99   Mean   : 913.5  
##                                   3rd Qu.: 90.99   3rd Qu.:1220.8  
##                                   Max.   :111.20   Max.   :5624.9  
##      status     .pred_class 
##  Current:178   Current:  0  
##  Left   :332   Left   :510  
##                             
##                             
##                             
##

predicted_loss <- sum(left_customers$monthly_charges)

Business Analysis & Conclusion

In terms of relative importance, how would you rate the predictors in your model? As a business manager, which factors would you focus on to decrease the chances of customers leaving?

In our logistic regression model, tenure came out as the most influential predictor, closely followed by contract type (both one-year and two-year contracts) and total charges. After these, the importance of other variables drops off quite a bit.

From a business manager’s perspective, the results make a lot of sense. Tenure tells us that the longer someone stays with Regork Telecom, the more likely they are to continue staying. Contract type also plays a big role — customers locked into longer contracts are much less likely to leave.

If I were in charge of customer retention, I would focus on encouraging longer tenure and longer contracts. That could mean offering special promotions for contract renewals, discounts for customers who sign longer agreements, or even loyalty rewards for customers who hit certain tenure milestones. Helping people stay longer naturally strengthens their attachment to the company.

Collect all the customers from the test dataset that you predict are going to leave

After running predictions on the test set, we identified 510 customers who are at high risk of leaving according to the logistic regression model. When we look closer at their profiles, a lot of patterns jump out: almost all of them are on month-to-month contracts, they have shorter tenure, and many are paying higher monthly charges around $79–$90.

These customers are likely feeling less committed because they’re not locked into a long-term agreement, and with the higher monthly bills, it becomes even easier for them to decide to switch providers if they find a cheaper or better alternative.

What is the predicted loss in revenue per month if no action is taken?

If Regork Telecom does not act, the model predicts a potential monthly revenue loss of around $40,283 if these at-risk customers end up leaving.

This estimate is based on summing the monthly charges of the customers who are predicted to churn. While of course not every predicted customer may leave, this still shows a significant financial risk especially because churn tends to increase if early departures trigger more customer dissatisfaction or negative reviews.

Propose an incentive scheme to your manager to retain these customers

To help prevent these customers from leaving, I propose a targeted loyalty and rewards program. Here’s what I recommend:

Offer discounts for customers willing to switch to a one-year or two-year contract.
Create a loyalty rewards program that recognizes tenure milestones with small rewards like credits, free upgrades, or gift cards.
Personalized discounts or incentives for high-risk customers
Birthday and anniversary rewards, where customers receive a small discount or bonus to make them feel valued without dramatically impacting the company’s bottom line.

This plan directly addresses the biggest drivers of churn shown by the model: total charges, contract type, and tenure.

Conclusion

In conclusion, the best path forward for Regork Telecom is to focus on loyalty-based incentives that promote longer tenure and encourage customers to commit to longer-term contracts. Our logistic regression model showed that customers are more likely to stay when they have longer histories with the company and when their financial burden feels more manageable.

The predicted revenue loss if no action is taken is serious enough to justify investment in these strategies. While rolling out incentives will take coordination with the finance team and careful planning to manage costs, the long-term benefits — both in customer retention and company reputation would be worth it.

By acting now, Regork can not only reduce churn but also strengthen its relationship with existing customers, creating a more loyal base that helps grow the business sustainably over time.