With Regork’s new initiative to expand into the telecommunications market, understanding customer behavior is key to building a strong and loyal customer base. In this project, we focus on predicting whether customers are likely to stay or leave, giving Regork the opportunity to proactively invest in retention strategies and reduce churn rates.
We begin by exploring the customer retention dataset through exploratory data analysis (EDA) to identify trends, patterns, and important variables that influence customer decisions. After gaining these insights, we build and evaluate several machine learning models to determine which approach best predicts customer status.
By combining EDA with predictive modeling, this analysis provides actionable recommendations that Regork’s leadership and marketing teams can use to design targeted promotions, loyalty programs, and other customer-focused initiatives. Ultimately, the goal is to turn data-driven insights into strategies that strengthen customer relationships and drive long-term success in the telecommunications space.
The following pacakges are used in our analysis. In order to reproduce the results in the report you will need to install and load these packages
library(tidyverse)
library(vip)
library(tidymodels)
library(pdp)
library(kernlab)
library(baguette)
library(dplyr)
library(summarytools)
library(janitor)
library(kableExtra)
Loading data
df <- read_csv('/Users/shamssadin/Library/Mobile Documents/com~apple~CloudDocs/Data/customer_retention.csv')
Initial Data Cleaning and Overview
df <- df %>% clean_names()
# Convert all character columns to factor
df <- df %>%
mutate(across(where(is.character), as.factor))
summary(df)
## gender senior_citizen partner dependents tenure
## Female:3467 Min. :0.0000 No :3613 No :4894 Min. : 0.00
## Male :3532 1st Qu.:0.0000 Yes:3386 Yes:2105 1st Qu.: 9.00
## Median :0.0000 Median :29.00
## Mean :0.1619 Mean :32.38
## 3rd Qu.:0.0000 3rd Qu.:55.00
## Max. :1.0000 Max. :72.00
##
## phone_service multiple_lines internet_service
## No : 676 No :3371 DSL :2405
## Yes:6323 No phone service: 676 Fiber optic:3075
## Yes :2952 No :1519
##
##
##
##
## online_security online_backup
## No :3471 No :3070
## No internet service:1519 No internet service:1519
## Yes :2009 Yes :2410
##
##
##
##
## device_protection tech_support
## No :3074 No :3448
## No internet service:1519 No internet service:1519
## Yes :2406 Yes :2032
##
##
##
##
## streaming_tv streaming_movies contract
## No :2792 No :2762 Month-to-month:3847
## No internet service:1519 No internet service:1519 One year :1465
## Yes :2688 Yes :2718 Two year :1687
##
##
##
##
## paperless_billing payment_method monthly_charges
## No :2862 Bank transfer (automatic):1534 Min. : 18.25
## Yes:4137 Credit card (automatic) :1512 1st Qu.: 35.48
## Electronic check :2350 Median : 70.35
## Mailed check :1603 Mean : 64.75
## 3rd Qu.: 89.85
## Max. :118.75
##
## total_charges status
## Min. : 18.8 Current:5143
## 1st Qu.: 401.9 Left :1856
## Median :1397.5
## Mean :2283.1
## 3rd Qu.:3796.9
## Max. :8684.8
## NA's :11
# Check how many NA values are in each column
colSums(is.na(df))
## gender senior_citizen partner dependents
## 0 0 0 0
## tenure phone_service multiple_lines internet_service
## 0 0 0 0
## online_security online_backup device_protection tech_support
## 0 0 0 0
## streaming_tv streaming_movies contract paperless_billing
## 0 0 0 0
## payment_method monthly_charges total_charges status
## 0 0 11 0
# View rows with missing TotalCharges
df %>% filter(is.na(total_charges))
## # A tibble: 11 × 20
## gender senior_citizen partner dependents tenure phone_service multiple_lines
## <fct> <dbl> <fct> <fct> <dbl> <fct> <fct>
## 1 Female 0 Yes Yes 0 No No phone servi…
## 2 Male 0 No Yes 0 Yes No
## 3 Female 0 Yes Yes 0 Yes No
## 4 Male 0 Yes Yes 0 Yes Yes
## 5 Female 0 Yes Yes 0 No No phone servi…
## 6 Male 0 Yes Yes 0 Yes No
## 7 Male 0 Yes Yes 0 Yes Yes
## 8 Female 0 Yes Yes 0 Yes No
## 9 Male 0 Yes Yes 0 Yes No
## 10 Female 0 Yes Yes 0 Yes Yes
## 11 Male 0 No Yes 0 Yes Yes
## # ℹ 13 more variables: internet_service <fct>, online_security <fct>,
## # online_backup <fct>, device_protection <fct>, tech_support <fct>,
## # streaming_tv <fct>, streaming_movies <fct>, contract <fct>,
## # paperless_billing <fct>, payment_method <fct>, monthly_charges <dbl>,
## # total_charges <dbl>, status <fct>
The missing values in total charges are likely due to new customers as they have Tenure of zero years. The missing values will be dropped in order to preprocess the data for machine learning models.
df <- df %>% drop_na(total_charges)
df %>%
count(status) %>%
mutate(percent = n / sum(n) * 100)
## # A tibble: 2 × 3
## status n percent
## <fct> <int> <dbl>
## 1 Current 5132 73.4
## 2 Left 1856 26.6
# Pie chart of Status
ggplot(df, aes(x="", fill=status)) +
geom_bar(width=1) +
coord_polar(theta="y") +
theme_void() +
labs(title="Customer Churn Distribution")
# Select demographic variables + Status
demo_df <- df %>%
select(status, gender, senior_citizen, partner, dependents) %>%
mutate(senior_citizen = as.character(senior_citizen))
demo_long <- demo_df %>%
pivot_longer(cols = c(gender, senior_citizen, partner, dependents),
names_to = "Demographic",
values_to = "Category")
# Plot
ggplot(demo_long, aes(x=Category, fill=status)) +
geom_bar(position="fill") +
facet_wrap(~Demographic, scales="free_x") +
labs(title="Demographic Analysis vs Customer Status",
y="Proportion",
x="") +
scale_y_continuous(labels=scales::percent) +
theme_minimal() +
theme(
strip.text = element_text(face="bold"))
Customers without dependants are more likely to churn, male and female has similar churn rates, having a partner leads to lower churn rate, being a senior citizen leads to significantly higher churn rates
# Create clean table with only Current %
demo_table_current <- demo_long %>%
group_by(Demographic, Category, status) %>%
summarise(n = n(), .groups = "drop") %>%
pivot_wider(names_from = status, values_from = n, values_fill = 0) %>%
mutate(
Total = Current + Left,
Current = Current / Total
) %>%
mutate(
Current = scales::percent(Current, accuracy = 0.1)
) %>%
select(Demographic, Category, Current) %>%
arrange(Demographic, Category)
demo_table_current %>%
kable(caption = "Customer Current Rate by Demographic Categories", align="c") %>%
kable_styling(full_width = FALSE, position = "center", bootstrap_options = c("striped", "hover", "condensed"))
| Demographic | Category | Current |
|---|---|---|
| dependents | No | 68.7% |
| dependents | Yes | 84.5% |
| gender | Female | 73.0% |
| gender | Male | 73.9% |
| partner | No | 67.0% |
| partner | Yes | 80.3% |
| senior_citizen | 0 | 76.4% |
| senior_citizen | 1 | 58.3% |
Demographic Insights Summary:
# Prepare data
charges_long <- df %>%
mutate(senior_citizen = as.character(senior_citizen)) %>%
select(gender, senior_citizen, partner, dependents, total_charges, monthly_charges) %>%
pivot_longer(cols = c(gender, senior_citizen, partner, dependents),
names_to = "Demographic", values_to = "Category")
# Total Charges by Demographic Group
ggplot(charges_long, aes(x=Category, y=total_charges, fill=Category)) +
geom_boxplot() +
facet_wrap(~Demographic, scales="free_x", ncol=2) +
labs(title="Total Charges by Demographic Group",
x="Category",
y="Total Charges") +
theme_minimal(base_size = 14) +
theme(legend.position="none",
strip.text = element_text(face="bold"))
# Monthly Charges by Demographic Group
ggplot(charges_long, aes(x=Category, y=monthly_charges, fill=Category)) +
geom_boxplot() +
facet_wrap(~Demographic, scales="free_x", ncol=2) +
labs(title="Monthly Charges by Demographic Group",
x="Category",
y="Monthly Charges") +
theme_minimal(base_size = 14) +
theme(legend.position="none",
strip.text = element_text(face="bold"))
Charges Insights Summary:
ggplot(df, aes(x=tenure, y=monthly_charges, color=status)) +
geom_point(alpha=0.4, size=1.2) + # smaller, more transparent points
geom_smooth(method="loess", se=FALSE) + # smooth trend line without confidence interval
labs(
title = "Tenure vs Monthly Charges with Status Trend",
x = "Tenure (Months)",
y = "Monthly Charges ($)",
color = "Customer Status"
) +
theme_minimal(base_size = 14) +
scale_color_manual(values=c("Current"="#4CAF50", "Left"="#F44336")) + # nice green/red
theme(
legend.position = "bottom",
plot.title = element_text(face="bold", hjust=0.5)
)
ggplot(df, aes(x=contract, fill=status)) +
geom_bar(position="fill") +
labs(
title = "Contract Type vs Customer Status",
x = "Contract Type",
y = "Proportion of Customers",
fill = "Customer Status"
) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
theme_minimal(base_size = 14) +
theme(
legend.position = "bottom",
plot.title = element_text(face="bold", hjust=0.5)
)
ggplot(df, aes(x=monthly_charges, y=status)) +
geom_jitter(aes(color=contract), height=0.2, width=0, alpha=0.2, size=1.5) +
facet_wrap(~contract) +
labs(
title = "Monthly Charges vs Customer Status by Contract Type",
x = "Monthly Charges ($)",
y = "Customer Status"
) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(face="bold", hjust=0.5),
legend.position = "none",
strip.text = element_text(face="bold")
)
Monthly Charges vs Customer Status by Contract Type:
# Create training and test sets
set.seed(123)
split <- initial_split(df, prop = .7, strata = "status")
train <- training(split)
test <- testing(split)
# Recipe
retention_recipe <- recipe(status ~ ., data = train) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
# 5-fold CV
set.seed(123)
kfolds <- vfold_cv(train, v = 5, strata = status)
# Model spec
lr_mod <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
# Workflow
lr_workflow <- workflow() %>%
add_recipe(retention_recipe) %>%
add_model(lr_mod)
# Cross-validation
set.seed(123)
lr_results <- fit_resamples(lr_workflow,resamples = kfolds,)
# Collect AUC
collect_metrics(lr_results) %>%
filter(.metric == "roc_auc")
## # A tibble: 1 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 roc_auc binary 0.845 5 0.00521 Preprocessor1_Model1
# Final fit
final_log_fit <- lr_workflow %>%
fit(data = train)
# Confusion matrix on test set
final_log_fit %>%
predict(test) %>%
bind_cols(test %>% select(status)) %>%
conf_mat(truth = status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1362 225
## Left 178 332
final_log_fit %>%
predict(train, type = "prob") %>%
mutate(truth = train$status) %>%
roc_curve(truth, .pred_Current) %>%
autoplot()
The logistic regression model achieved a mean ROC AUC of 0.845.
On the test set, the confusion matrix shows that 1362 current customers and 332 left customers were correctly classified, while there were 225 false positives and 178 false negatives.
This model performs better at predicting when a customer is current than when they have left. Misclassifications are slightly more from false positives (predicting “Current” when the customer actually left), indicating that the model tends to overpredict customer retention.
# Decision Tree Model
tree_model <- decision_tree(mode = "classification") %>%
set_engine("rpart")
# Workflow
tree_workflow <- workflow() %>%
add_recipe(retention_recipe) %>%
add_model(tree_model)
# Fit model on training data
tree_fit <- tree_workflow %>%
fit(data = train)
# Cross-validation
set.seed(123)
tree_cv_results <- fit_resamples(tree_workflow,resamples = kfolds)
# Collect metrics
collect_metrics(tree_cv_results) %>% filter(.metric == "roc_auc")
## # A tibble: 1 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 roc_auc binary 0.710 5 0.00529 Preprocessor1_Model1
# Visualize the fitted Decision Tree
rpart.plot::rpart.plot(tree_fit$fit$fit$fit)
## Redefine tree model with tuning parameters
tree_tune_model <- decision_tree(
mode = "classification",
cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()
) %>%
set_engine("rpart")
# Create grid for tuning
tree_grid <- grid_regular(
cost_complexity(),
tree_depth(),
min_n(),
levels = 5
)
# Tune model across grid
set.seed(123)
tree_tuned_results <- tune_grid(
tree_tune_model,
retention_recipe,
resamples = kfolds,
grid = tree_grid
)
# Display top performing models
show_best(tree_tuned_results, metric = "roc_auc", n = 5)
## # A tibble: 5 × 9
## cost_complexity tree_depth min_n .metric .estimator mean n std_err
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0000000001 8 30 roc_auc binary 0.816 5 0.00552
## 2 0.0000000178 8 30 roc_auc binary 0.816 5 0.00552
## 3 0.00000316 8 30 roc_auc binary 0.816 5 0.00552
## 4 0.0000000001 8 40 roc_auc binary 0.815 5 0.00421
## 5 0.0000000178 8 40 roc_auc binary 0.815 5 0.00421
## # ℹ 1 more variable: .config <chr>
# Select best hyperparameters
best_tree_params <- select_best(tree_tuned_results, metric = "roc_auc")
# Create final workflow
final_tree_workflow <- workflow() %>%
add_recipe(retention_recipe) %>%
add_model(tree_tune_model) %>%
finalize_workflow(best_tree_params)
# Fit the finalized model on entire training set
final_tree_fit <- final_tree_workflow %>%
fit(data = train)
# Confusion matrix on test data
final_tree_fit %>%
predict(test) %>%
bind_cols(test %>% select(status)) %>%
conf_mat(truth = status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1316 243
## Left 224 314
# ROC Curve for training data
final_tree_fit %>%
predict(train, type = "prob") %>%
mutate(truth = train$status) %>%
roc_curve(truth, .pred_Current) %>%
autoplot()
The Decision Tree model also predicts current customers more accurately
than customers who have left, but slightly worse than logistic
regression. Misclassifications are again more from false positives
(predicting “Current” when they actually left), showing a bias toward
predicting customers are still active. Compared to logistic regression,
the decision tree has lower ROC AUC (0.710 vs 0.845) and more errors
overall on the test set.
# Define Bagging Model
bagging_model <- bag_tree() %>%
set_engine("rpart", times = 5) %>%
set_mode("classification")
# Workflow
bagging_workflow <- workflow() %>%
add_recipe(retention_recipe) %>%
add_model(bagging_model)
# Cross-validation Results
set.seed(123)
bagging_cv <- fit_resamples(bagging_workflow,resamples = kfolds)
# Collect Performance Metrics
collect_metrics(bagging_cv) %>%
filter(.metric == "roc_auc")
## # A tibble: 1 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 roc_auc binary 0.770 5 0.00507 Preprocessor1_Model1
# Redefine Bagging Model
bagging_tune_model <- bag_tree() %>%
set_engine("rpart", times = tune()) %>%
set_mode("classification")
# Workflow
bagging_tune_workflow <- workflow() %>%
add_recipe(retention_recipe) %>%
add_model(bagging_tune_model)
# Hyperparameter Grid
bagging_grid <- tibble(times = c(5, 25, 50, 100, 200, 300))
# Cross-validation
set.seed(123)
bagging_tune_results <- tune_grid(
bagging_tune_workflow,
resamples = kfolds,
grid = bagging_grid
)
# Show Top Models
show_best(bagging_tune_results, metric = "roc_auc", n = 5)
## # A tibble: 5 × 7
## times .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 200 roc_auc binary 0.822 5 0.00360 Preprocessor1_Model5
## 2 300 roc_auc binary 0.821 5 0.00435 Preprocessor1_Model6
## 3 100 roc_auc binary 0.820 5 0.00476 Preprocessor1_Model4
## 4 50 roc_auc binary 0.815 5 0.00395 Preprocessor1_Model3
## 5 25 roc_auc binary 0.809 5 0.00446 Preprocessor1_Model2
# Select Best Hyperparameters
best_bagging_params <- select_best(bagging_tune_results, metric = "roc_auc")
# Final Workflow
final_bagging_workflow <- finalize_workflow(
bagging_tune_workflow,
best_bagging_params
)
# Fit Final Bagging Model on Full Training Set
final_bagging_fit <- final_bagging_workflow %>%
fit(data = train)
# Confusion Matrix on Test Set
final_bagging_fit %>%
predict(test) %>%
bind_cols(test %>% select(status)) %>%
conf_mat(truth = status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1341 263
## Left 199 294
# ROC Curve on Training Set
final_bagging_fit %>%
predict(train, type = "prob") %>%
mutate(truth = train$status) %>%
roc_curve(truth, .pred_Current) %>%
autoplot()
The bagged model continues to predict current customers more accurately than left customers, like the earlier models. Compared to the simple decision tree (ROC AUC = 0.710), the tuned bagged tree (ROC AUC = 0.822) performs substantially better, though still slightly below the logistic regression model (ROC AUC = 0.845). The confusion matrix shows fewer false negatives (199 vs 224 in the simple tree), meaning it improves at detecting customers who have left.
# Random Forest
forest_model <- rand_forest() %>%
set_engine("ranger") %>%
set_mode("classification")
# Workflow
forest_workflow <- workflow() %>%
add_recipe(retention_recipe) %>%
add_model(forest_model)
# Cross-validation
set.seed(123)
forest_cv <- fit_resamples(forest_workflow,resamples = kfolds)
# Collect Metrics
collect_metrics(forest_cv) %>%
filter(.metric == "roc_auc")
## # A tibble: 1 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 roc_auc binary 0.842 5 0.00411 Preprocessor1_Model1
# Hyperparameter tuning
forest_tune_model <- rand_forest(
trees = tune(),
mtry = tune(),
min_n = tune()
) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
# Workflow
forest_tune_workflow <- workflow() %>%
add_recipe(retention_recipe) %>%
add_model(forest_tune_model)
# Hyperparameter grid
forest_grid <- grid_regular(
trees(range = c(50, 500)),
mtry(range = c(2, 20)),
min_n(range = c(1, 20)),
levels = 5
)
# Tune model across grid
set.seed(123)
forest_tune_results <- tune_grid(forest_tune_workflow, resamples = kfolds,grid = forest_grid)
# View top performing models
show_best(forest_tune_results, metric = "roc_auc", n = 5)
## # A tibble: 5 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 6 500 20 roc_auc binary 0.845 5 0.00422 Preprocessor1_Model1…
## 2 6 162 20 roc_auc binary 0.845 5 0.00491 Preprocessor1_Model1…
## 3 6 387 20 roc_auc binary 0.845 5 0.00450 Preprocessor1_Model1…
## 4 6 50 20 roc_auc binary 0.844 5 0.00505 Preprocessor1_Model1…
## 5 6 275 20 roc_auc binary 0.844 5 0.00402 Preprocessor1_Model1…
# Select best hyperparameters
best_forest_params <- select_best(forest_tune_results, metric = "roc_auc")
# Finalize workflow
final_forest_workflow <- finalize_workflow(
forest_tune_workflow,
best_forest_params
)
# Fit the model on the training data
final_forest_fit <- final_forest_workflow %>%
fit(data = train)
# Confusion matrix
final_forest_fit %>%
predict(test) %>%
bind_cols(test %>% select(status)) %>%
conf_mat(truth = status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1387 264
## Left 153 293
# ROC curve
final_forest_fit %>%
predict(train, type = "prob") %>%
mutate(truth = train$status) %>%
roc_curve(truth, .pred_Current) %>%
autoplot()
The tuned random forest model achieved a mean ROC AUC of 0.845, matching
the logistic regression model in cross-validation performance. On the
test set, it correctly classified 1385 current customers and 293 left
customers, with 264 false positives and 155 false negatives. Compared to
other models, it shows the best balance between predicting current and
left customers, with fewer missed detections of customers who have
left.
collect_metrics(lr_results, summarize = FALSE) %>%
filter(.metric == "roc_auc")
## # A tibble: 5 × 5
## id .metric .estimator .estimate .config
## <chr> <chr> <chr> <dbl> <chr>
## 1 Fold1 roc_auc binary 0.836 Preprocessor1_Model1
## 2 Fold2 roc_auc binary 0.839 Preprocessor1_Model1
## 3 Fold3 roc_auc binary 0.865 Preprocessor1_Model1
## 4 Fold4 roc_auc binary 0.839 Preprocessor1_Model1
## 5 Fold5 roc_auc binary 0.845 Preprocessor1_Model1
final_fit <- lr_mod %>%
fit(status ~ ., data = train)
tidy(final_fit)
## # A tibble: 31 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.02 0.969 1.05 2.95e- 1
## 2 genderMale -0.0461 0.0779 -0.592 5.54e- 1
## 3 senior_citizen 0.258 0.101 2.55 1.08e- 2
## 4 partnerYes -0.141 0.0925 -1.52 1.29e- 1
## 5 dependentsYes -0.0475 0.108 -0.441 6.59e- 1
## 6 tenure -0.0646 0.00762 -8.47 2.35e-17
## 7 phone_serviceYes -0.0417 0.774 -0.0539 9.57e- 1
## 8 multiple_linesNo phone service NA NA NA NA
## 9 multiple_linesYes 0.442 0.211 2.10 3.56e- 2
## 10 internet_serviceFiber optic 1.48 0.950 1.56 1.18e- 1
## # ℹ 21 more rows
vip::vip(final_fit)
Which predictor variables appear to be most influential in customer behavior?
Looking at the variable importance plot, tenure stands out as the most influential predictor. Along with that, whether a customer has a one-year or two-year contract and their total charges also seem to play major roles in predicting if they stay or leave. Basically, customers who have been with the company longer or are locked into longer contracts are much less likely to leave.
Why are those specific predictor variables the most influential?
If a customer has been around for a long time (high tenure) or has committed to a long contract, they’re more likely to stay as they’ve already built a relationship with the company. Similarly, customers with higher total charges probably reflect longer usage or higher engagement, which can also tie into loyalty. These variables showed up with the highest importance when we used the vip() plot to measure impact.
How does this generalization error compare to the cross validation error seen in earlier results?
The generalization error we got from the test set is almost identical to what we saw during cross-validation, around 0.845. That’s a really good sign because it shows that our model isn’t just memorizing the training data; it’s actually learning patterns that hold up on new, unseen data too.
As a person responsible for making business decisions, what else are you learning from the observations in this section?
The biggest takeaway here is that tenure and contract length are critical levers for customer retention. If Regork Telecom wants to reduce churn, they should focus on strategies that keep customers around longer like offering incentives for multi-year contracts or rewarding long-term loyalty. Also, the fact that our model generalizes well means we can trust its predictions when making real business decisions, like identifying at-risk customers before they actually leave. Overall, the model gives leadership a reliable tool to be proactive rather than reactive.
final_fit %>%
predict(test) %>%
bind_cols(test %>% select(status))
## # A tibble: 2,097 × 2
## .pred_class status
## <fct> <fct>
## 1 Left Current
## 2 Left Left
## 3 Current Current
## 4 Current Current
## 5 Current Current
## 6 Current Current
## 7 Current Current
## 8 Current Current
## 9 Current Current
## 10 Left Left
## # ℹ 2,087 more rows
left_customers <- test %>%
bind_cols(final_fit %>% predict(test)) %>%
filter(.pred_class == "Left")
summary(left_customers)
## gender senior_citizen partner dependents tenure phone_service
## Female:264 Min. :0.0 No :368 No :446 Min. : 1.0 No : 52
## Male :246 1st Qu.:0.0 Yes:142 Yes: 64 1st Qu.: 2.0 Yes:458
## Median :0.0 Median : 6.5
## Mean :0.3 Mean :10.4
## 3rd Qu.:1.0 3rd Qu.:15.0
## Max. :1.0 Max. :56.0
## multiple_lines internet_service online_security
## No :227 DSL : 74 No :478
## No phone service: 52 Fiber optic:436 No internet service: 0
## Yes :231 No : 0 Yes : 32
##
##
##
## online_backup device_protection
## No :407 No :378
## No internet service: 0 No internet service: 0
## Yes :103 Yes :132
##
##
##
## tech_support streaming_tv
## No :465 No :279
## No internet service: 0 No internet service: 0
## Yes : 45 Yes :231
##
##
##
## streaming_movies contract paperless_billing
## No :262 Month-to-month:510 No : 93
## No internet service: 0 One year : 0 Yes:417
## Yes :248 Two year : 0
##
##
##
## payment_method monthly_charges total_charges
## Bank transfer (automatic): 45 Min. : 24.45 Min. : 24.6
## Credit card (automatic) : 40 1st Qu.: 71.46 1st Qu.: 130.6
## Electronic check :371 Median : 81.28 Median : 501.5
## Mailed check : 54 Mean : 78.99 Mean : 913.5
## 3rd Qu.: 90.99 3rd Qu.:1220.8
## Max. :111.20 Max. :5624.9
## status .pred_class
## Current:178 Current: 0
## Left :332 Left :510
##
##
##
##
predicted_loss <- sum(left_customers$monthly_charges)
In terms of relative importance, how would you rate the predictors in your model? As a business manager, which factors would you focus on to decrease the chances of customers leaving?
In our logistic regression model, tenure came out as the most influential predictor, closely followed by contract type (both one-year and two-year contracts) and total charges. After these, the importance of other variables drops off quite a bit.
From a business manager’s perspective, the results make a lot of sense. Tenure tells us that the longer someone stays with Regork Telecom, the more likely they are to continue staying. Contract type also plays a big role — customers locked into longer contracts are much less likely to leave.
If I were in charge of customer retention, I would focus on encouraging longer tenure and longer contracts. That could mean offering special promotions for contract renewals, discounts for customers who sign longer agreements, or even loyalty rewards for customers who hit certain tenure milestones. Helping people stay longer naturally strengthens their attachment to the company.
Collect all the customers from the test dataset that you predict are going to leave
After running predictions on the test set, we identified 510 customers who are at high risk of leaving according to the logistic regression model. When we look closer at their profiles, a lot of patterns jump out: almost all of them are on month-to-month contracts, they have shorter tenure, and many are paying higher monthly charges around $79–$90.
These customers are likely feeling less committed because they’re not locked into a long-term agreement, and with the higher monthly bills, it becomes even easier for them to decide to switch providers if they find a cheaper or better alternative.
What is the predicted loss in revenue per month if no action is taken?
If Regork Telecom does not act, the model predicts a potential monthly revenue loss of around $40,283 if these at-risk customers end up leaving.
This estimate is based on summing the monthly charges of the customers who are predicted to churn. While of course not every predicted customer may leave, this still shows a significant financial risk especially because churn tends to increase if early departures trigger more customer dissatisfaction or negative reviews.
Propose an incentive scheme to your manager to retain these customers
To help prevent these customers from leaving, I propose a targeted loyalty and rewards program. Here’s what I recommend:
This plan directly addresses the biggest drivers of churn shown by the model: total charges, contract type, and tenure.
Conclusion
In conclusion, the best path forward for Regork Telecom is to focus on loyalty-based incentives that promote longer tenure and encourage customers to commit to longer-term contracts. Our logistic regression model showed that customers are more likely to stay when they have longer histories with the company and when their financial burden feels more manageable.
The predicted revenue loss if no action is taken is serious enough to justify investment in these strategies. While rolling out incentives will take coordination with the finance team and careful planning to manage costs, the long-term benefits — both in customer retention and company reputation would be worth it.
By acting now, Regork can not only reduce churn but also strengthen its relationship with existing customers, creating a more loyal base that helps grow the business sustainably over time.