Customer retention is a critical focus for Regork, especially as the company expands its footprint in the competitive telecommunications industry. Offering services such as internet, phone, and streaming, Regork recognizes that retaining current customers is more cost-effective than acquiring new ones. Understanding why customers leave (churn) is essential for maintaining revenue and fostering long-term loyalty.
The goal of this project is to analyze customer data and build a predictive model to identify individuals at risk of leaving. This model will help Regork implement proactive strategies, such as targeted incentives, to reduce churn and maintain customer satisfaction. By leveraging data-driven insights, we aim to support the company’s efforts to enhance customer loyalty and minimize revenue losses.
Our analysis involves exploring the data to uncover trends, building machine learning models to predict churn, and providing actionable recommendations based on our findings. This comprehensive approach ensures that Regork can address churn effectively and continue its growth in the telecommunications market.
# Load libraries
library(tidymodels)
library(tidyverse)
library(baguette)
library(vip)
library(pdp)
library(here)
library(kernlab)
library(ggplot2)
library(earth)
library(rpart.plot)
library(ranger)
# Load data
c_retention <- read.csv("customer_retention.csv")
str(c_retention)
## 'data.frame': 6999 obs. of 20 variables:
## $ Gender : chr "Female" "Male" "Male" "Male" ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : chr "Yes" "No" "No" "No" ...
## $ Dependents : chr "No" "No" "No" "No" ...
## $ Tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : chr "No" "Yes" "Yes" "No" ...
## $ MultipleLines : chr "No phone service" "No" "No" "No phone service" ...
## $ InternetService : chr "DSL" "DSL" "DSL" "DSL" ...
## $ OnlineSecurity : chr "No" "Yes" "Yes" "Yes" ...
## $ OnlineBackup : chr "Yes" "No" "Yes" "No" ...
## $ DeviceProtection: chr "No" "Yes" "No" "Yes" ...
## $ TechSupport : chr "No" "No" "No" "Yes" ...
## $ StreamingTV : chr "No" "No" "No" "No" ...
## $ StreamingMovies : chr "No" "No" "No" "No" ...
## $ Contract : chr "Month-to-month" "One year" "Month-to-month" "One year" ...
## $ PaperlessBilling: chr "Yes" "No" "Yes" "No" ...
## $ PaymentMethod : chr "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Status : chr "Current" "Current" "Left" "Current" ...
# Check for missing values
sum(is.na(c_retention))
## [1] 11
# Data preprocessing
c_retention <- c_retention %>%
mutate(Status = factor(Status)) %>%
na.omit()
sum(is.na(c_retention))
## [1] 0
### Device Protection
dp_counts <- c_retention %>%
count(DeviceProtection) %>%
arrange(desc(n))
ggplot(dp_counts, aes(x = fct_reorder(DeviceProtection, n), y = n)) +
geom_col(fill = "black", width = 0.5) +
theme_minimal() +
labs(title = "Presence of Device Protection Among Users Who Left",
x = "Device Protection",
y = "Number of Users") +
coord_flip() +
theme(axis.text.y = element_text(size = 10))
###Contract Type
contract_counts <- c_retention %>%
count(Contract) %>%
arrange(desc(n))
ggplot(contract_counts, aes(x = fct_reorder(Contract, n), y = n)) +
geom_col(fill = "red", width = 0.5) +
theme_minimal() +
labs(title = "Contract Types of Users Who Left",
x = "Contract Type",
y = "Number of Users") +
coord_flip() +
theme(axis.text.y = element_text(size = 10))
tenure_counts <- c_retention %>%
count(Tenure) %>%
arrange(desc(n))
ggplot(tenure_counts, aes(x = Tenure, y = n)) +
geom_line(color = "yellow", linewidth = 1) +
geom_point(color = "yellow", linewidth = 3) +
theme_minimal() +
labs(title = "Distribution of Tenure Among Departed Users",
x = "Months",
y = "Number of Users")
The graph reveals that customer churn decreases as tenure increases, reaching its lowest point around 60 months, after which it begins to rise again.
ggplot(c_retention, aes(Tenure)) +
geom_bar(fill = "black") +
ggtitle("Partner and Tenure") +
facet_wrap(~Partner)
The graph indicates that married individuals are less likely to churn compared to those who are not married.
##Machine Learning
Our initial model focuses on logistic regression.
### Model Training and Evaluation
set.seed(123)
log_split <- initial_split(c_retention, prop = 0.5, strata = Status)
log_train <- training(log_split)
log_test <- testing(log_split)
set.seed(123)
log_kfolds <- vfold_cv(log_train, v = 5, strata = Status)
results <- logistic_reg() %>%
fit_resamples(Status ~ ., log_kfolds)
collect_metrics(results)
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.809 5 0.00841 Preprocessor1_Model1
## 2 brier_class binary 0.135 5 0.00329 Preprocessor1_Model1
## 3 roc_auc binary 0.847 5 0.00735 Preprocessor1_Model1
We obtained an ROC-AUC value of 0.847, which is a strong result but has potential for further improvement.
### Final Model
final_fit <- logistic_reg() %>%
fit(Status ~ ., data = log_train)
tidy(final_fit)
## # A tibble: 31 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.84 1.16 1.59 1.12e- 1
## 2 GenderMale -0.0753 0.0924 -0.815 4.15e- 1
## 3 SeniorCitizen 0.239 0.121 1.98 4.78e- 2
## 4 PartnerYes -0.114 0.111 -1.03 3.02e- 1
## 5 DependentsYes -0.0595 0.127 -0.468 6.40e- 1
## 6 Tenure -0.0565 0.00876 -6.45 1.09e-10
## 7 PhoneServiceYes 0.738 0.923 0.800 4.24e- 1
## 8 MultipleLinesNo phone service NA NA NA NA
## 9 MultipleLinesYes 0.498 0.250 1.99 4.67e- 2
## 10 InternetServiceFiber optic 2.26 1.14 1.98 4.79e- 2
## # ℹ 21 more rows
final_fit %>%
predict(log_test) %>%
bind_cols(log_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 2272 412
## Left 294 516
### Data Preparation and Tuning
set.seed(123)
mars_split <- initial_split(c_retention, prop = 0.5, strata = Status)
mars_train <- training(mars_split)
mars_test <- testing(mars_split)
mars_recipe <- recipe(Status ~ ., data = mars_train) %>%
step_YeoJohnson(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors())
set.seed(123)
kfolds <- vfold_cv(mars_train, v = 5, strata = Status)
mars_mod <- mars(num_terms = tune(), prod_degree = tune()) %>%
set_mode('classification')
mars_grid <- grid_regular(num_terms(range = c(1,30)), prod_degree(), levels = 40)
mars_wf <- workflow() %>%
add_recipe(mars_recipe) %>%
add_model(mars_mod)
tuning_results <- mars_wf %>%
tune_grid(resamples = kfolds, grid = mars_grid)
tuning_results %>%
collect_metrics() %>%
filter(.metric == "roc_auc") %>%
arrange(desc(mean))
## # A tibble: 60 × 8
## num_terms prod_degree .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 15 1 roc_auc binary 0.850 5 0.00812 Preprocessor1_M…
## 2 16 1 roc_auc binary 0.850 5 0.00812 Preprocessor1_M…
## 3 17 1 roc_auc binary 0.850 5 0.00812 Preprocessor1_M…
## 4 18 1 roc_auc binary 0.850 5 0.00812 Preprocessor1_M…
## 5 19 1 roc_auc binary 0.850 5 0.00812 Preprocessor1_M…
## 6 20 1 roc_auc binary 0.850 5 0.00812 Preprocessor1_M…
## 7 21 1 roc_auc binary 0.850 5 0.00812 Preprocessor1_M…
## 8 22 1 roc_auc binary 0.850 5 0.00812 Preprocessor1_M…
## 9 23 1 roc_auc binary 0.850 5 0.00812 Preprocessor1_M…
## 10 24 1 roc_auc binary 0.850 5 0.00812 Preprocessor1_M…
## # ℹ 50 more rows
We achieved a maximum ROC-AUC of 0.850, which is promising, but we believe there is still room for improvement. This value was obtained using 15 terms and a degree of 1, effectively making it a lasso model.
autoplot(tuning_results)
### Final Model
best_hyperparameters <- select_best(tuning_results, metric = "roc_auc")
final_wf <- workflow() %>%
add_recipe(mars_recipe) %>%
add_model(mars_mod) %>%
finalize_workflow(best_hyperparameters)
final_fit <- final_wf %>%
fit(data = mars_train)
final_fit %>%
extract_fit_parsnip() %>%
vip()
This model shows our most important features to be tenure, total charges, and payment method.
final_fit %>%
predict(mars_test) %>%
bind_cols(mars_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 2312 449
## Left 254 479
### Tuning and Feature Importance
dt_mod <- decision_tree(mode = 'classification') %>%
set_engine("rpart")
model_recipe <- recipe(Status ~ ., data = log_train)
dt_fit <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(dt_mod) %>%
fit(data = log_train)
rpart.plot::rpart.plot(dt_fit$fit$fit$fit)
The root node of the decision tree is based on the contract type, specifically whether it is a 1-year or 2-year contract.
set.seed(123)
kfold <- vfold_cv(log_train, v=5)
dt_results <- fit_resamples(dt_mod, model_recipe, kfold)
collect_metrics(dt_results)
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.788 5 0.00624 Preprocessor1_Model1
## 2 brier_class binary 0.148 5 0.00278 Preprocessor1_Model1
## 3 roc_auc binary 0.799 5 0.00604 Preprocessor1_Model1
The decision tree produced an ROC-AUC of 0.799, which is not ideal. With further tuning, we aim to develop a more effective model.
### Final Model
dt_mod <- decision_tree(
mode = "classification",
cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()
) %>%
set_engine("rpart")
dt_hyper_grid <- grid_regular(
cost_complexity(),
tree_depth(),
min_n(),
levels = 5
)
set.seed(123)
dt_results <- tune_grid(dt_mod, model_recipe, resamples = kfolds, grid = dt_hyper_grid)
show_best(dt_results, metric = "roc_auc", n = 5)
## # A tibble: 5 × 9
## cost_complexity tree_depth min_n .metric .estimator mean n std_err
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0000000001 11 40 roc_auc binary 0.805 5 0.0111
## 2 0.0000000178 11 40 roc_auc binary 0.805 5 0.0111
## 3 0.00000316 11 40 roc_auc binary 0.805 5 0.0111
## 4 0.0000000001 15 40 roc_auc binary 0.805 5 0.0111
## 5 0.0000000178 15 40 roc_auc binary 0.805 5 0.0111
## # ℹ 1 more variable: .config <chr>
We have now improved the ROC-AUC to 0.805, which is an enhancement but still falls short of being the optimal model.
dt_best_model <- select_best(dt_results, metric = 'roc_auc')
dt_final_wf <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(dt_mod) %>%
finalize_workflow(dt_best_model)
dt_final_fit <- dt_final_wf %>%
fit(data = log_train)
dt_final_fit %>%
extract_fit_parsnip() %>%
vip(20)
With decision trees, the most influential features identified are tenure, contract type, and total charges.
Lastly, we will analyze Random Forest
dt_final_fit %>%
predict(log_test) %>%
bind_cols(log_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 2261 446
## Left 305 482
### Tuning and Feature Importance
set.seed(123)
rf_split <- initial_split(c_retention, prop = 0.7, strata = Status)
rf_training <- training(rf_split)
rf_test <- testing(rf_split)
rf_recipe <- recipe(Status ~ ., data = c_retention)
rf_mod <- rand_forest(
mode = "classification",
trees = tune(),
mtry = tune(),
min_n = tune()) %>%
set_engine("ranger", importance = "impurity")
rf_hyper_grid <- grid_regular(
trees(range = c(50, 200)),
mtry(range = c(2, 10)),
min_n(range = c(1, 5)),
levels = 3)
rf_kfold <- vfold_cv(c_retention, v = 5, strata = Status)
rf_results <- tune_grid(
rf_mod,
rf_recipe,
resamples = rf_kfold,
grid = rf_hyper_grid
)
collect_metrics(rf_results)
## # A tibble: 81 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 50 1 accuracy binary 0.800 5 0.00373 Preprocessor1_M…
## 2 2 50 1 brier_class binary 0.138 5 0.00197 Preprocessor1_M…
## 3 2 50 1 roc_auc binary 0.839 5 0.00555 Preprocessor1_M…
## 4 2 125 1 accuracy binary 0.802 5 0.00392 Preprocessor1_M…
## 5 2 125 1 brier_class binary 0.137 5 0.00226 Preprocessor1_M…
## 6 2 125 1 roc_auc binary 0.842 5 0.00595 Preprocessor1_M…
## 7 2 200 1 accuracy binary 0.802 5 0.00355 Preprocessor1_M…
## 8 2 200 1 brier_class binary 0.137 5 0.00227 Preprocessor1_M…
## 9 2 200 1 roc_auc binary 0.841 5 0.00619 Preprocessor1_M…
## 10 6 50 1 accuracy binary 0.791 5 0.00426 Preprocessor1_M…
## # ℹ 71 more rows
set.seed(123)
rf_results <- tune_grid(rf_mod, rf_recipe, resamples = rf_kfold, grid = rf_hyper_grid)
show_best(rf_results, metric = "roc_auc")
## # A tibble: 5 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 200 3 roc_auc binary 0.843 5 0.00548 Preprocessor1_Model12
## 2 2 125 5 roc_auc binary 0.842 5 0.00545 Preprocessor1_Model20
## 3 2 200 5 roc_auc binary 0.842 5 0.00569 Preprocessor1_Model21
## 4 2 125 1 roc_auc binary 0.842 5 0.00545 Preprocessor1_Model02
## 5 2 125 3 roc_auc binary 0.841 5 0.00620 Preprocessor1_Model11
With hyperparameter tuning, we improved the ROC-AUC to 0.8435.
rf_best_hyperparameters <- select_best(rf_results, metric = "roc_auc")
final_rf_wf <- workflow() %>%
add_recipe(rf_recipe) %>%
add_model(rf_mod) %>%
finalize_workflow(rf_best_hyperparameters)
rf_final_fit <- final_rf_wf %>%
fit(data = log_train)
rf_final_fit %>%
extract_fit_parsnip() %>%
vip(num_features = 20)
Our random forest model identifies tenure, total charges, and contract type as the most important features.
rf_final_fit %>%
predict(log_test) %>%
bind_cols(log_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 2320 484
## Left 246 444
Our analysis successfully identified key factors influencing customer churn at Regork and developed a predictive model to address this challenge. Among the tested models, the MARS model emerged as the most accurate, achieving an AUC of 0.850 with 15 terms and a degree of 1. Key factors influencing churn included tenure, total charges, contract type, and payment methods, with tenure consistently emerging as the most critical factor across all models.
Based on these findings, we recommend that Regork focus on several strategies to improve customer retention. First, incentivizing long-term contracts by offering transparent pricing structures and avoiding unexpected charges can enhance customer trust and satisfaction. Second, rewarding customer loyalty through discounts, free devices for long-term customers, or bundled services can foster stronger relationships. Finally, addressing payment-related concerns by simplifying and diversifying payment methods will reduce friction and further improve customer satisfaction.
These strategies, combined with targeted interventions for at-risk customers, can help Regork reduce churn, increase customer satisfaction, and build long-term brand loyalty. While our analysis demonstrated strong predictive capabilities, it was constrained by the dataset’s size and lack of temporal data. A larger, time-series dataset would provide a more comprehensive understanding of customer behavior and churn trends over time. Additionally, exploring a broader range of machine learning models could yield even greater accuracy.
Despite these limitations, our project provides actionable insights and a solid foundation for Regork to enhance its customer retention strategies. By adopting a data-driven approach, the company is well-positioned to thrive in the competitive telecommunications industry.