Introduction

Customer retention is one of the biggest struggles in the telecommunication sector, with the expense of adding a new consumer commonly exceeding that of keeping an existing one. This analysis involves exploring customer data to identify trends, building predictive machine learning algorithms to determine churn drivers, and delivering actionable recommendations for the business. Applying the insights from this project will provide Regork the ability to make informed resource allocation decisions, create targeted retention strategies, and in turn improve customer enjoyment and bottom-line profits.

Required Packages

tidyverse - For data manipulation and visualization. tidymodels - For building machine learning workflows and models. vip - For visualizing feature importance in machine learning models. pdp - For generating partial dependence plots. ggthemes - For enhancing the aesthetics of ggplot visualizations.

library(tidyverse)
library(tidymodels)
library(vip)
library(pdp)
library(ggthemes)

# Adjust the file path to your actual dataset location
data <- read.csv("customer_retention.csv", stringsAsFactors = FALSE)

# Convert Status to factor if not already
data$Status <- as.factor(data$Status)

# Check the structure to ensure required columns are present
str(data)
## 'data.frame':    6999 obs. of  20 variables:
##  $ Gender          : chr  "Female" "Male" "Male" "Male" ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : chr  "Yes" "No" "No" "No" ...
##  $ Dependents      : chr  "No" "No" "No" "No" ...
##  $ Tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : chr  "No" "Yes" "Yes" "No" ...
##  $ MultipleLines   : chr  "No phone service" "No" "No" "No phone service" ...
##  $ InternetService : chr  "DSL" "DSL" "DSL" "DSL" ...
##  $ OnlineSecurity  : chr  "No" "Yes" "Yes" "Yes" ...
##  $ OnlineBackup    : chr  "Yes" "No" "Yes" "No" ...
##  $ DeviceProtection: chr  "No" "Yes" "No" "Yes" ...
##  $ TechSupport     : chr  "No" "No" "No" "Yes" ...
##  $ StreamingTV     : chr  "No" "No" "No" "No" ...
##  $ StreamingMovies : chr  "No" "No" "No" "No" ...
##  $ Contract        : chr  "Month-to-month" "One year" "Month-to-month" "One year" ...
##  $ PaperlessBilling: chr  "Yes" "No" "Yes" "No" ...
##  $ PaymentMethod   : chr  "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Status          : Factor w/ 2 levels "Current","Left": 1 1 2 1 2 2 1 1 2 1 ...

Exploratory Analysis

Status Distribution

This graph reveals the overall churn rate by showing the proportion of customers who are still active (Current) versus those who have left (Left). A significant proportion of customers have churned, providing a baseline for the analysis. The churn rate suggests a pressing need to identify the factors contributing to customer loss. This initial finding highlights the scope of the issue and emphasizes the importance of retention strategies in reducing churn.

data %>% 
  count(Status) %>%
  mutate(prop = n/sum(n)) %>%
  ggplot(aes(x = Status, y = prop, fill = Status)) +
  geom_col() +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Status Distribution", x = "Status", y = "Proportion") +
  theme_minimal()

Tenure Distribution by Status

This histogram visualizes how tenure (duration of customer association) differs between active and churned customers. Customers who have churned tend to have shorter tenures compared to those who remain active. Longer tenures are indicative of customer loyalty. Regork should investigate what drives early attrition and implement strategies to engage new customers effectively during the initial months.

data %>%
  ggplot(aes(x = Tenure, fill = Status)) +
  geom_histogram(bins = 30, position = "dodge", alpha = 0.7) +
  labs(title = "Tenure Distribution by Status", x = "Tenure (months)", y = "Count") +
  theme_minimal()

Gender and Multiple Lines Distribution by Status

This grouped bar plot examines the relationship between gender, having multiple lines, and customer churn. The presence of multiple lines appears to vary by customer status, but no strong patterns emerge based on gender alone.

data %>%
  ggplot(aes(x = Gender, fill = MultipleLines)) +
  geom_bar(position = "dodge") +
  facet_wrap(~Status) +
  labs(title = "Gender and MultipleLines Distribution by Status",
       x = "Gender", y = "Count", fill = "Multiple Lines") +
  theme_minimal()

Distribution of Status by Contract Type with Average Monthly Charges

This graph presents the distribution of customer churn (Status) based on the contract type (Month-to-month, One year, and Two year). One of the most important insights from the analysis: the longer the contract’s duration, the higher the likelihood of customers remaining with the company. For example, customers on month-to-month contracts have a churn rate of about 45%, while those on one-year contracts experience only 10% churn, and customers with two-year contracts show even lower churn, at less than 5%. This finding underscores the importance of contract type as a key variable influencing customer retention.

Given these results, Regork should focus on promoting longer-term contracts as a strategy to reduce churn. The longer customers stay under contract, the more stable and cost-effective their relationship becomes, providing Regork with a more predictable revenue stream. Additionally, offering competitive pricing models that reward longer-term commitments could further increase customer retention. One potential solution is to introduce 3- and 6-month contract options. While short-term contracts still carry some risk, they could serve as a bridge for customers, allowing them to experience the company’s value before committing to longer-term agreements. This approach would likely reduce churn and generate more revenue, helping Regork create a more stable and profitable customer base.

data_summary <- data %>%
  group_by(Contract, Status) %>%
  summarize(
    count = n(),
    avg_monthly = mean(MonthlyCharges, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  group_by(Contract) %>%
  mutate(prop = count / sum(count)) %>%
  ungroup()

ggplot(data_summary, aes(x = Contract, y = prop, fill = Status)) +
  geom_col() +
  geom_text(
    aes(label = round(avg_monthly, 1)), 
    position = position_stack(vjust = 0.5), 
    color = "white", 
    size = 3
  ) +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "Distribution of Status by Contract Type with Average Monthly Charges",
    x = "Contract Type",
    y = "Proportion of Customers"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Machine Learning

set.seed(123)
split <- initial_split(data, prop = 0.7, strata = Status)
train <- training(split)
test <- testing(split)

rec <- recipe(Status ~ Tenure + MonthlyCharges + Gender + MultipleLines + InternetService + 
                OnlineSecurity + OnlineBackup + DeviceProtection + TechSupport +
                StreamingTV + StreamingMovies + Contract + PaymentMethod,
              data = train) %>%
  step_dummy(all_nominal_predictors(), -all_outcomes()) %>% 
  step_normalize(all_numeric_predictors())

rec

set.seed(123)
folds <- vfold_cv(train, v = 5, strata = Status)

log_mod <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

log_wf <- workflow() %>%
  add_recipe(rec) %>%
  add_model(log_mod)

log_res <- fit_resamples(log_wf, resamples = folds, metrics = metric_set(roc_auc, accuracy))
collect_metrics(log_res)
## # A tibble: 2 Ă— 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.796     5 0.00452 Preprocessor1_Model1
## 2 roc_auc  binary     0.842     5 0.00402 Preprocessor1_Model1
dt_mod <- decision_tree(
  mode = "classification",
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = tune()
) %>%
  set_engine("rpart")

dt_wf <- workflow() %>%
  add_recipe(rec) %>%
  add_model(dt_mod)

dt_grid <- grid_regular(
  cost_complexity(),
  tree_depth(),
  min_n(),
  levels = 3
)

set.seed(123)
dt_res <- tune_grid(dt_wf, resamples = folds, grid = dt_grid, metrics = metric_set(roc_auc, accuracy))
show_best(dt_res, metric = "roc_auc")
## # A tibble: 5 Ă— 9
##   cost_complexity tree_depth min_n .metric .estimator  mean     n std_err
##             <dbl>      <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
## 1    0.0000000001         15    40 roc_auc binary     0.820     5 0.00544
## 2    0.00000316           15    40 roc_auc binary     0.820     5 0.00544
## 3    0.0000000001          8    40 roc_auc binary     0.819     5 0.00445
## 4    0.00000316            8    40 roc_auc binary     0.819     5 0.00445
## 5    0.0000000001          8    21 roc_auc binary     0.817     5 0.00323
## # ℹ 1 more variable: .config <chr>
dt_best <- select_best(dt_res, metric = "roc_auc")

dt_final_wf <- finalize_workflow(dt_wf, dt_best)
dt_final_fit <- fit(dt_final_wf, data = train)

rf_mod <- rand_forest(
  mode = "classification",
  mtry = tune(),
  trees = 500,
  min_n = tune()
) %>%
  set_engine("ranger", importance = "impurity")

rf_wf <- workflow() %>%
  add_recipe(rec) %>%
  add_model(rf_mod)

rf_grid <- grid_regular(mtry(range = c(2, 10)), min_n(range = c(1,10)), levels = 5)

set.seed(123)
rf_res <- tune_grid(rf_wf, resamples = folds, grid = rf_grid, metrics = metric_set(roc_auc, accuracy))
show_best(rf_res, metric = "roc_auc")
## # A tibble: 5 Ă— 8
##    mtry min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     4    10 roc_auc binary     0.843     5 0.00416 Preprocessor1_Model22
## 2     4     7 roc_auc binary     0.843     5 0.00419 Preprocessor1_Model17
## 3     4     5 roc_auc binary     0.842     5 0.00389 Preprocessor1_Model12
## 4     4     3 roc_auc binary     0.841     5 0.00386 Preprocessor1_Model07
## 5     4     1 roc_auc binary     0.841     5 0.00402 Preprocessor1_Model02
rf_best <- select_best(rf_res, metric = "roc_auc")

rf_final_wf <- finalize_workflow(rf_wf, rf_best)
rf_final_fit <- fit(rf_final_wf, data = train)

test_preds <- predict(rf_final_fit, test, type = "prob") %>%
  bind_cols(test %>% select(Status))

roc_auc(test_preds, truth = Status, .pred_Current)
## # A tibble: 1 Ă— 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.838
test_class_preds <- predict(rf_final_fit, test) %>%
  bind_cols(test %>% select(Status))

conf_mat(test_class_preds, truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1409  281
##    Left        134  276
rf_final_fit %>%
  extract_fit_parsnip() %>%
  vip::vip(num_features = 10)

Logistic Regression

AUC (Cross-validated): 0.842. Test Set AUC: 0.838. Logistic regression performed well with good interpretability but may lack flexibility in capturing non-linear relationships compared to more complex models.

Decision Tree

Best AUC (Cross-validated): 0.820. Tuned parameters: cost_complexity, tree_depth, and min_n. Decision trees are easy to interpret but performed slightly worse than other models in this analysis.

Random Forest

Best AUC (Cross-validated): 0.843. Test Set AUC: 0.83. Tuned parameters: mtry (number of variables sampled) and min_n Random Forest emerged as the best-performing model, offering robust performance and capturing complex interactions in the data.

Confusion Matrix Analysis for the Random Forest model

True Positives: 276 (customers who left were correctly identified). False Positives: 281 (customers predicted to leave but stayed). True Negatives: 1409 (correctly predicted customers who stayed). False Negatives: 134 (missed predictions of customers who left). The confusion matrix shows moderate false positives, which can lead to unnecessary retention efforts, but relatively low false negatives, meaning fewer customers who churned were missed.

Feature Importance.

The most influential features identified by the Random Forest model

Tenure: The strongest predictor of churn. Longer tenure correlates with loyalty, reducing churn risk.

Monthly Charges: Higher monthly charges are associated with increased churn likelihood, likely due to cost sensitivity.

Internet Service (Fiber Optic): Customers with fiber optic service showed a higher tendency to churn, possibly due to higher costs or service issues.

Contract Type: Longer contracts (one-year or two-year) significantly reduce churn, as they lock customers in for extended periods.

Payment Method: Customers using electronic checks churn more often, possibly due to dissatisfaction or perceived inconvenience.

Additional Services: Add-ons like online security and tech support reduce churn by enhancing the perceived value of the service.

Model Assessment

Generalization Error: The test set AUC (0.838) closely matches the cross-validated AUC (0.843), indicating good generalization and minimal overfitting.

Optimal Model: Random Forest is the optimal model, balancing performance and the ability to capture complex patterns in the data.

Business Analysis

The predictive models, particularly the Random Forest model with an AUC of 0.843, provide valuable insights into customer churn at Regork Telecom. Key business insights suggest targeting high-risk customers—those with short tenures, high monthly charges, and using fiber optic services or electronic checks. Offering targeted discounts, service upgrades, or personalized plans could help retain these customers. In addition, a crucial finding from the analysis is the significant impact of contract length on churn: month-to-month contracts have a much higher churn rate (45%) compared to one-year (10%) and two-year contracts (5%). This indicates that encouraging customers to adopt longer-term contracts could be a powerful retention strategy. By offering incentives such as discounts or loyalty bonuses, Regork can reduce churn and promote long-term stability. Furthermore, the analysis highlights the importance of enhancing add-on services like online security, tech support, and device protection, as these services can increase customer satisfaction and make customers less likely to churn. Lastly, pricing adjustments should be considered, particularly for fiber optic services and customers with higher bills. Offering bundle deals or revising pricing models could help retain cost-sensitive customers who are more likely to leave due to dissatisfaction with service costs.

By leveraging these insights, Regork can implement focused retention strategies that not only reduce churn but also improve customer loyalty and satisfaction. Targeting high-risk customers with personalized offers, promoting long-term contracts, and adjusting pricing strategies will help Regork optimize its business performance, ensuring a more stable and profitable customer base moving forward.

Conclusion

The analysis reveals that customer churn at Regork Telecom is strongly influenced by contract type, tenure, monthly charges, and service usage. The Random Forest model proved to be the most effective tool for predicting churn, providing actionable insights that can help Regork design targeted retention strategies. By focusing on high-risk customers with short tenures, high monthly charges, and fiber optic services, as well as promoting longer-term contracts with incentives, Regork can reduce churn significantly. Enhancing add-on services and adjusting pricing models for high-bill customers can further improve customer satisfaction and retention. Overall, by implementing these insights, Regork can optimize its retention efforts, ensuring a more stable customer base and boosting profitability in the long term.