Reducing Churn Rate for Regork Telecommunications

Introduction

As a new entrant into the telecommunications market, Regork will need to adjust to consumer’s preferences and understand the driving forces causing them to leave in order to thrive in the industry. This analysis addresses the business problem of predicting customer churn and creating strategies to minimize the issue of churn and potential revenue loss to the company.

To understand and address the issues with customer churn, I used a data set of Regork customers, which included information on retention status, length of time using Regork, contracts customers are signed under, and more. Using this dataset, I created a model that performs well in predicting customer churn. Through this analysis, I discovered the top variables predicting churn, and developed strategies to counteract the negative effects of these variables. This report will help Regork identify customers at a high risk of churn. I recommend offering incentives to customers at risk of churning, such as a $30 a month discount for $12 months. In addition, I recommend sending out customer service surveys to better understand pain points and improve the overall customer experience.

Data Preparation

The following packages were used in this analysis:

library(tidymodels)

library(tidyverse)

library(vip)

library(here)

library(ggplot2)

library(dplyr)

library(ranger)

library(rpart.plot)

To begin, I imported the data on customer retention. I checked for null values within the data, and then removed any from the rows. Finally, I mutated the response variable in the set, Status, as a factor for the model.

library(tidymodels)
library(tidyverse)
library(vip)
library(here)
library(ggplot2)
library(dplyr)
library(ranger)
library(rpart.plot)


cusret <- read.csv("customer_retention.csv")
colSums(is.na(cusret))
##           Gender    SeniorCitizen          Partner       Dependents 
##                0                0                0                0 
##           Tenure     PhoneService    MultipleLines  InternetService 
##                0                0                0                0 
##   OnlineSecurity     OnlineBackup DeviceProtection      TechSupport 
##                0                0                0                0 
##      StreamingTV  StreamingMovies         Contract PaperlessBilling 
##                0                0                0                0 
##    PaymentMethod   MonthlyCharges     TotalCharges           Status 
##                0                0               11                0
cusret <- cusret %>% 
  drop_na(TotalCharges)

cusret <- mutate(cusret, Status = factor(Status))

From there, I calculated the baseline churn rate.

cusret %>%
  count(Status) %>%
  mutate(percent = n / sum(n) * 100)
##    Status    n  percent
## 1 Current 5132 73.44018
## 2    Left 1856 26.55982

The baseline churn rate is 26.56%. This provides insight on the current rate customers are leaving Regork, and will serve as a benchmark for model performance going forward.

Exploratory Data Analysis

cusret %>%
  pivot_longer(cols = c(OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies),
               names_to = "Service", values_to = "Value") %>%
  ggplot(aes(x = Value, fill = Status)) +
  geom_bar(position = "fill") +
  facet_wrap(~Service, scales = "free_x") +
  labs(title = "Churn Rate by Online Services")

This plot illustrates the churn rate depending on whether the customer has different online services, along with a comparison of churn rates for those with no internet service at all. Based on the data, customers that have online security and tech support have the lowest churn rates. For customers that have streaming services, there is no significant difference in churn rates from those who don’t. Interestingly, customers with no internet service have the lowest churn rates of all.

cusret %>%
  ggplot(aes(x = InternetService, fill = Status)) +
  geom_bar(position = "fill") +
  labs(title = "Churn Rate by Internet Service")

When comparing internet service providers and their respective churn rates, we can see Fiber Optics has the highest rate, and no internet service has the most current customers associated with it. To gain a deeper insight into why this might be the case, I’ll examine the average monthly charges with each internet service type.

cusret %>%
  group_by(InternetService) %>%
  summarise(AverageMonthlyCharge = mean(MonthlyCharges), .groups = "drop") %>%
  ggplot(aes(x = InternetService, y = AverageMonthlyCharge, fill = InternetService)) +
  geom_col() +
  labs(title = "Average Monthly Charges by Internet Service Type",
       x = "Internet Service",
       y = "Average Monthly Charge") +
  theme_minimal()

Fiber optics has the highest average monthly charges, which is likely why the churn rate for this specific provider is so high. In the same vein, consumers who have no internet service pay the lowest monthly. This demonstrates the importance of pricing on churn rates. However, the method of payment is also a factor to look into.

churn_pay <- cusret %>%
  mutate(MonthlyBin = cut(MonthlyCharges,
                          breaks = c(0, 25, 50, 75, 100, 125),
                          include.lowest = TRUE,
                          right = FALSE))

churn_pay <- churn_pay %>%
  group_by(PaymentMethod, MonthlyBin) %>%
  summarise(
    Total = n(),
    Churned = sum(Status == "Left"),
    ChurnRate = Churned / Total,
    .groups = "drop"
  )


ggplot(churn_pay, aes(x = MonthlyBin, y = ChurnRate, fill = MonthlyBin)) +
  geom_bar(stat = "identity") +
  scale_y_continuous(labels = scales::percent) +
  facet_wrap(~ PaymentMethod) +
  labs(title = "Churn Rate by Monthly Charge and Payment Method",
       x = "Monthly Charge Range",
       y = "Churn Rate",
       fill = "Monthly Charges") + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

In examining the payment methods and the monthly charges for each, we can see that customers who pay by electronic check have the largest churn rates over all. For monthly charges, there is a unique pattern for churn rates. The churn rate mostly increases as monthly charges increase, but dip down at the highest monthly charge rate (100-125). To look further into why electronic checks as a payment method has the highest churn rate, I will compare the average monthly charges of each payment method.

cusret %>%
  group_by(PaymentMethod) %>%
  summarise(AverageMonthlyCharge = mean(MonthlyCharges), .groups = "drop") %>%
  ggplot(aes(x = reorder(PaymentMethod, -AverageMonthlyCharge), 
             y = AverageMonthlyCharge, fill = PaymentMethod)) +
  geom_col() +
  labs(title = "Average Monthly Charges by Payment Method",
       x = "Payment Method",
       y = "Average Monthly Charge") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Electronic checks have the highest monthly charges, which is likely why they have the highest churn rate, as well. To discover other factors influencing churn rate, I will look into tenure and contract types impact on customers leaving Regork.

cusret %>%
  ggplot(aes(x = Tenure, fill = Status)) +
  geom_histogram(position = "identity", alpha = 0.6, bins = 30) +
  labs(title = "Tenure vs. Churn Status")

This plot demonstrates the impact tenure (amount of time using Regork) has on churn rate. As tenure increases, the ratio of current customers to customers who have left increases. This suggests that a longer tenure length leads to greater customer retention.

cusret %>%
  group_by(Contract) %>%
  summarise(AverageTenure = mean(Tenure), .groups = "drop") %>%
  ggplot(aes(x = Contract, y = AverageTenure, fill = Contract)) +
  geom_col() +
  labs(title = "Average Tenure by Contract Type",
       x = "Contract Type",
       y = "Average Tenure (Months)") +
  theme_minimal()

As visualized in this plot, new customers often have month-to-month contracts, which is likely a contributing factor to their high turnover rates. Monthly contracts have less commitment attached to them, meaning customers can leave Regork much more easily than those under one or two year contracts. Going forward, I will use machine learning to better assess the factors most influential in predicting churn rate.

Machine Learning

Logistic Regression Model

To begin, I will be using a logistic regression model.

cusret <- cusret %>%
  mutate(Status = factor(Status, levels = c("Current", "Left")))

set.seed(123)
data_split <- initial_split(cusret, prop = 0.8, strata = Status)
train_data <- training(data_split)
test_data  <- testing(data_split)


log_recipe <- recipe(Status ~ ., data = train_data) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%  
  step_normalize(all_numeric_predictors())

log_mod <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

log_wf <- workflow() %>%
  add_model(log_mod) %>%
  add_recipe(log_recipe)

set.seed(123)
folds <- vfold_cv(train_data, v = 5, strata = Status)

log_res <- fit_resamples(
  log_wf,
  resamples = folds,
  metrics = metric_set(roc_auc, accuracy),
  control = control_resamples(save_pred = TRUE)
)

collect_metrics(log_res)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.801     5 0.00564 Preprocessor1_Model1
## 2 roc_auc  binary     0.847     5 0.00573 Preprocessor1_Model1
final_log_fit <- fit(log_wf, data = train_data)

log_preds <- predict(final_log_fit, test_data, type = "prob") %>%
  bind_cols(predict(final_log_fit, test_data)) %>%
  bind_cols(test_data %>% select(Status))

roc_auc(log_preds, truth = Status, .pred_Current)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.836
roc_curve(log_preds, truth = Status, .pred_Current) %>%
  autoplot()

My logistic regression model had an AUC of 0.836 on the test set, suggesting it performs well in distinguishing between customers who stay and those who leave.

conf_mat(log_preds, truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current     919  159
##    Left        108  213

Based on the confusion matrix, this model had 159 false positives and 108 false negatives. The higher number of false positives means it is more likely to falsely predict churn.

Random Forest Model

The next model I will use is a random forest model.

set.seed(123)
forest_split <- initial_split(cusret, prop = 0.8, strata = Status)
train_data <- training(forest_split)
test_data <- testing(forest_split)

rforest_spec <- rand_forest(
  mode = "classification",
  mtry = tune(),
  trees = 1000,
  min_n = tune()
) %>%
  set_engine("ranger", importance = "impurity")

forest_recipe <- recipe(Status ~ ., data = train_data) %>%
  step_dummy(all_nominal(), -all_outcomes())

rf_workflow <- workflow() %>%
  add_model(rforest_spec) %>%
  add_recipe(forest_recipe)

set.seed(234)
rf_folds <- vfold_cv(train_data, v = 5, strata = Status)

rf_grid <- grid_random(
  mtry(range = c(3, 10)),
  min_n(range = c(2, 10)),
  size = 20
)

set.seed(345)
rf_tuned <- tune_grid(
  rf_workflow,
  resamples = rf_folds,
  grid = rf_grid,
  metrics = metric_set(roc_auc, accuracy),
  control = control_grid(save_pred = TRUE)
)

best_rf <- select_best(rf_tuned, metric = "roc_auc")

collect_metrics(rf_tuned) %>%
  filter(.metric == "roc_auc") %>%
  arrange(desc(mean))
## # A tibble: 18 × 8
##     mtry min_n .metric .estimator  mean     n std_err .config              
##    <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
##  1     4     4 roc_auc binary     0.850     5 0.00293 Preprocessor1_Model12
##  2     3     8 roc_auc binary     0.850     5 0.00235 Preprocessor1_Model16
##  3     3     9 roc_auc binary     0.850     5 0.00228 Preprocessor1_Model05
##  4     4     3 roc_auc binary     0.850     5 0.00268 Preprocessor1_Model14
##  5     5     7 roc_auc binary     0.849     5 0.00305 Preprocessor1_Model09
##  6     6     8 roc_auc binary     0.848     5 0.00317 Preprocessor1_Model08
##  7     6     7 roc_auc binary     0.847     5 0.00329 Preprocessor1_Model02
##  8     5     2 roc_auc binary     0.847     5 0.00301 Preprocessor1_Model18
##  9     7     6 roc_auc binary     0.846     5 0.00318 Preprocessor1_Model01
## 10     6     2 roc_auc binary     0.846     5 0.00317 Preprocessor1_Model13
## 11     7     7 roc_auc binary     0.846     5 0.00317 Preprocessor1_Model10
## 12     7     5 roc_auc binary     0.845     5 0.00322 Preprocessor1_Model11
## 13     7     3 roc_auc binary     0.844     5 0.00336 Preprocessor1_Model07
## 14     9     7 roc_auc binary     0.843     5 0.00334 Preprocessor1_Model04
## 15    10     7 roc_auc binary     0.842     5 0.00320 Preprocessor1_Model06
## 16     9     4 roc_auc binary     0.842     5 0.00351 Preprocessor1_Model17
## 17    10     6 roc_auc binary     0.841     5 0.00309 Preprocessor1_Model15
## 18    10     2 roc_auc binary     0.839     5 0.00349 Preprocessor1_Model03
rf_conf_mat <- rf_tuned %>%
  collect_predictions() %>%
  filter(mtry == best_rf$mtry, min_n == best_rf$min_n) %>%
  mutate(.pred_class = factor(
    if_else(.pred_Current >= 0.5, "Current", "Left"),
    levels = levels(Status)
  )) %>%
  conf_mat(truth = Status, estimate = .pred_class)


rf_conf_mat
##           Truth
## Prediction Current Left
##    Current    3753  729
##    Left        352  755
rf_tuned %>%
  collect_predictions(parameters = best_rf) %>%
  roc_curve(truth = Status, .pred_Current) %>%
  autoplot()

From the confusion matrix, it can be determined that the model is far better at predicting current customers than predicting customers who have left Regork. It had 729 false negatives and 352 false positives. The highest AUC achieved by the model was 0.850, indicating that it’s a better predictor than the logistic regression model.

Decision Tree Model

The last model I will be using is a decision tree model.

tree_spec <- decision_tree(
  mode = "classification",
  cost_complexity = tune(),
  tree_depth = tune()
) %>%
  set_engine("rpart")

tree_grid <- grid_regular(
  cost_complexity(),
  tree_depth(range = c(2, 5)),  
  levels = 5
)

tree_wf <- workflow() %>%
  add_model(tree_spec) %>%
  add_recipe(forest_recipe)

set.seed(555)
tree_tuned <- tune_grid(
  tree_wf,
  resamples = rf_folds,
  grid = tree_grid,
  metrics = metric_set(roc_auc, accuracy),
  control = control_grid(save_pred = TRUE)
)

collect_metrics(tree_tuned) %>%
  arrange(desc(mean))
## # A tibble: 40 × 8
##    cost_complexity tree_depth .metric  .estimator  mean     n std_err .config   
##              <dbl>      <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>     
##  1    0.0000000001          5 accuracy binary     0.791     5 0.00288 Preproces…
##  2    0.0000000178          5 accuracy binary     0.791     5 0.00288 Preproces…
##  3    0.00000316            5 accuracy binary     0.791     5 0.00288 Preproces…
##  4    0.000562              5 accuracy binary     0.791     5 0.00288 Preproces…
##  5    0.0000000001          4 accuracy binary     0.791     5 0.00253 Preproces…
##  6    0.0000000178          4 accuracy binary     0.791     5 0.00253 Preproces…
##  7    0.00000316            4 accuracy binary     0.791     5 0.00253 Preproces…
##  8    0.000562              4 accuracy binary     0.791     5 0.00253 Preproces…
##  9    0.0000000001          2 accuracy binary     0.787     5 0.00372 Preproces…
## 10    0.0000000178          2 accuracy binary     0.787     5 0.00372 Preproces…
## # ℹ 30 more rows
best_tree <- select_best(tree_tuned, metric = "roc_auc")


final_tree_wf <- finalize_workflow(tree_wf, best_tree)

final_tree_fit <- fit(final_tree_wf, data = train_data)

tree_preds <- predict(final_tree_fit, test_data, type = "prob") %>%
  bind_cols(predict(final_tree_fit, test_data)) %>%
  bind_cols(test_data %>% select(Status))

tree_model <- extract_fit_parsnip(final_tree_fit)$fit

rpart.plot(tree_model, type = 2, extra = 106)

conf_mat(tree_preds, truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current     926  197
##    Left        101  175
roc_auc(tree_preds, truth = Status, .pred_Current)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.796
roc_curve(tree_preds, truth = Status, .pred_Current) %>%
  autoplot()

From the confusion matrix, there were 101 false positives and 197 false negatives, signifying that it is better at predicting customers who will stay than predicting how many will churn. The decision tree model had an AUC of 0.796, which indicates a relatively strong predictive ability. However, it was lower than the logistic regression model (AUC of .836) and the random forest model (AUC of .850). Therefore, the random forest model was the most optimal model.

Optimal Model- Random Forest

Feature Importance

final_rf <- finalize_workflow(rf_workflow, best_rf) %>%
  fit(data = train_data)

vip::vip(final_rf$fit$fit, num_features = 10)

Based on the plot of feature importance, the top predictor variables include the following:

  • Tenure (time spent with the company). Customers who had shorter tenure showed a higher churn rate. This suggests that customers who have been with Regork longer have higher loyalty than those who have been using Regork for a short period of time. Regork could target newer customers through exclusive benefits or perks. They could also offer loyalty rewards for consumers who stay with Regork longer.

  • Total charges and monthly charges. High charges may lead to dissatisfaction in customers. It also lowers the accessibility for lower income customers. Loyalty rewards could be beneficial for customer retention here, as a way to give back to customers who have paid higher amounts long term. Transparency in communication around billing could also help customers understand where their charges are coming from, and notify them in advance of any increases in price.

  • Fiber Optics as the internet service provider. In the exploratory data analysis, it was discovered that Fiber Optics had the highest average monthly charges, which is likely the reason for the higher churn rates under this provider. Another possible cause is that Fiber Optics is providing a lower quality service to customers. To tackle the issues with pricing, Regork could offer bundles to other services under Fiber Optics. Regork may also send out surveys to customers using Fiber Optics to gather feedback on issues they are perceiving with this provider and ways they could correct them.

  • Contract length (higher contracts correlating with lower churn). Longer contracts likely lower churn due to the commitment they add, while customers under a month-to-month contract have an easier time leaving Regork, making them more sensitive to price or service issues. Regork could encourage customers to enter a two-year contract by offering discounts or bundles under this plan.

  • Electronic checks as the payment method. Customers paying with electronic checks may have a harder time remember to pay their bills, which could cause late fees. As shown in the exploratory data analysis, customers paying with electronic checks had the highest average monthly charges compared to other payment methods, which was shown in the plot to have high importance in the churn rate. There is also a lack of convenience with this payment method.

  • Access to online security and tech support. Online security helps consumers feel safer and more protected, while tech support helps customers resolve issues and feel supported. To remedy this, Regork could push harder to upsell online security and tech support add-ons to their customers during the onboarding process.

final_rf_wf <- finalize_workflow(rf_workflow, best_rf)

final_rf_fit <- fit(final_rf_wf, data = train_data)

rf_preds <- predict(final_rf_fit, test_data, type = "prob") %>%
  bind_cols(predict(final_rf_fit, test_data)) %>%
  bind_cols(test_data %>% select(Status))


accuracy(rf_preds, truth = Status, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.795
roc_auc(rf_preds, truth = Status, .pred_Current)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.821

The generalization error was 20.5%, which was slightly higher than the cross-validation error (19%). This suggests that the model does well in its predictions and can be a reliable tool for Regork to identify customers at risk of churning. By flagging high-risk customers early, Regork can take targeted actions to improve customer satisfaction and reduce churn.

Business Analysis

Based on feature importance plot from the optimal model, the most influential predictors of churn are tenure, monthly and total charges, customers using Fiber Optics as their service provider, contract length, and access to tech support and online security. To decrease the chances of customers leaving Regork, I would focus on offering incentives to newer customers, such as discounts on billing, to encourage consumer loyalty. I would also encourage customers to sign onto longer term contracts (One to two years) by offering discounts or bundled services under these contracts. To get to the root of why customers using Fiber Optics have a higher churn rate, I would sent out customer service surveys to gather feedback and uncover ways to improve the service. I would offer loyalty rewards and benefits to customers who have been paying higher rates in order to show appreciation for their loyalty and improve their overall customer experience.

rf_preds_full <- bind_cols(test_data, rf_preds)

churn_risk <- rf_preds_full %>%
  filter(.pred_class == "Left")

nrow(churn_risk)
## [1] 285
mean(churn_risk$MonthlyCharges)
## [1] 77.76316
predicted_loss <- churn_risk %>%
  summarize(total_loss = sum(TotalCharges, na.rm = TRUE))


predicted_loss
##   total_loss
## 1   228190.3

Based on the model’s predictions, 285 customers are at risk of churning, which would amount to a loss of $228,190.30 in revenue. The retention strategies described prior could help preserve this revenue for the company.

Let’s say one of these incentives is a $30 discount for these customers for 12 months. The total cost of this plan would amount to $102,600.00, which would provide a net gain of $125,590.30 for Regork, making it both targeted and cost-effective.

Conclusion

This analysis revealed that Regork customers with shorter tenure, higher charges, and who use Fiber Optics as their service provider, have a significantly higher churn rate than other customers. For consumers, this creates an opportunity for Regork to respond to concerns and dissatisfaction proactively by offering targeted incentives for retention. Based on the data uncovered and predictions made by the optimal model, I recommend implementing a loyalty discount, such as $30 a month off for 12 months, for customers at risk of churning. Based on my cost benefit analysis, this would result in a net gain of $125,000. In addition, gathering feedback from customers can help improve service and offerings long-term, especially for Fiber Optics users.

Limitations for this report include a lack of scope with the data set provided. There is a lack of information on customer satisfaction, such as survey scores, as well as the prevalence of competitors and the prices they offer. More data here could help reveal external drivers of churn and deeper insights into why customers would leave. In addition, further information on the demographics of customers, such as age and annual income, could help identify whether specific groups are more vulnerable to churn. This would allow Regork to create more personalized retention strategies. Another limitation would be the accuracy of the model. False positives or negatives could lead to spending resources on customers who wouldn’t have churned, or missing those who will. The model could be improved further by incorporating more data over time.