For Regork, this project will be focusing on identifying the key variables that influence customer churn to improve overall customer retention strategies going forward. The objective is to determine which of these top factors contribute most to churn utilizing insight for our data-driven decison making.I will begin by loading necessary packages and dataset for my analysis.

library(tidyverse)
library(vip)
library(tidymodels)
library(pdp)
library(kernlab)
library(baguette)
library(dplyr)
library(rpart.plot)
library(earth)


setwd("C:/Users/csp10/OneDrive/Desktop/Data Mining/uc-bana-4080")

churn <- read_csv("C:/Users/csp10/OneDrive/Desktop/Data Mining/uc-bana-4080/customer_retention (1).csv")

view(churn)

Churn Rate by Contract Type

One hypothesis I have is that those who are on a month to month plan will be more likely to leave services. This next chart I am looking to see if this hypothesis is true.

churn %>%
  group_by(Contract) %>%
  summarise(churn_rate = mean(Status == "Left")) %>%
  ggplot(aes(x = Contract, y = churn_rate)) +
  geom_col(fill = "steelblue") +
  labs(title = "Churn Rate by Contract Type",
       x = "Contract Type", y = "Churn Rate") +
  scale_y_continuous(labels = scales::percent) +
  theme_minimal()

As expected, Those with month-to-month contract plans are much more likely to leave services. By focusing on this demographic, we will be able to identify how to keep those in a month-to-month service from leaving and eventually switching plans. Therefore, I will filter out the one year and 2 year demographics and identify the top factors for customer retention among this demographic

month_to_month <- churn %>%
  filter(Contract == "Month-to-month")

month_to_month <- month_to_month %>%
  mutate(Retained = ifelse(Status == "Current", 1, 0))


top_vars <- c("TechSupport", "OnlineSecurity", 
              "PhoneService", "StreamingTV", "StreamingMovies", 
              "PaperlessBilling", "MultipleLines")


ret_plot_data <- month_to_month %>%
  pivot_longer(cols = all_of(top_vars), names_to = "Variable", values_to = "Value") %>%
  group_by(Variable, Value) %>%
  summarise(RetentionRate = mean(Retained), .groups = "drop")


ggplot(ret_plot_data, aes(x = Value, y = RetentionRate, fill = Variable)) +
  geom_col() +
  facet_wrap(~ Variable, scales = "free_x") +
  labs(title = "Customer Retention Rates by Variable (Month-to-Month Customers)",
       x = "", y = "Retention Rate") +
  scale_y_continuous(labels = scales::percent) +
  theme_minimal()

Overall this chart helps me identify that tech support, phone service, and Online Security tend to higher retention rates. Regork could use this opportunity to push these services on customers to help with retention rates. However, it is important to note that No internet Service is higher on Online Security and Tech support. There is an opportunity to identify which demographics do not have internet service to try and identify where Regork should focus their sales strategy

demo_vars <- c("Gender", "SeniorCitizen", "Dependents")


internet_summary <- map_dfr(demo_vars, function(var) {
  churn %>%
    group_by_at(var) %>%
    summarise(
      pct_no_internet = mean(InternetService == "No"),
      .groups = "drop"
    ) %>%
    mutate(
      Demographic = var,
      Group = as.character(!!sym(var))
    ) %>%
    select(Demographic, Group, pct_no_internet)
})


internet_summary <- internet_summary %>%
  mutate(Group = ifelse(Demographic == "SeniorCitizen" & Group == "0", "No",
                        ifelse(Demographic == "SeniorCitizen" & Group == "1", "Yes", Group)))


ggplot(internet_summary, aes(x = Group, y = pct_no_internet, fill = Demographic)) +
  geom_col() +
  facet_wrap(~ Demographic, scales = "free_x") +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = " Demographics of Month-to-Month Customers",
    ,
    x = "", y = "Percent Without Internet"
  ) +
  theme_minimal()

This chart has a lot of takeaways. The most suprising is that the group of month to month customers is much higher in lacking internet for the younger generation . Seeing this is the biggest factor and differenvce in percentage an important push for Regork can be to focus on the younger demographic for internet services. This can leafd to more customers utilizing Online Security and TechSupport which have been proven to increase retention rates

Logistic Regression

churn <- churn %>% mutate(Status = as.factor(Status))

set.seed(123)
churn_split <- initial_split(churn, prop = 0.7, strata = Status)
churn_train <- training(churn_split)
churn_test <- testing(churn_split)

churn_recipe <- recipe(Status ~ ., data = churn_train) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors())

log_mod <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

wf_log <- workflow() %>%
  add_model(log_mod) %>%
  add_recipe(churn_recipe)

set.seed(123)
cv_folds <- vfold_cv(churn_train, v = 5, strata = Status)

log_results <- fit_resamples(
  wf_log,
  resamples = cv_folds,
  metrics = metric_set(roc_auc, accuracy)
)

collect_metrics(log_results)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.798     5 0.00463 Preprocessor1_Model1
## 2 roc_auc  binary     0.844     5 0.00421 Preprocessor1_Model1
log_final_fit <- last_fit(wf_log, churn_split)
collect_metrics(log_final_fit)
## # A tibble: 3 × 4
##   .metric     .estimator .estimate .config             
##   <chr>       <chr>          <dbl> <chr>               
## 1 accuracy    binary         0.809 Preprocessor1_Model1
## 2 roc_auc     binary         0.848 Preprocessor1_Model1
## 3 brier_class binary         0.134 Preprocessor1_Model1
log_final_fit %>%
  collect_predictions() %>%
  conf_mat(truth = Status, estimate = .pred_class)
##           Truth
## Prediction Current Left
##    Current    1389  247
##    Left        153  310

The logistic regression model received a strong performance with an AUC of .848 and accuracy of 80.9%. However, a rank deficiency warning appeared, but did not seem to affect the predictable of the model, only interpretations of individual variables.

The next model we will test will be a Decision Tree to evaluate how a simple model will perform in predicting customer churn compared with the baseline model of the logistic regression

## Decision Tree 

tree_mod <- decision_tree() %>%
  set_engine("rpart") %>%
  set_mode("classification")

wf_tree <- workflow() %>%
  add_model(tree_mod) %>%
  add_recipe(churn_recipe)

set.seed(123)
tree_results <- fit_resamples(
  wf_tree,
  resamples = cv_folds,
  metrics = metric_set(roc_auc, accuracy)
)

collect_metrics(tree_results)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.782     5 0.00214 Preprocessor1_Model1
## 2 roc_auc  binary     0.716     5 0.00971 Preprocessor1_Model1
tree_mod <- decision_tree() %>%
  set_engine("rpart") %>%
  set_mode("classification")




tree_model <- fit(tree_mod, Status ~ ., data = churn_train)
rpart.plot(tree_model$fit, type = 2, extra = 104, fallen.leaves = TRUE)

The output of our decision tree has an AUC of .700 which shows a moderate discriminatory ability to predict retention. The model is effective in providing a clear visual of paths, and helps us explain certain patterns behind customer retention.

An SVM model was also implemented to identify patterns in churn based on customer attributes, particularly among high-dimensional features. This model helps Regork classify churn more effectively by drawing optimal boundaries between those who stay and those who leave, improving segmentation efforts.

svm_mod <- svm_rbf() %>%
  set_engine("kernlab") %>%
  set_mode("classification")


wf_svm <- workflow() %>%
  add_model(svm_mod) %>%
  add_recipe(churn_recipe)


set.seed(123)
svm_results <- fit_resamples(
  wf_svm,
  resamples = cv_folds,
  metrics = metric_set(roc_auc, accuracy)
)


collect_metrics(svm_results)
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.804     1      NA Preprocessor1_Model1
## 2 roc_auc  binary     0.798     1      NA Preprocessor1_Model1

The output shows that the Support Vector Machine model achieved an accuracy of 80.4%, meaning it correctly predicted churn outcomes over 80% of the time. The ROC AUC score of 0.798 indicates strong model performance in distinguishing between customers who stayed and those who left. These results suggest the SVM model is an effective tool for predicting customer churn at Regork.

log_final_fit %>%
  extract_fit_parsnip() %>%
  vip()

collect_metrics(log_final_fit)
## # A tibble: 3 × 4
##   .metric     .estimator .estimate .config             
##   <chr>       <chr>          <dbl> <chr>               
## 1 accuracy    binary         0.809 Preprocessor1_Model1
## 2 roc_auc     binary         0.848 Preprocessor1_Model1
## 3 brier_class binary         0.134 Preprocessor1_Model1

The logistic regression model was identified as the top model because it produced the top ROC AUC of 0.844 with high accuracy and performed best in predicting churn risk. It is also easily interpretable, and thus it can be used for business decision-making.

Top drivers were Tenure, Contract Type, and Total Charges. The newer customers with month-to-month contracts will likely churn, so it is imperative to have early interaction and longer contracts to keep them.

These results can be employed by Regork’s management to develop targeted approaches to reduce churn. The model’s consistent test set results affirm its value in predicting subsequent customer behavior.

Business Analysis and Conclusion

all_predictions <- collect_predictions(log_final_fit)

churn_test_pred <- bind_cols(churn_test, all_predictions)

churn_test_pred %>%
  filter(.pred_class == "Left") %>%
  summarise(
    total_monthly_loss = sum(MonthlyCharges),
    total_incentive_cost = n() * 10 * 3,
    break_even_months = total_incentive_cost / total_monthly_loss
  )
## # A tibble: 1 × 3
##   total_monthly_loss total_incentive_cost break_even_months
##                <dbl>                <dbl>             <dbl>
## 1             37379.                13890             0.372

Logistic regression was used as the optimal model since it yielded the highest ROC AUC score of 0.844, which is good predictive power in identifying churn risk. The leading predictors that were identified were Tenure, Contract Type, and Total Charges, all of which align with sensible churn behavior patterns. Customers with lower tenure, month-to-month agreements, and higher fees had higher probabilities of churning, suggesting specific risk factors that Regork can proactively monitor and address.

From the test set, the model accurately predicted 463 customers who will churn. A majority of them were short-tenure customers with flexible contract plans, suggesting that these customers are not very invested in the company’s services. If nothing is done, Regork will lose approximately $37,378.80 of monthly revenue from these customers alone. This illustrates both the financial implication of churn and the necessity to act quickly to retain these customers.

To reduce churn, I recommend offering a $10 monthly credit for three months to all at-risk customers, which would cost $13,890. Break-even would occur in just 0.37 months, so the incentive would be an affordable way to retain revenue. Regork’s management should implement this strategy right away, focusing especially on low-tenure and month-to-month customers. Proactive communication and retention incentives can strengthen customer relationships, reduce turnover, and promote long-term profitability.