For Regork, this project will be focusing on identifying the key variables that influence customer churn to improve overall customer retention strategies going forward. The objective is to determine which of these top factors contribute most to churn utilizing insight for our data-driven decison making.I will begin by loading necessary packages and dataset for my analysis.
library(tidyverse)
library(vip)
library(tidymodels)
library(pdp)
library(kernlab)
library(baguette)
library(dplyr)
library(rpart.plot)
library(earth)
setwd("C:/Users/csp10/OneDrive/Desktop/Data Mining/uc-bana-4080")
churn <- read_csv("C:/Users/csp10/OneDrive/Desktop/Data Mining/uc-bana-4080/customer_retention (1).csv")
view(churn)
One hypothesis I have is that those who are on a month to month plan will be more likely to leave services. This next chart I am looking to see if this hypothesis is true.
churn %>%
group_by(Contract) %>%
summarise(churn_rate = mean(Status == "Left")) %>%
ggplot(aes(x = Contract, y = churn_rate)) +
geom_col(fill = "steelblue") +
labs(title = "Churn Rate by Contract Type",
x = "Contract Type", y = "Churn Rate") +
scale_y_continuous(labels = scales::percent) +
theme_minimal()
As expected, Those with month-to-month contract plans are much more likely to leave services. By focusing on this demographic, we will be able to identify how to keep those in a month-to-month service from leaving and eventually switching plans. Therefore, I will filter out the one year and 2 year demographics and identify the top factors for customer retention among this demographic
month_to_month <- churn %>%
filter(Contract == "Month-to-month")
month_to_month <- month_to_month %>%
mutate(Retained = ifelse(Status == "Current", 1, 0))
top_vars <- c("TechSupport", "OnlineSecurity",
"PhoneService", "StreamingTV", "StreamingMovies",
"PaperlessBilling", "MultipleLines")
ret_plot_data <- month_to_month %>%
pivot_longer(cols = all_of(top_vars), names_to = "Variable", values_to = "Value") %>%
group_by(Variable, Value) %>%
summarise(RetentionRate = mean(Retained), .groups = "drop")
ggplot(ret_plot_data, aes(x = Value, y = RetentionRate, fill = Variable)) +
geom_col() +
facet_wrap(~ Variable, scales = "free_x") +
labs(title = "Customer Retention Rates by Variable (Month-to-Month Customers)",
x = "", y = "Retention Rate") +
scale_y_continuous(labels = scales::percent) +
theme_minimal()
Overall this chart helps me identify that tech support, phone service, and Online Security tend to higher retention rates. Regork could use this opportunity to push these services on customers to help with retention rates. However, it is important to note that No internet Service is higher on Online Security and Tech support. There is an opportunity to identify which demographics do not have internet service to try and identify where Regork should focus their sales strategy
demo_vars <- c("Gender", "SeniorCitizen", "Dependents")
internet_summary <- map_dfr(demo_vars, function(var) {
churn %>%
group_by_at(var) %>%
summarise(
pct_no_internet = mean(InternetService == "No"),
.groups = "drop"
) %>%
mutate(
Demographic = var,
Group = as.character(!!sym(var))
) %>%
select(Demographic, Group, pct_no_internet)
})
internet_summary <- internet_summary %>%
mutate(Group = ifelse(Demographic == "SeniorCitizen" & Group == "0", "No",
ifelse(Demographic == "SeniorCitizen" & Group == "1", "Yes", Group)))
ggplot(internet_summary, aes(x = Group, y = pct_no_internet, fill = Demographic)) +
geom_col() +
facet_wrap(~ Demographic, scales = "free_x") +
scale_y_continuous(labels = scales::percent) +
labs(
title = " Demographics of Month-to-Month Customers",
,
x = "", y = "Percent Without Internet"
) +
theme_minimal()
This chart has a lot of takeaways. The most suprising is that the group
of month to month customers is much higher in lacking internet for the
younger generation . Seeing this is the biggest factor and differenvce
in percentage an important push for Regork can be to focus on the
younger demographic for internet services. This can leafd to more
customers utilizing Online Security and TechSupport which have been
proven to increase retention rates
churn <- churn %>% mutate(Status = as.factor(Status))
set.seed(123)
churn_split <- initial_split(churn, prop = 0.7, strata = Status)
churn_train <- training(churn_split)
churn_test <- testing(churn_split)
churn_recipe <- recipe(Status ~ ., data = churn_train) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
log_mod <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
wf_log <- workflow() %>%
add_model(log_mod) %>%
add_recipe(churn_recipe)
set.seed(123)
cv_folds <- vfold_cv(churn_train, v = 5, strata = Status)
log_results <- fit_resamples(
wf_log,
resamples = cv_folds,
metrics = metric_set(roc_auc, accuracy)
)
collect_metrics(log_results)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.798 5 0.00463 Preprocessor1_Model1
## 2 roc_auc binary 0.844 5 0.00421 Preprocessor1_Model1
log_final_fit <- last_fit(wf_log, churn_split)
collect_metrics(log_final_fit)
## # A tibble: 3 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.809 Preprocessor1_Model1
## 2 roc_auc binary 0.848 Preprocessor1_Model1
## 3 brier_class binary 0.134 Preprocessor1_Model1
log_final_fit %>%
collect_predictions() %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1389 247
## Left 153 310
The logistic regression model received a strong performance with an AUC of .848 and accuracy of 80.9%. However, a rank deficiency warning appeared, but did not seem to affect the predictable of the model, only interpretations of individual variables.
The next model we will test will be a Decision Tree to evaluate how a simple model will perform in predicting customer churn compared with the baseline model of the logistic regression
## Decision Tree
tree_mod <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
wf_tree <- workflow() %>%
add_model(tree_mod) %>%
add_recipe(churn_recipe)
set.seed(123)
tree_results <- fit_resamples(
wf_tree,
resamples = cv_folds,
metrics = metric_set(roc_auc, accuracy)
)
collect_metrics(tree_results)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.782 5 0.00214 Preprocessor1_Model1
## 2 roc_auc binary 0.716 5 0.00971 Preprocessor1_Model1
tree_mod <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
tree_model <- fit(tree_mod, Status ~ ., data = churn_train)
rpart.plot(tree_model$fit, type = 2, extra = 104, fallen.leaves = TRUE)
The output of our decision tree has an AUC of .700 which shows a moderate discriminatory ability to predict retention. The model is effective in providing a clear visual of paths, and helps us explain certain patterns behind customer retention.
An SVM model was also implemented to identify patterns in churn based on customer attributes, particularly among high-dimensional features. This model helps Regork classify churn more effectively by drawing optimal boundaries between those who stay and those who leave, improving segmentation efforts.
svm_mod <- svm_rbf() %>%
set_engine("kernlab") %>%
set_mode("classification")
wf_svm <- workflow() %>%
add_model(svm_mod) %>%
add_recipe(churn_recipe)
set.seed(123)
svm_results <- fit_resamples(
wf_svm,
resamples = cv_folds,
metrics = metric_set(roc_auc, accuracy)
)
collect_metrics(svm_results)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.804 1 NA Preprocessor1_Model1
## 2 roc_auc binary 0.798 1 NA Preprocessor1_Model1
The output shows that the Support Vector Machine model achieved an accuracy of 80.4%, meaning it correctly predicted churn outcomes over 80% of the time. The ROC AUC score of 0.798 indicates strong model performance in distinguishing between customers who stayed and those who left. These results suggest the SVM model is an effective tool for predicting customer churn at Regork.
log_final_fit %>%
extract_fit_parsnip() %>%
vip()
collect_metrics(log_final_fit)
## # A tibble: 3 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.809 Preprocessor1_Model1
## 2 roc_auc binary 0.848 Preprocessor1_Model1
## 3 brier_class binary 0.134 Preprocessor1_Model1
The logistic regression model was identified as the top model because it produced the top ROC AUC of 0.844 with high accuracy and performed best in predicting churn risk. It is also easily interpretable, and thus it can be used for business decision-making.
Top drivers were Tenure, Contract Type, and Total Charges. The newer customers with month-to-month contracts will likely churn, so it is imperative to have early interaction and longer contracts to keep them.
These results can be employed by Regork’s management to develop targeted approaches to reduce churn. The model’s consistent test set results affirm its value in predicting subsequent customer behavior.
all_predictions <- collect_predictions(log_final_fit)
churn_test_pred <- bind_cols(churn_test, all_predictions)
churn_test_pred %>%
filter(.pred_class == "Left") %>%
summarise(
total_monthly_loss = sum(MonthlyCharges),
total_incentive_cost = n() * 10 * 3,
break_even_months = total_incentive_cost / total_monthly_loss
)
## # A tibble: 1 × 3
## total_monthly_loss total_incentive_cost break_even_months
## <dbl> <dbl> <dbl>
## 1 37379. 13890 0.372
Logistic regression was used as the optimal model since it yielded the highest ROC AUC score of 0.844, which is good predictive power in identifying churn risk. The leading predictors that were identified were Tenure, Contract Type, and Total Charges, all of which align with sensible churn behavior patterns. Customers with lower tenure, month-to-month agreements, and higher fees had higher probabilities of churning, suggesting specific risk factors that Regork can proactively monitor and address.
From the test set, the model accurately predicted 463 customers who will churn. A majority of them were short-tenure customers with flexible contract plans, suggesting that these customers are not very invested in the company’s services. If nothing is done, Regork will lose approximately $37,378.80 of monthly revenue from these customers alone. This illustrates both the financial implication of churn and the necessity to act quickly to retain these customers.
To reduce churn, I recommend offering a $10 monthly credit for three months to all at-risk customers, which would cost $13,890. Break-even would occur in just 0.37 months, so the incentive would be an affordable way to retain revenue. Regork’s management should implement this strategy right away, focusing especially on low-tenure and month-to-month customers. Proactive communication and retention incentives can strengthen customer relationships, reduce turnover, and promote long-term profitability.