For this project, Regork has tasked us with analyzing customer retention for sustainable growth. As data scientists, our analysis needed to study the customer behavior and identify patterns related to churn. To begin, we structured our project to focus on key retention factors like payment methods, monthly charges, contract types, and additional services. Additionally, we employed machine learning models of logistic regression, decision trees, and random forest to predict future turn, identify important factors, and develop actionable recommendations.
Tenure Analysis: For our first analysis, we began examining the length of time customers remained with Regork and compared it to those who stayed versus left. This revealed that shorter tenure customers were more likely to churn and helped us focus on offering incentives for new customers to establish long term loyalty.
Payment and Contract Type: We then examined payment methods and contract types to understand how the options were correlated to churn. We identified that customers with certain payment behaviors and contract types displayed clear churn patterns. This will help Regork tailor its efforts to longer contracts and other payment types.
Billing Methods: We also wanted to examine how Regork billed its customers to identify areas of improvement on their end. This analysis focused on paperless vs mailed bills and churn rates. Customers using paperless billing were significantly more likely to leave and suggests that Regork
Gender Analysis: Finally, we examined the gender demographic to see if there was a correlation with churn. While our analysis and graph did not reveal any meaningful information, it helped us focus on more relevant factors and identify which factors had the largest impact.
General Analysis Finally, we wanted to expand and see if gender played a role in churn likelihood. Subsequently, we realized gender had little to no significance and allowed us to prioritize more impactful factors within customer retention.
# Required Packages ---------------------------------------------------------------
library(tidyverse)
library(tidymodels)
library(dplyr)
library(ggplot2)
library(lattice)
library(caret)
library(randomForest)
library(ranger)
library(vip)
library(rpart)
library(rpart.plot)
library(pROC)
library(e1071)
# Data Preparation ----------------------------------------------------------------
data <- read.csv("customer_retention.csv")
colSums(is.na(data))
data$TotalCharges <- ifelse(is.na(data$TotalCharges),
mean(data$TotalCharges, na.rm = TRUE),
data$TotalCharges)
data <- data %>%
mutate(OnlineBackup = ifelse(OnlineBackup == "No internet service", "No", OnlineBackup),
TechSupport = ifelse(TechSupport == "No internet service", "No", TechSupport))
# Bar Chart for Status Distribution
status_plot <- ggplot(data, aes(x = Status)) +
geom_bar(fill = c("green", "red"), alpha = 0.7) +
theme_minimal() +
labs(title = "Customer Churn Distribution", x = "Customer Status", y = "Count") +
geom_text(stat = "count", aes(label = ..count..), vjust = -0.5)
print(status_plot)
# Tenure distribution by churn status (Histogram)
tenure_plot <- ggplot(data, aes(x = Tenure, fill = Status)) +
geom_density(alpha = 0.7) +
theme_minimal() +
labs(title = "Tenure Distribution by Customer Status", x = "Tenure (Months)", y = "Density") +
scale_fill_manual(values = c("green", "red"))
print(tenure_plot)
# Boxplot for monthly charges vs. churn status
monthly_charges_plot <- ggplot(data, aes(x = Status, y = MonthlyCharges, fill = Status)) +
geom_violin(alpha = 0.7) +
geom_boxplot(width = 0.2, position = position_dodge(width = 0.9), alpha = 0.5) +
theme_minimal() +
labs(title = "Monthly Charges by Customer Status", x = "Status", y = "Monthly Charges ($)") +
scale_fill_manual(values = c("green", "red"))
print(monthly_charges_plot)
# Bar chart for payment method distribution
payment_method_plot <- ggplot(data, aes(x = PaymentMethod, fill = Status)) +
geom_bar(position = "fill", alpha = 0.8) +
theme_minimal() +
labs(title = "Payment Method Proportions by Customer Status", x = "Payment Method", y = "Proportion") +
scale_fill_manual(values = c("green", "red")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(payment_method_plot)
# Stacked bar chart for contract type by churn
contract_plot <- ggplot(data, aes(x = Contract, fill = Status)) +
geom_bar(position = "fill", alpha = 0.8) +
theme_minimal() +
labs(title = "Contract Type by Churn Status", x = "Contract Type", y = "Proportion") +
scale_fill_manual(values = c("green", "red"))
print(contract_plot)
# Bar chart for paperless billing
billing_plot <- ggplot(data, aes(x = PaperlessBilling, fill = Status)) +
geom_bar(position = "dodge", alpha = 0.8) +
theme_minimal() +
labs(title = "Paperless Billing and Churn", x = "Paperless Billing", y = "Count") +
scale_fill_manual(values = c("green", "red"))
print(billing_plot)
# Bar chart for gender distribution by churn
gender_plot <- ggplot(data, aes(x = Gender, fill = Status)) +
geom_bar(position = "dodge", alpha = 0.8) +
theme_minimal() +
labs(title = "Gender Distribution by Churn Status", x = "Gender", y = "Count") +
scale_fill_manual(values = c("green", "red"))
print(gender_plot)
# setting up our data
set.seed(123)
data_split <- initial_split(data, prop = 0.7, strata = Status)
ret_train <- training(data_split)
ret_test <- testing(data_split)
model_recipe <- recipe(Status ~ ., data = ret_train) %>%
step_dummy(all_nominal_predictors(), -all_outcomes()) %>% # Encode categorical variables
step_normalize(all_numeric_predictors()) # Scale numeric variables
kfold <- vfold_cv(ret_train, v = 5, strata = Status)
# Logistic Regression -------------------------------------------------------------
log_mod <- logistic_reg(mode = "classification") %>%
set_engine("glm")
log_workflow <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(log_mod)
log_results <- fit_resamples(log_workflow, kfold, metrics = metric_set(roc_auc, accuracy))
log_metrics <- collect_metrics(log_results)
print(log_metrics)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.798 5 0.00640 Preprocessor1_Model1
## 2 roc_auc binary 0.843 5 0.00178 Preprocessor1_Model1
# Decision Tree ------------------------------------------------------------------
dt_mod <- decision_tree(
mode = "classification",
cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()
) %>%
set_engine("rpart")
dt_hyper_grid <- grid_regular(
cost_complexity(range = c(-10, -1)),
tree_depth(range = c(5, 15)),
min_n(range = c(10, 50)),
levels = 5
)
set.seed(123)
dt_results <- tune_grid(
dt_mod,
model_recipe,
resamples = kfold,
grid = dt_hyper_grid,
metrics = metric_set(roc_auc)
)
# model results
top_results <- show_best(dt_results, metric = "roc_auc", n = 5)
print(top_results)
## # A tibble: 5 × 9
## cost_complexity tree_depth min_n .metric .estimator mean n std_err
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0000000001 7 40 roc_auc binary 0.818 5 0.00392
## 2 0.0000000178 7 40 roc_auc binary 0.818 5 0.00392
## 3 0.00000316 7 40 roc_auc binary 0.818 5 0.00392
## 4 0.0000000001 10 30 roc_auc binary 0.817 5 0.00436
## 5 0.0000000178 10 30 roc_auc binary 0.817 5 0.00436
## # ℹ 1 more variable: .config <chr>
all_results <- collect_metrics(dt_results)
print(all_results)
## # A tibble: 125 × 9
## cost_complexity tree_depth min_n .metric .estimator mean n std_err
## <dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl>
## 1 0.0000000001 5 10 roc_auc binary 0.794 5 0.00369
## 2 0.0000000178 5 10 roc_auc binary 0.794 5 0.00369
## 3 0.00000316 5 10 roc_auc binary 0.794 5 0.00369
## 4 0.000562 5 10 roc_auc binary 0.794 5 0.00369
## 5 0.1 5 10 roc_auc binary 0.587 5 0.0533
## 6 0.0000000001 7 10 roc_auc binary 0.807 5 0.00487
## 7 0.0000000178 7 10 roc_auc binary 0.807 5 0.00487
## 8 0.00000316 7 10 roc_auc binary 0.807 5 0.00487
## 9 0.000562 7 10 roc_auc binary 0.808 5 0.00575
## 10 0.1 7 10 roc_auc binary 0.587 5 0.0533
## # ℹ 115 more rows
## # ℹ 1 more variable: .config <chr>
# Random Forest Model -------------------------------------------------------------
rf_mod <- rand_forest(
mode = "classification",
trees = tune(),
mtry = tune(),
min_n = tune()
) %>%
set_engine("ranger", importance = "impurity")
rf_hyper_grid <- grid_regular(
trees(range = c(20, 300)),
mtry(range = c(2, 19)),
min_n(range = c(1, 20)),
levels = 5
)
set.seed(123)
rf_results <- tune_grid(
rf_mod,
model_recipe,
resamples = kfold,
grid = rf_hyper_grid,
metrics = metric_set(roc_auc)
)
# model results
top_rf_results <- show_best(rf_results, metric = "roc_auc", n = 5)
print(top_rf_results)
## # A tibble: 5 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 6 300 20 roc_auc binary 0.843 5 0.00109 Preprocessor1_Model…
## 2 6 230 20 roc_auc binary 0.843 5 0.000589 Preprocessor1_Model…
## 3 6 300 15 roc_auc binary 0.841 5 0.000918 Preprocessor1_Model…
## 4 6 160 20 roc_auc binary 0.841 5 0.00132 Preprocessor1_Model…
## 5 2 300 15 roc_auc binary 0.841 5 0.00122 Preprocessor1_Model…
rf_all_results <- collect_metrics(rf_results)
print(rf_all_results)
## # A tibble: 125 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 2 20 1 roc_auc binary 0.833 5 0.00178 Preprocessor1_Mode…
## 2 2 90 1 roc_auc binary 0.840 5 0.00149 Preprocessor1_Mode…
## 3 2 160 1 roc_auc binary 0.839 5 0.00100 Preprocessor1_Mode…
## 4 2 230 1 roc_auc binary 0.840 5 0.000922 Preprocessor1_Mode…
## 5 2 300 1 roc_auc binary 0.840 5 0.000785 Preprocessor1_Mode…
## 6 6 20 1 roc_auc binary 0.819 5 0.00167 Preprocessor1_Mode…
## 7 6 90 1 roc_auc binary 0.831 5 0.00186 Preprocessor1_Mode…
## 8 6 160 1 roc_auc binary 0.836 5 0.00102 Preprocessor1_Mode…
## 9 6 230 1 roc_auc binary 0.835 5 0.000910 Preprocessor1_Mode…
## 10 6 300 1 roc_auc binary 0.836 5 0.000822 Preprocessor1_Mode…
## # ℹ 115 more rows
# best hyperparameters
rf_best_hyperparameters <- select_best(rf_results, metric = "roc_auc")
final_rf_wf <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(rf_mod) %>%
finalize_workflow(rf_best_hyperparameters)
rf_final_fit <- final_rf_wf %>%
fit(data = ret_train)
rf_final_fit %>%
extract_fit_parsnip() %>%
vip(num_features = 20)
# The optimal model ---------------------------------------------------------------
log_roc_auc <- log_metrics %>%
filter(.metric == "roc_auc") %>%
summarise(roc_auc = mean(mean)) %>%
pull(roc_auc)
dt_roc_auc <- top_results %>%
filter(.metric == "roc_auc") %>%
pull(mean) %>%
.[1]
rf_roc_auc <- top_rf_results %>%
filter(.metric == "roc_auc") %>%
pull(mean) %>%
.[1]
roc_auc_comparison <- tibble(
Model = c("Logistic Regression", "Decision Tree", "Random Forest"),
ROC_AUC = c(log_roc_auc, dt_roc_auc, rf_roc_auc)
)
print(roc_auc_comparison)
## # A tibble: 3 × 2
## Model ROC_AUC
## <chr> <dbl>
## 1 Logistic Regression 0.843
## 2 Decision Tree 0.818
## 3 Random Forest 0.843
best_model <- roc_auc_comparison %>%
filter(ROC_AUC == max(ROC_AUC)) %>%
pull(Model)
cat("The best model is:", best_model)
## The best model is: Random Forest
# the confusion matrix
data <- data %>%
mutate(Status = factor(Status, levels = c("Current", "Left")))
set.seed(123)
data_split <- initial_split(data, prop = 0.7, strata = Status)
ret_train <- training(data_split)
ret_test <- testing(data_split)
model_recipe <- recipe(Status ~ ., data = ret_train) %>%
step_dummy(all_nominal_predictors(), -all_outcomes()) %>%
step_normalize(all_numeric_predictors())
log_mod <- logistic_reg(mode = "classification") %>%
set_engine("glm")
log_workflow <- workflow() %>%
add_recipe(model_recipe) %>%
add_model(log_mod)
log_fit <- fit(log_workflow, data = ret_train)
log_predictions <- log_fit %>%
predict(ret_test) %>%
bind_cols(ret_test %>% select(Status))
log_conf_matrix <- log_predictions %>%
conf_mat(truth = Status, estimate = .pred_class)
print(log_conf_matrix)
## Truth
## Prediction Current Left
## Current 1390 248
## Left 153 309
In our analysis, we wanted to understand the factors contributing to customer churn at Regork and provide actionable insights for sustainable growth. By examining customer tenure, payment behavior, contract types, billing methods, and demographic patterns, we identified key drivers of churn and opportunities for improvement.
Tenure: Customers with shorter tenures were significantly more likely to churn, highlighting the importance of early engagement strategies.
Payment and Contract Types: Flexible contract options and improved payment experiences could foster retention, as customers with monthly contracts and paperless billing were more likely to leave.
Billing Methods: Addressing challenges associated with paperless billing by improving communication or offering incentives for traditional billing methods could reduce churn.
Machine Learning Models: Our analysis showed that the Random Forest model performed the best with the highest ROC AUC score, indicating its robustness in predicting churn. This model allowed us to identify critical features influencing churn and provided a reliable framework for targeting at-risk customers.
While gender did not prove to be a significant factor, focusing on the factors with the largest impact enables Regork to implement tailored retention strategies. These could include personalized offers, targeted communication, and enhanced onboarding experiences to improve customer satisfaction and loyalty.
Moving forward, Regork should consider continuous monitoring of churn-related metrics, implement pilot retention programs, and assess their impact using predictive models. This data-driven approach will help ensure long-term growth and customer loyalty.
Final Findings Following the completion of our analysis, we determined Random Forest to be the optimal model for predicting customer churn. It had the highest roc_auc score at 0.844 compared at 0.843 and 0.818. We also determined the most influential hyperparameters to be tenure, total charges, monthly charges, internet service, payment method, and finally contract length. These contributors helped us create our models and produce 3 key recommendations.
First, we recommend implementing tailored discounts for long-term contracts. We easily identified tenure as a leading factor along with contracts so focusing on signing long term contracts will help lengthen the tenure of the customer and hopefully reduce their likelihood of churning.
Second, we wanted to improve the paperless payment process as it seems the customers are not happy with this process. Both the electronic check as payment and paperless billing had higher levels of churn. This demonstrates a potential issue that may be occurring within these electronic payments.
Finally, we recommend increasing technical support and adding customer support. These individuals could be incentivized by increasing contracts and create a better environment for in person payments as opposed to electric billing and payments.
These findings helped us determine where the issues were originating and help create ways to reduce customer churn in the future. If Regork fails to take action we predict there will be over a $50,000 loss in monthly revenue. This is due to the lack of spending over long periods of time from the customers who left. While they make large purchases at times, their short tenure results in a large revenue loss. Overall, addressing these key factors and implementing most of the recommended strategies, we believe Regork will improve their customer retention. By focusing on longer contracts, better payment processes, and large customer service team, Regork can build strong relationships with its customers and successfully achieve sustainable growth in the future.