We have been hired to work for Regork to analyze customer data and find a model that predicts which customers will leave in the future so that Regrok can take action to retain these customers. This is very valuable knowledge for the CEO of Regrok because in many cases the costs to attract new customers are much larger than to retain existing customers, so these finding will help Regork become more profitable.
In order to solve this problem, we applied three models machine learning models: Decision Tree Model, Regularized Regression Model, and the Multiple Adaptive Regression Splines (MARS) Model. With all three models split the data, collected the metrics, and assessed the coefficient significance. We then determined which model was the most accurate based upon the area under the curve metric. We found the Multiple Adaptive Regression Splines Model to be the most accurate because it had the the highest AUC. We then used this model to predict which customers would leave and how much revenue would be lost when they leave.
Our analysis will help the Regork Telecom CEO because by identifying which customers are mostly likely to churn and determining their similar characteristics, we can implement strategies to directly combat their churning. First, we proposed a tenure-based loyalty incentive program. Since tenure is the most relevant variable in terms of churning, it is critical to come up with a strategy to combat newer customers leaving. In this program customers will receive perks and milestone rewards to incentivize them to remain with Regrok. The second part of the strategy involves direct customer service. Regork should regularly being reaching out to their customers to address issues and questions about contracts head on in order to lessen the amount of customers churning.
library(tidyverse)
library(tidymodels)
library(baguette)
library(vip)
library(pdp)
library(here)
library(kernlab)
library(ggplot2)
library(ranger)
library(earth)
library(scales)
data <- read_csv("customer_retention.csv")
library(DT)
viewable_data <- data
# Rename columns for display
colnames(viewable_data)[colnames(viewable_data) == "PaperlessBilling"] <- "Paperless Billing"
colnames(viewable_data)[colnames(viewable_data) == "MonthlyCharges"] <- "Monthly Charges"
colnames(viewable_data)[colnames(viewable_data) == "TotalCharges"] <- "Total Charges"
# Select and reorder columns
selected_columns <- c("Status", "Monthly Charges", "Total Charges", "Tenure", "Contract", "Paperless Billing")
# Prepare the data with renamed columns
display_data <- viewable_data[selected_columns]
# Optionally, rename columns for display (if you want to change the display names)
colnames(display_data) <- c(
"Status",
"Monthly Charges",
"Total Charges",
"Tenure (months)",
"Contract",
"Paperless Billing"
)
# Show the datatable with formatting
datatable(
display_data,
options = list(pageLength = 25, autoWidth = TRUE)
) %>%
formatCurrency(c('Monthly Charges', 'Total Charges'), currency = "$") %>% # Add dollar signs
formatStyle(
'Status',
target = 'row',
backgroundColor = styleEqual(
c("Left", "Current"), # use your actual Status values
c('#e5cdf1', '#dcf1f3')
)
)
Customers who have left had a higher average monthly spend than current customers, but their average total spend and tenure were significantly lower, indicating they spent more per month but stayed for a shorter time. This suggests that retaining customers longer may be more valuable for overall revenue than focusing solely on high monthly spenders.
# Calculate averages
status_avg_df <- data %>%
group_by(Status) %>%
summarise(Avg_Monthly = mean(MonthlyCharges, na.rm = TRUE), Avg_Total = mean(TotalCharges, na.rm = TRUE), Avg_Tenure = mean(Tenure, na.rm = TRUE))
# Pivot the data
status_avg_pivot <- status_avg_df %>%
pivot_longer(
cols = starts_with("Avg"),
names_to = "Metric",
values_to = "Value")
# Set order and better facet labels
status_avg_pivot$Metric <- factor(
status_avg_pivot$Metric,
levels = c("Avg_Monthly", "Avg_Total", "Avg_Tenure"),
labels = c("Average Monthly Spend ($)", "Average Total Spend ($)", "Average Tenure (Months)"))
# Custom color palette
status_colors <- c("Left" = "darkorchid3", "Current" = "cadetblue3")
# Plot
ggplot(status_avg_pivot, aes(x = Status, y = Value, fill = Status)) +
geom_bar(stat = "identity", width = 0.8) +
facet_wrap(
~ Metric,
scales = "free_y") +
labs(title = "Comparison of Key Averages by Customer Status",
x = " Customer Status",
y = NULL) +
scale_fill_manual(values = status_colors) +
theme_minimal(base_size = 12) +
theme(legend.position = "none",
strip.text = element_text(size = 11),
plot.title = element_text(size = 13, face = "bold"),
axis.text.x = element_text(size = 10),
panel.spacing = unit(0.5, "lines")) +
scale_y_continuous(
labels = function(x) {if (max(x, na.rm = TRUE) > 50) {label_dollar()(x)} else {x}}) # Use $ for the first two facets, plain for tenure
There is a positive correlation between tenure and average monthly charges, indicating that customers who stay longer tend to pay higher monthly charges. This trend suggests opportunities to increase revenue from long-term customers, but it may also signal potential dissatisfaction or pricing issues that could impact retention. The company should investigate the reasons behind rising charges with tenure and consider loyalty incentives or tiered pricing to maintain satisfaction and reduce churn. Monitoring customer feedback and usage patterns will help optimize pricing strategies and enhance long-term customer value.
tenure_df <- data %>%
group_by(Tenure) %>%
summarize(average = mean(MonthlyCharges))
ggplot(tenure_df, aes(x = Tenure, y = average)) +
geom_point(color = "darkorchid3") +
theme_minimal() +
geom_smooth(se = FALSE, color = "cadetblue3") +
ggtitle('Monthly Charges vs Tenure',
subtitle = 'The average monthy charge amount for each amount of months of tenure') +
scale_x_continuous(name = 'Tenure (in months)',
breaks = seq(0, 75, by = 5)) +
scale_y_continuous(name = 'Average Monthly Charges',
labels = scales::dollar,
breaks = seq(0, 85, by = 5))
Customers on month-to-month contracts have a much higher churn rate compared to those on one-year or two-year contracts, where most customers remain current. This suggests that longer-term contracts are effective in retaining customers. The company should consider promoting or incentivizing longer-term contracts to reduce churn. Additionally, targeted retention strategies for month-to-month customers could further decrease customer attrition
ggplot(data, aes(x = Contract, fill = Status)) +
geom_bar(position = "dodge") +
theme_minimal() +
scale_fill_manual(name = "Customer Status", values = c("Left" = "darkorchid3", "Current" = "cadetblue3")) +
ggtitle('Contract Type vs Customer Status',
subtitle = 'Total count of customers who are either current or have left, based on the contract\ntype selected') +
xlab('Contract Type') +
scale_y_continuous(name = 'Number of Customers',
breaks = seq(0, 2500, by = 250))
Customers who use electronic checks have a significantly higher rate of leaving compared to other payment methods, with nearly as many customers leaving as staying. In contrast, automatic payment methods (bank transfer and credit card) have the lowest customer churn. The company should encourage customers to switch to automatic payment methods to help reduce churn. Targeted incentives or communication campaigns promoting the convenience and reliability of automatic payments could be effective in retaining more customers
ggplot(data, aes(x = PaymentMethod, fill = Status)) +
geom_bar(position = "dodge") +
theme_minimal() +
scale_fill_manual(name = "Customer Status", values = c("Left" = "darkorchid3", "Current" = "cadetblue3")) +
ggtitle('Payment Method vs Customer Status',
subtitle = 'Total count of customers who are either current or have left, based on the customers\npayment method') +
xlab('Payment Method') +
theme(axis.text.x = element_text(angle = 25, vjust = .75, hjust = .75)) +
scale_y_continuous(name = 'Number of Customers',
breaks = seq(0, 2500, by = 250))
Customers using paperless billing have a higher number of both current and former (left) customers compared to those not using paperless billing. Notably, the number of customers who have left is significantly higher among paperless billing users. This suggests a potential link between paperless billing and increased customer churn. The company should investigate the paperless billing experience and consider targeted retention strategies for these customers to reduce churn.
ggplot(data, aes(x = PaperlessBilling, fill = Status)) +
geom_bar(position = "dodge") +
theme_minimal() +
scale_fill_manual(name = "Customer Status", values = c("Left" = "darkorchid3", "Current" = "cadetblue3")) +
ggtitle('Paperless Billing vs Customer Status',
subtitle = 'Total count of customers who are either current or have left, based on if the customer\nuses paperless billing') +
xlab('Paperless Billing (Y/N)') +
scale_y_continuous(name = 'Number of Customers',
breaks = seq(0, 2500, by = 250))
customer_retention <- data
# Step 1: create train and test splits
set.seed(123) # for reproducibility
split <- initial_split(customer_retention, prop = 0.7, strata = Status)
customer_retention_train <- training(split) %>%
na.omit()
customer_retention_test <- testing(split) %>%
na.omit()
# Step 2: Create Model Recipe
customer_retention_recipe <- recipe(Status ~ ., data = customer_retention_train) %>%
step_YeoJohnson(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors())
# Step 3: create a decision tree model
dt_model <- decision_tree(mode = "classification") %>%
set_engine("rpart")
# Step 4: create workflow and fit model
dt_fit <- workflow() %>%
add_recipe(customer_retention_recipe) %>%
add_model(dt_model) %>%
fit(data = customer_retention_train)
# Step 5: Create resampling procedure
set.seed(123)
kfold <- vfold_cv(customer_retention, v = 5)
# Step 6: train model
results <- fit_resamples(dt_model, customer_retention_recipe, kfold)
# Step 7: Model Results
collect_metrics(results) %>%
arrange(desc(mean))
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 roc_auc binary 0.802 5 0.00396 Preprocessor1_Model1
## 2 accuracy binary 0.790 5 0.00784 Preprocessor1_Model1
## 3 brier_class binary 0.146 5 0.00359 Preprocessor1_Model1
# Step 8: Create model object with tuning results
dt_model <- decision_tree(
mode = "classification",
cost_complexity = tune(),
tree_depth = tune(),
min_n = tune()
) %>%
set_engine("rpart")
# Step 9: Create hyperparamter grid
hyper_grid <- grid_regular(
cost_complexity(),
tree_depth(),
min_n())
#Step 10: Train model across hyper parameter grid
set.seed(123)
results <- tune_grid(dt_model, customer_retention_recipe, resamples = kfold, grid = hyper_grid)
# Step 11: get best results:
dt_best_model <- select_best(results, metric = "roc_auc")
### Creating Final Fit Graph
# Step 1: Put together final workflow
dt_final_wf <- workflow() %>%
add_recipe(customer_retention_recipe) %>%
add_model(dt_model) %>%
finalize_workflow(dt_best_model)
# fit final workflow across entire training data
dt_final_fit <- dt_final_wf %>%
fit(data = customer_retention_train)
# plot feature importance
dt_final_fit %>%
extract_fit_parsnip() %>%
vip(aesthetics = list(fill = "darkorchid3"))
customer_retention <- data
customer_retention$Status <- as.factor(customer_retention$Status)
set.seed(123)
customer_retention_split <- initial_split(customer_retention, prop = .7 , strata = Status)
customer_retention_train <- training(customer_retention_split)
customer_retention_test <- testing(customer_retention_split)
set.seed(123)
customer_retention_kfolds <- vfold_cv(customer_retention_train, v = 5, strata = Status)
logistic_reg() %>%
fit_resamples(Status ~ ., customer_retention_kfolds) %>%
collect_metrics() %>%
arrange(desc(mean))
## # A tibble: 3 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 roc_auc binary 0.843 5 0.00416 Preprocessor1_Model1
## 2 accuracy binary 0.795 5 0.00405 Preprocessor1_Model1
## 3 brier_class binary 0.137 5 0.00184 Preprocessor1_Model1
final_fit <- logistic_reg() %>%
fit(Status ~ ., data = customer_retention_train)
vip(final_fit$fit, aesthetics = list(fill = "darkorchid3"))
customer_retention <- data
# Step 1: create train and test splits
set.seed(123) # for reproducibility
split <- initial_split(customer_retention, prop = 0.7, strata = Status)
customer_retention_train <- training(split) %>%
na.omit()
customer_retention_test <- testing(split) %>%
na.omit()
# Step 2: create model & preprocessing recipe
customer_retention_recipe <- recipe(Status ~ ., data = customer_retention_train) %>%
step_YeoJohnson(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors())
# Step 3: fit model across resampling object and collect results
set.seed(123)
kfolds <- vfold_cv(customer_retention_train, v = 5, strata = Status)
# Step 4: create ridge model object
mars_mod <- mars(num_terms = tune(), prod_degree = tune()) %>%
set_mode('classification')
# Step 5. create our hyperparameter search grid
mars_grid <- grid_regular(num_terms(c(1,30)), prod_degree(), levels = 25)
# Step 6: create workflow object to combine the recipe & model
customer_retention_wf <- workflow() %>%
add_recipe(customer_retention_recipe) %>%
add_model(mars_mod)
# Step 7. perform hyperparamter search
tuning_results <- customer_retention_wf %>%
tune_grid(resamples = kfolds, grid = mars_grid)
# Step 8. assess results
tuning_results %>%
collect_metrics() %>%
filter(.metric == "roc_auc") %>%
arrange(desc(mean))
## # A tibble: 50 × 8
## num_terms prod_degree .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 17 1 roc_auc binary 0.848 5 0.00557 Preprocessor1_M…
## 2 19 1 roc_auc binary 0.848 5 0.00557 Preprocessor1_M…
## 3 20 1 roc_auc binary 0.848 5 0.00557 Preprocessor1_M…
## 4 21 1 roc_auc binary 0.848 5 0.00557 Preprocessor1_M…
## 5 22 1 roc_auc binary 0.848 5 0.00557 Preprocessor1_M…
## 6 23 1 roc_auc binary 0.848 5 0.00557 Preprocessor1_M…
## 7 25 1 roc_auc binary 0.848 5 0.00557 Preprocessor1_M…
## 8 26 1 roc_auc binary 0.848 5 0.00557 Preprocessor1_M…
## 9 27 1 roc_auc binary 0.848 5 0.00557 Preprocessor1_M…
## 10 28 1 roc_auc binary 0.848 5 0.00557 Preprocessor1_M…
## # ℹ 40 more rows
# Step 1. finalize our workflow object with the optimal hyperparameter values
best_hyperparameters <- select_best(tuning_results, metric = "roc_auc")
final_wf <- workflow() %>%
add_recipe(customer_retention_recipe) %>%
add_model(mars_mod) %>%
finalize_workflow(best_hyperparameters)
# Step 2. fit our final workflow object across the full training set data
mars_final_fit <- final_wf %>%
fit(data = customer_retention_train)
# Step 3. plot the top 10 most influential features
mars_final_fit %>%
extract_fit_parsnip() %>%
vip(aesthetics = list(fill = "darkorchid3"))
We found our MARS Model to be the best fit model because it had the highest area under the curve compared to the other two machine learning examples. The information from this model is crucial for managers to know so that they can best allocate resources to promote customer retention. Overall, the most critical variable affecting customer retention is tenure, with customers with higher tenure leading to higher retention rates. Some additional variables that affect retention are contract length and charges (how much the customer is spending).
Of all of the customers in the data, Current customers have an average of an 18.07% probability of leaving, whereas customer who have left, have an average probability of leaving of 49.92%.
If no action is taken, we expect 117 customers to leave in the first month. This would result in $9,803.25 in revenue being lost for that month. These data points were found assuming customers with a 70% churn probability will leave in the next month.
# Make predictions (probability format)
churn_probabilities <- predict(mars_final_fit,
new_data = data,
type = "prob") %>%
bind_cols(data) %>%
select('ID', .pred_Left, everything()) %>%
arrange(desc(.pred_Left))
# Find average prediction percentage for Left customers
churn_probabilities %>%
group_by(Status) %>%
summarise(average = mean(.pred_Left, na.rm = TRUE))
## # A tibble: 2 × 2
## Status average
## <chr> <dbl>
## 1 Current 0.181
## 2 Left 0.499
# Assume customer will churn in the next month if they have an 70% probability of churning
churnned_cust <- churn_probabilities %>%
na.omit() %>%
filter(Status == "Current") %>%
filter(.pred_Left > .70)
# Revenue Lost Calculation
churnned_cust %>%
summarise(total = sum(MonthlyCharges))
## # A tibble: 1 × 1
## total
## <dbl>
## 1 9803.
Customer loyalty programs have been increasing in popularity over the last several years. We believe Regork forming a loyalty program for newer customer is the best course of action to increase customer tenure. Their loyalty program can increase customer tenure by offering “surprise rewards” for meeting milestones such as 3 months, 6 months, and a year. Reward examples include but are not limited to extra data, discounted plans, and device upgrades.
Additionally, Regork should enhance customer service. Things like customer feedback surveys and personalized calls make customer feel valued by the company, which in turn encourages them to continue to do business with Regork.
Once the customer loyalty program and customer service divisions are enhanced, customers would feel more that their business is valued by the company. Additionally, customers would appreciate any form of deals or rewards offered to them.
Our main limitation with our report is that we estimated churn probability as 70% was used to assess profit loss in the first loss. Further analysis could uncover a more accurate churn probability.
A potential area to build on our analysis would be diving deeper into customer demographics and how they affect contract length and charges. By knowing this information, changes to customer loyalty and customer service could better appeal to key demographic groups.