We show the packages used below:
library(tidyverse)
library(tidymodels)
library(dplyr)
library(ggplot2)
library(DT)
library(gridExtra)
library(pdp)
library(earth)
library(vip)
retention <- read_csv("data/customer_retention.csv")
# Pull in our retention dataset
datatable(head(retention, 15, width = auto))
Below is an overview of the metadata and transformations used in this dataset:
retention <- retention %>%
dplyr::mutate(Status = as.factor(Status))
retention <- drop_na(retention)
glimpse(retention)
## Rows: 6,988
## Columns: 20
## $ Gender <chr> "Female", "Male", "Male", "Male", "Female", "Female",…
## $ SeniorCitizen <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes…
## $ Dependents <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"…
## $ Tenure <dbl> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ MultipleLines <chr> "No phone service", "No", "No", "No phone service", "…
## $ InternetService <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt…
## $ OnlineSecurity <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "…
## $ OnlineBackup <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N…
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y…
## $ TechSupport <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes…
## $ StreamingTV <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye…
## $ StreamingMovies <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes…
## $ Contract <chr> "Month-to-month", "One year", "Month-to-month", "One …
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ PaymentMethod <chr> "Electronic check", "Mailed check", "Mailed check", "…
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Status <fct> Current, Current, Left, Current, Left, Left, Current,…
print(colSums(is.na(retention)))
## Gender SeniorCitizen Partner Dependents
## 0 0 0 0
## Tenure PhoneService MultipleLines InternetService
## 0 0 0 0
## OnlineSecurity OnlineBackup DeviceProtection TechSupport
## 0 0 0 0
## StreamingTV StreamingMovies Contract PaperlessBilling
## 0 0 0 0
## PaymentMethod MonthlyCharges TotalCharges Status
## 0 0 0 0
Regork_colors <- c("#0071ce", "#4b7f42")
# Retention Rate By Contract Type
ggplot(retention, aes(x = Contract, fill = Status)) +
geom_bar(position = "fill", color = "black") +
scale_fill_manual(values = Regork_colors) +
theme_minimal() +
labs(y = "Retention Rate (%)",
title = "Retention Rate by Contract Type") +
theme(axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
axis.title = element_text(size = 14, face = "bold"),
plot.title = element_text(size = 18, face = "bold", hjust = 0.5))
Looking at the different types of contracts that we offer (month-to-month, one-year, and two-year), it looks like people who are on a month-to-month contract are more likely to leave as opposed to those who are on a one-year or two-year contract.
retention %>%
group_by(StreamingTV, Status) %>%
ggplot(aes(x = StreamingTV, fill = Status)) +
geom_bar(position = "fill", color = "black") +
scale_fill_manual(values = Regork_colors) +
labs(y = "Retention Rate (%)",
title = "Retention Rate Based on TV Streaming Package") +
theme(axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
axis.title = element_text(size = 14, face = "bold"),
plot.title = element_text(size = 18, face = "bold", hjust = 0.5))
This graph shows the retention rate of those with the
streaming TV service, without the TV service, and
those with the TV service but not the internet
service. Those three categories are then broken into two
proportions, current customers and those that have
left.
One insight that can be pulled from the graph is that the proportion of current customers to leaving customers is nearly identical for those with internet and TV streaming compared to those with internet but no TV Streaming.
plot2data <- retention %>%
mutate(
Gender = as.factor(Gender),
SeniorCitizen = factor(SeniorCitizen, levels = c(0, 1), labels = c("No", "Yes")),
Partner = as.factor(Partner),
Dependents = as.factor(Dependents),
PhoneService = as.factor(PhoneService),
MultipleLines = as.factor(MultipleLines),
Status = as.factor(Status)
)
ggplot(plot2data, aes(x = SeniorCitizen, fill = Status)) +
geom_bar(position = "dodge") +
labs(x = "Senior Citizen?", y = "Count", fill = "Status") +
scale_fill_manual(values = Regork_colors) +
ggtitle("Distribution of customer status\nbased on whether they were a senior citizen") +
theme(axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
axis.title = element_text(size = 14, face = "bold"),
plot.title = element_text(size = 18, face = "bold", hjust = 0.5))
From the graph, we can see that there is a larger number of current customers who are not senior citizens compared to those who are. The number of customers who have churned is lower than the number of current customers for all categories. The difference between the number of current and churned customers is larger among non-senior citizens than among senior citizens, suggesting that the churn proportion is much larger for the number of senior citizens. However, the customer segment of non-senior citizens certainly outweighs the senior one.
set.seed(123)
split <- initial_split(retention, prop = 0.7, strata = "Status")
retention_train <- training(split)
retention_test <- testing(split)
set.seed(123)
kfolds <- vfold_cv(retention_train, v = 5, strata = Status)
retention_recipe <- recipe(Status ~ ., data = retention_train) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors())
logistic_results <- logistic_reg() %>%
fit_resamples(Status ~ ., kfolds)
collect_metrics(logistic_results)
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.799 5 0.00401 Preprocessor1_Model1
## 2 roc_auc binary 0.845 5 0.00521 Preprocessor1_Model1
Let’s define our Logistic regression model . Then, we will asses the model based on the ROC from our model, which is 0.845!
logistic_final_fit <- logistic_reg() %>%
fit(Status ~ ., data = retention_train)
Let go ahead and fit our logistic regression model on our training data.
logistic_final_fit %>%
predict(retention_test) %>%
bind_cols(retention_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1362 225
## Left 178 332
vip(logistic_final_fit$fit, num_features = 20)
This code creates a confusion matrix, which allows us to see which predictions the model got correct and which predictions it got wrong on our testing set.
It also shows us what logistic regression thinks the most important features are for people churning.
mars_mod <- mars(num_terms = tune(), prod_degree = tune()) %>%
set_mode("classification")
mars_grid <- grid_regular(num_terms(range = c(1,100)), prod_degree(), levels = 25)
retention_wf <- workflow() %>%
add_recipe(retention_recipe) %>%
add_model(mars_mod)
tuning_results <- retention_wf %>%
tune_grid(resamples = kfolds, grid = mars_grid)
tuning_results %>%
collect_metrics() %>%
filter(.metric == "roc_auc") %>%
arrange(desc(mean))
## # A tibble: 50 × 8
## num_terms prod_degree .metric .estimator mean n std_err .config
## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 21 1 roc_auc binary 0.849 5 0.00486 Preprocessor1_M…
## 2 25 1 roc_auc binary 0.849 5 0.00486 Preprocessor1_M…
## 3 29 1 roc_auc binary 0.849 5 0.00486 Preprocessor1_M…
## 4 34 1 roc_auc binary 0.849 5 0.00486 Preprocessor1_M…
## 5 38 1 roc_auc binary 0.849 5 0.00486 Preprocessor1_M…
## 6 42 1 roc_auc binary 0.849 5 0.00486 Preprocessor1_M…
## 7 46 1 roc_auc binary 0.849 5 0.00486 Preprocessor1_M…
## 8 50 1 roc_auc binary 0.849 5 0.00486 Preprocessor1_M…
## 9 54 1 roc_auc binary 0.849 5 0.00486 Preprocessor1_M…
## 10 58 1 roc_auc binary 0.849 5 0.00486 Preprocessor1_M…
## # ℹ 40 more rows
Let’s define our MARS model for classification and use a grid to tune our hyperparameters. From there, we will collect the best ROC from our model, which is 0.849!
best_hyperparameters <- select_best(tuning_results, metric = "roc_auc")
mars_final_wf <- workflow() %>%
add_recipe(retention_recipe) %>%
add_model(mars_mod) %>%
finalize_workflow(best_hyperparameters)
mars_final_fit <- mars_final_wf %>%
fit(data = retention_train)
# Predicting on our test set
mars_final_fit %>%
predict(retention_test) %>%
bind_cols(retention_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1383 242
## Left 157 315
mars_final_fit %>%
predict(retention_test, type = "prob") %>%
mutate(truth = retention_test$Status) %>%
roc_auc(truth, .pred_Current)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.847
Proceeding with fitting the final workflow to the training data, we then predict on the test set, evaluate model performance using a confusion matrix, and calculate the AUC score.
The results of the model prediction on the test set show that out of the instances labeled as “Current” the model correctly predicted 1383 people that are current and incorrectly classified 242 people as “Left”. Similarly, for instances labeled as “Left” the model correctly predicted 315 people that have left and misclassified 157 people as “Current”.
Additionally, the AUC score is calculated as 0.847 which is similar to our training AUC.
mars_final_fit %>%
extract_fit_parsnip() %>%
vip()
We can see here that our most important features on why someone is likely to leave are their total charges, the tenure of their account, and the monthly charges they acquire.
As a person responsible for making business decisions, what else are you learning from the observations in this section?
Our model did well on both the training and testing dataset so therefore we believe moving forward it will do well with unseen data in the future
First, we trained our model with default hyperparameter values and used the same 5-fold cross validation object and achieved the mean 5-fold cross-validated AUC of 0.836. After that, we arrived at the optimal random forest model with the highest AUC score of 0.847 after tuning the following values:
Assess a total of 5 values from each parameter (levels = 5).
rf_mod <- rand_forest(
mode = "classification",
trees = tune(),
mtry = tune(),
min_n = tune()
) %>%
set_engine("ranger", importance = "impurity")
# create the hyperparameter grid
rf_hyper_grid <- grid_regular(
trees(range = c(15, 300)),
mtry(range = c(2, 12)),
min_n(range = c(1, 20)),
levels = 5
)
# train our model across the hyper parameter grid
set.seed(123)
rf_results <- tune_grid(rf_mod, retention_recipe, resamples = kfolds, grid = rf_hyper_grid)
# model results
show_best(rf_results, metric = "roc_auc")
## # A tibble: 5 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 4 86 20 roc_auc binary 0.847 5 0.00463 Preprocessor1_Model1…
## 2 4 228 20 roc_auc binary 0.846 5 0.00442 Preprocessor1_Model1…
## 3 4 300 20 roc_auc binary 0.845 5 0.00498 Preprocessor1_Model1…
## 4 4 300 15 roc_auc binary 0.844 5 0.00404 Preprocessor1_Model0…
## 5 4 157 10 roc_auc binary 0.844 5 0.00467 Preprocessor1_Model0…
rf_best_hyperparameters <- select_best(rf_results, metric = "roc_auc")
rf_final_wf <- workflow() %>%
add_recipe(retention_recipe) %>%
add_model(rf_mod) %>%
finalize_workflow(rf_best_hyperparameters)
rf_final_fit <- rf_final_wf %>%
fit(data = retention_train)
# Predicting on our test set
rf_final_fit %>%
predict(retention_test) %>%
bind_cols(retention_test %>% select(Status)) %>%
conf_mat(truth = Status, estimate = .pred_class)
## Truth
## Prediction Current Left
## Current 1396 268
## Left 144 289
rf_final_fit %>%
predict(retention_test, type = "prob") %>%
mutate(truth = retention_test$Status) %>%
roc_auc(truth, .pred_Current)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.845
This random forest model demonstrates a stronger ability to identify customers who are still with the service rather than those who have left. The higher number of false positives, where the model predicts customers as “Current” when they have actually “Left,” indicates a tendency of the model to be overly optimistic about customer retention. Conversely, the lower number of false negatives suggests that the model is more cautious about predicting that a customer has churned. This pattern reveals a bias in the model towards predicting customer continuation, which may lead to an underestimation of the churn rate.
rf_final_fit %>%
predict(retention_train, type = "prob") %>%
mutate(truth = retention_train$Status) %>%
roc_curve(truth, .pred_Current) %>%
autoplot()
mars_final_fit %>%
extract_fit_parsnip() %>%
vip()
In our optimal model, the most influential variables are ‘Total Charges’, ‘Tenure’, and ‘MonthlyCharges’. Following those variables are ‘PaymentMethod_Electronic.check’ and ‘OnlineSecurity_Yes’.
As a business manager, we would focus on understanding why these predictors are influential. Firstly, if high total charges and monthly charges are a primary reason for customers leaving, we’d look into restructuring pricing or providing incentives/programs that could potentially provide additional value to our customers. If customers with longer tenure are more loyal, we’d explore rewards for customer loyalty. Additionally, improving service quality in the area of tech support and online security could address specific concerns that lead to customer churn. Offering incentives or promotions tied to electronic payment methods, long-term contracts, and bundled services with phone and online backup could also be effective strategies to decrease churn.
Customers that would leave:
customers_left <- mars_final_fit %>%
predict(retention_test) %>%
bind_cols(retention_test) %>%
filter(.pred_class == "Left")
customers_left
## # A tibble: 472 × 21
## .pred_class Gender SeniorCitizen Partner Dependents Tenure PhoneService
## <fct> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 Left Female 0 Yes No 28 Yes
## 2 Left Male 1 No No 1 Yes
## 3 Left Male 0 No No 5 Yes
## 4 Left Female 0 No No 2 Yes
## 5 Left Female 0 Yes No 47 Yes
## 6 Left Female 0 No No 12 Yes
## 7 Left Female 1 Yes No 25 Yes
## 8 Left Female 0 No No 13 Yes
## 9 Left Female 0 No No 4 Yes
## 10 Left Female 0 Yes No 2 Yes
## # ℹ 462 more rows
## # ℹ 14 more variables: MultipleLines <chr>, InternetService <chr>,
## # OnlineSecurity <chr>, OnlineBackup <chr>, DeviceProtection <chr>,
## # TechSupport <chr>, StreamingTV <chr>, StreamingMovies <chr>,
## # Contract <chr>, PaperlessBilling <chr>, PaymentMethod <chr>,
## # MonthlyCharges <dbl>, TotalCharges <dbl>, Status <fct>
Proportion of Predicted Churns
proportion_left <- mars_final_fit %>%
predict(retention_test) %>%
bind_cols(retention_test) %>%
count(.pred_class) %>%
spread(key = .pred_class, value = n) %>%
mutate(proportion_Left = Left / (Left + Current)*100) %>%
pull(proportion_Left)
proportion_left
## [1] 22.50835
ggplot(customers_left) +
geom_histogram(aes(x = Tenure), position="identity") +
ggtitle("Distribution of Tenure for Predicted Churned Customers") +
labs(x = "Tenure (in Months)", y = "Count") +
scale_x_continuous(limits = c(0, 60), breaks = seq(0, 60, by = 6)) +
scale_y_continuous(limits = c(0, 100)) +
theme_minimal() +
theme(text = element_text(size = 12),
axis.title = element_text(size = 14),
plot.title = element_text(size = 16, hjust = 0.5))
From the graph above, we can see that a significant number of predicted churns occur within the first few months of tenure, with the highest peak at the very beginning of 0-6 months range and 6-12 months range. The number of predicted churns decreases steadily as the tenure increases, indicating that customers are less likely to churn as they stay longer with the service.
Proportion of predicted churns that have only used the service for less than 12 months:
proportion_under_12_months <- customers_left %>%
summarise(under_12_months = sum(Tenure < 12) / n() * 100) %>%
pull(under_12_months)
proportion_under_12_months
## [1] 71.61017
Total monthly charges of predicted churns that only used the service for less than 12 months:
total_monthlycharges_under_12_months <- customers_left %>%
filter(Tenure < 12) %>%
summarise(total_monthlycharges = sum(MonthlyCharges)) %>%
pull(total_monthlycharges)
total_monthlycharges_under_12_months
## [1] 24927.5
BAplot1 <- ggplot(customers_left) +
geom_histogram(aes(x = MonthlyCharges), position = "identity") +
scale_x_continuous(labels = scales::dollar_format()) +
labs(x = "Monthly Charges", y = "Count") +
ggtitle("Distribution of Monthly Charges for Predicted Churned Customers") +
theme_minimal()
BAplot2 <- ggplot(customers_left) +
geom_histogram(aes(x = TotalCharges), position = "identity") +
scale_x_continuous(labels = scales::dollar_format()) +
labs(x = "Total Charges", y = "Count") +
ggtitle("Distribution of Total Charges for Predicted Churned Customers") +
theme_minimal()
gridExtra::grid.arrange(BAplot1, BAplot2)
From the second graph above, we can interpret that the customers who are predicted to churn tend to be on the higher end of the monthly charge scale. For total charges, there is a heavy skew towards the lower end, with a significant number of customers having total charges of less than $1,000. This might suggest that customers who have not invested a lot into the service (have just started using the service - lower total charges) but generally have a higher monthly charges are more likely to churn.
Overall, the service might be losing customers early on, which could indicate issues with initial customer satisfaction or early value realization. Customers with higher monthly charges are more inclined to churn, suggesting that pricing is an important factor in the decision to leave. In addition, customers might not perceive enough value over time to have long-term commitment.
The predicted revenue loss if not action is taken:
customers_left %>%
summarize(lost_revenue = sum(MonthlyCharges))
## # A tibble: 1 × 1
## lost_revenue
## <dbl>
## 1 37229.
If no proactive measures are implemented, Regork may experience a significant loss in revenue. Based on our model, we’ve estimated a potential revenue decline of approximately $37,229. This figure represents the aggregated monthly charges of customers identified by the model as high-risk for departure. While the actual number of customers who may leave—currently estimated at 472 (around 22.5% of customer base)—is not certain, these individuals represent the most vulnerable demographic according to our data. This underscores the importance of targeted retention strategies to mitigate the risk of this projected loss.
Based on our MARS model, we’ve identified critical factors contributing to customer churn, such as Tenure and Total Charges. We propose a discount program that rewards newer customers with discounts, targeting customers who have only used our service less than 6 or 12 months.
The anticipated cost of this program, when offset against the predicted retention-led revenue, indicates a net gain for the company. For instance, offering a 10% discount for customers who have only stayed with us for less than 12 months prevents 71% of predicted churns, the savings in retained revenue would substantially exceed the expense of the discounts applied. The estimated cost of this discount program is $2,492.70, while the potential revenue retained is $24,927.50. This presents us with a net benefit of $22,435, making the incentive scheme a profitable investment for the company.
As a result, the investment in our loyalty program not only strengthens our relationship with current customers but also ensures sustained profitability.
Our solution introduces a discount program targeting key churn factors—tenure and total charges. Financial projections suggest the program’s costs are outweighed by its benefits, with a 10% discount for the most vulnerable customer segment (customers who have only stayed with us for less than 12 months) poised to significantly reduce churn. The expected net gain from this program affirms its potential to bolster both customer loyalty and company profitability.
In order to enhance the strategy based on our analysis, we believe that we can focus on deeper customer segmentation to tailor retention efforts more effectively and introduce personalized incentives that resonate with high-risk churn segments. For instance, when we analyzed the test dataset, we noticed that all of the predicted churned customers were enrolled in the month-to-month contract type. Looking at different strategies to convert month-to-month customers to longer-duration contract customers can be significantly impactful. Our proposed incentives also focus mainly on more recent customers who have only used the service for less than 12 months. Additionally, looking deeper into devising specific loyalty programs that reward tenure can further cover other customer segments that are vulnerable to churning. This approach should be underpinned by a thorough cost-benefit analysis to ensure that our strategies are both effective, economically viable, and sustainable in terms of profit and inferred value to customers.