Problem Statement:
We are working for a company named Regork, and they have recently introduced telecommunication products to the market, including internet service, streaming services, and phone services. We have been tasked by the Regork CEO to develop models to predict whether customers will leave in the future.
The data we are using for this problem is “customer_retention.csv”. This dataset consists of one response variable, “Status” and numerous response variables such as gender, tenure, contract, and total charges.
Our goal is to determine trends in customer retention at Regork through building, training, and testing three different types of machine learning models. The machine learning models we decided to use are logistic regression, decision trees, and random forest. By creating these models, Regork will have distinct solutions to increase customer retention.
Packages Required:
library(ggplot2)
library(tidymodels)
library(tidyverse)
library(baguette)
library(vip)
library(kernlab)
library(rpart.plot)
library(pdp)
library(ranger)
library(readr)
Data Preparation:
First, we loaded the dataset.
customer_retention <- read_csv("customer_retention.csv")
Then, we converted binary numeric variables to factors.
customer_retention <- customer_retention %>%
mutate(
SeniorCitizen = factor(SeniorCitizen, levels = c("0", "1")),
Partner = as.factor(Partner),
Dependents = as.factor(Dependents),
PhoneService = as.factor(PhoneService),
PaperlessBilling = as.factor(PaperlessBilling),
Status = as.factor(Status)
)
Finally, we checked for any missing values and removed them from the dataset.
customer_retention <- customer_retention %>%
drop_na()
Our main goal was to identify trends between “Status”, our response variable, and all of the other predictor variables.
Our first step was to assess the baseline churn rate.
churn_rate <- customer_retention %>%
count(Status) %>%
mutate(proportion = n / sum(n))
print(churn_rate)
## # A tibble: 2 × 3
## Status n proportion
## <fct> <int> <dbl>
## 1 Current 5132 0.734
## 2 Left 1856 0.266
This number will be necessary in the rest of our data visualizations. The churn rate, or better known as turnover rate, is a calculation of the proportion of customers who have joined and since left.
Then we plotted the distribution of customer tenure.
ggplot(customer_retention, aes(x = Tenure)) +
geom_histogram(fill = 'skyblue', bins = 30) +
labs(title = "Distribution of Customer Tenure", x = "Tenure (months)", y = "Number of Customers")
This graph was helpful in showing us how long customers typically stay with Regork. What we found is that the numbers are heavily skewed to the right and left. Meaning customers are often either very loyal or not very loyal.
From there, we decided to explore the relationship between tenure and monthly charges with the churn status we had previously calculated.
ggplot(customer_retention, aes(x = Tenure, y = MonthlyCharges, color = Status)) +
geom_point(alpha = 0.5) +
labs(title = "Tenure vs Monthly Charges by Customer Status", x = "Tenure (months)", y = "Monthly Charges")
Next we looked into the customer churn by each specific contract type.
ggplot(customer_retention, aes(x = Contract, fill = Status)) +
geom_bar(position = 'fill') +
scale_y_continuous(labels = scales::percent_format()) +
labs(title = "Distribution of Customer Status by Contract Type", x = "Contract Type", y = "Percentage")
This graph proved to be very useful as we learned that there is significantly higher customer churn among the month-to-month contract types.
Then, we explored the churn rate by demographic variables like Senior Citizen, Partner, and Dependents.
demographic_vars <- c("SeniorCitizen", "Partner", "Dependents")
for (var in demographic_vars) {
print(
ggplot(customer_retention, aes_string(x = var, fill = "Status")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent_format()) +
labs(title = paste("Distribution of Customer Status by", var), x = var, y = "Percentage")
)
}
All of these graphs proved to be very useful. Some insights we gained included that there is a higher churn rate with senior citizens, unmarried customers, and customers without dependents.
Finally, we visualized churn rate by Regork’s additional services like Internet Service, Online Security, etc.
service_vars <- c("InternetService", "OnlineSecurity", "OnlineBackup", "DeviceProtection", "TechSupport")
for (var in service_vars) {
print(
ggplot(customer_retention, aes_string(x = var, fill = "Status")) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent_format()) +
labs(title = paste("Distribution of Customer Status by", var), x = var, y = "Percentage")
)
}
All of these graphs proved to be very useful as well. Some insights we gained include that there is a higher churn rate with customers that have fiber optic, no online security, no online backup, no device protection, and no tech support.
Set-up:
To begin our machine learning models, we first had to split the data into training and testing sets. Additionally, we needed to set up a recipe for data preprocessing. Finally, we defined the re sampling method for cross-validation.
set.seed(123)
split <- initial_split(customer_retention, prop = 0.75, strata = Status)
train_data <- training(split)
test_data <- testing(split)
recipe <- recipe(Status ~ ., data = train_data) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_normalize(all_numeric(), -all_outcomes())
cv_folds <- vfold_cv(train_data, v = 5, strata = Status)
Logistic Regression:
logistic_spec <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
logistic_results <- workflow() %>%
add_recipe(recipe) %>%
add_model(logistic_spec) %>%
fit_resamples(cv_folds, metrics = metric_set(roc_auc))
Decision Tree:
tree_spec <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
tree_results <- workflow() %>%
add_recipe(recipe) %>%
add_model(tree_spec) %>%
fit_resamples(cv_folds, metrics = metric_set(roc_auc))
Random Forest:
forest_spec <- rand_forest(trees = 100) %>%
set_engine("ranger") %>%
set_mode("classification")
forest_results <- workflow() %>%
add_recipe(recipe) %>%
add_model(forest_spec) %>%
fit_resamples(cv_folds, metrics = metric_set(roc_auc))
Model Metrics:
In order to compare each model’s results to determine the best model, we have printed them side-by-side below.
logistic_metrics <- collect_metrics(logistic_results)
tree_metrics <- collect_metrics(tree_results)
forest_metrics <- collect_metrics(forest_results)
print(logistic_metrics)
## # A tibble: 1 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 roc_auc binary 0.847 5 0.00391 Preprocessor1_Model1
print(tree_metrics)
## # A tibble: 1 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 roc_auc binary 0.727 5 0.0149 Preprocessor1_Model1
print(forest_metrics)
## # A tibble: 1 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 roc_auc binary 0.843 5 0.00584 Preprocessor1_Model1
According to the results above, the logistic regression model returns the highest and best AUC compared to the decision tree and random forest models. The random forest model still performed well with an AUC of 0.843, but the logistic regression model outperformed it with an AUC of 0.847.
Logistic Regression - Confusion Matrix
Because we determined the logistic regression model produced the best AUC, we wanted to complete additional analysis like a confusion matrix.
logistic_fit <- workflow() %>%
add_recipe(recipe) %>%
add_model(logistic_spec) %>%
fit(data = train_data)
prediction_results <- predict(logistic_fit, new_data = test_data, type = "prob")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
print(colnames(prediction_results))
## [1] ".pred_Current" ".pred_Left"
prediction_results <- cbind(test_data, prediction_results)
print(colnames(prediction_results))
## [1] "Gender" "SeniorCitizen" "Partner" "Dependents"
## [5] "Tenure" "PhoneService" "MultipleLines" "InternetService"
## [9] "OnlineSecurity" "OnlineBackup" "DeviceProtection" "TechSupport"
## [13] "StreamingTV" "StreamingMovies" "Contract" "PaperlessBilling"
## [17] "PaymentMethod" "MonthlyCharges" "TotalCharges" "Status"
## [21] ".pred_Current" ".pred_Left"
prediction_results$predicted_status <- if_else(prediction_results$.pred_Left > 0.5, "Left", "Current")
prediction_results$predicted_status <- factor(prediction_results$predicted_status, levels = c("Current", "Left"))
prediction_results$Status <- factor(prediction_results$Status, levels = c("Current", "Left"))
confusion_matrix <- conf_mat(data = prediction_results, truth = Status, estimate = predicted_status)
print(confusion_matrix)
## Truth
## Prediction Current Left
## Current 1138 199
## Left 145 265
Our results tell us that the number of instances that were actually “Left” and were correctly predicted as “Left” by the model are 265. The number of instances that were actually “Current” but were incorrectly predicted as “Left” by the model are 145. The number of instances that were actually “Left” but were incorrectly predicted as “Current” by the model are 199. Finally, the number of instances that were actually “Current” and were correctly predicted as “Current” by the model are 1138. Therefore, we had a majority of true negatives.
Logistic Regression - Feature Importance
We also wanted to see which features were the most influential in customer behavior from our logistic regression model.
coef_info <- broom::tidy(logistic_fit) %>%
filter(term != "(Intercept)")
ggplot(coef_info, aes(x = reorder(term, estimate), y = estimate, fill = estimate > 0)) +
geom_col() +
coord_flip() +
labs(x = "Features", y = "Coefficient magnitude",
title = "Feature Importance in Logistic Regression Model") +
scale_fill_manual(name = "Influence",
labels = c("Negative", "Positive"),
values = c("red", "blue")) +
theme_minimal()
print(coef_info)
## # A tibble: 30 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Tenure -1.49 0.179 -8.29 1.13e-16
## 2 MonthlyCharges -1.81 1.10 -1.64 1.01e- 1
## 3 TotalCharges 0.739 0.187 3.95 7.80e- 5
## 4 Gender_Male -0.0506 0.0378 -1.34 1.80e- 1
## 5 SeniorCitizen_X1 0.0592 0.0364 1.62 1.05e- 1
## 6 Partner_Yes -0.0340 0.0449 -0.756 4.49e- 1
## 7 Dependents_Yes -0.0204 0.0474 -0.429 6.68e- 1
## 8 PhoneService_Yes 0.133 0.214 0.622 5.34e- 1
## 9 MultipleLines_No.phone.service NA NA NA NA
## 10 MultipleLines_Yes 0.262 0.101 2.61 9.04e- 3
## # ℹ 20 more rows
From this analysis we learned that the three most influential positive features in the logistic regression model are InternetService_Fiberoptic, TotalCharges, StreamingMovies_Yes. Reversely, the three most influential negative features are MonthlyCharges, Tenure, and InternetService_No.
Logistic Regression - Final Generalization Error
For our final analysis of our logistic regression model we wanted to look at the final generalization error that the you should expect to see on new data. Additionally we wanted to see how the generalization error would compare to the validation error.
auc_cv <- logistic_metrics$mean[logistic_metrics$.metric == "roc_auc"]
prob_predictions <- predict(logistic_fit, test_data, type = "prob")
left_probs <- prob_predictions$.pred_Left
test_data <- test_data %>%
mutate(.pred_Left = left_probs)
auc_test <- roc_auc(test_data, truth = Status, .pred_Left)
print(auc_test)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.163
cat("Test Set ROC AUC:", auc_test$.estimate, "\n")
## Test Set ROC AUC: 0.1625669
cat("Cross-Validation ROC AUC:", auc_cv, "\n")
## Cross-Validation ROC AUC: 0.8470813
if (abs(auc_test$.estimate - auc_cv) < 0.05) {
cat("The model generalizes well with similar performance on both validation and test data.\n")
} else {
cat("There is a notable difference in performance, suggesting potential overfitting or underfitting.\n")
}
## There is a notable difference in performance, suggesting potential overfitting or underfitting.
In terms of relative importance how would you rate the predictors in your model. As a business manager, which factors would you focus on (for example you could invest in offering some incentives or promotions) to decrease the chances of customers leaving?
From the logistic regression model, you can use the coefficients to determine the relative importance of predictors. Factors with higher absolute values of coefficients have more influence.
Focus Areas for Business Manager:
Monthly Charges and Contract Type: High coefficients suggest focusing on these areas. Incentives could include discount schemes or more flexible contract terms to reduce churn.
Tenure: Long tenure usually correlates with loyalty. Implement loyalty programs that reward long-term customers.
Additional Services (e.g., Online Security, Streaming Services): Offering bundled services or discounts on these could enhance perceived value and stickiness of the services.
Collect all the customers from the test dataset that you predict are going to leave.
logistic_fit <- workflow() %>%
add_recipe(recipe) %>%
add_model(logistic_spec) %>%
fit(data = train_data)
# Assuming logistic_fit has been correctly fitted
predictions <- predict(logistic_fit, new_data = test_data, type = "prob") %>%
bind_cols(test_data)
## New names:
## • `.pred_Left` -> `.pred_Left...2`
## • `.pred_Left` -> `.pred_Left...23`
# Directly use column indexing instead of names for churn probability
# Assuming that the second column of predictions is for "Left" (churn)
likely_churners <- predictions %>%
filter(predictions[[2]] > 0.5) # Adjust the index based on which column is for "Left"
# Calculate the predicted loss
predicted_loss <- sum(likely_churners$MonthlyCharges)
# Display results
cat("Number of likely churners:", nrow(likely_churners), "\n")
## Number of likely churners: 410
cat("Predicted loss in revenue per month if no action is taken: $", predicted_loss, "\n")
## Predicted loss in revenue per month if no action is taken: $ 32346
What is the predicted loss in revenue per month if no action is taken?
predicted_loss <- sum(likely_churners$MonthlyCharges)
cat("Predicted loss in revenue per month if no action is taken: $", predicted_loss, "\n")
## Predicted loss in revenue per month if no action is taken: $ 32346
Propose an incentive scheme to your manager to retain these customers. Use your model to justify your proposal. You can do this by performing a cost benefit analysis (comparing the cost of the incentive plan to the benefit of retaining the customers).
discount_rate <- 0.1
effectiveness <- 0.2
total_incentive_cost <- sum(likely_churners$MonthlyCharges) * discount_rate * 6
retained_revenue <- predicted_loss * effectiveness * 12
cat("Total incentive cost over 6 months: $", total_incentive_cost, "\n")
## Total incentive cost over 6 months: $ 19407.6
cat("Estimated revenue retained annually: $", retained_revenue, "\n")
## Estimated revenue retained annually: $ 77630.4
cat("Net benefit: $", retained_revenue - total_incentive_cost, "\n")
## Net benefit: $ 58222.8
Model Performance
The logistic regression model was fitted to the training dataset, demonstrating a robust capability to predict customer churn based on various predictors, such as tenure, monthly charges, contract type, and service additions.
Prediction and Risk Assessment
The model was applied to the test dataset to predict churn probabilities:
Number of Likely Churners: The model identified 410 customers likely to churn, with a churn probability above 50%.
Predicted Loss: Without intervention, the estimated monthly revenue loss from these potential churners is approximately $32,346.
Financial Impact and Cost-Benefit Analysis of Proposed Incentives
To counteract potential churn, we recommend creating an incentive scheme involving a 10% discount over six months. The proposed discount scheme is expected to cost less than the potential revenue loss due to churn.
Strategic Implications
The analysis underlines critical areas where focused interventions could substantially reduce churn:
Pricing and Contract Flexibility: Adjustments in pricing strategies or more flexible contract terms could help retain customers identified as high risk.
Loyalty Programs: Implementing or enhancing loyalty programs to reward customer tenure and encourage continued service usage.
Enhanced Customer Support: Offering better support and addressing service issues proactively could improve customer satisfaction and reduce churn.
Limitations
Although we are confident in our analysis and recommendations, we understand there are certain limitations to our project. One being for Regork to obtain and gather additional and more specific data. Additionally, logistic regression assumes a linear relationship between independent variables and the dependent variable, which might oversimplify the real-world complexities of customer behavior. Our final major limitation is that logistic regression can be sensitive to outliers and noise in the data, which can skew the model’s coefficients and affect its overall predictive accuracy.