Regork: Customer Retention Analysis

Synopsis

Problem Statement:

We are working for a company named Regork, and they have recently introduced telecommunication products to the market, including internet service, streaming services, and phone services. We have been tasked by the Regork CEO to develop models to predict whether customers will leave in the future.

The data we are using for this problem is “customer_retention.csv”. This dataset consists of one response variable, “Status” and numerous response variables such as gender, tenure, contract, and total charges.

Our goal is to determine trends in customer retention at Regork through building, training, and testing three different types of machine learning models. The machine learning models we decided to use are logistic regression, decision trees, and random forest. By creating these models, Regork will have distinct solutions to increase customer retention.

Data Preparation

Packages Required:

library(ggplot2)
library(tidymodels)
library(tidyverse)
library(baguette)
library(vip)
library(kernlab)
library(rpart.plot)
library(pdp)
library(ranger)
library(readr)

Data Preparation:

First, we loaded the dataset.

customer_retention <- read_csv("customer_retention.csv")

Then, we converted binary numeric variables to factors.

customer_retention <- customer_retention %>%
  mutate(
    SeniorCitizen = factor(SeniorCitizen, levels = c("0", "1")),
    Partner = as.factor(Partner),
    Dependents = as.factor(Dependents),
    PhoneService = as.factor(PhoneService),
    PaperlessBilling = as.factor(PaperlessBilling),
    Status = as.factor(Status)
  )

Finally, we checked for any missing values and removed them from the dataset.

customer_retention <- customer_retention %>%
  drop_na()

Exploratory Analysis

Our main goal was to identify trends between “Status”, our response variable, and all of the other predictor variables.

Our first step was to assess the baseline churn rate.

churn_rate <- customer_retention %>%
  count(Status) %>%
  mutate(proportion = n / sum(n))

print(churn_rate)

## # A tibble: 2 × 3
##   Status      n proportion
##   <fct>   <int>      <dbl>
## 1 Current  5132      0.734
## 2 Left     1856      0.266

This number will be necessary in the rest of our data visualizations. The churn rate, or better known as turnover rate, is a calculation of the proportion of customers who have joined and since left.

Then we plotted the distribution of customer tenure.

ggplot(customer_retention, aes(x = Tenure)) +
  geom_histogram(fill = 'skyblue', bins = 30) +
  labs(title = "Distribution of Customer Tenure", x = "Tenure (months)", y = "Number of Customers")

This graph was helpful in showing us how long customers typically stay with Regork. What we found is that the numbers are heavily skewed to the right and left. Meaning customers are often either very loyal or not very loyal.

From there, we decided to explore the relationship between tenure and monthly charges with the churn status we had previously calculated.

ggplot(customer_retention, aes(x = Tenure, y = MonthlyCharges, color = Status)) +
  geom_point(alpha = 0.5) +
  labs(title = "Tenure vs Monthly Charges by Customer Status", x = "Tenure (months)", y = "Monthly Charges")

Next we looked into the customer churn by each specific contract type.

ggplot(customer_retention, aes(x = Contract, fill = Status)) +
  geom_bar(position = 'fill') +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "Distribution of Customer Status by Contract Type", x = "Contract Type", y = "Percentage")

This graph proved to be very useful as we learned that there is significantly higher customer churn among the month-to-month contract types.

Then, we explored the churn rate by demographic variables like Senior Citizen, Partner, and Dependents.

demographic_vars <- c("SeniorCitizen", "Partner", "Dependents")
for (var in demographic_vars) {
  print(
    ggplot(customer_retention, aes_string(x = var, fill = "Status")) +
      geom_bar(position = "fill") +
      scale_y_continuous(labels = scales::percent_format()) +
      labs(title = paste("Distribution of Customer Status by", var), x = var, y = "Percentage")
  )
}

All of these graphs proved to be very useful. Some insights we gained included that there is a higher churn rate with senior citizens, unmarried customers, and customers without dependents.

Finally, we visualized churn rate by Regork’s additional services like Internet Service, Online Security, etc.

service_vars <- c("InternetService", "OnlineSecurity", "OnlineBackup", "DeviceProtection", "TechSupport")
for (var in service_vars) {
  print(
    ggplot(customer_retention, aes_string(x = var, fill = "Status")) +
      geom_bar(position = "fill") +
      scale_y_continuous(labels = scales::percent_format()) +
      labs(title = paste("Distribution of Customer Status by", var), x = var, y = "Percentage")
  )
}

All of these graphs proved to be very useful as well. Some insights we gained include that there is a higher churn rate with customers that have fiber optic, no online security, no online backup, no device protection, and no tech support.

Machine Learning

Set-up:

To begin our machine learning models, we first had to split the data into training and testing sets. Additionally, we needed to set up a recipe for data preprocessing. Finally, we defined the re sampling method for cross-validation.

set.seed(123) 
split <- initial_split(customer_retention, prop = 0.75, strata = Status)
train_data <- training(split)
test_data <- testing(split)

recipe <- recipe(Status ~ ., data = train_data) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_normalize(all_numeric(), -all_outcomes())

cv_folds <- vfold_cv(train_data, v = 5, strata = Status)

Logistic Regression:

logistic_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

logistic_results <- workflow() %>%
  add_recipe(recipe) %>%
  add_model(logistic_spec) %>%
  fit_resamples(cv_folds, metrics = metric_set(roc_auc))

Decision Tree:

tree_spec <- decision_tree() %>%
  set_engine("rpart") %>%
  set_mode("classification")

tree_results <- workflow() %>%
  add_recipe(recipe) %>%
  add_model(tree_spec) %>%
  fit_resamples(cv_folds, metrics = metric_set(roc_auc))

Random Forest:

forest_spec <- rand_forest(trees = 100) %>%
  set_engine("ranger") %>%
  set_mode("classification")

forest_results <- workflow() %>%
  add_recipe(recipe) %>%
  add_model(forest_spec) %>%
  fit_resamples(cv_folds, metrics = metric_set(roc_auc))

Model Metrics:

In order to compare each model’s results to determine the best model, we have printed them side-by-side below.

logistic_metrics <- collect_metrics(logistic_results)
tree_metrics <- collect_metrics(tree_results)
forest_metrics <- collect_metrics(forest_results)

print(logistic_metrics)

## # A tibble: 1 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 roc_auc binary     0.847     5 0.00391 Preprocessor1_Model1

print(tree_metrics)

## # A tibble: 1 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 roc_auc binary     0.727     5  0.0149 Preprocessor1_Model1

print(forest_metrics)

## # A tibble: 1 × 6
##   .metric .estimator  mean     n std_err .config             
##   <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
## 1 roc_auc binary     0.843     5 0.00584 Preprocessor1_Model1

According to the results above, the logistic regression model returns the highest and best AUC compared to the decision tree and random forest models. The random forest model still performed well with an AUC of 0.843, but the logistic regression model outperformed it with an AUC of 0.847.

Logistic Regression - Confusion Matrix

Because we determined the logistic regression model produced the best AUC, we wanted to complete additional analysis like a confusion matrix.

logistic_fit <- workflow() %>%
  add_recipe(recipe) %>%
  add_model(logistic_spec) %>%
  fit(data = train_data)   

prediction_results <- predict(logistic_fit, new_data = test_data, type = "prob")

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

print(colnames(prediction_results))

## [1] ".pred_Current" ".pred_Left"

prediction_results <- cbind(test_data, prediction_results)

print(colnames(prediction_results))

##  [1] "Gender"           "SeniorCitizen"    "Partner"          "Dependents"      
##  [5] "Tenure"           "PhoneService"     "MultipleLines"    "InternetService" 
##  [9] "OnlineSecurity"   "OnlineBackup"     "DeviceProtection" "TechSupport"     
## [13] "StreamingTV"      "StreamingMovies"  "Contract"         "PaperlessBilling"
## [17] "PaymentMethod"    "MonthlyCharges"   "TotalCharges"     "Status"          
## [21] ".pred_Current"    ".pred_Left"

prediction_results$predicted_status <- if_else(prediction_results$.pred_Left > 0.5, "Left", "Current")

prediction_results$predicted_status <- factor(prediction_results$predicted_status, levels = c("Current", "Left"))
prediction_results$Status <- factor(prediction_results$Status, levels = c("Current", "Left"))

confusion_matrix <- conf_mat(data = prediction_results, truth = Status, estimate = predicted_status)

print(confusion_matrix)

##           Truth
## Prediction Current Left
##    Current    1138  199
##    Left        145  265

Our results tell us that the number of instances that were actually “Left” and were correctly predicted as “Left” by the model are 265. The number of instances that were actually “Current” but were incorrectly predicted as “Left” by the model are 145. The number of instances that were actually “Left” but were incorrectly predicted as “Current” by the model are 199. Finally, the number of instances that were actually “Current” and were correctly predicted as “Current” by the model are 1138. Therefore, we had a majority of true negatives.

Logistic Regression - Feature Importance

We also wanted to see which features were the most influential in customer behavior from our logistic regression model.

coef_info <- broom::tidy(logistic_fit) %>%
  filter(term != "(Intercept)") 

ggplot(coef_info, aes(x = reorder(term, estimate), y = estimate, fill = estimate > 0)) +
  geom_col() +
  coord_flip() +  
  labs(x = "Features", y = "Coefficient magnitude",
       title = "Feature Importance in Logistic Regression Model") +
  scale_fill_manual(name = "Influence", 
                    labels = c("Negative", "Positive"), 
                    values = c("red", "blue")) +
  theme_minimal()

print(coef_info)

## # A tibble: 30 × 5
##    term                           estimate std.error statistic   p.value
##    <chr>                             <dbl>     <dbl>     <dbl>     <dbl>
##  1 Tenure                          -1.49      0.179     -8.29   1.13e-16
##  2 MonthlyCharges                  -1.81      1.10      -1.64   1.01e- 1
##  3 TotalCharges                     0.739     0.187      3.95   7.80e- 5
##  4 Gender_Male                     -0.0506    0.0378    -1.34   1.80e- 1
##  5 SeniorCitizen_X1                 0.0592    0.0364     1.62   1.05e- 1
##  6 Partner_Yes                     -0.0340    0.0449    -0.756  4.49e- 1
##  7 Dependents_Yes                  -0.0204    0.0474    -0.429  6.68e- 1
##  8 PhoneService_Yes                 0.133     0.214      0.622  5.34e- 1
##  9 MultipleLines_No.phone.service  NA        NA         NA     NA       
## 10 MultipleLines_Yes                0.262     0.101      2.61   9.04e- 3
## # ℹ 20 more rows

From this analysis we learned that the three most influential positive features in the logistic regression model are InternetService_Fiberoptic, TotalCharges, StreamingMovies_Yes. Reversely, the three most influential negative features are MonthlyCharges, Tenure, and InternetService_No.

Logistic Regression - Final Generalization Error

For our final analysis of our logistic regression model we wanted to look at the final generalization error that the you should expect to see on new data. Additionally we wanted to see how the generalization error would compare to the validation error.

auc_cv <- logistic_metrics$mean[logistic_metrics$.metric == "roc_auc"]

prob_predictions <- predict(logistic_fit, test_data, type = "prob")
left_probs <- prob_predictions$.pred_Left

test_data <- test_data %>%
  mutate(.pred_Left = left_probs)

auc_test <- roc_auc(test_data, truth = Status, .pred_Left)

print(auc_test)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.163

cat("Test Set ROC AUC:", auc_test$.estimate, "\n")

## Test Set ROC AUC: 0.1625669

cat("Cross-Validation ROC AUC:", auc_cv, "\n")

## Cross-Validation ROC AUC: 0.8470813

if (abs(auc_test$.estimate - auc_cv) < 0.05) {
  cat("The model generalizes well with similar performance on both validation and test data.\n")
} else {
  cat("There is a notable difference in performance, suggesting potential overfitting or underfitting.\n")
}

## There is a notable difference in performance, suggesting potential overfitting or underfitting.

Business Analysis

In terms of relative importance how would you rate the predictors in your model. As a business manager, which factors would you focus on (for example you could invest in offering some incentives or promotions) to decrease the chances of customers leaving?

From the logistic regression model, you can use the coefficients to determine the relative importance of predictors. Factors with higher absolute values of coefficients have more influence.

Focus Areas for Business Manager:

Monthly Charges and Contract Type: High coefficients suggest focusing on these areas. Incentives could include discount schemes or more flexible contract terms to reduce churn.
Tenure: Long tenure usually correlates with loyalty. Implement loyalty programs that reward long-term customers.
Additional Services (e.g., Online Security, Streaming Services): Offering bundled services or discounts on these could enhance perceived value and stickiness of the services.

Collect all the customers from the test dataset that you predict are going to leave.

logistic_fit <- workflow() %>%
  add_recipe(recipe) %>%
  add_model(logistic_spec) %>%
  fit(data = train_data)  

# Assuming logistic_fit has been correctly fitted
predictions <- predict(logistic_fit, new_data = test_data, type = "prob") %>%
  bind_cols(test_data)

## New names:
## • `.pred_Left` -> `.pred_Left...2`
## • `.pred_Left` -> `.pred_Left...23`

# Directly use column indexing instead of names for churn probability
# Assuming that the second column of predictions is for "Left" (churn)
likely_churners <- predictions %>%
  filter(predictions[[2]] > 0.5)  # Adjust the index based on which column is for "Left"

# Calculate the predicted loss
predicted_loss <- sum(likely_churners$MonthlyCharges)

# Display results
cat("Number of likely churners:", nrow(likely_churners), "\n")

## Number of likely churners: 410

cat("Predicted loss in revenue per month if no action is taken: $", predicted_loss, "\n")

## Predicted loss in revenue per month if no action is taken: $ 32346

What is the predicted loss in revenue per month if no action is taken?

predicted_loss <- sum(likely_churners$MonthlyCharges)

cat("Predicted loss in revenue per month if no action is taken: $", predicted_loss, "\n")

## Predicted loss in revenue per month if no action is taken: $ 32346

Propose an incentive scheme to your manager to retain these customers. Use your model to justify your proposal. You can do this by performing a cost benefit analysis (comparing the cost of the incentive plan to the benefit of retaining the customers).

discount_rate <- 0.1
effectiveness <- 0.2

total_incentive_cost <- sum(likely_churners$MonthlyCharges) * discount_rate * 6 
retained_revenue <- predicted_loss * effectiveness * 12 

cat("Total incentive cost over 6 months: $", total_incentive_cost, "\n")

## Total incentive cost over 6 months: $ 19407.6

cat("Estimated revenue retained annually: $", retained_revenue, "\n")

## Estimated revenue retained annually: $ 77630.4

cat("Net benefit: $", retained_revenue - total_incentive_cost, "\n")

## Net benefit: $ 58222.8

Conclusion

Model Performance

The logistic regression model was fitted to the training dataset, demonstrating a robust capability to predict customer churn based on various predictors, such as tenure, monthly charges, contract type, and service additions.

Prediction and Risk Assessment

The model was applied to the test dataset to predict churn probabilities:

Number of Likely Churners: The model identified 410 customers likely to churn, with a churn probability above 50%.
Predicted Loss: Without intervention, the estimated monthly revenue loss from these potential churners is approximately $32,346.

Financial Impact and Cost-Benefit Analysis of Proposed Incentives

To counteract potential churn, we recommend creating an incentive scheme involving a 10% discount over six months. The proposed discount scheme is expected to cost less than the potential revenue loss due to churn.

Strategic Implications

The analysis underlines critical areas where focused interventions could substantially reduce churn:

Pricing and Contract Flexibility: Adjustments in pricing strategies or more flexible contract terms could help retain customers identified as high risk.
Loyalty Programs: Implementing or enhancing loyalty programs to reward customer tenure and encourage continued service usage.
Enhanced Customer Support: Offering better support and addressing service issues proactively could improve customer satisfaction and reduce churn.

Limitations

Although we are confident in our analysis and recommendations, we understand there are certain limitations to our project. One being for Regork to obtain and gather additional and more specific data. Additionally, logistic regression assumes a linear relationship between independent variables and the dependent variable, which might oversimplify the real-world complexities of customer behavior. Our final major limitation is that logistic regression can be sensitive to outliers and noise in the data, which can skew the model’s coefficients and affect its overall predictive accuracy.

Final Project