Churn Prediction Analysis

Author

R4DataScienceNigeria Team

Welcome 🤗 to R4DS Nigeria’s end-to-end data science project, where we harness the power of data to drive insights and inform decision-making. Our team has developed a Quarto dashboard to analyze customer churn, leveraging machine learning workflows with tidymodels and Quarto. Explore the live dashboard to visualize our findings and learn more about our approach: click here. click here

Many thanks to the Posit Team (formerly RStudio) and Simon P Couch, as many of the code and ideas come from his End-to-End Machine Learning Workflow talk. Feel free to reuse the codes - no license needed! Our community is grateful for the wealth of free resources and we echo Posit’s mission to make data science accessible to everyone - that’s at the heart of everything we do at R4DS Nigeria.

Our vision is to learn, grow, and explore the world of data science together with ❤️.

Please note that this project was developed by beginner and intermediate R users, not experts. We’re open to learning from your comments, critiques, or contributions - your feedback will help us grow and improve!

Collaborators

R4DataScienceNigeria Team:

Victor Arowolo.
Eyebiokin Oluwaseun
Adesanya Oluwasegun Qudus
Adebona Oreoluwa Elizabeth
Yusuff Olatunji Sikiru
Chizoma Chidiebube Chikere
Omowunmi Adebayo-Ajeyomi
Okoye Gloria. I.
Adebisi Saheed D.
Aina Ayomide
Azeez Babatunde Akintonde
Omisore Halleluyah
Ojile Cecilia
Abdulwahab Morohunmubo Olajumoke
Daniel Oluwafemi Olofin

1. Customer Retention Analysis

This documentation outlines the development of a robust analysis of customer churn and the development of a dashboard and machine learning model to predict churn.

1.1 Introduction

In today’s competitive financial sector, customer retention is crucial for banks. Increasing customer churn is impacting revenue and raising the cost of acquiring new customers. Despite various engagement strategies, banks lack a reliable system to predict and prevent customer attrition.

1.2 Aim and Objectives

The primary aim of this project is to develop a churn dashboard and a robust machine mearning model that accurately predicts customer churn for banks. Leveraging historical customer data, including transaction patterns, product usage, account tenure, and demographic information, the model will identify key factors contributing to customer attrition.

Aim: To develop dashboard and a predictive model that accurately identifies customers who are likely to churn.
Objectives:
- Collect and preprocess customer data.
- Explore and analyze the data to identify key features influencing churn.
- Develop a chrun dashboard.
- Build and evaluate multiple machine learning models.
- Deploy the best-performing model.
Key Deliverable:
- Recommend Retention Strategies: Develop targeted retention strategies and personalized interventions to improve customer satisfaction and reduce churn rates.

2. Data Information

2.1 Data Source

Source: The data-set was obtained from a multinational bank with branches in Nigeria and across Africa. It was generated in 2023.
Description: The data-set contains 19 variables and 500,000 rows. It is part of a larger data-set of over 20 million records

2.2 Metadata

Columns:
- acct_id: A unique identifier for each customer account.
- years: The number of years a customer has been with the bank.
- churn: A binary indicator of whether the customer has churned (e.g., 0 for not churned, 1 for churned).
- risk_rating: A rating or score that reflects the financial risk associated with the customer.
- currency: The currency used in the customer’s account (e.g., NGN, USD, EUR).
- ave_bal: The average balance in the customer’s account over a specified period.
- scheme_type: The type of banking scheme or product the customer is using. Y = Yes; N = No.
- mobile*app*adoption: Indicates whether the customer uses the bank’s mobile app. Y = Yes; N = No.
- internet_banking_adoption: Indicates whether the customer uses internet banking. Y = Yes; N = No.
- ussd_banking_adoption: Indicates whether the customer uses USSD banking services. Y = Yes; N = No.
- digital_loan: Indicates whether the customer has taken a digital loan. Y = Yes; N = No.
- unsecured_loan: Indicates whether the customer has an unsecured loan. Y = Yes; N = No.
- termloan_status: Status of the customer’s term loan. Y = Yes; N = No.
- credit_card: Indicates whether the customer holds a credit card with the bank. Y = Yes; N = No.
- subsegment: Total volume of credit transactions over the last 12 months.
  
  last_12_months_credit_volume: Total volume of credit transactions over the last 12 months.
- last_12_months_debit_volume: Total volume of debit transactions over the last 12 months.
- last_12_months_debit_value: Total value of debit transactions over the last 12 months.
- last_12_months_credit_value: Total value of credit transactions over the last 12 months.

3. Data Cleaning

Change Column Names to Lower Case

clean_names()
- Convert all column names to lower case for consistency.

Rename Long Column Names

rename(
  years = "years_with_bank",
  risk = "risk_rating",
  scheme = "scheme_type",
  mobile_app = "mobile_app_adoption",
  internet_banking = "internet_banking_adoption",
  ussd_banking = "ussd_banking_adoption",
  termloan = "termloan_status",
  credit_vol = "last_12_months_credit_volume",
  debit_vol = "last_12_months_debit_volume",
  debit_val = "last_12_months_debit_value",
  credit_val = "last_12_months_credit_value"
)

Renames long column names to shorter, more manageable names.

Filter Out Specific Values
```
filter(!(ave %in% c("GBP", "JPY", "NGN", "SBA", "USD")))
```
- Removes rows where the ave column contains specific currency codes instead of average balance.

Convert Numeric Columns.

mutate(across(c(ave, subsegment, debit_vol, debit_val, credit_val), 
              ~ parse_number(str_replace_all(., ",", "") %>%
                               str_replace_all("-", "0"))))

Converts columns to numeric by removing commas and replacing hyphens with zeros.

Convert Character Columns to Factors
```
mutate(across(where(is.character), as.factor))
```
- Converts all character columns to factors for better handling in modeling.
Recode Churn Column.
```
mutate(churn = factor(ifelse(churn == 1, "churned", "not churned")))
```
- Recodes the churn column to a factor with levels “churned” and “not churned”.
Save Clean Data
```
saveRDS(new_churn, file = "Team A/clean_churn.rds")
```
- Saves the cleaned dataset to an RDS file for future use.

4. Data Preprocessing

Normalize numerical features.
Encode categorical features.
Split the data into training and testing sets.

5. Data Exploration

Summary statistics of the data.
Visualizations (e.g., histograms, bar charts, correlation matrix).
Insights from exploratory data analysis.

6. Modelling

Define and train multiple models (e.g., logistic regression, decision tree, random forest).
Evaluate models using metrics such as accuracy, precision, recall, ROC curve, feature importance and F1 score.
Select the best-performing model.
Summary of findings and recommendations

Setup

First, loading tidymodels, along with a few additional tidymodels extension packages:

library(tidymodels)
library(baguette)
library(xgboost)
library(finetune)
library(bundle)
library(readr)
library(vetiver)
library(pins)

tidymodels supports a number of R frameworks for parallel computing:

library(doMC)
library(parallelly)

availableCores()

registerDoMC(cores = max(1, availableCores() - 1))

Now, loading data:

# Load and preprocess data
churn_df <- read_rds("Team A/clean_churn.rds") %>%
  select(-c(credit_val, x21)) %>%
  rename(churn_status = "churn")

Splitting up data

We split data into training and testing sets so that, once we’ve trained our final model, we can get an honest assessment of the model’s performance:

# Reduce dataset size if needed
set.seed(1)
churn_df <- churn_df %>% sample_n(50000)  # Keep 50,000 rows for efficiency

# Split data
set.seed(1)
churn_split <- initial_split(churn_df)
churn_train <- training(churn_split)
churn_test <- testing(churn_split)

# Reduce CV folds from 10 to 3
churn_folds <- vfold_cv(churn_train, v = 3)

Defining our modeling strategies

Our basic strategy is to first try out a bunch of different modeling approaches, and once we have an initial sense for how they perform, delve further into the one that looks the most promising.

We first define a few recipes, which specify how to process the inputted data in such a way that machine learning models will know how to work with predictors:

# Feature engineering recipe
recipe_mixed <- recipe(churn_status ~ ., data = churn_train) %>%
  step_rm(acct_id) %>%  # Remove unique ID
  step_corr(all_numeric_predictors(), threshold = 0.9) %>%  # Remove highly correlated variables
  step_nzv(all_predictors()) %>%  # Remove near-zero variance predictors
  step_YeoJohnson(all_numeric_predictors()) %>%  
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal_predictors(), -all_outcomes()) %>%
  step_zv(all_predictors())  # Remove zero-variance predictors

These recipes vary in complexity, from basic checks on the input data to advanced feature engineering techniques like principal component analysis.

We also define several model specifications. tidymodels comes with support for all sorts of machine learning algorithms, from neural networks to XGBoost boosted trees to plain old logistic regression:

# Choose two models (logistic regression and XGBoost)
spec_lr <- logistic_reg() %>%
  set_mode("classification") %>%
  set_engine("glm")

spec_xgb <- boost_tree(trees = 50, min_n = tune(), tree_depth = tune(),
                       learn_rate = 0.1) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

Note how similar the code for each of these model specifications looks! tidymodels takes care of the “translation” from our unified syntax to the code that these algorithms expect.

If typing all of these out seems cumbersome to you, or you’re not sure how to define a model specification that makes sense for your data, the usemodels RStudio addin may help!

Evaluating models: round 1

We’ll pair machine learning models with the recipes that make the most sense for them:

# Performance metrics
churn_metrics <- metric_set(roc_auc, accuracy)

# Workflow set with fewer models
wf_set <- workflow_set(
  preproc = list(mixed = recipe_mixed),
  models = list(
    logistic_reg = spec_lr,
    boost_tree = spec_xgb
  )
)

 #Define control settings (without grid argument)
ctrl <- control_grid(save_pred = TRUE, parallel_over = "everything")

# Train models with a smaller grid (grid = 5)
wf_set_fit <- workflow_map(
  wf_set, 
  fn = "tune_grid", 
  verbose = TRUE, 
  seed = 1,
  resamples = churn_folds, 
  metrics = churn_metrics,
  grid = 5,  # Specify grid size here, not in control_grid
  control = ctrl
)

save(wf_set_fit, file = "Team C/models/wf_set_fit.Rda")

wf_set_fit <-
  wf_set_fit[
    map_lgl(wf_set_fit$result, 
            ~pluck(., ".metrics", 1) %>% inherits("tbl_df"), 
            "tune_results"),
  ]
# First look at metrics:
metrics_wf_set <- collect_metrics(wf_set_fit, summarize = FALSE)

Taking a look at how these models did:

metrics_wf_set %>%
  filter(.metric == "roc_auc") %>%
  arrange(desc(.estimate))

Evaluating models: round 2

It looks like XGBoost with minimal preprocessing was considerably more performant than the other proposed models. Let’s work with those XGBoost results and see if we can make any improvements to performance:

save(metrics_wf_set, file = "Team C/models/metrics_wf_set.Rda")

# Extract best XGBoost model
xgb_res <- extract_workflow_set_result(wf_set_fit, "mixed_boost_tree")

xgb_wflow <-
  workflow() %>%
  add_recipe(recipe_mixed) %>%
  add_model(spec_xgb)

# Simulated annealing tuning
xgb_sim_anneal_fit <-
  tune_sim_anneal(
    object = xgb_wflow,
    resamples = churn_folds,
    iter = 25,
    metrics = churn_metrics,
    initial = xgb_res,
    control = control_sim_anneal(verbose = TRUE, parallel_over = "everything")
  )

save(xgb_sim_anneal_fit, file = "Team C/models/xgb_sim_anneal_fit.Rda")

metrics_xgb <- collect_metrics(xgb_sim_anneal_fit, summarize = FALSE)

save(metrics_xgb, file = "Team C/models/metrics_xgb.Rda")

Looks like we did make a small improvement:

The final model fit

The last_fit() function will take care of fitting the most performant model specification to the whole training dataset:

# Final model fit and evaluation against test set
xgb_final_fit <-
  last_fit(
    finalize_workflow(xgb_wflow, select_best(xgb_sim_anneal_fit, metric = "roc_auc")),
    churn_split
  )

save(xgb_final_fit, file = "Team C/models/xgb_final_fit.Rda")
final_fit <- bundle(xgb_final_fit$.workflow[[1]])
save(final_fit, file = "Team C/models/final_fit.Rda")

7. Dashboard Building

Quarto was used to create the dashboard which is accessible via this link churn dashboard

8. Deployment

Deploy the dashboard using a web hosting service (e.g., GitHub Pages, Netlify).
Ensure the dashboard is accessible and user-friendly.

9. Deploying to Connect

From here, all we need to do to deploy our fitted model is pass it off to vetiver for deployment to Posit Connect:

final_fit_vetiver <- vetiver_model(final_fit, "final_fit")

board <- board_connect()

vetiver_pin_write(board, final_fit_vetiver)

vetiver_deploy_rsconnect(board, "username/final_fit")