rename(
years = "years_with_bank",
risk = "risk_rating",
scheme = "scheme_type",
mobile_app = "mobile_app_adoption",
internet_banking = "internet_banking_adoption",
ussd_banking = "ussd_banking_adoption",
termloan = "termloan_status",
credit_vol = "last_12_months_credit_volume",
debit_vol = "last_12_months_debit_volume",
debit_val = "last_12_months_debit_value",
credit_val = "last_12_months_credit_value"
)
Churn Prediction Analysis
Welcome 🤗 to R4DS Nigeria’s end-to-end data science project, where we harness the power of data to drive insights and inform decision-making. Our team has developed a Quarto dashboard to analyze customer churn, leveraging machine learning workflows with tidymodels and Quarto. Explore the live dashboard to visualize our findings and learn more about our approach: click here. click here
Many thanks to the Posit Team (formerly RStudio) and Simon P Couch, as many of the code and ideas come from his End-to-End Machine Learning Workflow talk. Feel free to reuse the codes - no license needed! Our community is grateful for the wealth of free resources and we echo Posit’s mission to make data science accessible to everyone - that’s at the heart of everything we do at R4DS Nigeria.
Our vision is to learn, grow, and explore the world of data science together with ❤️.
Please note that this project was developed by beginner and intermediate R users, not experts. We’re open to learning from your comments, critiques, or contributions - your feedback will help us grow and improve!
Collaborators
- R4DataScienceNigeria Team:
- Victor Arowolo.
- Eyebiokin Oluwaseun
- Adesanya Oluwasegun Qudus
- Adebona Oreoluwa Elizabeth
- Yusuff Olatunji Sikiru
- Chizoma Chidiebube Chikere
- Omowunmi Adebayo-Ajeyomi
- Okoye Gloria. I.
- Adebisi Saheed D.
- Aina Ayomide
- Azeez Babatunde Akintonde
- Omisore Halleluyah
- Ojile Cecilia
- Abdulwahab Morohunmubo Olajumoke
- Daniel Oluwafemi Olofin
1. Customer Retention Analysis
This documentation outlines the development of a robust analysis of customer churn and the development of a dashboard and machine learning model to predict churn.
1.1 Introduction
In today’s competitive financial sector, customer retention is crucial for banks. Increasing customer churn is impacting revenue and raising the cost of acquiring new customers. Despite various engagement strategies, banks lack a reliable system to predict and prevent customer attrition.
1.2 Aim and Objectives
The primary aim of this project is to develop a churn dashboard and a robust machine mearning model that accurately predicts customer churn for banks. Leveraging historical customer data, including transaction patterns, product usage, account tenure, and demographic information, the model will identify key factors contributing to customer attrition.
- Aim: To develop dashboard and a predictive model that accurately identifies customers who are likely to churn.
- Objectives:
- Collect and preprocess customer data.
- Explore and analyze the data to identify key features influencing churn.
- Develop a chrun dashboard.
- Build and evaluate multiple machine learning models.
- Deploy the best-performing model.
- Key Deliverable:
- Recommend Retention Strategies: Develop targeted retention strategies and personalized interventions to improve customer satisfaction and reduce churn rates.
2. Data Information
2.1 Data Source
- Source: The data-set was obtained from a multinational bank with branches in Nigeria and across Africa. It was generated in 2023.
- Description: The data-set contains 19 variables and 500,000 rows. It is part of a larger data-set of over 20 million records
2.2 Metadata
- Columns:
acct_id
: A unique identifier for each customer account.years
: The number of years a customer has been with the bank.churn
: A binary indicator of whether the customer has churned (e.g., 0 for not churned, 1 for churned).risk_rating
: A rating or score that reflects the financial risk associated with the customer.currency
: The currency used in the customer’s account (e.g., NGN, USD, EUR).ave_bal
: The average balance in the customer’s account over a specified period.scheme_type
: The type of banking scheme or product the customer is using. Y = Yes; N = No.mobile*app*adoption
: Indicates whether the customer uses the bank’s mobile app. Y = Yes; N = No.internet_banking_adoption
: Indicates whether the customer uses internet banking. Y = Yes; N = No.ussd_banking_adoption
: Indicates whether the customer uses USSD banking services. Y = Yes; N = No.digital_loan
: Indicates whether the customer has taken a digital loan. Y = Yes; N = No.unsecured_loan
: Indicates whether the customer has an unsecured loan. Y = Yes; N = No.termloan_status
: Status of the customer’s term loan. Y = Yes; N = No.credit_card
: Indicates whether the customer holds a credit card with the bank. Y = Yes; N = No.subsegment
: Total volume of credit transactions over the last 12 months.last_12_months_credit_volume
: Total volume of credit transactions over the last 12 months.last_12_months_debit_volume
: Total volume of debit transactions over the last 12 months.last_12_months_debit_value
: Total value of debit transactions over the last 12 months.last_12_months_credit_value
: Total value of credit transactions over the last 12 months.
3. Data Cleaning
Change Column Names to Lower Case
clean_names()
- Convert all column names to lower case for consistency.
Rename Long Column Names
- Renames long column names to shorter, more manageable names.
Filter Out Specific Values
filter(!(ave %in% c("GBP", "JPY", "NGN", "SBA", "USD")))
- Removes rows where the
ave
column contains specific currency codes instead of average balance.
- Removes rows where the
Convert Numeric Columns.
mutate(across(c(ave, subsegment, debit_vol, debit_val, credit_val), ~ parse_number(str_replace_all(., ",", "") %>% str_replace_all("-", "0"))))
- Converts columns to numeric by removing commas and replacing hyphens with zeros.
Convert Character Columns to Factors
mutate(across(where(is.character), as.factor))
- Converts all character columns to factors for better handling in modeling.
Recode Churn Column.
mutate(churn = factor(ifelse(churn == 1, "churned", "not churned")))
- Recodes the
churn
column to a factor with levels “churned” and “not churned”.
- Recodes the
Save Clean Data
saveRDS(new_churn, file = "Team A/clean_churn.rds")
- Saves the cleaned dataset to an RDS file for future use.
4. Data Preprocessing
- Normalize numerical features.
- Encode categorical features.
- Split the data into training and testing sets.
5. Data Exploration
- Summary statistics of the data.
- Visualizations (e.g., histograms, bar charts, correlation matrix).
- Insights from exploratory data analysis.
6. Modelling
- Define and train multiple models (e.g., logistic regression, decision tree, random forest).
- Evaluate models using metrics such as accuracy, precision, recall, ROC curve, feature importance and F1 score.
- Select the best-performing model.
- Summary of findings and recommendations
Setup
First, loading tidymodels, along with a few additional tidymodels extension packages:
library(tidymodels)
library(baguette)
library(xgboost)
library(finetune)
library(bundle)
library(readr)
library(vetiver)
library(pins)
tidymodels supports a number of R frameworks for parallel computing:
library(doMC)
library(parallelly)
availableCores()
registerDoMC(cores = max(1, availableCores() - 1))
Now, loading data:
# Load and preprocess data
<- read_rds("Team A/clean_churn.rds") %>%
churn_df select(-c(credit_val, x21)) %>%
rename(churn_status = "churn")
Splitting up data
We split data into training and testing sets so that, once we’ve trained our final model, we can get an honest assessment of the model’s performance:
# Reduce dataset size if needed
set.seed(1)
<- churn_df %>% sample_n(50000) # Keep 50,000 rows for efficiency
churn_df
# Split data
set.seed(1)
<- initial_split(churn_df)
churn_split <- training(churn_split)
churn_train <- testing(churn_split)
churn_test
# Reduce CV folds from 10 to 3
<- vfold_cv(churn_train, v = 3) churn_folds
Defining our modeling strategies
Our basic strategy is to first try out a bunch of different modeling approaches, and once we have an initial sense for how they perform, delve further into the one that looks the most promising.
We first define a few recipes, which specify how to process the inputted data in such a way that machine learning models will know how to work with predictors:
# Feature engineering recipe
<- recipe(churn_status ~ ., data = churn_train) %>%
recipe_mixed step_rm(acct_id) %>% # Remove unique ID
step_corr(all_numeric_predictors(), threshold = 0.9) %>% # Remove highly correlated variables
step_nzv(all_predictors()) %>% # Remove near-zero variance predictors
step_YeoJohnson(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal_predictors(), -all_outcomes()) %>%
step_zv(all_predictors()) # Remove zero-variance predictors
These recipes vary in complexity, from basic checks on the input data to advanced feature engineering techniques like principal component analysis.
We also define several model specifications. tidymodels comes with support for all sorts of machine learning algorithms, from neural networks to XGBoost boosted trees to plain old logistic regression:
# Choose two models (logistic regression and XGBoost)
<- logistic_reg() %>%
spec_lr set_mode("classification") %>%
set_engine("glm")
<- boost_tree(trees = 50, min_n = tune(), tree_depth = tune(),
spec_xgb learn_rate = 0.1) %>%
set_engine("xgboost") %>%
set_mode("classification")
Note how similar the code for each of these model specifications looks! tidymodels takes care of the “translation” from our unified syntax to the code that these algorithms expect.
If typing all of these out seems cumbersome to you, or you’re not sure how to define a model specification that makes sense for your data, the usemodels RStudio addin may help!
Evaluating models: round 1
We’ll pair machine learning models with the recipes that make the most sense for them:
# Performance metrics
<- metric_set(roc_auc, accuracy)
churn_metrics
# Workflow set with fewer models
<- workflow_set(
wf_set preproc = list(mixed = recipe_mixed),
models = list(
logistic_reg = spec_lr,
boost_tree = spec_xgb
) )
#Define control settings (without grid argument)
<- control_grid(save_pred = TRUE, parallel_over = "everything")
ctrl
# Train models with a smaller grid (grid = 5)
<- workflow_map(
wf_set_fit
wf_set, fn = "tune_grid",
verbose = TRUE,
seed = 1,
resamples = churn_folds,
metrics = churn_metrics,
grid = 5, # Specify grid size here, not in control_grid
control = ctrl
)
save(wf_set_fit, file = "Team C/models/wf_set_fit.Rda")
<-
wf_set_fit
wf_set_fit[map_lgl(wf_set_fit$result,
~pluck(., ".metrics", 1) %>% inherits("tbl_df"),
"tune_results"),
]# First look at metrics:
<- collect_metrics(wf_set_fit, summarize = FALSE) metrics_wf_set
Taking a look at how these models did:
%>%
metrics_wf_set filter(.metric == "roc_auc") %>%
arrange(desc(.estimate))
Evaluating models: round 2
It looks like XGBoost with minimal preprocessing was considerably more performant than the other proposed models. Let’s work with those XGBoost results and see if we can make any improvements to performance:
save(metrics_wf_set, file = "Team C/models/metrics_wf_set.Rda")
# Extract best XGBoost model
<- extract_workflow_set_result(wf_set_fit, "mixed_boost_tree")
xgb_res
<-
xgb_wflow workflow() %>%
add_recipe(recipe_mixed) %>%
add_model(spec_xgb)
# Simulated annealing tuning
<-
xgb_sim_anneal_fit tune_sim_anneal(
object = xgb_wflow,
resamples = churn_folds,
iter = 25,
metrics = churn_metrics,
initial = xgb_res,
control = control_sim_anneal(verbose = TRUE, parallel_over = "everything")
)
save(xgb_sim_anneal_fit, file = "Team C/models/xgb_sim_anneal_fit.Rda")
<- collect_metrics(xgb_sim_anneal_fit, summarize = FALSE)
metrics_xgb
save(metrics_xgb, file = "Team C/models/metrics_xgb.Rda")
Looks like we did make a small improvement:
The final model fit
The last_fit()
function will take care of fitting the most performant model specification to the whole training dataset:
# Final model fit and evaluation against test set
<-
xgb_final_fit last_fit(
finalize_workflow(xgb_wflow, select_best(xgb_sim_anneal_fit, metric = "roc_auc")),
churn_split
)
save(xgb_final_fit, file = "Team C/models/xgb_final_fit.Rda")
<- bundle(xgb_final_fit$.workflow[[1]])
final_fit save(final_fit, file = "Team C/models/final_fit.Rda")
7. Dashboard Building
Quarto was used to create the dashboard which is accessible via this link churn dashboard
8. Deployment
- Deploy the dashboard using a web hosting service (e.g., GitHub Pages, Netlify).
- Ensure the dashboard is accessible and user-friendly.
9. Deploying to Connect
From here, all we need to do to deploy our fitted model is pass it off to vetiver for deployment to Posit Connect:
<- vetiver_model(final_fit, "final_fit")
final_fit_vetiver
<- board_connect()
board
vetiver_pin_write(board, final_fit_vetiver)
vetiver_deploy_rsconnect(board, "username/final_fit")