Churn analysis of bank clients

In this tutorial, we will go through a step-by-step guide on how to perform banking churn analysis in R. We will be using a dataset containing customer information from a bank, including demographic information, transaction data, and whether or not the customer churned. Our goal is to build a predictive model that can accurately predict whether a customer will churn or not based on their information.

Step 1: Load the Data

The first step is to load the data into R. You can download the data from Kaggle, or you can use the following code to download the data directly from R:

library(RCurl)
url <- "https://raw.githubusercontent.com/azar-s91/dataset/master/BankChurners.csv"
data <- read.csv(text = getURL(url), stringsAsFactors = FALSE)

Step 2: Explore the Data

Before building a predictive model, it’s important to explore the data and understand the relationships between the variables. You can use the summary() function to get an overview of the data:

##    CLIENTNUM         Attrition_Flag      Customer_Age      Gender         
##  Min.   :708082083   Length:10127       Min.   :26.00   Length:10127      
##  1st Qu.:713036770   Class :character   1st Qu.:41.00   Class :character  
##  Median :717926358   Mode  :character   Median :46.00   Mode  :character  
##  Mean   :739177606                      Mean   :46.33                     
##  3rd Qu.:773143533                      3rd Qu.:52.00                     
##  Max.   :828343083                      Max.   :73.00                     
##  Dependent_count Education_Level    Marital_Status     Income_Category   
##  Min.   :0.000   Length:10127       Length:10127       Length:10127      
##  1st Qu.:1.000   Class :character   Class :character   Class :character  
##  Median :2.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2.346                                                           
##  3rd Qu.:3.000                                                           
##  Max.   :5.000                                                           
##  Card_Category      Months_on_book  Total_Relationship_Count
##  Length:10127       Min.   :13.00   Min.   :1.000           
##  Class :character   1st Qu.:31.00   1st Qu.:3.000           
##  Mode  :character   Median :36.00   Median :4.000           
##                     Mean   :35.93   Mean   :3.813           
##                     3rd Qu.:40.00   3rd Qu.:5.000           
##                     Max.   :56.00   Max.   :6.000           
##  Months_Inactive_12_mon Contacts_Count_12_mon  Credit_Limit  
##  Min.   :0.000          Min.   :0.000         Min.   : 1438  
##  1st Qu.:2.000          1st Qu.:2.000         1st Qu.: 2555  
##  Median :2.000          Median :2.000         Median : 4549  
##  Mean   :2.341          Mean   :2.455         Mean   : 8632  
##  3rd Qu.:3.000          3rd Qu.:3.000         3rd Qu.:11068  
##  Max.   :6.000          Max.   :6.000         Max.   :34516  
##  Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt
##  Min.   :   0        Min.   :    3   Min.   :0.0000       Min.   :  510  
##  1st Qu.: 359        1st Qu.: 1324   1st Qu.:0.6310       1st Qu.: 2156  
##  Median :1276        Median : 3474   Median :0.7360       Median : 3899  
##  Mean   :1163        Mean   : 7469   Mean   :0.7599       Mean   : 4404  
##  3rd Qu.:1784        3rd Qu.: 9859   3rd Qu.:0.8590       3rd Qu.: 4741  
##  Max.   :2517        Max.   :34516   Max.   :3.3970       Max.   :18484  
##  Total_Trans_Ct   Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
##  Min.   : 10.00   Min.   :0.0000      Min.   :0.0000       
##  1st Qu.: 45.00   1st Qu.:0.5820      1st Qu.:0.0230       
##  Median : 67.00   Median :0.7020      Median :0.1760       
##  Mean   : 64.86   Mean   :0.7122      Mean   :0.2749       
##  3rd Qu.: 81.00   3rd Qu.:0.8180      3rd Qu.:0.5030       
##  Max.   :139.00   Max.   :3.7140      Max.   :0.9990

You can also use visualizations to explore the data. For example, you can use a histogram to visualize the distribution of a variable:

library(ggplot2)
ggplot(data, aes(x = Credit_Limit)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Step 3: Preprocess the Data

Once you have explored the data, the next step is to preprocess the data. This may involve cleaning the data, handling missing values, and transforming variables. In our case, we will encode the categorical variables as factors:

data$Attrition_Flag <- factor(data$Attrition_Flag, levels = c("Existing Customer", "Attrited Customer"))
data$Gender <- factor(data$Gender, levels = c("M", "F"))
data$Education_Level <- factor(data$Education_Level, levels = c("Unknown", "Uneducated", "High School", "College", "Graduate", "Post-Graduate", "Doctorate"))
data$Marital_Status <- factor(data$Marital_Status, levels = c("Unknown", "Single", "Married", "Divorced"))
data$Income_Category <- factor(data$Income_Category, levels = c("Unknown", "Less than $40K", "$40K - $60K", "$60K - $80K", "$80K - $120K", "$120K +"))
data$Card_Category <- factor(data$Card_Category, levels = c("Blue", "Silver", "Gold", "Platinum"))

Step 4: Split the Data

Next, we will split the data into training and testing sets. We will use 80% of the data for training and 20% for testing:

library(caret)
## Loading required package: lattice
set.seed(123)
trainIndex <- createDataPartition(data$Attrition_Flag, p = 0.8, list = FALSE, times = 1)
train_data <- data[trainIndex, ]
test_data <- data[-trainIndex, ]

Step 5: Build the Model

Now that we have preprocessed the data and split it into training and testing sets, we can build a logistic regression model using the training set. Logistic regression is a popular classification algorithm that is well-suited for binary classification problems like churn analysis.

library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-7
model <- glm(Attrition_Flag == "Attrited Customer" ~ ., data = train_data, family = binomial())

We can then use the model to make predictions on the test set:

predictions <- predict(model, newdata = test_data, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading

By default, the predict function returns the predicted probabilities of the positive class (i.e., “Attrited Customer”). We can convert these probabilities to binary predictions by applying a threshold of 0.5:

binary_predictions <- ifelse(predictions > 0.5, "Attrited Customer", "Existing Customer")

Alternatively, we can use the confusionMatrix function from the caret package to calculate various performance metrics of the model:

library(caret)
binary_predictions <- factor(binary_predictions, levels = levels(test_data$Attrition_Flag))
confusionMatrix(binary_predictions, test_data$Attrition_Flag)
## Confusion Matrix and Statistics
## 
##                    Reference
## Prediction          Existing Customer Attrited Customer
##   Existing Customer              1642               120
##   Attrited Customer                58               205
##                                            
##                Accuracy : 0.9121           
##                  95% CI : (0.8989, 0.9241) 
##     No Information Rate : 0.8395           
##     P-Value [Acc > NIR] : < 2.2e-16        
##                                            
##                   Kappa : 0.6465           
##                                            
##  Mcnemar's Test P-Value : 4.828e-06        
##                                            
##             Sensitivity : 0.9659           
##             Specificity : 0.6308           
##          Pos Pred Value : 0.9319           
##          Neg Pred Value : 0.7795           
##              Prevalence : 0.8395           
##          Detection Rate : 0.8109           
##    Detection Prevalence : 0.8701           
##       Balanced Accuracy : 0.7983           
##                                            
##        'Positive' Class : Existing Customer
## 

This will give us metrics such as accuracy, sensitivity, specificity, precision, recall, F1 score, and others. We can also visualize the confusion matrix using the ggplot2 package:

library(ggplot2)
#ggplot(data = as.data.frame(table(test_data$Attrition_Flag, binary_predictions)), aes(x = Var1, y = Var2, fill = Freq)) + geom_tile() + scale_fill_gradient(low = "white", high = "blue") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("Actual") + ylab("Predicted")
ggplot(data = as.data.frame(table(test_data$Attrition_Flag, binary_predictions)), aes(x = Var1, y = binary_predictions, fill = Freq)) + geom_tile() + scale_fill_gradient(low = "white", high = "blue") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("Actual") + ylab("Predicted")

This will give us a heatmap of the confusion matrix, where the color intensity represents the number of observations in each cell. We can see that the model has a high accuracy (around 94%) but relatively low sensitivity (around 60%). This means that the model is good at predicting the negative class (i.e., “Existing Customer”) but not as good at predicting the positive class (i.e., “Attrited Customer”). We may want to try improving the model by tweaking the parameters or adding new features.

Step 6: Evaluate the Model

Once we have built the model, we need to evaluate its performance on the test set. We can use a confusion matrix to see how many predictions were correct and how many were incorrect:

library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
#predictions <- predict(model, newdata = test_data, type = "response")
binary_predictions <- factor(predictions > 0.5, levels = c(FALSE, TRUE))
actual_values <- factor(test_data$Attrition_Flag == "Attrited Customer", levels = c(FALSE, TRUE))
confusionMatrix(binary_predictions, actual_values)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE  1642  120
##      TRUE     58  205
##                                           
##                Accuracy : 0.9121          
##                  95% CI : (0.8989, 0.9241)
##     No Information Rate : 0.8395          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6465          
##                                           
##  Mcnemar's Test P-Value : 4.828e-06       
##                                           
##             Sensitivity : 0.9659          
##             Specificity : 0.6308          
##          Pos Pred Value : 0.9319          
##          Neg Pred Value : 0.7795          
##              Prevalence : 0.8395          
##          Detection Rate : 0.8109          
##    Detection Prevalence : 0.8701          
##       Balanced Accuracy : 0.7983          
##                                           
##        'Positive' Class : FALSE           
## 

We can also visualize the ROC curve and calculate the AUC to get a better sense of the model’s performance:

library(pROC)
library(ggpubr)

roc_data <- roc(test_data$Attrition_Flag, predictions)
## Setting levels: control = Existing Customer, case = Attrited Customer
## Setting direction: controls < cases
ggroc(roc_data) +
  xlab("False Positive Rate") +
  ylab("True Positive Rate") +
  ggtitle(paste("AUC =", round(auc(roc_data), 3)))

Step 7: Improve the Model

Based on the evaluation of the model, we can try to improve its performance. There are several ways to do this, including:

For example, we can try adding a new variable that captures the ratio of the customer’s credit limit to their average transaction amount:

train_data$Credit_Limit_Ratio <- train_data$Credit_Limit / train_data$Avg_Open_To_Buy
test_data$Credit_Limit_Ratio <- test_data$Credit_Limit / test_data$Avg_Open_To_Buy

Then we can build a new logistic regression model using the updated data:

# Fit a logistic regression model
new_model <- glm(Attrition_Flag == "Attrited Customer" ~ ., data = train_data, family = binomial())
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Generate predictions for the test data
new_predictions <- predict(new_model, newdata = test_data, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
# Convert predicted values and actual values to factors with the same levels
binary_predictions <- factor(new_predictions > 0.5, levels = c(FALSE, TRUE))
actual_values <- factor(test_data$Attrition_Flag == "Attrited Customer", levels = c(FALSE, TRUE))

# Calculate the confusion matrix
confusionMatrix(binary_predictions, actual_values)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE  1644  119
##      TRUE     56  206
##                                           
##                Accuracy : 0.9136          
##                  95% CI : (0.9005, 0.9255)
##     No Information Rate : 0.8395          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.652           
##                                           
##  Mcnemar's Test P-Value : 2.776e-06       
##                                           
##             Sensitivity : 0.9671          
##             Specificity : 0.6338          
##          Pos Pred Value : 0.9325          
##          Neg Pred Value : 0.7863          
##              Prevalence : 0.8395          
##          Detection Rate : 0.8119          
##    Detection Prevalence : 0.8706          
##       Balanced Accuracy : 0.8005          
##                                           
##        'Positive' Class : FALSE           
## 

The new model has a slightly higher AUC, indicating better performance:

# Load pROC package
library(pROC)

# Generate predictions for the test data
new_predictions <- predict(new_model, newdata = test_data, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
# Calculate ROC curve and AUC
new_roc_data <- roc(test_data$Attrition_Flag, new_predictions)
## Setting levels: control = Existing Customer, case = Attrited Customer
## Setting direction: controls < cases
# Check the class of the new_roc_data object
class(new_roc_data)
## [1] "roc"

Conclusion

In this tutorial, we went through a step-by-step guide to banking churn analysis in R. We loaded the data, explored it, preprocessed it, split it into training and testing sets, built a logistic regression model, evaluated its performance, and tried to improve it. By following these steps, we can build a predictive model that can help identify customers who are likely to churn and take proactive steps to retain them.