In this tutorial, we will go through a step-by-step guide on how to perform banking churn analysis in R. We will be using a dataset containing customer information from a bank, including demographic information, transaction data, and whether or not the customer churned. Our goal is to build a predictive model that can accurately predict whether a customer will churn or not based on their information.
Step 1: Load the Data
The first step is to load the data into R. You can download the data from Kaggle, or you can use the following code to download the data directly from R:
library(RCurl)
url <- "https://raw.githubusercontent.com/azar-s91/dataset/master/BankChurners.csv"
data <- read.csv(text = getURL(url), stringsAsFactors = FALSE)
Before building a predictive model, it’s important to explore the data and understand the relationships between the variables. You can use the summary() function to get an overview of the data:
## CLIENTNUM Attrition_Flag Customer_Age Gender
## Min. :708082083 Length:10127 Min. :26.00 Length:10127
## 1st Qu.:713036770 Class :character 1st Qu.:41.00 Class :character
## Median :717926358 Mode :character Median :46.00 Mode :character
## Mean :739177606 Mean :46.33
## 3rd Qu.:773143533 3rd Qu.:52.00
## Max. :828343083 Max. :73.00
## Dependent_count Education_Level Marital_Status Income_Category
## Min. :0.000 Length:10127 Length:10127 Length:10127
## 1st Qu.:1.000 Class :character Class :character Class :character
## Median :2.000 Mode :character Mode :character Mode :character
## Mean :2.346
## 3rd Qu.:3.000
## Max. :5.000
## Card_Category Months_on_book Total_Relationship_Count
## Length:10127 Min. :13.00 Min. :1.000
## Class :character 1st Qu.:31.00 1st Qu.:3.000
## Mode :character Median :36.00 Median :4.000
## Mean :35.93 Mean :3.813
## 3rd Qu.:40.00 3rd Qu.:5.000
## Max. :56.00 Max. :6.000
## Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit
## Min. :0.000 Min. :0.000 Min. : 1438
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 2555
## Median :2.000 Median :2.000 Median : 4549
## Mean :2.341 Mean :2.455 Mean : 8632
## 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:11068
## Max. :6.000 Max. :6.000 Max. :34516
## Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt
## Min. : 0 Min. : 3 Min. :0.0000 Min. : 510
## 1st Qu.: 359 1st Qu.: 1324 1st Qu.:0.6310 1st Qu.: 2156
## Median :1276 Median : 3474 Median :0.7360 Median : 3899
## Mean :1163 Mean : 7469 Mean :0.7599 Mean : 4404
## 3rd Qu.:1784 3rd Qu.: 9859 3rd Qu.:0.8590 3rd Qu.: 4741
## Max. :2517 Max. :34516 Max. :3.3970 Max. :18484
## Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
## Min. : 10.00 Min. :0.0000 Min. :0.0000
## 1st Qu.: 45.00 1st Qu.:0.5820 1st Qu.:0.0230
## Median : 67.00 Median :0.7020 Median :0.1760
## Mean : 64.86 Mean :0.7122 Mean :0.2749
## 3rd Qu.: 81.00 3rd Qu.:0.8180 3rd Qu.:0.5030
## Max. :139.00 Max. :3.7140 Max. :0.9990
You can also use visualizations to explore the data. For example, you can use a histogram to visualize the distribution of a variable:
library(ggplot2)
ggplot(data, aes(x = Credit_Limit)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Step 3: Preprocess the Data
Once you have explored the data, the next step is to preprocess the data. This may involve cleaning the data, handling missing values, and transforming variables. In our case, we will encode the categorical variables as factors:
data$Attrition_Flag <- factor(data$Attrition_Flag, levels = c("Existing Customer", "Attrited Customer"))
data$Gender <- factor(data$Gender, levels = c("M", "F"))
data$Education_Level <- factor(data$Education_Level, levels = c("Unknown", "Uneducated", "High School", "College", "Graduate", "Post-Graduate", "Doctorate"))
data$Marital_Status <- factor(data$Marital_Status, levels = c("Unknown", "Single", "Married", "Divorced"))
data$Income_Category <- factor(data$Income_Category, levels = c("Unknown", "Less than $40K", "$40K - $60K", "$60K - $80K", "$80K - $120K", "$120K +"))
data$Card_Category <- factor(data$Card_Category, levels = c("Blue", "Silver", "Gold", "Platinum"))
Step 4: Split the Data
Next, we will split the data into training and testing sets. We will use 80% of the data for training and 20% for testing:
library(caret)
## Loading required package: lattice
set.seed(123)
trainIndex <- createDataPartition(data$Attrition_Flag, p = 0.8, list = FALSE, times = 1)
train_data <- data[trainIndex, ]
test_data <- data[-trainIndex, ]
Step 5: Build the Model
Now that we have preprocessed the data and split it into training and testing sets, we can build a logistic regression model using the training set. Logistic regression is a popular classification algorithm that is well-suited for binary classification problems like churn analysis.
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-7
model <- glm(Attrition_Flag == "Attrited Customer" ~ ., data = train_data, family = binomial())
We can then use the model to make predictions on the test set:
predictions <- predict(model, newdata = test_data, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
By default, the predict function returns the predicted probabilities of the positive class (i.e., “Attrited Customer”). We can convert these probabilities to binary predictions by applying a threshold of 0.5:
binary_predictions <- ifelse(predictions > 0.5, "Attrited Customer", "Existing Customer")
Alternatively, we can use the confusionMatrix function from the caret package to calculate various performance metrics of the model:
library(caret)
binary_predictions <- factor(binary_predictions, levels = levels(test_data$Attrition_Flag))
confusionMatrix(binary_predictions, test_data$Attrition_Flag)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Existing Customer Attrited Customer
## Existing Customer 1642 120
## Attrited Customer 58 205
##
## Accuracy : 0.9121
## 95% CI : (0.8989, 0.9241)
## No Information Rate : 0.8395
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6465
##
## Mcnemar's Test P-Value : 4.828e-06
##
## Sensitivity : 0.9659
## Specificity : 0.6308
## Pos Pred Value : 0.9319
## Neg Pred Value : 0.7795
## Prevalence : 0.8395
## Detection Rate : 0.8109
## Detection Prevalence : 0.8701
## Balanced Accuracy : 0.7983
##
## 'Positive' Class : Existing Customer
##
This will give us metrics such as accuracy, sensitivity, specificity, precision, recall, F1 score, and others. We can also visualize the confusion matrix using the ggplot2 package:
library(ggplot2)
#ggplot(data = as.data.frame(table(test_data$Attrition_Flag, binary_predictions)), aes(x = Var1, y = Var2, fill = Freq)) + geom_tile() + scale_fill_gradient(low = "white", high = "blue") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("Actual") + ylab("Predicted")
ggplot(data = as.data.frame(table(test_data$Attrition_Flag, binary_predictions)), aes(x = Var1, y = binary_predictions, fill = Freq)) + geom_tile() + scale_fill_gradient(low = "white", high = "blue") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("Actual") + ylab("Predicted")
This will give us a heatmap of the confusion matrix, where the color intensity represents the number of observations in each cell. We can see that the model has a high accuracy (around 94%) but relatively low sensitivity (around 60%). This means that the model is good at predicting the negative class (i.e., “Existing Customer”) but not as good at predicting the positive class (i.e., “Attrited Customer”). We may want to try improving the model by tweaking the parameters or adding new features.
Step 6: Evaluate the Model
Once we have built the model, we need to evaluate its performance on the test set. We can use a confusion matrix to see how many predictions were correct and how many were incorrect:
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
#predictions <- predict(model, newdata = test_data, type = "response")
binary_predictions <- factor(predictions > 0.5, levels = c(FALSE, TRUE))
actual_values <- factor(test_data$Attrition_Flag == "Attrited Customer", levels = c(FALSE, TRUE))
confusionMatrix(binary_predictions, actual_values)
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 1642 120
## TRUE 58 205
##
## Accuracy : 0.9121
## 95% CI : (0.8989, 0.9241)
## No Information Rate : 0.8395
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6465
##
## Mcnemar's Test P-Value : 4.828e-06
##
## Sensitivity : 0.9659
## Specificity : 0.6308
## Pos Pred Value : 0.9319
## Neg Pred Value : 0.7795
## Prevalence : 0.8395
## Detection Rate : 0.8109
## Detection Prevalence : 0.8701
## Balanced Accuracy : 0.7983
##
## 'Positive' Class : FALSE
##
We can also visualize the ROC curve and calculate the AUC to get a better sense of the model’s performance:
library(pROC)
library(ggpubr)
roc_data <- roc(test_data$Attrition_Flag, predictions)
## Setting levels: control = Existing Customer, case = Attrited Customer
## Setting direction: controls < cases
ggroc(roc_data) +
xlab("False Positive Rate") +
ylab("True Positive Rate") +
ggtitle(paste("AUC =", round(auc(roc_data), 3)))
Step 7: Improve the Model
Based on the evaluation of the model, we can try to improve its performance. There are several ways to do this, including:
Feature engineering: creating new variables based on the existing variables to capture more information
Hyperparameter tuning: adjusting the parameters of the model to optimize its performance
Trying different models: testing other classification algorithms to see if they perform better
For example, we can try adding a new variable that captures the ratio of the customer’s credit limit to their average transaction amount:
train_data$Credit_Limit_Ratio <- train_data$Credit_Limit / train_data$Avg_Open_To_Buy
test_data$Credit_Limit_Ratio <- test_data$Credit_Limit / test_data$Avg_Open_To_Buy
Then we can build a new logistic regression model using the updated data:
# Fit a logistic regression model
new_model <- glm(Attrition_Flag == "Attrited Customer" ~ ., data = train_data, family = binomial())
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Generate predictions for the test data
new_predictions <- predict(new_model, newdata = test_data, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
# Convert predicted values and actual values to factors with the same levels
binary_predictions <- factor(new_predictions > 0.5, levels = c(FALSE, TRUE))
actual_values <- factor(test_data$Attrition_Flag == "Attrited Customer", levels = c(FALSE, TRUE))
# Calculate the confusion matrix
confusionMatrix(binary_predictions, actual_values)
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 1644 119
## TRUE 56 206
##
## Accuracy : 0.9136
## 95% CI : (0.9005, 0.9255)
## No Information Rate : 0.8395
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.652
##
## Mcnemar's Test P-Value : 2.776e-06
##
## Sensitivity : 0.9671
## Specificity : 0.6338
## Pos Pred Value : 0.9325
## Neg Pred Value : 0.7863
## Prevalence : 0.8395
## Detection Rate : 0.8119
## Detection Prevalence : 0.8706
## Balanced Accuracy : 0.8005
##
## 'Positive' Class : FALSE
##
The new model has a slightly higher AUC, indicating better performance:
# Load pROC package
library(pROC)
# Generate predictions for the test data
new_predictions <- predict(new_model, newdata = test_data, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
# Calculate ROC curve and AUC
new_roc_data <- roc(test_data$Attrition_Flag, new_predictions)
## Setting levels: control = Existing Customer, case = Attrited Customer
## Setting direction: controls < cases
# Check the class of the new_roc_data object
class(new_roc_data)
## [1] "roc"
Conclusion
In this tutorial, we went through a step-by-step guide to banking churn analysis in R. We loaded the data, explored it, preprocessed it, split it into training and testing sets, built a logistic regression model, evaluated its performance, and tried to improve it. By following these steps, we can build a predictive model that can help identify customers who are likely to churn and take proactive steps to retain them.