Credit Card Customer Churn Detection Using Machine Learning
Algorithms
Credit Card Fraud Data
https://data.world/vlad/credit-card-fraud-detection
The rapid growth of the banking industry has allowed consumers to be more discerning about the banks they want to maintain relationships with. Thus, customer retention has become a significant concern for many financial institutions. One particular area where customer retention is particularly significant is in the realm of credit cards. High churn rates, the rate at which customers stop doing business with an entity, can lead to significant revenue losses and higher acquisition costs for new customers. This project aims to predict credit card customer churn, to help banks identify and retain customers at risk of churning.
Customer churn in the banking sector, particularly in credit cards, is a persistent issue. Predicting churn can be a complex task due to the multitude of factors that can influence a customer’s decision to leave, including customer service quality, better offerings from competitors, changes in customer financial circumstances, and more. Despite the advent of advanced data analytics techniques, many banks still struggle to predict and mitigate customer churn effectively. This project will focus on this problem, attempting to develop a model that can accurately predict customer churn and thus provide valuable insights to help banks retain their valuable credit card customers.
# Loading necessary libraries
library(readxl)
library(ggplot2)
library(dplyr)
library(corrplot)
library(hexbin)
library(plyr)
library(tidyr)
library(purrr)
library(gridExtra)
library(ggrepel)
library(pastecs)
library(caret)
#library(ROSE)
library(randomForest)
library(e1071)
library(rpart)
library(rpart.plot)c_data <- read.csv('BankChurners.csv')
head(c_data, 3)
names(c_data)# Remove duplicates
c_data <- unique(c_data)# Check for null values in each column
null_counts <- sapply(c_data, function(x) sum(is.na(x)))# Drop unnecessary columns
c_data <- c_data[, -c(1, 22, 23)]The distribution of customer ages in the dataset follows a fairly normal distribution. The box plot provides an overview of the median, quartiles, and any potential outliers in the age variable. The histogram illustrates the count of customers in each age group.
Next, we will perform similar EDA analysis for other variables in the dataset:
More samples of females in our dataset are compared to males, but the
percentage of difference is not that significant, so we can say that
genders are uniformly distributed.
The correlation matrix provides an overview of the relationships between the numeric variables in the dataset. It helps identify any strong positive or negative correlations between variables.
The scatter plot showcases the relationship between the total transaction amount and the total transaction count. It helps visualize any patterns or trends between these two variables.
The summary statistics provide a comprehensive overview of the numerical variables in the dataset. It includes measures such as mean, median, standard deviation, minimum, maximum, and various percentiles.
This EDA analysis provides insights into the distribution, relationships, and summary statistics of the key variables in the dataset. Further exploratory analyses can be conducted for other variables as per the project requirements.
The distribution of dependent counts is fairly normally distributed with a slight right skew.
If most of the customers with unknown education status lack any education, we can state that more than 70% of the customers have a formal education level. About 35% have a higher level of education.
## [1] "Kurtosis of Months on book features is: 0.398638886235621"
Distribution of the Total Transaction Amount (Last 12 months):
# Identify the column names of categorical variables and factors
categorical_columns <- sapply(c_data, is.character)
categorical_column_names <- names(c_data[categorical_columns])
# Print the column names of categorical variables and factors
print(categorical_column_names)## [1] "Attrition_Flag" "Gender" "Education_Level" "Marital_Status"
## [5] "Income_Category" "Card_Category"
# Convert values of Attrition_Flag to 0 and 1
c_data$Attrition_Flag <- ifelse(c_data$Attrition_Flag == "Existing Customer", 0, 1)## 'data.frame': 10127 obs. of 20 variables:
## $ Attrition_Flag : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Customer_Age : int 45 49 51 40 40 44 51 32 37 48 ...
## $ Gender : chr "M" "F" "M" "F" ...
## $ Dependent_count : int 3 5 3 4 3 2 4 0 3 2 ...
## $ Education_Level : chr "High School" "Graduate" "Graduate" "High School" ...
## $ Marital_Status : chr "Married" "Single" "Married" "Unknown" ...
## $ Income_Category : chr "$60K - $80K" "Less than $40K" "$80K - $120K" "Less than $40K" ...
## $ Card_Category : chr "Blue" "Blue" "Blue" "Blue" ...
## $ Months_on_book : int 39 44 36 34 21 36 46 27 36 36 ...
## $ Total_Relationship_Count: int 5 6 4 3 5 3 6 2 5 6 ...
## $ Months_Inactive_12_mon : int 1 1 1 4 1 1 1 2 2 3 ...
## $ Contacts_Count_12_mon : int 3 2 0 1 0 2 3 2 0 3 ...
## $ Credit_Limit : num 12691 8256 3418 3313 4716 ...
## $ Total_Revolving_Bal : int 777 864 0 2517 0 1247 2264 1396 2517 1677 ...
## $ Avg_Open_To_Buy : num 11914 7392 3418 796 4716 ...
## $ Total_Amt_Chng_Q4_Q1 : num 1.33 1.54 2.59 1.4 2.17 ...
## $ Total_Trans_Amt : int 1144 1291 1887 1171 816 1088 1330 1538 1350 1441 ...
## $ Total_Trans_Ct : int 42 33 20 20 28 24 31 36 24 32 ...
## $ Total_Ct_Chng_Q4_Q1 : num 1.62 3.71 2.33 2.33 2.5 ...
## $ Avg_Utilization_Ratio : num 0.061 0.105 0 0.76 0 0.311 0.066 0.048 0.113 0.144 ...
## [1] "Customer_Age" "Dependent_count"
## [3] "Months_on_book" "Total_Relationship_Count"
## [5] "Months_Inactive_12_mon" "Contacts_Count_12_mon"
## [7] "Credit_Limit" "Total_Revolving_Bal"
## [9] "Avg_Open_To_Buy" "Total_Amt_Chng_Q4_Q1"
## [11] "Total_Trans_Amt" "Total_Trans_Ct"
## [13] "Total_Ct_Chng_Q4_Q1" "Avg_Utilization_Ratio"
## [15] "Gender.F" "Gender.M"
## [17] "Education_Level.College" "Education_Level.Doctorate"
## [19] "Education_Level.Graduate" "Education_Level.High School"
## [21] "Education_Level.Post-Graduate" "Education_Level.Uneducated"
## [23] "Education_Level.Unknown" "Marital_Status.Divorced"
## [25] "Marital_Status.Married" "Marital_Status.Single"
## [27] "Marital_Status.Unknown" "Income_Category.$120K +"
## [29] "Income_Category.$40K - $60K" "Income_Category.$60K - $80K"
## [31] "Income_Category.$80K - $120K" "Income_Category.Less than $40K"
## [33] "Income_Category.Unknown" "Card_Category.Blue"
## [35] "Card_Category.Gold" "Card_Category.Platinum"
## [37] "Card_Category.Silver"
## [1] "Random Forest Accuracy: 0.953258722843976"
## [1] "Random Forest Confusion Matrix:"
## Reference
## Prediction 0 1
## 0 2509 119
## 1 23 387
## [1] "SVM Accuracy: 0.910138248847926"
## [1] "SVM Confusion Matrix:"
## Reference
## Prediction 0 1
## 0 2490 231
## 1 42 275
## [1] "Decision Tree Accuracy: 0.929229756418697"
## [1] "Decision Tree Confusion Matrix:"
## Reference
## Prediction 0 1
## 0 2449 132
## 1 83 374
## Model Accuracy Precision Recall F1_Score
## 1 Random Forest 0.9532587 0.9547184 0.9909163 0.9724806
## 2 SVM 0.9101382 0.9151047 0.9834123 0.9480297
## 3 Decision Tree 0.9292298 0.9488570 0.9672196 0.9579503
# Reshape the performance data frame for plotting
performance_long <- reshape2::melt(performance, id.vars = "Model")
# Grouped Bar Chart for Performance Metrics
bar_chart <- ggplot(performance_long, aes(x = Model, y = value, fill = variable)) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "Model", y = "Performance", title = "Performance Metrics") +
scale_fill_manual(values = c(Accuracy = "blue", Precision = "red", Recall = "green", F1_Score = "purple")) +
theme_minimal() +
theme(legend.position = "right")
print(bar_chart)# performance of the best model (Roc curve)
library(pROC)## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# Calculate ROC curve
rf_predictions_char <- as.numeric(as.character(rf_predictions))
rf_roc <- roc(response = as.numeric(y_test), predictor = rf_predictions_char)## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
# Plot ROC curve
plot(rf_roc, main = "ROC Curve", print.thres = "best", legacy.axes = TRUE)In performance metrics we using the different kinds of methodology to training dataset and test dataset Including Random Forest Classifier, SVM Model, Decision Tree model. Through the different model performance metrics we could know Random Forest is the highest in accuracy rate. Besides that, in precision, recall and F1 score the Random Forest performance better than other models. Therefore, Random Forest model will adopted as our first choice.