WQD7004 Group Assignment

Group Members

Khairun Nadzirah Binti Abdul Karim (22066330)
Muhammad Shahzad Rafiq (S2150889)
Sadman Chowdhury (S2199546)
Yin QiXiang (S2150692)
Xin Dong (22060696)

Title:

Credit Card Customer Churn Detection Using Machine Learning Algorithms

Dataset

Credit Card Fraud Data
https://data.world/vlad/credit-card-fraud-detection

Introduction

The rapid growth of the banking industry has allowed consumers to be more discerning about the banks they want to maintain relationships with. Thus, customer retention has become a significant concern for many financial institutions. One particular area where customer retention is particularly significant is in the realm of credit cards. High churn rates, the rate at which customers stop doing business with an entity, can lead to significant revenue losses and higher acquisition costs for new customers. This project aims to predict credit card customer churn, to help banks identify and retain customers at risk of churning.

Problem Statement

Customer churn in the banking sector, particularly in credit cards, is a persistent issue. Predicting churn can be a complex task due to the multitude of factors that can influence a customer’s decision to leave, including customer service quality, better offerings from competitors, changes in customer financial circumstances, and more. Despite the advent of advanced data analytics techniques, many banks still struggle to predict and mitigate customer churn effectively. This project will focus on this problem, attempting to develop a model that can accurately predict customer churn and thus provide valuable insights to help banks retain their valuable credit card customers.

Research Objective

To understand the factors that contribute to credit card customer churn: The first step in addressing the problem is understanding its root causes. We will investigate various factors like customer demographics, card usage, and other relevant variables to gain insights into what drives churn.
To develop a predictive model for credit card customer churn: Leveraging machine learning techniques, we will develop a model that can predict the likelihood of a customer churning. This objective will involve data preprocessing, model selection, training, and validation steps.
To provide recommendations for customer churn reduction strategies: Based on the findings from the predictive model and our understanding of churn factors, we will propose strategies that banks could implement to reduce churn rates and improve customer loyalty..

Research Question

What are the key factors influencing credit card customer churn?: Identifying the main predictors of churn will be the first step in developing our predictive model and provides a foundation for our churn reduction recommendations.
How accurately can we predict credit card customer churn?: This question will guide our model development and evaluation process. It will involve assessing the performance of our predictive model using suitable metrics.
What strategies can banks implement to reduce churn rates among credit card customers?: This question will be addressed in the final part of our research, where we will use our findings to suggest actionable strategies for churn reduction.

By investigating these research questions, this project aims to contribute to the understanding of credit card customer churn and provide tangible benefits to the banking industry by suggesting potential solutions.

Dataset

# Loading necessary libraries
library(readxl)
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(corrplot)

## corrplot 0.92 loaded

library(hexbin)
library(plyr)

## ------------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## ------------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

library(tidyr)
library(purrr)

## 
## Attaching package: 'purrr'

## The following object is masked from 'package:plyr':
## 
##     compact

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(ggrepel)
library(pastecs)

## 
## Attaching package: 'pastecs'

## The following object is masked from 'package:tidyr':
## 
##     extract

## The following objects are masked from 'package:dplyr':
## 
##     first, last

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

#library(ROSE)
library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(e1071)
library(rpart)
library(rpart.plot)


knitr::opts_chunk$set(echo = TRUE, results='hide')

c_data <- read.csv('BankChurners.csv')
head(c_data, 3)
names(c_data)

Data Cleaning

# Remove duplicates
c_data <- unique(c_data)

# Check for null values in each column
null_counts <- sapply(c_data, function(x) sum(is.na(x)))
print(null_counts)

# Drop unnecessary columns
c_data <- c_data[, -c(1, 22, 23)]

Kurtosis of Months on book features:

# Calculate kurtosis
kurtosis <- kurtosis(c_data$Months_on_book)

# Print kurtosis value
print(paste("Kurtosis of Months on book features is:", kurtosis))

Distribution of the Total Transaction Amount (Last 12 months):

# Box plot and histogram
p1 <- ggplot(c_data, aes(x = "", y = Total_Trans_Amt)) +
  geom_boxplot() +
  labs(x = NULL, y = "Total Transaction Amount") +
  theme_minimal()

p2 <- ggplot(c_data, aes(x = Total_Trans_Amt)) +
  geom_histogram() +
  labs(x = "Total Transaction Amount", y = "Count") +
  theme_minimal()

grid.arrange(p1, p2, nrow = 2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Data Processing

# Identify the column names of categorical variables and factors
categorical_columns <- sapply(c_data,  is.character)
categorical_column_names <- names(c_data[categorical_columns])

# Print the column names of categorical variables and factors
print(categorical_column_names)

# Convert values of Attrition_Flag to 0 and 1
c_data$Attrition_Flag <- ifelse(c_data$Attrition_Flag == "Existing Customer", 0, 1)

names(c_data)

str(c_data)

categorical_cols <- c("Gender", "Education_Level", "Marital_Status", "Income_Category", "Card_Category")

c_data[categorical_cols] <- lapply(c_data[categorical_cols], as.factor)

# Create a formula for one-hot encoding
formula <- as.formula(paste("factor(Attrition_Flag) ~", paste(categorical_cols, collapse = "+")))

# Create dummy variables using dummyVars
dummy_data <- predict(dummyVars(formula, data = c_data), newdata = c_data)

# Combine numerical and one-hot encoded data
combined_data <- cbind(c_data[, !(names(c_data) %in% categorical_cols)], dummy_data)

# Split the data into training and testing sets
set.seed(42)
train_indices <- createDataPartition(combined_data$Attrition_Flag, p = 0.7, list = FALSE)
train_data <- combined_data[train_indices, ]
test_data <- combined_data[-train_indices, ]

# Separate predictors (x) and target variable (y) in the training and testing sets
X_train <- train_data[, !(names(train_data) %in% "Attrition_Flag")]
y_train <- train_data$Attrition_Flag
X_test <- test_data[, !(names(test_data) %in% "Attrition_Flag")]
y_test <- test_data$Attrition_Flag

names(X_train)

Machine Learning Modeling

# Random Forest Classifier
rf_model <- randomForest(x = X_train, y = as.factor(y_train), class.factors = levels(as.factor(y_train)))

# Train the SVM model
svm_model <- svm(x = X_train, y = as.factor(y_train))

# Train the Decision Tree model
dt_model <- rpart(y_train ~ ., data = X_train, method = "class")

Performance Metrics

# Make predictions on the test set
rf_predictions <- predict(rf_model, X_test)

# Convert y_test to have the same levels as rf_predictions
y_test <- factor(y_test, levels = levels(rf_predictions))

# Calculate accuracy and confusion matrix
rf_accuracy <- sum(rf_predictions == y_test) / length(y_test)
rf_confusion <- confusionMatrix(rf_predictions, y_test)
# Print accuracy and confusion matrix
print(paste("Random Forest Accuracy:", rf_accuracy))
print("Random Forest Confusion Matrix:")
print(rf_confusion$table)

# SVM
# Make predictions on the test set
svm_predictions <- predict(svm_model, X_test)

# Evaluate the model performance
svm_accuracy <- sum(svm_predictions == y_test) / length(y_test)
# Create the confusion matrix
svm_confusion <- confusionMatrix(svm_predictions, y_test)
print(paste("SVM Accuracy:", svm_accuracy))
print("SVM Confusion Matrix:")
print(svm_confusion$table)

# Decision Tree

# Predict class labels on the test set
dt_predictions <- predict(dt_model, newdata = X_test, type = "class")

# Evaluate the model performance
dt_accuracy <- sum(dt_predictions == y_test) / length(y_test)
dt_confusion <- confusionMatrix(dt_predictions, y_test)
print(paste("Decision Tree Accuracy:", dt_accuracy))
print("Decision Tree Confusion Matrix:")
print(dt_confusion$table)

# Create a data frame to store the performance metrics
performance <- data.frame(Model = c("Random Forest", "SVM", "Decision Tree"),
                          Accuracy = numeric(3),
                          Precision = numeric(3),
                          Recall = numeric(3),
                          F1_Score = numeric(3))

# Random Forest
rf_accuracy <- sum(rf_predictions == y_test) / length(y_test)
rf_confusion <- confusionMatrix(rf_predictions, y_test)
rf_precision <- rf_confusion$byClass["Pos Pred Value"]
rf_recall <- rf_confusion$byClass["Sensitivity"]
rf_f1_score <- 2 * (rf_precision * rf_recall) / (rf_precision + rf_recall)
performance[1, c("Accuracy", "Precision", "Recall", "F1_Score")] <- c(rf_accuracy, rf_precision, rf_recall, rf_f1_score)

# SVM
svm_accuracy <- sum(svm_predictions == y_test) / length(y_test)
svm_confusion <- confusionMatrix(svm_predictions, y_test)
svm_precision <- svm_confusion$byClass["Pos Pred Value"]
svm_recall <- svm_confusion$byClass["Sensitivity"]
svm_f1_score <- 2 * (svm_precision * svm_recall) / (svm_precision + svm_recall)
performance[2, c("Accuracy", "Precision", "Recall", "F1_Score")] <- c(svm_accuracy, svm_precision, svm_recall, svm_f1_score)

# Decision Tree
dt_accuracy <- sum(dt_predictions == y_test) / length(y_test)
dt_confusion <- confusionMatrix(dt_predictions, y_test)
dt_precision <- dt_confusion$byClass["Pos Pred Value"]
dt_recall <- dt_confusion$byClass["Sensitivity"]
dt_f1_score <- 2 * (dt_precision * dt_recall) / (dt_precision + dt_recall)
performance[3, c("Accuracy", "Precision", "Recall", "F1_Score")] <- c(dt_accuracy, dt_precision, dt_recall, dt_f1_score)

# Print the performance metrics
print(performance)

Conclusion

-   There are 16.07% of customers who have churned.
-   The proportion of gender count is almost equally distributed (52.9% male and 47.1%) compare to proportion of      existing and attributed customer count (83.9% and 16.1%) which is highly imbalanced
-   The proportion of attrited customers by gender there are 14.4% more male than female who have churned
-   Customers who have churned are highly educated - A high proportion of education level of attrited customer is Graduate level (29.9%), followed by Post-Graduate level (18.8%)**
-   A high proportion of marital status of customers who have churned is Married (43.6%), followed by Single (41.1%) compared to Divorced (7.4%) and Unknown (7.9%) status - Marital stuats of the attributed customers are highly clustered in Married status and Single
-   As you can see from the proportion of income category of attrited customer, it is highly concentrated around $60K - $80K income (37.6%), followed by Less than $40K income (16.7%) compare to attrited customers with higher annual income of 80K-120K(14.9%) and over $120K + (11.5%). I assume that customers with higher income doesn't likely to leave their credit card services than meddle-income customer