WQD7004 Group Assignment

Group Members

Khairun Nadzirah Binti Abdul Karim (22066330)
Muhammad Shahzad Rafiq (S2150889)
Sadman Chowdhury (S2199546)
Yin QiXiang (S2150692)
Xin Dong (22060696)

Title

Credit Card Customer Churn Detection Using Machine Learning Algorithms

Dataset

Credit Card Fraud Data
https://data.world/vlad/credit-card-fraud-detection

Introduction

The rapid growth of the banking industry has allowed consumers to be more discerning about the banks they want to maintain relationships with. Thus, customer retention has become a significant concern for many financial institutions. One particular area where customer retention is particularly significant is in the realm of credit cards. High churn rates, the rate at which customers stop doing business with an entity, can lead to significant revenue losses and higher acquisition costs for new customers. This project aims to predict credit card customer churn, to help banks identify and retain customers at risk of churning.

Problem Statement

Customer churn in the banking sector, particularly in credit cards, is a persistent issue. Predicting churn can be a complex task due to the multitude of factors that can influence a customer’s decision to leave, including customer service quality, better offerings from competitors, changes in customer financial circumstances, and more. Despite the advent of advanced data analytics techniques, many banks still struggle to predict and mitigate customer churn effectively. This project will focus on this problem, attempting to develop a model that can accurately predict customer churn and thus provide valuable insights to help banks retain their valuable credit card customers.

Research Objective

To understand the factors that contribute to credit card customer churn.
To develop a predictive model for credit card customer churn.
To provide recommendations for customer churn reduction strategies.

Research Question

What are the key factors influencing credit card customer churn?
How accurately can we predict credit card customer churn?
What strategies can banks implement to reduce churn rates among credit card customers?

Dataset

# Loading necessary libraries
library(readxl)
library(ggplot2)
library(dplyr)
library(corrplot)
library(hexbin)
library(plyr)
library(tidyr)
library(purrr)
library(gridExtra)
library(ggrepel)
library(pastecs)
library(caret)
#library(ROSE)
library(randomForest)
library(e1071)
library(rpart)
library(rpart.plot)

c_data <- read.csv('BankChurners.csv')
head(c_data, 3)
names(c_data)

Data Cleaning

# Remove duplicates
c_data <- unique(c_data)

# Check for null values in each column
null_counts <- sapply(c_data, function(x) sum(is.na(x)))

# Drop unnecessary columns
c_data <- c_data[, -c(1, 22, 23)]

EDA

Distribution of Customer Age

# Box plot
p1 <- ggplot(c_data, aes(x = "", y = Customer_Age)) +
  geom_boxplot() +
  labs(x = NULL, y = "Customer Age") +
  theme_minimal()

# Histogram
p2 <- ggplot(c_data, aes(x = Customer_Age)) +
  geom_histogram() +
  labs(x = "Customer Age", y = "Count") +
  theme_minimal()

# Combine plots
grid.arrange(p1, p2, nrow = 2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The distribution of customer ages in the dataset follows a fairly normal distribution. The box plot provides an overview of the median, quartiles, and any potential outliers in the age variable. The histogram illustrates the count of customers in each age group.

Next, we will perform similar EDA analysis for other variables in the dataset:

Distribution of Gender

# Bar plot
ggplot(c_data, aes(x = Gender, fill = Gender)) +
  geom_bar() +
  labs(x = "Gender", y = "Count") +
  theme_minimal()

More samples of females in our dataset are compared to males, but the percentage of difference is not that significant, so we can say that genders are uniformly distributed.

Distribution of Education Level

# Bar plot
ggplot(c_data, aes(x = Education_Level, fill = Education_Level)) +
  geom_bar() +
  labs(x = "Education Level", y = "Count") +
  theme_minimal()

Correlation between Numeric Variables

# Select numeric variables for correlation analysis
numeric_vars <- c_data %>% select_if(is.numeric)

# Compute correlation matrix
cor_matrix <- cor(numeric_vars)

# Plot correlation matrix
corrplot(cor_matrix, method = "circle", type = "lower", tl.cex = 0.8)

The correlation matrix provides an overview of the relationships between the numeric variables in the dataset. It helps identify any strong positive or negative correlations between variables.

Scatter Plot: Total_Trans_Amt vs. Total_Trans_Ct

ggplot(c_data, aes(x = Total_Trans_Amt, y = Total_Trans_Ct)) +
  geom_point() +
  labs(x = "Total Transaction Amount", y = "Total Transaction Count") +
  theme_minimal()

The scatter plot showcases the relationship between the total transaction amount and the total transaction count. It helps visualize any patterns or trends between these two variables.

Summary Statistics

# Compute summary statistics
summary_stats <- c_data %>%
  select_if(is.numeric) %>%
  stat.desc()

# Print summary statistics
summary_stats

##              Customer_Age Dependent_count Months_on_book
## nbr.val      1.012700e+04    1.012700e+04   1.012700e+04
## nbr.null     0.000000e+00    9.040000e+02   0.000000e+00
## nbr.na       0.000000e+00    0.000000e+00   0.000000e+00
## min          2.600000e+01    0.000000e+00   1.300000e+01
## max          7.300000e+01    5.000000e+00   5.600000e+01
## range        4.700000e+01    5.000000e+00   4.300000e+01
## sum          4.691430e+05    2.376000e+04   3.638470e+05
## median       4.600000e+01    2.000000e+00   3.600000e+01
## mean         4.632596e+01    2.346203e+00   3.592841e+01
## SE.mean      7.966387e-02    1.290738e-02   7.936181e-02
## CI.mean.0.95 1.561570e-01    2.530102e-02   1.555649e-01
## var          6.426931e+01    1.687163e+00   6.378285e+01
## std.dev      8.016814e+00    1.298908e+00   7.986416e+00
## coef.var     1.730523e-01    5.536214e-01   2.222869e-01
##              Total_Relationship_Count Months_Inactive_12_mon
## nbr.val                  1.012700e+04           1.012700e+04
## nbr.null                 0.000000e+00           2.900000e+01
## nbr.na                   0.000000e+00           0.000000e+00
## min                      1.000000e+00           0.000000e+00
## max                      6.000000e+00           6.000000e+00
## range                    5.000000e+00           6.000000e+00
## sum                      3.861000e+04           2.370900e+04
## median                   4.000000e+00           2.000000e+00
## mean                     3.812580e+00           2.341167e+00
## SE.mean                  1.544630e-02           1.004265e-02
## CI.mean.0.95             3.027782e-02           1.968559e-02
## var                      2.416184e+00           1.021358e+00
## std.dev                  1.554408e+00           1.010622e+00
## coef.var                 4.077050e-01           4.316746e-01
##              Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal
## nbr.val               1.012700e+04 1.012700e+04        1.012700e+04
## nbr.null              3.990000e+02 0.000000e+00        2.470000e+03
## nbr.na                0.000000e+00 0.000000e+00        0.000000e+00
## min                   0.000000e+00 1.438300e+03        0.000000e+00
## max                   6.000000e+00 3.451600e+04        2.517000e+03
## range                 6.000000e+00 3.307770e+04        2.517000e+03
## sum                   2.486500e+04 8.741580e+07        1.177582e+07
## median                2.000000e+00 4.549000e+03        1.276000e+03
## mean                  2.455317e+00 8.631954e+03        1.162814e+03
## SE.mean               1.099267e-02 9.031607e+01        8.098609e+00
## CI.mean.0.95          2.154781e-02 1.770374e+02        1.587488e+01
## var                   1.223734e+00 8.260586e+07        6.642044e+05
## std.dev               1.106225e+00 9.088777e+03        8.149873e+02
## coef.var              4.505426e-01 1.052922e+00        7.008750e-01
##              Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt
## nbr.val         1.012700e+04         1.012700e+04    1.012700e+04
## nbr.null        0.000000e+00         5.000000e+00    0.000000e+00
## nbr.na          0.000000e+00         0.000000e+00    0.000000e+00
## min             3.000000e+00         0.000000e+00    5.100000e+02
## max             3.451600e+04         3.397000e+00    1.848400e+04
## range           3.451300e+04         3.397000e+00    1.797400e+04
## sum             7.563998e+07         7.695919e+03    4.460018e+07
## median          3.474000e+03         7.360000e-01    3.899000e+03
## mean            7.469140e+03         7.599407e-01    4.404086e+03
## SE.mean         9.033504e+01         2.178279e-03    3.375761e+01
## CI.mean.0.95    1.770746e+02         4.269859e-03    6.617161e+01
## var             8.264056e+07         4.805161e-02    1.154049e+07
## std.dev         9.090685e+03         2.192068e-01    3.397129e+03
## coef.var        1.217099e+00         2.884525e-01    7.713585e-01
##              Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
## nbr.val        1.012700e+04        1.012700e+04          1.012700e+04
## nbr.null       0.000000e+00        7.000000e+00          2.470000e+03
## nbr.na         0.000000e+00        0.000000e+00          0.000000e+00
## min            1.000000e+01        0.000000e+00          0.000000e+00
## max            1.390000e+02        3.714000e+00          9.990000e-01
## range          1.290000e+02        3.714000e+00          9.990000e-01
## sum            6.568240e+05        7.212676e+03          2.783847e+03
## median         6.700000e+01        7.020000e-01          1.760000e-01
## mean           6.485869e+01        7.122224e-01          2.748936e-01
## SE.mean        2.332492e-01        2.365885e-03          2.739573e-03
## CI.mean.0.95   4.572148e-01        4.637604e-03          5.370107e-03
## var            5.509616e+02        5.668499e-02          7.600579e-02
## std.dev        2.347257e+01        2.380861e-01          2.756915e-01
## coef.var       3.619032e-01        3.342862e-01          1.002903e+00

The summary statistics provide a comprehensive overview of the numerical variables in the dataset. It includes measures such as mean, median, standard deviation, minimum, maximum, and various percentiles.

This EDA analysis provides insights into the distribution, relationships, and summary statistics of the key variables in the dataset. Further exploratory analyses can be conducted for other variables as per the project requirements.

# Box plot
p1 <- ggplot(c_data, aes(x = "", y = Dependent_count)) +
  geom_boxplot() +
  labs(x = NULL, y = "Dependent count") +
  theme_minimal()

# Histogram
p2 <- ggplot(c_data, aes(x = Dependent_count)) +
  geom_histogram() +
  labs(x = "Dependent count", y = "Count") +
  theme_minimal()

# Combine plots
grid.arrange(p1, p2, nrow = 2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Distribution of dependent counts: The distribution of dependent counts is fairly normally distributed with a slight right skew.

names(c_data)

##  [1] "Attrition_Flag"           "Customer_Age"            
##  [3] "Gender"                   "Dependent_count"         
##  [5] "Education_Level"          "Marital_Status"          
##  [7] "Income_Category"          "Card_Category"           
##  [9] "Months_on_book"           "Total_Relationship_Count"
## [11] "Months_Inactive_12_mon"   "Contacts_Count_12_mon"   
## [13] "Credit_Limit"             "Total_Revolving_Bal"     
## [15] "Avg_Open_To_Buy"          "Total_Amt_Chng_Q4_Q1"    
## [17] "Total_Trans_Amt"          "Total_Trans_Ct"          
## [19] "Total_Ct_Chng_Q4_Q1"      "Avg_Utilization_Ratio"

# Proportion of existing and attrited customers count
attrition_counts <- table(c_data$Attrition_Flag)
barplot(attrition_counts, col = "lightblue", main = "Proportion of Existing and Attrited Customers Count")

# Proportion of existing and attrited customers by gender (countplot)
gender_attrition_counts <- table(c_data$Attrition_Flag, c_data$Gender)
barplot(gender_attrition_counts, col = c("lightblue", "lightgreen"), beside = TRUE,
        legend = rownames(gender_attrition_counts),
        main = "Proportion of Existing and Attrited Customers by Gender")

Proportion of different income levels:

# Pie chart
ggplot(c_data, aes(x = "", fill = Income_Category)) +
  geom_bar(width = 1, color = "white") +
  coord_polar("y", start = 0) +
  labs(x = NULL, fill = "Income Category") +
  theme_minimal() +
  theme(legend.position = "right") +
  ggtitle("Proportion of Different Income Levels")

Distribution of months the customer is part of the bank:

# Box plot and histogram
p1 <- ggplot(c_data, aes(x = "", y = Months_on_book)) +
  geom_boxplot() +
  labs(x = NULL, y = "Months on Book") +
  theme_minimal()

p2 <- ggplot(c_data, aes(x = Months_on_book)) +
  geom_histogram() +
  labs(x = "Months on Book", y = "Count") +
  theme_minimal()

grid.arrange(p1, p2, nrow = 2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Proportion of entire education levels
education_counts <- table(c_data$Education_Level)
barplot(education_counts, col = "lightblue", main = "Proportion of Entire Education Levels")

# Proportion of education level by existing and attrited customer
attrition_education_counts <- table(c_data$Education_Level, c_data$Attrition_Flag)
barplot(attrition_education_counts, col = c("lightblue", "lightgreen"), beside = TRUE,
        legend = rownames(attrition_education_counts),
        main = "Proportion of Education Level by Existing and Attrited Customer")

# Proportion of education level by gender (countplot)
ggplot(c_data, aes(x = Education_Level, fill = Gender)) +
  geom_bar(position = "fill") +
  labs(title = "Proportion of Education Level by Gender", x = "Education Level", y = "Proportion") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Proportion of Education Levels: If most of the customers with unknown education status lack any education, we can state that more than 70% of the customers have a formal education level. About 35% have a higher level of education.

# Proportion of marital status by attrited and existing customers
marital_counts <- table(c_data$Marital_Status)
barplot(marital_counts, col = "lightblue", main = "Proportion of Marital Status by Attrited and Existing Customers")

# Correlation using heatmap
correlation_matrix <- cor(c_data[, c("Customer_Age", "Dependent_count", "Months_on_book", "Credit_Limit",
                                     "Total_Revolving_Bal", "Avg_Open_To_Buy", "Total_Amt_Chng_Q4_Q1",
                                     "Total_Trans_Amt", "Total_Trans_Ct", "Total_Ct_Chng_Q4_Q1",
                                     "Avg_Utilization_Ratio")])
heatmap(correlation_matrix, col = colorRampPalette(c("lightblue", "white", "lightcoral"))(100),
        main = "Correlation Heatmap")

# Proportion of income category
income_counts <- table(c_data$Income_Category)
barplot(income_counts, col = "lightblue", main = "Proportion of Income Category")

# Proportion of income category by customer
income_customer_counts <- table(c_data$Income_Category, c_data$Attrition_Flag)
barplot(income_customer_counts, col = c("lightblue", "lightgreen"), beside = TRUE,
        legend = rownames(income_customer_counts),
        main = "Proportion of Income Category by Customer")

# Customer age count by customer
age_counts <- table(c_data$Customer_Age)
barplot(age_counts, col = "lightblue", main = "Customer Age Count by Customer")

Kurtosis of Months on book features:

# Calculate kurtosis
kurtosis <- kurtosis(c_data$Months_on_book)

# Print kurtosis value
print(paste("Kurtosis of Months on book features is:", kurtosis))

## [1] "Kurtosis of Months on book features is: 0.398638886235621"

Distribution of the Total Transaction Amount (Last 12 months):

# Box plot and histogram
p1 <- ggplot(c_data, aes(x = "", y = Total_Trans_Amt)) +
  geom_boxplot() +
  labs(x = NULL, y = "Total Transaction Amount") +
  theme_minimal()

p2 <- ggplot(c_data, aes(x = Total_Trans_Amt)) +
  geom_histogram() +
  labs(x = "Total Transaction Amount", y = "Count") +
  theme_minimal()

grid.arrange(p1, p2, nrow = 2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Data Processing

# Identify the column names of categorical variables and factors
categorical_columns <- sapply(c_data,  is.character)
categorical_column_names <- names(c_data[categorical_columns])

# Print the column names of categorical variables and factors
print(categorical_column_names)

## [1] "Attrition_Flag"  "Gender"          "Education_Level" "Marital_Status" 
## [5] "Income_Category" "Card_Category"

# Convert values of Attrition_Flag to 0 and 1
c_data$Attrition_Flag <- ifelse(c_data$Attrition_Flag == "Existing Customer", 0, 1)

names(c_data)

##  [1] "Attrition_Flag"           "Customer_Age"            
##  [3] "Gender"                   "Dependent_count"         
##  [5] "Education_Level"          "Marital_Status"          
##  [7] "Income_Category"          "Card_Category"           
##  [9] "Months_on_book"           "Total_Relationship_Count"
## [11] "Months_Inactive_12_mon"   "Contacts_Count_12_mon"   
## [13] "Credit_Limit"             "Total_Revolving_Bal"     
## [15] "Avg_Open_To_Buy"          "Total_Amt_Chng_Q4_Q1"    
## [17] "Total_Trans_Amt"          "Total_Trans_Ct"          
## [19] "Total_Ct_Chng_Q4_Q1"      "Avg_Utilization_Ratio"

str(c_data)

## 'data.frame':    10127 obs. of  20 variables:
##  $ Attrition_Flag          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Customer_Age            : int  45 49 51 40 40 44 51 32 37 48 ...
##  $ Gender                  : chr  "M" "F" "M" "F" ...
##  $ Dependent_count         : int  3 5 3 4 3 2 4 0 3 2 ...
##  $ Education_Level         : chr  "High School" "Graduate" "Graduate" "High School" ...
##  $ Marital_Status          : chr  "Married" "Single" "Married" "Unknown" ...
##  $ Income_Category         : chr  "$60K - $80K" "Less than $40K" "$80K - $120K" "Less than $40K" ...
##  $ Card_Category           : chr  "Blue" "Blue" "Blue" "Blue" ...
##  $ Months_on_book          : int  39 44 36 34 21 36 46 27 36 36 ...
##  $ Total_Relationship_Count: int  5 6 4 3 5 3 6 2 5 6 ...
##  $ Months_Inactive_12_mon  : int  1 1 1 4 1 1 1 2 2 3 ...
##  $ Contacts_Count_12_mon   : int  3 2 0 1 0 2 3 2 0 3 ...
##  $ Credit_Limit            : num  12691 8256 3418 3313 4716 ...
##  $ Total_Revolving_Bal     : int  777 864 0 2517 0 1247 2264 1396 2517 1677 ...
##  $ Avg_Open_To_Buy         : num  11914 7392 3418 796 4716 ...
##  $ Total_Amt_Chng_Q4_Q1    : num  1.33 1.54 2.59 1.4 2.17 ...
##  $ Total_Trans_Amt         : int  1144 1291 1887 1171 816 1088 1330 1538 1350 1441 ...
##  $ Total_Trans_Ct          : int  42 33 20 20 28 24 31 36 24 32 ...
##  $ Total_Ct_Chng_Q4_Q1     : num  1.62 3.71 2.33 2.33 2.5 ...
##  $ Avg_Utilization_Ratio   : num  0.061 0.105 0 0.76 0 0.311 0.066 0.048 0.113 0.144 ...

categorical_cols <- c("Gender", "Education_Level", "Marital_Status", "Income_Category", "Card_Category")

c_data[categorical_cols] <- lapply(c_data[categorical_cols], as.factor)

# Create a formula for one-hot encoding
formula <- as.formula(paste("factor(Attrition_Flag) ~", paste(categorical_cols, collapse = "+")))

# Create dummy variables using dummyVars
dummy_data <- predict(dummyVars(formula, data = c_data), newdata = c_data)

# Combine numerical and one-hot encoded data
combined_data <- cbind(c_data[, !(names(c_data) %in% categorical_cols)], dummy_data)

# Split the data into training and testing sets
set.seed(42)
train_indices <- createDataPartition(combined_data$Attrition_Flag, p = 0.7, list = FALSE)
train_data <- combined_data[train_indices, ]
test_data <- combined_data[-train_indices, ]

# Separate predictors (x) and target variable (y) in the training and testing sets
X_train <- train_data[, !(names(train_data) %in% "Attrition_Flag")]
y_train <- train_data$Attrition_Flag
X_test <- test_data[, !(names(test_data) %in% "Attrition_Flag")]
y_test <- test_data$Attrition_Flag

names(X_train)

##  [1] "Customer_Age"                   "Dependent_count"               
##  [3] "Months_on_book"                 "Total_Relationship_Count"      
##  [5] "Months_Inactive_12_mon"         "Contacts_Count_12_mon"         
##  [7] "Credit_Limit"                   "Total_Revolving_Bal"           
##  [9] "Avg_Open_To_Buy"                "Total_Amt_Chng_Q4_Q1"          
## [11] "Total_Trans_Amt"                "Total_Trans_Ct"                
## [13] "Total_Ct_Chng_Q4_Q1"            "Avg_Utilization_Ratio"         
## [15] "Gender.F"                       "Gender.M"                      
## [17] "Education_Level.College"        "Education_Level.Doctorate"     
## [19] "Education_Level.Graduate"       "Education_Level.High School"   
## [21] "Education_Level.Post-Graduate"  "Education_Level.Uneducated"    
## [23] "Education_Level.Unknown"        "Marital_Status.Divorced"       
## [25] "Marital_Status.Married"         "Marital_Status.Single"         
## [27] "Marital_Status.Unknown"         "Income_Category.$120K +"       
## [29] "Income_Category.$40K - $60K"    "Income_Category.$60K - $80K"   
## [31] "Income_Category.$80K - $120K"   "Income_Category.Less than $40K"
## [33] "Income_Category.Unknown"        "Card_Category.Blue"            
## [35] "Card_Category.Gold"             "Card_Category.Platinum"        
## [37] "Card_Category.Silver"

Machine Learning Modeling

# Random Forest Classifier
rf_model <- randomForest(x = X_train, y = as.factor(y_train), class.factors = levels(as.factor(y_train)))

# Train the SVM model
svm_model <- svm(x = X_train, y = as.factor(y_train))

# Train the Decision Tree model
dt_model <- rpart(y_train ~ ., data = X_train, method = "class")

Performance Metrics

# Make predictions on the test set
rf_predictions <- predict(rf_model, X_test)

# Convert y_test to have the same levels as rf_predictions
y_test <- factor(y_test, levels = levels(rf_predictions))

# Calculate accuracy and confusion matrix
rf_accuracy <- sum(rf_predictions == y_test) / length(y_test)
rf_confusion <- confusionMatrix(rf_predictions, y_test)
# Print accuracy and confusion matrix
print(paste("Random Forest Accuracy:", rf_accuracy))

## [1] "Random Forest Accuracy: 0.953258722843976"

print("Random Forest Confusion Matrix:")

## [1] "Random Forest Confusion Matrix:"

print(rf_confusion$table)

##           Reference
## Prediction    0    1
##          0 2509  119
##          1   23  387

# SVM
# Make predictions on the test set
svm_predictions <- predict(svm_model, X_test)

# Evaluate the model performance
svm_accuracy <- sum(svm_predictions == y_test) / length(y_test)
# Create the confusion matrix
svm_confusion <- confusionMatrix(svm_predictions, y_test)
print(paste("SVM Accuracy:", svm_accuracy))

## [1] "SVM Accuracy: 0.910138248847926"

print("SVM Confusion Matrix:")

## [1] "SVM Confusion Matrix:"

print(svm_confusion$table)

##           Reference
## Prediction    0    1
##          0 2490  231
##          1   42  275

# Decision Tree

# Predict class labels on the test set
dt_predictions <- predict(dt_model, newdata = X_test, type = "class")

# Evaluate the model performance
dt_accuracy <- sum(dt_predictions == y_test) / length(y_test)
dt_confusion <- confusionMatrix(dt_predictions, y_test)
print(paste("Decision Tree Accuracy:", dt_accuracy))

## [1] "Decision Tree Accuracy: 0.929229756418697"

print("Decision Tree Confusion Matrix:")

## [1] "Decision Tree Confusion Matrix:"

print(dt_confusion$table)

##           Reference
## Prediction    0    1
##          0 2449  132
##          1   83  374

# Create a data frame to store the performance metrics
performance <- data.frame(Model = c("Random Forest", "SVM", "Decision Tree"),
                          Accuracy = numeric(3),
                          Precision = numeric(3),
                          Recall = numeric(3),
                          F1_Score = numeric(3))

# Random Forest
rf_accuracy <- sum(rf_predictions == y_test) / length(y_test)
rf_confusion <- confusionMatrix(rf_predictions, y_test)
rf_precision <- rf_confusion$byClass["Pos Pred Value"]
rf_recall <- rf_confusion$byClass["Sensitivity"]
rf_f1_score <- 2 * (rf_precision * rf_recall) / (rf_precision + rf_recall)
performance[1, c("Accuracy", "Precision", "Recall", "F1_Score")] <- c(rf_accuracy, rf_precision, rf_recall, rf_f1_score)

# SVM
svm_accuracy <- sum(svm_predictions == y_test) / length(y_test)
svm_confusion <- confusionMatrix(svm_predictions, y_test)
svm_precision <- svm_confusion$byClass["Pos Pred Value"]
svm_recall <- svm_confusion$byClass["Sensitivity"]
svm_f1_score <- 2 * (svm_precision * svm_recall) / (svm_precision + svm_recall)
performance[2, c("Accuracy", "Precision", "Recall", "F1_Score")] <- c(svm_accuracy, svm_precision, svm_recall, svm_f1_score)

# Decision Tree
dt_accuracy <- sum(dt_predictions == y_test) / length(y_test)
dt_confusion <- confusionMatrix(dt_predictions, y_test)
dt_precision <- dt_confusion$byClass["Pos Pred Value"]
dt_recall <- dt_confusion$byClass["Sensitivity"]
dt_f1_score <- 2 * (dt_precision * dt_recall) / (dt_precision + dt_recall)
performance[3, c("Accuracy", "Precision", "Recall", "F1_Score")] <- c(dt_accuracy, dt_precision, dt_recall, dt_f1_score)

# Print the performance metrics
print(performance)

##           Model  Accuracy Precision    Recall  F1_Score
## 1 Random Forest 0.9532587 0.9547184 0.9909163 0.9724806
## 2           SVM 0.9101382 0.9151047 0.9834123 0.9480297
## 3 Decision Tree 0.9292298 0.9488570 0.9672196 0.9579503

Conclusion

There are 16.07% of customers who have churned.
The proportion of gender count is almost equally distributed (52.9% male and 47.1%) compare to proportion of existing and attributed customer count (83.9% and 16.1%) which is highly imbalanced
The proportion of attrited customers by gender there are 14.4% more male than female who have churned
Customers who have churned are highly educated - A high proportion of education level of attrited customer is Graduate level (29.9%), followed by Post-Graduate level (18.8%)
A high proportion of marital status of customers who have churned is Married (43.6%), followed by Single (41.1%) compared to Divorced (7.4%) and Unknown (7.9%) status - Marital stuats of the attributed customers are highly clustered in Married status and Single
As you can see from the proportion of income category of attrited customer, it is highly concentrated around $60K - $80K income (37.6%), followed by Less than $40K income (16.7%) compare to attrited customers with higher annual income of 80K-120K(14.9%) and over $120K + (11.5%). I assume that customers with higher income doesn’t likely to leave their credit card services than meddle-income customer