1 Introduction

This document contains data cleaning and data pre-processing (exploratory data analysis) for the WQD7004 project. # Group Assignment ## Introduction Customer churn has become a significant issue in the fast-paced, fiercely competitive banking sector, endangering long-term stability and profitability. Revenue streams and acquisition expenses are directly impacted by customer churn, which is the termination of a customer’s engagement with a business. Proactive churn control measures are crucial since, according to industry studies, obtaining a new client can be five times more expensive than keeping an existing one.

The study aims to address this challenge by analyzing a dataset obtained from Kaggle that ranges from 10000 customer records with 18 variables ranging from transactional and behavioral indicators to sociodemographic characteristics. The dataset provides a strong basis for comprehending churn dynamics in the banking environment by offering a singular chance to explore important churn causes, such as credit ratings, product utilization, customer satisfaction and demographic changes.

This research attempts to find actionable insights by using a methodical approach that includes predictive modeling, sophisticated preprocessing techniques, and exploratory data analysis (EDA). To get the dataset ready for modeling, important preprocessing techniques such feature encoding, scaling, and missing value analysis were carefully used. Additionally, oversampling strategies like SMOTE were used to solve class imbalance concerns, guaranteeing a representative and balanced training dataset.

The analysis’s conclusions will provide useful suggestions for identifying clients who are in danger, allowing financial institutions to put focused retention tactics into place. Banks can enhance customer happiness, maximize client engagement, and eventually lower churn rates by utilizing data-driven insights. This paper presents practical solutions for improving client retention in the banking industry in addition to adding to the body of knowledge on churn prediction.

2 Objective

  1. To identify and select the most correlated features associated with bank customer churn, including socio-demographic background, types of cards held, credit score, points earned, and number of products purchased through the bank

  2. To create a predictive model to identify bank customer churn based on relevant factors

3 Data Understanding

3.1 Load Packages

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(reshape2)
## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(dplyr) 
library(ggplot2) 
library(corrplot)
## corrplot 0.95 loaded
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(smotefamily)

3.2 Importing Data

# Load the dataset 
df <- read.csv('Customer-Churn-Records.csv')

# Display summary statistics of the dataset
summary(df)
##    RowNumber       CustomerId         Surname           CreditScore   
##  Min.   :    1   Min.   :15565701   Length:10000       Min.   :350.0  
##  1st Qu.: 2501   1st Qu.:15628528   Class :character   1st Qu.:584.0  
##  Median : 5000   Median :15690738   Mode  :character   Median :652.0  
##  Mean   : 5000   Mean   :15690941                      Mean   :650.5  
##  3rd Qu.: 7500   3rd Qu.:15753234                      3rd Qu.:718.0  
##  Max.   :10000   Max.   :15815690                      Max.   :850.0  
##   Geography            Gender               Age            Tenure      
##  Length:10000       Length:10000       Min.   :18.00   Min.   : 0.000  
##  Class :character   Class :character   1st Qu.:32.00   1st Qu.: 3.000  
##  Mode  :character   Mode  :character   Median :37.00   Median : 5.000  
##                                        Mean   :38.92   Mean   : 5.013  
##                                        3rd Qu.:44.00   3rd Qu.: 7.000  
##                                        Max.   :92.00   Max.   :10.000  
##     Balance       NumOfProducts    HasCrCard      IsActiveMember  
##  Min.   :     0   Min.   :1.00   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:     0   1st Qu.:1.00   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 97199   Median :1.00   Median :1.0000   Median :1.0000  
##  Mean   : 76486   Mean   :1.53   Mean   :0.7055   Mean   :0.5151  
##  3rd Qu.:127644   3rd Qu.:2.00   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :250898   Max.   :4.00   Max.   :1.0000   Max.   :1.0000  
##  EstimatedSalary         Exited          Complain      Satisfaction.Score
##  Min.   :    11.58   Min.   :0.0000   Min.   :0.0000   Min.   :1.000     
##  1st Qu.: 51002.11   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:2.000     
##  Median :100193.91   Median :0.0000   Median :0.0000   Median :3.000     
##  Mean   :100090.24   Mean   :0.2038   Mean   :0.2044   Mean   :3.014     
##  3rd Qu.:149388.25   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:4.000     
##  Max.   :199992.48   Max.   :1.0000   Max.   :1.0000   Max.   :5.000     
##   Card.Type          Point.Earned   
##  Length:10000       Min.   : 119.0  
##  Class :character   1st Qu.: 410.0  
##  Mode  :character   Median : 605.0  
##                     Mean   : 606.5  
##                     3rd Qu.: 801.0  
##                     Max.   :1000.0

3.3 Checking Data

# Check the number of rows and columns 
cat("Shape of the dataset: ", dim(df), "\n") 
## Shape of the dataset:  10000 18
# Print the column names 
print(colnames(df))
##  [1] "RowNumber"          "CustomerId"         "Surname"           
##  [4] "CreditScore"        "Geography"          "Gender"            
##  [7] "Age"                "Tenure"             "Balance"           
## [10] "NumOfProducts"      "HasCrCard"          "IsActiveMember"    
## [13] "EstimatedSalary"    "Exited"             "Complain"          
## [16] "Satisfaction.Score" "Card.Type"          "Point.Earned"

The dataset comprises of 10,000 rows and 18 columns, representing 10,000 samples and 18 features. In order to gain a deeper understanding of the dataset structure, the columns are categorized into three primary features based on the type of data they represent, which are:

Key Features Types of Data
User Data

Key demographic and personal information about each customer to understand the characteristics of customer base

  • Customer ID

  • Surname

  • Credit Score

  • Geography

  • Gender

  • Age

  • Has Credit Card

  • Estimated Salary

  • Exited

Card Data

Information about customer’s banking product usage to gives insight into customer’s engagement with the bank’s products

  • Balance

  • Number of Products

  • Card Type

  • Point Earned

Sentiment Driver

Reflects customer satisfaction and engagement to help assess how connected and satisfied customers are with the bank

  • Complain

  • Satisfaction Score

  • Is Active Member

3.4 Exploratory Data Analysis (EDA)

# Count exited customers and calculate percentages
exited_customers_df <- as.data.frame(table(df$Exited))
colnames(exited_customers_df) <- c("Exited", "Count")
exited_customers_df$Percentage <- (exited_customers_df$Count / sum(exited_customers_df$Count)) * 100

# Add labels
exited_customers_df$Label <- ifelse(exited_customers_df$Percentage >= 5,
                                    paste0(exited_customers_df$Exited, " (", round(exited_customers_df$Percentage, 1), "%)"),
                                    "")

# Define colors
colors <- c("salmon", "turquoise3")

# Pie chart for percentage of customers
ggplot(exited_customers_df, aes(x = "", y = Count, fill = factor(Exited))) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar(theta = "y") +
  theme_void() +
  labs(title = "Percentage of Churned Customers", fill = "Exited") +
  scale_fill_manual(values = colors) +
  geom_text(aes(label = Label), position = position_stack(vjust = 0.5))

# Bar chart for count of customers that churned and not churned by Geography
ggplot(df, aes(x = Geography, fill = factor(Exited))) +
  geom_bar(position = "dodge") +
  labs(title = "Customer Churn by Geography",
       x = "Geography",
       y = "Count",
       fill = "Exited") +
  theme_minimal()

# Histogram Age distribution
hist(df[['Age']], main = "Age distribution", xlab = "Age", ylab = "Amount of customers", col = "slateblue", border = "black")

# Filter data to include only churned customers
exited_customers <- df[df$Exited == TRUE, ]

# Age distribution of churned customers
library(ggplot2)
ggplot(exited_customers, aes(x = Age)) +
  geom_histogram(bins = 20, fill = "slateblue", color = "black", alpha = 0.7) +
  labs(title = "Age Distribution of Churned Customers",
       x = "Age",
       y = "Number of Churned Customers") +
  theme_minimal()

# Bar chart for count of customers by geography and credit card ownership
ggplot(df, aes(x = Geography, fill = factor(HasCrCard))) +
  geom_bar(position = "dodge") +
  scale_fill_manual(name = "Has Credit Card", labels = c("No", "Yes"), values = c("salmon", "turquoise3")) +
  labs(title = "Count of Customers by Geography and Credit Card Ownership", x = "Geography", y = "Count") +
  theme_minimal() +
  theme(legend.position = "top")

# Bar chart for count of customers with/without credit card by churn status
ggplot(df, aes(x = factor(HasCrCard, labels = c("No Credit Card", "Has Credit Card")), 
                 fill = factor(Exited))) +
  geom_bar(position = "dodge") +
  labs(title = "Churn Status of Customers with/without Credit Card",
       x = "Credit Card Ownership",
       y = "Count",
       fill = "Exited") +
  theme_minimal()

# Bar chart for churn status of customers by credit card ownership and geography
ggplot(df, aes(x = Geography, 
                 fill = interaction(factor(HasCrCard, labels = c("No Credit Card", "Has Credit Card")), 
                                    factor(Exited)))) +
  geom_bar(position = "dodge") + 
  scale_fill_manual(values = c("lightsalmon", "salmon", "paleturquoise", "turquoise")) + 
  labs(title = "Churn Status of Customers by Credit Card Ownership and Geography",
       x = "Geography",
       y = "Count",
       fill = "Credit Card & Exited") +
  theme_minimal()

# Bar chart for churned vs. not churned based on Card Type
ggplot(df, aes(x = `Card.Type`, fill = factor(Exited))) +
  geom_bar(position = "dodge") +
  labs(title = "Customer Churn by Card Type",
       x = "Card Type",
       y = "Count",
       fill = "Exited") +
  theme_minimal()

4 Data Preparation

4.1 Data Cleaning

# Display the counts of each data type
sapply(df, class) %>%
  table() %>%
  print()
## .
## character   integer   numeric 
##         4        12         2
# Check number of columns with null values 
null_counts <- colSums(is.na(df)) 
cat("Total number of columns with null values:", sum(null_counts > 0), "/", ncol(df), "\n")
## Total number of columns with null values: 0 / 18
# Create dataframe with column names, data types, and null counts 
column_info <- data.frame( 
  "Column Name" = colnames(df), 
  "Data Type" = sapply(df, class), 
  "Null Count" = null_counts 
) 

# Adjust row indices to start from 1 
rownames(column_info) <- 1:nrow(column_info) 
print(column_info)
##           Column.Name Data.Type Null.Count
## 1           RowNumber   integer          0
## 2          CustomerId   integer          0
## 3             Surname character          0
## 4         CreditScore   integer          0
## 5           Geography character          0
## 6              Gender character          0
## 7                 Age   integer          0
## 8              Tenure   integer          0
## 9             Balance   numeric          0
## 10      NumOfProducts   integer          0
## 11          HasCrCard   integer          0
## 12     IsActiveMember   integer          0
## 13    EstimatedSalary   numeric          0
## 14             Exited   integer          0
## 15           Complain   integer          0
## 16 Satisfaction.Score   integer          0
## 17          Card.Type character          0
## 18       Point.Earned   integer          0

It is observed that there are a total of 12 features with integer data type, 4 features with object data type, and 2 features with float data type. Furthermore, data imputation is excluded from this process as all columns are free of null values.

4.2 Data Type Conversion

All columns with object data type need to be converted into numerical format so that it is suitable with machine learning algorithms. There are four columns that need to be converted, namely Surname, Geography, Gender, and Card Type.

# Get columns of character data type only 
object_columns <- sapply(df, is.character) 

# Loop through each object column 
for (col in names(df)[object_columns]) { 
  unique_values <- unique(df[[col]]) 
  
  total_num_of_unique_values <- length(unique_values) 
  cat("Total number of unique values in '", col, "': ", total_num_of_unique_values, "\n\n", sep = "") 
}
## Total number of unique values in 'Surname': 2932
## 
## Total number of unique values in 'Geography': 3
## 
## Total number of unique values in 'Gender': 2
## 
## Total number of unique values in 'Card.Type': 4

The above output showed that among the four columns with a character data type, the Geography, Gender, and Card Type columns likely contain categorical data, as they have only a few unique values given the large number of rows. On the other hand, the Surname column, which has 2,932 unique entries, most likely does not contain categorical data.

# Specify the column that might contain categorical data 
categorical_data_columns <- c("Geography", "Gender", "Card.Type") 

# Loop through each column and print unique values 
for (col in categorical_data_columns) { 
  cat(paste("Unique values in", col, ":")) 
  cat(unique(df[[col]]), "\n\n") 
}
## Unique values in Geography :France Spain Germany 
## 
## Unique values in Gender :Female Male 
## 
## Unique values in Card.Type :DIAMOND GOLD SILVER PLATINUM

From the analysis above, it is observed that the Geography, Gender, and Card Type columns indeed contain categorical data. Specifically, Geography has three categories (France, Spain, Germany), Gender has two categories (Female, Male), and Card Type has four ordinal categories (Platinum > Diamond > Gold > Silver). These columns have meaningful features and are suitable to convert into numerical values.

4.3 Elimination of Non-Contributing Columns

In contrast, the Surname column has 2,932 unique values, indicating high variability that does not contribute meaningfully to the analysis as Customer ID column is already sufficient to identify each individual customer. Thus, this feature and Row Number which lacks any meaning can be excluded from further analysis. The removal of both columns can be seen below.

# Drop the 'Surname' and 'Row Number' columns from the data frame 
df <- df[, !(colnames(df) %in% c("Surname", "RowNumber"))] 

# Verify the columns of the modified data frame 
colnames(df)
##  [1] "CustomerId"         "CreditScore"        "Geography"         
##  [4] "Gender"             "Age"                "Tenure"            
##  [7] "Balance"            "NumOfProducts"      "HasCrCard"         
## [10] "IsActiveMember"     "EstimatedSalary"    "Exited"            
## [13] "Complain"           "Satisfaction.Score" "Card.Type"         
## [16] "Point.Earned"

4.4 Encoding and Mapping Implementation

Then, the Geography, Gender, and Card Type columns are converted into integer data type.

# Label encoding for Gender 
df$Gender <- as.integer(factor(df$Gender))

# One-Hot encoding for Geography 
df <- cbind(df, model.matrix(~Geography - 1, data =df))
df$Geography <- NULL

# Card Type Mapping 
card_type_mapping <- c("SILVER" = 1, "GOLD" = 2, "PLATINUM" = 3, "DIAMOND" = 4) 
df$Card.Type <- card_type_mapping[df$Card.Type]

The method used for the conversion and the rationale behind it are described as below.

Column Method Reason
Gender Label Encoding Contains two distinct categories, Male and Female, which can be effectively represented as integers (e.g., 0 for Male and 1 for Female). This encoding captures the binary distinction without implying any order.
Geography One-Hot Encoding Consists of nominal categories (France, Spain, Germany) without any inherent ranking. One-hot encoding creates separate binary columns for each country, ensuring that the model treats them as distinct and independent categories.
Card Type Manual Mapping for Ordinal Encoding Has a clear ordinal relationship with ranks defined as Silver < Gold < Platinum < Diamond. Manually mapping these categories to integers (1 for Silver, 2 for Gold, etc.) will preserve its hierarchy, thus allowing the model to utilize this ranking in its predictions.
# Verify the data transformation by getting the column data types 
column_data_types <- data.frame("Column Name" = colnames(df), 
                                "Data Type" = sapply(df, class)) 

# Adjust the row names to start from 1 (as in the original Python code) 
rownames(column_data_types) <- 1:nrow(column_data_types) 

# Display the result 
print(column_data_types)
##           Column.Name Data.Type
## 1          CustomerId   integer
## 2         CreditScore   integer
## 3              Gender   integer
## 4                 Age   integer
## 5              Tenure   integer
## 6             Balance   numeric
## 7       NumOfProducts   integer
## 8           HasCrCard   integer
## 9      IsActiveMember   integer
## 10    EstimatedSalary   numeric
## 11             Exited   integer
## 12           Complain   integer
## 13 Satisfaction.Score   integer
## 14          Card.Type   numeric
## 15       Point.Earned   integer
## 16    GeographyFrance   numeric
## 17   GeographyGermany   numeric
## 18     GeographySpain   numeric

The above results showed that the Gender column has been successfully converted to integer values, with 0 representing female and 1 representing male. The Geography column has been replaced by three new columns — GeographyFrance, GeographyGermany, and GeographySpain — each containing Boolean values (numeric format of 1 – positive; 0 – negative) to indicate a customer’s origin. Lastly, the Card Type column has been transformed into integer values, with 1, 2, 3, and 4 corresponding to Silver, Gold, Platinum, and Diamond, respectively, in ascending order of rank.

4.5 Scaling and Normalization

For this part, all the columns with integer or numeric (float) data type need to be checked whether scaling is needed or not. A detailed examination of each column’s characteristics helps in determining the need for scaling.

# Get columns of integer and numeric (float) data type only 
int_and_float_column_df <- df[sapply(df, function(x) is.integer(x) | is.numeric(x))] 

# Check the value range for the first few rows 
head(int_and_float_column_df)

There are seven columns that does not need scaling, such as those representing categorical, binary, or ordinal features. These include identifiers (e.g., customer ID), binary variables (e.g., gender, credit card ownership, membership status, complaint status, and target variables like “Exited”), and ordinal features that have been appropriately encoded as integers (e.g., Card Type). Additionally, discrete integer variables with a limited range, such as the Satisfaction Score, are not impacted by scale differences and therefore do not require normalization. These features are used to represent distinct categories that do not benefit from being on the same scale as numeric features.

# Select the columns of interest 
columns_to_scale <- c('CreditScore', 'Age', 'Tenure', 'Balance', 'EstimatedSalary', 'NumOfProducts', 'Point.Earned') 

# Get summary statistics for these columns 
summary_stats <- summary(df[, columns_to_scale]) 

# Extract min, max, mean, and std (standard deviation) from the summary statistics 
summary_stats <- data.frame( 
  Min = summary_stats[1, ], 
  Max = summary_stats [6, ], 
  Mean = summary_stats [3, ],
  Std = apply(df[, columns_to_scale], 2, sd) 
) 

# Display the result 
print(summary_stats)
##                                 Min                 Max                Mean
##  CreditScore        Min.   :350.0       Max.   :850.0       Median :652.0  
##      Age            Min.   :18.00       Max.   :92.00       Median :37.00  
##     Tenure         Min.   : 0.000      Max.   :10.000      Median : 5.000  
##    Balance         Min.   :     0      Max.   :250898      Median : 97199  
## EstimatedSalary Min.   :    11.58   Max.   :199992.48   Median :100193.91  
## NumOfProducts        Min.   :1.00        Max.   :4.00        Median :1.00  
##  Point.Earned      Min.   : 119.0      Max.   :1000.0      Median : 605.0  
##                          Std
##  CreditScore    9.665330e+01
##      Age        1.048781e+01
##     Tenure      2.892174e+00
##    Balance      6.239741e+04
## EstimatedSalary 5.751049e+04
## NumOfProducts   5.816544e-01
##  Point.Earned   2.259248e+02

In contrast, numeric columns with continuous values—such as Credit Score, Age, Tenure, Balance, Estimated Salary, and Point Earned—require scaling due to the wide range of their values and the variability in their distributions. Features with large numerical ranges, such as Balance and Estimated Salary, are particularly susceptible to dominating the model if not scaled appropriately. For features like Age, Tenure, and Credit Score, scaling helps to standardize the influence of each feature on the model, ensuring that no single feature disproportionately affects the predictions.

4.6 Outlier Removal

Before applying scaling or normalization techniques to the selected columns, it is essential to address the potential outliers identified in certain columns, such as Balance, Estimated Salary, and Point Earned, due to their wide ranges and high standard deviations. Outliers can significantly distort the results of scaling methods like Min-Max scaling and Standardization, which are sensitive to extreme values. If not handled, outliers can skew the mean and range of the data, which can in turn distort the scaled values and degrade the performance of machine learning models. For instance, in the case of Min-Max scaling, extreme values can stretch the range, compressing the majority of data points into a narrower band, thus affecting the overall distribution and model accuracy.

# Define a vector to store outlier counts for each column 
outlier_counts <- list() 

# Loop through each column of interest 
for (column in columns_to_scale) {
  
  # Calculate Q1, Q3, and IQR 
  Q1 <- quantile(df[[column]], 0.25) 
  Q3 <- quantile(df[[column]], 0.75) 
  IQR <- Q3 - Q1 
  
  # Calculate the lower and upper bounds for outliers 
  lower_bound <- Q1 - 1.5 * IQR 
  upper_bound <- Q3 + 1.5 * IQR 
  
  # Identify outliers 
  outliers <- df[df[[column]] < lower_bound | df[[column]] > upper_bound, ] 
  # Store the count of outliers in the list 
  outlier_counts[[column]] <- nrow(outliers) 
} 

# Display the count of outliers in each column 
print(outlier_counts)
## $CreditScore
## [1] 15
## 
## $Age
## [1] 359
## 
## $Tenure
## [1] 0
## 
## $Balance
## [1] 0
## 
## $EstimatedSalary
## [1] 0
## 
## $NumOfProducts
## [1] 60
## 
## $Point.Earned
## [1] 0

To detect outliers in the identified columns, the Interquartile Range (IQR) method was employed. The IQR is particularly useful for continuous numerical data, such as Credit Score, Age, Tenure, Balance, and Estimated Salary, which are prone to outliers.

Steps to Apply IQR Analysis: 1. Calculate the IQR for each column: The IQR is determined by subtracting the 25th percentile (Q1) from the 75th percentile (Q3) of the data. 2. Define outlier thresholds: A common threshold is to consider any value below 𝑄1−1.5×𝐼𝑄𝑅 or above 𝑄3+1.5×𝐼𝑄𝑅 as an outlier. 3. Apply the thresholds to each column: Values beyond these thresholds are considered outliers.

From the outlier detection process, only three columns—Credit Score, Age, and Number of Products—were identified as containing outliers. To address these outliers:

# Apply Winsorizing method for the 'NumOfProducts' column 
lower_bound <- quantile(df$NumOfProducts, 0.05) 
upper_bound <- quantile(df$NumOfProducts, 0.95) 

# Clip values within the bounds 
df$NumOfProducts <- pmax(pmin(df$NumOfProducts, upper_bound), lower_bound) 

# Display the modified column 
head(df$NumOfProducts)
## [1] 1 1 2 2 1 2

Number of Products: Given the presence of multiple outliers, the capping (winsorizing) method was applied. This involved capping extreme values to the 5th and 95th percentiles, which helps maintain a reasonable data range while preserving the overall distribution.

# Calculate z-scores for CreditScore 
df$CreditScore_Z <- (df$CreditScore - mean(df$CreditScore)) / sd(df$CreditScore) 

# Define the threshold 
threshold <- 3 

# Filter rows where the absolute value of CreditScore_Z is less than or equal to the threshold 
df_filtered <- df[abs(df$CreditScore_Z) <= threshold, ] 

# Drop the temporary CreditScore_Z column 
df_filtered$CreditScore_Z <- NULL 

# Check the dimensions of the filtered data frame 
dim(df_filtered)
## [1] 9992   18

Credit Score: For this column, which contained only a few outliers, the Z-score method was utilized to identify and remove the extreme values. Any data points with Z-scores greater than ±3 standard deviations from the mean were flagged as outliers and removed. Following this process, the dataset size decreased from 10,000 to 9,992, indicating that 8 samples were removed as outliers. This method proved more efficient than the IQR method, which had flagged 15 samples as outliers, thus preserving more valuable data.

# Calculate the correlation between 'Age' and 'Exited' 
correlation_age_exited <- cor(df_filtered$Age, df_filtered$Exited) 

# Print the result 
cat("Correlation between Age and Exited:", correlation_age_exited, "\n")
## Correlation between Age and Exited: 0.2842997

Age: This column contained a significant number of outliers, requiring a more cautious approach. Rather than simply removing these extreme values, it was decided to apply capping as a more appropriate technique. Given that these outliers may represent valid data points, capping ensures that extreme values are adjusted to lie within a more reasonable range. To assess the impact of these outliers on the target variable, Exited, a correlation analysis was conducted. The correlation between Age and Exited was found to be weak (approximately 0.28), suggesting that outliers in the Age column would not have a substantial impact on the model’s predictions. Therefore, capping method was chosen to handle the extreme values due to its tendency to maintains the natural distribution by simply setting extreme values to a threshold without introducing artificial values than imputation method which can introduce bias. To determine the appropriate thresholds for capping, it is essential to assess whether the outliers in the Age column affect the prediction of a particular class.

# Define the IQR bounds for the Age column to identify outliers 
Q1 <- quantile(df_filtered$Age, 0.25) 
Q3 <- quantile(df_filtered$Age, 0.75) 
IQR <- Q3 - Q1 
lower_bound <- Q1 - 1.5 * IQR 
upper_bound <- Q3 + 1.5 * IQR 

# Filter rows where Age is considered an outlier 
age_outliers <- df_filtered[df_filtered$Age < lower_bound | df_filtered$Age > upper_bound, ] 

# Calculate the proportion of Exited vs Non-Exited among Age outliers 
age_outliers_exited_counts <- table(age_outliers$Exited) 
total_outliers <- nrow(age_outliers) 
exited_proportion_in_outliers <- age_outliers_exited_counts / total_outliers 

# Print the proportion of 'Exited' within Age outliers 
cat("\nProportion of 'Exited' within Age outliers:")
## 
## Proportion of 'Exited' within Age outliers:
print(exited_proportion_in_outliers)
## 
##         0         1 
## 0.7966574 0.2033426
# Get the count of each class in the Exited column for comparison 
class_counts <- table(df_filtered$Exited) 
total_number_of_samples <- nrow(df_filtered) 
exited_proportion_overall <- class_counts / total_number_of_samples 

# Print the overall proportion of 'Exited' 
cat("\nOverall Proportion of 'Exited':")
## 
## Overall Proportion of 'Exited':
print(exited_proportion_overall)
## 
##         0         1 
## 0.7968375 0.2031625

Upon comparing the proportions of the Exited class between the Age outliers and the overall dataset, it was found that both subsets exhibit a similar 80/20 split between the ‘Exited’ and ‘Non-Exited’ classes. This consistent distribution indicates that there is no need to adjust the capping thresholds to preserve a higher proportion of samples from specific age ranges, which could disproportionately influence the prediction of a particular class. As a result, the standard capping approach, utilizing the 1st and 99th percentiles, is deemed appropriate without further adjustments.

# Define the lower and upper percentiles for capping 
lower_percentile <- 1 
upper_percentile <- 99 

# Calculate the lower and upper bounds using the percentiles 
lower_bound <- quantile(df_filtered$Age, lower_percentile /100) 
upper_bound <- quantile(df_filtered$Age, upper_percentile /100) 

# Apply capping by replacing values below lower bound with the lower bound & values above upper bound with the upper bound 
df_filtered$Age <- pmin(pmax(df_filtered$Age, lower_bound), upper_bound) 

# Print summary statistics for the 'Age' column after capping 
print(summary(df_filtered$Age)) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   21.00   32.00   37.00   38.89   44.00   72.00
# Verify that the proportion of Exited vs Non-Exited remains at ~ 80/20 split after capping 
class_counts <- table(df_filtered$Exited) 
total_number_of_samples <- nrow(df_filtered) 
exited_proportion_overall <- class_counts / total_number_of_samples


# Print the overall proportion of 'Exited' 
cat("\nOverall Proportion of 'Exited':")
## 
## Overall Proportion of 'Exited':
print(exited_proportion_overall)
## 
##         0         1 
## 0.7968375 0.2031625

The capping process has successfully adjusted the extreme values in the Age column, ensuring that they now lie within the 1st and 99th percentiles. The minimum age has been capped at 21, and the maximum age at 72. Consequently, the mean age (38.89) is now more representative, reflecting that the extreme values had less influence on these statistics. Although the maximum age shifted significantly from 92 to 72, the overall distribution of the data remains consistent, with an 80/20 split between the ‘Exited’ and ‘Non-Exited’ classes. This suggests that the capping process has not adversely affected the balance of the dataset. The approach preserves the integrity of the data, minimizes the impact of extreme age values, and ensures a more realistic range of values, making it well-suited for downstream analysis and machine learning modeling.

4.7 Features Selection

Upon completing scaling and normalization, the next step involves feature selection through correlation analysis. Examining feature correlations helps identify highly correlated pairs, which can introduce multicollinearity in certain machine learning models (e.g., linear regression). Highly correlated features may need to be removed or transformed to reduce redundancy, thereby improving model interpretability and performance.

# Calculate the correlation matrix for the numerical features 
correlation_matrix <- cor(df_filtered[sapply(df_filtered, is.numeric)]) 

# Set a large canvas size (e.g., 12x12 inches for a more spacious plot) 
png("correlation_matrix_plot.png", width = 1200, height = 1200) # Set width and height in pixels 

# Visualize the correlation matrix using a heatmap 
corrplot(correlation_matrix, method = "color", type = "upper", 
         col = colorRampPalette(c("blue", "white", "red"))(200), 
         tl.cex = 0.8, tl.col = "black", addCoef.col = "black", 
         title = "Correlation Matrix") 

# Save the plot to file and close the device 
dev.off()
## png 
##   2
cat("Plot saved as 'correlation_matrix_plot.png' in the working directory.")
## Plot saved as 'correlation_matrix_plot.png' in the working directory.

A correlation matrix is used to identify highly correlated feature pairs, and it is observed that the pair (‘Exited’, ‘Complain’) exhibits a strong correlation.

# Identify highly correlated features (e.g. with a correlation coefficient > 0.8 or -0.8) 
threshold <- 0.8 
highly_correlated <- which(abs(correlation_matrix) > threshold, arr.ind = TRUE) 

# Extract the indices of highly correlated features (excluding the diagonal) 
highly_correlated_pairs <- vector("list", length = nrow(highly_correlated)) 
for (i in 1:nrow(highly_correlated)) {
  x <- highly_correlated[i, 1] 
  y <- highly_correlated[i, 2] 
  
  # Avoid duplicates and the diagonal (x != y) 
  if (x !=y && x < y) { 
    highly_correlated_pairs[[i]] <- c(colnames(correlation_matrix)[x], colnames(correlation_matrix)[y]) 
  } 
} 

# Display highly correlated feature pairs 
cat("\nHighly Correlated Feature Pairs (correlation > 0.8 or < - 0.8):\n") 
## 
## Highly Correlated Feature Pairs (correlation > 0.8 or < - 0.8):
for (pair in highly_correlated_pairs) { 
  if (!is.null(pair)) { 
    print(pair) 
  } 
}
## [1] "Exited"   "Complain"

The correlation analysis indicates that the feature pair (‘Exited’, ‘Complain’) shows a strong correlation, with a coefficient greater than 0.8 or less than -0.8. Typically, features with such high correlations (above 0.8 or below -0.8) are considered for removal or transformation to address multicollinearity. However, in this case, multicollinearity is less of a concern, as Exited is the target variable, not another predictor. Therefore, retaining both features is justified. The strong correlation between Exited and Complain suggests that customer complaints may serve as a significant indicator for predicting customer churn. In the case of algorithms like decision trees, Complain is likely to emerge as a key feature in the early stages of the model.

It is crucial, however, to be cautious of overfitting, as highly correlated features with the target variable can sometimes lead to this issue. To mitigate overfitting, regularization techniques such as L1 or L2 regularization can be applied, allowing retention of critical features like Complain while preventing model overfitting.

4.8 Data Imbalance Analysis

After completing feature selection, the final step is to assess data imbalance. The analysis reveals that the class distribution for the target variable, Exited, is imbalanced, with approximately 79.68% of samples in the non-exited class (0) and only 20.32% in the exited class (1).

# Calculate the proportions of each class for the target variable 'Exited' 
class_proportions <- prop.table(table(df$Exited)) * 100

# Print the proportions 
print(class_proportions)
## 
##     0     1 
## 79.62 20.38

4.9 Implementation of Resampling Technique

This class imbalance could adversely affect model performance, as certain models may be biased towards the majority class (non-exited) and neglect the minority class (exited), leading to skewed predictions. To address this, two resampling techniques are considered: - Oversampling the minority class (Exited = 1) using methods such as SMOTE (Synthetic Minority Over-sampling Technique) to generate synthetic samples, thereby increasing representation for the exited class. - Undersampling the majority class (Exited = 0) to reduce the number of non-exited samples, balancing the dataset by reducing the size of the majority class.

Oversampling is preferred over undersampling primarily because the dataset is of moderate size, not large. While undersampling can be useful for very large datasets by reducing computational load, it may result in a loss of valuable information when the dataset is smaller, which could negatively affect model performance and generalization.

By utilizing oversampling (e.g., with SMOTE), the original majority class samples are preserved, and the minority class representation is enhanced through synthetic sample generation. This approach minimizes data loss and ensures the model has comprehensive information, supporting balanced training.

library(caret)         # For train-test split
library(DMwR2)         # For SMOTE
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(dplyr)         # For data manipulation
library(smotefamily)   # For SMOTE

set.seed(42)

# Split data into training and testing sets
train_indices <- createDataPartition(df_filtered$Exited, p = 0.8, list = FALSE)
train_data <- df_filtered[train_indices, ]
test_data <- df_filtered[-train_indices, ]

# Prepare the training data
X_train <- train_data %>% select(-Exited)  # Features for training
y_train <- train_data$Exited               # Target for training

X_test <- test_data %>% select(-Exited)    # Features for testing
y_test <- test_data$Exited                 # Target for testing

# Check input data dimensions
cat("Dimensions of X_train:", dim(X_train), "\n")
## Dimensions of X_train: 7994 17
cat("Length of y_train:", length(y_train), "\n")
## Length of y_train: 7994
cat("Target class distribution in y_train before SMOTE:\n")
## Target class distribution in y_train before SMOTE:
print(table(y_train))
## y_train
##    0    1 
## 6388 1606
# Ensure target variable is binary (0 and 1)
#y_train <- ifelse(y_train == 1, 1, 0)  # Ensure binary encoding
#y_test <- ifelse(y_test == 1, 1, 0)    # Also ensure binary encoding for testing
y_train <- as.factor(y_train)
y_test <- as.factor(y_test)

# Combine features and target into a single training data frame for SMOTE
train_data_combined <- data.frame(X_train, Class = y_train)

# Convert any character or factor columns into numeric before applying SMOTE
train_data_combined <- train_data_combined %>%
  mutate(across(where(is.character), ~ as.numeric(as.factor(.)))) %>%
  mutate(across(where(is.factor), as.numeric))

# Apply SMOTE to balance the dataset
smote_result <- SMOTE(train_data_combined, target = train_data_combined$Class, K = 5, dup_size = 1)

# Extract the balanced dataset
balanced_data <- smote_result$data

#Remove the class columnn introduced by SMOTE
balanced_data <- balanced_data %>%
  select(-class)

#Remove the irrelevant column to train the model
balanced_data <- balanced_data %>%
  select(-CustomerId)
X_test <- X_test %>%
  select(-CustomerId)

balanced_data <- balanced_data %>% select(-Complain)
X_test <- X_test %>% select(-Complain)

#Use mutate to convert Class to binary values
balanced_data <- balanced_data %>%
  mutate(Class = ifelse(Class == 2, 1, 0))


# Separate features and target from the balanced dataset
X_train_smote <- balanced_data[, -ncol(balanced_data)]  # All columns except the last
y_train_smote <- balanced_data$Class                    # The last column (Class)
y_train_smote <- ifelse(y_train_smote == 1, 1, 0)

# Verify the class distribution after SMOTE
cat("Class distribution after SMOTE:\n")
## Class distribution after SMOTE:
print(table(y_train_smote))
## y_train_smote
##    0    1 
## 6388 3212
# Verify dimensions of the final datasets
cat("Dimensions of X_train_smote:", dim(X_train_smote), "\n")
## Dimensions of X_train_smote: 9600 15
cat("Length of y_train_smote:", length(y_train_smote), "\n")
## Length of y_train_smote: 9600
cat("Dimensions of X_test:", dim(X_test), "\n")
## Dimensions of X_test: 1998 15
cat("Length of y_test:", length(y_test), "\n")
## Length of y_test: 1998

The output following the application of SMOTE shows the distribution of the target variable as follows: - Class 0 (non-exited): 6,388 samples (unchanged after SMOTE). - Class 1 (exited): 3,212 samples (increased due to oversampling).

SMOTE has successfully balanced the dataset by adding synthetic samples for Class 1. This balanced distribution in the training data should enhance the model’s ability to detect and accurately classify instances of the minority class (“Exited” = 1), providing a richer and more representative dataset for training.

4.10 Export Processed Data as CSV

# Combine the training features and target variable 
train_df <- cbind(X_train_smote, Exited = y_train_smote)

# Combine the testing features and target variable 
test_df <- cbind(X_test, Exited = y_test) 

# Save the training and testing data to CSV files 
write.csv(train_df, "bank_customer_churn_train_data_processed.csv", row.names = FALSE) 
write.csv(test_df, "bank_customer_churn_test_data_processed.csv", row.names = FALSE)

Resampling is applied only to the training data, as the test set must reflect the real-world class distribution. Altering the balance of the test set would lead to overly optimistic performance metrics and fail to represent the model’s ability to generalize to imbalanced, real-world data. Therefore, the final output of the data cleaning and preprocessing steps consists of two CSV files: one for the training dataset and another for the testing dataset.

5 Question 1

5.1 Which variables exhibit the strongest positive or negative correlations among the variables?

The correlation heatmap offers valuable insights into the interrelationships between the variables found from the dataset. The subsequent observations highlight the significant positive and negative correlations. The heatmap uses Pearson correlation coefficients, ranging from –1, where:

  • -1 indicates perfect negative correlation (one variable increases whereas the other variable decreases)

  • 0 indicates no correlation (variables are independent)

  • 1 indicates perfect positive correlation (both variables increase together)

5.1.1 Positive Correlations:

  1. Exited vs Complain (1.00): The correlation indicates the highest positive relationship where it implies that customers that complain are more likely to churn (exit). Complaints reflect dissatisfaction with the bank’s services, products or customer support which translates into a higher probability of exiting. Every churn customer had lodged a complaint to the bank. This indicates that timely redress of grievances of customers and improvement in service quality are crucial aspects to retain customers. Monitoring and effective complaint handling could be one of the strategies for preventing churn exit.

  2. GeographyGermany vs Balance (0.40): The relationship between both variables shows the moderately high positive correlation found from the heatmap. Customers in Germany have higher balances kept in the bank and it can indicate that the customers have varying regional incomes or customers that prefers saving larger amounts in their accounts.

  3. Exited vs Age (0.29): This shows a moderate positive correlation between Age and Exited. Customers coming from the older age group demonstrate a slightly higher tendency to exit the bank compared to the younger individuals. This could be attributed to changing financial needs or dissatisfaction with the banking products provided that do not align with their current phase of life. Older clients may need more flexible retirement planning options or personalized financial plans. This requires the bank to come up with an age-appropriate financial products consumer segment to increase retention rates among older clients.

  4. Balance vs Exited (0.12): It represents a weak positive correlation between both variables, thus suggests that the level of salary influences saving behavior. Individuals that earn a high income may possibly put more into savings or investment accounts. Although it is a weak correlation, it can be investigated further as the bank can come up with tailored financial products.

5.1.2 Negative Correlations:

  1. The three relationships show negative correlation indicating that a customer can only belong to either one of the countries which are France, Germany and Spain. The mutual exclusivity among the countries will show a negative correlation. The correlation can be seen below:
  • GeographyFrance vs GeographyGermany (-0.58)

  • GeographyFrance vs GeographySpain (-0.58)

  • GeographyGermany vs GeographySpain (-0.33)

  1. Balance vs NumOfProducts (-0.36): This is a moderately high negative correlation can be found from the heatmap that can be interpreted that customers that have fewer products hold higher balances whereas customers with more products might spread their money across different services. Customers with fewer products may prefer liquid or conservative while the others are interested in investing in different financial products such as loans, mutual funds, or buying shares. This highlights that the bank should target low-product customers with cross-selling strategies to increase sales without affecting customers’ satisfaction.

  2. Exited vs NumOfProducts (-0.18): It shows a weak negative correlation indicating that customers that own less products with the bank may engage less with the bank and eventually leave the bank. On the other hand, customers with more products have more commitment to the bank which reduces the churn rates.

  3. NumOfProducts vs Complain (-0.18): The correlation between these two variables also show a weak negative correlation, suggesting that customers have lower engagement with the products can be associated with the dissatisfaction they have with the services provided by the bank. On the contrary, customers that own more products are less likely to complain. The bank could use this opportunity to identify the complaints made to highlight the underlying issue faced.

  4. IsActiveMember vs Exited (-0.16): The weak negative correlation stipulates that active members have less tendency to leave the bank. Customers are active in interacting with the bank which achieves customers’ satisfaction. Banks could use this opportunity in improving retention like personalized offers, loyalty rewards or automated alerts to keep the accounts active.

# Install and load required packages
library(ggplot2)
library(corrplot)

# Calculate the correlation matrix for the numerical features
correlation_matrix <- cor(df_filtered[sapply(df_filtered, is.numeric)])
corrplot(
  correlation_matrix, 
  method = "color", 
  type = "upper", 
  col = colorRampPalette(c("blue", "white", "red"))(200), 
  tl.cex = 0.6,         
  tl.col = "black",     
  tl.srt = 45,          
  addCoef.col = "black",
  number.cex = 0.5,    
  mar = c(2, 2, 2, 2),  
  title = "Correlation Matrix",
  cl.cex = 0.7          
)

6 Question 2

6.1 Which machine learning model among the five classification algorithms performs best in predicting whether a customer will exit (Exited) based on the available features, and how do the models compare in terms of performance metrics?

Question 2 is the analysis using five classification algorithms — Logistic Regression, Decision Tree, Support Vector Machine (SVM), Random Forest, and XGBoost (Extreme Gradient Boosting). This is to predict customer churn and evaluate their performance based on metrics such as accuracy, precision, recall, F1-score, and AUC (Area Under the Curve). The best-performing model is then identified in this question. The models are evaluated based on metrics such as accuracy, precision, recall, F1-score, and Area Under Curve for ROC. The ultimate goal in this question is to identify the model that most accurately predicts whether a customer will churn while minimizing the false positive and false negative predictions.

A confusion matrix is used to evaluate the performance of classification models. It provides detailed information about the true labels (actual values) and the predicted labels produced by the model. For a binary classification problem like predicting customer churn, the confusion matrix is a 2x2 table

  • True Positives (TP): Customers who churned (Exited = 1) and were correctly predicted as churned.

  • True Negatives (TN): Customers who did not churn (Exited = 0) and were correctly predicted as non-churned.

  • False Positives (FP): Customers who did not churn (Exited = 0) but were incorrectly predicted as churned.

  • False Negatives (FN): Customers who churned (Exited = 1) but were incorrectly predicted as non-churned.

Next the metrics used to determine the model performance are discussed based on the formula and importance.

  1. Accuracy: Accuracy is the proportion of correctly classified instances (both churned and non-churned customers) out of all instances. It provides an overall measure of correctness.

  2. Precision: Measure of how many of the instances predicted as positive (non-churned) are actually positive. Precision is important when the cost of false positives (misclassifying churned customers as non-churned) is high.

  3. Recall (Sensitivity or True Positive Rate): Recall measures how many of the actual positive instances (non-churned customers) were correctly identified by the model. Sensitivity is critical in churn prediction because it ensures we correctly identify most customers who are likely to stay with the company. A high sensitivity means the model is effective at catching true positives (non-churn).

  4. Specificity: Specificity measures how many of the actual negative instances (churned customers) were correctly identified by the model. Specificity is particularly useful in identifying customers who churn. A high specificity indicates that the model is good at predicting churned customers.

  5. F1-Score: The F1-score is the harmonic mean of precision and recall. It balances the trade-off between precision and recall, especially in datasets with imbalanced classes. A high F1-score indicates a good balance between precision and recall.

To answer this question several R packages are used to facilitate model training, evaluation, and visualization. The table below shows the summary of packages used:

Package Name Description
caret Provides functions for data preprocessing, train-test splitting, and model evaluation (confusion matrix, metrics).
e1071 Supports SVM implementation and kernel-based learning methods.
rpart Enables creation of Decision Trees and visualization of tree-based models.
randomForest Implements Random Forest for classification and regression tasks.
xgboost Provides an optimized gradient boosting framework for high-performance classification models.
pROC Facilitates computation and visualization of ROC curves and AUC metrics.
ggplot2 Used for creating advanced visualizations, including ROC curves and performance comparisons.
dplyr Simplifies data manipulation tasks such as filtering, mutating, and summarizing datasets.

6.1.1 Modelling & Evaluation

library(e1071)          
library(rpart)          
library(randomForest)   
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(caret)          
library(dplyr)          
library(pROC)          
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(smotefamily)    
library(xgboost)        
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
## 
##     slice
library(DT)


# -------------------------------------------------------------------
#                  Prepare Train and Test Datasets
# -------------------------------------------------------------------
train_df <- cbind(X_train_smote, Exited = y_train_smote)
test_df <- cbind(X_test, Exited = y_test)

# Ensure Test Dataset Matches Train Columns
missing_cols <- setdiff(colnames(X_train_smote), colnames(X_test))
for (col in missing_cols) {
  X_test[[col]] <- 0
}
X_test <- X_test[, colnames(X_train_smote)]

# Convert Non-Numeric Columns to Numeric in X_test
X_test <- X_test %>%
  mutate(across(where(is.character), ~ as.numeric(as.factor(.)))) %>%
  mutate(across(where(is.factor), as.numeric))

# Ensure y_test is a Factor
y_test <- as.factor(y_test)
y_test <- factor(y_test, levels = c(0, 1))

# -------------------------------------------------------------------
#                       Define Helper Functions
# -------------------------------------------------------------------

# Function to Calculate F1-Score
calculate_f1 <- function(conf_matrix) {
  precision <- conf_matrix$byClass["Pos Pred Value"]
  recall <- conf_matrix$byClass["Sensitivity"]
  f1 <- 2 * ((precision * recall) / (precision + recall))
  return(round(f1, 4))
}
f1_scores <- data.frame(
  Model = character(),
  F1_Score = numeric(),
  stringsAsFactors = FALSE
)
# -------------------------------------------------------------------
#                   Logistic Regression Model
# -------------------------------------------------------------------
log_model <- glm(
  y_train_smote ~ .,
  data = data.frame(X_train_smote, y_train_smote),
  family = binomial(link = "logit")
)
log_predictions <- predict(log_model, X_test, type = "response")
log_class <- ifelse(log_predictions > 0.3, 1, 0)
log_cm <- confusionMatrix(as.factor(log_class), y_test,  positive = "1")
log_f1 <- calculate_f1(log_cm)
f1_scores <- rbind(f1_scores, data.frame(Model = "Logistic Regression", F1_Score = log_f1))
log_cm_output <- capture.output(print(log_cm))

cat(paste0(
  "Confusion Matrix for Logistic Regression:\n",
  "======================================\n",
  paste(log_cm_output, collapse = "\n"),
  "\n"
))
## Confusion Matrix for Logistic Regression:
## ======================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1070  102
##          1  504  322
##                                          
##                Accuracy : 0.6967         
##                  95% CI : (0.676, 0.7168)
##     No Information Rate : 0.7878         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.3262         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.7594         
##             Specificity : 0.6798         
##          Pos Pred Value : 0.3898         
##          Neg Pred Value : 0.9130         
##              Prevalence : 0.2122         
##          Detection Rate : 0.1612         
##    Detection Prevalence : 0.4134         
##       Balanced Accuracy : 0.7196         
##                                          
##        'Positive' Class : 1              
## 
# -------------------------------------------------------------------
#                      Decision Tree Model
# -------------------------------------------------------------------
tree_model <- rpart(
  y_train_smote ~ .,
  data = data.frame(X_train_smote, y_train_smote),
  method = "class",
  parms = list(split = "gini"),  # Try both "gini" and "information"
  control = rpart.control(
    minsplit = 20,  # Minimum number of samples required to split
    cp = 0.01,      # Complexity parameter (smaller values increase tree depth)
    maxdepth = 10   # Limit tree depth to prevent overfitting
  )
)
tree_probs <- predict(tree_model, X_test, type = "prob")
tree_predictions <- ifelse(tree_probs[, 2] >= 0.3, 1, 0)  # Lower threshold to favor positive class
tree_predictions <- factor(tree_predictions, levels = c(0, 1))

#tree_predictions <- predict(tree_model, X_test, type = "class")
tree_cm <- confusionMatrix(tree_predictions, y_test,  positive = "1")
tree_cm_output <- capture.output(print(tree_cm))
tree_f1 <- calculate_f1(tree_cm)
f1_scores <- rbind(f1_scores, data.frame(Model = "Decision Tree", F1_Score = tree_f1))
cat(paste0(
  "Confusion Matrix for Decision Tree:\n",
  "======================================\n",
  paste(tree_cm_output, collapse = "\n"),
  "\n"
))
## Confusion Matrix for Decision Tree:
## ======================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1288  176
##          1  286  248
##                                           
##                Accuracy : 0.7688          
##                  95% CI : (0.7496, 0.7871)
##     No Information Rate : 0.7878          
##     P-Value [Acc > NIR] : 0.9817          
##                                           
##                   Kappa : 0.3683          
##                                           
##  Mcnemar's Test P-Value : 3.954e-07       
##                                           
##             Sensitivity : 0.5849          
##             Specificity : 0.8183          
##          Pos Pred Value : 0.4644          
##          Neg Pred Value : 0.8798          
##              Prevalence : 0.2122          
##          Detection Rate : 0.1241          
##    Detection Prevalence : 0.2673          
##       Balanced Accuracy : 0.7016          
##                                           
##        'Positive' Class : 1               
## 
# -------------------------------------------------------------------
#                Support Vector Machine (SVM) Model
# -------------------------------------------------------------------
y_train_smote <- factor(y_train_smote, levels = c(0, 1))
svm_model <- svm(
  y_train_smote ~ .,
  data = data.frame(X_train_smote, y_train_smote),
  kernel = "radial",
  cost = 10,
  gamma = 0.1,
  class.weights = c("0" = 1, "1" = 10),  
  probability = TRUE
)

svm_probs <- predict(svm_model, X_test, probability = TRUE)
svm_probabilities <- attr(svm_probs, "probabilities")[, "1"]  
svm_predictions <- ifelse(svm_probabilities > 0.3, 1, 0)

svm_cm <- confusionMatrix(as.factor(svm_predictions), y_test, positive = "1")
svm_cm_output <- capture.output(print(svm_cm))
svm_f1 <- calculate_f1(svm_cm)
f1_scores <- rbind(f1_scores, data.frame(Model = "Support Vector Machine (SVM)", F1_Score = svm_f1))
cat(paste0(
  "Confusion Matrix for SVM:\n",
  "======================================\n",
  paste(svm_cm_output, collapse = "\n"),
  "\n"
))
## Confusion Matrix for SVM:
## ======================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1251  200
##          1  323  224
##                                           
##                Accuracy : 0.7382          
##                  95% CI : (0.7184, 0.7574)
##     No Information Rate : 0.7878          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2921          
##                                           
##  Mcnemar's Test P-Value : 9.571e-08       
##                                           
##             Sensitivity : 0.5283          
##             Specificity : 0.7948          
##          Pos Pred Value : 0.4095          
##          Neg Pred Value : 0.8622          
##              Prevalence : 0.2122          
##          Detection Rate : 0.1121          
##    Detection Prevalence : 0.2738          
##       Balanced Accuracy : 0.6615          
##                                           
##        'Positive' Class : 1               
## 
# -------------------------------------------------------------------
#                     Random Forest Model
# -------------------------------------------------------------------
y_train_smote <- factor(y_train_smote, levels = c(0, 1))
rf_model <- randomForest(
  y_train_smote ~ .,
  data = data.frame(X_train_smote, y_train_smote),
  ntree = 300,
  mtry = sqrt(ncol(X_train_smote)),
)
rf_probs <- predict(rf_model, X_test, type = "prob")
rf_predictions <- ifelse(rf_probs[, 2] >= 0.3, 1, 0)
rf_predictions <- factor(rf_predictions, levels = c(0, 1))
y_test <- factor(y_test, levels = c(0, 1))
rf_cm <- confusionMatrix(rf_predictions, y_test,  positive = "1")

rf_f1 <- calculate_f1(rf_cm)
f1_scores <- rbind(f1_scores, data.frame(Model = "Random Forest", F1_Score = rf_f1))
rf_cm_output <- capture.output(print(rf_cm))

cat(paste0(
  "Confusion Matrix for Random Forest:\n",
  "======================================\n",
  paste(rf_cm_output, collapse = "\n"),
  "\n"
))
## Confusion Matrix for Random Forest:
## ======================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1351  166
##          1  223  258
##                                           
##                Accuracy : 0.8053          
##                  95% CI : (0.7872, 0.8225)
##     No Information Rate : 0.7878          
##     P-Value [Acc > NIR] : 0.028599        
##                                           
##                   Kappa : 0.445           
##                                           
##  Mcnemar's Test P-Value : 0.004521        
##                                           
##             Sensitivity : 0.6085          
##             Specificity : 0.8583          
##          Pos Pred Value : 0.5364          
##          Neg Pred Value : 0.8906          
##              Prevalence : 0.2122          
##          Detection Rate : 0.1291          
##    Detection Prevalence : 0.2407          
##       Balanced Accuracy : 0.7334          
##                                           
##        'Positive' Class : 1               
## 
# -------------------------------------------------------------------
#                        XGBoost Model
# -------------------------------------------------------------------
y_train_smote <- as.numeric(as.character(y_train_smote))
X_train_matrix <- as.matrix(X_train_smote)
X_test_matrix <- as.matrix(X_test)
y_train_numeric <- as.numeric(as.character(y_train_smote))
y_test_numeric <- as.numeric(as.character(y_test))

xgb_model <- xgboost(
  data = X_train_matrix,
  label = y_train_numeric,
  objective = "binary:logistic",
  nrounds = 200,
  eval_metric = "logloss",
  verbose = 0
)
xgb_predictions <- predict(xgb_model, X_test_matrix)
xgb_class <- ifelse(xgb_predictions > 0.3, 1, 0)
xgb_cm <- confusionMatrix(as.factor(xgb_class), y_test,  positive = "1")
xgb_f1 <- calculate_f1(xgb_cm)
f1_scores <- rbind(f1_scores, data.frame(Model = "XGBoost", F1_Score = xgb_f1))
xgb_cm_output <- capture.output(print(xgb_cm))

cat(paste0(
  "Confusion Matrix for XGBoost:\n",
  "======================================\n",
  paste(xgb_cm_output, collapse = "\n"),
  "\n"
))
## Confusion Matrix for XGBoost:
## ======================================
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1405  204
##          1  169  220
##                                           
##                Accuracy : 0.8133          
##                  95% CI : (0.7955, 0.8302)
##     No Information Rate : 0.7878          
##     P-Value [Acc > NIR] : 0.002553        
##                                           
##                   Kappa : 0.4243          
##                                           
##  Mcnemar's Test P-Value : 0.078331        
##                                           
##             Sensitivity : 0.5189          
##             Specificity : 0.8926          
##          Pos Pred Value : 0.5656          
##          Neg Pred Value : 0.8732          
##              Prevalence : 0.2122          
##          Detection Rate : 0.1101          
##    Detection Prevalence : 0.1947          
##       Balanced Accuracy : 0.7057          
##                                           
##        'Positive' Class : 1               
## 
library(caret)
library(knitr)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
calculate_metrics <- function(conf_matrix) {
  precision <- conf_matrix$byClass["Pos Pred Value"]    # Precision/PPV
  recall <- conf_matrix$byClass["Sensitivity"]          # Sensitivity/Recall
  specificity <- conf_matrix$byClass["Specificity"]     # Specificity
  npv <- conf_matrix$byClass["Neg Pred Value"]          # NPV
  balanced_acc <- (recall + specificity) / 2            # Balanced Accuracy
  accuracy <- conf_matrix$overall["Accuracy"]           # Accuracy
  f1 <- 2 * ((precision * recall) / (precision + recall)) # F1-Score
  return(data.frame(
    Accuracy = round(accuracy, 4),
    Sensitivity = round(recall, 4),
    Specificity = round(specificity, 4),
    Precision = round(precision, 4),
    NPV = round(npv, 4),
    Balanced_Accuracy = round(balanced_acc, 4),
    F1_Score = round(f1, 4)
  ))
}

all_metrics <- data.frame(
  Model = character(),
  Accuracy = numeric(),
  Sensitivity = numeric(),
  Specificity = numeric(),
  Precision = numeric(),
  NPV = numeric(),
  Balanced_Accuracy = numeric(),
  F1_Score = numeric(),
  stringsAsFactors = FALSE
)

# Logistic Regression
if (exists("log_cm")) {
  log_metrics <- calculate_metrics(log_cm)
  log_metrics$Model <- "Logistic Regression"
  all_metrics <- rbind(all_metrics, log_metrics)
}

if (exists("tree_cm")) {
  tree_metrics <- calculate_metrics(tree_cm)
  tree_metrics$Model <- "Decision Tree"
  all_metrics <- rbind(all_metrics, tree_metrics)
}
if (exists("svm_cm")) {
  svm_metrics <- calculate_metrics(svm_cm)
  svm_metrics$Model <- "SVM"
  all_metrics <- rbind(all_metrics, svm_metrics)
}
if (exists("rf_cm")) {
  rf_metrics <- calculate_metrics(rf_cm)
  rf_metrics$Model <- "Random Forest"
  all_metrics <- rbind(all_metrics, rf_metrics)
}
if (exists("xgb_cm")) {
  xgb_metrics <- calculate_metrics(xgb_cm)
  xgb_metrics$Model <- "XGBoost"
  all_metrics <- rbind(all_metrics, xgb_metrics)
}

if (!"Model" %in% colnames(all_metrics)) {
  stop("The 'Model' column is missing in the data frame. Please ensure all models were evaluated.")
}

all_metrics <- all_metrics[, c("Model", "Accuracy", "Sensitivity", "Specificity",
                               "Precision", "NPV", "Balanced_Accuracy", "F1_Score")]

all_metrics %>%
  kbl(caption = "Evaluation Metrics for All Models", row.names = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
                full_width = TRUE, position = "center") %>%
  column_spec(1, bold = TRUE, border_left = TRUE) %>%
  column_spec(ncol(all_metrics), border_right = TRUE)
Evaluation Metrics for All Models
Model Accuracy Sensitivity Specificity Precision NPV Balanced_Accuracy F1_Score
Logistic Regression 0.6967 0.7594 0.6798 0.3898 0.9130 0.7196 0.5152
Decision Tree 0.7688 0.5849 0.8183 0.4644 0.8798 0.7016 0.5177
SVM 0.7382 0.5283 0.7948 0.4095 0.8622 0.6615 0.4614
Random Forest 0.8053 0.6085 0.8583 0.5364 0.8906 0.7334 0.5702
XGBoost 0.8133 0.5189 0.8926 0.5656 0.8732 0.7057 0.5412

The performance of five models—Logistic Regression, Decision Tree, Support Vector Machine (SVM), Random Forest, and XGBoost—was evaluated using metrics such as accuracy, sensitivity (ability to correctly identify churners), specificity (ability to avoid false alarms), positive predictive value (PPV or precision), negative predictive value (NPV), balanced accuracy, and F1-score. Among these models, Random Forest demonstrated the best overall performance, achieving a balanced accuracy of 73.34%, sensitivity of 60.85%, and a PPV of 53.64%. The F1-score of 0.57 highlighted its effectiveness in balancing precision and recall, making it the most reliable model. XGBoost also performed well, with high specificity (92.25%) and a PPV of 56.56%, though its sensitivity and F1-score were slightly lower compared to Random Forest.

The Decision Tree and SVM models exhibited moderate performance but struggled to effectively identify churners, as reflected in their lower sensitivity scores of 58.49% and 52.83%, respectively. This limitation makes them less suitable for churn prediction, where identifying potential churners is crucial. Logistic Regression, while being simple and interpretable, had the lowest precision (PPV of 38.98%), indicating a tendency to predict a high number of false positives. Despite having the highest sensitivity (75.94%), this imbalance makes Logistic Regression less practical, as it could lead to unnecessary customer interventions.

Model performance was improved through hyperparameter tuning. Random Forest was configured with 300 decision trees (ntree = 300) and the optimal number of features for splitting (mtry = sqrt(features)), leading to consistent performance across all metrics. For XGBoost, parameters such as the learning rate (eta = 0.1) were adjusted, and 200 boosting rounds were implemented to improve its handling of imbalanced data. The SVM model was optimized using a radial kernel, with parameters like gamma = 0.1 and cost = 10. Additionally, class weights were applied to enhance its ability to detect churners. Despite these adjustments, SVM underperformed in precision and sensitivity relative to the other models.

In summary, Random Forest emerged as the most effective model for predicting customer churn, offering a strong balance between sensitivity, precision, and F1-score. Although XGBoost showed promising results, the superior overall metrics of Random Forest make it the preferred choice. Further improvements could involve advanced tuning techniques or ensemble methods to enhance model performance and better manage imbalanced data.

6.1.2 ROC Plot & AUC

# Load required libraries
library(pROC)
library(ggplot2)

# Ensure y_test is numeric
y_test <- as.numeric(as.character(y_test))

# Function to calculate ROC for a model
get_roc <- function(y_true, y_pred_prob, model_name) {
  if (length(y_true) != length(y_pred_prob) || any(is.na(y_pred_prob))) {
    stop(paste("Invalid input data for", model_name))
  }
  roc(y_true, y_pred_prob)
}

cat("\n ROC Plot: \n")
## 
##  ROC Plot:
# Logistic Regression
log_prob <- predict(log_model, X_test, type = "response")
roc_log <- get_roc(y_test, log_prob, "Logistic Regression")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_log <- auc(roc_log)
# Decision Tree
tree_prob <- predict(tree_model, X_test, type = "prob")[, 2]
roc_tree <- get_roc(y_test, tree_prob, "Decision Tree")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_tree <- auc(roc_tree)
# Support Vector Machine (SVM)
svm_prob <- attr(predict(svm_model, X_test, probability = TRUE), "probabilities")[, 2]
roc_svm <- get_roc(y_test, svm_prob, "SVM")
## Setting levels: control = 0, case = 1
## Setting direction: controls > cases
auc_svm <- auc(roc_svm)
# Random Forest
rf_prob <- predict(rf_model, X_test, type = "prob")[, 2]
roc_rf <- get_roc(y_test, rf_prob, "Random Forest")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_rf <- auc(roc_rf)
# XGBoost
xgb_prob <- predict(xgb_model, X_test_matrix)
roc_xgb <- get_roc(y_test, xgb_prob, "XGBoost")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_xgb <- auc(roc_xgb)
roc_results <- list(
  LogisticRegression = roc_log,
  DecisionTree = roc_tree,
  SVM = roc_svm,
  RandomForest = roc_rf,
  XGBoost = roc_xgb
)
roc_combined_plot <- ggroc(roc_results, aes = "color") +
  ggtitle("ROC Curves for All Models") +
  theme_minimal() +
  scale_color_manual(
    values = c("LogisticRegression" = "purple", 
               "DecisionTree" = "green", 
               "SVM" = "blue", 
               "RandomForest" = "red", 
               "XGBoost" = "darkorange")
  ) +
  labs(color = "Model") +
  xlab("False Positive Rate") +
  ylab("True Positive Rate")
print(roc_combined_plot)

auc_table <- data.frame(
  Model = c("Logistic Regression", "Decision Tree", "Support Vector Machine (SVM)", 
            "Random Forest", "XGBoost"),
  AUC = c(round(auc_log, 4), round(auc_tree, 4), round(auc_svm, 4), 
          round(auc_rf, 4), round(auc_xgb, 4))
)
cat("\nAUC Values for All Models\n")
## 
## AUC Values for All Models
library(knitr)
kable(auc_table, caption = "AUC Values for All Models")
AUC Values for All Models
Model AUC
Logistic Regression 0.7856
Decision Tree 0.7622
Support Vector Machine (SVM) 0.7326
Random Forest 0.8219
XGBoost 0.7937

Above is the result of the Area Under Curve (AUC) obtained from the models. Generally, a higher AUC value indicates better model performance, with 1.0 being perfect and 0.5 being no better than random guessing. In this analysis, Random Forest achieved the highest AUC value of 0.8219, indicating it is the most effective at distinguishing between the two classes. XGBoost followed with an AUC of 0.7937, while Decision Tree and Logistic Regression performed moderately well with AUC values of 0.7622 and 0.7856, respectively. The Decision Tree had the lowest AUC at 0.7326

Based on the evaluation metrics, Random Forest emerged as the best-performing model for predicting customer churn, with the highest AUC value, strong accuracy, and balanced precision-recall performance. This indicates it effectively identifies churned customers while minimizing false predictions. XGBoost was a close second, showing strong overall performance and good reliability.

Other models like SVM, Logistic Regression, and the Decision Tree provided reasonable results but were less effective compared to Random Forest and XGBoost. Therefore, for this dataset, Random Forest is the most suitable model for accurately predicting customer churn, followed by XGBoost as a strong alternative.

7 Conclusion

Customer churn remains a critical challenge for the banking industry, requiring proactive measures to retain customers and ensure profitability. This study aimed to predict customer churn using a comprehensive dataset and advanced machine learning techniques. The dataset underwent rigorous preprocessing, including feature engineering, scaling, normalization, and class imbalance handling using SMOTE. These steps ensured the data was suitable for building robust models that could accurately predict customer churn while addressing the inherent class imbalance in the data

For Question 1, the analysis of correlations revealed key factors influencing customer churn. Complaints showed the strongest positive correlation with churn, highlighting dissatisfaction as a critical predictor. Additionally, older age groups and higher balances were moderately correlated with churn, suggesting the need for tailored financial products for these segments. On the negative side, fewer products owned and lower engagement levels were linked to higher churn rates, emphasizing the importance of cross-selling strategies and customer engagement initiatives to improve retention.

In Question 2, among the five machine learning models evaluated, Logistic Regression, Decision Tree, Support Vector Machine (SVM), Random Forest, and XGBoost, Random Forest demonstrated the best overall performance, with the highest balanced accuracy (73.34%), F1-score (0.57), and AUC value (0.822). These metrics indicate that Random Forest strikes a strong balance between identifying true churners (sensitivity) and avoiding false alarms (precision). XGBoost emerged as a close second, offering competitive results with high specificity (89.26%) and precision (56.56%). Both models outperformed the simpler algorithms like Logistic Regression and Decision Tree, which struggled with imbalanced data and delivered lower sensitivity and precision.

In conclusion, Random Forest is the most suitable model for predicting customer churn in this dataset, providing actionable insights for retaining customers. While XGBoost is a viable alternative, Random Forest’s consistent performance across all metrics makes it the preferred choice. Future work could involve exploring ensemble techniques, hyperparameter optimization, and additional features to further improve model accuracy and interpretability. These insights equip financial institutions with a data-driven framework to identify at-risk customers, enabling targeted interventions to improve retention and reduce churn rates.