This document contains data cleaning and data pre-processing (exploratory data analysis) for the WQD7004 project. # Group Assignment ## Introduction Customer churn has become a significant issue in the fast-paced, fiercely competitive banking sector, endangering long-term stability and profitability. Revenue streams and acquisition expenses are directly impacted by customer churn, which is the termination of a customer’s engagement with a business. Proactive churn control measures are crucial since, according to industry studies, obtaining a new client can be five times more expensive than keeping an existing one.
The study aims to address this challenge by analyzing a dataset obtained from Kaggle that ranges from 10000 customer records with 18 variables ranging from transactional and behavioral indicators to sociodemographic characteristics. The dataset provides a strong basis for comprehending churn dynamics in the banking environment by offering a singular chance to explore important churn causes, such as credit ratings, product utilization, customer satisfaction and demographic changes.
This research attempts to find actionable insights by using a methodical approach that includes predictive modeling, sophisticated preprocessing techniques, and exploratory data analysis (EDA). To get the dataset ready for modeling, important preprocessing techniques such feature encoding, scaling, and missing value analysis were carefully used. Additionally, oversampling strategies like SMOTE were used to solve class imbalance concerns, guaranteeing a representative and balanced training dataset.
The analysis’s conclusions will provide useful suggestions for identifying clients who are in danger, allowing financial institutions to put focused retention tactics into place. Banks can enhance customer happiness, maximize client engagement, and eventually lower churn rates by utilizing data-driven insights. This paper presents practical solutions for improving client retention in the banking industry in addition to adding to the body of knowledge on churn prediction.
To identify and select the most correlated features associated with bank customer churn, including socio-demographic background, types of cards held, credit score, points earned, and number of products purchased through the bank
To create a predictive model to identify bank customer churn based on relevant factors
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
## corrplot 0.95 loaded
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
# Load the dataset
df <- read.csv('Customer-Churn-Records.csv')
# Display summary statistics of the dataset
summary(df)
## RowNumber CustomerId Surname CreditScore
## Min. : 1 Min. :15565701 Length:10000 Min. :350.0
## 1st Qu.: 2501 1st Qu.:15628528 Class :character 1st Qu.:584.0
## Median : 5000 Median :15690738 Mode :character Median :652.0
## Mean : 5000 Mean :15690941 Mean :650.5
## 3rd Qu.: 7500 3rd Qu.:15753234 3rd Qu.:718.0
## Max. :10000 Max. :15815690 Max. :850.0
## Geography Gender Age Tenure
## Length:10000 Length:10000 Min. :18.00 Min. : 0.000
## Class :character Class :character 1st Qu.:32.00 1st Qu.: 3.000
## Mode :character Mode :character Median :37.00 Median : 5.000
## Mean :38.92 Mean : 5.013
## 3rd Qu.:44.00 3rd Qu.: 7.000
## Max. :92.00 Max. :10.000
## Balance NumOfProducts HasCrCard IsActiveMember
## Min. : 0 Min. :1.00 Min. :0.0000 Min. :0.0000
## 1st Qu.: 0 1st Qu.:1.00 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 97199 Median :1.00 Median :1.0000 Median :1.0000
## Mean : 76486 Mean :1.53 Mean :0.7055 Mean :0.5151
## 3rd Qu.:127644 3rd Qu.:2.00 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :250898 Max. :4.00 Max. :1.0000 Max. :1.0000
## EstimatedSalary Exited Complain Satisfaction.Score
## Min. : 11.58 Min. :0.0000 Min. :0.0000 Min. :1.000
## 1st Qu.: 51002.11 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:2.000
## Median :100193.91 Median :0.0000 Median :0.0000 Median :3.000
## Mean :100090.24 Mean :0.2038 Mean :0.2044 Mean :3.014
## 3rd Qu.:149388.25 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:4.000
## Max. :199992.48 Max. :1.0000 Max. :1.0000 Max. :5.000
## Card.Type Point.Earned
## Length:10000 Min. : 119.0
## Class :character 1st Qu.: 410.0
## Mode :character Median : 605.0
## Mean : 606.5
## 3rd Qu.: 801.0
## Max. :1000.0
## Shape of the dataset: 10000 18
## [1] "RowNumber" "CustomerId" "Surname"
## [4] "CreditScore" "Geography" "Gender"
## [7] "Age" "Tenure" "Balance"
## [10] "NumOfProducts" "HasCrCard" "IsActiveMember"
## [13] "EstimatedSalary" "Exited" "Complain"
## [16] "Satisfaction.Score" "Card.Type" "Point.Earned"
The dataset comprises of 10,000 rows and 18 columns, representing 10,000 samples and 18 features. In order to gain a deeper understanding of the dataset structure, the columns are categorized into three primary features based on the type of data they represent, which are:
Key Features | Types of Data |
---|---|
|
|
|
|
|
|
# Count exited customers and calculate percentages
exited_customers_df <- as.data.frame(table(df$Exited))
colnames(exited_customers_df) <- c("Exited", "Count")
exited_customers_df$Percentage <- (exited_customers_df$Count / sum(exited_customers_df$Count)) * 100
# Add labels
exited_customers_df$Label <- ifelse(exited_customers_df$Percentage >= 5,
paste0(exited_customers_df$Exited, " (", round(exited_customers_df$Percentage, 1), "%)"),
"")
# Define colors
colors <- c("salmon", "turquoise3")
# Pie chart for percentage of customers
ggplot(exited_customers_df, aes(x = "", y = Count, fill = factor(Exited))) +
geom_bar(width = 1, stat = "identity") +
coord_polar(theta = "y") +
theme_void() +
labs(title = "Percentage of Churned Customers", fill = "Exited") +
scale_fill_manual(values = colors) +
geom_text(aes(label = Label), position = position_stack(vjust = 0.5))
# Bar chart for count of customers that churned and not churned by Geography
ggplot(df, aes(x = Geography, fill = factor(Exited))) +
geom_bar(position = "dodge") +
labs(title = "Customer Churn by Geography",
x = "Geography",
y = "Count",
fill = "Exited") +
theme_minimal()
# Histogram Age distribution
hist(df[['Age']], main = "Age distribution", xlab = "Age", ylab = "Amount of customers", col = "slateblue", border = "black")
# Filter data to include only churned customers
exited_customers <- df[df$Exited == TRUE, ]
# Age distribution of churned customers
library(ggplot2)
ggplot(exited_customers, aes(x = Age)) +
geom_histogram(bins = 20, fill = "slateblue", color = "black", alpha = 0.7) +
labs(title = "Age Distribution of Churned Customers",
x = "Age",
y = "Number of Churned Customers") +
theme_minimal()
# Bar chart for count of customers by geography and credit card ownership
ggplot(df, aes(x = Geography, fill = factor(HasCrCard))) +
geom_bar(position = "dodge") +
scale_fill_manual(name = "Has Credit Card", labels = c("No", "Yes"), values = c("salmon", "turquoise3")) +
labs(title = "Count of Customers by Geography and Credit Card Ownership", x = "Geography", y = "Count") +
theme_minimal() +
theme(legend.position = "top")
# Bar chart for count of customers with/without credit card by churn status
ggplot(df, aes(x = factor(HasCrCard, labels = c("No Credit Card", "Has Credit Card")),
fill = factor(Exited))) +
geom_bar(position = "dodge") +
labs(title = "Churn Status of Customers with/without Credit Card",
x = "Credit Card Ownership",
y = "Count",
fill = "Exited") +
theme_minimal()
# Bar chart for churn status of customers by credit card ownership and geography
ggplot(df, aes(x = Geography,
fill = interaction(factor(HasCrCard, labels = c("No Credit Card", "Has Credit Card")),
factor(Exited)))) +
geom_bar(position = "dodge") +
scale_fill_manual(values = c("lightsalmon", "salmon", "paleturquoise", "turquoise")) +
labs(title = "Churn Status of Customers by Credit Card Ownership and Geography",
x = "Geography",
y = "Count",
fill = "Credit Card & Exited") +
theme_minimal()
# Bar chart for churned vs. not churned based on Card Type
ggplot(df, aes(x = `Card.Type`, fill = factor(Exited))) +
geom_bar(position = "dodge") +
labs(title = "Customer Churn by Card Type",
x = "Card Type",
y = "Count",
fill = "Exited") +
theme_minimal()
## .
## character integer numeric
## 4 12 2
# Check number of columns with null values
null_counts <- colSums(is.na(df))
cat("Total number of columns with null values:", sum(null_counts > 0), "/", ncol(df), "\n")
## Total number of columns with null values: 0 / 18
# Create dataframe with column names, data types, and null counts
column_info <- data.frame(
"Column Name" = colnames(df),
"Data Type" = sapply(df, class),
"Null Count" = null_counts
)
# Adjust row indices to start from 1
rownames(column_info) <- 1:nrow(column_info)
print(column_info)
## Column.Name Data.Type Null.Count
## 1 RowNumber integer 0
## 2 CustomerId integer 0
## 3 Surname character 0
## 4 CreditScore integer 0
## 5 Geography character 0
## 6 Gender character 0
## 7 Age integer 0
## 8 Tenure integer 0
## 9 Balance numeric 0
## 10 NumOfProducts integer 0
## 11 HasCrCard integer 0
## 12 IsActiveMember integer 0
## 13 EstimatedSalary numeric 0
## 14 Exited integer 0
## 15 Complain integer 0
## 16 Satisfaction.Score integer 0
## 17 Card.Type character 0
## 18 Point.Earned integer 0
It is observed that there are a total of 12 features with integer data type, 4 features with object data type, and 2 features with float data type. Furthermore, data imputation is excluded from this process as all columns are free of null values.
All columns with object data type need to be converted into numerical format so that it is suitable with machine learning algorithms. There are four columns that need to be converted, namely Surname, Geography, Gender, and Card Type.
# Get columns of character data type only
object_columns <- sapply(df, is.character)
# Loop through each object column
for (col in names(df)[object_columns]) {
unique_values <- unique(df[[col]])
total_num_of_unique_values <- length(unique_values)
cat("Total number of unique values in '", col, "': ", total_num_of_unique_values, "\n\n", sep = "")
}
## Total number of unique values in 'Surname': 2932
##
## Total number of unique values in 'Geography': 3
##
## Total number of unique values in 'Gender': 2
##
## Total number of unique values in 'Card.Type': 4
The above output showed that among the four columns with a character data type, the Geography, Gender, and Card Type columns likely contain categorical data, as they have only a few unique values given the large number of rows. On the other hand, the Surname column, which has 2,932 unique entries, most likely does not contain categorical data.
# Specify the column that might contain categorical data
categorical_data_columns <- c("Geography", "Gender", "Card.Type")
# Loop through each column and print unique values
for (col in categorical_data_columns) {
cat(paste("Unique values in", col, ":"))
cat(unique(df[[col]]), "\n\n")
}
## Unique values in Geography :France Spain Germany
##
## Unique values in Gender :Female Male
##
## Unique values in Card.Type :DIAMOND GOLD SILVER PLATINUM
From the analysis above, it is observed that the Geography, Gender, and Card Type columns indeed contain categorical data. Specifically, Geography has three categories (France, Spain, Germany), Gender has two categories (Female, Male), and Card Type has four ordinal categories (Platinum > Diamond > Gold > Silver). These columns have meaningful features and are suitable to convert into numerical values.
In contrast, the Surname column has 2,932 unique values, indicating high variability that does not contribute meaningfully to the analysis as Customer ID column is already sufficient to identify each individual customer. Thus, this feature and Row Number which lacks any meaning can be excluded from further analysis. The removal of both columns can be seen below.
# Drop the 'Surname' and 'Row Number' columns from the data frame
df <- df[, !(colnames(df) %in% c("Surname", "RowNumber"))]
# Verify the columns of the modified data frame
colnames(df)
## [1] "CustomerId" "CreditScore" "Geography"
## [4] "Gender" "Age" "Tenure"
## [7] "Balance" "NumOfProducts" "HasCrCard"
## [10] "IsActiveMember" "EstimatedSalary" "Exited"
## [13] "Complain" "Satisfaction.Score" "Card.Type"
## [16] "Point.Earned"
Then, the Geography, Gender, and Card Type columns are converted into integer data type.
# Label encoding for Gender
df$Gender <- as.integer(factor(df$Gender))
# One-Hot encoding for Geography
df <- cbind(df, model.matrix(~Geography - 1, data =df))
df$Geography <- NULL
# Card Type Mapping
card_type_mapping <- c("SILVER" = 1, "GOLD" = 2, "PLATINUM" = 3, "DIAMOND" = 4)
df$Card.Type <- card_type_mapping[df$Card.Type]
The method used for the conversion and the rationale behind it are described as below.
Column | Method | Reason |
---|---|---|
Gender | Label Encoding | Contains two distinct categories, Male and Female, which can be effectively represented as integers (e.g., 0 for Male and 1 for Female). This encoding captures the binary distinction without implying any order. |
Geography | One-Hot Encoding | Consists of nominal categories (France, Spain, Germany) without any inherent ranking. One-hot encoding creates separate binary columns for each country, ensuring that the model treats them as distinct and independent categories. |
Card Type | Manual Mapping for Ordinal Encoding | Has a clear ordinal relationship with ranks defined as Silver < Gold < Platinum < Diamond. Manually mapping these categories to integers (1 for Silver, 2 for Gold, etc.) will preserve its hierarchy, thus allowing the model to utilize this ranking in its predictions. |
# Verify the data transformation by getting the column data types
column_data_types <- data.frame("Column Name" = colnames(df),
"Data Type" = sapply(df, class))
# Adjust the row names to start from 1 (as in the original Python code)
rownames(column_data_types) <- 1:nrow(column_data_types)
# Display the result
print(column_data_types)
## Column.Name Data.Type
## 1 CustomerId integer
## 2 CreditScore integer
## 3 Gender integer
## 4 Age integer
## 5 Tenure integer
## 6 Balance numeric
## 7 NumOfProducts integer
## 8 HasCrCard integer
## 9 IsActiveMember integer
## 10 EstimatedSalary numeric
## 11 Exited integer
## 12 Complain integer
## 13 Satisfaction.Score integer
## 14 Card.Type numeric
## 15 Point.Earned integer
## 16 GeographyFrance numeric
## 17 GeographyGermany numeric
## 18 GeographySpain numeric
The above results showed that the Gender column has been successfully converted to integer values, with 0 representing female and 1 representing male. The Geography column has been replaced by three new columns — GeographyFrance, GeographyGermany, and GeographySpain — each containing Boolean values (numeric format of 1 – positive; 0 – negative) to indicate a customer’s origin. Lastly, the Card Type column has been transformed into integer values, with 1, 2, 3, and 4 corresponding to Silver, Gold, Platinum, and Diamond, respectively, in ascending order of rank.
For this part, all the columns with integer or numeric (float) data type need to be checked whether scaling is needed or not. A detailed examination of each column’s characteristics helps in determining the need for scaling.
# Get columns of integer and numeric (float) data type only
int_and_float_column_df <- df[sapply(df, function(x) is.integer(x) | is.numeric(x))]
# Check the value range for the first few rows
head(int_and_float_column_df)
There are seven columns that does not need scaling, such as those representing categorical, binary, or ordinal features. These include identifiers (e.g., customer ID), binary variables (e.g., gender, credit card ownership, membership status, complaint status, and target variables like “Exited”), and ordinal features that have been appropriately encoded as integers (e.g., Card Type). Additionally, discrete integer variables with a limited range, such as the Satisfaction Score, are not impacted by scale differences and therefore do not require normalization. These features are used to represent distinct categories that do not benefit from being on the same scale as numeric features.
# Select the columns of interest
columns_to_scale <- c('CreditScore', 'Age', 'Tenure', 'Balance', 'EstimatedSalary', 'NumOfProducts', 'Point.Earned')
# Get summary statistics for these columns
summary_stats <- summary(df[, columns_to_scale])
# Extract min, max, mean, and std (standard deviation) from the summary statistics
summary_stats <- data.frame(
Min = summary_stats[1, ],
Max = summary_stats [6, ],
Mean = summary_stats [3, ],
Std = apply(df[, columns_to_scale], 2, sd)
)
# Display the result
print(summary_stats)
## Min Max Mean
## CreditScore Min. :350.0 Max. :850.0 Median :652.0
## Age Min. :18.00 Max. :92.00 Median :37.00
## Tenure Min. : 0.000 Max. :10.000 Median : 5.000
## Balance Min. : 0 Max. :250898 Median : 97199
## EstimatedSalary Min. : 11.58 Max. :199992.48 Median :100193.91
## NumOfProducts Min. :1.00 Max. :4.00 Median :1.00
## Point.Earned Min. : 119.0 Max. :1000.0 Median : 605.0
## Std
## CreditScore 9.665330e+01
## Age 1.048781e+01
## Tenure 2.892174e+00
## Balance 6.239741e+04
## EstimatedSalary 5.751049e+04
## NumOfProducts 5.816544e-01
## Point.Earned 2.259248e+02
In contrast, numeric columns with continuous values—such as Credit Score, Age, Tenure, Balance, Estimated Salary, and Point Earned—require scaling due to the wide range of their values and the variability in their distributions. Features with large numerical ranges, such as Balance and Estimated Salary, are particularly susceptible to dominating the model if not scaled appropriately. For features like Age, Tenure, and Credit Score, scaling helps to standardize the influence of each feature on the model, ensuring that no single feature disproportionately affects the predictions.
Before applying scaling or normalization techniques to the selected columns, it is essential to address the potential outliers identified in certain columns, such as Balance, Estimated Salary, and Point Earned, due to their wide ranges and high standard deviations. Outliers can significantly distort the results of scaling methods like Min-Max scaling and Standardization, which are sensitive to extreme values. If not handled, outliers can skew the mean and range of the data, which can in turn distort the scaled values and degrade the performance of machine learning models. For instance, in the case of Min-Max scaling, extreme values can stretch the range, compressing the majority of data points into a narrower band, thus affecting the overall distribution and model accuracy.
# Define a vector to store outlier counts for each column
outlier_counts <- list()
# Loop through each column of interest
for (column in columns_to_scale) {
# Calculate Q1, Q3, and IQR
Q1 <- quantile(df[[column]], 0.25)
Q3 <- quantile(df[[column]], 0.75)
IQR <- Q3 - Q1
# Calculate the lower and upper bounds for outliers
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
# Identify outliers
outliers <- df[df[[column]] < lower_bound | df[[column]] > upper_bound, ]
# Store the count of outliers in the list
outlier_counts[[column]] <- nrow(outliers)
}
# Display the count of outliers in each column
print(outlier_counts)
## $CreditScore
## [1] 15
##
## $Age
## [1] 359
##
## $Tenure
## [1] 0
##
## $Balance
## [1] 0
##
## $EstimatedSalary
## [1] 0
##
## $NumOfProducts
## [1] 60
##
## $Point.Earned
## [1] 0
To detect outliers in the identified columns, the Interquartile Range (IQR) method was employed. The IQR is particularly useful for continuous numerical data, such as Credit Score, Age, Tenure, Balance, and Estimated Salary, which are prone to outliers.
Steps to Apply IQR Analysis: 1. Calculate the IQR for each column: The IQR is determined by subtracting the 25th percentile (Q1) from the 75th percentile (Q3) of the data. 2. Define outlier thresholds: A common threshold is to consider any value below 𝑄1−1.5×𝐼𝑄𝑅 or above 𝑄3+1.5×𝐼𝑄𝑅 as an outlier. 3. Apply the thresholds to each column: Values beyond these thresholds are considered outliers.
From the outlier detection process, only three columns—Credit Score, Age, and Number of Products—were identified as containing outliers. To address these outliers:
# Apply Winsorizing method for the 'NumOfProducts' column
lower_bound <- quantile(df$NumOfProducts, 0.05)
upper_bound <- quantile(df$NumOfProducts, 0.95)
# Clip values within the bounds
df$NumOfProducts <- pmax(pmin(df$NumOfProducts, upper_bound), lower_bound)
# Display the modified column
head(df$NumOfProducts)
## [1] 1 1 2 2 1 2
Number of Products: Given the presence of multiple outliers, the capping (winsorizing) method was applied. This involved capping extreme values to the 5th and 95th percentiles, which helps maintain a reasonable data range while preserving the overall distribution.
# Calculate z-scores for CreditScore
df$CreditScore_Z <- (df$CreditScore - mean(df$CreditScore)) / sd(df$CreditScore)
# Define the threshold
threshold <- 3
# Filter rows where the absolute value of CreditScore_Z is less than or equal to the threshold
df_filtered <- df[abs(df$CreditScore_Z) <= threshold, ]
# Drop the temporary CreditScore_Z column
df_filtered$CreditScore_Z <- NULL
# Check the dimensions of the filtered data frame
dim(df_filtered)
## [1] 9992 18
Credit Score: For this column, which contained only a few outliers, the Z-score method was utilized to identify and remove the extreme values. Any data points with Z-scores greater than ±3 standard deviations from the mean were flagged as outliers and removed. Following this process, the dataset size decreased from 10,000 to 9,992, indicating that 8 samples were removed as outliers. This method proved more efficient than the IQR method, which had flagged 15 samples as outliers, thus preserving more valuable data.
# Calculate the correlation between 'Age' and 'Exited'
correlation_age_exited <- cor(df_filtered$Age, df_filtered$Exited)
# Print the result
cat("Correlation between Age and Exited:", correlation_age_exited, "\n")
## Correlation between Age and Exited: 0.2842997
Age: This column contained a significant number of outliers, requiring a more cautious approach. Rather than simply removing these extreme values, it was decided to apply capping as a more appropriate technique. Given that these outliers may represent valid data points, capping ensures that extreme values are adjusted to lie within a more reasonable range. To assess the impact of these outliers on the target variable, Exited, a correlation analysis was conducted. The correlation between Age and Exited was found to be weak (approximately 0.28), suggesting that outliers in the Age column would not have a substantial impact on the model’s predictions. Therefore, capping method was chosen to handle the extreme values due to its tendency to maintains the natural distribution by simply setting extreme values to a threshold without introducing artificial values than imputation method which can introduce bias. To determine the appropriate thresholds for capping, it is essential to assess whether the outliers in the Age column affect the prediction of a particular class.
# Define the IQR bounds for the Age column to identify outliers
Q1 <- quantile(df_filtered$Age, 0.25)
Q3 <- quantile(df_filtered$Age, 0.75)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
# Filter rows where Age is considered an outlier
age_outliers <- df_filtered[df_filtered$Age < lower_bound | df_filtered$Age > upper_bound, ]
# Calculate the proportion of Exited vs Non-Exited among Age outliers
age_outliers_exited_counts <- table(age_outliers$Exited)
total_outliers <- nrow(age_outliers)
exited_proportion_in_outliers <- age_outliers_exited_counts / total_outliers
# Print the proportion of 'Exited' within Age outliers
cat("\nProportion of 'Exited' within Age outliers:")
##
## Proportion of 'Exited' within Age outliers:
##
## 0 1
## 0.7966574 0.2033426
# Get the count of each class in the Exited column for comparison
class_counts <- table(df_filtered$Exited)
total_number_of_samples <- nrow(df_filtered)
exited_proportion_overall <- class_counts / total_number_of_samples
# Print the overall proportion of 'Exited'
cat("\nOverall Proportion of 'Exited':")
##
## Overall Proportion of 'Exited':
##
## 0 1
## 0.7968375 0.2031625
Upon comparing the proportions of the Exited class between the Age outliers and the overall dataset, it was found that both subsets exhibit a similar 80/20 split between the ‘Exited’ and ‘Non-Exited’ classes. This consistent distribution indicates that there is no need to adjust the capping thresholds to preserve a higher proportion of samples from specific age ranges, which could disproportionately influence the prediction of a particular class. As a result, the standard capping approach, utilizing the 1st and 99th percentiles, is deemed appropriate without further adjustments.
# Define the lower and upper percentiles for capping
lower_percentile <- 1
upper_percentile <- 99
# Calculate the lower and upper bounds using the percentiles
lower_bound <- quantile(df_filtered$Age, lower_percentile /100)
upper_bound <- quantile(df_filtered$Age, upper_percentile /100)
# Apply capping by replacing values below lower bound with the lower bound & values above upper bound with the upper bound
df_filtered$Age <- pmin(pmax(df_filtered$Age, lower_bound), upper_bound)
# Print summary statistics for the 'Age' column after capping
print(summary(df_filtered$Age))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.00 32.00 37.00 38.89 44.00 72.00
# Verify that the proportion of Exited vs Non-Exited remains at ~ 80/20 split after capping
class_counts <- table(df_filtered$Exited)
total_number_of_samples <- nrow(df_filtered)
exited_proportion_overall <- class_counts / total_number_of_samples
# Print the overall proportion of 'Exited'
cat("\nOverall Proportion of 'Exited':")
##
## Overall Proportion of 'Exited':
##
## 0 1
## 0.7968375 0.2031625
The capping process has successfully adjusted the extreme values in the Age column, ensuring that they now lie within the 1st and 99th percentiles. The minimum age has been capped at 21, and the maximum age at 72. Consequently, the mean age (38.89) is now more representative, reflecting that the extreme values had less influence on these statistics. Although the maximum age shifted significantly from 92 to 72, the overall distribution of the data remains consistent, with an 80/20 split between the ‘Exited’ and ‘Non-Exited’ classes. This suggests that the capping process has not adversely affected the balance of the dataset. The approach preserves the integrity of the data, minimizes the impact of extreme age values, and ensures a more realistic range of values, making it well-suited for downstream analysis and machine learning modeling.
Upon completing scaling and normalization, the next step involves feature selection through correlation analysis. Examining feature correlations helps identify highly correlated pairs, which can introduce multicollinearity in certain machine learning models (e.g., linear regression). Highly correlated features may need to be removed or transformed to reduce redundancy, thereby improving model interpretability and performance.
# Calculate the correlation matrix for the numerical features
correlation_matrix <- cor(df_filtered[sapply(df_filtered, is.numeric)])
# Set a large canvas size (e.g., 12x12 inches for a more spacious plot)
png("correlation_matrix_plot.png", width = 1200, height = 1200) # Set width and height in pixels
# Visualize the correlation matrix using a heatmap
corrplot(correlation_matrix, method = "color", type = "upper",
col = colorRampPalette(c("blue", "white", "red"))(200),
tl.cex = 0.8, tl.col = "black", addCoef.col = "black",
title = "Correlation Matrix")
# Save the plot to file and close the device
dev.off()
## png
## 2
## Plot saved as 'correlation_matrix_plot.png' in the working directory.
A correlation matrix is used to identify highly correlated feature pairs, and it is observed that the pair (‘Exited’, ‘Complain’) exhibits a strong correlation.
# Identify highly correlated features (e.g. with a correlation coefficient > 0.8 or -0.8)
threshold <- 0.8
highly_correlated <- which(abs(correlation_matrix) > threshold, arr.ind = TRUE)
# Extract the indices of highly correlated features (excluding the diagonal)
highly_correlated_pairs <- vector("list", length = nrow(highly_correlated))
for (i in 1:nrow(highly_correlated)) {
x <- highly_correlated[i, 1]
y <- highly_correlated[i, 2]
# Avoid duplicates and the diagonal (x != y)
if (x !=y && x < y) {
highly_correlated_pairs[[i]] <- c(colnames(correlation_matrix)[x], colnames(correlation_matrix)[y])
}
}
# Display highly correlated feature pairs
cat("\nHighly Correlated Feature Pairs (correlation > 0.8 or < - 0.8):\n")
##
## Highly Correlated Feature Pairs (correlation > 0.8 or < - 0.8):
## [1] "Exited" "Complain"
The correlation analysis indicates that the feature pair (‘Exited’, ‘Complain’) shows a strong correlation, with a coefficient greater than 0.8 or less than -0.8. Typically, features with such high correlations (above 0.8 or below -0.8) are considered for removal or transformation to address multicollinearity. However, in this case, multicollinearity is less of a concern, as Exited is the target variable, not another predictor. Therefore, retaining both features is justified. The strong correlation between Exited and Complain suggests that customer complaints may serve as a significant indicator for predicting customer churn. In the case of algorithms like decision trees, Complain is likely to emerge as a key feature in the early stages of the model.
It is crucial, however, to be cautious of overfitting, as highly correlated features with the target variable can sometimes lead to this issue. To mitigate overfitting, regularization techniques such as L1 or L2 regularization can be applied, allowing retention of critical features like Complain while preventing model overfitting.
After completing feature selection, the final step is to assess data imbalance. The analysis reveals that the class distribution for the target variable, Exited, is imbalanced, with approximately 79.68% of samples in the non-exited class (0) and only 20.32% in the exited class (1).
# Calculate the proportions of each class for the target variable 'Exited'
class_proportions <- prop.table(table(df$Exited)) * 100
# Print the proportions
print(class_proportions)
##
## 0 1
## 79.62 20.38
This class imbalance could adversely affect model performance, as certain models may be biased towards the majority class (non-exited) and neglect the minority class (exited), leading to skewed predictions. To address this, two resampling techniques are considered: - Oversampling the minority class (Exited = 1) using methods such as SMOTE (Synthetic Minority Over-sampling Technique) to generate synthetic samples, thereby increasing representation for the exited class. - Undersampling the majority class (Exited = 0) to reduce the number of non-exited samples, balancing the dataset by reducing the size of the majority class.
Oversampling is preferred over undersampling primarily because the dataset is of moderate size, not large. While undersampling can be useful for very large datasets by reducing computational load, it may result in a loss of valuable information when the dataset is smaller, which could negatively affect model performance and generalization.
By utilizing oversampling (e.g., with SMOTE), the original majority class samples are preserved, and the minority class representation is enhanced through synthetic sample generation. This approach minimizes data loss and ensures the model has comprehensive information, supporting balanced training.
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(dplyr) # For data manipulation
library(smotefamily) # For SMOTE
set.seed(42)
# Split data into training and testing sets
train_indices <- createDataPartition(df_filtered$Exited, p = 0.8, list = FALSE)
train_data <- df_filtered[train_indices, ]
test_data <- df_filtered[-train_indices, ]
# Prepare the training data
X_train <- train_data %>% select(-Exited) # Features for training
y_train <- train_data$Exited # Target for training
X_test <- test_data %>% select(-Exited) # Features for testing
y_test <- test_data$Exited # Target for testing
# Check input data dimensions
cat("Dimensions of X_train:", dim(X_train), "\n")
## Dimensions of X_train: 7994 17
## Length of y_train: 7994
## Target class distribution in y_train before SMOTE:
## y_train
## 0 1
## 6388 1606
# Ensure target variable is binary (0 and 1)
#y_train <- ifelse(y_train == 1, 1, 0) # Ensure binary encoding
#y_test <- ifelse(y_test == 1, 1, 0) # Also ensure binary encoding for testing
y_train <- as.factor(y_train)
y_test <- as.factor(y_test)
# Combine features and target into a single training data frame for SMOTE
train_data_combined <- data.frame(X_train, Class = y_train)
# Convert any character or factor columns into numeric before applying SMOTE
train_data_combined <- train_data_combined %>%
mutate(across(where(is.character), ~ as.numeric(as.factor(.)))) %>%
mutate(across(where(is.factor), as.numeric))
# Apply SMOTE to balance the dataset
smote_result <- SMOTE(train_data_combined, target = train_data_combined$Class, K = 5, dup_size = 1)
# Extract the balanced dataset
balanced_data <- smote_result$data
#Remove the class columnn introduced by SMOTE
balanced_data <- balanced_data %>%
select(-class)
#Remove the irrelevant column to train the model
balanced_data <- balanced_data %>%
select(-CustomerId)
X_test <- X_test %>%
select(-CustomerId)
balanced_data <- balanced_data %>% select(-Complain)
X_test <- X_test %>% select(-Complain)
#Use mutate to convert Class to binary values
balanced_data <- balanced_data %>%
mutate(Class = ifelse(Class == 2, 1, 0))
# Separate features and target from the balanced dataset
X_train_smote <- balanced_data[, -ncol(balanced_data)] # All columns except the last
y_train_smote <- balanced_data$Class # The last column (Class)
y_train_smote <- ifelse(y_train_smote == 1, 1, 0)
# Verify the class distribution after SMOTE
cat("Class distribution after SMOTE:\n")
## Class distribution after SMOTE:
## y_train_smote
## 0 1
## 6388 3212
# Verify dimensions of the final datasets
cat("Dimensions of X_train_smote:", dim(X_train_smote), "\n")
## Dimensions of X_train_smote: 9600 15
## Length of y_train_smote: 9600
## Dimensions of X_test: 1998 15
## Length of y_test: 1998
The output following the application of SMOTE shows the distribution of the target variable as follows: - Class 0 (non-exited): 6,388 samples (unchanged after SMOTE). - Class 1 (exited): 3,212 samples (increased due to oversampling).
SMOTE has successfully balanced the dataset by adding synthetic samples for Class 1. This balanced distribution in the training data should enhance the model’s ability to detect and accurately classify instances of the minority class (“Exited” = 1), providing a richer and more representative dataset for training.
# Combine the training features and target variable
train_df <- cbind(X_train_smote, Exited = y_train_smote)
# Combine the testing features and target variable
test_df <- cbind(X_test, Exited = y_test)
# Save the training and testing data to CSV files
write.csv(train_df, "bank_customer_churn_train_data_processed.csv", row.names = FALSE)
write.csv(test_df, "bank_customer_churn_test_data_processed.csv", row.names = FALSE)
Resampling is applied only to the training data, as the test set must reflect the real-world class distribution. Altering the balance of the test set would lead to overly optimistic performance metrics and fail to represent the model’s ability to generalize to imbalanced, real-world data. Therefore, the final output of the data cleaning and preprocessing steps consists of two CSV files: one for the training dataset and another for the testing dataset.
The correlation heatmap offers valuable insights into the interrelationships between the variables found from the dataset. The subsequent observations highlight the significant positive and negative correlations. The heatmap uses Pearson correlation coefficients, ranging from –1, where:
-1 indicates perfect negative correlation (one variable increases whereas the other variable decreases)
0 indicates no correlation (variables are independent)
1 indicates perfect positive correlation (both variables increase together)
Exited vs Complain (1.00): The correlation indicates the highest positive relationship where it implies that customers that complain are more likely to churn (exit). Complaints reflect dissatisfaction with the bank’s services, products or customer support which translates into a higher probability of exiting. Every churn customer had lodged a complaint to the bank. This indicates that timely redress of grievances of customers and improvement in service quality are crucial aspects to retain customers. Monitoring and effective complaint handling could be one of the strategies for preventing churn exit.
GeographyGermany vs Balance (0.40): The relationship between both variables shows the moderately high positive correlation found from the heatmap. Customers in Germany have higher balances kept in the bank and it can indicate that the customers have varying regional incomes or customers that prefers saving larger amounts in their accounts.
Exited vs Age (0.29): This shows a moderate positive correlation between Age and Exited. Customers coming from the older age group demonstrate a slightly higher tendency to exit the bank compared to the younger individuals. This could be attributed to changing financial needs or dissatisfaction with the banking products provided that do not align with their current phase of life. Older clients may need more flexible retirement planning options or personalized financial plans. This requires the bank to come up with an age-appropriate financial products consumer segment to increase retention rates among older clients.
Balance vs Exited (0.12): It represents a weak positive correlation between both variables, thus suggests that the level of salary influences saving behavior. Individuals that earn a high income may possibly put more into savings or investment accounts. Although it is a weak correlation, it can be investigated further as the bank can come up with tailored financial products.
GeographyFrance vs GeographyGermany (-0.58)
GeographyFrance vs GeographySpain (-0.58)
GeographyGermany vs GeographySpain (-0.33)
Balance vs NumOfProducts (-0.36): This is a moderately high negative correlation can be found from the heatmap that can be interpreted that customers that have fewer products hold higher balances whereas customers with more products might spread their money across different services. Customers with fewer products may prefer liquid or conservative while the others are interested in investing in different financial products such as loans, mutual funds, or buying shares. This highlights that the bank should target low-product customers with cross-selling strategies to increase sales without affecting customers’ satisfaction.
Exited vs NumOfProducts (-0.18): It shows a weak negative correlation indicating that customers that own less products with the bank may engage less with the bank and eventually leave the bank. On the other hand, customers with more products have more commitment to the bank which reduces the churn rates.
NumOfProducts vs Complain (-0.18): The correlation between these two variables also show a weak negative correlation, suggesting that customers have lower engagement with the products can be associated with the dissatisfaction they have with the services provided by the bank. On the contrary, customers that own more products are less likely to complain. The bank could use this opportunity to identify the complaints made to highlight the underlying issue faced.
IsActiveMember vs Exited (-0.16): The weak negative correlation stipulates that active members have less tendency to leave the bank. Customers are active in interacting with the bank which achieves customers’ satisfaction. Banks could use this opportunity in improving retention like personalized offers, loyalty rewards or automated alerts to keep the accounts active.
# Install and load required packages
library(ggplot2)
library(corrplot)
# Calculate the correlation matrix for the numerical features
correlation_matrix <- cor(df_filtered[sapply(df_filtered, is.numeric)])
corrplot(
correlation_matrix,
method = "color",
type = "upper",
col = colorRampPalette(c("blue", "white", "red"))(200),
tl.cex = 0.6,
tl.col = "black",
tl.srt = 45,
addCoef.col = "black",
number.cex = 0.5,
mar = c(2, 2, 2, 2),
title = "Correlation Matrix",
cl.cex = 0.7
)
Question 2 is the analysis using five classification algorithms — Logistic Regression, Decision Tree, Support Vector Machine (SVM), Random Forest, and XGBoost (Extreme Gradient Boosting). This is to predict customer churn and evaluate their performance based on metrics such as accuracy, precision, recall, F1-score, and AUC (Area Under the Curve). The best-performing model is then identified in this question. The models are evaluated based on metrics such as accuracy, precision, recall, F1-score, and Area Under Curve for ROC. The ultimate goal in this question is to identify the model that most accurately predicts whether a customer will churn while minimizing the false positive and false negative predictions.
A confusion matrix is used to evaluate the performance of classification models. It provides detailed information about the true labels (actual values) and the predicted labels produced by the model. For a binary classification problem like predicting customer churn, the confusion matrix is a 2x2 table
True Positives (TP): Customers who churned (Exited = 1) and were correctly predicted as churned.
True Negatives (TN): Customers who did not churn (Exited = 0) and were correctly predicted as non-churned.
False Positives (FP): Customers who did not churn (Exited = 0) but were incorrectly predicted as churned.
False Negatives (FN): Customers who churned (Exited = 1) but were incorrectly predicted as non-churned.
Next the metrics used to determine the model performance are discussed based on the formula and importance.
Accuracy: Accuracy is the proportion of correctly classified instances (both churned and non-churned customers) out of all instances. It provides an overall measure of correctness.
Precision: Measure of how many of the instances predicted as positive (non-churned) are actually positive. Precision is important when the cost of false positives (misclassifying churned customers as non-churned) is high.
Recall (Sensitivity or True Positive Rate): Recall measures how many of the actual positive instances (non-churned customers) were correctly identified by the model. Sensitivity is critical in churn prediction because it ensures we correctly identify most customers who are likely to stay with the company. A high sensitivity means the model is effective at catching true positives (non-churn).
Specificity: Specificity measures how many of the actual negative instances (churned customers) were correctly identified by the model. Specificity is particularly useful in identifying customers who churn. A high specificity indicates that the model is good at predicting churned customers.
F1-Score: The F1-score is the harmonic mean of precision and recall. It balances the trade-off between precision and recall, especially in datasets with imbalanced classes. A high F1-score indicates a good balance between precision and recall.
To answer this question several R packages are used to facilitate model training, evaluation, and visualization. The table below shows the summary of packages used:
Package Name | Description |
---|---|
caret | Provides functions for data preprocessing, train-test splitting, and model evaluation (confusion matrix, metrics). |
e1071 | Supports SVM implementation and kernel-based learning methods. |
rpart | Enables creation of Decision Trees and visualization of tree-based models. |
randomForest | Implements Random Forest for classification and regression tasks. |
xgboost | Provides an optimized gradient boosting framework for high-performance classification models. |
pROC | Facilitates computation and visualization of ROC curves and AUC metrics. |
ggplot2 | Used for creating advanced visualizations, including ROC curves and performance comparisons. |
dplyr | Simplifies data manipulation tasks such as filtering, mutating, and summarizing datasets. |
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
library(DT)
# -------------------------------------------------------------------
# Prepare Train and Test Datasets
# -------------------------------------------------------------------
train_df <- cbind(X_train_smote, Exited = y_train_smote)
test_df <- cbind(X_test, Exited = y_test)
# Ensure Test Dataset Matches Train Columns
missing_cols <- setdiff(colnames(X_train_smote), colnames(X_test))
for (col in missing_cols) {
X_test[[col]] <- 0
}
X_test <- X_test[, colnames(X_train_smote)]
# Convert Non-Numeric Columns to Numeric in X_test
X_test <- X_test %>%
mutate(across(where(is.character), ~ as.numeric(as.factor(.)))) %>%
mutate(across(where(is.factor), as.numeric))
# Ensure y_test is a Factor
y_test <- as.factor(y_test)
y_test <- factor(y_test, levels = c(0, 1))
# -------------------------------------------------------------------
# Define Helper Functions
# -------------------------------------------------------------------
# Function to Calculate F1-Score
calculate_f1 <- function(conf_matrix) {
precision <- conf_matrix$byClass["Pos Pred Value"]
recall <- conf_matrix$byClass["Sensitivity"]
f1 <- 2 * ((precision * recall) / (precision + recall))
return(round(f1, 4))
}
f1_scores <- data.frame(
Model = character(),
F1_Score = numeric(),
stringsAsFactors = FALSE
)
# -------------------------------------------------------------------
# Logistic Regression Model
# -------------------------------------------------------------------
log_model <- glm(
y_train_smote ~ .,
data = data.frame(X_train_smote, y_train_smote),
family = binomial(link = "logit")
)
log_predictions <- predict(log_model, X_test, type = "response")
log_class <- ifelse(log_predictions > 0.3, 1, 0)
log_cm <- confusionMatrix(as.factor(log_class), y_test, positive = "1")
log_f1 <- calculate_f1(log_cm)
f1_scores <- rbind(f1_scores, data.frame(Model = "Logistic Regression", F1_Score = log_f1))
log_cm_output <- capture.output(print(log_cm))
cat(paste0(
"Confusion Matrix for Logistic Regression:\n",
"======================================\n",
paste(log_cm_output, collapse = "\n"),
"\n"
))
## Confusion Matrix for Logistic Regression:
## ======================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1070 102
## 1 504 322
##
## Accuracy : 0.6967
## 95% CI : (0.676, 0.7168)
## No Information Rate : 0.7878
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3262
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7594
## Specificity : 0.6798
## Pos Pred Value : 0.3898
## Neg Pred Value : 0.9130
## Prevalence : 0.2122
## Detection Rate : 0.1612
## Detection Prevalence : 0.4134
## Balanced Accuracy : 0.7196
##
## 'Positive' Class : 1
##
# -------------------------------------------------------------------
# Decision Tree Model
# -------------------------------------------------------------------
tree_model <- rpart(
y_train_smote ~ .,
data = data.frame(X_train_smote, y_train_smote),
method = "class",
parms = list(split = "gini"), # Try both "gini" and "information"
control = rpart.control(
minsplit = 20, # Minimum number of samples required to split
cp = 0.01, # Complexity parameter (smaller values increase tree depth)
maxdepth = 10 # Limit tree depth to prevent overfitting
)
)
tree_probs <- predict(tree_model, X_test, type = "prob")
tree_predictions <- ifelse(tree_probs[, 2] >= 0.3, 1, 0) # Lower threshold to favor positive class
tree_predictions <- factor(tree_predictions, levels = c(0, 1))
#tree_predictions <- predict(tree_model, X_test, type = "class")
tree_cm <- confusionMatrix(tree_predictions, y_test, positive = "1")
tree_cm_output <- capture.output(print(tree_cm))
tree_f1 <- calculate_f1(tree_cm)
f1_scores <- rbind(f1_scores, data.frame(Model = "Decision Tree", F1_Score = tree_f1))
cat(paste0(
"Confusion Matrix for Decision Tree:\n",
"======================================\n",
paste(tree_cm_output, collapse = "\n"),
"\n"
))
## Confusion Matrix for Decision Tree:
## ======================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1288 176
## 1 286 248
##
## Accuracy : 0.7688
## 95% CI : (0.7496, 0.7871)
## No Information Rate : 0.7878
## P-Value [Acc > NIR] : 0.9817
##
## Kappa : 0.3683
##
## Mcnemar's Test P-Value : 3.954e-07
##
## Sensitivity : 0.5849
## Specificity : 0.8183
## Pos Pred Value : 0.4644
## Neg Pred Value : 0.8798
## Prevalence : 0.2122
## Detection Rate : 0.1241
## Detection Prevalence : 0.2673
## Balanced Accuracy : 0.7016
##
## 'Positive' Class : 1
##
# -------------------------------------------------------------------
# Support Vector Machine (SVM) Model
# -------------------------------------------------------------------
y_train_smote <- factor(y_train_smote, levels = c(0, 1))
svm_model <- svm(
y_train_smote ~ .,
data = data.frame(X_train_smote, y_train_smote),
kernel = "radial",
cost = 10,
gamma = 0.1,
class.weights = c("0" = 1, "1" = 10),
probability = TRUE
)
svm_probs <- predict(svm_model, X_test, probability = TRUE)
svm_probabilities <- attr(svm_probs, "probabilities")[, "1"]
svm_predictions <- ifelse(svm_probabilities > 0.3, 1, 0)
svm_cm <- confusionMatrix(as.factor(svm_predictions), y_test, positive = "1")
svm_cm_output <- capture.output(print(svm_cm))
svm_f1 <- calculate_f1(svm_cm)
f1_scores <- rbind(f1_scores, data.frame(Model = "Support Vector Machine (SVM)", F1_Score = svm_f1))
cat(paste0(
"Confusion Matrix for SVM:\n",
"======================================\n",
paste(svm_cm_output, collapse = "\n"),
"\n"
))
## Confusion Matrix for SVM:
## ======================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1251 200
## 1 323 224
##
## Accuracy : 0.7382
## 95% CI : (0.7184, 0.7574)
## No Information Rate : 0.7878
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.2921
##
## Mcnemar's Test P-Value : 9.571e-08
##
## Sensitivity : 0.5283
## Specificity : 0.7948
## Pos Pred Value : 0.4095
## Neg Pred Value : 0.8622
## Prevalence : 0.2122
## Detection Rate : 0.1121
## Detection Prevalence : 0.2738
## Balanced Accuracy : 0.6615
##
## 'Positive' Class : 1
##
# -------------------------------------------------------------------
# Random Forest Model
# -------------------------------------------------------------------
y_train_smote <- factor(y_train_smote, levels = c(0, 1))
rf_model <- randomForest(
y_train_smote ~ .,
data = data.frame(X_train_smote, y_train_smote),
ntree = 300,
mtry = sqrt(ncol(X_train_smote)),
)
rf_probs <- predict(rf_model, X_test, type = "prob")
rf_predictions <- ifelse(rf_probs[, 2] >= 0.3, 1, 0)
rf_predictions <- factor(rf_predictions, levels = c(0, 1))
y_test <- factor(y_test, levels = c(0, 1))
rf_cm <- confusionMatrix(rf_predictions, y_test, positive = "1")
rf_f1 <- calculate_f1(rf_cm)
f1_scores <- rbind(f1_scores, data.frame(Model = "Random Forest", F1_Score = rf_f1))
rf_cm_output <- capture.output(print(rf_cm))
cat(paste0(
"Confusion Matrix for Random Forest:\n",
"======================================\n",
paste(rf_cm_output, collapse = "\n"),
"\n"
))
## Confusion Matrix for Random Forest:
## ======================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1351 166
## 1 223 258
##
## Accuracy : 0.8053
## 95% CI : (0.7872, 0.8225)
## No Information Rate : 0.7878
## P-Value [Acc > NIR] : 0.028599
##
## Kappa : 0.445
##
## Mcnemar's Test P-Value : 0.004521
##
## Sensitivity : 0.6085
## Specificity : 0.8583
## Pos Pred Value : 0.5364
## Neg Pred Value : 0.8906
## Prevalence : 0.2122
## Detection Rate : 0.1291
## Detection Prevalence : 0.2407
## Balanced Accuracy : 0.7334
##
## 'Positive' Class : 1
##
# -------------------------------------------------------------------
# XGBoost Model
# -------------------------------------------------------------------
y_train_smote <- as.numeric(as.character(y_train_smote))
X_train_matrix <- as.matrix(X_train_smote)
X_test_matrix <- as.matrix(X_test)
y_train_numeric <- as.numeric(as.character(y_train_smote))
y_test_numeric <- as.numeric(as.character(y_test))
xgb_model <- xgboost(
data = X_train_matrix,
label = y_train_numeric,
objective = "binary:logistic",
nrounds = 200,
eval_metric = "logloss",
verbose = 0
)
xgb_predictions <- predict(xgb_model, X_test_matrix)
xgb_class <- ifelse(xgb_predictions > 0.3, 1, 0)
xgb_cm <- confusionMatrix(as.factor(xgb_class), y_test, positive = "1")
xgb_f1 <- calculate_f1(xgb_cm)
f1_scores <- rbind(f1_scores, data.frame(Model = "XGBoost", F1_Score = xgb_f1))
xgb_cm_output <- capture.output(print(xgb_cm))
cat(paste0(
"Confusion Matrix for XGBoost:\n",
"======================================\n",
paste(xgb_cm_output, collapse = "\n"),
"\n"
))
## Confusion Matrix for XGBoost:
## ======================================
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1405 204
## 1 169 220
##
## Accuracy : 0.8133
## 95% CI : (0.7955, 0.8302)
## No Information Rate : 0.7878
## P-Value [Acc > NIR] : 0.002553
##
## Kappa : 0.4243
##
## Mcnemar's Test P-Value : 0.078331
##
## Sensitivity : 0.5189
## Specificity : 0.8926
## Pos Pred Value : 0.5656
## Neg Pred Value : 0.8732
## Prevalence : 0.2122
## Detection Rate : 0.1101
## Detection Prevalence : 0.1947
## Balanced Accuracy : 0.7057
##
## 'Positive' Class : 1
##
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
calculate_metrics <- function(conf_matrix) {
precision <- conf_matrix$byClass["Pos Pred Value"] # Precision/PPV
recall <- conf_matrix$byClass["Sensitivity"] # Sensitivity/Recall
specificity <- conf_matrix$byClass["Specificity"] # Specificity
npv <- conf_matrix$byClass["Neg Pred Value"] # NPV
balanced_acc <- (recall + specificity) / 2 # Balanced Accuracy
accuracy <- conf_matrix$overall["Accuracy"] # Accuracy
f1 <- 2 * ((precision * recall) / (precision + recall)) # F1-Score
return(data.frame(
Accuracy = round(accuracy, 4),
Sensitivity = round(recall, 4),
Specificity = round(specificity, 4),
Precision = round(precision, 4),
NPV = round(npv, 4),
Balanced_Accuracy = round(balanced_acc, 4),
F1_Score = round(f1, 4)
))
}
all_metrics <- data.frame(
Model = character(),
Accuracy = numeric(),
Sensitivity = numeric(),
Specificity = numeric(),
Precision = numeric(),
NPV = numeric(),
Balanced_Accuracy = numeric(),
F1_Score = numeric(),
stringsAsFactors = FALSE
)
# Logistic Regression
if (exists("log_cm")) {
log_metrics <- calculate_metrics(log_cm)
log_metrics$Model <- "Logistic Regression"
all_metrics <- rbind(all_metrics, log_metrics)
}
if (exists("tree_cm")) {
tree_metrics <- calculate_metrics(tree_cm)
tree_metrics$Model <- "Decision Tree"
all_metrics <- rbind(all_metrics, tree_metrics)
}
if (exists("svm_cm")) {
svm_metrics <- calculate_metrics(svm_cm)
svm_metrics$Model <- "SVM"
all_metrics <- rbind(all_metrics, svm_metrics)
}
if (exists("rf_cm")) {
rf_metrics <- calculate_metrics(rf_cm)
rf_metrics$Model <- "Random Forest"
all_metrics <- rbind(all_metrics, rf_metrics)
}
if (exists("xgb_cm")) {
xgb_metrics <- calculate_metrics(xgb_cm)
xgb_metrics$Model <- "XGBoost"
all_metrics <- rbind(all_metrics, xgb_metrics)
}
if (!"Model" %in% colnames(all_metrics)) {
stop("The 'Model' column is missing in the data frame. Please ensure all models were evaluated.")
}
all_metrics <- all_metrics[, c("Model", "Accuracy", "Sensitivity", "Specificity",
"Precision", "NPV", "Balanced_Accuracy", "F1_Score")]
all_metrics %>%
kbl(caption = "Evaluation Metrics for All Models", row.names = FALSE) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = TRUE, position = "center") %>%
column_spec(1, bold = TRUE, border_left = TRUE) %>%
column_spec(ncol(all_metrics), border_right = TRUE)
Model | Accuracy | Sensitivity | Specificity | Precision | NPV | Balanced_Accuracy | F1_Score |
---|---|---|---|---|---|---|---|
Logistic Regression | 0.6967 | 0.7594 | 0.6798 | 0.3898 | 0.9130 | 0.7196 | 0.5152 |
Decision Tree | 0.7688 | 0.5849 | 0.8183 | 0.4644 | 0.8798 | 0.7016 | 0.5177 |
SVM | 0.7382 | 0.5283 | 0.7948 | 0.4095 | 0.8622 | 0.6615 | 0.4614 |
Random Forest | 0.8053 | 0.6085 | 0.8583 | 0.5364 | 0.8906 | 0.7334 | 0.5702 |
XGBoost | 0.8133 | 0.5189 | 0.8926 | 0.5656 | 0.8732 | 0.7057 | 0.5412 |
The performance of five models—Logistic Regression, Decision Tree, Support Vector Machine (SVM), Random Forest, and XGBoost—was evaluated using metrics such as accuracy, sensitivity (ability to correctly identify churners), specificity (ability to avoid false alarms), positive predictive value (PPV or precision), negative predictive value (NPV), balanced accuracy, and F1-score. Among these models, Random Forest demonstrated the best overall performance, achieving a balanced accuracy of 73.34%, sensitivity of 60.85%, and a PPV of 53.64%. The F1-score of 0.57 highlighted its effectiveness in balancing precision and recall, making it the most reliable model. XGBoost also performed well, with high specificity (92.25%) and a PPV of 56.56%, though its sensitivity and F1-score were slightly lower compared to Random Forest.
The Decision Tree and SVM models exhibited moderate performance but struggled to effectively identify churners, as reflected in their lower sensitivity scores of 58.49% and 52.83%, respectively. This limitation makes them less suitable for churn prediction, where identifying potential churners is crucial. Logistic Regression, while being simple and interpretable, had the lowest precision (PPV of 38.98%), indicating a tendency to predict a high number of false positives. Despite having the highest sensitivity (75.94%), this imbalance makes Logistic Regression less practical, as it could lead to unnecessary customer interventions.
Model performance was improved through hyperparameter tuning. Random Forest was configured with 300 decision trees (ntree = 300) and the optimal number of features for splitting (mtry = sqrt(features)), leading to consistent performance across all metrics. For XGBoost, parameters such as the learning rate (eta = 0.1) were adjusted, and 200 boosting rounds were implemented to improve its handling of imbalanced data. The SVM model was optimized using a radial kernel, with parameters like gamma = 0.1 and cost = 10. Additionally, class weights were applied to enhance its ability to detect churners. Despite these adjustments, SVM underperformed in precision and sensitivity relative to the other models.
In summary, Random Forest emerged as the most effective model for predicting customer churn, offering a strong balance between sensitivity, precision, and F1-score. Although XGBoost showed promising results, the superior overall metrics of Random Forest make it the preferred choice. Further improvements could involve advanced tuning techniques or ensemble methods to enhance model performance and better manage imbalanced data.
# Load required libraries
library(pROC)
library(ggplot2)
# Ensure y_test is numeric
y_test <- as.numeric(as.character(y_test))
# Function to calculate ROC for a model
get_roc <- function(y_true, y_pred_prob, model_name) {
if (length(y_true) != length(y_pred_prob) || any(is.na(y_pred_prob))) {
stop(paste("Invalid input data for", model_name))
}
roc(y_true, y_pred_prob)
}
cat("\n ROC Plot: \n")
##
## ROC Plot:
# Logistic Regression
log_prob <- predict(log_model, X_test, type = "response")
roc_log <- get_roc(y_test, log_prob, "Logistic Regression")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_log <- auc(roc_log)
# Decision Tree
tree_prob <- predict(tree_model, X_test, type = "prob")[, 2]
roc_tree <- get_roc(y_test, tree_prob, "Decision Tree")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_tree <- auc(roc_tree)
# Support Vector Machine (SVM)
svm_prob <- attr(predict(svm_model, X_test, probability = TRUE), "probabilities")[, 2]
roc_svm <- get_roc(y_test, svm_prob, "SVM")
## Setting levels: control = 0, case = 1
## Setting direction: controls > cases
auc_svm <- auc(roc_svm)
# Random Forest
rf_prob <- predict(rf_model, X_test, type = "prob")[, 2]
roc_rf <- get_roc(y_test, rf_prob, "Random Forest")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_rf <- auc(roc_rf)
# XGBoost
xgb_prob <- predict(xgb_model, X_test_matrix)
roc_xgb <- get_roc(y_test, xgb_prob, "XGBoost")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc_xgb <- auc(roc_xgb)
roc_results <- list(
LogisticRegression = roc_log,
DecisionTree = roc_tree,
SVM = roc_svm,
RandomForest = roc_rf,
XGBoost = roc_xgb
)
roc_combined_plot <- ggroc(roc_results, aes = "color") +
ggtitle("ROC Curves for All Models") +
theme_minimal() +
scale_color_manual(
values = c("LogisticRegression" = "purple",
"DecisionTree" = "green",
"SVM" = "blue",
"RandomForest" = "red",
"XGBoost" = "darkorange")
) +
labs(color = "Model") +
xlab("False Positive Rate") +
ylab("True Positive Rate")
print(roc_combined_plot)
auc_table <- data.frame(
Model = c("Logistic Regression", "Decision Tree", "Support Vector Machine (SVM)",
"Random Forest", "XGBoost"),
AUC = c(round(auc_log, 4), round(auc_tree, 4), round(auc_svm, 4),
round(auc_rf, 4), round(auc_xgb, 4))
)
cat("\nAUC Values for All Models\n")
##
## AUC Values for All Models
Model | AUC |
---|---|
Logistic Regression | 0.7856 |
Decision Tree | 0.7622 |
Support Vector Machine (SVM) | 0.7326 |
Random Forest | 0.8219 |
XGBoost | 0.7937 |
Above is the result of the Area Under Curve (AUC) obtained from the models. Generally, a higher AUC value indicates better model performance, with 1.0 being perfect and 0.5 being no better than random guessing. In this analysis, Random Forest achieved the highest AUC value of 0.8219, indicating it is the most effective at distinguishing between the two classes. XGBoost followed with an AUC of 0.7937, while Decision Tree and Logistic Regression performed moderately well with AUC values of 0.7622 and 0.7856, respectively. The Decision Tree had the lowest AUC at 0.7326
Based on the evaluation metrics, Random Forest emerged as the best-performing model for predicting customer churn, with the highest AUC value, strong accuracy, and balanced precision-recall performance. This indicates it effectively identifies churned customers while minimizing false predictions. XGBoost was a close second, showing strong overall performance and good reliability.
Other models like SVM, Logistic Regression, and the Decision Tree provided reasonable results but were less effective compared to Random Forest and XGBoost. Therefore, for this dataset, Random Forest is the most suitable model for accurately predicting customer churn, followed by XGBoost as a strong alternative.
Customer churn remains a critical challenge for the banking industry, requiring proactive measures to retain customers and ensure profitability. This study aimed to predict customer churn using a comprehensive dataset and advanced machine learning techniques. The dataset underwent rigorous preprocessing, including feature engineering, scaling, normalization, and class imbalance handling using SMOTE. These steps ensured the data was suitable for building robust models that could accurately predict customer churn while addressing the inherent class imbalance in the data
For Question 1, the analysis of correlations revealed key factors influencing customer churn. Complaints showed the strongest positive correlation with churn, highlighting dissatisfaction as a critical predictor. Additionally, older age groups and higher balances were moderately correlated with churn, suggesting the need for tailored financial products for these segments. On the negative side, fewer products owned and lower engagement levels were linked to higher churn rates, emphasizing the importance of cross-selling strategies and customer engagement initiatives to improve retention.
In Question 2, among the five machine learning models evaluated, Logistic Regression, Decision Tree, Support Vector Machine (SVM), Random Forest, and XGBoost, Random Forest demonstrated the best overall performance, with the highest balanced accuracy (73.34%), F1-score (0.57), and AUC value (0.822). These metrics indicate that Random Forest strikes a strong balance between identifying true churners (sensitivity) and avoiding false alarms (precision). XGBoost emerged as a close second, offering competitive results with high specificity (89.26%) and precision (56.56%). Both models outperformed the simpler algorithms like Logistic Regression and Decision Tree, which struggled with imbalanced data and delivered lower sensitivity and precision.
In conclusion, Random Forest is the most suitable model for predicting customer churn in this dataset, providing actionable insights for retaining customers. While XGBoost is a viable alternative, Random Forest’s consistent performance across all metrics makes it the preferred choice. Future work could involve exploring ensemble techniques, hyperparameter optimization, and additional features to further improve model accuracy and interpretability. These insights equip financial institutions with a data-driven framework to identify at-risk customers, enabling targeted interventions to improve retention and reduce churn rates.