Group 1 Members

Num	Matric Number	Full Name
1	24215430	WANG YANRONG
2	25066551	XU JING
3	24069475	HU RUI
4	U2103571	SOO YAN LIN
5	U2103393	YAH HONG XUAN

Chapter 1 Introduction

1.1 Project Background

Credit cards are widely used in modern financial systems and generate rich behavioural data that reflect customers’ spending, repayment, and credit utilisation patterns. These data provide valuable insights into customer behaviour but are often high-dimensional and complex, making direct interpretation through simple descriptive analysis challenging.

Traditional customer analysis in the financial sector typically relies on predefined rules or manually constructed categories, which may overlook behavioural structures naturally embedded in the data. As a result, data-driven analytical approaches have become increasingly important for systematically uncovering behavioural patterns in large-scale financial datasets. In the absence of labelled outcomes, unsupervised learning methods particularly clustering techniques offer an effective means of grouping customers based on similarities in credit card usage, enabling the identification of latent behavioural segments without prior assumptions.

However, clustering alone is not sufficient for predictive analysis or automated decision-making. Supervised learning techniques, including regression and classification, can complement clustering by modelling relationships between behavioural variables and enabling the categorization of customers into meaningful groups. Integrating unsupervised and supervised learning methods therefore provides a more comprehensive framework for credit card customer profiling.

1.2 Dataset Overview

The analysis is based on a cleaned credit card customer dataset containing 8,950 observations and 17 variables. Each observation represents an individual credit card customer described by a set of behavioural and financial attributes.

The variables can be broadly categorised into three groups. Monetary-related variables (e.g., BALANCE, PURCHASES, PAYMENTS, and CASH_ADVANCE) capture the monetary scale of credit card usage. Frequency-related variables (e.g., PURCHASES_FREQUENCY and CASH_ADVANCE_FREQUENCY) describe how regularly customers use their credit cards. In addition, behavioural proportion variables such as PRC_FULL_PAYMENT reflect customers’ repayment behaviour.

The dataset does not contain predefined class labels, making it suitable for unsupervised learning techniques such as clustering. A small proportion of missing values is present in several behavioural variables, which motivates the need for appropriate data preprocessing prior to analysis.

1.3 Problem Statement

Although credit card datasets contain rich behavioural information related to spending, payment, and credit usage, several challenges hinder effective analysis of customer behaviour. First, customer usage patterns are highly heterogeneous and cannot be readily understood from raw transactional variables alone, making it difficult to characterize behavioural trends through simple descriptive analysis. Second, the relationships between key behavioural attributes, such as balance levels, amount of purchase, transaction activity and repayment behaviour remain insufficiently understood. Third, the lack of predefined outcome labels in credit card behavioural data restricts the application of supervised learning approaches for customer segmentation.

Without a structured analytical framework that integrates exploratory data analysis, regression-based modelling, and unsupervised clustering techniques, it is challenging to systematically analyse customer behaviour or to extract meaningful and interpretable customer segments from credit card data.

1.4 Research Objectives

The objectives of this study are as follows:

To explore and describe credit card customer behaviour through exploratory data analysis (EDA), focusing on spending, payment, and credit usage characteristics.
To examine the relationships between customer behavioural variables and total purchase amount by applying multiple linear regression analysis.
To identify and interpret distinct customer segments based on credit card usage patterns using unsupervised clustering techniques, supported by appropriate visualization and evaluation methods.

Chapter 2 Data Preprocessing

2.1 Import Data

library(readr)
library(tidyr)
library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(car)

## Loading required package: carData

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.1
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ purrr::lift()   masks caret::lift()
## ✖ dplyr::recode() masks car::recode()
## ✖ purrr::some()   masks car::some()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(corrplot)

## corrplot 0.94 loaded

library(glmnet)

## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 4.1-8

library(reshape2)

## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths

library(dplyr)
library(gridExtra)

## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

library(cluster)
library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(fmsb)

## Registered S3 methods overwritten by 'fmsb':
##   method    from
##   print.roc pROC
##   plot.roc  pROC

df <- read_csv("C:/Users/User/Desktop/Msc Data Science/WQD7004 PROGRAMMING FOR DATA SCIENCE/df_main_wqd7004.csv", 
    col_types = cols(...1 = col_skip(), BALANCE = col_number(), 
        BALANCE_FREQUENCY = col_number(), 
        PURCHASES = col_number(), ONEOFF_PURCHASES = col_number(), 
        INSTALLMENTS_PURCHASES = col_number(), 
        CASH_ADVANCE = col_number(), PURCHASES_FREQUENCY = col_number(), 
        ONEOFF_PURCHASES_FREQUENCY = col_number(), 
        PURCHASES_INSTALLMENTS_FREQUENCY = col_number(), 
        CASH_ADVANCE_FREQUENCY = col_number(), 
        CASH_ADVANCE_TRX = col_number(), 
        PURCHASES_TRX = col_number(), CREDIT_LIMIT = col_number(), 
        PAYMENTS = col_number(), MINIMUM_PAYMENTS = col_number(), 
        PRC_FULL_PAYMENT = col_number(), 
        TENURE = col_integer()))

## New names:
## • `` -> `...1`

library(dplyr)
overview_table <- data.frame(
  Variable = names(df),
  Data_Type = sapply(df, class),
  Unique_Values = sapply(df, function(x) length(unique(x))),
  Missing_Count = sapply(df, function(x) sum(is.na(x))),
  Missing_Percentage = round(sapply(df, function(x) sum(is.na(x))) / nrow(df) * 100, 2),
  stringsAsFactors = FALSE
)
overview_table$Duplicate_Values <- nrow(df) - overview_table$Unique_Values
overview_table

##                                                          Variable Data_Type
## CUST_ID                                                   CUST_ID character
## BALANCE                                                   BALANCE   numeric
## BALANCE_FREQUENCY                               BALANCE_FREQUENCY   numeric
## PURCHASES                                               PURCHASES   numeric
## ONEOFF_PURCHASES                                 ONEOFF_PURCHASES   numeric
## INSTALLMENTS_PURCHASES                     INSTALLMENTS_PURCHASES   numeric
## CASH_ADVANCE                                         CASH_ADVANCE   numeric
## PURCHASES_FREQUENCY                           PURCHASES_FREQUENCY   numeric
## ONEOFF_PURCHASES_FREQUENCY             ONEOFF_PURCHASES_FREQUENCY   numeric
## PURCHASES_INSTALLMENTS_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY   numeric
## CASH_ADVANCE_FREQUENCY                     CASH_ADVANCE_FREQUENCY   numeric
## CASH_ADVANCE_TRX                                 CASH_ADVANCE_TRX   numeric
## PURCHASES_TRX                                       PURCHASES_TRX   numeric
## CREDIT_LIMIT                                         CREDIT_LIMIT   numeric
## PAYMENTS                                                 PAYMENTS   numeric
## MINIMUM_PAYMENTS                                 MINIMUM_PAYMENTS   numeric
## PRC_FULL_PAYMENT                                 PRC_FULL_PAYMENT   numeric
## TENURE                                                     TENURE   integer
##                                  Unique_Values Missing_Count Missing_Percentage
## CUST_ID                                   8950             0               0.00
## BALANCE                                   8871             0               0.00
## BALANCE_FREQUENCY                           43             0               0.00
## PURCHASES                                 5915           425               4.75
## ONEOFF_PURCHASES                          4014             0               0.00
## INSTALLMENTS_PURCHASES                    4452             0               0.00
## CASH_ADVANCE                              4323             0               0.00
## PURCHASES_FREQUENCY                         47             0               0.00
## ONEOFF_PURCHASES_FREQUENCY                  47             0               0.00
## PURCHASES_INSTALLMENTS_FREQUENCY            47             0               0.00
## CASH_ADVANCE_FREQUENCY                      54             0               0.00
## CASH_ADVANCE_TRX                            65             0               0.00
## PURCHASES_TRX                              173             0               0.00
## CREDIT_LIMIT                               206             1               0.01
## PAYMENTS                                  8711             0               0.00
## MINIMUM_PAYMENTS                          8637           313               3.50
## PRC_FULL_PAYMENT                            48           413               4.61
## TENURE                                       7             0               0.00
##                                  Duplicate_Values
## CUST_ID                                         0
## BALANCE                                        79
## BALANCE_FREQUENCY                            8907
## PURCHASES                                    3035
## ONEOFF_PURCHASES                             4936
## INSTALLMENTS_PURCHASES                       4498
## CASH_ADVANCE                                 4627
## PURCHASES_FREQUENCY                          8903
## ONEOFF_PURCHASES_FREQUENCY                   8903
## PURCHASES_INSTALLMENTS_FREQUENCY             8903
## CASH_ADVANCE_FREQUENCY                       8896
## CASH_ADVANCE_TRX                             8885
## PURCHASES_TRX                                8777
## CREDIT_LIMIT                                 8744
## PAYMENTS                                      239
## MINIMUM_PAYMENTS                              313
## PRC_FULL_PAYMENT                             8902
## TENURE                                       8943

glimpse(df)

## Rows: 8,950
## Columns: 18
## $ CUST_ID                          <chr> "C10001", "C10002", "C10003", "C10004…
## $ BALANCE                          <dbl> 40.90075, 3202.46742, 2495.14886, 166…
## $ BALANCE_FREQUENCY                <dbl> 0.818182, 0.909091, 1.000000, 0.63636…
## $ PURCHASES                        <dbl> 95.40, NA, 773.17, 1499.00, 16.00, 13…
## $ ONEOFF_PURCHASES                 <dbl> 0.00, 0.00, 773.17, 1499.00, 16.00, 0…
## $ INSTALLMENTS_PURCHASES           <dbl> 95.40, 0.00, 0.00, 0.00, 0.00, 1333.2…
## $ CASH_ADVANCE                     <dbl> 0.0000, 6442.9455, 0.0000, 205.7880, …
## $ PURCHASES_FREQUENCY              <dbl> 0.166667, 0.000000, 1.000000, 0.08333…
## $ ONEOFF_PURCHASES_FREQUENCY       <dbl> 0.000000, 0.000000, 1.000000, 0.08333…
## $ PURCHASES_INSTALLMENTS_FREQUENCY <dbl> 0.083333, 0.000000, 0.000000, 0.00000…
## $ CASH_ADVANCE_FREQUENCY           <dbl> 0.000000, 0.250000, 0.000000, 0.08333…
## $ CASH_ADVANCE_TRX                 <dbl> 0, 4, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PURCHASES_TRX                    <dbl> 2, 0, 12, 1, 1, 8, 64, 12, 5, 3, 12, …
## $ CREDIT_LIMIT                     <dbl> 1000, 7000, 7500, 7500, 1200, 1800, 1…
## $ PAYMENTS                         <dbl> 201.8021, 4103.0326, 622.0667, 0.0000…
## $ MINIMUM_PAYMENTS                 <dbl> 139.50979, 1072.34022, 627.28479, NA,…
## $ PRC_FULL_PAYMENT                 <dbl> 0.000000, 0.222222, 0.000000, 0.00000…
## $ TENURE                           <int> 12, 12, 12, 12, 12, 12, 12, 12, 12, 1…

2.1 Import Data

Out of the 17 variables, four variables were found to contain missing values. Table below summarises the affected variables, the number of missing observations, and the proportion of missingness relative to the total number of records (8,950). The proportion of missing values for all variables is below 5%, indicating that the dataset remains largely complete. Nonetheless, suitable imputation strategies were required to preserve information and avoid bias, particularly for behavioural variables relevant to clustering.

# Get the variables that have missing values
missing_only_table <- overview_table %>%
  filter(Missing_Count > 0)

missing_only_table

##                          Variable Data_Type Unique_Values Missing_Count
## PURCHASES               PURCHASES   numeric          5915           425
## CREDIT_LIMIT         CREDIT_LIMIT   numeric           206             1
## MINIMUM_PAYMENTS MINIMUM_PAYMENTS   numeric          8637           313
## PRC_FULL_PAYMENT PRC_FULL_PAYMENT   numeric            48           413
##                  Missing_Percentage Duplicate_Values
## PURCHASES                      4.75             3035
## CREDIT_LIMIT                   0.01             8744
## MINIMUM_PAYMENTS               3.50              313
## PRC_FULL_PAYMENT               4.61             8902

2.2 Data Cleaning

2.2.1 Imputation for `MINIMUM_PAYMENTS`

The variable MINIMUM_PAYMENTS represents the minimum amount of payment paid by a customer in a billing cycle. Its value is closely related to the actual payment made by the customer (PAYMENTS). Therefore, a rule-based conditional imputation strategy was applied. If PAYMENTS = 0, then MINIMUM_PAYMENTS was set to 0, reflecting non-payment behaviour. If 0 < PAYMENTS < mean(PAYMENTS), then MINIMUM_PAYMENTS was set equal to PAYMENTS, assuming that customers paying small amounts tend to meet only the minimum requirement. Otherwise, MINIMUM_PAYMENTS was imputed using the overall mean of MINIMUM_PAYMENTS, representing a typical minimum payment value. This approach preserves realistic repayment behaviour while avoiding arbitrary statistical substitution.

df_main <- df
mean_payments <- mean(df_main$PAYMENTS, na.rm = TRUE)
mean_min_payments <- mean(df_main$MINIMUM_PAYMENTS, na.rm = TRUE)

df_main <- df_main %>%
  mutate(
    MINIMUM_PAYMENTS = case_when(
      is.na(MINIMUM_PAYMENTS) & PAYMENTS == 0 ~ 0,
      is.na(MINIMUM_PAYMENTS) & PAYMENTS > 0 & PAYMENTS < mean_payments ~ PAYMENTS,
      is.na(MINIMUM_PAYMENTS) ~ mean_min_payments,
      TRUE ~ MINIMUM_PAYMENTS
    )
  )

2.2.2 Imputation for `PRC_FULL_PAYMENT`

The variable PRC_FULL_PAYMENT indicates the proportion of the statement balance paid in full by the customer. Exploratory inspection revealed that this variable is highly skewed, with a substantial proportion of customers exhibiting values close to zero. To address this, missing values in PRC_FULL_PAYMENT were imputed using the median rather than the mean. The median is a robust measure of central tendency that is less influenced by extreme values and skewed distributions. Given that full payment behaviour is unevenly distributed across customers, median imputation provides a more representative estimate of typical repayment behaviour and avoids inflating values due to a small number of high full-payment observations.

median_prc <- median(df_main$PRC_FULL_PAYMENT, na.rm = TRUE)

df_main <- df_main %>%
  mutate(
    PRC_FULL_PAYMENT = ifelse(
      is.na(PRC_FULL_PAYMENT),
      median_prc,
      PRC_FULL_PAYMENT
    )
  )

2.2.3 Drop the single row which has missing value for `CREDIT_LIMIT`

The variable CREDIT_LIMIT contained only one missing observation (0.01% of the dataset). Given the negligible proportion of missingness and the importance of maintaining data integrity for credit-related variables, the single affected record was removed from the dataset. This decision avoids introducing unnecessary estimation error while having no material impact on the overall analysis.

df_main <- df_main %>%
  filter(!is.na(CREDIT_LIMIT))

2.2.4 Imputation for `PURCHASES`

The variable PURCHASES captures the total amount spent by a customer and is conceptually composed of two subcomponents, which are ONEOFF_PURCHASES and INSTALLMENTS_PURCHASES. Preliminary analysis showed a strong relationship between PURCHASES and these two variables. Therefore, missing values in PURCHASES were imputed using a linear regression model, with ONEOFF_PURCHASES and INSTALLMENTS_PURCHASES as predictors.

This regression-based imputation leverages the underlying transaction structure of the dataset and ensures that imputed values are consistent with observed purchasing behaviour. The strong association between these variables will be further demonstrated through a correlation heatmap in the subsequent Exploratory Data Analysis section.

train_data <- df_main %>%
  filter(!is.na(PURCHASES))

reg_model <- lm(
  PURCHASES ~ ONEOFF_PURCHASES + INSTALLMENTS_PURCHASES,
  data = train_data
)
summary(reg_model)

## 
## Call:
## lm(formula = PURCHASES ~ ONEOFF_PURCHASES + INSTALLMENTS_PURCHASES, 
##     data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -595.97    0.15    0.16    0.38   14.15 
## 
## Coefficients:
##                          Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)            -1.577e-01  1.217e-01    -1.296    0.195    
## ONEOFF_PURCHASES        1.000e+00  7.251e-05 13793.693   <2e-16 ***
## INSTALLMENTS_PURCHASES  9.994e-01  1.277e-04  7823.878   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.981 on 8521 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.776e+08 on 2 and 8521 DF,  p-value: < 2.2e-16

missing_purchases <- df_main %>%
  filter(is.na(PURCHASES))

pred_values <- predict(reg_model, newdata = missing_purchases)

# Replace negative predictions with 0
pred_values <- ifelse(pred_values < 0, 0, pred_values)

df_main$PURCHASES[is.na(df_main$PURCHASES)] <- pred_values

Check if there’s any missing values

sum(is.na(df_main))

## [1] 0

2.3 Data Transformation

2.3.1 Checking outliers

Following missing value treatment, an outlier assessment was conducted using boxplots for all numerical variables. The analysis revealed a large number of extreme values, particularly for monetary variables such as PURCHASES, BALANCE, CASH_ADVANCE, and PAYMENTS. These outliers reflect genuine customer behaviour (e.g., high spenders) rather than data errors and are therefore retained.

numeric_cols <- df_main %>%
  select(where(is.numeric))
  
# Convert to long format
df_long <- numeric_cols %>%
  pivot_longer(
    cols = everything(),
    names_to = "Variable",
    values_to = "Value"
  )

# Plot boxplots
ggplot(df_long, aes(x = Value)) +
  geom_boxplot(fill = "steelblue", outlier.size = 0.6, outlier.alpha = 0.5) +
  facet_wrap(~ Variable, scales = "free_x", ncol = 4) +
  theme_minimal(base_size = 11) +
  theme(
    strip.text = element_text(face = "bold", size = 9),
    axis.title.x = element_blank(),
    axis.text.y = element_blank(),
    panel.grid.major.y = element_blank()
  ) +
  labs(
    title = "Distribution and Outlier Characteristics of Credit Card Usage Variables",
    subtitle = "Boxplots prior to log transformation and robust scaling"
  )

2.3.2 Log transformation and robust scaling

Before scaling, logarithmic transformation was applied to selected monetary and count-based variables, including balance amounts, purchase values, payment amounts, credit limits, and transaction counts, due to their highly right-skewed distributions and the presence of extreme values. These variables span several orders of magnitude, with a small number of customers exhibiting exceptionally high spending or transaction activity. Applying a log transformation reduces skewness, compresses extreme values, and stabilizes variance, thereby improving numerical stability and model performance in both regression and clustering analyses. The log1p transformation was used to safely accommodate zero-valued observations while preserving relative differences among smaller values. In contrast, proportion and frequency-based variables bounded between fixed ranges were not transformed to maintain interpretability and avoid distortion of behavioural intensity near zero.

log_cols <- c(
  "BALANCE",
  "PURCHASES",
  "ONEOFF_PURCHASES",
  "INSTALLMENTS_PURCHASES",
  "CASH_ADVANCE",
  "PAYMENTS",
  "MINIMUM_PAYMENTS",
  "CREDIT_LIMIT",
  "PURCHASES_TRX",
  "CASH_ADVANCE_TRX"
)

df_log <- df_main
df_log[log_cols] <- lapply(df_log[log_cols], log1p)

Robust scaling was used to mitigate the influence of outliers while preserving the inherent structure of non-monetary variables.

numeric_cols <- df_log %>%
  select(where(is.numeric))

robust_scaler <- preProcess(
  numeric_cols,
  method = c("center", "scale"),
  center = apply(numeric_cols, 2, median),
  scale = apply(numeric_cols, 2, IQR)
)

df_scaled <- predict(robust_scaler, numeric_cols)

Chapter 3 Exploratory Data Analysis (EDA)

3.1 Univariate Analysis

Monetary Variables

amount_vars <- df_main %>%
select(
BALANCE,
PURCHASES,
CASH_ADVANCE,
PAYMENTS,
MINIMUM_PAYMENTS,
CREDIT_LIMIT
)

amount_vars %>%
pivot_longer(cols = everything()) %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7) +
geom_freqpoly(
    bins = 50,
    color = "lightblue",
    linewidth = 0.8
  ) +
facet_wrap(~ name, scales = "free") +
theme_minimal() +
labs(
title = "Distribution of Amount-related Variables",
x = "Value",
y = "Count"
)

amount_vars %>%
  pivot_longer(cols = everything()) %>%
  filter(value > 0) %>% 
  ggplot(aes(x = value, y = name, fill = name)) +
  geom_boxplot(
    outlier.alpha = 0.4,
    outlier.size = 1
  ) +
  scale_x_log10() +
  theme_bw() +
  theme(
    legend.position = "none",
    axis.title.y = element_blank()
  ) +
  labs(
    title = "Boxplot of Amount Variables (Log Scale)",
    x = "Value (log scale)"
  )

The distributions of amount-related variables, including balance, purchase, cash_advance and paymentare highly right-skewed, with most customers exhibiting relatively low balances and purchase amounts, while a small number of customers show extremely large values. After logarithmic transformation, the boxplot continue to display numerous high-value outliers, indicating significant differences in the scale of fund usage among customers. In contrast, the distribution of credit_limit appears comparatively smoother, but a certain proportion of high-limit accounts are still retained. This suggests a layered structure in credit limits within the customer population.

Usage Frequency Variables

freq_vars <- df_main %>%
select(
BALANCE_FREQUENCY,
PURCHASES_FREQUENCY,
ONEOFF_PURCHASES_FREQUENCY,
PURCHASES_INSTALLMENTS_FREQUENCY,
CASH_ADVANCE_FREQUENCY,
PRC_FULL_PAYMENT
)

freq_vars %>%
pivot_longer(cols = everything()) %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 20, fill = "steelblue", alpha = 0.7) +
facet_wrap(~ name, scales = "free") +
theme_minimal() +
labs(
title = "Distribution of Frequency-related Variables",
x = "Frequency",
y = "Count"
)

Frequency-related variables display more discrete distribution patterns. Balance_frequency and purchases_frequency are highly concentrated near 1, indicating that most customers use their credit cards on a regular basis. However, oneoff_purchases_frequency, cash_advance_frequency and prc_full_payment show clear spikes near 0, suggesting that a large proportion of customers rarely make one-off purchases, use cash advances, or pay their balances in full. Overall, these univariate distributions reveal significant disparities in spending patterns and usage behaviors among credit card users.

3.2 Bivariate Analysis

Purchases vs Cash Advance

ggplot(df_main, aes(x = PURCHASES, y = CASH_ADVANCE)) +
geom_point(alpha = 0.35,color = "steelblue") +
theme_minimal() +
labs(
title = "Purchases vs Cash Advance",
x = "Cash Advance Amount",
y = "Purchases Amount"
)

Purchases(log scale) vs Cash Advance Amount(log scale)

ggplot(df_main, aes(x = PURCHASES, y = CASH_ADVANCE)) +
geom_point(alpha = 0.35,color = "steelblue") +
scale_x_log10() +
scale_y_log10() +
theme_minimal() +
labs(
title = "Purchases(log scale) vs Cash Advance Amount(log scale)",
x = "Purchases (log scale)",
y = "Cash Advance Amount (log scale)"
)

## Warning in scale_x_log10(): log-10 transformation introduced infinite values.

## Warning in scale_y_log10(): log-10 transformation introduced infinite values.

The relationship between purchase and cash_advance shows a typical “L-shaped” distribution pattern. Most observations are concentrated along the two axes, while the central area is relatively sparse. This indicates that most customers tend to rely either on card purchases or on cash advances, whereas very few customers frequently and heavily engage in both behaviors at the same time. Users with zero PURCHASES are likely not typical credit card users. Some of them may hold the card mainly for emergency cash withdrawals, while others may be inactive accounts opened for promotional benefits but never truly used. Meanwhile, CASH_ADVANCE exhibits a pronounced spike near zero, indicating the fact that cash advances usually involve higher fees and interest costs.

Credit Limit vs Purchases

ggplot(df_main, aes(x = CREDIT_LIMIT, y = PURCHASES)) +
geom_point(alpha = 0.35,color = "steelblue") +
theme_minimal() +
labs(
title = "Credit Limit vs Purchases",
x = "Credit Limit",
y = "Purchases Amount"
)

Credit Limit vs Payments

ggplot(df_main, aes(x = CREDIT_LIMIT, y = PAYMENTS)) +
geom_point(alpha = 0.35,color = "steelblue") +
theme_minimal() +
labs(
title = "Credit Limit vs Payments",
x = "Credit Limit",
y = "Payments Amount"
)

Credit limit shows a moderate positive correlation with both purchases and repayment amounts. However, the credit limits of the majority of customers are still concentrated below 5,000, indicating that increases in credit limit do not substantially alter the overall distribution structure at the population level. Although higher credit limits are generally associated with larger spending and repayment volumes, a large number of customers hold relatively high credit limits but exhibit low activity. The explanatory power for actual behavior remains limited.

Purchases Frequency vs Purchases

ggplot(df_main, aes(x = PURCHASES_FREQUENCY, y = PURCHASES)) +
geom_point(alpha = 0.35,color = "steelblue") +
theme_minimal() +
labs(
title = "Purchases Frequency vs Purchases",
x = "Purchases Frequency",
y = "Purchases Amount"
)

Cash Advance Frequency vs Cash Advance

ggplot(df_main, aes(x = CASH_ADVANCE_FREQUENCY, y = CASH_ADVANCE)) +
geom_point(alpha = 0.35,color = "steelblue") +
theme_minimal() +
labs(
title = "Cash Advance Frequency vs Cash Advance",
x = "Cash Advance Frequency",
y = "Cash Advance Amount"
)

The relationship between frequency variables and their corresponding monetary variables reveals a strong consistency between the two. Higher transaction frequency is generally associated with larger cumulative amounts, with high-frequency users mainly concentrated in higher spending ranges. This suggests that transaction frequency can serve as a useful indicator of customer activity level and overall spending scale.

Customer Behaviour Pattern

plot_df_main <- df_main %>%
  mutate(
    log_purchases = log10(PURCHASES + 1),
    log_oneoff = log10(ONEOFF_PURCHASES + 1),
    log_installments = log10(INSTALLMENTS_PURCHASES + 1)
  )


p1 <- ggplot(plot_df_main, aes(x = log_oneoff, y = log_purchases)) +
  geom_point(alpha = 0.4, size = 1, color = "#1f77b4") +
  theme_minimal(base_size = 10) +
  labs(
    title = "ONEOFF_PURCHASES vs PURCHASES",
    x = "ONEOFF_PURCHASES",
    y = "PURCHASES"
  )

p2 <- ggplot(plot_df_main, aes(x = log_installments, y = log_purchases)) +
  geom_point(alpha = 0.4, size = 1, color = "#1f77b4") +
  theme_minimal(base_size = 10) +
  labs(
    title = "INSTALLMENTS_PURCHASES vs PURCHASES",
    x = "INSTALLMENTS_PURCHASES",
    y = "PURCHASES"
  )

grid.arrange(p1, p2, ncol = 2)

Based on the combined relationships among oneoff_purchases, installments_purchases and cash_advance, credit card usage behavior can be grouped into three representative customer segments.

The first group consists of one-off Users, who only make one-off card purchases and do not use installment payments or cash advances. In ONEOFF_PURCHASES plot, these customers lie close to the diagonal line, while forming along the Y-axis (x = 0) in the installments_purchases plot. This indicates that 100% of their spending comes from one-off transactions, with no reliance on installment plans. These users are likely frequent everyday spenders, using credit cards for routine expenses such as groceries, dining, or fuel. The second group is installment users, whose distribution pattern is the opposite of the first group. Their spending is entirely driven by installment transactions, with no one-off purchases and no cash advances. This suggests that these customers typically use credit cards only for large purchases or high-value bills, while maintaining relatively low usage frequency in daily transactions. The third group comprises mixed users, who are located in the relatively sparse triangular region between the diagonal line and the vertical axis. These customers engage in both one-off purchases and installment payments. Points closer to the diagonal indicate a higher proportion of one-off spending, while points closer to the vertical axis suggest greater reliance on installment payments.

3.3 Correlation Analysis

num_df_main <- df_main %>% select(where(is.numeric))
cor_matrix <- cor(num_df_main, use = "complete.obs")
cor_long <- melt(cor_matrix)

# Keep lower triangle only
cor_long <- cor_long %>%
  filter(as.numeric(Var1) >= as.numeric(Var2))

ggplot(cor_long, aes(Var2, Var1, fill = value)) +
  geom_tile(color = "white", size = 0.2) +
  scale_fill_gradient2(
    low = "#4575b4",
    mid = "white",
    high = "#d73027",
    midpoint = 0,
    limits = c(-1, 1),
    name = "Correlation"
  ) +
  theme_minimal(base_size = 10) +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 0.5,size=6),
    axis.text.y = element_text(size=6),
    axis.title = element_blank(),
    panel.grid = element_blank()
  ) +
  labs(title = "Correlation Heatmap of Credit Card Variables")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The correlation heatmap shows that credit card usage behavior exhibits a strong degree of functional differentiation:

Purchases-related variables display strong positive correlations with one another. This indicates that spending frequency and spending amount tend to move together within the same behavioral dimension. In other words, customers with higher spending levels not only spend more in total, but also make transactions more frequently, forming a stable “card-purchase–oriented” behavior cluster.
Cash-advance–related variables show strongly correlated with one another, but show generally weak correlations with spending-related variables. This result indicates that cash advances constitute a relatively independent pattern of credit card usage.
At the account level, credit_limit shows moderate positive correlations with balance and purchases, suggesting that higher credit limits are usually associated with larger usage volumes and outstanding balances. However, credit limit does not form strong associations with frequency-related variables, further confirming that credit availability alone does not directly determine customer activity levels.
Repayment behavior variables , prc_full_payment exhibits a clear negative correlation with balance, which indicates that customers who regularly pay their balances in full tend to maintain low outstanding balances, while customers with high balances are more likely to make partial payments.

Overall, this correlation structure reveals clear segmentation in customer behavior across spending, cash advance, and repayment dimensions. These patterns provide statistical support for subsequent customer segmentation, feature selection, and modeling strategies.

3.4 Discussion on EDA Results

This section conducts an examination of the overall distribution characteristics, the relationship between variables and the potential data structure. The results obvious non-equilibrium and structural characteristics in terms of variable relationship.

At the univariate analysis, most numerical variables at the low value interval and accompanied by a long right tail distribution, which indicates that there are more low value observations but more high value samples significantly impact the overall distribution. This not only reflects the common heterogeneity in behavioral data, but also suggest pay attention to abnormal value processing and variable scale differences in subsequent modelling.

Bivariate analysis shows the correlation structures between different types of variables have strong distinction. Some variables show a significant correlation, while others show a weaker or even close to zero correlation, which indicates that there may be different behaviour patterns in the data. Additionally, some variable relationships show “L-type” distribution characteristics, indicating that simple linear assumptions may not be enough to describe the true relationship between variables.

EDA results also reveal some potential challenges in the modelling. For example, the high bias and zero expansion characteristics of the variable distribution may affect the performance of the model. While the existence of correlation structures suggests multiple nonlinear problems. These findings emphases the importance of full exploration before formal modelling as well as to improve the explanatory and robustness of the model results.

Chapter 4 Regression Analysis

4.1 Multiple Linear Regression (MLR)

A multiple linear regression (MLR) is employed to examine the factors associated with PURCHASES, which is defined as the total amount of purchases made by a credit card holder. PURCHASES is selected as the dependent variable because it represents overall spending behaviour and directly reflects customers’ consumption intensity, making it a central outcome of interest for understanding credit card usage patterns. As an initial modelling approach, all relevant behavioural and financial predictors were first included in a full MLR specification before addressing issues of multicollinearity and model refinement in subsequent steps. The customer ID is removed as it does not contribute to prediction.

df_mlr <- df_log %>% select(-CUST_ID)

full_mlr <- lm(PURCHASES ~ ., data = df_mlr)
summary(full_mlr)

## 
## Call:
## lm(formula = PURCHASES ~ ., data = df_mlr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1536 -0.6180 -0.1722  0.6254  3.8952 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       1.440327   0.121649  11.840  < 2e-16 ***
## BALANCE                           0.011124   0.010129   1.098 0.272153    
## BALANCE_FREQUENCY                -0.554955   0.060846  -9.121  < 2e-16 ***
## ONEOFF_PURCHASES                  0.432027   0.005394  80.092  < 2e-16 ***
## INSTALLMENTS_PURCHASES            0.261802   0.007318  35.776  < 2e-16 ***
## CASH_ADVANCE                     -0.087947   0.006784 -12.965  < 2e-16 ***
## PURCHASES_FREQUENCY               3.353339   0.087385  38.374  < 2e-16 ***
## ONEOFF_PURCHASES_FREQUENCY       -3.605413   0.063931 -56.395  < 2e-16 ***
## PURCHASES_INSTALLMENTS_FREQUENCY -3.372167   0.074389 -45.331  < 2e-16 ***
## CASH_ADVANCE_FREQUENCY            0.347421   0.120966   2.872 0.004088 ** 
## CASH_ADVANCE_TRX                 -0.055998   0.034612  -1.618 0.105727    
## PURCHASES_TRX                     1.073645   0.024462  43.890  < 2e-16 ***
## CREDIT_LIMIT                      0.033757   0.013201   2.557 0.010573 *  
## PAYMENTS                          0.026647   0.008977   2.969 0.003000 ** 
## MINIMUM_PAYMENTS                  0.000777   0.010750   0.072 0.942386    
## PRC_FULL_PAYMENT                  0.147057   0.040354   3.644 0.000270 ***
## TENURE                           -0.025097   0.007613  -3.296 0.000983 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8764 on 8932 degrees of freedom
## Multiple R-squared:  0.9099, Adjusted R-squared:  0.9097 
## F-statistic:  5636 on 16 and 8932 DF,  p-value: < 2.2e-16

Before we intepret the model’s result, the Variance Inflation Factor (VIF) is used to detect multicollinearity among independent variables. High VIF values indicate redundant predictors that may destabilize coefficient estimates.

library(car)
vif_values <- vif(full_mlr)
vif_values

##                          BALANCE                BALANCE_FREQUENCY 
##                         4.844802                         2.418743 
##                 ONEOFF_PURCHASES           INSTALLMENTS_PURCHASES 
##                         3.572808                         5.930134 
##                     CASH_ADVANCE              PURCHASES_FREQUENCY 
##                         6.819572                        14.332294 
##       ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY 
##                         4.238706                        10.184913 
##           CASH_ADVANCE_FREQUENCY                 CASH_ADVANCE_TRX 
##                         6.828649                        14.220993 
##                    PURCHASES_TRX                     CREDIT_LIMIT 
##                        13.158182                         1.364229 
##                         PAYMENTS                 MINIMUM_PAYMENTS 
##                         2.377033                         3.086028 
##                 PRC_FULL_PAYMENT                           TENURE 
##                         1.568745                         1.207463

Value of VIF exceeding 10 commonly indicating severe multicollinearity. The full MLR model exhibits several predictors with elevated VIF values, most notably PURCHASES_FREQUENCY, which records a VIF (14.3) well above this threshold, indicating substantial redundancy with other purchase-related variables. Such inflation implies unstable coefficient estimates and reduced interpretability. Consequently, PURCHASES_FREQUENCY is removed in the subsequent model to mitigate multicollinearity and improve model reliability.

reduced_mlr_1 <- lm(
  PURCHASES ~ . - PURCHASES_FREQUENCY,
  data = df_mlr
)

vif(reduced_mlr_1)

##                          BALANCE                BALANCE_FREQUENCY 
##                         4.668755                         2.295774 
##                 ONEOFF_PURCHASES           INSTALLMENTS_PURCHASES 
##                         3.405527                         5.780320 
##                     CASH_ADVANCE       ONEOFF_PURCHASES_FREQUENCY 
##                         6.814807                         3.469917 
## PURCHASES_INSTALLMENTS_FREQUENCY           CASH_ADVANCE_FREQUENCY 
##                         6.212949                         6.828235 
##                 CASH_ADVANCE_TRX                    PURCHASES_TRX 
##                        14.220753                         9.698076 
##                     CREDIT_LIMIT                         PAYMENTS 
##                         1.364202                         2.372881 
##                 MINIMUM_PAYMENTS                 PRC_FULL_PAYMENT 
##                         3.076119                         1.568538 
##                           TENURE 
##                         1.168177

After removing PURCHASES_FREQUENCY, the overall multicollinearity structure improves. However, VIF diagnostics indicate that CASH_ADVANCE_TRX remains highly problematic, exhibiting a VIF value well above 10. This suggests strong linear dependence between transaction count and other cash-advance-related variables. Hence, CASH_ADVANCE_TRX is removed in the next model iteration to further mitigate multicollinearity and stabilise the regression estimates.

reduced_mlr_2 <- lm(
  PURCHASES ~ . -PURCHASES_FREQUENCY -CASH_ADVANCE_TRX,
  data = df_mlr
)
vif(reduced_mlr_2)

##                          BALANCE                BALANCE_FREQUENCY 
##                         4.667995                         2.295581 
##                 ONEOFF_PURCHASES           INSTALLMENTS_PURCHASES 
##                         3.398304                         5.779331 
##                     CASH_ADVANCE       ONEOFF_PURCHASES_FREQUENCY 
##                         3.372961                         3.466866 
## PURCHASES_INSTALLMENTS_FREQUENCY           CASH_ADVANCE_FREQUENCY 
##                         6.212336                         2.548100 
##                    PURCHASES_TRX                     CREDIT_LIMIT 
##                         9.695037                         1.357867 
##                         PAYMENTS                 MINIMUM_PAYMENTS 
##                         2.363191                         3.055479 
##                 PRC_FULL_PAYMENT                           TENURE 
##                         1.566818                         1.140738

Now, the VIF values of all remaining variables fall below the critical threshold of 10, indicating that serious multicollinearity is no longer present. Although PURCHASES_TRX still shows a relatively higher VIF than other predictors, its value remains within an acceptable range and does not indicate severe redundancy. More importantly, PURCHASES_TRX represents the number of purchase transactions and captures customers’ purchasing activity directly. Removing this variable would result in a loss of important behavioural information. Therefore, PURCHASES_TRX is retained, and no further variables are removed.

par(mfrow = c(2, 2))
plot(reduced_mlr_2)

Before we interpret the model’s result, diagnostic plots were examined to evaluate the assumptions of the multiple linear regression model. The Residuals vs Fitted plot shows a slight curved pattern, indicating that the relationship between the predictors and PURCHASES is not perfectly linear. The Q–Q plot reveals deviations at the tails, suggesting that the residuals are not fully normally distributed. The Scale–Location plot indicates some variation in residual spread across fitted values, which implies mild heteroscedasticity. The Residuals vs Leverage plot shows that no individual observation exceeds Cook’s distance thresholds, indicating that no data point has an undue influence on the model. Overall, these diagnostics suggest that while classical regression assumptions are not perfectly satisfied, the violations are mild and structural in nature rather than caused by individual predictors.

summary(reduced_mlr_2)

## 
## Call:
## lm(formula = PURCHASES ~ . - PURCHASES_FREQUENCY - CASH_ADVANCE_TRX, 
##     data = df_mlr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1475 -0.6808 -0.1776  0.6595  5.2179 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       2.217248   0.129228  17.158  < 2e-16 ***
## BALANCE                          -0.062787   0.010731  -5.851 5.06e-09 ***
## BALANCE_FREQUENCY                -0.027692   0.063976  -0.433 0.665139    
## ONEOFF_PURCHASES                  0.386883   0.005678  68.139  < 2e-16 ***
## INSTALLMENTS_PURCHASES            0.217031   0.007797  27.836  < 2e-16 ***
## CASH_ADVANCE                     -0.101865   0.005149 -19.784  < 2e-16 ***
## ONEOFF_PURCHASES_FREQUENCY       -2.558097   0.062402 -40.994  < 2e-16 ***
## PURCHASES_INSTALLMENTS_FREQUENCY -1.588650   0.062704 -25.336  < 2e-16 ***
## CASH_ADVANCE_FREQUENCY            0.171426   0.079752   2.150 0.031622 *  
## PURCHASES_TRX                     1.554472   0.022662  68.593  < 2e-16 ***
## CREDIT_LIMIT                      0.032845   0.014215   2.311 0.020875 *  
## PAYMENTS                          0.013088   0.009660   1.355 0.175490    
## MINIMUM_PAYMENTS                  0.022869   0.011545   1.981 0.047635 *  
## PRC_FULL_PAYMENT                  0.162921   0.043527   3.743 0.000183 ***
## TENURE                           -0.079471   0.007987  -9.950  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9458 on 8934 degrees of freedom
## Multiple R-squared:  0.895,  Adjusted R-squared:  0.8948 
## F-statistic:  5439 on 14 and 8934 DF,  p-value: < 2.2e-16

Stepwise regression was applied to assess whether further variable reduction could improve the model’s behaviour.

step_mlr <- step(
  reduced_mlr_2,   
  direction = "both",
  trace = FALSE
)

summary(step_mlr)

## 
## Call:
## lm(formula = PURCHASES ~ BALANCE + ONEOFF_PURCHASES + INSTALLMENTS_PURCHASES + 
##     CASH_ADVANCE + ONEOFF_PURCHASES_FREQUENCY + PURCHASES_INSTALLMENTS_FREQUENCY + 
##     CASH_ADVANCE_FREQUENCY + PURCHASES_TRX + CREDIT_LIMIT + MINIMUM_PAYMENTS + 
##     PRC_FULL_PAYMENT + TENURE, data = df_mlr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1295 -0.6810 -0.1776  0.6597  5.2182 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       2.193172   0.124954  17.552  < 2e-16 ***
## BALANCE                          -0.066231   0.008671  -7.638 2.43e-14 ***
## ONEOFF_PURCHASES                  0.388310   0.005582  69.565  < 2e-16 ***
## INSTALLMENTS_PURCHASES            0.218420   0.007710  28.330  < 2e-16 ***
## CASH_ADVANCE                     -0.100729   0.005043 -19.975  < 2e-16 ***
## ONEOFF_PURCHASES_FREQUENCY       -2.554060   0.061716 -41.384  < 2e-16 ***
## PURCHASES_INSTALLMENTS_FREQUENCY -1.592288   0.061565 -25.863  < 2e-16 ***
## CASH_ADVANCE_FREQUENCY            0.172108   0.079475   2.166 0.030372 *  
## PURCHASES_TRX                     1.552382   0.022616  68.641  < 2e-16 ***
## CREDIT_LIMIT                      0.037044   0.013729   2.698 0.006986 ** 
## MINIMUM_PAYMENTS                  0.031632   0.009526   3.321 0.000902 ***
## PRC_FULL_PAYMENT                  0.177668   0.040495   4.387 1.16e-05 ***
## TENURE                           -0.078391   0.007876  -9.954  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9458 on 8936 degrees of freedom
## Multiple R-squared:  0.895,  Adjusted R-squared:  0.8948 
## F-statistic:  6346 on 12 and 8936 DF,  p-value: < 2.2e-16

par(mfrow = c(2, 2))
plot(step_mlr)

In stepwise regression model, BALANCE_FREQUENCY and PAYMENTS are removed. However, when the resulting diagnostic plots were compared with previous the reduced model, additional variable removal does not substantially change the residual patterns. This suggests that the observed behaviour arises from the underlying data characteristics rather than from redundant predictors.Thus, the reduced multiple linear regression model was retained for interpretation purpose.

library(sandwich)

## Warning: package 'sandwich' was built under R version 4.3.3

library(lmtest)

## Warning: package 'lmtest' was built under R version 4.3.1

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 4.3.2

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

final_mlr <- reduced_mlr_2
coeftest(final_mlr, vcov = vcovHC(final_mlr, type = "HC3"))

## 
## t test of coefficients:
## 
##                                    Estimate Std. Error  t value  Pr(>|t|)    
## (Intercept)                       2.2172484  0.1370937  16.1732 < 2.2e-16 ***
## BALANCE                          -0.0627871  0.0112041  -5.6039 2.158e-08 ***
## BALANCE_FREQUENCY                -0.0276916  0.0639507  -0.4330  0.665015    
## ONEOFF_PURCHASES                  0.3868834  0.0066306  58.3478 < 2.2e-16 ***
## INSTALLMENTS_PURCHASES            0.2170307  0.0086326  25.1408 < 2.2e-16 ***
## CASH_ADVANCE                     -0.1018653  0.0054947 -18.5389 < 2.2e-16 ***
## ONEOFF_PURCHASES_FREQUENCY       -2.5580971  0.0686045 -37.2876 < 2.2e-16 ***
## PURCHASES_INSTALLMENTS_FREQUENCY -1.5886498  0.0679727 -23.3719 < 2.2e-16 ***
## CASH_ADVANCE_FREQUENCY            0.1714261  0.0783186   2.1888  0.028635 *  
## PURCHASES_TRX                     1.5544719  0.0267667  58.0748 < 2.2e-16 ***
## CREDIT_LIMIT                      0.0328451  0.0141387   2.3231  0.020198 *  
## PAYMENTS                          0.0130880  0.0103802   1.2609  0.207392    
## MINIMUM_PAYMENTS                  0.0228692  0.0126453   1.8085  0.070560 .  
## PRC_FULL_PAYMENT                  0.1629212  0.0439206   3.7095  0.000209 ***
## TENURE                           -0.0794713  0.0085934  -9.2480 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To deal with the mild heteroscedasticity observed in the residual diagnostics previously, robust standard errors were applied. Comparing the original regression output with the robust results shows that coefficient estimates remain unchanged, while standard errors and significance levels are slightly adjusted for some variables. This confirms that the direction and magnitude of relationships are stable, while statistical inference is made more reliable under non-constant error variance. The use of robust standard errors therefore strengthens the validity of hypothesis testing without altering the substantive interpretation of the model.

4.2 Interpretation and Discussion of the Reduced Multiple Linear Regression Model

The reduced multiple linear regression model explains a large proportion of the variation in customers’ total purchase spending, with an adjusted R² of approximately 0.895, indicating strong explanatory performance. The results show that customers who make larger individual purchases, spend more through instalment plans, and carry out a higher number of purchase transactions tend to have substantially higher overall purchase amounts. This highlights the importance of both transaction size and transaction volume in driving total spending. In contrast, customers who rely more heavily on cash withdrawals from their credit cards tend to spend less on regular purchase transactions, suggesting a shift away from conventional consumption behaviour. In addition, customers with higher credit limits and those who pay a larger proportion of their balance in full are associated with higher purchase spending, reflecting the role of credit capacity and disciplined repayment behaviour. However, the frequency with which account balances are updated does not show a meaningful relationship with total purchase amount once other financial and behavioural factors are taken into account. Overall, the findings indicate that direct spending behaviour and financial capacity are more influential determinants of purchase activity than account maintenance patterns.

Although the reduced multiple linear regression model demonstrates strong explanatory power, diagnostic analysis indicates mild departures from linearity and homoscedasticity, which are common in large-scale financial behaviour data. In such contexts, a regularized regression approach, such as LASSO regression, may be suitable as a complementary model. LASSO is particularly useful when predictors are correlated, as it performs automatic variable selection and reduces coefficient instability by shrinking less informative predictors toward zero. This makes it well suited for assessing predictive relevance rather than statistical significance. Alternatively, for capturing potential non-linear relationships and interaction effects inherent in customer spending behaviour, tree-based models such as Random Forest could be considered, as they do not rely on linearity assumptions. These models can serve as robustness or predictive benchmarks alongside the interpretable linear regression framework.

4.3 LASSO Regression

y <- df_mlr$PURCHASES
X <- model.matrix(PURCHASES ~ ., data = df_mlr)[, -1]

set.seed(123)

cv_lasso <- cv.glmnet(
  X, y,
  alpha = 1,            # alpha = 1 → LASSO
  standardize = TRUE,   # let glmnet standardize internally
  nfolds = 10
)
plot(cv_lasso)

cv_lasso$lambda.min

## [1] 0.001666535

cv_lasso$lambda.1se

## [1] 0.01175707

The optimal regularisation parameter was selected using 10-fold cross-validation. The value λₘᵢₙ corresponds to the minimum cross-validated prediction error, while λ₁ₛₑ represents a more conservative choice within one standard error of the minimum. The relatively small λ values are expected due to predictor standardisation and the log-transformed response variable. Using 10-fold cross-validation, the optimal regularisation parameter was identified at λ₁ₛₑ = 0.0118, which favours a more stable and conservative model.

coef_lasso_1se <- coef(cv_lasso, s = "lambda.1se")
coef_lasso_1se

## 17 x 1 sparse Matrix of class "dgCMatrix"
##                                           s1
## (Intercept)                       1.62058269
## BALANCE                           .         
## BALANCE_FREQUENCY                -0.44453443
## ONEOFF_PURCHASES                  0.41921889
## INSTALLMENTS_PURCHASES            0.23689770
## CASH_ADVANCE                     -0.08701131
## PURCHASES_FREQUENCY               2.71526133
## ONEOFF_PURCHASES_FREQUENCY       -3.10468189
## PURCHASES_INSTALLMENTS_FREQUENCY -2.68447365
## CASH_ADVANCE_FREQUENCY            .         
## CASH_ADVANCE_TRX                  .         
## PURCHASES_TRX                     1.08026058
## CREDIT_LIMIT                      0.02075249
## PAYMENTS                          0.01758830
## MINIMUM_PAYMENTS                  .         
## PRC_FULL_PAYMENT                  0.11529048
## TENURE                           -0.02190428

lasso_coef_df <- as.data.frame(as.matrix(coef_lasso_1se))
colnames(lasso_coef_df) <- "Coefficient"
lasso_coef_df$Variable <- rownames(lasso_coef_df)

lasso_coef_df

##                                  Coefficient                         Variable
## (Intercept)                       1.62058269                      (Intercept)
## BALANCE                           0.00000000                          BALANCE
## BALANCE_FREQUENCY                -0.44453443                BALANCE_FREQUENCY
## ONEOFF_PURCHASES                  0.41921889                 ONEOFF_PURCHASES
## INSTALLMENTS_PURCHASES            0.23689770           INSTALLMENTS_PURCHASES
## CASH_ADVANCE                     -0.08701131                     CASH_ADVANCE
## PURCHASES_FREQUENCY               2.71526133              PURCHASES_FREQUENCY
## ONEOFF_PURCHASES_FREQUENCY       -3.10468189       ONEOFF_PURCHASES_FREQUENCY
## PURCHASES_INSTALLMENTS_FREQUENCY -2.68447365 PURCHASES_INSTALLMENTS_FREQUENCY
## CASH_ADVANCE_FREQUENCY            0.00000000           CASH_ADVANCE_FREQUENCY
## CASH_ADVANCE_TRX                  0.00000000                 CASH_ADVANCE_TRX
## PURCHASES_TRX                     1.08026058                    PURCHASES_TRX
## CREDIT_LIMIT                      0.02075249                     CREDIT_LIMIT
## PAYMENTS                          0.01758830                         PAYMENTS
## MINIMUM_PAYMENTS                  0.00000000                 MINIMUM_PAYMENTS
## PRC_FULL_PAYMENT                  0.11529048                 PRC_FULL_PAYMENT
## TENURE                           -0.02190428                           TENURE

selected_vars <- lasso_coef_df %>%
  filter(Coefficient != 0)

selected_vars

##                                  Coefficient                         Variable
## (Intercept)                       1.62058269                      (Intercept)
## BALANCE_FREQUENCY                -0.44453443                BALANCE_FREQUENCY
## ONEOFF_PURCHASES                  0.41921889                 ONEOFF_PURCHASES
## INSTALLMENTS_PURCHASES            0.23689770           INSTALLMENTS_PURCHASES
## CASH_ADVANCE                     -0.08701131                     CASH_ADVANCE
## PURCHASES_FREQUENCY               2.71526133              PURCHASES_FREQUENCY
## ONEOFF_PURCHASES_FREQUENCY       -3.10468189       ONEOFF_PURCHASES_FREQUENCY
## PURCHASES_INSTALLMENTS_FREQUENCY -2.68447365 PURCHASES_INSTALLMENTS_FREQUENCY
## PURCHASES_TRX                     1.08026058                    PURCHASES_TRX
## CREDIT_LIMIT                      0.02075249                     CREDIT_LIMIT
## PAYMENTS                          0.01758830                         PAYMENTS
## PRC_FULL_PAYMENT                  0.11529048                 PRC_FULL_PAYMENT
## TENURE                           -0.02190428                           TENURE

The retained predictors largely correspond to transaction-driven and repayment-related behaviours. Variables representing the number of purchase transactions, one-off purchase amount, and instalment purchase amount remain positively associated with total purchase spending, reinforcing the importance of transaction intensity and spending structure. In contrast, variables related to cash advance usage and certain frequency measures are either negatively associated or removed entirely, suggesting that these behaviours contribute less to explaining purchase spending once regularisation is imposed. Notably, account balance level is excluded by LASSO, indicating that while balance may appear statistically significant in linear regression after multicollinearity control, it is not essential for predictive performance when competing correlated variables are present.

rmse_lasso_min <- sqrt(
  cv_lasso$cvm[cv_lasso$lambda == cv_lasso$lambda.min]
)

rmse_lasso_min

## [1] 0.8778477

The LASSO model achieves a root mean squared error (RMSE) of approximately 0.88, which is comparable to the reduced multiple linear regression model. This suggests that reducing the number of predictors does not substantially compromise predictive accuracy. The results indicate that a compact set of behavioural variables is sufficient to explain most of the variation in purchase spending, supporting the robustness of the behavioural patterns identified earlier.

Chapter 5 Clustering Analysis

5.1 Determination of Optimal Cluster Count

5.1.1 Silhouette Analysis

fviz_nbclust(df_scaled, kmeans, method='silhouette')

Optimal number of clusters, k = 2
Indicates the dataset contains 2 primary divisions at a fundamental level

5.1.2 Elbow Method

fviz_nbclust(df_scaled, kmeans, method = "wss")

Optimal number of clusters chosen, k = 4
Indicates the dataset contains 4 primary divisions

5.2 Visualization and Model Selection

5.2.1 Two-Cluster Model Visualization

set.seed(12345)
k2 <- kmeans(df_scaled, centers = 2 , nstart = 25)

# Calculate instead of printing since too lengthy
bss_tss_ratio_k2 <- (k2$betweenss / k2$totss) * 100
cat(sprintf("between_SS / total_SS = %5.1f %%\n", bss_tss_ratio_k2))

## between_SS / total_SS =  25.3 %

fviz_cluster(k2, data = df_scaled)

Some degree of overlap is observed between the two clusters. The sum of squares obtained for this two-cluster analysis is 21.5%, which represents that small proportion of the total variance in the dataset is explained by the clustering. The low between cluster to total sum of squares (BSS/TSS) ratio suggests that it may not provide a best fit.

Given the limited variance explained and the overlap between clusters, a two-cluster solution may be insufficient to capture the diversity of credit card usage behaviours. Therefore, it leads us to examine the elbow plot to determine whether adding additional clusters is still justified.

5.2 Visualization and Model Selection

5.2.1 Four-Cluster Model Visualization

set.seed(12345)
k4 <- kmeans(df_scaled, centers = 4 , nstart = 25)

fviz_cluster(k4, data = df_scaled)

Many degrees of overlap are observed between the clusters, especially Cluster 2 and 3. The clusters seemed to originate from a shared central region. The dense overlapping area represents the customers with moderate to low activity levels. These people are believed not to have strong differentiating behaviours. However, as the distance from the center increases, the customers’ behaviours are diverging outwards into distinct behavioral extremes shown at the outer tips. The graph captures the total variance up to 43.3% from the first 2 principles while the rest might be hidden from this 2D view, which explains the visual overlap. The increasing percentage of total variance being explained suggests that extending the model from two clusters to four clusters is reasonable and it can improve the segment compactness compared to the two-cluster model.

dist_mat <- dist(df_scaled)
sil_k4 <- silhouette(k4$cluster, dist_mat)
mean(sil_k4[, 3])

## [1] 0.2225641

The average silhouette score for the four-cluster solution is 0.22, indicating moderate overlap among clusters. This suggests that customer spending behaviour forms a continuum rather than strictly separated groups, which is common in financial transaction data. Despite the overlap, the clusters exhibit distinct behavioural profiles and remain suitable for segmentation and interpretation

To have a better understanding at the overlapping visual and the characteristics driving this divergence, we will look at their cluster means and visual profiles accordingly.

bss_tss_ratio <- (k4$betweenss / k4$totss) * 100

cat(sprintf("between_SS / total_SS = %5.1f %%\n", bss_tss_ratio)) ; cat("K-means clustering with 4 clusters of sizes: [", k4$size, "]", "\n") ; cat("Cluster means: ", "\n") ; print(k4$centers)

## between_SS / total_SS =  43.3 %

## K-means clustering with 4 clusters of sizes: [ 2753 1521 2103 2572 ]

## Cluster means:

##      BALANCE BALANCE_FREQUENCY   PURCHASES ONEOFF_PURCHASES
## 1  0.6139900         0.2615515 -1.11152051      -0.58129630
## 2 -1.2163200        -1.4840022 -0.05123344       0.01143858
## 3 -0.4539744         0.2167806  0.43105973      -0.60282677
## 4  0.4332878         0.4203835  0.86758298       1.10834189
##   INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
## 1             -0.9273885    1.0058457          -1.0376709
## 2             -0.4582181   -0.6300383          -0.6018928
## 3              0.8749787   -0.6844516           0.7862244
## 4              0.5481999   -0.1444025           0.8237780
##   ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## 1                 -0.5355237                       -0.8440946
## 2                 -0.3025640                       -0.5876509
## 3                 -0.5464054                        0.9888346
## 4                  1.1989063                        0.4424923
##   CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT   PAYMENTS
## 1              0.8387892        0.9393544    -1.0630909  -0.01768346  0.1427545
## 2             -0.5604024       -0.6230793    -0.4455991  -0.27017542 -0.9445516
## 3             -0.5507468       -0.6345278     0.4949848  -0.43002370 -0.1784366
## 4             -0.1160942       -0.1181676     0.9966923   0.53031074  0.5516765
##   MINIMUM_PAYMENTS PRC_FULL_PAYMENT      TENURE
## 1        0.4225857     -0.403527657 -0.08988408
## 2       -1.1173981      0.007804537 -0.15575872
## 3       -0.1309691      0.429805495 -0.03314988
## 4        0.3155568      0.075878686  0.21542539

5.3 Profiling and Interpretation

# Need original data in numeric
data_for_radar <- df_main[, sapply(df_main, is.numeric)]

# Add Cluster Labels
data_for_radar$cluster <- k4$cluster

# Mean for each cluster
cluster_means <- aggregate(. ~ cluster, data = data_for_radar, mean)

# Normalize the data (0 to 1 scaling)
normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}
cluster_means_norm <- as.data.frame(lapply(cluster_means[, -1], normalize))

# Rename because long name will overlap when showing
colnames(cluster_means_norm) <- c(
  "Balance", "Bal_Freq", "Purchases", "OneOff_Purch", "Install_Purch", "Cash_Advance", "Purch_Freq", "OneOff_Purch_Freq", "Inst_Freq", "Cash_Freq", "Cash_Adv_Trx", "Purch_Trx", "Credit_Limit", "Payments", "Min_Pay", "Prc_Full_Pay", "Tenure"
)

data_plot <- rbind(rep(1, ncol(cluster_means_norm)), 
                   rep(0, ncol(cluster_means_norm)), 
                   cluster_means_norm)               

colors_border <- c("#F88379", "#79c314", "#00c5cd","#B480FF")

par(mar = c(2, 2, 2, 2))

radarchart(data_plot, axistype = 1, pcol = colors_border,
           pfcol = scales::alpha(colors_border, 0.4),
           plwd = 2, plty = 1, cglcol = "grey", cglty = 1, axislabcol = "grey",
           caxislabels = seq(0, 1, 0.25),
           title = "Profile of The 4 Customer Segments",
           vlcex = 0.75)

legend(x = "topright", legend = paste("Cluster", cluster_means$cluster),
       bty = "n", pch = 20, col = colors_border, pt.cex = 2)

Cluster 1: At-Risk Customers

Cluster 1 represents a large segment, accounting for approximately 30% of the total customer base (2,753 customers). It shows a highly distinct and extreme segment in the radar chart. The red polygon spikes aggressively downwards, dominating the cash-related axes, while collapsing inward on purchase-related dimensions. This visual is further supported by the cluster centroids, where this group holds the highest positive scores for Cash Advance Transactions (+0.94) and Cash Advance Frequency (+0.84). In contrast, they hold the lowest values for spending compared to other clusters, with very low scores for Purchases (−1.11), One-Off Purchases (−0.58) and Purchases Frequency (−1.04). This result confirms they rely on the card for liquid cash withdrawals rather than merchant transactions. This also suggests the cluster is likely facing financial pressure and is simply paying off their debt. Therefore, Cluster 1 represents a high-risk segment since reliance on cash advances is an expensive form of credit.

Cluster 2: Dormant Customers

Cluster 2 represents the smallest segment, accounting for approximately 17% of the total customer base (1,521 customers). It shows a smaller polygon in overall dimension compared to other clusters, indicating that they have minimal activities. This lack of engagement is proven statistically. They record the lowest values among all clusters for Payments (−0.94) and Minimum Payments (−1.12), respectively. They also show low spending behaviour with Purchases (−0.05) and the lowest Balance (−1.22). It is shown that their engagement levels are massively below the average. This profile suggests that the customers in Cluster 2 rarely use the card, making them a cluster with low revenue but also low risk.

Cluster 3: Loyal Customers

Cluster 3 has shown a distinct preference for structured financing. They are the most regular consumers of installment plans, even though they do not have the purchasing power of the high-value cluster. This is seen in the chart where the cyan polygon on the installment frequency axis reaches the outer boundary of the radar chart. This also aligns with their highest centroid score for Installment Purchases Frequency (+0.99). In contrast, their low cash usage shown by the inward polygon indicates they use credit cards mainly for transactions rather than cash withdrawals, as reflected by low Cash Advance Transactions (−0.63) and low Cash Advance Frequency (−0.55). Furthermore, they exhibit the most responsible repayment behaviour in the portfolio, shown by the highest Prc_Full_Payment score (+0.43). This makes them a low-risk segment for credit card use since they clear their statement balances consistently. Therefore, they are customers who use credit cards for planned installment purchases and maintain a good repayment record.

Cluster 4: High-Value Customers

Cluster 4 is the most active and high-value spending segment, accounting for approximately 28% of the total customer base (2,572 customers). Visually, their purple polygon expands the most along the one-off purchase and credit limit axes, indicating a profile with significant spending power. The statistics confirm this pattern, as they record the highest centroid values for Purchases (+0.87) and One-Off Purchases (+1.11). Moreover, they possess the highest credit limit (+0.53) and a relatively higher tenure (+0.22), indicating that they are long-standing and trusted customers with strong purchasing capacity.

5.4 Actionable Insights and Recommendations

Cluster 1: At-Risk Customers

Bank should revise the credit limits offered to this segment and should not increase limits for them. Instead, bank should flag this cluster and monitor their card usage. It may offer low-interest personal loans to help customers manage their cash flow while reducing the risk of bad debts.

Cluster 2: Dormant Customers

Bank can plan strategies to increase the engagement of Cluster 2. It may launch low-barrier promotional campaigns to kickstart activities. For example, offer zero annual fee and cashback in dining and grocery category with low minimum spend. This aims to attract dormant customers to resume spending on basic daily expenses like dining and groceries.

Cluster 3: Loyal Customers

Bank can plan to further locking the loyalty of these clients. Bank can strengthen partnerships with electronics, furniture, and appliance retailers to offer exclusive 0% interest installment plans to the credit card holders. This can cater to the purchasing preferences of these clients directly. Buy Now Pay Later (BNPL) plans can be offered to this cluster for short-term options via certain business partner.

Cluster 4: High-Value Customers

Bank should provide experiential rewards instead of generic cashback offers to this cluster. For example, premium airport lounge access with accompanying guest, overseas spend benefits, and concierge services, which align with their lifestyle and purchasing power. To retain these high-value customers, the marketing need to pay more attention to their customer experience.

Chapter 6 Conclusion

This study aimed to explore credit card customer behaviour through exploratory data analysis, examine the relationship between behavioural attributes and purchase spending using multiple linear regression, and identify distinct customer segments through clustering techniques. These objectives were achieved by systematically preprocessing the data, examining variable distributions and relationships, and applying appropriate transformations to address skewness and scale differences. Together, these steps provided a structured and coherent understanding of customer spending, repayment, and credit usage behaviour beyond what can be observed from raw transactional data alone.

The analytical results reveal several important behavioural insights. Regression analysis indicates that total purchase spending is primarily driven by transaction-related factors, including the number of purchase transactions, one-off purchase amounts, and instalment spending, while cash advance usage is negatively associated with purchase activity. This suggests a clear behavioural distinction between customers who use credit cards for consumption and those who rely on them for liquidity. Clustering analysis further supports this separation by identifying customer groups with distinct usage profiles, including high-value spenders, instalment-oriented customers with disciplined repayment behaviour, cash-advance-reliant users, and largely inactive customers. Although cluster separation is not sharp, the identified segments exhibit meaningful and interpretable differences in spending intensity, cash usage, and repayment patterns, making them suitable for behavioural profiling and segmentation purposes.

Despite the usefulness of these findings, several methodological limitations remain. Diagnostic analysis of the regression model reveals mild non-linearity and heteroscedasticity, indicating that linear regression may not fully capture the complex relationships present in customer behaviour. Future work could therefore consider more flexible modelling approaches, such as LASSO regression to handle correlated predictors and improve model stability, or Random Forest models to capture non-linear effects and interaction patterns without strict distributional assumptions. Similarly, while k-means clustering provides an intuitive and interpretable segmentation, its reliance on Euclidean distance and spherical cluster assumptions may limit its effectiveness for heterogeneous financial data. Alternative approaches such as hierarchical clustering, Gaussian mixture models, or density-based methods could be explored to capture more nuanced cluster structures. The relatively low silhouette score observed in this study reflects the continuous and overlapping nature of real-world credit card behaviour rather than poor clustering quality, highlighting the challenge of achieving strong geometric separation in behavioural datasets. Overall, these extensions would further enhance the robustness and explanatory power of customer segmentation analysis.

WQD7004 GA

Group 1

2026-01-12

Group 1 Members

Chapter 1 Introduction

1.1 Project Background

1.2 Dataset Overview

1.3 Problem Statement

1.4 Research Objectives

Chapter 2 Data Preprocessing

2.1 Import Data

2.1 Import Data

2.2 Data Cleaning

2.2.1 Imputation for MINIMUM_PAYMENTS

2.2.2 Imputation for PRC_FULL_PAYMENT

2.2.3 Drop the single row which has missing value for CREDIT_LIMIT

2.2.4 Imputation for PURCHASES

2.3 Data Transformation

2.3.1 Checking outliers

2.3.2 Log transformation and robust scaling

Chapter 3 Exploratory Data Analysis (EDA)

3.1 Univariate Analysis

Monetary Variables

Usage Frequency Variables

3.2 Bivariate Analysis

Purchases vs Cash Advance

Purchases(log scale) vs Cash Advance Amount(log scale)

Credit Limit vs Purchases

Credit Limit vs Payments

Purchases Frequency vs Purchases

Cash Advance Frequency vs Cash Advance

Customer Behaviour Pattern

3.3 Correlation Analysis

3.4 Discussion on EDA Results

Chapter 4 Regression Analysis

4.1 Multiple Linear Regression (MLR)

4.2 Interpretation and Discussion of the Reduced Multiple Linear Regression Model

4.3 LASSO Regression

Chapter 5 Clustering Analysis

5.1 Determination of Optimal Cluster Count

5.1.1 Silhouette Analysis

5.1.2 Elbow Method

5.2 Visualization and Model Selection

5.2.1 Two-Cluster Model Visualization

5.2 Visualization and Model Selection

5.2.1 Four-Cluster Model Visualization

5.3 Profiling and Interpretation

5.4 Actionable Insights and Recommendations

Chapter 6 Conclusion

2.2.1 Imputation for `MINIMUM_PAYMENTS`

2.2.2 Imputation for `PRC_FULL_PAYMENT`

2.2.3 Drop the single row which has missing value for `CREDIT_LIMIT`

2.2.4 Imputation for `PURCHASES`