| Num | Matric Number | Full Name |
|---|---|---|
| 1 | 24215430 | WANG YANRONG |
| 2 | 25066551 | XU JING |
| 3 | 24069475 | HU RUI |
| 4 | U2103571 | SOO YAN LIN |
| 5 | U2103393 | YAH HONG XUAN |
Credit cards are widely used in modern financial systems and generate rich behavioural data that reflect customers’ spending, repayment, and credit utilisation patterns. These data provide valuable insights into customer behaviour but are often high-dimensional and complex, making direct interpretation through simple descriptive analysis challenging.
Traditional customer analysis in the financial sector typically relies on predefined rules or manually constructed categories, which may overlook behavioural structures naturally embedded in the data. As a result, data-driven analytical approaches have become increasingly important for systematically uncovering behavioural patterns in large-scale financial datasets. In the absence of labelled outcomes, unsupervised learning methods particularly clustering techniques offer an effective means of grouping customers based on similarities in credit card usage, enabling the identification of latent behavioural segments without prior assumptions.
However, clustering alone is not sufficient for predictive analysis or automated decision-making. Supervised learning techniques, including regression and classification, can complement clustering by modelling relationships between behavioural variables and enabling the categorization of customers into meaningful groups. Integrating unsupervised and supervised learning methods therefore provides a more comprehensive framework for credit card customer profiling.
The analysis is based on a cleaned credit card customer dataset containing 8,950 observations and 17 variables. Each observation represents an individual credit card customer described by a set of behavioural and financial attributes.
The variables can be broadly categorised into three groups.
Monetary-related variables (e.g., BALANCE,
PURCHASES, PAYMENTS, and
CASH_ADVANCE) capture the monetary scale of credit card
usage. Frequency-related variables (e.g.,
PURCHASES_FREQUENCY and
CASH_ADVANCE_FREQUENCY) describe how regularly customers
use their credit cards. In addition, behavioural proportion variables
such as PRC_FULL_PAYMENT reflect customers’ repayment
behaviour.
The dataset does not contain predefined class labels, making it suitable for unsupervised learning techniques such as clustering. A small proportion of missing values is present in several behavioural variables, which motivates the need for appropriate data preprocessing prior to analysis.
Although credit card datasets contain rich behavioural information related to spending, payment, and credit usage, several challenges hinder effective analysis of customer behaviour. First, customer usage patterns are highly heterogeneous and cannot be readily understood from raw transactional variables alone, making it difficult to characterize behavioural trends through simple descriptive analysis. Second, the relationships between key behavioural attributes, such as balance levels, amount of purchase, transaction activity and repayment behaviour remain insufficiently understood. Third, the lack of predefined outcome labels in credit card behavioural data restricts the application of supervised learning approaches for customer segmentation.
Without a structured analytical framework that integrates exploratory data analysis, regression-based modelling, and unsupervised clustering techniques, it is challenging to systematically analyse customer behaviour or to extract meaningful and interpretable customer segments from credit card data.
The objectives of this study are as follows:
To explore and describe credit card customer behaviour through exploratory data analysis (EDA), focusing on spending, payment, and credit usage characteristics.
To examine the relationships between customer behavioural variables and total purchase amount by applying multiple linear regression analysis.
To identify and interpret distinct customer segments based on credit card usage patterns using unsupervised clustering techniques, supported by appropriate visualization and evaluation methods.
library(readr)
library(tidyr)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(car)
## Loading required package: carData
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.0.1
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ✖ dplyr::recode() masks car::recode()
## ✖ purrr::some() masks car::some()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(corrplot)
## corrplot 0.94 loaded
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Loaded glmnet 4.1-8
library(reshape2)
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
library(dplyr)
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
library(cluster)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(fmsb)
## Registered S3 methods overwritten by 'fmsb':
## method from
## print.roc pROC
## plot.roc pROC
df <- read_csv("C:/Users/User/Desktop/Msc Data Science/WQD7004 PROGRAMMING FOR DATA SCIENCE/df_main_wqd7004.csv",
col_types = cols(...1 = col_skip(), BALANCE = col_number(),
BALANCE_FREQUENCY = col_number(),
PURCHASES = col_number(), ONEOFF_PURCHASES = col_number(),
INSTALLMENTS_PURCHASES = col_number(),
CASH_ADVANCE = col_number(), PURCHASES_FREQUENCY = col_number(),
ONEOFF_PURCHASES_FREQUENCY = col_number(),
PURCHASES_INSTALLMENTS_FREQUENCY = col_number(),
CASH_ADVANCE_FREQUENCY = col_number(),
CASH_ADVANCE_TRX = col_number(),
PURCHASES_TRX = col_number(), CREDIT_LIMIT = col_number(),
PAYMENTS = col_number(), MINIMUM_PAYMENTS = col_number(),
PRC_FULL_PAYMENT = col_number(),
TENURE = col_integer()))
## New names:
## • `` -> `...1`
library(dplyr)
overview_table <- data.frame(
Variable = names(df),
Data_Type = sapply(df, class),
Unique_Values = sapply(df, function(x) length(unique(x))),
Missing_Count = sapply(df, function(x) sum(is.na(x))),
Missing_Percentage = round(sapply(df, function(x) sum(is.na(x))) / nrow(df) * 100, 2),
stringsAsFactors = FALSE
)
overview_table$Duplicate_Values <- nrow(df) - overview_table$Unique_Values
overview_table
## Variable Data_Type
## CUST_ID CUST_ID character
## BALANCE BALANCE numeric
## BALANCE_FREQUENCY BALANCE_FREQUENCY numeric
## PURCHASES PURCHASES numeric
## ONEOFF_PURCHASES ONEOFF_PURCHASES numeric
## INSTALLMENTS_PURCHASES INSTALLMENTS_PURCHASES numeric
## CASH_ADVANCE CASH_ADVANCE numeric
## PURCHASES_FREQUENCY PURCHASES_FREQUENCY numeric
## ONEOFF_PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY numeric
## PURCHASES_INSTALLMENTS_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY numeric
## CASH_ADVANCE_FREQUENCY CASH_ADVANCE_FREQUENCY numeric
## CASH_ADVANCE_TRX CASH_ADVANCE_TRX numeric
## PURCHASES_TRX PURCHASES_TRX numeric
## CREDIT_LIMIT CREDIT_LIMIT numeric
## PAYMENTS PAYMENTS numeric
## MINIMUM_PAYMENTS MINIMUM_PAYMENTS numeric
## PRC_FULL_PAYMENT PRC_FULL_PAYMENT numeric
## TENURE TENURE integer
## Unique_Values Missing_Count Missing_Percentage
## CUST_ID 8950 0 0.00
## BALANCE 8871 0 0.00
## BALANCE_FREQUENCY 43 0 0.00
## PURCHASES 5915 425 4.75
## ONEOFF_PURCHASES 4014 0 0.00
## INSTALLMENTS_PURCHASES 4452 0 0.00
## CASH_ADVANCE 4323 0 0.00
## PURCHASES_FREQUENCY 47 0 0.00
## ONEOFF_PURCHASES_FREQUENCY 47 0 0.00
## PURCHASES_INSTALLMENTS_FREQUENCY 47 0 0.00
## CASH_ADVANCE_FREQUENCY 54 0 0.00
## CASH_ADVANCE_TRX 65 0 0.00
## PURCHASES_TRX 173 0 0.00
## CREDIT_LIMIT 206 1 0.01
## PAYMENTS 8711 0 0.00
## MINIMUM_PAYMENTS 8637 313 3.50
## PRC_FULL_PAYMENT 48 413 4.61
## TENURE 7 0 0.00
## Duplicate_Values
## CUST_ID 0
## BALANCE 79
## BALANCE_FREQUENCY 8907
## PURCHASES 3035
## ONEOFF_PURCHASES 4936
## INSTALLMENTS_PURCHASES 4498
## CASH_ADVANCE 4627
## PURCHASES_FREQUENCY 8903
## ONEOFF_PURCHASES_FREQUENCY 8903
## PURCHASES_INSTALLMENTS_FREQUENCY 8903
## CASH_ADVANCE_FREQUENCY 8896
## CASH_ADVANCE_TRX 8885
## PURCHASES_TRX 8777
## CREDIT_LIMIT 8744
## PAYMENTS 239
## MINIMUM_PAYMENTS 313
## PRC_FULL_PAYMENT 8902
## TENURE 8943
glimpse(df)
## Rows: 8,950
## Columns: 18
## $ CUST_ID <chr> "C10001", "C10002", "C10003", "C10004…
## $ BALANCE <dbl> 40.90075, 3202.46742, 2495.14886, 166…
## $ BALANCE_FREQUENCY <dbl> 0.818182, 0.909091, 1.000000, 0.63636…
## $ PURCHASES <dbl> 95.40, NA, 773.17, 1499.00, 16.00, 13…
## $ ONEOFF_PURCHASES <dbl> 0.00, 0.00, 773.17, 1499.00, 16.00, 0…
## $ INSTALLMENTS_PURCHASES <dbl> 95.40, 0.00, 0.00, 0.00, 0.00, 1333.2…
## $ CASH_ADVANCE <dbl> 0.0000, 6442.9455, 0.0000, 205.7880, …
## $ PURCHASES_FREQUENCY <dbl> 0.166667, 0.000000, 1.000000, 0.08333…
## $ ONEOFF_PURCHASES_FREQUENCY <dbl> 0.000000, 0.000000, 1.000000, 0.08333…
## $ PURCHASES_INSTALLMENTS_FREQUENCY <dbl> 0.083333, 0.000000, 0.000000, 0.00000…
## $ CASH_ADVANCE_FREQUENCY <dbl> 0.000000, 0.250000, 0.000000, 0.08333…
## $ CASH_ADVANCE_TRX <dbl> 0, 4, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ PURCHASES_TRX <dbl> 2, 0, 12, 1, 1, 8, 64, 12, 5, 3, 12, …
## $ CREDIT_LIMIT <dbl> 1000, 7000, 7500, 7500, 1200, 1800, 1…
## $ PAYMENTS <dbl> 201.8021, 4103.0326, 622.0667, 0.0000…
## $ MINIMUM_PAYMENTS <dbl> 139.50979, 1072.34022, 627.28479, NA,…
## $ PRC_FULL_PAYMENT <dbl> 0.000000, 0.222222, 0.000000, 0.00000…
## $ TENURE <int> 12, 12, 12, 12, 12, 12, 12, 12, 12, 1…
Out of the 17 variables, four variables were found to contain missing values. Table below summarises the affected variables, the number of missing observations, and the proportion of missingness relative to the total number of records (8,950). The proportion of missing values for all variables is below 5%, indicating that the dataset remains largely complete. Nonetheless, suitable imputation strategies were required to preserve information and avoid bias, particularly for behavioural variables relevant to clustering.
# Get the variables that have missing values
missing_only_table <- overview_table %>%
filter(Missing_Count > 0)
missing_only_table
## Variable Data_Type Unique_Values Missing_Count
## PURCHASES PURCHASES numeric 5915 425
## CREDIT_LIMIT CREDIT_LIMIT numeric 206 1
## MINIMUM_PAYMENTS MINIMUM_PAYMENTS numeric 8637 313
## PRC_FULL_PAYMENT PRC_FULL_PAYMENT numeric 48 413
## Missing_Percentage Duplicate_Values
## PURCHASES 4.75 3035
## CREDIT_LIMIT 0.01 8744
## MINIMUM_PAYMENTS 3.50 313
## PRC_FULL_PAYMENT 4.61 8902
MINIMUM_PAYMENTSThe variable MINIMUM_PAYMENTS represents the minimum
amount of payment paid by a customer in a billing cycle. Its value is
closely related to the actual payment made by the customer
(PAYMENTS). Therefore, a rule-based conditional imputation
strategy was applied. If PAYMENTS = 0, then
MINIMUM_PAYMENTS was set to 0, reflecting non-payment
behaviour. If 0 < PAYMENTS <
mean(PAYMENTS), then MINIMUM_PAYMENTS was set
equal to PAYMENTS, assuming that customers paying small
amounts tend to meet only the minimum requirement. Otherwise,
MINIMUM_PAYMENTS was imputed using the overall mean of
MINIMUM_PAYMENTS, representing a typical minimum payment
value. This approach preserves realistic repayment behaviour while
avoiding arbitrary statistical substitution.
df_main <- df
mean_payments <- mean(df_main$PAYMENTS, na.rm = TRUE)
mean_min_payments <- mean(df_main$MINIMUM_PAYMENTS, na.rm = TRUE)
df_main <- df_main %>%
mutate(
MINIMUM_PAYMENTS = case_when(
is.na(MINIMUM_PAYMENTS) & PAYMENTS == 0 ~ 0,
is.na(MINIMUM_PAYMENTS) & PAYMENTS > 0 & PAYMENTS < mean_payments ~ PAYMENTS,
is.na(MINIMUM_PAYMENTS) ~ mean_min_payments,
TRUE ~ MINIMUM_PAYMENTS
)
)
PRC_FULL_PAYMENTThe variable PRC_FULL_PAYMENT indicates the proportion
of the statement balance paid in full by the customer. Exploratory
inspection revealed that this variable is highly skewed, with a
substantial proportion of customers exhibiting values close to zero. To
address this, missing values in PRC_FULL_PAYMENT were
imputed using the median rather than the mean. The median is a robust
measure of central tendency that is less influenced by extreme values
and skewed distributions. Given that full payment behaviour is unevenly
distributed across customers, median imputation provides a more
representative estimate of typical repayment behaviour and avoids
inflating values due to a small number of high full-payment
observations.
median_prc <- median(df_main$PRC_FULL_PAYMENT, na.rm = TRUE)
df_main <- df_main %>%
mutate(
PRC_FULL_PAYMENT = ifelse(
is.na(PRC_FULL_PAYMENT),
median_prc,
PRC_FULL_PAYMENT
)
)
CREDIT_LIMITThe variable CREDIT_LIMIT contained only one missing
observation (0.01% of the dataset). Given the negligible proportion of
missingness and the importance of maintaining data integrity for
credit-related variables, the single affected record was removed from
the dataset. This decision avoids introducing unnecessary estimation
error while having no material impact on the overall analysis.
df_main <- df_main %>%
filter(!is.na(CREDIT_LIMIT))
PURCHASESThe variable PURCHASES captures the total amount spent
by a customer and is conceptually composed of two subcomponents, which
are ONEOFF_PURCHASES and
INSTALLMENTS_PURCHASES. Preliminary analysis showed a
strong relationship between PURCHASES and these two
variables. Therefore, missing values in PURCHASES were imputed using a
linear regression model, with ONEOFF_PURCHASES and
INSTALLMENTS_PURCHASES as predictors.
This regression-based imputation leverages the underlying transaction structure of the dataset and ensures that imputed values are consistent with observed purchasing behaviour. The strong association between these variables will be further demonstrated through a correlation heatmap in the subsequent Exploratory Data Analysis section.
train_data <- df_main %>%
filter(!is.na(PURCHASES))
reg_model <- lm(
PURCHASES ~ ONEOFF_PURCHASES + INSTALLMENTS_PURCHASES,
data = train_data
)
summary(reg_model)
##
## Call:
## lm(formula = PURCHASES ~ ONEOFF_PURCHASES + INSTALLMENTS_PURCHASES,
## data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -595.97 0.15 0.16 0.38 14.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.577e-01 1.217e-01 -1.296 0.195
## ONEOFF_PURCHASES 1.000e+00 7.251e-05 13793.693 <2e-16 ***
## INSTALLMENTS_PURCHASES 9.994e-01 1.277e-04 7823.878 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.981 on 8521 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.776e+08 on 2 and 8521 DF, p-value: < 2.2e-16
missing_purchases <- df_main %>%
filter(is.na(PURCHASES))
pred_values <- predict(reg_model, newdata = missing_purchases)
# Replace negative predictions with 0
pred_values <- ifelse(pred_values < 0, 0, pred_values)
df_main$PURCHASES[is.na(df_main$PURCHASES)] <- pred_values
Check if there’s any missing values
sum(is.na(df_main))
## [1] 0
Following missing value treatment, an outlier assessment was
conducted using boxplots for all numerical variables. The analysis
revealed a large number of extreme values, particularly for monetary
variables such as PURCHASES, BALANCE,
CASH_ADVANCE, and PAYMENTS. These outliers
reflect genuine customer behaviour (e.g., high spenders) rather than
data errors and are therefore retained.
numeric_cols <- df_main %>%
select(where(is.numeric))
# Convert to long format
df_long <- numeric_cols %>%
pivot_longer(
cols = everything(),
names_to = "Variable",
values_to = "Value"
)
# Plot boxplots
ggplot(df_long, aes(x = Value)) +
geom_boxplot(fill = "steelblue", outlier.size = 0.6, outlier.alpha = 0.5) +
facet_wrap(~ Variable, scales = "free_x", ncol = 4) +
theme_minimal(base_size = 11) +
theme(
strip.text = element_text(face = "bold", size = 9),
axis.title.x = element_blank(),
axis.text.y = element_blank(),
panel.grid.major.y = element_blank()
) +
labs(
title = "Distribution and Outlier Characteristics of Credit Card Usage Variables",
subtitle = "Boxplots prior to log transformation and robust scaling"
)
Before scaling, logarithmic transformation was applied to selected
monetary and count-based variables, including balance amounts, purchase
values, payment amounts, credit limits, and transaction counts, due to
their highly right-skewed distributions and the presence of extreme
values. These variables span several orders of magnitude, with a small
number of customers exhibiting exceptionally high spending or
transaction activity. Applying a log transformation reduces skewness,
compresses extreme values, and stabilizes variance, thereby improving
numerical stability and model performance in both regression and
clustering analyses. The log1p transformation was used to
safely accommodate zero-valued observations while preserving relative
differences among smaller values. In contrast, proportion and
frequency-based variables bounded between fixed ranges were not
transformed to maintain interpretability and avoid distortion of
behavioural intensity near zero.
log_cols <- c(
"BALANCE",
"PURCHASES",
"ONEOFF_PURCHASES",
"INSTALLMENTS_PURCHASES",
"CASH_ADVANCE",
"PAYMENTS",
"MINIMUM_PAYMENTS",
"CREDIT_LIMIT",
"PURCHASES_TRX",
"CASH_ADVANCE_TRX"
)
df_log <- df_main
df_log[log_cols] <- lapply(df_log[log_cols], log1p)
Robust scaling was used to mitigate the influence of outliers while preserving the inherent structure of non-monetary variables.
numeric_cols <- df_log %>%
select(where(is.numeric))
robust_scaler <- preProcess(
numeric_cols,
method = c("center", "scale"),
center = apply(numeric_cols, 2, median),
scale = apply(numeric_cols, 2, IQR)
)
df_scaled <- predict(robust_scaler, numeric_cols)
amount_vars <- df_main %>%
select(
BALANCE,
PURCHASES,
CASH_ADVANCE,
PAYMENTS,
MINIMUM_PAYMENTS,
CREDIT_LIMIT
)
amount_vars %>%
pivot_longer(cols = everything()) %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7) +
geom_freqpoly(
bins = 50,
color = "lightblue",
linewidth = 0.8
) +
facet_wrap(~ name, scales = "free") +
theme_minimal() +
labs(
title = "Distribution of Amount-related Variables",
x = "Value",
y = "Count"
)
amount_vars %>%
pivot_longer(cols = everything()) %>%
filter(value > 0) %>%
ggplot(aes(x = value, y = name, fill = name)) +
geom_boxplot(
outlier.alpha = 0.4,
outlier.size = 1
) +
scale_x_log10() +
theme_bw() +
theme(
legend.position = "none",
axis.title.y = element_blank()
) +
labs(
title = "Boxplot of Amount Variables (Log Scale)",
x = "Value (log scale)"
)
The distributions of amount-related variables, including
balance, purchase, cash_advance
and paymentare highly right-skewed, with most customers
exhibiting relatively low balances and purchase amounts, while a small
number of customers show extremely large values. After logarithmic
transformation, the boxplot continue to display numerous high-value
outliers, indicating significant differences in the scale of fund usage
among customers. In contrast, the distribution of
credit_limit appears comparatively smoother, but a certain
proportion of high-limit accounts are still retained. This suggests a
layered structure in credit limits within the customer population.
freq_vars <- df_main %>%
select(
BALANCE_FREQUENCY,
PURCHASES_FREQUENCY,
ONEOFF_PURCHASES_FREQUENCY,
PURCHASES_INSTALLMENTS_FREQUENCY,
CASH_ADVANCE_FREQUENCY,
PRC_FULL_PAYMENT
)
freq_vars %>%
pivot_longer(cols = everything()) %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 20, fill = "steelblue", alpha = 0.7) +
facet_wrap(~ name, scales = "free") +
theme_minimal() +
labs(
title = "Distribution of Frequency-related Variables",
x = "Frequency",
y = "Count"
)
Frequency-related variables display more discrete distribution
patterns. Balance_frequency and
purchases_frequency are highly concentrated near 1,
indicating that most customers use their credit cards on a regular
basis. However, oneoff_purchases_frequency, cash_advance_frequency and
prc_full_payment show clear spikes near 0, suggesting that
a large proportion of customers rarely make one-off purchases, use cash
advances, or pay their balances in full. Overall, these univariate
distributions reveal significant disparities in spending patterns and
usage behaviors among credit card users.
ggplot(df_main, aes(x = PURCHASES, y = CASH_ADVANCE)) +
geom_point(alpha = 0.35,color = "steelblue") +
theme_minimal() +
labs(
title = "Purchases vs Cash Advance",
x = "Cash Advance Amount",
y = "Purchases Amount"
)
ggplot(df_main, aes(x = PURCHASES, y = CASH_ADVANCE)) +
geom_point(alpha = 0.35,color = "steelblue") +
scale_x_log10() +
scale_y_log10() +
theme_minimal() +
labs(
title = "Purchases(log scale) vs Cash Advance Amount(log scale)",
x = "Purchases (log scale)",
y = "Cash Advance Amount (log scale)"
)
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
## Warning in scale_y_log10(): log-10 transformation introduced infinite values.
The relationship between purchase and cash_advance shows a typical
“L-shaped” distribution pattern. Most observations are concentrated
along the two axes, while the central area is relatively sparse. This
indicates that most customers tend to rely either on card purchases or
on cash advances, whereas very few customers frequently and heavily
engage in both behaviors at the same time. Users with zero
PURCHASES are likely not typical credit card users. Some of
them may hold the card mainly for emergency cash withdrawals, while
others may be inactive accounts opened for promotional benefits but
never truly used. Meanwhile, CASH_ADVANCE exhibits a
pronounced spike near zero, indicating the fact that cash advances
usually involve higher fees and interest costs.
ggplot(df_main, aes(x = CREDIT_LIMIT, y = PURCHASES)) +
geom_point(alpha = 0.35,color = "steelblue") +
theme_minimal() +
labs(
title = "Credit Limit vs Purchases",
x = "Credit Limit",
y = "Purchases Amount"
)
ggplot(df_main, aes(x = CREDIT_LIMIT, y = PAYMENTS)) +
geom_point(alpha = 0.35,color = "steelblue") +
theme_minimal() +
labs(
title = "Credit Limit vs Payments",
x = "Credit Limit",
y = "Payments Amount"
)
Credit limit shows a moderate positive correlation with both purchases and repayment amounts. However, the credit limits of the majority of customers are still concentrated below 5,000, indicating that increases in credit limit do not substantially alter the overall distribution structure at the population level. Although higher credit limits are generally associated with larger spending and repayment volumes, a large number of customers hold relatively high credit limits but exhibit low activity. The explanatory power for actual behavior remains limited.
ggplot(df_main, aes(x = PURCHASES_FREQUENCY, y = PURCHASES)) +
geom_point(alpha = 0.35,color = "steelblue") +
theme_minimal() +
labs(
title = "Purchases Frequency vs Purchases",
x = "Purchases Frequency",
y = "Purchases Amount"
)
ggplot(df_main, aes(x = CASH_ADVANCE_FREQUENCY, y = CASH_ADVANCE)) +
geom_point(alpha = 0.35,color = "steelblue") +
theme_minimal() +
labs(
title = "Cash Advance Frequency vs Cash Advance",
x = "Cash Advance Frequency",
y = "Cash Advance Amount"
)
The relationship between frequency variables and their corresponding monetary variables reveals a strong consistency between the two. Higher transaction frequency is generally associated with larger cumulative amounts, with high-frequency users mainly concentrated in higher spending ranges. This suggests that transaction frequency can serve as a useful indicator of customer activity level and overall spending scale.
plot_df_main <- df_main %>%
mutate(
log_purchases = log10(PURCHASES + 1),
log_oneoff = log10(ONEOFF_PURCHASES + 1),
log_installments = log10(INSTALLMENTS_PURCHASES + 1)
)
p1 <- ggplot(plot_df_main, aes(x = log_oneoff, y = log_purchases)) +
geom_point(alpha = 0.4, size = 1, color = "#1f77b4") +
theme_minimal(base_size = 10) +
labs(
title = "ONEOFF_PURCHASES vs PURCHASES",
x = "ONEOFF_PURCHASES",
y = "PURCHASES"
)
p2 <- ggplot(plot_df_main, aes(x = log_installments, y = log_purchases)) +
geom_point(alpha = 0.4, size = 1, color = "#1f77b4") +
theme_minimal(base_size = 10) +
labs(
title = "INSTALLMENTS_PURCHASES vs PURCHASES",
x = "INSTALLMENTS_PURCHASES",
y = "PURCHASES"
)
grid.arrange(p1, p2, ncol = 2)
Based on the combined relationships among
oneoff_purchases, installments_purchases and
cash_advance, credit card usage behavior can be grouped
into three representative customer segments.
The first group consists of one-off Users, who only make one-off card
purchases and do not use installment payments or cash advances. In
ONEOFF_PURCHASES plot, these customers lie close to the
diagonal line, while forming along the Y-axis (x = 0) in the
installments_purchases plot. This indicates that 100% of
their spending comes from one-off transactions, with no reliance on
installment plans. These users are likely frequent everyday spenders,
using credit cards for routine expenses such as groceries, dining, or
fuel. The second group is installment users, whose distribution pattern
is the opposite of the first group. Their spending is entirely driven by
installment transactions, with no one-off purchases and no cash
advances. This suggests that these customers typically use credit cards
only for large purchases or high-value bills, while maintaining
relatively low usage frequency in daily transactions. The third group
comprises mixed users, who are located in the relatively sparse
triangular region between the diagonal line and the vertical axis. These
customers engage in both one-off purchases and installment payments.
Points closer to the diagonal indicate a higher proportion of one-off
spending, while points closer to the vertical axis suggest greater
reliance on installment payments.
num_df_main <- df_main %>% select(where(is.numeric))
cor_matrix <- cor(num_df_main, use = "complete.obs")
cor_long <- melt(cor_matrix)
# Keep lower triangle only
cor_long <- cor_long %>%
filter(as.numeric(Var1) >= as.numeric(Var2))
ggplot(cor_long, aes(Var2, Var1, fill = value)) +
geom_tile(color = "white", size = 0.2) +
scale_fill_gradient2(
low = "#4575b4",
mid = "white",
high = "#d73027",
midpoint = 0,
limits = c(-1, 1),
name = "Correlation"
) +
theme_minimal(base_size = 10) +
theme(
axis.text.x = element_text(angle = 90, hjust = 0.5,size=6),
axis.text.y = element_text(size=6),
axis.title = element_blank(),
panel.grid = element_blank()
) +
labs(title = "Correlation Heatmap of Credit Card Variables")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The correlation heatmap shows that credit card usage behavior exhibits a strong degree of functional differentiation:
Purchases-related variables display strong positive correlations with one another. This indicates that spending frequency and spending amount tend to move together within the same behavioral dimension. In other words, customers with higher spending levels not only spend more in total, but also make transactions more frequently, forming a stable “card-purchase–oriented” behavior cluster.
Cash-advance–related variables show strongly correlated with one another, but show generally weak correlations with spending-related variables. This result indicates that cash advances constitute a relatively independent pattern of credit card usage.
At the account level, credit_limit shows moderate
positive correlations with balance and purchases, suggesting that higher
credit limits are usually associated with larger usage volumes and
outstanding balances. However, credit limit does not form strong
associations with frequency-related variables, further confirming that
credit availability alone does not directly determine customer activity
levels.
Repayment behavior variables , prc_full_payment
exhibits a clear negative correlation with balance, which indicates that
customers who regularly pay their balances in full tend to maintain low
outstanding balances, while customers with high balances are more likely
to make partial payments.
Overall, this correlation structure reveals clear segmentation in customer behavior across spending, cash advance, and repayment dimensions. These patterns provide statistical support for subsequent customer segmentation, feature selection, and modeling strategies.
This section conducts an examination of the overall distribution characteristics, the relationship between variables and the potential data structure. The results obvious non-equilibrium and structural characteristics in terms of variable relationship.
At the univariate analysis, most numerical variables at the low value interval and accompanied by a long right tail distribution, which indicates that there are more low value observations but more high value samples significantly impact the overall distribution. This not only reflects the common heterogeneity in behavioral data, but also suggest pay attention to abnormal value processing and variable scale differences in subsequent modelling.
Bivariate analysis shows the correlation structures between different types of variables have strong distinction. Some variables show a significant correlation, while others show a weaker or even close to zero correlation, which indicates that there may be different behaviour patterns in the data. Additionally, some variable relationships show “L-type” distribution characteristics, indicating that simple linear assumptions may not be enough to describe the true relationship between variables.
EDA results also reveal some potential challenges in the modelling. For example, the high bias and zero expansion characteristics of the variable distribution may affect the performance of the model. While the existence of correlation structures suggests multiple nonlinear problems. These findings emphases the importance of full exploration before formal modelling as well as to improve the explanatory and robustness of the model results.
A multiple linear regression (MLR) is employed to examine the factors
associated with PURCHASES, which is defined as the total
amount of purchases made by a credit card holder. PURCHASES
is selected as the dependent variable because it represents overall
spending behaviour and directly reflects customers’ consumption
intensity, making it a central outcome of interest for understanding
credit card usage patterns. As an initial modelling approach, all
relevant behavioural and financial predictors were first included in a
full MLR specification before addressing issues of multicollinearity and
model refinement in subsequent steps. The customer ID is removed as it
does not contribute to prediction.
df_mlr <- df_log %>% select(-CUST_ID)
full_mlr <- lm(PURCHASES ~ ., data = df_mlr)
summary(full_mlr)
##
## Call:
## lm(formula = PURCHASES ~ ., data = df_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1536 -0.6180 -0.1722 0.6254 3.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.440327 0.121649 11.840 < 2e-16 ***
## BALANCE 0.011124 0.010129 1.098 0.272153
## BALANCE_FREQUENCY -0.554955 0.060846 -9.121 < 2e-16 ***
## ONEOFF_PURCHASES 0.432027 0.005394 80.092 < 2e-16 ***
## INSTALLMENTS_PURCHASES 0.261802 0.007318 35.776 < 2e-16 ***
## CASH_ADVANCE -0.087947 0.006784 -12.965 < 2e-16 ***
## PURCHASES_FREQUENCY 3.353339 0.087385 38.374 < 2e-16 ***
## ONEOFF_PURCHASES_FREQUENCY -3.605413 0.063931 -56.395 < 2e-16 ***
## PURCHASES_INSTALLMENTS_FREQUENCY -3.372167 0.074389 -45.331 < 2e-16 ***
## CASH_ADVANCE_FREQUENCY 0.347421 0.120966 2.872 0.004088 **
## CASH_ADVANCE_TRX -0.055998 0.034612 -1.618 0.105727
## PURCHASES_TRX 1.073645 0.024462 43.890 < 2e-16 ***
## CREDIT_LIMIT 0.033757 0.013201 2.557 0.010573 *
## PAYMENTS 0.026647 0.008977 2.969 0.003000 **
## MINIMUM_PAYMENTS 0.000777 0.010750 0.072 0.942386
## PRC_FULL_PAYMENT 0.147057 0.040354 3.644 0.000270 ***
## TENURE -0.025097 0.007613 -3.296 0.000983 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8764 on 8932 degrees of freedom
## Multiple R-squared: 0.9099, Adjusted R-squared: 0.9097
## F-statistic: 5636 on 16 and 8932 DF, p-value: < 2.2e-16
Before we intepret the model’s result, the Variance Inflation Factor (VIF) is used to detect multicollinearity among independent variables. High VIF values indicate redundant predictors that may destabilize coefficient estimates.
library(car)
vif_values <- vif(full_mlr)
vif_values
## BALANCE BALANCE_FREQUENCY
## 4.844802 2.418743
## ONEOFF_PURCHASES INSTALLMENTS_PURCHASES
## 3.572808 5.930134
## CASH_ADVANCE PURCHASES_FREQUENCY
## 6.819572 14.332294
## ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## 4.238706 10.184913
## CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX
## 6.828649 14.220993
## PURCHASES_TRX CREDIT_LIMIT
## 13.158182 1.364229
## PAYMENTS MINIMUM_PAYMENTS
## 2.377033 3.086028
## PRC_FULL_PAYMENT TENURE
## 1.568745 1.207463
Value of VIF exceeding 10 commonly indicating severe
multicollinearity. The full MLR model exhibits several predictors with
elevated VIF values, most notably PURCHASES_FREQUENCY,
which records a VIF (14.3) well above this threshold, indicating
substantial redundancy with other purchase-related variables. Such
inflation implies unstable coefficient estimates and reduced
interpretability. Consequently, PURCHASES_FREQUENCY is removed in the
subsequent model to mitigate multicollinearity and improve model
reliability.
reduced_mlr_1 <- lm(
PURCHASES ~ . - PURCHASES_FREQUENCY,
data = df_mlr
)
vif(reduced_mlr_1)
## BALANCE BALANCE_FREQUENCY
## 4.668755 2.295774
## ONEOFF_PURCHASES INSTALLMENTS_PURCHASES
## 3.405527 5.780320
## CASH_ADVANCE ONEOFF_PURCHASES_FREQUENCY
## 6.814807 3.469917
## PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY
## 6.212949 6.828235
## CASH_ADVANCE_TRX PURCHASES_TRX
## 14.220753 9.698076
## CREDIT_LIMIT PAYMENTS
## 1.364202 2.372881
## MINIMUM_PAYMENTS PRC_FULL_PAYMENT
## 3.076119 1.568538
## TENURE
## 1.168177
After removing PURCHASES_FREQUENCY, the overall
multicollinearity structure improves. However, VIF diagnostics indicate
that CASH_ADVANCE_TRX remains highly problematic,
exhibiting a VIF value well above 10. This suggests strong linear
dependence between transaction count and other cash-advance-related
variables. Hence, CASH_ADVANCE_TRX is removed in the next
model iteration to further mitigate multicollinearity and stabilise the
regression estimates.
reduced_mlr_2 <- lm(
PURCHASES ~ . -PURCHASES_FREQUENCY -CASH_ADVANCE_TRX,
data = df_mlr
)
vif(reduced_mlr_2)
## BALANCE BALANCE_FREQUENCY
## 4.667995 2.295581
## ONEOFF_PURCHASES INSTALLMENTS_PURCHASES
## 3.398304 5.779331
## CASH_ADVANCE ONEOFF_PURCHASES_FREQUENCY
## 3.372961 3.466866
## PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY
## 6.212336 2.548100
## PURCHASES_TRX CREDIT_LIMIT
## 9.695037 1.357867
## PAYMENTS MINIMUM_PAYMENTS
## 2.363191 3.055479
## PRC_FULL_PAYMENT TENURE
## 1.566818 1.140738
Now, the VIF values of all remaining variables fall below the
critical threshold of 10, indicating that serious multicollinearity is
no longer present. Although PURCHASES_TRX still shows a
relatively higher VIF than other predictors, its value remains within an
acceptable range and does not indicate severe redundancy. More
importantly, PURCHASES_TRX represents the number of
purchase transactions and captures customers’ purchasing activity
directly. Removing this variable would result in a loss of important
behavioural information. Therefore, PURCHASES_TRX is retained, and no
further variables are removed.
par(mfrow = c(2, 2))
plot(reduced_mlr_2)
Before we interpret the model’s result, diagnostic plots were examined
to evaluate the assumptions of the multiple linear regression model. The
Residuals vs Fitted plot shows a slight curved pattern, indicating that
the relationship between the predictors and PURCHASES is not perfectly
linear. The Q–Q plot reveals deviations at the tails, suggesting that
the residuals are not fully normally distributed. The Scale–Location
plot indicates some variation in residual spread across fitted values,
which implies mild heteroscedasticity. The Residuals vs Leverage plot
shows that no individual observation exceeds Cook’s distance thresholds,
indicating that no data point has an undue influence on the model.
Overall, these diagnostics suggest that while classical regression
assumptions are not perfectly satisfied, the violations are mild and
structural in nature rather than caused by individual predictors.
summary(reduced_mlr_2)
##
## Call:
## lm(formula = PURCHASES ~ . - PURCHASES_FREQUENCY - CASH_ADVANCE_TRX,
## data = df_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1475 -0.6808 -0.1776 0.6595 5.2179
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.217248 0.129228 17.158 < 2e-16 ***
## BALANCE -0.062787 0.010731 -5.851 5.06e-09 ***
## BALANCE_FREQUENCY -0.027692 0.063976 -0.433 0.665139
## ONEOFF_PURCHASES 0.386883 0.005678 68.139 < 2e-16 ***
## INSTALLMENTS_PURCHASES 0.217031 0.007797 27.836 < 2e-16 ***
## CASH_ADVANCE -0.101865 0.005149 -19.784 < 2e-16 ***
## ONEOFF_PURCHASES_FREQUENCY -2.558097 0.062402 -40.994 < 2e-16 ***
## PURCHASES_INSTALLMENTS_FREQUENCY -1.588650 0.062704 -25.336 < 2e-16 ***
## CASH_ADVANCE_FREQUENCY 0.171426 0.079752 2.150 0.031622 *
## PURCHASES_TRX 1.554472 0.022662 68.593 < 2e-16 ***
## CREDIT_LIMIT 0.032845 0.014215 2.311 0.020875 *
## PAYMENTS 0.013088 0.009660 1.355 0.175490
## MINIMUM_PAYMENTS 0.022869 0.011545 1.981 0.047635 *
## PRC_FULL_PAYMENT 0.162921 0.043527 3.743 0.000183 ***
## TENURE -0.079471 0.007987 -9.950 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9458 on 8934 degrees of freedom
## Multiple R-squared: 0.895, Adjusted R-squared: 0.8948
## F-statistic: 5439 on 14 and 8934 DF, p-value: < 2.2e-16
Stepwise regression was applied to assess whether further variable reduction could improve the model’s behaviour.
step_mlr <- step(
reduced_mlr_2,
direction = "both",
trace = FALSE
)
summary(step_mlr)
##
## Call:
## lm(formula = PURCHASES ~ BALANCE + ONEOFF_PURCHASES + INSTALLMENTS_PURCHASES +
## CASH_ADVANCE + ONEOFF_PURCHASES_FREQUENCY + PURCHASES_INSTALLMENTS_FREQUENCY +
## CASH_ADVANCE_FREQUENCY + PURCHASES_TRX + CREDIT_LIMIT + MINIMUM_PAYMENTS +
## PRC_FULL_PAYMENT + TENURE, data = df_mlr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1295 -0.6810 -0.1776 0.6597 5.2182
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.193172 0.124954 17.552 < 2e-16 ***
## BALANCE -0.066231 0.008671 -7.638 2.43e-14 ***
## ONEOFF_PURCHASES 0.388310 0.005582 69.565 < 2e-16 ***
## INSTALLMENTS_PURCHASES 0.218420 0.007710 28.330 < 2e-16 ***
## CASH_ADVANCE -0.100729 0.005043 -19.975 < 2e-16 ***
## ONEOFF_PURCHASES_FREQUENCY -2.554060 0.061716 -41.384 < 2e-16 ***
## PURCHASES_INSTALLMENTS_FREQUENCY -1.592288 0.061565 -25.863 < 2e-16 ***
## CASH_ADVANCE_FREQUENCY 0.172108 0.079475 2.166 0.030372 *
## PURCHASES_TRX 1.552382 0.022616 68.641 < 2e-16 ***
## CREDIT_LIMIT 0.037044 0.013729 2.698 0.006986 **
## MINIMUM_PAYMENTS 0.031632 0.009526 3.321 0.000902 ***
## PRC_FULL_PAYMENT 0.177668 0.040495 4.387 1.16e-05 ***
## TENURE -0.078391 0.007876 -9.954 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9458 on 8936 degrees of freedom
## Multiple R-squared: 0.895, Adjusted R-squared: 0.8948
## F-statistic: 6346 on 12 and 8936 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(step_mlr)
In stepwise regression model,
BALANCE_FREQUENCY and
PAYMENTS are removed. However, when the resulting
diagnostic plots were compared with previous the reduced model,
additional variable removal does not substantially change the residual
patterns. This suggests that the observed behaviour arises from the
underlying data characteristics rather than from redundant
predictors.Thus, the reduced multiple linear regression model was
retained for interpretation purpose.
library(sandwich)
## Warning: package 'sandwich' was built under R version 4.3.3
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.3.1
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.3.2
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
final_mlr <- reduced_mlr_2
coeftest(final_mlr, vcov = vcovHC(final_mlr, type = "HC3"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2172484 0.1370937 16.1732 < 2.2e-16 ***
## BALANCE -0.0627871 0.0112041 -5.6039 2.158e-08 ***
## BALANCE_FREQUENCY -0.0276916 0.0639507 -0.4330 0.665015
## ONEOFF_PURCHASES 0.3868834 0.0066306 58.3478 < 2.2e-16 ***
## INSTALLMENTS_PURCHASES 0.2170307 0.0086326 25.1408 < 2.2e-16 ***
## CASH_ADVANCE -0.1018653 0.0054947 -18.5389 < 2.2e-16 ***
## ONEOFF_PURCHASES_FREQUENCY -2.5580971 0.0686045 -37.2876 < 2.2e-16 ***
## PURCHASES_INSTALLMENTS_FREQUENCY -1.5886498 0.0679727 -23.3719 < 2.2e-16 ***
## CASH_ADVANCE_FREQUENCY 0.1714261 0.0783186 2.1888 0.028635 *
## PURCHASES_TRX 1.5544719 0.0267667 58.0748 < 2.2e-16 ***
## CREDIT_LIMIT 0.0328451 0.0141387 2.3231 0.020198 *
## PAYMENTS 0.0130880 0.0103802 1.2609 0.207392
## MINIMUM_PAYMENTS 0.0228692 0.0126453 1.8085 0.070560 .
## PRC_FULL_PAYMENT 0.1629212 0.0439206 3.7095 0.000209 ***
## TENURE -0.0794713 0.0085934 -9.2480 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To deal with the mild heteroscedasticity observed in the residual diagnostics previously, robust standard errors were applied. Comparing the original regression output with the robust results shows that coefficient estimates remain unchanged, while standard errors and significance levels are slightly adjusted for some variables. This confirms that the direction and magnitude of relationships are stable, while statistical inference is made more reliable under non-constant error variance. The use of robust standard errors therefore strengthens the validity of hypothesis testing without altering the substantive interpretation of the model.
The reduced multiple linear regression model explains a large proportion of the variation in customers’ total purchase spending, with an adjusted R² of approximately 0.895, indicating strong explanatory performance. The results show that customers who make larger individual purchases, spend more through instalment plans, and carry out a higher number of purchase transactions tend to have substantially higher overall purchase amounts. This highlights the importance of both transaction size and transaction volume in driving total spending. In contrast, customers who rely more heavily on cash withdrawals from their credit cards tend to spend less on regular purchase transactions, suggesting a shift away from conventional consumption behaviour. In addition, customers with higher credit limits and those who pay a larger proportion of their balance in full are associated with higher purchase spending, reflecting the role of credit capacity and disciplined repayment behaviour. However, the frequency with which account balances are updated does not show a meaningful relationship with total purchase amount once other financial and behavioural factors are taken into account. Overall, the findings indicate that direct spending behaviour and financial capacity are more influential determinants of purchase activity than account maintenance patterns.
Although the reduced multiple linear regression model demonstrates strong explanatory power, diagnostic analysis indicates mild departures from linearity and homoscedasticity, which are common in large-scale financial behaviour data. In such contexts, a regularized regression approach, such as LASSO regression, may be suitable as a complementary model. LASSO is particularly useful when predictors are correlated, as it performs automatic variable selection and reduces coefficient instability by shrinking less informative predictors toward zero. This makes it well suited for assessing predictive relevance rather than statistical significance. Alternatively, for capturing potential non-linear relationships and interaction effects inherent in customer spending behaviour, tree-based models such as Random Forest could be considered, as they do not rely on linearity assumptions. These models can serve as robustness or predictive benchmarks alongside the interpretable linear regression framework.
y <- df_mlr$PURCHASES
X <- model.matrix(PURCHASES ~ ., data = df_mlr)[, -1]
set.seed(123)
cv_lasso <- cv.glmnet(
X, y,
alpha = 1, # alpha = 1 → LASSO
standardize = TRUE, # let glmnet standardize internally
nfolds = 10
)
plot(cv_lasso)
cv_lasso$lambda.min
## [1] 0.001666535
cv_lasso$lambda.1se
## [1] 0.01175707
The optimal regularisation parameter was selected using 10-fold cross-validation. The value λₘᵢₙ corresponds to the minimum cross-validated prediction error, while λ₁ₛₑ represents a more conservative choice within one standard error of the minimum. The relatively small λ values are expected due to predictor standardisation and the log-transformed response variable. Using 10-fold cross-validation, the optimal regularisation parameter was identified at λ₁ₛₑ = 0.0118, which favours a more stable and conservative model.
coef_lasso_1se <- coef(cv_lasso, s = "lambda.1se")
coef_lasso_1se
## 17 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.62058269
## BALANCE .
## BALANCE_FREQUENCY -0.44453443
## ONEOFF_PURCHASES 0.41921889
## INSTALLMENTS_PURCHASES 0.23689770
## CASH_ADVANCE -0.08701131
## PURCHASES_FREQUENCY 2.71526133
## ONEOFF_PURCHASES_FREQUENCY -3.10468189
## PURCHASES_INSTALLMENTS_FREQUENCY -2.68447365
## CASH_ADVANCE_FREQUENCY .
## CASH_ADVANCE_TRX .
## PURCHASES_TRX 1.08026058
## CREDIT_LIMIT 0.02075249
## PAYMENTS 0.01758830
## MINIMUM_PAYMENTS .
## PRC_FULL_PAYMENT 0.11529048
## TENURE -0.02190428
lasso_coef_df <- as.data.frame(as.matrix(coef_lasso_1se))
colnames(lasso_coef_df) <- "Coefficient"
lasso_coef_df$Variable <- rownames(lasso_coef_df)
lasso_coef_df
## Coefficient Variable
## (Intercept) 1.62058269 (Intercept)
## BALANCE 0.00000000 BALANCE
## BALANCE_FREQUENCY -0.44453443 BALANCE_FREQUENCY
## ONEOFF_PURCHASES 0.41921889 ONEOFF_PURCHASES
## INSTALLMENTS_PURCHASES 0.23689770 INSTALLMENTS_PURCHASES
## CASH_ADVANCE -0.08701131 CASH_ADVANCE
## PURCHASES_FREQUENCY 2.71526133 PURCHASES_FREQUENCY
## ONEOFF_PURCHASES_FREQUENCY -3.10468189 ONEOFF_PURCHASES_FREQUENCY
## PURCHASES_INSTALLMENTS_FREQUENCY -2.68447365 PURCHASES_INSTALLMENTS_FREQUENCY
## CASH_ADVANCE_FREQUENCY 0.00000000 CASH_ADVANCE_FREQUENCY
## CASH_ADVANCE_TRX 0.00000000 CASH_ADVANCE_TRX
## PURCHASES_TRX 1.08026058 PURCHASES_TRX
## CREDIT_LIMIT 0.02075249 CREDIT_LIMIT
## PAYMENTS 0.01758830 PAYMENTS
## MINIMUM_PAYMENTS 0.00000000 MINIMUM_PAYMENTS
## PRC_FULL_PAYMENT 0.11529048 PRC_FULL_PAYMENT
## TENURE -0.02190428 TENURE
selected_vars <- lasso_coef_df %>%
filter(Coefficient != 0)
selected_vars
## Coefficient Variable
## (Intercept) 1.62058269 (Intercept)
## BALANCE_FREQUENCY -0.44453443 BALANCE_FREQUENCY
## ONEOFF_PURCHASES 0.41921889 ONEOFF_PURCHASES
## INSTALLMENTS_PURCHASES 0.23689770 INSTALLMENTS_PURCHASES
## CASH_ADVANCE -0.08701131 CASH_ADVANCE
## PURCHASES_FREQUENCY 2.71526133 PURCHASES_FREQUENCY
## ONEOFF_PURCHASES_FREQUENCY -3.10468189 ONEOFF_PURCHASES_FREQUENCY
## PURCHASES_INSTALLMENTS_FREQUENCY -2.68447365 PURCHASES_INSTALLMENTS_FREQUENCY
## PURCHASES_TRX 1.08026058 PURCHASES_TRX
## CREDIT_LIMIT 0.02075249 CREDIT_LIMIT
## PAYMENTS 0.01758830 PAYMENTS
## PRC_FULL_PAYMENT 0.11529048 PRC_FULL_PAYMENT
## TENURE -0.02190428 TENURE
The retained predictors largely correspond to transaction-driven and repayment-related behaviours. Variables representing the number of purchase transactions, one-off purchase amount, and instalment purchase amount remain positively associated with total purchase spending, reinforcing the importance of transaction intensity and spending structure. In contrast, variables related to cash advance usage and certain frequency measures are either negatively associated or removed entirely, suggesting that these behaviours contribute less to explaining purchase spending once regularisation is imposed. Notably, account balance level is excluded by LASSO, indicating that while balance may appear statistically significant in linear regression after multicollinearity control, it is not essential for predictive performance when competing correlated variables are present.
rmse_lasso_min <- sqrt(
cv_lasso$cvm[cv_lasso$lambda == cv_lasso$lambda.min]
)
rmse_lasso_min
## [1] 0.8778477
The LASSO model achieves a root mean squared error (RMSE) of approximately 0.88, which is comparable to the reduced multiple linear regression model. This suggests that reducing the number of predictors does not substantially compromise predictive accuracy. The results indicate that a compact set of behavioural variables is sufficient to explain most of the variation in purchase spending, supporting the robustness of the behavioural patterns identified earlier.
fviz_nbclust(df_scaled, kmeans, method='silhouette')
fviz_nbclust(df_scaled, kmeans, method = "wss")
set.seed(12345)
k2 <- kmeans(df_scaled, centers = 2 , nstart = 25)
# Calculate instead of printing since too lengthy
bss_tss_ratio_k2 <- (k2$betweenss / k2$totss) * 100
cat(sprintf("between_SS / total_SS = %5.1f %%\n", bss_tss_ratio_k2))
## between_SS / total_SS = 25.3 %
fviz_cluster(k2, data = df_scaled)
Some degree of overlap is observed between the two clusters. The sum of squares obtained for this two-cluster analysis is 21.5%, which represents that small proportion of the total variance in the dataset is explained by the clustering. The low between cluster to total sum of squares (BSS/TSS) ratio suggests that it may not provide a best fit.
Given the limited variance explained and the overlap between clusters, a two-cluster solution may be insufficient to capture the diversity of credit card usage behaviours. Therefore, it leads us to examine the elbow plot to determine whether adding additional clusters is still justified.
set.seed(12345)
k4 <- kmeans(df_scaled, centers = 4 , nstart = 25)
fviz_cluster(k4, data = df_scaled)
Many degrees of overlap are observed between the clusters, especially Cluster 2 and 3. The clusters seemed to originate from a shared central region. The dense overlapping area represents the customers with moderate to low activity levels. These people are believed not to have strong differentiating behaviours. However, as the distance from the center increases, the customers’ behaviours are diverging outwards into distinct behavioral extremes shown at the outer tips. The graph captures the total variance up to 43.3% from the first 2 principles while the rest might be hidden from this 2D view, which explains the visual overlap. The increasing percentage of total variance being explained suggests that extending the model from two clusters to four clusters is reasonable and it can improve the segment compactness compared to the two-cluster model.
dist_mat <- dist(df_scaled)
sil_k4 <- silhouette(k4$cluster, dist_mat)
mean(sil_k4[, 3])
## [1] 0.2225641
The average silhouette score for the four-cluster solution is 0.22, indicating moderate overlap among clusters. This suggests that customer spending behaviour forms a continuum rather than strictly separated groups, which is common in financial transaction data. Despite the overlap, the clusters exhibit distinct behavioural profiles and remain suitable for segmentation and interpretation
To have a better understanding at the overlapping visual and the characteristics driving this divergence, we will look at their cluster means and visual profiles accordingly.
bss_tss_ratio <- (k4$betweenss / k4$totss) * 100
cat(sprintf("between_SS / total_SS = %5.1f %%\n", bss_tss_ratio)) ; cat("K-means clustering with 4 clusters of sizes: [", k4$size, "]", "\n") ; cat("Cluster means: ", "\n") ; print(k4$centers)
## between_SS / total_SS = 43.3 %
## K-means clustering with 4 clusters of sizes: [ 2753 1521 2103 2572 ]
## Cluster means:
## BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES
## 1 0.6139900 0.2615515 -1.11152051 -0.58129630
## 2 -1.2163200 -1.4840022 -0.05123344 0.01143858
## 3 -0.4539744 0.2167806 0.43105973 -0.60282677
## 4 0.4332878 0.4203835 0.86758298 1.10834189
## INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY
## 1 -0.9273885 1.0058457 -1.0376709
## 2 -0.4582181 -0.6300383 -0.6018928
## 3 0.8749787 -0.6844516 0.7862244
## 4 0.5481999 -0.1444025 0.8237780
## ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY
## 1 -0.5355237 -0.8440946
## 2 -0.3025640 -0.5876509
## 3 -0.5464054 0.9888346
## 4 1.1989063 0.4424923
## CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT PAYMENTS
## 1 0.8387892 0.9393544 -1.0630909 -0.01768346 0.1427545
## 2 -0.5604024 -0.6230793 -0.4455991 -0.27017542 -0.9445516
## 3 -0.5507468 -0.6345278 0.4949848 -0.43002370 -0.1784366
## 4 -0.1160942 -0.1181676 0.9966923 0.53031074 0.5516765
## MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
## 1 0.4225857 -0.403527657 -0.08988408
## 2 -1.1173981 0.007804537 -0.15575872
## 3 -0.1309691 0.429805495 -0.03314988
## 4 0.3155568 0.075878686 0.21542539
# Need original data in numeric
data_for_radar <- df_main[, sapply(df_main, is.numeric)]
# Add Cluster Labels
data_for_radar$cluster <- k4$cluster
# Mean for each cluster
cluster_means <- aggregate(. ~ cluster, data = data_for_radar, mean)
# Normalize the data (0 to 1 scaling)
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
cluster_means_norm <- as.data.frame(lapply(cluster_means[, -1], normalize))
# Rename because long name will overlap when showing
colnames(cluster_means_norm) <- c(
"Balance", "Bal_Freq", "Purchases", "OneOff_Purch", "Install_Purch", "Cash_Advance", "Purch_Freq", "OneOff_Purch_Freq", "Inst_Freq", "Cash_Freq", "Cash_Adv_Trx", "Purch_Trx", "Credit_Limit", "Payments", "Min_Pay", "Prc_Full_Pay", "Tenure"
)
data_plot <- rbind(rep(1, ncol(cluster_means_norm)),
rep(0, ncol(cluster_means_norm)),
cluster_means_norm)
colors_border <- c("#F88379", "#79c314", "#00c5cd","#B480FF")
par(mar = c(2, 2, 2, 2))
radarchart(data_plot, axistype = 1, pcol = colors_border,
pfcol = scales::alpha(colors_border, 0.4),
plwd = 2, plty = 1, cglcol = "grey", cglty = 1, axislabcol = "grey",
caxislabels = seq(0, 1, 0.25),
title = "Profile of The 4 Customer Segments",
vlcex = 0.75)
legend(x = "topright", legend = paste("Cluster", cluster_means$cluster),
bty = "n", pch = 20, col = colors_border, pt.cex = 2)
Cluster 1: At-Risk Customers
Cluster 1 represents a large segment, accounting for approximately 30% of the total customer base (2,753 customers). It shows a highly distinct and extreme segment in the radar chart. The red polygon spikes aggressively downwards, dominating the cash-related axes, while collapsing inward on purchase-related dimensions. This visual is further supported by the cluster centroids, where this group holds the highest positive scores for Cash Advance Transactions (+0.94) and Cash Advance Frequency (+0.84). In contrast, they hold the lowest values for spending compared to other clusters, with very low scores for Purchases (−1.11), One-Off Purchases (−0.58) and Purchases Frequency (−1.04). This result confirms they rely on the card for liquid cash withdrawals rather than merchant transactions. This also suggests the cluster is likely facing financial pressure and is simply paying off their debt. Therefore, Cluster 1 represents a high-risk segment since reliance on cash advances is an expensive form of credit.
Cluster 2: Dormant Customers
Cluster 2 represents the smallest segment, accounting for approximately 17% of the total customer base (1,521 customers). It shows a smaller polygon in overall dimension compared to other clusters, indicating that they have minimal activities. This lack of engagement is proven statistically. They record the lowest values among all clusters for Payments (−0.94) and Minimum Payments (−1.12), respectively. They also show low spending behaviour with Purchases (−0.05) and the lowest Balance (−1.22). It is shown that their engagement levels are massively below the average. This profile suggests that the customers in Cluster 2 rarely use the card, making them a cluster with low revenue but also low risk.
Cluster 3: Loyal Customers
Cluster 3 has shown a distinct preference for structured financing. They are the most regular consumers of installment plans, even though they do not have the purchasing power of the high-value cluster. This is seen in the chart where the cyan polygon on the installment frequency axis reaches the outer boundary of the radar chart. This also aligns with their highest centroid score for Installment Purchases Frequency (+0.99). In contrast, their low cash usage shown by the inward polygon indicates they use credit cards mainly for transactions rather than cash withdrawals, as reflected by low Cash Advance Transactions (−0.63) and low Cash Advance Frequency (−0.55). Furthermore, they exhibit the most responsible repayment behaviour in the portfolio, shown by the highest Prc_Full_Payment score (+0.43). This makes them a low-risk segment for credit card use since they clear their statement balances consistently. Therefore, they are customers who use credit cards for planned installment purchases and maintain a good repayment record.
Cluster 4: High-Value Customers
Cluster 4 is the most active and high-value spending segment, accounting for approximately 28% of the total customer base (2,572 customers). Visually, their purple polygon expands the most along the one-off purchase and credit limit axes, indicating a profile with significant spending power. The statistics confirm this pattern, as they record the highest centroid values for Purchases (+0.87) and One-Off Purchases (+1.11). Moreover, they possess the highest credit limit (+0.53) and a relatively higher tenure (+0.22), indicating that they are long-standing and trusted customers with strong purchasing capacity.
Cluster 1: At-Risk Customers
Bank should revise the credit limits offered to this segment and should not increase limits for them. Instead, bank should flag this cluster and monitor their card usage. It may offer low-interest personal loans to help customers manage their cash flow while reducing the risk of bad debts.
Cluster 2: Dormant Customers
Bank can plan strategies to increase the engagement of Cluster 2. It may launch low-barrier promotional campaigns to kickstart activities. For example, offer zero annual fee and cashback in dining and grocery category with low minimum spend. This aims to attract dormant customers to resume spending on basic daily expenses like dining and groceries.
Cluster 3: Loyal Customers
Bank can plan to further locking the loyalty of these clients. Bank can strengthen partnerships with electronics, furniture, and appliance retailers to offer exclusive 0% interest installment plans to the credit card holders. This can cater to the purchasing preferences of these clients directly. Buy Now Pay Later (BNPL) plans can be offered to this cluster for short-term options via certain business partner.
Cluster 4: High-Value Customers
Bank should provide experiential rewards instead of generic cashback offers to this cluster. For example, premium airport lounge access with accompanying guest, overseas spend benefits, and concierge services, which align with their lifestyle and purchasing power. To retain these high-value customers, the marketing need to pay more attention to their customer experience.
This study aimed to explore credit card customer behaviour through exploratory data analysis, examine the relationship between behavioural attributes and purchase spending using multiple linear regression, and identify distinct customer segments through clustering techniques. These objectives were achieved by systematically preprocessing the data, examining variable distributions and relationships, and applying appropriate transformations to address skewness and scale differences. Together, these steps provided a structured and coherent understanding of customer spending, repayment, and credit usage behaviour beyond what can be observed from raw transactional data alone.
The analytical results reveal several important behavioural insights. Regression analysis indicates that total purchase spending is primarily driven by transaction-related factors, including the number of purchase transactions, one-off purchase amounts, and instalment spending, while cash advance usage is negatively associated with purchase activity. This suggests a clear behavioural distinction between customers who use credit cards for consumption and those who rely on them for liquidity. Clustering analysis further supports this separation by identifying customer groups with distinct usage profiles, including high-value spenders, instalment-oriented customers with disciplined repayment behaviour, cash-advance-reliant users, and largely inactive customers. Although cluster separation is not sharp, the identified segments exhibit meaningful and interpretable differences in spending intensity, cash usage, and repayment patterns, making them suitable for behavioural profiling and segmentation purposes.
Despite the usefulness of these findings, several methodological limitations remain. Diagnostic analysis of the regression model reveals mild non-linearity and heteroscedasticity, indicating that linear regression may not fully capture the complex relationships present in customer behaviour. Future work could therefore consider more flexible modelling approaches, such as LASSO regression to handle correlated predictors and improve model stability, or Random Forest models to capture non-linear effects and interaction patterns without strict distributional assumptions. Similarly, while k-means clustering provides an intuitive and interpretable segmentation, its reliance on Euclidean distance and spherical cluster assumptions may limit its effectiveness for heterogeneous financial data. Alternative approaches such as hierarchical clustering, Gaussian mixture models, or density-based methods could be explored to capture more nuanced cluster structures. The relatively low silhouette score observed in this study reflects the continuous and overlapping nature of real-world credit card behaviour rather than poor clustering quality, highlighting the challenge of achieving strong geometric separation in behavioural datasets. Overall, these extensions would further enhance the robustness and explanatory power of customer segmentation analysis.