The relationship between customer characteristics and default
risk.
Research Objective:
Analyze the potential relationship between demographic characteristics
(gender, education level, marital status, age) and default risk (default
payment next month). This can help to find out which groups are more
likely to default based on customer attributes, thus providing decision
support for credit risk management.
[Dataset:]https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients
library(readxl)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
##
## recode
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
library(ggplot2)
file_path <- "C:/Users/admin/Desktop/unsupervied-learning/project/association rule/default of credit card clients.xlsx"
credit <- read_excel(file_path)
## New names:
## • `` -> `...1`
# Convert the first row to column names and remove that row
colnames(credit) <- as.character(unlist(credit[1, ]))
credit <- credit[-1, ]
# Remove the ID column (not useful for association rules)
credit <- credit %>% select(-ID)
# Check dataset structure and dimensions
head(credit)
## # A tibble: 6 × 24
## LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 20000 2 2 1 24 2 2 -1 -1 -2 -2
## 2 120000 2 2 2 26 -1 2 0 0 0 2
## 3 90000 2 2 2 34 0 0 0 0 0 0
## 4 50000 2 2 1 37 0 0 0 0 0 0
## 5 50000 1 2 1 57 -1 0 -1 0 0 0
## 6 50000 1 1 2 37 0 0 0 0 0 0
## # ℹ 13 more variables: BILL_AMT1 <chr>, BILL_AMT2 <chr>, BILL_AMT3 <chr>,
## # BILL_AMT4 <chr>, BILL_AMT5 <chr>, BILL_AMT6 <chr>, PAY_AMT1 <chr>,
## # PAY_AMT2 <chr>, PAY_AMT3 <chr>, PAY_AMT4 <chr>, PAY_AMT5 <chr>,
## # PAY_AMT6 <chr>, `default payment next month` <chr>
str(credit)
## tibble [30,000 × 24] (S3: tbl_df/tbl/data.frame)
## $ LIMIT_BAL : chr [1:30000] "20000" "120000" "90000" "50000" ...
## $ SEX : chr [1:30000] "2" "2" "2" "2" ...
## $ EDUCATION : chr [1:30000] "2" "2" "2" "2" ...
## $ MARRIAGE : chr [1:30000] "1" "2" "2" "1" ...
## $ AGE : chr [1:30000] "24" "26" "34" "37" ...
## $ PAY_0 : chr [1:30000] "2" "-1" "0" "0" ...
## $ PAY_2 : chr [1:30000] "2" "2" "0" "0" ...
## $ PAY_3 : chr [1:30000] "-1" "0" "0" "0" ...
## $ PAY_4 : chr [1:30000] "-1" "0" "0" "0" ...
## $ PAY_5 : chr [1:30000] "-2" "0" "0" "0" ...
## $ PAY_6 : chr [1:30000] "-2" "2" "0" "0" ...
## $ BILL_AMT1 : chr [1:30000] "3913" "2682" "29239" "46990" ...
## $ BILL_AMT2 : chr [1:30000] "3102" "1725" "14027" "48233" ...
## $ BILL_AMT3 : chr [1:30000] "689" "2682" "13559" "49291" ...
## $ BILL_AMT4 : chr [1:30000] "0" "3272" "14331" "28314" ...
## $ BILL_AMT5 : chr [1:30000] "0" "3455" "14948" "28959" ...
## $ BILL_AMT6 : chr [1:30000] "0" "3261" "15549" "29547" ...
## $ PAY_AMT1 : chr [1:30000] "0" "0" "1518" "2000" ...
## $ PAY_AMT2 : chr [1:30000] "689" "1000" "1500" "2019" ...
## $ PAY_AMT3 : chr [1:30000] "0" "1000" "1000" "1200" ...
## $ PAY_AMT4 : chr [1:30000] "0" "1000" "1000" "1100" ...
## $ PAY_AMT5 : chr [1:30000] "0" "0" "1000" "1069" ...
## $ PAY_AMT6 : chr [1:30000] "0" "2000" "5000" "1000" ...
## $ default payment next month: chr [1:30000] "1" "1" "0" "0" ...
dim(credit)
## [1] 30000 24
The dataset consists of 30000 observations and 24 variables.
credit <- credit %>%
# 1) Rename the original "default payment next month" column to "DEFAULT", for clarity
rename(DEFAULT = `default payment next month`) %>%
# 2) Use a single mutate() to perform various type conversions, binning, and factorization
mutate(
# Convert SEX, EDUCATION, and MARRIAGE into factors
SEX = as.factor(SEX),
EDUCATION = as.factor(EDUCATION),
MARRIAGE = as.factor(MARRIAGE),
# # Convert AGE to numeric and then bin it
AGE = as.numeric(AGE),
AGE_BIN = cut(
x = AGE,
breaks = c(0, 25, 35, 50, 150),
labels = c("Age_<=25", "Age_26_35", "Age_36_50", "Age_>50")
),
# Convert LIMIT_BAL to numeric and then bin it
LIMIT_BAL = as.numeric(LIMIT_BAL),
LIMIT_BAL_BIN = cut(
x = LIMIT_BAL,
breaks = c(-Inf, 100000, 300000, 600000, Inf),
labels = c("Limit_Low", "Limit_Mid", "Limit_High", "Limit_VeryHigh")
),
# Convert DEFAULT from 0/1 to "Default_No"/"Default_Yes"
DEFAULT = as.factor(ifelse(DEFAULT == 1, "Default_Yes", "Default_No"))
) %>%
# 3) Select only the columns needed for the association rules analysis
select(SEX, EDUCATION, MARRIAGE, AGE_BIN, LIMIT_BAL_BIN, DEFAULT)
# Inspect the transformed data
str(credit)
## tibble [30,000 × 6] (S3: tbl_df/tbl/data.frame)
## $ SEX : Factor w/ 2 levels "1","2": 2 2 2 2 1 1 1 2 2 1 ...
## $ EDUCATION : Factor w/ 7 levels "0","1","2","3",..: 3 3 3 3 3 2 2 3 4 4 ...
## $ MARRIAGE : Factor w/ 4 levels "0","1","2","3": 2 3 3 2 2 3 3 3 2 3 ...
## $ AGE_BIN : Factor w/ 4 levels "Age_<=25","Age_26_35",..: 1 2 2 3 4 3 2 1 2 2 ...
## $ LIMIT_BAL_BIN: Factor w/ 4 levels "Limit_Low","Limit_Mid",..: 1 2 1 1 1 1 3 1 2 1 ...
## $ DEFAULT : Factor w/ 2 levels "Default_No","Default_Yes": 2 2 1 1 1 1 1 1 1 1 ...
head(credit)
## # A tibble: 6 × 6
## SEX EDUCATION MARRIAGE AGE_BIN LIMIT_BAL_BIN DEFAULT
## <fct> <fct> <fct> <fct> <fct> <fct>
## 1 2 2 1 Age_<=25 Limit_Low Default_Yes
## 2 2 2 2 Age_26_35 Limit_Mid Default_Yes
## 3 2 2 2 Age_26_35 Limit_Low Default_No
## 4 2 2 1 Age_36_50 Limit_Low Default_No
## 5 1 2 1 Age_>50 Limit_Low Default_No
## 6 1 1 2 Age_36_50 Limit_Low Default_No
# Convert the current data.frame into transactions
credit_trans <- as(credit, "transactions")
# Check the transaction summary
credit_trans
## transactions in sparse format with
## 30000 transactions (rows) and
## 23 items (columns)
summary(credit_trans)
## transactions as itemMatrix in sparse format with
## 30000 rows (elements/itemsets/transactions) and
## 23 columns (items) and a density of 0.2608696
##
## most frequent items:
## DEFAULT=Default_No SEX=2 MARRIAGE=2 EDUCATION=2
## 23364 18112 15964 14030
## MARRIAGE=1 (Other)
## 13659 94871
##
## element (itemset/transaction) length distribution:
## sizes
## 6
## 30000
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6 6 6 6 6 6
##
## includes extended item information - examples:
## labels variables levels
## 1 SEX=1 SEX 1
## 2 SEX=2 SEX 2
## 3 EDUCATION=0 EDUCATION 0
##
## includes extended transaction information - examples:
## transactionID
## 1 1
## 2 2
## 3 3
Apriori is an algorithm for frequent item set mining and association
rule learning over relational databases. It proceeds by identifying the
frequent individual items in the database and extending them to larger
and larger item sets as long as those item sets appear sufficiently
often in the database. The frequent item sets determined by Apriori can
be used to determine association rules which highlight general trends in
the database: this has applications in domains such as market basket
analysis.
In the context of analyzing credit card customer data, Apriori is
particularly useful for uncovering combinations of demographic
attributes (such as gender, education level, marital status, and age)
that correlate with a higher risk of default. By binning continuous
attributes like credit limit and age, we transform the dataset into a
transaction-style format. Then, Apriori helps identify frequent
itemsets—for instance, characteristic groups that appear frequently—and
shows how these groups link to default outcomes.
The intuitive “if–then” structure of Apriori’s output gives credit analysts and risk managers direct insights into which demographic profiles or credit usage patterns are most strongly associated with default. This makes it easier to implement targeted interventions or refine credit policies based on clear, interpretable rules rather than black-box predictions alone.
rules <- apriori(
data = credit_trans,
parameter = list(
supp = 0.05, # Minimum support (5%)
conf = 0.6, # Minimum confidence (60%)
minlen = 2 # Minimum rule length
)
)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.05 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 1500
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[23 item(s), 30000 transaction(s)] done [0.01s].
## sorting and recoding items ... [16 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 done [0.00s].
## writing ... [306 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# Check the result of rule mining
rules
## set of 306 rules
summary(rules)
## set of 306 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5
## 26 112 134 34
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 4.000 3.575 4.000 5.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.05003 Min. :0.6002 Min. :0.05867 Min. :0.8683
## 1st Qu.:0.06044 1st Qu.:0.6634 1st Qu.:0.08737 1st Qu.:1.0164
## Median :0.08343 Median :0.7285 Median :0.11568 Median :1.0686
## Mean :0.10813 Mean :0.7313 Mean :0.14967 Mean :1.1861
## 3rd Qu.:0.12339 3rd Qu.:0.7994 3rd Qu.:0.17288 3rd Qu.:1.3904
## Max. :0.47830 Max. :0.8900 Max. :0.77880 Max. :1.9012
## count
## Min. : 1501
## 1st Qu.: 1813
## Median : 2503
## Mean : 3244
## 3rd Qu.: 3702
## Max. :14349
##
## mining info:
## data ntransactions support confidence
## credit_trans 30000 0.05 0.6
## call
## apriori(data = credit_trans, parameter = list(supp = 0.05, conf = 0.6, minlen = 2))
The maximum lift (~1.90) is moderate—these rules do show noticeable relationships, but not extremely strong ones (e.g., a lift of 3 or 4 would be very strong).
# Inspect the first 10 rules
inspect(head(rules, n = 10))
## lhs rhs support
## [1] {AGE_BIN=Age_>50} => {MARRIAGE=1} 0.05646667
## [2] {AGE_BIN=Age_>50} => {DEFAULT=Default_No} 0.05640000
## [3] {AGE_BIN=Age_<=25} => {LIMIT_BAL_BIN=Limit_Low} 0.10220000
## [4] {AGE_BIN=Age_<=25} => {MARRIAGE=2} 0.11376667
## [5] {AGE_BIN=Age_<=25} => {SEX=2} 0.09020000
## [6] {AGE_BIN=Age_<=25} => {DEFAULT=Default_No} 0.09463333
## [7] {LIMIT_BAL_BIN=Limit_High} => {DEFAULT=Default_No} 0.12950000
## [8] {EDUCATION=3} => {DEFAULT=Default_No} 0.12266667
## [9] {EDUCATION=1} => {MARRIAGE=2} 0.22696667
## [10] {EDUCATION=1} => {DEFAULT=Default_No} 0.28496667
## confidence coverage lift count
## [1] 0.7465844 0.07563333 1.6397637 1694
## [2] 0.7457030 0.07563333 0.9575025 1692
## [3] 0.7920434 0.12903333 1.9012084 3066
## [4] 0.8816843 0.12903333 1.6568861 3413
## [5] 0.6990442 0.12903333 1.1578691 2706
## [6] 0.7334022 0.12903333 0.9417080 2839
## [7] 0.8664139 0.14946667 1.1124986 3885
## [8] 0.7484238 0.16390000 0.9609962 3680
## [9] 0.6432688 0.35283333 1.2088489 6809
## [10] 0.8076523 0.35283333 1.0370472 8549
# Subset: rules where the consequent (rhs) contains "DEFAULT=Default_Yes"
rules_default_yes <- subset(rules, subset = rhs %pin% "Default_Yes")
inspect(rules_default_yes)
Age and Credit Limit:
Younger customers (Age ≤ 25) strongly tend to have a lower credit limit
(rule 3, lift ≈ 1.90), which aligns with typical bank policies assigning
lower limits to younger or less-established clients.
Age and Default:
Customers over 50 appear more likely to have no default (lift ≈ 1.51),
suggesting lower risk among older borrowers.
Younger customers (≤25) also skew “no default” (confidence 73.3%), but
the lift is <1, indicating they are actually slightly less safe than
the population average, though the difference is not large.
Marital Status:
Age > 50 strongly associates with “Married” (lift ≈ 1.64), while Age
≤ 25 strongly associates with “Single” (lift ≈ 1.66). These findings are
intuitively consistent with life stages.
Education and Default:
Some education levels show moderate positive associations with “no
default,” though the lifts (≈1.09 to 1.11) are not very high. This
suggests a small but noticeable effect of certain education categories
on lower default risk.
Lift Values:
Most lifts range between ~1.1 and ~1.9, indicating moderate
relationships rather than extremely strong ones. Still, these rules can
guide preliminary segmentation (e.g., older married customers with high
limits might be relatively lower risk).
# Basic scatter plot (support vs. confidence) with color shading for lift
plot(rules, measure="support", shading="lift")
# Grouped matrix visualization
plot(rules, method = "grouped",k= 18)
# Focus on top 20 rules by lift
sub_rules <- head(sort(rules, by="lift", decreasing=TRUE), 20)
plot(sub_rules, method = "graph", engine = "html")
rules_new <- apriori(
data = credit_trans,
parameter = list(
supp = 0.03, # Lower support threshold (from 0.05 to 0.03)
conf = 0.7, # Higher confidence (from 0.6 to 0.7)
minlen = 2
)
)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.7 0.1 1 none FALSE TRUE 5 0.03 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 900
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[23 item(s), 30000 transaction(s)] done [0.01s].
## sorting and recoding items ... [16 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [291 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules_new
## set of 291 rules
summary(rules_new)
## set of 291 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6
## 17 78 128 65 3
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 4.000 3.859 4.000 6.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.03010 Min. :0.7009 Min. :0.03440 Min. :0.9001
## 1st Qu.:0.04270 1st Qu.:0.7372 1st Qu.:0.05485 1st Qu.:0.9921
## Median :0.05807 Median :0.7849 Median :0.07473 Median :1.0607
## Mean :0.08279 Mean :0.7857 Mean :0.10585 Mean :1.2097
## 3rd Qu.:0.09373 3rd Qu.:0.8254 3rd Qu.:0.12590 3rd Qu.:1.5389
## Max. :0.47830 Max. :0.9600 Max. :0.60373 Max. :2.1291
## count
## Min. : 903
## 1st Qu.: 1281
## Median : 1742
## Mean : 2484
## 3rd Qu.: 2812
## Max. :14349
##
## mining info:
## data ntransactions support confidence
## credit_trans 30000 0.03 0.7
## call
## apriori(data = credit_trans, parameter = list(supp = 0.03, conf = 0.7, minlen = 2))
# Inspect the first 10 rules
inspect(head(rules_new, n = 10))
## lhs rhs support
## [1] {AGE_BIN=Age_>50} => {MARRIAGE=1} 0.05646667
## [2] {AGE_BIN=Age_>50} => {DEFAULT=Default_No} 0.05640000
## [3] {AGE_BIN=Age_<=25} => {LIMIT_BAL_BIN=Limit_Low} 0.10220000
## [4] {AGE_BIN=Age_<=25} => {MARRIAGE=2} 0.11376667
## [5] {AGE_BIN=Age_<=25} => {DEFAULT=Default_No} 0.09463333
## [6] {LIMIT_BAL_BIN=Limit_High} => {DEFAULT=Default_No} 0.12950000
## [7] {EDUCATION=3} => {DEFAULT=Default_No} 0.12266667
## [8] {EDUCATION=1} => {DEFAULT=Default_No} 0.28496667
## [9] {AGE_BIN=Age_36_50} => {DEFAULT=Default_No} 0.28200000
## [10] {SEX=1} => {DEFAULT=Default_No} 0.30050000
## confidence coverage lift count
## [1] 0.7465844 0.07563333 1.6397637 1694
## [2] 0.7457030 0.07563333 0.9575025 1692
## [3] 0.7920434 0.12903333 1.9012084 3066
## [4] 0.8816843 0.12903333 1.6568861 3413
## [5] 0.7334022 0.12903333 0.9417080 2839
## [6] 0.8664139 0.14946667 1.1124986 3885
## [7] 0.7484238 0.16390000 0.9609962 3680
## [8] 0.8076523 0.35283333 1.0370472 8549
## [9] 0.7745834 0.36406667 0.9945858 8460
## [10] 0.7583277 0.39626667 0.9737131 9015
rules_default_yes <- subset(rules_new, subset = rhs %pin% "Default_Yes")
inspect(rules_default_yes)
sub20_rules <- head(sort(rules_new, by="lift", decreasing=TRUE), 20)
plot(sub20_rules, method = "graph", engine = "html")
Lowering minimum support to 0.03 captures more niche patterns that
occur in at least 3% of transactions.The maximum lift of ~2.13 suggests
that some rules more than double the baseline probability of the RHS.The
mean lift is ~1.21, indicating a moderate average level of association
strength.Raising confidence to 0.7 ensures each rule’s LHS strongly
predicts the RHS, yielding more reliable (though fewer) rules.
Younger (≤25) typically have Low Limits and a high proportion of “No
Default” but with a lift under 1.0, so they’re slightly more prone to
default than the baseline.Customers over 50 also have a high no default
rate, but the lift is around 0.96–1.0, so about average.
Our analysis aimed to identify which demographic segments are more
inclined to default on credit card payments. By applying the Apriori
algorithm (with varying support and confidence thresholds) to
demographic and credit-limit features, we uncovered a range of
association rules that offer insights into default behavior:
Younger Customers (Age ≤ 25)
Frequently tied to Low credit limits (lift ~1.90), indicating that banks
often set more conservative limits for younger clients. Show a high
proportion of “No Default” in absolute terms, but the lift compared to
the overall default rate is slightly below 1, suggesting that while many
young customers do not default, their default risk is not distinctly
lower than average.
Older Customers (Age > 50)
Strongly associated with Married status (lift ~1.64). Also exhibit a
high proportion of “No Default,” though the lift hovers around 0.96–1.0,
indicating their default risk is roughly on par with the broader
population.
High Credit Limits
Show a moderate positive association with “No Default” (lift ~1.1),
suggesting that having a higher limit correlates slightly with lower
risk, though the effect is not pronounced.
Education and Marital Status Certain education levels (e.g., “EDUCATION=3” or “EDUCATION=1”) appear more often with “Default=No,” but lifts are modest (~1.0–1.1). Similarly, marital status interacts with age bins in predictable ways (e.g., younger singles, older marrieds) rather than showing strong default patterns on its own.
When we lowered support to 0.03 and raised confidence to 0.70,
additional niche rules surfaced, but again the most notable lifts
centered on younger vs. older groups and credit-limit tiers. Subsetting
for Default=Yes produced fewer rules, reflecting that default cases can
be more concentrated within specific subgroups—potentially requiring
even lower support or additional features (e.g., repayment history) to
isolate high-risk patterns.
Overall, our results indicate that while demographics (age, marital
status, education) and credit limit do show meaningful correlations with
default behavior, the strength of these relationships is moderate. For
credit risk management, these findings can guide targeted monitoring or
policy adjustments—for example, re-evaluating limits for specific age
segments or combining demographic data with more detailed payment
histories to refine risk assessments.