Association Rules with Credit Card

Introduction
Data Preparation
Association Rules(Apriori)
Conclusion

Introduction

The relationship between customer characteristics and default risk.
Research Objective:
Analyze the potential relationship between demographic characteristics (gender, education level, marital status, age) and default risk (default payment next month). This can help to find out which groups are more likely to default based on customer attributes, thus providing decision support for credit risk management.

[Dataset:]https://archive.ics.uci.edu/dataset/350/default+of+credit+card+clients

Data Preparation

library(readxl)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(arules)

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

library(arulesViz)
library(ggplot2)
file_path <- "C:/Users/admin/Desktop/unsupervied-learning/project/association rule/default of credit card clients.xlsx"
credit <- read_excel(file_path)

## New names:
## • `` -> `...1`

# Convert the first row to column names and remove that row
colnames(credit) <- as.character(unlist(credit[1, ])) 
credit <- credit[-1, ] 
# Remove the ID column (not useful for association rules)
credit <- credit %>% select(-ID)
# Check dataset structure and dimensions
head(credit)

## # A tibble: 6 × 24
##   LIMIT_BAL SEX   EDUCATION MARRIAGE AGE   PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6
##   <chr>     <chr> <chr>     <chr>    <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 20000     2     2         1        24    2     2     -1    -1    -2    -2   
## 2 120000    2     2         2        26    -1    2     0     0     0     2    
## 3 90000     2     2         2        34    0     0     0     0     0     0    
## 4 50000     2     2         1        37    0     0     0     0     0     0    
## 5 50000     1     2         1        57    -1    0     -1    0     0     0    
## 6 50000     1     1         2        37    0     0     0     0     0     0    
## # ℹ 13 more variables: BILL_AMT1 <chr>, BILL_AMT2 <chr>, BILL_AMT3 <chr>,
## #   BILL_AMT4 <chr>, BILL_AMT5 <chr>, BILL_AMT6 <chr>, PAY_AMT1 <chr>,
## #   PAY_AMT2 <chr>, PAY_AMT3 <chr>, PAY_AMT4 <chr>, PAY_AMT5 <chr>,
## #   PAY_AMT6 <chr>, `default payment next month` <chr>

str(credit)

## tibble [30,000 × 24] (S3: tbl_df/tbl/data.frame)
##  $ LIMIT_BAL                 : chr [1:30000] "20000" "120000" "90000" "50000" ...
##  $ SEX                       : chr [1:30000] "2" "2" "2" "2" ...
##  $ EDUCATION                 : chr [1:30000] "2" "2" "2" "2" ...
##  $ MARRIAGE                  : chr [1:30000] "1" "2" "2" "1" ...
##  $ AGE                       : chr [1:30000] "24" "26" "34" "37" ...
##  $ PAY_0                     : chr [1:30000] "2" "-1" "0" "0" ...
##  $ PAY_2                     : chr [1:30000] "2" "2" "0" "0" ...
##  $ PAY_3                     : chr [1:30000] "-1" "0" "0" "0" ...
##  $ PAY_4                     : chr [1:30000] "-1" "0" "0" "0" ...
##  $ PAY_5                     : chr [1:30000] "-2" "0" "0" "0" ...
##  $ PAY_6                     : chr [1:30000] "-2" "2" "0" "0" ...
##  $ BILL_AMT1                 : chr [1:30000] "3913" "2682" "29239" "46990" ...
##  $ BILL_AMT2                 : chr [1:30000] "3102" "1725" "14027" "48233" ...
##  $ BILL_AMT3                 : chr [1:30000] "689" "2682" "13559" "49291" ...
##  $ BILL_AMT4                 : chr [1:30000] "0" "3272" "14331" "28314" ...
##  $ BILL_AMT5                 : chr [1:30000] "0" "3455" "14948" "28959" ...
##  $ BILL_AMT6                 : chr [1:30000] "0" "3261" "15549" "29547" ...
##  $ PAY_AMT1                  : chr [1:30000] "0" "0" "1518" "2000" ...
##  $ PAY_AMT2                  : chr [1:30000] "689" "1000" "1500" "2019" ...
##  $ PAY_AMT3                  : chr [1:30000] "0" "1000" "1000" "1200" ...
##  $ PAY_AMT4                  : chr [1:30000] "0" "1000" "1000" "1100" ...
##  $ PAY_AMT5                  : chr [1:30000] "0" "0" "1000" "1069" ...
##  $ PAY_AMT6                  : chr [1:30000] "0" "2000" "5000" "1000" ...
##  $ default payment next month: chr [1:30000] "1" "1" "0" "0" ...

dim(credit)

## [1] 30000    24

The dataset consists of 30000 observations and 24 variables.

credit <- credit %>%
  # 1) Rename the original "default payment next month" column to "DEFAULT", for clarity
  rename(DEFAULT = `default payment next month`) %>%
  
  # 2) Use a single mutate() to perform various type conversions, binning, and factorization
  mutate(
    # Convert SEX, EDUCATION, and MARRIAGE into factors
    SEX        = as.factor(SEX),
    EDUCATION  = as.factor(EDUCATION),
    MARRIAGE   = as.factor(MARRIAGE),
    
    # # Convert AGE to numeric and then bin it
    AGE = as.numeric(AGE),
    AGE_BIN = cut(
      x      = AGE,
      breaks = c(0, 25, 35, 50, 150),
      labels = c("Age_<=25", "Age_26_35", "Age_36_50", "Age_>50")
    ),
    
     # Convert LIMIT_BAL to numeric and then bin it
    LIMIT_BAL = as.numeric(LIMIT_BAL),
    LIMIT_BAL_BIN = cut(
      x      = LIMIT_BAL,
      breaks = c(-Inf, 100000, 300000, 600000, Inf),
      labels = c("Limit_Low", "Limit_Mid", "Limit_High", "Limit_VeryHigh")
    ),
    
    # Convert DEFAULT from 0/1 to "Default_No"/"Default_Yes"
    DEFAULT = as.factor(ifelse(DEFAULT == 1, "Default_Yes", "Default_No"))
  ) %>%
  
   # 3) Select only the columns needed for the association rules analysis
  select(SEX, EDUCATION, MARRIAGE, AGE_BIN, LIMIT_BAL_BIN, DEFAULT)

# Inspect the transformed data
str(credit)

## tibble [30,000 × 6] (S3: tbl_df/tbl/data.frame)
##  $ SEX          : Factor w/ 2 levels "1","2": 2 2 2 2 1 1 1 2 2 1 ...
##  $ EDUCATION    : Factor w/ 7 levels "0","1","2","3",..: 3 3 3 3 3 2 2 3 4 4 ...
##  $ MARRIAGE     : Factor w/ 4 levels "0","1","2","3": 2 3 3 2 2 3 3 3 2 3 ...
##  $ AGE_BIN      : Factor w/ 4 levels "Age_<=25","Age_26_35",..: 1 2 2 3 4 3 2 1 2 2 ...
##  $ LIMIT_BAL_BIN: Factor w/ 4 levels "Limit_Low","Limit_Mid",..: 1 2 1 1 1 1 3 1 2 1 ...
##  $ DEFAULT      : Factor w/ 2 levels "Default_No","Default_Yes": 2 2 1 1 1 1 1 1 1 1 ...

head(credit)

## # A tibble: 6 × 6
##   SEX   EDUCATION MARRIAGE AGE_BIN   LIMIT_BAL_BIN DEFAULT    
##   <fct> <fct>     <fct>    <fct>     <fct>         <fct>      
## 1 2     2         1        Age_<=25  Limit_Low     Default_Yes
## 2 2     2         2        Age_26_35 Limit_Mid     Default_Yes
## 3 2     2         2        Age_26_35 Limit_Low     Default_No 
## 4 2     2         1        Age_36_50 Limit_Low     Default_No 
## 5 1     2         1        Age_>50   Limit_Low     Default_No 
## 6 1     1         2        Age_36_50 Limit_Low     Default_No

# Convert the current data.frame into transactions
credit_trans <- as(credit, "transactions")

# Check the transaction summary
credit_trans

## transactions in sparse format with
##  30000 transactions (rows) and
##  23 items (columns)

summary(credit_trans)

## transactions as itemMatrix in sparse format with
##  30000 rows (elements/itemsets/transactions) and
##  23 columns (items) and a density of 0.2608696 
## 
## most frequent items:
## DEFAULT=Default_No              SEX=2         MARRIAGE=2        EDUCATION=2 
##              23364              18112              15964              14030 
##         MARRIAGE=1            (Other) 
##              13659              94871 
## 
## element (itemset/transaction) length distribution:
## sizes
##     6 
## 30000 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       6       6       6       6       6       6 
## 
## includes extended item information - examples:
##        labels variables levels
## 1       SEX=1       SEX      1
## 2       SEX=2       SEX      2
## 3 EDUCATION=0 EDUCATION      0
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2             2
## 3             3

Association Rule Mining

Apriori algorithm

Apriori is an algorithm for frequent item set mining and association rule learning over relational databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis.
In the context of analyzing credit card customer data, Apriori is particularly useful for uncovering combinations of demographic attributes (such as gender, education level, marital status, and age) that correlate with a higher risk of default. By binning continuous attributes like credit limit and age, we transform the dataset into a transaction-style format. Then, Apriori helps identify frequent itemsets—for instance, characteristic groups that appear frequently—and shows how these groups link to default outcomes.

The intuitive “if–then” structure of Apriori’s output gives credit analysts and risk managers direct insights into which demographic profiles or credit usage patterns are most strongly associated with default. This makes it easier to implement targeted interventions or refine credit policies based on clear, interpretable rules rather than black-box predictions alone.

rules <- apriori(
  data = credit_trans,
  parameter = list(
    supp = 0.05,    # Minimum support (5%)
    conf = 0.6,     # Minimum confidence (60%)
    minlen = 2      # Minimum rule length
  )
)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5    0.05      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 1500 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[23 item(s), 30000 transaction(s)] done [0.01s].
## sorting and recoding items ... [16 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 done [0.00s].
## writing ... [306 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

# Check the result of rule mining
rules

## set of 306 rules

summary(rules)

## set of 306 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4   5 
##  26 112 134  34 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   4.000   3.575   4.000   5.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift       
##  Min.   :0.05003   Min.   :0.6002   Min.   :0.05867   Min.   :0.8683  
##  1st Qu.:0.06044   1st Qu.:0.6634   1st Qu.:0.08737   1st Qu.:1.0164  
##  Median :0.08343   Median :0.7285   Median :0.11568   Median :1.0686  
##  Mean   :0.10813   Mean   :0.7313   Mean   :0.14967   Mean   :1.1861  
##  3rd Qu.:0.12339   3rd Qu.:0.7994   3rd Qu.:0.17288   3rd Qu.:1.3904  
##  Max.   :0.47830   Max.   :0.8900   Max.   :0.77880   Max.   :1.9012  
##      count      
##  Min.   : 1501  
##  1st Qu.: 1813  
##  Median : 2503  
##  Mean   : 3244  
##  3rd Qu.: 3702  
##  Max.   :14349  
## 
## mining info:
##          data ntransactions support confidence
##  credit_trans         30000    0.05        0.6
##                                                                                 call
##  apriori(data = credit_trans, parameter = list(supp = 0.05, conf = 0.6, minlen = 2))

The maximum lift (~1.90) is moderate—these rules do show noticeable relationships, but not extremely strong ones (e.g., a lift of 3 or 4 would be very strong).

# Inspect the first 10 rules
inspect(head(rules, n = 10))

##      lhs                           rhs                       support   
## [1]  {AGE_BIN=Age_>50}          => {MARRIAGE=1}              0.05646667
## [2]  {AGE_BIN=Age_>50}          => {DEFAULT=Default_No}      0.05640000
## [3]  {AGE_BIN=Age_<=25}         => {LIMIT_BAL_BIN=Limit_Low} 0.10220000
## [4]  {AGE_BIN=Age_<=25}         => {MARRIAGE=2}              0.11376667
## [5]  {AGE_BIN=Age_<=25}         => {SEX=2}                   0.09020000
## [6]  {AGE_BIN=Age_<=25}         => {DEFAULT=Default_No}      0.09463333
## [7]  {LIMIT_BAL_BIN=Limit_High} => {DEFAULT=Default_No}      0.12950000
## [8]  {EDUCATION=3}              => {DEFAULT=Default_No}      0.12266667
## [9]  {EDUCATION=1}              => {MARRIAGE=2}              0.22696667
## [10] {EDUCATION=1}              => {DEFAULT=Default_No}      0.28496667
##      confidence coverage   lift      count
## [1]  0.7465844  0.07563333 1.6397637 1694 
## [2]  0.7457030  0.07563333 0.9575025 1692 
## [3]  0.7920434  0.12903333 1.9012084 3066 
## [4]  0.8816843  0.12903333 1.6568861 3413 
## [5]  0.6990442  0.12903333 1.1578691 2706 
## [6]  0.7334022  0.12903333 0.9417080 2839 
## [7]  0.8664139  0.14946667 1.1124986 3885 
## [8]  0.7484238  0.16390000 0.9609962 3680 
## [9]  0.6432688  0.35283333 1.2088489 6809 
## [10] 0.8076523  0.35283333 1.0370472 8549

# Subset: rules where the consequent (rhs) contains "DEFAULT=Default_Yes"
rules_default_yes <- subset(rules, subset = rhs %pin% "Default_Yes")
inspect(rules_default_yes)

Age and Credit Limit:
Younger customers (Age ≤ 25) strongly tend to have a lower credit limit (rule 3, lift ≈ 1.90), which aligns with typical bank policies assigning lower limits to younger or less-established clients.

Age and Default:
Customers over 50 appear more likely to have no default (lift ≈ 1.51), suggesting lower risk among older borrowers.
Younger customers (≤25) also skew “no default” (confidence 73.3%), but the lift is <1, indicating they are actually slightly less safe than the population average, though the difference is not large.

Marital Status:
Age > 50 strongly associates with “Married” (lift ≈ 1.64), while Age ≤ 25 strongly associates with “Single” (lift ≈ 1.66). These findings are intuitively consistent with life stages.

Education and Default:
Some education levels show moderate positive associations with “no default,” though the lifts (≈1.09 to 1.11) are not very high. This suggests a small but noticeable effect of certain education categories on lower default risk.

Lift Values:
Most lifts range between ~1.1 and ~1.9, indicating moderate relationships rather than extremely strong ones. Still, these rules can guide preliminary segmentation (e.g., older married customers with high limits might be relatively lower risk).

Visualization of Rules

# Basic scatter plot (support vs. confidence) with color shading for lift
plot(rules, measure="support", shading="lift")

# Grouped matrix visualization
plot(rules, method = "grouped",k= 18)

# Focus on top 20 rules by lift
sub_rules <- head(sort(rules, by="lift", decreasing=TRUE), 20)
plot(sub_rules, method = "graph", engine = "html")

Re-run Apriori with Different Parameters

rules_new <- apriori(
  data = credit_trans,
  parameter = list(
    supp   = 0.03, # Lower support threshold (from 0.05 to 0.03)
    conf   = 0.7,  # Higher confidence (from 0.6 to 0.7)
    minlen = 2
  )
)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.7    0.1    1 none FALSE            TRUE       5    0.03      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 900 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[23 item(s), 30000 transaction(s)] done [0.01s].
## sorting and recoding items ... [16 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [291 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rules_new

## set of 291 rules

summary(rules_new)

## set of 291 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4   5   6 
##  17  78 128  65   3 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   4.000   3.859   4.000   6.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift       
##  Min.   :0.03010   Min.   :0.7009   Min.   :0.03440   Min.   :0.9001  
##  1st Qu.:0.04270   1st Qu.:0.7372   1st Qu.:0.05485   1st Qu.:0.9921  
##  Median :0.05807   Median :0.7849   Median :0.07473   Median :1.0607  
##  Mean   :0.08279   Mean   :0.7857   Mean   :0.10585   Mean   :1.2097  
##  3rd Qu.:0.09373   3rd Qu.:0.8254   3rd Qu.:0.12590   3rd Qu.:1.5389  
##  Max.   :0.47830   Max.   :0.9600   Max.   :0.60373   Max.   :2.1291  
##      count      
##  Min.   :  903  
##  1st Qu.: 1281  
##  Median : 1742  
##  Mean   : 2484  
##  3rd Qu.: 2812  
##  Max.   :14349  
## 
## mining info:
##          data ntransactions support confidence
##  credit_trans         30000    0.03        0.7
##                                                                                 call
##  apriori(data = credit_trans, parameter = list(supp = 0.03, conf = 0.7, minlen = 2))

# Inspect the first 10 rules
inspect(head(rules_new, n = 10))

##      lhs                           rhs                       support   
## [1]  {AGE_BIN=Age_>50}          => {MARRIAGE=1}              0.05646667
## [2]  {AGE_BIN=Age_>50}          => {DEFAULT=Default_No}      0.05640000
## [3]  {AGE_BIN=Age_<=25}         => {LIMIT_BAL_BIN=Limit_Low} 0.10220000
## [4]  {AGE_BIN=Age_<=25}         => {MARRIAGE=2}              0.11376667
## [5]  {AGE_BIN=Age_<=25}         => {DEFAULT=Default_No}      0.09463333
## [6]  {LIMIT_BAL_BIN=Limit_High} => {DEFAULT=Default_No}      0.12950000
## [7]  {EDUCATION=3}              => {DEFAULT=Default_No}      0.12266667
## [8]  {EDUCATION=1}              => {DEFAULT=Default_No}      0.28496667
## [9]  {AGE_BIN=Age_36_50}        => {DEFAULT=Default_No}      0.28200000
## [10] {SEX=1}                    => {DEFAULT=Default_No}      0.30050000
##      confidence coverage   lift      count
## [1]  0.7465844  0.07563333 1.6397637 1694 
## [2]  0.7457030  0.07563333 0.9575025 1692 
## [3]  0.7920434  0.12903333 1.9012084 3066 
## [4]  0.8816843  0.12903333 1.6568861 3413 
## [5]  0.7334022  0.12903333 0.9417080 2839 
## [6]  0.8664139  0.14946667 1.1124986 3885 
## [7]  0.7484238  0.16390000 0.9609962 3680 
## [8]  0.8076523  0.35283333 1.0370472 8549 
## [9]  0.7745834  0.36406667 0.9945858 8460 
## [10] 0.7583277  0.39626667 0.9737131 9015

rules_default_yes <- subset(rules_new, subset = rhs %pin% "Default_Yes")
inspect(rules_default_yes)



sub20_rules <- head(sort(rules_new, by="lift", decreasing=TRUE), 20)
plot(sub20_rules, method = "graph", engine = "html")

Lowering minimum support to 0.03 captures more niche patterns that occur in at least 3% of transactions.The maximum lift of ~2.13 suggests that some rules more than double the baseline probability of the RHS.The mean lift is ~1.21, indicating a moderate average level of association strength.Raising confidence to 0.7 ensures each rule’s LHS strongly predicts the RHS, yielding more reliable (though fewer) rules.
Younger (≤25) typically have Low Limits and a high proportion of “No Default” but with a lift under 1.0, so they’re slightly more prone to default than the baseline.Customers over 50 also have a high no default rate, but the lift is around 0.96–1.0, so about average.

Conclusion

Our analysis aimed to identify which demographic segments are more inclined to default on credit card payments. By applying the Apriori algorithm (with varying support and confidence thresholds) to demographic and credit-limit features, we uncovered a range of association rules that offer insights into default behavior:
Younger Customers (Age ≤ 25)
Frequently tied to Low credit limits (lift ~1.90), indicating that banks often set more conservative limits for younger clients. Show a high proportion of “No Default” in absolute terms, but the lift compared to the overall default rate is slightly below 1, suggesting that while many young customers do not default, their default risk is not distinctly lower than average.

Older Customers (Age > 50)
Strongly associated with Married status (lift ~1.64). Also exhibit a high proportion of “No Default,” though the lift hovers around 0.96–1.0, indicating their default risk is roughly on par with the broader population.

High Credit Limits
Show a moderate positive association with “No Default” (lift ~1.1), suggesting that having a higher limit correlates slightly with lower risk, though the effect is not pronounced.

Education and Marital Status Certain education levels (e.g., “EDUCATION=3” or “EDUCATION=1”) appear more often with “Default=No,” but lifts are modest (~1.0–1.1). Similarly, marital status interacts with age bins in predictable ways (e.g., younger singles, older marrieds) rather than showing strong default patterns on its own.

When we lowered support to 0.03 and raised confidence to 0.70, additional niche rules surfaced, but again the most notable lifts centered on younger vs. older groups and credit-limit tiers. Subsetting for Default=Yes produced fewer rules, reflecting that default cases can be more concentrated within specific subgroups—potentially requiring even lower support or additional features (e.g., repayment history) to isolate high-risk patterns.
Overall, our results indicate that while demographics (age, marital status, education) and credit limit do show meaningful correlations with default behavior, the strength of these relationships is moderate. For credit risk management, these findings can guide targeted monitoring or policy adjustments—for example, re-evaluating limits for specific age segments or combining demographic data with more detailed payment histories to refine risk assessments.

Association Rules with Credit Card

Meifang Wu

2025-01-25