Analyzing Discretization Outcomes on Bank Client Data

Introduction

The aim of this project is to analyse the impact of different discretization methods on the partitioning of continuous data, the generation of association rules, and their interpretability. Analysis will include comparison of 5 discretization methods applied to a dataset containing both the qualitative and the quantitative data. First the data will be described and cleaned. After that, 5 different discretization methods will be applied to the continuous data, and then the resulting data partitions and mined association rules will be compared.

Description of the dataset and cleaning

Analysis will be conducted on a credit card consumer dataset¹ found on Kaggle. The dataset was constructed mainly for predicting credit card consumer segmentation, as it contains vast data for around 10 000 customers including information wheter they are an existing or former (attrited) customer. For my analysis I will use both the qualitative data (gender, education level, marital status etc.) which will be converted to factor type and the quantitative data (age, number of dependents, transaction count etc.) which will be divided into partitions using different discretization methods.

The dataset consists of 8500 observations of current customers and approximately 1500 observations of former customers. By employing the different discretization methods, I aim to explore how the way of dividing continuous variables affects the creation of association rules for the customers. I want to specifically focus on characterizing the attrited customers, as it can provide insights into which clients who are most likely to churn.

From the 23 variables in the dataset I chose 16 of them:

Attrition_Flag - information whether it’s an existing or attrited customer
Customer_Age - age of the customer
Gender - gender of the customer
Dependent_count - number of dependents (people that rely on the client for support)
Education_Level - educational level of the client
Marital_Status - marital status of the client
Income_Category - income category of the client
Card_Category - card category of the client
Months_on_book - how long the customer was a client
Total_Relationship_Count - number of relationships the client has with the credit card provider
Months_Inactive_12_mon - number of months the client has been inactive in the last twelve months
Contacts_Count_12_mon - number of contacts the client has had in the last twelve months
Credit_Limit - credit limit of the client
Total_Revolving_Bal - total revolving balance of the client
Total_Trans_Ct - total transaction count of the client
Avg_Utilization_Ratio - utillization ratio of the client (amount of revolving credit the client is using divided by the total credit available)

Let’s load the data and look at the variables.

clients <- read.csv("BankChurners.csv")
clients<- clients[, c(2:15, 19, 21)]

str(clients)

## 'data.frame':    10127 obs. of  16 variables:
##  $ Attrition_Flag          : chr  "Existing Customer" "Existing Customer" "Existing Customer" "Existing Customer" ...
##  $ Customer_Age            : int  45 49 51 40 40 44 51 32 37 48 ...
##  $ Gender                  : chr  "M" "F" "M" "F" ...
##  $ Dependent_count         : int  3 5 3 4 3 2 4 0 3 2 ...
##  $ Education_Level         : chr  "High School" "Graduate" "Graduate" "High School" ...
##  $ Marital_Status          : chr  "Married" "Single" "Married" "Unknown" ...
##  $ Income_Category         : chr  "$60K - $80K" "Less than $40K" "$80K - $120K" "Less than $40K" ...
##  $ Card_Category           : chr  "Blue" "Blue" "Blue" "Blue" ...
##  $ Months_on_book          : int  39 44 36 34 21 36 46 27 36 36 ...
##  $ Total_Relationship_Count: int  5 6 4 3 5 3 6 2 5 6 ...
##  $ Months_Inactive_12_mon  : int  1 1 1 4 1 1 1 2 2 3 ...
##  $ Contacts_Count_12_mon   : int  3 2 0 1 0 2 3 2 0 3 ...
##  $ Credit_Limit            : num  12691 8256 3418 3313 4716 ...
##  $ Total_Revolving_Bal     : int  777 864 0 2517 0 1247 2264 1396 2517 1677 ...
##  $ Total_Trans_Ct          : int  42 33 20 20 28 24 31 36 24 32 ...
##  $ Avg_Utilization_Ratio   : num  0.061 0.105 0 0.76 0 0.311 0.066 0.048 0.113 0.144 ...

Within the dataset, there are 10 integer/numerical variables and 6 character variables. I’m going to change the labels within some of the character variables so they will be easier to interpret. After that I will transform all of the character variables to factor type for the further analysis.

# 3 - Gender
clients$Gender <- ifelse(clients[, 3] == "M", "Male",
                         ifelse(clients[, 3] == "F", "Female",
                                NA))

# 5 - Education_level
clients$Education_Level <- ifelse(clients[, 5] == "Unknown", "Unknown education",
                               ifelse(clients[, 5] == "High School", "High School",
                                      ifelse(clients[, 5] == "Graduate", "Graduate",
                                             ifelse(clients[, 5] == "Uneducated", "Uneducated",
                                                    ifelse(clients[, 5] == "College", "College",
                                                           ifelse(clients[, 5] =="Post-Graduate", "Post-Graduate",
                                                                  ifelse(clients[, 5] == "Doctorate", "Doctorate",
                                                                         NA)))))))
# 6 - Marital_status
clients$Marital_Status <- ifelse(clients[, 6] == "Unknown", "Unknown marital status",
                              ifelse(clients[, 6] == "Married", "Married",
                                     ifelse(clients[, 6] == "Single", "Single",
                                            ifelse(clients[, 6] == "Divorced", "Divorced",
                                                   NA))))

# 7 - Income_Category
clients$Income_Category <- ifelse(clients[, 7] == "Unknown", "Unknown income",
                               ifelse(clients[, 7] == "Less than $40K", "Income under $40K",
                                      ifelse(clients[, 7] == "$40K - $60K", "Income $40K - $60K",
                                             ifelse(clients[, 7] == "$60K - $80K", "Income $60K - $80K",
                                                    ifelse(clients[, 7] == "$80K - $120K", "Income $80K - $120K",
                                                           ifelse(clients[, 7] == "$120K +", "Income over $120K",
                                                                  NA))))))

# converting to factor type
columns_to_factor <- c("Attrition_Flag", "Gender", "Education_Level",
                       "Marital_Status", "Income_Category", "Card_Category")
clients[columns_to_factor] <- lapply(clients[columns_to_factor], factor)

# checking the type of the variables
str(clients)

## 'data.frame':    10127 obs. of  16 variables:
##  $ Attrition_Flag          : Factor w/ 2 levels "Attrited Customer",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Customer_Age            : int  45 49 51 40 40 44 51 32 37 48 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 2 2 2 2 2 ...
##  $ Dependent_count         : int  3 5 3 4 3 2 4 0 3 2 ...
##  $ Education_Level         : Factor w/ 7 levels "College","Doctorate",..: 4 3 3 4 6 3 7 4 6 3 ...
##  $ Marital_Status          : Factor w/ 4 levels "Divorced","Married",..: 2 3 2 4 2 2 2 4 3 3 ...
##  $ Income_Category         : Factor w/ 6 levels "Income $40K - $60K",..: 2 5 3 5 2 1 4 2 2 3 ...
##  $ Card_Category           : Factor w/ 4 levels "Blue","Gold",..: 1 1 1 1 1 1 2 4 1 1 ...
##  $ Months_on_book          : int  39 44 36 34 21 36 46 27 36 36 ...
##  $ Total_Relationship_Count: int  5 6 4 3 5 3 6 2 5 6 ...
##  $ Months_Inactive_12_mon  : int  1 1 1 4 1 1 1 2 2 3 ...
##  $ Contacts_Count_12_mon   : int  3 2 0 1 0 2 3 2 0 3 ...
##  $ Credit_Limit            : num  12691 8256 3418 3313 4716 ...
##  $ Total_Revolving_Bal     : int  777 864 0 2517 0 1247 2264 1396 2517 1677 ...
##  $ Total_Trans_Ct          : int  42 33 20 20 28 24 31 36 24 32 ...
##  $ Avg_Utilization_Ratio   : num  0.061 0.105 0 0.76 0 0.311 0.066 0.048 0.113 0.144 ...

Basics of association rules

Description

Association rules represent an unsupervised learning technique designed to reveal patterns within large datasets. This method allows us to uncover connections that may not be apparent at first glance, such as identifying items that frequently occur together.

Association rules consist of two crucial components if (antecedent) and then (consequent). One of the good explanations is: “an antecedent is something that’s found in data, and a consequent is an item that is found in combination with the antecedent.”². For instance, in the dataset of bank clients, one could look for a rule such as if income is under $40K, then the client is an existing customer.” The metrics that are used to assess the quality/strength of the rules are:

support - frequency of the combined occurrence of the antecedent and consequent (for support = 0.5 -> 50% of transactions contain both the income under $40K and existing customer)
confidence - probability of consequent occuring when antecedent is in place (for confidence = 0.7 -> probability of customer being an existing customer when his income is under $40K is 70%)
lift - how much more common is it to see the antecedent and consequent together (for lift = 2 -> it’s twice as common for the existing customer to have income under $40K than not)

Dealing with continous data

The important thing is that the association rules can be conducted only on qualitative data - numerical values don’t allow to mine the rules. To deal with data that is numerical, one has to separate them into intervals that allow for further classification. Let’s look at a variable from our dataset - age.

summary(clients$Customer_Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   26.00   41.00   46.00   46.33   52.00   73.00

hist(clients$Customer_Age, col = "brown",
     xlab = "Age", ylab = "Number of customer",
     main = "Histogram of age variable")

The minimal age is 26, median is 46 and maximum is 73. Some options to group the variables would be:

2 groups: 1st (age from 26 to 46), 2nd (age from 47 to 73)
3 groups: 1st (age below 40), 2nd (age from 40 to 50), 3rd (age over 50)
4 groups: 1st (age below 35), 2nd (age from 35 to 45), 3rd (age from 45 to 55), 4th (age over 55)

These are just 3 options from endless possibilities - which one should person choose? It’s hard to separately group every quantitative variable one has in a dataset, as each might need different intervals depending on with what it’s analysed with. That’s why, for this kind of data we use CBA - Classification Based on Associations. It still looks for the if-then connections, but it has to be specified which qualitative variable is on the RHS (Right Hand Side), while LHS (Left Hand Side) might be continuous - depending on the RHS it’s partitioned accordingly.

Several algorithms utilize the CBA, and among them, some provide the option to specify the discretization method, while others do not. In this analysis, we will concentrate on the one that offers this option - in this case mineCARs() from arulesCBA package. To use it, the continuous variables have to first be discretized, and then the rules can be generated.

Discretization methods

In my analysis I want to compare what will be the outcome of different discretization methods. More specifically, I will concentrate on five methods, with the initial three being supervised and the remaining two being unsupervised methods.

Supervised methods (using the discretizeDF.supervised()):

MDLP (Minimal Description Length Principle) - discretizes the continuous attributes of data matrix using entropy criterion with the Minimum Description Length as stopping rule
CAIM (Class-Attribute Interdependence Maximization) – CAIM criterion measures the dependency between the class variable and the discretization variable for attribute
ChiMerge - uses the $\chi^2$ statistic to determine if the relative class frequencies of adjoining intervals are distinctly different or if they are similar enough to justify merging them into a single interval

Unsupervised methods (using the discretizeDF()):

Interval - partitionings based on equal interval width
Frequency - partitionings based on equal frequency

For each of the methods I will follow the same order. First I will discretize the continous variables using the selected method and transform the data into the needed structure (transactions form). After that I will use the mineCARs() algorithm to mine the rules, while setting the parameters to 0.03 for minimal support and 0.7 for minimal confidence. Then I will clean the redundant rules - the rules that do not provide additional meaningful information beyond what is already covered by other rules. With that I will look at the mined rules, especially those where RHS = “Attrited Customer”. Thanks to this I will get more insight into what were the characteristics of the clients who churned.

Performing different discretization methods and mining the rules

MDLP

library(arulesCBA)
library(arulesViz)

D1.disc <- discretizeDF.supervised(Attrition_Flag ~ ., data = clients, method = "mdlp")
summary(D1.disc)

##            Attrition_Flag      Customer_Age      Gender        Dependent_count 
##  Attrited Customer:1627   [-Inf, Inf]:10127   Female:5358   [-Inf, Inf]:10127  
##  Existing Customer:8500                       Male  :4769                      
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##           Education_Level                Marital_Status
##  College          :1013   Divorced              : 748  
##  Doctorate        : 451   Married               :4687  
##  Graduate         :3128   Single                :3943  
##  High School      :2013   Unknown marital status: 749  
##  Post-Graduate    : 516                                
##  Uneducated       :1487                                
##  Unknown education:1519                                
##             Income_Category  Card_Category      Months_on_book 
##  Income $40K - $60K :1790   Blue    :9436   [-Inf, Inf]:10127  
##  Income $60K - $80K :1402   Gold    : 116                      
##  Income $80K - $120K:1535   Platinum:  20                      
##  Income over $120K  : 727   Silver  : 555                      
##  Income under $40K  :3561                                      
##  Unknown income     :1112                                      
##                                                                
##  Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon
##  [-Inf,2.5):2153          [-Inf,0.5):  29        [-Inf,0.5): 399      
##  [2.5,3.5) :2305          [0.5,1.5) :2233        [0.5,1.5) :1499      
##  [3.5, Inf]:5669          [1.5,2.5) :3282        [1.5,2.5) :3227      
##                           [2.5, Inf]:4583        [2.5,5.5) :4948      
##                                                  [5.5, Inf]:  54      
##                                                                       
##                                                                       
##             Credit_Limit       Total_Revolving_Bal     Total_Trans_Ct
##  [-Inf,1.9e+03)   :1246   [-Inf,66)      :2470     [64.5,78.5):2613  
##  [1.9e+03,3.4e+03):2805   [66,427)       :  78     [78.5,94.5):2025  
##  [3.4e+03, Inf]   :6076   [427,582)      : 133     [37.5,51.5):1683  
##                           [582,980)      :1141     [20.5,37.5):1418  
##                           [980,2.38e+03) :5585     [94.5, Inf]: 882  
##                           [2.38e+03, Inf]: 720     [57.5,64.5): 871  
##                                                    (Other)    : 635  
##     Avg_Utilization_Ratio
##  [-Inf,0.0255) :2556     
##  [0.0255,0.451):4671     
##  [0.451,0.798) :2429     
##  [0.798, Inf]  : 471     
##                          
##                          
##

D1.trans<-transactions(D1.disc)
D1.ass <- mineCARs(Attrition_Flag ~ ., transactions = D1.trans, support = 0.03, confidence = 0.7)

D1.clean <- D1.ass[!is.redundant(D1.ass)]
summary(D1.clean)

## set of 877 rules
## 
## rule length distribution (lhs + rhs):sizes
##   1   2   3   4   5 
##   1  26 191 388 271 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   4.000   4.029   5.000   5.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift      
##  Min.   :0.02182   Min.   :0.7075   Min.   :0.03002   Min.   :1.000  
##  1st Qu.:0.03575   1st Qu.:0.9126   1st Qu.:0.03762   1st Qu.:1.092  
##  Median :0.04779   Median :0.9554   Median :0.05046   Median :1.140  
##  Mean   :0.07067   Mean   :0.9411   Mean   :0.07594   Mean   :1.171  
##  3rd Qu.:0.07692   3rd Qu.:0.9777   3rd Qu.:0.08186   3rd Qu.:1.166  
##  Max.   :0.83934   Max.   :1.0000   Max.   :1.00000   Max.   :5.322  
##      count       
##  Min.   : 221.0  
##  1st Qu.: 362.0  
##  Median : 484.0  
##  Mean   : 715.7  
##  3rd Qu.: 779.0  
##  Max.   :8500.0  
## 
## mining info:
##          data ntransactions support confidence
##  transactions         10127    0.03        0.7
##                                                                                                                                         call
##  apriori(data = transactions, parameter = parameter, appearance = list(rhs = vars$class_items, lhs = vars$feature_items), control = control)

inspectDT(D1.clean)

Statistics:

partitions - 1-6+ (some had [-Inf, Inf] while others over 6 partitions)
number of rules after cleaning - 877
number of rules for rhs = Attrited Customer - 11
highest support for rhs = Attrited Customer - 0.044
highest confidence for rhs = Attrited Customer - 0.855
highest lift for rhs = Attrited Customer - 5.322

This method was quick and resulted in quite uneven partitions - some of them were not separated while others had over 6 intervals. There were 877 rules mined while only 11 of them were for the attrited customers. The rule which had the highest confidence (0.855) stated that probability of customer being an attrited customer is 85% when it’s a female with transaction count between 37.5 and 51.5 and utilization ratio below 0.025. One with the second highest confidence stated that probability of customer being an attrited customer is 85% when it’s a female with transaction count between 37.5 and 51.5 and revolving balance below 66.

CAIM

D2.disc <- discretizeDF.supervised(Attrition_Flag ~ ., data = clients, method = "caim")
summary(D2.disc)

##            Attrition_Flag    Customer_Age     Gender     Dependent_count
##  Attrited Customer:1627   [-Inf,26):   0   Female:5358   [-Inf,0):   0  
##  Existing Customer:8500   [26,39.5):2036   Male  :4769   [0,2.5) :5397  
##                           [39.5,73):8090                 [2.5,5) :4306  
##                           [73, Inf]:   1                 [5, Inf]: 424  
##                                                                         
##                                                                         
##                                                                         
##           Education_Level                Marital_Status
##  College          :1013   Divorced              : 748  
##  Doctorate        : 451   Married               :4687  
##  Graduate         :3128   Single                :3943  
##  High School      :2013   Unknown marital status: 749  
##  Post-Graduate    : 516                                
##  Uneducated       :1487                                
##  Unknown education:1519                                
##             Income_Category  Card_Category    Months_on_book
##  Income $40K - $60K :1790   Blue    :9436   [-Inf,13):   0  
##  Income $60K - $80K :1402   Gold    : 116   [13,35.5):3802  
##  Income $80K - $120K:1535   Platinum:  20   [35.5,56):6222  
##  Income over $120K  : 727   Silver  : 555   [56, Inf]: 103  
##  Income under $40K  :3561                                   
##  Unknown income     :1112                                   
##                                                             
##  Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon
##  [-Inf,1):   0            [-Inf,0):   0          [-Inf,0):    0       
##  [1,2.5) :2153            [0,1.5) :2262          [0,5.5) :10073       
##  [2.5,6) :6108            [1.5,6) :7741          [5.5,6) :    0       
##  [6, Inf]:1866            [6, Inf]: 124          [6, Inf]:   54       
##                                                                       
##                                                                       
##                                                                       
##              Credit_Limit       Total_Revolving_Bal    Total_Trans_Ct
##  [-Inf,1.44e+03)   :   0   [-Inf,0)       :   0     [-Inf,10) :   0  
##  [1.44e+03,1.9e+03):1246   [0,582)        :2681     [10,54.5) :3458  
##  [1.9e+03,3.45e+04):8373   [582,2.52e+03) :6938     [54.5,139):6668  
##  [3.45e+04, Inf]   : 508   [2.52e+03, Inf]: 508     [139, Inf]:   1  
##                                                                      
##                                                                      
##                                                                      
##     Avg_Utilization_Ratio
##  [-Inf,0)      :   0     
##  [0,0.0205)    :2515     
##  [0.0205,0.999):7611     
##  [0.999, Inf]  :   1     
##                          
##                          
##

D2.trans<-transactions(D2.disc)
D2.ass <- mineCARs(Attrition_Flag ~ ., transactions = D2.trans, support = 0.03, confidence = 0.7)

D2.clean <- D2.ass[!is.redundant(D2.ass)]
summary(D2.clean)

## set of 2458 rules
## 
## rule length distribution (lhs + rhs):sizes
##    1    2    3    4    5 
##    1   22  194  751 1490 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   5.000   4.508   5.000   5.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift      
##  Min.   :0.02192   Min.   :0.7173   Min.   :0.03002   Min.   :1.000  
##  1st Qu.:0.04058   1st Qu.:0.8878   1st Qu.:0.04345   1st Qu.:1.063  
##  Median :0.06335   Median :0.9420   Median :0.06794   Median :1.127  
##  Mean   :0.09964   Mean   :0.9279   Mean   :0.10731   Mean   :1.220  
##  3rd Qu.:0.11650   3rd Qu.:0.9647   3rd Qu.:0.12511   3rd Qu.:1.153  
##  Max.   :0.83934   Max.   :1.0000   Max.   :1.00000   Max.   :5.660  
##      count       
##  Min.   : 222.0  
##  1st Qu.: 411.0  
##  Median : 641.5  
##  Mean   :1009.1  
##  3rd Qu.:1179.8  
##  Max.   :8500.0  
## 
## mining info:
##          data ntransactions support confidence
##  transactions         10127    0.03        0.7
##                                                                                                                                         call
##  apriori(data = transactions, parameter = parameter, appearance = list(rhs = vars$class_items, lhs = vars$feature_items), control = control)

inspectDT(D2.clean)

Statistics:

partitions - 4 (some of them didn’t have observations in them)
number of rules after cleaning - 2458
number of rules for rhs = Attrited Customer - 70
highest support for rhs = Attrited Customer - 0.082
highest confidence for rhs = Attrited Customer - 0.909
highest lift for rhs = Attrited Customer - 5.660

This method was quick and resulted in even partitions - each of the discretized variables had 4 partitions. There were 2458 rules mined with 70 for the attrited customers. The rule which had the highest confidence (0.909) stated that probability of customer being an attrited customer is 90% when it’s a client with transaction count between 10 and 54.5, between 35.5 and 56 months on books and relationship count between 1.5 and 2. One with the second highest confidence stated that probability of customer being an attrited customer is 90% when it’s a customer with blue card, transaction count between 10 and 54.5, between 1.5 and 6 months inactive and relationship count between 1.5 and 2.

ChiMerge

D3.disc <- discretizeDF.supervised(Attrition_Flag ~ ., data = clients, method = "chimerge")
summary(D3.disc)

##            Attrition_Flag      Customer_Age     Gender       Dependent_count
##  Attrited Customer:1627   [-Inf,29.5): 195   Female:5358   [-Inf,2.5):5397  
##  Existing Customer:8500   [29.5,39.5):1841   Male  :4769   [2.5, Inf]:4730  
##                           [39.5,53.5):6174                                  
##                           [53.5,54.5): 307                                  
##                           [54.5,58.5): 921                                  
##                           [58.5,59.5): 157                                  
##                           [59.5, Inf]: 532                                  
##           Education_Level                Marital_Status
##  College          :1013   Divorced              : 748  
##  Doctorate        : 451   Married               :4687  
##  Graduate         :3128   Single                :3943  
##  High School      :2013   Unknown marital status: 749  
##  Post-Graduate    : 516                                
##  Uneducated       :1487                                
##  Unknown education:1519                                
##             Income_Category  Card_Category      Months_on_book
##  Income $40K - $60K :1790   Blue    :9436   [-Inf,29.5):1920  
##  Income $60K - $80K :1402   Gold    : 116   [29.5,30.5): 300  
##  Income $80K - $120K:1535   Platinum:  20   [30.5,31.5): 318  
##  Income over $120K  : 727   Silver  : 555   [31.5,49.5):7075  
##  Income under $40K  :3561                   [49.5,52.5): 238  
##  Unknown income     :1112                   [52.5, Inf]: 276  
##                                                               
##  Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon
##  [-Inf,2.5):2153          [-Inf,0.5):  29        [-Inf,0.5): 399      
##  [2.5,3.5) :2305          [0.5,1.5) :2233        [0.5,1.5) :1499      
##  [3.5, Inf]:5669          [1.5,2.5) :3282        [1.5,2.5) :3227      
##                           [2.5,3.5) :3846        [2.5,4.5) :4772      
##                           [3.5,4.5) : 435        [4.5,5.5) : 176      
##                           [4.5, Inf]: 302        [5.5, Inf]:  54      
##                                                                       
##           Credit_Limit       Total_Revolving_Bal     Total_Trans_Ct
##  [-Inf,1503.5)  : 594   [-Inf,66)      :2470     [78.5,94.5):2025  
##  [34167.5, Inf] : 513   [2512.5, Inf]  : 512     [71.5,78.5):1329  
##  [2302.5,2434.5): 249   [1012.5,1119.5): 389     [64.5,71.5):1284  
##  [2157.5,2245.5): 205   [864.5,945.5)  : 292     [22.5,34.5): 938  
##  [1819.5,1905.5): 140   [1707.5,1770.5): 272     [94.5, Inf]: 882  
##  [1506.5,1589)  : 120   [1384.5,1430.5): 241     [57.5,64.5): 871  
##  (Other)        :8306   (Other)        :5951     (Other)    :2798  
##      Avg_Utilization_Ratio
##  [-Inf,0.002)   :2470     
##  [0.2025,0.3025): 899     
##  [0.1135,0.1675): 689     
##  [0.0755,0.1105): 601     
##  [0.5715,0.6325): 490     
##  [0.0525,0.0725): 479     
##  (Other)        :4499

D3.trans<-transactions(D3.disc)
D3.ass <- mineCARs(Attrition_Flag ~ ., transactions = D3.trans, support = 0.03, confidence = 0.7)

D3.clean <- D3.ass[!is.redundant(D3.ass)]
summary(D3.clean)

## set of 801 rules
## 
## rule length distribution (lhs + rhs):sizes
##   1   2   3   4   5 
##   1  34 187 347 232 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   4.000   3.968   5.000   5.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift      
##  Min.   :0.02637   Min.   :0.8393   Min.   :0.03002   Min.   :1.000  
##  1st Qu.:0.03347   1st Qu.:0.8766   1st Qu.:0.03594   1st Qu.:1.044  
##  Median :0.04394   Median :0.9115   Median :0.04760   Median :1.086  
##  Mean   :0.06137   Mean   :0.9189   Mean   :0.06760   Mean   :1.095  
##  3rd Qu.:0.06626   3rd Qu.:0.9683   3rd Qu.:0.07337   3rd Qu.:1.154  
##  Max.   :0.83934   Max.   :1.0000   Max.   :1.00000   Max.   :1.191  
##      count       
##  Min.   : 267.0  
##  1st Qu.: 339.0  
##  Median : 445.0  
##  Mean   : 621.5  
##  3rd Qu.: 671.0  
##  Max.   :8500.0  
## 
## mining info:
##          data ntransactions support confidence
##  transactions         10127    0.03        0.7
##                                                                                                                                         call
##  apriori(data = transactions, parameter = parameter, appearance = list(rhs = vars$class_items, lhs = vars$feature_items), control = control)

inspectDT(D3.clean)

This method took the longest of them all.

Statistics:

partitions - 2-6+
number of rules after cleaning - 801
number of rules for rhs = Attrited Customer - no rules
highest support for rhs = Attrited Customer - x
highest confidence for rhs = Attrited Customer - x
highest lift for rhs = Attrited Customer - x

This method took at least few minutes and had the number of partitions between 2 and 6+. There were 801 rules mined, but none for the attrited customers - we can’t determine what were the most common characteristics for the customers who churned.

Interval

D4.disc <- discretizeDF(clients, default = list(method = "interval", breaks = 3, labels = c("low", "moderate", "high")))
summary(D4.disc)

##            Attrition_Flag   Customer_Age     Gender     Dependent_count
##  Attrited Customer:1627   low     :2776   Female:5358   low     :2742  
##  Existing Customer:8500   moderate:6505   Male  :4769   moderate:5387  
##                           high    : 846                 high    :1998  
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##           Education_Level                Marital_Status
##  College          :1013   Divorced              : 748  
##  Doctorate        : 451   Married               :4687  
##  Graduate         :3128   Single                :3943  
##  High School      :2013   Unknown marital status: 749  
##  Post-Graduate    : 516                                
##  Uneducated       :1487                                
##  Unknown education:1519                                
##             Income_Category  Card_Category   Months_on_book
##  Income $40K - $60K :1790   Blue    :9436   low     :1404  
##  Income $60K - $80K :1402   Gold    : 116   moderate:6537  
##  Income $80K - $120K:1535   Platinum:  20   high    :2186  
##  Income over $120K  : 727   Silver  : 555                  
##  Income under $40K  :3561                                  
##  Unknown income     :1112                                  
##                                                            
##  Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon
##  low     :2153            low     :2262          low     :1898        
##  moderate:4217            moderate:7128          moderate:6607        
##  high    :3757            high    : 737          high    :1622        
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##    Credit_Limit  Total_Revolving_Bal  Total_Trans_Ct Avg_Utilization_Ratio
##  low     :7892   low     :3330       low     :3284   low     :6484        
##  moderate:1231   moderate:3780       moderate:6001   moderate:2333        
##  high    :1004   high    :3017       high    : 842   high    :1310        
##                                                                           
##                                                                           
##                                                                           
##

D4.trans<-transactions(D4.disc)
D4.ass <- mineCARs(Attrition_Flag ~ ., transactions = D4.trans, support = 0.03, confidence = 0.7)

D4.clean <- D4.ass[!is.redundant(D4.ass)]
summary(D4.clean)

## set of 1727 rules
## 
## rule length distribution (lhs + rhs):sizes
##   1   2   3   4   5 
##   1  27 238 656 805 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   4.000   4.295   5.000   5.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift      
##  Min.   :0.02340   Min.   :0.7016   Min.   :0.03002   Min.   :1.000  
##  1st Qu.:0.03412   1st Qu.:0.8802   1st Qu.:0.03663   1st Qu.:1.057  
##  Median :0.04532   Median :0.9491   Median :0.04898   Median :1.134  
##  Mean   :0.06448   Mean   :0.9259   Mean   :0.06970   Mean   :1.231  
##  3rd Qu.:0.07041   3rd Qu.:0.9654   3rd Qu.:0.07643   3rd Qu.:1.153  
##  Max.   :0.83934   Max.   :1.0000   Max.   :1.00000   Max.   :5.683  
##      count       
##  Min.   : 237.0  
##  1st Qu.: 345.5  
##  Median : 459.0  
##  Mean   : 652.9  
##  3rd Qu.: 713.0  
##  Max.   :8500.0  
## 
## mining info:
##          data ntransactions support confidence
##  transactions         10127    0.03        0.7
##                                                                                                                                         call
##  apriori(data = transactions, parameter = parameter, appearance = list(rhs = vars$class_items, lhs = vars$feature_items), control = control)

inspectDT(D4.clean)

Statistics:

partitions - 3
number of rules after cleaning - 1727
number of rules for rhs = Attrited Customer - 57
highest support for rhs = Attrited Customer - 0.081
highest confidence for rhs = Attrited Customer - 0.913
highest lift for rhs = Attrited Customer - 5.683

This method was quick and and was set for performing 3 partitions. There were 1727 rules mined with 57 for the attrited customers. The rule which had the highest confidence (0.913) stated that probability of customer being an attrited customer is 91% when it’s a client with blue card, low transaction count, low relationship count and low utillization ratio. Rule with the second highest confidence stated that probability of customer being an attrited customer is 91% when it’s a customer with low relationship count, low credit limit and low transaction count.

Frequency

D5.disc <- discretizeDF(clients, default = list(method = "frequency", breaks = 3, labels = c("low", "moderate", "high")))
summary(D5.disc)

##            Attrition_Flag   Customer_Age     Gender     Dependent_count
##  Attrited Customer:1627   low     :3202   Female:5358   low     :2742  
##  Existing Customer:8500   moderate:3395   Male  :4769   moderate:2655  
##                           high    :3530                 high    :4730  
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##           Education_Level                Marital_Status
##  College          :1013   Divorced              : 748  
##  Doctorate        : 451   Married               :4687  
##  Graduate         :3128   Single                :3943  
##  High School      :2013   Unknown marital status: 749  
##  Post-Graduate    : 516                                
##  Uneducated       :1487                                
##  Unknown education:1519                                
##             Income_Category  Card_Category   Months_on_book
##  Income $40K - $60K :1790   Blue    :9436   low     :3132  
##  Income $60K - $80K :1402   Gold    : 116   moderate:3491  
##  Income $80K - $120K:1535   Platinum:  20   high    :3504  
##  Income over $120K  : 727   Silver  : 555                  
##  Income under $40K  :3561                                  
##  Unknown income     :1112                                  
##                                                            
##  Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon
##  low     :2153            low     :2262          low     :1898        
##  moderate:4217            moderate:3282          moderate:3227        
##  high    :3757            high    :4583          high    :5002        
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##    Credit_Limit  Total_Revolving_Bal  Total_Trans_Ct Avg_Utilization_Ratio
##  low     :3375   low     :3375       low     :3369   low     :3367        
##  moderate:3376   moderate:3375       moderate:3266   moderate:3384        
##  high    :3376   high    :3377       high    :3492   high    :3376        
##                                                                           
##                                                                           
##                                                                           
##

D5.trans<-transactions(D5.disc)

D5.ass <- mineCARs(Attrition_Flag ~ ., transactions = D5.trans, support = 0.03, confidence = 0.7)

D5.clean <- D5.ass[!is.redundant(D5.ass)]
summary(D5.clean)

## set of 2049 rules
## 
## rule length distribution (lhs + rhs):sizes
##    1    2    3    4    5 
##    1   28  323 1076  621 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   4.000   4.117   5.000   5.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift      
##  Min.   :0.02242   Min.   :0.7000   Min.   :0.03002   Min.   :1.000  
##  1st Qu.:0.03318   1st Qu.:0.8849   1st Qu.:0.03565   1st Qu.:1.064  
##  Median :0.04108   Median :0.9259   Median :0.04434   Median :1.108  
##  Mean   :0.05468   Mean   :0.9211   Mean   :0.05983   Mean   :1.215  
##  3rd Qu.:0.05678   3rd Qu.:0.9578   3rd Qu.:0.06201   3rd Qu.:1.146  
##  Max.   :0.83934   Max.   :1.0000   Max.   :1.00000   Max.   :5.859  
##      count       
##  Min.   : 227.0  
##  1st Qu.: 336.0  
##  Median : 416.0  
##  Mean   : 553.8  
##  3rd Qu.: 575.0  
##  Max.   :8500.0  
## 
## mining info:
##          data ntransactions support confidence
##  transactions         10127    0.03        0.7
##                                                                                                                                         call
##  apriori(data = transactions, parameter = parameter, appearance = list(rhs = vars$class_items, lhs = vars$feature_items), control = control)

inspectDT(D5.clean)

Statistics:

partitions - 3
number of rules after cleaning - 2049
number of rules for rhs = Attrited Customer - 62
highest support for rhs = Attrited Customer - 0.074
highest confidence for rhs = Attrited Customer - 0.941
highest lift for rhs = Attrited Customer - 5.859

This method was quick and and was set for performing 3 partitions. There were 2049 rules mined with 62 for the attrited customers. The rule which had the highest confidence (0.941) stated that probability of customer being an attrited customer is 94% when it’s a low transaction count, low relationship count and low contacts count. Rule with the second highest confidence stated that probability of customer being an attrited customer is 89% when it’s a customer with blue card, low transaction count, low relationship count.

Comparison of the results

Statistics comparison
Method	MDLP	CAIM	ChiMerge	Interval	Frequency
partitions	1-6+	4	2-6+	3	3
number of rules after cleaning	877	2458	801	1727	2049
number of rules for rhs = Attrited Customer	11	70	0	57	62
highest support for rhs = Attrited Customer	0.044	0.082	x	0.081	0.074
highest confidence for rhs = Attrited Customer	0.855	0.909	x	0.913	0.941
highest lift for rhs = Attrited Customer	5.322	5.660	x	5.683	5.859

For each of the methods, results were a little bit different. MDLP and ChiMerge had quite uneven partitions while the CAIM algorithms resulted in all of the variables being separated into 4 partitions. ChiMerged was the only method that didn’t mine any rules for the rhs = Attrited Customer, which didn’t allow for defining the characteristics of the former customers. The rest of the methods mined rules for that definition of the rhs, they ranged from 11 to 70 rules. The rule with the highest support (frequency of the combined occurrence of the antecedent and consequent) was observed for the data with the CAIM algorithm applied. Both the highest confidence (probability of consequent occuring when antecedent is in place) and highest lift (how much more common is it to see the antecedent and consequent together) were observed for the data discretized using the partitioning by frequency.

When looking at the rules with the highest confidence for the datasets with different partitionings, the occuring items were repeating. We could say that the probability of the customer being an attrited customer was the highest for the customers who had combinations of those characteristics:

below 50 / low transaction count
low utillization ratio
below 2 / low relationship count
blue card

By analysing the rules more in-depth one can find more interesting patterns for the attrited customers.

Summary

Proposed analysis illustrates the impact of various discretization methods on rule mining outcomes. Although the statistics and the quantity of rules may vary, the key elements of rules with the highest confidence remain consistent. this kind of comparison allows for the evaluation of the advantages and disadvantages of specific discretization methods - some may provide favorable statistics, while others may fall short of providing desired results. The selection of the most suitable method should be based on a thorough analysis to determine which one performs optimally in a given scenario.

Sources:

UL class materials
Association rules explained
Descriptions of the discretization methods

Analyzing Discretization Outcomes on Bank Client Data

Monika Kot

January 2024

Introduction

Description of the dataset and cleaning

Basics of association rules

Description

Dealing with continous data

Discretization methods

Performing different discretization methods and mining the rules

MDLP

CAIM

ChiMerge

Interval

Frequency

Comparison of the results

Summary