The aim of this project is to analyse the impact of different discretization methods on the partitioning of continuous data, the generation of association rules, and their interpretability. Analysis will include comparison of 5 discretization methods applied to a dataset containing both the qualitative and the quantitative data. First the data will be described and cleaned. After that, 5 different discretization methods will be applied to the continuous data, and then the resulting data partitions and mined association rules will be compared.
Analysis will be conducted on a credit card consumer dataset1 found on Kaggle. The dataset was constructed mainly for predicting credit card consumer segmentation, as it contains vast data for around 10 000 customers including information wheter they are an existing or former (attrited) customer. For my analysis I will use both the qualitative data (gender, education level, marital status etc.) which will be converted to factor type and the quantitative data (age, number of dependents, transaction count etc.) which will be divided into partitions using different discretization methods.
The dataset consists of 8500 observations of current customers and approximately 1500 observations of former customers. By employing the different discretization methods, I aim to explore how the way of dividing continuous variables affects the creation of association rules for the customers. I want to specifically focus on characterizing the attrited customers, as it can provide insights into which clients who are most likely to churn.
From the 23 variables in the dataset I chose 16 of them:
Let’s load the data and look at the variables.
clients <- read.csv("BankChurners.csv")
clients<- clients[, c(2:15, 19, 21)]
str(clients)
## 'data.frame': 10127 obs. of 16 variables:
## $ Attrition_Flag : chr "Existing Customer" "Existing Customer" "Existing Customer" "Existing Customer" ...
## $ Customer_Age : int 45 49 51 40 40 44 51 32 37 48 ...
## $ Gender : chr "M" "F" "M" "F" ...
## $ Dependent_count : int 3 5 3 4 3 2 4 0 3 2 ...
## $ Education_Level : chr "High School" "Graduate" "Graduate" "High School" ...
## $ Marital_Status : chr "Married" "Single" "Married" "Unknown" ...
## $ Income_Category : chr "$60K - $80K" "Less than $40K" "$80K - $120K" "Less than $40K" ...
## $ Card_Category : chr "Blue" "Blue" "Blue" "Blue" ...
## $ Months_on_book : int 39 44 36 34 21 36 46 27 36 36 ...
## $ Total_Relationship_Count: int 5 6 4 3 5 3 6 2 5 6 ...
## $ Months_Inactive_12_mon : int 1 1 1 4 1 1 1 2 2 3 ...
## $ Contacts_Count_12_mon : int 3 2 0 1 0 2 3 2 0 3 ...
## $ Credit_Limit : num 12691 8256 3418 3313 4716 ...
## $ Total_Revolving_Bal : int 777 864 0 2517 0 1247 2264 1396 2517 1677 ...
## $ Total_Trans_Ct : int 42 33 20 20 28 24 31 36 24 32 ...
## $ Avg_Utilization_Ratio : num 0.061 0.105 0 0.76 0 0.311 0.066 0.048 0.113 0.144 ...
Within the dataset, there are 10 integer/numerical variables and 6 character variables. I’m going to change the labels within some of the character variables so they will be easier to interpret. After that I will transform all of the character variables to factor type for the further analysis.
# 3 - Gender
clients$Gender <- ifelse(clients[, 3] == "M", "Male",
ifelse(clients[, 3] == "F", "Female",
NA))
# 5 - Education_level
clients$Education_Level <- ifelse(clients[, 5] == "Unknown", "Unknown education",
ifelse(clients[, 5] == "High School", "High School",
ifelse(clients[, 5] == "Graduate", "Graduate",
ifelse(clients[, 5] == "Uneducated", "Uneducated",
ifelse(clients[, 5] == "College", "College",
ifelse(clients[, 5] =="Post-Graduate", "Post-Graduate",
ifelse(clients[, 5] == "Doctorate", "Doctorate",
NA)))))))
# 6 - Marital_status
clients$Marital_Status <- ifelse(clients[, 6] == "Unknown", "Unknown marital status",
ifelse(clients[, 6] == "Married", "Married",
ifelse(clients[, 6] == "Single", "Single",
ifelse(clients[, 6] == "Divorced", "Divorced",
NA))))
# 7 - Income_Category
clients$Income_Category <- ifelse(clients[, 7] == "Unknown", "Unknown income",
ifelse(clients[, 7] == "Less than $40K", "Income under $40K",
ifelse(clients[, 7] == "$40K - $60K", "Income $40K - $60K",
ifelse(clients[, 7] == "$60K - $80K", "Income $60K - $80K",
ifelse(clients[, 7] == "$80K - $120K", "Income $80K - $120K",
ifelse(clients[, 7] == "$120K +", "Income over $120K",
NA))))))
# converting to factor type
columns_to_factor <- c("Attrition_Flag", "Gender", "Education_Level",
"Marital_Status", "Income_Category", "Card_Category")
clients[columns_to_factor] <- lapply(clients[columns_to_factor], factor)
# checking the type of the variables
str(clients)
## 'data.frame': 10127 obs. of 16 variables:
## $ Attrition_Flag : Factor w/ 2 levels "Attrited Customer",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Customer_Age : int 45 49 51 40 40 44 51 32 37 48 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 2 2 2 2 2 ...
## $ Dependent_count : int 3 5 3 4 3 2 4 0 3 2 ...
## $ Education_Level : Factor w/ 7 levels "College","Doctorate",..: 4 3 3 4 6 3 7 4 6 3 ...
## $ Marital_Status : Factor w/ 4 levels "Divorced","Married",..: 2 3 2 4 2 2 2 4 3 3 ...
## $ Income_Category : Factor w/ 6 levels "Income $40K - $60K",..: 2 5 3 5 2 1 4 2 2 3 ...
## $ Card_Category : Factor w/ 4 levels "Blue","Gold",..: 1 1 1 1 1 1 2 4 1 1 ...
## $ Months_on_book : int 39 44 36 34 21 36 46 27 36 36 ...
## $ Total_Relationship_Count: int 5 6 4 3 5 3 6 2 5 6 ...
## $ Months_Inactive_12_mon : int 1 1 1 4 1 1 1 2 2 3 ...
## $ Contacts_Count_12_mon : int 3 2 0 1 0 2 3 2 0 3 ...
## $ Credit_Limit : num 12691 8256 3418 3313 4716 ...
## $ Total_Revolving_Bal : int 777 864 0 2517 0 1247 2264 1396 2517 1677 ...
## $ Total_Trans_Ct : int 42 33 20 20 28 24 31 36 24 32 ...
## $ Avg_Utilization_Ratio : num 0.061 0.105 0 0.76 0 0.311 0.066 0.048 0.113 0.144 ...
Association rules represent an unsupervised learning technique designed to reveal patterns within large datasets. This method allows us to uncover connections that may not be apparent at first glance, such as identifying items that frequently occur together.
Association rules consist of two crucial components if (antecedent) and then (consequent). One of the good explanations is: “an antecedent is something that’s found in data, and a consequent is an item that is found in combination with the antecedent.”2. For instance, in the dataset of bank clients, one could look for a rule such as if income is under $40K, then the client is an existing customer.” The metrics that are used to assess the quality/strength of the rules are:
The important thing is that the association rules can be conducted only on qualitative data - numerical values don’t allow to mine the rules. To deal with data that is numerical, one has to separate them into intervals that allow for further classification. Let’s look at a variable from our dataset - age.
summary(clients$Customer_Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 26.00 41.00 46.00 46.33 52.00 73.00
hist(clients$Customer_Age, col = "brown",
xlab = "Age", ylab = "Number of customer",
main = "Histogram of age variable")
The minimal age is 26, median is 46 and maximum is 73. Some options to group the variables would be:
These are just 3 options from endless possibilities - which one should person choose? It’s hard to separately group every quantitative variable one has in a dataset, as each might need different intervals depending on with what it’s analysed with. That’s why, for this kind of data we use CBA - Classification Based on Associations. It still looks for the if-then connections, but it has to be specified which qualitative variable is on the RHS (Right Hand Side), while LHS (Left Hand Side) might be continuous - depending on the RHS it’s partitioned accordingly.
Several algorithms utilize the CBA, and among them, some provide the
option to specify the discretization method, while others do not. In
this analysis, we will concentrate on the one that offers this option -
in this case mineCARs() from arulesCBA
package. To use it, the continuous variables have to first be
discretized, and then the rules can be generated.
In my analysis I want to compare what will be the outcome of different discretization methods. More specifically, I will concentrate on five methods, with the initial three being supervised and the remaining two being unsupervised methods.
Supervised methods (using the
discretizeDF.supervised()):
Unsupervised methods (using the discretizeDF()):
For each of the methods I will follow the same order. First I will
discretize the continous variables using the selected method and
transform the data into the needed structure (transactions form). After
that I will use the mineCARs() algorithm to mine the rules,
while setting the parameters to 0.03 for minimal support and
0.7 for minimal confidence. Then I will clean the redundant
rules - the rules that do not provide additional meaningful information
beyond what is already covered by other rules. With that I will look at
the mined rules, especially those where RHS = “Attrited Customer”.
Thanks to this I will get more insight into what were the
characteristics of the clients who churned.
library(arulesCBA)
library(arulesViz)
D1.disc <- discretizeDF.supervised(Attrition_Flag ~ ., data = clients, method = "mdlp")
summary(D1.disc)
## Attrition_Flag Customer_Age Gender Dependent_count
## Attrited Customer:1627 [-Inf, Inf]:10127 Female:5358 [-Inf, Inf]:10127
## Existing Customer:8500 Male :4769
##
##
##
##
##
## Education_Level Marital_Status
## College :1013 Divorced : 748
## Doctorate : 451 Married :4687
## Graduate :3128 Single :3943
## High School :2013 Unknown marital status: 749
## Post-Graduate : 516
## Uneducated :1487
## Unknown education:1519
## Income_Category Card_Category Months_on_book
## Income $40K - $60K :1790 Blue :9436 [-Inf, Inf]:10127
## Income $60K - $80K :1402 Gold : 116
## Income $80K - $120K:1535 Platinum: 20
## Income over $120K : 727 Silver : 555
## Income under $40K :3561
## Unknown income :1112
##
## Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon
## [-Inf,2.5):2153 [-Inf,0.5): 29 [-Inf,0.5): 399
## [2.5,3.5) :2305 [0.5,1.5) :2233 [0.5,1.5) :1499
## [3.5, Inf]:5669 [1.5,2.5) :3282 [1.5,2.5) :3227
## [2.5, Inf]:4583 [2.5,5.5) :4948
## [5.5, Inf]: 54
##
##
## Credit_Limit Total_Revolving_Bal Total_Trans_Ct
## [-Inf,1.9e+03) :1246 [-Inf,66) :2470 [64.5,78.5):2613
## [1.9e+03,3.4e+03):2805 [66,427) : 78 [78.5,94.5):2025
## [3.4e+03, Inf] :6076 [427,582) : 133 [37.5,51.5):1683
## [582,980) :1141 [20.5,37.5):1418
## [980,2.38e+03) :5585 [94.5, Inf]: 882
## [2.38e+03, Inf]: 720 [57.5,64.5): 871
## (Other) : 635
## Avg_Utilization_Ratio
## [-Inf,0.0255) :2556
## [0.0255,0.451):4671
## [0.451,0.798) :2429
## [0.798, Inf] : 471
##
##
##
D1.trans<-transactions(D1.disc)
D1.ass <- mineCARs(Attrition_Flag ~ ., transactions = D1.trans, support = 0.03, confidence = 0.7)
D1.clean <- D1.ass[!is.redundant(D1.ass)]
summary(D1.clean)
## set of 877 rules
##
## rule length distribution (lhs + rhs):sizes
## 1 2 3 4 5
## 1 26 191 388 271
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 4.000 4.029 5.000 5.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.02182 Min. :0.7075 Min. :0.03002 Min. :1.000
## 1st Qu.:0.03575 1st Qu.:0.9126 1st Qu.:0.03762 1st Qu.:1.092
## Median :0.04779 Median :0.9554 Median :0.05046 Median :1.140
## Mean :0.07067 Mean :0.9411 Mean :0.07594 Mean :1.171
## 3rd Qu.:0.07692 3rd Qu.:0.9777 3rd Qu.:0.08186 3rd Qu.:1.166
## Max. :0.83934 Max. :1.0000 Max. :1.00000 Max. :5.322
## count
## Min. : 221.0
## 1st Qu.: 362.0
## Median : 484.0
## Mean : 715.7
## 3rd Qu.: 779.0
## Max. :8500.0
##
## mining info:
## data ntransactions support confidence
## transactions 10127 0.03 0.7
## call
## apriori(data = transactions, parameter = parameter, appearance = list(rhs = vars$class_items, lhs = vars$feature_items), control = control)
inspectDT(D1.clean)
Statistics:
This method was quick and resulted in quite uneven partitions - some of them were not separated while others had over 6 intervals. There were 877 rules mined while only 11 of them were for the attrited customers. The rule which had the highest confidence (0.855) stated that probability of customer being an attrited customer is 85% when it’s a female with transaction count between 37.5 and 51.5 and utilization ratio below 0.025. One with the second highest confidence stated that probability of customer being an attrited customer is 85% when it’s a female with transaction count between 37.5 and 51.5 and revolving balance below 66.
D2.disc <- discretizeDF.supervised(Attrition_Flag ~ ., data = clients, method = "caim")
summary(D2.disc)
## Attrition_Flag Customer_Age Gender Dependent_count
## Attrited Customer:1627 [-Inf,26): 0 Female:5358 [-Inf,0): 0
## Existing Customer:8500 [26,39.5):2036 Male :4769 [0,2.5) :5397
## [39.5,73):8090 [2.5,5) :4306
## [73, Inf]: 1 [5, Inf]: 424
##
##
##
## Education_Level Marital_Status
## College :1013 Divorced : 748
## Doctorate : 451 Married :4687
## Graduate :3128 Single :3943
## High School :2013 Unknown marital status: 749
## Post-Graduate : 516
## Uneducated :1487
## Unknown education:1519
## Income_Category Card_Category Months_on_book
## Income $40K - $60K :1790 Blue :9436 [-Inf,13): 0
## Income $60K - $80K :1402 Gold : 116 [13,35.5):3802
## Income $80K - $120K:1535 Platinum: 20 [35.5,56):6222
## Income over $120K : 727 Silver : 555 [56, Inf]: 103
## Income under $40K :3561
## Unknown income :1112
##
## Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon
## [-Inf,1): 0 [-Inf,0): 0 [-Inf,0): 0
## [1,2.5) :2153 [0,1.5) :2262 [0,5.5) :10073
## [2.5,6) :6108 [1.5,6) :7741 [5.5,6) : 0
## [6, Inf]:1866 [6, Inf]: 124 [6, Inf]: 54
##
##
##
## Credit_Limit Total_Revolving_Bal Total_Trans_Ct
## [-Inf,1.44e+03) : 0 [-Inf,0) : 0 [-Inf,10) : 0
## [1.44e+03,1.9e+03):1246 [0,582) :2681 [10,54.5) :3458
## [1.9e+03,3.45e+04):8373 [582,2.52e+03) :6938 [54.5,139):6668
## [3.45e+04, Inf] : 508 [2.52e+03, Inf]: 508 [139, Inf]: 1
##
##
##
## Avg_Utilization_Ratio
## [-Inf,0) : 0
## [0,0.0205) :2515
## [0.0205,0.999):7611
## [0.999, Inf] : 1
##
##
##
D2.trans<-transactions(D2.disc)
D2.ass <- mineCARs(Attrition_Flag ~ ., transactions = D2.trans, support = 0.03, confidence = 0.7)
D2.clean <- D2.ass[!is.redundant(D2.ass)]
summary(D2.clean)
## set of 2458 rules
##
## rule length distribution (lhs + rhs):sizes
## 1 2 3 4 5
## 1 22 194 751 1490
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 5.000 4.508 5.000 5.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.02192 Min. :0.7173 Min. :0.03002 Min. :1.000
## 1st Qu.:0.04058 1st Qu.:0.8878 1st Qu.:0.04345 1st Qu.:1.063
## Median :0.06335 Median :0.9420 Median :0.06794 Median :1.127
## Mean :0.09964 Mean :0.9279 Mean :0.10731 Mean :1.220
## 3rd Qu.:0.11650 3rd Qu.:0.9647 3rd Qu.:0.12511 3rd Qu.:1.153
## Max. :0.83934 Max. :1.0000 Max. :1.00000 Max. :5.660
## count
## Min. : 222.0
## 1st Qu.: 411.0
## Median : 641.5
## Mean :1009.1
## 3rd Qu.:1179.8
## Max. :8500.0
##
## mining info:
## data ntransactions support confidence
## transactions 10127 0.03 0.7
## call
## apriori(data = transactions, parameter = parameter, appearance = list(rhs = vars$class_items, lhs = vars$feature_items), control = control)
inspectDT(D2.clean)
Statistics:
This method was quick and resulted in even partitions - each of the discretized variables had 4 partitions. There were 2458 rules mined with 70 for the attrited customers. The rule which had the highest confidence (0.909) stated that probability of customer being an attrited customer is 90% when it’s a client with transaction count between 10 and 54.5, between 35.5 and 56 months on books and relationship count between 1.5 and 2. One with the second highest confidence stated that probability of customer being an attrited customer is 90% when it’s a customer with blue card, transaction count between 10 and 54.5, between 1.5 and 6 months inactive and relationship count between 1.5 and 2.
D3.disc <- discretizeDF.supervised(Attrition_Flag ~ ., data = clients, method = "chimerge")
summary(D3.disc)
## Attrition_Flag Customer_Age Gender Dependent_count
## Attrited Customer:1627 [-Inf,29.5): 195 Female:5358 [-Inf,2.5):5397
## Existing Customer:8500 [29.5,39.5):1841 Male :4769 [2.5, Inf]:4730
## [39.5,53.5):6174
## [53.5,54.5): 307
## [54.5,58.5): 921
## [58.5,59.5): 157
## [59.5, Inf]: 532
## Education_Level Marital_Status
## College :1013 Divorced : 748
## Doctorate : 451 Married :4687
## Graduate :3128 Single :3943
## High School :2013 Unknown marital status: 749
## Post-Graduate : 516
## Uneducated :1487
## Unknown education:1519
## Income_Category Card_Category Months_on_book
## Income $40K - $60K :1790 Blue :9436 [-Inf,29.5):1920
## Income $60K - $80K :1402 Gold : 116 [29.5,30.5): 300
## Income $80K - $120K:1535 Platinum: 20 [30.5,31.5): 318
## Income over $120K : 727 Silver : 555 [31.5,49.5):7075
## Income under $40K :3561 [49.5,52.5): 238
## Unknown income :1112 [52.5, Inf]: 276
##
## Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon
## [-Inf,2.5):2153 [-Inf,0.5): 29 [-Inf,0.5): 399
## [2.5,3.5) :2305 [0.5,1.5) :2233 [0.5,1.5) :1499
## [3.5, Inf]:5669 [1.5,2.5) :3282 [1.5,2.5) :3227
## [2.5,3.5) :3846 [2.5,4.5) :4772
## [3.5,4.5) : 435 [4.5,5.5) : 176
## [4.5, Inf]: 302 [5.5, Inf]: 54
##
## Credit_Limit Total_Revolving_Bal Total_Trans_Ct
## [-Inf,1503.5) : 594 [-Inf,66) :2470 [78.5,94.5):2025
## [34167.5, Inf] : 513 [2512.5, Inf] : 512 [71.5,78.5):1329
## [2302.5,2434.5): 249 [1012.5,1119.5): 389 [64.5,71.5):1284
## [2157.5,2245.5): 205 [864.5,945.5) : 292 [22.5,34.5): 938
## [1819.5,1905.5): 140 [1707.5,1770.5): 272 [94.5, Inf]: 882
## [1506.5,1589) : 120 [1384.5,1430.5): 241 [57.5,64.5): 871
## (Other) :8306 (Other) :5951 (Other) :2798
## Avg_Utilization_Ratio
## [-Inf,0.002) :2470
## [0.2025,0.3025): 899
## [0.1135,0.1675): 689
## [0.0755,0.1105): 601
## [0.5715,0.6325): 490
## [0.0525,0.0725): 479
## (Other) :4499
D3.trans<-transactions(D3.disc)
D3.ass <- mineCARs(Attrition_Flag ~ ., transactions = D3.trans, support = 0.03, confidence = 0.7)
D3.clean <- D3.ass[!is.redundant(D3.ass)]
summary(D3.clean)
## set of 801 rules
##
## rule length distribution (lhs + rhs):sizes
## 1 2 3 4 5
## 1 34 187 347 232
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 4.000 3.968 5.000 5.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.02637 Min. :0.8393 Min. :0.03002 Min. :1.000
## 1st Qu.:0.03347 1st Qu.:0.8766 1st Qu.:0.03594 1st Qu.:1.044
## Median :0.04394 Median :0.9115 Median :0.04760 Median :1.086
## Mean :0.06137 Mean :0.9189 Mean :0.06760 Mean :1.095
## 3rd Qu.:0.06626 3rd Qu.:0.9683 3rd Qu.:0.07337 3rd Qu.:1.154
## Max. :0.83934 Max. :1.0000 Max. :1.00000 Max. :1.191
## count
## Min. : 267.0
## 1st Qu.: 339.0
## Median : 445.0
## Mean : 621.5
## 3rd Qu.: 671.0
## Max. :8500.0
##
## mining info:
## data ntransactions support confidence
## transactions 10127 0.03 0.7
## call
## apriori(data = transactions, parameter = parameter, appearance = list(rhs = vars$class_items, lhs = vars$feature_items), control = control)
inspectDT(D3.clean)
This method took the longest of them all.
Statistics:
This method took at least few minutes and had the number of partitions between 2 and 6+. There were 801 rules mined, but none for the attrited customers - we can’t determine what were the most common characteristics for the customers who churned.
D4.disc <- discretizeDF(clients, default = list(method = "interval", breaks = 3, labels = c("low", "moderate", "high")))
summary(D4.disc)
## Attrition_Flag Customer_Age Gender Dependent_count
## Attrited Customer:1627 low :2776 Female:5358 low :2742
## Existing Customer:8500 moderate:6505 Male :4769 moderate:5387
## high : 846 high :1998
##
##
##
##
## Education_Level Marital_Status
## College :1013 Divorced : 748
## Doctorate : 451 Married :4687
## Graduate :3128 Single :3943
## High School :2013 Unknown marital status: 749
## Post-Graduate : 516
## Uneducated :1487
## Unknown education:1519
## Income_Category Card_Category Months_on_book
## Income $40K - $60K :1790 Blue :9436 low :1404
## Income $60K - $80K :1402 Gold : 116 moderate:6537
## Income $80K - $120K:1535 Platinum: 20 high :2186
## Income over $120K : 727 Silver : 555
## Income under $40K :3561
## Unknown income :1112
##
## Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon
## low :2153 low :2262 low :1898
## moderate:4217 moderate:7128 moderate:6607
## high :3757 high : 737 high :1622
##
##
##
##
## Credit_Limit Total_Revolving_Bal Total_Trans_Ct Avg_Utilization_Ratio
## low :7892 low :3330 low :3284 low :6484
## moderate:1231 moderate:3780 moderate:6001 moderate:2333
## high :1004 high :3017 high : 842 high :1310
##
##
##
##
D4.trans<-transactions(D4.disc)
D4.ass <- mineCARs(Attrition_Flag ~ ., transactions = D4.trans, support = 0.03, confidence = 0.7)
D4.clean <- D4.ass[!is.redundant(D4.ass)]
summary(D4.clean)
## set of 1727 rules
##
## rule length distribution (lhs + rhs):sizes
## 1 2 3 4 5
## 1 27 238 656 805
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 4.000 4.295 5.000 5.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.02340 Min. :0.7016 Min. :0.03002 Min. :1.000
## 1st Qu.:0.03412 1st Qu.:0.8802 1st Qu.:0.03663 1st Qu.:1.057
## Median :0.04532 Median :0.9491 Median :0.04898 Median :1.134
## Mean :0.06448 Mean :0.9259 Mean :0.06970 Mean :1.231
## 3rd Qu.:0.07041 3rd Qu.:0.9654 3rd Qu.:0.07643 3rd Qu.:1.153
## Max. :0.83934 Max. :1.0000 Max. :1.00000 Max. :5.683
## count
## Min. : 237.0
## 1st Qu.: 345.5
## Median : 459.0
## Mean : 652.9
## 3rd Qu.: 713.0
## Max. :8500.0
##
## mining info:
## data ntransactions support confidence
## transactions 10127 0.03 0.7
## call
## apriori(data = transactions, parameter = parameter, appearance = list(rhs = vars$class_items, lhs = vars$feature_items), control = control)
inspectDT(D4.clean)
Statistics:
This method was quick and and was set for performing 3 partitions. There were 1727 rules mined with 57 for the attrited customers. The rule which had the highest confidence (0.913) stated that probability of customer being an attrited customer is 91% when it’s a client with blue card, low transaction count, low relationship count and low utillization ratio. Rule with the second highest confidence stated that probability of customer being an attrited customer is 91% when it’s a customer with low relationship count, low credit limit and low transaction count.
D5.disc <- discretizeDF(clients, default = list(method = "frequency", breaks = 3, labels = c("low", "moderate", "high")))
summary(D5.disc)
## Attrition_Flag Customer_Age Gender Dependent_count
## Attrited Customer:1627 low :3202 Female:5358 low :2742
## Existing Customer:8500 moderate:3395 Male :4769 moderate:2655
## high :3530 high :4730
##
##
##
##
## Education_Level Marital_Status
## College :1013 Divorced : 748
## Doctorate : 451 Married :4687
## Graduate :3128 Single :3943
## High School :2013 Unknown marital status: 749
## Post-Graduate : 516
## Uneducated :1487
## Unknown education:1519
## Income_Category Card_Category Months_on_book
## Income $40K - $60K :1790 Blue :9436 low :3132
## Income $60K - $80K :1402 Gold : 116 moderate:3491
## Income $80K - $120K:1535 Platinum: 20 high :3504
## Income over $120K : 727 Silver : 555
## Income under $40K :3561
## Unknown income :1112
##
## Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon
## low :2153 low :2262 low :1898
## moderate:4217 moderate:3282 moderate:3227
## high :3757 high :4583 high :5002
##
##
##
##
## Credit_Limit Total_Revolving_Bal Total_Trans_Ct Avg_Utilization_Ratio
## low :3375 low :3375 low :3369 low :3367
## moderate:3376 moderate:3375 moderate:3266 moderate:3384
## high :3376 high :3377 high :3492 high :3376
##
##
##
##
D5.trans<-transactions(D5.disc)
D5.ass <- mineCARs(Attrition_Flag ~ ., transactions = D5.trans, support = 0.03, confidence = 0.7)
D5.clean <- D5.ass[!is.redundant(D5.ass)]
summary(D5.clean)
## set of 2049 rules
##
## rule length distribution (lhs + rhs):sizes
## 1 2 3 4 5
## 1 28 323 1076 621
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 4.000 4.117 5.000 5.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.02242 Min. :0.7000 Min. :0.03002 Min. :1.000
## 1st Qu.:0.03318 1st Qu.:0.8849 1st Qu.:0.03565 1st Qu.:1.064
## Median :0.04108 Median :0.9259 Median :0.04434 Median :1.108
## Mean :0.05468 Mean :0.9211 Mean :0.05983 Mean :1.215
## 3rd Qu.:0.05678 3rd Qu.:0.9578 3rd Qu.:0.06201 3rd Qu.:1.146
## Max. :0.83934 Max. :1.0000 Max. :1.00000 Max. :5.859
## count
## Min. : 227.0
## 1st Qu.: 336.0
## Median : 416.0
## Mean : 553.8
## 3rd Qu.: 575.0
## Max. :8500.0
##
## mining info:
## data ntransactions support confidence
## transactions 10127 0.03 0.7
## call
## apriori(data = transactions, parameter = parameter, appearance = list(rhs = vars$class_items, lhs = vars$feature_items), control = control)
inspectDT(D5.clean)
Statistics:
This method was quick and and was set for performing 3 partitions. There were 2049 rules mined with 62 for the attrited customers. The rule which had the highest confidence (0.941) stated that probability of customer being an attrited customer is 94% when it’s a low transaction count, low relationship count and low contacts count. Rule with the second highest confidence stated that probability of customer being an attrited customer is 89% when it’s a customer with blue card, low transaction count, low relationship count.
| Method | MDLP | CAIM | ChiMerge | Interval | Frequency |
|---|---|---|---|---|---|
| partitions | 1-6+ | 4 | 2-6+ | 3 | 3 |
| number of rules after cleaning | 877 | 2458 | 801 | 1727 | 2049 |
| number of rules for rhs = Attrited Customer | 11 | 70 | 0 | 57 | 62 |
| highest support for rhs = Attrited Customer | 0.044 | 0.082 | x | 0.081 | 0.074 |
| highest confidence for rhs = Attrited Customer | 0.855 | 0.909 | x | 0.913 | 0.941 |
| highest lift for rhs = Attrited Customer | 5.322 | 5.660 | x | 5.683 | 5.859 |
For each of the methods, results were a little bit different. MDLP and ChiMerge had quite uneven partitions while the CAIM algorithms resulted in all of the variables being separated into 4 partitions. ChiMerged was the only method that didn’t mine any rules for the rhs = Attrited Customer, which didn’t allow for defining the characteristics of the former customers. The rest of the methods mined rules for that definition of the rhs, they ranged from 11 to 70 rules. The rule with the highest support (frequency of the combined occurrence of the antecedent and consequent) was observed for the data with the CAIM algorithm applied. Both the highest confidence (probability of consequent occuring when antecedent is in place) and highest lift (how much more common is it to see the antecedent and consequent together) were observed for the data discretized using the partitioning by frequency.
When looking at the rules with the highest confidence for the datasets with different partitionings, the occuring items were repeating. We could say that the probability of the customer being an attrited customer was the highest for the customers who had combinations of those characteristics:
By analysing the rules more in-depth one can find more interesting patterns for the attrited customers.
Proposed analysis illustrates the impact of various discretization methods on rule mining outcomes. Although the statistics and the quantity of rules may vary, the key elements of rules with the highest confidence remain consistent. this kind of comparison allows for the evaluation of the advantages and disadvantages of specific discretization methods - some may provide favorable statistics, while others may fall short of providing desired results. The selection of the most suitable method should be based on a thorough analysis to determine which one performs optimally in a given scenario.
Sources: